How to Annotate a Genome

by Ross Overbeek

We at the Fellowship for Interpretation of Genomes (FIG) have actively led the Project to Annotate a 1000 Genomes since its inception in 2003.  In that effort we pioneered what we called the subsystems approach to annotation in which experts annotated a single subsystem across the entire set of genomes.  This was a radically different approach than the more usual of attempting to annotate all of the genes in a single genome.  The effort to develop well-curated sets of subsystems has led to a collection of 400-600 subsystems (depending on where you choose to impose a threshold of acceptable quality).  We believe that the number will continue to grow for reasons that will become apparent in this short note.

It is time to revisit the issue of how to annotate a specific genome of interest, since numerous biologists are now faced with that opportunity.   For what it is worth, here is our advice.

Begin by Identifying the Recognizable Instances of Subsystems

When you are able to annotate a complete subsystem, the individual assignments are all somewhat more reliable.  Most of the common machinery can easily be identified, and this establishes a starting point for the more difficult remaining tasks.  The easiest way to perform this initial stage of analysis is to proceed through two tasks:
  1. Submit the genome sequence to the RAST server maintained at Argonne National Laboratory.  This can be done by going to the RAST server, registering yourself as a user (anyone is welcome to use the site), uploading your sequence, and getting an initial annotation back in about 12 hours.  You can then download the initial annotation to your site and work on it using any tools you prefer.  The initial annotation from RAST gives you three things:
    • protein-encoding genes (CDSs),
    • RNA-encoding genes (tRNAs and rRNAs)
    • identified subsystems
  2. Once you have an initial set of identified subsystems, you should manually go through and see where RAST missed identifying active variants.  It is fairly conservative in its calls, so if there were a mis-called gene (e.g., due to a frameshift) or an unusual form of a gene (e.g., an unknown form of an enzyme) you would see almost all of the genes in a subsystem accounted for, but not enough for RAST to say that the subsystem is really there.  If you do this analysis within RAST, you can compare the metabolic reconstruction for your genome against related genomes, focusing on the specific differences.
If your genome is close to a previously annotated and studied genome, we suggest that you do a detailed analysis of what genes distinguish the new genome from the previously annotated genome (or genomes).  The SEED provides a tool for easily doing such a comparison, and similar tools are either available or becoming available from a number of sources.

Note that this initial step can be done very rapidly -- in a few days.

Fix Frameshifts, Annotate Insertion Sequences, and Process Pseudo-genes

RAST often fails to identify the functional role of a particular gene due to frameshifts.  This is very common in low-quality sequence or sequence produced by 454 technology.  It is not particularly serious, but we do recommend that you post-process the gene calls to clean up the frameshifts.   Biologists are justifiably reluctant to change sequence data without resequencing; hence, we recommend that the actual DNA sequence remain unchanged, that the correction be embodied in the proposed translation of the feature, and that the discrepancy between the actual DNA sequence and the translation be recorded with the feature. We note that you can automatically correct obvious frameshifts using tools within the SEED environment, and we anticipate that these will become increasingly important as larger volumes of low-quality sequence data becomes available.

The issue of detecting insertion sequences, mobile elements, prophages and so forth is important for a number of reasons.  Determining the set of impacted genes (often pseudo-genes) is extremely time-consuming.  We would guess that tools to support this type of analysis will appear soon, but for now you will need to determine how much effort you are willing to expend on the task.   So, this part of the effort can take from a few days (to automatically detect and correct frameshifts) to man-years (to characterize insertion sequences, pseudo-genes, and prophages).

Look at Identified Functions that are Not in Subsystems

As you scan through the genes not yet placed in subsystems that were identified by RAST, some correspond to FIGfams, and some do not.  Some are closely similar to well-annotated proteins (e.g., to Swiss Prot entries), and some are not.

We recommend that you scan through these focusing on those that correspond to functional roles that should be encoded into subsystems.    It is particularly important to examine those for which "functional coupling" information exists (RAST will give you this information).  When strong functional coupling data exists, and when the functional role can be identified with reasonable certainty, you have a particularly good candidate for a new subsystem.  If you can connect any of the genes in the cluster (in, say, genomes that are "close" and have been actively studied) to literature, you need to get the relevant papers before deciding how to proceed.  We suggest making a rapid pass through the set of genes that have not been assigned to subsystems, prioritizing these genes for possible use in starting new subsystems.

We urge you to develop new subsystems when possible and to publish these subsystems (which makes them accessible to users working on other versions of the SEED).


So, our approximate approach to annotating a new genome would be:

  1. Run the genome through RAST.
  2. Do a detailed metabolic comparison (within RAST) between your new genome and one or more of its closest relatives.  Follow this by a general comparison of what genes distinguish it from its closest relatives.
  3. Correct obvious frameshifts.
  4. Decide whether or not you are willing to spend the effort needed to identify IS elements, prophages and other mobile elements.  Similarly, decide whether or not you wish to expend the effort to carfully identify pseudo-genes.
  5. If you have substantially changed the gene calls, rerun your genome through RAST again (keeping the gene calls that you have now established).
  6. Go through the genes that have not yet been placed into subsystems, determine whether or not it makes sense to construct a limited set of new subsystems (especially if they capture aspects of the genome which may have motivated the sequencing effort in the first place).
This can be done either very rapidly or more time can be taken.  It all depends on  the anticipated role of the genome.   In many cases, these tasks can be performed in a few weeks, and we believe that the overall time will continue to drop as the quality of the RAST analysis (due to an expanded library of subsystems) improves.