We at the Fellowship for Interpretation of Genomes (FIG) have actively led the
Project to Annotate a 1000 Genomes since its inception in 2003. In that effort we pioneered what we called the
subsystems approach to annotation
in which experts annotated a single subsystem across the entire set of
genomes. This was a radically different approach than the more
usual of attempting to annotate all of the genes in a single genome.
The effort to develop well-curated sets of subsystems has led to
a collection of 400-600 subsystems (depending on where you choose to
impose a threshold of acceptable quality). We believe that the
number will continue to grow for reasons that will become apparent in
this short note.
It is time to revisit the issue of how to
annotate a specific genome of interest, since numerous biologists are
now faced with that opportunity. For what it is worth, here is
our advice.
Begin by Identifying the Recognizable Instances of Subsystems
When
you are able to annotate a complete subsystem, the individual
assignments are all somewhat more reliable. Most of the common
machinery can easily be identified, and this establishes a starting
point for the more difficult remaining tasks. The easiest way to
perform this initial stage of analysis is to proceed through two tasks:
- Submit the genome sequence to the RAST server maintained at Argonne National Laboratory. This can be done by going to the RAST server,
registering yourself as a user (anyone is welcome to use the site),
uploading your sequence, and getting an initial annotation back in
about 12 hours. You can then download the initial annotation to
your site and work on it using any tools you prefer. The initial
annotation from RAST gives you three things:
- protein-encoding genes (CDSs),
- RNA-encoding genes (tRNAs and rRNAs)
- identified subsystems
- Once
you have an initial set of identified subsystems, you should manually
go through and see where RAST missed identifying active variants.
It is fairly conservative in its calls, so if there were a
mis-called gene (e.g., due to a frameshift) or an unusual form of a
gene (e.g., an unknown form of an enzyme) you would see almost all of
the genes in a subsystem accounted for, but not enough for RAST to say
that the subsystem is really there. If you do this analysis
within RAST, you can compare the metabolic reconstruction for your
genome against related genomes, focusing on the specific differences.
If
your genome is close to a previously annotated and studied genome, we
suggest that you do a detailed analysis of what genes distinguish the
new genome from the previously annotated genome (or genomes). The
SEED provides a tool for easily doing such a comparison, and similar
tools are either available or becoming available from a number of
sources.
Note that this initial step can be done very rapidly -- in a few days.
Fix Frameshifts, Annotate Insertion Sequences, and Process Pseudo-genes
RAST
often fails to identify the functional role of a particular gene due to
frameshifts. This is very common in low-quality sequence or
sequence produced by 454 technology. It is not particularly
serious, but we do recommend that you post-process the gene calls to
clean up the frameshifts. Biologists are justifiably reluctant
to change sequence data without resequencing; hence, we recommend that
the actual DNA sequence remain unchanged, that the correction be
embodied in the proposed translation of the feature, and that the
discrepancy between the actual DNA sequence and the translation be
recorded with the feature. We note that you can automatically correct
obvious frameshifts using tools within the SEED environment, and we
anticipate that these will become increasingly important as larger
volumes of low-quality sequence data becomes available.
The
issue of detecting insertion sequences, mobile elements, prophages and
so forth is important for a number of reasons. Determining the
set of impacted genes (often pseudo-genes) is extremely time-consuming.
We would guess that tools to support this type of analysis will
appear soon, but for now you will need to determine how much effort you
are willing to expend on the task. So, this part of the effort
can take from a few days (to automatically detect and correct
frameshifts) to man-years (to characterize insertion sequences,
pseudo-genes, and prophages).
Look at Identified Functions that are Not in Subsystems
As
you scan through the genes not yet placed in subsystems that were
identified by RAST, some correspond to FIGfams, and some do not.
Some are closely similar to well-annotated proteins (e.g., to
Swiss Prot entries), and some are not.
We recommend that you
scan through these focusing on those that correspond to functional
roles that should be encoded into subsystems. It is
particularly important to examine those for which "functional coupling"
information exists (RAST will give you this information). When
strong functional coupling data exists, and when the functional role
can be identified with reasonable certainty, you have a particularly
good candidate for a new subsystem. If you can connect any of the
genes in the cluster (in, say, genomes that are "close" and have been
actively studied) to literature, you need to get the relevant papers
before deciding how to proceed. We suggest making a rapid pass
through the set of genes that have not been assigned to subsystems,
prioritizing these genes for possible use in starting new subsystems.
We
urge you to develop new subsystems when possible and to publish these
subsystems (which makes them accessible to users working on other
versions of the SEED).
Summary
So, our approximate approach to annotating a new genome would be:
- Run the genome through RAST.
- Do a detailed metabolic comparison (within RAST) between your new genome and one or more of its closest relatives. Follow this by a general comparison of what genes distinguish it from its closest relatives.
- Correct obvious frameshifts.
- Decide
whether or not you are willing to spend the effort needed to identify
IS elements, prophages and other mobile elements. Similarly,
decide whether or not you wish to expend the effort to carfully
identify pseudo-genes.
- If you have substantially changed the
gene calls, rerun your genome through RAST again (keeping the gene
calls that you have now established).
- Go through the genes that
have not yet been placed into subsystems, determine whether or not it
makes sense to construct a limited set of new subsystems (especially if
they capture aspects of the genome which may have motivated the
sequencing effort in the first place).
This can be done either
very rapidly or more time can be taken. It all depends on
the anticipated role of the genome. In many cases, these tasks
can be performed in a few weeks, and we believe that the overall time
will continue to drop as the quality of the RAST analysis (due to an
expanded library of subsystems) improves.