Reflections on Accurate Annotations:
The Basic Cycle and Its Significance
by Ross Overbeek
I have recently been reflecting on the status of the Project to Annotate 1000 Genomes,
and in this short essay I will argue that it has been an
overwhelming success due to issues that became apparent only as the
project progressed. A
thousand more-or-less complete genomes now exist, a framework for
rapidly annotating new genomes with remarkable accuracy is now
functioning, and we are on the verge of another major shift in the
world of annotations. This reflection is based on an
informal note that I sent to friends on the last day of 2007, but my
thoughts have clarified somewhat since then.
The Production of Accurate Annotations
The efforts required to establish a framework for high-volume, accurate
annotation are substantial. I believe that it is important
that we reflect on what we have learned about the factors that
determine productivity. So, what have we learned from the
project?
First, subsystem-based
annotation is
the key to accuracy. While there are certainly
numerous efforts still focusing on annotation of a single genome, the
recognition that comparative analysis is the key to everything, and
that focusing on the variations of a single component of cellular
machinery as they are manifested over the entire collection of existing
genomes is the key to accuracy, are both widely accepted
principles at this stage. Manually-based subsystem creation
and maintenance is the rate-limiting component of
successful annotation efforts, and the factors that constrain this
process are at the heart of the matter. We have understood
this for some time now.
However, I am going to argue a new position in this short essay:
- There are three distinct components that make up our
strategy for rapid accurate annotation: subsystems-based annotation,
FIGfams
as a framework for propagating the subsystems annotations, and RAST as a technology
for using FIGfams and subsystems to consistently propagate annotations
to newly-sequenced genomes.
- These three components form a cycle (subsystems =>
FIGfams => RAST technology => subsystems). This
cycle creates a feedback that rapidly accelerates the productivity
achievable in all three components. Further, failure in any
of these components impairs productivity dramatically in the others.
Understanding this cycle will be the key to supporting higher
productivity in subsystem maintenance and creation.
- To understand the dependencies, we need to consider each of
the components:
- The key to
accurate FIGfam creation and maintenance is to couple it directly to
subsystem maintenance.
Once the initial release
of the FIGfams was created, updating them occurs
automatically
based
on changes in the subsystem collection. Thus, FIGfams are
automatically split, merged and added as the subsystem
collection is maintained. There remains one area of
substantial cost in FIGfam development -- creation of family-dependent
decision procedures that are occasionally required to achieve the
required accuracy. At this point we have approximately 10,000
subsystem-based FIGfams, although the overall collection contains over
100,000 families (the majority containing only 2-3 members).
- RAST has a
central dependency on FIGfams for assertion of function to
newly-recognized genes. In this sense, the main dependency of
RAST is on the FIGfam collection. The more accurate the
FIGfams and their associated decision procedures, the more accurate the
assignments of function made to genes in genomes processed by RAST.
- Finally, the central costs of maintenance of subsystems
are cleaning up errors in existing subsystems (often indicated by
multiple genes having the same function) and by adding new genomes to
existing subsystems. Once a subsystem has reached an
acceptable level of accuracy (and many are not there yet), the central cost is integration
of new genomes after annotation by RAST. The
speed with which new genomes can be added depends on how well RAST
assigns gene function (and, secondarily, on how accurately these
RAST-based annotations can be used to infer operational
variants of subsystems).
- The main costs of increasing the speed and accuracy of
annotations split into two categories: those relating to maintenance of
existing subsystems, and those relating to generation of new
subsystems.
The maintenance costs are containable, if the cycle is established and
functions smoothly. Otherwise, I suspect they inevitably grow
rapidly.
Let me begin by depicting the cycle pictorially:

I have argued that the costs in achieving rapid, accurate annotations
is
limited by the rate at which subsystems can be maintained and created.
I place the maintenance ahead of creation at this stage.
As the collection grows (it now contains over 600
subsystems with over 6800 distinct functional roles), costs of
maintenance will tend to dominate. The
creation of new subsystems will always be a critical activity, but each
new subsystem will impact smaller sets of genomes as we "move into the
tail of the distribution".
The costs relating to subsystem maintenance, which will quickly
dominate, depend critically on how smoothly the cycle I described
functions. We have just established the complete cycle.
The two central costs that cannot be avoided will be creation of
FIGfam-dependent decision procedures and the creation of new
subsystems. The manual work on FIGfams will be necessary to
achieve near-100% accuracy on annotation of seriously ambiguous
paralogs. However, in the vast majority of cases, this effort
will be restricted to specific curators who are willing to spend
massive effort to get things perfect. The more central cost
relates to manual curation of the subsystems.
More Effective Integration of Existing Annotation Efforts
In the section above, I reflected on the cycle that we shall depend
upon for supporting increased volume and accuracy of our own efforts.
Other groups are certainly experimenting with their own
solutions, and in some cases with clear successes. I have no
desire to rate these competing efforts. I sincerely believe
that cooperative activity is the key to
enhanced achievements by everyone. However, effective
cooperation is often elusive. I think that we have put in
place an extremely important mechanism for making cooperation much
easier, and the benefits more compelling.
Anyone working for one of the main annotation efforts realizes that it
is not easy to really benefit from access to the annotation efforts of
other groups. The efforts required to characterize
discrepancies between local annotations and those produced externally
often outweigh any benefits that result.
Two events of major importance have occurred:
- Both PIR and the SEED Project decided to build
correspondences between IDs used by different annotation projects.
The PIR effort produced BioThesaurus
and the SEED effort produced the
Annotation Clearing House. The fact that it will
become trivial to reconcile IDs between the different annotation
efforts will undoubtedly support rapid increases in cross-linking
entries. The SEED is working with UniProt to cross-link
proteins from all of our complete genomes, and I am sure similar
efforts are happening between the other major annotation efforts.
- Within the Annotation Clearing House, a project to allow
experts to assert that specific annotations are reliable (using
whatever IDs they wish) has been initiated. This has led to
many tens of thousands of assertions that specific annotations are
highly reliable. PIR is preparing a list of assertions that
they consider highly reliable, and both institutions are making these
lists openly available.
To see the utility of exchanging expert assertions in a framework in
which it is easy to compare the results, let me describe how we intend
to use these assertions:
- We begin with a 3-column table of reliable annotations
containing [ProteinID,AssertedFunction,IDofExpert]
- We
then take our IDs and construct a 2-column table [FIG-function,AssertedFunction].
This table gives a correspondence between each of our functional roles
and the functional roles used by the expert making the assertion of
reliability.
- Then, we go through this correspondence table (using both
tools and manual inspection) and split it into one set in which we
believe both columns are essentially identical and a second set that we
believe represent errors (either our own or those of the expert
asserting reliability). We anticipate that in most cases the
expert assertion will be accurate, which is what makes this exercise so
beneficial to ourselves.
- We take the table of "essentially the same" assertions and
distribute it as a table of synonyms (which we consider to be a very
useful resource).
We are strongly motivated to resolve differences between our
annotations and high-reliability assertions made by experts.
The production of the table of synonyms both reduces the
effort to redo such a comparison in the future, but is also a major
asset by itself. I am confident that any serious annotation
group that participates will benefit, and I believe that these
exchanges will accelerate in 2008 and 2009.
Summary
I have tried to express the significance of the cycle depicted above,
but I think that I failed to really convey the epiphany, so let me
end by expressing it somewhat more emphatically. I believe
that there will be a very rapid acceleration in the sequencing of new,
complete genomes (although frequently the quality of the sequence wil
be far from perfect, and I am willing to say that a genome in 100
contigs is "essentially complete"). Groups that now try to
provide accurate integrations of all (or most) complete genomes will be
strained heavily. The tendency will be to go one of two
directions:
- Some will swing to completely automated approaches.
This will result in rapid propagation of errors (for those
portions of the cellular mechanisms that are not yet accurately
characterized -- which is quite a bit).
- Others will give up any attempt at comprehensive annotation
and focus on accurate annotation of a slowly growing subset.
The problem with the second approach is that accurate annotation of new
cellular mechanisms (i.e., the introduction of new subsystems) will
increasingly depend on a comprehensive set of genomes (comparative
analysis is central to working out any of the serious difficulties, and
the larger the set of accurately annotated genomes, the better
framework for careful correction.
The cycle depicted above is the only viable strategy that I know of to
handle the deluge of genomes accurately. I claim that as time
goes by, the SEED effort to implement the above cycle will emerge in a
continuously strengthening position. Other groups will be
forced to rapidly copy it, but it really was not that easy to
establish, and I believe the odds are that the SEED effort will be the
only group standing in 2-3 years (i.e., it will be the only group
claiming both accuracy and comprehensive integration).