Reflections on Accurate Annotations:

The Basic Cycle and Its Significance

by Ross Overbeek


I have recently been reflecting on the status of the Project to Annotate 1000 Genomes, and in this short essay I will argue that it has been an overwhelming success due to issues that became apparent only as the project progressed.    A thousand more-or-less complete genomes now exist, a framework for rapidly annotating new genomes with remarkable accuracy is now functioning, and we are on the verge of another major shift in the world of annotations.   This reflection is based on an informal note that I sent to friends on the last day of 2007, but my thoughts have clarified somewhat since then.

The Production of Accurate Annotations

The efforts required to establish a framework for high-volume, accurate annotation are substantial.  I believe that it is important that we reflect on what we have learned about the factors that determine productivity.  So, what have we learned from the project?

First, subsystem-based annotation is the key to accuracy.   While there are certainly numerous efforts still focusing on annotation of a single genome, the recognition that comparative analysis is the key to everything, and that focusing on the variations of a single component of cellular machinery as they are manifested over the entire collection of existing genomes is the key to accuracy, are both widely accepted  principles at this stage.   Manually-based subsystem creation and maintenance is the rate-limiting component of successful annotation efforts, and the factors that constrain this process are at the heart of the matter.  We have understood this for some time now.  

However, I am going to argue a new position in this short essay:

  1. There are three distinct components that make up our strategy for rapid accurate annotation: subsystems-based annotation, FIGfams as a framework for propagating the subsystems annotations, and RAST as a technology for using FIGfams and subsystems to consistently propagate annotations to newly-sequenced genomes.

  2. These three components form a cycle (subsystems => FIGfams => RAST technology => subsystems).  This cycle creates a feedback that rapidly accelerates the productivity achievable in all three components.  Further, failure in any of these components impairs productivity dramatically in the others.  Understanding this cycle will be the key to supporting higher productivity in  subsystem maintenance and creation.

  3. To understand the dependencies, we need to consider each of the components:

  4. The main costs of increasing the speed and accuracy of annotations split into two categories: those relating to maintenance of existing subsystems, and those relating to generation of new subsystems. The maintenance costs are containable, if the cycle is established and functions smoothly.  Otherwise, I suspect they inevitably grow rapidly.
Let me begin by depicting the cycle pictorially:





I have argued that the costs in achieving rapid, accurate annotations is limited by the rate at which subsystems can be maintained and created.  I place the maintenance ahead of creation at this stage.  As the collection grows (it now contains over 600 subsystems with over 6800 distinct functional roles), costs of maintenance will tend to dominate.  The creation of new subsystems will always be a critical activity, but each new subsystem will impact smaller sets of genomes as we "move into the tail of the distribution".  

The costs relating to subsystem maintenance, which will quickly dominate, depend critically on how smoothly the cycle I described functions.  We have just established the complete cycle.

The two central costs that cannot be avoided will be creation of FIGfam-dependent decision procedures and the creation of new subsystems.  The manual work on FIGfams will be necessary to achieve near-100% accuracy on annotation of seriously ambiguous paralogs.  However, in the vast majority of cases, this effort will be restricted to specific curators who are willing to spend massive effort to get things perfect.   The more central cost relates to manual curation of the subsystems. 

More Effective Integration of Existing Annotation Efforts

In the section above, I reflected on the cycle that we shall depend upon for supporting increased volume and accuracy of our own efforts.  Other groups are certainly experimenting with their own solutions, and in some cases with clear successes.  I have no desire to rate these competing efforts.   I sincerely believe that cooperative activity is the key to enhanced achievements by everyone.  However, effective cooperation is often elusive.  I think that we have put in place an extremely important mechanism for making cooperation much easier, and the benefits more compelling.

Anyone working for one of the main annotation efforts realizes that it is not easy to really benefit from access to the annotation efforts of other groups.  The efforts required to characterize discrepancies between local annotations and those produced externally often outweigh any benefits that result.

Two events of major importance have occurred:

  1. Both PIR and the SEED Project decided to build correspondences between IDs used by different annotation projects.  The PIR effort produced BioThesaurus and the SEED effort produced the Annotation Clearing House.  The fact that it will become trivial to reconcile IDs between the different annotation efforts will undoubtedly support rapid increases in cross-linking entries.  The SEED is working with UniProt to cross-link proteins from all of our complete genomes, and I am sure similar efforts are happening between the other major annotation efforts.

  2. Within the Annotation Clearing House, a project to allow experts to assert that specific annotations are reliable (using whatever IDs they wish) has been initiated.  This has led to many tens of thousands of assertions that specific annotations are highly reliable.  PIR is preparing a list of assertions that they consider highly reliable, and both institutions are making these lists openly available.

To see the utility of exchanging expert assertions in a framework in which it is easy to compare the results, let me describe how we intend to use these assertions:

  1. We begin with a 3-column table of reliable annotations containing [ProteinID,AssertedFunction,IDofExpert]

  2. We then take our IDs and construct a 2-column table [FIG-function,AssertedFunction].  This table gives a correspondence between each of our functional roles and the functional roles used by the expert making the assertion of reliability.

  3. Then, we go through this correspondence table (using both tools and manual inspection) and split it into one set in which we believe both columns are essentially identical and a second set that we believe represent errors (either our own or those of the expert asserting reliability).  We anticipate that in most cases the expert assertion will be accurate, which is what makes this exercise so beneficial to ourselves.

  4. We take the table of "essentially the same" assertions and distribute it as a table of synonyms (which we consider to be a very useful resource).

We are strongly motivated to resolve differences between our annotations and high-reliability assertions made by experts.  The production of the table of synonyms both reduces the effort to redo such a comparison in the future, but is also a major asset by itself.  I am confident that any serious annotation group that participates will benefit, and I believe that these exchanges will accelerate in 2008 and 2009.

Summary

I have tried to express the significance of the cycle depicted above, but I think that I failed to really convey the epiphany, so let me end by expressing it somewhat more emphatically.  I believe that there will be a very rapid acceleration in the sequencing of new, complete genomes (although frequently the quality of the sequence wil be far from perfect, and I am willing to say that a genome in 100 contigs is "essentially complete").  Groups that now try to provide accurate integrations of all (or most) complete genomes will be strained heavily.  The tendency will be to go one of two directions:

  1. Some will swing to completely automated approaches.  This will result in rapid propagation of errors (for those portions of the cellular mechanisms that are not yet accurately characterized -- which is quite a bit).

  2. Others will give up any attempt at comprehensive annotation and focus on accurate annotation of a slowly growing subset.
The problem with the second approach is that accurate annotation of new cellular mechanisms (i.e., the introduction of new subsystems) will increasingly depend on a comprehensive set of genomes (comparative analysis is central to working out any of the serious difficulties, and the larger the set of accurately annotated genomes, the better framework for careful correction.

The cycle depicted above is the only viable strategy that I know of to handle the deluge of genomes accurately.  I claim that as time goes by, the SEED effort to implement the above cycle will emerge in a continuously strengthening position.  Other groups will be forced to rapidly copy it, but it really was not that easy to establish, and I believe the odds are that the SEED effort will be the only group standing in 2-3 years (i.e., it will be the only group claiming both accuracy and comprehensive integration).