Exchange of Assignments and Annotations: the SEED Perspective

In this document I describe the basic concepts relating to assignments and annotations as they are implemented in the SEED. I will discuss notions relating to exchange of annotations and maintenance of relevant events. I will not discuss issues relating to ontologies and the use of constrtained vocabularies to formulate functions. These topics are important, but they are largely independent of the issues I will be covering in this document.

The topic of annotation is certainly not limited to the specific class of features we call coding sequences (this type of feature is often abbreviated as a CDS or within the SEED as a protein-encoding gene called a PEG). However, most of the central issues do relate to CDSs. Hence, I will focus on annotation of CDSs; the generalization of the concept to arbitrary features is straightforward.

Annotation

An annotation is a time-stamped piece of text written by an author that is attached to a feature. It may be viewed as a 4-tuple: {Feature,TimeStamp,Author,Text}. The text can be either structured or unstructured. One particular type of structured annotation is used to record a judgement relating to the function of a protein encoded by a specific gene. The basic syntax of this form of structured annotation is The function of Gene is Function.

Assignment

An assignment is a 3-tuple {Feature,Author,Function}. When a user of the SEED generates an assignment, an annotation is also generated to record the event.

Function of a Protein Encoded by a Gene

A Function also has some minimal structure. It can be an arbitrary string of text that does not contain any occurrences of either "; " or " / ", which is called a Basic Function. It can also be a sequence of basic functions separated by "; ", which may be taken to mean "the function is asserted to be one or more of the basic functions". It can also be a sequence of basic functions separated by " / ", which may be taken to mean "the function is asserted to be all of the basic functions".

The Issue of IDs

To understand the issues relating to IDs of coding sequences, consider a situation in which we have a system (say, an instance of the SEED) which contains 100 closely related genomes all sharing the same genus and species. Within just 2-3 years this will be commonplace. Already, we have versions with 5-10 distinct strains of Stapholococcus aureaus. Now suppose that a specific gene appears with exactly identical sequence in each of the 100 genomes. Further, in one of the genomes it has been duplicated. Hence, we have 101 distinct coding sequences that all translate to a single protein sequence. Finally, assume that in the genome with two copies, the upstream regions contain regulatory sites that cause one copy to be expressed only at high temperatures and the other copy at low temperatures; that is; the two copies have what is arguably different functions.

When comparing data from two systems, it may or may not be trivial to determine when two genomes are identical. If users of each system are making occasional corrections to the actual sequences of the genomes, it becomes somewhat problematic. When genomes are not precisely identical, it can be quite difficult to determine whether genes from the two genomes should be thought of as identical or not.

When a version of the SEED receives an assignment from an external source, it is normally received as a 3-tuple {ExternalID,Sequence,Function}, where the Sequence is a protein sequence (i.e., the translation of the CDS). If the ExternalID can reliably be mapped to a specific coding sequence in the SEED, then the assignment is unambiguous. On the other hand, if the mapping cannot be done unambiguously, the assignment is taken as a set of assertions -- one for each of the internal CDSs that have matching translations. Two translations are considered matching if after discarding the initial amino acids, one of the sequences is a suffix of the other and the shorter sequence has a length that is at least 70% of the length of the longer sequence.

This naturally raises the issue of how unambiguous mappings can be determined. Within the SEED, the following steps are used:

  1. If the exchange is with a version of the SEED, then FIG ids can be matched. If some genomes exist in only one version, or if ids have been added to one or both of the versions, ids may fail to match.
  2. Otherwise, if CDS ids (e.g., gi or RefSeq ids) can be matched, then an unambiguous correspondence can be established.
  3. Otherwise, if identical versions of a genome are in use (ensuring that checksums of the contigs in the genome give identical results), then locations on contigs can be used as ids to determine unambiguous matches.

The Notion of Cooperative Maintenance of a Master Set of Annotations

The SEED is designed to support a group of annotators who wish to cooperatively annotate a set of genomes. By "cooperatively annotate", I mean that they wish to overwrite each other's assignments -- they are trusted annotators. The SEED supports any number of independent annotators -- individuals who do not overwrite each other's assignments. Corresponding to each SEED, there is a single set of cooperating users, and they specify user ids of the form master:user. Note that the annotations recording assignments should never get overwritten in any event.

The Use and Synchronization of Cooperating Annotation Systems

Multiple copies of the SEED can be used to maintain synchronized assignments. Usually, the systems would be those of a cooperating group of annotators. The SEED provides the capabilities for daily automatic synchronizations. This is achieved by designating one of the systems as a "clearinghouse". On a daily basis, the clearinghouse will acquire all newly-generated annotations and assignments from each of the other participating systems. Then, it will merge and dispense updates to the other systems. Setting up and administering this behaviour is described in a separate document.

Introduction of Externally-Generated Assignments and Annotations

Any SEED system can initiate a transfer of annotations and assignments from other SEED systems. We expect to move towards common protocols that allow such transfers with a growing number of non-SEED annotation systems.

When annotations are transferred they are simply merged. When assignments are transferred and the author is not a cooperating user, the SEED user is offered the option of accepting them (or not accepting them). If they are accepted and the author was a cooperating annotator, the assignment will be made (but no annotation will be generated). If they are accepted and the author was not a cooperating annotator, the assignment will be made and an annotation recording the event will be made. A "short-cut" to acceptance can be utilized for a cooperating annotator -- an assignment that is accompanied by an annotation that is time-stamped as more recent than any existing annotations is automatically accepted.

The Introduction of a New Release of the SEED

Introduction of a new release of the SEED is conceptually the same as
  1. considering the new release as the current system (with minimal assignments and annotations), and
  2. treating the old version of the SEED as the source of all of its existing assignments and annotations that were made since the completion of the installation of the last release, and
  3. all assignments that would introduce changes (i.e., that did not match the assignments supplied with the release) are accepted (i.e., you will not be asked whether or not you wish to accept them).