Exchange of Assignments and Annotations: the SEED Perspective
In this document I describe the basic concepts relating to assignments
and annotations as they are implemented in the SEED. I will discuss
notions relating to exchange of annotations and maintenance of
relevant events. I will not discuss issues relating to ontologies
and the use of constrtained vocabularies to formulate functions.
These topics are important, but they are largely independent of the
issues I will be covering in this document.
The topic of annotation is certainly not limited to the specific class
of features we call coding sequences (this type of feature is often
abbreviated as a CDS or within the SEED as a protein-encoding
gene called a PEG). However, most of the central issues do
relate to CDSs. Hence, I will focus on annotation of CDSs; the
generalization of the concept to arbitrary features is straightforward.
Annotation
An annotation is a time-stamped piece of
text written by an author that is attached to a feature. It
may be viewed as a 4-tuple: {Feature,TimeStamp,Author,Text}. The text
can be either structured or unstructured. One
particular type of structured annotation is used to record a judgement
relating to the function of a protein encoded by a specific gene. The
basic syntax of this form of structured annotation is The function
of Gene is Function.
Assignment
An assignment is a 3-tuple {Feature,Author,Function}. When a
user of the SEED generates an assignment, an annotation is also
generated to record the event.
Function of a Protein Encoded by a Gene
A Function also has some minimal structure. It can be an
arbitrary string of text that does not contain any occurrences of
either "; " or " / ", which is called a Basic Function. It can
also be a sequence of basic functions separated by "; ", which may be
taken to mean "the function is asserted to be one or more of the basic
functions". It can also be a sequence of basic functions separated by
" / ", which may be taken to mean "the function is asserted to be all
of the basic functions".
The Issue of IDs
To understand the issues relating to IDs of coding sequences, consider
a situation in which we have a system (say, an instance of the SEED)
which contains 100 closely related genomes all sharing the same genus
and species. Within just 2-3 years this will be commonplace.
Already, we have versions with 5-10 distinct strains of
Stapholococcus aureaus. Now suppose that a specific gene
appears with exactly identical sequence in each of the 100 genomes.
Further, in one of the genomes it has been duplicated. Hence, we have
101 distinct coding sequences that all translate to a single protein
sequence.
Finally, assume that in the genome with two copies, the upstream
regions contain regulatory sites that cause one copy to be expressed
only at high temperatures and the other copy at low temperatures; that
is; the two copies have what is arguably different functions.
When comparing data from two systems, it may or may not be trivial to
determine when two genomes are identical. If users of each system are
making occasional corrections to the actual sequences of the genomes,
it becomes somewhat problematic. When genomes are not precisely
identical, it can be quite difficult to determine whether genes from
the two genomes should be thought of as identical or not.
When a version of the SEED receives an assignment from an external
source, it is normally received as a 3-tuple
{ExternalID,Sequence,Function}, where the Sequence is a protein
sequence (i.e., the translation of the CDS). If the ExternalID
can reliably be mapped to a specific coding sequence in the SEED, then
the assignment is unambiguous. On the other hand, if the mapping
cannot be done unambiguously, the assignment is taken as a set of
assertions -- one for each of the internal CDSs that have matching
translations.
Two translations are considered matching if after discarding the
initial amino acids, one of the sequences is a suffix of the other and
the shorter sequence has a length that is at least 70% of the length
of the longer sequence.
This naturally raises the issue of how unambiguous mappings can be
determined.
Within the SEED, the following steps are used:
-
If the exchange is with a version of the SEED, then FIG ids can be
matched. If some genomes exist in only one version, or if ids have
been added to one or both of the versions, ids may fail to match.
-
Otherwise,
if CDS ids (e.g., gi or RefSeq ids) can be matched, then an
unambiguous correspondence can be established.
-
Otherwise, if identical versions of a genome are in use (ensuring that
checksums of the contigs in the genome give identical results), then
locations on contigs can be used as ids to determine unambiguous
matches.
The Notion of Cooperative Maintenance of a Master Set of Annotations
The SEED is designed to support a group of annotators who wish to
cooperatively annotate a set of genomes. By "cooperatively annotate",
I mean that they wish to overwrite each other's assignments -- they
are trusted annotators. The SEED supports any number of
independent annotators -- individuals who do not overwrite each
other's assignments. Corresponding to each SEED, there is a single
set of cooperating users, and they specify user ids of the form
master:user.
Note that the annotations recording assignments
should never get overwritten in any event.
The Use and Synchronization of Cooperating Annotation Systems
Multiple copies of the SEED can be used to maintain synchronized assignments.
Usually, the systems would be those of
a cooperating group of annotators. The SEED provides the capabilities
for daily automatic synchronizations. This is achieved by designating
one of the systems as a "clearinghouse". On a daily basis, the
clearinghouse will acquire all newly-generated annotations and
assignments from each of the other participating systems. Then, it
will merge and dispense updates to the other systems. Setting up and
administering this behaviour is described in a separate document.
Introduction of Externally-Generated Assignments and Annotations
Any SEED system can initiate a transfer of annotations and assignments
from other SEED systems. We expect to move towards common protocols
that allow such transfers with a growing number of non-SEED annotation
systems.
When annotations are transferred they are simply merged. When
assignments are transferred and the author is not a cooperating user,
the SEED user is offered the option of accepting them (or not
accepting them). If they are accepted and the author was a
cooperating annotator, the assignment will be made (but no annotation
will be generated). If they are accepted and the author was not a
cooperating annotator, the assignment will be made and an annotation
recording the event will be made. A "short-cut" to acceptance can be
utilized for a cooperating annotator -- an assignment that is
accompanied by an annotation that is time-stamped as more recent than
any existing annotations is automatically accepted.
The Introduction of a New Release of the SEED
Introduction of a new release of the SEED is conceptually the same as
-
considering the new release as the current system (with minimal
assignments and annotations), and
-
treating the old version of the SEED as the source of all of its
existing assignments and annotations that were made since the
completion of the installation of the last release, and
-
all assignments that would introduce changes (i.e., that did not match
the assignments supplied with the release) are accepted (i.e., you
will not be asked whether or not you wish to accept them).