How to Annotate a Subsystem
Introduction
One of the objectives in constructing the SEED was to provide a framework in which annotations
of genes could be studied and improved. The ability to annotate a specific subsystem (like glycolysis, the ribosome,
or histidine biosynthesis or whatever) is a key
component of our strategy to "Annotate a 1000 Genomes". In this project, we plan on cooperatively
developing detailed annotations of several hundred subsystems using hundreds genomes in the annotation of
each subsystem. Once we have the detailed encoding of a subsystem based on several hundred organisms, we believe that
this encoding can be easily used to add any number of new organisms. This will allow us to automatically,
and accurately, annotate the thousands of genomes that will become available during the coming years.
There are two basic styles in annotating a subsystem: the style used by a novice to produce a reasonably accurate
encoding, and the style used by an expert to produce as accurate an analysis as possible. Ultimately we are, of
course, seeking expert encodings for each of the cellular subsystems. However, we also believe that
producing basically accurate encodings using less skilled participants is a worthy and useful objective.
In fact, we consider annotation of subsystems to be a basic activitiy of a practicing biologist
and believe that most biologists should learn how to do it well. Certainly students should spend some time
learning both how to do it and how to exploit the results.
We will begin with trying to convey the goals and approaches to developing "reasonably accurate" encodings quickly.
After covering this topic, we will move on to discuss what more is needed to develop expert annotations.
Developing a Reasonably Accurate Subsystem Annotation Quickly
The steps involved in constructing a subsystem annotation quickly are as follows (we cover the details for
each step below):
- Select a subsystem and specify the precise roles that make up the subsystem.
- Annotate a few key organisms that are known to include the subsystem carefully. For each
gene included in the subsystem, make sure the annotations are made precisely as you desire,
and project these exact formulations of function to as many orthologous genes as you can.
-
Add the well-annotated genomes to the spreadsheet, fixing any errors that show up.
-
Add a somewhat larger group of organisms, again fixing errors that show up.
-
Add all of the remaining organisms.
-
Remove organisms which appear not to have the subsystem.
-
Make a pass to fill in as many of the cells that are missing genes.
-
Mark each genome with a numerical variant code. Genomes with the same version of the subsystem should be marked with the same variant code. Genomes with errors (an uncalled gene for a role required in the subsystem for example) should be marked with a variant code of zero.
This is a sketch of how to make a subsystem annotation (that will contain errors). It should be
viewed as a starting point for an expert annotation. As you read what follows, it will be useful to
work in parallel on a short assignment.
Let us now go through these steps in detail.
Getting Started
The first task is to initiate a new subsystem and fill in the roles. To do this, you first have to
get to the Subsystems Page. You get to this from the initial SEED search page, using a section
near the bottom that says "Work on Subsystems". You will need to fill in a user ID (something like "master:JohnD"),
and then click on the button that says Work on Subsystems.
Once you get the Subsystems Page, you should see a table describing the existing subsystems (if any) and
a spot where you can initiate a new subsystem (look for the words To Start a New Subsystem Annotation).
You will need to type in a unique name for the subsystem. Make the name descriptive
(something like Histidine Biosynthesis or Chemotaxis).
Once you click on start new subsystem annotation you should see a blank spreadsheet that you will
be filling it. It begins with slots to write Functional Roles. You have room for five functional roles, but whenever
you click on update spreadsheet it will update the form, making sure that you have room to add
more functional roles.
You begin by filling in the exact names you wish for the functional roles you wish to annotate -- called exactly as
you want them to be annotated. Genes will be annotated as coding for proteins that have
these roles, or occasionally several of these roles (in the
case of multifunctional proteins). Here are some minimal guidelines that we suggest people follow in
assigning text strings for each role:
-
Although genes can be multifunctional, functional roles are not.
You should have a separate role for each "catalytic domain" or, in the case of nonmetabolic subsystems for each
function encoded by a peptide.
-
We tend to use Swiss Prot or UniProt descriptions as functional roles. These are usually expertly curated,
carefully chosen wordings. On the other hand, it is really your choice.
-
When you include an EC number, include it in the form (EC x.x.x.x) (e.g., (EC 2.7.1.11) for phosphofructokinase).
-
You may be encoding a piece of metabolism in which alternatives exist. You can make a list of roles that
includes all alternatives, understanding that most organisms will not contain genes corresponding to every
role. Alternatively, you can encode two subsystems (each containing one alternative). We have a utility that
can be used to glue two subsystem spreadsheets together into a single spreadsheet. Sometimes it is much more
convenient to work on separate small pieces and then just combine them.
-
Add an abbreviation for each role in the Abbrev column. This will serve as a column header in the spreadsheet below. The abbreviation can be the abbreviation for a gene, protein or something only with meaning in the context of your subsystem.
Once you have typed in the roles, click on update spreadsheet and proceed to the next step.
Annotate a Few Key Organisms
Usually, there is at least one organism in which the subsystem is well-annotated, in the sense that what is actually
known has been captured correctly (often this is E.coli or
B.subtilis). We suggest that you begin by looking at the
annotations in these key organisms, and make sure they agree
(exact match -- no changes in case, spacing, punctuation, etc.) with the descriptions you used for
the roles when you began the subsystem spreadsheet.
As you make sure they agree, for each gene look at the closest 100 FIG Ids (i.e., look at similarities with maxN
set to 100, Max Expand set to 100, and Just FID Ids checked). Check the assignments made
to the similar genes, and when you can do it both accurately and quickly, make them consistent with
the descriptions you are using for the roles.
This may take anywhere from 10 minutes to several hours, but it will start having a dramatic effect on the
consistency of annotations for this very limited set of genes.
Add the Well-Annotated Organisms to the Spreadsheet
Now, you can update the spreadsheet by selecting the set of organisms that you believe are well-annotated (we
suggest starting with just one or two) and clicking on update spreadsheet. The spreadsheet should
get updated, and rows should appear corresponding to the organisms you selected. The FIG Ids in each cell
should correspond to the genes having the roles. Frequently, some cells will turn up as empty, although you
know for sure they should be filled. This is normally due to mismatches in the role descriptions and the
functions assigned to the genes (the last time this happened, there was an extra space in one of the role
descriptions, but there are many ways you can make seemingly identical strings mismatch). One easy way
to track down what is going on is to check the box show missing and update the spreadsheet. This will
cause links to be generated for each cell that has no genes in it. By clicking on one of these
links you will cause the SEED to look for candidate genes. You should pursue these links, trying to
correct whatever errors led to the failure to fill in the cell. Once you have corrected all of the
errors, check the box fill and update the spreadsheet. The entries that match will be added to
the previously empty cells. Continue until the cells appear to be filled in correctly.
Add a Larger Set of Organisms
Now, we suggest that you add four or five somewhat diverse organisms, and see if the spreadsheet fills
in as you anticipated. If not, use the show missing to set up links to pursue missing entries.
You can also use show duplicates to look at cases where multiple genes are included in a cell.
These are often legitimate, but if you know what you are doing, it might be useful to look for
clear misannotations. If you are not familiar with the details of the subsystem, leave duplicate
checking to experts.
You can use Add Genomes with Solid Hits to add all genomes for which all of the cells can be filled in.
This is sometimes useful, but often specific roles are optional, and when some are missing those genomes
will not automatically get selected.
Add all of the Remaining Organisms
You can easily select all of the remaining organisms and add them in a single shot. This will usually
lead to many potential errors -- cells that are empty, duplicates, and cells that are filled for organisms that
clearly do not have the subsystem. In each case, these may or may not be real errors. In most cases, they
are things that should be examined and thought about.
As you correct annotations and update the spreadsheet, empty cells will fill in. However, if there
should be two entries in a cell, but only one appears, the second will only be added after checking the
"refill spreadsheet from scratch" checkbox option and then updating the spreadsheet .
Remove Organisms that Do Not Have Functioning Versions of the Subsystem
Now, you should make a pass to erase genomes that you believe do not have versions of the subsystem.
To erase a genome, just erase the genome number; this will cause the whole row to go away when the
spreadsheet is updated.
The basic idea of the spreadsheet is to document variants of the subsystem as they are understood.
When a row is present with missing entries, it can simply reflect alternatives, uncertainty or that you
believe strongly that the subsystem is present, but that there must exist a gene that has not yet been
properly characterized and annotated. You should remove only those rows that represent organisms that
you believe do not have the subsystem.
Make One Last Pass to Check Missing Genes
For cells that you believe represent roles that are not optional, you should check for missing
entries one last time. If you cannot find the gene, you should make a note that it represents a
serious difficulty: the gene may not have been called (but is there on the chromosome), it may have
a form that was never characterized (these are the "gems" we are looking for; the clue in this case
is when several genomes all are missing the same role), it might be in a frameshifted gene that was
misannotated, or whatever. Make one last pass, and keep detailed notes (in the notes section, so others
may benefit from your observations). If you haven't already done so make use of the functional coupling
feature to help locate missing genes (described in section 2.44 of the Getting Started tutorial).
Mark the Prototypical genomes
The key use of annotated subsystems will be to annotate new genomes. To make this work, we need
to have a detailed record of the "acceptable variants of the subsystem". As you go through the list
of genomes in the spreadsheet, when you find ones that define an operational subsystem, mark it with a
numerical variant code. Several operational variations of a subsystem may exist. Genomes with the same set of roles present should be marked with the same variant code. Explain the differences
between the variants in the notes section. Some genomes may have empty cells for essential roles (indicating a missing gene), but you should not assign a variant code (leave it set to zero) to a genome that has genes that were just not called or for which frameshifts prevented accurate recognition of the gene. When you get done, you have two classes of genomes: those that can be used to describe the set of variants of the subsystem (i.e., of working versions of the subsystem) and those that you believe have working versions, but also have uncorrected errors.
Once this step has been completed, you have finished a crude version of the basic spreadsheet. It
represents a very valuable contribution for the following reasons:
-
The annotations will be far more consistent.
-
Known variants of the subsystem have been characterized (albeit imperfectly in many cases).
-
The existence of missing genes (new forms of enzymes) is usually made vivid.
-
We have data that can be used as relatively reliable (this is often essential for creating
"learning sets" or "evaluation sets").
-
It forms a starting point for an expert that wishes to do the really difficult analysis required
to accurately characterize the subsystem.
The Second Phases: Expert Annotation
The second phase of subsystem annotation requires extensive expertise from years of research,
coupled with the capability of doing experimental verification of conjectures. It represents
orders of magnitude more effort. The goal of the expert annotation should ultimately be to clarify
the evolutionary history of the genes in the protein families that include those implementing the
functional roles in the spreadsheet. That is, many roles will be implemented by genes that have paralogs
that implement closely related functions (which are not included in the subsystem). It is necessary to
clarify the precise differences in function. Further, a truly complete analysis will clarify
the detailed evolutionary history: where each duplication, horizontal transfer, cluster breakup, and so forth
occurred.
There is broad disagreement about whether or not characterization of the evolutionary history is essential.
It is certainly true that accurate identification of function can be accomplished at far less effort than
working out the details of the evolutionary history. And, effort that is spent working out the evolutionary events
(which is extremely difficult and demanding work in many cases) could certainly be expended working
out the functional characterizations of other subsystems. It is not our place to assign relative values
to these different activities (but, in the few cases in which the evolutionary history has largely been
pieced together, one gets glimpses of a new level of understanding).
We will attempt to add features to the SEED to support expert annotation of subsystems. We believe that
in many cases experts will use the SEED as a framework for curating the data that will be included in
a progression of review articles. That is how it is intended to work.