How to Annotate a Subsystem: Part 2
Part one of this tutorial was designed to get you started by showing
you how to build a new subsystem and add it to the growing collection
maintained at the clearinghouse. The goals of this tutorial
are somewhat different.
We are reaching the end of the initial phase of subsystem
development. Many subsystems now exist in the clearinghouse. A few
reflect truly expert analysis and required substantial effort to
produce. Many more probably required substantial effort, but since
they were not done by an expert contain errors and omissions. This is
completely ok -- when we started the Project to Annotate 1000 Genomes,
it was understood that the bulk of the initial effort would be done by
enthusiastic amateurs like me. The results are proving extremely
valuable, and the simple fact is that this initial stage accomplished
two key goals:
- It demonstrated the value, caused the software to mature quickly,
and pushed the project forward to the point where experts could
recognize the value and begin to contribute.
- It helped turn many of the amateurs into far more accomplished
researchers in the areas in which they expended the effort.
Finally, there are many poorly done subsystems and false starts. The
way the SEED effort has been set up, you are basically responsible for
determining what you believe to be acceptable, and to import the
subsystems that meet your standards.
In this tutorial, the focus will not be on how to construct a
subsystem spreadsheet, but rather how to take an existing one and
- improve it, and
- exploit it for research purposes.
I will use as an example De Novo Purine Biosynthesis, a
subsystem that I did myself. I suggest that you access a copy via
either your own server or through the University of Chicago server.
Begin by getting the spreadsheet and then getting a copy sorted by
variant. This may not be completely trivial, so I urge you to do this
before continuing and to look at the results. If you experience
trouble, get help (if necessary send email to the SEED users via
Once you have the spreadsheet in front of you, start looking at the
different variants from the beginning. The first thing that pops out
is that the column RPAL is completely empty. What is going on
If you go to KEGG and look at the pathway diagram for purine
metabolism and then look at the enzyme entry for EC 184.108.40.206, you will
find that it
This is a situation in which either the paper was wrong, or there
might be a gem hidden here. I did not take the time to get the paper
and read it. If you do, and you can find the sequence that was used
in the analysis in 1968, then you will be able to clean up a number of
annotations. This is the sort of thing that an expert would know from
his experience, but it is something you can learn from a little
- has no known sequences, but
- resulted from a paper published in 1968 describing an "alternative
first step in purine biosynthesis".
Now, just start from the beginning and notice when organisms seem to
have problems. For example:
There are many, many such comments that could be made about the
contents of this spreadsheet. Most simply reflect bad gene calls,
incomplete genomes (the SEED tends to mark anything with 90% or more
of the sequence as "complete", but you need to find out more about the
status of the genome if you intend to rule out potentially unsequenced
portions), and so forth. All of these discrepancies need to be
explained. Some of them represent interesting research problems, and
some just represents parts of the analysis that need to be cleaned up.
- Fusobacterium nucleatum polymorphum ATCC seems to be
missing a number of genes. In particular, genes encoding purL,
purQ, and purK look like serious omissions.
- Pseudomonas aeruginosa UCBPP- PA14 seems to be missing
- Numerous organisms seem to be missing purK.
Beginning the Second Stage of Subsystem Analysis
Once you have examined the basic spreadsheet just looking for anomolies,
it is probably a good idea to begin a more systematic examination. I
suggest that you begin by checking the request that will cause the
spreadsheet to be sorted by_variant and then redisplaying the
spreadsheet. This will cause the genomes to be sorted "by variant".
Here is how this works:
This is not a completely perfect way to locate a role that is missing
in many genomes, but it is quite useful.
- First, each genome is inspected, and a vector of os and
1s is built. The first entry in the vector will be a 1, if
genes have been found for this role; else, it will be 0. The0 second
value in the vector will reflect the presence or absence of genes
encoding the second role, and so forth. In my version of the
spreadsheet, there are 16 roles. The first genome is missing genes
for PrsA, RPAL, and PurN (the 1st, 3rd, and 5th
roles). Thus, the vector for this first genome would be
- Then, the rows are sorted using the computed vectors as the keys.
This will bring together rows that are missing the same entries.
Now, as we begin to peruse the results, note that we have a number of
genomes that are all missing PrsA. Is that reasonable? To
understand the answer, we need to get a picture or diagram of the
pathway to see whether or not this step is required. You should get
the KEGG map covering Purine Metabolism in a second window.
The KEGG maps are a critical resource, but you could also use Gerhard
book on biochemical pathways.
What can you glean? Is PrsA required? What is it used for? Is
there an alternative source of the intermediate it produces?
You will need to go through this reasoning process of trying to
understand whether or not roles must be there (and have not yet been
pinned down to specific genes) or whether they simply represent
alternatives for numerous issues revealed by the spreadsheet.
As an exercise, I suggest that you try to articulate the major issues
to be resolved for this spreadsheet. You can gain a reasonable
overview in just a few minutes, although actually developing the
detailed analysis might well take many hours.
The Issue of Duplicate Assignments
One tends to focus on the empty cells as a source of interesting
challenges, but the issue of too many genes being assigned to a single
functional role is also important. In many cases, multiple entries
simply reflect a genome that is still in many contigs, and pieces of
the same gene have been assigned distinct IDs (since they occur on
different contigs). Framshifts can also produce this apparent
redundancy. And, then, it is true that in some genomes, there
actually are many genes playing the same role. Figuring out the truth
is often tedious, unless you have a particular interest in the given
gene or organism.
What Is the Real Objective?
The subsystem spreadsheet represents a framework where numerous
problems can be exposed with relatively little effort. Working
through all of the detailed issues is essential to developing a
comprehensive grasp of the subsystem and how it is implemented in the
organisms that contain it. In many respects, this represents the
underlying data needed to support a detailed review of the subsystem.
To cover a subsystem in this level of detail represents a major
investment of time and effort, but it will in most cases lay the
foundation for a comprehensive research plan to clarify issues
relating to the subsystem.
By constructing spreadsheets for all of the processes in core
metabolism, we are simultaneously developing detailed, informal
metabolic reconstructions for hundreds of organisms simultaneously.
This is a point worth pondering, since metabolic reconstructions
represent a key foundation for numerous technologies that will emerge
to exploit genomic data.
Beyond this basic need to fill in the missing pieces and clarify the
outstanding issues, is there more to be done? Is there a stage
in our development of encoded subsystems that goes beyond constructing
detailed metabolic reconstructions for hundreds (and eventually
thousands) or organisms?
The fact that I posed the question signals that I believe that the
answer is "yes". This final stage of analysis was best illustrated in
on evolution of the tryptophan operan
written by Xie, Bonner and Jensen. In this review, the authors
attempted to provide an accurate picture of how the seven conserved
catalytic domains required for tryptophan biosynthesis evolved. That
is, for each functional role, an attempt was made to determine the
evolutionary history of the genes that implement the role. This would
include determination of when clusters formed and were broken, when
fusions occurred, when duplications occurred, when horizontal
transfers occurred, and so forth. Years were spent in the analysis
of these seven functional roles, but it is fair to say that the
results were incredibly encouraging. While questions remain, the
authors produced a stunningly detailed picture of the history of this
pathway. The impact of this study will go well beyond our
understanding of the tryptophan operon; it presented a template of
exactly what was needed to advance and deepen our understanding of
subsystems in general.
In this tutorial (which is still far from complete), we will attempt
to offer some guidance and tools to support this advanced analysis.
Over the next two years, we hope to develop a more-or-less
comprehensive set of tools to support this style of analysis.
Constructing the Phylogentic Framework
The first step, assuming an initial subsystem spreadsheet already
exists, will be to develop multiple sequence alignments for each of
the functional roles that make up the system. To get a fairly good
alignment, you can just use clustalW, a fine program that has
justifiably gained a good reputation. You can request that alignments
be generated directly from your web-browser, but the results may not
be available immediately. In large subsystems it may take a few
minutes (or, very occasionally, hours) to complete the alignment;
hence the subsystem analysis application schedules the task, and then
you must periodically check to see when the results have completed.
If you make changes to a spreadsheet and then recompute alignments, it
will only realign columns in which entries have been inserted or
Computing an alignment produces an initial, very approximate
phylogenetic tree. You may inspect these alignments directly from the
SEED. The first questions that needs to be asked are as follows:
The experiences of the team working on the tryptophan strongly
support the view that there is a recognizable phylogenetic history
that is essentially in agreement with estimates produced from analysis
of rRNA (see
Inter-genomic displacement via lateral gene transfer of bacterial trp operons in an
overall context of vertical genealogy).
Whether or not this turns out to be the case for most subsystems, you
should begin by examining the initial set of trees.
What one would expect, for subsystems that are
part of the central cellular machinery, is overall agreement between
the individual trees obscured by occasional horizontal transfers.
To what extent are these alignments consistent with one
- To what extent are they consistent with the estimate of
phylogenetic history provided by alignments of rRNA?
Here we need to develop a detailed study of the trees from de
novo purine biosynthesis
Once you have determined a set of functional roles and organisms that,
you believe, are consistent (and, hence, either were not subject to
horizontal transfer or were transferred as a block of genes), you can
build a more reliable estimate of the evolutionary history. The
straightforward way to do this is to build one large alignment by
concatenating the set of consistent "role alignments" and then
building a tree. There are arguments about whether this reveals or
obscures the truth, but if the individual alignments actually reflect
a common history it should lead to a more accurate estimate of the
tree. This should be done, and the result should be compared with the
Here we do this for the de novo purine biosynthesis subsystem
Using the Phylogenetic Framework
Once you have constructed an estimate of the phylogenetic history of
the majority of functional roles that make up the subsystem, you can
utilize it to study the history of clusters on the chromosome, fusion
events and horizontal transfers. We have developed tools within the
SEED to support the analysis required, but in each case careful human
judgement must be applied; the tools simply aid in developing and
Developing a History of the Chromosomal Clusters
Once you have an estimate of the underlying phylogenetic tree, you can
use it to estimate when clusters were formed and broken. The tool we
offer works like this:
The labelled tree produced by these steps can be used to gain an
overview of when clusters were formed and broken. In many cases it
produces a powerful confirmation of the overall accuracy of the
phylogenetic tree. On the other hand, if inconsistencies are revealed
(i.e., a parsimonious interpretation of the clusters involves many
duplicate events), this too is worth thinking about.
Consider the pairs of genes (X,Y) such that the genes implement
roles from the subsystem, they are adjacent in at least one genome,
and X sorts ahead of Y. Make a list of all such pairs.
- Now, for each pair in the list, perform the following steps:
- Take a copy of the phylogenetic tree and label every leaf
(genome) that contains adjacent versions of the genes with a "+" and
label every leaf that doe not with a "-".
- Propogate these labels to internal nodes using a parsimony
- Record the results
- At this point, you have recorded the nodes at which the presence
of adjacent pairs could be inferred based on an assumption of
parsimony (i.e., that a minimal number of "changes" occurred).
For each node, coalesce these pairs into longer strings when
possible. Note that as you form a longer string (or, for that matter,
even with a single pair of genes) you cannot be sure that all occurrences of
this pattern have genes in the same orientation.