How to Annotate a Subsystem: Part 2


Part one of this tutorial was designed to get you started by showing you how to build a new subsystem and add it to the growing collection maintained at the clearinghouse. The goals of this tutorial are somewhat different.

We are reaching the end of the initial phase of subsystem development. Many subsystems now exist in the clearinghouse. A few reflect truly expert analysis and required substantial effort to produce. Many more probably required substantial effort, but since they were not done by an expert contain errors and omissions. This is completely ok -- when we started the Project to Annotate 1000 Genomes, it was understood that the bulk of the initial effort would be done by enthusiastic amateurs like me. The results are proving extremely valuable, and the simple fact is that this initial stage accomplished two key goals:

  1. It demonstrated the value, caused the software to mature quickly, and pushed the project forward to the point where experts could recognize the value and begin to contribute.
  2. It helped turn many of the amateurs into far more accomplished researchers in the areas in which they expended the effort.

Finally, there are many poorly done subsystems and false starts. The way the SEED effort has been set up, you are basically responsible for determining what you believe to be acceptable, and to import the subsystems that meet your standards.

In this tutorial, the focus will not be on how to construct a subsystem spreadsheet, but rather how to take an existing one and

I will use as an example De Novo Purine Biosynthesis, a subsystem that I did myself. I suggest that you access a copy via either your own server or through the University of Chicago server. Begin by getting the spreadsheet and then getting a copy sorted by variant. This may not be completely trivial, so I urge you to do this before continuing and to look at the results. If you experience trouble, get help (if necessary send email to the SEED users via

Once you have the spreadsheet in front of you, start looking at the different variants from the beginning. The first thing that pops out is that the column RPAL is completely empty. What is going on here? If you go to KEGG and look at the pathway diagram for purine metabolism and then look at the enzyme entry for EC, you will find that it

  1. has no known sequences, but
  2. resulted from a paper published in 1968 describing an "alternative first step in purine biosynthesis".
This is a situation in which either the paper was wrong, or there might be a gem hidden here. I did not take the time to get the paper and read it. If you do, and you can find the sequence that was used in the analysis in 1968, then you will be able to clean up a number of annotations. This is the sort of thing that an expert would know from his experience, but it is something you can learn from a little effort.

Now, just start from the beginning and notice when organisms seem to have problems. For example:

There are many, many such comments that could be made about the contents of this spreadsheet. Most simply reflect bad gene calls, incomplete genomes (the SEED tends to mark anything with 90% or more of the sequence as "complete", but you need to find out more about the status of the genome if you intend to rule out potentially unsequenced portions), and so forth. All of these discrepancies need to be explained. Some of them represent interesting research problems, and some just represents parts of the analysis that need to be cleaned up.

Beginning the Second Stage of Subsystem Analysis

Once you have examined the basic spreadsheet just looking for anomolies, it is probably a good idea to begin a more systematic examination. I suggest that you begin by checking the request that will cause the spreadsheet to be sorted by_variant and then redisplaying the spreadsheet. This will cause the genomes to be sorted "by variant". Here is how this works:
  1. First, each genome is inspected, and a vector of os and 1s is built. The first entry in the vector will be a 1, if genes have been found for this role; else, it will be 0. The0 second value in the vector will reflect the presence or absence of genes encoding the second role, and so forth. In my version of the spreadsheet, there are 16 roles. The first genome is missing genes for PrsA, RPAL, and PurN (the 1st, 3rd, and 5th roles). Thus, the vector for this first genome would be (0,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1).
  2. Then, the rows are sorted using the computed vectors as the keys. This will bring together rows that are missing the same entries.
This is not a completely perfect way to locate a role that is missing in many genomes, but it is quite useful.

Now, as we begin to peruse the results, note that we have a number of genomes that are all missing PrsA. Is that reasonable? To understand the answer, we need to get a picture or diagram of the pathway to see whether or not this step is required. You should get the KEGG map covering Purine Metabolism in a second window. The KEGG maps are a critical resource, but you could also use Gerhard Michal's great book on biochemical pathways.

What can you glean? Is PrsA required? What is it used for? Is there an alternative source of the intermediate it produces?

You will need to go through this reasoning process of trying to understand whether or not roles must be there (and have not yet been pinned down to specific genes) or whether they simply represent alternatives for numerous issues revealed by the spreadsheet.

As an exercise, I suggest that you try to articulate the major issues to be resolved for this spreadsheet. You can gain a reasonable overview in just a few minutes, although actually developing the detailed analysis might well take many hours.

The Issue of Duplicate Assignments

One tends to focus on the empty cells as a source of interesting challenges, but the issue of too many genes being assigned to a single functional role is also important. In many cases, multiple entries simply reflect a genome that is still in many contigs, and pieces of the same gene have been assigned distinct IDs (since they occur on different contigs). Framshifts can also produce this apparent redundancy. And, then, it is true that in some genomes, there actually are many genes playing the same role. Figuring out the truth is often tedious, unless you have a particular interest in the given gene or organism.

What Is the Real Objective?

The subsystem spreadsheet represents a framework where numerous problems can be exposed with relatively little effort. Working through all of the detailed issues is essential to developing a comprehensive grasp of the subsystem and how it is implemented in the organisms that contain it. In many respects, this represents the underlying data needed to support a detailed review of the subsystem. To cover a subsystem in this level of detail represents a major investment of time and effort, but it will in most cases lay the foundation for a comprehensive research plan to clarify issues relating to the subsystem.

By constructing spreadsheets for all of the processes in core metabolism, we are simultaneously developing detailed, informal metabolic reconstructions for hundreds of organisms simultaneously. This is a point worth pondering, since metabolic reconstructions represent a key foundation for numerous technologies that will emerge to exploit genomic data.

Beyond this basic need to fill in the missing pieces and clarify the outstanding issues, is there more to be done? Is there a stage in our development of encoded subsystems that goes beyond constructing detailed metabolic reconstructions for hundreds (and eventually thousands) or organisms?

The fact that I posed the question signals that I believe that the answer is "yes". This final stage of analysis was best illustrated in a review on evolution of the tryptophan operan written by Xie, Bonner and Jensen. In this review, the authors attempted to provide an accurate picture of how the seven conserved catalytic domains required for tryptophan biosynthesis evolved. That is, for each functional role, an attempt was made to determine the evolutionary history of the genes that implement the role. This would include determination of when clusters formed and were broken, when fusions occurred, when duplications occurred, when horizontal transfers occurred, and so forth. Years were spent in the analysis of these seven functional roles, but it is fair to say that the results were incredibly encouraging. While questions remain, the authors produced a stunningly detailed picture of the history of this pathway. The impact of this study will go well beyond our understanding of the tryptophan operon; it presented a template of exactly what was needed to advance and deepen our understanding of subsystems in general.

In this tutorial (which is still far from complete), we will attempt to offer some guidance and tools to support this advanced analysis. Over the next two years, we hope to develop a more-or-less comprehensive set of tools to support this style of analysis.

Constructing the Phylogentic Framework

The first step, assuming an initial subsystem spreadsheet already exists, will be to develop multiple sequence alignments for each of the functional roles that make up the system. To get a fairly good alignment, you can just use clustalW, a fine program that has justifiably gained a good reputation. You can request that alignments be generated directly from your web-browser, but the results may not be available immediately. In large subsystems it may take a few minutes (or, very occasionally, hours) to complete the alignment; hence the subsystem analysis application schedules the task, and then you must periodically check to see when the results have completed. If you make changes to a spreadsheet and then recompute alignments, it will only realign columns in which entries have been inserted or deleted.

Computing an alignment produces an initial, very approximate phylogenetic tree. You may inspect these alignments directly from the SEED. The first questions that needs to be asked are as follows:

  1. To what extent are these alignments consistent with one another?
  2. To what extent are they consistent with the estimate of phylogenetic history provided by alignments of rRNA?
The experiences of the team working on the tryptophan strongly support the view that there is a recognizable phylogenetic history that is essentially in agreement with estimates produced from analysis of rRNA (see Inter-genomic displacement via lateral gene transfer of bacterial trp operons in an overall context of vertical genealogy). Whether or not this turns out to be the case for most subsystems, you should begin by examining the initial set of trees. What one would expect, for subsystems that are part of the central cellular machinery, is overall agreement between the individual trees obscured by occasional horizontal transfers.

Here we need to develop a detailed study of the trees from de novo purine biosynthesis
Once you have determined a set of functional roles and organisms that, you believe, are consistent (and, hence, either were not subject to horizontal transfer or were transferred as a block of genes), you can build a more reliable estimate of the evolutionary history. The straightforward way to do this is to build one large alignment by concatenating the set of consistent "role alignments" and then building a tree. There are arguments about whether this reveals or obscures the truth, but if the individual alignments actually reflect a common history it should lead to a more accurate estimate of the tree. This should be done, and the result should be compared with the rRNA tree.
Here we do this for the de novo purine biosynthesis subsystem

Using the Phylogenetic Framework

Once you have constructed an estimate of the phylogenetic history of the majority of functional roles that make up the subsystem, you can utilize it to study the history of clusters on the chromosome, fusion events and horizontal transfers. We have developed tools within the SEED to support the analysis required, but in each case careful human judgement must be applied; the tools simply aid in developing and applying hypotheses.

Developing a History of the Chromosomal Clusters

Once you have an estimate of the underlying phylogenetic tree, you can use it to estimate when clusters were formed and broken. The tool we offer works like this: The labelled tree produced by these steps can be used to gain an overview of when clusters were formed and broken. In many cases it produces a powerful confirmation of the overall accuracy of the phylogenetic tree. On the other hand, if inconsistencies are revealed (i.e., a parsimonious interpretation of the clusters involves many duplicate events), this too is worth thinking about.