Data Mining Tools in the SEED

A Sketchy Set of Notes Maintained by Ross Overbeek

This tutorial discusses a number of issues that you will need to know about in order to use the data mining tools included in SEED. It is organized as follows:

  1. Introduction: pegs, fid2dna, translate, and translation_of
  2. A More Complex Example: Indirect Functional Coupling
  3. Extracting Upstream Regions
  4. Locating New Forms of Enzymes
The first time through, I recommend reading it in order.

Introduction: pegs, fid2dna, translate, and translation_of

The SEED is a loosely structured, evolving system. It is by intent an "executable prototype" in which new ideas are installed by numerous participants. The set of tools that I describe here are descendants of an experiment that we did with the WIT system. The basic idea is to build a command-line capability to very easily extract data from an integration like the SEED. To extract desired data, you invoke small programs that are strung together to produce the desired output. It is clearly a notion that derived from the original ideas developed in the UNIX culture. Although we will not elaborate on the point here, most of the precise design grew out of work done on logic programming systems in the late 1980's and early 1990's.

To see how things work, let's consider a simple example:

	pegs 83333.1 | fid2dna | translate
is a sequence of three commands that are strung together to get translations of the genes in Escherichia coli. The way it works is that This little example actually translates the DNA. The SEED includes translations for each PEG. You could get them using
	pegs 83333.1 | translation_of 
As an exercise, we suggest that you run
	pegs 83333.1 | translation_of > tmp.SEED.translations
	pegs 83333.1 | fid2dna | translate > tmp.translated.SEED.DNA
and see how the two files compare. Would you expect them to be the same? Why might they differ?

A More Complex Example: Indirect Functional Coupling

Functionally related genes tend to cluster on prokaryotic chromosomes. This fact is the basis for a number of techniques that have been used to gain clues relating to the function of "hypothetical proteins". As the number of sequenced genomes grows, the power of techniques based on statistical analysis of chromosomal clusters grows dramatically (I have argued elsewhere that the information grows approximately as the square of the number of genomes, but that is a story best left for another document).

Inevitably, the question arises: "What can one do with a gene when it does not appear to be in a chromosomal cluster?" The technique that I will describe now has often been referred to as indirect coupling. Here is how it works:

  1. You start with a gene in some genome (prokaryotic or eukaryotic). Call this gene X.
  2. Compute all bi-directional best hits (BBHs) of X. These amount to an attempt to find the corresponding genes in other genomes. You need to remember that such projections are not completely reliable, but they are at the heart of many of the inferences we make relating to function. Call this set of BBHs S1.
  3. Now for each element of S1, we can check to see if it appears to be in a chromosomal cluster, and if it is we can compute the genes that appear to be directly functional coupled. Call this set of genes that are fucntionally coupled to BBHs of X S2.
  4. Now we want to locate the genes in the original genome that correspond to genes in . Again, we estimate this correspondence using BBHs. That is, we take BBHs of genes in S2 that are in the same genome as X. Call this set of genes S3. These are the genes that are indirectly coupled to X.
  5. Finally, look at the functions of the indirectly coupled genes. This set of functions can often be hugely informative.
To illustrate this process and how it can be implemented using the tools we provide with the SEED, let us go through the computation one step at a time. The first step is to pick an initial gene. As an example, let me choose a gene from Methanocaldococcus jannaschii, one of the first genomes I studied. In the version of the SEED I am using at this time, the peg is fig|2190.1.peg.1537. It is also known by a number of other ids: NP_248444.1, gi|15669631, sp|Q58835, and kegg|mja:MJ1440, for example. You should make a file containing just one line, and that line should contain one of these ids. Call this file initial.peg. Then
	bbhs < initial.peg >
should construct a file containing the BBHs of the single gene in the file. If initial.peg had contained more genes, BBHs would have been computed for all of the ids in the file.

You should make an initial file and run bbhs to make sure that you understand this simple tool. If you wished to restrict yourself to BBHs that had better similarity scores than, say, 1.0e-50, you would use

	bbhs 1.0e-50 < initial.peg >
The tool bbhs computes BBHs for any id that occurs as a last field in the tab-separated fields in lines from the input file. If there are numerous PEGs in the input file, execution of bbhs can go for hours.

The next step involved computing genes that were functionally coupled to those in This can be done using

    functionally_coupled 5000 1.0e-10 3 fast < >
The tool functionally_coupled computes genes that are functionally coupled (based on preserved contiguity) to those that occur as last entries in the tab-separated fields of lines from the input file. The parameters to the tool are as follows:
  1. the maximum distance separating a gene from another that is putatively clustered with it,
  2. the maximum p-score for similarities used to find corresponding genes between two clusters,
  3. the minimum functional coupling score (which is one less than the number of clusters in "significantly different" genomes), and
  4. the word fast which causes the tool to attempt a fast technique that sometimes misses couplings (this argument is optional). Fast is meant only relatively here: execution can go for hours, even with the option set.
The tool writes out 3-tuples containing the two functionally-coupled genes separated by their coupling score.

Again, I urge you to actually go through this step and examine the output file.

Once you have created the file, it is time to search for BBHs from the coupled genes back in the original genome. This can be done using

   bbhs < | restrict_to_genome initial.genome >
The bbhs tool is the same one we discussed above. The simple tool restrict_to_genome takes a single argument which is a file containing a list of genome ids. In this case the file contains just a single line giving the id of the genome containing the initial gene (which was, if you recall, 2190.1).

Finally, you should append the functions of the detected genes and remove duplicates using

function_of < | sort -u >
The function_of tool takes the last field in each input line, looks up the function (assuming the field was a PEG id), and writes out the PEG id and function. If you do all of this, you get a file containing
fig|2190.1.peg.1170	Methanocaldococcus jannaschii	Shikimate 5-dehydrogenase (EC
fig|2190.1.peg.1551	Methanocaldococcus jannaschii	3-dehydroquinate dehydratase (EC
fig|2190.1.peg.571	Methanocaldococcus jannaschii	3-phosphoshikimate 1-carboxyvinyltransferase (EC
For those of you who are familiar with aromatic amino acid biosynthesis, these three functions do, in fact, perfectly bracket the initial gene (fig|2190.1.peg.1537, which is the archael shikimate kinase) in the pathway for chorismate biosynthesis. A number of us partiucipated in predicting the function of this gene a number of years ago. With the tools and data we have now, the predictions of function for such genes is vastly simplified. You could have produced the three lines of output by stringing all of the commands together using
	% bbhs < initial.peg | 
	> functionally_coupled 5000 1.0e-10 3 fast | 
	> bbhs | 
	> restrict_to_genomes initial.genome | 
	> function_of | 
	> sort -u >
So, there you have a pipeline of commands to compute indirect functional coupling. You can adjust the similarities required for the BBHs, and you can alter the minimum coupling score required, but that is essentially the computation.

You might wish to play with the different parameters. In particular, the "fast" parameter on functionally_coupled is optional, and does alter the behaviour. How well it performs depends critically on how many pins and chromosomal clusters have accurately been precomputed. If you do not use the "fast" parameter, it can slow things down by an order of magnitude. We do try to keep pins and clusters in good shape, but ....

Finally, you could use similar_to rather than bbhs. However, unless you proceed incrementally, checking all intermediate results, this can produce a lot of false positives (i.e., a lot of garbage).

Extracting Upstream Regions

Suppose that you wished to search upstream regions of genes for control sites. To do this, you need to get sets of genes that might share a common control site, extract the upstream regions, and then search these for a common regulatory site. In this section we will discuss how to extract the desired upstream regions.

Normally, one would select a set of potentially co-regulated genes. This might be done using something like

% pegs_in_subsystem "Histidine Biosynthesis" | grep 'fig|211586.1.peg' > related
% cat related
ATP phosphoribosyltransferase (EC	fig|211586.1.peg.1886
Phosphoribosyl-ATP pyrophosphatase (EC	fig|211586.1.peg.1879
Phosphoribosyl-AMP cyclohydrolase (EC	fig|211586.1.peg.1879
Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC	fig|211586.1.peg.1881
Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC	fig|211586.1.peg.3771
Imidazole glycerol phosphate synthase subunit hisH (EC 2.4.2.-)	fig|211586.1.peg.1882
Imidazole glycerol phosphate synthase subunit hisF (EC 4.1.3.-)	fig|211586.1.peg.1880
Imidazoleglycerol-phosphate dehydratase (EC	fig|211586.1.peg.1883
Histidinol-phosphate aminotransferase (EC	fig|211586.1.peg.1884
Histidinol-phosphatase (EC	fig|211586.1.peg.1883
Histidinol dehydrogenase (EC	fig|211586.1.peg.1885

The command pegs_in_subsystem can be used to extract all of the genes from the spreadsheet encoding the designated subsystem (in this case, Histidine Biosynthesis). The command produces a 2-column (tab-separated) file in which the first column is a functional role and the second is a PEG assigned that function. By using grep to pull out just those genes for a given genome (in this case 211586.1, which is Shewanella oneidensis MR-1), we get a potentially co-regulated set of genes. This may or may not be exactly what you wish -- you must still be careful of situations in which the genes occur in a single operon, but that is a detail for later.

Once you have a set of potentially co-regulated genes, you might use something like

	upstream upstream=200 plus=100 < related > upstream.fasta

which produces a file like
Here, the parameters to upstream have the following meanings: The output contains one fasta entry for each gene. A "-" separates the upstream sequence from the start of the gene's coding sequence. Lowercase is used for the upstream sequence, until it overlaps another gene (and from there on you get uppercase).

The key to actually extracting accurate upstream operators in prokaryotes has turned out to be the use of comparative analysis with corresponding genes from closely related organisms. This requires determining a set of "corresponding genes from closely related organisms" for each gene in the potentially co-regulated set. One way to do this might be to use

pegs_in_subsystem "Histidine Biosynthesis" | grep "ATP phosphoribosyltransferase (EC"

which produces a file something like
ATP phosphoribosyltransferase (EC	fig|83333.1.peg.1994
ATP phosphoribosyltransferase (EC	fig|272626.1.peg.573
ATP phosphoribosyltransferase (EC	fig|224308.1.peg.3498
ATP phosphoribosyltransferase (EC	fig|169963.1.peg.563

Note that all we have done here is to extract a column (rather than a row) from the subsystem spreadsheet.

You might, of course, wish to consider corresponding genes as defined by sequence similarity, rather than using a subsystem spreadsheet. In this case, you need to get all of the genes from other organisms that pass some similarity threshhold. To do this, you might use something like

	echo 'fig|211586.1.peg.1886' | similar_to >

This gives you the genes similar to fig|211586.1.peg.1886. If you wished to get the upstream regions for the original gene and its related genes, then
(echo 'fig|211586.1.peg.1886';  echo 'fig|211586.1.peg.1886' | similar_to) | upstream
would work (here we just use default parameters on similar_to and upstream). To do this for an entire Shewanella genome I used
% for i in `pegs 211586.1`
> do
>     (echo "$i"; echo "$i" | similar_to | is_prokaryotic) | upstream upstream=500 plus=100 > "Output/$i"
>     echo $i
> done
while running in bash. The is_prokaryotic filter just removes non-prokaryotic genes from the input to upstream. The is_eukaryotic, is_bacterial, and is_archaeal filters now work, as well.

Locating New Forms of Enzymes

Researchers have discovered that many enzymes have multiple, nonhomologous forms. The best way to recognize such situations is by constructing a subsystem spreadsheet. Once you have constructed a spreadsheet for a subsystem, you often will find a set of organisms in which the majority of functional roles can be matched up with PEGs, but one or more columns remain empty. These often represent cases in which an alternative form of the enzyme exists, and one of the more interesting tasks in annotation is to search for the gene that has not yet been identified as playing the missing role. In this short section, we will illustrate one one to seek the missing gene using the SEED data mining tools (you could, of course, also seek it using the Web-based tools).

I will illustrate the technique using a search for the archaeal form of the shikimate kinase. This example induces pleasant memories for me, since this was one of the first bioinformatics predictions that I was involved in that was then confirmed in the wet lab. At the time a number of us worked out the prediction, our approachs were far from systematic. Now it is time to formulate the approaches that work and apply them routinely.

In this example, we had a known version of the shikimate kinase, an enzyme used in aromatic amino-acid biosynthesis. The known version occurs throughout the bacteria and eukaryotes. However, while it was absolutely clear that many archaea synthesize aromatic amino acids with essentially the same pathway used by bacteria, locating the archaeal form of the shikimate kinase had not yet been done. So, how might one go about locating the archaeal form, given only a number of occurrences of the known bacterial form?
The basic approach that we will illustrate here uses the following steps: