This tutorial discusses a number of issues that you will need to know about in order to use the data mining tools included in SEED. It is organized as follows:
To see how things work, let's consider a simple example:
pegs 83333.1 | fid2dna | translateis a sequence of three commands that are strung together to get translations of the genes in Escherichia coli. The way it works is that
pegs 83333.1 | head -5which will show the first 5 lines written by pegs.
pegs 83333.1 | fid2dna | head -50to see what comes out.
pegs 83333.1 | translation_ofAs an exercise, we suggest that you run
pegs 83333.1 | translation_of > tmp.SEED.translations pegs 83333.1 | fid2dna | translate > tmp.translated.SEED.DNAand see how the two files compare. Would you expect them to be the same? Why might they differ?
Inevitably, the question arises: "What can one do with a gene when it does not appear to be in a chromosomal cluster?" The technique that I will describe now has often been referred to as indirect coupling. Here is how it works:
bbhs < initial.peg > corresponding.genes.in.other.genomesshould construct a file containing the BBHs of the single gene in the file. If initial.peg had contained more genes, BBHs would have been computed for all of the ids in the file.
You should make an initial file and run bbhs to make sure that you understand this simple tool. If you wished to restrict yourself to BBHs that had better similarity scores than, say, 1.0e-50, you would use
bbhs 1.0e-50 < initial.peg > corresponding.genes.in.other.genomesThe tool bbhs computes BBHs for any id that occurs as a last field in the tab-separated fields in lines from the input file. If there are numerous PEGs in the input file, execution of bbhs can go for hours.
The next step involved computing genes that were functionally coupled to those in corresponding.genes.in.other.genomes. This can be done using
functionally_coupled 5000 1.0e-10 3 fast < corresponding.genes.in.other.genomes > coupled.in.other.genomesThe tool functionally_coupled computes genes that are functionally coupled (based on preserved contiguity) to those that occur as last entries in the tab-separated fields of lines from the input file. The parameters to the tool are as follows:
Again, I urge you to actually go through this step and examine the output file.
Once you have created the file coupled.in.other.genomes, it is time to search for BBHs from the coupled genes back in the original genome. This can be done using
bbhs < coupled.in.other.genomes | restrict_to_genome initial.genome > restricted.to.initial.genomeThe bbhs tool is the same one we discussed above. The simple tool restrict_to_genome takes a single argument which is a file containing a list of genome ids. In this case the file contains just a single line giving the id of the genome containing the initial gene (which was, if you recall, 2190.1).
Finally, you should append the functions of the detected genes and remove duplicates using
function_of < restricted.to.initial.genome | sort -u > indirectly.coupled.in.original.genomeThe function_of tool takes the last field in each input line, looks up the function (assuming the field was a PEG id), and writes out the PEG id and function. If you do all of this, you get a file containing
fig|2190.1.peg.1170 Methanocaldococcus jannaschii Shikimate 5-dehydrogenase (EC 1.1.1.25) fig|2190.1.peg.1551 Methanocaldococcus jannaschii 3-dehydroquinate dehydratase (EC 4.2.1.10) fig|2190.1.peg.571 Methanocaldococcus jannaschii 3-phosphoshikimate 1-carboxyvinyltransferase (EC 2.5.1.19)For those of you who are familiar with aromatic amino acid biosynthesis, these three functions do, in fact, perfectly bracket the initial gene (fig|2190.1.peg.1537, which is the archael shikimate kinase) in the pathway for chorismate biosynthesis. A number of us partiucipated in predicting the function of this gene a number of years ago. With the tools and data we have now, the predictions of function for such genes is vastly simplified. You could have produced the three lines of output by stringing all of the commands together using
% bbhs < initial.peg | > functionally_coupled 5000 1.0e-10 3 fast | > bbhs | > restrict_to_genomes initial.genome | > function_of | > sort -u > indirectly.coupled.in.original.genomeSo, there you have a pipeline of commands to compute indirect functional coupling. You can adjust the similarities required for the BBHs, and you can alter the minimum coupling score required, but that is essentially the computation.
You might wish to play with the different parameters. In particular, the "fast" parameter on functionally_coupled is optional, and does alter the behaviour. How well it performs depends critically on how many pins and chromosomal clusters have accurately been precomputed. If you do not use the "fast" parameter, it can slow things down by an order of magnitude. We do try to keep pins and clusters in good shape, but ....
Finally, you could use similar_to rather than bbhs. However, unless you proceed incrementally, checking all intermediate results, this can produce a lot of false positives (i.e., a lot of garbage).
Normally, one would select a set of potentially co-regulated genes. This might be done using something like
% pegs_in_subsystem "Histidine Biosynthesis" | grep 'fig|211586.1.peg' > related % cat related ATP phosphoribosyltransferase (EC 2.4.2.17) fig|211586.1.peg.1886 Phosphoribosyl-ATP pyrophosphatase (EC 3.6.1.31) fig|211586.1.peg.1879 Phosphoribosyl-AMP cyclohydrolase (EC 3.5.4.19) fig|211586.1.peg.1879 Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC 5.3.1.16) fig|211586.1.peg.1881 Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC 5.3.1.16) fig|211586.1.peg.3771 Imidazole glycerol phosphate synthase subunit hisH (EC 2.4.2.-) fig|211586.1.peg.1882 Imidazole glycerol phosphate synthase subunit hisF (EC 4.1.3.-) fig|211586.1.peg.1880 Imidazoleglycerol-phosphate dehydratase (EC 4.2.1.19) fig|211586.1.peg.1883 Histidinol-phosphate aminotransferase (EC 2.6.1.9) fig|211586.1.peg.1884 Histidinol-phosphatase (EC 3.1.3.15) fig|211586.1.peg.1883 Histidinol dehydrogenase (EC 1.1.1.23) fig|211586.1.peg.1885 %
Once you have a set of potentially co-regulated genes, you might use something like
upstream upstream=200 plus=100 < related > upstream.fasta
>fig|211586.1.peg.1886 acccaccatcatcaccaccatatgcggtagccttcggacgttcactgccggaggactata tccttctggcggactgattcaaagatgatcataaacccctggaaggaattccgggggttt ttgcttttaggcttttaataaacgatttaacagctaattagtaattaatttcaatatgtt aaccgccacatttactggcg-GTGCCGATAAGCCAGTTAAAGCGTGAAACGAAAGAATAC CATTTTGAATTTGTTAACTTTGAGCGAGATAGAATTATGACCGAATCAAACCGTTTACGT A >fig|211586.1.peg.1879 GTTTATCGATGCGAAGGTGGATGCAGCACTGGCGGCCAGCGTGTTCCATAAAGCCATTAT CAATATCGGCGAGTTAAAACAGTATTTAGCCGCCGAAGGTATTGCTATTAGACGCtagag cggttctaggttctaggttctaggttctaggttctaggttctaggttctaggttctacaa atagattaagagtgaatact-ATGTCAACGCAAACAAACACTAAGTCAGATTTTAGCGAG CTGGATTGGGACAAACAGGATGGGCTTATCCCTGCTGTTATCCAAAACCACTTATCAGGC A >fig|211586.1.peg.1881 ATCGGTAAAGACAACTTTATGGGCGTGCAATTCCACCCTGAAAAAAGTGCCGCCGTGGGC GCGCAAATCCTAGGTAACTTTTTAAAAATGCAAtaaatgcaatcgcttattaggtttggc tgagccaccactatttttcagccacaccatacgacaataaagcgcggcagcaaatgcacc aaaagaattaaggacaacag-ATGATCATTCCAGCGATTGATTTAATTGATGGCAAGGTA GTCAGGCTCTACCAAGGGGATTATGGGCAGCAGACCACCTTCGATTTAAGCCCCCTCGCC C . . .Here, the parameters to upstream have the following meanings:
The key to actually extracting accurate upstream operators in prokaryotes has turned out to be the use of comparative analysis with corresponding genes from closely related organisms. This requires determining a set of "corresponding genes from closely related organisms" for each gene in the potentially co-regulated set. One way to do this might be to use
pegs_in_subsystem "Histidine Biosynthesis" | grep "ATP phosphoribosyltransferase (EC 2.4.2.17)"
ATP phosphoribosyltransferase (EC 2.4.2.17) fig|83333.1.peg.1994 ATP phosphoribosyltransferase (EC 2.4.2.17) fig|272626.1.peg.573 ATP phosphoribosyltransferase (EC 2.4.2.17) fig|224308.1.peg.3498 ATP phosphoribosyltransferase (EC 2.4.2.17) fig|169963.1.peg.563 . . .
You might, of course, wish to consider corresponding genes as defined by sequence similarity,
rather than using a subsystem spreadsheet. In this case, you need to get all of the genes
from other organisms that pass some similarity threshhold. To do this, you might use something like
echo 'fig|211586.1.peg.1886' | similar_to > genes.similar.to.1886
(echo 'fig|211586.1.peg.1886'; echo 'fig|211586.1.peg.1886' | similar_to) | upstreamwould work (here we just use default parameters on similar_to and upstream). To do this for an entire Shewanella genome I used
% for i in `pegs 211586.1` > do > (echo "$i"; echo "$i" | similar_to | is_prokaryotic) | upstream upstream=500 plus=100 > "Output/$i" > echo $i > donewhile running in bash. The is_prokaryotic filter just removes non-prokaryotic genes from the input to upstream. The
I will illustrate the technique using a search for the archaeal form of the shikimate kinase. This example induces pleasant memories for me, since this was one of the first bioinformatics predictions that I was involved in that was then confirmed in the wet lab. At the time a number of us worked out the prediction, our approachs were far from systematic. Now it is time to formulate the approaches that work and apply them routinely.
In this example, we had a known
version of the shikimate kinase, an enzyme used in aromatic
amino-acid biosynthesis. The known version occurs throughout the
bacteria and eukaryotes. However, while it was absolutely clear that
many archaea synthesize aromatic amino acids with essentially the same
pathway used by bacteria, locating the archaeal form of the shikimate
kinase had not yet been done. So, how might one go about locating the
archaeal form, given only a number of occurrences of the known
bacterial form?
The basic approach that we will illustrate here uses the following
steps:
bbhs iterate < initial > bbhs cat initial bbhs > SiThe command bbhs was used in our last example. In this case, we have added an extra argument; by saying iterate we have requested that the command compute BBHs of the PEGs in the file initial, but then iterate by adding BBHs of any of the newly added PEGs (and continue until no more new BBHs can be added).
functionally_coupled 5000 1.0e-10 5 fast < Si > fc.Si bbhs length=0.8 iterate < fc.Si > known.machineryThe genes in fc.Si should all occur in genomes that contain the known version of the enzyme. This set represents "genes coupled to the known version" and represents an estimate of the machinery that is functionally related to the role played by the enzyme. If we then project this machinery to all of the genomes (including those that do not contain the known version) using bbhs we have the file known.machinery.
genomes_of < Si > genomes.of.Si restrict_to_genomes -v genomes.of.Si < known.machinery > known.in.genomes.without.orig.formThe tool genome_of takes as input a file containing lines that end with PEG ids. It outputs lines that contain the genome ids of all genomes containing those PEGs (i.e., those that occur in the last column of the input file). The tool restrict_to_genomes takes as input a file containing PEG ids in the last column. It restricts this input set based on the genome ids of these PEGs, writing out those lines that pass a test. The test is given by the command line arguments. If the command line contains a single argument, it should be a file containing genome ids. In the case, the test amounts to "keep PEGs from genomes corresponding to those in the given file". If the command line arguments are of the form -v File, then the test amounts to "keep only PEG ids from genomes other than those in the given file".
functionally_coupled 5000 1.0e-10 5 fast < known.in.genomes.without.orig.form > potential.candidatesto get genes that are functionally coupled to PEGs in known.in.genomes.without.orig.form. The trouble with these "potential candidates" is that the file will contain instances of all components of the machinery of the subsystem; what we really want is instances that bear no resemblance to either the original form or to any other gene that was functionally related to the original form (or to any BBH of those!). We achive this restriction using
cat Si fc.Si known.machinery > things.to.ignore restrict_by_similarity -v things.to.ignore 1.0e-3 < potential.candidates > candidatesThe restrict_by_similarity command can be used to restrict a set of PEGs to those that are either similar to, or definitely not similar to, a set of given PEGs (in this case things.to.ignore). In the invocation shown, we take those not similar to genes in potential.candidates, and we specify a similarity threshhold of 1.0e-3.
cluster_by_bbhs < candidates > data.to.look.atThis concludes my initial discussion of what I call the "sniffer algorithm" (with the image of a bloodhound in mind). It is, perhaps, worth noting that we base this exploration almost entirely on projection (BBHs) and functional coupling on the chromosome. If one wanted to include use of other forms of evidence of functional coupling, it can be done easily. For example, to add fusion data, one would simply change
functionally_coupled 5000 1.0e-10 5 fast < known.in.genomes.without.orig.form > potential.candidatesto
functionally_coupled 5000 1.0e-10 5 fast < known.in.genomes.without.orig.form > potential.candidates fusion_of < known.in.genomes.without.orig.form >> potential.candidates