By the term compound we refer to the normal notion of chemical compound.
A cellular machine is a set of proteins that together perform a function. Unless otherwise noted, when we use the term machine we will always be speaking of a cellular machine. Many machines transform one set of compounds into another set. Some machines (transport machines) are used to move compounds into or out of the cell. Later we will try to convey a more comprehensive notion of what functions are implemented by machines that we understand.
A protein is a string of amino acids (i.e., a string in the 20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
A genome is a string of DNA bases (i.e., a string in the 4-character alphabet {A,C,G,T}).
A gene is a region in the genome that
describes how to build a
protein. The description is a sequence of 3-character codons. Each
codon may
be thought of as an
instruction specifying which amino acid should come next in the protein
the gene describes. Thus, if the protein described by the
gene
contains 100 amino acids, then the gene would be composed of 100 codons
(i.e., 300 DNA characters) followed by a codon that means "stop here"
(a stop codon).
There are three stop codons: {TAA,TAG,TGA}. The genetic code is the
table of correspondences between codons and amino acids:
Amino Acid | Codons |
---|---|
A | GCT, GCC, GCA, GCG |
C | TGT, TGC |
D | GAT, GAC |
E | GAA, GAG |
F | TTT, TTC |
G | GGT, GGC, GGA, GGG |
H | CAT, CAC |
I | ATT, ATC, ATA |
K | AAA, AAG |
L | TTA, TTG, CTT, CTC, CTA, CTG |
M | ATG |
N | AAT, AAC |
P | CCT, CCC, CCA, CCG |
Q | CAA, CAG |
R | CGT, CGC, CGA, CGG, AGA, AGG |
S | TCT, TCC, TCA, TCG, AGT, AGC |
T | ACT, ACC, ACA, ACG |
V | GTT, GTC, GTA, GTG |
W | TGG |
Y | TAT, TAC |
* | TAG, TGA, TAA [Stop codons] |
Very quickly, the second alternative became more appropriate; it was based on the idea of effectively exploiting the efforts that had been expended in the early genomes to more quickly and accurately identify the genes in each new genome.
It is worth noting that the second approach, while exploiting the investments made in annotating the early genomes, also has the property that early errors are frequently propagated. If an algorithm had called a section of an early genome a gene when it actually was not, then when we see something similar in a new genome it might well get improperly labeled as well.
The third approach offers an unusal perspective and opportunity. It suggests that we are entering an era in which we have many available genomes, and that there might be approaches based on comparison that would support more accurate annotations for the entire collection. There may be many such approaches, but we will describe just one that is based on ideas used in creating one of the early gene-calling systems. Let us start by quoting the abstract from CRITICA: coding region identification tool invoking comparative analysis. by Jonathan Badger and Gary Olsen (Mol Biol Evol. 1999 Apr;16(4):512-24.PMID: 10331277):
"Gene recognition is essential to understanding existing and future DNA sequence data. CRITICA (Coding Region Identification Tool Invoking Comparative Analysis) is a suite of programs for identifying likely protein coding sequences in DNA by combining comparative analysis of DNA sequences with more common noncomparative methods. In the comparative component of the analysis, regions of DNA are aligned with related sequences from the DNA databases; if the translation of the aligned sequences has greater amino acid identity than expected for the observed percentage nucleotide identity, this is interpreted as evidence for coding. CRITICA also incorporates noncomparative information derived from the relative frequencies of hexanucleotides in coding-frames versus other contexts (i.e., dicodon bias). The dicodon usage information is derived by iterative analysis of the data so that CRITICA is not dependent upon the existence or accuracy of coding sequence annotations in the databases. This independence makes the method particularly well-suited for the analysis of novel genomes. CRITICA was tested by analyzing the available Salmonella typhimurium DNA sequences. Its predictions were compared to the DNA sequence annotations and to the predictions of GenMark. CRITICA proved more accurate than GenMark, and, moreover, many of its predictions that would seem to be errors, instead reflect problems in the sequence databases."To understand the basic idea, we need to discuss how genomes are passed on to descendants. We discuss the notion of replication below, but for now let us just say that cells occasionally copy their genome and divide into two cells, leaving a version of the genome in each cell. The set of machines in the original cell also gets divided. How the cell makes sure that each of the new cells gets enough machines to make up an operational life-form is a separate topic. For now, let us just say that they do achieve it. The new cell containing a copy of the genome that existed in the original cell may very occasionally contain a copied genome that differs from the original version due to errors in copying. These differences are called mutations. If a mutation occurred in a gene (encoding a protein), and if the mutation caused the encoding to be changed to produce a protein sequence that would not work, then the mutation is lethal and the cell dies (whatever that means -- something close to "it does not function well enough to compete for resources"). On the other hand, it may change the encoding, but the new version is either just as good, or even better. Many of the changes will simply change the DNA, but not the protein it is used to generate (e.g., it might change GGC to GGA, both of which are encoding of the amino acid G).
Most mutations that occur in protein-encoding genes are lethal (the proteins have been optimized over many, many generations). The number that improve the functioning of the encoded protein are relatively few. This means that most mutations that alter which amino acid is encoded do not appear in the sequenced genomes (cells with those mutations often just die). A disproportionate number of mutations will be of the category that leave the encoded sequence of amino acids unchanged.
Let's make this all more concrete and you can try to tie a lot of these notions together. Let us begin with a multiple-sequence alignment of the starts of some genes from closely-related cells:
fig|198214.1.peg.4 ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACC fig|83333.1.peg.4 ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACC fig|331112.3.peg.3 ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACC fig|155864.1.peg.4 ATGAAACTCTACAATCTTAAAGATCACAATGAGCAGGTCAGCTTTGCGCAAGCCGTAACC fig|321314.4.peg.144 ATGAAACTCTATAATCTGAAAGACCATAATGAGCAGGTCAGCTTTGCGCAGGCCGTCACG *********** ***** ***** ** ** ******************** ***** ** fig|198214.1.peg.4 CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCACGACCTGCCGGAATTCAGCCTG fig|83333.1.peg.4 CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCACGACCTGCCGGAATTCAGCCTG fig|331112.3.peg.3 CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCATGACCTGCCGGAATTCAGCCTG fig|155864.1.peg.4 CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCACGACCTGCCGGAATTCAGCCTG fig|321314.4.peg.144 CAAGGACTGGGCAAACAGCAGGGACTTTTTTTTCCGCACGAACTGCCGGAGTTTAGCCTG ** ** ******** * ***** ** *********** ** ******** ** ******We are depicting the initial 120 characters of the DNA encoding the same corresponding protein from 5 distinct cells. We have associated distinct identifiers to the 5 genes (e.g., fig|198214.1.peg.4). Each of the genes beginning with ATG which is a codon encoding M. The corresponding amino acid strings (that is, the starts of the proteins encoded by the genes) are as follows:
fig|198214.1.peg.4 MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL fig|331112.3.peg.3 MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL fig|83333.1.peg.4 MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL fig|155864.1.peg.4 MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL fig|321314.4.peg.144 MKLYNLKDHNEQVSFAQAVTQGLGKQQGLFFPHELPEFSL ************************* ******* ******Note that we have 120 DNA characters encoding 40 amino acids in each of 5 closely-related genomes. Note that the fourth codon in the gene (TAT in one genome, but TAC in the others) corresponds to the Y in the fourth position of the amino acid alignment. We highly recommend that you manually go through the correspondence between the DNA and amino acid sequences. Tabulate the number of mutations that did not alter the amino acid sequences, as well as the number that did. Think about what this means. It is critical.
What is important for you to realize is that the authors of CRITICA had a pretty good idea: with just these five genomes you can rather reliably recognize that these regions encode amino acid strings. If we were to take the 30 characters ahead of the genes (usually called upstream of the genes) along with the initial ATG we would get the following alignment of those DNA sequences:
fig|198214.1.peg.4 ACGGCGGGCGCACGAGTACTGGAAAACTAAATG fig|331112.3.peg.3 ACGGCGGGCGCACGAGTACTGGAAAACTAAATG fig|83333.1.peg.4 ACGGCGGGCGCACGAGTACTGGAAAACTAAATG fig|155864.1.peg.4 ACGGCGGGCGCACGAGTACTGGAAAACTAAATG fig|321314.4.peg.144 ACGGCGGGCGCACGAGTAGTGGGATAATCAATG ****************** *** * * * ****When we look at the generated amino acids, we see
fig|198214.1.peg.4 TAGARVLENXM fig|331112.3.peg.3 TAGARVLENXM fig|83333.1.peg.4 TAGARVLENXM fig|155864.1.peg.4 TAGARVLENXM fig|321314.4.peg.144 TAGARVVGXSM ******: *Here we see some Xs in the alignment; they represent stop codons (i.e., they indicate that the codon does not encode an amino acid). What is worth noting is that there are mutations in 5 of the 30 upstream characters, and 4 of those 5 produced changes in the encoded characters. It is a fact that most genes begin with ATG, which makes it quite likely that this gene begins with the exact ATG we have shown.
Now let us return to the topic of gene-calling. Our basic approach will be as follows:
fig|226900.1.peg.4136 ------------------ATGAGTAAAATTATCGGTATTGACTTAGGTAC fig|138677.1.peg.499 ATGAGTGAACACAAAAAATCAAGCAAAATTATAGGTATAGACTTAGGCAC ** ******** ***** ******** ** fig|226900.1.peg.4136 AACAAACTCTTGTGTAGCTGTTATGGAAGGTGGAGAACCAAAGGTTATCC fig|138677.1.peg.499 AACAAACTCCTGCGTATCTGTTATGGAAGGAGGACAAGCTAAAGTAATTA ********* ** *** ************* *** ** * ** ** ** fig|226900.1.peg.4136 CAAATCCAGAAGGGAACCGTACAACACCTTCTGTTGTAGCTTTCAAAAAT fig|138677.1.peg.499 CATCATCCGAAGGAACAAGAACCACGCCATCGATCGTTGCCTTCAAAGGT ** * ***** * * ** ** ** ** * ** ** ****** * fig|226900.1.peg.4136 GAAGAACGTCAAGTTGGGGAAGTTGCAAAGCGCCAAGCAATTACAAACCC fig|138677.1.peg.499 AATGAGAAATTAGTGGGGATTCCAGCAAAACGTCAAGCAGTGACAAATCC * ** *** *** ***** ** ****** * ***** ** fig|226900.1.peg.4136 AAATACAA---TCATGTCTGTTAAACGTCATATGGG---TACAGACTACA fig|138677.1.peg.499 AGAAAAAACTCTCGGCTCTACAAAACGCTTTATTGGCCGTAAGTACTCTG * * * ** ** *** ***** *** ** ** *** fig|226900.1.peg.4136 AAGTAG-------------------------------------------- fig|138677.1.peg.499 AAGTAGCTTCGGAAATCCAAACCGTTCCTTATACAGTCACCTCCGGATCT ****** fig|226900.1.peg.4136 -------------------AAGTTGAAGGTAAAGATTATACACCTCAAGA fig|138677.1.peg.499 AAAGGTGATGCCGTTTTCGAAGTTGATGGCAAACAATACACTCCAGAAGA ******* ** *** * ** ** ** **** fig|226900.1.peg.4136 AATTTCTGCCATCATTTTACAAAACTTAAAAGCTTCTGCTGAAGCATACT fig|138677.1.peg.499 AATTGGCGCACAAATCTTAATGAAAATGAAAGAGACAGCAGAAGCTTATC **** ** ** *** ** * **** * ** ***** ** fig|226900.1.peg.4136 TAGGTGAAACAGTAACGAAAGCTGTTATTACAGTACCTGCATACTTCAAC fig|138677.1.peg.499 TAGGCGAAACTGTCACAGAAGCAGTGATCACCGTCCCCGCATACTTCAAT **** ***** ** ** **** ** ** ** ** ** *********** fig|226900.1.peg.4136 GATGCAGAGCGTCAAGCAACGAAAGATGCTGGTCGTATCGCTGGTTTAGA fig|138677.1.peg.499 GATTCTCAACGAGCATCCACAAAAGATGCTGGACGCATTGCAGGTCTAGA *** * * ** * * ** *********** ** ** ** *** **** fig|226900.1.peg.4136 AGTTGAGCGTATCATTAACGAGCCAACAGCAGCAGCACTTGCTTACGGTT fig|138677.1.peg.499 TGTAAAACGTATCATTCCAGAACCTACCGCAGCAGCTCTTGCCTACGGAA ** * ********* ** ** ** ******** ***** ***** fig|226900.1.peg.4136 TAGAAAAACAAGACGAAGAACAAAAAATCTTAGTATATGACTTAGGTGGC fig|138677.1.peg.499 TCGATAA---AGTCGGTGATAAAAAAATCGCTGTCTTCGACCTTGGTGGA * ** ** ** ** ** ******** ** * *** * *****When two characters are in the same column, the implication is that we believe that they derived from the same character in an ancestral sequence. When a dash (i.e., a -) appears in a column, it indicates that we believe that
fig|226900.1.peg.4136 -------------------MSKIIGIDLGTTNSCVAVME-GGEPKVIPNP fig|95665.5.peg.505 ----------------------------------MAVIE-NKKPIVLENP fig|138677.1.peg.499 -------------MSEHKKSSKIIGIDLGTTNSCVSVME-GGQAKVITSS fig|243274.1.peg.368 ---------------MAEKKEFVVGIDLGTTNSVIAWMKPDGTVEVIPNA fig|349521.5.peg.4864 MIRKIAVFSFLRANRGFQSSMSLIGIDLGTTNSLIAHWG-EQGVEIIPNR fig|397945.5.peg.3653 -----------------MEQKMIIGIDLGTTNSLVAAWK-DGRSVLIPNA :: :: . fig|226900.1.peg.4136 EGNRTTPSVVAFK-NEERQVGEVAKRQAITNPN-TIMSVKRHMG------ fig|95665.5.peg.505 EGKRTVPSVVSFN-GDEVLVGDAAKRKQITNPN-TVSSIKRLMG------ fig|138677.1.peg.499 EGTRTTPSIVAFK-GNEKLVGIPAKRQAVTNPEKTLGSTKRFIGRKYSEV fig|243274.1.peg.368 EGSRVTPSVVAFTKSGEILVGEPAKRQMILNPERTIKSIKRKMG------ fig|349521.5.peg.4864 LGARLTPSAVSLDADGAVIVGQAAKDRLVTHPDLSVASFKRRMG------ fig|397945.5.peg.3653 LGETLTPSCVSLDEDVTVLVGRAARERLQTHPDRTAANFKRYMG------ * .** *:: . ** *: : :*: : . ** :* fig|226900.1.peg.4136 ----------------TDYKVEVEGKDYTPQEISAIILQNLKASAEAYLG fig|95665.5.peg.505 ----------------TKEKVTILNKEYTPEEISAKILSYIKDYAEKKLG fig|138677.1.peg.499 ASEIQTVPYTVTSGSKGDAVFEVDGKQYTPEEIGAQILMKMKETAEAYLG fig|243274.1.peg.368 ----------------TDYKVRIDDKEYTPQEISAFILKKLKNDAEAYLG fig|349521.5.peg.4864 ----------------TNAAYTLGKQSFRPEELSALVLKQLKEDAEAYLN fig|397945.5.peg.3653 ----------------SDRTVALAGRAFRPEELSSLVLRALKADAEAFLG . : : : *:*:.: :* :* ** *.In actuality, these five sequences are part of a set of sequences that are fairly similar, and recognizably so. However, we believe that it is far from clear that the alignment above is actually "correct" or "optimal" in a meaningful sense. Rather, it seems probably close to correct, but containing errors. Exactly where the dashess (called indels, since they represent characters that were either inserted or deleted) should be placed is uncertain.
There are two classes of problems associated with multiple-sequence alignments:
Before we leave this topic, we will briefly describe a tool that we
believe any computer scientist could build easily and that would
reveal numerous research topics. Suppose that we have a single genome
that we wish to analyze, and that we have computed all regions of
similarity between sections of this genome and other complete genomes.
For each character in the genome we are focused on, we can easily
extract all regions in other genomes that are similar to regions in
the focus genome that contain the given character. Further, each of
the stored similarities (between a region in the given genome and one
of the other genomes) has an associated percent identity (a
measure of how similar the regions are - the percent of the aligned
characters that are identical). Now, the utility that is needed is
the ability to specify a region in the given genome, along with a
range of desired similarities, and then the program would display the
alignment composed of the selected similarity range (maybe with some
representation of the consensus and how conserved the values are).
seq3 -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE seq4 -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA seq5 -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN seq1 -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT seq2 MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE seq3 EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL seq4 ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL seq5 TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL seq1 EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL seq2 FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL seq3 ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI seq4 ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI seq5 ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI seq1 ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV seq2 ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV seq3 FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV seq4 FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV seq5 FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV seq1 VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA seq2 YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA seq3 NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI seq4 NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM seq5 NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI seq1 NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI seq2 NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI seq3 VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA seq4 IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA seq5 LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA seq1 AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------ seq2 AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS---------- seq3 FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC seq4 FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP seq5 LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL seq1 -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ seq2 -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG seq3 RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA seq4 GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR seq5 VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA seq1 LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK seq2 LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS seq3 FVSQHGNRGKPL seq4 FMSGHLGA---- seq5 FIEKKAL----- seq1 LMMNHQ------ seq2 YLLGK-------From the extant five sequences that are similar and displayed in the previous alignment, we can construct a tree that depicts the "phylogenetic history" of the sequences. Here is one reasonable tree for the last 5 sequences.
,----------------------- seq5 | | -| | | | ,---------------------------- seq3 | | | | | ,-----------------| | | | | | | | | `--------------------------- seq4 | | | | | ,--| | | | | | | | | `---------------------------------------------- seq1 | | | | `--------------------| | | `------------------------------------------------ seq2The tree suggests that at some point an ancestral cell replicated. One copy led (through a chain of descendants) to seq5, while the remaining sequences descend from the other copy.
Note that we now have alignments that contain thousands of sequences, and even displaying such trees is nontrivial. Because evolution plays such a central role in the phenomena we study, the construction of alignments and trees in order to compare extant versions of proteins and gain insight into their historical origins is considered basic to the task at hand.
M1 | harvesting light energy |
M2 | building sugar from smaller components and energy |
M3 | Storing strings of sugar molecules as starch |
M4 | breaking up starch to give sugar |
M5 | breaking up sugar to get energy and smaller molecules |
2OG | 2-oxoglutarate |
3PG | 3-phospho-glutarate |
A | Adenosine [one of the characters in a DNA string] |
Ala | Alanine [an amino acid] |
Arg | Arginine [an amino acid] |
Asn | Asparagine [an amino acid] |
Asp | Aspartate [an amino acid] |
C | Cytosine [one of the characters in a DNA string] |
CHOR | Chorismate |
CO2 | Carbon dioxide |
Daughter genome | the added cell after replication |
E4P | Erythrose 4-phosphate |
Extra Membrane | A little extra membrane for the new cell |
G | Guanine [one of the characters in a DNA string] |
G6P | Glucose 6-phosphate |
Genome | the DNA string in the cell that contais the genes |
Gln | Glutamine [an amino acid] |
Glu | Glutamate [an amino acid] |
Gly | Glycine [an amino acid] |
HOM | Homoserine |
His | Histidine [an amino acid] |
Iso | Isoleucine [an amino acid] |
Leu | Leucine [an amino acid] |
Lys | Lysine [an amino acid] |
Membrane | the thing enclosing the cell |
Met | Methionine [an amino acid] |
OXLA | Oxalacetate |
PEP | Phosphoenolpyruvate |
PYR | Pyruvate |
Phe | Phenylalanine [an amino acid] |
Pro | Proline [an amino acid] |
R5P | Ribose 5-phosphate |
Ser | Serine [an amino acid] |
Starch | A polymer of sugars (used for storage) |
Sugar | think glucose |
T | Thiamine [one of the characters in a DNA string] |
Thr | Threonine [an amino acid] |
Trp | Tryptophane [an amino acid] |
Tyr | Tyrosine [an amino acid] |
Val | Valine [an amino acid] |
M6 | build glutamate and glutamine from 2-oxoglutarate |
M7 | build proline from glutamate and ATP |
M8 | build aspartate from 2-oxalacetate |
M9 | build arginine from glutamate, aspartate, and ATP |
M10 | build asparagine from glutamine, aspartate, and ATP |
M11 | build serine from 3-phospho-glutarate and glutamate |
M12 | build glycine from serine |
M13 | build cysteine from serine |
M14 | build methionine from homoserine and cysteine |
M15 | build lysine from pyruvate and aspartate |
M16 | buil homoserine from aspartate |
M17 | build threonine from homoserine and ATP |
M18 | build isoleucine from glutamate, threonine and pyruvate |
M19 | build alanine from pyruvate |
M20 | build valine from pyruvate |
M21 | Build leucine from pyruvate |
M22 | build the intermediate chorismate from phosphoenolpyruvate and erythrose 4-phosphate |
M23 | build tyrosine and phenaylalanine from glutamate and chorismate |
M24 | build tryptophane from chorismate and glutamine |
M25 | build ribose 5-phosphate from glucose-6-phosphate |
M26 | build histidine from ribose-5-phosphate and ATP |
M30 | building a protein from amino acids and a gene |
M27 | build nucleotides |
M28 | build new genome |
M29 | split the cell into original and daughter |
Before leaving this topic, it is worth noting that a site called
The Annotation Clearinghouse
exists. This resource will allow users to download assertions of
function that are considered to be reasonably reliable by human
annotators manually curating the growing body of data. The assertions
use widely differing IDs for genes (but a table for interconverting
the IDs is provided), they use an uncontrolled vocabulary (although
progress is being made in developing synonym lists), and many of the
assertions are undoubtedly wrong. However, it is a start on a
resource of central importance.
In our view, the most useful set of clues to date have arisen from recognizing that genes that implement closely related functions (i.e., functions that are part of the same machine or machines that implement connected functions) often occur close to one another in the genome. That is, if you take the genes that implement a machine, and you look at where these genes occur in the genome, the occurrences are not random. On average, about 50% of the genes that make up a machine will occur within 5000 characters of one another in the genome. In some genomes far fewer genes cluster (for reasons we do not fully understand).
To exploit this tendency, we might construct sets of pairs of genes. All pairs in a set occur close together in a genome (one of the ones in our collection). All of the first members of pairs are similar to one another, and all of the second members are similar to one another. The fact that all of the 2-tuples in each set have corresponding pairs that are similar might lead one to believe that all of the pairs implemented the same two abstract functions, but that is not the case. It is often, and perhaps usually, the case; but, there are many instances where the pairs implement distinct functions. For example, there are many cases in which 4 close genes implement a transport machine. For each of these transport machines, even though they transport completely different compounds, 3 of the 4 genes are pretty similar. The fourth gene is often the one that is specific to the compound being transported.
What we can say, assuming that we find enough entries in a set (that is way more coresponding pairs than one would expect by random), is that the functions of the genes in each pair are related. We cannot say with reliability that the actual functions in all of the pairs match up, but the ones in each pair will usually be related.
Further, a single protein might well participate in pairs from several sets. By combining the evidence from all of these sets of pairs, it is possible to produce an estimate of all of the components in a machine, without really knowing the functions of any of them. That is, it becomes possible to say "I think that these four genes implement a machine", and to do so without having a clear idea of what the machine actually does. The information produced by examining conserved contiguity has not really been completely exploited. It has proved to be immensely useful, but there is far more to be gleaned from this data by those with some minimal creativity and statistical competence.