Understanding Single-celled Life:

An Abstract Approach

by Ralph Butler, Ross Overbeek, ...


Part 1: The Cell: a Basic Abstraction

A cell is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.

By the term compound we refer to the normal notion of chemical compound.

A cellular machine is a set of proteins that together perform a function. Unless otherwise noted, when we use the term machine we will always be speaking of a cellular machine. Many machines transform one set of compounds into another set. Some machines (transport machines) are used to move compounds into or out of the cell. Later we will try to convey a more comprehensive notion of what functions are implemented by machines that we understand.

A protein is a string of amino acids (i.e., a string in the 20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).

A genome is a string of DNA bases (i.e., a string in the 4-character alphabet {A,C,G,T}).

A gene is a region in the genome that describes how to build a protein. The description is a sequence of 3-character codons. Each codon may be thought of as an instruction specifying which amino acid should come next in the protein the gene describes.   Thus, if the protein described by the gene contains 100 amino acids, then the gene would be composed of 100 codons (i.e., 300 DNA characters) followed by a codon that means "stop here" (a stop codon).  There are three stop codons: {TAA,TAG,TGA}. The genetic code is the table of correspondences between codons and amino acids:

Amino Acid Codons
A GCT, GCC, GCA, GCG
C TGT, TGC
D GAT, GAC
E GAA, GAG
F TTT, TTC
G GGT, GGC, GGA, GGG
H CAT, CAC
I ATT, ATC, ATA
K AAA, AAG
L TTA, TTG, CTT, CTC, CTA, CTG
M ATG
N AAT, AAC
P CCT, CCC, CCA, CCG
Q CAA, CAG
R CGT, CGC, CGA, CGG, AGA, AGG
S TCT, TCC, TCA, TCG, AGT, AGC
T ACT, ACC, ACA, ACG
V GTT, GTC, GTA, GTG
W TGG
Y TAT, TAC
* TAG, TGA, TAA [Stop codons]



The process of building a protein as a string of amino acids from the gene containing codons is called expressing the gene.

Problems in BioInformatics that Depend only on the Basic Abstraction

Identifying Genes within the Genome

If we plan on using a genome, it will usually be necessary to identify the genes within the genome.  How can this best be done?   First, it should be noted that this can be broken into three variations:

  1. Given no assumption of an existing body of previously identified genes, find the genes in a new genome.
  2. Given a large collection of existing genomes in which the genes have been identified, find the set of genes in a new genome.
  3. Given a large set of existing genomes, discard any existing decisions and try to identify genes in all of them from scratch.
When the first genome was sequenced, the first option was pretty much the only reasonable choice (this is not completely true, since we had many partial genomes that had already been sequenced and annotated). People focused on developing reasonable strategies that would make the best possible choices taking just the single genome as input.

Very quickly, the second alternative became more appropriate; it was based on the idea of effectively exploiting the efforts that had been expended in the early genomes to more quickly and accurately identify the genes in each new genome.

It is worth noting that the second approach, while exploiting the investments made in annotating the early genomes, also has the property that early errors are frequently propagated. If an algorithm had called a section of an early genome a gene when it actually was not, then when we see something similar in a new genome it might well get improperly labeled as well.

The third approach offers an unusal perspective and opportunity. It suggests that we are entering an era in which we have many available genomes, and that there might be approaches based on comparison that would support more accurate annotations for the entire collection. There may be many such approaches, but we will describe just one that is based on ideas used in creating one of the early gene-calling systems. Let us start by quoting the abstract from CRITICA: coding region identification tool invoking comparative analysis. by Jonathan Badger and Gary Olsen (Mol Biol Evol. 1999 Apr;16(4):512-24.PMID: 10331277):

"Gene recognition is essential to understanding existing and future DNA sequence data. CRITICA (Coding Region Identification Tool Invoking Comparative Analysis) is a suite of programs for identifying likely protein coding sequences in DNA by combining comparative analysis of DNA sequences with more common noncomparative methods. In the comparative component of the analysis, regions of DNA are aligned with related sequences from the DNA databases; if the translation of the aligned sequences has greater amino acid identity than expected for the observed percentage nucleotide identity, this is interpreted as evidence for coding. CRITICA also incorporates noncomparative information derived from the relative frequencies of hexanucleotides in coding-frames versus other contexts (i.e., dicodon bias). The dicodon usage information is derived by iterative analysis of the data so that CRITICA is not dependent upon the existence or accuracy of coding sequence annotations in the databases. This independence makes the method particularly well-suited for the analysis of novel genomes. CRITICA was tested by analyzing the available Salmonella typhimurium DNA sequences. Its predictions were compared to the DNA sequence annotations and to the predictions of GenMark. CRITICA proved more accurate than GenMark, and, moreover, many of its predictions that would seem to be errors, instead reflect problems in the sequence databases."
To understand the basic idea, we need to discuss how genomes are passed on to descendants. We discuss the notion of replication below, but for now let us just say that cells occasionally copy their genome and divide into two cells, leaving a version of the genome in each cell. The set of machines in the original cell also gets divided. How the cell makes sure that each of the new cells gets enough machines to make up an operational life-form is a separate topic. For now, let us just say that they do achieve it. The new cell containing a copy of the genome that existed in the original cell may very occasionally contain a copied genome that differs from the original version due to errors in copying. These differences are called mutations. If a mutation occurred in a gene (encoding a protein), and if the mutation caused the encoding to be changed to produce a protein sequence that would not work, then the mutation is lethal and the cell dies (whatever that means -- something close to "it does not function well enough to compete for resources"). On the other hand, it may change the encoding, but the new version is either just as good, or even better. Many of the changes will simply change the DNA, but not the protein it is used to generate (e.g., it might change GGC to GGA, both of which are encoding of the amino acid G).

Most mutations that occur in protein-encoding genes are lethal (the proteins have been optimized over many, many generations). The number that improve the functioning of the encoded protein are relatively few. This means that most mutations that alter which amino acid is encoded do not appear in the sequenced genomes (cells with those mutations often just die). A disproportionate number of mutations will be of the category that leave the encoded sequence of amino acids unchanged.

Let's make this all more concrete and you can try to tie a lot of these notions together. Let us begin with a multiple-sequence alignment of the starts of some genes from closely-related cells:


fig|198214.1.peg.4        ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACC
fig|83333.1.peg.4         ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACC
fig|331112.3.peg.3        ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCTTTGCGCAAGCCGTAACC
fig|155864.1.peg.4        ATGAAACTCTACAATCTTAAAGATCACAATGAGCAGGTCAGCTTTGCGCAAGCCGTAACC
fig|321314.4.peg.144      ATGAAACTCTATAATCTGAAAGACCATAATGAGCAGGTCAGCTTTGCGCAGGCCGTCACG
                          *********** ***** ***** ** ** ******************** ***** ** 

fig|198214.1.peg.4        CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCACGACCTGCCGGAATTCAGCCTG
fig|83333.1.peg.4         CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCACGACCTGCCGGAATTCAGCCTG
fig|331112.3.peg.3        CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCATGACCTGCCGGAATTCAGCCTG
fig|155864.1.peg.4        CAGGGGTTGGGCAAAAATCAGGGGCTGTTTTTTCCGCACGACCTGCCGGAATTCAGCCTG
fig|321314.4.peg.144      CAAGGACTGGGCAAACAGCAGGGACTTTTTTTTCCGCACGAACTGCCGGAGTTTAGCCTG
                          ** **  ******** * ***** ** *********** ** ******** ** ******
We are depicting the initial 120 characters of the DNA encoding the same corresponding protein from 5 distinct cells. We have associated distinct identifiers to the 5 genes (e.g., fig|198214.1.peg.4). Each of the genes beginning with ATG which is a codon encoding M. The corresponding amino acid strings (that is, the starts of the proteins encoded by the genes) are as follows:
fig|198214.1.peg.4        MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL
fig|331112.3.peg.3        MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL
fig|83333.1.peg.4         MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL
fig|155864.1.peg.4        MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSL
fig|321314.4.peg.144      MKLYNLKDHNEQVSFAQAVTQGLGKQQGLFFPHELPEFSL
                          ************************* ******* ******
Note that we have 120 DNA characters encoding 40 amino acids in each of 5 closely-related genomes. Note that the fourth codon in the gene (TAT in one genome, but TAC in the others) corresponds to the Y in the fourth position of the amino acid alignment. We highly recommend that you manually go through the correspondence between the DNA and amino acid sequences. Tabulate the number of mutations that did not alter the amino acid sequences, as well as the number that did. Think about what this means. It is critical.

What is important for you to realize is that the authors of CRITICA had a pretty good idea: with just these five genomes you can rather reliably recognize that these regions encode amino acid strings. If we were to take the 30 characters ahead of the genes (usually called upstream of the genes) along with the initial ATG we would get the following alignment of those DNA sequences:


fig|198214.1.peg.4        ACGGCGGGCGCACGAGTACTGGAAAACTAAATG
fig|331112.3.peg.3        ACGGCGGGCGCACGAGTACTGGAAAACTAAATG
fig|83333.1.peg.4         ACGGCGGGCGCACGAGTACTGGAAAACTAAATG
fig|155864.1.peg.4        ACGGCGGGCGCACGAGTACTGGAAAACTAAATG
fig|321314.4.peg.144      ACGGCGGGCGCACGAGTAGTGGGATAATCAATG
                          ****************** *** * * * ****
When we look at the generated amino acids, we see

fig|198214.1.peg.4        TAGARVLENXM
fig|331112.3.peg.3        TAGARVLENXM
fig|83333.1.peg.4         TAGARVLENXM
fig|155864.1.peg.4        TAGARVLENXM
fig|321314.4.peg.144      TAGARVVGXSM
                          ******:   *
Here we see some Xs in the alignment; they represent stop codons (i.e., they indicate that the codon does not encode an amino acid). What is worth noting is that there are mutations in 5 of the 30 upstream characters, and 4 of those 5 produced changes in the encoded characters. It is a fact that most genes begin with ATG, which makes it quite likely that this gene begins with the exact ATG we have shown.

Now let us return to the topic of gene-calling. Our basic approach will be as follows:

  1. Begin by attempting to find as many genes as we can by taking the existing set of genomes and finding protein-encoding sections using the idea that was used in CRITICA. This will be computationally expensive because it might require looking for similar regions in thousands of genomes (remember there are 499,500 pairwise comparisons to make for 1000 genomes, and there are thousands of similar genes for almost all of the pairwise comparisons). However, what we will get out is a pretty accurate estimate of which areas of each genome are actually genes.
  2. A second step, after forming as many accurate predictions as we can make, would involve polishing things up by taking the set of predicted genes and trying to
A comprehensive recalling of genes should be done periodically, leading to ever more reliable estimates for ever more thousands of genomes. The whole topic of reducing the effort required to do the incremental comparisons between genomes is obviously going to be considered over the coming years. What is important, we suppose, is that it is clear at this point that we can now accurately call genes in prokaryotic genomes (although no one has yet gone back and cleaned up all of the errors in the existing genomes).

Identifying Similar Genes

Genes are said to be homologous if they share a common ancestor.  Tools have been developed to construct estimates of whether or not two genes, or the protein sequences they encode, are homologous.  Most of these are based on measuring the degree of similarity between the genes based on some metric.  The most basic versions of this problem are

  1. Given two genes (or proteins), are they homologs?  That is, estimate the liklihood that they are homologs.
  2. Given a gene and a database of other genes, extract a prioritized list from the database of genes that are likely to be homologs.  Similarly, given a protein sequence and a database of other protein sequences, which are most likely to be produced by homologous genes?
  3. Produce an alignment of two DNA or protein sequences that attempts to show corresponding characters in the two sequences.   For example,
fig|226900.1.peg.4136      ------------------ATGAGTAAAATTATCGGTATTGACTTAGGTAC
fig|138677.1.peg.499       ATGAGTGAACACAAAAAATCAAGCAAAATTATAGGTATAGACTTAGGCAC
                                                ** ******** ***** ******** **

fig|226900.1.peg.4136      AACAAACTCTTGTGTAGCTGTTATGGAAGGTGGAGAACCAAAGGTTATCC
fig|138677.1.peg.499       AACAAACTCCTGCGTATCTGTTATGGAAGGAGGACAAGCTAAAGTAATTA
                           ********* ** *** ************* *** ** * ** ** **  

fig|226900.1.peg.4136      CAAATCCAGAAGGGAACCGTACAACACCTTCTGTTGTAGCTTTCAAAAAT
fig|138677.1.peg.499       CATCATCCGAAGGAACAAGAACCACGCCATCGATCGTTGCCTTCAAAGGT
                           **    * ***** *   * ** ** ** **  * ** ** ******  *

fig|226900.1.peg.4136      GAAGAACGTCAAGTTGGGGAAGTTGCAAAGCGCCAAGCAATTACAAACCC
fig|138677.1.peg.499       AATGAGAAATTAGTGGGGATTCCAGCAAAACGTCAAGCAGTGACAAATCC
                            * **      *** ***      ***** ** ****** * ***** **

fig|226900.1.peg.4136      AAATACAA---TCATGTCTGTTAAACGTCATATGGG---TACAGACTACA
fig|138677.1.peg.499       AGAAAAAACTCTCGGCTCTACAAAACGCTTTATTGGCCGTAAGTACTCTG
                           * * * **   **   ***   *****   *** **   **   ***   

fig|226900.1.peg.4136      AAGTAG--------------------------------------------
fig|138677.1.peg.499       AAGTAGCTTCGGAAATCCAAACCGTTCCTTATACAGTCACCTCCGGATCT
                           ******                                            

fig|226900.1.peg.4136      -------------------AAGTTGAAGGTAAAGATTATACACCTCAAGA
fig|138677.1.peg.499       AAAGGTGATGCCGTTTTCGAAGTTGATGGCAAACAATACACTCCAGAAGA
                                              ******* ** *** * ** ** **  ****

fig|226900.1.peg.4136      AATTTCTGCCATCATTTTACAAAACTTAAAAGCTTCTGCTGAAGCATACT
fig|138677.1.peg.499       AATTGGCGCACAAATCTTAATGAAAATGAAAGAGACAGCAGAAGCTTATC
                           ****   **    ** ***   **  * ****   * ** ***** **  

fig|226900.1.peg.4136      TAGGTGAAACAGTAACGAAAGCTGTTATTACAGTACCTGCATACTTCAAC
fig|138677.1.peg.499       TAGGCGAAACTGTCACAGAAGCAGTGATCACCGTCCCCGCATACTTCAAT
                           **** ***** ** **  **** ** ** ** ** ** *********** 

fig|226900.1.peg.4136      GATGCAGAGCGTCAAGCAACGAAAGATGCTGGTCGTATCGCTGGTTTAGA
fig|138677.1.peg.499       GATTCTCAACGAGCATCCACAAAAGATGCTGGACGCATTGCAGGTCTAGA
                           *** *  * **   * * ** *********** ** ** ** *** ****

fig|226900.1.peg.4136      AGTTGAGCGTATCATTAACGAGCCAACAGCAGCAGCACTTGCTTACGGTT
fig|138677.1.peg.499       TGTAAAACGTATCATTCCAGAACCTACCGCAGCAGCTCTTGCCTACGGAA
                            **  * *********   ** ** ** ******** ***** *****  

fig|226900.1.peg.4136      TAGAAAAACAAGACGAAGAACAAAAAATCTTAGTATATGACTTAGGTGGC
fig|138677.1.peg.499       TCGATAA---AGTCGGTGATAAAAAAATCGCTGTCTTCGACCTTGGTGGA
                           * ** **   ** **  **  ********   ** *  *** * ***** 
When two characters are in the same column, the implication is that we believe that they derived from the same character in an ancestral sequence. When a dash (i.e., a -) appears in a column, it indicates that we believe that

Multiple-Sequence Alignment

A multiple-sequence alignment extends the notion of a binary alignment. We have already used them in discussing the problem of identifying the genes in genomes, but they represent a fundamental source of comparative insight and come into play in almost every aspect of analyzing genomic sequences. Consider the following piece of a multiple-sequence alignment:
fig|226900.1.peg.4136      -------------------MSKIIGIDLGTTNSCVAVME-GGEPKVIPNP
fig|95665.5.peg.505        ----------------------------------MAVIE-NKKPIVLENP
fig|138677.1.peg.499       -------------MSEHKKSSKIIGIDLGTTNSCVSVME-GGQAKVITSS
fig|243274.1.peg.368       ---------------MAEKKEFVVGIDLGTTNSVIAWMKPDGTVEVIPNA
fig|349521.5.peg.4864      MIRKIAVFSFLRANRGFQSSMSLIGIDLGTTNSLIAHWG-EQGVEIIPNR
fig|397945.5.peg.3653      -----------------MEQKMIIGIDLGTTNSLVAAWK-DGRSVLIPNA
                                                             ::         :: . 

fig|226900.1.peg.4136      EGNRTTPSVVAFK-NEERQVGEVAKRQAITNPN-TIMSVKRHMG------
fig|95665.5.peg.505        EGKRTVPSVVSFN-GDEVLVGDAAKRKQITNPN-TVSSIKRLMG------
fig|138677.1.peg.499       EGTRTTPSIVAFK-GNEKLVGIPAKRQAVTNPEKTLGSTKRFIGRKYSEV
fig|243274.1.peg.368       EGSRVTPSVVAFTKSGEILVGEPAKRQMILNPERTIKSIKRKMG------
fig|349521.5.peg.4864      LGARLTPSAVSLDADGAVIVGQAAKDRLVTHPDLSVASFKRRMG------
fig|397945.5.peg.3653      LGETLTPSCVSLDEDVTVLVGRAARERLQTHPDRTAANFKRYMG------
                            *   .** *::  .    **  *: :   :*: :  . ** :*      

fig|226900.1.peg.4136      ----------------TDYKVEVEGKDYTPQEISAIILQNLKASAEAYLG
fig|95665.5.peg.505        ----------------TKEKVTILNKEYTPEEISAKILSYIKDYAEKKLG
fig|138677.1.peg.499       ASEIQTVPYTVTSGSKGDAVFEVDGKQYTPEEIGAQILMKMKETAEAYLG
fig|243274.1.peg.368       ----------------TDYKVRIDDKEYTPQEISAFILKKLKNDAEAYLG
fig|349521.5.peg.4864      ----------------TNAAYTLGKQSFRPEELSALVLKQLKEDAEAYLN
fig|397945.5.peg.3653      ----------------SDRTVALAGRAFRPEELSSLVLRALKADAEAFLG
                                            .    :  : : *:*:.: :*  :*  **  *.

In actuality, these five sequences are part of a set of sequences that are fairly similar, and recognizably so. However, we believe that it is far from clear that the alignment above is actually "correct" or "optimal" in a meaningful sense. Rather, it seems probably close to correct, but containing errors. Exactly where the dashess (called indels, since they represent characters that were either inserted or deleted) should be placed is uncertain.

There are two classes of problems associated with multiple-sequence alignments:

  1. how to compute them and
  2. how to use them.
Some of the most interesting problems are of the second sort -- using multiple-sequence alignments in what might be called molecular archaeology to uncover events in the evolutionary history of the sequences that occur in the alignment. On the other hand, one of the more important problems in bioinformatics is, as we accumulate collections of thousands of homologous sequences, the development of tools to support the construction and use of these multiple-sequence alignments.

Before we leave this topic, we will briefly describe a tool that we believe any computer scientist could build easily and that would reveal numerous research topics. Suppose that we have a single genome that we wish to analyze, and that we have computed all regions of similarity between sections of this genome and other complete genomes. For each character in the genome we are focused on, we can easily extract all regions in other genomes that are similar to regions in the focus genome that contain the given character. Further, each of the stored similarities (between a region in the given genome and one of the other genomes) has an associated percent identity (a measure of how similar the regions are - the percent of the aligned characters that are identical). Now, the utility that is needed is the ability to specify a region in the given genome, along with a range of desired similarities, and then the program would display the alignment composed of the selected similarity range (maybe with some representation of the consensus and how conserved the values are).

Given a multiple sequence alignment, determine the most likely evolutionary history of the sequences (i.e., construct a phylogenetic tree).

Here is an example of a multiple sequence alignment:

seq3 -------------------MRYISTRGQAPALNFEDVLLAGLASDGGLYVPENLPRFTLE
seq4 -------------------MRYISTRGSAPTLSFEEVLLTGLASDGGLYVPESLPSFTSA
seq5 -------------------MNYISTRGAIAPIGFKDAVMMGLATDGGLLLPETIPALGRN
seq1 -------------------MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLT
seq2 MKIRVICGAPTPKPFIKIPMKYYSTNKQAPLASLEEAVVKGLASDKGLFMPMTIKPLPQE

seq3 EIASWVGLPYHELAFRVMRPFVAGSIADADFKKILEETYGVFAHDAVAPLRQLNGNEWVL
seq4 ELEAMASLDYPSLAHRILLPFVEEAFTGEELREIIDDTYAVFRHSAVAPLVQLDHNQWVL
seq5 TLESWQSLSYQDLAFNVIS-LFADDIPAQDLKDLIDRSYATFSHPEITPVVEKDG-VYIL
seq1 EIDEMLKLDFVTRSAKILSAFIGDEIPQEILEERVRAAFAFP-----APVANVESDVGCL
seq2 FYDEIENLSFREIAYRVADAFFGEDVPAETLKEIVYDTLNFD-----VPLVPVKENIYSL

seq3 ELFHGPTLAFKDFALQLLGRLLDHVLAKRGER-VVIMGATSGDTGSAAIEGCRRCDNVDI
seq4 ELFHGPTLAFKDFALQLLGRLLDAILKRRGEK-VVIMGATSGDTGSAAIAGCERCENIDI
seq5 ELFHGPTLAFKDVALQLLGNLFEYLLKERGEK-MNIVGATSGDTGSAAIYGVRGKDKINI
seq1 ELFHGPTLAFKDFGGRFMAQMLTHIA---GDKPVTILTATSGDTGAAVAHAFYGLPNVKV
seq2 ELFHGPTLAFKDVGGRFMARLLGYFIRKEGRKQVNVLVATSGDTGSAVANGFLGVEGIHV

seq3 FIMHPHNRVSEVQRRQMTTILGDNIHNIAIEGNFDDCQEMVKASFADQGFLK-GTRLVAV
seq4 FILHPHGRVSEVQRRQMTTLSAPTIHNLAIEGNFDDCQAMVKASFRDQSFLPDGRRLVAV
seq5 FILHPHGKTSPVQALQMTTVLDPNVHNIAARGTFDDCQNIVKSLFSDLPFKE-KYSLGAV
seq1 VILYPRGKISPLQEKLFCTLGG-NIETVAIDGDFDACQALVKQAFDDEELKV-ALGLNSA
seq2 YVLYPKGKVSEIQEKQFTTLGR-NITALEVDGTFDDCQALVKAAFMDQELNE-QLLLTSA

seq3 NSINWARIMAQIVYYFHAALQLG-APH-RSVAFSVPTGNFGDIFAGYLARNMGLPVSQLI
seq4 NSINWARIMAQIVYYFYAGLRLG-APH-RAAAYSVPTGNFGDIFAGYLASKMGLPVAQLM
seq5 NSINWARVLAQVVYYFYAYFRVA-ALFGQEVVFSVPTGNFGDIFAGYVAKRMGLPIRRLI
seq1 NSINISRLLAQICYYFEAVAQLPQETRNQ-LVVSVPSGNFGDLTAGLLAKSLGLPVKRFI
seq2 NSINVARFLPQAFYYFYAYAQLKKAGRAENVVICVPSGNFGNITAGLFGKKMGLPVRRFI

seq3 VATNRNDILHRFMSGNRYDKDTLHPSLSPSMDIMVSSNFERLLFDLHGRNGKAVAELLDA
seq4 IATNRNDVLHRLLSTGDYARQTLEHTLSPSMDISVSSNFERLMFDLYERDGAAIASLMAA
seq5 LATNENNILSRFINGGDYSLGDVVATVSPSMDIQLASNFERYVYYLFGENPARVREAFAA
seq1 AATNVNDTVPRFLHDGQWSPKATQATLSNAMDVSQPNNWPR-VEELFR------------
seq2 AANNKNDIFYQYLQTGQYNPRPSVATIANAMDVGDPSNFAR-VLDLYGGS----------

seq3 FKASGKLSVEDQRWTEARKLFDSLAVSDEQTCETIAEVYRSCGELLDPHTAIGVRAAREC
seq4 FDD-GDITLSDAAMEKARQLFASHRVDDAQTLACIADVWGRTEYLLDPHSAIGYAAATQP
seq5 LPTKGRIDFTEAEMEKVRDEFLSRSVNEDETIATIAAFHRETGYILDPHTAVGVKAALEL
seq1 -------------RKIWQLKELGYAAVDDETTQQTMRELKELGYTSEPHAAVAYRALRDQ
seq2 -------------HAAIAAEISGTTYTDEQIRESVKACWQQTGYLLDPHGACGYRALEEG

seq3 RRSLSVPMVTLGTAHPVKFPEAVEKAGIGQAPALPAHLADLFEREERCTVLPNELAKVQA
seq4 GANTQTPWVTLATAHPAKFPDAIKASAVGTTAQLPVHLADLFERSEHFDVLPNDIAAVQR
seq5 VQDG-TPAVCLATAHPAKFAEAVVR-AVGFEPSRPTSLEGIEALPSRCDVLDADRDAIKA
seq1 LNPG-EYGLFLGTAHPAKFKESVEA-ILGETLDLPKELAERADLPLLSHNLPADFAALRK
seq2 LQPG-ETGVFLETAHPAKFLQTVES-IIGTEVEIPAKLRAFMKGEKKSLPMTKEFADFKS

seq3 FVSQHGNRGKPL
seq4 FMSGHLGA----
seq5 FIEKKAL-----
seq1 LMMNHQ------
seq2 YLLGK-------
From the extant five sequences that are similar and displayed in the previous alignment, we can construct a tree that depicts the "phylogenetic history" of the sequences. Here is one reasonable tree for the last 5 sequences.
  ,----------------------- seq5
  |
  |
 -|
  |
  |
  |                                         ,---------------------------- seq3
  |                                         |
  |                                         |
  |                       ,-----------------|
  |                       |                 |
  |                       |                 |
  |                       |                 `--------------------------- seq4
  |                       |
  |                       |
  |                    ,--|
  |                    |  |
  |                    |  |
  |                    |  `---------------------------------------------- seq1
  |                    |
  |                    |
  `--------------------|
                       |
                       |
                       `------------------------------------------------ seq2
The tree suggests that at some point an ancestral cell replicated. One copy led (through a chain of descendants) to seq5, while the remaining sequences descend from the other copy.

Note that we now have alignments that contain thousands of sequences, and even displaying such trees is nontrivial. Because evolution plays such a central role in the phenomena we study, the construction of alignments and trees in order to compare extant versions of proteins and gain insight into their historical origins is considered basic to the task at hand.

What is "the tree of life" and How Might it Get Built?


The problem of constructing a single phylogenetic tree from a single alignment (the last problem) is relevant to this issue, but it does not cover it.  Suppose that you built 200 alignments  that contain the sequences common to almost all genomes.  Then, if you were to build 200 trees, and then you found that they were not identical (or even close in some cases), what would you infer, and how should you respond?  Is it even possible or desirable that we actually create an estimate of the history of how the existing micro-organisms have evolved from some ancestral organism?

Assuming that We Do Have an Estimate of the Tree of Life, which Proteins Characterize Subdivisions of the Tree?

It is clear that sequences are introduced into genomes through replication and (in addition) through horizontal transfer.  In the presence of large amounts of horizontal transfer, many genes will occur only in relatively small portions of a specific subtree (these represent relatively recent transfers).  Is it possible and meaningful to create inventories of proteins that tend to be unique to a subtree (or is the concept "tend to be unique" somewhat similar to "a little pregnant")?

Can We Identify Instances of Horizontal Transfer?

How can we construct tools to recognize horizontal transfer, and can these tools be good enough to sort out the actual details of the evolutionary history?

Can We Determine Which Columns and Sections of a Multiple-Sequence Alignment are Conserved (and Why)?

Conservation normally implies functional constraints (the reason a column has restricted content is that any evolutionary change  led to the death of the organism that had it).  Shifts of function relate to conserved sections that have changed (i.e., the sections are not random, but neither are they identical).  The correspondence between conservation and function is a rich source of significant problems.

To What Extent Can Structure (Secondary or Tertiary) be Predicted froma Multiple-Sequence Alignment?

Comparison of columns in a large multiple sequence alignment was the key to developing secondary structures for both DNA alignments and protein alignments.

The Machines: a Initial Inventory

Energy Issues

The following diagram offers a summary of the machines that relate to acquisition and storage of energy, as well as the production of a number of key compounds by breaking up sugar:




    
M1 harvesting light energy
M2 building sugar from smaller components and energy
M3 Storing strings of sugar molecules as starch
M4 breaking up starch to give sugar
M5 breaking up sugar to get energy and smaller molecules

Many of our machines will need energy to run.  In the basic organism we are describing, we have incuded M1 to harvest energy from sunlight.  This process is called photosynthesis.  The cell stores energy in a molecule called ATP.  Whenever energy is needed, the molecule is broken into two pieces, releasing energy.  The cell maintains a fairly constant concentration of ATP, which allows reactions throughout the cell to depend on it.  This is similar in many respects to the way electricity is available throught an house.  Appliances can be designed to plug in anywhere, and they assume the normal voltage will be available.  Similarly, we have a mechanism for maintaining the concentration of ATP, and this allows us to include reactions that depend on that concentration.

M2 is a machine that builds sugar from CO2 and energy.  This involves a number of transformations.  Eventually, we will need to examine the individual steps, but for now let us remain at this quite abstract level.

Machines M3 and M4  allow the cell to store sugars when energy is abundant, and then to use them later when energy is needed.  Starch should be thought of as just a string of sugar molecules, which is a convenient way to store them.  When sugar is needed, M4 can be used to break off a few.

Finally, M5 is a machine that takes sugar molecules and breaks them into smaller pieces, releasing energy (in the form of ATP) in the process.  These smaller molecules are the building blocks that are used  over and over to build things needed by the cell.  Here is a table that contains the abbreviations we use for these molecules.  Frankly, if you have not had biochemistry classes, you might simply work with the abbreviations, since the full names can be intimidating.

2OG 2-oxoglutarate
3PG 3-phospho-glutarate
A Adenosine [one of the characters in a DNA string]
Ala Alanine [an amino acid]
Arg Arginine [an amino acid]
Asn Asparagine [an amino acid]
Asp Aspartate [an amino acid]
C Cytosine [one of the characters in a DNA string]
CHOR Chorismate
CO2 Carbon dioxide
Daughter genome the added cell after replication
E4P Erythrose 4-phosphate
Extra Membrane A little extra membrane for the new cell
G Guanine [one of the characters in a DNA string]
G6P Glucose 6-phosphate
Genome the DNA string in the cell that contais the genes
Gln Glutamine [an amino acid]
Glu Glutamate [an amino acid]
Gly Glycine [an amino acid]
HOM Homoserine
His Histidine [an amino acid]
Iso Isoleucine [an amino acid]
Leu Leucine [an amino acid]
Lys Lysine [an amino acid]
Membrane the thing enclosing the cell
Met Methionine [an amino acid]
OXLA Oxalacetate
PEP Phosphoenolpyruvate
PYR Pyruvate
Phe Phenylalanine [an amino acid]
Pro Proline [an amino acid]
R5P Ribose 5-phosphate
Ser Serine [an amino acid]
Starch A polymer of sugars (used for storage)
Sugar think glucose
T Thiamine [one of the characters in a DNA string]
Thr Threonine [an amino acid]
Trp Tryptophane [an amino acid]
Tyr Tyrosine [an amino acid]
Val Valine [an amino acid]


Building the Amino Acids


M6 build glutamate and glutamine  from 2-oxoglutarate
M7 build proline from glutamate and ATP
M8 build aspartate from 2-oxalacetate
M9 build arginine from glutamate, aspartate, and ATP
M10 build asparagine from glutamine, aspartate, and ATP
M11 build serine from 3-phospho-glutarate and glutamate



M12 build glycine from serine
M13 build cysteine from serine
M14 build methionine from homoserine and cysteine
M15 build lysine from pyruvate and aspartate
M16 buil homoserine from aspartate
M17 build threonine from homoserine and ATP
M18 build isoleucine from glutamate, threonine and pyruvate



M19 build alanine from pyruvate
M20 build valine from pyruvate
M21 Build leucine from pyruvate
M22 build the intermediate  chorismate from phosphoenolpyruvate and erythrose 4-phosphate
M23 build tyrosine and phenaylalanine from glutamate and chorismate
M24 build tryptophane from chorismate and glutamine
M25 build ribose 5-phosphate from glucose-6-phosphate
M26 build histidine from ribose-5-phosphate and ATP

Expressing Genes



M30 building a protein from amino acids and a gene

M30 is a complex machine that we have not represented all that well. It exists in the cell, and you might imagine the cell as containing free-floating amino acids (which are built by the machines discussed above). M30 can take the description of a protein encoded in a gene and build the protein from the instructions and the free-floating amino acids. It is certainly a complex and incredible machine, and it exists as a central component of the life forms we are studying.

Motility

The cell we envision has some motility.  It can "turn on its motor and propellers" to move a bit, turn off the motility machinery, wait a while, turn it on again, and so forth.
We do not show a diagram or table of this machine, but we shall number it M31.

Replication


Replication is descriibed in a somewhat imprecise manner.  We think of M27 as a machine that builds the nucleotides, which are the characters that make up the DNA genome.   Then M28 is a machine that takes these loose "characters" floating in the cell, along with the existing genomes, and manufactures a copy of the genome.   Then, finally, M29 takes some extra membrane (see the output of M5), the genome copy, and "pinches" the extended cell, creating two separate cells which we call the "original" (containing the original genome) and the "daughter" containing the copiy of the genome).

 



M27 build nucleotides
M28 build new genome
M29 split the cell into original and daughter

Problems in BioInformatics that Can Be Done Once the Notion of "Function" Exists


The inventory of machines has led us (albeit circuitously) into a discussion of "the function of a protein" and how to think about it.  These problems relate to the use of comparative analysis between the protein sequences from many distinct genomes (and what clues we can expect to develop in our attempts to make sense of it all).

Identifying the Functions of Genes

The general topic of how assign function to genes is central to genome annotation.  Deciding when you can safely project function based on similarity is a topic that can profitably be pondered.

Before leaving this topic, it is worth noting that a site called The Annotation Clearinghouse exists. This resource will allow users to download assertions of function that are considered to be reasonably reliable by human annotators manually curating the growing body of data. The assertions use widely differing IDs for genes (but a table for interconverting the IDs is provided), they use an uncontrolled vocabulary (although progress is being made in developing synonym lists), and many of the assertions are undoubtedly wrong. However, it is a start on a resource of central importance.

Predicting When Two Genes Implement Related Functions

There are many clues that can be used to improve the accuracy of function projection.  Conservation of contiguity, detection of gene fusions, protein-protein interaction data, and characterization of regulatory sites have all proven useful  Integration of clues from a number of sources has been attempted (and will undoubtedly be important in the future).

In our view, the most useful set of clues to date have arisen from recognizing that genes that implement closely related functions (i.e., functions that are part of the same machine or machines that implement connected functions) often occur close to one another in the genome. That is, if you take the genes that implement a machine, and you look at where these genes occur in the genome, the occurrences are not random. On average, about 50% of the genes that make up a machine will occur within 5000 characters of one another in the genome. In some genomes far fewer genes cluster (for reasons we do not fully understand).

To exploit this tendency, we might construct sets of pairs of genes. All pairs in a set occur close together in a genome (one of the ones in our collection). All of the first members of pairs are similar to one another, and all of the second members are similar to one another. The fact that all of the 2-tuples in each set have corresponding pairs that are similar might lead one to believe that all of the pairs implemented the same two abstract functions, but that is not the case. It is often, and perhaps usually, the case; but, there are many instances where the pairs implement distinct functions. For example, there are many cases in which 4 close genes implement a transport machine. For each of these transport machines, even though they transport completely different compounds, 3 of the 4 genes are pretty similar. The fourth gene is often the one that is specific to the compound being transported.

What we can say, assuming that we find enough entries in a set (that is way more coresponding pairs than one would expect by random), is that the functions of the genes in each pair are related. We cannot say with reliability that the actual functions in all of the pairs match up, but the ones in each pair will usually be related.

Further, a single protein might well participate in pairs from several sets. By combining the evidence from all of these sets of pairs, it is possible to produce an estimate of all of the components in a machine, without really knowing the functions of any of them. That is, it becomes possible to say "I think that these four genes implement a machine", and to do so without having a clear idea of what the machine actually does. The information produced by examining conserved contiguity has not really been completely exploited. It has proved to be immensely useful, but there is far more to be gleaned from this data by those with some minimal creativity and statistical competence.

Grouping Genes into Subsystems

The genes that encode proteins that together implement a single machine may be thought of as an instance of a subsystem.  In later tutorials we will discuss the notion of subsystem in more detail.  Essentially, it is an abstraction of the notion of machine, and it represents an important conceptual framework for analyzing the functions of genes from many genomes simultaneously.  So, how can you detect when two genes are components of the same machine?

Constructing Sets of Isofunctional Homologs

Homologs are genes that share a common ancestor.  Isofunctional genes implement the same function.  The goal of compiling sets of homologous genes (and the proteins they encode) that implement a single function is central to automating annotation of genomes.  Further, since we will be faced with annotating thousands of new genomes over the next few years (and it increases much more rapidly after that), almost all annotations will be automated.

Supporting Decision Procedures for Sets of Isofunctional Homologs

Suppose that you have a collection of sets of isofunctional homologs.  Suppose further that you have, say, 10,000 of these sets.  For each set, you will wish to develop a decision procedure which, when given as input a set and a new protein sequence, determines whether or not the protein should be added to the set.  In some cases, such decisions are easy, and you will wish to use a very fast decision procedure.  In others, they are very difficult, and you will need to bring many sources of clues to bear.
Construction of such decision procedures will become increasingly important.

Characterization of Regulons for a Genome

Genes are often co-regulated.  That is, expression of a set of genes may always be tightly coordinated.  In this case, we will think of the co-regulated set as a regulon.  Determination of which genes make up which regulons is a task requiring both bioinformatic challenges and wet lab confirmations.  Don't attempt this one without a close working relationship with a wet lab biologist.

Charaterization of "States of the Cell"

It might be conjectured that a cell has a limited set of states.  Each state is characterized by the set of regulons that are expressed.  It seems likely that the cell should be viewed as "tending to stay in the same state" until forced to make a transition to another state.  That is, the states demonstrate a degree of homeostasis.  If we underatnd a comprehensive list of states, and we worked out the forces that determine transitions, we would begin to understand the cell as a dynamic system.