Sequencing of genomes is laying the foundation for advances in science that will dramatically reshape our society. These advances will initially occur in medicine, agriculture, and chemical production, but in the long term the impact will be pervasive. The computer revolution started by impacting payrolls, but eventually allowed man to travel to the moon. Similarly, the biological revolution is beginning by reshaping the life sciences, but this will surely not be the the whole story or even the most significant outcome.
The interpretation of genomes will constitute the most exciting and most significant science of the century. By rapidly advancing our understanding of life, how it arose, and how it continues to change, we will acquire the tools that will allow us to better understand and improve our existence. Understanding will begin with relatively simple forms of life -- unicellular organisms. While the central mechanisms of life are shared by both these organisms and the most complex animals and plants, they also contain a remarkable diversity. They have an immense amount to teach us about life itself, and we will need to master these lessons before full understanding of complex genomes will be achievable.
The Fellowship for Interpretation of Genomes will focus on organizing the data needed to support interpretation of genomes, providing the infrastructure needed by the world community in its efforts to achieve understanding. In addition, we will ourselves pick specific, critical problems and attempt to actively participate in the unravelling of the secrets within these amazing entities. It is only by merging the work of building infrastructure with the applications that use it that we will more deeply understand what is needed at each step.
The FIG Architecture: the Seed
We begin with the "seed" of FIG. The seed contains the essential, basic elements that are needed to sustain a scalable integration of thousands of genomes. The later parts of this document will attempt to offer precise notions of what makes up the seed of FIG. I will cover the basic types of objects, make comments on what extensions will be needed to support hundreds of thousands of genomes, and offer an implementation plan.
However, before we go into such detail, some broad notions should be discussed. The idea of integrating hundreds of thousands of genomes needs some clarification. Indeed, what is meant by integrating a bunch of genomes, no matter what the number. In my mind, the notion of integration is essentially "maintenance of notions of neighborhood, allowing forms of access that can be used to easily explore connections and comparisons between data from numerous genomes". This may be viewed as a complicated way to say "a framework to support comparative analysis". To be a shade more precise:
The power in an integration arises from mixing the different notions of neighborhood. The tools for supporting effective use of a variety of comparative notions constitute the computational framework for comparative analysis, which is often abbreviated to the notion of "integration".
- Genes from a single genomes are often "functionally related" in that the participate in implementing a single pathway or subsystem. For any single gene, the "functional neighborhood" of that gene is the set of genes that are functionally related to the gene. To support access relating to this notion of neighborhood requires an encoding of the cellular machinery (e.g., pathways).
- Genes that occur close to each other on a chromosome may be thought of as "postionally related". The set of genes that are positionally related to a given gene amounts to the "positional neighborhood" of the gene. One of the huge payouts of integrations to data has been based on a correlation between the neighborhoods imposed by "functionally related" and "positionally related" in the case of prokaryotic genomes.
- Genes from one or more genomes that share a common ancestor are called "homologous". Homology induces yet another notion of neigborhood. One can build more restricted neighborhoods upon this basic concept. Thus, I tend to think of a protein family as a set of homologous genes that have a common function (a very imprecise notion, I grant). Maintenance of protein families will, of course, be an absolutely essential part of effectively integrating many thousands of genomes.
- Sets of very closely related genomes may be viewed as a neighborhood (i.e., the neighborhood of a genome becomes a set of closely related genomes). One can layer a notion of "variation", including SNPs, on the notion of closely related genomes, and then whole frameworks for exploring minor variations become possible.
FIG will offer the key services required to architect and implement a comparative framwork for interpreting genomes.