A Short Essay on Starting a Bioinformatics Curriculum by Ross Overbeek

I am occasionally invited to join discussions relating to the topic of bioinformatics education. I would like to believe that I know a fair amount about curriculum development due to a rather intense 10-year period in my life in which a group of us built a computer science department from scratch. It was one of the most wonderful and demanding experiences of my life. I also believe that I know a fair amount about bioinformatics, although it is certainly true that I am ignorant of many important aspects. In this short essay, I wish to present one approach to "getting started". I believe that almost all colleges and universities now face an unusual opportunity, which is in many respects similar to the opportunity we faced in computer science at the start of the 1970s. It is a time of rapid change and unpredictable scientific advances; departments will have to integrate new facts, ideas and technologies at a rate that many will find jarring. In most cases, the curricula will adapt slowly, will ultimately reach a tolerable state, and no one will even reflect on "what might have been". However, for the few willing to really give it a shot, here is something to think about.

I am going to propose a two-course sequence that is similar in several respects to courses now being taught at a very few universities (almost all at the graduate level). I would suggest this two-course sequence as appropriate for junior/senior undergraduates and graduate students. In all cases in which new courses are proposed, there is a belief that "you need someone with adequate background" to teach them. For what I am proposing, you need someone with a reasonable background in biochemistry, strong desire to move into the analysis of genomes, and a willingness to spend time to learn new material. That is all.

The First Course

Understanding Subsystems

The Project to Annotate 1000 Genomes introduced a framework for annotation of subsystems. For purposes of this discussion, consider the annnotation of a subsystem to be a detailed analysis of a metabolic pathway as it is manifested in the existing collection of genomes. Understanding how to construct and work with these populated subsystems is an extremely useful skill. The first half of this initial course should be devoted to having students take a standard metabolic pathway, along with one or more review articles, to do a thorough review of whatever has been encoded in a currently existing subsystem, and to add one or more genomes to the subsystem. The first subsystem that I encoded was Histidine Biosynthesis, and that (or any other cengtral pathway) would be perfectly acceptable starting points. This amounts to roughly 7 weeks devoted to developing a reasonably detailed understanding of a single component of metabolism, analyzing which variants are known to exist, locating and evaluating the annotations relating to existing genomes, and tabulating the outstanding questions.

The process of reviewing and analyzing an existing populated subsystem can be used as a framework for introducing numerous fundamental concepts. For example, I would expect at least the following to be discussed:

The explicit goal would be for the student to develop a thorough understanding of the pathway and to evaluate/improve the existing analysis.

Is this too much to ask for seven weeks? My response is to draw analogies with what we did in computer science. The basic principles are just

  1. the lectures are used to support the students' effort to do the assignment (basically, all real education will come from actually doing the assignment, rather than the lectures), and
  2. you can create assignments about very demanding topics by supplying as much framework as necessary to make the assignment addressable by the student.
This second point is worth amplifying. In the old days we asked students to implement simple operating systems. Realistically, this is not achievable (in most cases) in one semester. However, we supplied sections, they plugged in sections, and assignments resulted that were managable and hugely enlightening. I would argue that many of the topics I mentioned could be related to specific, small aspects of the overall assignment of evaluating/improving/extending an existing subsystem.

Developing Metabolic Reconstructions

The second half of the first course would be devoted to developing a detailed metabolic reconstruction for a single organism. In existing classes, graduate students usually spend an entire semester on such a topic. However, the advances that have taken place during the last year in the development of subsystems now make it feasible to cover far more, far faster.

The student would go through the following steps for a specific genome:

  1. Determine the set of subsystems that contain entries with operational variants.
  2. Organize these subsystems into a basic hierarchy composed of the standard topics (amino acid metabolism, energy metabolism, transcription, translation, ...).
  3. Create a list of the reactions attached to roles in these subsystems.
  4. Construct an initial, partial reaction network.
  5. Determine the set of genes that apparently implement enzymatic roles, but are not included in subsystems.
  6. Determine how this list should be integrated into the emerging reconstruction.
  7. Compose a list of enzymatic functions that are believed to be present, but which cannot be connected to specific genes.

This would be done only for genomes that had been integrated into the large and growing, publicly-available set of populated subsystems (see subsystemsavailable on the publicly-available SEED at the University of Chicago). Picking appropriate genomes is, of course, essential. The instructor should limit the choices to genomes with 5000 or fewer genes (those with 1500-2500 would probably be the best picks in most cases, but good arguments can be made for looking at very small genomes and larger genomes in special cases).

Tools to simplify each of the steps either now exist or are now in development. It would seem useful to me to compose a list of properties such as

These properties all correspond to variant codes associated with specific subsystems. The professor should supply a list of, perhaps, 100 such properties, and the student would be responsible for producing the assessment for his genome. Note that this component of the exercise can (and maybe should) be fully automated. The student does need to understand how the properties are inferred and what they actually imply.

The Second Course

The second course would be focused on converting one of the two products of the first course (i.e., the enhanced subsystem or the metabolic reconstruction) into a publication. Clearly, entry to the course would be restricted to those students that had done a reasonable job in the first course.

Producing a Publishable Enhanced Subsystem

This is, in my opinion, the more difficult of the two options. The best way to understand the goal of the effort is to read Ancient origin of the tryptophan operon and the dynamics of evolutionary change by Jensen's team. This is the best example I have read, but it represents an order of magnitude more experience and effort than any student can achieve in a single course. However, it does establish a goal and a basic approach. To emulate this effort for other subsystems seems to me to be a reasonable objective. I also believe that students can produce a publishable review within the semester in many cases, and this would be an experience of real significance for them.

Producing a Publishable Metabolic Reconstruction

Bernhard Palsson has pioneered technology for developing metabolic reconstructions to the stage where they can fruitfully be employed to support metabolic modeling. He has also pioneered teaching this technology to students. To see what they are producing, let me simply point at papers they published describing a models for Staphylococcus aureus and Helicobacter pylori (note that a recent paper discussing the extension and use of this last model has appeared in J.Bact, Aug 2005, 5818-5830).

The production of such models still represents a publishable effort, but this will probably not be true two years from now. By then, the integration of wet lab confirmations will become required. These efforts are path breaking, and I assume (and hope) that some other groups will start mass producing these reconstructions.


This is intended to be a short proposal arguing for the immediate introduction of a 2-course sequence leading to publishable efforts. The effects of initiating such an effort would include
  1. rapid education of both instructors and students,
  2. clarification of where else to expand the curriculum,
  3. testable conjectures that act as input to more standard courses in biochemistry and molecular biology, and
  4. taking a pioneering role in the coming genomics era.
A few groups have already begun efforts like this, but we are at a really unusual point in scientific progress, and this will remain a major opportunity for the next few years. This particular way to start will undoubtedly be replaced by others focusing on new advances. As in most advances driven by rapid technological advance, each stage is just a warm-up for the next. I think that many schools would do well to take my advice, but it is also true that new opportunities will continue to emerge rapidly.

I will add comments on specifics relating to this proposal in separate documents: