A Short Essay on Starting a Bioinformatics Curriculum by Ross Overbeek
I am occasionally invited to join discussions relating to the topic of
bioinformatics education. I would like to believe that I know a fair
amount about curriculum development due to a rather intense 10-year
period in my life in which a group of us built a computer science
department from scratch. It was one of the most wonderful and
demanding experiences of my life. I also believe that I know a fair
amount about bioinformatics, although it is certainly true that I am
ignorant of many important aspects. In this short essay, I wish to
present one approach to "getting started". I believe that almost all
colleges and universities now face an unusual opportunity, which is in
many respects similar to the opportunity we faced in computer science
at the start of the 1970s. It is a time of rapid change and
unpredictable scientific advances; departments will have to
integrate new facts, ideas and technologies at a rate that many will
find jarring. In most cases, the curricula will adapt slowly, will
ultimately reach a tolerable state, and no one will even reflect on
"what might have been". However, for the few willing to really give
it a shot, here is something to think about.
I am going to propose a two-course sequence that is similar in several
respects to courses now being taught at a very few universities
(almost all at the graduate level). I would suggest this two-course
sequence as appropriate for junior/senior undergraduates and graduate
students. In all cases in which new courses are proposed, there is a
belief that "you need someone with adequate background" to teach
them. For what I am proposing, you need someone with a reasonable
background in biochemistry, strong desire to move into the
analysis of genomes, and a willingness to spend time to learn new
material. That is all.
The First Course
Understanding Subsystems
The Project to
Annotate 1000 Genomes introduced a framework for annotation of
subsystems. For purposes of this discussion, consider the
annnotation of a subsystem to be a detailed analysis of a metabolic
pathway as it is manifested in the existing collection of genomes.
Understanding how to construct and work with these populated
subsystems is an extremely useful skill. The first half of this
initial course should be devoted to having students take a standard
metabolic pathway, along with one or more review articles, to do a
thorough review of whatever has been encoded in a currently existing
subsystem, and to add one or more genomes to the subsystem. The first
subsystem that I encoded was Histidine Biosynthesis, and that
(or any other cengtral pathway) would be perfectly acceptable starting
points. This amounts to roughly 7 weeks devoted to developing a
reasonably detailed understanding of a single component of metabolism,
analyzing which variants are known to exist, locating and evaluating
the annotations relating to existing genomes, and tabulating the
outstanding questions.
The process of reviewing and analyzing an existing populated
subsystem can be used as a framework for introducing numerous
fundamental concepts. For example, I would expect at least the
following to be discussed:
- the tendency of functionally related prokaryotic genes to cluster
on the chromosome,
- problems related to identifying orthologs and paralogs,
- creation of multiple sequence alignments,
- use of phylogenetic trees to separate out paralogs,
- gene calling (not exactly how it is done, but at least an overview
of the basic issues),
- gene fusions,
- use of Hidden Markov Models (HMMs), and
- non-orthologous gene displacement.
The explicit goal would be for the student to develop a thorough
understanding of the pathway and to evaluate/improve the existing
analysis.
Is this too much to ask for seven weeks? My response is to draw
analogies with what we did in computer science. The basic principles
are just
- the lectures are used to support the students' effort to do the
assignment (basically, all real education will come from actually
doing the assignment, rather than the lectures), and
- you can create assignments about very demanding topics by
supplying as much framework as necessary to make the assignment
addressable by the student.
This second point is worth amplifying. In the old days we asked
students to implement simple operating systems. Realistically, this
is not achievable (in most cases) in one semester. However, we
supplied sections, they plugged in sections, and assignments resulted
that were managable and hugely enlightening. I would argue that many
of the topics I mentioned could be related to specific, small aspects
of the overall assignment of evaluating/improving/extending an
existing subsystem.
Developing Metabolic Reconstructions
The second half of the first course would be devoted to developing a
detailed metabolic reconstruction for a single organism. In existing
classes, graduate students usually spend an entire semester on such a
topic. However, the advances that have taken place during the last
year in the development of subsystems now make it feasible to cover
far more, far faster.
The student would go through the following steps for a specific genome:
- Determine the set of subsystems that contain entries with
operational variants.
- Organize these subsystems into a basic hierarchy composed of the
standard topics (amino acid metabolism, energy metabolism,
transcription, translation, ...).
- Create a list of the reactions attached to roles in these
subsystems.
- Construct an initial, partial reaction network.
- Determine the set of genes that apparently implement enzymatic
roles, but are not included in subsystems.
- Determine how this list should be integrated into the emerging
reconstruction.
- Compose a list of enzymatic functions that are believed to be
present, but which cannot be connected to specific genes.
This would be done only for genomes that had been integrated into the
large and growing, publicly-available set of populated subsystems (see
subsystemsavailable on the publicly-available
SEED at the
University of Chicago).
Picking appropriate genomes is, of course, essential. The instructor
should limit the choices to genomes with 5000 or fewer genes (those
with 1500-2500 would probably be the best picks in most cases, but
good arguments can be made for looking at very small genomes and
larger genomes in special cases).
Tools to simplify each of the steps either now exist or are now in
development. It would seem useful to me to compose a list of
properties such as
- capable of synthesizing histidine,
- capable of supporting de Novo purine biosynthesis,
- photosyntheitc, ...
These properties all correspond to variant codes associated with
specific subsystems. The professor should supply a list of, perhaps,
100 such properties, and the student would be responsible for
producing the assessment for his genome. Note that this component of
the exercise can (and maybe should) be fully automated. The student
does need to understand how the properties are inferred and what they
actually imply.
The Second Course
The second course would be focused on converting one of the two
products of the first course (i.e., the enhanced subsystem or the
metabolic reconstruction) into a publication. Clearly, entry
to the course would be restricted to those students that had done a
reasonable job in the first course.
Producing a Publishable Enhanced Subsystem
This is, in my opinion, the more difficult of the two options. The
best way to understand the goal of the effort is to read Ancient
origin of the tryptophan operon and the dynamics of evolutionary change by
Jensen's team. This is the best example I have read, but it
represents an order of magnitude more experience and effort than any
student can achieve in a single course. However, it does establish a
goal and a basic approach. To emulate this effort for other
subsystems seems to me to be a reasonable objective. I also believe
that students can produce a publishable review within the semester in
many cases, and this would be an experience of real significance for
them.
Producing a Publishable Metabolic Reconstruction
Bernhard Palsson has pioneered technology for developing metabolic
reconstructions to the stage where they can fruitfully be employed to
support metabolic modeling. He has also pioneered teaching this
technology to students. To see what they are producing, let me simply
point at papers they published describing a models for
Staphylococcus
aureus and
Helicobacter
pylori (note that a recent paper discussing the extension and
use of this last model has appeared in J.Bact, Aug 2005, 5818-5830).
The production of such models still represents a publishable effort,
but this will probably not be true two years from now. By then, the
integration of wet lab confirmations will become required. These
efforts are path breaking, and I assume (and hope) that some other groups will
start mass producing these reconstructions.
Summary
This is intended to be a short proposal arguing for the immediate
introduction of a 2-course sequence leading to publishable efforts.
The effects of initiating such an effort would include
- rapid education of both instructors and students,
- clarification of where else to expand the curriculum,
- testable conjectures that act as input to more standard courses in
biochemistry and molecular biology, and
- taking a pioneering role in the coming genomics era.
A few groups have already begun efforts like this, but we are at a really
unusual point in scientific progress, and this will remain a major
opportunity for the next few years. This particular way to start will
undoubtedly be replaced by others focusing on new advances. As in
most advances driven by rapid technological advance, each stage is
just a warm-up for the next. I think that many schools would do well
to take my advice, but it is also true that new opportunities will
continue to emerge rapidly.
I will add comments on specifics relating to this proposal in separate
documents: