The Annotation Clearinghouse
The Annotation Clearinghouse is a rather simple service. It does two things:
- If you give it the ID of a protein (e.g., tr|Q40XE9, uni|Q40XE9, fig|266940.1.peg.3572,
gi|69288893, NP_251014, kegg|pae:PA2324, or
sp|Q03224), the server will return IDs of other proteins that it knows about that have the same sequence.
Furthermore, associated with each ID will be the function asserted for that ID by the major annotation teams that use that ID
(at least those for which we can download their annotations).
Thus, at any point where you have an ID, you can quickly get opinions about that protein (or other proteins that have
the same sequence). To see how this works, just go to the
clearinghouse, type in sp|P0A753, and look at the returned table. We emphasize that all
of the proteins in this table are character-for-character identical, except for small variations at
the start (due to differing calls of start positions).
- The second aspect of the server involves allowing experts to express reliable assertions. That is, we are
gathering assertions of function for proteins where each assertion is being made by an expert, and the expert
believes the assertion to be reliable (i.e., it is not a speculation or guess). It is understood
that even "reliable" assertions are sometimes wrong or incomplete, and we encourage the experts to update
their assertions in such cases.
When users of the annotation clearinghouse ask for assertions about the protein with a given ID, the expert
assertions are also provided, and they are colored
so that they stand out. In the table returned for sp|P0A753 you will see some assertions made
by Andrei Osterman who is known for his research in NAD metabolism.
We update the sequences used to provide these current annotations frequently, trying to acquire the latest
opinions from each of the annotation groups. We solicit help from experts, when it is known that
some annotations are in error. We invite experts to register and contribute their assertions, and we
believe that the annotation groups will carefully monitor such assertions and that it will lead to
steady improvement of the annotations contributed by the major annotation efforts.
Why Is This Important?
Both services are important, but for
quite distinct reasons. The first service, the gathering of assertions
of function from annotation teams from around the world, allows a user
to rapidly gather a set of opinions for comparison. In effect, the
user has access to the best efforts from a number of groups
simultaneously. All well-known annotation efforts actively compare
their annotations to all others that are publicly-available. Each of
these annotation efforts must actively gather annotations for
comparison. This represents an unnecessary duplication of effort, and
in many cases the gathering of annotations is not systematic leading to
collections that are often out of date.
The
second service allows experts to contribute assertions. In almost
all cases, we would expect the annotation efforts to observe these and
adjust their annotations accordingly. The result will be rapid
correction of many errors that at this point in time tend to persist
for years. We all know of cases in which published results
contradict the annotations in the archives, and even remain unnoticed
by some of the most reputable annotation efforts. This service
will offer a mechanism that can be used by experts to draw attention to
errors that persist in the public colections (who has not seen an
obvious error and wondered "How can I get this corrected without consuming much effort).
Making
these services easily accessible to the bioinformatics comunity will
allow tool-builders access to data that has been certified by experts.
Even though the
expert assertions will probably never cover a
majority of the protein sets, they will represent a sizable sample of
very reliable assertions. As such, teams building automated
annotation pipelines will have a growing body of data against which
they can calibrate their efforts.
The Simple Programatic Interface
Most annotation efforts use a fairly simple approach: they use blast to find a set of similar sequences,
then they gather annotations from a few sources for these similar proteins, and finally they integrate
these annotations for similar sequences into an assertion for their protein of interest.
The annotation clearinhouse provides a simple way to gather current annotations from all of the major
annotation efforts.
Here is a simple Perl program that takes a set of IDs, goes across the web to attain current
annotations, and then displays the acquired annotations. We offer it soley to illustrate
how straightforward it is to access this data:
#!/usr/bin/perl
while (defined($_ = <STDIN>))
{
if ($_ =~ /(\S+)/)
{
$set = &get_set($1);
foreach $entry (@$set)
{
print join("\t",@$entry),"\n";
}
}
}
sub get_set {
my($id) = @_;
my $tmpF = "/tmp/download.$$";
my $url = "http://clearinghouse.nmpdr.org/aclh.cgi?page=SearchResults&raw_dump=1&query=$id";
system "wget -O $tmpF '$url' > /dev/null 2> /dev/null";
open(TMP,"<$tmpF") || die "could not open $tmpF";
while (defined($_ = <TMP>) && ($_ !~ /^Identifier/)) {}
chomp;
my $set = [];
push(@$set,[split(/\t/,$_)]);
while (defined($_ = <TMP>) && ($_ =~ /\S/))
{
chomp;
push(@$set,[split(/\t/,$_)]);
}
close(TMP);
unlink $tmpF;
return $set;
}
The routine get_set($id) can easily be used by any Perl programmer
to download any desired protein sets. Note that the command wget must be accessible via your path.