The Annotation Clearinghouse

The Annotation Clearinghouse is a rather simple service. It does two things:
  1. If you give it the ID of a protein (e.g., tr|Q40XE9, uni|Q40XE9, fig|266940.1.peg.3572, gi|69288893, NP_251014, kegg|pae:PA2324, or sp|Q03224), the server will return IDs of other proteins that it knows about that have the same sequence. Furthermore, associated with each ID will be the function asserted for that ID by the major annotation teams that use that ID (at least those for which we can download their annotations). Thus, at any point where you have an ID, you can quickly get opinions about that protein (or other proteins that have the same sequence). To see how this works, just go to the clearinghouse, type in sp|P0A753, and look at the returned table. We emphasize that all of the proteins in this table are character-for-character identical, except for small variations at the start (due to differing calls of start positions).

  2. The second aspect of the server involves allowing experts to express reliable assertions. That is, we are gathering assertions of function for proteins where each assertion is being made by an expert, and the expert believes the assertion to be reliable (i.e., it is not a speculation or guess). It is understood that even "reliable" assertions are sometimes wrong or incomplete, and we encourage the experts to update their assertions in such cases. When users of the annotation clearinghouse ask for assertions about the protein with a given ID, the expert assertions are also provided, and they are colored so that they stand out. In the table returned for sp|P0A753 you will see some assertions made by Andrei Osterman who is known for his research in NAD metabolism.
We update the sequences used to provide these current annotations frequently, trying to acquire the latest opinions from each of the annotation groups. We solicit help from experts, when it is known that some annotations are in error. We invite experts to register and contribute their assertions, and we believe that the annotation groups will carefully monitor such assertions and that it will lead to steady improvement of the annotations contributed by the major annotation efforts.

Why Is This Important?

Both services are important, but for quite distinct reasons.  The first service, the gathering of assertions of function from annotation teams from around the world, allows a user to rapidly gather a set of opinions for comparison.  In effect, the user has access to the best efforts from a number of groups simultaneously.  All well-known annotation efforts actively compare their annotations to all others that are publicly-available.  Each of these annotation efforts must actively gather annotations for comparison.  This represents an unnecessary duplication of effort, and in many cases the gathering of annotations is not systematic leading to collections that are often out of date.

The second service allows experts to contribute assertions.  In almost all cases, we would expect the annotation efforts to observe these and adjust their annotations accordingly.  The result will be rapid correction of many errors that at this point in time tend to persist for years.  We all know of cases in which published results contradict the annotations in the archives, and even remain unnoticed by some of the most reputable annotation efforts.  This service will offer a mechanism that can be used by experts to draw attention to errors that persist in the public colections (who has not seen an obvious error and wondered "How can I get this corrected without consuming much effort).

Making these services easily accessible to the bioinformatics comunity will allow tool-builders access to data that has been certified by experts.  Even though the
expert assertions will probably never cover a majority of the protein sets, they will represent a sizable sample of very reliable assertions.   As such, teams building automated annotation pipelines will have a growing body of data against which they can calibrate their efforts.

The Simple Programatic Interface

Most annotation efforts use a fairly simple approach: they use blast to find a set of similar sequences, then they gather annotations from a few sources for these similar proteins, and finally they integrate these annotations for similar sequences into an assertion for their protein of interest. The annotation clearinhouse provides a simple way to gather current annotations from all of the major annotation efforts.

Here is a simple Perl program that takes a set of IDs, goes across the web to attain current annotations, and then displays the acquired annotations. We offer it soley to illustrate how straightforward it is to access this data:

#!/usr/bin/perl

while (defined($_ = <STDIN>))
{
if ($_ =~ /(\S+)/)
{
$set = &get_set($1);
foreach $entry (@$set)
{
print join("\t",@$entry),"\n";
}
}
}

sub get_set {
my($id) = @_;

my $tmpF = "/tmp/download.$$";
my $url = "http://clearinghouse.nmpdr.org/aclh.cgi?page=SearchResults&raw_dump=1&query=$id";
system "wget -O $tmpF '$url' > /dev/null 2> /dev/null";

open(TMP,"<$tmpF") || die "could not open $tmpF";
while (defined($_ = <TMP>) && ($_ !~ /^Identifier/)) {}
chomp;
my $set = [];
push(@$set,[split(/\t/,$_)]);
while (defined($_ = <TMP>) && ($_ =~ /\S/))
{
chomp;
push(@$set,[split(/\t/,$_)]);
}
close(TMP);
unlink $tmpF;
return $set;
}
The routine get_set($id) can easily be used by any Perl programmer to download any desired protein sets.  Note that the command wget must be accessible via your path.