Before we launch into a discussion of the details of how to invoke specific routines, there are some background issues that you should know about. First, let's talk about fig, which is a command-line utility that is used for a number of purposes. Its most basic purpose is to exercise the API and illustrate what data is accessed by each operation. However, you may also find it extremely useful as a quick way to invoke services. To try it out, type
figThis should produce a prompt of the form
??You should type
?? h<cr>to get a listing of the commands supported by the utility. At the time this document was written (May, 2004), the output was as follows:
?? h DB Current DB abbrev genome_name access_sims [expand|func|expandfunc] (for timeing - access 10 distinct sims) add_annotation FID User [ prompted for annotation; terminate with "." at start of line ] add_chromosomal_clusters File add_genome GenomeDir add_pch_pins File add_peg_link PEG Link (e.g., <a href=http//...>link to somewhere</a>) all_compounds all_features GenomeID Type all_maps all_protein_families all_reactions all_roles all_sets Relation SetName assign_function PEG User [conf=X] Function assign_functionF User File [no_annotations] assignments_made who date G1 G2 ... [ none => all ] auto_assign PEG [Seq] auto_assignF User FileOfPEGs auto_assignG User GenomeID1 GenomeID2 ... bbhs PEG Cutoff between n1 n2 n3 blast PEG [against nr] blastit PEG seq db maxP boundaries_of Loc build_tree_of_complete min_for_label by_alias alias by_fig_id FID1 FID2 candidates_for_role Genome Cutoff Role cas CID cas_to_cid cas catalyzed_by RID catalyzes role cgi_url clean_tmp close_genes FID distance comp2react CID contig_ln GenomeID Contig coupling_and_evidence FID Bound SimCutoff CouplingCutoff crude_estimate_of_distance GenomeID1 GenomeID2 delete_all_peg_links PEG delete_genomes G1 G2 G3 ...Gn delete_peg_link PEG URL displayable_reaction RID dna_seq GenomeID Loc dsims FastaFile [MaxN MaxPsc Select] *** select defaults to raw ec_to_maps EC ec_name EC expand_ec EC epoch_to_readable [gives readable version of current time] export_chromosomal_clusters export_pch_pins export_set Relation SetName File exportable_subsystem Subsystem external_calls PEG1 PEG2 PEG3... extract_seq ContigsFile loc all_exchangable_subsystems best_bbh_candidates Genome Cutoff Requested Known family_function Family fast_coupling FID Bound CouplingCutoff feature_aliases FID feature_annotations FID feature_location FID file2N File ftype FID function_of ID [all] [trans] genes_in_region GenomeID Contig Beg End genome_of PEG genomes [complete|Pat] genome_counts [complete] genome_version GenomeID genus_species GenomeID get_translation ID get_translations File [tab] h [command] *** h h for help on how to use help *** hypo Function ids_in_family Family ids_in_set WhichSet Relation SetName in_cluster_with PEG in_family FID in_pch_pin_with FID in_sets Which Relation SetName is_archaeal GenomeID is_bacterial GenomeID is_eukaryotic GenomeID is_prokaryotic GenomeID is_exchangable_subsystem Subsystem is_real_feature FID largest_clusters FileOfRoles user load_all map_to_ecs map map_name map mapped_prot_ids ID maps_to_id ID max n1 n2 n3 ... merged_related_annotations PEG1 PEG2 ... PEGn min n1 n2 n3 ... names_of_compound neighborhood_of_role role org_of prot_id pegs_of GenomeID possibly_truncated FID reaction2comp RID related_by_func_sim PEG user reversible RID rnas_of GenomeID roles_of_function function same_func 'function1' 'function2' search_index pattern seqs_with_role role who seqs_with_roles_in_genomes genomes roles made_by sims PEG maxN maxP select sort_fids_by_taxonomy FID1 FID2 FID3 ... sort_genomes_by_taxonomy GenomeID1 GenomeID2 GenomeID3 ... sz_family family taxonomic_groups_of_complete min_for_labels taxonomy_of GenomeID translatable prot_id translate_function PEG user translated_function_of PEG user translation_length ID unique_functions PEGs user [make the pegs comma separated] verify_dir dirIf you quickly browse through the commands that are supported by fig, you should get a fairly complete overview of the services provided by the API. We are not going to cover the use of fig in detail in this tutorial, but we do encourage you to experiment with it. By the way, you should type x to exit from it.
fig is ususally accessed in interactive mode (invoked with no arguments). However, if you wish to invoke it to process a single command, you can just put the command as a single argument on the command-line. Thus,
fig 'add_peg_link fig|188.1.peg.96 <a href=http://nature.berkeley.edu/~hongwang/Project/ATP_synthase/>animation</a>'would add a link to PEG fig|188.1.peg.96 while
fig 'delete_all_peg_links fig|188.1.peg.96'would delete all links from the PEG. Since adding links to PEGs is something you might wish to do, the easiest way to do it is to just create a file of commands of the sort shown, make the file executable, and then run it.
If you are a Perl programmer, eventually you will wish to peruse the fig source code to see how it is structured, how to add to it, and how to invoke services via the API.
We have briefly discussed the utility fig. Now it is time to discuss writing your own programs using the API. Basically, the entire API is encapsulated in the Perl module FIG.pm. To illustrate the style of programming we recommend, let's create a simple program to add a link to a PEG (rather than using fig). Adding a link to a PEG will cause the link to be displayed whenever anyone examines the PEG via the SEED web services.. The API defines a set of services that do many things; adding a link to a PEG is just one very minor service, but understanding exactly how one could invoke such a service is what this tutorial is about. To begin, use
tool_hdr add_linkThis creates a file called add_link and makes it executable. The file contains an initial set of Perl commands designed to establish the correct execution environment (where the libraries are, and so forth). You need to add your code to the end of the initialized file. We recommend adding the following lines:
use FIG; my $fig = new FIG; $usage = "usage: add_link PEG Link"; ( ($peg = shift @ARGV) && ($link = shift @ARGV) ) || die $usage; $fig->add_peg_links([$peg,$link]);
use FIG; my $fig = new FIG;initialize $fig. This is your key to the fig services. All API services are invoked using
$fig->rtn(arguments)
<a href=URL>LABEL</a>
genomes.ex.pl genomes.ex.py use strict; use FIG; my $fig = new FIG; my(%sofar,$genome,$which,$abbrev,$version); foreach $genome ($fig->genomes("complete")) { $which = $fig->genus_species($genome); $abbrev = &get_unique_abbreviation($which,\%sofar); $version = $fig->genome_version($genome); print "$genome\t$abbrev\t$version\t$which\n"; } sub get_unique_abbreviation { my($which,$sofar) = @_; my($nonunique,$n); $nonunique = &FIG::abbrev($which); $nonunique =~ s/ //g; $nonunique =~ s/\.+$//; $n = $sofar->{$nonunique}; if (! defined($n)) { $n = $sofar->{$nonunique} = 1; } $sofar->{$nonunique}++; return $nonunique . ".$n"; } from FigKernelPackages import FIG import sys import string def die(msg): print msg sys.exit(0) fig = FIG.FIG() sofar = {} def get_unique_abbreviation(which, sofar): nonunique = fig.abbrev(which); nonunique = string.rstrip(string.replace(nonunique," ", ""), ".") if not sofar.has_key(nonunique): sofar[nonunique] = 1 n = sofar[nonunique] sofar[nonunique] = sofar[nonunique] + 1 return (nonunique+"."+str(n)) for genome in fig.genomes("complete"): which = fig.genus_species(genome) abbrev = get_unique_abbreviation(which,sofar); version = fig.genome_version(genome); print "%s\t%s\t%s\t%s" % (genome, abbrev, version, which)
197221.1 The.elon.BP.1 197221.1.3406724106 Thermosynechococcus elongatus BP-1 198094.1 Bac.anth.st.2 198094.1.2593352402 Bacillus anthracis str. Ames 198214.1 Shi.flex.2a.1 198214.1.1464268764 Shigella flexneri 2a str. 301 198215.1 Shi.flex.2a.2 198215.1.1756034760 Shigella flexneri 2a str. 2457TWhat other genome-related services does the SEED offer? The only ones worth mentioning here relate to contigs. Here is a short example illustrating how to use those services. Basically, you can
contigs_for_genome.pl contigs_for_genome.py use strict; use FIG; my $fig = new FIG; my $usage = "usage: contigs_for_genome Genome"; my($genome,@contigs,$contig,$len,$seq); ($genome = shift @ARGV) || die $usage; @contigs = $fig->all_contigs($genome); foreach $contig (@contigs) { $len = $fig->contig_ln($genome,$contig); if ($len >= 50) { $seq = $fig->dna_seq($genome,$contig . "_1_50"); print "$contig\t$len\t$seq\n"; } } from FigKernelPackages import FIG import sys fig = FIG.FIG() def die(msg): print msg sys.exit(0) if len(sys.argv) != 2: die("usage: contigs_for_genome Genome") genome = sys.argv[1] contigs = fig.all_contigs(genome) for contig in contigs: len = fig.contig_ln(genome,contig) if len >= 50: seq = fig.dna_seq(genome,contig+"_1_50") print contig+"\t"+len+"\t"+seq
NC_000913_1_50would be used to represent the first fifty chracters of the contig NC_000913.
NC_000913_50_1would represent the sequence on the complementary strand of NC_000913 going from position 50 to position 1.
$fig->dna_seq($genome,"NC_000913_1_50,NC_000913_52_100") or $fig->dna_seq($genome,"NC_000913_1_50","NC_000913_52_100")would return 99 characters formed by deleting character 51 from the first 100 characters of NC_000913.
There is a fairly rich set of services for pegs, as you might guess from the web services offered by the SEED. As an introduction to some of these capabilities, consider the following short program:
features_around.pl features_around.py use strict; my($usage,$id,$fid,$peg,$loc,$contig,$beg,$end,$start_reg,$end_reg); my($loc1,$aliases1,$trunc,$pseq,$prot_ln,$func,$features_in_region); my($start_of_leftmost_feature,$end_of_rightmost_feature); my($genome); use FIG; my $fig = new FIG; $usage = "usage: features_around ID"; ($id = shift @ARGV) || die $usage; if ($peg = $fig->by_alias($id)) { if ($loc = $fig->feature_location($peg)) { ($contig,$beg,$end) = $fig->boundaries_of($loc); if (defined($contig)) { $start_reg = &FIG::min($beg,$end) - 5000; $end_reg = &FIG::max($beg,$end) + 5000; $genome = &FIG::genome_of($peg); ($features_in_region,$start_of_leftmost_feature,$end_of_rightmost_feature) = $fig->genes_in_region($genome,$contig,$start_reg,$end_reg); foreach $fid (@$features_in_region) { $loc1 = $fig->feature_location($fid); $aliases1 = $fig->feature_aliases($fid); $trunc = $fig->possibly_truncated($fid); $pseq = $func = $prot_ln = ""; if (&FIG::ftype($fid) eq "peg") { if ($prot_ln = $fig->translation_length($fid)) { $pseq = $fig->get_translation($fid); $func = $fig->function_of($fid); } } print join("\t",($fid,$loc1,$aliases1,$trunc,$func,$prot_ln,$pseq)),"\n"; } } } else { print STDERR "Sorry, could not get the location of $fid\n"; } } else { print STDERR "Sorry, could not figure out which PEG you meant by $id\n"; } from FigKernelPackages import FIG import sys import string fig = FIG.FIG() def die(msg): print msg sys.exit(0) if len(sys.argv) != 2: die("usage: features_around ID") id = sys.argv[1] peg = fig.by_alias(id) if peg: loc = fig.feature_location(peg) if loc: try: (contig,beg,end) = fig.boundaries_of(loc) start_reg = int(fig.min(beg,end)) - 5000 end_reg = int(fig.max(beg,end)) + 5000 genome = fig.genome_of(peg) (features_in_region,start_of_leftmost_feature,end_of_rightmost_feature) = fig.genes_in_region(genome,contig,start_reg,end_reg); for fid in features_in_region: loc1 = fig.feature_location(fid) aliases1 = fig.feature_aliases(fid) trunc = fig.possibly_truncated(fid) pseq = func = prot_ln = "" if fig.ftype(fid) == "peg": prot_ln = fig.translation_length(fid) if prot_ln: pseq = fig.get_translation(fid) func = fig.function_of(fid) print string.join([fid, loc1, str(aliases1), str(trunc), str(func), str(prot_ln), pseq], "\t") #print join("\t",($fid,$loc1,$aliases1,$trunc,$func,$prot_ln,$pseq)),"\n"; except: pass else: sys.stderr.write("Sorry, could not get the location of %s\n" % fid) #print STDERR "Sorry, could not get the location of $fid\n"; else: sys.stderr.write("Sorry, could not figure out which PEG you meant by %s\n" % id) #print STDERR "Sorry, could not figure out which PEG you meant by $id\n";
fig|xxxx.y.peg.zzzzwhere xxxx.y is the genome ID and zzzz is a unique sequence number (these usually occur in the same order as the genes occur on the contigs, but this is not guaranteed to be true). The argument could also be any other ID known by the SEED that uniquely corresponds to the FIG ID. For example,
features_around 'sp|Q9TL34'would display twelve features (fig|31312.1.peg.1 through fig|31312.1.peg.12, since sp|Q9TL34 corresponds to fig|31312.1.peg.5. The program outputs a tab-separated file of the sort one might use as input to a spreadsheet like Excel. The columns in the output contain:
Now, let's go through some of the details in the code:
$fig->by_alias($id)returns the FIG ID when it can be determined (otherwise, undef).
$fig->feature_location($peg)Remember, a location is a sequence of contiguous regions separated by commas (often a single contiguous region); a contiguous region amounts to a contig ID, a beginning position, and an ending position separated by underscores. In some contexts, all you want is the region containing a gene (without the details of intron/exon boundaries). In this case, you can use
($contig,$beg,$end) = $fig->boundaries_of($loc);to convert the location to a single contiguous region. These will succeed only if the sequence of contiguous regions all have the same contig ID. $beg will be set to the start of the first exon, while $end will be set to the end of the last exon.
$genome = &FIG::genome_of($peg);$fig->genome_of($peg) would work, as well), and then using
($features_in_region,$start_of_leftmost_feature,$end_of_rightmost_feature) = $fig->genes_in_region($genome,$contig,$start_reg,$end_reg);to extract the set of genes that overlap the region specified by the four arguments $genome, $contig, $start_reg, and $end_reg. A list of three things is returned by the invocation:
$loc1 = $fig->feature_location($fid); $aliases1 = $fig->feature_aliases($fid); $trunc = $fig->possibly_truncated($fid);shows how to get the location of a feature, a comma-separated list of aliases, and an indication of whether or not the feature might be truncated (this is a rather simple check to just see if the feature occurs within a few hundred bases from the end of the contig). The use of
&FIG::ftype($fid)just extracts the type of the feature (which is almost always peg or rna, although we certainly hope that we will have a richer set of feature types in the not too distant future). The two invocations
$prot_ln = $fig->translation_length($fid) $pseq = $fig->get_translation($fid)show how to get the length of the protein sequence encoded by a PEG and the actual sequence itself. Finally,
$func = $fig->function_of($fid)is used to get the current master function assignment for the gene. The SEED does support non-master assignments, although their use has been discouraged. the original intent was to allow inexperienced users to make assignments that did not overwrite the "master" assignment. Users who identify themselves as master:someone will only make assignments that overwrite the master assignment. However, a user without the "master:" prefix can make an assignment that is retained seprately. To access such an assignment, you would use
$func = $fig->function_of($fid,"someone")If someone has not made an assignment to $fig, this will return the master assignment.
print &Dumper($some_returned_variable);to see precisely what comes back.
The next short example illustrates how to access the reverse complement of a DNA sequence (i.e., the opposite strand), how to translate DNA sequences, and how to write out sequences in FASTA format. Let us look at it before thinking about additional complexities like how to handle nonstandard translations or properly translating the first codon:
seq_utils.pl seq_utils.py use FIG; my $fig = new FIG; # seq_utils 31312.1 NC_000927_9431_9048 > stdout 2> stderr $usage = "usage: seq_utils Genome Location"; ( ($genome = shift @ARGV) && ($loc = shift @ARGV) ) || die $usage; if ($seq = $fig->dna_seq($genome,$loc)) { $seqR = &FIG::reverse_comp($seq); $tran = &FIG::translate($seq); $tranR = &FIG::translate($seqR); &FIG::display_id_and_seq("plus_strand",\$seq); &FIG::display_id_and_seq("plus_strand_translated",\$tran,\*STDERR); &FIG::display_id_and_seq("minus_strand",\$seqR); &FIG::display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR); } from FigKernelPackages import FIG import sys fig = FIG.FIG() def die(msg): print msg sys.exit(0) if len(sys.argv) != 3: die("usage: seq_utils Genome Location") genome = sys.argv[1] loc = sys.argv[2] seq = fig.dna_seq(genome, loc) if seq: seqR = fig.reverse_comp(seq); tran = fig.translate(seq); tranR = fig.translate(seqR); fig.display_id_and_seq("plus_strand",seq); #fig.display_id_and_seq("plus_strand_translated",\$tran,\*STDERR); fig.display_id_and_seq("minus_strand",seqR); #fig.display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);
seq_utils 31312.1 NC_000927_9431_9048 > stdout 2> stderrwill write the following contents to the file stdout:
>plus_strand ATGTCTACTATTACAAATCAAATTATTGAACAACTAAAATCGATGACGCTTTTGGAAGCG GCTGAACTTGTGTCTCAAATTGAAGAGACATTTGGAGTTGATGCTTCTGCACCCGCAGCT GGAGCGGTGATGGTCGCTGCCGCTGCGCCGTCTGCGCCAGTTGTCGAAGAGCAAACTGAA TTTACTGTCATGATTAATGATGTGCCAAGTGCGAAACGAATTGCTGTTATTAAAGTTGTA CGTGCTTTAACAGGATTAGGGTTGAAAGAAGCCAAAGATATGATTGAGTCTGTACCAAAA GCAATTAAAGAAGGAATTGGGAAAGACGAAGCTCAACAAATTAAACAACAGCTCGAAGAG GCTGGTGCAACTGTCACAGTGAAA >minus_strand TTTCACTGTGACAGTTGCACCAGCCTCTTCGAGCTGTTGTTTAATTTGTTGAGCTTCGTC TTTCCCAATTCCTTCTTTAATTGCTTTTGGTACAGACTCAATCATATCTTTGGCTTCTTT CAACCCTAATCCTGTTAAAGCACGTACAACTTTAATAACAGCAATTCGTTTCGCACTTGG CACATCATTAATCATGACAGTAAATTCAGTTTGCTCTTCGACAACTGGCGCAGACGGCGC AGCGGCAGCGACCATCACCGCTCCAGCTGCGGGTGCAGAAGCATCAACTCCAAATGTCTC TTCAATTTGAGACACAAGTTCAGCCGCTTCCAAAAGCGTCATCGATTTTAGTTGTTCAAT AATTTGATTTGTAATAGTAGACATand the following two sequences to the file stderr
>plus_strand_translated MSTITNQIIEQLKSMTLLEAAELVSQIEETFGVDASAPAAGAVMVAAAAPSAPVVEEQTE FTVMINDVPSAKRIAVIKVVRALTGLGLKEAKDMIESVPKAIKEGIGKDEAQQIKQQLEE AGATVTVK >minus_strand_translated FHCDSCTSLFELLFNLLSFVFPNSFFNCFWYRLNHIFGFFQP*SC*STYNFNNSNSFRTW HIINHDSKFSLLFDNWRRRRSGSDHHRSSCGCRSINSKCLFNLRHKFSRFQKRHRF*LFN NLICNSRHNow let us look at the routines within the program that produced these results:
$seq = $fig->dna_seq($genome,$loc)to extract the DNA. The actual location can specify sequence from either strand (remember how?).
$seqR = &FIG::reverse_comp($seq)to get the reverse complement of the sequence held in $seq.
$tran = &FIG::translate($seq); $tranR = &FIG::translate($seqR);to get translations of both strands. Note that these invocations use the standard genetic code to do the translation. The translate service supports two optional arguments. The second argument is used to pass a pointer to a hash containing the desired translation, if you wish to use a nonstandard code. The keys of the hash should be
AAA => AAC => AAG => . . . TTT =>The values are what the codons are translated to (normally, from the set {A,C,G,T}, but it is essentially unrestricted). If the third argument evaluates to true (i.e., defined, not null string, and not 0), then initial GTG and TTG codons will be translated to "M".
&FIG::display_id_and_seq("plus_strand",\$seq); &FIG::display_id_and_seq("plus_strand_translated",\$tran,\*STDERR); &FIG::display_id_and_seq("minus_strand",\$seqR); &FIG::display_id_and_seq("minus_strand_translated",\$tranR,\*STDERR);The first argument is the id you wish, the second is a pointer to the sequence to write out, and the third argument (which is optional) is a file handle. It defaults to writing to STDOUT.
show_similarities.pl show_similarities.py use FIG; my $fig = new FIG; use Sim; use strict; my($usage,$peg,$cutoff,$max,$expand); my(@sims,$sim,$id2,$pscore,$iden,$b1,$e1,$b2,$e2,$ln1,$ln2,$func1,$func2); $usage = "usage: show_similarities PEG CutOff MaxReturned Expand"; ( ($peg = shift @ARGV) && ($cutoff = shift @ARGV) && ($max = shift @ARGV) && defined($expand = shift @ARGV) ) || die $usage; @sims = $fig->sims($peg,$max,$cutoff,"all",$expand); foreach $sim (@sims) { $ln1 = $sim->ln1; $id2 = $sim->id2; $ln2 = $sim->ln2; $pscore = $sim->psc; $iden = $sim->iden; $b1 = $sim->b1; $e1 = $sim->e1; $b2 = $sim->b2; $e2 = $sim->e2; $func2 = $fig->function_of($id2); print join("\t",($ln1,$id2,$ln2,$pscore,$iden,$b1,$e1,$b2,$e2,$func2)),"\n"; } from FigKernelPackages import FIG import sys, string fig = FIG.FIG() id1_ix = 0 ln1_ix = 12 id2_ix = 1 ln2_ix = 13 iden_ix = 2 psc_ix = 10 ali_ln_ix = 3 mismatches_ix = 4 gaps_ix = 5 bsc_ix = 11 b1_ix = 6 e1_ix = 7 b2_ix = 8 e2_ix = 9 bit_score_ix = 11 tool_ix = 14 def2_ix = 15 ali_ix = 16 def die(msg): print msg sys.exit(0) if len(sys.argv) != 5: die("usage: show_similarities PEG CutOFF MaxReturned Expand") peg = sys.argv[1] cutoff = sys.argv[2] max = sys.argv[3] expand = sys.argv[4] sims = fig.sims(peg, max, cutoff, "all", expand) for sim in sims: ln1 = str(sim[ln1_ix]) id2 = str(sim[id2_ix]) ln2 = str(sim[ln2_ix]) pscore = str(sim[psc_ix]) iden = str(sim[iden_ix]) b1 = str(sim[b1_ix]) e1 = str(sim[e1_ix]) b2 = str(sim[b2_ix]) e2 = str(sim[e2_ix]) func2 = fig.function_of(id2) print string.join((ln1, id2, ln2, pscore, iden, b1, e1, b2, e2, func2[1]), "\t")
show_similarities 'fig|83333.1.peg.3' 1.0e-5 500 0it would produce an output file that started with the following lines:
310 fig|216592.1.peg.4423 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39) 310 sp|Q8XA82 310 9e-178 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39) 310 pirnr|NF01089494 310 2e-177 99.35 1 310 1 310 Homoserine kinase 310 pirnr|NF01137671 310 2e-177 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39) (HK) 310 fig|216599.1.peg.3713 310 3e-177 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39)This output amounts to the first five unexpanded similarities. If you were to run
show_similarities 'fig|83333.1.peg.3' 1.0e-5 500 2it would produce an output file that started with the following lines:
310 sp|P00547 310 0 100 1 310 1 310 Homoserine kinase (EC 2.7.1.39) 310 pirnr|NF00702903 310 0 100 1 310 1 310 Homoserine kinase (EC 2.7.1.39) (HK) 310 kegg|eco:b0003 310 0 100 1 310 1 310 homoserine kinase (HK) [EC:2.7.1.39] 310 kegg|ecj:JW0002 310 0 100 1 310 1 310 Homoserine kinase [EC:2.7.1.39] 310 pirnr|NF01368612 310 0 100 1 310 -18 291 homoserine kinase 310 gi|33383907 310 0 100 1 310 -18 291 homoserine kinase 310 gi|16127997 310 0 100 1 310 1 310 homoserine kinase 310 fig|216592.1.peg.4423 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39) 310 fig|216593.1.peg.3578 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39) 310 fig|216598.1.peg.2077 310 4e-178 99.68 1 310 1 310 Homoserine kinase (EC 2.7.1.39) 310 gi|10186209 310 4e-178 99.68 1 310 -21 288 homoserine kinase 310 gi|33383857 310 4e-178 99.68 1 310 -18 291 homoserine kinase 310 pirnr|NF00706256 310 4e-178 99.68 1 310 -21 288 Homoserine kinase (Fragment) 310 pirnr|NF01368414 310 4e-178 99.68 1 310 -18 291 homoserine kinase 310 tr|Q9ETA5 310 4e-178 99.68 1 310 -21 288 Homoserine kinase 310 sp|Q8XA82 310 9e-178 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39) 310 fig|155864.1.peg.3 310 9e-178 99.35 1 310 1 310 Homoserine kinase (EC 2.7.1.39) 310 kegg|ece:Z0003 310 9e-178 99.35 1 310 1 310 homoserine kinase [EC:2.7.1.39]The first seven lines of this expanded set of similarities amount to identical matches against the set of sequences that were collapsed into a single entry in the nonredundant database (i.e., sequences that were essentially identical with fig|83333.1.peg.3). The next eight similarities correspond to the first "raw" similarity, and so forth.
use Sim;at the head of your program.
$fig->sims($peg,$max,$cutoff,"all",$expand)takes several parameters (the given PEG, the maximum number of similarities to return, the p-score cutoff, a "selection" option, and the number of raw similarities to expand). Itb returns a list of objects of type similarity. The "selection" option can be
$ln1 = $sim->ln1; $id2 = $sim->id2; $ln2 = $sim->ln2; $pscore = $sim->psc; $iden = $sim->iden; $b1 = $sim->b1; $e1 = $sim->e1; $b2 = $sim->b2; $e2 = $sim->e2;
The SEED supports one central role that can be used to summarize preserved contiguity: The following short program illustrates its use:
show_functional_coupling.pl show_functional_coupling.py use FIG; my $fig = new FIG; use strict; my($usage,$bound,$sim_cutoff,$fc_cutoff,@fc,$tuple,$peg); $usage = "usage: show_functional_coupling Bound SimCutoff FcCutOff < PEGs > PEG1-Sc-PEGs"; ( ($bound = shift @ARGV) && ($sim_cutoff = shift @ARGV) && ($fc_cutoff = shift @ARGV) ) || die $usage; while (defined($_ =)) { if ($_ =~ /^(fig\|\d+\.\d+\.peg\.\d+)/) { $peg = $1; @fc = $fig->coupling_and_evidence($peg,$bound,$sim_cutoff,$fc_cutoff); foreach $tuple (@fc) { # print &Dumper($tuple); print "$peg\t$tuple->[0]\t$tuple->[1]\n"; } } } from FigKernelPackages import FIG2 import sys, re fig = FIG2.FIG() def die(msg): print msg sys.exit(0) if len(sys.argv) != 4: die("usage: fc Bound SimCutoff FcCutOff < PEGs > PEG1-Sc-PEGs") bound = sys.argv[1] sim_cutoff = sys.argv[2] fc_cutoff = sys.argv[3] for line in sys.stdin.readlines(): match = re.search("^(fig\|\d+\.\d+\.peg\.\d+)", line) if match: peg = match.group(1) fc = fig.coupling_and_evidence(peg, bound, sim_cutoff, fc_cutoff) for tuple in fc: print "%s\t%s\t%s" % ( peg, tuple[0], tuple[1])
@fc =$fig->coupling_and_evidence($peg,$bound,$sim_cutoff,$fc_cutoff);to compute a list of 3-tuples. Each 3-tuple corresponds to a functionally coupled gene. The 3-tuple contains
show_functional_coupling 5000 1.0e-10 2 < tmpand if tmp contains a single line of the form
fig|83333.1.peg.3then the program will produce the following output:
$VAR1 = [ 77, 'fig|83333.1.peg.2', [ [ 'fig|263.1.peg.1622', 'fig|263.1.peg.1624' ], [ 'fig|263.1.peg.1622', 'fig|263.1.peg.1625' ], [ 'fig|562.2.peg.2720', 'fig|562.2.peg.2721' ], [ 'fig|573.2.peg.4598', 'fig|573.2.peg.4594' ], . . <<< deleted lines showing more pairs >>> . ] ]; $VAR1 = [ 44, 'fig|83333.1.peg.4', [ [ 'fig|263.1.peg.1622', 'fig|263.1.peg.1620' ], [ 'fig|562.2.peg.2526', 'fig|562.2.peg.2527' ], [ 'fig|562.2.peg.2521', 'fig|562.2.peg.2522' ], [ 'fig|562.2.peg.2525', 'fig|562.2.peg.2527' ], [ 'fig|573.2.peg.4598', 'fig|573.2.peg.4599' ], . . <<< deleted lines showing more pairs >>> . ] ]; $VAR1 = [ 10, 'fig|83333.1.peg.6', [ [ 'fig|562.2.peg.2526', 'fig|562.2.peg.2529' ], [ 'fig|562.2.peg.2525', 'fig|562.2.peg.2529' ], . . <<< deleted lines showing more pairs >>> . ] ]; $VAR1 = [ 7, 'fig|83333.1.peg.7', [ [ 'fig|562.2.peg.2526', 'fig|562.2.peg.2531' ], [ 'fig|562.2.peg.2526', 'fig|562.2.peg.2530' ], [ 'fig|562.2.peg.2525', 'fig|562.2.peg.2531' ], [ 'fig|562.2.peg.2525', 'fig|562.2.peg.2530' ], [ 'fig|573.2.peg.4598', 'fig|573.2.peg.4604' ], . . <<< deleted lines showing more pairs >>> . ] ]; $VAR1 = [ 7, 'fig|83333.1.peg.8', [ [ 'fig|594.1.peg.4954', 'fig|594.1.peg.4958' ], [ 'fig|630.2.peg.581', 'fig|630.2.peg.584' ], [ 'fig|12149.1.peg.2', 'fig|12149.1.peg.6' ], . . <<< deleted lines showing more pairs >>> . ] ];If you simply replaced
print &Dumper($tuple);with
print "$peg\t$tuple->[0]\t$tuple->[1]\n";you would get
fig|83333.1.peg.3 77 fig|83333.1.peg.2 fig|83333.1.peg.3 44 fig|83333.1.peg.4 fig|83333.1.peg.3 10 fig|83333.1.peg.6 fig|83333.1.peg.3 7 fig|83333.1.peg.7 fig|83333.1.peg.3 7 fig|83333.1.peg.8which might be closer to what most end users wish to see.
The only detail that might be worth noting is the scoring function. It counts the number of pairs (besides the pair composed of the given PEG and the related PEG) that could be located, where "essentially identical" occurrences are collapsed to a single occurrence. For example, if you had 10 E.coli genomes (of closely related strains), you would wish to count them as a single occurrence. Basically, we collapse occurrences between genes that are encode protein sequences that are more than 97% identical to a single occurrence. So, when you see
fig|83333.1.peg.3 77 fig|83333.1.peg.2This means that there are genes encoding proteins similar to those encoded by fig|83333.1.peg.3 and fig|83333.1.peg.2 that are within 5000 bases of one another in 77 quite distinct genomes.
Since KEGG is doing such a useful job in attempting to curate maps, we download their maps and use them. We supplement them occasionally with some drawn by ourselves, but we do think that the community owes KEGG real gratitude for making these maps available.
role1 / role2 / role3...That is, the gene function should be made up of a list of roles separated by " / ". A gene function is said to connect to each of the roles that make it up (those roles that are separated by " / "), as well as any embedded EC numbers. Thus, the gene function
Pyruvate kinase (EC 2.7.1.40) / 6-phosphofructokinase (EC 2.7.1.11)would connect to
Normally, when we speak of maps, most of the functional roles are simply EC numbers (although some are not). When we speak of subsystems, the functional roles are almost never specified as just an EC number.
Our first simple example program takes as command line arguments a map ID (i.e., one of the KEGG map numbers -- these can be taken off the initial page displayed by the SEED web services) and a genome ID. It displays the map ID and the "map name". Then, it lists the roles contained within the map, and for each role it displays the PEGs associated with the role. Thus,
connect_map_to_genes MAP00010 83333.1produced the following output for me:
MAP00010 Glycolysis / Gluconeogenesis 2.7.1.11 fig|83333.1.peg.1706 6-phosphofructokinase II (EC 2.7.1.11) fig|83333.1.peg.3835 6-phosphofructokinase (EC 2.7.1.11) 2.7.1.41 1.2.1.3 fig|83333.1.peg.3522 Aldehyde dehydrogenase (EC 1.2.1.3) 5.4.2.1 fig|83333.1.peg.3548 2,3-bisphosphoglycerate-independent phosphoglycerate mutase (EC 5.4.2.1) fig|83333.1.peg.4303 Phosphoglycerate mutase (EC 5.4.2.1) fig|83333.1.peg.741 Phosphoglycerate mutase (EC 5.4.2.1) 3.1.3.10 fig|83333.1.peg.989 Glucose-1-phosphatase precursor (EC 3.1.3.10) 1.1.1.27 fig|83333.1.peg.787 L-lactate dehydrogenase (EC 1.1.1.27) 1.8.1.4 fig|83333.1.peg.116 Dihydrolipoamide dehydrogenase (EC 1.8.1.4) 3.6.1.7 fig|83333.1.peg.953 Acylphosphatase (EC 3.6.1.7) 2.7.1.40 fig|83333.1.peg.1660 Pyruvate kinase (EC 2.7.1.40) fig|83333.1.peg.1838 Pyruvate kinase (EC 2.7.1.40) 2.7.1.1 3.1.3.9 4.1.1.1 5.1.3.15 1.1.99.8 1.2.4.1 fig|83333.1.peg.114 Pyruvate dehydrogenase E1 component (EC 1.2.4.1) 5.1.3.3 fig|83333.1.peg.742 Aldose 1-epimerase (EC 5.1.3.3) 6.2.1.1 fig|83333.1.peg.3979 Acetyl-coenzyme A synthetase (EC 6.2.1.1) 5.4.2.4 2.7.1.69 fig|83333.1.peg.1086 PTS system, glucose-specific IIBC component (EIIBC-Glc) (Glucose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69) fig|83333.1.peg.1607 PTS system, maltose and glucose-specific IIABC component (Maltose and glucose-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69) fig|83333.1.peg.1719 PTS system, cellobiose-specific IIA component (EIIA-Cel) (Cellobiose- permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69) fig|83333.1.peg.1721 PTS system, cellobiose-specific IIB component (EIIB-Cel) (Cellobiose- permease IIB component) (Phosphotransferase enzyme II, B component) (EC 2.7.1.69) fig|83333.1.peg.1800 PTS system, mannose-specific IIAB component (EC 2.7.1.69) fig|83333.1.peg.2068 PTS system, galactitol-specific IIB component (EIIB-GAT) (Galacticol- permease IIB component) (Phosphotransferase enzyme II, B component) (EC 2.7.1.69) fig|83333.1.peg.2069 PTS system, galactitol-specific IIA component (EIIA-GAT) (Galacticol- permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69) fig|83333.1.peg.2142 PTS system, fructose-specific IIBC component (EIIBC-Fru) (Fructose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69) fig|83333.1.peg.2144 PTS system, fructose-specific IIA/FPr component (EIIA-Fru) (Fructose- permease IIA/FPr component) (Phosphotransferase enzyme II, A/FPr component) (EC 2.7.1.69) fig|83333.1.peg.2165 PTS system, mannitol-specific IIBC component (EC 2.7.1.69) fig|83333.1.peg.2360 PTS system, fructose-specific IIBC component (EC 2.7.1.69) fig|83333.1.peg.2385 PTS system, glucose-specific IIA component (EC 2.7.1.69) fig|83333.1.peg.2657 Sorbitol PTS, EIIC (EC 2.7.1.69) fig|83333.1.peg.2658 PTS system, glucitol/sorbitol-specific IIBC component (EC 2.7.1.69) fig|83333.1.peg.2659 PTS system, glucitol/sorbitol-specific IIA component (EIIA-gut) (Glucitol/sorbitol-permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69) fig|83333.1.peg.2670 PTS system, arbutin-, cellobiose-, and salicin-specific IIABC component (EIIABC-ASC) (Arbutin-, cellobiose-, and salicin-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69) fig|83333.1.peg.2884 PTS system, mannitol (Cryptic) -specific IIBC component (EIIBC-(C)MTL) (Mannitol (Cryptic)-permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69) fig|83333.1.peg.2885 PTS system, mannitol (Cryptic) -specific IIA component (EIIA-(C)MTL) (Mannitol (Cryptic)-permease IIA component) (Phosphotransferase enzyme II, A component) (EC 2.7.1.69) fig|83333.1.peg.3083 PTS system, N-acetylgalactosamine-specific IIB component 1 (EIIB-AGA) (N-acetylgalactosamine-permease IIB component 1) (Phosphotransferase enzyme II, B component 1) (EC 2.7.1.69) fig|83333.1.peg.3147 Nitrogen regulatory IIA protein (EC 2.7.1.69) fig|83333.1.peg.3534 PTS system, mannitol-specific IIABC component (EIIABC-MTL) (Mannitol- permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69) fig|83333.1.peg.3659 PTS system, beta-glucoside-specific IIABC component (EIIABC-Bgl) (Beta-glucoside-permease IIABC component) (Phosphotransferase enzyme II, ABC component) (EC 2.7.1.69) fig|83333.1.peg.3820 PTS system, fructose-specific IIBC component (EC 2.7.1.69) fig|83333.1.peg.3821 PTS system, fructose-specific IIABC component (EC 2.7.1.69) fig|83333.1.peg.3869 PTS system, fructose-specific IIBC component (EC 2.7.1.69) fig|83333.1.peg.3870 PTS system, fructose-like-2 IIB component 1 (Phosphotransferase enzyme II, B component) (EC 2.7.1.69) fig|83333.1.peg.3873 PTS system, fructose-like-2 IIB component 2 (Phosphotransferase enzyme II, B component) (EC 2.7.1.69) fig|83333.1.peg.4150 PTS system, trehalose-specific IIBC component (EIIBC-TRE) (Trehalose- permease IIBC component) (Phosphotransferase enzyme II, BC component) (EC 2.7.1.69) fig|83333.1.peg.4213 PTS system, galactitol-specific IIC component (EC 2.7.1.69) fig|83333.1.peg.669 PTS system, glucose-specific IIABC component (EC 2.7.1.69) fig|83333.1.peg.723 PTS system, fructose-specific IIABC component (EC 2.7.1.69) 1.1.1.71 1.1.1.1 fig|83333.1.peg.1228 Aldehyde-alcohol dehydrogenase [Includes: Alcohol dehydrogenase (EC 1.1.1.1) (ADH); Acetaldehyde dehydrogenase [acetylating] (EC 1.2.1.10) (ACDH); Pyruvate-formate-lyase deactivase (PFL deactivase)] fig|83333.1.peg.1301 Alcohol dehydrogenase (EC 1.1.1.1) fig|83333.1.peg.3523 Alcohol dehydrogenase (EC 1.1.1.1) fig|83333.1.peg.353 Alcohol dehydrogenase class III (EC 1.1.1.1) 2.7.1.63 3.2.1.86 fig|83333.1.peg.1717 6-phospho-beta-glucosidase (EC 3.2.1.86) fig|83333.1.peg.2853 6-phospho-beta-glucosidase bglA (EC 3.2.1.86) fig|83333.1.peg.3658 6-phospho-beta-glucosidase bglB (EC 3.2.1.86) 3.1.6.3 1.2.1.12 fig|83333.1.peg.1762 Glyceraldehyde 3-phosphate dehydrogenase (EC 1.2.1.12) 2.7.2.3 fig|83333.1.peg.2877 Phosphoglycerate kinase (EC 2.7.2.3) 5.3.1.9 fig|83333.1.peg.3934 Glucose-6-phosphate isomerase (EC 5.3.1.9) 1.2.1.5 1.2.1.51 4.2.1.11 fig|83333.1.peg.2735 Enolase (EC 4.2.1.11) 5.3.1.1 fig|83333.1.peg.3838 Triosephosphate isomerase (EC 5.3.1.1) 3.1.3.13 4.1.2.13 fig|83333.1.peg.1504 Fructose-bisphosphate aldolase (EC 4.1.2.13) fig|83333.1.peg.2072 Fructose-bisphosphate aldolase (EC 4.1.2.13) fig|83333.1.peg.2876 Fructose-bisphosphate aldolase (EC 4.1.2.13) 2.7.1.2 fig|83333.1.peg.2361 Glucokinase (EC 2.7.1.2) 2.3.1.12 fig|83333.1.peg.115 Dihydrolipoamide acetyltransferase component of pyruvate dehydrogenase complex (EC 2.3.1.12) (E2) 1.1.1.2 3.1.3.11 fig|83333.1.peg.2881 Fructose-1,6-bisphosphatase (EC 3.1.3.11) fig|83333.1.peg.4142 Fructose-1,6-bisphosphatase (EC 3.1.3.11) 5.4.2.2 fig|83333.1.peg.678 Phosphoglucomutase (EC 5.4.2.2)Now let's examine the short program that produced the output:
returns the full name associated with the map.
connect_map_to_genes.pl connect_map_to_genes.py use FIG; my $fig = new FIG; use Sim; use strict; my($usage,$map,$genome,$name,$role,$peg,$func,$expanded_role); $usage = "usage: connect_map_to_genes Map Genome"; ( ($map = shift @ARGV) && ($genome = shift @ARGV) ) || die $usage; $name = $fig->map_name($map); print "$map\t$name\n"; foreach $role ($fig->map_to_ecs($map)) { print " $role\n"; foreach $peg ($fig->seqs_with_role($role,"master",$genome)) { $func = $fig->function_of($peg); print " $peg\t$func\n"; } } from FigKernelPackages import FIG2 import sys fig = FIG2.FIG() def die(msg): print msg sys.exit(0) if len(sys.argv) != 3: die("usage: connect_map_to_genes Map Genome") map = sys.argv[1] genome = sys.argv[2] name = fig.map_name(map); print "%s\t%s" % (map, name) roles = fig.map_to_ecs(map) for role in roles: print role pegs = fig.seqs_with_role(role,"master",genome) for peg in pegs: func = fig.function_of(peg); print " %s\t%s" % (peg, func[0][1])
$fig->map_to_roles($map)
$expanded_role = $fig->expand_ec($role)I did not do that in this example, since I wanted you to see the EC numbers being used as functional roles in maps. You should modify the program to expand the ECs and rerun it.
$fig->seqs_with_role($role,"master",$genome)
connect_subsystem_to_genes.pl connect_subsystem_to_genes.py use FIG; my $fig = new FIG; use Sim; use strict; my($usage,$subsystem,$genome,$name,$role,$peg,$func); $usage = "usage: connect_subsystem_to_genes Subsystem Genome"; ( ($subsystem = shift @ARGV) && ($genome = shift @ARGV) ) || die $usage; print "$subsystem\n"; foreach $role ($fig->subsystem_to_roles($subsystem)) { print " $role\n"; foreach $peg ($fig->seqs_with_role($role,"master",$genome)) { $func = $fig->function_of($peg); print " $peg\t$func\n"; } } from FigKernelPackages import FIG2 import sys fig = FIG2.FIG() def die(msg): print msg sys.exit(0) if len(sys.argv) != 3: die("usage: connect_subsystem_to_genes Subsystem Genome") subsystem = sys.argv[1] genome = sys.argv[2] print subsystem roles = fig.subsystem_to_roles(subsystem) for role in roles: print role pegs = fig.seqs_with_role(role,"master",genome) for peg in pegs: func = fig.function_of(peg); print "%s\t%s" % ( peg, func[0][1])
connect_subsystem_to_genes 'Embden-Meyerhof and Gluconogenesis' 83333.1would produce the following output:
Note that the functional roles in subsystems tend to be far more detailed (and almost never just an EC number).
The problem with the above program is that we have chosen to view a subsystem as a spreadsheet in which each cell corresponds to a specific functional role in a designated organism. The gene functions of the PEGs contained in the cell usually connect to the specific role, but this is not forced. If someone changes the function of a gene, it does not disappear from the spreadsheet, unless the spreadsheet is rebuilt by forcing the entries to reflect current assignments and how they connect to the roles. We have chosen this approach to allow curators of spreadsheets to have precise control over the contents. People who import a subsystem may or may not instal the gene assignments needed to make the spreadsheet consistent, but the spreadsheet reflects the curators best assessment of which genes implement specific roles. This leads to my assertion that the following program is a better way to display a subsystem:
connect_subsystem_to_genes_the_right_way.pl connect_subsystem_to_genes_the_right_way.py use FIG; my $fig = new FIG; use Sim; use strict; my($usage,$subsystem,$genome,$name,$role,$peg,$func); $usage = "usage: connect_subsystem_to_genes_the_right_way Subsystem Genome"; ( ($subsystem = shift @ARGV) && ($genome = shift @ARGV) ) || die $usage; print "$subsystem\n"; foreach $role ($fig->subsystem_to_roles($subsystem)) { print " $role\n"; foreach $peg ($fig->pegs_in_subsystem_cell($subsystem, $genome, $role)) { $func = $fig->function_of($peg); print " $peg\t$func\n"; } } from FigKernelPackages import FIG2 import sys fig = FIG2.FIG() def die(msg): print msg sys.exit(0) if len(sys.argv) != 3: die("usage: connect_subsystem_to_genes_the_right_way Subsystem Genome") subsystem = sys.argv[1] genome = sys.argv[2] print subsystem roles = fig.subsystem_to_roles(subsystem) for role in roles: print role pegs = fig.pegs_in_subsystem_cell(subsystem, genome, role) for peg in pegs: func = fig.function_of(peg) print " %s\t%s" % (peg, func[0][1])
We have just illustrated how to connect maps to functional roles and functional roles to gene functions. Now we need to go the other way. That is, given a gene function, what roles would it connect to, and which maps and subsystems contain those roles? The following program illustrates the services required to answer this question;
gf2maps_and_subs.pl gf2maps_and_subs.py use FIG; my $fig = new FIG; use strict; my($usage,$peg,$func,@roles,$role,@maps,$map); my(%in_map,@maps_it_connects_to,@subsystems_it_is_in); my($map_name,$subsystem); $usage = "usage: gf2maps_and_subs PEG"; # The single input argument must be a FIG id, which is used to get the assigned function. # The program then prints a list of maps to which the gene can be connected, # followed by a list of subsystems that contain the gene. ($peg = shift @ARGV) || die $usage; if (($func = $fig->function_of($peg)) && (! &FIG::hypo($func))) { @roles = $fig->roles_of_function($func); foreach $role (@roles) { @maps = $fig->role_to_maps($role); foreach $map (@maps) { $in_map{$map} = 1; } } @maps_it_connects_to = sort keys(%in_map); if (@maps_it_connects_to > 0) { print "Maps that connect to $func\n\n"; foreach $map (@maps_it_connects_to) { $map_name = $fig->map_name($map); print "$map\t$map_name\n"; } print "\n"; } } else { print STDERR "you need to give a valid PEG with a non-hypothetical function to connect to maps\n"; } @subsystems_it_is_in = $fig->peg_to_subsystems($peg); if (@subsystems_it_is_in > 0) { print "Subsystems that contain $peg\n\n"; foreach $subsystem (@subsystems_it_is_in) { print "$subsystem\n"; } } from FigKernelPackages import FIG2 import sys fig = FIG2.FIG() def die(msg): print msg sys.exit(0) if len(sys.argv) != 2: die("usage: gf2maps_and_subs PEG") # The single input argument must be a FIG id, which is used to get the assigned function. # The program then prints a list of maps to which the gene can be connected, # followed by a list of subsystems that contain the gene. peg = sys.argv[1] func = fig.function_of(peg) func = func[0][1] hypo = fig.hypo(func) in_map = {} if func and hypo[0] == 0: roles = fig.roles_of_function(func) for role in roles: maps = fig.role_to_maps(role) for map in maps: in_map[map] = 1 maps_it_connects_to = in_map.keys() maps_it_connects_to.sort() if len(maps_it_connects_to) > 0: print "Maps that connect to %s\n" % func for map in maps_it_connects_to: map_name = fig.map_name(map) print "%s\t%s" % (map, map_name[0]) print " "; else: sys.stderr.write("you need to give a valid PEG with a non-hypothetical function to connect to maps\n") subsystems_it_is_in = fig.peg_to_subsystems(peg) if len(subsystems_it_is_in) > 0: print "Subsystems that contain %s\n" % peg for subsystem in subsystems_it_is_in: print subsystem
&FIG::hypo($func)to check to see whether or not the function is equivalent to hypothetical protein.
The way we explore connections between the PEG and maps is quite different than the way we connect the PEG to subsystems. In the case of maps, the connection is constructed by computing the set of roles to which the gene function connects, and then connecting each role to maps. In the case of subsystems, we simply look up which subsystems contain the given PEG. Subsystems explicitly contain a precise list of the PEGs that occur within them, which is not true of maps.
@roles = $fig->roles_of_function($func); @maps = $fig->role_to_maps($role) @subsystems = $fig->peg_to_subsystems($peg);The first takes a function as input and returns a list of roles that the function connects to, the second takes a role and returns a list of maps to which the role connects, and the last takes a peg and returns the list of subsystems that contain the PEG.
@annotations = $fig->feature_annotations($fid)The returned list contains 4-tuples (i.e., pointers to lists, each of which contains 4 entries). Each 4-tuple contains
@annotations = $fig->merged_related_annotations($fig_ids)which is a useful generalization. This routine takes a pointer to a list of feature IDs as input, and it produces a merged list of the annotations attached to all of the features. This can often be very convenient. For example, you might have a set of similar PEGs that are all misannotated, and you wish to reconstruct the steps that led to the current situation. By merging the annotations (which normally record both who made the annotations and why), you can get a record of the events in chronilogical order. The following short example illustrates this utility:
annotations_of.pl annotations_of.py use FIG; my $fig = new FIG; use strict; my($usage,$fig_ids,@annotations); $usage = "usage: annotations_of Fid1 Fid2 ..."; (@ARGV > 0) || die $usage; $fig_ids = [@ARGV]; @annotations = $fig->merged_related_annotations($fig_ids); if (@annotations > 0) { &display_annotations(\@annotations); } else { print STDERR "Sorry, there were no annotations attached to the features\n"; } sub display_annotations { my($annotations) = @_; my($annotation,$fid,$timestamp,$user,$text); foreach $annotation (@$annotations) { ($fid,$timestamp,$user,$text) = @$annotation; print "=======\n"; print "$user\t$timestamp\t$fid\n$text\n"; print "=======\n\n"; } } from FigKernelPackages import FIG import sys def display_annotations(annotations): for annotation in annotations: fid,timestamp,user,text = annotation print "=======\n"; print "%s\t%s\t%s\n%s\n" % ( user, timestamp, fid, text) print "=======\n\n" fig = FIG.FIG() def die(msg): print msg sys.exit(0) if len(sys.argv) < 2: die("usage: annotations_of Fid1, Fid2, ... ") fig_ids = sys.argv[1:] annotations = fig.merged_related_annotations(fig_ids) if len(annotations) > 0: display_annotations(annotations) else: sys.stderr.write("Sorry, there were no annotations attached to the features\n")
annotations_of 'fig|207559.1.peg.735' 'fig|12149.1.peg.2133' 'fig|1525.1.peg.1757'produced the following output:
======= automated_assignment Thu Mar 18 17:42:51 2004 fig|12149.1.peg.2133 Set master function to ATP phosphoribosyltransferase (EC 2.4.2.1 ======= ======= automated_assignment Thu Mar 18 17:45:16 2004 fig|1525.1.peg.1757 Set master function to ATP phosphoribosyltransferase (EC 2.4.2.1 ======= ======= last_release Fri Apr 2 14:14:07 2004 fig|207559.1.peg.735 Set master function to ATP phosphoribosyltransferase (EC 2.4.2.1 =======
batch_assignments.pl use FIG; my $fig = new FIG; use strict; my($usage,$user,$annotation,$text,$peg,$func,$conf,$user_name,$funcO); $usage = "usage: batch_assignments User [AnnotationFile] < Assignments"; ($user = shift @ARGV) || die $usage; if ($annotation = shift @ARGV) { $text = join("",`cat $annotation`); } else { $text = ""; } while (defined($_ =)) { chop; ($peg,$func,$conf) = split(/\t/,$_); if (! $conf) { $conf = "" } $funcO = $fig->function_of($peg); if ($funcO ne $func) { # annotations are now made by assign_function $fig->assign_function($peg,$user,$func,$conf); } }
There is also the matter of confidence levels. The SEED supports attaching confidence codes to assignments, although they are seldom used. The codes we use at this point are:
Now, back to the example program:
$text = join("",`cat $annotation`);just reads the entire file, concatenates the lines, and assigns the result to the variable $text.
$fig->assign_function($peg,$user,$func,$conf);is used to actually make the assignment in the database, record it in in the current assignments file, and record an annotation of the annotation history file.
@bbhs = $fig->bbhs($peg,1.0e-10);This would set @bbhs to a list of 2-tuples. Each 2-tuple would contain a PEG and a p-score. They would correspond to BBHs which have similarity better than 1.0e-10.
To get blastp similarities quickly (and somewhat less accurately than just blasting against the nonredundant database), use something like
@sims = $fig->dsims($id,$seq,200,1.0e-10,"raw")This would return the similarities that can rapidly be detected between the protein sequence with id given by id and sequence given by $seq. It would give you at most 200 similarities with p-scores better than 1.0e-10, and you would get only the raw similarities (all would return the expanded similarities).