Explanation of Protein Similarities Options

The SEED has numerous (some might say too many) options for controlling the display of similarities. Some are fairly obvious, while others are less so. Learning to use the options allows you to do things that are not easily done using other tools.

Background: Sequences and similarities in the SEED

Some of the options relate to how the SEED stores sequences and similarities. The system is in many ways similar to that employed by NCBI in its non-redundant BLAST databases. The primary idea is that the SEED protein database is a merger of sequences from several different sources. It is common to have the same sequence from some combination of GenBank, RefSeq, UniProt, and KEGG, as well as the SEED genomes. In the NCBI non-redundant databases, Identical sequences are represented by a single sequence entry and a list of all the different sources. The SEED carries this one step further. In the SEED, the sequences do not need to be identical. A sequence that is identical to the carboxy-terminal portion of another sequence can be merged into the entry for the longer sequence. The most common reason for this to happen is that the sequences are based upon the same gene, but the assumed start site is different for the entries. In instances of very closely related organisms, it is also possible for proteins to be merged between genomes. With the sequencing of multiple strains within a bacterial species, this is now a fairly common occurrence.

So, what does this mean to similarities? The similarities are computed only for the representative (longest) version of each protein. The results are automatically adjusted to look like those that would be found for the user-requested query. However, the SEED has no way of knowing which subject sequence (matching sequence) will be of greatest interest to the user. On one hand, listing all equivalent sequences is time consuming, and clutters the list with multiple versions of each match. On the other hand, the description associated with the representative sequence might be less useful than that associated with other entries. Perhaps the most annoying issue is that the entry for the SEED genome (the FIG sequence) is often not the representative, and hence is not visible without expanding the list to show all equivalent sequences. A less obvious consequence is that without expanding the list of matches, proteins identical to the query in the same or other genomes will not be included in the table of Similarities! (Of course they are already displayed as "Assignments for Essentially Identical Proteins", but you might not have realized that this is what that list represents.)

There are potentially unexpected, or even undesired consequences of this, but it is often the case that closer consideration suggests that they are mostly harmless, or perhaps even blessings in disguise. In particular, the reported region of similarity can have a negative start coordinate, because the similarity does extend beyond the reported start of the particular protein. It is also the case that similarity scores are not adjusted for shorter sequences that might not include the entire region of similarity (again, this is only the case when the reported start point is a negative sequence position).

Standard Options

Max sims is the number of similarities to report. This is the number of entries in the table of similarities, not necessarily the number of unique sequences that were "expanded". The number reported can be less than this if there are fewer entries in the database that satisfy all of the other criteria defined by the search options (by default, the only limit is the E-value of the match).

Max expand lets the user control the expansion of the representative database sequence (which in this context should be viewed as arbitrarily chosen) into the list of equivalent sequences. Expanding is essential for two functions: Showing just FIG sequences and showing sequences that are identical to the query. Beyond these two cases, expanding at least some sequences is frequently useful for seeing what diverse databases (e.g., UniProt, KEGG and RefSeq) have to say about the most significant matches. Sometimes it appears that fewer entries were expanded because one or more expanded entries were filtered out by later tests (e.g., they were environmental sequences, or they did not have a FIG ID). Regardless of the value of Max expand, the process stops as soon as Max sims similarities have been reported.

Max E-val sets an upper limit on the E-value (the expected number of random database matches this good or better) of matches that will be displayed (that is, a lower limit on the significance). The highest E-value actually reported is never greater than this value, but can be less due to the E-value cutoff used in computing the original similarities, and/or the limited number of similarities stored in the database.

There is a pop-up menu to select the treatment of entries when they are expanded (and even to force the expanding of additional entries):

Show Env. samples is used to enable the reporting of similarities to environmental sequences. By default, environmental sequences are not reported because all of their annotations are indirect. However, they may be displayed so that the user can annotate them, or explore other properties, such as genomic context.

Hide aliases removes the aliases column from the similarities table, primarily to save screen real estate.

Sort by:

Group by genome is a useful method to collect paralogs within a genome to the same location in the similarities list. There are several properties of the function that need to be understood to use it safely and effectively:

Standard Buttons

The resubmit button is used to redisplay the similarities after changing one or more of the options. This is also the action taken if you press the return key while the cursor is in one of the text boxes.

The more similarities button acts the same as the resubmit button, but it also doubles the values of Max sims and Max expand.

The previous PEG button (when present) navigates to the next lower numbered protein coding gene in the genome. This is not necessarily in the same contig, and in some genomes, it can be anywhere. This operation conserves all of the similarities settings (unlike navigating via the genome context table, or map).

The next PEG button (when present) navigates to the next higher numbered protein coding gene in the genome. This is not necessarily in the same contig, and in some genomes, it can be anywhere. This operation conserves all of the similarities settings (unlike navigating via the genome context table, or map).

The more sim options button does what it says, it gives you more options.

Extra Options

The more sim options button enables additional options, with their default values. Hiding the options with the fewer sim options button reverts all of the extra options to their default values. That is, the display is never influenced by options that are hidden from view.

Min similarity defines another way to cut off weak matches. In this case by percent identity (as reported by BLAST, which comes with caveats) or by bit score per position. Although the latter is less intuitive than percent identity, it is probably better for highly diverged sequences in that the most common amino acid replacements still get a positive score. The effective range of this measure is 0 to 2 bits. The measurement option is selected with the as defined by pop-up menu.

Min query cover (%) is used to eliminate matches that only cover a small part of the query. Typically these are matches to conserved domains, but they can also be matches to fragmentary genes in the database.

Min subject cover (%) is used to eliminate matches that only cover a small part of the database sequence. Typically these are matches to conserved domains, but they can also be matches to multifunctional genes in the database.