So, what does this mean to similarities? The similarities are computed only
for the representative (longest) version of each protein. The results are
automatically adjusted to look like those that would be found for the
user-requested query. However, the SEED has no way of knowing which subject
sequence (matching sequence) will be of greatest interest to the user. On
one hand, listing all equivalent sequences is time consuming, and clutters
the list with multiple versions of each match. On the other hand, the
description associated with the representative sequence might be less useful
than that associated with other entries. Perhaps the most annoying issue is
that the entry for the SEED genome (the FIG sequence) is often not the
representative, and hence is not visible without expanding the list to show
all equivalent sequences. A less obvious consequence is that without
expanding the list of matches, proteins identical to the query in the same or
other genomes will not be included in the table of Similarities! (Of course they are already displayed as
"Assignments for Essentially Identical
Proteins", but you might not have realized that this is what that
list represents.)
There are potentially unexpected, or even undesired consequences of this, but
it is often the case that closer consideration suggests that they are mostly
harmless, or perhaps even blessings in disguise. In particular, the reported
region of similarity can have a negative start coordinate, because the
similarity does extend beyond the reported start of the particular protein.
It is also the case that similarity scores are not adjusted for shorter
sequences that might not include the entire region of similarity (again, this
is only the case when the reported start point is a negative sequence
position).
Max expand lets the user control the expansion of the representative database sequence (which in this context should be viewed as arbitrarily chosen) into the list of equivalent sequences. Expanding is essential for two functions: Showing just FIG sequences and showing sequences that are identical to the query. Beyond these two cases, expanding at least some sequences is frequently useful for seeing what diverse databases (e.g., UniProt, KEGG and RefSeq) have to say about the most significant matches. Sometimes it appears that fewer entries were expanded because one or more expanded entries were filtered out by later tests (e.g., they were environmental sequences, or they did not have a FIG ID). Regardless of the value of Max expand, the process stops as soon as Max sims similarities have been reported.
Max E-val sets an upper limit on the E-value (the expected number of random database matches this good or better) of matches that will be displayed (that is, a lower limit on the significance). The highest E-value actually reported is never greater than this value, but can be less due to the E-value cutoff used in computing the original similarities, and/or the limited number of similarities stored in the database.
There is a pop-up menu to select the treatment of entries when they are expanded (and even to force the expanding of additional entries):
Show Env. samples is used to enable the reporting of similarities to environmental sequences. By default, environmental sequences are not reported because all of their annotations are indirect. However, they may be displayed so that the user can annotate them, or explore other properties, such as genomic context.
Hide aliases removes the aliases column from the similarities table, primarily to save screen real estate.
Sort by:
For the above two options, the difference in the * version (relative to the non-* version) is a small sample penalty so that very short sequences will be less apt to randomly appear very early in the list. As the order in the menu might suggest, the * versions are recommended, though details of their behavior might be changed the future.
Group by genome is a useful method to collect paralogs within a genome to the same location in the similarities list. There are several properties of the function that need to be understood to use it safely and effectively:
The more similarities button acts the same as the resubmit button, but it also doubles the values of Max sims and Max expand.
The previous PEG button (when present) navigates to the next lower numbered protein coding gene in the genome. This is not necessarily in the same contig, and in some genomes, it can be anywhere. This operation conserves all of the similarities settings (unlike navigating via the genome context table, or map).
The next PEG button (when present) navigates to the next higher numbered protein coding gene in the genome. This is not necessarily in the same contig, and in some genomes, it can be anywhere. This operation conserves all of the similarities settings (unlike navigating via the genome context table, or map).
The more sim options button does what it says, it gives you more options.
Min similarity defines another way to cut off weak matches. In this case by percent identity (as reported by BLAST, which comes with caveats) or by bit score per position. Although the latter is less intuitive than percent identity, it is probably better for highly diverged sequences in that the most common amino acid replacements still get a positive score. The effective range of this measure is 0 to 2 bits. The measurement option is selected with the as defined by pop-up menu.
Min query cover (%) is used to eliminate matches that only cover a small part of the query. Typically these are matches to conserved domains, but they can also be matches to fragmentary genes in the database.
Min subject cover (%) is used to eliminate matches that only cover a small part of the database sequence. Typically these are matches to conserved domains, but they can also be matches to multifunctional genes in the database.