Documentation read from 04/17/2019 22:07:25 version of /vol/public-pseed/FIGdisk/FIG/bin/svr_corresponding_genes.

svr_corresponding_genes

svr_corresponding_genes

Attempt to Tabulate Corresponding Genes from Two Complete Genomes

------ Example: svr_corresponding_genes 107806.1 198804.1

would produce a 18-column table that is an attempt to present the correspondence between the genes in two genomes (in this case 107806.1 and 198804.1, which are two Buchnera genomes). ------

There is no input other than the command-line arguments. The two genomes must be specified, and there are two optional arguments that relate to determining how to determine the "context" of genes.

One important aspect of the tool is that it tries to establish the correspondence, and then for a corresponding pair of genes Ga and Gb, it attemptes to determine how many genes in the "context" of Ga map to genes in the "context" of Gb. This is important, since preservation of context increases the confidence of the mapping between Ga and Gb considerably. The optional parameters effect the determination of the genes in the "context". Using

    -n 5

would indicate that the context of G should include 5 distinct genes to the left of G and 5 distinct genes to the right of G. This notion of distinct was added due to the existence of numerous splice variants in some eukaryotic genomes. Genes are considered to be distinct if the size of the overlap between the genes is less than a threshhold. The threshhold can be set using the -o parameter. Thus, use of

    -o 1000

would indicate that two genes are distinct iff the boundaries of the two genes overlap by less than 1000 bp. The default is a very high value, so if you specify nothing (which is appropriate for prokaryotic genomes), any two genes will be considered distinct.

Command-Line Options

The program is invoked using

    svr_corresponding_genes [-u ServerUrl] [-n HalfSzOfContext] [-o MaxOverlap] [-d Genome1Dir] Genome1 Genome2
-n HalfSizeOfRegion

This is used to specify how many genes to the left and right you want to be considered in the context. The default is 10.

-o MaxOverlap

This allows the user the specify a maximum overlap that would result in two genes being considered "distinct" in the computation of genes to be added to the context. It defaults to a very large value.

-d Genome1Dir

This allows the user to give a SEED "genome directory" that is used to get data for Genome1, rather than taking it from the SEED itself. This allows one, for example, to use a RAST directory.

-u ServerUrl

This allows the user to specify the URL for the Sapling server. If it is "localhost", then the Sapling method will be run on the local SEED.

Output Format

The standard output is a 18-column tab-delimited file:

Column-1 The ID of a PEG in Genome1.
Column-2

The ID of a PEG in Genome2 that is our best estimate of a "corresponding gene".

Column-3 Count of the number of pairs of matching genes were found in the context
Column-4

Pairs of corresponding genes from the contexts

Column-5

The function of the gene in Genome1

Column-6

The function of the gene in Genome2

Column-7

Aliases of the gene in Genome1 (any protein with an identical sequence is considered an alias, whether or not it is actually the name of the same gene in the same genome)

Column-8

Aliases of the gene in Genome2 (any protein with an identical sequence is considered an alias, whether or not it is actually the name of the same gene in the same genome)

Column-9

Bi-directional best hits will contain "<=>" in this column. Otherwise, an "->" or an "<-" will appear.

Column-10

Percent identity over the region of the detected match

Column-11

The P-sc for the detected match

Column-12

Beginning match coordinate in the protein encoded by the gene in Genome1.

Column-13

Ending match coordinate in the protein encoded by the gene in Genome1.

Column-14

Length of the protein encoded by the gene in Genome1.

Column-15

Beginning match coordinate in the protein encoded by the gene in Genome2.

Column-16

Ending match coordinate in the protein encoded by the gene in Genome2

Column-17

Length of the protein encoded by the gene in Genome2.

Column-18

Bit score for the match. Divide by the length of the longer PEG to get what we often refer to as a "normalized bit score".

POD ERRORS

Hey! The above document had some coding errors, which are explained below:

Around line 104:

You forgot a '=back' before '=head2'

Around line 108:

'=item' outside of any '=over'