Documentation read from 04/17/2019 22:07:27 version of /vol/public-pseed/FIGdisk/FIG/bin/svr_make_pan_genome_prot_families.

svr_make_pan_genome_prot_families < InputDefiningGenomes > ProteinFamilies

svr_make_pan_genome_prot_families < InputDefiningGenomes > ProteinFamilies

Construct the protein families needed to study Pan Genomes

Introduction

The study of pan genomes focuses on protein families composed of corresponding proteins. This program takes an input file that defines where to find the genomes, what proteins each contains, locations for the proteins, and functions for the proteins.

Command-Line Arguments

The program is invoked using

    svr_make_pan_genome_prot_families [options] < FileDefiningGenomes > ProteinFamilies

    The genomes can be identified by a genome ID from P-SEED, a SEED/RAST directory,
    or a triple of files (fasta,tbl,assigned_functions).  Each line of the input
    file describes one of these three sources of a genome.
-d

Directory used to store the binary correspondences

-i

Minimum identity used in forming binary correspondences (defaults to 80)

-bbhs=[0|1]

Use -bbhs=1 to force connections to be bidirectional best hits (BBHs). Defaults to 1, so use -bbhs=0 to get a looser matching procedure.

-n

Minimum number of genes in context that can be paired (defaults to 5)

-numMatchingFunctionsInContext=N

Minimum number of the pairs in context that contain matching functions. (Defaults to min(2,# genes in context)).

-maxPsc=pscore

Maximum p-score required in correspondences (defaults to 1.0e-10)

-coverage=Frac

Fraction of each gene in a pair that must be within the region of similarity if the pair are to be considered as "corresponding" (defaults to 0.7)

-p=N

Number of computations of correspondences that can be run in parallel.

Output

The output files defines the resulting protein families. Each line contains

    [SetNumber,ProteinID,AssignedFunction]