Documentation read from 04/17/2019 22:07:24 version of /vol/public-pseed/FIGdisk/FIG/bin/svr_CS_pipeline.

svr_CS_pipeline

svr_CS_pipeline

Generate data needed to support close-strain analysis.

------

Example:

    mkdir Data.Strep
    svr_CS_pipeline -d Data.Strep -g Streptococus
or
    fill in Data.kmers, rep.genomes and genome.names and use
    svr_CS_pipeline -d Data.Streptococcus
or
    fill in Data.kmers, rep.genomes, genome.names, Seqs, PegLocs, and PegDNA and use
    svr_CS_pipeline -d Data.Streptococcus
or
    fill in Data.kmers, rep.genomes, genome.names, Seqs, PegLocs, PegDNA, families.all  and use
    svr_CS_pipeline -d Data.Streptococcus

Command-Line Options

-d Data

This is an extended Data directory (what Bob might call a "close strain workspace"). It includes a Data.kmers directory that is used by kmer_guts to annotate PEGs, a "rep.genomes" and "genme.names" files that identify the genomes to be included, s set of derived protein families and a set of derived files used to support comparative analysis of the genomes.

-r Role for representative genomes (defaults to DNA-directed RNA polymerase beta subunit (EC 2.7.7.6)).
-i IdentityFraction

This is the fraction used by Gary's representative_sequences when choosing representative genomes

-g Genus (required if rep.genomes and genome.names are missing)

Output Format

Output is added to the extended Data directory. The key files are

    families.all [the protein families underlying everything]
             FamilyID - an integer
             Function - function assigned to family
             SubFunction - the Function and an integer (SubFunction) together uniquely
                  determine the FamilyID.  Another way to look at it is

                  a) each family is assigned a unique ID and a function
                  b) multiple families can have the same function (consider
                     "hypothetical protein")
                  c) the Function+SubFunction uniquely determine the FamilyID
             PEG
             LengthProt - the length of the translated PEG
             Mean       - the mean length of PEGs in the family
             StdDev     - standard deviation of lengths for family
             Z-sc       - the Z-score associated with the length of this PEG

    labeled.tree [a rooted labeled newick tree]
    readable.tree [an ascii version of labeled.tree]
    placed.events [adjacency shifts placed on the tree]
        Each line describes an event that occurred on an arc.  The format
        used to encode the events is as follows:
             ancestral node
             node   [the event occurred on the arc from the ancestor to the node]
             family:direction [thus, 1206:upstream meand the event occurred as a
                    change of the protein family upstream of family 1206]
             ancestral-adjacency [family:strand of the adjacent family at ancestral node]
             node-adjacency      [family:strand of adjacent family at the child]
    
    where.shifts.occurred [where families were gained/lost on arcs]
        describes where families were gained or lost
             ancestral node
             node (child of ancestor)
             family
             abcestral value
             node value

These are the files that drive the "What Changed?" application.