Documentation read from 04/17/2019 22:07:26 version of /vol/public-pseed/FIGdisk/FIG/bin/svr_get_rep_genomes.

svr_get_rep_genomes

svr_get_rep_genomes

Get a set of representative genomes using heuristics and the NCBI taxonomy

------

Example:

    svr_get_rep_genomes -n 80 -f taxonomies -t Proteobacteria -c 1 -m 2000000

would produce a 5-column table. The first column would contain KBase IDs for the selected genomes, the second column would have the SEED ID, the third column is the size of the genome, the fourth column is the number of contigs, and the fifth is the NCBI taxonomy.

    -n says "get 80 genomes"
    -f taxonomies indicates a file that should contain the NCBI taxonomies (built
          by running this program with the name of a file that does not exist, 
          causing the program to build it)
    -t Proteobacteria says "get the 80 genomes from the taxonomic grouping Proteobacteria"
    -c 1 says "give me only genomes with a single contig"
    -m 2000000 says "get only genomes that are at least 2M in size

------

Command-Line Options

-k File [name of file containing already selected genomes]
-n N [the number of genomes being requested - default is 100]

You may or may not get exactly that number

-f tax-file [a file in which taxonomies have been cached]

If the file does not exist, running the program builds it (and it may take a few minutes).

-t [taxonomic group - default is 'Bacteria']

Scan the tax-file if you are not sure of the NCBI names of taxonomic groups

-c Max-Contigs
-m Min-DNA-size

Output Format

The standard output is a tab-delimited file. It consists of the following fields:

      the KBase ID of a selected genome
      the SEED ID of a selected genome
      the size of the genome (in bp)
      the number of contigs in the genome
      a representation of the NCBI taxonomy with consecutive groups separated by ": "