Documentation read from 04/17/2019 22:07:27 version of /vol/public-pseed/FIGdisk/FIG/bin/svr_representative_sequences.
usage: representative_sequences [ opts ] [ rep_seqs_0 ] < new_seqs > rep_seqs
-a - number of threads used by blastall (D=2) -b - order input sequences by size (long to short) -c cluster_type - behavior of clustering algorithm (0 or 1, D=1) -d seq_clust_dir - directory for files of clustered sequencees -f id_clust_file - file with one line per cluster, listing its ids -g keep_gid_list - list of genome IDs to keep -i keep_id_list - list of sequence IDs to keep -l log_file - real-time record of clustering, one line per seq -m measure_of_sim - measure of similarity to use: identity_fraction (default), positive_fraction (proteins only), or score_per_position (0-2 bits) -s similarity - similarity required to be clustered (D = 0.8) Sequences are clustered, with one representative sequence reported for each cluster. rep_seqs_0 is an optional file of sequences to be assigned to unique clusters, regardless of their similarities. Each new sequence is added to the cluster with the most similar representative sequence, or, if its similarity to any existing representative is less than 'similarity', it becomes the representative of a new cluster. With the -d option, each cluster of sequences is written to a distinct file in the specified directory. With the -f option, for each cluster, a tab-separated list of ids is written to the specified file. With the -l option, the id of each sequence analyzed is written to the log file, followed by the id of the sequence that represents it (when appropriate). cluster_type 0 is the original method, which has only the representative for each group in the blast database. This can randomly segregate distant members of groups, regardless of the placement of other very similar sequences. cluster_type 1 adds more diverse representatives of a group in the blast database. This is slightly more expensive, but is much less likely to split close relatives into different groups.
Number of threads used by blastall (D=2)
order input sequences by size (long to short)
behavior of clustering algorithm (0 or 1, D=1)
cluster_type 0 is the original method, which has only the representative for each group in the blast database. This can randomly segregate distant members of groups, regardless of the placement of other very similar sequences.
cluster_type 1 adds more diverse representatives of a group in the blast database. This is slightly more expensive, but is much less likely to split close relatives into different groups.
With the -d option, each cluster of sequences is written to a distinct file in the specified directory.
With the -f option, for each cluster, a tab-separated list of ids is written to the specified file.
The file specified contains lines beginning with a genome ID. Any IDs for these genomes are always kept.
The specified file contains lines, each of which is a list of comma-separated FIG IDs. Sequences with these IDs are always kept.
This is used to see the details of the clustering process. We doubt that most users should find it necessary.
Sequences are removed if there similarity to a "kept" sequence exceeds a specified threshold (see -similarity below)
The possible measures of similarity that you can specify are as follows:
identity_fraction (default), positive_fraction (proteins only), or score_per_position (0-2 bits)
The similarity threshhold used to determine when sequences are deleted (but represented by a kept sequence).
You have the option of reading all of the sequences from STDIN, but you can also specify a set of files as arguments on the command line. All of these files (plus STDIN) are sources for the input sequences.
The set of retained sequences is written to STDOUT. Which sequences are represented by each retained sequence can be determined by the output indicated in -f (a file of grouped sequences, one group per line) or in -d (a directory in which each file represents a single group).