Documentation read from 04/17/2019 22:07:27 version of /vol/public-pseed/FIGdisk/FIG/bin/svr_psiblast_search.

svr_psiblast_search

svr_psiblast_search

    svr_psiblast_search [options] < ali.trimmed.fa > hits.extracted.fa

This script takes a FASTA file of trimmed protein sequence alignment, uses PSIBLAST to search against the protein database of complete genomes, and writes the extracted regions of hits to the standard output.

Introduction

usage: svr_psiblast_search [options] < ali.trimmed.fa > hits.fa

       -a   n_processor     - number of processors to use (D = 2)
       -b   database        - database to search against: SEED, CORE, PSEED, PUBSEED (D), FASTA file name, FIG genome ID
       -c   min_frac_cov    - minimum fraction coverage of query and subject sequence (D = 0.20)
       -cq  min_q_cov       - minimum fraction coverage of query sequence (D = 0.50)
       -cs  min_s_cov       - minimum fraction coverage of subject sequence (D = 0.20)
       -e   max_e_val       - maximum psiblast e-Value (D = 0.01)
       -i   min_ident       - minimum fraction identity (D = 0.15)
       -l                   - run search locally if a local database is specified
       -n   max_num_seqs    - maximum matching sequences (D = 5000)
       -p   min_positive    - minimum fraction positive scoring (D = 0.20)
       -r   report_file     - output file of psiblast report
       -u   max_q_uncov     - maximum unmatched query (D = 20)
       -uc  max_q_uncov_c   - maximum unmatched query, c-term (D = 20)
       -un  max_q_uncov_n   - maximum unmatched query, n-term (D = 20)

       options for incremental search:

       -inc                 - incrementally expand an initial set of of sequences through multiple psiblast rounds
       -fast                - use fast trimming (trim to conserved domains) (D = 0)
       -nr   min_reps       - only use representative seqs if number of seqs exceeds this threshold (D = 10)
       -nq   max_nquery     - stop incremental search if the number of query sequences exceeds this threshold (D = 500)
       -rep                 - collapse profile seqs into representatives before submitting to psiblast
       -sim  max_reps_sim   - threshold used to collapse seqs into representatives (D = 0.95)
       -stop max_rounds     - stop incremental search after a specified number of psiblast rounds (D = until convergence)

Command-Line options

-a n_processor

Number of processors to use (D = 2)

-b database

Database for psiblast to search against. It can be a FASTA file name, a FIG genome ID, or a string, SEED, CORE, PSEED, or PUBSEED, to indicate one of the preconfigured database of all protein sequences from complete genomes. The default is PUBSEED.

-c min_frac_cov

Minimum fraction coverage of query and subject sequence (D = 0.20)

-cq min_q_cov

Minimum fraction coverage of query sequence (D = 0.50)

-cs min_s_cov

Minimum fraction coverage of subject sequence (D = 0.20)

-e max_e_eval

Maximum psiblast e-Value (D = 0.01).

-i min_ident

Minimum fraction identity (D = 0.15).

-l

With the -l option, psiblast search is run locally. The database must be a local FASTA file.

-n max_num_seqs

Maximum matching sequences (D = 5000).

-p min_positive

Minimum fraction of positive scoring AAs (D = 0.20).

-r report_file

Output file name for psiblast records produced as a 11-column table containing:

  [ subject_id, bit_score, e_value,
    subject_length, status,
    fraction_ident, fraction_positive,
    query_uncov_n_term, query_uncov_c_term,
    subject_uncov_n_term, subject_uncov_c_term ]
-u max_q_uncov

Maximum unmatched query (D = 20).

-uc max_q_uncov_c

Maximum unmatched query, c-term (D = 20).

-un max_q_uncov_n

Maximum unmatched query, n-term (D = 20).

-inc

With the -inc option, multiple psiblast search rounds will be carried out to expand the input set of sequences. This can be particularly useful when the starting profile contains few sequences.

The input set of sequences can be unaligned. The psiblast hits at the end of each round are aligned, trimmed, and sorted. The top hits are then selected to form the set of profile sequences for the next round. The algorithm tries to expand the set cautiously. Unless the psiblast hits share high identity (~75%) with the profile, the set grows by no more than a factor of 2 each round. If '-stop max_psiblast_rounds' is not specified, the process runs to convergence or until the number of profile sequences reaches 500, at which point a clear pattern should have emerged in the aligned profile sequences.

The command-line options only affect the final round of psiblast. Customized psiblast options are used in the iterative rounds.

-fast

With the -fast option, fast trimming (trim to conserved domains) is used.

-nr min_seqs_for_reps

Only use representative seqs if number of seqs exceeds this threshold (D = 10)

-nq max_query_seqs

Stop incremental search if the number of query sequences exceeds this threshold (D = 500)

-rep

Collapse profile seqs into representatives before submitting to psiblast if the number of profile sequences is equal or greater than min_seqs_for_reps in a psiblast round.

-sim max_reps_sim

Specifies the threshold to use for collapsing seqs into representatives (D = 0.95)

-stop max_psiblast_rounds

Stop incremental search after a specified number of psiblast rounds (D = unlimited).

Input

The input search profile is a FASTA alignment read from STDIN.

Output

The set of hits is written to STDOUT. Coordinates of the extracted sequences are appended to the FASTA comment field.

If the -inc option is specified, psiblast history is produced as a 4-column table, and written to STDERR. The rows correspond to search status at each psiblast round.

  [ profile_length, num_starting_seqs, num_trimmed_reps, num_psiblast_hits ]