Documentation read from 04/17/2019 22:07:28 version of /vol/public-pseed/FIGdisk/FIG/bin/svr_trim_ali.

svr_trim_ali

svr_trim_ali

    svr_trim_ali [options] < ali.fa > trim.fa

This script takes a FASTA file of aligned sequences, trims the alignment by running PSIBLAST against the sequences themselves, and writes the trimmed alignment to the standard output.

Introduction

usage: svr_trim_ali [options] < ali.fa > trim.fa

       -a align_tool   - alignment tool to use: Clustal (D), MAFFT, Muscle. 
       -c              - append trimming coordinates to description fields in FASTA
       -d log_dir      - direcotry for log files
       -e log_prefix   - prefix for log file names
       -f fract_cov    - fraction of sequences to be covered in initial trimming (D = 0.75)
       -g              - attempt no more than a single round of psiblast
       -l              - run trimming locally
       -m              - trim to median ends only
       -r              - first collapse seqs into representatives
       -s max_reps_sim - threshold used to collapse seqs into representatives (D = 0.9)
       -cd             - trim to conserved domains
       -html file      - show clipped ends in lower letters in an html alignment 

Command-Line options

-a align_tool

Alignment tool to use. The default is Clustal, which seems to deal with end gaps better. If MAFFT is chosen, automatically selects an appropriate strategy from L-INS-i, FFT-NS-i and FFT-NS-2, according to data size.

-c

With the -c option, the coordiates of the trimmed sequences are appended to the comment field of the output FASTA.

-d log_dir

Directory name for trimming log files. Without the -d option, log files are not saved.

-e log_prefix

Prefix for log file names. Random digits are appended to the file names so that existing files will not be clobbered.

-f fract_cov (D = 0.75)

Fraction of sequences to be covered in initial trimming. Use 0.5 for trimming to medien ends.

-g

Without the -g option, more than one rounds of psiblast search may be attempted to incorporate seqs with multiple hsps.

-l

Run trimming and psiblast locally.

-m

Trim to median ends (or a specified coverage fraction in the -f option) only.

-r

Use represetative sequences to reduce data size and over-represented sequences.

-s max_reps_sim (D = 0.9)

The similarity threshold used to collapse seqs into representatives.

-t fract_ends (D = 0.1)

The minimum fraction of ends falling in the same window of uncovered amino acids that are considered significant for determining the trimming cutoff. A smaller fraction value indicates more aggressive trimming.

-w window_size (D = 10)

The size of the initial sliding window used to count instances of sequences whose ends have similar number of uncovered amino acids. If no cutoff value is found, additional rounds of calculation are carried out with increasing window sizes. The effect of starting window size on trimming is uncertain. A narrower starting window size usually indicates less aggressive trimming, but it may have the opposite effect when fract_ends is very small.

-cd

Trim to conserved domains. No psiblast search is attempted.

-html file

Generate an HTML file for visualizing trimmed alignment with clipped ends in lower case.

Input

The input set of aligned sequences is read from STDIN.

Output

The set of trimmed sequences is written to STDOUT.