compseq | Sequence Analysis in a Nutshell: A Guide to Common Tools and Databases

compseq

compseq counts the composition of dimer/trimer/etc words in a sequence.

Here is a sample session with compseq. To count the frequencies of dinucleotides in a file:

% compseq  embl:hsfau  2  result3.comp

To count the frequencies of hexanucleotides, without outputting the results of hexanucleotides that do not occur in the sequence:

% compseq  embl:hsfau  6  result6.comp  -nozero

To count the frequencies of trinucleotides in frame 2 of a sequence using a previously prepared compseq output to show the expected frequencies:

% compseq  embl:hsfau  3  result3.comp  -frame 2  -in prev.comp

Mandatory qualifiers:

[-sequence] (seqall): Sequence database USA.
[-word] (integer): The size of word (n-mer) to count. If you want to count codon frequencies, enter 3 here.
[-outfile] (outfile): The results file.

Optional qualifiers (bold if not always prompted):

-infile (infile): This is a file previously produced by compseq that can be used to set the expected frequencies of words in an analysis. The word size in the current run must be the same as the word size in this results file. Obviously, you should use a file produced from protein sequences if you are counting protein sequence word frequencies, or a file made from nucleotide frequencies if you are analyzing a nucleotide sequence.
-frame (integer): The normal behavior of compseq is to count the frequencies of all words that occur by moving a window of length word up by one each time. This option allows you to move the window up by the length of the word each time, skipping intervening words. You can count only those words that occur in a single frame of the word by setting this value to a number other than 0. If you set it to 1 it will only count the words in frame 1, 2 will only count the words in frame 2 and so on.
-[no]ignorebz (boolean): The amino acid code B represents Asparagine or Aspartic acid, and the code Z represents Glutamine or Glutamic acid. These codes are not commonly used, and you may not want to count words containing them. This command will note codes B and Z in the count of "Other" words.
-reverse (boolean): Set this option to true if you want to count words in the reverse complement of a nucleic sequence.
-[no]zerocount (boolean): You can make the output results file much smaller if you do not display the words with a zero count.