Chapter 6. Readseq

Chapter 6. Readseq

Readseq is a classic, dating from 1989. Developed by Don Gilbert, this program reads and writes nucleotide and protein sequences in many useful formats. The Java version is the most current; we're using Version 2.

To run Readseq use:

java -cp readseq.jar run options inputfiles

For more details use:

java -cp readseq.jar help more

This chapter contains a list of the command line options used in Readseq.

6.1 Supported Formats

Table 6-1 contains the formats supported by Readseq. ID is a number that can be used for this format (name is prefered). Alternate Names are separated by using the | character. You can use any of these names to specify a format. R and W indicate if Readseq can read and write this format. I means the format is interleaved. F indicates that sequence record documentation and features are parsed. S indicates that the format contains sequence data. Content-type is the magic string sent for that format through a CGI web server. The suffix is the standard file suffix used for that format.

Table 6-1. Supported formats for Readseq

ID

Name

R

W

I

F

S

Content-type

Suffix

1

GenBank|gb

T

T

F

T

T

biosequence/genbank

.gb

2

EMBL|em

T

T

F

T

T

biosequence/embl

.embl

3

Pearson|Fasta|fa

T

T

F

F

T

biosequence/fasta

.fasta

4

GCG

T

T

F

F

T

biosequence/gcg

.gcg

5

MSF

T

T

T

F

T

biosequence/msf

.msf

6

Clustal

T

T

T

F

T

biosequence/clustal

.aln

7

NBRF

T

T

F

F

T

biosequence/nbrf

.nbrf

8

PIR|CODATA

T

T

F

F

T

biosequence/codata

.pir

9

ACEDB

T

T

F

F

T

biosequence/acedb

.ace

10

Phylip3.2

T

T

T

F

T

biosequence/phylip2

.phylip2

11

Phylip|Phylip4

T

T

T

F

T

biosequence/phylip

.phylip

12

Plain|Raw

T

T

F

F

T

biosequence/plain

.seq

13

PAUP|NEXUS

T

T

T

F

T

biosequence/nexus

.nexus

14

XML

T

T

F

T

T

biosequence/xml

.xml

15

FlatFeat|FFF

T

T

F

T

F

biosequence/fff

.fff

16

GFF

T

T

F

T

F

biosequence/gff

.gff

17

BLAST

T

F

T

F

T

biosequence/blast

.blast

18

Pretty

F

T

T

F

T

biosequence/pretty

.pretty

19

SCF

T

F

F

F

T

biosequence/scf

.scf

20

DNAStrider

T

T

F

F

T

biosequence/strider

.strider

21

IG|Stanford

T

T

F

F

T

biosequence/ig

.ig

22

Fitch

F

F

F

F

T

biosequence/fitch

.fitch

23

ASN.1

F

F

F

F

T

biosequence/asn1

.asn

6.2 Command-Line Options

Table 6-2 through Table 6-6 summarize Readseq's command-line options.

Table 6-2. Primary pptions

Option

Definition

-a[ll]

Select all sequences. "all" causes processing of all sequences (default now for Version 2, for compatibility with version 1). Use" items=1,2,3" to select a subset.

-c[aselower]

Change to lower case. "caselower" and "CASEUPPER" will convert sequence case.

-C[ASEUPPER]

Change to UPPERCASE.

-degap[=-]

Remove gap symbols. "degap=symbol" will remove this symbol from output sequence (- normally).

-f[ormat=]#

Format number for output.

-f[ormat=]Name

Format name for output. See formats list (Table 6-1) for names and numbers. "format=genbank", "format=gb", "format=xml", etc., selects an output format. You can also use format number, but these numbers may change with revisions. Alternate names of formats are listed in Table 6-1. "Pearson|FastA|fa" allows "pearson", "fasta", or "fa" as a name). This is case-insensitive.

-inform[at]=#

Input format number.

-inform[at]=Name

Input format name. Assume input data is this format. "inform=genbank" lets you specify data input format. Normally Readseq guesses the input format (usually correctly). Use this option if you wish to bypass this input format guessing.

-i[tem=2,3,4]

Select Item number(s) from several. "items=2,3,4" will select these sequence records from a multisequence input file.

-l[ist]

List sequences only. "list" will list titles of sequence records.

-o[utput=]out.seq

Redirect Output. "output=file", sends output to named file.

-p[ipe]

Pipe (command line, < stdin, > stdout). "pipe" will cause input data to come from STDIN and output go to STDOUT Unix standard files (unless -out is given and input file given), and no prompting or progress reports will occurr.

-r[everse]

Reverse-complement of input sequence. "reverse" will write the sequence from end to start, and DNA bases are complemented. Amino residues are not complemented.

-t[ranslate=]io

Translate input symbol [i] to output symbol [o]. Use several -tio to translate several symbols translates given sequence bases, e.g., -tAN to change "A" to "N".

-v[erbose]

Verbose progress. "verbose" will print some progress reports.

-ch[ecksum]

Calculate & print checksum of sequences.

Table 6-3. Documentation and feature table extraction options

Option

Definition

-feat[ures]=exon,CDS...

Extract sequence of selected features.

-nofeat[ures]=repeat_region,intron...

Remove sequence of selected features. "feature=CDS,intron" lets you specify those features to extract, or remove, in the output. Currently this causes each feature to produce a new sequence record.

-field=AC,ID...

Include selected document fields in output.

-nofield=COMMENT,...

Remove selected document fields from output.

Table 6-4. Subrange options

Option

Definition

-subrange=-1000..10

Extract subrange of sequence for feature locations:

-subrange=1..end
-subrange=end-10..end+99

-extract=10000..99999

Extract all features and sequence from given base range.

Table 6-5. Pair, unpair options

Option

Definition

-pair=1

Combine features (fff,gff) and sequence files to one output.

-unpair=1

Split features, sequence from one input to two files.

Table 6-6. Pretty format options

Option

Definition

-wid[th]=#

Sequence line width.

-tab=#

Left indent.

-col[space]=#

Column space within sequence line on output.

-gap[count]

Count gap chars in sequence numbers.

-nameleft, -nameright[=#]

Name on left/right side [=max width].

-nametop

Name at top/bottom.

-numleft, -numright

Seq index on left/right side.

-numtop, -numbot

Index on top/bottom.

-match[=.]

Use match base for 2..n species.

-inter[line=#]

Blank line(s) between sequence blocks .