extractfeat | Sequence Analysis in a Nutshell: A Guide to Common Tools and Databases

extractfeat

extractfeat is a simple utility for extracting parts of a sequence that have been annotated as a specific type of feature. These subsequences are written to the output sequence file.

Here is a sample session with extractfeat to write out the exons of a sequence:

% extractfeat embl:hsfau1 -type exon stdout

To write out the exons with 10 extra bases at the start and end so that you can inspect the splice sites:

% extractfeat embl:hsfau1 -type exon -before 10 -after 10 stdout

To write out the 10 bases around the start of all "exon" features in the EMBL database:

% extractfeat embl:\*  -type exon -before 5 -after -5 stdout

To write out the 7 residues around all phosphorylated residues in SWISS-PROT:

% extractfeat sw:\*  -type mod_res -value phosphorylation -before 3 -after -4 stdout

Mandatory qualifiers:

[-sequence] (seqall): Sequence database USA.
[-outseq] (seqout): Output sequence USA.

Optional qualifiers:

-before (integer)

If this value is greater than 0, that number of bases or residues before the feature are included in the extracted sequence. This allows you to see the context of the feature. If this value is negative, the start of the extracted sequence will be this number of bases/residues before the end of the feature. For example, a value of 10 will start the extraction 10 bases/residues before the start of the sequence, and a value of -10 will start the extraction 10 bases or residues before the end of the feature. The output sequence will be padded with "N" or "X" characters if the sequence starts after the required start of the extraction.

-after (integer)

If this value is greater than 0, that number of bases or residues after the feature are included in the extracted sequence. This allows you to see the context of the feature. If this value is negative, the end of the extracted sequence will be this number of bases/residues after the start of the feature. For example, a value of 10 will end the extraction 10 bases/residues after the end of the sequence, and a value of -10 will end the extraction 10 bases or residues after the start of the feature. The output sequence will be padded with "N" or "X" characters if the sequence ends before the required end of the extraction.

-source (string)

By default, any feature source in the feature table is shown. You can set this to match any feature source you want to show. The source name is usually either the name of the program that detected the feature, or the feature table (e.g., EMBL) that the feature came from. The source may be wildcarded by using *. If you want to show more than one source, separate their names with the character |, e.g., gene* | embl.

-type (string)

By default, every feature in the feature table is extracted. You can set this to be any feature type you want to extract. See Chapter 2 for a list of the EMBL feature types, and Chapter 3 for a list of the SWISS-PROT feature types. The type may be wildcarded by using *. If you want to extract more than one type, separate their names with the |character. For example:

*UTR | intron

-sense (integer)

By default, any feature type in the feature table is extracted. You can set this to match any feature sense you want. 0 matches any sense, 1 matches forward sense, and -1 matches reverse sense.

-minscore (float)

If this is greater than or equal to the maximum score, any score is permitted.

-maxscore (float)

If this is less than or equal to the maximum score, any score is permitted.

-tag (string)

Tags are the types of extra values that a feature may have. For example, in the EMBL feature table, a CDS type of feature may have the tags /codon, /codon_start, /db_xref, /EC_number, /evidence, /exception, /function, /gene, /label, /map, /note, /number, /partial, /product, /protein_id, /pseudo, /standard_name, /translation, /transl_except, /transl_table, or /usedin. Some of these tags also have values (e.g., /gene can have the value of the gene name). By default, any feature tag in the feature table is extracted. You can set this to match any feature tag you want to show. The tag may be wildcarded by using *. If you want to extract more than one tag, separate their names with the | character. For example:

 gene | label

-value (string)

Tag values are the values associated with a feature tag. Tags are the types of extra values that a feature may have. For example, in the EMBL feature table, a CDS type of feature may have the tags /codon, /codon_start, /db_xref, /EC_number, /evidence, /exception, /function, /gene, /label, /map, /note, /number, /partial, /product, /protein_id, /pseudo, /standard_name, /translation, /transl_except, /transl_table, or /usedin. Some of these tags also have values (e.g., /gene can have the value of the gene name). By default, any feature tag in the feature table is extracted. You can set this to match any feature tag value you want to show. The tag may be wildcarded by using *. If you want to extract more than one tag, separate their names with the | character. For example:

pax* | 10

-join (boolean)

Some features, such as coding sequence (CDS) and mRNA, are composed of introns concatenated together. There may be other forms of joined sequence, depending on the feature table. If this option is set TRUE, any group of these features will be output as a single sequence. If the before and after qualifiers have been set, only the sequences before the first feature and after the last feature are added.