fasta - manipulations with FASTA databases¶
FASTA is a simple file format for protein sequence databases. Please refer to the NCBI website for the most detailed information on the format.
Data manipulation¶
Classes¶
Several classes of FASTA parsers are available. All of them have common features:
context manager support;
header parsing;
direct iteration.
Available classes:
FASTABase
- common ancestor, suitable for type checking. Abstract class.
FASTA
- text-mode, sequential parser. Good for iteration over database entries.
IndexedFASTA
- binary-mode, indexing parser. Supports direct indexing by header string.
TwoLayerIndexedFASTA
- additionally supports indexing by extracted header fields.
UniProt
andIndexedUniProt
,UniParc
andIndexedUniParc
,UniMes
andIndexedUniMes
,UniRef
andIndexedUniRef
,SPD
andIndexedSPD
,NCBI
andIndexedNCBI
,RefSeq
andIndexedRefSeq
, - format-specific parsers.
Functions¶
read()
- returns an instance of the appropriate reader class, for sequential iteration or random access.
chain()
- read multiple files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
write()
- write entries to a FASTA database.
parse()
- parse a FASTA header.
Decoy sequence generation¶
decoy_sequence()
- generate a decoy sequence from a given sequence, using
one of the other functions listed in this section or any other callable.
reverse()
- generate a reversed decoy sequence.
shuffle()
- generate a shuffled decoy sequence.
fused_decoy()
- generate a “fused” decoy sequence.
Decoy database generation¶
write_decoy_db()
- generate a decoy database and write it to a file.
decoy_db()
- generate entries for a decoy database from a given FASTA database.
decoy_entries()
- generate decoy entries for an iterator.
decoy_chain()
- a version ofdecoy_db()
for multiple files.
decoy_chain.from_iterable()
- likedecoy_chain()
, but with an iterable of files.
Auxiliary¶
std_parsers
- a dictionary with parsers for known FASTA header formats.
- pyteomics.fasta.chain(*args, **kwargs)¶
Chain
read()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to theread()
function.
- chain.from_iterable(files, **kwargs)¶
Chain
read()
for several files. Keyword arguments are passed to theread()
function.- Parameters:
files – Iterable of file names or file objects.
- pyteomics.fasta.decoy_chain(*args, **kwargs)¶
Chain
decoy_db()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to thedecoy_db()
function.
- decoy_chain.from_iterable(files, **kwargs)¶
Chain
decoy_db()
for several files. Keyword arguments are passed to thedecoy_db()
function.- Parameters:
files – Iterable of file names or file objects.
- class pyteomics.fasta.FASTA(source, ignore_comments=False, parser=None, encoding=None)[source]¶
Bases:
FASTABase
,FileReader
Text-mode, sequential FASTA parser. Suitable for iteration over the file to obtain all entries in order.
- __init__(source, ignore_comments=False, parser=None, encoding=None)[source]¶
Create a new FASTA parser object. Supports iteration, yields (description, sequence) tuples. Supports with syntax.
- Parameters:
source (str or file-like) – File to read. If file object, it must be opened in text mode.
ignore_comments (bool, optional) – If
True
then ignore the second and subsequent lines of description. Default isFalse
, which concatenates multi-line descriptions into a single string.parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a function, that function will be given the description string, and the returned value will be yielded together with the sequence. The
std_parsers
dict has parsers for several formats. Hint: specifyparse()
as the parser to apply automatic format recognition. Default isNone
, which means return the header “as is”.encoding (str or None, optional) – File encoding (if it is given by name).
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.FASTABase(source, **kwargs)[source]¶
Bases:
object
Abstract base class for FASTA file parsers. Can be used for type checking.
- class pyteomics.fasta.FlavoredMixin(parse=True)[source]¶
Bases:
object
Parser aimed at a specific FASTA flavor. Subclasses should define parser and header_pattern. The parse argument in
__init__()
defines whether description is parsed in output.
- class pyteomics.fasta.IndexedFASTA(source, ignore_comments=False, parser=None, **kwargs)[source]¶
Bases:
FASTABase
,TaskMappingMixin
,IndexedTextReader
Indexed FASTA parser. Supports direct indexing by matched labels.
- __init__(source, ignore_comments=False, parser=None, **kwargs)[source]¶
Create an indexed FASTA parser object.
- Parameters:
source (str or file-like) – File to read. If file object, it must be opened in binary mode.
ignore_comments (bool, optional) – If
True
then ignore the second and subsequent lines of description. Default isFalse
, which concatenates multi-line descriptions into a single string.parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a function, that function will be given the description string, and the returned value will be yielded together with the sequence. The
std_parsers
dict has parsers for several formats. Hint: specifyparse()
as the parser to apply automatic format recognition. Default isNone
, which means return the header “as is”.encoding (str or None, optional, keyword only) – File encoding. Default is UTF-8.
block_size (int or None, optional, keyword only) – Number of bytes to consume at once.
delimiter (str or None, optional, keyword only) – Overrides the FASTA record delimiter (default is
'\n>'
).label (str or None, optional, keyword only) – Overrides the FASTA record label pattern. Default is
'^[\n]?>(.*)'
.label_group (int or str, optional, keyword only) – Overrides the matched group used as key in the byte offset index. This in combination with label can be used to extract fields from headers. However, consider using
TwoLayerIndexedFASTA
for this purpose.
- map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶
Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
- Parameters:
target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (
Sequence
, optional) – Additional positional arguments to be passed to the target functionkwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function**_kwargs – Additional keyword arguments to be passed to the target function
- Yields:
object – The work item returned by the target function.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.IndexedNCBI(source, parse=True, **kwargs)[source]¶
Bases:
NCBIMixin
,TwoLayerIndexedFASTA
Indexed parser for Indexe FASTA files.
- __init__(source, parse=True, **kwargs)¶
Creates a
IndexedNCBI
object.- Parameters:
source (str or file) – The file to read. If a file object, it needs to be in binary mode.
parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is
True
.kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
- build_second_index()¶
Create the mapping from extracted field to whole header string.
- get_by_id(key)¶
Get the entry by value of header string or extracted field.
- map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶
Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
- Parameters:
target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (
Sequence
, optional) – Additional positional arguments to be passed to the target functionkwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function**_kwargs – Additional keyword arguments to be passed to the target function
- Yields:
object – The work item returned by the target function.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.IndexedRefSeq(source, parse=True, **kwargs)[source]¶
Bases:
RefSeqMixin
,TwoLayerIndexedFASTA
Indexed parser for IndexedR FASTA files.
- __init__(source, parse=True, **kwargs)¶
Creates a
IndexedRefSeq
object.- Parameters:
source (str or file) – The file to read. If a file object, it needs to be in binary mode.
parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is
True
.kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
- build_second_index()¶
Create the mapping from extracted field to whole header string.
- get_by_id(key)¶
Get the entry by value of header string or extracted field.
- map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶
Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
- Parameters:
target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (
Sequence
, optional) – Additional positional arguments to be passed to the target functionkwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function**_kwargs – Additional keyword arguments to be passed to the target function
- Yields:
object – The work item returned by the target function.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.IndexedSPD(source, parse=True, **kwargs)[source]¶
Bases:
SPDMixin
,TwoLayerIndexedFASTA
Indexed parser for Index FASTA files.
- __init__(source, parse=True, **kwargs)¶
Creates a
IndexedSPD
object.- Parameters:
source (str or file) – The file to read. If a file object, it needs to be in binary mode.
parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is
True
.kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
- build_second_index()¶
Create the mapping from extracted field to whole header string.
- get_by_id(key)¶
Get the entry by value of header string or extracted field.
- map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶
Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
- Parameters:
target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (
Sequence
, optional) – Additional positional arguments to be passed to the target functionkwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function**_kwargs – Additional keyword arguments to be passed to the target function
- Yields:
object – The work item returned by the target function.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.IndexedUniMes(source, parse=True, **kwargs)[source]¶
Bases:
UniMesMixin
,TwoLayerIndexedFASTA
Indexed parser for IndexedU FASTA files.
- __init__(source, parse=True, **kwargs)¶
Creates a
IndexedUniMes
object.- Parameters:
source (str or file) – The file to read. If a file object, it needs to be in binary mode.
parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is
True
.kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
- build_second_index()¶
Create the mapping from extracted field to whole header string.
- get_by_id(key)¶
Get the entry by value of header string or extracted field.
- map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶
Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
- Parameters:
target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (
Sequence
, optional) – Additional positional arguments to be passed to the target functionkwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function**_kwargs – Additional keyword arguments to be passed to the target function
- Yields:
object – The work item returned by the target function.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.IndexedUniParc(source, parse=True, **kwargs)[source]¶
Bases:
UniParcMixin
,TwoLayerIndexedFASTA
Indexed parser for IndexedUn FASTA files.
- __init__(source, parse=True, **kwargs)¶
Creates a
IndexedUniParc
object.- Parameters:
source (str or file) – The file to read. If a file object, it needs to be in binary mode.
parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is
True
.kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
- build_second_index()¶
Create the mapping from extracted field to whole header string.
- get_by_id(key)¶
Get the entry by value of header string or extracted field.
- map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶
Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
- Parameters:
target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (
Sequence
, optional) – Additional positional arguments to be passed to the target functionkwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function**_kwargs – Additional keyword arguments to be passed to the target function
- Yields:
object – The work item returned by the target function.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.IndexedUniProt(source, parse=True, **kwargs)[source]¶
Bases:
UniProtMixin
,TwoLayerIndexedFASTA
Indexed parser for IndexedUn FASTA files.
- __init__(source, parse=True, **kwargs)¶
Creates a
IndexedUniProt
object.- Parameters:
source (str or file) – The file to read. If a file object, it needs to be in binary mode.
parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is
True
.kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
- build_second_index()¶
Create the mapping from extracted field to whole header string.
- get_by_id(key)¶
Get the entry by value of header string or extracted field.
- map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶
Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
- Parameters:
target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (
Sequence
, optional) – Additional positional arguments to be passed to the target functionkwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function**_kwargs – Additional keyword arguments to be passed to the target function
- Yields:
object – The work item returned by the target function.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.IndexedUniRef(source, parse=True, **kwargs)[source]¶
Bases:
UniRefMixin
,TwoLayerIndexedFASTA
Indexed parser for IndexedU FASTA files.
- __init__(source, parse=True, **kwargs)¶
Creates a
IndexedUniRef
object.- Parameters:
source (str or file) – The file to read. If a file object, it needs to be in binary mode.
parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is
True
.kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
- build_second_index()¶
Create the mapping from extracted field to whole header string.
- get_by_id(key)¶
Get the entry by value of header string or extracted field.
- map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶
Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
- Parameters:
target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (
Sequence
, optional) – Additional positional arguments to be passed to the target functionkwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function**_kwargs – Additional keyword arguments to be passed to the target function
- Yields:
object – The work item returned by the target function.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.NCBI(source, parse=True, **kwargs)[source]¶
-
Text-mode parser for FASTA files.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.NCBIMixin(parse=True)[source]¶
Bases:
FlavoredMixin
- __init__(parse=True)¶
- class pyteomics.fasta.RefSeq(source, parse=True, **kwargs)[source]¶
Bases:
RefSeqMixin
,FASTA
Text-mode parser for R FASTA files.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.RefSeqMixin(parse=True)[source]¶
Bases:
FlavoredMixin
- __init__(parse=True)¶
- class pyteomics.fasta.SPD(source, parse=True, **kwargs)[source]¶
-
Text-mode parser for FASTA files.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.SPDMixin(parse=True)[source]¶
Bases:
FlavoredMixin
- __init__(parse=True)¶
- class pyteomics.fasta.TwoLayerIndexedFASTA(source, header_pattern=None, header_group=None, ignore_comments=False, parser=None, **kwargs)[source]¶
Bases:
IndexedFASTA
Parser with two-layer index. Extracted groups are mapped to full headers (where possible), full headers are mapped to byte offsets.
When indexed, the key is looked up in both indexes, allowing access by meaningful IDs (like UniProt accession) and by full header string.
- __init__(source, header_pattern=None, header_group=None, ignore_comments=False, parser=None, **kwargs)[source]¶
Open source and create a two-layer index for convenient random access both by full header strings and extracted fields.
- Parameters:
source (str or file-like) – File to read. If file object, it must be opened in binary mode.
header_pattern (str or RE or None, optional) – Pattern to match the header string. Must capture the group used for the second index. If
None
(default), second-level index is not created.header_group (int or str or None, optional) – Defines which group is used as key in the second-level index. Default is 1.
ignore_comments (bool, optional) – If
True
then ignore the second and subsequent lines of description. Default isFalse
, which concatenates multi-line descriptions into a single string.parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a function, that function will be given the description string, and the returned value will be yielded together with the sequence. The
std_parsers
dict has parsers for several formats. Hint: specifyparse()
as the parser to apply automatic format recognition. Default isNone
, which means return the header “as is”.arguments (Other) –
- map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶
Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
- Parameters:
target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (
Sequence
, optional) – Additional positional arguments to be passed to the target functionkwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function**_kwargs – Additional keyword arguments to be passed to the target function
- Yields:
object – The work item returned by the target function.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.UniMes(source, parse=True, **kwargs)[source]¶
Bases:
UniMesMixin
,FASTA
Text-mode parser for U FASTA files.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.UniMesMixin(parse=True)[source]¶
Bases:
FlavoredMixin
- __init__(parse=True)¶
- class pyteomics.fasta.UniParc(source, parse=True, **kwargs)[source]¶
Bases:
UniParcMixin
,FASTA
Text-mode parser for Un FASTA files.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.UniParcMixin(parse=True)[source]¶
Bases:
FlavoredMixin
- __init__(parse=True)¶
- class pyteomics.fasta.UniProt(source, parse=True, **kwargs)[source]¶
Bases:
UniProtMixin
,FASTA
Text-mode parser for Un FASTA files.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.UniProtMixin(parse=True)[source]¶
Bases:
FlavoredMixin
- __init__(parse=True)¶
- class pyteomics.fasta.UniRef(source, parse=True, **kwargs)[source]¶
Bases:
UniRefMixin
,FASTA
Text-mode parser for U FASTA files.
- reset()¶
Resets the iterator to its initial state.
- class pyteomics.fasta.UniRefMixin(parse=True)[source]¶
Bases:
FlavoredMixin
- __init__(parse=True)¶
- pyteomics.fasta.decoy_db(source=None, mode='reverse', prefix='DECOY_', decoy_only=False, ignore_comments=False, parser=None, **kwargs)[source]¶
Iterate over sequences for a decoy database out of a given
source
.- Parameters:
source (file-like object or str or None, optional) – A path to a FASTA database or a file object itself. Default is
None
, which means read standard input.mode (str or callable, optional) – Algorithm of decoy sequence generation. ‘reverse’ by default. See
decoy_sequence()
for more information.prefix (str, optional) – A prefix to the protein descriptions of decoy entries. The default value is ‘DECOY_’.
decoy_only (bool, optional) – If set to
True
, only the decoy entries will be written to output. IfFalse
, the entries from source will be written first.False
by default.ignore_comments (bool, optional) – If True then ignore the second and subsequent lines of description. Default is
False
.parser (function or None, optional) – Defines whether the fasta descriptions should be parsed. If it is a function, that function will be given the description string, and the returned value will be yielded together with the sequence. The
std_parsers
dict has parsers for several formats. Hint: specifyparse()
as the parser to apply automatic format guessing. Default isNone
, which means return the header “as is”.**kwargs (given to
decoy_sequence()
.) –
- Returns:
out – An iterator over entries of the new database.
- Return type:
iterator
- pyteomics.fasta.decoy_entries(entries, mode='reverse', prefix='DECOY_', decoy_only=True, **kwargs)[source]¶
Iterate over protein entries (tuples) and produce decoy entries. The entries are only iterated once.
- Parameters:
entries (iterable of tuples) – Any iterable of (description, sequence) pairs.
mode (str or callable, optional) – Algorithm of decoy sequence generation. ‘reverse’ by default. See
decoy_sequence()
for more information.prefix (str, optional) – A prefix to the protein descriptions of decoy entries. The default value is ‘DECOY_’.
decoy_only (bool, optional) – If set to
True
, only the decoy entries will be written to output. IfFalse
, each consumed entry is yielded unchanged, followed by its decoy couterpart.True
by default.**kwargs (given to
decoy_sequence()
.) –
- Returns:
out – An iterator over new entries.
- Return type:
iterator
- pyteomics.fasta.decoy_sequence(sequence, mode='reverse', **kwargs)[source]¶
Create a decoy sequence out of a given sequence string.
- Parameters:
sequence (str) – The initial sequence string.
mode (str or callable, optional) –
Type of decoy sequence. Should be one of the standard modes or any callable. Standard modes are:
’reverse’ for
reverse()
;’shuffle’ for
shuffle()
;’fused’ for
fused_decoy()
.
Default is ‘reverse’.
**kwargs (given to the decoy function.) –
- Returns:
decoy_sequence – The decoy sequence.
- Return type:
- pyteomics.fasta.fused_decoy(sequence, decoy_mode='reverse', sep='R', **kwargs)[source]¶
Create a “fused” decoy sequence by concatenating a decoy sequence with the original one. The method and its use cases are described in:
Ivanov, M. V., Levitsky, L. I., & Gorshkov, M. V. (2016). Adaptation of Decoy Fusion Strategy for Existing Multi-Stage Search Workflows. Journal of The American Society for Mass Spectrometry, 27(9), 1579-1582.
- Parameters:
sequence (str) – The initial sequence string.
decoy_mode (str or callable, optional) –
Type of decoy sequence to use. Should be one of the standard modes or any callable. Standard modes are:
’reverse’ for
reverse()
;’shuffle’ for
shuffle()
;’fused’ for
fused_decoy()
(if you love recursion).
Default is ‘reverse’.
sep (str, optional) – Amino acid motif that separates the decoy sequence from the target one. This setting should reflect the enzyme specificity used in the search against the database being generated. Default is ‘R’, which is suitable for trypsin searches.
**kwargs (given to the decoy generation function.) –
Examples
>>> fused_decoy('PEPT') 'TPEPRPEPT' >>> fused_decoy('MPEPT', 'shuffle', 'K', keep_nterm=True) 'MPPTEKMPEPT'
- pyteomics.fasta.parse(header, flavor='auto', parsers=None)[source]¶
Parse the FASTA header and return a nice dictionary.
- Parameters:
header (str) – FASTA header to parse
flavor (str, optional) – Short name of the header format (case-insensitive). Valid values are
'auto'
and keys of the parsers dict. Default is'auto'
, which means try all formats in turn and return the first result that can be obtained without an exception.parsers (dict, optional) – A dict where keys are format names (lowercased) and values are functions that take a header string and return the parsed header.
- Returns:
out – A dictionary with the info from the header. The format depends on the flavor.
- Return type:
- pyteomics.fasta.read(source=None, use_index=None, flavor=None, **kwargs)[source]¶
Parse a FASTA file. This function serves as a dispatcher between different parsers available in this module.
- Parameters:
source (str or file or None, optional) – A file object (or file name) with a FASTA database. Default is
None
, which means read standard input.use_index (bool, optional) – If
True
, the created parser object will be an instance ofIndexedFASTA
. IfFalse
(default), it will be an instance ofFASTA
.flavor (str or None, optional) –
A supported FASTA header format. If specified, a format-specific parser instance is returned.
Note
See
std_parsers
for supported flavors.
- Returns:
out – A named 2-tuple with FASTA header (str or dict) and sequence (str). Attributes ‘description’ and ‘sequence’ are also provided.
- Return type:
iterator of tuples
- pyteomics.fasta.reverse(sequence, keep_nterm=False, keep_cterm=False)[source]¶
Create a decoy sequence by reversing the original one.
- Parameters:
- Returns:
decoy_sequence – The decoy sequence.
- Return type:
- pyteomics.fasta.shuffle(sequence, keep_nterm=False, keep_cterm=False, keep_nterm_M=False, fix_aa='')[source]¶
Create a decoy sequence by shuffling the original one.
- Parameters:
sequence (str) – The initial sequence string.
keep_nterm (bool, optional) – If
True
, then the N-terminal residue will be kept. Default isFalse
.keep_cterm (bool, optional) – If
True
, then the C-terminal residue will be kept. Default isFalse
.keep_nterm_M (bool, optional) – If
True
, then the N-terminal methionine will be kept. Default isFalse
.fix_aa (iterable, optional) – Single letter codes for amino acids that should preserve their position during shuffling. Default is ‘’.
- Returns:
decoy_sequence – The decoy sequence.
- Return type:
- pyteomics.fasta.std_parsers¶
A dictionary with parsers for known FASTA header formats. For now, supported formats are those described at UniProt help page.
- pyteomics.fasta.write(entries, output=None)[source]¶
Create a FASTA file with entries.
- Parameters:
entries (iterable of (str/dict, str) tuples) – An iterable of 2-tuples in the form (description, sequence). If description is a dictionary, the value for
RAW_HEADER_KEY
will be written as protein description.output (file-like or str, optional) –
A file open for writing or a path to write to. If the file exists, it will be opened for writing. Default is
None
, which means write to standard output.Note
The default mode for output files specified by name has been changed from a to w in pyteomics 4.6. See file_mode to override the mode.
file_mode (str, keyword only, optional) –
If output is a file name, defines the mode the file will be opened in. Otherwise will be ignored. Default is ‘w’.
Note
The default changed from ‘a’ in pyteomics 4.6.
- Returns:
output_file – The file where the FASTA is written.
- Return type:
file object
- pyteomics.fasta.write_decoy_db(source=None, output=None, mode='reverse', prefix='DECOY_', decoy_only=False, **kwargs)[source]¶
Generate a decoy database out of a given
source
and write to file.If output is a path, the file will be open for appending, so no information will be lost if the file exists. Although, the user should be careful when providing open file streams as source and output. The reading and writing will start from the current position in the files, which is where the last I/O operation finished. One can use the
file.seek()
method to change it.- Parameters:
source (file-like object or str or None, optional) – A path to a FASTA database or a file object itself. Default is
None
, which means read standard input.output (file-like object or str, optional) – A path to the output database or a file open for writing. Defaults to
None
, the results go to the standard output.mode (str or callable, optional) – Algorithm of decoy sequence generation. ‘reverse’ by default. See
decoy_sequence()
for more details.prefix (str, optional) – A prefix to the protein descriptions of decoy entries. The default value is ‘DECOY_’
decoy_only (bool, optional) – If set to
True
, only the decoy entries will be written to output. IfFalse
, the entries from source will be written as well.False
by default.file_mode (str, keyword only, optional) – If output is a file name, defines the mode the file will be opened in. Otherwise will be ignored. Default is ‘a’.
**kwargs (given to
decoy_sequence()
.) –
- Returns:
output – A (closed) file object for the created file.
- Return type:
file