Pyteomics documentation v4.7.1

fasta - manipulations with FASTA databases

«  electrochem - electrochemical properties of polypeptides   ::   Contents   ::   peff - PSI Extended FASTA Format  »

fasta - manipulations with FASTA databases

FASTA is a simple file format for protein sequence databases. Please refer to the NCBI website for the most detailed information on the format.

Data manipulation

Classes

Several classes of FASTA parsers are available. All of them have common features:

  • context manager support;

  • header parsing;

  • direct iteration.

Available classes:

FASTABase - common ancestor, suitable for type checking. Abstract class.

FASTA - text-mode, sequential parser. Good for iteration over database entries.

IndexedFASTA - binary-mode, indexing parser. Supports direct indexing by header string.

TwoLayerIndexedFASTA - additionally supports indexing by extracted header fields.

UniProt and IndexedUniProt, UniParc and IndexedUniParc, UniMes and IndexedUniMes, UniRef and IndexedUniRef, SPD and IndexedSPD, NCBI and IndexedNCBI, RefSeq and IndexedRefSeq, - format-specific parsers.

Functions

read() - returns an instance of the appropriate reader class, for sequential iteration or random access.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

write() - write entries to a FASTA database.

parse() - parse a FASTA header.

Decoy sequence generation

decoy_sequence() - generate a decoy sequence from a given sequence, using one of the other functions listed in this section or any other callable.

reverse() - generate a reversed decoy sequence.

shuffle() - generate a shuffled decoy sequence.

fused_decoy() - generate a “fused” decoy sequence.

Decoy database generation

write_decoy_db() - generate a decoy database and write it to a file.

decoy_db() - generate entries for a decoy database from a given FASTA database.

decoy_entries() - generate decoy entries for an iterator.

decoy_chain() - a version of decoy_db() for multiple files.

decoy_chain.from_iterable() - like decoy_chain(), but with an iterable of files.

Auxiliary

std_parsers - a dictionary with parsers for known FASTA header formats.


pyteomics.fasta.chain(*args, **kwargs)

Chain read() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the read() function.

chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:

files – Iterable of file names or file objects.

pyteomics.fasta.decoy_chain(*args, **kwargs)

Chain decoy_db() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the decoy_db() function.

decoy_chain.from_iterable(files, **kwargs)

Chain decoy_db() for several files. Keyword arguments are passed to the decoy_db() function.

Parameters:

files – Iterable of file names or file objects.

class pyteomics.fasta.FASTA(source, ignore_comments=False, parser=None, encoding=None)[source]

Bases: FASTABase, FileReader

Text-mode, sequential FASTA parser. Suitable for iteration over the file to obtain all entries in order.

__init__(source, ignore_comments=False, parser=None, encoding=None)[source]

Create a new FASTA parser object. Supports iteration, yields (description, sequence) tuples. Supports with syntax.

Parameters:
  • source (str or file-like) – File to read. If file object, it must be opened in text mode.

  • ignore_comments (bool, optional) – If True then ignore the second and subsequent lines of description. Default is False, which concatenates multi-line descriptions into a single string.

  • parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a function, that function will be given the description string, and the returned value will be yielded together with the sequence. The std_parsers dict has parsers for several formats. Hint: specify parse() as the parser to apply automatic format recognition. Default is None, which means return the header “as is”.

  • encoding (str or None, optional) – File encoding (if it is given by name).

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.FASTABase(source, **kwargs)[source]

Bases: object

Abstract base class for FASTA file parsers. Can be used for type checking.

__init__(source, **kwargs)[source]
class pyteomics.fasta.FlavoredMixin(parse=True)[source]

Bases: object

Parser aimed at a specific FASTA flavor. Subclasses should define parser and header_pattern. The parse argument in __init__() defines whether description is parsed in output.

__init__(parse=True)[source]
class pyteomics.fasta.IndexedFASTA(source, ignore_comments=False, parser=None, **kwargs)[source]

Bases: FASTABase, TaskMappingMixin, IndexedTextReader

Indexed FASTA parser. Supports direct indexing by matched labels.

__init__(source, ignore_comments=False, parser=None, **kwargs)[source]

Create an indexed FASTA parser object.

Parameters:
  • source (str or file-like) – File to read. If file object, it must be opened in binary mode.

  • ignore_comments (bool, optional) – If True then ignore the second and subsequent lines of description. Default is False, which concatenates multi-line descriptions into a single string.

  • parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a function, that function will be given the description string, and the returned value will be yielded together with the sequence. The std_parsers dict has parsers for several formats. Hint: specify parse() as the parser to apply automatic format recognition. Default is None, which means return the header “as is”.

  • encoding (str or None, optional, keyword only) – File encoding. Default is UTF-8.

  • block_size (int or None, optional, keyword only) – Number of bytes to consume at once.

  • delimiter (str or None, optional, keyword only) – Overrides the FASTA record delimiter (default is '\n>').

  • label (str or None, optional, keyword only) – Overrides the FASTA record label pattern. Default is '^[\n]?>(.*)'.

  • label_group (int or str, optional, keyword only) – Overrides the matched group used as key in the byte offset index. This in combination with label can be used to extract fields from headers. However, consider using TwoLayerIndexedFASTA for this purpose.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs

  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.

  • args (Sequence, optional) – Additional positional arguments to be passed to the target function

  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function

  • **_kwargs – Additional keyword arguments to be passed to the target function

Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedNCBI(source, parse=True, **kwargs)[source]

Bases: NCBIMixin, TwoLayerIndexedFASTA

Indexed parser for Indexe FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedNCBI object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –

build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs

  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.

  • args (Sequence, optional) – Additional positional arguments to be passed to the target function

  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function

  • **_kwargs – Additional keyword arguments to be passed to the target function

Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedRefSeq(source, parse=True, **kwargs)[source]

Bases: RefSeqMixin, TwoLayerIndexedFASTA

Indexed parser for IndexedR FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedRefSeq object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –

build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs

  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.

  • args (Sequence, optional) – Additional positional arguments to be passed to the target function

  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function

  • **_kwargs – Additional keyword arguments to be passed to the target function

Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedSPD(source, parse=True, **kwargs)[source]

Bases: SPDMixin, TwoLayerIndexedFASTA

Indexed parser for Index FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedSPD object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –

build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs

  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.

  • args (Sequence, optional) – Additional positional arguments to be passed to the target function

  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function

  • **_kwargs – Additional keyword arguments to be passed to the target function

Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedUniMes(source, parse=True, **kwargs)[source]

Bases: UniMesMixin, TwoLayerIndexedFASTA

Indexed parser for IndexedU FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedUniMes object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –

build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs

  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.

  • args (Sequence, optional) – Additional positional arguments to be passed to the target function

  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function

  • **_kwargs – Additional keyword arguments to be passed to the target function

Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedUniParc(source, parse=True, **kwargs)[source]

Bases: UniParcMixin, TwoLayerIndexedFASTA

Indexed parser for IndexedUn FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedUniParc object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –

build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs

  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.

  • args (Sequence, optional) – Additional positional arguments to be passed to the target function

  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function

  • **_kwargs – Additional keyword arguments to be passed to the target function

Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedUniProt(source, parse=True, **kwargs)[source]

Bases: UniProtMixin, TwoLayerIndexedFASTA

Indexed parser for IndexedUn FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedUniProt object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –

build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs

  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.

  • args (Sequence, optional) – Additional positional arguments to be passed to the target function

  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function

  • **_kwargs – Additional keyword arguments to be passed to the target function

Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedUniRef(source, parse=True, **kwargs)[source]

Bases: UniRefMixin, TwoLayerIndexedFASTA

Indexed parser for IndexedU FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedUniRef object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –

build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs

  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.

  • args (Sequence, optional) – Additional positional arguments to be passed to the target function

  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function

  • **_kwargs – Additional keyword arguments to be passed to the target function

Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.NCBI(source, parse=True, **kwargs)[source]

Bases: NCBIMixin, FASTA

Text-mode parser for FASTA files.

__init__(source, parse=True, **kwargs)

Creates a NCBI object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the FASTA constructor.) –

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.NCBIMixin(parse=True)[source]

Bases: FlavoredMixin

__init__(parse=True)
class pyteomics.fasta.RefSeq(source, parse=True, **kwargs)[source]

Bases: RefSeqMixin, FASTA

Text-mode parser for R FASTA files.

__init__(source, parse=True, **kwargs)

Creates a RefSeq object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the FASTA constructor.) –

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.RefSeqMixin(parse=True)[source]

Bases: FlavoredMixin

__init__(parse=True)
class pyteomics.fasta.SPD(source, parse=True, **kwargs)[source]

Bases: SPDMixin, FASTA

Text-mode parser for FASTA files.

__init__(source, parse=True, **kwargs)

Creates a SPD object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the FASTA constructor.) –

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.SPDMixin(parse=True)[source]

Bases: FlavoredMixin

__init__(parse=True)
class pyteomics.fasta.TwoLayerIndexedFASTA(source, header_pattern=None, header_group=None, ignore_comments=False, parser=None, **kwargs)[source]

Bases: IndexedFASTA

Parser with two-layer index. Extracted groups are mapped to full headers (where possible), full headers are mapped to byte offsets.

When indexed, the key is looked up in both indexes, allowing access by meaningful IDs (like UniProt accession) and by full header string.

__init__(source, header_pattern=None, header_group=None, ignore_comments=False, parser=None, **kwargs)[source]

Open source and create a two-layer index for convenient random access both by full header strings and extracted fields.

Parameters:
  • source (str or file-like) – File to read. If file object, it must be opened in binary mode.

  • header_pattern (str or RE or None, optional) – Pattern to match the header string. Must capture the group used for the second index. If None (default), second-level index is not created.

  • header_group (int or str or None, optional) – Defines which group is used as key in the second-level index. Default is 1.

  • ignore_comments (bool, optional) – If True then ignore the second and subsequent lines of description. Default is False, which concatenates multi-line descriptions into a single string.

  • parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a function, that function will be given the description string, and the returned value will be yielded together with the sequence. The std_parsers dict has parsers for several formats. Hint: specify parse() as the parser to apply automatic format recognition. Default is None, which means return the header “as is”.

  • arguments (Other) –

build_second_index()[source]

Create the mapping from extracted field to whole header string.

get_by_id(key)[source]

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs

  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.

  • args (Sequence, optional) – Additional positional arguments to be passed to the target function

  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function

  • **_kwargs – Additional keyword arguments to be passed to the target function

Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.UniMes(source, parse=True, **kwargs)[source]

Bases: UniMesMixin, FASTA

Text-mode parser for U FASTA files.

__init__(source, parse=True, **kwargs)

Creates a UniMes object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the FASTA constructor.) –

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.UniMesMixin(parse=True)[source]

Bases: FlavoredMixin

__init__(parse=True)
class pyteomics.fasta.UniParc(source, parse=True, **kwargs)[source]

Bases: UniParcMixin, FASTA

Text-mode parser for Un FASTA files.

__init__(source, parse=True, **kwargs)

Creates a UniParc object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the FASTA constructor.) –

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.UniParcMixin(parse=True)[source]

Bases: FlavoredMixin

__init__(parse=True)
class pyteomics.fasta.UniProt(source, parse=True, **kwargs)[source]

Bases: UniProtMixin, FASTA

Text-mode parser for Un FASTA files.

__init__(source, parse=True, **kwargs)

Creates a UniProt object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the FASTA constructor.) –

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.UniProtMixin(parse=True)[source]

Bases: FlavoredMixin

__init__(parse=True)
class pyteomics.fasta.UniRef(source, parse=True, **kwargs)[source]

Bases: UniRefMixin, FASTA

Text-mode parser for U FASTA files.

__init__(source, parse=True, **kwargs)

Creates a UniRef object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.

  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.

  • kwargs (passed to the FASTA constructor.) –

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.UniRefMixin(parse=True)[source]

Bases: FlavoredMixin

__init__(parse=True)
pyteomics.fasta.decoy_db(source=None, mode='reverse', prefix='DECOY_', decoy_only=False, ignore_comments=False, parser=None, **kwargs)[source]

Iterate over sequences for a decoy database out of a given source.

Parameters:
  • source (file-like object or str or None, optional) – A path to a FASTA database or a file object itself. Default is None, which means read standard input.

  • mode (str or callable, optional) – Algorithm of decoy sequence generation. ‘reverse’ by default. See decoy_sequence() for more information.

  • prefix (str, optional) – A prefix to the protein descriptions of decoy entries. The default value is ‘DECOY_’.

  • decoy_only (bool, optional) – If set to True, only the decoy entries will be written to output. If False, the entries from source will be written first. False by default.

  • ignore_comments (bool, optional) – If True then ignore the second and subsequent lines of description. Default is False.

  • parser (function or None, optional) – Defines whether the fasta descriptions should be parsed. If it is a function, that function will be given the description string, and the returned value will be yielded together with the sequence. The std_parsers dict has parsers for several formats. Hint: specify parse() as the parser to apply automatic format guessing. Default is None, which means return the header “as is”.

  • **kwargs (given to decoy_sequence().) –

Returns:

out – An iterator over entries of the new database.

Return type:

iterator

pyteomics.fasta.decoy_entries(entries, mode='reverse', prefix='DECOY_', decoy_only=True, **kwargs)[source]

Iterate over protein entries (tuples) and produce decoy entries. The entries are only iterated once.

Parameters:
  • entries (iterable of tuples) – Any iterable of (description, sequence) pairs.

  • mode (str or callable, optional) – Algorithm of decoy sequence generation. ‘reverse’ by default. See decoy_sequence() for more information.

  • prefix (str, optional) – A prefix to the protein descriptions of decoy entries. The default value is ‘DECOY_’.

  • decoy_only (bool, optional) – If set to True, only the decoy entries will be written to output. If False, each consumed entry is yielded unchanged, followed by its decoy couterpart. True by default.

  • **kwargs (given to decoy_sequence().) –

Returns:

out – An iterator over new entries.

Return type:

iterator

pyteomics.fasta.decoy_sequence(sequence, mode='reverse', **kwargs)[source]

Create a decoy sequence out of a given sequence string.

Parameters:
  • sequence (str) – The initial sequence string.

  • mode (str or callable, optional) –

    Type of decoy sequence. Should be one of the standard modes or any callable. Standard modes are:

    Default is ‘reverse’.

  • **kwargs (given to the decoy function.) –

Returns:

decoy_sequence – The decoy sequence.

Return type:

str

pyteomics.fasta.fused_decoy(sequence, decoy_mode='reverse', sep='R', **kwargs)[source]

Create a “fused” decoy sequence by concatenating a decoy sequence with the original one. The method and its use cases are described in:

Ivanov, M. V., Levitsky, L. I., & Gorshkov, M. V. (2016). Adaptation of Decoy Fusion Strategy for Existing Multi-Stage Search Workflows. Journal of The American Society for Mass Spectrometry, 27(9), 1579-1582.

Parameters:
  • sequence (str) – The initial sequence string.

  • decoy_mode (str or callable, optional) –

    Type of decoy sequence to use. Should be one of the standard modes or any callable. Standard modes are:

    Default is ‘reverse’.

  • sep (str, optional) – Amino acid motif that separates the decoy sequence from the target one. This setting should reflect the enzyme specificity used in the search against the database being generated. Default is ‘R’, which is suitable for trypsin searches.

  • **kwargs (given to the decoy generation function.) –

Examples

>>> fused_decoy('PEPT')
'TPEPRPEPT'
>>> fused_decoy('MPEPT', 'shuffle', 'K', keep_nterm=True)
'MPPTEKMPEPT'
pyteomics.fasta.parse(header, flavor='auto', parsers=None)[source]

Parse the FASTA header and return a nice dictionary.

Parameters:
  • header (str) – FASTA header to parse

  • flavor (str, optional) – Short name of the header format (case-insensitive). Valid values are 'auto' and keys of the parsers dict. Default is 'auto', which means try all formats in turn and return the first result that can be obtained without an exception.

  • parsers (dict, optional) – A dict where keys are format names (lowercased) and values are functions that take a header string and return the parsed header.

Returns:

out – A dictionary with the info from the header. The format depends on the flavor.

Return type:

dict

pyteomics.fasta.read(source=None, use_index=None, flavor=None, **kwargs)[source]

Parse a FASTA file. This function serves as a dispatcher between different parsers available in this module.

Parameters:
  • source (str or file or None, optional) – A file object (or file name) with a FASTA database. Default is None, which means read standard input.

  • use_index (bool, optional) – If True, the created parser object will be an instance of IndexedFASTA. If False (default), it will be an instance of FASTA.

  • flavor (str or None, optional) –

    A supported FASTA header format. If specified, a format-specific parser instance is returned.

    Note

    See std_parsers for supported flavors.

Returns:

out – A named 2-tuple with FASTA header (str or dict) and sequence (str). Attributes ‘description’ and ‘sequence’ are also provided.

Return type:

iterator of tuples

pyteomics.fasta.reverse(sequence, keep_nterm=False, keep_cterm=False)[source]

Create a decoy sequence by reversing the original one.

Parameters:
  • sequence (str) – The initial sequence string.

  • keep_nterm (bool, optional) – If True, then the N-terminal residue will be kept. Default is False.

  • keep_cterm (bool, optional) – If True, then the C-terminal residue will be kept. Default is False.

Returns:

decoy_sequence – The decoy sequence.

Return type:

str

pyteomics.fasta.shuffle(sequence, keep_nterm=False, keep_cterm=False, keep_nterm_M=False, fix_aa='')[source]

Create a decoy sequence by shuffling the original one.

Parameters:
  • sequence (str) – The initial sequence string.

  • keep_nterm (bool, optional) – If True, then the N-terminal residue will be kept. Default is False.

  • keep_cterm (bool, optional) – If True, then the C-terminal residue will be kept. Default is False.

  • keep_nterm_M (bool, optional) – If True, then the N-terminal methionine will be kept. Default is False.

  • fix_aa (iterable, optional) – Single letter codes for amino acids that should preserve their position during shuffling. Default is ‘’.

Returns:

decoy_sequence – The decoy sequence.

Return type:

str

pyteomics.fasta.std_parsers

A dictionary with parsers for known FASTA header formats. For now, supported formats are those described at UniProt help page.

pyteomics.fasta.write(entries, output=None)[source]

Create a FASTA file with entries.

Parameters:
  • entries (iterable of (str/dict, str) tuples) – An iterable of 2-tuples in the form (description, sequence). If description is a dictionary, it must have a special key, whose value will be written as protein description. The special key is defined by the variable RAW_HEADER_KEY.

  • output (file-like or str, optional) –

    A file open for writing or a path to write to. If the file exists, it will be opened for writing. Default is None, which means write to standard output.

    Note

    The default mode for output files specified by name has been changed from a to w in pyteomics 4.6. See file_mode to override the mode.

  • file_mode (str, keyword only, optional) –

    If output is a file name, defines the mode the file will be opened in. Otherwise will be ignored. Default is ‘w’.

    Note

    The default changed from ‘a’ in pyteomics 4.6.

Returns:

output_file – The file where the FASTA is written.

Return type:

file object

pyteomics.fasta.write_decoy_db(source=None, output=None, mode='reverse', prefix='DECOY_', decoy_only=False, **kwargs)[source]

Generate a decoy database out of a given source and write to file.

If output is a path, the file will be open for appending, so no information will be lost if the file exists. Although, the user should be careful when providing open file streams as source and output. The reading and writing will start from the current position in the files, which is where the last I/O operation finished. One can use the file.seek() method to change it.

Parameters:
  • source (file-like object or str or None, optional) – A path to a FASTA database or a file object itself. Default is None, which means read standard input.

  • output (file-like object or str, optional) – A path to the output database or a file open for writing. Defaults to None, the results go to the standard output.

  • mode (str or callable, optional) – Algorithm of decoy sequence generation. ‘reverse’ by default. See decoy_sequence() for more details.

  • prefix (str, optional) – A prefix to the protein descriptions of decoy entries. The default value is ‘DECOY_’

  • decoy_only (bool, optional) – If set to True, only the decoy entries will be written to output. If False, the entries from source will be written as well. False by default.

  • file_mode (str, keyword only, optional) – If output is a file name, defines the mode the file will be opened in. Otherwise will be ignored. Default is ‘a’.

  • **kwargs (given to decoy_sequence().) –

Returns:

output – A (closed) file object for the created file.

Return type:

file

«  electrochem - electrochemical properties of polypeptides   ::   Contents   ::   peff - PSI Extended FASTA Format  »