Pyteomics documentation v4.3.3dev1

Pyteomics documentation v4.3.3dev1

Contents

Welcome to Pyteomics tutorial!

Test status PyPI Read the Docs (latest) Apache License python-pyteomics on AUR Pyteomics is awesome

What is Pyteomics?

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis, such as:

  • calculation of basic physico-chemical properties of polypeptides:
    • mass and isotopic distribution
    • charge and pI
    • chromatographic retention time
  • access to common proteomics data:
    • MS or LC-MS data
    • FASTA databases
    • search engines output
  • easy manipulation of sequences of modified peptides and proteins

The goal of the Pyteomics project is to provide a versatile, reliable and well-documented set of open tools for the wide proteomics community. One of the project’s key features is Python itself, an open source language increasingly popular in scientific programming. The main applications of the library are reproducible statistical data analysis and rapid software prototyping.

Citation

Pyteomics is distributed under Apache License version 2.0.

When using or redistributing Pyteomics, or parts of it, please cite the following papers:

Goloborodko, A.A.; Levitsky, L.I.; Ivanov, M.V.; and Gorshkov, M.V. (2013) “Pyteomics - a Python Framework for Exploratory Data Analysis and Rapid Software Prototyping in Proteomics”, Journal of The American Society for Mass Spectrometry, 24(2), 301–304. DOI: 10.1007/s13361-012-0516-6

Levitsky, L.I.; Klein, J.; Ivanov, M.V.; and Gorshkov, M.V. (2018) “Pyteomics 4.0: five years of development of a Python proteomics framework”, Journal of Proteome Research. DOI: 10.1021/acs.jproteome.8b00717

Backup of old repo

Pyteomics source code used to be hosted on Bitbucket. An archive of issues and pull requests is stored at: https://levitsky.github.io/bitbucket_backup/#!/levitsky/pyteomics.

Pyteomics Extensions

Additional, third-party packages extending the Pyteomics functionality can be insalled separately:

Feedback & Support

Please email to pyteomics@googlegroups.com with any questions about Pyteomics. You are welcome to use the Github issue tracker to report bugs, request features, etc.

Relation to other proteomics data analysis tools

Our goal is to create an infrastructure for proteomics data analysis within Python ecosystem. Pyteomics is not a proteomic search engine, nor does it any data conversion. There are other tools for that. Pyteomics does not aim to substitute any of these, but rather to coexist and complement them.

Contents:

Introduction

This tutorial covers the basic Pyteomics functionality. For more details, please, check the API reference. You can also access the API docstrings from Python shell:

>>> from pyteomics.mass import calculate_mass
>>> help(calculate_mass)

IPython users can use the following shortcut:

>>> from pyteomics.mass import calculate_mass
>>> calculate_mass?

We expect the reader to be familiar with the basic Python syntax as well as proteomics concepts.

How to install Pyteomics

Supported Python versions

Pyteomics supports Python 2.7 and Python 3.3+.

Project dependencies

Pyteomics uses the following Python packages:

All dependencies are optional.

GNU/Linux

The preferred way to obtain Pyteomics is via pip Python package manager. The shell code for a freshly installed Ubuntu system:

sudo apt-get install python-setuptools python-dev build-essential
sudo easy_install pip
sudo pip install lxml numpy matplotlib pyteomics
Arch-based distros

On Arch Linux and related distros, you can install Pyteomics from AUR:

Windows

  • Get pip, if you don’t have it yet.

  • Install Pyteomics and its dependencies:

    pip install lxml numpy matplotlib pyteomics
    

Peptide sequence formats. Parser module

modX

Pyteomics uses a custom IUPAC-derived peptide sequence notation named modX. As in the IUPAC notation, each amino acid residue is represented by a capital letter, but it may preceded by an arbitrary number of small letters to show modification. Terminal modifications are separated from the backbone sequence by a hyphen (‘-’). By default, both termini are assumed to be unmodified, which can be shown explicitly by ‘H-‘ for N-terminal hydrogen and ‘-OH’ for C-terminal hydroxyl.

“H-HoxMMdaN-OH” is an example of a valid sequence in modX. See parser - operations on modX peptide sequences for additional information. Note that it is recommended to include either 0 or 2 terminal groups in a modX sequence.

Sequence operations

There are two helper functions to check if a label is in modX format or represents a terminal modification: pyteomics.parser.is_modX() and pyteomics.parser.is_term_mod():

>>> parser.is_modX('A')
True
>>> parser.is_modX('pT')
True
>>> parser.is_modX('pTx')
False
>>> parser.is_term_mod('pT')
False
>>> parser.is_term_mod('Ac-')
True

A modX sequence can be translated to a list of amino acid residues with pyteomics.parser.parse() function:

>>> from pyteomics import parser
>>> parser.parse('PEPTIDE')
['P', 'E', 'P', 'T', 'I', 'D', 'E']
>>> parser.parse('PEPTIDE', show_unmodified_termini=True)
['H-', 'P', 'E', 'P', 'T', 'I', 'D', 'E', '-OH']
>>> parser.parse('Ac-PEpTIDE', labels=parser.std_labels+['Ac-', 'pT'])
['Ac-', 'P', 'E', 'pT', 'I', 'D', 'E']

In the last example we supplied two arguments, the sequence itself and ‘labels’. The latter is used to specify what labels are allowed for amino acid residues and terminal modifications. std_labels is a predefined set of labels for the twenty standard amino acids, ‘H-‘ for N-terminal hydrogen and ‘-OH’ for C-terminal hydroxyl. In this example we specified the codes for phosphorylated threonine and N-terminal acetylation.

Since version 2.5, specifying labels is never mandatory. If this argument is not supplied, no checks will be made. However, the last example won’t work without labels, because it has only one terminal group shown, which is discouraged.

parse() has another mode, in which it returns tuples:

>>> parser.parse('Ac-PEpTIDE-OH', split=True)
[('Ac-', 'P'), ('E',), ('p', 'T'), ('I',), ('D',), ('E',)]

or:

>>> parser.parse('Ac-PEpTIDE-OH', split=True, labels=parser.std_labels+['Ac-', 'p'])
[('Ac-', 'P'), ('E',), ('p', 'T'), ('I',), ('D',), ('E',)]

Also, note what we supply as labels here: ‘p’ instead of ‘pT’. That means that ‘p’ is a modification applicable to any residue.

In modX, standard len() function cannot be used to determine the length of a peptide because of the modifications. Use pyteomics.parser.length() instead:

>>> from pyteomics import parser
>>> parser.length('aVRILLaVIGNE')
10

The pyteomics.parser.amino_acid_composition() function accepts a sequence and returns a dictionary with amino acid labels as keys and integer numbers as values, corresponding to the number of times each residue occurs in the sequence:

>>> from pyteomics import parser
>>> parser.amino_acid_composition('PEPTIDE')
{'I': 1.0, 'P': 2.0, 'E': 2.0, 'T': 1.0, 'D': 1.0}

pyteomics.parser.cleave() is a method to perform in silico cleavage. The requiered arguments are the sequence, the rule for enzyme specificity and the number of missed cleavages allowed (optional). cleave() returns a set of product peptides.

>>> from pyteomics import parser
>>> parser.cleave('AKAKBK', parser.expasy_rules['trypsin'], 0)
{'AK', 'BK'}

pyteomics.parser.expasy_rules is a predefined dict with the clevage rules for the most common proteases.

All possible modified sequences of a peptide can be obtained with pyteomics.parser.isoforms():

>>> from pyteomics import parser
>>> forms = parser.isoforms('PEPTIDE', variable_mods={'p': ['T'], 'ox': ['P']})
>>> for seq in forms: print seq
...
oxPEPpTIDE
oxPEPTIDE
oxPEoxPpTIDE
oxPEoxPTIDE
PEPpTIDE
PEPTIDE
PEoxPpTIDE
PEoxPTIDE

Peptide properties: mass, charge, chromatographic retention

Mass and isotopes

The functions related to mass calculations and isotopic distributions are organized into the pyteomics.mass module.

Basic mass calculations

The most common task in mass spectrometry data analysis is to calculate the mass of an organic molecule or peptide or m/z ratio of an ion. The tasks of this kind can be performed with the pyteomics.mass.calculate_mass() function. It works with chemical formulas, polypeptide sequences in modX notation, pre-parsed sequences and dictionaries of chemical compositions:

>>> from pyteomics import mass
>>> mass.calculate_mass(formula='H2O')
18.0105646837036

>>> mass.calculate_mass(formula='C2H5OH')
46.0418648119876

>>> mass.calculate_mass(composition={'H':2, 'O':1})
18.0105646837036

>>> mass.calculate_mass(sequence='PEPTIDE')
799.359964027207

>>> from pyteomics import parser
>>> ps = parser.parse('PEPTIDE', show_unmodified_termini=True)
>>> mass.calculate_mass(parsed_sequence=ps)
799.359964027207

Warning

Always set show_unmodified_termini=True when parsing a sequence, if you want to use the result to calculate the mass. Otherwise, the mass of the terminal hydrogen and hydroxyl will not be taken into account.

Mass-to-charge ratio of ions

pyteomics.mass.calculate_mass() can be used to calculate the mass/charge ratio of peptide ions and ionized fragments. To do that, simply supply the type of the peptide ionized fragment and its charge:

>>> from pyteomics import mass
>>> mass.calculate_mass(sequence='PEPTIDE', ion_type='M', charge=2)
400.6872584803735

>>> mass.calculate_mass(sequence='PEP', ion_type='b', charge=1)
324.15539725264904

>>> mass.calculate_mass(sequence='TIDE', ion_type='y', charge=1)
477.219119708098
Mass of modified peptides

With pyteomics.mass.calculate_mass() you can calculate masses of modified peptides as well. For the function to recognize the modified residue, you need to add the information about its elemental composition to the pyteomics.mass.std_aa_comp dictionary used in the calculations by default.

>>> from pyteomics import mass
>>> mass.std_aa_comp['pT'] = mass.Composition(
...    {'C': 4, 'H': 8, 'N': 1, 'O': 5, 'P': 1})
>>> mass.calculate_mass(sequence='PEPpTIDE')
879.3262945499629

To add information about modified amino acids to a user-defined aa_comp dict one can either add the composition info for a specific modified residue or just for a modification:

>>> from pyteomics import mass
>>> aa_comp = dict(mass.std_aa_comp)
>>> aa_comp['p'] = mass.Composition('HPO3')
>>> mass.calculate_mass('pT', aa_comp=aa_comp)
199.02457367493957

In this example we call calculate_mass() with a positional (non-keyword) argument (‘pT’). This feature was added in version 1.2.4. When you provide a non-keyword argument, it will be treated as a sequence; if it fails, it will be treated as a formula; in case it fails as well, a PyteomicsError will be raised. Note that ‘pT’ is treated as a sequence here, so default terminal groups are implied when calculating the composition and mass:

>>> mass.calculate_mass('pT', aa_comp=aa_comp) == mass.calculate_mass(aa_comp['p']) + mass.calculate_mass(aa_comp['T']) + mass.calculate_mass('H2O')
True

You can create a specific entry for a modified amino acid to override the modification on a specific residue:

>>> aa_comp['pT'] = mass.Composition({'N': 2})
>>> mass.Composition('pT', aa_comp=aa_comp)
{'H': 2, 'O': 1, 'N': 2}
>>> mass.Composition('pS', aa_comp=aa_comp)
{'H': 8, 'C': 3, 'N': 1, 'O': 6, 'P': 1}

Unimod database is an excellent resource for the information on the chemical compositions of known protein modifications. Version 2.0.3 introduces pyteomics.mass.Unimod class that can serve as a Python interface to Unimod:

>>> db = mass.Unimod()
>>> aa_comp = dict(mass.std_aa_comp)
>>> aa_comp['p'] = db.by_title('Phospho')['composition']
>>> mass.calculate_mass('PEpTIDE', aa_comp=aa_comp)
782.2735307010443
Chemical compositions

Some problems in organic mass spectrometry deal with molecules made by addition or subtraction of standard chemical ‘building blocks’. In pyteomics.mass there are two ways to approach these problems.

  • There is a pyteomics.mass.Composition class intended to store chemical formulas. pyteomics.mass.Composition objects are dicts that can be added or subtracted from one another or multiplied by integers.

    >>> from pyteomics import mass
    >>> p = mass.Composition(formula='HO3P') # Phosphate group
    Composition({'H': 1, 'O': 3, 'P': 1})
    >>> mass.std_aa_comp['T']
    Composition{'C': 4, 'H': 7, 'N': 1, 'O': 2})
    >>> p + mass.std_aa_comp['T']
    Composition({'C': 4, 'H': 8, 'N': 1, 'O': 5, 'P': 1})
    

    The values of pyteomics.mass.std_aa_comp are pyteomics.mass.Composition objects.

  • All functions that accept a formula keyword argument sum and subtract numbers following the same atom in the formula:

    >>> from pyteomics import mass
    >>> mass.calculate_mass(formula='C2H6') # Ethane
    30.046950192426
    >>> mass.calculate_mass(formula='C2H6H-2') # Ethylene
    28.031300128284002
    
Faster mass calculations

While pyteomics.mass.calculate_mass() has a flexible and convenient interface, it may be too slow for large-scale calculations. There is an optimized and simplified version of this function named pyteomics.mass.fast_mass(). It works only with unmodified sequences in standard one-letter IUPAC notation. Like pyteomics.mass.calculate_mass(), pyteomics.mass.fast_mass() can calculate m/z when provided with ion type and charge. Amino acid masses can be specified via the aa_mass argument.

>>> from pyteomicss import mass
>>> mass.fast_mass('PEPTIDE')
799.3599446837036

If you need to calculate the mass or m/z for a peptide with modifications and/or non-standard terminal groups, but don’t want to specify all compositions, you can also use the pyteomics.mass.fast_mass2() function. It uses aa_mass the same way as fast_mass(), but has full modX support:

>>> mass.fast_mass2('H-PEPTIDE-OH')
799.3599446837036
Isotopes

If not specified, pyteomics.mass assumes that the substances are in the pure isotopic state. However, you may specify particular isotopic state in brackets (e.g. O[18], N[15]) in a chemical formula. An element with unspecified isotopic state is assumed to have the mass of the most stable isotope and abundance of 100%.

>>> mass.calculate_mass(formula='H[2]2O') # Heavy water
20.0231181752416
>>> mass.calculate_mass(formula='H[2]HO') # Semiheavy water
19.0168414294726

pyteomics.mass.isotopic_composition_abundance() function calculates the relative abundance of a given isotopic state of a molecule. The input can be provided as a formula or as a Composition/dict.

>>> from pyteomics import mass
>>> mass.isotopic_composition_abundance(formula='H2O') # Water with an unspecified isotopic state
1.0
>>> mass.isotopic_composition_abundance(formula='H[2]2O') # Heavy water
1.3386489999999999e-08
>>> mass.isotopic_composition_abundance(formula='H[2]H[1]O') # Semiheavy water
0.0002313727050147582
>>> mass.isotopic_composition_abundance(composition={'H[2]’: 1, ‘H[1]’: 1, ‘O': 1}) # Semiheavy water
0.0002313727050147582
>>> mass.isotopic_composition_abundance(formula='H[2]2O[18]') # Heavy-hydrogen heavy-oxygen water
2.7461045585999998e-11

Warning

You cannot mix specified and unspecified states of the same element in one formula in pyteomics.mass.isotopic_composition_abundance() due to ambiguity.

>>> mass.isotopic_composition_abundance(formula='H[2]HO')
...
PyteomicsError: Pyteomics error, message: 'Please specify the isotopic states of all atoms of H or do not specify them at all.'

Finally, you can find the most probable isotopic composition for a substance with pyteomics.mass.most_probable_isotopic_composition() function. The substance is specified as a formula, a pyteomics.mass.Composition object or a modX sequence string.

>>> from pyteomics import mass
>>> mass.most_probable_isotopic_composition(formula='H2SO4')
Composition({'H[1]': 2.0,  'H[2]': 0.0,  'O[16]': 4.0,  'O[17]': 0.0,  'S[32]': 1.0,  'S[33]': 0.0})
>>> mass.most_probable_isotopic_composition(formula='C300H602')
Composition({'C[12]': 297.0, 'C[13]': 3.0, 'H[1]': 602.0, 'H[2]': 0.0})
>>> mass.most_probable_isotopic_composition(sequence='PEPTIDE'*100)
Composition({'C[12]': 3364.0,  'C[13]': 36.0,  'H[1]': 5102.0,  'H[2]': 0.0, 'N[14]': 698.0,  'N[15]': 2.0,  'O[16]':  398.0,  'O[17]': 3.0})

The information about chemical elements, their isotopes and relative abundances is stored in the pyteomics.mass.nist_mass dictionary.

>>> from pyteomics import mass
>>> print mass.nist_mass['C']
{0: (12.0, 1.0), 12: (12.0, 0.98938), 13: (13.0033548378, 0.01078), 14: (14.0032419894, 0.0)}

The zero key stands for the unspecified isotopic state. The data about isotopes are stored as tuples (accurate mass, relative abundance).

Charge and pI

Electrochemical properties of polypeptides can be assessed via the pyteomics.electrochem module. For now, it allows to calculate:

  • the charge of a polypeptide molecule at given pH;
  • the isoelectric point.

The pyteomics.electrochem module is based on the Henderson-Hasselbalch equation.

Examples

Both functions in the module accept input in the form of a modX sequence, a parsed sequence or a dict with amino acid composition.

>>> from pyteomics import electrochem
>>> electrochem.charge('PEPTIDE', 7)
-2.9980189709606284
>>> from pyteomics import parser
>>> parsed_seq = parser.parse('PEPTIDE', show_unmodified_termini=True)
>>> electrochem.charge(parsed_seq, 7)
-2.9980189709606284
>>> aa_composition = parser.amino_acid_composition('PEPTIDE', show_unmodified_termini=True)
>>> electrochem.charge(aa_composition, 7)
-2.9980189709606284
>>> electrochem.pI('PEPTIDE')
2.87451171875
>>> electrochem.pI('PEPTIDE', precision_pI=0.0001)
2.876354217529297

(Source code, png, hires.png, pdf)

_images/charge_vs_ph.png
Customization

The pKas of individual amino acids are stored in dicts in the following format: {modX label : (pKa, charge)}. The module contains several datasets published in scientific journals: pyteomics.electrochem.pK_lehninger (used by default), pyteomics.electrochem.pK_sillero, pyteomics.electrochem.pK_dawson, pyteomics.electrochem.pK_rodwell.

Retention time prediction

Pyteomics has two modules for prediction of retention times (RTs) of peptides and proteins in liquid chromatography.

BioLCCC

The first module is pyteomics.biolccc. This module implements the BioLCCC model of liquid chromatography of polypeptides. pyteomics.biolccc is not distributed with the main package and has to be installed separately. pyteomics.biolccc can be downloaded from http://pypi.python.org/pypi/pyteomics.biolccc, and the project documentation is hosted at http://theorchromo.ru/docs.

Additive model of peptide chromatography

Another option for retention time prediction is the pyteomics.achrom module distributed with Pyteomics. It implements the additive model of polypeptide chromatography. Briefly, in the additive model each amino acid residue changes retention time by a fixed value, depending only on its type (e.g. an alanine residue add 2.0 mins to RT, while an arginine decreases it by 1.1 min). The module documentation contains the complete description of this model and the references. In this tutorial we will focus on the basic usage.

Retention time prediction

Retention time prediction with pyteomics.achrom is done by the pyteomics.achrom.calculate_RT() function:

>>> from pyteomics import achrom
>>> achrom.calculate_RT('PEPTIDE', achrom.RCs_guo_ph7_0)
7.8000000000000025

The first argument of the function is the sequence of a peptide in modX notation.

The second argument is the set parameters called ‘retention coefficients’ which describe chromatographic properties of individual amino acid residues in a polypeptide chain. pyteomics.achrom has a number of predefined sets of retention coefficients obtained from publications. The list, detailed descriptions and references related to these sets can be found in the module documentation.

Calibration

The main advantage of the additive model is that it gives more accurate predictions if adjusted to specific chromatographic setups and conditions. This adjustment, or ‘calibration’ requires a set of known peptide sequences and corresponding retention times (a ‘training set’) and returns a set of new retention coefficients. The following code illustrates the calibration procedure in Pyteomics.

>>> from pyteomics import achrom
>>> RCs = achrom.get_RCs(sequences, RTs)
>>> achrom.calculate_RT('PEPTIDE', RCs)

The first argument of pyteomics.achrom.get_RCs() should be a list of modX sequences, the second - a list of float-point retention times.

Like in pyteomics.parser.parse_sequence(), all non-standard amino modX acid labels used in the training set should be supplied to labels keyword argument of pyteomics.achrom.get_RCs() along with the standard ones:

>>> RCs = achrom.get_RCs(sequences, RTs, labels=achrom.std_labels + ['pS', 'pT'])
Advanced calibration

The standard additive model allows a couple of improvements. Firstly, an explicit dependency on the length of a peptide may be introduced by multiplying the retention time by (1.0 + m * log(L)), where L is the number of amino acid residues in the peptide and m is the length correction parameter, typically ~ -0.2.

The value of the length correction parameter is set at the calibration and stored along with the retention coefficients. By default, length correction is enabled in pyteomics.achrom.get_RCs() and the parameter equals -0.21. You can change the value of the length correction parameter by supplying the ‘lcp’ keyword argument, or you can disable length correction completely by setting lcp=0:

>>> RCs = achrom.get_RCs(sequences, RTs, lcp=-0.18) # A new value of the length correction parameter

>>> RCs = achrom.get_RCs(sequences, RTs, lcp=0) # Disable length correction.

Another considerable improvement over the standard additive model is to treat terminal amino acid residues as separate chemical entities. This behavior is disabled by default, but can be enabled by setting term_aa=True:

>>> RCs = achrom.get_RCs(sequences, RTs, term_aa=True)

This correction is implemented by addition of the ‘nterm’ and ‘cterm’ prefixes to the labels of terminal amino acid residues of the training peptides. In order for this correction to work, the training peptides should represent all possible variations of terminal amino acid residues.

Data Access

The following section is dedicated to data manipulation. Pyteomics aims to support the most common formats of (LC-)MS/MS data, peptide identification results and protein databases.

General Notes

  • Each module mentioned below corresponds to a file format. In each module, the top-level function read() allows iteration over entries in a file. It works like the built-in open(), allowing direct iteration and supporting the with syntax, which we recommend using. So you can do:

    >>> from pyteomics import mgf
    >>> reader = mgf.read('tests/test.mgf')
    >>> for spectrum in reader:
    >>>    ...
    >>> reader.close()
    

    … but it is recommended to do:

    >>> from pyteomics import mgf
    >>> with mgf.read('tests/test.mgf') as reader:
    >>>     for spectrum in reader:
    >>>        ...
    
  • Additionally, most modules provide one or several classes which implement different parsing modes, e.g. pyteomics.mgf.MGF and pyteomics.mgf.IndexedMGF. Indexed parsers build an index of file entries and thus allow random access in addition to iteration. See Indexed Parsers for a detailed description and examples.

  • Apart from read(), which reads just one file, all modules described here have functions for reading multiple files: chain() and chain.from_iterable(). chain('f1', 'f2') is equivalent to chain.from_iterable(['f1', 'f2']). chain() and chain.from_iterable() only support the with syntax. If you don’t want to use the with syntax, you can just use the itertools functions chain() and chain.from_iterable().

  • Throughout this section we use pyteomics.auxiliary.print_tree() to display the structure of the data returned by various parsers. Replace this call with the actual processsing that you need to perform on your files.

Text-based formats

MGF

Mascot Generic Format (MGF) is a simple human-readable format for MS/MS data. It allows storing MS/MS peak lists and exprimental parameters. pyteomics.mgf is a module that implements reading and writing MGF files.

Reading

pyteomics.mgf.read() function allows iterating through spectrum entries. Spectra are represented as dicts. By default, MS/MS peak lists are stored as numpy.ndarray objects m/z array and intensity array. Fragment charges will be stored in a masked array under the charge array key. Parameters are stored as a dict under params key.

Here is an example of use:

>>> from pyteomics import mgf, auxiliary
>>> with mgf.read('tests/test.mgf') as reader:
>>>     auxiliary.print_tree(next(reader))
m/z array
params
 -> username
 -> useremail
 -> mods
 -> pepmass
 -> title
 -> itol
 -> charge
 -> mass
 -> itolu
 -> it_mods
 -> com
intensity array
charge array

To speed up parsing, or if you want to avoid using numpy, you can tweak the behaviour of pyteomics.mgf.read() with parameters convert_arrays and read_charges.

Reading file headers

Also, pyteomics.mgf allows to extract headers with general parameters from MGF files with pyteomics.mgf.read_header() function. It also returns a dict.

>>> header = mgf.read_header('tests/test.mgf')
>>> auxiliary.print_tree(header)
itolu
itol
username
com
useremail
it_mods
charge
mods
mass
Class-based interface

Since version 3.4.3, MGF parsing functionality is encapsulated in a class: pyteomics.mgf.MGF. This class can be used for:

  • sequential parsing of the file (the same as read()):
>>> with mgf.MGF('tests/test.mgf') as reader:
..:     for spectrum in reader:
..:         ...
  • accessing the file header (the same as read_header()):
>>> f = mgf.MGF('tests/test.mgf')
>>> f.header
{'charge': [2, 3],
 'com': 'Based on http://www.matrixscience.com/help/data_file_help.html',
 'it_mods': 'Oxidation (M)',
 'itol': '1',
 'itolu': 'Da',
 'mass': 'Monoisotopic',
 'mods': 'Carbamidomethyl (C)',
 'useremail': 'leu@altered-state.edu',
 'username': 'Lou Scene'}
  • direct access to spectra by title (the same as get_spectrum()):
>>> f = mgf.MGF('tests/test.mgf')
>>> f['Spectrum 2']
{'charge array': masked_array(data = [3 2 1 1 1 1],
              mask = False,
        fill_value = 0),
 'intensity array': array([  237.,   128.,   108.,  1007.,   974.,    79.]),
 'm/z array': array([  345.1,   370.2,   460.2,  1673.3,  1674. ,  1675.3]),
 'params': {'charge': [2, 3],
  'com': 'Based on http://www.matrixscience.com/help/data_file_help.html',
  'it_mods': 'Oxidation (M)',
  'itol': '1',
  'itolu': 'Da',
  'mass': 'Monoisotopic',
  'mods': 'Carbamidomethyl (C)',
  'pepmass': (1084.9, 1234.0),
  'rtinseconds': '25',
  'scans': '3',
  'title': 'Spectrum 2',
  'useremail': 'leu@altered-state.edu',
  'username': 'Lou Scene'}}

Note

MGF’s support for direct indexing is rudimentary, because it does not in fact keep an index and has to search through the file line-wise on every call. pyteomics.mgf.IndexedMGF is designed for random access and more (see Indexed Parsers for details).

Writing

Creation of MGF files is implemented in pyteomics.mgf.write() function. The user can specify the header, an iterable of spectra in the same format as returned by read(), and the output path.

>>> spectra = mgf.read('tests/test.mgf')
>>> mgf.write(spectra=spectra, header=header)
USERNAME=Lou Scene
ITOL=1
USEREMAIL=leu@altered-state.edu
MODS=Carbamidomethyl (C)
IT_MODS=Oxidation (M)
CHARGE=2+ and 3+
MASS=Monoisotopic
ITOLU=Da
COM=Taken from http://www.matrixscience.com/help/data_file_help.html

BEGIN IONS
TITLE=Spectrum 1
PEPMASS=983.6
846.6 73.0
846.8 44.0
847.6 67.0
1640.1 291.0
1640.6 54.0
1895.5 49.0
END IONS

BEGIN IONS
TITLE=Spectrum 2
RTINSECONDS=25
PEPMASS=1084.9
SCANS=3
345.1 237.0
370.2 128.0
460.2 108.0
1673.3 1007.0
1674.0 974.0
1675.3 79.0
END IONS
MS1 and MS2

MS1 and MS2 are simple human-readable formats for MS1 and MSn data. It allows storing peak lists and exprimental parameters. Just like MS1 and MS2 formats are quite similar to MGF, the corresponding module (pyteomics.ms1 and pyteomics.ms2) provides the same functions and classes with very similar signatures for reading headers and spectra from files.

Writing is not supported at this time.

FASTA

FASTA is a common format for protein sequence databases.

Reading

To extract data from FASTA databases, use the pyteomics.fasta.read() function.

>>> from pyteomics import fasta
>>> with fasta.read('/path/to/file/my.fasta') as db:
>>>     for entry in db:
>>>         ...

Just like other parsers in Pyteomics, pyteomics.fasta.read() returns a generator object instead of a list to prevent excessive memory use. The generator yields (description, sequence) tuples, so it’s natural to use it as follows:

>>> with fasta.read('/path/to/file/my.fasta') as db:
>>>     for descr, seq in db:
>>>         ...

You can also use attributes to access description and sequence:

>>> with fasta.read('my.fasta') as reader:
>>>     descriptions = [item.description for item in reader]
Description parsing

You can specify a function that will be applied to the FASTA headers for your convenience. pyteomics.fasta.std_parsers has some pre-defined parsers that can be used for this purpose.

>>> with fasta.read('HUMAN.fasta', parser=fasta.std_parsers['uniprot']) as r:
>>>    print(next(r).description)
{'PE': 2, 'gene_id': 'LCE6A', 'GN': 'LCE6A', 'id': 'A0A183', 'taxon': 'HUMAN',
 'SV': 1, 'OS': 'Homo sapiens', 'entry': 'LCE6A_HUMAN',
 'name': 'Late cornified envelope protein 6A', 'db': 'sp'}

or try guessing the header format:

>>> with fasta.read('HUMAN.fasta', parser=fasta.parse) as r:
>>>    print(next(r).description)
{'PE': 2, 'gene_id': 'LCE6A', 'GN': 'LCE6A', 'id': 'A0A183', 'taxon': 'HUMAN',
 'SV': 1, 'OS': 'Homo sapiens', 'entry': 'LCE6A_HUMAN',
 'name': 'Late cornified envelope protein 6A', 'db': 'sp'}
Class-based interface

The pyteomics.fasta.FASTA class is available for text-based (old style) parsing (the same as shown with read() above). Also, the new binary-mode, indexed parser, pyteomics.fasta.IndexedFASTA implements all the perks of the Indexed Parsers. Both classes also have a number of flavor-specific subclasses that implement header parsing.

Additionally, flavored indexed parsers allow accessing the protein entries by the extracted ID field, while the regular pyteomics.fasta.IndexedFASTA uses full description string for identification:

In [1]: from pyteomics import fasta

In [2]: db = fasta.IndexedUniProt('sprot_human.fasta') # A SwissProt database

In [3]: len(db['Q8IYH5'].sequence)
Out[3]: 903

In [4]: db['Q8IYH5'] == db['sp|Q8IYH5|ZZZ3_HUMAN ZZ-type zinc finger-containing protein 3 OS=Homo sapiens GN=ZZZ3 PE=1 SV=1']
Out[4]: True
Writing

You can also create a FASTA file using a sequence of (description, sequence) tuples.

>>> entries = [('Protein 1', 'PEPTIDE'*1000), ('Protein 2', 'PEPTIDE'*2000)]
>>> fasta.write(entries, 'target-file.fasta')
Decoy databases

Another common task is to generate a decoy database. Pyteomics allows that by means of the pyteomics.fasta.decoy_db() and pyteomics.fasta.write_decoy_db() functions.

>>> fasta.write_decoy_db('mydb.fasta', 'mydb-with-decoy.fasta')

The only required argument is the first one, indicating the source database. The second argument is the target file and defaults to system standard output.

If you need to modify a single sequence, use the pyteomics.fasta.decoy_sequence() function. It supports three modes: 'reverse', 'shuffle', and 'fused' (see pyteomics.fasta.reverse(), pyteomics.fasta.shuffle() and pyteomics.fasta.fused_decoy() for documentation).

>>> fasta.decoy_sequence('PEPTIDE', 'reverse')
'EDITPEP'
>>> fasta.decoy_sequence('PEPTIDE', 'shuffle')
'TPPIDEE'
>>> fasta.decoy_sequence('PEPTIDE', 'shuffle')
'PTIDEPE'
mzTab

mzTab is a HUPO-PSI standardized text-based format for describing identification and quantification of peptides and small molecules. You can read an mzTab file into a set of pandas.DataFrame objects with the pyteomics.mztab.MzTab class.

>>> from pyteomics import mztab
>>> tables = mztab.MzTab("path/to/file.mzTab")
>>> psms = tables.spectrum_match_table
>>> # do something with DataFrame

XML formats

XML parsers are implemented as classes and provide an object-oriented interface. The functional interface is preserved for backward compatibility and wraps the actual class-based machinery. That means that reader objects returned by read() functions have additional methods.

One of the most important methods is iterfind(). It allows reading additional information from XML files.

mzML and mzXML

mzML and mzXML are XML-based formats for experimental data obtained on MS/MS or LC-MS setups. Pyteomics offers you the functionality of pyteomics.mzml and pyteomics.mzxml modules to gain access to the information contained in those files from Python. The interfaces of the two modules are very similar, this section will use mzML for demonstration.

The user can iterate through MS/MS spectra contained in a file via the pyteomics.mzml.read() function or pyteomics.mzml.MzML class. Here is an example of the output:

>>> from pyteomics import mzml, auxiliary
>>> with mzml.read('tests/test.mzML') as reader:
>>>     auxiliary.print_tree(next(reader))
count
index
highest observed m/z
ms level
total ion current
intensity array
lowest observed m/z
defaultArrayLength
profile spectrum
MSn spectrum
positive scan
base peak intensity
m/z array
base peak m/z
id
scanList
 -> count
 -> scan [list]
 ->  -> scan start time
 ->  -> preset scan configuration
 ->  -> filter string
 ->  -> instrumentConfigurationRef
 ->  -> scanWindowList
 ->  ->  -> count
 ->  ->  -> scanWindow [list]
 ->  ->  ->  -> scan window lower limit
 ->  ->  ->  -> scan window upper limit
 ->  -> [Thermo Trailer Extra]Monoisotopic M/Z:
 -> no combination

Additionally, pyteomics.mzml.MzML objects support direct indexing with spectrum IDs and all other features of Indexed Parsers.

pyteomics.mzml.PreIndexedMzML offers the same functionality, but it uses byte offset information found at the end of the file. Unlike the rest of the functions and classes, pyteomics.mzml.PreIndexedMzML does not have a counterpart in pyteomics.mzxml.

pepXML

pepXML is a widely used XML-based format for peptide identifications. It contains information about the MS data, the parameters of the search engine used and the assigned sequences. To access these data, use pyteomics.pepxml module.

The function pyteomics.pepxml.read() iterates through Peptide-Spectrum matches in a pepXML file and returns them as a custom dict. Alternatively, you can use the pyteomics.pepxml.PepXML interface.

>>> from pyteomics import pepxml, auxiliary
>>> with pepxml.read('tests/test.pep.xml') as reader:
>>>     auxiliary.print_tree(next(reader))
end_scan
search_hit [list]
 -> hit_rank
 -> calc_neutral_pep_mass
 -> modifications
 -> modified_peptide
 -> peptide
 -> num_matched_ions
 -> search_score
 ->  -> deltacn
 ->  -> spscore
 ->  -> sprank
 ->  -> deltacnstar
 ->  -> xcorr
 -> num_missed_cleavages
 -> analysis_result [list]
 ->  -> peptideprophet_result
 ->  ->  -> all_ntt_prob
 ->  ->  -> parameter
 ->  ->  ->  -> massd
 ->  ->  ->  -> fval
 ->  ->  ->  -> nmc
 ->  ->  ->  -> ntt
 ->  ->  -> probability
 ->  -> analysis
 -> tot_num_ions
 -> num_tot_proteins
 -> is_rejected
 -> proteins [list]
 ->  -> num_tol_term
 ->  -> protein
 ->  -> peptide_next_aa
 ->  -> protein_descr
 ->  -> peptide_prev_aa
 -> massdiff
index
assumed_charge
spectrum
precursor_neutral_mass
start_scan
Reading into a pandas.DataFrame

If you like working with tabular data using pandas, you can load pepXML files directly into pandas.DataFrames using the pyteomics.pepxml.DataFrame() function. It can read multiple files at once (using pyteomics.pepxml.chain()) and return a combined table with essential information about search results. This function requires pandas.

X!Tandem

X!Tandem search engine has its own output format that contains more info than pepXML. Pyteomics has a reader for it in the pyteomics.tandem module.

>>> from pyteomics import tandem, auxiliary
>>> with tandem.read('tests/test.t.xml') as reader:
...     auxiliary.print_tree(next(reader))
...
rt
support
 -> fragment ion mass spectrum
 ->  -> M+H
 ->  -> note
 ->  -> charge
 ->  -> Ydata
 ->  ->  -> units
 ->  ->  -> values
 ->  -> Xdata
 ->  ->  -> units
 ->  ->  -> values
 ->  -> label
 ->  -> id
 -> supporting data
 ->  -> convolution survival function
 ->  ->  -> Ydata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> Xdata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> label
 ->  -> b ion histogram
 ->  ->  -> Ydata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> Xdata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> label
 ->  -> y ion histogram
 ->  ->  -> Ydata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> Xdata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> label
 ->  -> hyperscore expectation function
 ->  ->  -> a1
 ->  ->  -> a0
 ->  ->  -> Ydata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> Xdata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> label
mh
maxI
expect
sumI
act
fI
z
id
protein [list]
 -> peptide
 ->  -> pre
 ->  -> end
 ->  -> seq
 ->  -> b_ions
 ->  -> nextscore
 ->  -> mh
 ->  -> y_ions
 ->  -> start
 ->  -> hyperscore
 ->  -> expect
 ->  -> delta
 ->  -> id
 ->  -> post
 ->  -> missed_cleavages
 ->  -> b_score
 ->  -> y_score
 -> uid
 -> sumI
 -> label
 -> note
 -> expect
 -> file
 ->  -> URL
 ->  -> type
 -> id

pyteomics.tandem.read() returns a pyteomics.tandem.TandemXML instance, which can also be created directly.

Reading into a pandas.DataFrame

You can also load data from X!Tandem files directly into pandas.DataFrames using the pyteomics.tandem.DataFrame() function. It can read multiple files at once (using pyteomics.tandem.chain()) and return a combined table with essential information about search results. Of course, this function requires pandas.

mzIdentML

mzIdentML is one of the standards developed by the Proteomics Informatics working group of the HUPO Proteomics Standard Initiative.

The module interface is similar to that of the other reader modules. The pyteomics.mzid.read() function returns a pyteomics.mzid.MzIdentML instance, which you can just as easily use directly.

>>> from pyteomics import mzid, auxiliary
>>> with mzid.read('tests/test.mzid') as reader:
>>>     auxiliary.print_tree(next(reader))
SpectrumIdentificationItem [list]
 -> PeptideEvidenceRef [list]
 ->  -> peptideEvidence_ref
 -> ProteinScape:SequestMetaScore
 -> chargeState
 -> rank
 -> ProteinScape:IntensityCoverage
 -> calculatedMassToCharge
 -> peptide_ref
 -> passThreshold
 -> experimentalMassToCharge
 -> id
spectrumID
id
spectraData_ref
Element IDs and references

In mzIdentML, some elements contain references to other elements in the same file. The references are simply XML attributes whose name ends with _ref and the value is an ID, identical to the value of the id attribute of a certain element.

The parser can retrieve information from these references on the fly, which can be enabled by passing retrieve_refs=True to the pyteomics.mzid.MzIdentML.iterfind() method, to pyteomics.mzid.MzIdentML constructor, or to pyteomics.mzid.read(). Retrieval of data by ID is implemented in the pyteomics.mzid.MzIdentML.get_by_id() method. Alternatively, the MzIdentML object itself can be indexed with element IDs:

>>> from pyteomics import mzid
>>> m = mzid.MzIdentML('tests/test.mzid')
>>> m['ipi.HUMAN_decoy']
{'DatabaseName': 'database IPI_human',
 'decoy DB accession regexp': '^SHD',
 'decoy DB generation algorithm': 'PeakQuant.DecoyDatabaseBuilder',
 'id': 'ipi.HUMAN_decoy',
 'location': 'file://www.medizinisches-proteom-center.de/DBServer/ipi.HUMAN/3.15/ipi.HUMAN_decoy.fasta',
 'name': ['decoy DB from IPI_human',
  'DB composition target+decoy',
  'decoy DB type shuffle'],
 'numDatabaseSequences': 58099,
 'releaseDate': '2006-02-22T09:30:47Z',
 'version': '3.15'}
>>> m.close()

Note

Since version 3.3, pyteomics.mzid.MzIdentML objects keep an index of byte offsets for some of the elements (see Indexed Parsers). Indexing helps achieve acceptable performance when using retrieve_refs=True, or when accessing individual elements by their ID.

This behavior can be disabled by passing use_index=False to the object constructor. An alternative, older mechanism is caching of element IDs. To build a cache for a file, you can pass build_id_cache=True and use_index=False to the MzIdentML constructor, or to pyteomics.mzid.read(), or call the pyteomics.mzid.MzIdentML.build_id_cache() method prior to reading the data.

Reading into a pandas.DataFrame

pyteomics.mzid also provides a pyteomics.mzid.DataFrame() function that reads one or several files into a single Pandas DataFrame. This function requires pandas.

idXML

idXML is an OpenMS format for peptide identifications. It is supported in pyteomics.openms.idxml. It partially supports indexing (protein information can be indexed and extracted with retrieve_refs).

The regular iterative parsing is done through read() or IDXML, and :py:class:`pandas.DataFrame`s can be created as well.

TraML

TraML is also a PSI format. It stores a lot of information on SRM experiments. The parser, pyteomics.traml.TraML, iterates over <Transition> elements by default. Like MzIdentML, it has a retrieve_refs parameter that helps pull in the information from other parts of the file. TraML is one of the Indexed Parsers.

FeatureXML

pyteomics.openms.featurexml implements a simple parser for .featureXML files used in the OpenMS framework. The usage is identical to other XML parsing modules. Since featureXML has feature IDs, FeatureXML objects also support direct indexing as well as iteration, among the many features of Indexed Parsers:

>>> from pyteomics.openms import featurexml

>>> # function style, iteration
... with featurexml.read('tests/test.featureXML') as f:
...     qual = [feat['overallquality'] for feat in f]
...

>>> qual # qualities of the two features in the test file
[0.791454, 0.945634]

>>> # object-oriented style, direct indexing
>>> f = featurexml.FeatureXML('tests/test.featureXML')
>>> f['f_189396504510444007']['overallquality']
0.945634
>>> f.close()

As always, pyteomics.openms.featurexml.read() and pyteomics.openms.featurexml.FeatureXML are interchangeable.

TrafoXML

.trafoXML is another OpenMS format based on XML. It describes a tranformation produced by an RT alignment algorithm. The file basically contains a series of (from; to) pairs corresponding to original and transformed retention times:

>>> from pyteomics.openms import trafoxml
>>> from_rt, to_rt = [], []
>>> with trafoxml.read('test/test.trafoXML') as f:
...    for pair in f:
...        from_rt.append(pair['from'])
...        to_rt.append(pair['to'])

>>> # plot the transformation
>>> import pylab
>>> pylab.plot(from_rt, to_rt)

As always, pyteomics.openms.trafoxml.read() and pyteomics.openms.trafoxml.TrafoXML are interchangeable. TrafoXML parsers do not support indexing because there are no IDs for specific data points in this format.

Controlled Vocabularies

Controlled Vocabularies are the universal annotation system used in the PSI formats, including mzML and mzIdentML. pyteomics.mzml.MzML, pyteomics.traml.TraML and pyteomics.mzid.MzIdentML retain the annotation information. It can be accessed using the helper function, pyteomics.auxiliary.cvquery():

>>> from pyteomics import auxiliary as aux, mzid, mzml
>>> f = mzid.MzIdentML('tests/test.mzid')
>>> s = next(f)
>>> s
{'SpectrumIdentificationItem': [{'ProteinScape:SequestMetaScore': 7.59488518903425, 'calculatedMassToCharge': 1507.695, 'PeptideEvidenceRef': [{'peptideEvidence_ref': 'PE1_SEQ_spec1_pep1'}], 'chargeState': 1, 'passThreshold': True, 'peptide_ref': 'prot1_pep1', 'rank': 1, 'id': 'SEQ_spec1_pep1', 'ProteinScape:IntensityCoverage': 0.3919545603809718, 'experimentalMassToCharge': 1507.696}], 'spectrumID': 'databasekey=1', 'id': 'SEQ_spec1', 'spectraData_ref': 'LCMALDI_spectra'}
>>> aux.cvquery(s)
{'MS:1001506': 7.59488518903425, 'MS:1001505': 0.3919545603809718}
>>> f.close()

Indexed Parsers

Most of the parsers implement indexing: MGF, mzML, mzXML, FASTA, PEFF, pepXML, mzIdentML, ms1, TraML, featureXML. Some formats do not have indexing parsers, because there is no unique ID field in the files to identify entries.

XML parser classes are called according to the format, e.g. pyteomics.mzml.MzML. Text format parsers that implement indexing are called with the word “Indexed”, e.g. pyteomics.fasta.IndexedFASTA, as opposed to pyteomics.fasta.FASTA, which does not implement indexing. This distinction is due to the fact that indexed parsers need to open the files in binary mode. This may affect performance for text-based formats and is not always backwards-compatible (you cannot instantiate an indexed parser class using a previously opened file if it is in text mode). XML files, on the other hand, are always meant to be opened in binary mode. So, there is no duplication of classes for XML formats, but indexing can still be disabled by passing use_index=False to the class constructor or the read() function.

Basic usage

Indexed parsers can be instantiated using the class name or the read() function:

In [1]: from pyteomics import mgf

In [2]: f = mgf.IndexedMGF('tests/test.mgf')

In [3]: f
Out[3]: <pyteomics.mgf.IndexedMGF at 0x7fc983cbaeb8>

In [4]: f.close()

In [5]: f = mgf.read('tests/test.mgf', use_index=True)

In [6]: f
Out[6]: <pyteomics.mgf.IndexedMGF at 0x7fc980c63898>

They support direct assignment and iteration or the with syntax, the same way as the older, iterative parsers.

Parser objects can be used as dictionaries mapping entry IDs to entries, or as lists:

In [7]: f['Spectrum 2']
Out[7]:
{'params': {'com': 'Based on http://www.matrixscience.com/help/data_file_help.html',
  'itol': '1',
  'itolu': 'Da',
  'mods': 'Carbamidomethyl (C)',
  'it_mods': 'Oxidation (M)',
  'mass': 'Monoisotopic',
  'username': 'Lou Scene',
  'useremail': 'leu@altered-state.edu',
  'charge': [2, 3],
  'title': 'Spectrum 2',
  'pepmass': (1084.9, 1234.0),
  'scans': '3',
  'rtinseconds': 25.0 second},
 'm/z array': array([ 345.1,  370.2,  460.2, 1673.3, 1674. , 1675.3]),
 'intensity array': array([ 237.,  128.,  108., 1007.,  974.,   79.]),
 'charge array': masked_array(data=[3, 2, 1, 1, 1, 1],
              mask=False,
        fill_value=0)}

In [8]: f[1]['params']['title'] # positional indexing
Out[8]: 'Spectrum 2'

Like dictionaries, indexed parsers support membership testing and len():

In [9]: 'Spectrum 1' in f
Out[9]: True

In [10]: len(f)
Out[10]: 2
Rich Indexing

Indexed parsers also support positional indexing, slices of IDs and integers. ID-based slices include both endpoints; integer-based slices exclude the right edge of the interval. With integer indexing, step is also supported. Here is a self-explanatory demo of indexing functionality using a test file of two spectra:

In [11]: len(f['Spectrum 1':'Spectrum 2'])
Out[11]: 2

In [12]: len(f['Spectrum 2':'Spectrum 1'])
Out[12]: 2

In [13]: len(f[:])
Out[13]: 2

In [14]: len(f[:1])
Out[14]: 1

In [15]: len(f[1:0])
Out[15]: 0

In [16]: len(f[1:0:-1])
Out[16]: 1

In [17]: len(f[::2])
Out[17]: 1
RT-based indexing

In MGF, mzML and mzXML the spectra are usually time-ordered. The corresponding indexed parsers allow accessing the spectra by retention time, including slices:

In [18]: f = mzxml.MzXML('tests/test.mzXML')

In [19]: spec = f.time[5.5] # get the spectrum closest to this retention time

In [20]: len(f.time[5.5:6.0]) # get spectra from a range
Out[20]: 2

RT lookup is performed using binary search. When retrieving ranges, the closest spectra to the start and end of the range are used as endpoints, so it is possible that they are slightly outside the range.

Multiprocessing

Indexed parsers provide a unified interface for multiprocessing: map(). The method applies a user-defined function to entries from the file, calling it in different processes. If the function is not provided, the parsing itself is parallelized. Depending on the format, this may speed up or slow down the parsing overall. map() is a generator and yields items as they become available, not preserving the original order:

In [1]: from pyteomics import mzml

In [2]: f = mzml.MzML('tests/test.mzML')

In [3]: for spec in f.map():
   ...:     print(spec['id'])
   ...:
controllerType=0 controllerNumber=1 scan=2
controllerType=0 controllerNumber=1 scan=1

In [4]: for item in f.map(lambda spec: spec['id']):
   ...:     print(item)
   ...:
controllerType=0 controllerNumber=1 scan=1
controllerType=0 controllerNumber=1 scan=2

Note

To use map() with lambda functions (and in some other corner cases, like parsers instantiated with pre-opened file objects), the dill package is required. This is because the target callable and the parser itself need to be pickled for multiprocessing to work.

Apart from parser objects, map() is available on objects returned by chain() functions and iterfind():

In [5]: for c in f.iterfind('chromatogram').map():
   ...:     print(c['id'])
   ...:
TIC

In [6]: for spec in mzml.chain('tests/test.mzML', 'tests/test.mzML').map():
   ...:     print(spec['id'])
   ...:
controllerType=0 controllerNumber=1 scan=1
controllerType=0 controllerNumber=1 scan=2
controllerType=0 controllerNumber=1 scan=1
controllerType=0 controllerNumber=1 scan=2

FDR estimation and filtering

The modules for reading proteomics search engine or post-processing output (tandem, pepxml, mzid, idxml and protxml) expose similar functions is_decoy(), fdr() and filter(). These functions implement the widely used Target-Decoy Approach (TDA) to estimation of False Discovery Rate (FDR).

The is_decoy() function is supposed to determine if a particular spectrum identification is coming from the decoy database. In tandem and pepxml this is done by checking if the protein description/name starts with a certain prefix. In mzid, a boolean value that stores this information in the PSM dict is used.

Warning

Because of the variety of the software producing files in pepXML and mzIdentML formats, the is_decoy() function provided in the corresponding modules may not work for your specific files. In this case you will have to refer to the source of pyteomics.pepxml.is_decoy() and pyteomics.mzid.is_decoy() and create your own function in a similar manner.

The fdr() function estimates the FDR in a set of PSMs by counting the decoy matches. Since it is using the is_decoy() function, the warning above applies. You can supply a custom function so that fdr() works for your data. fdr() can also be imported from auxiliary, where it has no default for is_decoy().

The filter() function works like chain(), but instead of yielding all PSMs, it filters them to a certain level of FDR. PSM filtering requires counting decoy matches, too (see above), but it also implies sorting the PSMs by some kind of a score. This score cannot be universal due to the above-mentioned reasons, and it can be specified as a user-defined function. For instance, the default sorting key in pyteomics.mzid.filter() is only expected to work with mzIdentML files created with Mascot. So once again,

Warning

The default parameters of filter() may not work for your files.

There are also filter.chain() and filter.chain.from_iterable(). These are different from filter() in that they apply FDR filtering to all files separately and then provide a reader over top PSMs of all files, whereas filter() pools all PSMs together and applies a single threshold.

If you want to filter a list representing PSMs in arbitrary format, you can use pyteomics.auxiliary.filter(). Instead of files it takes lists (or other iterables) of PSMs. The rest is the same as for other filter() functions.

NumPy and Pandas support, etc.

pyteomics.auxiliary.filter() supports structured numpy arrays and pandas.DataFrames of PSMs. This makes it easy to filter search results stored as CSV files (see Example 3: Search engines and PSM filtering for more info).

Generally, PSMs can be provided as iterators, lists, arrays, and DataFrames, and key and is_decoy parameters to filter() can be functions, strings, lists, arrays, or iterators. If a string is given, it is used as a key in a structured array, DataFrame or an iterable of dicts.

FDR correction

As described in this JPR article, filtering based on decoy counting is inherently biased, especially for small datasets. All TDA-related functions have an optional argument, correction, that enables the correcting procedure proposed in the article.

Pyteomics API documentation

This section documents all user functions and data available in Pyteomics. You can access all of this info off-line from your Python interpreter.

Contents:

parser - operations on modX peptide sequences

modX is a simple extension of the IUPAC one-letter peptide sequence representation.

The labels (or codes) for the 20 standard amino acids in modX are the same as in IUPAC nomeclature. A label for a modified amino acid has a general form of ‘modX’, i.e.:

  • it starts with an arbitrary number of lower-case symbols or numbers (a modification);
  • it ends with a single upper-case symbol (an amino acid residue).

The valid examples of modX amino acid labels are: ‘G’, ‘pS’, ‘oxM’. This rule allows to combine read- and parseability.

Besides the sequence of amino acid residues, modX has a rule to specify terminal modifications of a polypeptide. Such a label should start or end with a hyphen. The default N-terminal amine group and C-terminal carboxyl group may not be shown explicitly.

Therefore, valid examples of peptide sequences in modX are: “GAGA”, “H-PEPTIDE-OH”, “H-TEST-NH2”. It is not recommmended to specify only one terminal group.

Operations on polypeptide sequences

parse() - convert a sequence string into a list of amino acid residues.

tostring() - convert a parsed sequence to a string.

amino_acid_composition() - get numbers of each amino acid residue in a peptide.

cleave() - cleave a polypeptide using a given rule of enzymatic digestion.

num_sites() - count the number of cleavage sites in a sequence.

isoforms() - generate all unique modified peptide sequences given the initial sequence and modifications.

Auxiliary commands

coverage() - calculate the sequence coverage of a protein by peptides.

length() - calculate the number of amino acid residues in a polypeptide.

valid() - check if a sequence can be parsed successfully.

fast_valid() - check if a sequence contains of known one-letter codes.

is_modX() - check if supplied code corresponds to a modX label.

is_term_mod() - check if supplied code corresponds to a terminal modification.

Data

std_amino_acids - a list of the 20 standard amino acid IUPAC codes.

std_nterm - the standard N-terminal modification (the unmodified group is a single atom of hydrogen).

std_cterm - the standard C-terminal modification (the unmodified group is hydroxyl).

std_labels - a list of all standard sequence elements, amino acid residues and terminal modifications.

expasy_rules - a dict with the regular expressions of cleavage rules for the most popular proteolytic enzymes.


pyteomics.parser.amino_acid_composition(sequence, show_unmodified_termini=False, term_aa=False, allow_unknown_modifications=False, **kwargs)[source]

Calculate amino acid composition of a polypeptide.

Parameters:
  • sequence (str or list) – The sequence of a polypeptide or a list with a parsed sequence.
  • show_unmodified_termini (bool, optional) – If True then the unmodified N- and C-terminus are explicitly shown in the returned dict. Default value is False.
  • term_aa (bool, optional) – If True then the terminal amino acid residues are artificially modified with nterm or cterm modification. Default value is False.
  • allow_unknown_modifications (bool, optional) – If True then do not raise an exception when an unknown modification of a known amino acid residue is found in the sequence. Default value is False.
  • labels (list, optional) – A list of allowed labels for amino acids and terminal modifications.
Returns:

out – A dictionary of amino acid composition.

Return type:

dict

Examples

>>> amino_acid_composition('PEPTIDE') ==     {'I': 1, 'P': 2, 'E': 2, 'T': 1, 'D': 1}
True
>>> amino_acid_composition('PEPTDE', term_aa=True) ==     {'ctermE': 1, 'E': 1, 'D': 1, 'P': 1, 'T': 1, 'ntermP': 1}
True
>>> amino_acid_composition('PEPpTIDE', labels=std_labels+['pT']) ==     {'I': 1, 'P': 2, 'E': 2, 'D': 1, 'pT': 1}
True
pyteomics.parser.cleave(sequence, rule, missed_cleavages=0, min_length=None, semi=False, exception=None)[source]

Cleaves a polypeptide sequence using a given rule.

Parameters:
  • sequence (str) –

    The sequence of a polypeptide.

    Note

    The sequence is expected to be in one-letter uppercase notation. Otherwise, some of the cleavage rules in expasy_rules will not work as expected.

  • rule (str or compiled regex) – A key present in expasy_rules or a regular expression describing the site of cleavage. It is recommended to design the regex so that it matches only the residue whose C-terminal bond is to be cleaved. All additional requirements should be specified using lookaround assertions. expasy_rules contains cleavage rules for popular cleavage agents.
  • missed_cleavages (int, optional) – Maximum number of allowed missed cleavages. Defaults to 0.
  • min_length (int or None, optional) –

    Minimum peptide length. Defaults to None.

    Note

    This checks for string length, which is only correct for one-letter notation and not for full modX. Use length() manually if you know what you are doing and apply cleave() to modX sequences.

  • semi (bool, optional) – Include products of semi-specific cleavage. Default is False. This effectively cuts every peptide at every position and adds results to the output.
  • exception (str or compiled RE or None, optional) – Exceptions to the cleavage rule. If specified, should be a key present in expasy_rules or regular expression. Cleavage sites matching rule will be checked against exception and omitted if they match.
Returns:

out – A set of unique (!) peptides.

Return type:

set

Examples

>>> cleave('AKAKBK', expasy_rules['trypsin'], 0) == {'AK', 'BK'}
True
>>> cleave('AKAKBK', 'trypsin', 0) == {'AK', 'BK'}
True
>>> cleave('GKGKYKCK', expasy_rules['trypsin'], 2) ==     {'CK', 'GKYK', 'YKCK', 'GKGK', 'GKYKCK', 'GK', 'GKGKYK', 'YK'}
True
pyteomics.parser.coverage(protein, peptides)[source]

Calculate how much of protein is covered by peptides. Peptides can overlap. If a peptide is found multiple times in protein, it contributes more to the overall coverage.

Requires numpy.

Note

Modifications and terminal groups are discarded.

Parameters:
  • protein (str) – A protein sequence.
  • peptides (iterable) – An iterable of peptide sequences.
Returns:

out – The sequence coverage, between 0 and 1.

Return type:

float

Examples

>>> coverage('PEPTIDES'*100, ['PEP', 'EPT'])
0.5
pyteomics.parser.expasy_rules

This dict contains regular expressions for cleavage rules of the most popular proteolytic enzymes. The rules were taken from the PeptideCutter tool at Expasy.

Note

‘trypsin_exception’ can be used as exception argument when calling cleave() with ‘trypsin’ rule:

>>> parser.cleave('PEPTIDKDE', parser.expasy_rules['trypsin'])
{'DE', 'PEPTIDK'}
>>> parser.cleave('PEPTIDKDE', parser.expasy_rules['trypsin'], exception=parser.expasy_rules['trypsin_exception'])
{'PEPTIDKDE'}
pyteomics.parser.fast_valid(sequence, labels={'-OH', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'H-', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'})[source]

Iterate over sequence and check if all items are in labels. With strings, this only works as expected on sequences without modifications or terminal groups.

Parameters:
  • sequence (iterable (expectedly, str)) – The sequence to check. A valid sequence would be a string of labels, all present in labels.
  • labels (iterable, optional) – An iterable of known labels.
Returns:

out

Return type:

bool

pyteomics.parser.is_modX(label)[source]

Check if label is a valid ‘modX’ label.

Parameters:label (str) –
Returns:out
Return type:bool

Examples

>>> is_modX('M')
True
>>> is_modX('oxM')
True
>>> is_modX('oxMet')
False
>>> is_modX('160C')
True
pyteomics.parser.is_term_mod(label)[source]

Check if label corresponds to a terminal modification.

Parameters:label (str) –
Returns:out
Return type:bool

Examples

>>> is_term_mod('A')
False
>>> is_term_mod('Ac-')
True
>>> is_term_mod('-customGroup')
True
>>> is_term_mod('this-group-')
False
>>> is_term_mod('-')
False
pyteomics.parser.isoforms(sequence, **kwargs)[source]

Apply variable and fixed modifications to the polypeptide and yield the unique modified sequences.

Parameters:
  • sequence (str) – Peptide sequence to modify.
  • variable_mods (dict, optional) –

    A dict of variable modifications in the following format: {'label1': ['X', 'Y', ...], 'label2': ['X', 'A', 'B', ...]}

    Keys in the dict are modification labels (terminal modifications allowed). Values are iterables of residue labels (one letter each) or True. If a value for a modification is True, it is applicable to any residue (useful for terminal modifications). You can use values such as ‘ntermX’ or ‘ctermY’ to specify that a mdofication only occurs when the residue is in the terminal position. This is not needed for terminal modifications.

    Note

    Several variable modifications can occur on amino acids of the same type, but in the output each amino acid residue will be modified at most once (apart from terminal modifications).

  • fixed_mods (dict, optional) –

    A dict of fixed modifications in the same format.

    Note: if a residue is affected by a fixed modification, no variable modifications will be applied to it (apart from terminal modifications).

  • labels (list, optional) – A list of amino acid labels containing all the labels present in sequence. Modified entries will be added automatically. Defaults to std_labels. Not required since version 2.5.
  • max_mods (int or None, optional) – Number of modifications that can occur simultaneously on a peptide, excluding fixed modifications. If None or if max_mods is greater than the number of modification sites, all possible isoforms are generated. Default is None.
  • override (bool, optional) – Defines how to handle the residues that are modified in the input. False means that they will be preserved (default). True means they will be treated as unmodified.
  • show_unmodified_termini (bool, optional) – If True then the unmodified N- and C-termini are explicitly shown in the returned sequences. Default value is False.
  • format (str, optional) – If 'str' (default), an iterator over sequences is returned. If 'split', the iterator will yield results in the same format as parse() with the ‘split’ option, with unmodified terminal groups shown.
Returns:

out – All possible unique polypeptide sequences resulting from the specified modifications are yielded obe by one.

Return type:

iterator over strings or lists

pyteomics.parser.length(sequence, **kwargs)[source]

Calculate the number of amino acid residues in a polypeptide written in modX notation.

Parameters:
  • sequence (str or list or dict) – A string with a polypeptide sequence, a list with a parsed sequence or a dict of amino acid composition.
  • labels (list, optional) – A list of allowed labels for amino acids and terminal modifications.
Returns:

out

Return type:

int

Examples

>>> length('PEPTIDE')
7
>>> length('H-PEPTIDE-OH')
7
pyteomics.parser.match_modX(label)[source]

Check if label is a valid ‘modX’ label.

Parameters:label (str) –
Returns:out
Return type:re.match or None
pyteomics.parser.num_sites(sequence, rule, **kwargs)[source]

Count the number of sites where sequence can be cleaved using the given rule (e.g. number of miscleavages for a peptide).

Parameters:
  • sequence (str) – The sequence of a polypeptide.
  • rule (str or compiled regex) –

    A regular expression describing the site of cleavage. It is recommended to design the regex so that it matches only the residue whose C-terminal bond is to be cleaved. All additional requirements should be specified using lookaround assertions.

  • labels (list, optional) – A list of allowed labels for amino acids and terminal modifications.
  • exception (str or compiled RE or None, optional) – Exceptions to the cleavage rule. If specified, should be a regular expression. Cleavage sites matching rule will be checked against exception and omitted if they match.
Returns:

out – Number of cleavage sites.

Return type:

int

pyteomics.parser.parse(sequence, show_unmodified_termini=False, split=False, allow_unknown_modifications=False, **kwargs)[source]

Parse a sequence string written in modX notation into a list of labels or (if split argument is True) into a list of tuples representing amino acid residues and their modifications.

Parameters:
  • sequence (str) – The sequence of a polypeptide.
  • show_unmodified_termini (bool, optional) – If True then the unmodified N- and C-termini are explicitly shown in the returned list. Default value is False.
  • split (bool, optional) – If True then the result will be a list of tuples with 1 to 4 elements: terminal modification, modification, residue. Default value is False.
  • allow_unknown_modifications (bool, optional) –

    If True then do not raise an exception when an unknown modification of a known amino acid residue is found in the sequence. This also includes terminal groups. Default value is False.

    Note

    Since version 2.5, this parameter has effect only if labels are provided.

  • labels (container, optional) –

    A container of allowed labels for amino acids, modifications and terminal modifications. If not provided, no checks will be done. Separate labels for modifications (such as ‘p’ or ‘ox’) can be supplied, which means they are applicable to all residues.

    Warning

    If show_unmodified_termini is set to True, standard terminal groups need to be present in labels.

    Warning

    Avoid using sequences with only one terminal group, as they are ambiguous. If you provide one, labels (or std_labels) will be used to resolve the ambiguity.

Returns:

out – List of tuples with labels of modifications and amino acid residues.

Return type:

list

Examples

>>> parse('PEPTIDE', split=True)
[('P',), ('E',), ('P',), ('T',), ('I',), ('D',), ('E',)]
>>> parse('H-PEPTIDE')
['P', 'E', 'P', 'T', 'I', 'D', 'E']
>>> parse('PEPTIDE', show_unmodified_termini=True)
['H-', 'P', 'E', 'P', 'T', 'I', 'D', 'E', '-OH']
>>> parse('TEpSToxM', labels=std_labels + ['pS', 'oxM'])
['T', 'E', 'pS', 'T', 'oxM']
>>> parse('zPEPzTIDzE', True, True, labels=std_labels+['z'])
[('H-', 'z', 'P'), ('E',), ('P',), ('z', 'T'), ('I',), ('D',), ('z', 'E', '-OH')]
>>> parse('Pmod1EPTIDE')
['P', 'mod1E', 'P', 'T', 'I', 'D', 'E']
pyteomics.parser.std_amino_acids

modX labels for the 20 standard amino acids.

pyteomics.parser.std_cterm

modX label for the unmodified C-terminus.

pyteomics.parser.std_labels

modX labels for the standard amino acids and unmodified termini.

pyteomics.parser.std_nterm

modX label for the unmodified N-terminus.

pyteomics.parser.tostring(parsed_sequence, show_unmodified_termini=True)[source]

Create a string from a parsed sequence.

Parameters:
  • parsed_sequence (iterable) – Expected to be in one of the formats returned by parse(), i.e. list of labels or list of tuples.
  • show_unmodified_termini (bool, optional) – Defines the behavior towards standard terminal groups in the input. True means that they will be preserved if present (default). False means that they will be removed. Standard terminal groups will not be added if not shown in parsed_sequence, regardless of this setting.
Returns:

sequence

Return type:

str

pyteomics.parser.valid(*args, **kwargs)[source]

Try to parse sequence and catch the exceptions. All parameters are passed to parse().

Returns:outTrue if the sequence was parsed successfully, and False otherwise.
Return type:bool

mass - molecular masses and isotope distributions

Summary

This module defines general functions for mass and isotope abundance calculations. For most of the functions, the user can define a given substance in various formats, but all of them would be reduced to the Composition object describing its chemical composition.

Classes

Composition - a class storing chemical composition of a substance.

Unimod - a class representing a Python interface to the Unimod database (see pyteomics.mass.unimod for a much more powerful alternative).

Mass calculations

calculate_mass() - a general routine for mass / m/z calculation. Can calculate mass for a polypeptide sequence, chemical formula or elemental composition. Supplied with an ion type and charge, the function would calculate m/z.

fast_mass() - a less powerful but much faster function for polypeptide mass calculation.

fast_mass2() - a version of fast_mass that supports modX notation.

Isotopic abundances

isotopic_composition_abundance() - calculate the relative abundance of a given isotopic composition.

most_probable_isotopic_composition() - finds the most abundant isotopic composition for a molecule defined by a polypeptide sequence, chemical formula or elemental composition.

isotopologues() - iterate over possible isotopic conposition of a molecule, possibly filtered by abundance.

Data

nist_mass - a dict with exact masses of the most abundant isotopes.

std_aa_comp - a dict with the elemental compositions of the standard twenty amino acid residues, selenocysteine and pyrrolysine.

std_ion_comp - a dict with the relative elemental compositions of the standard peptide fragment ions.

std_aa_mass - a dict with the monoisotopic masses of the standard twenty amino acid residues, selenocysteine and pyrrolysine.


Composition.__init__(*args, **kwargs)[source]

A Composition object stores a chemical composition of a substance. Basically it is a dict object, in which keys are the names of chemical elements and values contain integer numbers of corresponding atoms in a substance.

The main improvement over dict is that Composition objects allow addition and subtraction.

A Composition object can be initialized with one of the following arguments: formula, sequence, parsed_sequence or split_sequence.

If none of these are specified, the constructor will look at the first positional argument and try to build the object from it. Without positional arguments, a Composition will be constructed directly from keyword arguments.

If there’s an ambiguity, i.e. the argument is both a valid sequence and a formula (such as ‘HCN’), it will be treated as a sequence. You need to provide the ‘formula’ keyword to override this.

Warning

Be careful when supplying a list with a parsed sequence or a split sequence as a keyword argument. It must be obtained with enabled show_unmodified_termini option. When supplying it as a positional argument, the option doesn’t matter, because the positional argument is always converted to a sequence prior to any processing.

Parameters:
  • formula (str, optional) – A string with a chemical formula. All elements must be present in mass_data.
  • sequence (str, optional) – A polypeptide sequence string in modX notation.
  • parsed_sequence (list of str, optional) – A polypeptide sequence parsed into a list of amino acids.
  • split_sequence (list of tuples of str, optional) – A polypeptyde sequence parsed into a list of tuples (as returned be pyteomics.parser.parse() with split=True).
  • aa_comp (dict, optional) – A dict with the elemental composition of the amino acids (the default value is std_aa_comp).
  • mass_data (dict, optional) – A dict with the masses of chemical elements (the default value is nist_mass). It is used for formulae parsing only.
  • charge (int, optional) – If not 0 then additional protons are added to the composition.
  • ion_comp (dict, optional) – A dict with the relative elemental compositions of peptide ion fragments (default is std_ion_comp).
  • ion_type (str, optional) – If specified, then the polypeptide is considered to be in the form of the corresponding ion. Do not forget to specify the charge state!
Composition.mass(**kwargs)[source]

Calculate the mass or m/z of a Composition.

Parameters:
  • average (bool, optional) – If True then the average mass is calculated. Note that mass is not averaged for elements with specified isotopes. Default is False.
  • charge (int, optional) – If not 0 then m/z is calculated: the mass is increased by the corresponding number of proton masses and divided by charge.
  • mass_data (dict, optional) – A dict with the masses of the chemical elements (the default value is nist_mass).
  • ion_comp (dict, optional) – A dict with the relative elemental compositions of peptide ion fragments (default is std_ion_comp).
  • ion_type (str, optional) – If specified, then the polypeptide is considered to be in the form of the corresponding ion. Do not forget to specify the charge state!
Returns:

mass

Return type:

float

class pyteomics.mass.mass.Unimod(source='http://www.unimod.org/xml/unimod.xml')[source]

Bases: object

A class for Unimod database of modifications. The list of all modifications can be retrieved via mods attribute. Methods for convenient searching are by_title and by_name. For more elaborate filtering, iterate manually over the list.

Note

See pyteomics.mass.unimod for a new alternative class with more features.

__init__(source='http://www.unimod.org/xml/unimod.xml')[source]

Create a database and fill it from XML file retrieved from source.

Parameters:source (str or file, optional) – A file-like object or a URL to read from. Don’t forget the 'file://' prefix when pointing to local files.
by_id(i)[source]

Search modifications by record ID. If a modification is found, it is returned. Otherwise, KeyError is raised.

Parameters:i (int or str) – The Unimod record ID.
Returns:out – A single modification dict.
Return type:dict
by_name(name, strict=True)[source]

Search modifications by name. If a single modification is found, it is returned. Otherwise, a list will be returned.

Parameters:
  • name (str) – The full name of the modification(s).
  • strict (bool, optional) – If False, the search will return all modifications whose full name contains title, otherwise equality is required. True by default.
Returns:

out – A single modification or a list of modifications.

Return type:

dict or list

by_title(title, strict=True)[source]

Search modifications by title. If a single modification is found, it is returned. Otherwise, a list will be returned.

Parameters:
  • title (str) – The modification title.
  • strict (bool, optional) – If False, the search will return all modifications whose title contains title, otherwise equality is required. True by default.
Returns:

out – A single modification or a list of modifications.

Return type:

dict or list

mass_data

Get element mass data extracted from the database

mods

Get the list of Unimod modifications

pyteomics.mass.mass.calculate_mass(*args, **kwargs)[source]

Calculates the monoisotopic mass of a polypeptide defined by a sequence string, parsed sequence, chemical formula or Composition object.

One or none of the following keyword arguments is required: formula, sequence, parsed_sequence, split_sequence or composition. All arguments given are used to create a Composition object, unless an existing one is passed as a keyword argument.

Note that if a sequence string is supplied and terminal groups are not explicitly shown, then the mass is calculated for a polypeptide with standard terminal groups (NH2- and -OH).

Warning

Be careful when supplying a list with a parsed sequence. It must be obtained with enabled show_unmodified_termini option.

Parameters:
  • formula (str, optional) – A string with a chemical formula.
  • sequence (str, optional) – A polypeptide sequence string in modX notation.
  • parsed_sequence (list of str, optional) – A polypeptide sequence parsed into a list of amino acids.
  • composition (Composition, optional) – A Composition object with the elemental composition of a substance.
  • aa_comp (dict, optional) – A dict with the elemental composition of the amino acids (the default value is std_aa_comp).
  • average (bool, optional) – If True then the average mass is calculated. Note that mass is not averaged for elements with specified isotopes. Default is False.
  • charge (int, optional) – If not 0 then m/z is calculated: the mass is increased by the corresponding number of proton masses and divided by charge.
  • mass_data (dict, optional) – A dict with the masses of the chemical elements (the default value is nist_mass).
  • ion_comp (dict, optional) – A dict with the relative elemental compositions of peptide ion fragments (default is std_ion_comp).
  • ion_type (str, optional) – If specified, then the polypeptide is considered to be in the form of the corresponding ion. Do not forget to specify the charge state!
Returns:

mass

Return type:

float

pyteomics.mass.mass.fast_mass(sequence, ion_type=None, charge=None, **kwargs)[source]

Calculate monoisotopic mass of an ion using the fast algorithm. May be used only if amino acid residues are presented in one-letter code.

Parameters:
  • sequence (str) – A polypeptide sequence string.
  • ion_type (str, optional) – If specified, then the polypeptide is considered to be in a form of corresponding ion. Do not forget to specify the charge state!
  • charge (int, optional) – If not 0 then m/z is calculated: the mass is increased by the corresponding number of proton masses and divided by z.
  • mass_data (dict, optional) – A dict with the masses of chemical elements (the default value is nist_mass).
  • aa_mass (dict, optional) – A dict with the monoisotopic mass of amino acid residues (default is std_aa_mass);
  • ion_comp (dict, optional) – A dict with the relative elemental compositions of peptide ion fragments (default is std_ion_comp).
Returns:

mass – Monoisotopic mass or m/z of a peptide molecule/ion.

Return type:

float

pyteomics.mass.mass.fast_mass2(sequence, ion_type=None, charge=None, **kwargs)[source]

Calculate monoisotopic mass of an ion using the fast algorithm. modX notation is fully supported.

Parameters:
  • sequence (str) – A polypeptide sequence string.
  • ion_type (str, optional) – If specified, then the polypeptide is considered to be in a form of corresponding ion. Do not forget to specify the charge state!
  • charge (int, optional) – If not 0 then m/z is calculated: the mass is increased by the corresponding number of proton masses and divided by z.
  • mass_data (dict, optional) – A dict with the masses of chemical elements (the default value is nist_mass).
  • aa_mass (dict, optional) – A dict with the monoisotopic mass of amino acid residues (default is std_aa_mass);
  • ion_comp (dict, optional) – A dict with the relative elemental compositions of peptide ion fragments (default is std_ion_comp).
Returns:

mass – Monoisotopic mass or m/z of a peptide molecule/ion.

Return type:

float

pyteomics.mass.mass.isotopic_composition_abundance(*args, **kwargs)[source]

Calculate the relative abundance of a given isotopic composition of a molecule.

Parameters:
  • formula (str, optional) – A string with a chemical formula.
  • composition (Composition, optional) – A Composition object with the isotopic composition of a substance.
  • mass_data (dict, optional) – A dict with the masses of chemical elements (the default value is nist_mass).
Returns:

relative_abundance – The relative abundance of a given isotopic composition.

Return type:

float

pyteomics.mass.mass.isotopologues(*args, **kwargs)[source]

Iterate over possible isotopic states of a molecule. The molecule can be defined by formula, sequence, parsed sequence, or composition. The space of possible isotopic compositions is restrained by parameters elements_with_isotopes, isotope_threshold, overall_threshold.

Parameters:
  • formula (str, optional) – A string with a chemical formula.
  • sequence (str, optional) – A polypeptide sequence string in modX notation.
  • parsed_sequence (list of str, optional) – A polypeptide sequence parsed into a list of amino acids.
  • composition (Composition, optional) – A Composition object with the elemental composition of a substance.
  • report_abundance (bool, optional) – If True, the output will contain 2-tuples: (composition, abundance). Otherwise, only compositions are yielded. Default is False.
  • elements_with_isotopes (container of str, optional) – A set of elements to be considered in isotopic distribution (by default, every element has an isotopic distribution).
  • isotope_threshold (float, optional) – The threshold abundance of a specific isotope to be considered. Default is 5e-4.
  • overall_threshold (float, optional) – The threshold abundance of the calculateed isotopic composition. Default is 0.
  • aa_comp (dict, optional) – A dict with the elemental composition of the amino acids (the default value is std_aa_comp).
  • mass_data (dict, optional) – A dict with the masses of chemical elements (the default value is nist_mass).
Returns:

out – Iterator over possible isotopic compositions.

Return type:

iterator

pyteomics.mass.mass.most_probable_isotopic_composition(*args, **kwargs)[source]

Calculate the most probable isotopic composition of a peptide molecule/ion defined by a sequence string, parsed sequence, chemical formula or Composition object.

Note that if a sequence string without terminal groups is supplied then the isotopic composition is calculated for a polypeptide with standard terminal groups (H- and -OH).

For each element, only two most abundant isotopes are considered.

Parameters:
  • formula (str, optional) – A string with a chemical formula.
  • sequence (str, optional) – A polypeptide sequence string in modX notation.
  • parsed_sequence (list of str, optional) – A polypeptide sequence parsed into a list of amino acids.
  • composition (Composition, optional) – A Composition object with the elemental composition of a substance.
  • elements_with_isotopes (list of str) – A list of elements to be considered in isotopic distribution (by default, every element has a isotopic distribution).
  • aa_comp (dict, optional) – A dict with the elemental composition of the amino acids (the default value is std_aa_comp).
  • mass_data (dict, optional) – A dict with the masses of chemical elements (the default value is nist_mass).
  • ion_comp (dict, optional) – A dict with the relative elemental compositions of peptide ion fragments (default is std_ion_comp).
Returns:

out – A tuple with the most probable isotopic composition and its relative abundance.

Return type:

tuple (Composition, float)

pyteomics.mass.mass.nist_mass

//www.nist.gov/pml/data/comp.cfm . There are entries for each element containing the masses and relative abundances of several abundant isotopes and a separate entry for undefined isotope with zero key, mass of the most abundant isotope and 1.0 abundance.

Type:A dict with the exact element masses downloaded from the NIST website
Type:http
pyteomics.mass.mass.std_aa_comp

A dictionary with elemental compositions of the twenty standard amino acid residues, selenocysteine, pyrrolysine, and standard H- and -OH terminal groups.

pyteomics.mass.mass.std_aa_mass

A dictionary with monoisotopic masses of the twenty standard amino acid residues, selenocysteine and pyrrolysine.

pyteomics.mass.mass.std_ion_comp

A dict with relative elemental compositions of the standard peptide fragment ions. An elemental composition of a fragment ion is calculated as a difference between the total elemental composition of an ion and the sum of elemental compositions of its constituting amino acid residues.

unimod - interface to the Unimod database

This module provides an interface to the relational Unimod database. The main class is Unimod.

Dependencies

This module requres lxml and sqlalchemy.

class pyteomics.mass.unimod.AlternativeName(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.AminoAcid(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base, pyteomics.mass.unimod.HasFullNameMixin

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.Brick(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base, pyteomics.mass.unimod.HasFullNameMixin

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.BrickToElement(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.Classification(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.Crossreference(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.CrossreferenceSource(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.Element(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base, pyteomics.mass.unimod.HasFullNameMixin

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.Fragment(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.FragmentComposition(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.HasFullNameMixin[source]

Bases: object

A simple mixin to standardize equality operators for models with a full_name attribute.

__init__

Initialize self. See help(type(self)) for accurate signature.

class pyteomics.mass.unimod.MiscNotesModifications(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.Modification(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base, pyteomics.mass.unimod.HasFullNameMixin

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.ModificationToBrick(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.NeutralLoss(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.Position(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.Specificity(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.SpecificityToNeutralLoss(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

__init__(**kwargs)

A simple constructor that allows initialization from kwargs.

Sets attributes on the constructed instance using the names and values in kwargs.

Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.

class pyteomics.mass.unimod.Unimod(path=None)[source]

Bases: object

Main class representing the relational Unimod database.

__init__(path=None)[source]

Initialize the object from a database file.

Parameters:path (str or None, optional) – If str, should point to a database. Use a dialect-specific prefix, like 'sqlite://'. If None (default), a relational XML file will be downloaded from default location.
by_name(identifier, strict=True)

Get a modification matching identifier. Replaces both by_name and by_title methods in the old class.

Parameters:
  • identifier (str) –
  • strict (bool, optional) – Defaults to True.
Returns:

out

Return type:

Modification

by_title(identifier, strict=True)

Get a modification matching identifier. Replaces both by_name and by_title methods in the old class.

Parameters:
  • identifier (str) –
  • strict (bool, optional) – Defaults to True.
Returns:

out

Return type:

Modification

get(identifier, strict=True)[source]

Get a modification matching identifier. Replaces both by_name and by_title methods in the old class.

Parameters:
  • identifier (str) –
  • strict (bool, optional) – Defaults to True.
Returns:

out

Return type:

Modification

pyteomics.mass.unimod.has_composition(attr_name)[source]

A decorator to simplify flagging a Model with a column to be treated as a formula for parsing. Calls _composition_listener() internally.

pyteomics.mass.unimod.load(doc_path, output_path='sqlite://')[source]

Parse the relational table-like XML file provided by http://www.unimod.org/downloads.html and convert each <tag>_row into an equivalent database entry.

By default the table will be held in memory.

pyteomics.mass.unimod.preprocess_xml(doc_path)[source]

Parse and drop namespaces from an XML document.

Parameters:doc_path (str) –
Returns:out
Return type:etree.ElementTree
pyteomics.mass.unimod.remove_namespace(doc, namespace)[source]

Remove namespace in the passed document in place.

achrom - additive model of polypeptide chromatography

Summary

The additive model of polypeptide chromatography, or achrom, is the most basic model for peptide retention time prediction. The main equation behind achrom has the following form:

RT = (1 + m\,ln N) \sum_{i=1}^{i=N}{RC_i n_i} + RT_0

Here, RC_i is the retention coefficient of the amino acid residues of the i-th type, n_i corresponds to the number of amino acid residues of type i in the peptide sequence, N is the total number of different types of amino acid residues present, and RT_0 is a constant retention time shift.

In order to use achrom, one needs to find the retention coeffcients, using experimentally determined retention times for a training set of peptide retention times, i.e. to calibrate the model.

Calibration

get_RCs() - find a set of retention coefficients using a given set of peptides with known retention times and a fixed value of length correction parameter.

get_RCs_vary_lcp() - find the best length correction parameter and a set of retention coefficients for a given peptide sample.

Retention time calculation
calculate_RT() - calculate the retention time of a peptide using a given set of retention coefficients.
Data

RCs_guo_ph2_0 - a set of retention coefficients (RCs) from [2]. Conditions: Synchropak RP-P C18 column (250 x 4.1 mm I.D.), gradient (A = 0.1% aq. TFA, pH 2.0; B = 0.1% TFA in acetonitrile) at 1% B/min, flow rate 1 ml/min, 26 centigrades.

RCs_guo_ph7_0 - a set of retention coefficients (RCs) from [2]. Conditions: Synchropak RP-P C18 column (250 x 4.1 mm I.D.), gradient (A = aq. 10 mM (NH4)2HPO4 - 0.1 M NaClO4, pH 7.0; B = 0.1 M NaClO4 in 60% aq. acetonitrile) at 1.67% B/min, flow rate 1 ml/min, 26 centigrades.

RCs_meek_ph2_1 - a set of RCs from [1]. Conditions: Bio-Rad “ODS” column, gradient (A = 0.1 M NaClO4, 0.1% phosphoric acid in water; B = 0.1 M NaClO4, 0.1% phosphoric acid in 60% aq. acetonitrile) at 1.25% B/min, room temperature.

RCs_meek_ph7_4 - a set of RCs from [1]. Conditions: Bio-Rad “ODS” column, gradient (A = 0.1 M NaClO4, 5 mM phosphate buffer in water; B = 0.1 M NaClO4, 5 mM phosphate buffer in 60% aq. acetonitrile) at 1.25% B/min, room temperature.

RCs_browne_tfa - a set of RCs found in [7]. Conditions: Waters mjuBondapak C18 column, gradient (A = 0.1% aq. TFA, B = 0.1% TFA in acetonitrile) at 0.33% B/min, flow rate 1.5 ml/min.

RCs_browne_hfba - a set of RCs found in [7]. Conditions: Waters mjuBondapak C18 column, gradient (A = 0.13% aq. HFBA, B = 0.13% HFBA in acetonitrile) at 0.33% B/min, flow rate 1.5 ml/min.

RCs_palmblad - a set of RCs from [8]. Conditions: a fused silica column (80-100 x 0.200 mm I.D.) packed in-house with C18 ODS-AQ; solvent A = 0.5% aq. HAc, B = 0.5% HAc in acetonitrile.

RCs_yoshida - a set of RCs for normal phase chromatography from [9]. Conditions: TSK gel Amide-80 column (250 x 4.6 mm I.D.), gradient (A = 0.1% TFA in ACN-water (90:10); B = 0.1% TFA in ACN-water (55:45)) at 0.6% water/min, flow rate 1.0 ml/min, 40 centigrades.

RCs_yoshida_lc - a set of length-corrected RCs for normal phase chromatography. The set was calculated in [10] for the data from [9]. Conditions: TSK gel Amide-80 column (250 x 4.6 mm I.D.), gradient (A = 0.1% TFA in ACN-water (90:10); B = 0.1% TFA in ACN-water (55:45)) at 0.6% water/min, flow rate 1.0 ml/min, 40 centigrades.

RCs_zubarev - a set of length-corrected RCs calculated on a dataset used in [11]. Conditions: Reprosil-Pur C18-AQ column (150 x 0.075 mm I.D.), gradient (A = 0.5% AA in water; B = 0.5% AA in ACN-water (90:10)) at 0.5% water/min, flow rate 200.0 nl/min, room temperature.

RCs_gilar_atlantis_ph3_0 - a set of retention coefficients obtained in [12]. Conditions: Atlantis HILIC silica column, (150 x 2.1 mm I.D.), 3 um, 100 A, gradient (A = water, B = ACN, C = 200 mM ammonium formate): 0 min, 5% A, 90% B, 5% C; 62.5 min, 55% A, 40% B, 5% C at 0.2 ml/min, temperature 40 C, pH 3.0

RCs_gilar_atlantis_ph4_5 - a set of retention coefficients obtained in [12]. Conditions: Atlantis HILIC silica column, (150 x 2.1 mm I.D.), 3 um, 100 A, gradient (A = water, B = ACN, C = 200 mM ammonium formate): 0 min, 5% A, 90% B, 5% C; 62.5 min, 55% A, 40% B, 5% C at 0.2 ml/min, temperature 40 C, pH 4.5

RCs_gilar_atlantis_ph10_0 - a set of retention coefficients obtained in [12]. Conditions: Atlantis HILIC silica column, (150 x 2.1 mm I.D.), 3 um, 100 A, gradient (A = water, B = ACN, C = 200 mM ammonium formate): 0 min, 5% A, 90% B, 5% C; 62.5 min, 55% A, 40% B, 5% C at 0.2 ml/min, temperature 40 C, pH 10.0

RCs_gilar_beh - a set of retention coefficients obtained in [12]. Conditions: ACQUITY UPLC BEH HILIC column (150 x 2.1 mm I.D.), 1.7 um, 130 A, Mobile phase A: 10 mM ammonium formate buffer, pH 4.5 prepared by titrating 10 mM solution of FA with ammonium hydroxide. Mobile phase B: 90% ACN, 10% mobile phase A (v:v). Gradient: 90-60% B in 50 min.

RCs_gilar_beh_amide - a set of retention coefficients obtained in [12]. Conditions: ACQUITY UPLC BEH glycan column (150 x 2.1 mm I.D.), 1.7 um, 130 A, Mobile phase A: 10 mM ammonium formate buffer, pH 4.5 prepared by titrating 10 mM solution of FA with ammonium hydroxide. Mobile phase B: 90% ACN, 10% mobile phase A (v:v). Gradient: 90-60% B in 50 min.

RCs_gilar_rp - a set of retention coefficients obtained in [12]. Conditions: ACQUITY UPLC BEH C18 column (100 mm x 2.1 mm I.D.), 1.7 um, 130 A. Mobile phase A: 0.02% TFA in water, mobile phase B: 0.018% TFA in ACN. Gradient: 0 to 50% B in 50 min, flow rate 0.2 ml/min, temperature 40 C., pH 2.6.

RCs_krokhin_100A_fa - a set of retention coefficients obtained in [13]. Conditions: 300 um x 150mm PepMap100 (Dionex, 0.1% FA), packed with 5-um Luna C18(2) (Phenomenex, Torrance, CA), pH=2.0. Both eluents A (2% ACN in water) and B (98% ACN) contained 0.1% FA as ion-pairing modifier. 0.33% ACN/min linear gradient (0-30% B).

RCs_krokhin_100A_tfa - a set of retention coefficients obtained in [13]. Conditions: 300 um x 150mm PepMap100 (Dionex, 0.1% TFA), packed with 5-um Luna C18(2) (Phenomenex, Torrance, CA), pH=2.0. Both eluents A (2% ACN in water) and B (98% ACN) contained 0.1% TFA as ion-pairing modifier. 0.33% ACN/min linear gradient (0-30% B).

Theory

The additive model of polypeptide chromatography, or the model of retention coefficients was the earliest attempt to describe the dependence of retention time of a polypeptide in liquid chromatography on its sequence [1], [2]. In this model, each amino acid is assigned a number, or a retention coefficient (RC) describing its retention properties. The retention time (RT) during a gradient elution is then calculated as:

RT = \sum_{i=1}^{i=N}{RC_i \cdot n_i} + RT_0,

which is the sum of retention coefficients of all amino acid residues in a polypeptide. This equation can also be expressed in terms of linear algebra:

RT = \bar{aa} \cdot \bar{RC} + RT_0,

where \bar{aa} is a vector of amino acid composition, i.e. \bar{aa}_i is the number of amino acid residues of i-th type in a polypeptide; \bar{RC} is a vector of respective retention coefficients.

In this formulation, it is clear that additive model gives the same results for any two peptides with different sequences but the same amino acid composition. In other words, additive model is not sequence-specific.

The additive model has two advantages over all other models of chromatography - it is easy to understand and use. The rule behind the additive model is as simple as it could be: each amino acid residue shifts retention time by a fixed value, depending only on its type. This rule allows geometrical interpretation. Each peptide may be represented by a point in 21-dimensional space, with first 20 coordinates equal to the amounts of corresponding amino acid residues in the peptide and 21-st coordinate equal to RT. The additive model assumes that a line may be drawn through these points. Of course, this assumption is valid only partially, and most points would not lie on the line. But the line would describe the main trend and could be used to estimate retention time for peptides with known amino acid composition.

This best fit line is described by retention coefficients and RT_0. The procedure of finding these coefficients is called calibration. There is an analytical solution to calibration of linear models, which makes them especially useful in real applications.

Several attempts were made in order to improve the accuracy of prediction by the additive model (for a review of the field we suggest to read [3] and [4]). The two implemented in this module are the logarithmic length correction term described in [5] and additional sets of retention coefficients for terminal amino acid residues [6].

Logarithmic length correction

This enhancement was firstly described in [5]. Briefly, it was found that the following equation better describes the dependence of RT on the peptide sequence:

RT = \sum_{i=1}^{i=N}{RC_i} + m\,ln N \sum_{i=1}^{i=N}{RC_i} + RT_0

We would call the second term m\,ln N \sum_{i=1}^{i=N}{RC_i} the length correction term and m - the length correction parameter. The simplified and vectorized form of this equation would be:

RT = (1 + m\,ln N) \, \bar{RC} \cdot \bar{aa} + RT_0

This equation may be reduced to a linear form and solved by the standard methods.

Terminal retention coefficients

Another significant improvement may be obtained through introduction of separate sets of retention coefficients for terminal amino acid residues [6].

References

[1](1, 2, 3) Meek, J. L. Prediction of peptide retention times in high-pressure liquid chromatography on the basis of amino acid composition. PNAS, 1980, 77 (3), 1632-1636.
[2](1, 2, 3) Guo, D.; Mant, C. T.; Taneja, A. K.; Parker, J. M. R.; Hodges, R. S. Prediction of peptide retention times in reversed-phase high-performance liquid chromatography I. Determination of retention coefficients of amino acid residues of model synthetic peptides. Journal of Chromatography A, 1986, 359, 499-518.
[3]Baczek, T.; Kaliszan, R. Predictions of peptides’ retention times in reversed-phase liquid chromatography as a new supportive tool to improve protein identification in proteomics. Proteomics, 2009, 9 (4), 835-47.
[4]Babushok, V. I.; Zenkevich, I. G. Retention Characteristics of Peptides in RP-LC: Peptide Retention Prediction. Chromatographia, 2010, 72 (9-10), 781-797.
[5](1, 2) Mant, C. T.; Zhou, N. E.; Hodges, R. S. Correlation of protein retention times in reversed-phase chromatography with polypeptide chain length and hydrophobicity. Journal of Chromatography A, 1989, 476, 363-375.
[6](1, 2) Tripet, B.; Cepeniene, D.; Kovacs, J. M.; Mant, C. T.; Krokhin, O. V.; Hodges, R. S. Requirements for prediction of peptide retention time in reversed-phase high-performance liquid chromatography: hydrophilicity/hydrophobicity of side-chains at the N- and C-termini of peptides are dramatically affected by the end-groups and location. Journal of chromatography A, 2007, 1141 (2), 212-25.
[7](1, 2) Browne, C. A.; Bennett, H. P. J.; Solomon, S. The isolation of peptides by high-performance liquid chromatography using predicted elution positions. Analytical Biochemistry, 1982, 124 (1), 201-208.
[8]Palmblad, M.; Ramstrom, M.; Markides, K. E.; Hakansson, P.; Bergquist, J. Prediction of Chromatographic Retention and Protein Identification in Liquid Chromatography/Mass Spectrometry. Analytical Chemistry, 2002, 74 (22), 5826-5830.
[9](1, 2) Yoshida, T. Calculation of peptide retention coefficients in normal-phase liquid chromatography. Journal of Chromatography A, 1998, 808 (1-2), 105-112.
[10]Moskovets, E.; Goloborodko A. A.; Gorshkov A. V.; Gorshkov M.V. Limitation of predictive 2-D liquid chromatography in reducing the database search space in shotgun proteomics: In silico studies. Journal of Separation Science, 2012, 35 (14), 1771-1778.
[11]Goloborodko A. A.; Mayerhofer C.; Zubarev A. R.; Tarasova I. A.; Gorshkov A. V.; Zubarev, R. A.; Gorshkov, M. V. Empirical approach to false discovery rate estimation in shotgun proteomics. Rapid communications in mass spectrometry, 2010, 24(4), 454-62.
[12](1, 2, 3, 4, 5, 6) Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6.
[13](1, 2) Dwivedi, R. C.; Spicer, V.; Harder, M.; Antonovici, M.; Ens, W.; Standing, K. G.; Wilkins, J. A.; Krokhin, O. V. (2008). Practical implementation of 2D HPLC scheme with accurate peptide retention prediction in both dimensions for high-throughput bottom-up proteomics. Analytical Chemistry, 80(18), 7036-42.
Dependencies

This module requires numpy.


pyteomics.achrom.RCs_browne_hfba

A set of retention coefficients determined in Browne, C. A.; Bennett, H. P. J.; Solomon, S. The isolation of peptides by high-performance liquid chromatography using predicted elution positions. Analytical Biochemistry, 1982, 124 (1), 201-208.

Conditions: Waters mjuBondapak C18 column, gradient (A = 0.13% aq. HFBA, B = 0.13% HFBA in acetonitrile) at 0.33% B/min, flow rate 1.5 ml/min.

pyteomics.achrom.RCs_browne_tfa

A set of retention coefficients determined in Browne, C. A.; Bennett, H. P. J.; Solomon, S. The isolation of peptides by high-performance liquid chromatography using predicted elution positions. Analytical Biochemistry, 1982, 124 (1), 201-208.

Conditions: Waters mjuBondapak C18 column, gradient (A = 0.1% aq. TFA, B = 0.1% TFA in acetonitrile) at 0.33% B/min, flow rate 1.5 ml/min.

pyteomics.achrom.RCs_gilar_atlantis_ph10_0

A set of retention coefficients for normal phase chromatography obtained in Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6.

Note

Cysteine is Carbamidomethylated.

Conditions: Atlantis HILIC silica column (150 x 2.1 mm I.D.), 3 um, 100 A, gradient (A = water, B = ACN, C = 200 mM ammonium formate): 0 min, 5% A, 90% B, 5% C; 62.5 min, 55% A, 40% B, 5% C at 0.2 ml/min, temperature 40 C, pH 10.0

pyteomics.achrom.RCs_gilar_atlantis_ph3_0

A set of retention coefficients for normal phase chromatography obtained in Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6.

Note

Cysteine is Carbamidomethylated.

Conditions: Atlantis HILIC silica column (150 x 2.1 mm I.D.), 3 um, 100 A, gradient (A = water, B = ACN, C = 200 mM ammonium formate): 0 min, 5% A, 90% B, 5% C; 62.5 min, 55% A, 40% B, 5% C at 0.2 ml/min, temperature 40 C, pH 3.0

pyteomics.achrom.RCs_gilar_atlantis_ph4_5

A set of retention coefficients for normal phase chromatography obtained in Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6.

Note

Cysteine is Carbamidomethylated.

Conditions: Atlantis HILIC silica column (150 x 2.1 mm I.D.), 3 um, 100 A, gradient (A = water, B = ACN, C = 200 mM ammonium formate): 0 min, 5% A, 90% B, 5% C; 62.5 min, 55% A, 40% B, 5% C at 0.2 ml/min, temperature 40 C, pH 4.5

pyteomics.achrom.RCs_gilar_beh

A set of retention coefficients for normal phase chromatography obtained in Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6.

Note

Cysteine is Carbamidomethylated.

Conditions: ACQUITY UPLC BEH HILIC column (150 x 2.1 mm I.D.), 1.7 um, 130 A, Mobile phase A: 10 mM ammonium formate buffer, pH 4.5 prepared by titrating 10 mM solution of FA with ammonium hydroxide. Mobile phase B: 90% ACN, 10% mobile phase A (v:v). Gradient: 90-60% B in 50 min.

pyteomics.achrom.RCs_gilar_beh_amide

A set of retention coefficients for normal phase chromatography obtained in Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6.

Note

Cysteine is Carbamidomethylated.

Conditions: ACQUITY UPLC BEH glycan column (150 x 2.1 mm I.D.), 1.7 um, 130 A, Mobile phase A: 10 mM ammonium formate buffer, pH 4.5 prepared by titrating 10 mM solution of FA with ammonium hydroxide. Mobile phase B: 90% ACN, 10% mobile phase A (v:v). Gradient: 90-60% B in 50 min.

pyteomics.achrom.RCs_gilar_rp

A set of retention coefficients for normal phase chromatography obtained in Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6.

Note

Cysteine is Carbamidomethylated.

Conditions: ACQUITY UPLC BEH C18 column (100 mm x 2.1 mm I.D.), 1.7 um, 130 A. Mobile phase A: 0.02% TFA in water, mobile phase B: 0.018% TFA in ACN. Gradient: 0 to 50% B in 50 min, flow rate 0.2 ml/min, temperature 40 C., pH 2.6.

pyteomics.achrom.RCs_guo_ph2_0

A set of retention coefficients from Guo, D.; Mant, C. T.; Taneja, A. K.; Parker, J. M. R.; Hodges, R. S. Prediction of peptide retention times in reversed-phase high-performance liquid chromatography I. Determination of retention coefficients of amino acid residues of model synthetic peptides. Journal of Chromatography A, 1986, 359, 499-518.

Conditions: Synchropak RP-P C18 column (250 x 4.1 mm I.D.), gradient (A = 0.1% aq. TFA, pH 2.0; B = 0.1% TFA in acetonitrile) at 1% B/min, flow rate 1 ml/min, 26 centigrades.

pyteomics.achrom.RCs_guo_ph7_0

A set of retention coefficients from Guo, D.; Mant, C. T.; Taneja, A. K.; Parker, J. M. R.; Hodges, R. S. Prediction of peptide retention times in reversed-phase high-performance liquid chromatography I. Determination of retention coefficients of amino acid residues of model synthetic peptides. Journal of Chromatography A, 1986, 359, 499-518.

Conditions: Synchropak RP-P C18 column (250 x 4.1 mm I.D.), gradient (A = aq. 10 mM (NH4)2HPO4 - 0.1 M NaClO4, pH 7.0; B = 0.1 M NaClO4 in 60% aq. acetonitrile) at 1.67% B/min, flow rate 1 ml/min, 26 centigrades.

pyteomics.achrom.RCs_krokhin_100A_fa

A set of retention coefficients from R.C. Dwivedi, V. Spicer, M. Harder, M. Antonovici, W. Ens, K.G. Standing, J.A. Wilkins, and O.V. Krokhin; Analytical Chemistry 2008 80 (18), 7036-7042. Practical Implementation of 2D HPLC Scheme with Accurate Peptide Retention Prediction in Both Dimensions for High-Throughput Bottom-Up Proteomics.

Note

Cysteine is Carbamidomethylated.

Conditions: 300 um x 150mm PepMap100 (Dionex, 0.1% FA), packed with 5-um Luna C18(2) (Phenomenex, Torrance, CA), pore size 100A, pH=2.0. Both eluents A (2% ACN in water) and B (98% ACN) contained 0.1% FA as ion-pairing modifier. 0.33% ACN/min linear gradient (0-30% B).

pyteomics.achrom.RCs_krokhin_100A_tfa

A set of retention coefficients from R.C. Dwivedi, V. Spicer, M. Harder, M. Antonovici, W. Ens, K.G. Standing, J.A. Wilkins, and O.V. Krokhin; Analytical Chemistry 2008 80 (18), 7036-7042. Practical Implementation of 2D HPLC Scheme with Accurate Peptide Retention Prediction in Both Dimensions for High-Throughput Bottom-Up Proteomics.

Note

Cysteine is Carbamidomethylated.

Conditions: 300 um x 150mm PepMap100 (Dionex, 0.1% TFA), packed with 5-um Luna C18(2) (Phenomenex, Torrance, CA), pore size 100 A, pH=2.0. Both eluents A (2% ACN in water) and B (98% ACN) contained 0.1% TFA as ion-pairing modifier. 0.33% ACN/min linear gradient (0-30% B).

pyteomics.achrom.RCs_meek_ph2_1

A set of retention coefficients determined in Meek, J. L. Prediction of peptide retention times in high-pressure liquid chromatography on the basis of amino acid composition. PNAS, 1980, 77 (3), 1632-1636.

Note

C stands for Cystine.

Conditions: Bio-Rad “ODS” column, gradient (A = 0.1 M NaClO4, 0.1% phosphoric acid in water; B = 0.1 M NaClO4, 0.1% phosphoric acid in 60% aq. acetonitrile) at 1.25% B/min, room temperature.

pyteomics.achrom.RCs_meek_ph7_4

A set of retention coefficients determined in Meek, J. L. Prediction of peptide retention times in high-pressure liquid chromatography on the basis of amino acid composition. PNAS, 1980, 77 (3), 1632-1636.

Note

C stands for Cystine.

Conditions: Bio-Rad “ODS” column, gradient (A = 0.1 M NaClO4, 5 mM phosphate buffer in water; B = 0.1 M NaClO4, 5 mM phosphate buffer in 60% aq. acetonitrile) at 1.25% B/min, room temperature.

pyteomics.achrom.RCs_palmblad

A set of retention coefficients determined in Palmblad, M.; Ramstrom, M.; Markides, K. E.; Hakansson, P.; Bergquist, J. Prediction of Chromatographic Retention and Protein Identification in Liquid Chromatography/Mass Spectrometry. Analytical Chemistry, 2002, 74 (22), 5826-5830.

Conditions: a fused silica column (80-100 x 0.200 mm I.D.) packed in-house with C18 ODS-AQ; solvent A = 0.5% aq. HAc, B = 0.5% HAc in acetonitrile.

pyteomics.achrom.RCs_yoshida

A set of retention coefficients determined in Yoshida, T. Calculation of peptide retention coefficients in normal-phase liquid chromatography. Journal of Chromatography A, 1998, 808 (1-2), 105-112.

Note

Cysteine is Carboxymethylated.

Conditions: TSK gel Amide-80 column (250 x 4.6 mm I.D.), gradient (A = 0.1% TFA in ACN-water (90:10); B = 0.1% TFA in ACN-water (55:45)) at 0.6% water/min, flow rate 1.0 ml/min, 40 centigrades.

pyteomics.achrom.RCs_yoshida_lc

A set of retention coefficients from the length-corrected model of normal-phase peptide chromatography. The dataset comes from Yoshida, T. Calculation of peptide retention coefficients in normal-phase liquid chromatography. Journal of Chromatography A, 1998, 808 (1-2), 105-112. The RCs were calculated in Moskovets, E.; Goloborodko A. A.; Gorshkov A. V.; Gorshkov M.V. Limitation of predictive 2-D liquid chromatography in reducing the database search space in shotgun proteomics: In silico studies. Journal of Separation Science, 2012, 35 (14), 1771-1778.

Note

Cysteine is Carboxymethylated.

Conditions: TSK gel Amide-80 column (250 x 4.6 mm I.D.), gradient (A = 0.1% TFA in ACN-water (90:10); B = 0.1% TFA in ACN-water (55:45)) at 0.6% water/min, flow rate 1.0 ml/min, 40 centigrades.

pyteomics.achrom.RCs_zubarev

A set of retention coefficients from the length-corrected model of reversed-phase peptide chromatography. The dataset was taken from Goloborodko A. A.; Mayerhofer C.; Zubarev A. R.; Tarasova I. A.; Gorshkov A. V.; Zubarev, R. A.; Gorshkov, M. V. Empirical approach to false discovery rate estimation in shotgun proteomics. Rapid communications in mass spectrometry, 2010, 24(4), 454-62.

Note

Cysteine is Carbamidomethylated.

Conditions: Reprosil-Pur C18-AQ column (150 x 0.075 mm I.D.), gradient (A = 0.5% AA in water; B = 0.5% AA in ACN-water (90:10)) at 0.5% water/min, flow rate 200.0 nl/min, room temperature.

pyteomics.achrom.calculate_RT(peptide, RC_dict, raise_no_mod=True)[source]

Calculate the retention time of a peptide using a given set of retention coefficients.

Parameters:
  • peptide (str or dict) – A peptide sequence or amino acid composition.
  • RC_dict (dict) – A set of retention coefficients, length correction parameter and a fixed retention time shift. Keys are: ‘aa’, ‘lcp’ and ‘const’.
  • raise_no_mod (bool, optional) – If True then an exception is raised when a modified amino acid from peptides is not found in RC_dict. If False, then the retention coefficient for the non-modified amino acid residue is used instead. True by default.
Returns:

RT – Calculated retention time.

Return type:

float

Examples

>>> RT = calculate_RT('AA', {'aa': {'A': 1.1}, 'lcp':0.0, 'const': 0.1})
>>> abs(RT - 2.3) < 1e-6      # Float comparison
True
>>> RT = calculate_RT('AAA', {'aa': {'ntermA': 1.0, 'A': 1.1, 'ctermA': 1.2},        'lcp': 0.0, 'const':0.1})
>>> abs(RT - 3.4) < 1e-6      # Float comparison
True
>>> RT = calculate_RT({'A': 3}, {'aa': {'ntermA': 1.0, 'A': 1.1, 'ctermA': 1.2},        'lcp': 0.0, 'const':0.1})
>>> abs(RT - 3.4) < 1e-6      # Float comparison
True
pyteomics.achrom.get_RCs(sequences, RTs, lcp=-0.21, term_aa=False, **kwargs)[source]

Calculate the retention coefficients of amino acids using retention times of a peptide sample and a fixed value of length correction parameter.

Parameters:
  • sequences (list of str) – List of peptide sequences.
  • RTs (list of float) – List of corresponding retention times.
  • lcp (float, optional) – A multiplier before ln(L) term in the equation for the retention time of a peptide. Set to -0.21 by default.
  • term_aa (bool, optional) – If True, terminal amino acids are treated as being modified with ‘ntermX’/’ctermX’ modifications. False by default.
  • labels (list of str, optional) – List of all possible amino acids and terminal groups If not given, any modX labels are allowed.
Returns:

RC_dict – Dictionary with the calculated retention coefficients.

  • RC_dict[‘aa’] – amino acid retention coefficients.
  • RC_dict[‘const’] – constant retention time shift.
  • RC_dict[‘lcp’] – length correction parameter.

Return type:

dict

Examples

>>> RCs = get_RCs(['A','AA'], [1.0, 2.0], 0.0, labels=['A'])
>>> abs(RCs['aa']['A'] - 1) < 1e-6 and abs(RCs['const']) < 1e-6
True
>>> RCs = get_RCs(['A','AA','B'], [1.0, 2.0, 2.0], 0.0, labels=['A','B'])
>>> abs(RCs['aa']['A'] - 1) + abs(RCs['aa']['B'] - 2) +             abs(RCs['const']) < 1e-6
True
pyteomics.achrom.get_RCs_vary_lcp(sequences, RTs, term_aa=False, lcp_range=(-1.0, 1.0), **kwargs)[source]

Find the best combination of a length correction parameter and retention coefficients for a given peptide sample.

Parameters:
  • sequences (list of str) – List of peptide sequences.
  • RTs (list of float) – List of corresponding retention times.
  • term_aa (bool, optional) – If True, terminal amino acids are treated as being modified with ‘ntermX’/’ctermX’ modifications. False by default.
  • lcp_range (2-tuple of float, optional) – Range of possible values of the length correction parameter.
  • labels (list of str, optional) – List of labels for all possible amino acids and terminal groups If not given, any modX labels are allowed.
  • lcp_accuracy (float, optional) – The accuracy of the length correction parameter calculation.
Returns:

RC_dict – Dictionary with the calculated retention coefficients.

  • RC_dict[‘aa’] – amino acid retention coefficients.
  • RC_dict[‘const’] – constant retention time shift.
  • RC_dict[‘lcp’] – length correction parameter.

Return type:

dict

Examples

>>> RCs = get_RCs_vary_lcp(['A', 'AA', 'AAA'],         [1.0, 2.0, 3.0],         labels=['A'])
>>> abs(RCs['aa']['A'] - 1) + abs(RCs['lcp']) + abs(RCs['const']) < 1e-6
True

electrochem - electrochemical properties of polypeptides

Summary

This module is used to calculate the electrochemical properties of polypeptide molecules.

The theory behind most of this module is based on the Henderson-Hasselbalch equation and was thoroughly described in a number of sources [1], [2].

Briefly, the formula for the charge of a polypeptide in given pH is the following:

Q_{peptide} = \sum{\frac{Q_i}{1+10^{Q_i(pH-pK_i)}}},

where the sum is taken over all ionizable groups of the polypeptide, and Q_i is -1 and +1 for acidic and basic functional groups, respectively.

Charge and pI functions

charge() - calculate the charge of a polypeptide

pI() - calculate the isoelectric point of a polypeptide

GRand AVerage of hYdropathicity (GRAVY)
gravy() - calculate the GRAVY index of a polypeptide
Data

pK_lehninger - a set of pK from [3].

pK_sillero - a set of pK from [4].

pK_dawson - a set of pK from [5], the pK values for NH2- and -OH are taken from [4].

pK_rodwell - a set of pK from [6].

pK_bjellqvist - a set of pK from [7].

pK_nterm_bjellqvist - a set of N-terminal pK from [7].

pK_cterm_bjellqvist - a set of C-terminal pK from [7].

hydropathicity_KD - a set of hydropathicity indexes from [8].

References

[1]Aronson, J. N. The Henderson-Hasselbalch equation revisited. Biochemical Education, 1983, 11 (2), 68. Link.
[2]Moore, D. S.. Amino acid and peptide net charges: A simple calculational procedure. Biochemical Education, 1986, 13 (1), 10-12. Link.
[3]Nelson, D. L.; Cox, M. M. Lehninger Principles of Biochemistry, Fourth Edition; W. H. Freeman, 2004; p. 1100.
[4](1, 2) Sillero, A.; Ribeiro, J. Isoelectric points of proteins: Theoretical determination. Analytical Biochemistry, 1989, 179 (2), 319-325. Link.
[5]Dawson, R. M. C.; Elliot, D. C.; Elliot, W. H.; Jones, K. M. Data for biochemical research. Oxford University Press, 1989; p. 592.
[6]Rodwell, J. Heterogeneity of component bands in isoelectric focusing patterns. Analytical Biochemistry, 1982, 119 (2), 440-449. Link.
[7](1, 2, 3) Bjellqvist, B., Basse, B., Olsen, E. and Celis, J.E. Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis 1994, 15, 529-539. Link.
[8]Kyte, J.; Doolittle, R. F.. A simple method for displaying the hydropathic character of a protein. Journal of molecular biology 1982, 157 (1), 105-32. Link.

pyteomics.electrochem.charge(sequence, pH, **kwargs)[source]

Calculate the charge of a polypeptide in given pH or list of pHs using a given list of amino acid electrochemical properties.

Warning

Be cafeful when supplying a list with a parsed sequence or a dict with amino acid composition as sequence. Such values must be obtained with enabled show_unmodified_termini option.

Warning

If you provide pK_nterm or pK_cterm and provide sequence as a dict, it is assumed that it was obtained with term_aa=True (see pyteomics.parser.amino_acid_composition() for details).

Parameters:
  • sequence (str or list or dict) – A string with a polypeptide sequence, a list with a parsed sequence or a dict of amino acid composition.
  • pH (float or iterable of floats) – pH or iterable of pHs for which the charge is calculated.
  • pK (dict {str: [(float, int), ..]}, optional) – A set of pK of amino acids’ ionizable groups. It is a dict, where keys are amino acid labels and the values are lists of tuples (pK, charge_in_ionized_state), a tuple per ionizable group. The default value is pK_lehninger.
  • pK_nterm (dict {str: [(float, int),]}, optional) –
  • pK_cterm (dict {str: [(float, int),]}, optional) – Sets of pK of N-terminal and C-terminal (respectively) amino acids’ ionizable groups. Dicts with the same structure as pK. These values (if present) are used for N-terminal and C-terminal residues, respectively. If given, sequence must be a str or a list. The default value is an empty dict.
Returns:

out – A single value of charge or a list of charges.

Return type:

float or list of floats

pyteomics.electrochem.gravy(sequence, hydropathicity={'A': 1.8, 'C': 2.5, 'D': -3.5, 'E': -3.5, 'F': 2.8, 'G': -0.4, 'H': -3.2, 'I': 4.5, 'K': -3.9, 'L': 3.8, 'M': 1.9, 'N': -3.5, 'P': -1.6, 'Q': -3.5, 'R': -4.5, 'S': -0.8, 'T': -0.7, 'V': 4.2, 'W': -0.9, 'Y': -1.3})[source]

Calculate GRand AVerage of hYdropathicity (GRAVY) index for amino acid sequence.

Parameters:
  • sequence (str) – Polypeptide sequence in one-letter format.
  • hydropathicity (dict, optional) – Hydropathicity indexes of amino acids. Default is hydropathicity_KD.
Returns:

  • out (float) – GRand AVerage of hYdropathicity (GRAVY) index.
  • Examples
  • >>> gravy(‘PEPTIDE’)
  • -1.4375

pyteomics.electrochem.hydropathicity_KD

105-132 (1982).

Type:A set of hydropathicity indexes obtained from Kyte J., Doolittle F. J. Mol. Biol. 157
pyteomics.electrochem.pI(sequence, pI_range=(0.0, 14.0), precision_pI=0.01, **kwargs)[source]

Calculate the isoelectric point of a polypeptide using a given set of amino acids’ electrochemical properties.

Warning

Be cafeful when supplying a list with a parsed sequence or a dict with amino acid composition as sequence. Such values must be obtained with enabled show_unmodified_termini option.

Parameters:
  • sequence (str or list or dict) – A string with a polypeptide sequence, a list with a parsed sequence or a dict of amino acid composition.
  • pI_range (tuple (float, float)) – The range of allowable pI values. Default is (0.0, 14.0).
  • precision_pI (float) – The precision of the calculated pI. Default is 0.01.
  • pK (dict {str: [(float, int), ..]}, optional) – A set of pK of amino acids’ ionizable groups. It is a dict, where keys are amino acid labels and the values are lists of tuples (pK, charge_in_ionized_state), a tuple per ionizable group. The default value is pK_lehninger.
  • pK_nterm (dict {str: [(float, int),]}, optional) –
  • pK_cterm (dict {str: [(float, int),]}, optional) – Sets of pK of N-terminal and C-terminal (respectively) amino acids’ ionizable groups. Dicts with the same structure as pK. These values (if present) are used for N-terminal and C-terminal residues, respectively. If given, sequence must be a str or a list. The default value is an empty dict.
Returns:

out

Return type:

float

pyteomics.electrochem.pK_bjellqvist

A set of pK from Bjellqvist, B., Basse, B., Olsen, E. and Celis, J.E. Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis 1994, 15, 529-539.

pyteomics.electrochem.pK_cterm_bjellqvist

A set of C-terminal pK from Bjellqvist, B., Basse, B., Olsen, E. and Celis, J.E. Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis 1994, 15, 529-539.

pyteomics.electrochem.pK_dawson

A set of pK from Dawson, R. M. C.; Elliot, D. C.; Elliot, W. H.; Jones, K. M. Data for biochemical research. Oxford University Press, 1989; p. 592. pKs for NH2- and -OH are taken from pK_sillero.

pyteomics.electrochem.pK_lehninger

A set of pK from Nelson, D. L.; Cox, M. M. Lehninger Principles of Biochemistry, Fourth Edition; W. H. Freeman, 2004; p. 1100.

pyteomics.electrochem.pK_nterm_bjellqvist

A set of N-terminal pK from Bjellqvist, B., Basse, B., Olsen, E. and Celis, J.E. Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis 1994, 15, 529-539.

pyteomics.electrochem.pK_rodwell

A set of pK from Rodwell, J. Heterogeneity of component bands in isoelectric focusing patterns. Analytical Biochemistry, vol. 119 (2), pp. 440-449, 1982.

pyteomics.electrochem.pK_sillero

Theoretical determination. Analytical Biochemistry, vol. 179 (2), pp. 319-325, 1989.

Type:A set of pK from Sillero, A.; Ribeiro, J. Isoelectric points of proteins

fasta - manipulations with FASTA databases

FASTA is a simple file format for protein sequence databases. Please refer to the NCBI website for the most detailed information on the format.

Data manipulation
Classes

Several classes of FASTA parsers are available. All of them have common features:

  • context manager support;
  • header parsing;
  • direct iteration.

Available classes:

FASTABase - common ancestor, suitable for type checking. Abstract class.

FASTA - text-mode, sequential parser. Good for iteration over database entries.

IndexedFASTA - binary-mode, indexing parser. Supports direct indexing by header string.

TwoLayerIndexedFASTA - additionally supports indexing by extracted header fields.

UniProt and IndexedUniProt, UniParc and IndexedUniParc, UniMes and IndexedUniMes, UniRef and IndexedUniRef, SPD and IndexedSPD, NCBI and IndexedNCBI - format-specific parsers.

Functions

read() - returns an instance of the appropriate reader class, for sequential iteration or random access.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

write() - write entries to a FASTA database.

parse() - parse a FASTA header.

Decoy sequence generation

decoy_sequence() - generate a decoy sequence from a given sequence, using one of the other functions listed in this section or any other callable.

reverse() - generate a reversed decoy sequence.

shuffle() - generate a shuffled decoy sequence.

fused_decoy() - generate a “fused” decoy sequence.

Decoy database generation

write_decoy_db() - generate a decoy database and write it to a file.

decoy_db() - generate entries for a decoy database from a given FASTA database.

decoy_chain() - a version of decoy_db() for multiple files.

decoy_chain.from_iterable() - like decoy_chain(), but with an iterable of files.

Auxiliary
std_parsers - a dictionary with parsers for known FASTA header formats.

pyteomics.fasta.chain(*args, **kwargs)

Chain read() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the read() function.

chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.fasta.decoy_chain(*args, **kwargs)

Chain decoy_db() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the decoy_db() function.

decoy_chain.from_iterable(files, **kwargs)

Chain decoy_db() for several files. Keyword arguments are passed to the decoy_db() function.

Parameters:files – Iterable of file names or file objects.
class pyteomics.fasta.FASTA(source, ignore_comments=False, parser=None, encoding=None)[source]

Bases: pyteomics.fasta.FASTABase, pyteomics.auxiliary.file_helpers.FileReader

Text-mode, sequential FASTA parser. Suitable for iteration over the file to obtain all entries in order.

__init__(source, ignore_comments=False, parser=None, encoding=None)[source]

Create a new FASTA parser object. Supports iteration, yields (description, sequence) tuples. Supports with syntax.

Parameters:
  • source (str or file-like) – File to read. If file object, it must be opened in text mode.
  • ignore_comments (bool, optional) – If True then ignore the second and subsequent lines of description. Default is False, which concatenates multi-line descriptions into a single string.
  • parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a function, that function will be given the description string, and the returned value will be yielded together with the sequence. The std_parsers dict has parsers for several formats. Hint: specify parse() as the parser to apply automatic format recognition. Default is None, which means return the header “as is”.
  • encoding (str or None, optional) – File encoding (if it is given by name).
reset()

Resets the iterator to its initial state.

class pyteomics.fasta.FASTABase(source, **kwargs)[source]

Bases: object

Abstract base class for FASTA file parsers. Can be used for type checking.

__init__(source, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class pyteomics.fasta.FlavoredMixin(parse=True)[source]

Bases: object

Parser aimed at a specific FASTA flavor. Subclasses should define parser and header_pattern. The parse argument in __init__() defines whether description is parsed in output.

__init__(parse=True)[source]

Initialize self. See help(type(self)) for accurate signature.

class pyteomics.fasta.IndexedFASTA(source, ignore_comments=False, parser=None, **kwargs)[source]

Bases: pyteomics.fasta.FASTABase, pyteomics.auxiliary.file_helpers.TaskMappingMixin, pyteomics.auxiliary.file_helpers.IndexedTextReader

Indexed FASTA parser. Supports direct indexing by matched labels.

__init__(source, ignore_comments=False, parser=None, **kwargs)[source]

Create an indexed FASTA parser object.

Parameters:
  • source (str or file-like) – File to read. If file object, it must be opened in binary mode.
  • ignore_comments (bool, optional) – If True then ignore the second and subsequent lines of description. Default is False, which concatenates multi-line descriptions into a single string.
  • parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a function, that function will be given the description string, and the returned value will be yielded together with the sequence. The std_parsers dict has parsers for several formats. Hint: specify parse() as the parser to apply automatic format recognition. Default is None, which means return the header “as is”.
  • encoding (str or None, optional, keyword only) – File encoding. Default is UTF-8.
  • block_size (int or None, optional, keyword only) – Number of bytes to consume at once.
  • delimiter (str or None, optional, keyword only) – Overrides the FASTA record delimiter (default is '\n>').
  • label (str or None, optional, keyword only) – Overrides the FASTA record label pattern. Default is '^[\n]?>(.*)'.
  • label_group (int or str, optional, keyword only) – Overrides the matched group used as key in the byte offset index. This in combination with label can be used to extract fields from headers. However, consider using TwoLayerIndexedFASTA for this purpose.
map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedNCBI(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.NCBIMixin, pyteomics.fasta.TwoLayerIndexedFASTA

Indexed parser for NCBI FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedNCBI object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –
build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedRefSeq(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.RefSeqMixin, pyteomics.fasta.TwoLayerIndexedFASTA

Indexed parser for RefSeq FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedRefSeq object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –
build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedSPD(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.SPDMixin, pyteomics.fasta.TwoLayerIndexedFASTA

Indexed parser for SPD FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedSPD object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –
build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedUniMes(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.UniMesMixin, pyteomics.fasta.TwoLayerIndexedFASTA

Indexed parser for UniMes FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedUniMes object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –
build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedUniParc(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.UniParcMixin, pyteomics.fasta.TwoLayerIndexedFASTA

Indexed parser for UniParc FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedUniParc object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –
build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedUniProt(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.UniProtMixin, pyteomics.fasta.TwoLayerIndexedFASTA

Indexed parser for UniProt FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedUniProt object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –
build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.IndexedUniRef(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.UniRefMixin, pyteomics.fasta.TwoLayerIndexedFASTA

Indexed parser for UniRef FASTA files.

__init__(source, parse=True, **kwargs)

Creates a IndexedUniRef object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in binary mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –
build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.NCBI(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.NCBIMixin, pyteomics.fasta.FASTA

Text-mode parser for NCBI FASTA files.

__init__(source, parse=True, **kwargs)

Creates a NCBI object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the FASTA constructor.) –
reset()

Resets the iterator to its initial state.

class pyteomics.fasta.NCBIMixin(parse=True)[source]

Bases: pyteomics.fasta.FlavoredMixin

__init__(parse=True)

Initialize self. See help(type(self)) for accurate signature.

class pyteomics.fasta.RefSeq(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.RefSeqMixin, pyteomics.fasta.FASTA

Text-mode parser for RefSeq FASTA files.

__init__(source, parse=True, **kwargs)

Creates a RefSeq object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the FASTA constructor.) –
reset()

Resets the iterator to its initial state.

class pyteomics.fasta.RefSeqMixin(parse=True)[source]

Bases: pyteomics.fasta.FlavoredMixin

__init__(parse=True)

Initialize self. See help(type(self)) for accurate signature.

class pyteomics.fasta.SPD(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.SPDMixin, pyteomics.fasta.FASTA

Text-mode parser for SPD FASTA files.

__init__(source, parse=True, **kwargs)

Creates a SPD object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the FASTA constructor.) –
reset()

Resets the iterator to its initial state.

class pyteomics.fasta.SPDMixin(parse=True)[source]

Bases: pyteomics.fasta.FlavoredMixin

__init__(parse=True)

Initialize self. See help(type(self)) for accurate signature.

class pyteomics.fasta.TwoLayerIndexedFASTA(source, header_pattern=None, header_group=None, ignore_comments=False, parser=None, **kwargs)[source]

Bases: pyteomics.fasta.IndexedFASTA

Parser with two-layer index. Extracted groups are mapped to full headers (where possible), full headers are mapped to byte offsets.

When indexed, the key is looked up in both indexes, allowing access by meaningful IDs (like UniProt accession) and by full header string.

__init__(source, header_pattern=None, header_group=None, ignore_comments=False, parser=None, **kwargs)[source]

Open source and create a two-layer index for convenient random access both by full header strings and extracted fields.

Parameters:
  • source (str or file-like) – File to read. If file object, it must be opened in binary mode.
  • header_pattern (str or RE or None, optional) – Pattern to match the header string. Must capture the group used for the second index. If None (default), second-level index is not created.
  • header_group (int or str or None, optional) – Defines which group is used as key in the second-level index. Default is 1.
  • ignore_comments (bool, optional) – If True then ignore the second and subsequent lines of description. Default is False, which concatenates multi-line descriptions into a single string.
  • parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a function, that function will be given the description string, and the returned value will be yielded together with the sequence. The std_parsers dict has parsers for several formats. Hint: specify parse() as the parser to apply automatic format recognition. Default is None, which means return the header “as is”.
  • arguments (Other) –
build_second_index()[source]

Create the mapping from extracted field to whole header string.

get_by_id(key)[source]

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.fasta.UniMes(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.UniMesMixin, pyteomics.fasta.FASTA

Text-mode parser for UniMes FASTA files.

__init__(source, parse=True, **kwargs)

Creates a UniMes object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the FASTA constructor.) –
reset()

Resets the iterator to its initial state.

class pyteomics.fasta.UniMesMixin(parse=True)[source]

Bases: pyteomics.fasta.FlavoredMixin

__init__(parse=True)

Initialize self. See help(type(self)) for accurate signature.

class pyteomics.fasta.UniParc(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.UniParcMixin, pyteomics.fasta.FASTA

Text-mode parser for UniParc FASTA files.

__init__(source, parse=True, **kwargs)

Creates a UniParc object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the FASTA constructor.) –
reset()

Resets the iterator to its initial state.

class pyteomics.fasta.UniParcMixin(parse=True)[source]

Bases: pyteomics.fasta.FlavoredMixin

__init__(parse=True)

Initialize self. See help(type(self)) for accurate signature.

class pyteomics.fasta.UniProt(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.UniProtMixin, pyteomics.fasta.FASTA

Text-mode parser for UniProt FASTA files.

__init__(source, parse=True, **kwargs)

Creates a UniProt object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the FASTA constructor.) –
reset()

Resets the iterator to its initial state.

class pyteomics.fasta.UniProtMixin(parse=True)[source]

Bases: pyteomics.fasta.FlavoredMixin

__init__(parse=True)

Initialize self. See help(type(self)) for accurate signature.

class pyteomics.fasta.UniRef(source, parse=True, **kwargs)[source]

Bases: pyteomics.fasta.UniRefMixin, pyteomics.fasta.FASTA

Text-mode parser for UniRef FASTA files.

__init__(source, parse=True, **kwargs)

Creates a UniRef object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in text mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the FASTA constructor.) –
reset()

Resets the iterator to its initial state.

class pyteomics.fasta.UniRefMixin(parse=True)[source]

Bases: pyteomics.fasta.FlavoredMixin

__init__(parse=True)

Initialize self. See help(type(self)) for accurate signature.

pyteomics.fasta.decoy_db(source=None, mode='reverse', prefix='DECOY_', decoy_only=False, ignore_comments=False, parser=None, **kwargs)[source]

Iterate over sequences for a decoy database out of a given source.

Parameters:
  • source (file-like object or str or None, optional) – A path to a FASTA database or a file object itself. Default is None, which means read standard input.
  • mode (str or callable, optional) – Algorithm of decoy sequence generation. ‘reverse’ by default. See decoy_sequence() for more information.
  • prefix (str, optional) – A prefix to the protein descriptions of decoy entries. The default value is ‘DECOY_’.
  • decoy_only (bool, optional) – If set to True, only the decoy entries will be written to output. If False, the entries from source will be written first. False by default.
  • ignore_comments (bool, optional) – If True then ignore the second and subsequent lines of description. Default is False.
  • parser (function or None, optional) – Defines whether the fasta descriptions should be parsed. If it is a function, that function will be given the description string, and the returned value will be yielded together with the sequence. The std_parsers dict has parsers for several formats. Hint: specify parse() as the parser to apply automatic format guessing. Default is None, which means return the header “as is”.
  • **kwargs (given to decoy_sequence().) –
Returns:

out – An iterator over entries of the new database.

Return type:

iterator

pyteomics.fasta.decoy_sequence(sequence, mode='reverse', **kwargs)[source]

Create a decoy sequence out of a given sequence string.

Parameters:
  • sequence (str) – The initial sequence string.
  • mode (str or callable, optional) –

    Type of decoy sequence. Should be one of the standard modes or any callable. Standard modes are:

    Default is ‘reverse’.

  • **kwargs (given to the decoy function.) –
Returns:

decoy_sequence – The decoy sequence.

Return type:

str

pyteomics.fasta.fused_decoy(sequence, decoy_mode='reverse', sep='R', **kwargs)[source]

Create a “fused” decoy sequence by concatenating a decoy sequence with the original one. The method and its use cases are described in:

Ivanov, M. V., Levitsky, L. I., & Gorshkov, M. V. (2016). Adaptation of Decoy Fusion Strategy for Existing Multi-Stage Search Workflows. Journal of The American Society for Mass Spectrometry, 27(9), 1579-1582.

Parameters:
  • sequence (str) – The initial sequence string.
  • decoy_mode (str or callable, optional) –

    Type of decoy sequence to use. Should be one of the standard modes or any callable. Standard modes are:

    Default is ‘reverse’.

  • sep (str, optional) – Amino acid motif that separates the decoy sequence from the target one. This setting should reflect the enzyme specificity used in the search against the database being generated. Default is ‘R’, which is suitable for trypsin searches.
  • **kwargs (given to the decoy generation function.) –

Examples

>>> fused_decoy('PEPT')
'TPEPRPEPT'
>>> fused_decoy('MPEPT', 'shuffle', 'K', keep_nterm=True)
'MPPTEKMPEPT'
pyteomics.fasta.parse(header, flavor='auto', parsers=None)[source]

Parse the FASTA header and return a nice dictionary.

Parameters:
  • header (str) – FASTA header to parse
  • flavor (str, optional) – Short name of the header format (case-insensitive). Valid values are 'auto' and keys of the parsers dict. Default is 'auto', which means try all formats in turn and return the first result that can be obtained without an exception.
  • parsers (dict, optional) – A dict where keys are format names (lowercased) and values are functions that take a header string and return the parsed header.
Returns:

out – A dictionary with the info from the header. The format depends on the flavor.

Return type:

dict

pyteomics.fasta.read(source=None, use_index=None, flavor=None, **kwargs)[source]

Parse a FASTA file. This function serves as a dispatcher between different parsers available in this module.

Parameters:
  • source (str or file or None, optional) – A file object (or file name) with a FASTA database. Default is None, which means read standard input.
  • use_index (bool, optional) – If True, the created parser object will be an instance of IndexedFASTA. If False (default), it will be an instance of FASTA.
  • flavor (str or None, optional) –

    A supported FASTA header format. If specified, a format-specific parser instance is returned.

    Note

    See std_parsers for supported flavors.

Returns:

out – A named 2-tuple with FASTA header (str or dict) and sequence (str). Attributes ‘description’ and ‘sequence’ are also provided.

Return type:

iterator of tuples

pyteomics.fasta.reverse(sequence, keep_nterm=False, keep_cterm=False)[source]

Create a decoy sequence by reversing the original one.

Parameters:
  • sequence (str) – The initial sequence string.
  • keep_nterm (bool, optional) – If True, then the N-terminal residue will be kept. Default is False.
  • keep_cterm (bool, optional) – If True, then the C-terminal residue will be kept. Default is False.
Returns:

decoy_sequence – The decoy sequence.

Return type:

str

pyteomics.fasta.shuffle(sequence, keep_nterm=False, keep_cterm=False)[source]

Create a decoy sequence by shuffling the original one.

Parameters:
  • sequence (str) – The initial sequence string.
  • keep_nterm (bool, optional) – If True, then the N-terminal residue will be kept. Default is False.
  • keep_cterm (bool, optional) – If True, then the C-terminal residue will be kept. Default is False.
Returns:

decoy_sequence – The decoy sequence.

Return type:

str

pyteomics.fasta.std_parsers

A dictionary with parsers for known FASTA header formats. For now, supported formats are those described at UniProt help page.

pyteomics.fasta.write(entries, output=None)[source]

Create a FASTA file with entries.

Parameters:
  • entries (iterable of (str, str) tuples) – An iterable of 2-tuples in the form (description, sequence).
  • output (file-like or str, optional) – A file open for writing or a path to write to. If the file exists, it will be opened for appending. Default is None, which means write to standard output.
  • file_mode (str, keyword only, optional) – If output is a file name, defines the mode the file will be opened in. Otherwise will be ignored. Default is ‘a’.
Returns:

output_file – The file where the FASTA is written.

Return type:

file object

pyteomics.fasta.write_decoy_db(source=None, output=None, mode='reverse', prefix='DECOY_', decoy_only=False, **kwargs)[source]

Generate a decoy database out of a given source and write to file.

If output is a path, the file will be open for appending, so no information will be lost if the file exists. Although, the user should be careful when providing open file streams as source and output. The reading and writing will start from the current position in the files, which is where the last I/O operation finished. One can use the file.seek() method to change it.

Parameters:
  • source (file-like object or str or None, optional) – A path to a FASTA database or a file object itself. Default is None, which means read standard input.
  • output (file-like object or str, optional) – A path to the output database or a file open for writing. Defaults to None, the results go to the standard output.
  • mode (str or callable, optional) – Algorithm of decoy sequence generation. ‘reverse’ by default. See decoy_sequence() for more details.
  • prefix (str, optional) – A prefix to the protein descriptions of decoy entries. The default value is ‘DECOY_’
  • decoy_only (bool, optional) – If set to True, only the decoy entries will be written to output. If False, the entries from source will be written as well. False by default.
  • file_mode (str, keyword only, optional) – If output is a file name, defines the mode the file will be opened in. Otherwise will be ignored. Default is ‘a’.
  • **kwargs (given to decoy_sequence().) –
Returns:

output – A (closed) file object for the created file.

Return type:

file

peff - PSI Extended FASTA Format

PEFF is a forth-coming standard from PSI-HUPO formalizing and extending the encoding of protein features and annotations for building search spaces for proteomics. See The PEFF specification for more up-to-date information on the standard.

Data manipulation
Classes

The PEFF parser inherits several properties from implementation in the fasta module, building on top of the TwoLayerIndexedFASTA reader.

Available classes:

IndexedPEFF - Parse a PEFF format file in binary-mode, supporting direct indexing by header string or by tag.
class pyteomics.peff.Header(mapping, original=None)[source]

Bases: collections.abc.Mapping

Hold parsed properties of a key-value pair like a sequence’s definition line.

This object supports the Mapping interface, and keys may be accessed by attribute access notation.

__init__(mapping, original=None)[source]

Initialize self. See help(type(self)) for accurate signature.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.
items() → a set-like object providing a view on D's items[source]
keys() → a set-like object providing a view on D's keys[source]
values() → an object providing a view on D's values[source]
class pyteomics.peff.IndexedPEFF(source, ignore_comments=False, **kwargs)[source]

Bases: pyteomics.fasta.TwoLayerIndexedFASTA

Creates an IndexedPEFF object.

Parameters:
  • source (str or file) – The file to read. If a file object, it needs to be in rb mode.
  • parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples. Default is True.
  • kwargs (passed to the TwoLayerIndexedFASTA constructor.) –
__init__(source, ignore_comments=False, **kwargs)[source]

Open source and create a two-layer index for convenient random access both by full header strings and extracted fields.

Parameters:
  • source (str or file-like) – File to read. If file object, it must be opened in binary mode.
  • header_pattern (str or RE or None, optional) – Pattern to match the header string. Must capture the group used for the second index. If None (default), second-level index is not created.
  • header_group (int or str or None, optional) – Defines which group is used as key in the second-level index. Default is 1.
  • ignore_comments (bool, optional) – If True then ignore the second and subsequent lines of description. Default is False, which concatenates multi-line descriptions into a single string.
  • parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a function, that function will be given the description string, and the returned value will be yielded together with the sequence. The std_parsers dict has parsers for several formats. Hint: specify parse() as the parser to apply automatic format recognition. Default is None, which means return the header “as is”.
  • arguments (Other) –
build_second_index()

Create the mapping from extracted field to whole header string.

get_by_id(key)

Get the entry by value of header string or extracted field.

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

mzml - reader for mass spectrometry data in mzML format

Summary

mzML is a standard rich XML-format for raw mass spectrometry data storage. Please refer to psidev.info for the detailed specification of the format and structure of mzML files.

This module provides a minimalistic way to extract information from mzML files. You can use the old functional interface (read()) or the new object-oriented interface (MzML or PreIndexedMzML) to iterate over entries in <spectrum> elements. MzML and PreIndexedMzML also support direct indexing with spectrum IDs.

Data access

MzML - a class representing a single mzML file. Other data access functions use this class internally.

PreIndexedMzML - a class representing a single mzML file. Uses byte offsets listed at the end of the file for quick access to spectrum elements.

read() - iterate through spectra in mzML file. Data from a single spectrum are converted to a human-readable dict. Spectra themselves are stored under ‘m/z array’ and ‘intensity array’ keys.

chain() - read multiple mzML files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

Deprecated functions

version_info() - get version information about the mzML file. You can just read the corresponding attribute of the MzML object.

iterfind() - iterate over elements in an mzML file. You can just call the corresponding method of the MzML object.

Dependencies

This module requires lxml and numpy.


pyteomics.mzml.chain(*sources, **kwargs)

Chain sequence_maker() for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to the sequence_maker() function.

pyteomics.mzml.sources

Sources for creating new sequences from, such as paths or file-like objects

Type:Iterable
pyteomics.mzml.kwargs

Additional arguments used to instantiate each sequence

Type:Mapping
chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.mzml.version_info(source)

Provide version information about the mzML file.

Note

This function is provided for backward compatibility only. It simply creates an MzML instance and returns its version_info attribute.

Parameters:source (str or file) – File name or file-like object.
Returns:out – A (version, schema URL) tuple, both elements are strings or None.
Return type:tuple
pyteomics.mzml.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an MzML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.
  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
  • decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output (under “m/z array”, “intensity array”, etc.). Default is True.
Returns:

out

Return type:

iterator

class pyteomics.mzml.MzML(*args, **kwargs)[source]

Bases: pyteomics.xml.ArrayConversionMixin, pyteomics.auxiliary.file_helpers.TimeOrderedIndexedReaderMixin, pyteomics.xml.MultiProcessingXML, pyteomics.xml.IndexSavingXML

Parser class for mzML files.

__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class binary_array_record

Bases: pyteomics.auxiliary.utils.binary_array_record

Hold all of the information about a base64 encoded array needed to decode the array.

__init__

Initialize self. See help(type(self)) for accurate signature.

compression

Alias for field number 1

count()

Return number of occurrences of value.

data

Alias for field number 0

decode()

Decode data into a numerical array

Returns:
Return type:np.ndarray
dtype

Alias for field number 2

index()

Return first index of value.

Raises ValueError if the value is not present.

key

Alias for field number 4

source

Alias for field number 3

build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

decode_data_array(source, compression_type=None, dtype=<class 'numpy.float64'>)

Decode a base64-encoded, compressed bytestring into a numerical array.

Parameters:
  • source (bytes) – A base64 string encoding a potentially compressed numerical array.
  • compression_type (str, optional) – The name of the compression method used before encoding the array into base64.
  • dtype (type, optional) – The data type to use to decode the binary array from the decompressed bytes.
Returns:

Return type:

np.ndarray

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.
  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.
Returns:

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

classmethod prebuild_byte_offset_file(path)

Construct a new XML reader, build its byte offset index and write it to file

Parameters:path (str) – The path to the file to parse
reset()

Resets the iterator to its initial state.

write_byte_offsets()

Write the byte offsets in _offset_index to the file at _byte_offset_filename

class pyteomics.mzml.PreIndexedMzML(*args, **kwargs)[source]

Bases: pyteomics.mzml.MzML

Parser class for mzML files, subclass of MzML. Uses byte offsets listed at the end of the file for quick access to spectrum elements.

__init__(*args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

class binary_array_record

Bases: pyteomics.auxiliary.utils.binary_array_record

Hold all of the information about a base64 encoded array needed to decode the array.

__init__

Initialize self. See help(type(self)) for accurate signature.

compression

Alias for field number 1

count()

Return number of occurrences of value.

data

Alias for field number 0

decode()

Decode data into a numerical array

Returns:
Return type:np.ndarray
dtype

Alias for field number 2

index()

Return first index of value.

Raises ValueError if the value is not present.

key

Alias for field number 4

source

Alias for field number 3

build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

decode_data_array(source, compression_type=None, dtype=<class 'numpy.float64'>)

Decode a base64-encoded, compressed bytestring into a numerical array.

Parameters:
  • source (bytes) – A base64 string encoding a potentially compressed numerical array.
  • compression_type (str, optional) – The name of the compression method used before encoding the array into base64.
  • dtype (type, optional) – The data type to use to decode the binary array from the decompressed bytes.
Returns:

Return type:

np.ndarray

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.
  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.
Returns:

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

classmethod prebuild_byte_offset_file(path)

Construct a new XML reader, build its byte offset index and write it to file

Parameters:path (str) – The path to the file to parse
reset()

Resets the iterator to its initial state.

write_byte_offsets()

Write the byte offsets in _offset_index to the file at _byte_offset_filename

pyteomics.mzml.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an MzML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.
  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
  • decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output (under “m/z array”, “intensity array”, etc.). Default is True.
Returns:

out

Return type:

iterator

pyteomics.mzml.read(source, read_schema=False, iterative=True, use_index=False, dtype=None, huge_tree=False)[source]

Parse source and iterate through spectra.

Parameters:
  • source (str or file) – A path to a target mzML file or the file object itself.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
  • iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce memory usage at almost the same parsing speed. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for spectrum elements. Default is False.
  • dtype (type or dict, optional) – dtype to convert arrays to, one for both m/z and intensity arrays or one for each key. If dict, keys should be ‘m/z array’ and ‘intensity array’.
  • decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output (under “m/z array”, “intensity array”, etc.). Default is True.
  • huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether security checks for XML tree depth and node size should be disabled. Default is False. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
Returns:

out – An iterator over the dicts with spectrum properties.

Return type:

iterator

mzxml - reader for mass spectrometry data in mzXML format

Summary

mzXML is a (formerly) standard XML-format for raw mass spectrometry data storage, intended to be replaced with mzML.

This module provides a minimalistic way to extract information from mzXML files. You can use the old functional interface (read()) or the new object-oriented interface (MzXML) to iterate over entries in <scan> elements. MzXML also supports direct indexing with scan IDs.

Data access

MzXML - a class representing a single mzXML file. Other data access functions use this class internally.

read() - iterate through spectra in mzXML file. Data from a single scan are converted to a human-readable dict. Spectra themselves are stored under ‘m/z array’ and ‘intensity array’ keys.

chain() - read multiple mzXML files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

Deprecated functions

version_info() - get version information about the mzXML file. You can just read the corresponding attribute of the MzXML object.

iterfind() - iterate over elements in an mzXML file. You can just call the corresponding method of the MzXML object.

Dependencies

This module requires lxml and numpy.


pyteomics.mzxml.chain(*sources, **kwargs)

Chain sequence_maker() for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to the sequence_maker() function.

pyteomics.mzxml.sources

Sources for creating new sequences from, such as paths or file-like objects

Type:Iterable
pyteomics.mzxml.kwargs

Additional arguments used to instantiate each sequence

Type:Mapping
chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.mzxml.version_info(source)

Provide version information about the XML file.

Note

This function is provided for backward compatibility only. It simply creates an MzXML instance and returns its version_info attribute.

Parameters:source (str or file) – File name or file-like object.
Returns:out – A (version, schema URL) tuple, both elements are strings or None.
Return type:tuple
pyteomics.mzxml.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified XPath.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an MzXML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.
  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzIdentML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings.
  • decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output (under “m/z array”, “intensity array”, etc.). Default is True.
Returns:

out

Return type:

iterator

class pyteomics.mzxml.MzXML(*args, **kwargs)[source]

Bases: pyteomics.xml.ArrayConversionMixin, pyteomics.auxiliary.file_helpers.TimeOrderedIndexedReaderMixin, pyteomics.xml.MultiProcessingXML, pyteomics.xml.IndexSavingXML

Parser class for mzXML files.

__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class binary_array_record

Bases: pyteomics.auxiliary.utils.binary_array_record

Hold all of the information about a base64 encoded array needed to decode the array.

__init__

Initialize self. See help(type(self)) for accurate signature.

compression

Alias for field number 1

count()

Return number of occurrences of value.

data

Alias for field number 0

decode()

Decode data into a numerical array

Returns:
Return type:np.ndarray
dtype

Alias for field number 2

index()

Return first index of value.

Raises ValueError if the value is not present.

key

Alias for field number 4

source

Alias for field number 3

build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

decode_data_array(source, compression_type=None, dtype=<class 'numpy.float64'>)

Decode a base64-encoded, compressed bytestring into a numerical array.

Parameters:
  • source (bytes) – A base64 string encoding a potentially compressed numerical array.
  • compression_type (str, optional) – The name of the compression method used before encoding the array into base64.
  • dtype (type, optional) – The data type to use to decode the binary array from the decompressed bytes.
Returns:

Return type:

np.ndarray

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.
  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.
Returns:

Return type:

dict

iterfind(path, **kwargs)[source]

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

classmethod prebuild_byte_offset_file(path)

Construct a new XML reader, build its byte offset index and write it to file

Parameters:path (str) – The path to the file to parse
reset()

Resets the iterator to its initial state.

write_byte_offsets()

Write the byte offsets in _offset_index to the file at _byte_offset_filename

pyteomics.mzxml.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified XPath.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an MzXML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.
  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzIdentML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings.
  • decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output (under “m/z array”, “intensity array”, etc.). Default is True.
Returns:

out

Return type:

iterator

pyteomics.mzxml.read(source, read_schema=False, iterative=True, use_index=False, dtype=None, huge_tree=False)[source]

Parse source and iterate through spectra.

Parameters:
  • source (str or file) – A path to a target mzML file or the file object itself.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
  • iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce memory usage at almost the same parsing speed. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for spectrum elements. Default is False.
  • decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output (under “m/z array”, “intensity array”, etc.). Default is True.
  • huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether security checks for XML tree depth and node size should be disabled. Default is False. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
Returns:

out – An iterator over the dicts with spectrum properties.

Return type:

iterator

mgf - read and write MS/MS data in Mascot Generic Format

Summary

MGF is a simple human-readable format for MS/MS data. It allows storing MS/MS peak lists and exprimental parameters.

This module provides classes and functions for access to data stored in MGF files. Parsing is done using MGF and IndexedMGF classes. The read() function can be used as an entry point. MGF spectra are converted to dictionaries. MS/MS data points are (optionally) represented as numpy arrays. Also, common parameters can be read from MGF file header with read_header() function. write() allows creation of MGF files.

Classes

MGF - a text-mode MGF parser. Suitable to read spectra from a file consecutively. Needs a file opened in text mode (or will open it if given a file name).

IndexedMGF - a binary-mode MGF parser. When created, builds a byte offset index for fast random access by spectrum titles. Sequential iteration is also supported. Needs a seekable file opened in binary mode (if created from existing file object).

MGFBase - abstract class, the common ancestor of the two classes above. Can be used for type checking.

Functions

read() - iterate through spectra in MGF file. Data from a single spectrum are converted to a human-readable dict.

get_spectrum() - read a single spectrum with given title from a file.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

read_header() - get a dict with common parameters for all spectra from the beginning of MGF file.

write() - write an MGF file.


pyteomics.mgf.chain(*args, **kwargs)

Chain read() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the read() function.

chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files (iterable) – Iterable of file names or file objects.
class pyteomics.mgf.IndexedMGF(source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding='utf-8', _skip_index=False, **kwargs)[source]

Bases: pyteomics.mgf.MGFBase, pyteomics.auxiliary.file_helpers.TaskMappingMixin, pyteomics.auxiliary.file_helpers.TimeOrderedIndexedReaderMixin, pyteomics.auxiliary.file_helpers.IndexSavingTextReader

A class representing an MGF file. Supports the with syntax and direct iteration for sequential parsing. Specific spectra can be accessed by title using the indexing syntax in constant time. If created using a file object, it needs to be opened in binary mode.

When iterated, IndexedMGF object yields spectra one by one. Each ‘spectrum’ is a dict with four keys: ‘m/z array’, ‘intensity array’, ‘charge array’ and ‘params’. ‘m/z array’ and ‘intensity array’ store numpy.ndarray’s of floats, ‘charge array’ is a masked array (numpy.ma.MaskedArray) of ints, and ‘params’ stores a dict of parameters (keys and values are str, keys corresponding to MGF, lowercased).

header

The file header.

Type:dict
time

A property used for accessing spectra by retention time.

Type:RTLocator
__init__(source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding='utf-8', _skip_index=False, **kwargs)[source]

Create an MGF file object, set MGF-specific parameters.

Parameters:
  • source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is None, which means read standard input.
  • use_header (bool, optional, keyword only) – Add the info from file header to each dict. Spectrum-specific parameters override those from the header in case of conflict. Default is True.
  • convert_arrays (one of {0, 1, 2}, optional, keyword only) – If 0, m/z, intensities and (possibly) charges will be returned as regular lists. If 1, they will be converted to regular numpy.ndarray’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 require numpy.
  • read_charges (bool, optional, keyword only) – If True (default), fragment charges are reported. Disabling it improves performance.
  • dtype (type or str or dict, optional, keyword only) – dtype argument to numpy array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’ and/or ‘charge array’.
  • encoding (str, optional, keyword only) – File encoding.
map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

classmethod prebuild_byte_offset_file(path)

Construct a new XML reader, build its byte offset index and write it to file

Parameters:path (str) – The path to the file to parse
reset()

Resets the iterator to its initial state.

write_byte_offsets()

Write the byte offsets in _offset_index to the file at _byte_offset_filename

class pyteomics.mgf.MGF(source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding=None)[source]

Bases: pyteomics.mgf.MGFBase, pyteomics.auxiliary.file_helpers.FileReader

A class representing an MGF file. Supports the with syntax and direct iteration for sequential parsing. Specific spectra can be accessed by title using the indexing syntax (if the file is seekable), but it takes linear time to search through the file. Consider using IndexedMGF for constant-time access to spectra.

MGF object behaves as an iterator, yielding spectra one by one. Each ‘spectrum’ is a dict with four keys: ‘m/z array’, ‘intensity array’, ‘charge array’ and ‘params’. ‘m/z array’ and ‘intensity array’ store numpy.ndarray’s of floats, ‘charge array’ is a masked array (numpy.ma.MaskedArray) of ints, and ‘params’ stores a dict of parameters (keys and values are str, keys corresponding to MGF, lowercased).

header

The file header.

Type:dict
__init__(source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding=None)[source]

Create an MGF file object, set MGF-specific parameters.

Parameters:
  • source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is None, which means read standard input.
  • use_header (bool, optional, keyword only) – Add the info from file header to each dict. Spectrum-specific parameters override those from the header in case of conflict. Default is True.
  • convert_arrays (one of {0, 1, 2}, optional, keyword only) – If 0, m/z, intensities and (possibly) charges will be returned as regular lists. If 1, they will be converted to regular numpy.ndarray’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 require numpy.
  • read_charges (bool, optional, keyword only) – If True (default), fragment charges are reported. Disabling it improves performance.
  • dtype (type or str or dict, optional, keyword only) – dtype argument to numpy array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’ and/or ‘charge array’.
  • encoding (str, optional, keyword only) – File encoding.
reset()

Resets the iterator to its initial state.

class pyteomics.mgf.MGFBase(source=None, **kwargs)[source]

Bases: object

Abstract mixin class representing an MGF file. Subclasses implement different approaches to parsing.

__init__(source=None, **kwargs)[source]

Create an MGF file object, set MGF-specific parameters.

Parameters:
  • source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is None, which means read standard input.
  • use_header (bool, optional, keyword only) – Add the info from file header to each dict. Spectrum-specific parameters override those from the header in case of conflict. Default is True.
  • convert_arrays (one of {0, 1, 2}, optional, keyword only) – If 0, m/z, intensities and (possibly) charges will be returned as regular lists. If 1, they will be converted to regular numpy.ndarray’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 require numpy.
  • read_charges (bool, optional, keyword only) – If True (default), fragment charges are reported. Disabling it improves performance.
  • dtype (type or str or dict, optional, keyword only) – dtype argument to numpy array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’ and/or ‘charge array’.
  • encoding (str, optional, keyword only) – File encoding.
pyteomics.mgf.get_spectrum(source, title, *args, **kwargs)[source]

Read one spectrum (with given title) from source.

See read() for explanation of parameters affecting the output.

Note

Only the key-value pairs after the “TITLE =” line will be included in the output.

Parameters:
  • source (str or file or None) – File to read from.
  • title (str) – Spectrum title.
  • *args – Given to read().
  • **kwargs – Given to read().
Returns:

out – A dict with the spectrum, if it is found, and None otherwise.

Return type:

dict or None

pyteomics.mgf.read(*args, **kwargs)[source]

Returns a reader for a given MGF file. Most of the parameters repeat the instantiation signature of MGF and IndexedMGF. Additional parameter use_index helps decide which class to instantiate for given source.

Parameters:
  • source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is None, which means read standard input.
  • use_header (bool, optional) – Add the info from file header to each dict. Spectrum-specific parameters override those from the header in case of conflict. Default is True.
  • convert_arrays (one of {0, 1, 2}, optional) – If 0, m/z, intensities and (possibly) charges will be returned as regular lists. If 1, they will be converted to regular numpy.ndarray’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 require numpy.
  • read_charges (bool, optional) – If True (default), fragment charges are reported. Disabling it improves performance.
  • dtype (type or str or dict, optional) – dtype argument to numpy array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’ and/or ‘charge array’.
  • encoding (str, optional) – File encoding.
  • use_index (bool, optional) –

    Determines which parsing method to use. If True (default), an instance of IndexedMGF is created. This facilitates random access by spectrum titles. If an open file is passed as source, it needs to be open in binary mode.

    If False, an instance of MGF is created. It reads source in text mode and is suitable for iterative parsing. Access by spectrum title requires linear search and thus takes linear time.

  • block_size (int, optinal) – Size of the chunk (in bytes) used to parse the file when creating the byte offset index. (Accepted only for IndexedMGF.)
Returns:

out – Instance of MGF or IndexedMGF.

Return type:

MGFBase

pyteomics.mgf.read_header(source)[source]

Read the specified MGF file, get search parameters specified in the header as a dict, the keys corresponding to MGF format (lowercased).

Parameters:source (str or file) – File name or file object representing an file in MGF format.
Returns:header
Return type:dict
pyteomics.mgf.write(spectra, output=None, header='', key_order=['title', 'pepmass', 'rtinseconds', 'charge'], fragment_format=None, write_charges=True, use_numpy=None, param_formatters={'charge': <function _charge_repr>, 'pepmass': <function _pepmass_repr>})[source]

Create a file in MGF format.

Parameters:
  • spectra (iterable) –

    A sequence of dictionaries with keys ‘m/z array’, ‘intensity array’, and ‘params’. ‘m/z array’ and ‘intensity array’ should be sequences of int, float, or str. Strings will be written ‘as is’. The sequences should be of equal length, otherwise excessive values will be ignored.

    ’params’ should be a dict with keys corresponding to MGF format. Keys must be strings, they will be uppercased and used as is, without any format consistency tests. Values can be of any type allowing string representation.

    ’charge array’ can also be specified.

  • output (str or file or None, optional) – Path or a file-like object open for writing. If an existing file is specified by file name, it will be opened for appending. In this case writing with a header can result in violation of format conventions. Default value is None, which means using standard output.
  • header (dict or (multiline) str or list of str, optional) – In case of a single string or a list of strings, the header will be written ‘as is’. In case of dict, the keys (must be strings) will be uppercased.
  • write_charges (bool, optional) – If False, fragment charges from ‘charge array’ will not be written. Default is True.
  • fragment_format (str, optional) –

    Format string for m/z, intensity and charge of a fragment. Useful to set the number of decimal places, e.g.: fragment_format='%.4f %.0f'. Default is '{} {} {}'.

    Note

    The supported format syntax differs depending on other parameters. If use_numpy is True and numpy is available, fragment peaks will be written using numpy.savetxt(). Then, fragment_format must be recognized by that function.

    Otherwise, plain Python string formatting is done. See the docs for details on writing the format string. If some or all charges are missing, an empty string is substituted instead, so formatting as float or int will raise an exception. Hence it is safer to just use {} for charges.

  • key_order (list, optional) –

    A list of strings specifying the order in which params will be written in the spectrum header. Unlisted keys will be in arbitrary order. Default is _default_key_order.

    Note

    This does not affect the order of lines in the global header.

  • param_formatters (dict, optional) – A dict mapping parameter names to functions. Each function must accept two arguments (key and value) and return a string. Default is _default_value_formatters.
  • use_numpy (bool, optional) –

    Controls whether fragment peak arrays are written using numpy.savetxt(). Using numpy.savetxt() is faster, but cannot handle sparse arrays of fragment charges. You may want to disable this if you need to save spectra with ‘charge arrays’ with missing values.

    If not specified, will be set to the opposite of write_chrages. If numpy is not available, this parameter has no effect.

  • file_mode (str, keyword only, optional) – If output is a file name, defines the mode the file will be opened in. Otherwise will be ignored. Default is ‘a’.
  • encoding (str, keyword only, optional) – Output file encoding (if output is specified by name).
Returns:

output

Return type:

file

ms1 - read and write MS/MS data in MS1 format

Summary

MS1 is a simple human-readable format for MS1 data. It allows storing MS1 peak lists and exprimental parameters.

This module provides minimalistic infrastructure for access to data stored in MS1 files. Two main classes are MS1, which provides an iterative, text-mode parser, and IndexedMS1, which is a binary-mode parser that supports random access using scan IDs and retention times. The function read() helps dispatch between the two classes. Also, common parameters can be read from MS1 file header with read_header() function.

Functions

read() - iterate through spectra in MS1 file. Data from a single spectrum are converted to a human-readable dict.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

read_header() - get a dict with common parameters for all spectra from the beginning of MS1 file.


pyteomics.ms1.chain(*args, **kwargs)

Chain read() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the read() function.

chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
class pyteomics.ms1.IndexedMS1(source=None, use_header=False, convert_arrays=True, dtype=None, encoding='utf-8', _skip_index=False, **kwargs)[source]

Bases: pyteomics.ms1.MS1Base, pyteomics.auxiliary.file_helpers.TaskMappingMixin, pyteomics.auxiliary.file_helpers.TimeOrderedIndexedReaderMixin, pyteomics.auxiliary.file_helpers.IndexedTextReader

A class representing an MS1 file. Supports the with syntax and direct iteration for sequential parsing. Specific spectra can be accessed by title using the indexing syntax in constant time. If created using a file object, it needs to be opened in binary mode.

When iterated, IndexedMS1 object yields spectra one by one. Each ‘spectrum’ is a dict with four keys: ‘m/z array’, ‘intensity array’, ‘charge array’ and ‘params’. ‘m/z array’ and ‘intensity array’ store numpy.ndarray’s of floats, ‘charge array’ is a masked array (numpy.ma.MaskedArray) of ints, and ‘params’ stores a dict of parameters (keys and values are str, keys corresponding to MS1).

Warning

Labels for scan objects are constructed as the first number in the S line, as follows: for a line S  0   1 the label is ‘0’. If these labels are not unique for the scans in the file, the indexed parser will not work correctly. Consider using MS1 instead.

header

The file header.

Type:dict
time

A property used for accessing spectra by retention time.

Type:RTLocator
__init__(source=None, use_header=False, convert_arrays=True, dtype=None, encoding='utf-8', _skip_index=False, **kwargs)[source]

Instantiate a TaskMappingMixin object, set default parameters for IPC.

Parameters:
  • queue_timeout (float, keyword only, optional) – The number of seconds to block, waiting for a result before checking to see if all workers are done.
  • queue_size (int, keyword only, optional) – The length of IPC queue used.
  • processes (int, keyword only, optional) – Number of worker processes to spawn when map() is called. This can also be specified in the map() call.
map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.ms1.MS1(source=None, use_header=False, convert_arrays=True, dtype=None, encoding=None, **kwargs)[source]

Bases: pyteomics.ms1.MS1Base, pyteomics.auxiliary.file_helpers.FileReader

A class representing an MS1 file. Supports the with syntax and direct iteration for sequential parsing.

MS1 object behaves as an iterator, yielding spectra one by one. Each ‘spectrum’ is a dict with three keys: ‘m/z array’, ‘intensity array’, and ‘params’. ‘m/z array’ and ‘intensity array’ store numpy.ndarray’s of floats, and ‘params’ stores a dict of parameters.

header

The file header.

Type:dict
__init__(source=None, use_header=False, convert_arrays=True, dtype=None, encoding=None, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

reset()

Resets the iterator to its initial state.

class pyteomics.ms1.MS1Base(source=None, use_header=False, convert_arrays=True, dtype=None, **kwargs)[source]

Bases: object

Abstract class representing an MS1 file. Subclasses implement different approaches to parsing.

__init__(source=None, use_header=False, convert_arrays=True, dtype=None, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

pyteomics.ms1.read(*args, **kwargs)[source]

Read an MS1 file and return entries iteratively.

Read the specified MS1 file, yield spectra one by one. Each ‘spectrum’ is a dict with three keys: ‘m/z array’, ‘intensity array’, and ‘params’. ‘m/z array’ and ‘intensity array’ store numpy.ndarray’s of floats, and ‘params’ stores a dict of parameters.

Parameters:
  • source (str or file or None, optional) – A file object (or file name) with data in MS1 format. Default is None, which means read standard input.
  • use_header (bool, optional) – Add the info from file header to each dict. Spectrum-specific parameters override those from the header in case of conflict. Default is False.
  • convert_arrays (bool, optional) – If False, m/z and intensities will be returned as regular lists. If True (default), they will be converted to regular numpy.ndarray’s. Conversion requires numpy.
  • dtype (type or str or dict, optional) – dtype argument to numpy array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’ and/or ‘intensity array’.
  • encoding (str, optional) – File encoding.
  • use_index (bool, optional) –

    Determines which parsing method to use. If True, an instance of IndexedMS1 is created. This facilitates random access by scan titles. If an open file is passed as source, it needs to be open in binary mode.

    If False (default), an instance of MS1 is created. It reads source in text mode and is suitable for iterative parsing.

    Warning

    Labels for scan objects are constructed as the first number in the S line, as follows: for a line S  0   1 the label is ‘0’. If these labels are not unique for the scans in the file, the indexed parser will not work correctly.

  • block_size (int, optinal) – Size of the chunk (in bytes) used to parse the file when creating the byte offset index. (Accepted only for IndexedMS1.)
Returns:

out – An instance of MS1 or IndexedMS1, depending on use_index and source.

Return type:

MS1Base

pyteomics.ms1.read_header(source, *args, **kwargs)[source]

Read the specified MS1 file, get the parameters specified in the header as a dict.

Parameters:source (str or file) – File name or file object representing an file in MS1 format.
Returns:header
Return type:dict

ms2 - read and write MS/MS data in MS2 format

Summary

MS2 is a simple human-readable format for MS2 data. It allows storing MS2 peak lists and exprimental parameters.

This module provides minimalistic infrastructure for access to data stored in MS2 files. Two main classes are MS2, which provides an iterative, text-mode parser, and IndexedMS2, which is a binary-mode parser that supports random access using scan IDs and retention times. The function read() helps dispatch between the two classes. Also, common parameters can be read from MS2 file header with read_header() function.

Functions

read() - iterate through spectra in MS2 file. Data from a single spectrum are converted to a human-readable dict.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

read_header() - get a dict with common parameters for all spectra from the beginning of MS2 file.


pyteomics.ms2.chain(*args, **kwargs)

Chain read() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the read() function.

chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
class pyteomics.ms2.IndexedMS2(source=None, use_header=False, convert_arrays=True, dtype=None, encoding='utf-8', _skip_index=False, **kwargs)[source]

Bases: pyteomics.ms1.IndexedMS1

A class representing an MS2 file. Supports the with syntax and direct iteration for sequential parsing. Specific spectra can be accessed by title using the indexing syntax in constant time. If created using a file object, it needs to be opened in binary mode.

When iterated, IndexedMS2 object yields spectra one by one. Each ‘spectrum’ is a dict with four keys: ‘m/z array’, ‘intensity array’, ‘charge array’ and ‘params’. ‘m/z array’ and ‘intensity array’ store numpy.ndarray’s of floats, ‘charge array’ is a masked array (numpy.ma.MaskedArray) of ints, and ‘params’ stores a dict of parameters (keys and values are str, keys corresponding to MS2).

Warning

Labels for scan objects are constructed as the first number in the S line, as follows: for a line S  0   1   123.4 the label is ‘0’. If these labels are not unique for the scans in the file, the indexed parser will not work correctly. Consider using MS2 instead.

header

The file header.

Type:dict
time

A property used for accessing spectra by retention time.

Type:RTLocator
__init__(source=None, use_header=False, convert_arrays=True, dtype=None, encoding='utf-8', _skip_index=False, **kwargs)

Instantiate a TaskMappingMixin object, set default parameters for IPC.

Parameters:
  • queue_timeout (float, keyword only, optional) – The number of seconds to block, waiting for a result before checking to see if all workers are done.
  • queue_size (int, keyword only, optional) – The length of IPC queue used.
  • processes (int, keyword only, optional) – Number of worker processes to spawn when map() is called. This can also be specified in the map() call.
map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.ms2.MS2(source=None, use_header=False, convert_arrays=True, dtype=None, encoding=None, **kwargs)[source]

Bases: pyteomics.ms1.MS1

A class representing an MS2 file. Supports the with syntax and direct iteration for sequential parsing.

MS2 object behaves as an iterator, yielding spectra one by one. Each ‘spectrum’ is a dict with three keys: ‘m/z array’, ‘intensity array’, and ‘params’. ‘m/z array’ and ‘intensity array’ store numpy.ndarray’s of floats, and ‘params’ stores a dict of parameters.

header

The file header.

Type:dict
__init__(source=None, use_header=False, convert_arrays=True, dtype=None, encoding=None, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

reset()

Resets the iterator to its initial state.

pyteomics.ms2.read(*args, **kwargs)[source]

Read an MS2 file and return entries iteratively.

Read the specified MS2 file, yield spectra one by one. Each ‘spectrum’ is a dict with three keys: ‘m/z array’, ‘intensity array’, and ‘params’. ‘m/z array’ and ‘intensity array’ store numpy.ndarray’s of floats, and ‘params’ stores a dict of parameters.

Parameters:
  • source (str or file or None, optional) – A file object (or file name) with data in MS2 format. Default is None, which means read standard input.
  • use_header (bool, optional) – Add the info from file header to each dict. Spectrum-specific parameters override those from the header in case of conflict. Default is False.
  • convert_arrays (bool, optional) – If False, m/z and intensities will be returned as regular lists. If True (default), they will be converted to regular numpy.ndarray’s. Conversion requires numpy.
  • dtype (type or str or dict, optional) – dtype argument to numpy array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’ and/or ‘intensity array’.
  • encoding (str, optional) – File encoding.
  • use_index (bool, optional) –

    Determines which parsing method to use. If True, an instance of IndexedMS2 is created. This facilitates random access by scan titles. If an open file is passed as source, it needs to be open in binary mode.

    Warning

    Labels for scan objects are constructed as the first number in the S line, as follows: for a line S  0   1   123.4 the label is ‘0’. If these labels are not unique for the scans in the file, the indexed parser will not work correctly.

    If False (default), an instance of MS2 is created. It reads source in text mode and is suitable for iterative parsing.

  • block_size (int, optinal) – Size of the chunk (in bytes) used to parse the file when creating the byte offset index. (Accepted only for IndexedMS2.)
Returns:

An instance of MS2 or IndexedMS2, depending on use_index and source.

Return type:

out

pyteomics.ms2.read_header(source, *args, **kwargs)[source]

Read the specified MS2 file, get the parameters specified in the header as a dict.

Parameters:source (str or file) – File name or file object representing an file in MS2 format.
Returns:header
Return type:dict

pepxml - pepXML file reader

Summary

pepXML was the first widely accepted format for proteomics search engines’ output. Even though it is to be replaced by a community standard mzIdentML, it is still used commonly.

This module provides minimalistic infrastructure for access to data stored in pepXML files. The most important function is read(), which reads peptide-spectum matches and related information and saves them into human-readable dicts. This function relies on the terminology of the underlying lxml library.

Data access

PepXML - a class representing a single pepXML file. Other data access functions use this class internally.

read() - iterate through peptide-spectrum matches in a pepXML file. Data for a single spectrum are converted to an easy-to-use dict.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

DataFrame() - read pepXML files into a pandas.DataFrame.

Target-decoy approach

filter() - filter PSMs from a chain of pepXML files to a specific FDR using TDA.

filter.chain() - chain a series of filters applied independently to several files.

filter.chain.from_iterable() - chain a series of filters applied independently to an iterable of files.

filter_df() - filter pepXML files and return a pandas.DataFrame.

fdr() - estimate the false discovery rate of a PSM set using the target-decoy approach.

qvalues() - get an array of scores and local FDR values for a PSM set using the target-decoy approach.

is_decoy() - determine whether a PSM is decoy or not.

Miscellaneous
roc_curve() - get a receiver-operator curve (min PeptideProphet probability in a sample vs. false discovery rate) of PeptideProphet analysis.
Deprecated functions

iterfind() - iterate over elements in a pepXML file. You can just call the corresponding method of the PepXML object.

version_info() - get information about pepXML version and schema. You can just read the corresponding attribute of the PepXML object.

Dependencies

This module requires lxml.


pyteomics.pepxml.chain(*sources, **kwargs)

Chain sequence_maker() for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to the sequence_maker() function.

pyteomics.pepxml.sources

Sources for creating new sequences from, such as paths or file-like objects

Type:Iterable
pyteomics.pepxml.kwargs

Additional arguments used to instantiate each sequence

Type:Mapping
chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.pepxml.filter(*args, **kwargs)

Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.

Requires numpy and, optionally, pandas.

Parameters:
  • args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
  • fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
  • key (callable / array-like / iterable / str, keyword only, optional) –

    A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only, optional) –

    A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is True.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • full_output (bool, keyword only, optional) –

    If True, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default is True.

    Note

    The name for the parameter comes from the fact that it is internally passed to qvalues().

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • **kwargs (passed to the chain() function.) –
Returns:

out

Return type:

iterator or numpy.ndarray or pandas.DataFrame

filter.chain(*files, **kwargs)

Chain filter() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the filter() function.

filter.chain.from_iterable(*files, **kwargs)

Chain filter() for several files. Keyword arguments are passed to the filter() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.pepxml.version_info(source)

Provide version information about the pepXML file.

Note

This function is provided for backward compatibility only. It simply creates an PepXML instance and returns its version_info attribute.

Parameters:source (str or file) – File name or file-like object.
Returns:out – A (version, schema URL) tuple, both elements are strings or None.
Return type:tuple
pyteomics.pepxml.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an PepXML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.
  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
Returns:

out

Return type:

iterator

pyteomics.pepxml.fdr(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)

Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:

FDR = \frac{N_{decoy}}{N_{target} * ratio}

The second formula is:

FDR = \frac{N_{decoy} * (1 + \frac{1}{ratio})}{N_{total}}

Note

This function is less versatile than qvalues(). To obtain FDR, you can call qvalues() and take the last q-value. This function can be used (with correction = 0 or 1) when numpy is not available.

Parameters:
  • psms (iterable, optional) – An iterable of PSMs, e.g. as returned by read(). Not needed if is_decoy is an iterable.
  • formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
  • is_decoy (callable, iterable, or str, optional) –

    If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is is_decoy(). If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • pep (callable, iterable, or str, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.

  • ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

    Note

    Requires numpy, if correction is a float or 2.

    Note

    Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using filter() without correction).

Returns:

out – The estimation of FDR, (roughly) between 0 and 1.

Return type:

float

pyteomics.pepxml.qvalues(*args, **kwargs)

Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.

Requires numpy (and optionally pandas).

Parameters:
  • args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
  • key (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is False.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • full_output (bool, keyword only, optional) – If True, then the returned array has PSM objects along with scores and q-values. Default is False.
  • **kwargs (passed to the chain() function.) –
Returns:

out – A sorted array of records with the following fields:

  • ’score’: np.float64
  • ’is decoy’: np.bool_
  • ’q’: np.float64
  • ’psm’: np.object_ (if full_output is True)

Return type:

numpy.ndarray

pyteomics.pepxml.DataFrame(*args, **kwargs)[source]

Read pepXML output files into a pandas.DataFrame.

Requires pandas.

Parameters:
  • *args – Passed to chain().
  • **kwargs – Passed to chain().
  • sep (str or None, keyword only, optional) – Some values related to PSMs (such as protein information) are variable-length lists. If sep is a str, they will be packed into single string using this delimiter. If sep is None, they are kept as lists. Default is None.
  • pd_kwargs (dict, optional) – Keyword arguments passed to the pandas.DataFrame constructor.
Returns:

out

Return type:

pandas.DataFrame

class pyteomics.pepxml.PepXML(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]

Bases: pyteomics.xml.MultiProcessingXML, pyteomics.xml.IndexSavingXML

Parser class for pepXML files.

__init__(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)

Create an indexed XML parser object.

Parameters:
  • source (str or file) – File name or file-like object corresponding to an XML file.
  • read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
  • iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
  • indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.
build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.
  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.
Returns:

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

classmethod prebuild_byte_offset_file(path)

Construct a new XML reader, build its byte offset index and write it to file

Parameters:path (str) – The path to the file to parse
reset()

Resets the iterator to its initial state.

write_byte_offsets()

Write the byte offsets in _offset_index to the file at _byte_offset_filename

pyteomics.pepxml.filter_df(*args, **kwargs)[source]

Read pepXML files or DataFrames and return a DataFrame with filtered PSMs. Positional arguments can be pepXML files or DataFrames.

Requires pandas.

Parameters:
  • key (str / iterable / callable, keyword only, optional) – PSM score. Default is ‘expect’.
  • is_decoy (str / iterable / callable, keyword only, optional) – Default is to check if all strings in the “protein” column start with ‘DECOY_’
  • *args – Passed to auxiliary.filter() and/or DataFrame().
  • **kwargs – Passed to auxiliary.filter() and/or DataFrame().
Returns:

out

Return type:

pandas.DataFrame

pyteomics.pepxml.is_decoy(psm, prefix='DECOY_')

Given a PSM dict, return True if all protein names for the PSM start with prefix, and False otherwise. This function might not work for some pepXML flavours. Use the source to get the idea and suit it to your needs.

Parameters:
  • psm (dict) – A dict, as yielded by read().
  • prefix (str, optional) – A prefix used to mark decoy proteins. Default is ‘DECOY_’.
Returns:

out

Return type:

bool

pyteomics.pepxml.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an PepXML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.
  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
Returns:

out

Return type:

iterator

pyteomics.pepxml.read(source, read_schema=False, iterative=True, **kwargs)[source]

Parse source and iterate through peptide-spectrum matches.

Parameters:
  • source (str or file) – A path to a target pepXML file or the file object itself.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the pepXML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
  • iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce memory usage at almost the same parsing speed. Default is True.
Returns:

out – An iterator over dicts with PSM properties.

Return type:

PepXML

pyteomics.pepxml.roc_curve(source)[source]

Parse source and return a ROC curve for peptideprophet analysis.

Parameters:source (str or file) – A path to a target pepXML file or the file object itself.
Returns:out – A list of ROC points.
Return type:list

protxml - parsing of ProteinProphet output files

Summary

protXML is the output format of the ProteinProphet software. It contains information about identified proteins and their statistical significance.

This module provides minimalistic infrastructure for access to data stored in protXML files. The central class is ProtXML, which reads protein entries and related information and saves them into Python dicts.

Data access

ProtXML - a class representing a single protXML file. Other data access functions use this class internally.

read() - iterate through peptide-spectrum matches in a protXML file. Calling the function is synonymous to instantiating the ProtXML class.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

DataFrame() - read protXML files into a pandas.DataFrame.

Target-decoy approach

filter() - filter protein groups from a chain of protXML files to a specific FDR using TDA.

filter.chain() - chain a series of filters applied independently to several files.

filter.chain.from_iterable() - chain a series of filters applied independently to an iterable of files.

filter_df() - filter protXML files and return a pandas.DataFrame.

fdr() - estimate the false discovery rate of a set of protein groups using the target-decoy approach.

qvalues() - get an array of scores and q values for protein groups using the target-decoy approach.

is_decoy() - determine whether a protein group is decoy or not. This function may not suit your use case.

Dependencies

This module requres lxml.


pyteomics.protxml.chain(*sources, **kwargs)

Chain sequence_maker() for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to the sequence_maker() function.

pyteomics.protxml.sources

Sources for creating new sequences from, such as paths or file-like objects

Type:Iterable
pyteomics.protxml.kwargs

Additional arguments used to instantiate each sequence

Type:Mapping
chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.protxml.filter(*args, **kwargs)

Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.

Requires numpy and, optionally, pandas.

Parameters:
  • args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
  • fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
  • key (callable / array-like / iterable / str, keyword only, optional) –

    A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only, optional) –

    A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is True.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • full_output (bool, keyword only, optional) –

    If True, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default is True.

    Note

    The name for the parameter comes from the fact that it is internally passed to qvalues().

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • **kwargs (passed to the chain() function.) –
Returns:

out

Return type:

iterator or numpy.ndarray or pandas.DataFrame

filter.chain(*files, **kwargs)

Chain filter() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the filter() function.

filter.chain.from_iterable(*files, **kwargs)

Chain filter() for several files. Keyword arguments are passed to the filter() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.protxml.fdr(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)

Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:

FDR = \frac{N_{decoy}}{N_{target} * ratio}

The second formula is:

FDR = \frac{N_{decoy} * (1 + \frac{1}{ratio})}{N_{total}}

Note

This function is less versatile than qvalues(). To obtain FDR, you can call qvalues() and take the last q-value. This function can be used (with correction = 0 or 1) when numpy is not available.

Parameters:
  • psms (iterable, optional) – An iterable of PSMs, e.g. as returned by read(). Not needed if is_decoy is an iterable.
  • formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
  • is_decoy (callable, iterable, or str, optional) –

    If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is is_decoy(). If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • pep (callable, iterable, or str, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.

  • ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

    Note

    Requires numpy, if correction is a float or 2.

    Note

    Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using filter() without correction).

Returns:

out – The estimation of FDR, (roughly) between 0 and 1.

Return type:

float

pyteomics.protxml.qvalues(*args, **kwargs)

Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.

Requires numpy (and optionally pandas).

Parameters:
  • args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
  • key (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is False.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • full_output (bool, keyword only, optional) – If True, then the returned array has PSM objects along with scores and q-values. Default is False.
  • **kwargs (passed to the chain() function.) –
Returns:

out – A sorted array of records with the following fields:

  • ’score’: np.float64
  • ’is decoy’: np.bool_
  • ’q’: np.float64
  • ’psm’: np.object_ (if full_output is True)

Return type:

numpy.ndarray

pyteomics.protxml.DataFrame(*args, **kwargs)[source]

Read protXML output files into a pandas.DataFrame.

Note

Rows in the DataFrame correspond to individual proteins, not protein groups.

Requires pandas.

Parameters:
  • sep (str or None, keyword only, optional) – Some values related to protein groups are variable-length lists. If sep is a str, they will be packed into single string using this delimiter. If sep is None, they are kept as lists. Default is None.
  • pd_kwargs (dict, optional) – Keyword arguments passed to the pandas.DataFrame constructor.
  • *args – Passed to chain().
  • **kwargs – Passed to chain().
Returns:

out

Return type:

pandas.DataFrame

class pyteomics.protxml.ProtXML(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]

Bases: pyteomics.xml.MultiProcessingXML

Parser class for protXML files.

__init__(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)

Create an indexed XML parser object.

Parameters:
  • source (str or file) – File name or file-like object corresponding to an XML file.
  • read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
  • iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
  • indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.
build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.
  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.
Returns:

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

pyteomics.protxml.filter_df(*args, **kwargs)[source]

Read protXML files or DataFrames and return a DataFrame with filtered PSMs. Positional arguments can be protXML files or DataFrames.

Note

Rows in the DataFrame correspond to individual proteins, not protein groups.

Requires pandas.

Parameters:
  • key (str / iterable / callable, keyword only, optional) – Default is ‘probability’.
  • is_decoy (str / iterable / callable, keyword only, optional) – Default is to check that “protein_name” starts with ‘DECOY_’.
  • reverse (bool, keyword only, optional) – Should be True if higher score is better. Default is True (because the default key is ‘probability’).
  • *args – Passed to auxiliary.filter() and/or DataFrame().
  • **kwargs – Passed to auxiliary.filter() and/or DataFrame().
Returns:

out

Return type:

pandas.DataFrame

pyteomics.protxml.is_decoy(pg, prefix='DECOY_')

Determine if a protein group should be considered decoy.

This function checks that all protein names in a group start with prefix. You may need to provide your own function for correct filtering and FDR estimation.

Parameters:
  • pg (dict) – A protein group dict produced by the ProtXML parser.
  • prefix (str, optional) – A prefix used to mark decoy proteins. Default is ‘DECOY_’.
Returns:

out

Return type:

bool

pyteomics.protxml.read(source, read_schema=False, iterative=True, **kwargs)[source]

Parse source and iterate through protein groups.

Parameters:
  • source (str or file) – A path to a target protXML file or the file object itself.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the protXML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
  • iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce memory usage at almost the same parsing speed. Default is True.
Returns:

out – An iterator over dicts with protein group properties.

Return type:

ProtXML

tandem - X!Tandem output file reader

Summary

X!Tandem is an open-source proteomic search engine with a very simple, sophisticated application programming interface (API): it simply takes an XML file of instructions on its command line, and outputs the results into an XML file, which has been specified in the input XML file. The output format is described here (PDF).

This module provides a minimalistic way to extract information from X!Tandem output files. You can use the old functional interface (read()) or the new object-oriented interface (TandemXML) to iterate over entries in <group> elements, i.e. identifications for a certain spectrum.

Data access

TandemXML - a class representing a single X!Tandem output file. Other data access functions use this class internally.

read() - iterate through peptide-spectrum matches in an X!Tandem output file. Data from a single PSM are converted to a human-readable dict.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

DataFrame() - read X!Tandem output files into a pandas.DataFrame.

Target-decoy approach

filter() - iterate through peptide-spectrum matches in a chain of X!Tandem output files, yielding only top PSMs and keeping false discovery rate (FDR) at the desired level. The FDR is estimated using the target-decoy approach (TDA).

filter.chain() - chain a series of filters applied independently to several files.

filter.chain.from_iterable() - chain a series of filters applied independently to an iterable of files.

filter_df() - filter X!Tandem output files and return a pandas.DataFrame.

is_decoy() - determine if a PSM is from the decoy database.

fdr() - estimate the FDR in a data set using TDA.

qvalues() - get an array of scores and local FDR values for a PSM set using the target-decoy approach.

Deprecated functions
iterfind() - iterate over elements in an X!Tandem file. You can just call the corresponding method of the TandemXML object.
Dependencies

This module requires lxml and numpy.


pyteomics.tandem.chain(*sources, **kwargs)

Chain sequence_maker() for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to the sequence_maker() function.

pyteomics.tandem.sources

Sources for creating new sequences from, such as paths or file-like objects

Type:Iterable
pyteomics.tandem.kwargs

Additional arguments used to instantiate each sequence

Type:Mapping
chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.tandem.filter(*args, **kwargs)

Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.

Requires numpy and, optionally, pandas.

Parameters:
  • args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
  • fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
  • key (callable / array-like / iterable / str, keyword only, optional) –

    A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only, optional) –

    A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is True.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • full_output (bool, keyword only, optional) –

    If True, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default is True.

    Note

    The name for the parameter comes from the fact that it is internally passed to qvalues().

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • **kwargs (passed to the chain() function.) –
Returns:

out

Return type:

iterator or numpy.ndarray or pandas.DataFrame

filter.chain(*files, **kwargs)

Chain filter() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the filter() function.

filter.chain.from_iterable(*files, **kwargs)

Chain filter() for several files. Keyword arguments are passed to the filter() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.tandem.fdr(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)

Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:

FDR = \frac{N_{decoy}}{N_{target} * ratio}

The second formula is:

FDR = \frac{N_{decoy} * (1 + \frac{1}{ratio})}{N_{total}}

Note

This function is less versatile than qvalues(). To obtain FDR, you can call qvalues() and take the last q-value. This function can be used (with correction = 0 or 1) when numpy is not available.

Parameters:
  • psms (iterable, optional) – An iterable of PSMs, e.g. as returned by read(). Not needed if is_decoy is an iterable.
  • formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
  • is_decoy (callable, iterable, or str, optional) –

    If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is is_decoy(). If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • pep (callable, iterable, or str, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.

  • ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

    Note

    Requires numpy, if correction is a float or 2.

    Note

    Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using filter() without correction).

Returns:

out – The estimation of FDR, (roughly) between 0 and 1.

Return type:

float

pyteomics.tandem.qvalues(*args, **kwargs)

Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.

Requires numpy (and optionally pandas).

Parameters:
  • args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
  • key (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is False.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • full_output (bool, keyword only, optional) – If True, then the returned array has PSM objects along with scores and q-values. Default is False.
  • **kwargs (passed to the chain() function.) –
Returns:

out – A sorted array of records with the following fields:

  • ’score’: np.float64
  • ’is decoy’: np.bool_
  • ’q’: np.float64
  • ’psm’: np.object_ (if full_output is True)

Return type:

numpy.ndarray

pyteomics.tandem.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create a TandemXML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.
  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
Returns:

out

Return type:

iterator

pyteomics.tandem.DataFrame(*args, **kwargs)[source]

Read X!Tandem output files into a pandas.DataFrame.

Requires pandas.

Parameters:
  • sep (str or None, optional) – Some values related to PSMs (such as protein information) are variable-length lists. If sep is a str, they will be packed into single string using this delimiter. If sep is None, they are kept as lists. Default is None.
  • pd_kwargs (dict, optional) – Keyword arguments passed to the pandas.DataFrame constructor.
  • *args – Passed to chain().
  • **kwargs – Passed to chain().
Returns:

out

Return type:

pandas.DataFrame

class pyteomics.tandem.TandemXML(*args, **kwargs)[source]

Bases: pyteomics.xml.XML

Parser class for TandemXML files.

__init__(*args, **kwargs)[source]

Create an XML parser object.

Parameters:
  • source (str or file) – File name or file-like object corresponding to an XML file.
  • read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
  • iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
  • build_id_cache (bool, optional) – Defines whether a dictionary mapping IDs to XML tree elements should be built and stored on the instance. It is used in XML.get_by_id(), e.g. when using pyteomics.mzid.MzIdentML with retrieve_refs=True.
  • huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether security checks for XML tree depth and node size should be disabled. Default is False. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(elem_id, **kwargs)

Parse the file and return the element with id attribute equal to elem_id. Returns None if no such element is found.

Parameters:elem_id (str) – The value of the id attribute to match.
Returns:out
Return type:dict or None
iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

reset()

Resets the iterator to its initial state.

pyteomics.tandem.filter_df(*args, **kwargs)[source]

Read X!Tandem output files or DataFrames and return a DataFrame with filtered PSMs. Positional arguments can be X!Tandem output files or DataFrames.

Requires pandas.

Parameters:
  • key (str / iterable / callable, optional) – Default is ‘expect’.
  • is_decoy (str / iterable / callable, optional) – Default is to check if all strings in the “protein” column start with ‘DECOY_’
  • *args – Passed to auxiliary.filter() and/or DataFrame().
  • **kwargs – Passed to auxiliary.filter() and/or DataFrame().
Returns:

out

Return type:

pandas.DataFrame

pyteomics.tandem.is_decoy(psm, prefix='DECOY_')

Given a PSM dict, return True if all protein names for the PSM start with prefix, and False otherwise.

Parameters:
  • psm (dict) – A dict, as yielded by read().
  • prefix (str, optional) – A prefix used to mark decoy proteins. Default is ‘DECOY_’.
Returns:

out

Return type:

bool

pyteomics.tandem.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create a TandemXML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.
  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
Returns:

out

Return type:

iterator

pyteomics.tandem.read(source, iterative=True, **kwargs)[source]

Parse source and iterate through peptide-spectrum matches.

Parameters:
  • source (str or file) – A path to a target X!Tandem output file or the file object itself.
  • iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce memory usage at almost the same parsing speed. Default is True.
Returns:

out – An iterator over dicts with PSM properties.

Return type:

iterator

mzid - mzIdentML file reader

Summary

mzIdentML is one of the standards developed by the Proteomics Informatics working group of the HUPO Proteomics Standard Initiative.

This module provides a minimalistic way to extract information from mzIdentML files. You can use the old functional interface (read()) or the new object-oriented interface (MzIdentML) to iterate over entries in <SpectrumIdentificationResult> elements, i.e. groups of identifications for a certain spectrum. Note that each entry can contain more than one PSM (peptide-spectrum match). They are accessible with “SpectrumIdentificationItem” key. MzIdentML objects also support direct indexing by element ID.

Data access

MzIdentML - a class representing a single MzIdentML file. Other data access functions use this class internally.

read() - iterate through peptide-spectrum matches in an mzIdentML file. Data from a single PSM group are converted to a human-readable dict. Basically creates an MzIdentML object and reads it.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

DataFrame() - read MzIdentML files into a pandas.DataFrame.

Target-decoy approach

filter() - read a chain of mzIdentML files and filter to a certain FDR using TDA.

filter.chain() - chain a series of filters applied independently to several files.

filter.chain.from_iterable() - chain a series of filters applied independently to an iterable of files.

filter_df() - filter MzIdentML files and return a pandas.DataFrame.

is_decoy() - determine if a “SpectrumIdentificationResult” should be consiudered decoy.

fdr() - estimate the false discovery rate of a set of identifications using the target-decoy approach.

qvalues() - get an array of scores and local FDR values for a PSM set using the target-decoy approach.

Deprecated functions

version_info() - get information about mzIdentML version and schema. You can just read the corresponding attribute of the MzIdentML object.

get_by_id() - get an element by its ID and extract the data from it. You can just call the corresponding method of the MzIdentML object.

iterfind() - iterate over elements in an mzIdentML file. You can just call the corresponding method of the MzIdentML object.

Dependencies

This module requires lxml.


pyteomics.mzid.version_info(source)

Provide version information about the mzIdentML file.

Note

This function is provided for backward compatibility only. It simply creates an MzIdentML instance and returns its version_info attribute.

Parameters:source (str or file) – File name or file-like object.
Returns:out – A (version, schema URL) tuple, both elements are strings or None.
Return type:tuple
pyteomics.mzid.fdr(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)

Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:

FDR = \frac{N_{decoy}}{N_{target} * ratio}

The second formula is:

FDR = \frac{N_{decoy} * (1 + \frac{1}{ratio})}{N_{total}}

Note

This function is less versatile than qvalues(). To obtain FDR, you can call qvalues() and take the last q-value. This function can be used (with correction = 0 or 1) when numpy is not available.

Parameters:
  • psms (iterable, optional) – An iterable of PSMs, e.g. as returned by read(). Not needed if is_decoy is an iterable.
  • formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
  • is_decoy (callable, iterable, or str, optional) –

    If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is is_decoy(). If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • pep (callable, iterable, or str, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.

  • ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

    Note

    Requires numpy, if correction is a float or 2.

    Note

    Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using filter() without correction).

Returns:

out – The estimation of FDR, (roughly) between 0 and 1.

Return type:

float

pyteomics.mzid.qvalues(*args, **kwargs)

Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.

Requires numpy (and optionally pandas).

Parameters:
  • args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
  • key (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is False.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • full_output (bool, keyword only, optional) – If True, then the returned array has PSM objects along with scores and q-values. Default is False.
  • **kwargs (passed to the chain() function.) –
Returns:

out – A sorted array of records with the following fields:

  • ’score’: np.float64
  • ’is decoy’: np.bool_
  • ’q’: np.float64
  • ’psm’: np.object_ (if full_output is True)

Return type:

numpy.ndarray

pyteomics.mzid.chain(*sources, **kwargs)

Chain sequence_maker() for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to the sequence_maker() function.

pyteomics.mzid.sources

Sources for creating new sequences from, such as paths or file-like objects

Type:Iterable
pyteomics.mzid.kwargs

Additional arguments used to instantiate each sequence

Type:Mapping
chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.mzid.filter(*args, **kwargs)

Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.

Requires numpy and, optionally, pandas.

Parameters:
  • args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
  • fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
  • key (callable / array-like / iterable / str, keyword only, optional) –

    A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only, optional) –

    A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is True.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • full_output (bool, keyword only, optional) –

    If True, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default is True.

    Note

    The name for the parameter comes from the fact that it is internally passed to qvalues().

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • **kwargs (passed to the chain() function.) –
Returns:

out

Return type:

iterator or numpy.ndarray or pandas.DataFrame

filter.chain(*files, **kwargs)

Chain filter() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the filter() function.

filter.chain.from_iterable(*files, **kwargs)

Chain filter() for several files. Keyword arguments are passed to the filter() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.mzid.DataFrame(*args, **kwargs)[source]

Read MzIdentML files into a pandas.DataFrame.

Requires pandas.

Warning

Only the first ‘SpectrumIdentificationItem’ element is considered in every ‘SpectrumIdentificationResult’.

Parameters:
  • *args – Passed to chain().
  • **kwargs – Passed to chain().
  • sep (str or None, keyword only, optional) – Some values related to PSMs (such as protein information) are variable-length lists. If sep is a str, they will be packed into single string using this delimiter. If sep is None, they are kept as lists. Default is None.
Returns:

out

Return type:

pandas.DataFrame

class pyteomics.mzid.MzIdentML(*args, **kwargs)[source]

Bases: pyteomics.xml.MultiProcessingXML, pyteomics.xml.IndexSavingXML

Parser class for MzIdentML files.

__init__(*args, **kwargs)[source]

Create an indexed XML parser object.

Parameters:
  • source (str or file) – File name or file-like object corresponding to an XML file.
  • read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
  • iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
  • indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.
build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.
  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.
Returns:

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

classmethod prebuild_byte_offset_file(path)

Construct a new XML reader, build its byte offset index and write it to file

Parameters:path (str) – The path to the file to parse
reset()

Resets the iterator to its initial state.

write_byte_offsets()

Write the byte offsets in _offset_index to the file at _byte_offset_filename

pyteomics.mzid.filter_df(*args, **kwargs)[source]

Read MzIdentML files or DataFrames and return a DataFrame with filtered PSMs. Positional arguments can be MzIdentML files or DataFrames.

Requires pandas.

Warning

Only the first ‘SpectrumIdentificationItem’ element is considered in every ‘SpectrumIdentificationResult’.

Parameters:
  • key (str / iterable / callable, keyword only, optional) – Default is ‘mascot:expectation value’.
  • is_decoy (str / iterable / callable, keyword only, optional) – Default is ‘isDecoy’.
  • *args – Passed to auxiliary.filter() and/or DataFrame().
  • **kwargs – Passed to auxiliary.filter() and/or DataFrame().
Returns:

out

Return type:

pandas.DataFrame

pyteomics.mzid.get_by_id(source, elem_id, **kwargs)[source]

Parse source and return the element with id attribute equal to elem_id. Returns None if no such element is found.

Note

This function is provided for backward compatibility only. If you do multiple get_by_id() calls on one file, you should create an MzIdentML object and use its get_by_id() method.

Parameters:
  • source (str or file) – A path to a target mzIdentML file of the file object itself.
  • elem_id (str) – The value of the id attribute to match.
Returns:

out

Return type:

dict or None

pyteomics.mzid.is_decoy(psm, prefix=None)[source]

Given a PSM dict, return True if all proteins in the dict are marked as decoy, and False otherwise.

Parameters:
  • psm (dict) – A dict, as yielded by read().
  • prefix (ignored) –
Returns:

out

Return type:

bool

pyteomics.mzid.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an MzIdentML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.
  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • retrieve_refs (bool, optional) – If True, additional information from references will be automatically added to the results. The file processing time will increase. Default is False.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzIdentML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings.
  • build_id_cache (bool, optional) – Defines whether a cache of element IDs should be built and stored on the created MzIdentML instance. Default value is the value of retrieve_refs.
Returns:

out

Return type:

iterator

pyteomics.mzid.read(source, **kwargs)[source]

Parse source and iterate through peptide-spectrum matches.

Note

This function is provided for backward compatibility only. It simply creates an MzIdentML instance using provided arguments and returns it.

Parameters:
  • source (str or file) – A path to a target mzIdentML file or the file object itself.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • retrieve_refs (bool, optional) – If True, additional information from references will be automatically added to the results. The file processing time will increase. Default is True.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzIdentML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings.
  • build_id_cache (bool, optional) –

    Defines whether a cache of element IDs should be built and stored on the created MzIdentML instance. Default value is the value of retrieve_refs.

    Note

    This parameter is ignored when use_index is True (default).

  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for the indexed elements. If True (default), build_id_cache is ignored.
  • indexed_tags (container of bytes, optional) – Defines which elements need to be indexed. Empty set by default.
Returns:

out – An iterator over the dicts with PSM properties.

Return type:

MzIdentML

mztab - mzTab file reader

Summary

mzTab is one of the standards developed by the Proteomics Informatics working group of the HUPO Proteomics Standard Initiative.

This module provides a way to read mzTab files into a collection of pandas.DataFrame instances in memory, along with a mapping of the file-level metadata.

Data access
MzTab - a class representing a single mzTab file
class pyteomics.mztab.MzTab(path, encoding='utf8', table_format='df')[source]

Bases: pyteomics.mztab._MzTabParserBase

Parser for mzTab format files.

comments

A list of comments across the file

Type:list
file

A file stream wrapper for the file to be read

Type:_file_obj
metadata

A mapping of metadata that was entities.

Type:OrderedDict
peptide_table

The table of peptides. Not commonly used.

Type:_MzTabTable or pd.DataFrame
protein_table

The table of protein identifications.

Type:_MzTabTable or pd.DataFrame
small_molecule_table

The table of small molecule identifications.

Type:_MzTabTable or pd.DataFrame
spectrum_match_table

The table of spectrum-to-peptide match identifications.

Type:_MzTabTable or pd.DataFrame
table_format

The structure type to replace each table with. The string ‘df’ will use pd.DataFrame instances. ‘dict’ will create a dictionary of dictionaries for each table. A callable will be called on each raw _MzTabTable object

Type:‘df’, ‘dict’, or callable
__init__(path, encoding='utf8', table_format='df')[source]

Initialize self. See help(type(self)) for accurate signature.

collapse_properties(proplist)[source]

Collapse a flat property list into a hierchical structure.

This is intended to operate on Mapping objects, including dict, pandas.Series and pandas.DataFrame.

{
  "ms_run[1]-format": "Andromeda:apl file format",
  "ms_run[1]-location": "file://...",
  "ms_run[1]-id_format": "scan number only nativeID format"
}

to

{
  "ms_run": [
    {
      "format": "Andromeda:apl file format",
      "location": "file://...",
      "id_format": "scan number only nativeID format"
    }
  ]
}
Parameters:proplist (Mapping) – Key-Value pairs to collapse
Returns:The collapsed property list
Return type:OrderedDict

featurexml - reader for featureXML files

Summary

featureXML is a format specified in the OpenMS project. It defines a list of LC-MS features observed in an experiment.

This module provides a minimalistic way to extract information from featureXML files. You can use the old functional interface (read()) or the new object-oriented interface (FeatureXML) to iterate over entries in <feature> elements. FeatureXML also supports direct indexing with feature IDs.

Data access

FeatureXML - a class representing a single featureXML file. Other data access functions use this class internally.

read() - iterate through features in a featureXML file. Data from a single feature are converted to a human-readable dict.

chain() - read multiple featureXML files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

Dependencies

This module requres lxml.


pyteomics.openms.featurexml.chain(*args, **kwargs)

Chain read() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the read() function.

chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
class pyteomics.openms.featurexml.FeatureXML(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]

Bases: pyteomics.xml.MultiProcessingXML

Parser class for featureXML files.

__init__(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)

Create an indexed XML parser object.

Parameters:
  • source (str or file) – File name or file-like object corresponding to an XML file.
  • read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
  • iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
  • indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.
build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.
  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.
Returns:

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

pyteomics.openms.featurexml.read(source, read_schema=True, iterative=True, use_index=False)[source]

Parse source and iterate through features.

Parameters:
  • source (str or file) – A path to a target featureXML file or the file object itself.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the file header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings.
  • iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce memory usage at almost the same parsing speed. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for spectrum elements. Default is False.
Returns:

out – An iterator over the dicts with feature properties.

Return type:

iterator

trafoxml - reader for trafoXML files

Summary

trafoXML is a format specified in the OpenMS project. It defines a transformation, which is a result of retention time alignment.

This module provides a minimalistic way to extract information from trafoXML files. You can use the old functional interface (read()) or the new object-oriented interface (TrafoXML) to iterate over entries in <Pair> elements.

Data access

TrafoXML - a class representing a single trafoXML file. Other data access functions use this class internally.

read() - iterate through pairs in a trafoXML file. Data from a single trafo are converted to a human-readable dict.

chain() - read multiple trafoXML files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

Dependencies

This module requres lxml.


pyteomics.openms.trafoxml.chain(*args, **kwargs)

Chain read() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the read() function.

chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
class pyteomics.openms.trafoxml.TrafoXML(source, read_schema=None, iterative=None, build_id_cache=False, **kwargs)[source]

Bases: pyteomics.xml.XML

Parser class for trafoXML files.

__init__(source, read_schema=None, iterative=None, build_id_cache=False, **kwargs)

Create an XML parser object.

Parameters:
  • source (str or file) – File name or file-like object corresponding to an XML file.
  • read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
  • iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
  • build_id_cache (bool, optional) – Defines whether a dictionary mapping IDs to XML tree elements should be built and stored on the instance. It is used in XML.get_by_id(), e.g. when using pyteomics.mzid.MzIdentML with retrieve_refs=True.
  • huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether security checks for XML tree depth and node size should be disabled. Default is False. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(elem_id, **kwargs)

Parse the file and return the element with id attribute equal to elem_id. Returns None if no such element is found.

Parameters:elem_id (str) – The value of the id attribute to match.
Returns:out
Return type:dict or None
iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

reset()

Resets the iterator to its initial state.

pyteomics.openms.trafoxml.read(source, read_schema=True, iterative=True)[source]

Parse source and iterate through pairs.

Parameters:
  • source (str or file) – A path to a target trafoXML file or the file object itself.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the file header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings.
  • iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce memory usage at almost the same parsing speed. Default is True.
Returns:

out – An iterator over the dicts with feature properties.

Return type:

iterator

idxml - idXML file reader

Summary

idXML is a format specified in the OpenMS project. It defines a list of peptide identifications.

This module provides a minimalistic way to extract information from idXML files. You can use the old functional interface (read()) or the new object-oriented interface (IDXML) to iterate over entries in <PeptideIdentification> elements. Note that each entry can contain more than one PSM (peptide-spectrum match). They are accessible with 'PeptideHit' key. IDXML objects also support direct indexing by element ID.

Data access

IDXML - a class representing a single idXML file. Other data access functions use this class internally.

read() - iterate through peptide-spectrum matches in an idXML file. Data from a single PSM group are converted to a human-readable dict. Basically creates an IDXML object and reads it.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

DataFrame() - read idXML files into a pandas.DataFrame.

Target-decoy approach

filter() - read a chain of idXML files and filter to a certain FDR using TDA.

filter.chain() - chain a series of filters applied independently to several files.

filter.chain.from_iterable() - chain a series of filters applied independently to an iterable of files.

filter_df() - filter idXML files and return a pandas.DataFrame.

is_decoy() - determine if a “SpectrumIdentificationResult” should be consiudered decoy.

fdr() - estimate the false discovery rate of a set of identifications using the target-decoy approach.

qvalues() - get an array of scores and local FDR values for a PSM set using the target-decoy approach.

Deprecated functions

version_info() - get information about idXML version and schema. You can just read the corresponding attribute of the IDXML object.

get_by_id() - get an element by its ID and extract the data from it. You can just call the corresponding method of the IDXML object.

iterfind() - iterate over elements in an idXML file. You can just call the corresponding method of the IDXML object.

Dependencies

This module requires lxml.


pyteomics.openms.idxml.version_info(source)

Provide version information about the idXML file.

Note

This function is provided for backward compatibility only. It simply creates an IDXML instance and returns its version_info attribute.

Parameters:source (str or file) – File name or file-like object.
Returns:out – A (version, schema URL) tuple, both elements are strings or None.
Return type:tuple
pyteomics.openms.idxml.fdr(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)

Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:

FDR = \frac{N_{decoy}}{N_{target} * ratio}

The second formula is:

FDR = \frac{N_{decoy} * (1 + \frac{1}{ratio})}{N_{total}}

Note

This function is less versatile than qvalues(). To obtain FDR, you can call qvalues() and take the last q-value. This function can be used (with correction = 0 or 1) when numpy is not available.

Parameters:
  • psms (iterable, optional) – An iterable of PSMs, e.g. as returned by read(). Not needed if is_decoy is an iterable.
  • formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
  • is_decoy (callable, iterable, or str, optional) –

    If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is is_decoy(). If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • pep (callable, iterable, or str, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.

  • ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

    Note

    Requires numpy, if correction is a float or 2.

    Note

    Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using filter() without correction).

Returns:

out – The estimation of FDR, (roughly) between 0 and 1.

Return type:

float

pyteomics.openms.idxml.qvalues(*args, **kwargs)

Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.

Requires numpy (and optionally pandas).

Parameters:
  • args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
  • key (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is False.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • full_output (bool, keyword only, optional) – If True, then the returned array has PSM objects along with scores and q-values. Default is False.
  • **kwargs (passed to the chain() function.) –
Returns:

out – A sorted array of records with the following fields:

  • ’score’: np.float64
  • ’is decoy’: np.bool_
  • ’q’: np.float64
  • ’psm’: np.object_ (if full_output is True)

Return type:

numpy.ndarray

pyteomics.openms.idxml.chain(*sources, **kwargs)

Chain sequence_maker() for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to the sequence_maker() function.

pyteomics.openms.idxml.sources

Sources for creating new sequences from, such as paths or file-like objects

Type:Iterable
pyteomics.openms.idxml.kwargs

Additional arguments used to instantiate each sequence

Type:Mapping
chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.openms.idxml.filter(*args, **kwargs)

Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.

Requires numpy and, optionally, pandas.

Parameters:
  • args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
  • fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
  • key (callable / array-like / iterable / str, keyword only, optional) –

    A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only, optional) –

    A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is True.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • full_output (bool, keyword only, optional) –

    If True, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default is True.

    Note

    The name for the parameter comes from the fact that it is internally passed to qvalues().

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • **kwargs (passed to the chain() function.) –
Returns:

out

Return type:

iterator or numpy.ndarray or pandas.DataFrame

filter.chain(*files, **kwargs)

Chain filter() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the filter() function.

filter.chain.from_iterable(*files, **kwargs)

Chain filter() for several files. Keyword arguments are passed to the filter() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.openms.idxml.DataFrame(*args, **kwargs)[source]

Read idXML files into a pandas.DataFrame.

Requires pandas.

Warning

Only the first ‘PeptideHit’ element is considered in every ‘PeptideIdentification’.

Parameters:
  • *args – Passed to chain()
  • **kwargs – Passed to chain()
  • sep (str or None, keyword only, optional) – Some values related to PSMs (such as protein information) are variable-length lists. If sep is a str, they will be packed into single string using this delimiter. If sep is None, they are kept as lists. Default is None.
Returns:

out

Return type:

pandas.DataFrame

class pyteomics.openms.idxml.IDXML(*args, **kwargs)[source]

Bases: pyteomics.xml.IndexedXML

Parser class for idXML files.

__init__(*args, **kwargs)[source]

Create an indexed XML parser object.

Parameters:
  • source (str or file) – File name or file-like object corresponding to an XML file.
  • read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
  • iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
  • indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.
build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.
  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.
Returns:

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

reset()

Resets the iterator to its initial state.

pyteomics.openms.idxml.filter_df(*args, **kwargs)[source]

Read idXML files or DataFrames and return a DataFrame with filtered PSMs. Positional arguments can be idXML files or DataFrames.

Requires pandas.

Warning

Only the first ‘PeptideHit’ element is considered in every ‘PeptideIdentification’.

Parameters:
  • key (str / iterable / callable, keyword only, optional) – Peptide identification score. Default is ‘score’. You will probably need to change it.
  • is_decoy (str / iterable / callable, keyword only, optional) – Default is ‘is decoy’.
  • *args – Passed to auxiliary.filter() and/or DataFrame().
  • **kwargs – Passed to auxiliary.filter() and/or DataFrame().
Returns:

out

Return type:

pandas.DataFrame

pyteomics.openms.idxml.get_by_id(source, elem_id, **kwargs)[source]

Parse source and return the element with id attribute equal to elem_id. Returns None if no such element is found.

Note

This function is provided for backward compatibility only. If you do multiple get_by_id() calls on one file, you should create an IDXML object and use its get_by_id() method.

Parameters:
  • source (str or file) – A path to a target mzIdentML file of the file object itself.
  • elem_id (str) – The value of the id attribute to match.
Returns:

out

Return type:

dict or None

pyteomics.openms.idxml.is_decoy(psm, prefix=None)[source]

Given a PSM dict, return True if it is marked as decoy, and False otherwise.

Parameters:
  • psm (dict) – A dict, as yielded by read().
  • prefix (ignored) –
Returns:

out

Return type:

bool

pyteomics.openms.idxml.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an IDXML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.
  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • retrieve_refs (bool, optional) – If True, additional information from references will be automatically added to the results. The file processing time will increase. Default is False.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the IDXML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings.
  • build_id_cache (bool, optional) – Defines whether a cache of element IDs should be built and stored on the created IDXML instance. Default value is the value of retrieve_refs.
Returns:

out

Return type:

iterator

pyteomics.openms.idxml.read(source, **kwargs)[source]

Parse source and iterate through peptide-spectrum matches.

Note

This function is provided for backward compatibility only. It simply creates an IDXML instance using provided arguments and returns it.

Parameters:
  • source (str or file) – A path to a target IDXML file or the file object itself.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • retrieve_refs (bool, optional) – If True, additional information from references will be automatically added to the results. The file processing time will increase. Default is True.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the IDXML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings.
  • build_id_cache (bool, optional) –

    Defines whether a cache of element IDs should be built and stored on the created IDXML instance. Default value is the value of retrieve_refs.

    Note

    This parameter is ignored when use_index is True (default).

  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for the indexed elements. If True (default), build_id_cache is ignored.
  • indexed_tags (container of bytes, optional) – Defines which elements need to be indexed. Empty set by default.
Returns:

out – An iterator over the dicts with PSM properties.

Return type:

IDXML

traml - targeted MS transition data in TraML format

Summary

TraML is a standard rich XML-format for targeted mass spectrometry method definitions. Please refer to psidev.info for the detailed specification of the format and structure of TraML files.

This module provides a minimalistic way to extract information from TraML files. You can use the object-oriented interface (TraML instances) to access target definitions and transitions. TraML objects also support indexing with entity IDs directly.

Data access

TraML - a class representing a single TraML file. Other data access functions use this class internally.

read() - iterate through transitions in TraML format.

chain() - read multiple TraML files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

Deprecated functions

version_info() - get version information about the TraML file. You can just read the corresponding attribute of the TraML object.

iterfind() - iterate over elements in an TraML file. You can just call the corresponding method of the TraML object.

Dependencies

This module requires lxml


pyteomics.traml.chain(*sources, **kwargs)

Chain sequence_maker() for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to the sequence_maker() function.

pyteomics.traml.sources

Sources for creating new sequences from, such as paths or file-like objects

Type:Iterable
pyteomics.traml.kwargs

Additional arguments used to instantiate each sequence

Type:Mapping
chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
class pyteomics.traml.TraML(*args, **kwargs)[source]

Bases: pyteomics.xml.MultiProcessingXML, pyteomics.xml.IndexSavingXML

Parser class for TraML files.

__init__(*args, **kwargs)[source]

Create an indexed XML parser object.

Parameters:
  • source (str or file) – File name or file-like object corresponding to an XML file.
  • read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
  • iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
  • indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.
build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.
  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.
Returns:

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

classmethod prebuild_byte_offset_file(path)

Construct a new XML reader, build its byte offset index and write it to file

Parameters:path (str) – The path to the file to parse
reset()

Resets the iterator to its initial state.

write_byte_offsets()

Write the byte offsets in _offset_index to the file at _byte_offset_filename

pyteomics.traml.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an TraML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.
  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
Returns:

out

Return type:

iterator

pyteomics.traml.read(source, retrieve_refs=True, read_schema=False, iterative=True, use_index=False, huge_tree=False)[source]

Parse source and iterate through transitions.

Parameters:
  • source (str or file) – A path to a target TraML file or the file object itself.
  • retrieve_refs (bool, optional) – If True, additional information from references will be automatically added to the results. The file processing time will increase. Default is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the TraML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
  • iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce memory usage at almost the same parsing speed. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for spectrum elements. Default is False.
  • huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether security checks for XML tree depth and node size should be disabled. Default is False. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
Returns:

out – A TraML object, suitable for iteration and possibly random access.

Return type:

TraML

pylab_aux - auxiliary functions for plotting with pylab

This module serves as a collection of useful routines for data plotting with matplotlib.

Generic plotting

plot_line() - plot a line.

scatter_trend() - plot a scatter plot with a regression line.

plot_function_3d() - plot a 3D graph of a function of two variables.

plot_function_contour() - plot a contour graph of a function of two variables.

Spectrum visualization

plot_spectrum() - plot a single spectrum (m/z vs intensity).

annotate_spectrum() - plot and annotate peaks in MS/MS spectrum.

FDR control
plot_qvalue_curve() - plot the dependence of q-value on the amount of PSMs (similar to a ROC curve).
Dependencies

This module requires matplotlib.


pyteomics.pylab_aux.annotate_spectrum(spectrum, peptide, centroided=True, *args, **kwargs)[source]

Plot a spectrum and annotate matching fragment peaks.

Parameters:
  • spectrum (dict) – A spectrum as returned by Pyteomics parsers. Needs to have ‘m/z array’ and ‘intensity array’ keys.
  • peptide (str) – A modX sequence.
  • centroided (bool, optional) – Passed to plot_spectrum().
  • types (Container, keyword only, optional) – Ion types to be considered for annotation. Default is (‘b’, ‘y’).
  • maxcharge (int, keyword only, optional) – Maximum charge state for fragment ions to be considered. Default is 1.
  • colors (dict, keyword only, optional) – Keys are ion types, values are colors to plot the annotated peaks with. Defaults to a red-blue scheme.
  • ftol (float, keyword only, optional) – A fixed m/z tolerance value for peak matching. Alternative to rtol.
  • rtol (float, keyword only, optional) – A relative m/z error for peak matching. Default is 10 ppm.
  • adjust_text (bool, keyword only, optional) – Adjust the overlapping text annotations using adjustText.
  • text_kw (dict, keyword only, optional) – Keyword arguments for pylab.text().
  • adjust_kw (dict, keyword only, optional) – Keyword argyuments for :py:func:`adjust_text.
  • ion_comp (dict, keyword only, optional) – A dictionary defining definitions of ion compositions to override pyteomics.mass.std_ion_comp.
  • mass_data (dict, keyword only, optional) – A dictionary of element masses to override pyteomics.mass.nist_mass.
  • aa_mass (dict, keyword only, optional) – A dictionary of amino acid residue masses.
  • *args – Passed to plot_spectrum().
  • **kwargs – Passed to plot_spectrum().
pyteomics.pylab_aux.plot_function_3d(x, y, function, **kwargs)[source]

Plot values of a function of two variables in 3D.

More on 3D plotting in pylab:

http://www.scipy.org/Cookbook/Matplotlib/mplot3D

Parameters:
  • x (array_like of float) – The plotting range on X axis.
  • y (array_like of float) – The plotting range on Y axis.
  • function (function) – The function to plot.
  • plot_type ({'surface', 'wireframe', 'scatter', 'contour', 'contourf'}, keyword only, optional) – The type of a plot, see scipy cookbook for examples. The default value is ‘surface’.
  • num_contours (int) – The number of contours to plot, 50 by default.
  • xlabel (str, keyword only, optional) – The X axis label. Empty by default.
  • ylabel (str, keyword only, optional) – The Y axis label. Empty by default.
  • zlabel (str, keyword only, optional) – The Z axis label. Empty by default.
  • title (str, keyword only, optional) – The title. Empty by default.
  • **kwargs – Passed to the respective plotting function.
pyteomics.pylab_aux.plot_function_contour(x, y, function, **kwargs)[source]

Make a contour plot of a function of two variables.

Parameters:
  • y (x,) – The positions of the nodes of a plotting grid.
  • function (function) – The function to plot.
  • filling (bool) – Fill contours if True (default).
  • num_contours (int) – The number of contours to plot, 50 by default.
  • ylabel (xlabel,) – The axes labels. Empty by default.
  • title (str, optional) – The title. Empty by default.
  • **kwargs – Passed to pylab.contour() or pylab.contourf().
pyteomics.pylab_aux.plot_line(a, b, xlim=None, *args, **kwargs)[source]

Plot a line y = a * x + b.

Parameters:
  • a (float) – The slope of the line.
  • b (float) – The intercept of the line.
  • xlim (tuple, optional) – Minimal and maximal values of x. If not given, pylab.xlim() will be called.
  • *args – Passed to pylab.plot() after x and y values.
  • **kwargs – Passed to pylab.plot().
Returns:

out – The line object.

Return type:

matplotlib.lines.Line2D

pyteomics.pylab_aux.plot_qvalue_curve(qvalues, *args, **kwargs)[source]

Plot a curve with q-values on the X axis and corresponding PSM number (starting with 1) on the Y axis.

Parameters:
  • qvalues (array-like) – An array of q-values for sorted PSMs.
  • xlabel (str, keyword only, optional) – Label for the X axis. Default is “q-value”.
  • ylabel (str, keyword only, optional) – Label for the Y axis. Default is “# of PSMs”.
  • title (str, keyword only, optional) – The title. Empty by default.
  • *args – Given to pylab.plot() after x and y.
  • **kwargs – Given to pylab.plot().
Returns:

out

Return type:

matplotlib.lines.Line2D

pyteomics.pylab_aux.plot_spectrum(spectrum, centroided=True, *args, **kwargs)[source]

Plot a spectrum, assuming it is a dictionary containing “m/z array” and “intensity array”.

Parameters:
  • spectrum (dict) – A dictionary, as returned by MGF, mzML or mzXML parsers. Must contain “m/z array” and “intensity array” keys with decoded arrays.
  • centroided (bool, optional) – If True (default), peaks of the spectrum are plotted using pylab.bar(). If False, the arrays are simply plotted using pylab.plot().
  • xlabel (str, keyword only, optional) – Label for the X axis. Default is “m/z”.
  • ylabel (str, keyword only, optional) – Label for the Y axis. Default is “intensity”.
  • title (str, keyword only, optional) – The title. Empty by default.
  • *args – Given to pylab.plot() or pylab.bar() (depending on centroided).
  • **kwargs – Given to pylab.plot() or pylab.bar() (depending on centroided).
pyteomics.pylab_aux.scatter_trend(x, y=None, **kwargs)[source]

Make a scatter plot with a linear regression.

Parameters:
  • x (array_like of float) – 1-D array of floats. If y is omitted, x must be a 2-D array of shape (N, 2).
  • y (array_like of float, optional) – 1-D arrays of floats. If y is omitted or None, x must be a 2-D array of shape (N, 2).
  • plot_trend (bool, optional) – If True then plot a trendline (default).
  • plot_sigmas (bool, optional) – If True then plot confidence intervals of the linear fit. False by default.
  • show_legend (bool, optional) – If True, a legend will be shown with linear fit equation, correlation coefficient, and standard deviation from the fit. Default is True.
  • title (str, optional) – The title. Empty by default.
  • ylabel (xlabel,) – The axes labels. Empty by default.
  • alpha_legend (float, optional) – Legend box transparency. 1.0 by default
  • scatter_kwargs (dict, optional) – Keyword arguments for pylab.scatter(). Empty by default.
  • plot_kwargs (dict, optional) – Keyword arguments for plot_line(). By default, sets xlim and label.
  • legend_kwargs (dict, optional) – Keyword arguments for pylab.legend(). Default is {'loc': 'upper left'}.
  • sigma_kwargs (dict, optional) – Keyword arguments for pylab.plot() used for sigma lines. Default is {'color': 'red', 'linestyle': 'dashed'}.
  • sigma_values (iterable, optional) – Each value will be multiplied with standard error of the fit, and the line shifted by the resulting value will be plotted. Default is range(-3, 4).
  • regression (callable, optional) – Function to perform linear regression. Will be given x and y as arguments. Must return a 4-tuple: (a, b, r, stderr). Default is pyteomics.auxiliary.linear_regression().
Returns:

out – A (scatter_plot, trend_line, sigma_lines, legend) tuple.

Return type:

tuple

xml - utilities for XML parsing

This module is not intended for end users. It implements the abstract classes for all XML parsers, XML and IndexedXML, and some utility functions.

Dependencies

This module requres lxml and numpy.


class pyteomics.xml.ArrayConversionMixin(*args, **kwargs)[source]

Bases: pyteomics.auxiliary.utils.BinaryDataArrayTransformer

__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

class binary_array_record

Bases: pyteomics.auxiliary.utils.binary_array_record

Hold all of the information about a base64 encoded array needed to decode the array.

__init__

Initialize self. See help(type(self)) for accurate signature.

compression

Alias for field number 1

count()

Return number of occurrences of value.

data

Alias for field number 0

decode()

Decode data into a numerical array

Returns:
Return type:np.ndarray
dtype

Alias for field number 2

index()

Return first index of value.

Raises ValueError if the value is not present.

key

Alias for field number 4

source

Alias for field number 3

decode_data_array(source, compression_type=None, dtype=<class 'numpy.float64'>)

Decode a base64-encoded, compressed bytestring into a numerical array.

Parameters:
  • source (bytes) – A base64 string encoding a potentially compressed numerical array.
  • compression_type (str, optional) – The name of the compression method used before encoding the array into base64.
  • dtype (type, optional) – The data type to use to decode the binary array from the decompressed bytes.
Returns:

Return type:

np.ndarray

class pyteomics.xml.ByteCountingXMLScanner(source, indexed_tags, block_size=1000000)[source]

Bases: pyteomics.auxiliary.file_helpers._file_obj

Carry out the construction of a byte offset index for source XML file for each type of tag in indexed_tags.

Inheris from pyteomics.auxiliary._file_obj to support the object-oriented _keep_state() interface.

__init__(source, indexed_tags, block_size=1000000)[source]
Parameters:
  • indexed_tags (iterable of bytes) – The XML tags (without namespaces) to build indices for.
  • block_size (int, optional) – The size of the each chunk or “block” of the file to hold in memory as a partitioned string at any given time. Defaults to 1000000.
build_byte_index(lookup_id_key_mapping=None)[source]

Builds a byte offset index for one or more types of tags.

Parameters:lookup_id_key_mapping (Mapping, optional) – A mapping from tag name to the attribute to look up the identity for each entity of that type to be extracted. Defaults to ‘id’ for each type of tag.
Returns:Mapping from tag type to dict from identifier to byte offset
Return type:defaultdict(dict)
class pyteomics.xml.IndexSavingXML(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]

Bases: pyteomics.auxiliary.file_helpers.IndexSavingMixin, pyteomics.xml.IndexedXML

An extension to the IndexedXML type which adds facilities to read and write the byte offset index externally.

__init__(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)

Create an indexed XML parser object.

Parameters:
  • source (str or file) – File name or file-like object corresponding to an XML file.
  • read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
  • iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
  • indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.
build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.
  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.
Returns:

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

classmethod prebuild_byte_offset_file(path)

Construct a new XML reader, build its byte offset index and write it to file

Parameters:path (str) – The path to the file to parse
reset()

Resets the iterator to its initial state.

write_byte_offsets()

Write the byte offsets in _offset_index to the file at _byte_offset_filename

class pyteomics.xml.IndexedXML(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]

Bases: pyteomics.auxiliary.file_helpers.IndexedReaderMixin, pyteomics.xml.XML

Subclass of XML which uses an index of byte offsets for some elements for quick random access.

__init__(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]

Create an indexed XML parser object.

Parameters:
  • source (str or file) – File name or file-like object corresponding to an XML file.
  • read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
  • iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
  • indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.
build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)[source]

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.
  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.
Returns:

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

reset()

Resets the iterator to its initial state.

class pyteomics.xml.MultiProcessingXML(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]

Bases: pyteomics.xml.IndexedXML, pyteomics.auxiliary.file_helpers.TaskMappingMixin

XML reader that feeds indexes to external processes for parallel parsing and analysis of XML entries.

__init__(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)

Create an indexed XML parser object.

Parameters:
  • source (str or file) – File name or file-like object corresponding to an XML file.
  • read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
  • iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
  • indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.
build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.
  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.
Returns:

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

reset()

Resets the iterator to its initial state.

class pyteomics.xml.TagSpecificXMLByteIndex(source, indexed_tags=None, keys=None)[source]

Bases: object

Encapsulates the construction and querying of a byte offset index for a set of XML tags.

This type mimics an immutable Mapping.

indexed_tags

The tag names to index, not including a namespace

Type:iterable of bytes
offsets

The hierarchy of byte offsets organized {"tag_type": {"id": byte_offset}}

Type:defaultdict(OrderedDict(str, int))
indexed_tag_keys

A mapping from tag name to unique identifier attribute

Type:dict(str, str)
Parameters:index_tags (iterable of bytes) – The tag names to include in the index
__init__(source, indexed_tags=None, keys=None)[source]

Initialize self. See help(type(self)) for accurate signature.

build_index()[source]

Perform the byte offset index building for py:attr:source.

Returns:offsets – The hierarchical offset, stored in offsets
Return type:defaultdict
class pyteomics.xml.XML(source, read_schema=None, iterative=None, build_id_cache=False, **kwargs)[source]

Bases: pyteomics.auxiliary.file_helpers.FileReader

Base class for all format-specific XML parsers. The instances can be used as context managers and as iterators.

__init__(source, read_schema=None, iterative=None, build_id_cache=False, **kwargs)[source]

Create an XML parser object.

Parameters:
  • source (str or file) – File name or file-like object corresponding to an XML file.
  • read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
  • iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
  • build_id_cache (bool, optional) – Defines whether a dictionary mapping IDs to XML tree elements should be built and stored on the instance. It is used in XML.get_by_id(), e.g. when using pyteomics.mzid.MzIdentML with retrieve_refs=True.
  • huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether security checks for XML tree depth and node size should be disabled. Default is False. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
build_id_cache()[source]

Construct a cache for each element in the document, indexed by id attribute

build_tree()[source]

Build and store the ElementTree instance for the underlying file

clear_id_cache()[source]

Clear the element ID cache

clear_tree()[source]

Remove the saved ElementTree.

get_by_id(elem_id, **kwargs)[source]

Parse the file and return the element with id attribute equal to elem_id. Returns None if no such element is found.

Parameters:elem_id (str) – The value of the id attribute to match.
Returns:out
Return type:dict or None
iterfind(path, **kwargs)[source]

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

reset()

Resets the iterator to its initial state.

pyteomics.xml.xpath(tree, path, ns=None)[source]

Return the results of XPath query with added namespaces. Assumes the ns declaration is on the root element or absent.

Parameters:
  • tree (ElementTree) –
  • path (str) –
  • ns (str or None, optional) –
pyteomics.xml.xsd_parser(schema_url)[source]

Parse an XSD file from the specified URL into a schema dictionary that can be used by XML parsers to automatically cast data to the appropriate type.

Parameters:schema_url (str) – The URL to retrieve the schema from
Returns:
Return type:dict

auxiliary - common functions and objects

Math

linear_regression_vertical() - a wrapper for NumPy linear regression, minimizes the sum of squares of y errors.

linear_regression() - alias for linear_regression_vertical().

linear_regression_perpendicular() - a wrapper for NumPy linear regression, minimizes the sum of squares of (perpendicular) distances between the points and the line.

Target-Decoy Approach

qvalues() - estimate q-values for a set of PSMs.

filter() - filter PSMs to specified FDR level using TDA or given PEPs.

filter.chain() - a chained version of filter().

fdr() - estimate FDR in a set of PSMs using TDA or given PEPs.

Project infrastructure
PyteomicsError - a pyteomics-specific exception.
Helpers

Charge - a subclass of int for charge states.

ChargeList - a subclass of list for lists of charges.

print_tree() - display the structure of a complex nested dict.

memoize() - makes a memoization function decorator.

cvquery() - traverse an arbitrarily nested dictionary looking for keys which are cvstr instances, or objects with an attribute called accession.


pyteomics.auxiliary.math.linear_regression(x, y=None, a=None, b=None)[source]

Alias of linear_regression_vertical().

pyteomics.auxiliary.math.linear_regression_perpendicular(x, y=None)[source]

Calculate coefficients of a linear regression y = a * x + b. The fit minimizes perpendicular distances between the points and the line.

Requires numpy.

Parameters:y (x,) – 1-D arrays of floats. If y is omitted, x must be a 2-D array of shape (N, 2).
Returns:out – The structure is (a, b, r, stderr), where a – slope coefficient, b – free term, r – Peason correlation coefficient, stderr – standard deviation.
Return type:4-tuple of float
pyteomics.auxiliary.math.linear_regression_vertical(x, y=None, a=None, b=None)[source]

Calculate coefficients of a linear regression y = a * x + b. The fit minimizes vertical distances between the points and the line.

Requires numpy.

Parameters:
  • y (x,) – 1-D arrays of floats. If y is omitted, x must be a 2-D array of shape (N, 2).
  • a (float, optional) – If specified then the slope coefficient is fixed and equals a.
  • b (float, optional) – If specified then the free term is fixed and equals b.
Returns:

out – The structure is (a, b, r, stderr), where a – slope coefficient, b – free term, r – Peason correlation coefficient, stderr – standard deviation.

Return type:

4-tuple of float

pyteomics.auxiliary.target_decoy.fdr(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)

Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:

FDR = \frac{N_{decoy}}{N_{target} * ratio}

The second formula is:

FDR = \frac{N_{decoy} * (1 + \frac{1}{ratio})}{N_{total}}

Note

This function is less versatile than qvalues(). To obtain FDR, you can call qvalues() and take the last q-value. This function can be used (with correction = 0 or 1) when numpy is not available.

Parameters:
  • psms (iterable, optional) – An iterable of PSMs, e.g. as returned by read(). Not needed if is_decoy is an iterable.
  • formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
  • is_decoy (callable, iterable, or str) – If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is is_decoy(). If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).
  • pep (callable, iterable, or str, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.

  • ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

    Note

    Requires numpy, if correction is a float or 2.

    Note

    Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using filter() without correction).

Returns:

out – The estimation of FDR, (roughly) between 0 and 1.

Return type:

float

pyteomics.auxiliary.target_decoy.filter(*args, **kwargs)

Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.

Requires numpy and, optionally, pandas.

Parameters:
  • args (positional) – Iterables to read PSMs from. All positional arguments are chained. The rest of the arguments must be named.
  • fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
  • key (callable / array-like / iterable / str, keyword only) –

    A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only) – A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.
  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is True.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • full_output (bool, keyword only, optional) –

    If True, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default is True.

    Note

    The name for the parameter comes from the fact that it is internally passed to qvalues().

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • **kwargs (passed to the chain() function.) –
Returns:

out

Return type:

iterator or numpy.ndarray or pandas.DataFrame

pyteomics.auxiliary.target_decoy.qvalues(*args, **kwargs)

Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.

Requires numpy (and optionally pandas).

Parameters:
  • args (positional) – Iterables to read PSMs from. All positional arguments are chained. The rest of the arguments must be named.
  • key (callable / array-like / iterable / str, keyword only) –

    If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only) – If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).
  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is False.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • full_output (bool, keyword only, optional) – If True, then the returned array has PSM objects along with scores and q-values. Default is False.
  • **kwargs (passed to the chain() function.) –
Returns:

out – A sorted array of records with the following fields:

  • ’score’: np.float64
  • ’is decoy’: np.bool_
  • ’q’: np.float64
  • ’psm’: np.object_ (if full_output is True)

Return type:

numpy.ndarray

pyteomics.auxiliary.target_decoy.sigma_T(psms, is_decoy, ratio=1)[source]

Calculates the standard error for the number of false positive target PSMs.

The formula is:

.. math ::
sigma(T) = sqrt{frac{(d + 1) cdot {p}}{(1 - p)^{2}}} = sqrt{frac{d+1}{r^{2}} cdot (r+1)}

This estimation is accurate for low FDRs. See the article for more details.

pyteomics.auxiliary.target_decoy.sigma_fdr(psms=None, formula=1, is_decoy=None, ratio=1)[source]

Calculates the standard error of FDR using the formula for negative binomial distribution. See sigma_T() for math. This estimation is accurate for low FDRs. See also the article for more details.

class pyteomics.auxiliary.utils.BinaryDataArrayTransformer[source]

Bases: object

A base class that provides methods for reading base64-encoded binary arrays.

compression_type_map

Maps compressor type name to decompression function

Type:dict
__init__

Initialize self. See help(type(self)) for accurate signature.

class binary_array_record[source]

Bases: pyteomics.auxiliary.utils.binary_array_record

Hold all of the information about a base64 encoded array needed to decode the array.

__init__

Initialize self. See help(type(self)) for accurate signature.

compression

Alias for field number 1

count()

Return number of occurrences of value.

data

Alias for field number 0

decode()[source]

Decode data into a numerical array

Returns:
Return type:np.ndarray
dtype

Alias for field number 2

index()

Return first index of value.

Raises ValueError if the value is not present.

key

Alias for field number 4

source

Alias for field number 3

decode_data_array(source, compression_type=None, dtype=<class 'numpy.float64'>)[source]

Decode a base64-encoded, compressed bytestring into a numerical array.

Parameters:
  • source (bytes) – A base64 string encoding a potentially compressed numerical array.
  • compression_type (str, optional) – The name of the compression method used before encoding the array into base64.
  • dtype (type, optional) – The data type to use to decode the binary array from the decompressed bytes.
Returns:

Return type:

np.ndarray

pyteomics.auxiliary.utils.memoize(maxsize=1000)[source]

Make a memoization decorator. A negative value of maxsize means no size limit.

pyteomics.auxiliary.utils.print_tree(d, indent_str=' -> ', indent_count=1)[source]

Read a nested dict (with strings as keys) and print its structure.

class pyteomics.auxiliary.structures.BasicComposition(*args, **kwargs)[source]

Bases: collections.defaultdict, collections.Counter

A generic dictionary for compositions. Keys should be strings, values should be integers. Allows simple arithmetics.

__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

clear() → None. Remove all items from D.
copy() → a shallow copy of D.[source]
default_factory

Factory for default value called by __missing__().

elements()

Iterator over elements repeating each as many times as its count.

>>> c = Counter('ABCABC')
>>> sorted(c.elements())
['A', 'A', 'B', 'B', 'C', 'C']

# Knuth’s example for prime factors of 1836: 2**2 * 3**3 * 17**1 >>> prime_factors = Counter({2: 2, 3: 3, 17: 1}) >>> product = 1 >>> for factor in prime_factors.elements(): # loop over factors … product *= factor # and multiply them >>> product 1836

Note, if an element’s count has been set to zero or is a negative number, elements() will ignore it.

classmethod fromkeys(iterable, v=None)

Create a new dictionary with keys from iterable and values set to value.

get()

Return the value for key if key is in the dictionary, else default.

items() → a set-like object providing a view on D's items
keys() → a set-like object providing a view on D's keys
most_common(n=None)

List the n most common elements and their counts from the most common to the least. If n is None, then list all element counts.

>>> Counter('abcdeabcdabcaba').most_common(3)
[('a', 5), ('b', 4), ('c', 3)]
pop(k[, d]) → v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised

popitem() → (k, v), remove and return some (key, value) pair as a

2-tuple; but raise KeyError if D is empty.

setdefault()

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

subtract(**kwds)

Like dict.update() but subtracts counts instead of replacing them. Counts can be reduced below zero. Both the inputs and outputs are allowed to contain zero and negative counts.

Source can be an iterable, a dictionary, or another Counter instance.

>>> c = Counter('which')
>>> c.subtract('witch')             # subtract elements from another iterable
>>> c.subtract(Counter('watch'))    # subtract elements from another counter
>>> c['h']                          # 2 in which, minus 1 in witch, minus 1 in watch
0
>>> c['w']                          # 1 in which, minus 1 in witch, minus 1 in watch
-1
update(**kwds)

Like dict.update() but add counts instead of replacing them.

Source can be an iterable, a dictionary, or another Counter instance.

>>> c = Counter('which')
>>> c.update('witch')           # add elements from another iterable
>>> d = Counter('watch')
>>> c.update(d)                 # add elements from another counter
>>> c['h']                      # four 'h' in which, witch, and watch
4
values() → an object providing a view on D's values
class pyteomics.auxiliary.structures.CVQueryEngine[source]

Bases: object

Traverse an arbitrarily nested dictionary looking for keys which are cvstr instances, or objects with an attribute called accession.

__init__

Initialize self. See help(type(self)) for accurate signature.

index(data)[source]

Construct a flat dict whose keys are the accession numbers for all qualified keys in data and whose values are the mapped values from data.

query(data, accession)[source]

Search data for a key with the accession number accession. Returns None if not found.

class pyteomics.auxiliary.structures.Charge[source]

Bases: int

A subclass of int. Can be constructed from strings in “N+” or “N-” format, and the string representation of a Charge is also in that format.

__init__

Initialize self. See help(type(self)) for accurate signature.

bit_length()

Number of bits necessary to represent self in binary.

>>> bin(37)
'0b100101'
>>> (37).bit_length()
6
conjugate()

Returns self, the complex conjugate of any int.

denominator

the denominator of a rational number in lowest terms

from_bytes()

Return the integer represented by the given array of bytes.

bytes
Holds the array of bytes to convert. The argument must either support the buffer protocol or be an iterable object producing bytes. Bytes and bytearray are examples of built-in objects that support the buffer protocol.
byteorder
The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value.
signed
Indicates whether two’s complement is used to represent the integer.
imag

the imaginary part of a complex number

numerator

the numerator of a rational number in lowest terms

real

the real part of a complex number

to_bytes()

Return an array of bytes representing an integer.

length
Length of bytes object to use. An OverflowError is raised if the integer is not representable with the given number of bytes.
byteorder
The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value.
signed
Determines whether two’s complement is used to represent the integer. If signed is False and a negative integer is given, an OverflowError is raised.
class pyteomics.auxiliary.structures.ChargeList(*args, **kwargs)[source]

Bases: list

Just a list of :py:class:`Charge`s. When printed, looks like an enumeration of the list contents. Can also be constructed from such strings (e.g. “2+, 3+ and 4+”).

__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

append()

Append object to the end of the list.

clear()

Remove all items from list.

copy()

Return a shallow copy of the list.

count()

Return number of occurrences of value.

extend()

Extend list by appending elements from the iterable.

index()

Return first index of value.

Raises ValueError if the value is not present.

insert()

Insert object before index.

pop()

Remove and return item at index (default last).

Raises IndexError if list is empty or index is out of range.

remove()

Remove first occurrence of value.

Raises ValueError if the value is not present.

reverse()

Reverse IN PLACE.

sort()

Stable sort IN PLACE.

exception pyteomics.auxiliary.structures.PyteomicsError(msg, *values)[source]

Bases: Exception

Exception raised for errors in Pyteomics library.

message

Error message.

Type:str
__init__(msg, *values)[source]

Initialize self. See help(type(self)) for accurate signature.

with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

pyteomics.auxiliary.structures.clear_unit_cv_table()[source]

Clear the module-level unit name and controlled vocabulary accession table.

class pyteomics.auxiliary.structures.cvstr[source]

Bases: str

A helper class to associate a controlled vocabullary accession number with an otherwise plain str object

__init__

Initialize self. See help(type(self)) for accurate signature.

capitalize()

Return a capitalized version of the string.

More specifically, make the first character have upper case and the rest lower case.

casefold()

Return a version of the string suitable for caseless comparisons.

center()

Return a centered string of length width.

Padding is done using the specified fill character (default is a space).

count(sub[, start[, end]]) → int

Return the number of non-overlapping occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation.

encode()

Encode the string using the codec registered for encoding.

encoding
The encoding in which to encode the string.
errors
The error handling scheme to use for encoding errors. The default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’ as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.
endswith(suffix[, start[, end]]) → bool

Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.

expandtabs()

Return a copy where all tab characters are expanded using spaces.

If tabsize is not given, a tab size of 8 characters is assumed.

find(sub[, start[, end]]) → int

Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

format(*args, **kwargs) → str

Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).

format_map(mapping) → str

Return a formatted version of S, using substitutions from mapping. The substitutions are identified by braces (‘{‘ and ‘}’).

index(sub[, start[, end]]) → int

Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Raises ValueError when the substring is not found.

isalnum()

Return True if the string is an alpha-numeric string, False otherwise.

A string is alpha-numeric if all characters in the string are alpha-numeric and there is at least one character in the string.

isalpha()

Return True if the string is an alphabetic string, False otherwise.

A string is alphabetic if all characters in the string are alphabetic and there is at least one character in the string.

isascii()

Return True if all characters in the string are ASCII, False otherwise.

ASCII characters have code points in the range U+0000-U+007F. Empty string is ASCII too.

isdecimal()

Return True if the string is a decimal string, False otherwise.

A string is a decimal string if all characters in the string are decimal and there is at least one character in the string.

isdigit()

Return True if the string is a digit string, False otherwise.

A string is a digit string if all characters in the string are digits and there is at least one character in the string.

isidentifier()

Return True if the string is a valid Python identifier, False otherwise.

Use keyword.iskeyword() to test for reserved identifiers such as “def” and “class”.

islower()

Return True if the string is a lowercase string, False otherwise.

A string is lowercase if all cased characters in the string are lowercase and there is at least one cased character in the string.

isnumeric()

Return True if the string is a numeric string, False otherwise.

A string is numeric if all characters in the string are numeric and there is at least one character in the string.

isprintable()

Return True if the string is printable, False otherwise.

A string is printable if all of its characters are considered printable in repr() or if it is empty.

isspace()

Return True if the string is a whitespace string, False otherwise.

A string is whitespace if all characters in the string are whitespace and there is at least one character in the string.

istitle()

Return True if the string is a title-cased string, False otherwise.

In a title-cased string, upper- and title-case characters may only follow uncased characters and lowercase characters only cased ones.

isupper()

Return True if the string is an uppercase string, False otherwise.

A string is uppercase if all cased characters in the string are uppercase and there is at least one cased character in the string.

join()

Concatenate any number of strings.

The string whose method is called is inserted in between each given string. The result is returned as a new string.

Example: ‘.’.join([‘ab’, ‘pq’, ‘rs’]) -> ‘ab.pq.rs’

ljust()

Return a left-justified string of length width.

Padding is done using the specified fill character (default is a space).

lower()

Return a copy of the string converted to lowercase.

lstrip()

Return a copy of the string with leading whitespace removed.

If chars is given and not None, remove characters in chars instead.

static maketrans()

Return a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.

partition()

Partition the string into three parts using the given separator.

This will search for the separator in the string. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.

If the separator is not found, returns a 3-tuple containing the original string and two empty strings.

replace()

Return a copy with all occurrences of substring old replaced by new.

count
Maximum number of occurrences to replace. -1 (the default value) means replace all occurrences.

If the optional argument count is given, only the first count occurrences are replaced.

rfind(sub[, start[, end]]) → int

Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

rindex(sub[, start[, end]]) → int

Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Raises ValueError when the substring is not found.

rjust()

Return a right-justified string of length width.

Padding is done using the specified fill character (default is a space).

rpartition()

Partition the string into three parts using the given separator.

This will search for the separator in the string, starting at the end. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.

If the separator is not found, returns a 3-tuple containing two empty strings and the original string.

rsplit()

Return a list of the words in the string, using sep as the delimiter string.

sep
The delimiter according which to split the string. None (the default value) means split according to any whitespace, and discard empty strings from the result.
maxsplit
Maximum number of splits to do. -1 (the default value) means no limit.

Splits are done starting at the end of the string and working to the front.

rstrip()

Return a copy of the string with trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.

split()

Return a list of the words in the string, using sep as the delimiter string.

sep
The delimiter according which to split the string. None (the default value) means split according to any whitespace, and discard empty strings from the result.
maxsplit
Maximum number of splits to do. -1 (the default value) means no limit.
splitlines()

Return a list of the lines in the string, breaking at line boundaries.

Line breaks are not included in the resulting list unless keepends is given and true.

startswith(prefix[, start[, end]]) → bool

Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.

strip()

Return a copy of the string with leading and trailing whitespace remove.

If chars is given and not None, remove characters in chars instead.

swapcase()

Convert uppercase characters to lowercase and lowercase characters to uppercase.

title()

Return a version of the string where each word is titlecased.

More specifically, words start with uppercased characters and all remaining cased characters have lower case.

translate()

Replace each character in the string using the given translation table.

table
Translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, strings, or None.

The table must implement lookup/indexing via __getitem__, for instance a dictionary or list. If this operation raises LookupError, the character is left untouched. Characters mapped to None are deleted.

upper()

Return a copy of the string converted to uppercase.

zfill()

Pad a numeric string with zeros on the left, to fill a field of the given width.

The string is never truncated.

class pyteomics.auxiliary.structures.unitfloat[source]

Bases: float

__init__

Initialize self. See help(type(self)) for accurate signature.

as_integer_ratio()

Return integer ratio.

Return a pair of integers, whose ratio is exactly equal to the original float and with a positive denominator.

Raise OverflowError on infinities and a ValueError on NaNs.

>>> (10.0).as_integer_ratio()
(10, 1)
>>> (0.0).as_integer_ratio()
(0, 1)
>>> (-.25).as_integer_ratio()
(-1, 4)
conjugate()

Return self, the complex conjugate of any float.

fromhex()

Create a floating-point number from a hexadecimal string.

>>> float.fromhex('0x1.ffffp10')
2047.984375
>>> float.fromhex('-0x1p-1074')
-5e-324
hex()

Return a hexadecimal representation of a floating-point number.

>>> (-0.1).hex()
'-0x1.999999999999ap-4'
>>> 3.14159.hex()
'0x1.921f9f01b866ep+1'
imag

the imaginary part of a complex number

is_integer()

Return True if the float is an integer.

real

the real part of a complex number

class pyteomics.auxiliary.structures.unitint[source]

Bases: int

__init__

Initialize self. See help(type(self)) for accurate signature.

bit_length()

Number of bits necessary to represent self in binary.

>>> bin(37)
'0b100101'
>>> (37).bit_length()
6
conjugate()

Returns self, the complex conjugate of any int.

denominator

the denominator of a rational number in lowest terms

from_bytes()

Return the integer represented by the given array of bytes.

bytes
Holds the array of bytes to convert. The argument must either support the buffer protocol or be an iterable object producing bytes. Bytes and bytearray are examples of built-in objects that support the buffer protocol.
byteorder
The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value.
signed
Indicates whether two’s complement is used to represent the integer.
imag

the imaginary part of a complex number

numerator

the numerator of a rational number in lowest terms

real

the real part of a complex number

to_bytes()

Return an array of bytes representing an integer.

length
Length of bytes object to use. An OverflowError is raised if the integer is not representable with the given number of bytes.
byteorder
The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value.
signed
Determines whether two’s complement is used to represent the integer. If signed is False and a negative integer is given, an OverflowError is raised.
class pyteomics.auxiliary.structures.unitstr[source]

Bases: str

__init__

Initialize self. See help(type(self)) for accurate signature.

capitalize()

Return a capitalized version of the string.

More specifically, make the first character have upper case and the rest lower case.

casefold()

Return a version of the string suitable for caseless comparisons.

center()

Return a centered string of length width.

Padding is done using the specified fill character (default is a space).

count(sub[, start[, end]]) → int

Return the number of non-overlapping occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation.

encode()

Encode the string using the codec registered for encoding.

encoding
The encoding in which to encode the string.
errors
The error handling scheme to use for encoding errors. The default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’ as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.
endswith(suffix[, start[, end]]) → bool

Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.

expandtabs()

Return a copy where all tab characters are expanded using spaces.

If tabsize is not given, a tab size of 8 characters is assumed.

find(sub[, start[, end]]) → int

Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

format(*args, **kwargs) → str

Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).

format_map(mapping) → str

Return a formatted version of S, using substitutions from mapping. The substitutions are identified by braces (‘{‘ and ‘}’).

index(sub[, start[, end]]) → int

Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Raises ValueError when the substring is not found.

isalnum()

Return True if the string is an alpha-numeric string, False otherwise.

A string is alpha-numeric if all characters in the string are alpha-numeric and there is at least one character in the string.

isalpha()

Return True if the string is an alphabetic string, False otherwise.

A string is alphabetic if all characters in the string are alphabetic and there is at least one character in the string.

isascii()

Return True if all characters in the string are ASCII, False otherwise.

ASCII characters have code points in the range U+0000-U+007F. Empty string is ASCII too.

isdecimal()

Return True if the string is a decimal string, False otherwise.

A string is a decimal string if all characters in the string are decimal and there is at least one character in the string.

isdigit()

Return True if the string is a digit string, False otherwise.

A string is a digit string if all characters in the string are digits and there is at least one character in the string.

isidentifier()

Return True if the string is a valid Python identifier, False otherwise.

Use keyword.iskeyword() to test for reserved identifiers such as “def” and “class”.

islower()

Return True if the string is a lowercase string, False otherwise.

A string is lowercase if all cased characters in the string are lowercase and there is at least one cased character in the string.

isnumeric()

Return True if the string is a numeric string, False otherwise.

A string is numeric if all characters in the string are numeric and there is at least one character in the string.

isprintable()

Return True if the string is printable, False otherwise.

A string is printable if all of its characters are considered printable in repr() or if it is empty.

isspace()

Return True if the string is a whitespace string, False otherwise.

A string is whitespace if all characters in the string are whitespace and there is at least one character in the string.

istitle()

Return True if the string is a title-cased string, False otherwise.

In a title-cased string, upper- and title-case characters may only follow uncased characters and lowercase characters only cased ones.

isupper()

Return True if the string is an uppercase string, False otherwise.

A string is uppercase if all cased characters in the string are uppercase and there is at least one cased character in the string.

join()

Concatenate any number of strings.

The string whose method is called is inserted in between each given string. The result is returned as a new string.

Example: ‘.’.join([‘ab’, ‘pq’, ‘rs’]) -> ‘ab.pq.rs’

ljust()

Return a left-justified string of length width.

Padding is done using the specified fill character (default is a space).

lower()

Return a copy of the string converted to lowercase.

lstrip()

Return a copy of the string with leading whitespace removed.

If chars is given and not None, remove characters in chars instead.

static maketrans()

Return a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.

partition()

Partition the string into three parts using the given separator.

This will search for the separator in the string. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.

If the separator is not found, returns a 3-tuple containing the original string and two empty strings.

replace()

Return a copy with all occurrences of substring old replaced by new.

count
Maximum number of occurrences to replace. -1 (the default value) means replace all occurrences.

If the optional argument count is given, only the first count occurrences are replaced.

rfind(sub[, start[, end]]) → int

Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

rindex(sub[, start[, end]]) → int

Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Raises ValueError when the substring is not found.

rjust()

Return a right-justified string of length width.

Padding is done using the specified fill character (default is a space).

rpartition()

Partition the string into three parts using the given separator.

This will search for the separator in the string, starting at the end. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.

If the separator is not found, returns a 3-tuple containing two empty strings and the original string.

rsplit()

Return a list of the words in the string, using sep as the delimiter string.

sep
The delimiter according which to split the string. None (the default value) means split according to any whitespace, and discard empty strings from the result.
maxsplit
Maximum number of splits to do. -1 (the default value) means no limit.

Splits are done starting at the end of the string and working to the front.

rstrip()

Return a copy of the string with trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.

split()

Return a list of the words in the string, using sep as the delimiter string.

sep
The delimiter according which to split the string. None (the default value) means split according to any whitespace, and discard empty strings from the result.
maxsplit
Maximum number of splits to do. -1 (the default value) means no limit.
splitlines()

Return a list of the lines in the string, breaking at line boundaries.

Line breaks are not included in the resulting list unless keepends is given and true.

startswith(prefix[, start[, end]]) → bool

Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.

strip()

Return a copy of the string with leading and trailing whitespace remove.

If chars is given and not None, remove characters in chars instead.

swapcase()

Convert uppercase characters to lowercase and lowercase characters to uppercase.

title()

Return a version of the string where each word is titlecased.

More specifically, words start with uppercased characters and all remaining cased characters have lower case.

translate()

Replace each character in the string using the given translation table.

table
Translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, strings, or None.

The table must implement lookup/indexing via __getitem__, for instance a dictionary or list. If this operation raises LookupError, the character is left untouched. Characters mapped to None are deleted.

upper()

Return a copy of the string converted to uppercase.

zfill()

Pad a numeric string with zeros on the left, to fill a field of the given width.

The string is never truncated.

class pyteomics.auxiliary.file_helpers.ChainBase(*sources, **kwargs)[source]

Bases: object

Chain sequence_maker() for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to the sequence_maker() function.

sources

Sources for creating new sequences from, such as paths or file-like objects

Type:Iterable
kwargs

Additional arguments used to instantiate each sequence

Type:Mapping
__init__(*sources, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

map(target=None, processes=-1, queue_timeout=4, args=None, kwargs=None, **_kwargs)[source]

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If negative, the number of processes will match the number of available CPUs.
  • queue_timeout (float, optional) – The number of seconds to block, waiting for a result before checking to see if all workers are done.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

class pyteomics.auxiliary.file_helpers.FileReader(source, **kwargs)[source]

Bases: pyteomics.auxiliary.file_helpers.IteratorContextManager

Abstract class implementing context manager protocol for file readers.

__init__(source, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

reset()[source]

Resets the iterator to its initial state.

class pyteomics.auxiliary.file_helpers.FileReadingProcess(reader_spec, target_spec, qin, qout, args_spec, kwargs_spec)[source]

Bases: multiprocessing.context.Process

Process that does a share of distributed work on entries read from file. Reconstructs a reader object, parses an entries from given indexes, optionally does additional processing, sends results back.

The reader class must support the __getitem__() dict-like lookup.

__init__(reader_spec, target_spec, qin, qout, args_spec, kwargs_spec)[source]

Initialize self. See help(type(self)) for accurate signature.

close()

Close the Process object.

This method releases resources held by the Process object. It is an error to call this method if the child process is still running.

daemon

Return whether process is a daemon

exitcode

Return exit code of process or None if it has yet to stop

ident

Return identifier (PID) of process or None if it has yet to start

is_alive()

Return whether process is alive

join(timeout=None)

Wait until child process terminates

kill()

Terminate process; sends SIGKILL signal or uses TerminateProcess()

pid

Return identifier (PID) of process or None if it has yet to start

run()[source]

Method to be run in sub-process; can be overridden in sub-class

sentinel

Return a file descriptor (Unix) or handle (Windows) suitable for waiting for process termination.

start()

Start child process

terminate()

Terminate process; sends SIGTERM signal or uses TerminateProcess()

class pyteomics.auxiliary.file_helpers.IndexSavingMixin(*args, **kwargs)[source]

Bases: pyteomics.auxiliary.file_helpers.NoOpBaseReader

Common interface for IndexSavingXML and IndexSavingTextReader.

__init__(*args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

classmethod prebuild_byte_offset_file(path)[source]

Construct a new XML reader, build its byte offset index and write it to file

Parameters:path (str) – The path to the file to parse
write_byte_offsets()[source]

Write the byte offsets in _offset_index to the file at _byte_offset_filename

class pyteomics.auxiliary.file_helpers.IndexSavingTextReader(source, **kwargs)[source]

Bases: pyteomics.auxiliary.file_helpers.IndexSavingMixin, pyteomics.auxiliary.file_helpers.IndexedTextReader

__init__(source, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

classmethod prebuild_byte_offset_file(path)

Construct a new XML reader, build its byte offset index and write it to file

Parameters:path (str) – The path to the file to parse
reset()

Resets the iterator to its initial state.

write_byte_offsets()

Write the byte offsets in _offset_index to the file at _byte_offset_filename

class pyteomics.auxiliary.file_helpers.IndexedReaderMixin(*args, **kwargs)[source]

Bases: pyteomics.auxiliary.file_helpers.NoOpBaseReader

Common interface for IndexedTextReader and IndexedXML.

__init__(*args, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

class pyteomics.auxiliary.file_helpers.IndexedTextReader(source, **kwargs)[source]

Bases: pyteomics.auxiliary.file_helpers.IndexedReaderMixin, pyteomics.auxiliary.file_helpers.FileReader

Abstract class for text file readers that keep an index of records for random access. This requires reading the file in binary mode.

__init__(source, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

reset()

Resets the iterator to its initial state.

class pyteomics.auxiliary.file_helpers.OffsetIndex(*args, **kwargs)[source]

Bases: collections.OrderedDict, pyteomics.auxiliary.file_helpers.WritableIndex

An augmented OrderedDict that formally wraps getting items by index

__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

clear() → None. Remove all items from od.
copy() → a shallow copy of od
from_index(index, include_value=False)[source]

Get an entry by its integer index in the ordered sequence of this mapping.

Parameters:
  • index (int) – The index to retrieve.
  • include_value (bool) – Whether to return both the key and the value or just the key. Defaults to False.
Returns:

If include_value is True, a tuple of (key, value) at index else just the key at index.

Return type:

object

from_slice(spec, include_value=False)[source]

Get a slice along index in the ordered sequence of this mapping.

Parameters:
  • spec (slice) – The slice over the range of indices to retrieve
  • include_value (bool) – Whether to return both the key and the value or just the key. Defaults to False
Returns:

If include_value is True, a tuple of (key, value) at index else just the key at index for each index in spec

Return type:

list

fromkeys()

Create a new ordered dictionary with keys from iterable and values set to value.

get()

Return the value for key if key is in the dictionary, else default.

index_sequence

Keeps a cached copy of the items() sequence stored as a tuple to avoid repeatedly copying the sequence over many method calls.

Returns:
Return type:tuple
items() → a set-like object providing a view on D's items
keys() → a set-like object providing a view on D's keys
move_to_end()

Move an existing element to the end (or beginning if last is false).

Raise KeyError if the element does not exist.

pop(k[, d]) → v, remove specified key and return the corresponding[source]

value. If key is not found, d is returned if given, otherwise KeyError is raised.

popitem()

Remove and return a (key, value) pair from the dictionary.

Pairs are returned in LIFO order if last is true or FIFO order if false.

setdefault()

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update([E, ]**F) → None. Update D from dict/iterable E and F.

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → an object providing a view on D's values
class pyteomics.auxiliary.file_helpers.TableJoiner(*sources, **kwargs)[source]

Bases: pyteomics.auxiliary.file_helpers.ChainBase

__init__(*sources, **kwargs)

Initialize self. See help(type(self)) for accurate signature.

map(target=None, processes=-1, queue_timeout=4, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If negative, the number of processes will match the number of available CPUs.
  • queue_timeout (float, optional) – The number of seconds to block, waiting for a result before checking to see if all workers are done.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

class pyteomics.auxiliary.file_helpers.TimeOrderedIndexedReaderMixin(*args, **kwargs)[source]

Bases: pyteomics.auxiliary.file_helpers.IndexedReaderMixin

__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

Combined examples

This section lists examples that illustrate the possible usage of Pyteomics as a whole. The list will grow in time.

Contents:

Example 1: Unravelling the Peptidome

In this example, we will introduce the Pyteomics tools to predict the basic physicochemical characteristics of peptides, such as mass, charge and chromatographic retention time. We will download a FASTA database with baker’s yeast proteins, digest it with trypsin and study the distributions of various quantitative qualities that may be measured in a typical proteomic experiment.

The example is organized as a script interrupted by comments. It is assumed that the reader already has experience with numpy and matplotlib libraries. The source code for the example can be found here.

Before we begin, we need to import all the modules that we may require. Besides pyteomics itself, we need the builtin tools that allow to access the hard drive (os), download files from the Internet (urllib), open gzip archives (gzip), and external libraries to process and visualize arrays of data (numpy, matplotlib).

import os
from urllib.request import urlretrieve
import gzip
import matplotlib.pyplot as plt
import numpy as np
from pyteomics import fasta, parser, mass, achrom, electrochem, auxiliary

We also need to download a real FASTA database. For our purposes, the Uniprot database with Saccharomyces cerevisiae proteins will work fine. We’ll download a gzip-compressed database from Uniprot FTP server:

if not os.path.isfile('yeast.fasta.gz'):
    print('Downloading the FASTA file for Saccharomyces cerevisiae...')
    urlretrieve(
        'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/'
        'reference_proteomes/Eukaryota/UP000002311_559292.fasta.gz',
        'yeast.fasta.gz')
    print('Done!')

The pyteomics.fasta.FASTA() class allows to iterate over the protein sequences in a FASTA file in a regular Python loop. It replaced pyteomics.fasta.read(), although the latter still exists, too. In this example, we create a FASTA object from a file-like object representing a gzip archive. All file parser objects are flexible and support a variety of use cases. Additionally pyteomics.fasta supports an even greater variety of FASTA types and flavors.

For all FASTA parser classes, check fasta - manipulations with FASTA databases. See also: an explanation of Indexed Parsers.

In order to obtain the peptide sequences, we cleave each protein using the pyteomics.parser.cleave() function and combine results into a set object that automatically discards multiple occurrences of the same sequence.

print('Cleaving the proteins with trypsin...')
unique_peptides = set()
with gzip.open('yeast.fasta.gz', mode='rt') as gzfile:
    for description, sequence in fasta.FASTA(gzfile):
        new_peptides = parser.cleave(sequence, 'trypsin')
        unique_peptides.update(new_peptides)
print('Done, {0} sequences obtained!'.format(len(unique_peptides)))

Later we will calculate different peptide properties. In order to store them, we create a list of dicts, where each dict stores the properties of a single peptide, including its sequence.

peptides = [{'sequence': i} for i in unique_peptides]

It is also more efficient to pre-parse the sequences into individual amino acids and supply the parsed structures into the functions that calculate m/z, charge, etc. During parsing, we explicitly save the terminal groups of peptides so that they are taken into the account when calculating m/z and charge of a peptide.

print('Parsing peptide sequences...')
for peptide in peptides:
    peptide['parsed_sequence'] = parser.parse(
        peptide['sequence'],
        show_unmodified_termini=True)
    peptide['length'] = parser.length(peptide['parsed_sequence'])
print('Done!')

For our purposes, we will limit ourselves to reasonably short peptides with the length less than 100 residues.

peptides = [peptide for peptide in peptides if peptide['length'] <= 100]

We use pyteomics.electrochem.charge() to calculate the charge at pH=2.0. The neutral mass and m/z of an ion is found with pyteomics.mass.calculate_mass().

print('Calculating the mass, charge and m/z...')
for peptide in peptides:
    peptide['charge'] = int(round(
        electrochem.charge(peptide['parsed_sequence'], pH=2.0)))
    peptide['mass'] = mass.calculate_mass(peptide['parsed_sequence'])
    peptide['m/z'] = mass.calculate_mass(peptide['parsed_sequence'],
        charge=peptide['charge'])
print('Done!')

Next, we calculate the retention time in the reversed- and normal-phase chromatography using pyteomics.achrom.calculate_RT() for two different sets of retention coefficients. The phase is specified by supplying corresponding sets of retention coefficients, pyteomics.achrom.RCs_zubarev and pyteomics.achrom.RCs_yoshida_lc for the reversed and normal phases, correspondingly.

print('Calculating the retention time...')
for peptide in peptides:
    peptide['RT_RP'] = achrom.calculate_RT(
        peptide['parsed_sequence'],
        achrom.RCs_zubarev)
    peptide['RT_normal'] = achrom.calculate_RT(
        peptide['parsed_sequence'],
        achrom.RCs_yoshida_lc)
print('Done!')

Now, as we have all the numbers we can estimate the complexity of a sample by plotting the distributions of parameters measurable in a typical proteomic experiment. First, we show the distribution of m/z using the standard histogram plotting function from matplotlib.

plt.figure()
plt.hist([peptide['m/z'] for peptide in peptides],
    bins = 2000,
    range=(0,4000))
plt.xlabel('m/z, Th')
plt.ylabel('# of peptides within 2 Th bin')

The same set of commands allows us to plot the distribution of charge states in the sample:

plt.figure()
plt.hist([peptide['charge'] for peptide in peptides],
    bins = 20,
    range=(0,10))
plt.xlabel('charge, e')
plt.ylabel('# of peptides')

Next, we want to visualize the statistical correlation between m/z and retention time in reversed-phase chromatography.

The standard approach would be to use a scatter plot. However, with a sample of our size that would be uninformative. Instead, we will plot a 2d-histogram. There is no standard matplotlib command for that and we have to use a combination of numpy and matplotlib. The function numpy.histogram2d() bins a set of (x,y) points on a plane and returns the matrix of numbers in each individual bin and the borders of the bins. We also use a trick of replacing zeros in this matrix with the not-a-number value so that on the final figure empty bins are highlighted with white color instead of the darkest blue. We suggest removing the fourth line in this code snippet to see how that affects the final plot. At the last line, we also apply the linear regression to obtain the coefficient of correlation between m/z and retention time.

x = [peptide['RT_RP'] for peptide in peptides]
y = [peptide['RT_normal'] for peptide in peptides]
heatmap, xbins, ybins = np.histogram2d(x, y, bins=100)
heatmap[heatmap == 0] = np.nan
a, b, r, stderr = auxiliary.linear_regression(x,y)

The obtained heatmap is plotted with matplotlib.pyplot.imshow() function that visualizes matrices.

plt.figure()
plt.imshow(heatmap)
plt.xlabel('RT on RP, min')
plt.ylabel('RT on normal phase, min')
plt.title('All tryptic peptides, RT correlation = {0}'.format(r))

The same code can also be applied to compare the retention times obtained on different chromatographic phases. As you can see upon execution of the code, the retention times obtained on different chromatographic phases seem to be uncorrelated.

x = [peptide['m/z'] for peptide in peptides]
y = [peptide['RT_RP'] for peptide in peptides]
heatmap, xbins, ybins = np.histogram2d(x, y,
    bins=[150, 2000],
    range=[[0, 4000], [0, 150]])
heatmap[heatmap == 0] = np.nan
a, b, r, stderr = auxiliary.linear_regression(x,y)

plt.figure()
plt.imshow(heatmap,
    aspect='auto',
    origin='lower')
plt.xlabel('m/z, Th')
plt.ylabel('RT on RP, min')
plt.title('All tryptic peptides, correlation = {0}'.format(r))

Finally, let us check whether the retention times remain uncorrelated when we narrow down the sample of peptides. We select the peptides with m/z lying in a 700-701 Th window and plot two chromatographic retention times. This time the sample allows us to use a scatter plot.

close_mass_peptides = [peptide for peptide in peptides
                       if 700.0 <= peptide['m/z'] <= 701.0]
x = [peptide['RT_RP'] for peptide in close_mass_peptides]
y = [peptide['RT_normal'] for peptide in close_mass_peptides]
a, b, r, stderr = auxiliary.linear_regression(x, y)

plt.figure()
plt.scatter(x, y)
plt.xlabel('RT on RP, min')
plt.ylabel('RT on normal phase, min')
plt.title('Tryptic peptides with m/z=700-701 Th\nRT correlation = {0}'.format(r))

plt.show()

As you can see, the retention times of peptides lying in a narrow mass window turn out to be substantially correlated.

At this point we stop. The next example will cover the modules allowing access to experimental proteomic datasets stored in XML-based formats.

Example 2: Fragmentation

In this example, we are going to retrieve MS/MS data from an MGF file and compare it to identification info we read from a pepXML file. We are going to compare the MS/MS spectrum in the file with the theoretical spectrum of a peptide assigned to this spectrum by the search engine.

The script source can be downloaded here. We will also need the example MGF file and the example pepXML file, but the script will download them for you.

The MGF file has a single MS/MS spectrum in it. This spectrum is taken from the SwedCAD database of annotated MS/MS spectra. The pepXML file was obtained by running X!Tandem against the MGF file and converting the results to pepXML with the Tandem2XML tool from TPP.

Let’s start with importing the modules.

from pyteomics import mgf, pepxml, mass
import os
from urllib.request import urlretrieve
import pylab

Then we’ll download the files, if needed:

for fname in ('mgf', 'pep.xml'):
    if not os.path.isfile('example.' + fname):
        urlretrieve('http://pyteomics.readthedocs.io/en/latest/_static/example.'
                + fname, 'example.' + fname)

Now it’s time to define the function that will give us m/z of theoretical fragments for a given sequence. We will use pyteomics.mass.fast_mass() to calculate the values. All we need to do is split the sequence at every bond and iterate over possible charges and ion types:

def fragments(peptide, types=('b', 'y'), maxcharge=1):
    """
    The function generates all possible m/z for fragments of types
    `types` and of charges from 1 to `maxharge`.
    """
    for i in range(1, len(peptide)-1):
        for ion_type in types:
            for charge in range(1, maxcharge+1):
                if ion_type[0] in 'abc':
                    yield mass.fast_mass(
                            peptide[:i], ion_type=ion_type, charge=charge)
                else:
                    yield mass.fast_mass(
                            peptide[i:], ion_type=ion_type, charge=charge)

So, the outer loop is over “fragmentation sites”, the next one is over ion types, then over charges, and lastly over two parts of the sequence (C- and N-terminal).

All right, now it’s time to extract the info from the files. We are going to use the with statement syntax, which is not required, but recommended.

with mgf.read('example.mgf') as spectra, pepxml.read('example.pep.xml') as psms:
    spectrum = next(spectra)
    psm = next(psms)

Now prepare the figure…

pylab.figure()
pylab.title('Theoretical and experimental spectra for '
        + psm['search_hit'][0]['peptide'])
pylab.xlabel('m/z, Th')
pylab.ylabel('Intensity, rel. units')

… plot the real spectrum:

pylab.bar(spectrum['m/z array'], spectrum['intensity array'], width=0.1, linewidth=2,
        edgecolor='black')

… calculate and plot the theoretical spectrum, and show everything:

theor_spectrum = list(fragments(psm['search_hit'][0]['peptide'],
    maxcharge=psm['assumed_charge']))
pylab.bar(theor_spectrum,
        [spectrum['intensity array'].max()]*len(theor_spectrum),
        width=0.1, edgecolor='red', alpha=0.7)
pylab.show()

You will see something like this:

_images/example_msms.png

That’s it, as you can see, the most intensive peaks in the spectrum are indeed matched by the theoretical spectrum.

Example 3: Search engines and PSM filtering

In this example we are going to parse the output of several search engines and see what we can do with it using Pyteomics.

Full Python code can be downloaded here (Python script) and here (IPython Notebook). The files used in this example can be downloaded from here. The example, including code, figures, and accompanying text, is contained in the IPython Notebook file.

View the rendered notebook online.

History of changes

4.3.2

Fix #7.

4.3.1

Technical release.

4.3

First release after the move to Github. Issue and PR numbers from now on refer to the Github repo. Archive of the Bibucket issues and PRs is stored here.

Changes in this release:

4.2

  • Changes in XML XPath implementation. For standard XML parser classes, this only means a minor change in performance (should be a slight improvement, most noticeable for TandemXML).

    • For custom classes: the implementation of xpath evaluation in pyteomics.xml.XML.iterfind() has changed. Pseudo-conditions are now not supported. Instead, an attempt is made to support full XPath. The main difference is that the XPath is evaluated on XML elements, whereas pseudo-conditions used to be evaluated for complete Python dictionaries. To reproduce old behavior, you can just write an explicit if statement at an appropriate place. New implementation allows actually skipping the elements that do not satisfy the XPath predicate. When writing classes which by default iterate over elements based on a complex XPath, set _default_iter_path instead of _default_iter_tag.

      Warning

      Beware that if _default_iter_path differs from _default_iter_tag and you use indexing, all elements corresponding to _default_iter_tag will be indexed. This is a limitation of the index building procedure. This discrepancy will lead to confusing behavior (length checks, membership tests and other things based on index will not correspond to items returned by iteration). map() calls will also operate on the full index.

  • New keyword arguments queue_size, queue_timeout and processes for indexed parsers with support for map().

  • New method mass.Unimod.by_id(). Also, mass.Unimod now supports dict-like queries with record IDs.

  • Reduce memory footprint for unit primitives (PR #35 by Joshua Klein).

  • New functions pyteomics.auxiliary.sigma_T() and pyteomics.auxiliary.sigma_fdr().

  • Fix issues #44, #46, #47, #48.

4.1.2

Bugfix: fix the standard mass value for pyrrolysine (issue #42).

4.1.1

  • Add numpress support for mzML and mzXML files. To read files compressed with Numpress, install pynumpress (PyPI, GitHub).
  • Bugfixes.
API changes
  • In ms1.read() and ms2.read(), the default value for use_index is now False. Using the indexed parsers may result in incorrect behavior if the “first” scan number in S-lines is not unique.

4.1

4.0.1

Fix issue #35 (incorrect order of deserialized offset indexes on older Python versions).

4.0

3.5.1

Technical release to update the package metadata on PyPI. Project documentation on pythonhosted.org has been deleted. Latest documentation is available at: https://pyteomics.readthedocs.io/.

3.5

  • Preserve accession information on cvParam elements in mzML parser. Dictionaries produced by the parser can now be queried by accession using pyteomics.auxiliary.cvquery(). (Contributed by J. Klein)

  • Add optional decode_binary argument in pyteomics.mzml.MzML and pyteomics.mzxml.MzXML. When set to False, the parsers provide binary records suitable for decoding on demand. (Contributed by J. Klein)

  • Add method write_byte_offsets() in pyteomics.mzml.MzML, pyteomics.mzxml.MzXML and pyteomics.mzid.MzIdentML. Byte offsets can be loaded later to speed up random access. (Contributed by J. Klein)

  • Random access to MGF spectrum entries.

    This functionality will be changed in upcoming versions.

  • New module pyteomics.protxml for parsing of ProteinProphet output files.

  • Add PeptideProphet and iProphet analysis information to the output of pyteomics.pepxml.DataFrame().

  • New parameter huge_tree in XML parser constructors and read() functions. It is passed to the underlying lxml calls. Default value is False. Set to True to overcome errors such as: XMLSyntaxError: xmlSAX2Characters: huge text node.

  • New parameter skip_empty_cvparam_values in XML parser constructors. It instructs the parser to treat the empty “value” attributes in cvParam elements as if they were not there. This is helpful in cases when such empty “values” are present in one vendor’s file and absent in another: enabling the parameter will result in more unified output. Default value is False.

  • Change the default value for read_schema to False in XML parsing modules.

  • Change the default value for retrieve_refs to True in MzIdentML constructor.

  • Implement retrieve_refs for pyteomics.mzml.MzML. (Contributed by J. Klein)

  • New parameter keep_cterm in decoy generation functions in pyteomics.fasta.

  • New parameters decoy_prefix and decoy_suffix in all format-specific FDR filtering functions. If the standard is_decoy() function works for your files, you can use these parameters to specify either the prefix or the suffix appended to the protein names in decoy entries.

  • New ion types in pyteomics.mass.std_ion_comp.

  • Bugfixes.

3.4.2

  • New module pyteomics.ms1 for parsing of MS1 files.
  • mass.Composition constructor now accepts ion_type and charge parameters.
  • New functions pyteomics.mzid.DataFrame() and pyteomics.mzid.filter_df(). Their behavior may be refined later on.
  • Changes in behavior of pyteomics.auxiliary.filter() and pyteomics.auxiliary.qvalues():
    • both functions now always return DataFrames with pandas.DataFrame input and full_output=True.
    • string values of key, is_decoy and pep are substituted with simple itemgetter functions for non-pandas, non-numpy input;
    • additional parameters score_label, decoy_label, pep_label, and q_label for output control.
  • Performance optimizations in XML parsing code.

3.4.1

3.4

3.3.1

New submodule pyteomics.featurexml with a parser for OpenMS featureXML files.

3.3

  • mzML and mzIdentML parsers can now create an index of element offsets. This allows quick random access to elements by unique ID.
  • mzML parsers now come in two flavors: pyteomics.mzml.MzML and pyteomics.mzml.PreIndexedMzML. The latter uses the byte offsets listed at the end of the file.
  • New parameters convert_arrays and read_charges in mgf.read() allow using it without numpy and possibly improve performance. The default behavior is retained.
  • Performance optimizations in mgf.read() and parser.cleave().
  • New decoy generation mode called “fused decoy”, described in the paper accepted to JASMS.
API changes
  • pyteomics.parser.cleave() no longer accepts the labels argument. It is emphasized that the input sequences are expected to be in plain one-letter notation, but no checks are performed.
  • DataFrame() functions in pepxml and tandem now extract more protein-related information. The list-like protein-related values can be reported as lists or packed into strings, depending on the optional paramter sep. Some column names have changed as a result.
  • Call signatures of pyteomics.fasta.decoy_sequence() and the functions using it are slightly changed. Standard modes are now also exposed as individual functions.

3.2

New submodule pyteomics.mass.unimod contains rewritten machinery for handling of Unimod relational databases (contributed by Joshua Klein). This is a substitution and extension for the old mass.Unimod class. pyteomics.mass.unimod requires SQLAlchemy.

Other changes:

  • New function pyteomics.auxiliary.linear_regression_perpendicular() provides a linear fit minimizing distances from data points to the fit line (as opposed to pyteomics.auxiliary.linear_regression(), which minimizes vertical distances).
  • Both new and old linear regression functions now accept a single array of shape (N, 2).
  • pyteomics.pylab_aux.scatter_trend() now has an optional parameter regression which can be a callable performing the regression. Also, the regression equation is now the label of the regression line, not the scatter plot.
  • Another two new parameters for pyteomics.pylab_aux.scatter_trend() are sigma_kwargs and sigma_values.
  • pyteomics.pylab_aux functions plot_line() and scatter_trend() now return the objects they create.
  • Writer functions (pyteomics.mgf.write(), pyteomics.fasta.write(), pyteomics.fasta.write_decoy_db()) now accept a file_mode argument that overrides the mode in which the file is opened.
  • In pyteomics.mgf.write() one can now override the format spec for fragment m/z, intensity and charge values using the optinal fragment_format argument. Key order and key-value parameter formatters are now also handled via optional arguments.
  • pyteomics.fasta.decoy_db() now supports ignore_comments and parser arguments.

3.1.1

3.1

This release offers integration with the great pandas library. Working with qvalues() and filter() functions is now much easier if you have your PSMs in a DataFrame. Many search engines use CSV as their output format, allowing direct creation of DataFrame objects. New functions pyteomics.tandem.DataFrame() and pyteomics.pepxml.DataFrame() faciliatate creation of DataFrames from corresponding formats.

Also, qvalues(), filter() and fdr() functions can now use posterior error probabilities (PEPs) instead of using decoys for q-value calculation.

  • In qvalues() and filter() functions, key and is_decoy can now be array-like objects or strings (as well as functions and iterators). If a string is given, it is used as a field name in the PSM array or DataFrame. fdr() functions also support strings and iterables as arguments.
  • New parameter pep in qvalues(), filter() and fdr() functions. It can be callable, array-like, or iterator. Conflicts with decoy-related parameters. Compatible with key, but makes it optional.
  • Fixed the behavior of filter.chain() functions. They now treat the full_output argument the same way as filter() functions.
  • Fixed the issue that caused exceptions when calling fasta.decoy_db() and fasta.write_decoy_db() with explicitly given mode (signature for creation of pyteomics.auxiliary.FileReader objects slightly changed).
  • Pyteomics now uses setuptools and is a namespace package.
  • Minor fixes.
API changes
  • Default value of remove_decoy in qvalues() is now False.

3.0.1

3.0.0

  • XML parsers are now implemented as objects, each format has its own class. Those classes can be instantiated using the same arguments as read() functions accepted, and support direct iteration and the with syntax. The read() functions are now simple aliases to the corresponding constructors.
  • As a result, functions iterfind(), version_info() and get_by_id() functions are now deprecated in favor of methods iterfind() and get_by_id() and attribute version_info of corresponding instances.
  • In pyteomics.mgf.write(), the order of keys and the format of values are now controlled via module-level variables.
  • In pyteomics.electrochem, correction for pK of terminal groups depending on the terminal residue is implemented; example set of pK and corrected pK added.
  • Imports of external dependencies are delayed where possible, so that unnecessary ImportErrors do not occur.
  • local_fdr() renamed to qvalues() in pepxml, mzid, tandem and auxiliary. local_fdr() did not reflect the semantics of the function. The algorithm has been also corrected so that the array of q-values is always sorted (as it should be by definition).
  • qvalues() now also accepts a parameter full_output which keeps the PSMs alongside their scores and associated q-values.
  • All fdr(), qvalues(), and filter() functions now accept a new parameter correction. It is used for more accurate estimation of the number of false positives using TDA (paper with explanation).
  • filter() functions now support both iterator protocol and context manager protocol. They now also accept the full_output parameter, which has the following meaning: if True (default), then an array of PSMs is directly returned by the function. Otherwise, an iterator is returned, as before. The array takes some memory, but this way is usually around 2x faster.
  • New function pyteomics.pylab_aux.plot_qvalue_curve().
  • pyteomics.mass.Composition objects now have a mass() method (equivalent to pyteomics.mass.calculate_mass().
  • Also, Composition and objects returned by pyteomics.parser.amino_acid_composition() now inherit from collections.defaultdict and collections.Counter.
  • Decoy-related functions in pyteomics.fasta now accept a new parameter keep_nterm that preserves the N-terminal residue in the generated decoy sequences.
  • Minor fixes.
API changes

2.5.5

Fix for a memory leak in pyteomics.mzid.get_by_id(), which affects pyteomics.mzid.read() with retrieve_refs=True.

2.5.4

  • New functions local_fdr() in pepxml, mzid, and tandem. The function returns a NumPy array with PSM scores and corresponding values of local FDR.
  • New parameter iterative in read() functions of XML parsing modules. Parsing of mzIdentML files with retrieve_refs=True got significantly faster.

2.5.3

  • Universally applicable modifications are now allowed in pyteomics.parser.isoforms().
  • It is now also possible to specify non-terminal modifications which are only applicable to terminal residues.
  • Fix in pyteomics.parser.parse(): if the labels argument is provided, it needs to contain standard terminal groups if they are present in the sequence or if show_unmodified_termini is set to True.
  • pyteomics.mass.Composition instances are now pickleable.
  • Performance improvements.

2.5.2

  • New parameter reverse in all filter() functions.
  • New function pyteomics.mass.fast_mass2(), which is analogous to pyteomicsmass.fast_mass(), but supports full modX notation and is several times slower.
  • Fix in pyteomics.pepxml.read() for compatibility with files produced with Mascot2XML utility.
  • Unknown labels now allowed in pyteomics.electrochem and pyteomics.achrom functions in accordance with new general policy.

2.5.1

2.5.0

API changes
  • The boolean overlap parameter in pyteomics.parser.cleave() is replaced with an integer min_length. Since min_length uses pyteomics.parser.length(), the labels keyword argument is now accepted by cleave() and num_sites(), if needed. With carefully designed cleavage rules, all cleavage functions work with modX sequences.
  • The labels argument in pyteomics.parser.parse() and related functions has changed its meaning. parse() won’t raise an exception for non-standard labels in sequences if the labels keyword argument is not given.
  • The modX notation specification is now more strict to avoid ambiguity: only zero or two terminal groups can be present in a modX sequence. Sequences with one terminal group specified will be supported where possible, but be advised that sequences such as “H-OH” are intrinsically ambiguous.

2.4.3

  • Added the ratio keyword argument for FDR calculation.
  • Minor changes in iterfind() functions of file parsers.
  • Bugfix in pyteomics.mgf.write() (duplication of pepmass key).
  • Removed non-functional parameter read_schema for pyteomics.tandem.read().

2.4.2

  • Bugfix in pyteomics.mass.most_probable_isotopic_composition(). The bug manifested itself after version 2.4.0, when pyteomics.mass.nist_mass was expanded. Also, the format of the returned value is now in accordance with the documentation.

2.4.1

  • New function pyteomics.auxiliary.filter() for filtering lists of PSMs not coming directly from files in supported formats.
  • Also, a format-agnostic helper function pyteomics.auxiliary.fdr().

2.4.0

  • New functions for filtering to a certain FDR level based on target-decoy strategy, as well as for FDR estimation, in pyteomics.tandem, pyteomics.pepxml and pyteomics.mzid. The functions are called filter() (beware of shadowing the built-in function) and fdr() (in each of the modules). Chained versions filter.chain() and filter.chain.from_iterable() are also available. See Data Access for more info.

  • New function pyteomics.parser.coverage() for sequence coverage calculation.

  • New function pyteomics.fasta.decoy_chain(), a chained version of pyteomics.fasta.decoy_db().

  • New elements in pyteomics.mass.nist_mass. Pretty much all elements are there now.

  • Fix in pyteomics.parser.parse() to cover some fancy corner cases.

  • Bugfix in pyteomics.tandem: modification info is now fully extracted.

  • pyteomics.mass.isotopic_composition_abundance() is now able to calculate abundances for larger molecules.

    Note

    Rounding errors may be significant in this case.

2.3.0

  • New parameter “read_schema” in read() functions of XML parsing modules. When set to False, disables the attempts to fetch an auxiliary file and obtain structure information about the file being parsed.
  • New function chain() in all modules that have a read() function, for convenient chaining of multiple files. chain() only works as a context manager. Use itertools.chain() in other cases. The chain.from_iterable form is also available as a context manager.
  • New function pyteomics.auxiliary.print_tree() for exploration of complex nested dicts produced by XML parsers.
  • New sets of retention coefficients in pyteomics.achrom.
  • Bugfix in pyteomics.pepxml. The bug caused an exception when parsing some pepXML files.
  • The output of pyteomics.mgf.read() now always contains a masked array of charges.
  • Other minor fixes.
API change
  • In pyteomics.mgf.read() the precursor charge is now always represented by a list of ints (a ChargeList object).

2.2.2

2.2.1

  • Update parsers for FASTA headers.
  • NamedTuple for FASTA entries is now defined globally, which should solve pickling problems.

2.2.0

  • New module pyteomics.tandem for reading output files of X!Tandem search engine.

2.1.6

  • Fix in pyteomics.pepxml. pepXML files generated by TPP are now processed without errors.

2.1.5

  • Fix in pyteomics.pepxml. ‘modified_peptide’ is now always available.
  • Fix in pyteomics.mass (issue #2 in the bug tracker).
  • Improved arithmetics for Composition objects.

2.1.4

  • In fasta, decoy_db() now doesn’t write to file, but returns an iterator over FASTA records. The old decoy_db() is now called write_decoy_db(), which is equivalent to decoy_db() combined with write().

Bugfixes:

  • In pyteomics.mgf.read(), the charges, if present, are returned as a masked array now. Previously, an exception occurred if charges were missing for some of the fragments.
  • Values in mass.nist_mass corrected.
  • Other minor corrections.

2.1.3

  • Adjust the behavior affected by the bug fixed in 2.1.2. name attributes of <cvParam> elements in the absence of value attributes are now collected in a list under the ‘name’ key.
  • Add support for overlapping matches in parser.cleave().

2.1.2

  • Bugfix in XML parsers. The bug caused the mzML parser to break on some files. The fix can slightly change the format of the output.

2.1.1

  • Rename keys in the dicts returned by mgf.read() to facilitate writing code working with both MGF and mzML.
  • The items yielded by fasta.read() now have attributes description and sequence.

2.1.0

  • New sets of retention coefficients in achrom.
  • mass.Composition now only stores non-zero ints.
  • fasta now has tools for parsing of FASTA headers.
  • File parsers now implement the context manager protocol. We recommend using with statements to avoid resource leaks.
API changes
  • ‘pepmass’ is now a tuple in the output of mgf.read() (to allow reading precursor intensities).
  • new function fasta.parse() for convenient parsing of FASTA headers.
  • fasta.std_parsers stores parsers for common UniProt header formats.
  • new parameter parser in fasta.read() allows to apply parsing while reading a FASTA file.
  • close parameter removed in all functions that do file I/O. The unified behavior is: if the parameter is a file object, it won’t be closed by the function. If a file path is given, the file object will be created and closed inside the corresponding function.

2.0.3

  • Added new class pyteomics.mass.Unimod. The interface is experimental and may change.
  • Improved iterfind() function in XML-reading modules.
  • pyteomics.mass.Composition objects now support multiplication by int.
  • Bugfix in auxiliary.linear_regression().

2.0.2

2.0.1

API changes

2.0.0

  • Added mzid module for parsing of mzIdentML files.
  • Fixed bugs, improved tests.
API changes
  • top-module functions in fasta, mgf, mzml, pepxml, as well as mzid, are now called read().
  • in parser, parse_sequence() renamed to parse(). It now accepts an optional parameter allow_unknown_modifications.
  • mgf.write_mgf() and fasta.write_fasta() renamed to write().
  • the output format of all read() functions has changed.

1.2.5

1.2.4

  • Changes in pyteomics.mass.
API changes
  • Composition objects can be created using positional first argument, which will be treated as a sequence or (upon failure) as a formula. This means that all functions relying on Composition (calculate_mass(), most_probable_isotopic_composition(), isotopic_composition_abundance()) allow that as well. However, it’s of no use for the latter.
  • Composition entries for modifications can be added to aa_comp and used in composition and mass calculations. This way the specified group will be added to any residue bearing this modification.
  • That being said, the add_modifications() function is not needed anymore and has been removed.
  • Addition and subtraction of Composition objects now produces a Composition object, allowing addition/subtraction of multiple objects.
  • Composition is now a subclass of collections.defaultdict so one can safely retrieve values without checking if a key exists.

1.2.3

API changes

1.2.2

  • Bugfix in pyteomics.pepxml: modification info is now extracted.
  • New optional boolean argument ‘split’ in pyteomics.parser.parse_sequence() allows to generate a list of tuples where modifications are separated from the residues instead of a regular list of labels. In labels not only modX labels are now allowed, but also separate mod prefixes. Such modifications are assumed to be applicable to any residue.

1.2.1

  • Memory usage significantly decreased when parsing large mzML and pepXML files.

1.2.0

  • Added support for Python 3. Python 2.7 is still supported, Python 2.6 is not.

1.1.1

  • New function called add_modifications() added in pyteomics.mass. It updates aa_comp.
  • Also, pyteomics.parser.isoforms() is a new function to get all possible modified sequences of a peptide.

1.1.0

  • New module added - pyteomics.mgf. It is intended for reading and writing files in Mascot Generic Format.

1.0.2

  • In pyteomics.pepxml module, now all search hits are read from file (not only the top hit).
API changes:
  • pyteomics.pepxml.read(): information specific to search hits is now stored in a list under the 'search_hits' key. The list is sorted by hit rank.

1.0.1

1.0.0

  • The first public release of Pyteomics.
API changes:

Indices and tables

Contents