Welcome to Pyteomics tutorial!¶
What is Pyteomics?¶
Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis, such as:
- calculation of basic physico-chemical properties of polypeptides:
- mass and isotopic distribution
- charge and pI
- chromatographic retention time
- access to common proteomics data:
- MS or LC-MS data
- FASTA databases
- search engines output
- easy manipulation of sequences of modified peptides and proteins
The goal of the Pyteomics project is to provide a versatile, reliable and well-documented set of open tools for the wide proteomics community. One of the project’s key features is Python itself, an open source language increasingly popular in scientific programming. The main applications of the library are reproducible statistical data analysis and rapid software prototyping.
Citation¶
Pyteomics is distributed under Apache License version 2.0.
When using or redistributing Pyteomics, or parts of it, please cite the following papers:
Goloborodko, A.A.; Levitsky, L.I.; Ivanov, M.V.; and Gorshkov, M.V. (2013) “Pyteomics - a Python Framework for Exploratory Data Analysis and Rapid Software Prototyping in Proteomics”, Journal of The American Society for Mass Spectrometry, 24(2), 301–304. DOI: 10.1007/s13361-012-0516-6
Levitsky, L.I.; Klein, J.; Ivanov, M.V.; and Gorshkov, M.V. (2018) “Pyteomics 4.0: five years of development of a Python proteomics framework”, Journal of Proteome Research. DOI: 10.1021/acs.jproteome.8b00717
Useful Links¶
Pyteomics is hosted at the following sites:
- Python package @ Python Package Index: https://pypi.org/project/pyteomics/
- project documentation @ Read the Docs: https://pyteomics.readthedocs.io/
- source code @ Github: https://github.com/levitsky/pyteomics
- mailing list @ Google: https://groups.google.com/group/pyteomics/
Backup of old repo¶
Pyteomics source code used to be hosted on Bitbucket. An archive of issues and pull requests is stored at: https://levitsky.github.io/bitbucket_backup/#!/levitsky/pyteomics.
Pyteomics Extensions¶
Additional, third-party packages extending the Pyteomics functionality can be insalled separately:
- pyteomics.pepxmltk (pepXML file creation)
- pyteomics.biolccc (retention time prediction)
- pyteomics.cythonize (cythonized versions of
mass
andparser
modules)
Feedback & Support¶
Please email to pyteomics@googlegroups.com with any questions about Pyteomics. You are welcome to use the Github issue tracker to report bugs, request features, etc.
Relation to other proteomics data analysis tools¶
Our goal is to create an infrastructure for proteomics data analysis within Python ecosystem. Pyteomics is not a proteomic search engine, nor does it any data conversion. There are other tools for that. Pyteomics does not aim to substitute any of these, but rather to coexist and complement them.
Contents:¶
Introduction¶
This tutorial covers the basic Pyteomics functionality. For more details, please, check the API reference. You can also access the API docstrings from Python shell:
>>> from pyteomics.mass import calculate_mass
>>> help(calculate_mass)
IPython users can use the following shortcut:
>>> from pyteomics.mass import calculate_mass
>>> calculate_mass?
We expect the reader to be familiar with the basic Python syntax as well as proteomics concepts.
How to install Pyteomics¶
Supported Python versions¶
Pyteomics supports Python 2.7 and Python 3.3+.
Project dependencies¶
Pyteomics uses the following Python packages:
- numpy
- matplotlib (used by
pyteomics.pylab_aux
)- lxml (used by XML parsing modules)
- pandas (can be used with
pyteomics.pepxml
,pyteomics.tandem
,pyteomics.mzid
,pyteomics.auxiliary
)- sqlalchemy (used by
pyteomics.mass.unimod
)- pynumpress (adds support for Numpress compression)
All dependencies are optional.
GNU/Linux¶
The preferred way to obtain Pyteomics is via pip Python package manager. The shell code for a freshly installed Ubuntu system:
sudo apt-get install python-setuptools python-dev build-essential
sudo easy_install pip
sudo pip install lxml numpy matplotlib pyteomics
Peptide sequence formats. Parser module¶
modX¶
Pyteomics uses a custom IUPAC-derived peptide sequence notation named modX. As in the IUPAC notation, each amino acid residue is represented by a capital letter, but it may preceded by an arbitrary number of small letters to show modification. Terminal modifications are separated from the backbone sequence by a hyphen (‘-’). By default, both termini are assumed to be unmodified, which can be shown explicitly by ‘H-‘ for N-terminal hydrogen and ‘-OH’ for C-terminal hydroxyl.
“H-HoxMMdaN-OH”
is an example of a valid sequence in modX. See
parser - operations on modX peptide sequences for additional information. Note that it is recommended to include
either 0 or 2 terminal groups in a modX sequence.
Sequence operations¶
There are two helper functions to check if a label is in modX format or represents
a terminal modification: pyteomics.parser.is_modX()
and
pyteomics.parser.is_term_mod()
:
>>> parser.is_modX('A')
True
>>> parser.is_modX('pT')
True
>>> parser.is_modX('pTx')
False
>>> parser.is_term_mod('pT')
False
>>> parser.is_term_mod('Ac-')
True
A modX sequence can be translated to a list of amino acid residues with
pyteomics.parser.parse()
function:
>>> from pyteomics import parser
>>> parser.parse('PEPTIDE')
['P', 'E', 'P', 'T', 'I', 'D', 'E']
>>> parser.parse('PEPTIDE', show_unmodified_termini=True)
['H-', 'P', 'E', 'P', 'T', 'I', 'D', 'E', '-OH']
>>> parser.parse('Ac-PEpTIDE', labels=parser.std_labels+['Ac-', 'pT'])
['Ac-', 'P', 'E', 'pT', 'I', 'D', 'E']
In the last example we supplied two arguments, the sequence itself
and ‘labels’. The latter is used to specify what labels are allowed for amino
acid residues and terminal modifications. std_labels
is a predefined
set of labels for the twenty standard amino acids, ‘H-‘ for N-terminal hydrogen
and ‘-OH’ for C-terminal hydroxyl. In this example we specified the codes for
phosphorylated threonine and N-terminal acetylation.
Since version 2.5, specifying labels
is never mandatory. If this argument
is not supplied, no checks will be made. However, the last example won’t work
without labels
, because it has only one terminal group shown, which is
discouraged.
parse()
has another mode, in which it returns tuples:
>>> parser.parse('Ac-PEpTIDE-OH', split=True)
[('Ac-', 'P'), ('E',), ('p', 'T'), ('I',), ('D',), ('E',)]
or:
>>> parser.parse('Ac-PEpTIDE-OH', split=True, labels=parser.std_labels+['Ac-', 'p'])
[('Ac-', 'P'), ('E',), ('p', 'T'), ('I',), ('D',), ('E',)]
Also, note what we supply as labels here: ‘p’ instead of ‘pT’. That means that ‘p’ is a modification applicable to any residue.
In modX, standard len()
function cannot be used to determine the length
of a peptide because of the modifications.
Use pyteomics.parser.length()
instead:
>>> from pyteomics import parser
>>> parser.length('aVRILLaVIGNE')
10
The pyteomics.parser.amino_acid_composition()
function accepts a sequence
and returns a dictionary with amino acid labels as keys and integer numbers as
values, corresponding to the number of times each residue occurs in the sequence:
>>> from pyteomics import parser
>>> parser.amino_acid_composition('PEPTIDE')
{'I': 1.0, 'P': 2.0, 'E': 2.0, 'T': 1.0, 'D': 1.0}
pyteomics.parser.cleave()
is a method to perform in silico cleavage.
The requiered arguments are the sequence, the rule for enzyme specificity and the
number of missed cleavages allowed (optional). cleave()
returns a
set
of product peptides.
>>> from pyteomics import parser
>>> parser.cleave('AKAKBK', parser.expasy_rules['trypsin'], 0)
{'AK', 'BK'}
pyteomics.parser.expasy_rules
is a predefined dict
with
the clevage rules for the most common proteases.
All possible modified sequences of a peptide can be obtained with
pyteomics.parser.isoforms()
:
>>> from pyteomics import parser
>>> forms = parser.isoforms('PEPTIDE', variable_mods={'p': ['T'], 'ox': ['P']})
>>> for seq in forms: print seq
...
oxPEPpTIDE
oxPEPTIDE
oxPEoxPpTIDE
oxPEoxPTIDE
PEPpTIDE
PEPTIDE
PEoxPpTIDE
PEoxPTIDE
Peptide properties: mass, charge, chromatographic retention¶
Mass and isotopes¶
The functions related to mass calculations and isotopic distributions are
organized into the pyteomics.mass
module.
Basic mass calculations¶
The most common task in mass spectrometry data analysis is to calculate the
mass of an organic molecule or peptide or m/z ratio of an ion.
The tasks of this kind can be performed with the
pyteomics.mass.calculate_mass()
function. It works with
chemical formulas, polypeptide sequences in modX notation, pre-parsed sequences
and dictionaries of chemical compositions:
>>> from pyteomics import mass
>>> mass.calculate_mass(formula='H2O')
18.0105646837036
>>> mass.calculate_mass(formula='C2H5OH')
46.0418648119876
>>> mass.calculate_mass(composition={'H':2, 'O':1})
18.0105646837036
>>> mass.calculate_mass(sequence='PEPTIDE')
799.359964027207
>>> from pyteomics import parser
>>> ps = parser.parse('PEPTIDE', show_unmodified_termini=True)
>>> mass.calculate_mass(parsed_sequence=ps)
799.359964027207
Warning
Always set show_unmodified_termini=True
when parsing a
sequence, if you want to use the result to calculate the mass. Otherwise,
the mass of the terminal hydrogen and hydroxyl will not be taken into account.
Mass-to-charge ratio of ions¶
pyteomics.mass.calculate_mass()
can be used to calculate the mass/charge
ratio of peptide ions and ionized fragments. To do that, simply supply the type
of the peptide ionized fragment and its charge:
>>> from pyteomics import mass
>>> mass.calculate_mass(sequence='PEPTIDE', ion_type='M', charge=2)
400.6872584803735
>>> mass.calculate_mass(sequence='PEP', ion_type='b', charge=1)
324.15539725264904
>>> mass.calculate_mass(sequence='TIDE', ion_type='y', charge=1)
477.219119708098
Mass of modified peptides¶
With pyteomics.mass.calculate_mass()
you can calculate masses of
modified peptides as well. For the function to recognize the modified residue,
you need to add the information about its elemental composition to the
pyteomics.mass.std_aa_comp
dictionary used in the calculations by
default.
>>> from pyteomics import mass
>>> mass.std_aa_comp['pT'] = mass.Composition(
... {'C': 4, 'H': 8, 'N': 1, 'O': 5, 'P': 1})
>>> mass.calculate_mass(sequence='PEPpTIDE')
879.3262945499629
To add information about modified amino acids to a user-defined aa_comp
dict
one can either add the composition info for a specific modified residue or just
for a modification:
>>> from pyteomics import mass
>>> aa_comp = dict(mass.std_aa_comp)
>>> aa_comp['p'] = mass.Composition('HPO3')
>>> mass.calculate_mass('pT', aa_comp=aa_comp)
199.02457367493957
In this example we call calculate_mass()
with a positional
(non-keyword) argument (‘pT’). This feature was added in version
1.2.4. When you provide a non-keyword argument, it will be treated as a sequence;
if it fails, it will be treated as a formula; in case it fails as well, a
PyteomicsError
will be raised.
Note that ‘pT’ is treated as a sequence here, so default terminal groups are
implied when calculating the composition and mass:
>>> mass.calculate_mass('pT', aa_comp=aa_comp) == mass.calculate_mass(aa_comp['p']) + mass.calculate_mass(aa_comp['T']) + mass.calculate_mass('H2O')
True
You can create a specific entry for a modified amino acid to override the modification on a specific residue:
>>> aa_comp['pT'] = mass.Composition({'N': 2}) >>> mass.Composition('pT', aa_comp=aa_comp) {'H': 2, 'O': 1, 'N': 2} >>> mass.Composition('pS', aa_comp=aa_comp) {'H': 8, 'C': 3, 'N': 1, 'O': 6, 'P': 1}
Unimod database is an excellent resource for the
information on the chemical compositions of known protein modifications.
Version 2.0.3 introduces pyteomics.mass.Unimod
class that can serve
as a Python interface to Unimod:
>>> db = mass.Unimod()
>>> aa_comp = dict(mass.std_aa_comp)
>>> aa_comp['p'] = db.by_title('Phospho')['composition']
>>> mass.calculate_mass('PEpTIDE', aa_comp=aa_comp)
782.2735307010443
Chemical compositions¶
Some problems in organic mass spectrometry deal with molecules made by
addition or subtraction of standard chemical ‘building blocks’.
In pyteomics.mass
there are two ways to approach these problems.
There is a
pyteomics.mass.Composition
class intended to store chemical formulas.pyteomics.mass.Composition
objects are dicts that can be added or subtracted from one another or multiplied by integers.>>> from pyteomics import mass >>> p = mass.Composition(formula='HO3P') # Phosphate group Composition({'H': 1, 'O': 3, 'P': 1}) >>> mass.std_aa_comp['T'] Composition{'C': 4, 'H': 7, 'N': 1, 'O': 2}) >>> p + mass.std_aa_comp['T'] Composition({'C': 4, 'H': 8, 'N': 1, 'O': 5, 'P': 1})
The values of
pyteomics.mass.std_aa_comp
arepyteomics.mass.Composition
objects.All functions that accept a formula keyword argument sum and subtract numbers following the same atom in the formula:
>>> from pyteomics import mass >>> mass.calculate_mass(formula='C2H6') # Ethane 30.046950192426 >>> mass.calculate_mass(formula='C2H6H-2') # Ethylene 28.031300128284002
Faster mass calculations¶
While pyteomics.mass.calculate_mass()
has a flexible and convenient
interface, it may be too slow for large-scale calculations. There is an
optimized and simplified version of this function named
pyteomics.mass.fast_mass()
. It works only with unmodified sequences in
standard one-letter IUPAC notation. Like pyteomics.mass.calculate_mass()
,
pyteomics.mass.fast_mass()
can calculate m/z when provided with ion
type and charge. Amino acid masses can be specified via the aa_mass
argument.
>>> from pyteomicss import mass
>>> mass.fast_mass('PEPTIDE')
799.3599446837036
If you need to calculate the mass or m/z for a peptide with modifications
and/or non-standard terminal groups, but don’t want to specify all compositions,
you can also use the pyteomics.mass.fast_mass2()
function. It uses
aa_mass
the same way as fast_mass()
, but has full modX support:
>>> mass.fast_mass2('H-PEPTIDE-OH')
799.3599446837036
Isotopes¶
If not specified, pyteomics.mass
assumes that the substances are in
the pure isotopic state. However, you may specify particular isotopic state in
brackets (e.g. O[18], N[15]) in a chemical formula. An element with unspecified
isotopic state is assumed to have the mass of the most stable isotope and
abundance of 100%.
>>> mass.calculate_mass(formula='H[2]2O') # Heavy water
20.0231181752416
>>> mass.calculate_mass(formula='H[2]HO') # Semiheavy water
19.0168414294726
pyteomics.mass.isotopic_composition_abundance()
function calculates the
relative abundance of a given isotopic state of a molecule. The input can be
provided as a formula or as a Composition/dict.
>>> from pyteomics import mass
>>> mass.isotopic_composition_abundance(formula='H2O') # Water with an unspecified isotopic state
1.0
>>> mass.isotopic_composition_abundance(formula='H[2]2O') # Heavy water
1.3386489999999999e-08
>>> mass.isotopic_composition_abundance(formula='H[2]H[1]O') # Semiheavy water
0.0002313727050147582
>>> mass.isotopic_composition_abundance(composition={'H[2]’: 1, ‘H[1]’: 1, ‘O': 1}) # Semiheavy water
0.0002313727050147582
>>> mass.isotopic_composition_abundance(formula='H[2]2O[18]') # Heavy-hydrogen heavy-oxygen water
2.7461045585999998e-11
Warning
You cannot mix specified and unspecified states of the same element in one
formula in pyteomics.mass.isotopic_composition_abundance()
due to
ambiguity.
>>> mass.isotopic_composition_abundance(formula='H[2]HO')
...
PyteomicsError: Pyteomics error, message: 'Please specify the isotopic states of all atoms of H or do not specify them at all.'
Finally, you can find the most probable isotopic composition for a substance
with pyteomics.mass.most_probable_isotopic_composition()
function. The
substance is specified as a formula, a pyteomics.mass.Composition
object or a modX sequence string.
>>> from pyteomics import mass
>>> mass.most_probable_isotopic_composition(formula='H2SO4')
Composition({'H[1]': 2.0, 'H[2]': 0.0, 'O[16]': 4.0, 'O[17]': 0.0, 'S[32]': 1.0, 'S[33]': 0.0})
>>> mass.most_probable_isotopic_composition(formula='C300H602')
Composition({'C[12]': 297.0, 'C[13]': 3.0, 'H[1]': 602.0, 'H[2]': 0.0})
>>> mass.most_probable_isotopic_composition(sequence='PEPTIDE'*100)
Composition({'C[12]': 3364.0, 'C[13]': 36.0, 'H[1]': 5102.0, 'H[2]': 0.0, 'N[14]': 698.0, 'N[15]': 2.0, 'O[16]': 398.0, 'O[17]': 3.0})
The information about chemical elements, their isotopes and relative abundances
is stored in the pyteomics.mass.nist_mass
dictionary.
>>> from pyteomics import mass
>>> print mass.nist_mass['C']
{0: (12.0, 1.0), 12: (12.0, 0.98938), 13: (13.0033548378, 0.01078), 14: (14.0032419894, 0.0)}
The zero key stands for the unspecified isotopic state. The data about isotopes are stored as tuples (accurate mass, relative abundance).
Charge and pI¶
Electrochemical properties of polypeptides can be assessed via the
pyteomics.electrochem
module. For now, it allows to calculate:
- the charge of a polypeptide molecule at given pH;
- the isoelectric point.
The pyteomics.electrochem
module is based on the Henderson-Hasselbalch
equation.
Examples¶
Both functions in the module accept input in the form of a modX sequence, a parsed sequence or a dict with amino acid composition.
>>> from pyteomics import electrochem
>>> electrochem.charge('PEPTIDE', 7)
-2.9980189709606284
>>> from pyteomics import parser
>>> parsed_seq = parser.parse('PEPTIDE', show_unmodified_termini=True)
>>> electrochem.charge(parsed_seq, 7)
-2.9980189709606284
>>> aa_composition = parser.amino_acid_composition('PEPTIDE', show_unmodified_termini=True)
>>> electrochem.charge(aa_composition, 7)
-2.9980189709606284
>>> electrochem.pI('PEPTIDE')
2.87451171875
>>> electrochem.pI('PEPTIDE', precision_pI=0.0001)
2.876354217529297
(Source code, png, hires.png, pdf)

Customization¶
The pKas of individual amino acids are stored in dicts in the
following format: {modX label : (pKa, charge)}. The module contains
several datasets published in scientific journals:
pyteomics.electrochem.pK_lehninger
(used by default),
pyteomics.electrochem.pK_sillero
,
pyteomics.electrochem.pK_dawson
,
pyteomics.electrochem.pK_rodwell
.
Retention time prediction¶
Pyteomics has two modules for prediction of retention times (RTs) of peptides and proteins in liquid chromatography.
BioLCCC¶
The first module is pyteomics.biolccc
. This module implements
the BioLCCC model of liquid chromatography of polypeptides.
pyteomics.biolccc
is not distributed with the main package and has
to be installed separately. pyteomics.biolccc
can be downloaded from
http://pypi.python.org/pypi/pyteomics.biolccc, and the project documentation
is hosted at http://theorchromo.ru/docs.
Additive model of peptide chromatography¶
Another option for retention time prediction is the pyteomics.achrom
module
distributed with Pyteomics. It implements the additive model of polypeptide
chromatography. Briefly, in the additive model each amino acid residue changes
retention time by a fixed value, depending only on its type (e.g. an alanine
residue add 2.0 mins to RT, while an arginine decreases it by 1.1 min). The module
documentation contains the complete description of this model and the references.
In this tutorial we will focus on the basic usage.
Retention time prediction¶
Retention time prediction with pyteomics.achrom
is done by the
pyteomics.achrom.calculate_RT()
function:
>>> from pyteomics import achrom
>>> achrom.calculate_RT('PEPTIDE', achrom.RCs_guo_ph7_0)
7.8000000000000025
The first argument of the function is the sequence of a peptide in modX notation.
The second argument is the set parameters called ‘retention coefficients’ which
describe chromatographic properties of individual amino acid residues in
a polypeptide chain. pyteomics.achrom
has a number of predefined sets of
retention coefficients obtained from publications. The list, detailed
descriptions and references related to these sets can be found in the module
documentation.
Calibration¶
The main advantage of the additive model is that it gives more accurate predictions if adjusted to specific chromatographic setups and conditions. This adjustment, or ‘calibration’ requires a set of known peptide sequences and corresponding retention times (a ‘training set’) and returns a set of new retention coefficients. The following code illustrates the calibration procedure in Pyteomics.
>>> from pyteomics import achrom
>>> RCs = achrom.get_RCs(sequences, RTs)
>>> achrom.calculate_RT('PEPTIDE', RCs)
The first argument of pyteomics.achrom.get_RCs()
should be a list of modX sequences,
the second - a list of float-point retention times.
Like in pyteomics.parser.parse_sequence()
, all non-standard amino modX
acid labels used in the training set should be supplied to labels keyword
argument of pyteomics.achrom.get_RCs()
along with the standard ones:
>>> RCs = achrom.get_RCs(sequences, RTs, labels=achrom.std_labels + ['pS', 'pT'])
Advanced calibration¶
The standard additive model allows a couple of improvements. Firstly, an
explicit dependency on the length of a peptide may be introduced by multiplying
the retention time by , where L is the number of amino
acid residues in the peptide and m is the length correction parameter, typically ~ -0.2.
The value of the length correction parameter is set at the calibration and stored along
with the retention coefficients. By default, length correction is enabled in
pyteomics.achrom.get_RCs()
and the parameter equals -0.21. You can change
the value of the length correction parameter by supplying the ‘lcp’ keyword argument,
or you can disable length correction completely by setting lcp=0:
>>> RCs = achrom.get_RCs(sequences, RTs, lcp=-0.18) # A new value of the length correction parameter
>>> RCs = achrom.get_RCs(sequences, RTs, lcp=0) # Disable length correction.
Another considerable improvement over the standard additive model is to treat terminal amino acid residues as separate chemical entities. This behavior is disabled by default, but can be enabled by setting term_aa=True:
>>> RCs = achrom.get_RCs(sequences, RTs, term_aa=True)
This correction is implemented by addition of the ‘nterm’ and ‘cterm’ prefixes to the labels of terminal amino acid residues of the training peptides. In order for this correction to work, the training peptides should represent all possible variations of terminal amino acid residues.
Data Access¶
The following section is dedicated to data manipulation. Pyteomics aims to support the most common formats of (LC-)MS/MS data, peptide identification results and protein databases.
General Notes¶
Each module mentioned below corresponds to a file format. In each module, the top-level function
read()
allows iteration over entries in a file. It works like the built-inopen()
, allowing direct iteration and supporting thewith
syntax, which we recommend using. So you can do:>>> from pyteomics import mgf >>> reader = mgf.read('tests/test.mgf') >>> for spectrum in reader: >>> ... >>> reader.close()
… but it is recommended to do:
>>> from pyteomics import mgf >>> with mgf.read('tests/test.mgf') as reader: >>> for spectrum in reader: >>> ...
Additionally, most modules provide one or several classes which implement different parsing modes, e.g.
pyteomics.mgf.MGF
andpyteomics.mgf.IndexedMGF
. Indexed parsers build an index of file entries and thus allow random access in addition to iteration. See Indexed Parsers for a detailed description and examples.Apart from
read()
, which reads just one file, all modules described here have functions for reading multiple files:chain()
andchain.from_iterable()
.chain('f1', 'f2')
is equivalent tochain.from_iterable(['f1', 'f2'])
.chain()
andchain.from_iterable()
only support thewith
syntax. If you don’t want to use thewith
syntax, you can just use theitertools
functionschain()
andchain.from_iterable()
.Throughout this section we use
pyteomics.auxiliary.print_tree()
to display the structure of the data returned by various parsers. Replace this call with the actual processsing that you need to perform on your files.
Text-based formats¶
MGF¶
Mascot Generic Format
(MGF) is a simple
human-readable format for MS/MS data. It allows storing MS/MS peak lists and
exprimental parameters. pyteomics.mgf
is a module that implements
reading and writing MGF files.
Reading¶
pyteomics.mgf.read()
function allows iterating through spectrum entries.
Spectra are represented as dicts
. By default, MS/MS peak lists are stored
as numpy.ndarray
objects m/z array and intensity array.
Fragment charges will be stored in a masked array under the charge array key.
Parameters are stored as a dict
under params key.
Here is an example of use:
>>> from pyteomics import mgf, auxiliary
>>> with mgf.read('tests/test.mgf') as reader:
>>> auxiliary.print_tree(next(reader))
m/z array
params
-> username
-> useremail
-> mods
-> pepmass
-> title
-> itol
-> charge
-> mass
-> itolu
-> it_mods
-> com
intensity array
charge array
To speed up parsing, or if you want to avoid using numpy
, you can tweak the
behaviour of pyteomics.mgf.read()
with parameters convert_arrays and read_charges.
Reading file headers¶
Also, pyteomics.mgf
allows to extract headers with general
parameters from MGF files with pyteomics.mgf.read_header()
function. It
also returns a dict
.
>>> header = mgf.read_header('tests/test.mgf')
>>> auxiliary.print_tree(header)
itolu
itol
username
com
useremail
it_mods
charge
mods
mass
Class-based interface¶
Since version 3.4.3, MGF parsing functionality is encapsulated in a class:
pyteomics.mgf.MGF
. This class can be used for:
- sequential parsing of the file (the same as
read()
):>>> with mgf.MGF('tests/test.mgf') as reader: ..: for spectrum in reader: ..: ...
- accessing the file header (the same as
read_header()
):>>> f = mgf.MGF('tests/test.mgf') >>> f.header {'charge': [2, 3], 'com': 'Based on http://www.matrixscience.com/help/data_file_help.html', 'it_mods': 'Oxidation (M)', 'itol': '1', 'itolu': 'Da', 'mass': 'Monoisotopic', 'mods': 'Carbamidomethyl (C)', 'useremail': 'leu@altered-state.edu', 'username': 'Lou Scene'}
- direct access to spectra by title (the same as
get_spectrum()
):>>> f = mgf.MGF('tests/test.mgf') >>> f['Spectrum 2'] {'charge array': masked_array(data = [3 2 1 1 1 1], mask = False, fill_value = 0), 'intensity array': array([ 237., 128., 108., 1007., 974., 79.]), 'm/z array': array([ 345.1, 370.2, 460.2, 1673.3, 1674. , 1675.3]), 'params': {'charge': [2, 3], 'com': 'Based on http://www.matrixscience.com/help/data_file_help.html', 'it_mods': 'Oxidation (M)', 'itol': '1', 'itolu': 'Da', 'mass': 'Monoisotopic', 'mods': 'Carbamidomethyl (C)', 'pepmass': (1084.9, 1234.0), 'rtinseconds': '25', 'scans': '3', 'title': 'Spectrum 2', 'useremail': 'leu@altered-state.edu', 'username': 'Lou Scene'}}
Note
MGF
’s support for direct indexing is rudimentary, because it does not in fact keep an index and has
to search through the file line-wise on every call. pyteomics.mgf.IndexedMGF
is designed for
random access and more (see Indexed Parsers for details).
Writing¶
Creation of MGF files is implemented in pyteomics.mgf.write()
function.
The user can specify the header, an iterable of spectra in the same format as
returned by read()
, and the output path.
>>> spectra = mgf.read('tests/test.mgf')
>>> mgf.write(spectra=spectra, header=header)
USERNAME=Lou Scene
ITOL=1
USEREMAIL=leu@altered-state.edu
MODS=Carbamidomethyl (C)
IT_MODS=Oxidation (M)
CHARGE=2+ and 3+
MASS=Monoisotopic
ITOLU=Da
COM=Taken from http://www.matrixscience.com/help/data_file_help.html
BEGIN IONS
TITLE=Spectrum 1
PEPMASS=983.6
846.6 73.0
846.8 44.0
847.6 67.0
1640.1 291.0
1640.6 54.0
1895.5 49.0
END IONS
BEGIN IONS
TITLE=Spectrum 2
RTINSECONDS=25
PEPMASS=1084.9
SCANS=3
345.1 237.0
370.2 128.0
460.2 108.0
1673.3 1007.0
1674.0 974.0
1675.3 79.0
END IONS
MS1 and MS2¶
MS1 and MS2 are simple
human-readable formats for MS1 and MSn data. It allows storing peak lists and
exprimental parameters. Just like MS1 and MS2 formats are quite similar to MGF,
the corresponding module (pyteomics.ms1
and pyteomics.ms2
) provides
the same functions and classes with very similar signatures for reading headers and
spectra from files.
Writing is not supported at this time.
FASTA¶
FASTA is a common format for protein sequence databases.
Reading¶
To extract data from FASTA databases, use the pyteomics.fasta.read()
function.
>>> from pyteomics import fasta
>>> with fasta.read('/path/to/file/my.fasta') as db:
>>> for entry in db:
>>> ...
Just like other parsers in Pyteomics, pyteomics.fasta.read()
returns a generator object instead of a
list
to prevent excessive memory use. The generator yields
(description, sequence) tuples, so it’s natural to use it as follows:
>>> with fasta.read('/path/to/file/my.fasta') as db:
>>> for descr, seq in db:
>>> ...
You can also use attributes to access description and sequence:
>>> with fasta.read('my.fasta') as reader:
>>> descriptions = [item.description for item in reader]
Description parsing¶
You can specify a function that will be applied to the FASTA headers for
your convenience. pyteomics.fasta.std_parsers
has some pre-defined
parsers that can be used for this purpose.
>>> with fasta.read('HUMAN.fasta', parser=fasta.std_parsers['uniprot']) as r:
>>> print(next(r).description)
{'PE': 2, 'gene_id': 'LCE6A', 'GN': 'LCE6A', 'id': 'A0A183', 'taxon': 'HUMAN',
'SV': 1, 'OS': 'Homo sapiens', 'entry': 'LCE6A_HUMAN',
'name': 'Late cornified envelope protein 6A', 'db': 'sp'}
or try guessing the header format:
>>> with fasta.read('HUMAN.fasta', parser=fasta.parse) as r:
>>> print(next(r).description)
{'PE': 2, 'gene_id': 'LCE6A', 'GN': 'LCE6A', 'id': 'A0A183', 'taxon': 'HUMAN',
'SV': 1, 'OS': 'Homo sapiens', 'entry': 'LCE6A_HUMAN',
'name': 'Late cornified envelope protein 6A', 'db': 'sp'}
Class-based interface¶
The pyteomics.fasta.FASTA
class is available for text-based (old style) parsing
(the same as shown with read()
above). Also, the new binary-mode, indexed parser,
pyteomics.fasta.IndexedFASTA
implements all the perks of the Indexed Parsers.
Both classes also have a number of flavor-specific subclasses that implement header parsing.
Additionally, flavored indexed parsers allow accessing the protein entries by the extracted ID field,
while the regular pyteomics.fasta.IndexedFASTA
uses full description string for identification:
In [1]: from pyteomics import fasta
In [2]: db = fasta.IndexedUniProt('sprot_human.fasta') # A SwissProt database
In [3]: len(db['Q8IYH5'].sequence)
Out[3]: 903
In [4]: db['Q8IYH5'] == db['sp|Q8IYH5|ZZZ3_HUMAN ZZ-type zinc finger-containing protein 3 OS=Homo sapiens GN=ZZZ3 PE=1 SV=1']
Out[4]: True
Writing¶
You can also create a FASTA file using a sequence of (description, sequence)
tuples
.
>>> entries = [('Protein 1', 'PEPTIDE'*1000), ('Protein 2', 'PEPTIDE'*2000)]
>>> fasta.write(entries, 'target-file.fasta')
Decoy databases¶
Another common task is to generate a decoy database. Pyteomics allows
that by means of the pyteomics.fasta.decoy_db()
and
pyteomics.fasta.write_decoy_db()
functions.
>>> fasta.write_decoy_db('mydb.fasta', 'mydb-with-decoy.fasta')
The only required argument is the first one, indicating the source database. The second argument is the target file and defaults to system standard output.
If you need to modify a single sequence, use the
pyteomics.fasta.decoy_sequence()
function. It supports three modes:
'reverse'
, 'shuffle'
, and 'fused'
(see pyteomics.fasta.reverse()
,
pyteomics.fasta.shuffle()
and pyteomics.fasta.fused_decoy()
for documentation).
>>> fasta.decoy_sequence('PEPTIDE', 'reverse')
'EDITPEP'
>>> fasta.decoy_sequence('PEPTIDE', 'shuffle')
'TPPIDEE'
>>> fasta.decoy_sequence('PEPTIDE', 'shuffle')
'PTIDEPE'
mzTab¶
mzTab is a HUPO-PSI standardized text-based format for describing identification
and quantification of peptides and small molecules. You can read an mzTab file into
a set of pandas.DataFrame
objects with the pyteomics.mztab.MzTab
class.
>>> from pyteomics import mztab
>>> tables = mztab.MzTab("path/to/file.mzTab")
>>> psms = tables.spectrum_match_table
>>> # do something with DataFrame
XML formats¶
XML parsers are implemented as classes and provide an
object-oriented interface. The functional interface is preserved for backward
compatibility and wraps the actual class-based machinery.
That means that reader objects returned
by read()
functions have additional methods.
One of the most important methods is iterfind()
. It allows reading
additional information from XML files.
mzML and mzXML¶
mzML and mzXML are XML-based formats for experimental data obtained on MS/MS or LC-MS
setups. Pyteomics offers you the functionality of pyteomics.mzml
and
pyteomics.mzxml
modules to gain access to the information contained in those files from Python.
The interfaces of the two modules are very similar, this section will use mzML
for demonstration.
The user can iterate through MS/MS spectra contained in a file via the
pyteomics.mzml.read()
function or pyteomics.mzml.MzML
class.
Here is an example of the output:
>>> from pyteomics import mzml, auxiliary
>>> with mzml.read('tests/test.mzML') as reader:
>>> auxiliary.print_tree(next(reader))
count
index
highest observed m/z
ms level
total ion current
intensity array
lowest observed m/z
defaultArrayLength
profile spectrum
MSn spectrum
positive scan
base peak intensity
m/z array
base peak m/z
id
scanList
-> count
-> scan [list]
-> -> scan start time
-> -> preset scan configuration
-> -> filter string
-> -> instrumentConfigurationRef
-> -> scanWindowList
-> -> -> count
-> -> -> scanWindow [list]
-> -> -> -> scan window lower limit
-> -> -> -> scan window upper limit
-> -> [Thermo Trailer Extra]Monoisotopic M/Z:
-> no combination
Additionally, pyteomics.mzml.MzML
objects support direct indexing
with spectrum IDs and all other features of Indexed Parsers.
pyteomics.mzml.PreIndexedMzML
offers the same functionality,
but it uses byte offset information found at the end of the file.
Unlike the rest of the functions and classes, pyteomics.mzml.PreIndexedMzML
does not have a counterpart in pyteomics.mzxml
.
pepXML¶
pepXML
is a widely used XML-based format for peptide identifications.
It contains information about the MS data, the parameters of the search engine
used and the assigned sequences. To access these data, use
pyteomics.pepxml
module.
The function pyteomics.pepxml.read()
iterates through Peptide-Spectrum
matches in a pepXML file and returns them as a custom dict. Alternatively, you
can use the pyteomics.pepxml.PepXML
interface.
>>> from pyteomics import pepxml, auxiliary
>>> with pepxml.read('tests/test.pep.xml') as reader:
>>> auxiliary.print_tree(next(reader))
end_scan
search_hit [list]
-> hit_rank
-> calc_neutral_pep_mass
-> modifications
-> modified_peptide
-> peptide
-> num_matched_ions
-> search_score
-> -> deltacn
-> -> spscore
-> -> sprank
-> -> deltacnstar
-> -> xcorr
-> num_missed_cleavages
-> analysis_result [list]
-> -> peptideprophet_result
-> -> -> all_ntt_prob
-> -> -> parameter
-> -> -> -> massd
-> -> -> -> fval
-> -> -> -> nmc
-> -> -> -> ntt
-> -> -> probability
-> -> analysis
-> tot_num_ions
-> num_tot_proteins
-> is_rejected
-> proteins [list]
-> -> num_tol_term
-> -> protein
-> -> peptide_next_aa
-> -> protein_descr
-> -> peptide_prev_aa
-> massdiff
index
assumed_charge
spectrum
precursor_neutral_mass
start_scan
Reading into a pandas.DataFrame¶
If you like working with tabular data using pandas
, you can load pepXML files
directly into pandas.DataFrames
using the pyteomics.pepxml.DataFrame()
function. It can read multiple files
at once (using pyteomics.pepxml.chain()
) and return a combined table with
essential information about search results. This function requires pandas
.
X!Tandem¶
X!Tandem search engine has its own output
format that contains more info than pepXML. Pyteomics has a reader for it
in the pyteomics.tandem
module.
>>> from pyteomics import tandem, auxiliary
>>> with tandem.read('tests/test.t.xml') as reader:
... auxiliary.print_tree(next(reader))
...
rt
support
-> fragment ion mass spectrum
-> -> M+H
-> -> note
-> -> charge
-> -> Ydata
-> -> -> units
-> -> -> values
-> -> Xdata
-> -> -> units
-> -> -> values
-> -> label
-> -> id
-> supporting data
-> -> convolution survival function
-> -> -> Ydata
-> -> -> -> units
-> -> -> -> values
-> -> -> Xdata
-> -> -> -> units
-> -> -> -> values
-> -> -> label
-> -> b ion histogram
-> -> -> Ydata
-> -> -> -> units
-> -> -> -> values
-> -> -> Xdata
-> -> -> -> units
-> -> -> -> values
-> -> -> label
-> -> y ion histogram
-> -> -> Ydata
-> -> -> -> units
-> -> -> -> values
-> -> -> Xdata
-> -> -> -> units
-> -> -> -> values
-> -> -> label
-> -> hyperscore expectation function
-> -> -> a1
-> -> -> a0
-> -> -> Ydata
-> -> -> -> units
-> -> -> -> values
-> -> -> Xdata
-> -> -> -> units
-> -> -> -> values
-> -> -> label
mh
maxI
expect
sumI
act
fI
z
id
protein [list]
-> peptide
-> -> pre
-> -> end
-> -> seq
-> -> b_ions
-> -> nextscore
-> -> mh
-> -> y_ions
-> -> start
-> -> hyperscore
-> -> expect
-> -> delta
-> -> id
-> -> post
-> -> missed_cleavages
-> -> b_score
-> -> y_score
-> uid
-> sumI
-> label
-> note
-> expect
-> file
-> -> URL
-> -> type
-> id
pyteomics.tandem.read()
returns a
pyteomics.tandem.TandemXML
instance, which can also be
created directly.
Reading into a pandas.DataFrame¶
You can also load data from X!Tandem files directly into pandas.DataFrames
using the pyteomics.tandem.DataFrame()
function. It can read multiple files
at once (using pyteomics.tandem.chain()
) and return a combined table with
essential information about search results. Of course, this function requires pandas
.
mzIdentML¶
mzIdentML is one of the standards developed by the Proteomics Informatics working group of the HUPO Proteomics Standard Initiative.
The module interface is similar to that of the other reader modules.
The pyteomics.mzid.read()
function returns a
pyteomics.mzid.MzIdentML
instance, which you can just as easily
use directly.
>>> from pyteomics import mzid, auxiliary
>>> with mzid.read('tests/test.mzid') as reader:
>>> auxiliary.print_tree(next(reader))
SpectrumIdentificationItem [list]
-> PeptideEvidenceRef [list]
-> -> peptideEvidence_ref
-> ProteinScape:SequestMetaScore
-> chargeState
-> rank
-> ProteinScape:IntensityCoverage
-> calculatedMassToCharge
-> peptide_ref
-> passThreshold
-> experimentalMassToCharge
-> id
spectrumID
id
spectraData_ref
Element IDs and references¶
In mzIdentML, some elements contain references to other elements in the same
file. The references are simply XML attributes whose name ends with _ref
and
the value is an ID, identical to the value of the id
attribute of a certain
element.
The parser can retrieve information from these references on the fly, which can
be enabled by passing retrieve_refs=True
to the
pyteomics.mzid.MzIdentML.iterfind()
method, to
pyteomics.mzid.MzIdentML
constructor, or to
pyteomics.mzid.read()
. Retrieval of data by ID is implemented in
the pyteomics.mzid.MzIdentML.get_by_id()
method. Alternatively, the
MzIdentML
object itself can be indexed with element IDs:
>>> from pyteomics import mzid
>>> m = mzid.MzIdentML('tests/test.mzid')
>>> m['ipi.HUMAN_decoy']
{'DatabaseName': 'database IPI_human',
'decoy DB accession regexp': '^SHD',
'decoy DB generation algorithm': 'PeakQuant.DecoyDatabaseBuilder',
'id': 'ipi.HUMAN_decoy',
'location': 'file://www.medizinisches-proteom-center.de/DBServer/ipi.HUMAN/3.15/ipi.HUMAN_decoy.fasta',
'name': ['decoy DB from IPI_human',
'DB composition target+decoy',
'decoy DB type shuffle'],
'numDatabaseSequences': 58099,
'releaseDate': '2006-02-22T09:30:47Z',
'version': '3.15'}
>>> m.close()
Note
Since version 3.3, pyteomics.mzid.MzIdentML
objects keep an index of byte
offsets for some of the elements (see Indexed Parsers).
Indexing helps achieve acceptable performance
when using retrieve_refs=True
, or when accessing individual elements by their ID.
This behavior can be disabled by passing
use_index=False
to the object constructor.
An alternative, older mechanism is caching of element IDs. To build
a cache for a file, you can pass build_id_cache=True
and use_index=False
to the MzIdentML
constructor, or to pyteomics.mzid.read()
,
or call the pyteomics.mzid.MzIdentML.build_id_cache()
method
prior to reading the data.
Reading into a pandas.DataFrame¶
pyteomics.mzid
also provides a pyteomics.mzid.DataFrame()
function
that reads one or several files into a single Pandas DataFrame
.
This function requires pandas
.
idXML¶
idXML is an OpenMS format for peptide identifications. It is supported in pyteomics.openms.idxml
.
It partially supports indexing (protein information can be indexed and extracted with retrieve_refs).
The regular iterative parsing is done through read()
or IDXML
, and :py:class:`pandas.DataFrame`s
can be created as well.
TraML¶
TraML is also a PSI format. It stores a lot of information on SRM experiments.
The parser, pyteomics.traml.TraML
, iterates over <Transition> elements by default.
Like MzIdentML, it has a retrieve_refs parameter that helps pull in the information from other parts of the file.
TraML
is one of the Indexed Parsers.
FeatureXML¶
pyteomics.openms.featurexml
implements a simple parser for .featureXML files
used in the OpenMS framework. The usage
is identical to other XML parsing modules. Since featureXML has feature IDs,
FeatureXML
objects also support direct indexing as well as iteration, among
the many features of Indexed Parsers:
>>> from pyteomics.openms import featurexml
>>> # function style, iteration
... with featurexml.read('tests/test.featureXML') as f:
... qual = [feat['overallquality'] for feat in f]
...
>>> qual # qualities of the two features in the test file
[0.791454, 0.945634]
>>> # object-oriented style, direct indexing
>>> f = featurexml.FeatureXML('tests/test.featureXML')
>>> f['f_189396504510444007']['overallquality']
0.945634
>>> f.close()
As always, pyteomics.openms.featurexml.read()
and pyteomics.openms.featurexml.FeatureXML
are interchangeable.
TrafoXML¶
.trafoXML is another OpenMS format based on XML. It describes a tranformation produced by an RT alignment algorithm. The file basically contains a series of (from; to) pairs corresponding to original and transformed retention times:
>>> from pyteomics.openms import trafoxml
>>> from_rt, to_rt = [], []
>>> with trafoxml.read('test/test.trafoXML') as f:
... for pair in f:
... from_rt.append(pair['from'])
... to_rt.append(pair['to'])
>>> # plot the transformation
>>> import pylab
>>> pylab.plot(from_rt, to_rt)
As always, pyteomics.openms.trafoxml.read()
and pyteomics.openms.trafoxml.TrafoXML
are interchangeable.
TrafoXML parsers do not support indexing because there are no IDs for specific data points in this format.
Controlled Vocabularies¶
Controlled Vocabularies
are the universal annotation system used in the PSI formats, including
mzML and mzIdentML. pyteomics.mzml.MzML
, pyteomics.traml.TraML
and pyteomics.mzid.MzIdentML
retain the annotation information. It can be accessed using the helper function, pyteomics.auxiliary.cvquery()
:
>>> from pyteomics import auxiliary as aux, mzid, mzml
>>> f = mzid.MzIdentML('tests/test.mzid')
>>> s = next(f)
>>> s
{'SpectrumIdentificationItem': [{'ProteinScape:SequestMetaScore': 7.59488518903425, 'calculatedMassToCharge': 1507.695, 'PeptideEvidenceRef': [{'peptideEvidence_ref': 'PE1_SEQ_spec1_pep1'}], 'chargeState': 1, 'passThreshold': True, 'peptide_ref': 'prot1_pep1', 'rank': 1, 'id': 'SEQ_spec1_pep1', 'ProteinScape:IntensityCoverage': 0.3919545603809718, 'experimentalMassToCharge': 1507.696}], 'spectrumID': 'databasekey=1', 'id': 'SEQ_spec1', 'spectraData_ref': 'LCMALDI_spectra'}
>>> aux.cvquery(s)
{'MS:1001506': 7.59488518903425, 'MS:1001505': 0.3919545603809718}
>>> f.close()
Indexed Parsers¶
Most of the parsers implement indexing: MGF, mzML, mzXML, FASTA, PEFF, pepXML, mzIdentML, ms1, TraML, featureXML. Some formats do not have indexing parsers, because there is no unique ID field in the files to identify entries.
XML parser classes are called according to the format, e.g. pyteomics.mzml.MzML
. Text format parsers
that implement indexing are called with the word “Indexed”, e.g. pyteomics.fasta.IndexedFASTA
,
as opposed to pyteomics.fasta.FASTA
, which does not implement indexing.
This distinction is due to the fact that indexed parsers need to open the files in binary mode.
This may affect performance for text-based formats and is not always backwards-compatible
(you cannot instantiate an indexed parser class using a previously opened file if it is in text mode).
XML files, on the other hand, are always meant to be opened in binary mode.
So, there is no duplication of classes for XML formats, but indexing can still be disabled by passing
use_index=False
to the class constructor or the read()
function.
Basic usage¶
Indexed parsers can be instantiated using the class name or the read()
function:
In [1]: from pyteomics import mgf
In [2]: f = mgf.IndexedMGF('tests/test.mgf')
In [3]: f
Out[3]: <pyteomics.mgf.IndexedMGF at 0x7fc983cbaeb8>
In [4]: f.close()
In [5]: f = mgf.read('tests/test.mgf', use_index=True)
In [6]: f
Out[6]: <pyteomics.mgf.IndexedMGF at 0x7fc980c63898>
They support direct assignment and iteration or the with syntax, the same way as the older, iterative parsers.
Parser objects can be used as dictionaries mapping entry IDs to entries, or as lists:
In [7]: f['Spectrum 2']
Out[7]:
{'params': {'com': 'Based on http://www.matrixscience.com/help/data_file_help.html',
'itol': '1',
'itolu': 'Da',
'mods': 'Carbamidomethyl (C)',
'it_mods': 'Oxidation (M)',
'mass': 'Monoisotopic',
'username': 'Lou Scene',
'useremail': 'leu@altered-state.edu',
'charge': [2, 3],
'title': 'Spectrum 2',
'pepmass': (1084.9, 1234.0),
'scans': '3',
'rtinseconds': 25.0 second},
'm/z array': array([ 345.1, 370.2, 460.2, 1673.3, 1674. , 1675.3]),
'intensity array': array([ 237., 128., 108., 1007., 974., 79.]),
'charge array': masked_array(data=[3, 2, 1, 1, 1, 1],
mask=False,
fill_value=0)}
In [8]: f[1]['params']['title'] # positional indexing
Out[8]: 'Spectrum 2'
Like dictionaries, indexed parsers support membership testing and len()
:
In [9]: 'Spectrum 1' in f
Out[9]: True
In [10]: len(f)
Out[10]: 2
Rich Indexing¶
Indexed parsers also support positional indexing, slices of IDs and integers. ID-based slices include both endpoints; integer-based slices exclude the right edge of the interval. With integer indexing, step is also supported. Here is a self-explanatory demo of indexing functionality using a test file of two spectra:
In [11]: len(f['Spectrum 1':'Spectrum 2'])
Out[11]: 2
In [12]: len(f['Spectrum 2':'Spectrum 1'])
Out[12]: 2
In [13]: len(f[:])
Out[13]: 2
In [14]: len(f[:1])
Out[14]: 1
In [15]: len(f[1:0])
Out[15]: 0
In [16]: len(f[1:0:-1])
Out[16]: 1
In [17]: len(f[::2])
Out[17]: 1
RT-based indexing¶
In MGF, mzML and mzXML the spectra are usually time-ordered. The corresponding indexed parsers allow accessing the spectra by retention time, including slices:
In [18]: f = mzxml.MzXML('tests/test.mzXML')
In [19]: spec = f.time[5.5] # get the spectrum closest to this retention time
In [20]: len(f.time[5.5:6.0]) # get spectra from a range
Out[20]: 2
RT lookup is performed using binary search. When retrieving ranges, the closest spectra to the start and end of the range are used as endpoints, so it is possible that they are slightly outside the range.
Multiprocessing¶
Indexed parsers provide a unified interface for multiprocessing: map()
.
The method applies a user-defined function to entries from the file, calling it in different processes.
If the function is not provided, the parsing itself is parallelized. Depending on the format,
this may speed up or slow down the parsing overall.
map()
is a generator and yields items as they become available, not preserving the original order:
In [1]: from pyteomics import mzml
In [2]: f = mzml.MzML('tests/test.mzML')
In [3]: for spec in f.map():
...: print(spec['id'])
...:
controllerType=0 controllerNumber=1 scan=2
controllerType=0 controllerNumber=1 scan=1
In [4]: for item in f.map(lambda spec: spec['id']):
...: print(item)
...:
controllerType=0 controllerNumber=1 scan=1
controllerType=0 controllerNumber=1 scan=2
Note
To use map()
with lambda functions (and in some other corner cases, like
parsers instantiated with pre-opened file objects), the dill
package is required.
This is because the target callable and the parser itself need to be pickled for multiprocessing to work.
Apart from parser objects, map()
is available on objects returned by chain()
functions
and iterfind()
:
In [5]: for c in f.iterfind('chromatogram').map():
...: print(c['id'])
...:
TIC
In [6]: for spec in mzml.chain('tests/test.mzML', 'tests/test.mzML').map():
...: print(spec['id'])
...:
controllerType=0 controllerNumber=1 scan=1
controllerType=0 controllerNumber=1 scan=2
controllerType=0 controllerNumber=1 scan=1
controllerType=0 controllerNumber=1 scan=2
FDR estimation and filtering¶
The modules for reading proteomics search engine or post-processing output
(tandem
, pepxml
, mzid
, idxml
and protxml
)
expose similar functions
is_decoy()
, fdr()
and filter()
.
These functions implement the widely used
Target-Decoy Approach (TDA) to estimation of False Discovery Rate (FDR).
The is_decoy()
function is supposed to determine if a particular
spectrum identification is coming from the decoy database. In tandem
and pepxml
this is done by checking if the protein description/name
starts with a certain prefix. In mzid
, a boolean value that stores
this information in the PSM dict is used.
Warning
Because of the variety of the software producing files in pepXML and
mzIdentML formats, the is_decoy()
function provided in the
corresponding modules may not work for your specific files. In this case
you will have to refer to the source of
pyteomics.pepxml.is_decoy()
and
pyteomics.mzid.is_decoy()
and create your own function in a
similar manner.
The fdr()
function estimates the FDR in a set of PSMs by counting
the decoy matches. Since it is using the is_decoy()
function, the
warning above applies. You can supply a custom function so that fdr()
works for your data. fdr()
can also be imported from
auxiliary
, where it has no default for is_decoy()
.
The filter()
function works like chain()
, but instead of
yielding all PSMs, it filters them to a certain level of FDR. PSM filtering
requires counting decoy matches, too (see above), but it also implies sorting
the PSMs by some kind of a score. This score cannot be universal due to the
above-mentioned reasons, and it can be specified as a user-defined function.
For instance, the default sorting key in pyteomics.mzid.filter()
is
only expected to work with mzIdentML files created with Mascot.
So once again,
Warning
The default parameters of filter()
may not work for your files.
There are also filter.chain()
and
filter.chain.from_iterable()
. These are different from
filter()
in that they apply FDR filtering to all files separately
and then provide a reader over top PSMs of all files, whereas
filter()
pools all PSMs together and applies a single threshold.
If you want to filter a list representing PSMs in arbitrary format, you can
use pyteomics.auxiliary.filter()
. Instead of files it takes lists
(or other iterables) of PSMs. The rest is the same as for other
filter()
functions.
NumPy and Pandas support, etc.¶
pyteomics.auxiliary.filter()
supports structured numpy
arrays and
pandas.DataFrames
of PSMs. This makes it easy to filter search results
stored as CSV files (see Example 3: Search engines and PSM filtering for more info).
Generally, PSMs can be provided as iterators, lists, arrays, and DataFrames
,
and key and is_decoy parameters to filter()
can be functions, strings,
lists, arrays, or iterators. If a string is given, it is used as a key in a structured
array, DataFrame
or an iterable of dicts
.
FDR correction¶
As described in this JPR article, filtering based on decoy counting is inherently biased, especially for small datasets. All TDA-related functions have an optional argument, correction, that enables the correcting procedure proposed in the article.
Pyteomics API documentation¶
This section documents all user functions and data available in Pyteomics. You can access all of this info off-line from your Python interpreter.
Contents:
parser - operations on modX peptide sequences¶
modX is a simple extension of the IUPAC one-letter peptide sequence representation.
The labels (or codes) for the 20 standard amino acids in modX are the same as in IUPAC nomeclature. A label for a modified amino acid has a general form of ‘modX’, i.e.:
- it starts with an arbitrary number of lower-case symbols or numbers (a modification);
- it ends with a single upper-case symbol (an amino acid residue).
The valid examples of modX amino acid labels are: ‘G’, ‘pS’, ‘oxM’. This rule allows to combine read- and parseability.
Besides the sequence of amino acid residues, modX has a rule to specify terminal modifications of a polypeptide. Such a label should start or end with a hyphen. The default N-terminal amine group and C-terminal carboxyl group may not be shown explicitly.
Therefore, valid examples of peptide sequences in modX are: “GAGA”, “H-PEPTIDE-OH”, “H-TEST-NH2”. It is not recommmended to specify only one terminal group.
Operations on polypeptide sequences¶
parse()
- convert a sequence string into a list of amino acid residues.
tostring()
- convert a parsed sequence to a string.
amino_acid_composition()
- get numbers of each amino acid residue in a peptide.
cleave()
- cleave a polypeptide using a given rule of enzymatic digestion.
num_sites()
- count the number of cleavage sites in a sequence.
isoforms()
- generate all unique modified peptide sequences given the initial sequence and modifications.
Auxiliary commands¶
coverage()
- calculate the sequence coverage of a protein by peptides.
length()
- calculate the number of amino acid residues in a polypeptide.
valid()
- check if a sequence can be parsed successfully.
fast_valid()
- check if a sequence contains of known one-letter codes.
is_modX()
- check if supplied code corresponds to a modX label.
is_term_mod()
- check if supplied code corresponds to a terminal modification.
Data¶
std_amino_acids
- a list of the 20 standard amino acid IUPAC codes.
std_nterm
- the standard N-terminal modification (the unmodified group is a single atom of hydrogen).
std_cterm
- the standard C-terminal modification (the unmodified group is hydroxyl).
std_labels
- a list of all standard sequence elements, amino acid residues and terminal modifications.
expasy_rules
- a dict with the regular expressions of cleavage rules for the most popular proteolytic enzymes.
-
pyteomics.parser.
amino_acid_composition
(sequence, show_unmodified_termini=False, term_aa=False, allow_unknown_modifications=False, **kwargs)[source]¶ Calculate amino acid composition of a polypeptide.
Parameters: - sequence (str or list) – The sequence of a polypeptide or a list with a parsed sequence.
- show_unmodified_termini (bool, optional) – If
True
then the unmodified N- and C-terminus are explicitly shown in the returned dict. Default value isFalse
. - term_aa (bool, optional) – If
True
then the terminal amino acid residues are artificially modified with nterm or cterm modification. Default value isFalse
. - allow_unknown_modifications (bool, optional) – If
True
then do not raise an exception when an unknown modification of a known amino acid residue is found in the sequence. Default value isFalse
. - labels (list, optional) – A list of allowed labels for amino acids and terminal modifications.
Returns: out – A dictionary of amino acid composition.
Return type: Examples
>>> amino_acid_composition('PEPTIDE') == {'I': 1, 'P': 2, 'E': 2, 'T': 1, 'D': 1} True >>> amino_acid_composition('PEPTDE', term_aa=True) == {'ctermE': 1, 'E': 1, 'D': 1, 'P': 1, 'T': 1, 'ntermP': 1} True >>> amino_acid_composition('PEPpTIDE', labels=std_labels+['pT']) == {'I': 1, 'P': 2, 'E': 2, 'D': 1, 'pT': 1} True
-
pyteomics.parser.
cleave
(sequence, rule, missed_cleavages=0, min_length=None, semi=False, exception=None)[source]¶ Cleaves a polypeptide sequence using a given rule.
Parameters: - sequence (str) –
The sequence of a polypeptide.
Note
The sequence is expected to be in one-letter uppercase notation. Otherwise, some of the cleavage rules in
expasy_rules
will not work as expected. - rule (str or compiled regex) – A key present in
expasy_rules
or a regular expression describing the site of cleavage. It is recommended to design the regex so that it matches only the residue whose C-terminal bond is to be cleaved. All additional requirements should be specified using lookaround assertions.expasy_rules
contains cleavage rules for popular cleavage agents. - missed_cleavages (int, optional) – Maximum number of allowed missed cleavages. Defaults to 0.
- min_length (int or None, optional) –
Minimum peptide length. Defaults to
None
. - semi (bool, optional) – Include products of semi-specific cleavage. Default is
False
. This effectively cuts every peptide at every position and adds results to the output. - exception (str or compiled RE or None, optional) – Exceptions to the cleavage rule. If specified, should be a key present in
expasy_rules
or regular expression. Cleavage sites matching rule will be checked against exception and omitted if they match.
Returns: out – A set of unique (!) peptides.
Return type: Examples
>>> cleave('AKAKBK', expasy_rules['trypsin'], 0) == {'AK', 'BK'} True >>> cleave('AKAKBK', 'trypsin', 0) == {'AK', 'BK'} True >>> cleave('GKGKYKCK', expasy_rules['trypsin'], 2) == {'CK', 'GKYK', 'YKCK', 'GKGK', 'GKYKCK', 'GK', 'GKGKYK', 'YK'} True
- sequence (str) –
-
pyteomics.parser.
coverage
(protein, peptides)[source]¶ Calculate how much of protein is covered by peptides. Peptides can overlap. If a peptide is found multiple times in protein, it contributes more to the overall coverage.
Requires
numpy
.Note
Modifications and terminal groups are discarded.
Parameters: - protein (str) – A protein sequence.
- peptides (iterable) – An iterable of peptide sequences.
Returns: out – The sequence coverage, between 0 and 1.
Return type: Examples
>>> coverage('PEPTIDES'*100, ['PEP', 'EPT']) 0.5
-
pyteomics.parser.
expasy_rules
¶ This dict contains regular expressions for cleavage rules of the most popular proteolytic enzymes. The rules were taken from the PeptideCutter tool at Expasy.
Note
‘trypsin_exception’ can be used as exception argument when calling
cleave()
with ‘trypsin’ rule:>>> parser.cleave('PEPTIDKDE', parser.expasy_rules['trypsin']) {'DE', 'PEPTIDK'} >>> parser.cleave('PEPTIDKDE', parser.expasy_rules['trypsin'], exception=parser.expasy_rules['trypsin_exception']) {'PEPTIDKDE'}
-
pyteomics.parser.
fast_valid
(sequence, labels={'-OH', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'H-', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'})[source]¶ Iterate over sequence and check if all items are in labels. With strings, this only works as expected on sequences without modifications or terminal groups.
Parameters: - sequence (iterable (expectedly, str)) – The sequence to check. A valid sequence would be a string of labels, all present in labels.
- labels (iterable, optional) – An iterable of known labels.
Returns: out
Return type:
-
pyteomics.parser.
is_modX
(label)[source]¶ Check if label is a valid ‘modX’ label.
Parameters: label (str) – Returns: out Return type: bool Examples
>>> is_modX('M') True >>> is_modX('oxM') True >>> is_modX('oxMet') False >>> is_modX('160C') True
-
pyteomics.parser.
is_term_mod
(label)[source]¶ Check if label corresponds to a terminal modification.
Parameters: label (str) – Returns: out Return type: bool Examples
>>> is_term_mod('A') False >>> is_term_mod('Ac-') True >>> is_term_mod('-customGroup') True >>> is_term_mod('this-group-') False >>> is_term_mod('-') False
-
pyteomics.parser.
isoforms
(sequence, **kwargs)[source]¶ Apply variable and fixed modifications to the polypeptide and yield the unique modified sequences.
Parameters: - sequence (str) – Peptide sequence to modify.
- variable_mods (dict, optional) –
A dict of variable modifications in the following format:
{'label1': ['X', 'Y', ...], 'label2': ['X', 'A', 'B', ...]}
Keys in the dict are modification labels (terminal modifications allowed). Values are iterables of residue labels (one letter each) or
True
. If a value for a modification isTrue
, it is applicable to any residue (useful for terminal modifications). You can use values such as ‘ntermX’ or ‘ctermY’ to specify that a mdofication only occurs when the residue is in the terminal position. This is not needed for terminal modifications.Note
Several variable modifications can occur on amino acids of the same type, but in the output each amino acid residue will be modified at most once (apart from terminal modifications).
- fixed_mods (dict, optional) –
A dict of fixed modifications in the same format.
Note: if a residue is affected by a fixed modification, no variable modifications will be applied to it (apart from terminal modifications).
- labels (list, optional) – A list of amino acid labels containing all the labels present in
sequence. Modified entries will be added automatically.
Defaults to
std_labels
. Not required since version 2.5. - max_mods (int or None, optional) – Number of modifications that can occur simultaneously on a peptide,
excluding fixed modifications. If
None
or ifmax_mods
is greater than the number of modification sites, all possible isoforms are generated. Default isNone
. - override (bool, optional) – Defines how to handle the residues that are modified in the input.
False
means that they will be preserved (default).True
means they will be treated as unmodified. - show_unmodified_termini (bool, optional) – If
True
then the unmodified N- and C-termini are explicitly shown in the returned sequences. Default value isFalse
. - format (str, optional) – If
'str'
(default), an iterator over sequences is returned. If'split'
, the iterator will yield results in the same format asparse()
with the ‘split’ option, with unmodified terminal groups shown.
Returns: out – All possible unique polypeptide sequences resulting from the specified modifications are yielded obe by one.
Return type: iterator over strings or lists
-
pyteomics.parser.
length
(sequence, **kwargs)[source]¶ Calculate the number of amino acid residues in a polypeptide written in modX notation.
Parameters: Returns: out
Return type: Examples
>>> length('PEPTIDE') 7 >>> length('H-PEPTIDE-OH') 7
-
pyteomics.parser.
match_modX
(label)[source]¶ Check if label is a valid ‘modX’ label.
Parameters: label (str) – Returns: out Return type: re.match or None
-
pyteomics.parser.
num_sites
(sequence, rule, **kwargs)[source]¶ Count the number of sites where sequence can be cleaved using the given rule (e.g. number of miscleavages for a peptide).
Parameters: - sequence (str) – The sequence of a polypeptide.
- rule (str or compiled regex) –
A regular expression describing the site of cleavage. It is recommended to design the regex so that it matches only the residue whose C-terminal bond is to be cleaved. All additional requirements should be specified using lookaround assertions.
- labels (list, optional) – A list of allowed labels for amino acids and terminal modifications.
- exception (str or compiled RE or None, optional) – Exceptions to the cleavage rule. If specified, should be a regular expression. Cleavage sites matching rule will be checked against exception and omitted if they match.
Returns: out – Number of cleavage sites.
Return type:
-
pyteomics.parser.
parse
(sequence, show_unmodified_termini=False, split=False, allow_unknown_modifications=False, **kwargs)[source]¶ Parse a sequence string written in modX notation into a list of labels or (if split argument is
True
) into a list of tuples representing amino acid residues and their modifications.Parameters: - sequence (str) – The sequence of a polypeptide.
- show_unmodified_termini (bool, optional) – If
True
then the unmodified N- and C-termini are explicitly shown in the returned list. Default value isFalse
. - split (bool, optional) – If
True
then the result will be a list of tuples with 1 to 4 elements: terminal modification, modification, residue. Default value isFalse
. - allow_unknown_modifications (bool, optional) –
If
True
then do not raise an exception when an unknown modification of a known amino acid residue is found in the sequence. This also includes terminal groups. Default value isFalse
.Note
Since version 2.5, this parameter has effect only if labels are provided.
- labels (container, optional) –
A container of allowed labels for amino acids, modifications and terminal modifications. If not provided, no checks will be done. Separate labels for modifications (such as ‘p’ or ‘ox’) can be supplied, which means they are applicable to all residues.
Warning
If show_unmodified_termini is set to
True
, standard terminal groups need to be present in labels.Warning
Avoid using sequences with only one terminal group, as they are ambiguous. If you provide one, labels (or
std_labels
) will be used to resolve the ambiguity.
Returns: out – List of tuples with labels of modifications and amino acid residues.
Return type: Examples
>>> parse('PEPTIDE', split=True) [('P',), ('E',), ('P',), ('T',), ('I',), ('D',), ('E',)] >>> parse('H-PEPTIDE') ['P', 'E', 'P', 'T', 'I', 'D', 'E'] >>> parse('PEPTIDE', show_unmodified_termini=True) ['H-', 'P', 'E', 'P', 'T', 'I', 'D', 'E', '-OH'] >>> parse('TEpSToxM', labels=std_labels + ['pS', 'oxM']) ['T', 'E', 'pS', 'T', 'oxM'] >>> parse('zPEPzTIDzE', True, True, labels=std_labels+['z']) [('H-', 'z', 'P'), ('E',), ('P',), ('z', 'T'), ('I',), ('D',), ('z', 'E', '-OH')] >>> parse('Pmod1EPTIDE') ['P', 'mod1E', 'P', 'T', 'I', 'D', 'E']
-
pyteomics.parser.
std_amino_acids
¶ modX labels for the 20 standard amino acids.
-
pyteomics.parser.
std_cterm
¶ modX label for the unmodified C-terminus.
-
pyteomics.parser.
std_labels
¶ modX labels for the standard amino acids and unmodified termini.
-
pyteomics.parser.
std_nterm
¶ modX label for the unmodified N-terminus.
-
pyteomics.parser.
tostring
(parsed_sequence, show_unmodified_termini=True)[source]¶ Create a string from a parsed sequence.
Parameters: - parsed_sequence (iterable) – Expected to be in one of the formats returned by
parse()
, i.e. list of labels or list of tuples. - show_unmodified_termini (bool, optional) – Defines the behavior towards standard terminal groups in the input.
True
means that they will be preserved if present (default).False
means that they will be removed. Standard terminal groups will not be added if not shown in parsed_sequence, regardless of this setting.
Returns: sequence
Return type: - parsed_sequence (iterable) – Expected to be in one of the formats returned by
mass - molecular masses and isotope distributions¶
Summary¶
This module defines general functions for mass and isotope abundance
calculations. For most of the functions, the user can define a given
substance in various formats, but all of them would be reduced to the
Composition
object describing its
chemical composition.
Classes¶
Composition
- a class storing chemical composition of a substance.
Unimod
- a class representing a Python interface to the Unimod database (seepyteomics.mass.unimod
for a much more powerful alternative).
Mass calculations¶
calculate_mass()
- a general routine for mass / m/z calculation. Can calculate mass for a polypeptide sequence, chemical formula or elemental composition. Supplied with an ion type and charge, the function would calculate m/z.
fast_mass()
- a less powerful but much faster function for polypeptide mass calculation.
fast_mass2()
- a version of fast_mass that supports modX notation.
Isotopic abundances¶
isotopic_composition_abundance()
- calculate the relative abundance of a given isotopic composition.
most_probable_isotopic_composition()
- finds the most abundant isotopic composition for a molecule defined by a polypeptide sequence, chemical formula or elemental composition.
isotopologues()
- iterate over possible isotopic conposition of a molecule, possibly filtered by abundance.
Data¶
nist_mass
- a dict with exact masses of the most abundant isotopes.
std_aa_comp
- a dict with the elemental compositions of the standard twenty amino acid residues, selenocysteine and pyrrolysine.
std_ion_comp
- a dict with the relative elemental compositions of the standard peptide fragment ions.
std_aa_mass
- a dict with the monoisotopic masses of the standard twenty amino acid residues, selenocysteine and pyrrolysine.
-
Composition.
__init__
(*args, **kwargs)[source]¶ A Composition object stores a chemical composition of a substance. Basically it is a dict object, in which keys are the names of chemical elements and values contain integer numbers of corresponding atoms in a substance.
The main improvement over dict is that Composition objects allow addition and subtraction.
A Composition object can be initialized with one of the following arguments: formula, sequence, parsed_sequence or split_sequence.
If none of these are specified, the constructor will look at the first positional argument and try to build the object from it. Without positional arguments, a Composition will be constructed directly from keyword arguments.
If there’s an ambiguity, i.e. the argument is both a valid sequence and a formula (such as ‘HCN’), it will be treated as a sequence. You need to provide the ‘formula’ keyword to override this.
Warning
Be careful when supplying a list with a parsed sequence or a split sequence as a keyword argument. It must be obtained with enabled show_unmodified_termini option. When supplying it as a positional argument, the option doesn’t matter, because the positional argument is always converted to a sequence prior to any processing.
Parameters: - formula (str, optional) – A string with a chemical formula. All elements must be present in mass_data.
- sequence (str, optional) – A polypeptide sequence string in modX notation.
- parsed_sequence (list of str, optional) – A polypeptide sequence parsed into a list of amino acids.
- split_sequence (list of tuples of str, optional) – A polypeptyde sequence parsed into a list of tuples
(as returned be
pyteomics.parser.parse()
withsplit=True
). - aa_comp (dict, optional) – A dict with the elemental composition of the amino acids (the default value is std_aa_comp).
- mass_data (dict, optional) – A dict with the masses of chemical elements (the default
value is
nist_mass
). It is used for formulae parsing only. - charge (int, optional) – If not 0 then additional protons are added to the composition.
- ion_comp (dict, optional) – A dict with the relative elemental compositions of peptide ion
fragments (default is
std_ion_comp
). - ion_type (str, optional) – If specified, then the polypeptide is considered to be in the form of the corresponding ion. Do not forget to specify the charge state!
-
Composition.
mass
(**kwargs)[source]¶ Calculate the mass or m/z of a
Composition
.Parameters: - average (bool, optional) – If
True
then the average mass is calculated. Note that mass is not averaged for elements with specified isotopes. Default isFalse
. - charge (int, optional) – If not 0 then m/z is calculated: the mass is increased by the corresponding number of proton masses and divided by charge.
- mass_data (dict, optional) – A dict with the masses of the chemical elements (the default
value is
nist_mass
). - ion_comp (dict, optional) – A dict with the relative elemental compositions of peptide ion
fragments (default is
std_ion_comp
). - ion_type (str, optional) – If specified, then the polypeptide is considered to be in the form of the corresponding ion. Do not forget to specify the charge state!
Returns: mass
Return type: - average (bool, optional) – If
-
class
pyteomics.mass.mass.
Unimod
(source='http://www.unimod.org/xml/unimod.xml')[source]¶ Bases:
object
A class for Unimod database of modifications. The list of all modifications can be retrieved via mods attribute. Methods for convenient searching are by_title and by_name. For more elaborate filtering, iterate manually over the list.
Note
See
pyteomics.mass.unimod
for a new alternative class with more features.-
__init__
(source='http://www.unimod.org/xml/unimod.xml')[source]¶ Create a database and fill it from XML file retrieved from source.
Parameters: source (str or file, optional) – A file-like object or a URL to read from. Don’t forget the 'file://'
prefix when pointing to local files.
-
by_id
(i)[source]¶ Search modifications by record ID. If a modification is found, it is returned. Otherwise,
KeyError
is raised.Parameters: i (int or str) – The Unimod record ID. Returns: out – A single modification dict. Return type: dict
-
by_name
(name, strict=True)[source]¶ Search modifications by name. If a single modification is found, it is returned. Otherwise, a list will be returned.
Parameters: Returns: out – A single modification or a list of modifications.
Return type:
-
by_title
(title, strict=True)[source]¶ Search modifications by title. If a single modification is found, it is returned. Otherwise, a list will be returned.
Parameters: Returns: out – A single modification or a list of modifications.
Return type:
-
mass_data
¶ Get element mass data extracted from the database
-
mods
¶ Get the list of Unimod modifications
-
-
pyteomics.mass.mass.
calculate_mass
(*args, **kwargs)[source]¶ Calculates the monoisotopic mass of a polypeptide defined by a sequence string, parsed sequence, chemical formula or Composition object.
One or none of the following keyword arguments is required: formula, sequence, parsed_sequence, split_sequence or composition. All arguments given are used to create a
Composition
object, unless an existing one is passed as a keyword argument.Note that if a sequence string is supplied and terminal groups are not explicitly shown, then the mass is calculated for a polypeptide with standard terminal groups (NH2- and -OH).
Warning
Be careful when supplying a list with a parsed sequence. It must be obtained with enabled show_unmodified_termini option.
Parameters: - formula (str, optional) – A string with a chemical formula.
- sequence (str, optional) – A polypeptide sequence string in modX notation.
- parsed_sequence (list of str, optional) – A polypeptide sequence parsed into a list of amino acids.
- composition (Composition, optional) – A Composition object with the elemental composition of a substance.
- aa_comp (dict, optional) – A dict with the elemental composition of the amino acids (the default value is std_aa_comp).
- average (bool, optional) – If
True
then the average mass is calculated. Note that mass is not averaged for elements with specified isotopes. Default isFalse
. - charge (int, optional) – If not 0 then m/z is calculated: the mass is increased by the corresponding number of proton masses and divided by charge.
- mass_data (dict, optional) – A dict with the masses of the chemical elements (the default
value is
nist_mass
). - ion_comp (dict, optional) – A dict with the relative elemental compositions of peptide ion
fragments (default is
std_ion_comp
). - ion_type (str, optional) – If specified, then the polypeptide is considered to be in the form of the corresponding ion. Do not forget to specify the charge state!
Returns: mass
Return type:
-
pyteomics.mass.mass.
fast_mass
(sequence, ion_type=None, charge=None, **kwargs)[source]¶ Calculate monoisotopic mass of an ion using the fast algorithm. May be used only if amino acid residues are presented in one-letter code.
Parameters: - sequence (str) – A polypeptide sequence string.
- ion_type (str, optional) – If specified, then the polypeptide is considered to be in a form of corresponding ion. Do not forget to specify the charge state!
- charge (int, optional) – If not 0 then m/z is calculated: the mass is increased by the corresponding number of proton masses and divided by z.
- mass_data (dict, optional) – A dict with the masses of chemical elements (the default
value is
nist_mass
). - aa_mass (dict, optional) – A dict with the monoisotopic mass of amino acid residues (default is std_aa_mass);
- ion_comp (dict, optional) – A dict with the relative elemental compositions of peptide ion
fragments (default is
std_ion_comp
).
Returns: mass – Monoisotopic mass or m/z of a peptide molecule/ion.
Return type:
-
pyteomics.mass.mass.
fast_mass2
(sequence, ion_type=None, charge=None, **kwargs)[source]¶ Calculate monoisotopic mass of an ion using the fast algorithm. modX notation is fully supported.
Parameters: - sequence (str) – A polypeptide sequence string.
- ion_type (str, optional) – If specified, then the polypeptide is considered to be in a form of corresponding ion. Do not forget to specify the charge state!
- charge (int, optional) – If not 0 then m/z is calculated: the mass is increased by the corresponding number of proton masses and divided by z.
- mass_data (dict, optional) – A dict with the masses of chemical elements (the default
value is
nist_mass
). - aa_mass (dict, optional) – A dict with the monoisotopic mass of amino acid residues (default is std_aa_mass);
- ion_comp (dict, optional) – A dict with the relative elemental compositions of peptide ion
fragments (default is
std_ion_comp
).
Returns: mass – Monoisotopic mass or m/z of a peptide molecule/ion.
Return type:
-
pyteomics.mass.mass.
isotopic_composition_abundance
(*args, **kwargs)[source]¶ Calculate the relative abundance of a given isotopic composition of a molecule.
Parameters: Returns: relative_abundance – The relative abundance of a given isotopic composition.
Return type:
-
pyteomics.mass.mass.
isotopologues
(*args, **kwargs)[source]¶ Iterate over possible isotopic states of a molecule. The molecule can be defined by formula, sequence, parsed sequence, or composition. The space of possible isotopic compositions is restrained by parameters
elements_with_isotopes
,isotope_threshold
,overall_threshold
.Parameters: - formula (str, optional) – A string with a chemical formula.
- sequence (str, optional) – A polypeptide sequence string in modX notation.
- parsed_sequence (list of str, optional) – A polypeptide sequence parsed into a list of amino acids.
- composition (
Composition
, optional) – AComposition
object with the elemental composition of a substance. - report_abundance (bool, optional) – If
True
, the output will contain 2-tuples: (composition, abundance). Otherwise, only compositions are yielded. Default isFalse
. - elements_with_isotopes (container of str, optional) – A set of elements to be considered in isotopic distribution (by default, every element has an isotopic distribution).
- isotope_threshold (float, optional) – The threshold abundance of a specific isotope to be considered.
Default is
5e-4
. - overall_threshold (float, optional) – The threshold abundance of the calculateed isotopic composition.
Default is
0
. - aa_comp (dict, optional) – A dict with the elemental composition of the amino acids (the
default value is
std_aa_comp
). - mass_data (dict, optional) – A dict with the masses of chemical elements (the default
value is
nist_mass
).
Returns: out – Iterator over possible isotopic compositions.
Return type: iterator
-
pyteomics.mass.mass.
most_probable_isotopic_composition
(*args, **kwargs)[source]¶ Calculate the most probable isotopic composition of a peptide molecule/ion defined by a sequence string, parsed sequence, chemical formula or
Composition
object.Note that if a sequence string without terminal groups is supplied then the isotopic composition is calculated for a polypeptide with standard terminal groups (H- and -OH).
For each element, only two most abundant isotopes are considered.
Parameters: - formula (str, optional) – A string with a chemical formula.
- sequence (str, optional) – A polypeptide sequence string in modX notation.
- parsed_sequence (list of str, optional) – A polypeptide sequence parsed into a list of amino acids.
- composition (
Composition
, optional) – AComposition
object with the elemental composition of a substance. - elements_with_isotopes (list of str) – A list of elements to be considered in isotopic distribution (by default, every element has a isotopic distribution).
- aa_comp (dict, optional) – A dict with the elemental composition of the amino acids (the
default value is
std_aa_comp
). - mass_data (dict, optional) – A dict with the masses of chemical elements (the default
value is
nist_mass
). - ion_comp (dict, optional) – A dict with the relative elemental compositions of peptide ion
fragments (default is
std_ion_comp
).
Returns: out – A tuple with the most probable isotopic composition and its relative abundance.
Return type:
-
pyteomics.mass.mass.
nist_mass
¶ //www.nist.gov/pml/data/comp.cfm . There are entries for each element containing the masses and relative abundances of several abundant isotopes and a separate entry for undefined isotope with zero key, mass of the most abundant isotope and 1.0 abundance.
Type: A dict with the exact element masses downloaded from the NIST website Type: http
-
pyteomics.mass.mass.
std_aa_comp
¶ A dictionary with elemental compositions of the twenty standard amino acid residues, selenocysteine, pyrrolysine, and standard H- and -OH terminal groups.
-
pyteomics.mass.mass.
std_aa_mass
¶ A dictionary with monoisotopic masses of the twenty standard amino acid residues, selenocysteine and pyrrolysine.
-
pyteomics.mass.mass.
std_ion_comp
¶ A dict with relative elemental compositions of the standard peptide fragment ions. An elemental composition of a fragment ion is calculated as a difference between the total elemental composition of an ion and the sum of elemental compositions of its constituting amino acid residues.
unimod - interface to the Unimod database¶
This module provides an interface to the relational Unimod database.
The main class is Unimod
.
Dependencies¶
This module requres lxml
and sqlalchemy
.
-
class
pyteomics.mass.unimod.
AlternativeName
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
AminoAcid
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
,pyteomics.mass.unimod.HasFullNameMixin
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
Brick
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
,pyteomics.mass.unimod.HasFullNameMixin
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
BrickToElement
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
Classification
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
Crossreference
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
CrossreferenceSource
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
Element
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
,pyteomics.mass.unimod.HasFullNameMixin
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
Fragment
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
FragmentComposition
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
HasFullNameMixin
[source]¶ Bases:
object
A simple mixin to standardize equality operators for models with a
full_name
attribute.-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
pyteomics.mass.unimod.
MiscNotesModifications
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
Modification
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
,pyteomics.mass.unimod.HasFullNameMixin
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
ModificationToBrick
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
NeutralLoss
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
Position
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
Specificity
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
SpecificityToNeutralLoss
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
pyteomics.mass.unimod.
Unimod
(path=None)[source]¶ Bases:
object
Main class representing the relational Unimod database.
-
__init__
(path=None)[source]¶ Initialize the object from a database file.
Parameters: path (str or None, optional) – If str
, should point to a database. Use a dialect-specific prefix, like'sqlite://'
. IfNone
(default), a relational XML file will be downloaded from default location.
-
by_name
(identifier, strict=True)¶ Get a modification matching identifier. Replaces both
by_name
andby_title
methods in the old class.Parameters: Returns: out
Return type:
-
by_title
(identifier, strict=True)¶ Get a modification matching identifier. Replaces both
by_name
andby_title
methods in the old class.Parameters: Returns: out
Return type:
-
-
pyteomics.mass.unimod.
has_composition
(attr_name)[source]¶ A decorator to simplify flagging a Model with a column to be treated as a formula for parsing. Calls
_composition_listener()
internally.
-
pyteomics.mass.unimod.
load
(doc_path, output_path='sqlite://')[source]¶ Parse the relational table-like XML file provided by http://www.unimod.org/downloads.html and convert each <tag>_row into an equivalent database entry.
By default the table will be held in memory.
achrom - additive model of polypeptide chromatography¶
Summary¶
The additive model of polypeptide chromatography, or achrom, is the most basic model for peptide retention time prediction. The main equation behind achrom has the following form:
Here, is the retention coefficient of the amino acid
residues of the i-th type,
corresponds to the number of amino acid
residues of type
in the peptide sequence, N is the total number of
different types of amino acid residues present,
and
is a constant retention time shift.
In order to use achrom, one needs to find the retention coeffcients, using experimentally determined retention times for a training set of peptide retention times, i.e. to calibrate the model.
Calibration¶
get_RCs()
- find a set of retention coefficients using a given set of peptides with known retention times and a fixed value of length correction parameter.
get_RCs_vary_lcp()
- find the best length correction parameter and a set of retention coefficients for a given peptide sample.
Retention time calculation¶
calculate_RT()
- calculate the retention time of a peptide using a given set of retention coefficients.
Data¶
RCs_guo_ph2_0
- a set of retention coefficients (RCs) from [2]. Conditions: Synchropak RP-P C18 column (250 x 4.1 mm I.D.), gradient (A = 0.1% aq. TFA, pH 2.0; B = 0.1% TFA in acetonitrile) at 1% B/min, flow rate 1 ml/min, 26 centigrades.
RCs_guo_ph7_0
- a set of retention coefficients (RCs) from [2]. Conditions: Synchropak RP-P C18 column (250 x 4.1 mm I.D.), gradient (A = aq. 10 mM (NH4)2HPO4 - 0.1 M NaClO4, pH 7.0; B = 0.1 M NaClO4 in 60% aq. acetonitrile) at 1.67% B/min, flow rate 1 ml/min, 26 centigrades.
RCs_meek_ph2_1
- a set of RCs from [1]. Conditions: Bio-Rad “ODS” column, gradient (A = 0.1 M NaClO4, 0.1% phosphoric acid in water; B = 0.1 M NaClO4, 0.1% phosphoric acid in 60% aq. acetonitrile) at 1.25% B/min, room temperature.
RCs_meek_ph7_4
- a set of RCs from [1]. Conditions: Bio-Rad “ODS” column, gradient (A = 0.1 M NaClO4, 5 mM phosphate buffer in water; B = 0.1 M NaClO4, 5 mM phosphate buffer in 60% aq. acetonitrile) at 1.25% B/min, room temperature.
RCs_browne_tfa
- a set of RCs found in [7]. Conditions: Waters mjuBondapak C18 column, gradient (A = 0.1% aq. TFA, B = 0.1% TFA in acetonitrile) at 0.33% B/min, flow rate 1.5 ml/min.
RCs_browne_hfba
- a set of RCs found in [7]. Conditions: Waters mjuBondapak C18 column, gradient (A = 0.13% aq. HFBA, B = 0.13% HFBA in acetonitrile) at 0.33% B/min, flow rate 1.5 ml/min.
RCs_palmblad
- a set of RCs from [8]. Conditions: a fused silica column (80-100 x 0.200 mm I.D.) packed in-house with C18 ODS-AQ; solvent A = 0.5% aq. HAc, B = 0.5% HAc in acetonitrile.
RCs_yoshida
- a set of RCs for normal phase chromatography from [9]. Conditions: TSK gel Amide-80 column (250 x 4.6 mm I.D.), gradient (A = 0.1% TFA in ACN-water (90:10); B = 0.1% TFA in ACN-water (55:45)) at 0.6% water/min, flow rate 1.0 ml/min, 40 centigrades.
RCs_yoshida_lc
- a set of length-corrected RCs for normal phase chromatography. The set was calculated in [10] for the data from [9]. Conditions: TSK gel Amide-80 column (250 x 4.6 mm I.D.), gradient (A = 0.1% TFA in ACN-water (90:10); B = 0.1% TFA in ACN-water (55:45)) at 0.6% water/min, flow rate 1.0 ml/min, 40 centigrades.
RCs_zubarev
- a set of length-corrected RCs calculated on a dataset used in [11]. Conditions: Reprosil-Pur C18-AQ column (150 x 0.075 mm I.D.), gradient (A = 0.5% AA in water; B = 0.5% AA in ACN-water (90:10)) at 0.5% water/min, flow rate 200.0 nl/min, room temperature.
RCs_gilar_atlantis_ph3_0
- a set of retention coefficients obtained in [12]. Conditions: Atlantis HILIC silica column, (150 x 2.1 mm I.D.), 3 um, 100 A, gradient (A = water, B = ACN, C = 200 mM ammonium formate): 0 min, 5% A, 90% B, 5% C; 62.5 min, 55% A, 40% B, 5% C at 0.2 ml/min, temperature 40 C, pH 3.0
RCs_gilar_atlantis_ph4_5
- a set of retention coefficients obtained in [12]. Conditions: Atlantis HILIC silica column, (150 x 2.1 mm I.D.), 3 um, 100 A, gradient (A = water, B = ACN, C = 200 mM ammonium formate): 0 min, 5% A, 90% B, 5% C; 62.5 min, 55% A, 40% B, 5% C at 0.2 ml/min, temperature 40 C, pH 4.5
RCs_gilar_atlantis_ph10_0
- a set of retention coefficients obtained in [12]. Conditions: Atlantis HILIC silica column, (150 x 2.1 mm I.D.), 3 um, 100 A, gradient (A = water, B = ACN, C = 200 mM ammonium formate): 0 min, 5% A, 90% B, 5% C; 62.5 min, 55% A, 40% B, 5% C at 0.2 ml/min, temperature 40 C, pH 10.0
RCs_gilar_beh
- a set of retention coefficients obtained in [12]. Conditions: ACQUITY UPLC BEH HILIC column (150 x 2.1 mm I.D.), 1.7 um, 130 A, Mobile phase A: 10 mM ammonium formate buffer, pH 4.5 prepared by titrating 10 mM solution of FA with ammonium hydroxide. Mobile phase B: 90% ACN, 10% mobile phase A (v:v). Gradient: 90-60% B in 50 min.
RCs_gilar_beh_amide
- a set of retention coefficients obtained in [12]. Conditions: ACQUITY UPLC BEH glycan column (150 x 2.1 mm I.D.), 1.7 um, 130 A, Mobile phase A: 10 mM ammonium formate buffer, pH 4.5 prepared by titrating 10 mM solution of FA with ammonium hydroxide. Mobile phase B: 90% ACN, 10% mobile phase A (v:v). Gradient: 90-60% B in 50 min.
RCs_gilar_rp
- a set of retention coefficients obtained in [12]. Conditions: ACQUITY UPLC BEH C18 column (100 mm x 2.1 mm I.D.), 1.7 um, 130 A. Mobile phase A: 0.02% TFA in water, mobile phase B: 0.018% TFA in ACN. Gradient: 0 to 50% B in 50 min, flow rate 0.2 ml/min, temperature 40 C., pH 2.6.
RCs_krokhin_100A_fa
- a set of retention coefficients obtained in [13]. Conditions: 300 um x 150mm PepMap100 (Dionex, 0.1% FA), packed with 5-um Luna C18(2) (Phenomenex, Torrance, CA), pH=2.0. Both eluents A (2% ACN in water) and B (98% ACN) contained 0.1% FA as ion-pairing modifier. 0.33% ACN/min linear gradient (0-30% B).
RCs_krokhin_100A_tfa
- a set of retention coefficients obtained in [13]. Conditions: 300 um x 150mm PepMap100 (Dionex, 0.1% TFA), packed with 5-um Luna C18(2) (Phenomenex, Torrance, CA), pH=2.0. Both eluents A (2% ACN in water) and B (98% ACN) contained 0.1% TFA as ion-pairing modifier. 0.33% ACN/min linear gradient (0-30% B).
Theory¶
The additive model of polypeptide chromatography, or the model of retention coefficients was the earliest attempt to describe the dependence of retention time of a polypeptide in liquid chromatography on its sequence [1], [2]. In this model, each amino acid is assigned a number, or a retention coefficient (RC) describing its retention properties. The retention time (RT) during a gradient elution is then calculated as:
which is the sum of retention coefficients of all amino acid residues in a polypeptide. This equation can also be expressed in terms of linear algebra:
where is a vector of amino acid composition,
i.e.
is the number of amino acid residues of i-th
type in a polypeptide;
is a vector of respective
retention coefficients.
In this formulation, it is clear that additive model gives the same results for any two peptides with different sequences but the same amino acid composition. In other words, additive model is not sequence-specific.
The additive model has two advantages over all other models of chromatography - it is easy to understand and use. The rule behind the additive model is as simple as it could be: each amino acid residue shifts retention time by a fixed value, depending only on its type. This rule allows geometrical interpretation. Each peptide may be represented by a point in 21-dimensional space, with first 20 coordinates equal to the amounts of corresponding amino acid residues in the peptide and 21-st coordinate equal to RT. The additive model assumes that a line may be drawn through these points. Of course, this assumption is valid only partially, and most points would not lie on the line. But the line would describe the main trend and could be used to estimate retention time for peptides with known amino acid composition.
This best fit line is described by retention coefficients and .
The procedure of finding these coefficients is called calibration. There is
an analytical solution to calibration of linear models, which makes them
especially useful in real applications.
Several attempts were made in order to improve the accuracy of prediction by the additive model (for a review of the field we suggest to read [3] and [4]). The two implemented in this module are the logarithmic length correction term described in [5] and additional sets of retention coefficients for terminal amino acid residues [6].
Logarithmic length correction¶
This enhancement was firstly described in [5]. Briefly, it was found that the following equation better describes the dependence of RT on the peptide sequence:
We would call the second term the
length correction term and m - the length correction parameter. The
simplified and vectorized form of this equation would be:
This equation may be reduced to a linear form and solved by the standard methods.
Terminal retention coefficients¶
Another significant improvement may be obtained through introduction of separate sets of retention coefficients for terminal amino acid residues [6].
References
[1] | (1, 2, 3) Meek, J. L. Prediction of peptide retention times in high-pressure liquid chromatography on the basis of amino acid composition. PNAS, 1980, 77 (3), 1632-1636. |
[2] | (1, 2, 3) Guo, D.; Mant, C. T.; Taneja, A. K.; Parker, J. M. R.; Hodges, R. S. Prediction of peptide retention times in reversed-phase high-performance liquid chromatography I. Determination of retention coefficients of amino acid residues of model synthetic peptides. Journal of Chromatography A, 1986, 359, 499-518. |
[3] | Baczek, T.; Kaliszan, R. Predictions of peptides’ retention times in reversed-phase liquid chromatography as a new supportive tool to improve protein identification in proteomics. Proteomics, 2009, 9 (4), 835-47. |
[4] | Babushok, V. I.; Zenkevich, I. G. Retention Characteristics of Peptides in RP-LC: Peptide Retention Prediction. Chromatographia, 2010, 72 (9-10), 781-797. |
[5] | (1, 2) Mant, C. T.; Zhou, N. E.; Hodges, R. S. Correlation of protein retention times in reversed-phase chromatography with polypeptide chain length and hydrophobicity. Journal of Chromatography A, 1989, 476, 363-375. |
[6] | (1, 2) Tripet, B.; Cepeniene, D.; Kovacs, J. M.; Mant, C. T.; Krokhin, O. V.; Hodges, R. S. Requirements for prediction of peptide retention time in reversed-phase high-performance liquid chromatography: hydrophilicity/hydrophobicity of side-chains at the N- and C-termini of peptides are dramatically affected by the end-groups and location. Journal of chromatography A, 2007, 1141 (2), 212-25. |
[7] | (1, 2) Browne, C. A.; Bennett, H. P. J.; Solomon, S. The isolation of peptides by high-performance liquid chromatography using predicted elution positions. Analytical Biochemistry, 1982, 124 (1), 201-208. |
[8] | Palmblad, M.; Ramstrom, M.; Markides, K. E.; Hakansson, P.; Bergquist, J. Prediction of Chromatographic Retention and Protein Identification in Liquid Chromatography/Mass Spectrometry. Analytical Chemistry, 2002, 74 (22), 5826-5830. |
[9] | (1, 2) Yoshida, T. Calculation of peptide retention coefficients in normal-phase liquid chromatography. Journal of Chromatography A, 1998, 808 (1-2), 105-112. |
[10] | Moskovets, E.; Goloborodko A. A.; Gorshkov A. V.; Gorshkov M.V. Limitation of predictive 2-D liquid chromatography in reducing the database search space in shotgun proteomics: In silico studies. Journal of Separation Science, 2012, 35 (14), 1771-1778. |
[11] | Goloborodko A. A.; Mayerhofer C.; Zubarev A. R.; Tarasova I. A.; Gorshkov A. V.; Zubarev, R. A.; Gorshkov, M. V. Empirical approach to false discovery rate estimation in shotgun proteomics. Rapid communications in mass spectrometry, 2010, 24(4), 454-62. |
[12] | (1, 2, 3, 4, 5, 6) Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6. |
[13] | (1, 2) Dwivedi, R. C.; Spicer, V.; Harder, M.; Antonovici, M.; Ens, W.; Standing, K. G.; Wilkins, J. A.; Krokhin, O. V. (2008). Practical implementation of 2D HPLC scheme with accurate peptide retention prediction in both dimensions for high-throughput bottom-up proteomics. Analytical Chemistry, 80(18), 7036-42. |
Dependencies¶
This module requires numpy
.
-
pyteomics.achrom.
RCs_browne_hfba
¶ A set of retention coefficients determined in Browne, C. A.; Bennett, H. P. J.; Solomon, S. The isolation of peptides by high-performance liquid chromatography using predicted elution positions. Analytical Biochemistry, 1982, 124 (1), 201-208.
Conditions: Waters mjuBondapak C18 column, gradient (A = 0.13% aq. HFBA, B = 0.13% HFBA in acetonitrile) at 0.33% B/min, flow rate 1.5 ml/min.
-
pyteomics.achrom.
RCs_browne_tfa
¶ A set of retention coefficients determined in Browne, C. A.; Bennett, H. P. J.; Solomon, S. The isolation of peptides by high-performance liquid chromatography using predicted elution positions. Analytical Biochemistry, 1982, 124 (1), 201-208.
Conditions: Waters mjuBondapak C18 column, gradient (A = 0.1% aq. TFA, B = 0.1% TFA in acetonitrile) at 0.33% B/min, flow rate 1.5 ml/min.
-
pyteomics.achrom.
RCs_gilar_atlantis_ph10_0
¶ A set of retention coefficients for normal phase chromatography obtained in Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6.
Note
Cysteine is Carbamidomethylated.
Conditions: Atlantis HILIC silica column (150 x 2.1 mm I.D.), 3 um, 100 A, gradient (A = water, B = ACN, C = 200 mM ammonium formate): 0 min, 5% A, 90% B, 5% C; 62.5 min, 55% A, 40% B, 5% C at 0.2 ml/min, temperature 40 C, pH 10.0
-
pyteomics.achrom.
RCs_gilar_atlantis_ph3_0
¶ A set of retention coefficients for normal phase chromatography obtained in Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6.
Note
Cysteine is Carbamidomethylated.
Conditions: Atlantis HILIC silica column (150 x 2.1 mm I.D.), 3 um, 100 A, gradient (A = water, B = ACN, C = 200 mM ammonium formate): 0 min, 5% A, 90% B, 5% C; 62.5 min, 55% A, 40% B, 5% C at 0.2 ml/min, temperature 40 C, pH 3.0
-
pyteomics.achrom.
RCs_gilar_atlantis_ph4_5
¶ A set of retention coefficients for normal phase chromatography obtained in Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6.
Note
Cysteine is Carbamidomethylated.
Conditions: Atlantis HILIC silica column (150 x 2.1 mm I.D.), 3 um, 100 A, gradient (A = water, B = ACN, C = 200 mM ammonium formate): 0 min, 5% A, 90% B, 5% C; 62.5 min, 55% A, 40% B, 5% C at 0.2 ml/min, temperature 40 C, pH 4.5
-
pyteomics.achrom.
RCs_gilar_beh
¶ A set of retention coefficients for normal phase chromatography obtained in Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6.
Note
Cysteine is Carbamidomethylated.
Conditions: ACQUITY UPLC BEH HILIC column (150 x 2.1 mm I.D.), 1.7 um, 130 A, Mobile phase A: 10 mM ammonium formate buffer, pH 4.5 prepared by titrating 10 mM solution of FA with ammonium hydroxide. Mobile phase B: 90% ACN, 10% mobile phase A (v:v). Gradient: 90-60% B in 50 min.
-
pyteomics.achrom.
RCs_gilar_beh_amide
¶ A set of retention coefficients for normal phase chromatography obtained in Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6.
Note
Cysteine is Carbamidomethylated.
Conditions: ACQUITY UPLC BEH glycan column (150 x 2.1 mm I.D.), 1.7 um, 130 A, Mobile phase A: 10 mM ammonium formate buffer, pH 4.5 prepared by titrating 10 mM solution of FA with ammonium hydroxide. Mobile phase B: 90% ACN, 10% mobile phase A (v:v). Gradient: 90-60% B in 50 min.
-
pyteomics.achrom.
RCs_gilar_rp
¶ A set of retention coefficients for normal phase chromatography obtained in Gilar, M., & Jaworski, A. (2011). Retention behavior of peptides in hydrophilic-interaction chromatography. Journal of chromatography A, 1218(49), 8890-6.
Note
Cysteine is Carbamidomethylated.
Conditions: ACQUITY UPLC BEH C18 column (100 mm x 2.1 mm I.D.), 1.7 um, 130 A. Mobile phase A: 0.02% TFA in water, mobile phase B: 0.018% TFA in ACN. Gradient: 0 to 50% B in 50 min, flow rate 0.2 ml/min, temperature 40 C., pH 2.6.
-
pyteomics.achrom.
RCs_guo_ph2_0
¶ A set of retention coefficients from Guo, D.; Mant, C. T.; Taneja, A. K.; Parker, J. M. R.; Hodges, R. S. Prediction of peptide retention times in reversed-phase high-performance liquid chromatography I. Determination of retention coefficients of amino acid residues of model synthetic peptides. Journal of Chromatography A, 1986, 359, 499-518.
Conditions: Synchropak RP-P C18 column (250 x 4.1 mm I.D.), gradient (A = 0.1% aq. TFA, pH 2.0; B = 0.1% TFA in acetonitrile) at 1% B/min, flow rate 1 ml/min, 26 centigrades.
-
pyteomics.achrom.
RCs_guo_ph7_0
¶ A set of retention coefficients from Guo, D.; Mant, C. T.; Taneja, A. K.; Parker, J. M. R.; Hodges, R. S. Prediction of peptide retention times in reversed-phase high-performance liquid chromatography I. Determination of retention coefficients of amino acid residues of model synthetic peptides. Journal of Chromatography A, 1986, 359, 499-518.
Conditions: Synchropak RP-P C18 column (250 x 4.1 mm I.D.), gradient (A = aq. 10 mM (NH4)2HPO4 - 0.1 M NaClO4, pH 7.0; B = 0.1 M NaClO4 in 60% aq. acetonitrile) at 1.67% B/min, flow rate 1 ml/min, 26 centigrades.
-
pyteomics.achrom.
RCs_krokhin_100A_fa
¶ A set of retention coefficients from R.C. Dwivedi, V. Spicer, M. Harder, M. Antonovici, W. Ens, K.G. Standing, J.A. Wilkins, and O.V. Krokhin; Analytical Chemistry 2008 80 (18), 7036-7042. Practical Implementation of 2D HPLC Scheme with Accurate Peptide Retention Prediction in Both Dimensions for High-Throughput Bottom-Up Proteomics.
Note
Cysteine is Carbamidomethylated.
Conditions: 300 um x 150mm PepMap100 (Dionex, 0.1% FA), packed with 5-um Luna C18(2) (Phenomenex, Torrance, CA), pore size 100A, pH=2.0. Both eluents A (2% ACN in water) and B (98% ACN) contained 0.1% FA as ion-pairing modifier. 0.33% ACN/min linear gradient (0-30% B).
-
pyteomics.achrom.
RCs_krokhin_100A_tfa
¶ A set of retention coefficients from R.C. Dwivedi, V. Spicer, M. Harder, M. Antonovici, W. Ens, K.G. Standing, J.A. Wilkins, and O.V. Krokhin; Analytical Chemistry 2008 80 (18), 7036-7042. Practical Implementation of 2D HPLC Scheme with Accurate Peptide Retention Prediction in Both Dimensions for High-Throughput Bottom-Up Proteomics.
Note
Cysteine is Carbamidomethylated.
Conditions: 300 um x 150mm PepMap100 (Dionex, 0.1% TFA), packed with 5-um Luna C18(2) (Phenomenex, Torrance, CA), pore size 100 A, pH=2.0. Both eluents A (2% ACN in water) and B (98% ACN) contained 0.1% TFA as ion-pairing modifier. 0.33% ACN/min linear gradient (0-30% B).
-
pyteomics.achrom.
RCs_meek_ph2_1
¶ A set of retention coefficients determined in Meek, J. L. Prediction of peptide retention times in high-pressure liquid chromatography on the basis of amino acid composition. PNAS, 1980, 77 (3), 1632-1636.
Note
C stands for Cystine.
Conditions: Bio-Rad “ODS” column, gradient (A = 0.1 M NaClO4, 0.1% phosphoric acid in water; B = 0.1 M NaClO4, 0.1% phosphoric acid in 60% aq. acetonitrile) at 1.25% B/min, room temperature.
-
pyteomics.achrom.
RCs_meek_ph7_4
¶ A set of retention coefficients determined in Meek, J. L. Prediction of peptide retention times in high-pressure liquid chromatography on the basis of amino acid composition. PNAS, 1980, 77 (3), 1632-1636.
Note
C stands for Cystine.
Conditions: Bio-Rad “ODS” column, gradient (A = 0.1 M NaClO4, 5 mM phosphate buffer in water; B = 0.1 M NaClO4, 5 mM phosphate buffer in 60% aq. acetonitrile) at 1.25% B/min, room temperature.
-
pyteomics.achrom.
RCs_palmblad
¶ A set of retention coefficients determined in Palmblad, M.; Ramstrom, M.; Markides, K. E.; Hakansson, P.; Bergquist, J. Prediction of Chromatographic Retention and Protein Identification in Liquid Chromatography/Mass Spectrometry. Analytical Chemistry, 2002, 74 (22), 5826-5830.
Conditions: a fused silica column (80-100 x 0.200 mm I.D.) packed in-house with C18 ODS-AQ; solvent A = 0.5% aq. HAc, B = 0.5% HAc in acetonitrile.
-
pyteomics.achrom.
RCs_yoshida
¶ A set of retention coefficients determined in Yoshida, T. Calculation of peptide retention coefficients in normal-phase liquid chromatography. Journal of Chromatography A, 1998, 808 (1-2), 105-112.
Note
Cysteine is Carboxymethylated.
Conditions: TSK gel Amide-80 column (250 x 4.6 mm I.D.), gradient (A = 0.1% TFA in ACN-water (90:10); B = 0.1% TFA in ACN-water (55:45)) at 0.6% water/min, flow rate 1.0 ml/min, 40 centigrades.
-
pyteomics.achrom.
RCs_yoshida_lc
¶ A set of retention coefficients from the length-corrected model of normal-phase peptide chromatography. The dataset comes from Yoshida, T. Calculation of peptide retention coefficients in normal-phase liquid chromatography. Journal of Chromatography A, 1998, 808 (1-2), 105-112. The RCs were calculated in Moskovets, E.; Goloborodko A. A.; Gorshkov A. V.; Gorshkov M.V. Limitation of predictive 2-D liquid chromatography in reducing the database search space in shotgun proteomics: In silico studies. Journal of Separation Science, 2012, 35 (14), 1771-1778.
Note
Cysteine is Carboxymethylated.
Conditions: TSK gel Amide-80 column (250 x 4.6 mm I.D.), gradient (A = 0.1% TFA in ACN-water (90:10); B = 0.1% TFA in ACN-water (55:45)) at 0.6% water/min, flow rate 1.0 ml/min, 40 centigrades.
-
pyteomics.achrom.
RCs_zubarev
¶ A set of retention coefficients from the length-corrected model of reversed-phase peptide chromatography. The dataset was taken from Goloborodko A. A.; Mayerhofer C.; Zubarev A. R.; Tarasova I. A.; Gorshkov A. V.; Zubarev, R. A.; Gorshkov, M. V. Empirical approach to false discovery rate estimation in shotgun proteomics. Rapid communications in mass spectrometry, 2010, 24(4), 454-62.
Note
Cysteine is Carbamidomethylated.
Conditions: Reprosil-Pur C18-AQ column (150 x 0.075 mm I.D.), gradient (A = 0.5% AA in water; B = 0.5% AA in ACN-water (90:10)) at 0.5% water/min, flow rate 200.0 nl/min, room temperature.
-
pyteomics.achrom.
calculate_RT
(peptide, RC_dict, raise_no_mod=True)[source]¶ Calculate the retention time of a peptide using a given set of retention coefficients.
Parameters: - peptide (str or dict) – A peptide sequence or amino acid composition.
- RC_dict (dict) – A set of retention coefficients, length correction parameter and a fixed retention time shift. Keys are: ‘aa’, ‘lcp’ and ‘const’.
- raise_no_mod (bool, optional) – If
True
then an exception is raised when a modified amino acid from peptides is not found in RC_dict. IfFalse
, then the retention coefficient for the non-modified amino acid residue is used instead.True
by default.
Returns: RT – Calculated retention time.
Return type: Examples
>>> RT = calculate_RT('AA', {'aa': {'A': 1.1}, 'lcp':0.0, 'const': 0.1}) >>> abs(RT - 2.3) < 1e-6 # Float comparison True >>> RT = calculate_RT('AAA', {'aa': {'ntermA': 1.0, 'A': 1.1, 'ctermA': 1.2}, 'lcp': 0.0, 'const':0.1}) >>> abs(RT - 3.4) < 1e-6 # Float comparison True >>> RT = calculate_RT({'A': 3}, {'aa': {'ntermA': 1.0, 'A': 1.1, 'ctermA': 1.2}, 'lcp': 0.0, 'const':0.1}) >>> abs(RT - 3.4) < 1e-6 # Float comparison True
-
pyteomics.achrom.
get_RCs
(sequences, RTs, lcp=-0.21, term_aa=False, **kwargs)[source]¶ Calculate the retention coefficients of amino acids using retention times of a peptide sample and a fixed value of length correction parameter.
Parameters: - sequences (list of str) – List of peptide sequences.
- RTs (list of float) – List of corresponding retention times.
- lcp (float, optional) – A multiplier before ln(L) term in the equation for the retention time of a peptide. Set to -0.21 by default.
- term_aa (bool, optional) – If
True
, terminal amino acids are treated as being modified with ‘ntermX’/’ctermX’ modifications.False
by default. - labels (list of str, optional) – List of all possible amino acids and terminal groups If not given, any modX labels are allowed.
Returns: RC_dict – Dictionary with the calculated retention coefficients.
- RC_dict[‘aa’] – amino acid retention coefficients.
- RC_dict[‘const’] – constant retention time shift.
- RC_dict[‘lcp’] – length correction parameter.
Return type: Examples
>>> RCs = get_RCs(['A','AA'], [1.0, 2.0], 0.0, labels=['A']) >>> abs(RCs['aa']['A'] - 1) < 1e-6 and abs(RCs['const']) < 1e-6 True >>> RCs = get_RCs(['A','AA','B'], [1.0, 2.0, 2.0], 0.0, labels=['A','B']) >>> abs(RCs['aa']['A'] - 1) + abs(RCs['aa']['B'] - 2) + abs(RCs['const']) < 1e-6 True
-
pyteomics.achrom.
get_RCs_vary_lcp
(sequences, RTs, term_aa=False, lcp_range=(-1.0, 1.0), **kwargs)[source]¶ Find the best combination of a length correction parameter and retention coefficients for a given peptide sample.
Parameters: - sequences (list of str) – List of peptide sequences.
- RTs (list of float) – List of corresponding retention times.
- term_aa (bool, optional) – If True, terminal amino acids are treated as being modified with ‘ntermX’/’ctermX’ modifications. False by default.
- lcp_range (2-tuple of float, optional) – Range of possible values of the length correction parameter.
- labels (list of str, optional) – List of labels for all possible amino acids and terminal groups If not given, any modX labels are allowed.
- lcp_accuracy (float, optional) – The accuracy of the length correction parameter calculation.
Returns: RC_dict – Dictionary with the calculated retention coefficients.
- RC_dict[‘aa’] – amino acid retention coefficients.
- RC_dict[‘const’] – constant retention time shift.
- RC_dict[‘lcp’] – length correction parameter.
Return type: Examples
>>> RCs = get_RCs_vary_lcp(['A', 'AA', 'AAA'], [1.0, 2.0, 3.0], labels=['A']) >>> abs(RCs['aa']['A'] - 1) + abs(RCs['lcp']) + abs(RCs['const']) < 1e-6 True
electrochem - electrochemical properties of polypeptides¶
Summary¶
This module is used to calculate the electrochemical properties of polypeptide molecules.
The theory behind most of this module is based on the Henderson-Hasselbalch equation and was thoroughly described in a number of sources [1], [2].
Briefly, the formula for the charge of a polypeptide in given pH is the following:
where the sum is taken over all ionizable groups of the polypeptide, and
is -1 and +1 for acidic and basic functional groups,
respectively.
Charge and pI functions¶
Data¶
pK_lehninger
- a set of pK from [3].
pK_sillero
- a set of pK from [4].
pK_dawson
- a set of pK from [5], the pK values for NH2- and -OH are taken from [4].
pK_rodwell
- a set of pK from [6].
pK_bjellqvist
- a set of pK from [7].
pK_nterm_bjellqvist
- a set of N-terminal pK from [7].
pK_cterm_bjellqvist
- a set of C-terminal pK from [7].
hydropathicity_KD
- a set of hydropathicity indexes from [8].
References
[1] | Aronson, J. N. The Henderson-Hasselbalch equation revisited. Biochemical Education, 1983, 11 (2), 68. Link. |
[2] | Moore, D. S.. Amino acid and peptide net charges: A simple calculational procedure. Biochemical Education, 1986, 13 (1), 10-12. Link. |
[3] | Nelson, D. L.; Cox, M. M. Lehninger Principles of Biochemistry, Fourth Edition; W. H. Freeman, 2004; p. 1100. |
[4] | (1, 2) Sillero, A.; Ribeiro, J. Isoelectric points of proteins: Theoretical determination. Analytical Biochemistry, 1989, 179 (2), 319-325. Link. |
[5] | Dawson, R. M. C.; Elliot, D. C.; Elliot, W. H.; Jones, K. M. Data for biochemical research. Oxford University Press, 1989; p. 592. |
[6] | Rodwell, J. Heterogeneity of component bands in isoelectric focusing patterns. Analytical Biochemistry, 1982, 119 (2), 440-449. Link. |
[7] | (1, 2, 3) Bjellqvist, B., Basse, B., Olsen, E. and Celis, J.E. Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis 1994, 15, 529-539. Link. |
[8] | Kyte, J.; Doolittle, R. F.. A simple method for displaying the hydropathic character of a protein. Journal of molecular biology 1982, 157 (1), 105-32. Link. |
-
pyteomics.electrochem.
charge
(sequence, pH, **kwargs)[source]¶ Calculate the charge of a polypeptide in given pH or list of pHs using a given list of amino acid electrochemical properties.
Warning
Be cafeful when supplying a list with a parsed sequence or a dict with amino acid composition as sequence. Such values must be obtained with enabled show_unmodified_termini option.
Warning
If you provide pK_nterm or pK_cterm and provide sequence as a dict, it is assumed that it was obtained with
term_aa=True
(seepyteomics.parser.amino_acid_composition()
for details).Parameters: - sequence (str or list or dict) – A string with a polypeptide sequence, a list with a parsed sequence or a dict of amino acid composition.
- pH (float or iterable of floats) – pH or iterable of pHs for which the charge is calculated.
- pK (dict {str: [(float, int), ..]}, optional) – A set of pK of amino acids’ ionizable groups. It is a dict, where keys are amino acid labels and the values are lists of tuples (pK, charge_in_ionized_state), a tuple per ionizable group. The default value is pK_lehninger.
- pK_nterm (dict {str: [(float, int),]}, optional) –
- pK_cterm (dict {str: [(float, int),]}, optional) – Sets of pK of N-terminal and C-terminal (respectively) amino acids’
ionizable groups. Dicts with the same structure as
pK
. These values (if present) are used for N-terminal and C-terminal residues, respectively. If given, sequence must be astr
or alist
. The default value is an empty dict.
Returns: out – A single value of charge or a list of charges.
Return type: float or list of floats
-
pyteomics.electrochem.
gravy
(sequence, hydropathicity={'A': 1.8, 'C': 2.5, 'D': -3.5, 'E': -3.5, 'F': 2.8, 'G': -0.4, 'H': -3.2, 'I': 4.5, 'K': -3.9, 'L': 3.8, 'M': 1.9, 'N': -3.5, 'P': -1.6, 'Q': -3.5, 'R': -4.5, 'S': -0.8, 'T': -0.7, 'V': 4.2, 'W': -0.9, 'Y': -1.3})[source]¶ Calculate GRand AVerage of hYdropathicity (GRAVY) index for amino acid sequence.
Parameters: - sequence (str) – Polypeptide sequence in one-letter format.
- hydropathicity (dict, optional) – Hydropathicity indexes of amino acids. Default is
hydropathicity_KD
.
Returns: - out (float) – GRand AVerage of hYdropathicity (GRAVY) index.
- Examples
- >>> gravy(‘PEPTIDE’)
- -1.4375
-
pyteomics.electrochem.
hydropathicity_KD
¶ 105-132 (1982).
Type: A set of hydropathicity indexes obtained from Kyte J., Doolittle F. J. Mol. Biol. 157
-
pyteomics.electrochem.
pI
(sequence, pI_range=(0.0, 14.0), precision_pI=0.01, **kwargs)[source]¶ Calculate the isoelectric point of a polypeptide using a given set of amino acids’ electrochemical properties.
Warning
Be cafeful when supplying a list with a parsed sequence or a dict with amino acid composition as sequence. Such values must be obtained with enabled show_unmodified_termini option.
Parameters: - sequence (str or list or dict) – A string with a polypeptide sequence, a list with a parsed sequence or a dict of amino acid composition.
- pI_range (tuple (float, float)) – The range of allowable pI values. Default is (0.0, 14.0).
- precision_pI (float) – The precision of the calculated pI. Default is 0.01.
- pK (dict {str: [(float, int), ..]}, optional) – A set of pK of amino acids’ ionizable groups. It is a dict, where keys are amino acid labels and the values are lists of tuples (pK, charge_in_ionized_state), a tuple per ionizable group. The default value is pK_lehninger.
- pK_nterm (dict {str: [(float, int),]}, optional) –
- pK_cterm (dict {str: [(float, int),]}, optional) – Sets of pK of N-terminal and C-terminal (respectively) amino acids’
ionizable groups. Dicts with the same structure as
pK
. These values (if present) are used for N-terminal and C-terminal residues, respectively. If given, sequence must be astr
or alist
. The default value is an empty dict.
Returns: out
Return type:
-
pyteomics.electrochem.
pK_bjellqvist
¶ A set of pK from Bjellqvist, B., Basse, B., Olsen, E. and Celis, J.E. Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis 1994, 15, 529-539.
-
pyteomics.electrochem.
pK_cterm_bjellqvist
¶ A set of C-terminal pK from Bjellqvist, B., Basse, B., Olsen, E. and Celis, J.E. Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis 1994, 15, 529-539.
-
pyteomics.electrochem.
pK_dawson
¶ A set of pK from Dawson, R. M. C.; Elliot, D. C.; Elliot, W. H.; Jones, K. M. Data for biochemical research. Oxford University Press, 1989; p. 592. pKs for NH2- and -OH are taken from pK_sillero.
-
pyteomics.electrochem.
pK_lehninger
¶ A set of pK from Nelson, D. L.; Cox, M. M. Lehninger Principles of Biochemistry, Fourth Edition; W. H. Freeman, 2004; p. 1100.
-
pyteomics.electrochem.
pK_nterm_bjellqvist
¶ A set of N-terminal pK from Bjellqvist, B., Basse, B., Olsen, E. and Celis, J.E. Reference points for comparisons of two-dimensional maps of proteins from different human cell types defined in a pH scale where isoelectric points correlate with polypeptide compositions. Electrophoresis 1994, 15, 529-539.
-
pyteomics.electrochem.
pK_rodwell
¶ A set of pK from Rodwell, J. Heterogeneity of component bands in isoelectric focusing patterns. Analytical Biochemistry, vol. 119 (2), pp. 440-449, 1982.
-
pyteomics.electrochem.
pK_sillero
¶ Theoretical determination. Analytical Biochemistry, vol. 179 (2), pp. 319-325, 1989.
Type: A set of pK from Sillero, A.; Ribeiro, J. Isoelectric points of proteins
fasta - manipulations with FASTA databases¶
FASTA is a simple file format for protein sequence databases. Please refer to the NCBI website for the most detailed information on the format.
Data manipulation¶
Classes¶
Several classes of FASTA parsers are available. All of them have common features:
- context manager support;
- header parsing;
- direct iteration.
Available classes:
FASTABase
- common ancestor, suitable for type checking. Abstract class.
FASTA
- text-mode, sequential parser. Good for iteration over database entries.
IndexedFASTA
- binary-mode, indexing parser. Supports direct indexing by header string.
TwoLayerIndexedFASTA
- additionally supports indexing by extracted header fields.
UniProt
andIndexedUniProt
,UniParc
andIndexedUniParc
,UniMes
andIndexedUniMes
,UniRef
andIndexedUniRef
,SPD
andIndexedSPD
,NCBI
andIndexedNCBI
- format-specific parsers.
Functions¶
read()
- returns an instance of the appropriate reader class, for sequential iteration or random access.
chain()
- read multiple files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
write()
- write entries to a FASTA database.
parse()
- parse a FASTA header.
Decoy sequence generation¶
decoy_sequence()
- generate a decoy sequence from a given sequence, using
one of the other functions listed in this section or any other callable.
reverse()
- generate a reversed decoy sequence.
shuffle()
- generate a shuffled decoy sequence.
fused_decoy()
- generate a “fused” decoy sequence.
Decoy database generation¶
write_decoy_db()
- generate a decoy database and write it to a file.
decoy_db()
- generate entries for a decoy database from a given FASTA database.
decoy_chain()
- a version ofdecoy_db()
for multiple files.
decoy_chain.from_iterable()
- likedecoy_chain()
, but with an iterable of files.
Auxiliary¶
std_parsers
- a dictionary with parsers for known FASTA header formats.
-
pyteomics.fasta.
chain
(*args, **kwargs)¶ Chain
read()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to theread()
function.
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.fasta.
decoy_chain
(*args, **kwargs)¶ Chain
decoy_db()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to thedecoy_db()
function.
-
decoy_chain.
from_iterable
(files, **kwargs)¶ Chain
decoy_db()
for several files. Keyword arguments are passed to thedecoy_db()
function.Parameters: files – Iterable of file names or file objects.
-
class
pyteomics.fasta.
FASTA
(source, ignore_comments=False, parser=None, encoding=None)[source]¶ Bases:
pyteomics.fasta.FASTABase
,pyteomics.auxiliary.file_helpers.FileReader
Text-mode, sequential FASTA parser. Suitable for iteration over the file to obtain all entries in order.
-
__init__
(source, ignore_comments=False, parser=None, encoding=None)[source]¶ Create a new FASTA parser object. Supports iteration, yields (description, sequence) tuples. Supports with syntax.
Parameters: - source (str or file-like) – File to read. If file object, it must be opened in text mode.
- ignore_comments (bool, optional) – If
True
then ignore the second and subsequent lines of description. Default isFalse
, which concatenates multi-line descriptions into a single string. - parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a
function, that function will be given the description string, and
the returned value will be yielded together with the sequence.
The
std_parsers
dict has parsers for several formats. Hint: specifyparse()
as the parser to apply automatic format recognition. Default isNone
, which means return the header “as is”. - encoding (str or None, optional) – File encoding (if it is given by name).
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
FASTABase
(source, **kwargs)[source]¶ Bases:
object
Abstract base class for FASTA file parsers. Can be used for type checking.
-
class
pyteomics.fasta.
FlavoredMixin
(parse=True)[source]¶ Bases:
object
Parser aimed at a specific FASTA flavor. Subclasses should define parser and header_pattern. The parse argument in
__init__()
defines whether description is parsed in output.
-
class
pyteomics.fasta.
IndexedFASTA
(source, ignore_comments=False, parser=None, **kwargs)[source]¶ Bases:
pyteomics.fasta.FASTABase
,pyteomics.auxiliary.file_helpers.TaskMappingMixin
,pyteomics.auxiliary.file_helpers.IndexedTextReader
Indexed FASTA parser. Supports direct indexing by matched labels.
-
__init__
(source, ignore_comments=False, parser=None, **kwargs)[source]¶ Create an indexed FASTA parser object.
Parameters: - source (str or file-like) – File to read. If file object, it must be opened in binary mode.
- ignore_comments (bool, optional) – If
True
then ignore the second and subsequent lines of description. Default isFalse
, which concatenates multi-line descriptions into a single string. - parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a
function, that function will be given the description string, and
the returned value will be yielded together with the sequence.
The
std_parsers
dict has parsers for several formats. Hint: specifyparse()
as the parser to apply automatic format recognition. Default isNone
, which means return the header “as is”. - encoding (str or None, optional, keyword only) – File encoding. Default is UTF-8.
- block_size (int or None, optional, keyword only) – Number of bytes to consume at once.
- delimiter (str or None, optional, keyword only) – Overrides the FASTA record delimiter (default is
'\n>'
). - label (str or None, optional, keyword only) – Overrides the FASTA record label pattern. Default is
'^[\n]?>(.*)'
. - label_group (int or str, optional, keyword only) – Overrides the matched group used as key in the byte offset index.
This in combination with label can be used to extract fields from headers.
However, consider using
TwoLayerIndexedFASTA
for this purpose.
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
IndexedNCBI
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.NCBIMixin
,pyteomics.fasta.TwoLayerIndexedFASTA
Indexed parser for NCBI FASTA files.
-
__init__
(source, parse=True, **kwargs)¶ Creates a
IndexedNCBI
object.Parameters: - source (str or file) – The file to read. If a file object, it needs to be in binary mode.
- parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples.
Default is
True
. - kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
-
build_second_index
()¶ Create the mapping from extracted field to whole header string.
-
get_by_id
(key)¶ Get the entry by value of header string or extracted field.
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
IndexedRefSeq
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.RefSeqMixin
,pyteomics.fasta.TwoLayerIndexedFASTA
Indexed parser for RefSeq FASTA files.
-
__init__
(source, parse=True, **kwargs)¶ Creates a
IndexedRefSeq
object.Parameters: - source (str or file) – The file to read. If a file object, it needs to be in binary mode.
- parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples.
Default is
True
. - kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
-
build_second_index
()¶ Create the mapping from extracted field to whole header string.
-
get_by_id
(key)¶ Get the entry by value of header string or extracted field.
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
IndexedSPD
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.SPDMixin
,pyteomics.fasta.TwoLayerIndexedFASTA
Indexed parser for SPD FASTA files.
-
__init__
(source, parse=True, **kwargs)¶ Creates a
IndexedSPD
object.Parameters: - source (str or file) – The file to read. If a file object, it needs to be in binary mode.
- parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples.
Default is
True
. - kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
-
build_second_index
()¶ Create the mapping from extracted field to whole header string.
-
get_by_id
(key)¶ Get the entry by value of header string or extracted field.
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
IndexedUniMes
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.UniMesMixin
,pyteomics.fasta.TwoLayerIndexedFASTA
Indexed parser for UniMes FASTA files.
-
__init__
(source, parse=True, **kwargs)¶ Creates a
IndexedUniMes
object.Parameters: - source (str or file) – The file to read. If a file object, it needs to be in binary mode.
- parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples.
Default is
True
. - kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
-
build_second_index
()¶ Create the mapping from extracted field to whole header string.
-
get_by_id
(key)¶ Get the entry by value of header string or extracted field.
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
IndexedUniParc
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.UniParcMixin
,pyteomics.fasta.TwoLayerIndexedFASTA
Indexed parser for UniParc FASTA files.
-
__init__
(source, parse=True, **kwargs)¶ Creates a
IndexedUniParc
object.Parameters: - source (str or file) – The file to read. If a file object, it needs to be in binary mode.
- parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples.
Default is
True
. - kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
-
build_second_index
()¶ Create the mapping from extracted field to whole header string.
-
get_by_id
(key)¶ Get the entry by value of header string or extracted field.
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
IndexedUniProt
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.UniProtMixin
,pyteomics.fasta.TwoLayerIndexedFASTA
Indexed parser for UniProt FASTA files.
-
__init__
(source, parse=True, **kwargs)¶ Creates a
IndexedUniProt
object.Parameters: - source (str or file) – The file to read. If a file object, it needs to be in binary mode.
- parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples.
Default is
True
. - kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
-
build_second_index
()¶ Create the mapping from extracted field to whole header string.
-
get_by_id
(key)¶ Get the entry by value of header string or extracted field.
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
IndexedUniRef
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.UniRefMixin
,pyteomics.fasta.TwoLayerIndexedFASTA
Indexed parser for UniRef FASTA files.
-
__init__
(source, parse=True, **kwargs)¶ Creates a
IndexedUniRef
object.Parameters: - source (str or file) – The file to read. If a file object, it needs to be in binary mode.
- parse (bool, optional) – Defines whether the descriptions should be parsed in the produced tuples.
Default is
True
. - kwargs (passed to the
TwoLayerIndexedFASTA
constructor.) –
-
build_second_index
()¶ Create the mapping from extracted field to whole header string.
-
get_by_id
(key)¶ Get the entry by value of header string or extracted field.
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
NCBI
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.NCBIMixin
,pyteomics.fasta.FASTA
Text-mode parser for NCBI FASTA files.
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
NCBIMixin
(parse=True)[source]¶ Bases:
pyteomics.fasta.FlavoredMixin
-
__init__
(parse=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
pyteomics.fasta.
RefSeq
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.RefSeqMixin
,pyteomics.fasta.FASTA
Text-mode parser for RefSeq FASTA files.
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
RefSeqMixin
(parse=True)[source]¶ Bases:
pyteomics.fasta.FlavoredMixin
-
__init__
(parse=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
pyteomics.fasta.
SPD
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.SPDMixin
,pyteomics.fasta.FASTA
Text-mode parser for SPD FASTA files.
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
SPDMixin
(parse=True)[source]¶ Bases:
pyteomics.fasta.FlavoredMixin
-
__init__
(parse=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
pyteomics.fasta.
TwoLayerIndexedFASTA
(source, header_pattern=None, header_group=None, ignore_comments=False, parser=None, **kwargs)[source]¶ Bases:
pyteomics.fasta.IndexedFASTA
Parser with two-layer index. Extracted groups are mapped to full headers (where possible), full headers are mapped to byte offsets.
When indexed, the key is looked up in both indexes, allowing access by meaningful IDs (like UniProt accession) and by full header string.
-
__init__
(source, header_pattern=None, header_group=None, ignore_comments=False, parser=None, **kwargs)[source]¶ Open source and create a two-layer index for convenient random access both by full header strings and extracted fields.
Parameters: - source (str or file-like) – File to read. If file object, it must be opened in binary mode.
- header_pattern (str or RE or None, optional) – Pattern to match the header string. Must capture the group used
for the second index. If
None
(default), second-level index is not created. - header_group (int or str or None, optional) – Defines which group is used as key in the second-level index. Default is 1.
- ignore_comments (bool, optional) – If
True
then ignore the second and subsequent lines of description. Default isFalse
, which concatenates multi-line descriptions into a single string. - parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a
function, that function will be given the description string, and
the returned value will be yielded together with the sequence.
The
std_parsers
dict has parsers for several formats. Hint: specifyparse()
as the parser to apply automatic format recognition. Default isNone
, which means return the header “as is”. - arguments (Other) –
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
UniMes
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.UniMesMixin
,pyteomics.fasta.FASTA
Text-mode parser for UniMes FASTA files.
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
UniMesMixin
(parse=True)[source]¶ Bases:
pyteomics.fasta.FlavoredMixin
-
__init__
(parse=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
pyteomics.fasta.
UniParc
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.UniParcMixin
,pyteomics.fasta.FASTA
Text-mode parser for UniParc FASTA files.
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
UniParcMixin
(parse=True)[source]¶ Bases:
pyteomics.fasta.FlavoredMixin
-
__init__
(parse=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
pyteomics.fasta.
UniProt
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.UniProtMixin
,pyteomics.fasta.FASTA
Text-mode parser for UniProt FASTA files.
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
UniProtMixin
(parse=True)[source]¶ Bases:
pyteomics.fasta.FlavoredMixin
-
__init__
(parse=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
pyteomics.fasta.
UniRef
(source, parse=True, **kwargs)[source]¶ Bases:
pyteomics.fasta.UniRefMixin
,pyteomics.fasta.FASTA
Text-mode parser for UniRef FASTA files.
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.fasta.
UniRefMixin
(parse=True)[source]¶ Bases:
pyteomics.fasta.FlavoredMixin
-
__init__
(parse=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
pyteomics.fasta.
decoy_db
(source=None, mode='reverse', prefix='DECOY_', decoy_only=False, ignore_comments=False, parser=None, **kwargs)[source]¶ Iterate over sequences for a decoy database out of a given
source
.Parameters: - source (file-like object or str or None, optional) – A path to a FASTA database or a file object itself. Default is
None
, which means read standard input. - mode (str or callable, optional) – Algorithm of decoy sequence generation. ‘reverse’ by default.
See
decoy_sequence()
for more information. - prefix (str, optional) – A prefix to the protein descriptions of decoy entries. The default value is ‘DECOY_’.
- decoy_only (bool, optional) – If set to
True
, only the decoy entries will be written to output. IfFalse
, the entries from source will be written first.False
by default. - ignore_comments (bool, optional) – If True then ignore the second and subsequent lines of description.
Default is
False
. - parser (function or None, optional) – Defines whether the fasta descriptions should be parsed. If it is a
function, that function will be given the description string, and
the returned value will be yielded together with the sequence.
The
std_parsers
dict has parsers for several formats. Hint: specifyparse()
as the parser to apply automatic format guessing. Default isNone
, which means return the header “as is”. - **kwargs (given to
decoy_sequence()
.) –
Returns: out – An iterator over entries of the new database.
Return type: iterator
- source (file-like object or str or None, optional) – A path to a FASTA database or a file object itself. Default is
-
pyteomics.fasta.
decoy_sequence
(sequence, mode='reverse', **kwargs)[source]¶ Create a decoy sequence out of a given sequence string.
Parameters: - sequence (str) – The initial sequence string.
- mode (str or callable, optional) –
Type of decoy sequence. Should be one of the standard modes or any callable. Standard modes are:
- ’reverse’ for
reverse()
; - ’shuffle’ for
shuffle()
; - ’fused’ for
fused_decoy()
.
Default is ‘reverse’.
- ’reverse’ for
- **kwargs (given to the decoy function.) –
Returns: decoy_sequence – The decoy sequence.
Return type:
-
pyteomics.fasta.
fused_decoy
(sequence, decoy_mode='reverse', sep='R', **kwargs)[source]¶ Create a “fused” decoy sequence by concatenating a decoy sequence with the original one. The method and its use cases are described in:
Ivanov, M. V., Levitsky, L. I., & Gorshkov, M. V. (2016). Adaptation of Decoy Fusion Strategy for Existing Multi-Stage Search Workflows. Journal of The American Society for Mass Spectrometry, 27(9), 1579-1582.
Parameters: - sequence (str) – The initial sequence string.
- decoy_mode (str or callable, optional) –
Type of decoy sequence to use. Should be one of the standard modes or any callable. Standard modes are:
- ’reverse’ for
reverse()
; - ’shuffle’ for
shuffle()
; - ’fused’ for
fused_decoy()
(if you love recursion).
Default is ‘reverse’.
- ’reverse’ for
- sep (str, optional) – Amino acid motif that separates the decoy sequence from the target one. This setting should reflect the enzyme specificity used in the search against the database being generated. Default is ‘R’, which is suitable for trypsin searches.
- **kwargs (given to the decoy generation function.) –
Examples
>>> fused_decoy('PEPT') 'TPEPRPEPT' >>> fused_decoy('MPEPT', 'shuffle', 'K', keep_nterm=True) 'MPPTEKMPEPT'
-
pyteomics.fasta.
parse
(header, flavor='auto', parsers=None)[source]¶ Parse the FASTA header and return a nice dictionary.
Parameters: - header (str) – FASTA header to parse
- flavor (str, optional) – Short name of the header format (case-insensitive). Valid values are
'auto'
and keys of the parsers dict. Default is'auto'
, which means try all formats in turn and return the first result that can be obtained without an exception. - parsers (dict, optional) – A dict where keys are format names (lowercased) and values are functions that take a header string and return the parsed header.
Returns: out – A dictionary with the info from the header. The format depends on the flavor.
Return type:
-
pyteomics.fasta.
read
(source=None, use_index=None, flavor=None, **kwargs)[source]¶ Parse a FASTA file. This function serves as a dispatcher between different parsers available in this module.
Parameters: - source (str or file or None, optional) – A file object (or file name) with a FASTA database. Default is
None
, which means read standard input. - use_index (bool, optional) – If
True
, the created parser object will be an instance ofIndexedFASTA
. IfFalse
(default), it will be an instance ofFASTA
. - flavor (str or None, optional) –
A supported FASTA header format. If specified, a format-specific parser instance is returned.
Note
See
std_parsers
for supported flavors.
Returns: out – A named 2-tuple with FASTA header (str or dict) and sequence (str). Attributes ‘description’ and ‘sequence’ are also provided.
Return type: iterator of tuples
- source (str or file or None, optional) – A file object (or file name) with a FASTA database. Default is
-
pyteomics.fasta.
reverse
(sequence, keep_nterm=False, keep_cterm=False)[source]¶ Create a decoy sequence by reversing the original one.
Parameters: Returns: decoy_sequence – The decoy sequence.
Return type:
-
pyteomics.fasta.
shuffle
(sequence, keep_nterm=False, keep_cterm=False)[source]¶ Create a decoy sequence by shuffling the original one.
Parameters: Returns: decoy_sequence – The decoy sequence.
Return type:
-
pyteomics.fasta.
std_parsers
¶ A dictionary with parsers for known FASTA header formats. For now, supported formats are those described at UniProt help page.
-
pyteomics.fasta.
write
(entries, output=None)[source]¶ Create a FASTA file with entries.
Parameters: - entries (iterable of (str, str) tuples) – An iterable of 2-tuples in the form (description, sequence).
- output (file-like or str, optional) – A file open for writing or a path to write to. If the file exists,
it will be opened for appending. Default is
None
, which means write to standard output. - file_mode (str, keyword only, optional) – If output is a file name, defines the mode the file will be opened in. Otherwise will be ignored. Default is ‘a’.
Returns: output_file – The file where the FASTA is written.
Return type: file object
-
pyteomics.fasta.
write_decoy_db
(source=None, output=None, mode='reverse', prefix='DECOY_', decoy_only=False, **kwargs)[source]¶ Generate a decoy database out of a given
source
and write to file.If output is a path, the file will be open for appending, so no information will be lost if the file exists. Although, the user should be careful when providing open file streams as source and output. The reading and writing will start from the current position in the files, which is where the last I/O operation finished. One can use the
file.seek()
method to change it.Parameters: - source (file-like object or str or None, optional) – A path to a FASTA database or a file object itself. Default is
None
, which means read standard input. - output (file-like object or str, optional) – A path to the output database or a file open for writing.
Defaults to
None
, the results go to the standard output. - mode (str or callable, optional) – Algorithm of decoy sequence generation. ‘reverse’ by default.
See
decoy_sequence()
for more details. - prefix (str, optional) – A prefix to the protein descriptions of decoy entries. The default value is ‘DECOY_’
- decoy_only (bool, optional) – If set to
True
, only the decoy entries will be written to output. IfFalse
, the entries from source will be written as well.False
by default. - file_mode (str, keyword only, optional) – If output is a file name, defines the mode the file will be opened in. Otherwise will be ignored. Default is ‘a’.
- **kwargs (given to
decoy_sequence()
.) –
Returns: output – A (closed) file object for the created file.
Return type: file
- source (file-like object or str or None, optional) – A path to a FASTA database or a file object itself. Default is
peff - PSI Extended FASTA Format¶
PEFF is a forth-coming standard from PSI-HUPO formalizing and extending the encoding of protein features and annotations for building search spaces for proteomics. See The PEFF specification for more up-to-date information on the standard.
Data manipulation¶
Classes¶
The PEFF parser inherits several properties from implementation in the fasta
module,
building on top of the TwoLayerIndexedFASTA
reader.
Available classes:
IndexedPEFF
- Parse a PEFF format file in binary-mode, supporting direct indexing by header string or by tag.
-
class
pyteomics.peff.
Header
(mapping, original=None)[source]¶ Bases:
collections.abc.Mapping
Hold parsed properties of a key-value pair like a sequence’s definition line.
This object supports the
Mapping
interface, and keys may be accessed by attribute access notation.-
__init__
(mapping, original=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
get
(k[, d]) → D[k] if k in D, else d. d defaults to None.¶
-
-
class
pyteomics.peff.
IndexedPEFF
(source, ignore_comments=False, **kwargs)[source]¶ Bases:
pyteomics.fasta.TwoLayerIndexedFASTA
Creates an
IndexedPEFF
object.Parameters: -
__init__
(source, ignore_comments=False, **kwargs)[source]¶ Open source and create a two-layer index for convenient random access both by full header strings and extracted fields.
Parameters: - source (str or file-like) – File to read. If file object, it must be opened in binary mode.
- header_pattern (str or RE or None, optional) – Pattern to match the header string. Must capture the group used
for the second index. If
None
(default), second-level index is not created. - header_group (int or str or None, optional) – Defines which group is used as key in the second-level index. Default is 1.
- ignore_comments (bool, optional) – If
True
then ignore the second and subsequent lines of description. Default isFalse
, which concatenates multi-line descriptions into a single string. - parser (function or None, optional) – Defines whether the FASTA descriptions should be parsed. If it is a
function, that function will be given the description string, and
the returned value will be yielded together with the sequence.
The
std_parsers
dict has parsers for several formats. Hint: specifyparse()
as the parser to apply automatic format recognition. Default isNone
, which means return the header “as is”. - arguments (Other) –
-
build_second_index
()¶ Create the mapping from extracted field to whole header string.
-
get_by_id
(key)¶ Get the entry by value of header string or extracted field.
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
mzml - reader for mass spectrometry data in mzML format¶
Summary¶
mzML is a standard rich XML-format for raw mass spectrometry data storage. Please refer to psidev.info for the detailed specification of the format and structure of mzML files.
This module provides a minimalistic way to extract information from mzML
files. You can use the old functional interface (read()
) or the new
object-oriented interface (MzML
or PreIndexedMzML
)
to iterate over entries in <spectrum>
elements.
MzML
and PreIndexedMzML
also support direct indexing
with spectrum IDs.
Data access¶
MzML
- a class representing a single mzML file. Other data access functions use this class internally.
PreIndexedMzML
- a class representing a single mzML file. Uses byte offsets listed at the end of the file for quick access to spectrum elements.
read()
- iterate through spectra in mzML file. Data from a single spectrum are converted to a human-readable dict. Spectra themselves are stored under ‘m/z array’ and ‘intensity array’ keys.
chain()
- read multiple mzML files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
Deprecated functions¶
version_info()
- get version information about the mzML file. You can just read the corresponding attribute of theMzML
object.
iterfind()
- iterate over elements in an mzML file. You can just call the corresponding method of theMzML
object.
Dependencies¶
This module requires lxml
and numpy
.
-
pyteomics.mzml.
chain
(*sources, **kwargs)¶ Chain
sequence_maker()
for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to thesequence_maker()
function.-
pyteomics.mzml.
sources
¶ Sources for creating new sequences from, such as paths or file-like objects
Type: Iterable
-
pyteomics.mzml.
kwargs
¶ Additional arguments used to instantiate each sequence
Type: Mapping
-
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.mzml.
version_info
(source)¶ Provide version information about the mzML file.
Note
This function is provided for backward compatibility only. It simply creates an
MzML
instance and returns itsversion_info
attribute.Parameters: source (str or file) – File name or file-like object. Returns: out – A (version, schema URL) tuple, both elements are strings or None. Return type: tuple
-
pyteomics.mzml.
iterfind
(source, path, **kwargs)[source]¶ Parse source and yield info on elements with specified local name or by specified “XPath”.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create anMzML
object and use itsiterfind()
method.Parameters: - source (str or file) – File name or file-like object.
- path (str) – Element name or XPath-like expression. Only local names separated
with slashes are accepted. An asterisk (*) means any element.
You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces. - recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings. - decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output
(under “m/z array”, “intensity array”, etc.).
Default is
True
.
Returns: out
Return type: iterator
-
class
pyteomics.mzml.
MzML
(*args, **kwargs)[source]¶ Bases:
pyteomics.xml.ArrayConversionMixin
,pyteomics.auxiliary.file_helpers.TimeOrderedIndexedReaderMixin
,pyteomics.xml.MultiProcessingXML
,pyteomics.xml.IndexSavingXML
Parser class for mzML files.
-
class
binary_array_record
¶ Bases:
pyteomics.auxiliary.utils.binary_array_record
Hold all of the information about a base64 encoded array needed to decode the array.
-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
compression
¶ Alias for field number 1
-
count
()¶ Return number of occurrences of value.
-
data
¶ Alias for field number 0
-
dtype
¶ Alias for field number 2
-
index
()¶ Return first index of value.
Raises ValueError if the value is not present.
-
key
¶ Alias for field number 4
-
source
¶ Alias for field number 3
-
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
decode_data_array
(source, compression_type=None, dtype=<class 'numpy.float64'>)¶ Decode a base64-encoded, compressed bytestring into a numerical array.
Parameters: Returns: Return type: np.ndarray
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
classmethod
prebuild_byte_offset_file
(path)¶ Construct a new XML reader, build its byte offset index and write it to file
Parameters: path (str) – The path to the file to parse
-
reset
()¶ Resets the iterator to its initial state.
-
write_byte_offsets
()¶ Write the byte offsets in
_offset_index
to the file at_byte_offset_filename
-
class
-
class
pyteomics.mzml.
PreIndexedMzML
(*args, **kwargs)[source]¶ Bases:
pyteomics.mzml.MzML
Parser class for mzML files, subclass of
MzML
. Uses byte offsets listed at the end of the file for quick access to spectrum elements.-
__init__
(*args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
class
binary_array_record
¶ Bases:
pyteomics.auxiliary.utils.binary_array_record
Hold all of the information about a base64 encoded array needed to decode the array.
-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
compression
¶ Alias for field number 1
-
count
()¶ Return number of occurrences of value.
-
data
¶ Alias for field number 0
-
dtype
¶ Alias for field number 2
-
index
()¶ Return first index of value.
Raises ValueError if the value is not present.
-
key
¶ Alias for field number 4
-
source
¶ Alias for field number 3
-
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
decode_data_array
(source, compression_type=None, dtype=<class 'numpy.float64'>)¶ Decode a base64-encoded, compressed bytestring into a numerical array.
Parameters: Returns: Return type: np.ndarray
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
classmethod
prebuild_byte_offset_file
(path)¶ Construct a new XML reader, build its byte offset index and write it to file
Parameters: path (str) – The path to the file to parse
-
reset
()¶ Resets the iterator to its initial state.
-
write_byte_offsets
()¶ Write the byte offsets in
_offset_index
to the file at_byte_offset_filename
-
-
pyteomics.mzml.
iterfind
(source, path, **kwargs)[source] Parse source and yield info on elements with specified local name or by specified “XPath”.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create anMzML
object and use itsiterfind()
method.Parameters: - source (str or file) – File name or file-like object.
- path (str) – Element name or XPath-like expression. Only local names separated
with slashes are accepted. An asterisk (*) means any element.
You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces. - recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings. - decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output
(under “m/z array”, “intensity array”, etc.).
Default is
True
.
Returns: out
Return type: iterator
-
pyteomics.mzml.
read
(source, read_schema=False, iterative=True, use_index=False, dtype=None, huge_tree=False)[source]¶ Parse source and iterate through spectra.
Parameters: - source (str or file) – A path to a target mzML file or the file object itself.
- read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings. - iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce
memory usage at almost the same parsing speed. Default is
True
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
spectrum elements. Default is
False
. - dtype (type or dict, optional) – dtype to convert arrays to, one for both m/z and intensity arrays or one for each key.
If
dict
, keys should be ‘m/z array’ and ‘intensity array’. - decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output
(under “m/z array”, “intensity array”, etc.).
Default is
True
. - huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether
security checks for XML tree depth and node size should be disabled.
Default is
False
. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
Returns: out – An iterator over the dicts with spectrum properties.
Return type: iterator
mzxml - reader for mass spectrometry data in mzXML format¶
Summary¶
mzXML is a (formerly) standard XML-format for raw mass spectrometry data storage, intended to be replaced with mzML.
This module provides a minimalistic way to extract information from mzXML
files. You can use the old functional interface (read()
) or the new
object-oriented interface (MzXML
)
to iterate over entries in <scan>
elements.
MzXML
also supports direct indexing with scan IDs.
Data access¶
MzXML
- a class representing a single mzXML file. Other data access functions use this class internally.
read()
- iterate through spectra in mzXML file. Data from a single scan are converted to a human-readable dict. Spectra themselves are stored under ‘m/z array’ and ‘intensity array’ keys.
chain()
- read multiple mzXML files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
Deprecated functions¶
version_info()
- get version information about the mzXML file. You can just read the corresponding attribute of theMzXML
object.
iterfind()
- iterate over elements in an mzXML file. You can just call the corresponding method of theMzXML
object.
Dependencies¶
This module requires lxml
and numpy
.
-
pyteomics.mzxml.
chain
(*sources, **kwargs)¶ Chain
sequence_maker()
for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to thesequence_maker()
function.-
pyteomics.mzxml.
sources
¶ Sources for creating new sequences from, such as paths or file-like objects
Type: Iterable
-
pyteomics.mzxml.
kwargs
¶ Additional arguments used to instantiate each sequence
Type: Mapping
-
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.mzxml.
version_info
(source)¶ Provide version information about the XML file.
Note
This function is provided for backward compatibility only. It simply creates an
MzXML
instance and returns itsversion_info
attribute.Parameters: source (str or file) – File name or file-like object. Returns: out – A (version, schema URL) tuple, both elements are strings or None. Return type: tuple
-
pyteomics.mzxml.
iterfind
(source, path, **kwargs)[source]¶ Parse source and yield info on elements with specified local name or by specified XPath.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create anMzXML
object and use itsiterfind()
method.Parameters: - source (str or file) – File name or file-like object.
- path (str) – Element name or XPath-like expression. Only local names separated
with slashes are accepted. An asterisk (*) means any element.
You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces. - recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzIdentML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings. - decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output
(under “m/z array”, “intensity array”, etc.).
Default is
True
.
Returns: out
Return type: iterator
-
class
pyteomics.mzxml.
MzXML
(*args, **kwargs)[source]¶ Bases:
pyteomics.xml.ArrayConversionMixin
,pyteomics.auxiliary.file_helpers.TimeOrderedIndexedReaderMixin
,pyteomics.xml.MultiProcessingXML
,pyteomics.xml.IndexSavingXML
Parser class for mzXML files.
-
class
binary_array_record
¶ Bases:
pyteomics.auxiliary.utils.binary_array_record
Hold all of the information about a base64 encoded array needed to decode the array.
-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
compression
¶ Alias for field number 1
-
count
()¶ Return number of occurrences of value.
-
data
¶ Alias for field number 0
-
dtype
¶ Alias for field number 2
-
index
()¶ Return first index of value.
Raises ValueError if the value is not present.
-
key
¶ Alias for field number 4
-
source
¶ Alias for field number 3
-
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
decode_data_array
(source, compression_type=None, dtype=<class 'numpy.float64'>)¶ Decode a base64-encoded, compressed bytestring into a numerical array.
Parameters: Returns: Return type: np.ndarray
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)[source]¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
classmethod
prebuild_byte_offset_file
(path)¶ Construct a new XML reader, build its byte offset index and write it to file
Parameters: path (str) – The path to the file to parse
-
reset
()¶ Resets the iterator to its initial state.
-
write_byte_offsets
()¶ Write the byte offsets in
_offset_index
to the file at_byte_offset_filename
-
class
-
pyteomics.mzxml.
iterfind
(source, path, **kwargs)[source] Parse source and yield info on elements with specified local name or by specified XPath.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create anMzXML
object and use itsiterfind()
method.Parameters: - source (str or file) – File name or file-like object.
- path (str) – Element name or XPath-like expression. Only local names separated
with slashes are accepted. An asterisk (*) means any element.
You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces. - recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzIdentML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings. - decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output
(under “m/z array”, “intensity array”, etc.).
Default is
True
.
Returns: out
Return type: iterator
-
pyteomics.mzxml.
read
(source, read_schema=False, iterative=True, use_index=False, dtype=None, huge_tree=False)[source]¶ Parse source and iterate through spectra.
Parameters: - source (str or file) – A path to a target mzML file or the file object itself.
- read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings. - iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce
memory usage at almost the same parsing speed. Default is
True
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
spectrum elements. Default is
False
. - decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output
(under “m/z array”, “intensity array”, etc.).
Default is
True
. - huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether
security checks for XML tree depth and node size should be disabled.
Default is
False
. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
Returns: out – An iterator over the dicts with spectrum properties.
Return type: iterator
mgf - read and write MS/MS data in Mascot Generic Format¶
Summary¶
MGF is a simple human-readable format for MS/MS data. It allows storing MS/MS peak lists and exprimental parameters.
This module provides classes and functions for access to data stored in
MGF files.
Parsing is done using MGF
and IndexedMGF
classes.
The read()
function can be used as an entry point.
MGF spectra are converted to dictionaries. MS/MS data points are
(optionally) represented as numpy
arrays.
Also, common parameters can be read from MGF file header with
read_header()
function.
write()
allows creation of MGF files.
Classes¶
MGF
- a text-mode MGF parser. Suitable to read spectra from a file consecutively. Needs a file opened in text mode (or will open it if given a file name).
IndexedMGF
- a binary-mode MGF parser. When created, builds a byte offset index for fast random access by spectrum titles. Sequential iteration is also supported. Needs a seekable file opened in binary mode (if created from existing file object).
MGFBase
- abstract class, the common ancestor of the two classes above. Can be used for type checking.
Functions¶
read()
- iterate through spectra in MGF file. Data from a single spectrum are converted to a human-readable dict.
get_spectrum()
- read a single spectrum with given title from a file.
chain()
- read multiple files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
read_header()
- get a dict with common parameters for all spectra from the beginning of MGF file.
write()
- write an MGF file.
-
pyteomics.mgf.
chain
(*args, **kwargs)¶ Chain
read()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to theread()
function.
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files (iterable) – Iterable of file names or file objects.
-
class
pyteomics.mgf.
IndexedMGF
(source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding='utf-8', _skip_index=False, **kwargs)[source]¶ Bases:
pyteomics.mgf.MGFBase
,pyteomics.auxiliary.file_helpers.TaskMappingMixin
,pyteomics.auxiliary.file_helpers.TimeOrderedIndexedReaderMixin
,pyteomics.auxiliary.file_helpers.IndexSavingTextReader
A class representing an MGF file. Supports the with syntax and direct iteration for sequential parsing. Specific spectra can be accessed by title using the indexing syntax in constant time. If created using a file object, it needs to be opened in binary mode.
When iterated,
IndexedMGF
object yields spectra one by one. Each ‘spectrum’ is adict
with four keys: ‘m/z array’, ‘intensity array’, ‘charge array’ and ‘params’. ‘m/z array’ and ‘intensity array’ storenumpy.ndarray
’s of floats, ‘charge array’ is a masked array (numpy.ma.MaskedArray
) of ints, and ‘params’ stores adict
of parameters (keys and values arestr
, keys corresponding to MGF, lowercased).-
time
¶ A property used for accessing spectra by retention time.
Type: RTLocator
-
__init__
(source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding='utf-8', _skip_index=False, **kwargs)[source]¶ Create an MGF file object, set MGF-specific parameters.
Parameters: - source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is
None
, which means read standard input. - use_header (bool, optional, keyword only) – Add the info from file header to each dict. Spectrum-specific parameters
override those from the header in case of conflict.
Default is
True
. - convert_arrays (one of {0, 1, 2}, optional, keyword only) – If 0, m/z, intensities and (possibly) charges will be returned as regular lists.
If 1, they will be converted to regular
numpy.ndarray
’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 requirenumpy
. - read_charges (bool, optional, keyword only) – If True (default), fragment charges are reported. Disabling it improves performance.
- dtype (type or str or dict, optional, keyword only) – dtype argument to
numpy
array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’ and/or ‘charge array’. - encoding (str, optional, keyword only) – File encoding.
- source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
classmethod
prebuild_byte_offset_file
(path)¶ Construct a new XML reader, build its byte offset index and write it to file
Parameters: path (str) – The path to the file to parse
-
reset
()¶ Resets the iterator to its initial state.
-
write_byte_offsets
()¶ Write the byte offsets in
_offset_index
to the file at_byte_offset_filename
-
-
class
pyteomics.mgf.
MGF
(source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding=None)[source]¶ Bases:
pyteomics.mgf.MGFBase
,pyteomics.auxiliary.file_helpers.FileReader
A class representing an MGF file. Supports the with syntax and direct iteration for sequential parsing. Specific spectra can be accessed by title using the indexing syntax (if the file is seekable), but it takes linear time to search through the file. Consider using
IndexedMGF
for constant-time access to spectra.MGF
object behaves as an iterator, yielding spectra one by one. Each ‘spectrum’ is adict
with four keys: ‘m/z array’, ‘intensity array’, ‘charge array’ and ‘params’. ‘m/z array’ and ‘intensity array’ storenumpy.ndarray
’s of floats, ‘charge array’ is a masked array (numpy.ma.MaskedArray
) of ints, and ‘params’ stores adict
of parameters (keys and values arestr
, keys corresponding to MGF, lowercased).-
__init__
(source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding=None)[source]¶ Create an MGF file object, set MGF-specific parameters.
Parameters: - source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is
None
, which means read standard input. - use_header (bool, optional, keyword only) – Add the info from file header to each dict. Spectrum-specific parameters
override those from the header in case of conflict.
Default is
True
. - convert_arrays (one of {0, 1, 2}, optional, keyword only) – If 0, m/z, intensities and (possibly) charges will be returned as regular lists.
If 1, they will be converted to regular
numpy.ndarray
’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 requirenumpy
. - read_charges (bool, optional, keyword only) – If True (default), fragment charges are reported. Disabling it improves performance.
- dtype (type or str or dict, optional, keyword only) – dtype argument to
numpy
array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’ and/or ‘charge array’. - encoding (str, optional, keyword only) – File encoding.
- source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.mgf.
MGFBase
(source=None, **kwargs)[source]¶ Bases:
object
Abstract mixin class representing an MGF file. Subclasses implement different approaches to parsing.
-
__init__
(source=None, **kwargs)[source]¶ Create an MGF file object, set MGF-specific parameters.
Parameters: - source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is
None
, which means read standard input. - use_header (bool, optional, keyword only) – Add the info from file header to each dict. Spectrum-specific parameters
override those from the header in case of conflict.
Default is
True
. - convert_arrays (one of {0, 1, 2}, optional, keyword only) – If 0, m/z, intensities and (possibly) charges will be returned as regular lists.
If 1, they will be converted to regular
numpy.ndarray
’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 requirenumpy
. - read_charges (bool, optional, keyword only) – If True (default), fragment charges are reported. Disabling it improves performance.
- dtype (type or str or dict, optional, keyword only) – dtype argument to
numpy
array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’ and/or ‘charge array’. - encoding (str, optional, keyword only) – File encoding.
- source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is
-
-
pyteomics.mgf.
get_spectrum
(source, title, *args, **kwargs)[source]¶ Read one spectrum (with given title) from source.
See
read()
for explanation of parameters affecting the output.Note
Only the key-value pairs after the “TITLE =” line will be included in the output.
Parameters: Returns: out – A dict with the spectrum, if it is found, and None otherwise.
Return type:
-
pyteomics.mgf.
read
(*args, **kwargs)[source]¶ Returns a reader for a given MGF file. Most of the parameters repeat the instantiation signature of
MGF
andIndexedMGF
. Additional parameter use_index helps decide which class to instantiate for given source.Parameters: - source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is
None
, which means read standard input. - use_header (bool, optional) – Add the info from file header to each dict. Spectrum-specific parameters
override those from the header in case of conflict.
Default is
True
. - convert_arrays (one of {0, 1, 2}, optional) – If 0, m/z, intensities and (possibly) charges will be returned as regular lists.
If 1, they will be converted to regular
numpy.ndarray
’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 requirenumpy
. - read_charges (bool, optional) – If True (default), fragment charges are reported. Disabling it improves performance.
- dtype (type or str or dict, optional) – dtype argument to
numpy
array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’ and/or ‘charge array’. - encoding (str, optional) – File encoding.
- use_index (bool, optional) –
Determines which parsing method to use. If
True
(default), an instance ofIndexedMGF
is created. This facilitates random access by spectrum titles. If an open file is passed as source, it needs to be open in binary mode.If
False
, an instance ofMGF
is created. It reads source in text mode and is suitable for iterative parsing. Access by spectrum title requires linear search and thus takes linear time. - block_size (int, optinal) – Size of the chunk (in bytes) used to parse the file when creating the byte offset index.
(Accepted only for
IndexedMGF
.)
Returns: out – Instance of
MGF
orIndexedMGF
.Return type: - source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is
-
pyteomics.mgf.
read_header
(source)[source]¶ Read the specified MGF file, get search parameters specified in the header as a
dict
, the keys corresponding to MGF format (lowercased).Parameters: source (str or file) – File name or file object representing an file in MGF format. Returns: header Return type: dict
-
pyteomics.mgf.
write
(spectra, output=None, header='', key_order=['title', 'pepmass', 'rtinseconds', 'charge'], fragment_format=None, write_charges=True, use_numpy=None, param_formatters={'charge': <function _charge_repr>, 'pepmass': <function _pepmass_repr>})[source]¶ Create a file in MGF format.
Parameters: - spectra (iterable) –
A sequence of dictionaries with keys ‘m/z array’, ‘intensity array’, and ‘params’. ‘m/z array’ and ‘intensity array’ should be sequences of
int
,float
, orstr
. Strings will be written ‘as is’. The sequences should be of equal length, otherwise excessive values will be ignored.’params’ should be a
dict
with keys corresponding to MGF format. Keys must be strings, they will be uppercased and used as is, without any format consistency tests. Values can be of any type allowing string representation.’charge array’ can also be specified.
- output (str or file or None, optional) – Path or a file-like object open for writing. If an existing file is
specified by file name, it will be opened for appending. In this case
writing with a header can result in violation of format conventions.
Default value is
None
, which means using standard output. - header (dict or (multiline) str or list of str, optional) – In case of a single string or a list of strings, the header will be written ‘as is’. In case of dict, the keys (must be strings) will be uppercased.
- write_charges (bool, optional) – If
False
, fragment charges from ‘charge array’ will not be written. Default isTrue
. - fragment_format (str, optional) –
Format string for m/z, intensity and charge of a fragment. Useful to set the number of decimal places, e.g.:
fragment_format='%.4f %.0f'
. Default is'{} {} {}'
.Note
The supported format syntax differs depending on other parameters. If use_numpy is
True
andnumpy
is available, fragment peaks will be written usingnumpy.savetxt()
. Then, fragment_format must be recognized by that function.Otherwise, plain Python string formatting is done. See the docs for details on writing the format string. If some or all charges are missing, an empty string is substituted instead, so formatting as
float
orint
will raise an exception. Hence it is safer to just use{}
for charges. - key_order (list, optional) –
A list of strings specifying the order in which params will be written in the spectrum header. Unlisted keys will be in arbitrary order. Default is
_default_key_order
.Note
This does not affect the order of lines in the global header.
- param_formatters (dict, optional) – A dict mapping parameter names to functions. Each function must accept
two arguments (key and value) and return a string.
Default is
_default_value_formatters
. - use_numpy (bool, optional) –
Controls whether fragment peak arrays are written using
numpy.savetxt()
. Usingnumpy.savetxt()
is faster, but cannot handle sparse arrays of fragment charges. You may want to disable this if you need to save spectra with ‘charge arrays’ with missing values.If not specified, will be set to the opposite of write_chrages. If
numpy
is not available, this parameter has no effect. - file_mode (str, keyword only, optional) – If output is a file name, defines the mode the file will be opened in. Otherwise will be ignored. Default is ‘a’.
- encoding (str, keyword only, optional) – Output file encoding (if output is specified by name).
Returns: output
Return type: file
- spectra (iterable) –
ms1 - read and write MS/MS data in MS1 format¶
Summary¶
MS1 is a simple human-readable format for MS1 data. It allows storing MS1 peak lists and exprimental parameters.
This module provides minimalistic infrastructure for access to data stored in
MS1 files.
Two main classes are MS1
, which provides an iterative, text-mode parser,
and IndexedMS1
, which is a binary-mode parser that supports random access using scan IDs
and retention times.
The function read()
helps dispatch between the two classes.
Also, common parameters can be read from MS1 file header with
read_header()
function.
Functions¶
read()
- iterate through spectra in MS1 file. Data from a single spectrum are converted to a human-readable dict.
chain()
- read multiple files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
read_header()
- get a dict with common parameters for all spectra from the beginning of MS1 file.
-
pyteomics.ms1.
chain
(*args, **kwargs)¶ Chain
read()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to theread()
function.
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
class
pyteomics.ms1.
IndexedMS1
(source=None, use_header=False, convert_arrays=True, dtype=None, encoding='utf-8', _skip_index=False, **kwargs)[source]¶ Bases:
pyteomics.ms1.MS1Base
,pyteomics.auxiliary.file_helpers.TaskMappingMixin
,pyteomics.auxiliary.file_helpers.TimeOrderedIndexedReaderMixin
,pyteomics.auxiliary.file_helpers.IndexedTextReader
A class representing an MS1 file. Supports the with syntax and direct iteration for sequential parsing. Specific spectra can be accessed by title using the indexing syntax in constant time. If created using a file object, it needs to be opened in binary mode.
When iterated,
IndexedMS1
object yields spectra one by one. Each ‘spectrum’ is adict
with four keys: ‘m/z array’, ‘intensity array’, ‘charge array’ and ‘params’. ‘m/z array’ and ‘intensity array’ storenumpy.ndarray
’s of floats, ‘charge array’ is a masked array (numpy.ma.MaskedArray
) of ints, and ‘params’ stores adict
of parameters (keys and values arestr
, keys corresponding to MS1).Warning
Labels for scan objects are constructed as the first number in the S line, as follows: for a line
S 0 1
the label is ‘0’. If these labels are not unique for the scans in the file, the indexed parser will not work correctly. Consider usingMS1
instead.-
time
¶ A property used for accessing spectra by retention time.
Type: RTLocator
-
__init__
(source=None, use_header=False, convert_arrays=True, dtype=None, encoding='utf-8', _skip_index=False, **kwargs)[source]¶ Instantiate a
TaskMappingMixin
object, set default parameters for IPC.Parameters: - queue_timeout (float, keyword only, optional) – The number of seconds to block, waiting for a result before checking to see if all workers are done.
- queue_size (int, keyword only, optional) – The length of IPC queue used.
- processes (int, keyword only, optional) – Number of worker processes to spawn when
map()
is called. This can also be specified in themap()
call.
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.ms1.
MS1
(source=None, use_header=False, convert_arrays=True, dtype=None, encoding=None, **kwargs)[source]¶ Bases:
pyteomics.ms1.MS1Base
,pyteomics.auxiliary.file_helpers.FileReader
A class representing an MS1 file. Supports the with syntax and direct iteration for sequential parsing.
MS1
object behaves as an iterator, yielding spectra one by one. Each ‘spectrum’ is adict
with three keys: ‘m/z array’, ‘intensity array’, and ‘params’. ‘m/z array’ and ‘intensity array’ storenumpy.ndarray
’s of floats, and ‘params’ stores adict
of parameters.-
__init__
(source=None, use_header=False, convert_arrays=True, dtype=None, encoding=None, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.ms1.
MS1Base
(source=None, use_header=False, convert_arrays=True, dtype=None, **kwargs)[source]¶ Bases:
object
Abstract class representing an MS1 file. Subclasses implement different approaches to parsing.
-
pyteomics.ms1.
read
(*args, **kwargs)[source]¶ Read an MS1 file and return entries iteratively.
Read the specified MS1 file, yield spectra one by one. Each ‘spectrum’ is a
dict
with three keys: ‘m/z array’, ‘intensity array’, and ‘params’. ‘m/z array’ and ‘intensity array’ storenumpy.ndarray
’s of floats, and ‘params’ stores adict
of parameters.Parameters: - source (str or file or None, optional) – A file object (or file name) with data in MS1 format. Default is
None
, which means read standard input. - use_header (bool, optional) – Add the info from file header to each dict. Spectrum-specific parameters
override those from the header in case of conflict.
Default is
False
. - convert_arrays (bool, optional) – If
False
, m/z and intensities will be returned as regular lists. IfTrue
(default), they will be converted to regularnumpy.ndarray
’s. Conversion requiresnumpy
. - dtype (type or str or dict, optional) – dtype argument to
numpy
array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’ and/or ‘intensity array’. - encoding (str, optional) – File encoding.
- use_index (bool, optional) –
Determines which parsing method to use. If
True
, an instance ofIndexedMS1
is created. This facilitates random access by scan titles. If an open file is passed as source, it needs to be open in binary mode.If
False
(default), an instance ofMS1
is created. It reads source in text mode and is suitable for iterative parsing.Warning
Labels for scan objects are constructed as the first number in the S line, as follows: for a line
S 0 1
the label is ‘0’. If these labels are not unique for the scans in the file, the indexed parser will not work correctly. - block_size (int, optinal) – Size of the chunk (in bytes) used to parse the file when creating the byte offset index.
(Accepted only for
IndexedMS1
.)
Returns: out – An instance of
MS1
orIndexedMS1
, depending on use_index and source.Return type: - source (str or file or None, optional) – A file object (or file name) with data in MS1 format. Default is
ms2 - read and write MS/MS data in MS2 format¶
Summary¶
MS2 is a simple human-readable format for MS2 data. It allows storing MS2 peak lists and exprimental parameters.
This module provides minimalistic infrastructure for access to data stored in
MS2 files.
Two main classes are MS2
, which provides an iterative, text-mode parser,
and IndexedMS2
, which is a binary-mode parser that supports random access using scan IDs
and retention times.
The function read()
helps dispatch between the two classes.
Also, common parameters can be read from MS2 file header with
read_header()
function.
Functions¶
read()
- iterate through spectra in MS2 file. Data from a single spectrum are converted to a human-readable dict.
chain()
- read multiple files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
read_header()
- get a dict with common parameters for all spectra from the beginning of MS2 file.
-
pyteomics.ms2.
chain
(*args, **kwargs)¶ Chain
read()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to theread()
function.
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
class
pyteomics.ms2.
IndexedMS2
(source=None, use_header=False, convert_arrays=True, dtype=None, encoding='utf-8', _skip_index=False, **kwargs)[source]¶ Bases:
pyteomics.ms1.IndexedMS1
A class representing an MS2 file. Supports the with syntax and direct iteration for sequential parsing. Specific spectra can be accessed by title using the indexing syntax in constant time. If created using a file object, it needs to be opened in binary mode.
When iterated,
IndexedMS2
object yields spectra one by one. Each ‘spectrum’ is adict
with four keys: ‘m/z array’, ‘intensity array’, ‘charge array’ and ‘params’. ‘m/z array’ and ‘intensity array’ storenumpy.ndarray
’s of floats, ‘charge array’ is a masked array (numpy.ma.MaskedArray
) of ints, and ‘params’ stores adict
of parameters (keys and values arestr
, keys corresponding to MS2).Warning
Labels for scan objects are constructed as the first number in the S line, as follows: for a line
S 0 1 123.4
the label is ‘0’. If these labels are not unique for the scans in the file, the indexed parser will not work correctly. Consider usingMS2
instead.-
time
¶ A property used for accessing spectra by retention time.
Type: RTLocator
-
__init__
(source=None, use_header=False, convert_arrays=True, dtype=None, encoding='utf-8', _skip_index=False, **kwargs)¶ Instantiate a
TaskMappingMixin
object, set default parameters for IPC.Parameters: - queue_timeout (float, keyword only, optional) – The number of seconds to block, waiting for a result before checking to see if all workers are done.
- queue_size (int, keyword only, optional) – The length of IPC queue used.
- processes (int, keyword only, optional) – Number of worker processes to spawn when
map()
is called. This can also be specified in themap()
call.
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.ms2.
MS2
(source=None, use_header=False, convert_arrays=True, dtype=None, encoding=None, **kwargs)[source]¶ Bases:
pyteomics.ms1.MS1
A class representing an MS2 file. Supports the with syntax and direct iteration for sequential parsing.
MS2
object behaves as an iterator, yielding spectra one by one. Each ‘spectrum’ is adict
with three keys: ‘m/z array’, ‘intensity array’, and ‘params’. ‘m/z array’ and ‘intensity array’ storenumpy.ndarray
’s of floats, and ‘params’ stores adict
of parameters.-
__init__
(source=None, use_header=False, convert_arrays=True, dtype=None, encoding=None, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
reset
()¶ Resets the iterator to its initial state.
-
-
pyteomics.ms2.
read
(*args, **kwargs)[source]¶ Read an MS2 file and return entries iteratively.
Read the specified MS2 file, yield spectra one by one. Each ‘spectrum’ is a
dict
with three keys: ‘m/z array’, ‘intensity array’, and ‘params’. ‘m/z array’ and ‘intensity array’ storenumpy.ndarray
’s of floats, and ‘params’ stores adict
of parameters.Parameters: - source (str or file or None, optional) – A file object (or file name) with data in MS2 format. Default is
None
, which means read standard input. - use_header (bool, optional) – Add the info from file header to each dict. Spectrum-specific parameters
override those from the header in case of conflict.
Default is
False
. - convert_arrays (bool, optional) – If
False
, m/z and intensities will be returned as regular lists. IfTrue
(default), they will be converted to regularnumpy.ndarray
’s. Conversion requiresnumpy
. - dtype (type or str or dict, optional) – dtype argument to
numpy
array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’ and/or ‘intensity array’. - encoding (str, optional) – File encoding.
- use_index (bool, optional) –
Determines which parsing method to use. If
True
, an instance ofIndexedMS2
is created. This facilitates random access by scan titles. If an open file is passed as source, it needs to be open in binary mode.Warning
Labels for scan objects are constructed as the first number in the S line, as follows: for a line
S 0 1 123.4
the label is ‘0’. If these labels are not unique for the scans in the file, the indexed parser will not work correctly.If
False
(default), an instance ofMS2
is created. It reads source in text mode and is suitable for iterative parsing. - block_size (int, optinal) – Size of the chunk (in bytes) used to parse the file when creating the byte offset index.
(Accepted only for
IndexedMS2
.)
Returns: An instance of
MS2
orIndexedMS2
, depending on use_index and source.Return type: out
- source (str or file or None, optional) – A file object (or file name) with data in MS2 format. Default is
pepxml - pepXML file reader¶
Summary¶
pepXML was the first widely accepted format for proteomics search engines’ output. Even though it is to be replaced by a community standard mzIdentML, it is still used commonly.
This module provides minimalistic infrastructure for access to data stored in
pepXML files. The most important function is read()
, which
reads peptide-spectum matches and related information and saves them into
human-readable dicts. This function relies on the terminology of the underlying
lxml library.
Data access¶
PepXML
- a class representing a single pepXML file. Other data access functions use this class internally.
read()
- iterate through peptide-spectrum matches in a pepXML file. Data for a single spectrum are converted to an easy-to-use dict.
chain()
- read multiple files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
DataFrame()
- read pepXML files into apandas.DataFrame
.
Target-decoy approach¶
filter()
- filter PSMs from a chain of pepXML files to a specific FDR using TDA.
filter.chain()
- chain a series of filters applied independently to several files.
filter.chain.from_iterable()
- chain a series of filters applied independently to an iterable of files.
filter_df()
- filter pepXML files and return apandas.DataFrame
.
fdr()
- estimate the false discovery rate of a PSM set using the target-decoy approach.
qvalues()
- get an array of scores and local FDR values for a PSM set using the target-decoy approach.
is_decoy()
- determine whether a PSM is decoy or not.
Miscellaneous¶
roc_curve()
- get a receiver-operator curve (min PeptideProphet probability in a sample vs. false discovery rate) of PeptideProphet analysis.
Deprecated functions¶
iterfind()
- iterate over elements in a pepXML file. You can just call the corresponding method of thePepXML
object.
version_info()
- get information about pepXML version and schema. You can just read the corresponding attribute of thePepXML
object.
Dependencies¶
This module requires lxml
.
-
pyteomics.pepxml.
chain
(*sources, **kwargs)¶ Chain
sequence_maker()
for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to thesequence_maker()
function.-
pyteomics.pepxml.
sources
¶ Sources for creating new sequences from, such as paths or file-like objects
Type: Iterable
-
pyteomics.pepxml.
kwargs
¶ Additional arguments used to instantiate each sequence
Type: Mapping
-
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.pepxml.
filter
(*args, **kwargs)¶ Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.
Requires
numpy
and, optionally,pandas
.Parameters: - args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
- fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
- key (callable / array-like / iterable / str, keyword only, optional) –
A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.
Warning
The default function may not work with your files, because format flavours are diverse.
- reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
. - is_decoy (callable / array-like / iterable / str, keyword only, optional) –
A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.
Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
True
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. - formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR
estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions). - ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
- pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
- full_output (bool, keyword only, optional) –
If
True
, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default isTrue
.Note
The name for the parameter comes from the fact that it is internally passed to
qvalues()
. - q_label (str, optional) – Field name for q-value in the output. Default is
'q'
. - score_label (str, optional) – Field name for score in the output. Default is
'score'
. - decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
. - pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
. - **kwargs (passed to the
chain()
function.) –
Returns: out
Return type: iterator or
numpy.ndarray
orpandas.DataFrame
-
filter.
chain
(*files, **kwargs)¶ Chain
filter()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to thefilter()
function.
-
filter.chain.
from_iterable
(*files, **kwargs)¶ Chain
filter()
for several files. Keyword arguments are passed to thefilter()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.pepxml.
version_info
(source)¶ Provide version information about the pepXML file.
Note
This function is provided for backward compatibility only. It simply creates an
PepXML
instance and returns itsversion_info
attribute.Parameters: source (str or file) – File name or file-like object. Returns: out – A (version, schema URL) tuple, both elements are strings or None. Return type: tuple
-
pyteomics.pepxml.
iterfind
(source, path, **kwargs)[source]¶ Parse source and yield info on elements with specified local name or by specified “XPath”.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create anPepXML
object and use itsiterfind()
method.Parameters: - source (str or file) – File name or file-like object.
- path (str) – Element name or XPath-like expression. Only local names separated
with slashes are accepted. An asterisk (*) means any element.
You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces. - recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
Returns: out
Return type: iterator
-
pyteomics.pepxml.
fdr
(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)¶ Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:
The second formula is:
Note
This function is less versatile than
qvalues()
. To obtain FDR, you can callqvalues()
and take the last q-value. This function can be used (with correction = 0 or 1) whennumpy
is not available.Parameters: - psms (iterable, optional) – An iterable of PSMs, e.g. as returned by
read()
. Not needed if is_decoy is an iterable. - formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
- is_decoy (callable, iterable, or str, optional) –
If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is
is_decoy()
. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or apandas.DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- pep (callable, iterable, or str, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
pandas.DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.
- ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
Note
Requires
numpy
, if correction is a float or 2.Note
Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using
filter()
without correction).
Returns: out – The estimation of FDR, (roughly) between 0 and 1.
Return type: - psms (iterable, optional) – An iterable of PSMs, e.g. as returned by
-
pyteomics.pepxml.
qvalues
(*args, **kwargs)¶ Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.
Requires
numpy
(and optionallypandas
).Parameters: - args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
- key (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
. - is_decoy (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
- remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
False
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. - formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR
estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions). - ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
- q_label (str, optional) – Field name for q-value in the output. Default is
'q'
. - score_label (str, optional) – Field name for score in the output. Default is
'score'
. - decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
. - pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
. - full_output (bool, keyword only, optional) – If
True
, then the returned array has PSM objects along with scores and q-values. Default isFalse
. - **kwargs (passed to the
chain()
function.) –
Returns: out – A sorted array of records with the following fields:
- ’score’:
np.float64
- ’is decoy’:
np.bool_
- ’q’:
np.float64
- ’psm’:
np.object_
(if full_output isTrue
)
Return type: numpy.ndarray
-
pyteomics.pepxml.
DataFrame
(*args, **kwargs)[source]¶ Read pepXML output files into a
pandas.DataFrame
.Requires
pandas
.Parameters: - *args – Passed to
chain()
. - **kwargs – Passed to
chain()
. - sep (str or None, keyword only, optional) – Some values related to PSMs (such as protein information) are variable-length
lists. If sep is a
str
, they will be packed into single string using this delimiter. If sep isNone
, they are kept as lists. Default isNone
. - pd_kwargs (dict, optional) – Keyword arguments passed to the
pandas.DataFrame
constructor.
Returns: out
Return type: pandas.DataFrame
- *args – Passed to
-
class
pyteomics.pepxml.
PepXML
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶ Bases:
pyteomics.xml.MultiProcessingXML
,pyteomics.xml.IndexSavingXML
Parser class for pepXML files.
-
__init__
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)¶ Create an indexed XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
elements listed in indexed_tags.
This is useful for random access to spectra in mzML or elements of mzIdentML files,
or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
. - indexed_tags (container of bytes, optional) – If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
classmethod
prebuild_byte_offset_file
(path)¶ Construct a new XML reader, build its byte offset index and write it to file
Parameters: path (str) – The path to the file to parse
-
reset
()¶ Resets the iterator to its initial state.
-
write_byte_offsets
()¶ Write the byte offsets in
_offset_index
to the file at_byte_offset_filename
-
-
pyteomics.pepxml.
filter_df
(*args, **kwargs)[source]¶ Read pepXML files or DataFrames and return a
DataFrame
with filtered PSMs. Positional arguments can be pepXML files or DataFrames.Requires
pandas
.Parameters: - key (str / iterable / callable, keyword only, optional) – PSM score. Default is ‘expect’.
- is_decoy (str / iterable / callable, keyword only, optional) – Default is to check if all strings in the “protein” column start with ‘DECOY_’
- *args – Passed to
auxiliary.filter()
and/orDataFrame()
. - **kwargs – Passed to
auxiliary.filter()
and/orDataFrame()
.
Returns: out
Return type: pandas.DataFrame
-
pyteomics.pepxml.
is_decoy
(psm, prefix='DECOY_')¶ Given a PSM dict, return
True
if all protein names for the PSM start withprefix
, andFalse
otherwise. This function might not work for some pepXML flavours. Use the source to get the idea and suit it to your needs.Parameters: Returns: out
Return type:
-
pyteomics.pepxml.
iterfind
(source, path, **kwargs)[source] Parse source and yield info on elements with specified local name or by specified “XPath”.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create anPepXML
object and use itsiterfind()
method.Parameters: - source (str or file) – File name or file-like object.
- path (str) – Element name or XPath-like expression. Only local names separated
with slashes are accepted. An asterisk (*) means any element.
You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces. - recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
Returns: out
Return type: iterator
-
pyteomics.pepxml.
read
(source, read_schema=False, iterative=True, **kwargs)[source]¶ Parse source and iterate through peptide-spectrum matches.
Parameters: - source (str or file) – A path to a target pepXML file or the file object itself.
- read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the pepXML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings. - iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce
memory usage at almost the same parsing speed. Default is
True
.
Returns: out – An iterator over dicts with PSM properties.
Return type:
protxml - parsing of ProteinProphet output files¶
Summary¶
protXML is the output format of the ProteinProphet software. It contains information about identified proteins and their statistical significance.
This module provides minimalistic infrastructure for access to data stored in
protXML files. The central class is ProtXML
, which
reads protein entries and related information and saves them into
Python dicts.
Data access¶
ProtXML
- a class representing a single protXML file. Other data access functions use this class internally.
read()
- iterate through peptide-spectrum matches in a protXML file. Calling the function is synonymous to instantiating theProtXML
class.
chain()
- read multiple files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
DataFrame()
- read protXML files into apandas.DataFrame
.
Target-decoy approach¶
filter()
- filter protein groups from a chain of protXML files to a specific FDR using TDA.
filter.chain()
- chain a series of filters applied independently to several files.
filter.chain.from_iterable()
- chain a series of filters applied independently to an iterable of files.
filter_df()
- filter protXML files and return apandas.DataFrame
.
fdr()
- estimate the false discovery rate of a set of protein groups using the target-decoy approach.
qvalues()
- get an array of scores and q values for protein groups using the target-decoy approach.
is_decoy()
- determine whether a protein group is decoy or not. This function may not suit your use case.
Dependencies¶
This module requres lxml
.
-
pyteomics.protxml.
chain
(*sources, **kwargs)¶ Chain
sequence_maker()
for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to thesequence_maker()
function.-
pyteomics.protxml.
sources
¶ Sources for creating new sequences from, such as paths or file-like objects
Type: Iterable
-
pyteomics.protxml.
kwargs
¶ Additional arguments used to instantiate each sequence
Type: Mapping
-
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.protxml.
filter
(*args, **kwargs)¶ Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.
Requires
numpy
and, optionally,pandas
.Parameters: - args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
- fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
- key (callable / array-like / iterable / str, keyword only, optional) –
A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.
Warning
The default function may not work with your files, because format flavours are diverse.
- reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
. - is_decoy (callable / array-like / iterable / str, keyword only, optional) –
A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.
Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
True
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. - formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR
estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions). - ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
- pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
- full_output (bool, keyword only, optional) –
If
True
, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default isTrue
.Note
The name for the parameter comes from the fact that it is internally passed to
qvalues()
. - q_label (str, optional) – Field name for q-value in the output. Default is
'q'
. - score_label (str, optional) – Field name for score in the output. Default is
'score'
. - decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
. - pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
. - **kwargs (passed to the
chain()
function.) –
Returns: out
Return type: iterator or
numpy.ndarray
orpandas.DataFrame
-
filter.
chain
(*files, **kwargs)¶ Chain
filter()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to thefilter()
function.
-
filter.chain.
from_iterable
(*files, **kwargs)¶ Chain
filter()
for several files. Keyword arguments are passed to thefilter()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.protxml.
fdr
(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)¶ Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:
The second formula is:
Note
This function is less versatile than
qvalues()
. To obtain FDR, you can callqvalues()
and take the last q-value. This function can be used (with correction = 0 or 1) whennumpy
is not available.Parameters: - psms (iterable, optional) – An iterable of PSMs, e.g. as returned by
read()
. Not needed if is_decoy is an iterable. - formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
- is_decoy (callable, iterable, or str, optional) –
If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is
is_decoy()
. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or apandas.DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- pep (callable, iterable, or str, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
pandas.DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.
- ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
Note
Requires
numpy
, if correction is a float or 2.Note
Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using
filter()
without correction).
Returns: out – The estimation of FDR, (roughly) between 0 and 1.
Return type: - psms (iterable, optional) – An iterable of PSMs, e.g. as returned by
-
pyteomics.protxml.
qvalues
(*args, **kwargs)¶ Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.
Requires
numpy
(and optionallypandas
).Parameters: - args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
- key (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
. - is_decoy (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
- remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
False
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. - formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR
estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions). - ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
- q_label (str, optional) – Field name for q-value in the output. Default is
'q'
. - score_label (str, optional) – Field name for score in the output. Default is
'score'
. - decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
. - pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
. - full_output (bool, keyword only, optional) – If
True
, then the returned array has PSM objects along with scores and q-values. Default isFalse
. - **kwargs (passed to the
chain()
function.) –
Returns: out – A sorted array of records with the following fields:
- ’score’:
np.float64
- ’is decoy’:
np.bool_
- ’q’:
np.float64
- ’psm’:
np.object_
(if full_output isTrue
)
Return type: numpy.ndarray
-
pyteomics.protxml.
DataFrame
(*args, **kwargs)[source]¶ Read protXML output files into a
pandas.DataFrame
.Note
Rows in the DataFrame correspond to individual proteins, not protein groups.
Requires
pandas
.Parameters: - sep (str or None, keyword only, optional) – Some values related to protein groups are variable-length lists.
If sep is a
str
, they will be packed into single string using this delimiter. If sep isNone
, they are kept as lists. Default isNone
. - pd_kwargs (dict, optional) – Keyword arguments passed to the
pandas.DataFrame
constructor. - *args – Passed to
chain()
. - **kwargs – Passed to
chain()
.
Returns: out
Return type: pandas.DataFrame
- sep (str or None, keyword only, optional) – Some values related to protein groups are variable-length lists.
If sep is a
-
class
pyteomics.protxml.
ProtXML
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶ Bases:
pyteomics.xml.MultiProcessingXML
Parser class for protXML files.
-
__init__
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)¶ Create an indexed XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
elements listed in indexed_tags.
This is useful for random access to spectra in mzML or elements of mzIdentML files,
or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
. - indexed_tags (container of bytes, optional) – If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
pyteomics.protxml.
filter_df
(*args, **kwargs)[source]¶ Read protXML files or DataFrames and return a
DataFrame
with filtered PSMs. Positional arguments can be protXML files or DataFrames.Note
Rows in the DataFrame correspond to individual proteins, not protein groups.
Requires
pandas
.Parameters: - key (str / iterable / callable, keyword only, optional) – Default is ‘probability’.
- is_decoy (str / iterable / callable, keyword only, optional) – Default is to check that “protein_name” starts with ‘DECOY_’.
- reverse (bool, keyword only, optional) – Should be
True
if higher score is better. Default isTrue
(because the default key is ‘probability’). - *args – Passed to
auxiliary.filter()
and/orDataFrame()
. - **kwargs – Passed to
auxiliary.filter()
and/orDataFrame()
.
Returns: out
Return type: pandas.DataFrame
-
pyteomics.protxml.
is_decoy
(pg, prefix='DECOY_')¶ Determine if a protein group should be considered decoy.
This function checks that all protein names in a group start with prefix. You may need to provide your own function for correct filtering and FDR estimation.
Parameters: Returns: out
Return type:
-
pyteomics.protxml.
read
(source, read_schema=False, iterative=True, **kwargs)[source]¶ Parse source and iterate through protein groups.
Parameters: - source (str or file) – A path to a target protXML file or the file object itself.
- read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the protXML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings. - iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce
memory usage at almost the same parsing speed. Default is
True
.
Returns: out – An iterator over dicts with protein group properties.
Return type:
tandem - X!Tandem output file reader¶
Summary¶
X!Tandem is an open-source proteomic search engine with a very simple, sophisticated application programming interface (API): it simply takes an XML file of instructions on its command line, and outputs the results into an XML file, which has been specified in the input XML file. The output format is described here (PDF).
This module provides a minimalistic way to extract information from X!Tandem
output files. You can use the old functional interface (read()
) or the
new object-oriented interface (TandemXML
) to iterate over entries in
<group> elements, i.e. identifications for a certain spectrum.
Data access¶
TandemXML
- a class representing a single X!Tandem output file. Other data access functions use this class internally.
read()
- iterate through peptide-spectrum matches in an X!Tandem output file. Data from a single PSM are converted to a human-readable dict.
chain()
- read multiple files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
DataFrame()
- read X!Tandem output files into apandas.DataFrame
.
Target-decoy approach¶
filter()
- iterate through peptide-spectrum matches in a chain of X!Tandem output files, yielding only top PSMs and keeping false discovery rate (FDR) at the desired level. The FDR is estimated using the target-decoy approach (TDA).
filter.chain()
- chain a series of filters applied independently to several files.
filter.chain.from_iterable()
- chain a series of filters applied independently to an iterable of files.
filter_df()
- filter X!Tandem output files and return apandas.DataFrame
.
is_decoy()
- determine if a PSM is from the decoy database.
fdr()
- estimate the FDR in a data set using TDA.
qvalues()
- get an array of scores and local FDR values for a PSM set using the target-decoy approach.
Deprecated functions¶
iterfind()
- iterate over elements in an X!Tandem file. You can just call the corresponding method of theTandemXML
object.
Dependencies¶
This module requires lxml
and numpy
.
-
pyteomics.tandem.
chain
(*sources, **kwargs)¶ Chain
sequence_maker()
for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to thesequence_maker()
function.-
pyteomics.tandem.
sources
¶ Sources for creating new sequences from, such as paths or file-like objects
Type: Iterable
-
pyteomics.tandem.
kwargs
¶ Additional arguments used to instantiate each sequence
Type: Mapping
-
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.tandem.
filter
(*args, **kwargs)¶ Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.
Requires
numpy
and, optionally,pandas
.Parameters: - args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
- fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
- key (callable / array-like / iterable / str, keyword only, optional) –
A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.
Warning
The default function may not work with your files, because format flavours are diverse.
- reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
. - is_decoy (callable / array-like / iterable / str, keyword only, optional) –
A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.
Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
True
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. - formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR
estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions). - ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
- pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
- full_output (bool, keyword only, optional) –
If
True
, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default isTrue
.Note
The name for the parameter comes from the fact that it is internally passed to
qvalues()
. - q_label (str, optional) – Field name for q-value in the output. Default is
'q'
. - score_label (str, optional) – Field name for score in the output. Default is
'score'
. - decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
. - pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
. - **kwargs (passed to the
chain()
function.) –
Returns: out
Return type: iterator or
numpy.ndarray
orpandas.DataFrame
-
filter.
chain
(*files, **kwargs)¶ Chain
filter()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to thefilter()
function.
-
filter.chain.
from_iterable
(*files, **kwargs)¶ Chain
filter()
for several files. Keyword arguments are passed to thefilter()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.tandem.
fdr
(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)¶ Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:
The second formula is:
Note
This function is less versatile than
qvalues()
. To obtain FDR, you can callqvalues()
and take the last q-value. This function can be used (with correction = 0 or 1) whennumpy
is not available.Parameters: - psms (iterable, optional) – An iterable of PSMs, e.g. as returned by
read()
. Not needed if is_decoy is an iterable. - formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
- is_decoy (callable, iterable, or str, optional) –
If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is
is_decoy()
. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or apandas.DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- pep (callable, iterable, or str, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
pandas.DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.
- ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
Note
Requires
numpy
, if correction is a float or 2.Note
Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using
filter()
without correction).
Returns: out – The estimation of FDR, (roughly) between 0 and 1.
Return type: - psms (iterable, optional) – An iterable of PSMs, e.g. as returned by
-
pyteomics.tandem.
qvalues
(*args, **kwargs)¶ Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.
Requires
numpy
(and optionallypandas
).Parameters: - args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
- key (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
. - is_decoy (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
- remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
False
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. - formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR
estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions). - ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
- q_label (str, optional) – Field name for q-value in the output. Default is
'q'
. - score_label (str, optional) – Field name for score in the output. Default is
'score'
. - decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
. - pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
. - full_output (bool, keyword only, optional) – If
True
, then the returned array has PSM objects along with scores and q-values. Default isFalse
. - **kwargs (passed to the
chain()
function.) –
Returns: out – A sorted array of records with the following fields:
- ’score’:
np.float64
- ’is decoy’:
np.bool_
- ’q’:
np.float64
- ’psm’:
np.object_
(if full_output isTrue
)
Return type: numpy.ndarray
-
pyteomics.tandem.
iterfind
(source, path, **kwargs)[source]¶ Parse source and yield info on elements with specified local name or by specified “XPath”.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create aTandemXML
object and use itsiterfind()
method.Parameters: - source (str or file) – File name or file-like object.
- path (str) – Element name or XPath-like expression. Only local names separated
with slashes are accepted. An asterisk (*) means any element.
You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces. - recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
.
Returns: out
Return type: iterator
-
pyteomics.tandem.
DataFrame
(*args, **kwargs)[source]¶ Read X!Tandem output files into a
pandas.DataFrame
.Requires
pandas
.Parameters: - sep (str or None, optional) – Some values related to PSMs (such as protein information) are variable-length
lists. If sep is a
str
, they will be packed into single string using this delimiter. If sep isNone
, they are kept as lists. Default isNone
. - pd_kwargs (dict, optional) – Keyword arguments passed to the
pandas.DataFrame
constructor. - *args – Passed to
chain()
. - **kwargs – Passed to
chain()
.
Returns: out
Return type: pandas.DataFrame
- sep (str or None, optional) – Some values related to PSMs (such as protein information) are variable-length
lists. If sep is a
-
class
pyteomics.tandem.
TandemXML
(*args, **kwargs)[source]¶ Bases:
pyteomics.xml.XML
Parser class for TandemXML files.
-
__init__
(*args, **kwargs)[source]¶ Create an XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - build_id_cache (bool, optional) – Defines whether a dictionary mapping IDs to XML tree elements
should be built and stored on the instance. It is used in
XML.get_by_id()
, e.g. when usingpyteomics.mzid.MzIdentML
withretrieve_refs=True
. - huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether
security checks for XML tree depth and node size should be disabled.
Default is
False
. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, **kwargs)¶ Parse the file and return the element with id attribute equal to elem_id. Returns
None
if no such element is found.Parameters: elem_id (str) – The value of the id attribute to match. Returns: out Return type: dict
orNone
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
reset
()¶ Resets the iterator to its initial state.
-
-
pyteomics.tandem.
filter_df
(*args, **kwargs)[source]¶ Read X!Tandem output files or DataFrames and return a
DataFrame
with filtered PSMs. Positional arguments can be X!Tandem output files or DataFrames.Requires
pandas
.Parameters: - key (str / iterable / callable, optional) – Default is ‘expect’.
- is_decoy (str / iterable / callable, optional) – Default is to check if all strings in the “protein” column start with ‘DECOY_’
- *args – Passed to
auxiliary.filter()
and/orDataFrame()
. - **kwargs – Passed to
auxiliary.filter()
and/orDataFrame()
.
Returns: out
Return type: pandas.DataFrame
-
pyteomics.tandem.
is_decoy
(psm, prefix='DECOY_')¶ Given a PSM dict, return
True
if all protein names for the PSM start with prefix, andFalse
otherwise.Parameters: Returns: out
Return type:
-
pyteomics.tandem.
iterfind
(source, path, **kwargs)[source] Parse source and yield info on elements with specified local name or by specified “XPath”.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create aTandemXML
object and use itsiterfind()
method.Parameters: - source (str or file) – File name or file-like object.
- path (str) – Element name or XPath-like expression. Only local names separated
with slashes are accepted. An asterisk (*) means any element.
You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces. - recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
.
Returns: out
Return type: iterator
mzid - mzIdentML file reader¶
Summary¶
mzIdentML is one of the standards developed by the Proteomics Informatics working group of the HUPO Proteomics Standard Initiative.
This module provides a minimalistic way to extract information from mzIdentML
files. You can use the old functional interface (read()
) or the new
object-oriented interface (MzIdentML
) to iterate over entries in
<SpectrumIdentificationResult>
elements, i.e. groups of identifications
for a certain spectrum. Note that each entry can contain more than one PSM
(peptide-spectrum match). They are accessible with “SpectrumIdentificationItem”
key.
MzIdentML
objects also support direct indexing by element ID.
Data access¶
MzIdentML
- a class representing a single MzIdentML file. Other data access functions use this class internally.
read()
- iterate through peptide-spectrum matches in an mzIdentML file. Data from a single PSM group are converted to a human-readable dict. Basically creates anMzIdentML
object and reads it.
chain()
- read multiple files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
DataFrame()
- read MzIdentML files into apandas.DataFrame
.
Target-decoy approach¶
filter()
- read a chain of mzIdentML files and filter to a certain FDR using TDA.
filter.chain()
- chain a series of filters applied independently to several files.
filter.chain.from_iterable()
- chain a series of filters applied independently to an iterable of files.
filter_df()
- filter MzIdentML files and return apandas.DataFrame
.
is_decoy()
- determine if a “SpectrumIdentificationResult” should be consiudered decoy.
fdr()
- estimate the false discovery rate of a set of identifications using the target-decoy approach.
qvalues()
- get an array of scores and local FDR values for a PSM set using the target-decoy approach.
Deprecated functions¶
version_info()
- get information about mzIdentML version and schema. You can just read the corresponding attribute of theMzIdentML
object.
get_by_id()
- get an element by its ID and extract the data from it. You can just call the corresponding method of theMzIdentML
object.
iterfind()
- iterate over elements in an mzIdentML file. You can just call the corresponding method of theMzIdentML
object.
Dependencies¶
This module requires lxml
.
-
pyteomics.mzid.
version_info
(source)¶ Provide version information about the mzIdentML file.
Note
This function is provided for backward compatibility only. It simply creates an
MzIdentML
instance and returns itsversion_info
attribute.Parameters: source (str or file) – File name or file-like object. Returns: out – A (version, schema URL) tuple, both elements are strings or None. Return type: tuple
-
pyteomics.mzid.
fdr
(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)¶ Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:
The second formula is:
Note
This function is less versatile than
qvalues()
. To obtain FDR, you can callqvalues()
and take the last q-value. This function can be used (with correction = 0 or 1) whennumpy
is not available.Parameters: - psms (iterable, optional) – An iterable of PSMs, e.g. as returned by
read()
. Not needed if is_decoy is an iterable. - formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
- is_decoy (callable, iterable, or str, optional) –
If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is
is_decoy()
. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or apandas.DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- pep (callable, iterable, or str, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
pandas.DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.
- ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
Note
Requires
numpy
, if correction is a float or 2.Note
Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using
filter()
without correction).
Returns: out – The estimation of FDR, (roughly) between 0 and 1.
Return type: - psms (iterable, optional) – An iterable of PSMs, e.g. as returned by
-
pyteomics.mzid.
qvalues
(*args, **kwargs)¶ Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.
Requires
numpy
(and optionallypandas
).Parameters: - args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
- key (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
. - is_decoy (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
- remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
False
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. - formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR
estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions). - ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
- q_label (str, optional) – Field name for q-value in the output. Default is
'q'
. - score_label (str, optional) – Field name for score in the output. Default is
'score'
. - decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
. - pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
. - full_output (bool, keyword only, optional) – If
True
, then the returned array has PSM objects along with scores and q-values. Default isFalse
. - **kwargs (passed to the
chain()
function.) –
Returns: out – A sorted array of records with the following fields:
- ’score’:
np.float64
- ’is decoy’:
np.bool_
- ’q’:
np.float64
- ’psm’:
np.object_
(if full_output isTrue
)
Return type: numpy.ndarray
-
pyteomics.mzid.
chain
(*sources, **kwargs)¶ Chain
sequence_maker()
for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to thesequence_maker()
function.-
pyteomics.mzid.
sources
¶ Sources for creating new sequences from, such as paths or file-like objects
Type: Iterable
-
pyteomics.mzid.
kwargs
¶ Additional arguments used to instantiate each sequence
Type: Mapping
-
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.mzid.
filter
(*args, **kwargs)¶ Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.
Requires
numpy
and, optionally,pandas
.Parameters: - args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
- fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
- key (callable / array-like / iterable / str, keyword only, optional) –
A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.
Warning
The default function may not work with your files, because format flavours are diverse.
- reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
. - is_decoy (callable / array-like / iterable / str, keyword only, optional) –
A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.
Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
True
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. - formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR
estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions). - ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
- pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
- full_output (bool, keyword only, optional) –
If
True
, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default isTrue
.Note
The name for the parameter comes from the fact that it is internally passed to
qvalues()
. - q_label (str, optional) – Field name for q-value in the output. Default is
'q'
. - score_label (str, optional) – Field name for score in the output. Default is
'score'
. - decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
. - pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
. - **kwargs (passed to the
chain()
function.) –
Returns: out
Return type: iterator or
numpy.ndarray
orpandas.DataFrame
-
filter.
chain
(*files, **kwargs)¶ Chain
filter()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to thefilter()
function.
-
filter.chain.
from_iterable
(*files, **kwargs)¶ Chain
filter()
for several files. Keyword arguments are passed to thefilter()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.mzid.
DataFrame
(*args, **kwargs)[source]¶ Read MzIdentML files into a
pandas.DataFrame
.Requires
pandas
.Warning
Only the first ‘SpectrumIdentificationItem’ element is considered in every ‘SpectrumIdentificationResult’.
Parameters: - *args – Passed to
chain()
. - **kwargs – Passed to
chain()
. - sep (str or None, keyword only, optional) – Some values related to PSMs (such as protein information) are variable-length
lists. If sep is a
str
, they will be packed into single string using this delimiter. If sep isNone
, they are kept as lists. Default isNone
.
Returns: out
Return type: pandas.DataFrame
- *args – Passed to
-
class
pyteomics.mzid.
MzIdentML
(*args, **kwargs)[source]¶ Bases:
pyteomics.xml.MultiProcessingXML
,pyteomics.xml.IndexSavingXML
Parser class for MzIdentML files.
-
__init__
(*args, **kwargs)[source]¶ Create an indexed XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
elements listed in indexed_tags.
This is useful for random access to spectra in mzML or elements of mzIdentML files,
or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
. - indexed_tags (container of bytes, optional) – If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
classmethod
prebuild_byte_offset_file
(path)¶ Construct a new XML reader, build its byte offset index and write it to file
Parameters: path (str) – The path to the file to parse
-
reset
()¶ Resets the iterator to its initial state.
-
write_byte_offsets
()¶ Write the byte offsets in
_offset_index
to the file at_byte_offset_filename
-
-
pyteomics.mzid.
filter_df
(*args, **kwargs)[source]¶ Read MzIdentML files or DataFrames and return a
DataFrame
with filtered PSMs. Positional arguments can be MzIdentML files or DataFrames.Requires
pandas
.Warning
Only the first ‘SpectrumIdentificationItem’ element is considered in every ‘SpectrumIdentificationResult’.
Parameters: - key (str / iterable / callable, keyword only, optional) – Default is ‘mascot:expectation value’.
- is_decoy (str / iterable / callable, keyword only, optional) – Default is ‘isDecoy’.
- *args – Passed to
auxiliary.filter()
and/orDataFrame()
. - **kwargs – Passed to
auxiliary.filter()
and/orDataFrame()
.
Returns: out
Return type: pandas.DataFrame
-
pyteomics.mzid.
get_by_id
(source, elem_id, **kwargs)[source]¶ Parse source and return the element with id attribute equal to elem_id. Returns
None
if no such element is found.Note
This function is provided for backward compatibility only. If you do multiple
get_by_id()
calls on one file, you should create anMzIdentML
object and use itsget_by_id()
method.Parameters: Returns: out
Return type: dict
orNone
-
pyteomics.mzid.
is_decoy
(psm, prefix=None)[source]¶ Given a PSM dict, return
True
if all proteins in the dict are marked as decoy, andFalse
otherwise.Parameters: Returns: out
Return type:
-
pyteomics.mzid.
iterfind
(source, path, **kwargs)[source]¶ Parse source and yield info on elements with specified local name or by specified “XPath”.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create anMzIdentML
object and use itsiterfind()
method.Parameters: - source (str or file) – File name or file-like object.
- path (str) – Element name or XPath-like expression. Only local names separated
with slashes are accepted. An asterisk (*) means any element.
You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces. - recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - retrieve_refs (bool, optional) – If
True
, additional information from references will be automatically added to the results. The file processing time will increase. Default isFalse
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzIdentML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings. - build_id_cache (bool, optional) – Defines whether a cache of element IDs should be built and stored on the
created
MzIdentML
instance. Default value is the value of retrieve_refs.
Returns: out
Return type: iterator
-
pyteomics.mzid.
read
(source, **kwargs)[source]¶ Parse source and iterate through peptide-spectrum matches.
Note
This function is provided for backward compatibility only. It simply creates an
MzIdentML
instance using provided arguments and returns it.Parameters: - source (str or file) – A path to a target mzIdentML file or the file object itself.
- recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - retrieve_refs (bool, optional) – If
True
, additional information from references will be automatically added to the results. The file processing time will increase. Default isTrue
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzIdentML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings. - build_id_cache (bool, optional) –
Defines whether a cache of element IDs should be built and stored on the created
MzIdentML
instance. Default value is the value of retrieve_refs.Note
This parameter is ignored when
use_index
isTrue
(default). - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
the indexed elements. If
True
(default), build_id_cache is ignored. - indexed_tags (container of bytes, optional) – Defines which elements need to be indexed. Empty set by default.
Returns: out – An iterator over the dicts with PSM properties.
Return type:
mztab - mzTab file reader¶
Summary¶
mzTab is one of the standards developed by the Proteomics Informatics working group of the HUPO Proteomics Standard Initiative.
This module provides a way to read mzTab files into a collection of
pandas.DataFrame
instances in memory, along with a mapping
of the file-level metadata.
Data access¶
MzTab
- a class representing a single mzTab file
-
class
pyteomics.mztab.
MzTab
(path, encoding='utf8', table_format='df')[source]¶ Bases:
pyteomics.mztab._MzTabParserBase
Parser for mzTab format files.
-
file
¶ A file stream wrapper for the file to be read
Type: _file_obj
-
metadata
¶ A mapping of metadata that was entities.
Type: OrderedDict
-
peptide_table
¶ The table of peptides. Not commonly used.
Type: _MzTabTable or pd.DataFrame
-
protein_table
¶ The table of protein identifications.
Type: _MzTabTable or pd.DataFrame
-
small_molecule_table
¶ The table of small molecule identifications.
Type: _MzTabTable or pd.DataFrame
-
spectrum_match_table
¶ The table of spectrum-to-peptide match identifications.
Type: _MzTabTable or pd.DataFrame
-
table_format
¶ The structure type to replace each table with. The string ‘df’ will use pd.DataFrame instances. ‘dict’ will create a dictionary of dictionaries for each table. A callable will be called on each raw _MzTabTable object
Type: ‘df’, ‘dict’, or callable
-
__init__
(path, encoding='utf8', table_format='df')[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
collapse_properties
(proplist)[source]¶ Collapse a flat property list into a hierchical structure.
This is intended to operate on
Mapping
objects, includingdict
,pandas.Series
andpandas.DataFrame
.{ "ms_run[1]-format": "Andromeda:apl file format", "ms_run[1]-location": "file://...", "ms_run[1]-id_format": "scan number only nativeID format" }
to
{ "ms_run": [ { "format": "Andromeda:apl file format", "location": "file://...", "id_format": "scan number only nativeID format" } ] }
Parameters: proplist ( Mapping
) – Key-Value pairs to collapseReturns: The collapsed property list Return type: OrderedDict
-
featurexml - reader for featureXML files¶
Summary¶
featureXML is a format specified in the OpenMS project. It defines a list of LC-MS features observed in an experiment.
This module provides a minimalistic way to extract information from featureXML
files. You can use the old functional interface (read()
) or the new
object-oriented interface (FeatureXML
)
to iterate over entries in <feature>
elements.
FeatureXML
also supports direct indexing with feature IDs.
Data access¶
FeatureXML
- a class representing a single featureXML file. Other data access functions use this class internally.
read()
- iterate through features in a featureXML file. Data from a single feature are converted to a human-readable dict.
chain()
- read multiple featureXML files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
Dependencies¶
This module requres lxml
.
-
pyteomics.openms.featurexml.
chain
(*args, **kwargs)¶ Chain
read()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to theread()
function.
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
class
pyteomics.openms.featurexml.
FeatureXML
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶ Bases:
pyteomics.xml.MultiProcessingXML
Parser class for featureXML files.
-
__init__
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)¶ Create an indexed XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
elements listed in indexed_tags.
This is useful for random access to spectra in mzML or elements of mzIdentML files,
or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
. - indexed_tags (container of bytes, optional) – If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
pyteomics.openms.featurexml.
read
(source, read_schema=True, iterative=True, use_index=False)[source]¶ Parse source and iterate through features.
Parameters: - source (str or file) – A path to a target featureXML file or the file object itself.
- read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the file header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings. - iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce
memory usage at almost the same parsing speed. Default is
True
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
spectrum elements. Default is
False
.
Returns: out – An iterator over the dicts with feature properties.
Return type: iterator
trafoxml - reader for trafoXML files¶
Summary¶
trafoXML is a format specified in the OpenMS project. It defines a transformation, which is a result of retention time alignment.
This module provides a minimalistic way to extract information from trafoXML
files. You can use the old functional interface (read()
) or the new
object-oriented interface (TrafoXML
)
to iterate over entries in <Pair>
elements.
Data access¶
TrafoXML
- a class representing a single trafoXML file. Other data access functions use this class internally.
read()
- iterate through pairs in a trafoXML file. Data from a single trafo are converted to a human-readable dict.
chain()
- read multiple trafoXML files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
Dependencies¶
This module requres lxml
.
-
pyteomics.openms.trafoxml.
chain
(*args, **kwargs)¶ Chain
read()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to theread()
function.
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
class
pyteomics.openms.trafoxml.
TrafoXML
(source, read_schema=None, iterative=None, build_id_cache=False, **kwargs)[source]¶ Bases:
pyteomics.xml.XML
Parser class for trafoXML files.
-
__init__
(source, read_schema=None, iterative=None, build_id_cache=False, **kwargs)¶ Create an XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - build_id_cache (bool, optional) – Defines whether a dictionary mapping IDs to XML tree elements
should be built and stored on the instance. It is used in
XML.get_by_id()
, e.g. when usingpyteomics.mzid.MzIdentML
withretrieve_refs=True
. - huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether
security checks for XML tree depth and node size should be disabled.
Default is
False
. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, **kwargs)¶ Parse the file and return the element with id attribute equal to elem_id. Returns
None
if no such element is found.Parameters: elem_id (str) – The value of the id attribute to match. Returns: out Return type: dict
orNone
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
reset
()¶ Resets the iterator to its initial state.
-
-
pyteomics.openms.trafoxml.
read
(source, read_schema=True, iterative=True)[source]¶ Parse source and iterate through pairs.
Parameters: - source (str or file) – A path to a target trafoXML file or the file object itself.
- read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the file header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings. - iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce
memory usage at almost the same parsing speed. Default is
True
.
Returns: out – An iterator over the dicts with feature properties.
Return type: iterator
idxml - idXML file reader¶
Summary¶
idXML is a format specified in the OpenMS project. It defines a list of peptide identifications.
This module provides a minimalistic way to extract information from idXML
files. You can use the old functional interface (read()
) or the new
object-oriented interface (IDXML
) to iterate over entries in
<PeptideIdentification>
elements. Note that each entry can contain more than one PSM
(peptide-spectrum match). They are accessible with 'PeptideHit'
key.
IDXML
objects also support direct indexing by element ID.
Data access¶
IDXML
- a class representing a single idXML file. Other data access functions use this class internally.
read()
- iterate through peptide-spectrum matches in an idXML file. Data from a single PSM group are converted to a human-readable dict. Basically creates anIDXML
object and reads it.
chain()
- read multiple files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
DataFrame()
- read idXML files into apandas.DataFrame
.
Target-decoy approach¶
filter()
- read a chain of idXML files and filter to a certain FDR using TDA.
filter.chain()
- chain a series of filters applied independently to several files.
filter.chain.from_iterable()
- chain a series of filters applied independently to an iterable of files.
filter_df()
- filter idXML files and return apandas.DataFrame
.
is_decoy()
- determine if a “SpectrumIdentificationResult” should be consiudered decoy.
fdr()
- estimate the false discovery rate of a set of identifications using the target-decoy approach.
qvalues()
- get an array of scores and local FDR values for a PSM set using the target-decoy approach.
Deprecated functions¶
version_info()
- get information about idXML version and schema. You can just read the corresponding attribute of theIDXML
object.
get_by_id()
- get an element by its ID and extract the data from it. You can just call the corresponding method of theIDXML
object.
iterfind()
- iterate over elements in an idXML file. You can just call the corresponding method of theIDXML
object.
Dependencies¶
This module requires lxml
.
-
pyteomics.openms.idxml.
version_info
(source)¶ Provide version information about the idXML file.
Note
This function is provided for backward compatibility only. It simply creates an
IDXML
instance and returns itsversion_info
attribute.Parameters: source (str or file) – File name or file-like object. Returns: out – A (version, schema URL) tuple, both elements are strings or None. Return type: tuple
-
pyteomics.openms.idxml.
fdr
(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)¶ Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:
The second formula is:
Note
This function is less versatile than
qvalues()
. To obtain FDR, you can callqvalues()
and take the last q-value. This function can be used (with correction = 0 or 1) whennumpy
is not available.Parameters: - psms (iterable, optional) – An iterable of PSMs, e.g. as returned by
read()
. Not needed if is_decoy is an iterable. - formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
- is_decoy (callable, iterable, or str, optional) –
If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is
is_decoy()
. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or apandas.DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- pep (callable, iterable, or str, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
pandas.DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.
- ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
Note
Requires
numpy
, if correction is a float or 2.Note
Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using
filter()
without correction).
Returns: out – The estimation of FDR, (roughly) between 0 and 1.
Return type: - psms (iterable, optional) – An iterable of PSMs, e.g. as returned by
-
pyteomics.openms.idxml.
qvalues
(*args, **kwargs)¶ Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.
Requires
numpy
(and optionallypandas
).Parameters: - args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
- key (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
. - is_decoy (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
- remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
False
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. - formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR
estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions). - ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
- q_label (str, optional) – Field name for q-value in the output. Default is
'q'
. - score_label (str, optional) – Field name for score in the output. Default is
'score'
. - decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
. - pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
. - full_output (bool, keyword only, optional) – If
True
, then the returned array has PSM objects along with scores and q-values. Default isFalse
. - **kwargs (passed to the
chain()
function.) –
Returns: out – A sorted array of records with the following fields:
- ’score’:
np.float64
- ’is decoy’:
np.bool_
- ’q’:
np.float64
- ’psm’:
np.object_
(if full_output isTrue
)
Return type: numpy.ndarray
-
pyteomics.openms.idxml.
chain
(*sources, **kwargs)¶ Chain
sequence_maker()
for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to thesequence_maker()
function.-
pyteomics.openms.idxml.
sources
¶ Sources for creating new sequences from, such as paths or file-like objects
Type: Iterable
-
pyteomics.openms.idxml.
kwargs
¶ Additional arguments used to instantiate each sequence
Type: Mapping
-
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.openms.idxml.
filter
(*args, **kwargs)¶ Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.
Requires
numpy
and, optionally,pandas
.Parameters: - args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
- fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
- key (callable / array-like / iterable / str, keyword only, optional) –
A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.
Warning
The default function may not work with your files, because format flavours are diverse.
- reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
. - is_decoy (callable / array-like / iterable / str, keyword only, optional) –
A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.
Warning
The default function may not work with your files, because format flavours are diverse.
- decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
- decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
- remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
True
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. - formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR
estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions). - ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
- pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
- full_output (bool, keyword only, optional) –
If
True
, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default isTrue
.Note
The name for the parameter comes from the fact that it is internally passed to
qvalues()
. - q_label (str, optional) – Field name for q-value in the output. Default is
'q'
. - score_label (str, optional) – Field name for score in the output. Default is
'score'
. - decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
. - pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
. - **kwargs (passed to the
chain()
function.) –
Returns: out
Return type: iterator or
numpy.ndarray
orpandas.DataFrame
-
filter.
chain
(*files, **kwargs)¶ Chain
filter()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to thefilter()
function.
-
filter.chain.
from_iterable
(*files, **kwargs)¶ Chain
filter()
for several files. Keyword arguments are passed to thefilter()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.openms.idxml.
DataFrame
(*args, **kwargs)[source]¶ Read idXML files into a
pandas.DataFrame
.Requires
pandas
.Warning
Only the first ‘PeptideHit’ element is considered in every ‘PeptideIdentification’.
Parameters: - *args – Passed to
chain()
- **kwargs – Passed to
chain()
- sep (str or None, keyword only, optional) – Some values related to PSMs (such as protein information) are variable-length
lists. If sep is a
str
, they will be packed into single string using this delimiter. If sep isNone
, they are kept as lists. Default isNone
.
Returns: out
Return type: pandas.DataFrame
- *args – Passed to
-
class
pyteomics.openms.idxml.
IDXML
(*args, **kwargs)[source]¶ Bases:
pyteomics.xml.IndexedXML
Parser class for idXML files.
-
__init__
(*args, **kwargs)[source]¶ Create an indexed XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
elements listed in indexed_tags.
This is useful for random access to spectra in mzML or elements of mzIdentML files,
or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
. - indexed_tags (container of bytes, optional) – If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
reset
()¶ Resets the iterator to its initial state.
-
-
pyteomics.openms.idxml.
filter_df
(*args, **kwargs)[source]¶ Read idXML files or DataFrames and return a
DataFrame
with filtered PSMs. Positional arguments can be idXML files or DataFrames.Requires
pandas
.Warning
Only the first ‘PeptideHit’ element is considered in every ‘PeptideIdentification’.
Parameters: - key (str / iterable / callable, keyword only, optional) – Peptide identification score. Default is ‘score’. You will probably need to change it.
- is_decoy (str / iterable / callable, keyword only, optional) – Default is ‘is decoy’.
- *args – Passed to
auxiliary.filter()
and/orDataFrame()
. - **kwargs – Passed to
auxiliary.filter()
and/orDataFrame()
.
Returns: out
Return type: pandas.DataFrame
-
pyteomics.openms.idxml.
get_by_id
(source, elem_id, **kwargs)[source]¶ Parse source and return the element with id attribute equal to elem_id. Returns
None
if no such element is found.Note
This function is provided for backward compatibility only. If you do multiple
get_by_id()
calls on one file, you should create anIDXML
object and use itsget_by_id()
method.Parameters: Returns: out
Return type: dict
orNone
-
pyteomics.openms.idxml.
is_decoy
(psm, prefix=None)[source]¶ Given a PSM dict, return
True
if it is marked as decoy, andFalse
otherwise.Parameters: Returns: out
Return type:
-
pyteomics.openms.idxml.
iterfind
(source, path, **kwargs)[source]¶ Parse source and yield info on elements with specified local name or by specified “XPath”.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create anIDXML
object and use itsiterfind()
method.Parameters: - source (str or file) – File name or file-like object.
- path (str) – Element name or XPath-like expression. Only local names separated
with slashes are accepted. An asterisk (*) means any element.
You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces. - recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - retrieve_refs (bool, optional) – If
True
, additional information from references will be automatically added to the results. The file processing time will increase. Default isFalse
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the IDXML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings. - build_id_cache (bool, optional) – Defines whether a cache of element IDs should be built and stored on the
created
IDXML
instance. Default value is the value of retrieve_refs.
Returns: out
Return type: iterator
-
pyteomics.openms.idxml.
read
(source, **kwargs)[source]¶ Parse source and iterate through peptide-spectrum matches.
Note
This function is provided for backward compatibility only. It simply creates an
IDXML
instance using provided arguments and returns it.Parameters: - source (str or file) – A path to a target IDXML file or the file object itself.
- recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - retrieve_refs (bool, optional) – If
True
, additional information from references will be automatically added to the results. The file processing time will increase. Default isTrue
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the IDXML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings. - build_id_cache (bool, optional) –
Defines whether a cache of element IDs should be built and stored on the created
IDXML
instance. Default value is the value of retrieve_refs.Note
This parameter is ignored when
use_index
isTrue
(default). - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
the indexed elements. If
True
(default), build_id_cache is ignored. - indexed_tags (container of bytes, optional) – Defines which elements need to be indexed. Empty set by default.
Returns: out – An iterator over the dicts with PSM properties.
Return type:
traml - targeted MS transition data in TraML format¶
Summary¶
TraML is a standard rich XML-format for targeted mass spectrometry method definitions. Please refer to psidev.info for the detailed specification of the format and structure of TraML files.
This module provides a minimalistic way to extract information from TraML
files. You can use the object-oriented interface (TraML
instances) to
access target definitions and transitions. TraML
objects also support
indexing with entity IDs directly.
Data access¶
TraML
- a class representing a single TraML file. Other data access functions use this class internally.
read()
- iterate through transitions in TraML format.
chain()
- read multiple TraML files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
Deprecated functions¶
version_info()
- get version information about the TraML file. You can just read the corresponding attribute of theTraML
object.
iterfind()
- iterate over elements in an TraML file. You can just call the corresponding method of theTraML
object.
Dependencies¶
This module requires lxml
-
pyteomics.traml.
chain
(*sources, **kwargs)¶ Chain
sequence_maker()
for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to thesequence_maker()
function.-
pyteomics.traml.
sources
¶ Sources for creating new sequences from, such as paths or file-like objects
Type: Iterable
-
pyteomics.traml.
kwargs
¶ Additional arguments used to instantiate each sequence
Type: Mapping
-
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
class
pyteomics.traml.
TraML
(*args, **kwargs)[source]¶ Bases:
pyteomics.xml.MultiProcessingXML
,pyteomics.xml.IndexSavingXML
Parser class for TraML files.
-
__init__
(*args, **kwargs)[source]¶ Create an indexed XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
elements listed in indexed_tags.
This is useful for random access to spectra in mzML or elements of mzIdentML files,
or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
. - indexed_tags (container of bytes, optional) – If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
classmethod
prebuild_byte_offset_file
(path)¶ Construct a new XML reader, build its byte offset index and write it to file
Parameters: path (str) – The path to the file to parse
-
reset
()¶ Resets the iterator to its initial state.
-
write_byte_offsets
()¶ Write the byte offsets in
_offset_index
to the file at_byte_offset_filename
-
-
pyteomics.traml.
iterfind
(source, path, **kwargs)[source]¶ Parse source and yield info on elements with specified local name or by specified “XPath”.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create anTraML
object and use itsiterfind()
method.Parameters: - source (str or file) – File name or file-like object.
- path (str) – Element name or XPath-like expression. Only local names separated
with slashes are accepted. An asterisk (*) means any element.
You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces. - recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
Returns: out
Return type: iterator
-
pyteomics.traml.
read
(source, retrieve_refs=True, read_schema=False, iterative=True, use_index=False, huge_tree=False)[source]¶ Parse source and iterate through transitions.
Parameters: - source (str or file) – A path to a target TraML file or the file object itself.
- retrieve_refs (bool, optional) – If
True
, additional information from references will be automatically added to the results. The file processing time will increase. Default isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the TraML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings. - iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce
memory usage at almost the same parsing speed. Default is
True
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
spectrum elements. Default is
False
. - huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether
security checks for XML tree depth and node size should be disabled.
Default is
False
. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
Returns: out – A
TraML
object, suitable for iteration and possibly random access.Return type:
pylab_aux - auxiliary functions for plotting with pylab¶
This module serves as a collection of useful routines for data plotting with matplotlib.
Generic plotting¶
plot_line()
- plot a line.
scatter_trend()
- plot a scatter plot with a regression line.
plot_function_3d()
- plot a 3D graph of a function of two variables.
plot_function_contour()
- plot a contour graph of a function of two variables.
Spectrum visualization¶
plot_spectrum()
- plot a single spectrum (m/z vs intensity).
annotate_spectrum()
- plot and annotate peaks in MS/MS spectrum.
FDR control¶
plot_qvalue_curve()
- plot the dependence of q-value on the amount of PSMs (similar to a ROC curve).
Dependencies¶
This module requires matplotlib
.
-
pyteomics.pylab_aux.
annotate_spectrum
(spectrum, peptide, centroided=True, *args, **kwargs)[source]¶ Plot a spectrum and annotate matching fragment peaks.
Parameters: - spectrum (dict) – A spectrum as returned by Pyteomics parsers. Needs to have ‘m/z array’ and ‘intensity array’ keys.
- peptide (str) – A modX sequence.
- centroided (bool, optional) – Passed to
plot_spectrum()
. - types (Container, keyword only, optional) – Ion types to be considered for annotation. Default is (‘b’, ‘y’).
- maxcharge (int, keyword only, optional) – Maximum charge state for fragment ions to be considered. Default is 1.
- colors (dict, keyword only, optional) – Keys are ion types, values are colors to plot the annotated peaks with. Defaults to a red-blue scheme.
- ftol (float, keyword only, optional) – A fixed m/z tolerance value for peak matching. Alternative to rtol.
- rtol (float, keyword only, optional) – A relative m/z error for peak matching. Default is 10 ppm.
- adjust_text (bool, keyword only, optional) – Adjust the overlapping text annotations using
adjustText
. - text_kw (dict, keyword only, optional) – Keyword arguments for
pylab.text()
. - adjust_kw (dict, keyword only, optional) – Keyword argyuments for :py:func:`adjust_text.
- ion_comp (dict, keyword only, optional) – A dictionary defining definitions of ion compositions to override
pyteomics.mass.std_ion_comp
. - mass_data (dict, keyword only, optional) – A dictionary of element masses to override
pyteomics.mass.nist_mass
. - aa_mass (dict, keyword only, optional) – A dictionary of amino acid residue masses.
- *args – Passed to
plot_spectrum()
. - **kwargs – Passed to
plot_spectrum()
.
-
pyteomics.pylab_aux.
plot_function_3d
(x, y, function, **kwargs)[source]¶ Plot values of a function of two variables in 3D.
More on 3D plotting in pylab:
http://www.scipy.org/Cookbook/Matplotlib/mplot3D
Parameters: - x (array_like of float) – The plotting range on X axis.
- y (array_like of float) – The plotting range on Y axis.
- function (function) – The function to plot.
- plot_type ({'surface', 'wireframe', 'scatter', 'contour', 'contourf'}, keyword only, optional) – The type of a plot, see scipy cookbook for examples. The default value is ‘surface’.
- num_contours (int) – The number of contours to plot, 50 by default.
- xlabel (str, keyword only, optional) – The X axis label. Empty by default.
- ylabel (str, keyword only, optional) – The Y axis label. Empty by default.
- zlabel (str, keyword only, optional) – The Z axis label. Empty by default.
- title (str, keyword only, optional) – The title. Empty by default.
- **kwargs – Passed to the respective plotting function.
-
pyteomics.pylab_aux.
plot_function_contour
(x, y, function, **kwargs)[source]¶ Make a contour plot of a function of two variables.
Parameters: - y (x,) – The positions of the nodes of a plotting grid.
- function (function) – The function to plot.
- filling (bool) – Fill contours if True (default).
- num_contours (int) – The number of contours to plot, 50 by default.
- ylabel (xlabel,) – The axes labels. Empty by default.
- title (str, optional) – The title. Empty by default.
- **kwargs – Passed to
pylab.contour()
orpylab.contourf()
.
-
pyteomics.pylab_aux.
plot_line
(a, b, xlim=None, *args, **kwargs)[source]¶ Plot a line y = a * x + b.
Parameters: Returns: out – The line object.
Return type: matplotlib.lines.Line2D
-
pyteomics.pylab_aux.
plot_qvalue_curve
(qvalues, *args, **kwargs)[source]¶ Plot a curve with q-values on the X axis and corresponding PSM number (starting with
1
) on the Y axis.Parameters: - qvalues (array-like) – An array of q-values for sorted PSMs.
- xlabel (str, keyword only, optional) – Label for the X axis. Default is “q-value”.
- ylabel (str, keyword only, optional) – Label for the Y axis. Default is “# of PSMs”.
- title (str, keyword only, optional) – The title. Empty by default.
- *args – Given to
pylab.plot()
after x and y. - **kwargs – Given to
pylab.plot()
.
Returns: out
Return type: matplotlib.lines.Line2D
-
pyteomics.pylab_aux.
plot_spectrum
(spectrum, centroided=True, *args, **kwargs)[source]¶ Plot a spectrum, assuming it is a dictionary containing “m/z array” and “intensity array”.
Parameters: - spectrum (dict) – A dictionary, as returned by MGF, mzML or mzXML parsers. Must contain “m/z array” and “intensity array” keys with decoded arrays.
- centroided (bool, optional) – If
True
(default), peaks of the spectrum are plotted usingpylab.bar()
. IfFalse
, the arrays are simply plotted usingpylab.plot()
. - xlabel (str, keyword only, optional) – Label for the X axis. Default is “m/z”.
- ylabel (str, keyword only, optional) – Label for the Y axis. Default is “intensity”.
- title (str, keyword only, optional) – The title. Empty by default.
- *args – Given to
pylab.plot()
orpylab.bar()
(depending on centroided). - **kwargs – Given to
pylab.plot()
orpylab.bar()
(depending on centroided).
-
pyteomics.pylab_aux.
scatter_trend
(x, y=None, **kwargs)[source]¶ Make a scatter plot with a linear regression.
Parameters: - x (array_like of float) – 1-D array of floats. If y is omitted, x must be a 2-D array of shape (N, 2).
- y (array_like of float, optional) – 1-D arrays of floats. If y is omitted or
None
, x must be a 2-D array of shape (N, 2). - plot_trend (bool, optional) – If
True
then plot a trendline (default). - plot_sigmas (bool, optional) – If
True
then plot confidence intervals of the linear fit.False
by default. - show_legend (bool, optional) – If
True
, a legend will be shown with linear fit equation, correlation coefficient, and standard deviation from the fit. Default isTrue
. - title (str, optional) – The title. Empty by default.
- ylabel (xlabel,) – The axes labels. Empty by default.
- alpha_legend (float, optional) – Legend box transparency. 1.0 by default
- scatter_kwargs (dict, optional) – Keyword arguments for
pylab.scatter()
. Empty by default. - plot_kwargs (dict, optional) – Keyword arguments for
plot_line()
. By default, sets xlim and label. - legend_kwargs (dict, optional) – Keyword arguments for
pylab.legend()
. Default is{'loc': 'upper left'}
. - sigma_kwargs (dict, optional) – Keyword arguments for
pylab.plot()
used for sigma lines. Default is{'color': 'red', 'linestyle': 'dashed'}
. - sigma_values (iterable, optional) – Each value will be multiplied with standard error of the fit, and the line
shifted by the resulting value will be plotted. Default is
range(-3, 4)
. - regression (callable, optional) – Function to perform linear regression. Will be given
x
andy
as arguments. Must return a 4-tuple: (a, b, r, stderr). Default ispyteomics.auxiliary.linear_regression()
.
Returns: out – A (scatter_plot, trend_line, sigma_lines, legend) tuple.
Return type:
xml - utilities for XML parsing¶
This module is not intended for end users. It implements the abstract classes
for all XML parsers, XML
and IndexedXML
, and some utility functions.
Dependencies¶
This module requres lxml
and numpy
.
-
class
pyteomics.xml.
ArrayConversionMixin
(*args, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.utils.BinaryDataArrayTransformer
-
class
binary_array_record
¶ Bases:
pyteomics.auxiliary.utils.binary_array_record
Hold all of the information about a base64 encoded array needed to decode the array.
-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
compression
¶ Alias for field number 1
-
count
()¶ Return number of occurrences of value.
-
data
¶ Alias for field number 0
-
dtype
¶ Alias for field number 2
-
index
()¶ Return first index of value.
Raises ValueError if the value is not present.
-
key
¶ Alias for field number 4
-
source
¶ Alias for field number 3
-
-
decode_data_array
(source, compression_type=None, dtype=<class 'numpy.float64'>)¶ Decode a base64-encoded, compressed bytestring into a numerical array.
Parameters: Returns: Return type: np.ndarray
-
class
-
class
pyteomics.xml.
ByteCountingXMLScanner
(source, indexed_tags, block_size=1000000)[source]¶ Bases:
pyteomics.auxiliary.file_helpers._file_obj
Carry out the construction of a byte offset index for source XML file for each type of tag in
indexed_tags
.Inheris from
pyteomics.auxiliary._file_obj
to support the object-oriented_keep_state()
interface.-
__init__
(source, indexed_tags, block_size=1000000)[source]¶ Parameters: - indexed_tags (iterable of bytes) – The XML tags (without namespaces) to build indices for.
- block_size (int, optional) – The size of the each chunk or “block” of the file to hold in memory as a partitioned string at any given time. Defaults to 1000000.
-
build_byte_index
(lookup_id_key_mapping=None)[source]¶ Builds a byte offset index for one or more types of tags.
Parameters: lookup_id_key_mapping (Mapping, optional) – A mapping from tag name to the attribute to look up the identity for each entity of that type to be extracted. Defaults to ‘id’ for each type of tag. Returns: Mapping from tag type to dict from identifier to byte offset Return type: defaultdict(dict)
-
-
class
pyteomics.xml.
IndexSavingXML
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.file_helpers.IndexSavingMixin
,pyteomics.xml.IndexedXML
An extension to the IndexedXML type which adds facilities to read and write the byte offset index externally.
-
__init__
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)¶ Create an indexed XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
elements listed in indexed_tags.
This is useful for random access to spectra in mzML or elements of mzIdentML files,
or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
. - indexed_tags (container of bytes, optional) – If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
classmethod
prebuild_byte_offset_file
(path)¶ Construct a new XML reader, build its byte offset index and write it to file
Parameters: path (str) – The path to the file to parse
-
reset
()¶ Resets the iterator to its initial state.
-
write_byte_offsets
()¶ Write the byte offsets in
_offset_index
to the file at_byte_offset_filename
-
-
class
pyteomics.xml.
IndexedXML
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.file_helpers.IndexedReaderMixin
,pyteomics.xml.XML
Subclass of
XML
which uses an index of byte offsets for some elements for quick random access.-
__init__
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶ Create an indexed XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
elements listed in indexed_tags.
This is useful for random access to spectra in mzML or elements of mzIdentML files,
or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
. - indexed_tags (container of bytes, optional) – If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)[source]¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.xml.
MultiProcessingXML
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶ Bases:
pyteomics.xml.IndexedXML
,pyteomics.auxiliary.file_helpers.TaskMappingMixin
XML reader that feeds indexes to external processes for parallel parsing and analysis of XML entries.
-
__init__
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)¶ Create an indexed XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
elements listed in indexed_tags.
This is useful for random access to spectra in mzML or elements of mzIdentML files,
or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
. - indexed_tags (container of bytes, optional) – If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.xml.
TagSpecificXMLByteIndex
(source, indexed_tags=None, keys=None)[source]¶ Bases:
object
Encapsulates the construction and querying of a byte offset index for a set of XML tags.
This type mimics an immutable Mapping.
The tag names to index, not including a namespace
Type: iterable of bytes
-
offsets
¶ The hierarchy of byte offsets organized
{"tag_type": {"id": byte_offset}}
Type: defaultdict(OrderedDict(str, int))
Parameters: index_tags (iterable of bytes) – The tag names to include in the index
-
class
pyteomics.xml.
XML
(source, read_schema=None, iterative=None, build_id_cache=False, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.file_helpers.FileReader
Base class for all format-specific XML parsers. The instances can be used as context managers and as iterators.
-
__init__
(source, read_schema=None, iterative=None, build_id_cache=False, **kwargs)[source]¶ Create an XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - build_id_cache (bool, optional) – Defines whether a dictionary mapping IDs to XML tree elements
should be built and stored on the instance. It is used in
XML.get_by_id()
, e.g. when usingpyteomics.mzid.MzIdentML
withretrieve_refs=True
. - huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether
security checks for XML tree depth and node size should be disabled.
Default is
False
. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
-
build_id_cache
()[source]¶ Construct a cache for each element in the document, indexed by id attribute
-
get_by_id
(elem_id, **kwargs)[source]¶ Parse the file and return the element with id attribute equal to elem_id. Returns
None
if no such element is found.Parameters: elem_id (str) – The value of the id attribute to match. Returns: out Return type: dict
orNone
-
iterfind
(path, **kwargs)[source]¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
reset
()¶ Resets the iterator to its initial state.
-
auxiliary - common functions and objects¶
Math¶
linear_regression_vertical()
- a wrapper for NumPy linear regression, minimizes the sum of squares of y errors.
linear_regression()
- alias forlinear_regression_vertical()
.
linear_regression_perpendicular()
- a wrapper for NumPy linear regression, minimizes the sum of squares of (perpendicular) distances between the points and the line.
Target-Decoy Approach¶
qvalues()
- estimate q-values for a set of PSMs.
filter()
- filter PSMs to specified FDR level using TDA or given PEPs.
filter.chain()
- a chained version offilter()
.
fdr()
- estimate FDR in a set of PSMs using TDA or given PEPs.
Project infrastructure¶
PyteomicsError
- a pyteomics-specific exception.
Helpers¶
Charge
- a subclass ofint
for charge states.
ChargeList
- a subclass oflist
for lists of charges.
print_tree()
- display the structure of a complex nesteddict
.
memoize()
- makes a memoization function decorator.
cvquery()
- traverse an arbitrarily nested dictionary looking for keys which arecvstr
instances, or objects with an attribute calledaccession
.
-
pyteomics.auxiliary.math.
linear_regression
(x, y=None, a=None, b=None)[source]¶ Alias of
linear_regression_vertical()
.
-
pyteomics.auxiliary.math.
linear_regression_perpendicular
(x, y=None)[source]¶ Calculate coefficients of a linear regression y = a * x + b. The fit minimizes perpendicular distances between the points and the line.
Requires
numpy
.Parameters: y (x,) – 1-D arrays of floats. If y is omitted, x must be a 2-D array of shape (N, 2). Returns: out – The structure is (a, b, r, stderr), where a – slope coefficient, b – free term, r – Peason correlation coefficient, stderr – standard deviation. Return type: 4-tuple of float
-
pyteomics.auxiliary.math.
linear_regression_vertical
(x, y=None, a=None, b=None)[source]¶ Calculate coefficients of a linear regression y = a * x + b. The fit minimizes vertical distances between the points and the line.
Requires
numpy
.Parameters: Returns: out – The structure is (a, b, r, stderr), where a – slope coefficient, b – free term, r – Peason correlation coefficient, stderr – standard deviation.
Return type: 4-tuple of float
-
pyteomics.auxiliary.target_decoy.
fdr
(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)¶ Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:
The second formula is:
Note
This function is less versatile than
qvalues()
. To obtain FDR, you can callqvalues()
and take the last q-value. This function can be used (with correction = 0 or 1) whennumpy
is not available.Parameters: - psms (iterable, optional) – An iterable of PSMs, e.g. as returned by
read()
. Not needed if is_decoy is an iterable. - formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
- is_decoy (callable, iterable, or str) – If callable, should accept exactly one argument (PSM) and return a truthy value
if the PSM is considered decoy. Default is
is_decoy()
. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or apandas.DataFrame
). - pep (callable, iterable, or str, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
pandas.DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.
- ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
Note
Requires
numpy
, if correction is a float or 2.Note
Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using
filter()
without correction).
Returns: out – The estimation of FDR, (roughly) between 0 and 1.
Return type: - psms (iterable, optional) – An iterable of PSMs, e.g. as returned by
-
pyteomics.auxiliary.target_decoy.
filter
(*args, **kwargs)¶ Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.
Requires
numpy
and, optionally,pandas
.Parameters: - args (positional) – Iterables to read PSMs from. All positional arguments are chained. The rest of the arguments must be named.
- fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
- key (callable / array-like / iterable / str, keyword only) –
A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.
Warning
The default function may not work with your files, because format flavours are diverse.
- reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
. - is_decoy (callable / array-like / iterable / str, keyword only) – A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.
- remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
True
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. - formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR
estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions). - ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
- pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
- full_output (bool, keyword only, optional) –
If
True
, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default isTrue
.Note
The name for the parameter comes from the fact that it is internally passed to
qvalues()
. - q_label (str, optional) – Field name for q-value in the output. Default is
'q'
. - score_label (str, optional) – Field name for score in the output. Default is
'score'
. - decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
. - pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
. - **kwargs (passed to the
chain()
function.) –
Returns: out
Return type: iterator or
numpy.ndarray
orpandas.DataFrame
-
pyteomics.auxiliary.target_decoy.
qvalues
(*args, **kwargs)¶ Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.
Requires
numpy
(and optionallypandas
).Parameters: - args (positional) – Iterables to read PSMs from. All positional arguments are chained. The rest of the arguments must be named.
- key (callable / array-like / iterable / str, keyword only) –
If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
- reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
. - is_decoy (callable / array-like / iterable / str, keyword only) – If callable, a function used to determine if the PSM is decoy or not.
Should accept exactly one argument (PSM) and return a truthy value if the
PSM should be considered decoy.
If array-like, should contain boolean values for all given PSMs.
If string, it is used as a field name (PSMs must be in a record array
or a
DataFrame
). - pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
- remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
False
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument. - formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR
estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions). - ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
- correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
- q_label (str, optional) – Field name for q-value in the output. Default is
'q'
. - score_label (str, optional) – Field name for score in the output. Default is
'score'
. - decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
. - pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
. - full_output (bool, keyword only, optional) – If
True
, then the returned array has PSM objects along with scores and q-values. Default isFalse
. - **kwargs (passed to the
chain()
function.) –
Returns: out – A sorted array of records with the following fields:
- ’score’:
np.float64
- ’is decoy’:
np.bool_
- ’q’:
np.float64
- ’psm’:
np.object_
(if full_output isTrue
)
Return type: numpy.ndarray
-
pyteomics.auxiliary.target_decoy.
sigma_T
(psms, is_decoy, ratio=1)[source]¶ Calculates the standard error for the number of false positive target PSMs.
The formula is:
.. math ::
sigma(T) = sqrt{frac{(d + 1) cdot {p}}{(1 - p)^{2}}} = sqrt{frac{d+1}{r^{2}} cdot (r+1)}This estimation is accurate for low FDRs. See the article for more details.
-
pyteomics.auxiliary.target_decoy.
sigma_fdr
(psms=None, formula=1, is_decoy=None, ratio=1)[source]¶ Calculates the standard error of FDR using the formula for negative binomial distribution. See
sigma_T()
for math. This estimation is accurate for low FDRs. See also the article for more details.
-
class
pyteomics.auxiliary.utils.
BinaryDataArrayTransformer
[source]¶ Bases:
object
A base class that provides methods for reading base64-encoded binary arrays.
-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
class
binary_array_record
[source]¶ Bases:
pyteomics.auxiliary.utils.binary_array_record
Hold all of the information about a base64 encoded array needed to decode the array.
-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
compression
¶ Alias for field number 1
-
count
()¶ Return number of occurrences of value.
-
data
¶ Alias for field number 0
-
dtype
¶ Alias for field number 2
-
index
()¶ Return first index of value.
Raises ValueError if the value is not present.
-
key
¶ Alias for field number 4
-
source
¶ Alias for field number 3
-
-
-
pyteomics.auxiliary.utils.
memoize
(maxsize=1000)[source]¶ Make a memoization decorator. A negative value of maxsize means no size limit.
-
pyteomics.auxiliary.utils.
print_tree
(d, indent_str=' -> ', indent_count=1)[source]¶ Read a nested dict (with strings as keys) and print its structure.
-
class
pyteomics.auxiliary.structures.
BasicComposition
(*args, **kwargs)[source]¶ Bases:
collections.defaultdict
,collections.Counter
A generic dictionary for compositions. Keys should be strings, values should be integers. Allows simple arithmetics.
-
clear
() → None. Remove all items from D.¶
-
default_factory
¶ Factory for default value called by __missing__().
-
elements
()¶ Iterator over elements repeating each as many times as its count.
>>> c = Counter('ABCABC') >>> sorted(c.elements()) ['A', 'A', 'B', 'B', 'C', 'C']
# Knuth’s example for prime factors of 1836: 2**2 * 3**3 * 17**1 >>> prime_factors = Counter({2: 2, 3: 3, 17: 1}) >>> product = 1 >>> for factor in prime_factors.elements(): # loop over factors … product *= factor # and multiply them >>> product 1836
Note, if an element’s count has been set to zero or is a negative number, elements() will ignore it.
-
classmethod
fromkeys
(iterable, v=None)¶ Create a new dictionary with keys from iterable and values set to value.
-
get
()¶ Return the value for key if key is in the dictionary, else default.
-
items
() → a set-like object providing a view on D's items¶
-
keys
() → a set-like object providing a view on D's keys¶
-
most_common
(n=None)¶ List the n most common elements and their counts from the most common to the least. If n is None, then list all element counts.
>>> Counter('abcdeabcdabcaba').most_common(3) [('a', 5), ('b', 4), ('c', 3)]
-
pop
(k[, d]) → v, remove specified key and return the corresponding value.¶ If key is not found, d is returned if given, otherwise KeyError is raised
-
popitem
() → (k, v), remove and return some (key, value) pair as a¶ 2-tuple; but raise KeyError if D is empty.
-
setdefault
()¶ Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
-
subtract
(**kwds)¶ Like dict.update() but subtracts counts instead of replacing them. Counts can be reduced below zero. Both the inputs and outputs are allowed to contain zero and negative counts.
Source can be an iterable, a dictionary, or another Counter instance.
>>> c = Counter('which') >>> c.subtract('witch') # subtract elements from another iterable >>> c.subtract(Counter('watch')) # subtract elements from another counter >>> c['h'] # 2 in which, minus 1 in witch, minus 1 in watch 0 >>> c['w'] # 1 in which, minus 1 in witch, minus 1 in watch -1
-
update
(**kwds)¶ Like dict.update() but add counts instead of replacing them.
Source can be an iterable, a dictionary, or another Counter instance.
>>> c = Counter('which') >>> c.update('witch') # add elements from another iterable >>> d = Counter('watch') >>> c.update(d) # add elements from another counter >>> c['h'] # four 'h' in which, witch, and watch 4
-
values
() → an object providing a view on D's values¶
-
-
class
pyteomics.auxiliary.structures.
CVQueryEngine
[source]¶ Bases:
object
Traverse an arbitrarily nested dictionary looking for keys which are
cvstr
instances, or objects with an attribute calledaccession
.-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
pyteomics.auxiliary.structures.
Charge
[source]¶ Bases:
int
A subclass of
int
. Can be constructed from strings in “N+” or “N-” format, and the string representation of aCharge
is also in that format.-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
bit_length
()¶ Number of bits necessary to represent self in binary.
>>> bin(37) '0b100101' >>> (37).bit_length() 6
-
conjugate
()¶ Returns self, the complex conjugate of any int.
-
denominator
¶ the denominator of a rational number in lowest terms
-
from_bytes
()¶ Return the integer represented by the given array of bytes.
- bytes
- Holds the array of bytes to convert. The argument must either support the buffer protocol or be an iterable object producing bytes. Bytes and bytearray are examples of built-in objects that support the buffer protocol.
- byteorder
- The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value.
- signed
- Indicates whether two’s complement is used to represent the integer.
-
imag
¶ the imaginary part of a complex number
-
numerator
¶ the numerator of a rational number in lowest terms
-
real
¶ the real part of a complex number
-
to_bytes
()¶ Return an array of bytes representing an integer.
- length
- Length of bytes object to use. An OverflowError is raised if the integer is not representable with the given number of bytes.
- byteorder
- The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value.
- signed
- Determines whether two’s complement is used to represent the integer. If signed is False and a negative integer is given, an OverflowError is raised.
-
-
class
pyteomics.auxiliary.structures.
ChargeList
(*args, **kwargs)[source]¶ Bases:
list
Just a list of :py:class:`Charge`s. When printed, looks like an enumeration of the list contents. Can also be constructed from such strings (e.g. “2+, 3+ and 4+”).
-
append
()¶ Append object to the end of the list.
-
clear
()¶ Remove all items from list.
-
copy
()¶ Return a shallow copy of the list.
-
count
()¶ Return number of occurrences of value.
-
extend
()¶ Extend list by appending elements from the iterable.
-
index
()¶ Return first index of value.
Raises ValueError if the value is not present.
-
insert
()¶ Insert object before index.
-
pop
()¶ Remove and return item at index (default last).
Raises IndexError if list is empty or index is out of range.
-
remove
()¶ Remove first occurrence of value.
Raises ValueError if the value is not present.
-
reverse
()¶ Reverse IN PLACE.
-
sort
()¶ Stable sort IN PLACE.
-
-
exception
pyteomics.auxiliary.structures.
PyteomicsError
(msg, *values)[source]¶ Bases:
Exception
Exception raised for errors in Pyteomics library.
-
with_traceback
()¶ Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
-
-
pyteomics.auxiliary.structures.
clear_unit_cv_table
()[source]¶ Clear the module-level unit name and controlled vocabulary accession table.
-
class
pyteomics.auxiliary.structures.
cvstr
[source]¶ Bases:
str
A helper class to associate a controlled vocabullary accession number with an otherwise plain
str
object-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
capitalize
()¶ Return a capitalized version of the string.
More specifically, make the first character have upper case and the rest lower case.
-
casefold
()¶ Return a version of the string suitable for caseless comparisons.
-
center
()¶ Return a centered string of length width.
Padding is done using the specified fill character (default is a space).
-
count
(sub[, start[, end]]) → int¶ Return the number of non-overlapping occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation.
-
encode
()¶ Encode the string using the codec registered for encoding.
- encoding
- The encoding in which to encode the string.
- errors
- The error handling scheme to use for encoding errors. The default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’ as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.
-
endswith
(suffix[, start[, end]]) → bool¶ Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.
-
expandtabs
()¶ Return a copy where all tab characters are expanded using spaces.
If tabsize is not given, a tab size of 8 characters is assumed.
-
find
(sub[, start[, end]]) → int¶ Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Return -1 on failure.
-
format
(*args, **kwargs) → str¶ Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
-
format_map
(mapping) → str¶ Return a formatted version of S, using substitutions from mapping. The substitutions are identified by braces (‘{‘ and ‘}’).
-
index
(sub[, start[, end]]) → int¶ Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Raises ValueError when the substring is not found.
-
isalnum
()¶ Return True if the string is an alpha-numeric string, False otherwise.
A string is alpha-numeric if all characters in the string are alpha-numeric and there is at least one character in the string.
-
isalpha
()¶ Return True if the string is an alphabetic string, False otherwise.
A string is alphabetic if all characters in the string are alphabetic and there is at least one character in the string.
-
isascii
()¶ Return True if all characters in the string are ASCII, False otherwise.
ASCII characters have code points in the range U+0000-U+007F. Empty string is ASCII too.
-
isdecimal
()¶ Return True if the string is a decimal string, False otherwise.
A string is a decimal string if all characters in the string are decimal and there is at least one character in the string.
-
isdigit
()¶ Return True if the string is a digit string, False otherwise.
A string is a digit string if all characters in the string are digits and there is at least one character in the string.
-
isidentifier
()¶ Return True if the string is a valid Python identifier, False otherwise.
Use keyword.iskeyword() to test for reserved identifiers such as “def” and “class”.
-
islower
()¶ Return True if the string is a lowercase string, False otherwise.
A string is lowercase if all cased characters in the string are lowercase and there is at least one cased character in the string.
-
isnumeric
()¶ Return True if the string is a numeric string, False otherwise.
A string is numeric if all characters in the string are numeric and there is at least one character in the string.
-
isprintable
()¶ Return True if the string is printable, False otherwise.
A string is printable if all of its characters are considered printable in repr() or if it is empty.
-
isspace
()¶ Return True if the string is a whitespace string, False otherwise.
A string is whitespace if all characters in the string are whitespace and there is at least one character in the string.
-
istitle
()¶ Return True if the string is a title-cased string, False otherwise.
In a title-cased string, upper- and title-case characters may only follow uncased characters and lowercase characters only cased ones.
-
isupper
()¶ Return True if the string is an uppercase string, False otherwise.
A string is uppercase if all cased characters in the string are uppercase and there is at least one cased character in the string.
-
join
()¶ Concatenate any number of strings.
The string whose method is called is inserted in between each given string. The result is returned as a new string.
Example: ‘.’.join([‘ab’, ‘pq’, ‘rs’]) -> ‘ab.pq.rs’
-
ljust
()¶ Return a left-justified string of length width.
Padding is done using the specified fill character (default is a space).
-
lower
()¶ Return a copy of the string converted to lowercase.
-
lstrip
()¶ Return a copy of the string with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
-
static
maketrans
()¶ Return a translation table usable for str.translate().
If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.
-
partition
()¶ Partition the string into three parts using the given separator.
This will search for the separator in the string. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.
If the separator is not found, returns a 3-tuple containing the original string and two empty strings.
-
replace
()¶ Return a copy with all occurrences of substring old replaced by new.
- count
- Maximum number of occurrences to replace. -1 (the default value) means replace all occurrences.
If the optional argument count is given, only the first count occurrences are replaced.
-
rfind
(sub[, start[, end]]) → int¶ Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Return -1 on failure.
-
rindex
(sub[, start[, end]]) → int¶ Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Raises ValueError when the substring is not found.
-
rjust
()¶ Return a right-justified string of length width.
Padding is done using the specified fill character (default is a space).
-
rpartition
()¶ Partition the string into three parts using the given separator.
This will search for the separator in the string, starting at the end. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.
If the separator is not found, returns a 3-tuple containing two empty strings and the original string.
-
rsplit
()¶ Return a list of the words in the string, using sep as the delimiter string.
- sep
- The delimiter according which to split the string. None (the default value) means split according to any whitespace, and discard empty strings from the result.
- maxsplit
- Maximum number of splits to do. -1 (the default value) means no limit.
Splits are done starting at the end of the string and working to the front.
-
rstrip
()¶ Return a copy of the string with trailing whitespace removed.
If chars is given and not None, remove characters in chars instead.
-
split
()¶ Return a list of the words in the string, using sep as the delimiter string.
- sep
- The delimiter according which to split the string. None (the default value) means split according to any whitespace, and discard empty strings from the result.
- maxsplit
- Maximum number of splits to do. -1 (the default value) means no limit.
-
splitlines
()¶ Return a list of the lines in the string, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends is given and true.
-
startswith
(prefix[, start[, end]]) → bool¶ Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.
-
strip
()¶ Return a copy of the string with leading and trailing whitespace remove.
If chars is given and not None, remove characters in chars instead.
-
swapcase
()¶ Convert uppercase characters to lowercase and lowercase characters to uppercase.
-
title
()¶ Return a version of the string where each word is titlecased.
More specifically, words start with uppercased characters and all remaining cased characters have lower case.
-
translate
()¶ Replace each character in the string using the given translation table.
- table
- Translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, strings, or None.
The table must implement lookup/indexing via __getitem__, for instance a dictionary or list. If this operation raises LookupError, the character is left untouched. Characters mapped to None are deleted.
-
upper
()¶ Return a copy of the string converted to uppercase.
-
zfill
()¶ Pad a numeric string with zeros on the left, to fill a field of the given width.
The string is never truncated.
-
-
class
pyteomics.auxiliary.structures.
unitfloat
[source]¶ Bases:
float
-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
as_integer_ratio
()¶ Return integer ratio.
Return a pair of integers, whose ratio is exactly equal to the original float and with a positive denominator.
Raise OverflowError on infinities and a ValueError on NaNs.
>>> (10.0).as_integer_ratio() (10, 1) >>> (0.0).as_integer_ratio() (0, 1) >>> (-.25).as_integer_ratio() (-1, 4)
-
conjugate
()¶ Return self, the complex conjugate of any float.
-
fromhex
()¶ Create a floating-point number from a hexadecimal string.
>>> float.fromhex('0x1.ffffp10') 2047.984375 >>> float.fromhex('-0x1p-1074') -5e-324
-
hex
()¶ Return a hexadecimal representation of a floating-point number.
>>> (-0.1).hex() '-0x1.999999999999ap-4' >>> 3.14159.hex() '0x1.921f9f01b866ep+1'
-
imag
¶ the imaginary part of a complex number
-
is_integer
()¶ Return True if the float is an integer.
-
real
¶ the real part of a complex number
-
-
class
pyteomics.auxiliary.structures.
unitint
[source]¶ Bases:
int
-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
bit_length
()¶ Number of bits necessary to represent self in binary.
>>> bin(37) '0b100101' >>> (37).bit_length() 6
-
conjugate
()¶ Returns self, the complex conjugate of any int.
-
denominator
¶ the denominator of a rational number in lowest terms
-
from_bytes
()¶ Return the integer represented by the given array of bytes.
- bytes
- Holds the array of bytes to convert. The argument must either support the buffer protocol or be an iterable object producing bytes. Bytes and bytearray are examples of built-in objects that support the buffer protocol.
- byteorder
- The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value.
- signed
- Indicates whether two’s complement is used to represent the integer.
-
imag
¶ the imaginary part of a complex number
-
numerator
¶ the numerator of a rational number in lowest terms
-
real
¶ the real part of a complex number
-
to_bytes
()¶ Return an array of bytes representing an integer.
- length
- Length of bytes object to use. An OverflowError is raised if the integer is not representable with the given number of bytes.
- byteorder
- The byte order used to represent the integer. If byteorder is ‘big’, the most significant byte is at the beginning of the byte array. If byteorder is ‘little’, the most significant byte is at the end of the byte array. To request the native byte order of the host system, use `sys.byteorder’ as the byte order value.
- signed
- Determines whether two’s complement is used to represent the integer. If signed is False and a negative integer is given, an OverflowError is raised.
-
-
class
pyteomics.auxiliary.structures.
unitstr
[source]¶ Bases:
str
-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
capitalize
()¶ Return a capitalized version of the string.
More specifically, make the first character have upper case and the rest lower case.
-
casefold
()¶ Return a version of the string suitable for caseless comparisons.
-
center
()¶ Return a centered string of length width.
Padding is done using the specified fill character (default is a space).
-
count
(sub[, start[, end]]) → int¶ Return the number of non-overlapping occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation.
-
encode
()¶ Encode the string using the codec registered for encoding.
- encoding
- The encoding in which to encode the string.
- errors
- The error handling scheme to use for encoding errors. The default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’ as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.
-
endswith
(suffix[, start[, end]]) → bool¶ Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.
-
expandtabs
()¶ Return a copy where all tab characters are expanded using spaces.
If tabsize is not given, a tab size of 8 characters is assumed.
-
find
(sub[, start[, end]]) → int¶ Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Return -1 on failure.
-
format
(*args, **kwargs) → str¶ Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}’).
-
format_map
(mapping) → str¶ Return a formatted version of S, using substitutions from mapping. The substitutions are identified by braces (‘{‘ and ‘}’).
-
index
(sub[, start[, end]]) → int¶ Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Raises ValueError when the substring is not found.
-
isalnum
()¶ Return True if the string is an alpha-numeric string, False otherwise.
A string is alpha-numeric if all characters in the string are alpha-numeric and there is at least one character in the string.
-
isalpha
()¶ Return True if the string is an alphabetic string, False otherwise.
A string is alphabetic if all characters in the string are alphabetic and there is at least one character in the string.
-
isascii
()¶ Return True if all characters in the string are ASCII, False otherwise.
ASCII characters have code points in the range U+0000-U+007F. Empty string is ASCII too.
-
isdecimal
()¶ Return True if the string is a decimal string, False otherwise.
A string is a decimal string if all characters in the string are decimal and there is at least one character in the string.
-
isdigit
()¶ Return True if the string is a digit string, False otherwise.
A string is a digit string if all characters in the string are digits and there is at least one character in the string.
-
isidentifier
()¶ Return True if the string is a valid Python identifier, False otherwise.
Use keyword.iskeyword() to test for reserved identifiers such as “def” and “class”.
-
islower
()¶ Return True if the string is a lowercase string, False otherwise.
A string is lowercase if all cased characters in the string are lowercase and there is at least one cased character in the string.
-
isnumeric
()¶ Return True if the string is a numeric string, False otherwise.
A string is numeric if all characters in the string are numeric and there is at least one character in the string.
-
isprintable
()¶ Return True if the string is printable, False otherwise.
A string is printable if all of its characters are considered printable in repr() or if it is empty.
-
isspace
()¶ Return True if the string is a whitespace string, False otherwise.
A string is whitespace if all characters in the string are whitespace and there is at least one character in the string.
-
istitle
()¶ Return True if the string is a title-cased string, False otherwise.
In a title-cased string, upper- and title-case characters may only follow uncased characters and lowercase characters only cased ones.
-
isupper
()¶ Return True if the string is an uppercase string, False otherwise.
A string is uppercase if all cased characters in the string are uppercase and there is at least one cased character in the string.
-
join
()¶ Concatenate any number of strings.
The string whose method is called is inserted in between each given string. The result is returned as a new string.
Example: ‘.’.join([‘ab’, ‘pq’, ‘rs’]) -> ‘ab.pq.rs’
-
ljust
()¶ Return a left-justified string of length width.
Padding is done using the specified fill character (default is a space).
-
lower
()¶ Return a copy of the string converted to lowercase.
-
lstrip
()¶ Return a copy of the string with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
-
static
maketrans
()¶ Return a translation table usable for str.translate().
If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.
-
partition
()¶ Partition the string into three parts using the given separator.
This will search for the separator in the string. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.
If the separator is not found, returns a 3-tuple containing the original string and two empty strings.
-
replace
()¶ Return a copy with all occurrences of substring old replaced by new.
- count
- Maximum number of occurrences to replace. -1 (the default value) means replace all occurrences.
If the optional argument count is given, only the first count occurrences are replaced.
-
rfind
(sub[, start[, end]]) → int¶ Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Return -1 on failure.
-
rindex
(sub[, start[, end]]) → int¶ Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.
Raises ValueError when the substring is not found.
-
rjust
()¶ Return a right-justified string of length width.
Padding is done using the specified fill character (default is a space).
-
rpartition
()¶ Partition the string into three parts using the given separator.
This will search for the separator in the string, starting at the end. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.
If the separator is not found, returns a 3-tuple containing two empty strings and the original string.
-
rsplit
()¶ Return a list of the words in the string, using sep as the delimiter string.
- sep
- The delimiter according which to split the string. None (the default value) means split according to any whitespace, and discard empty strings from the result.
- maxsplit
- Maximum number of splits to do. -1 (the default value) means no limit.
Splits are done starting at the end of the string and working to the front.
-
rstrip
()¶ Return a copy of the string with trailing whitespace removed.
If chars is given and not None, remove characters in chars instead.
-
split
()¶ Return a list of the words in the string, using sep as the delimiter string.
- sep
- The delimiter according which to split the string. None (the default value) means split according to any whitespace, and discard empty strings from the result.
- maxsplit
- Maximum number of splits to do. -1 (the default value) means no limit.
-
splitlines
()¶ Return a list of the lines in the string, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends is given and true.
-
startswith
(prefix[, start[, end]]) → bool¶ Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.
-
strip
()¶ Return a copy of the string with leading and trailing whitespace remove.
If chars is given and not None, remove characters in chars instead.
-
swapcase
()¶ Convert uppercase characters to lowercase and lowercase characters to uppercase.
-
title
()¶ Return a version of the string where each word is titlecased.
More specifically, words start with uppercased characters and all remaining cased characters have lower case.
-
translate
()¶ Replace each character in the string using the given translation table.
- table
- Translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, strings, or None.
The table must implement lookup/indexing via __getitem__, for instance a dictionary or list. If this operation raises LookupError, the character is left untouched. Characters mapped to None are deleted.
-
upper
()¶ Return a copy of the string converted to uppercase.
-
zfill
()¶ Pad a numeric string with zeros on the left, to fill a field of the given width.
The string is never truncated.
-
-
class
pyteomics.auxiliary.file_helpers.
ChainBase
(*sources, **kwargs)[source]¶ Bases:
object
Chain
sequence_maker()
for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to thesequence_maker()
function.-
sources
¶ Sources for creating new sequences from, such as paths or file-like objects
Type: Iterable
-
kwargs
¶ Additional arguments used to instantiate each sequence
Type: Mapping
-
map
(target=None, processes=-1, queue_timeout=4, args=None, kwargs=None, **_kwargs)[source]¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If negative, the number of processes will match the number of available CPUs.
- queue_timeout (float, optional) – The number of seconds to block, waiting for a result before checking to see if all workers are done.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
-
class
pyteomics.auxiliary.file_helpers.
FileReader
(source, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.file_helpers.IteratorContextManager
Abstract class implementing context manager protocol for file readers.
-
class
pyteomics.auxiliary.file_helpers.
FileReadingProcess
(reader_spec, target_spec, qin, qout, args_spec, kwargs_spec)[source]¶ Bases:
multiprocessing.context.Process
Process that does a share of distributed work on entries read from file. Reconstructs a reader object, parses an entries from given indexes, optionally does additional processing, sends results back.
The reader class must support the
__getitem__()
dict-like lookup.-
__init__
(reader_spec, target_spec, qin, qout, args_spec, kwargs_spec)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
close
()¶ Close the Process object.
This method releases resources held by the Process object. It is an error to call this method if the child process is still running.
-
daemon
¶ Return whether process is a daemon
-
exitcode
¶ Return exit code of process or None if it has yet to stop
-
ident
¶ Return identifier (PID) of process or None if it has yet to start
-
is_alive
()¶ Return whether process is alive
-
join
(timeout=None)¶ Wait until child process terminates
-
kill
()¶ Terminate process; sends SIGKILL signal or uses TerminateProcess()
-
pid
¶ Return identifier (PID) of process or None if it has yet to start
-
sentinel
¶ Return a file descriptor (Unix) or handle (Windows) suitable for waiting for process termination.
-
start
()¶ Start child process
-
terminate
()¶ Terminate process; sends SIGTERM signal or uses TerminateProcess()
-
-
class
pyteomics.auxiliary.file_helpers.
IndexSavingMixin
(*args, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.file_helpers.NoOpBaseReader
Common interface for
IndexSavingXML
andIndexSavingTextReader
.-
__init__
(*args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
pyteomics.auxiliary.file_helpers.
IndexSavingTextReader
(source, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.file_helpers.IndexSavingMixin
,pyteomics.auxiliary.file_helpers.IndexedTextReader
-
__init__
(source, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
classmethod
prebuild_byte_offset_file
(path)¶ Construct a new XML reader, build its byte offset index and write it to file
Parameters: path (str) – The path to the file to parse
-
reset
()¶ Resets the iterator to its initial state.
-
write_byte_offsets
()¶ Write the byte offsets in
_offset_index
to the file at_byte_offset_filename
-
-
class
pyteomics.auxiliary.file_helpers.
IndexedReaderMixin
(*args, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.file_helpers.NoOpBaseReader
Common interface for
IndexedTextReader
andIndexedXML
.-
__init__
(*args, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
-
class
pyteomics.auxiliary.file_helpers.
IndexedTextReader
(source, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.file_helpers.IndexedReaderMixin
,pyteomics.auxiliary.file_helpers.FileReader
Abstract class for text file readers that keep an index of records for random access. This requires reading the file in binary mode.
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.auxiliary.file_helpers.
OffsetIndex
(*args, **kwargs)[source]¶ Bases:
collections.OrderedDict
,pyteomics.auxiliary.file_helpers.WritableIndex
An augmented OrderedDict that formally wraps getting items by index
-
clear
() → None. Remove all items from od.¶
-
copy
() → a shallow copy of od¶
-
from_index
(index, include_value=False)[source]¶ Get an entry by its integer index in the ordered sequence of this mapping.
Parameters: Returns: If
include_value
isTrue
, a tuple of (key, value) atindex
else just the key atindex
.Return type:
-
from_slice
(spec, include_value=False)[source]¶ Get a slice along index in the ordered sequence of this mapping.
Parameters: Returns: If
include_value
isTrue
, a tuple of (key, value) atindex
else just the key atindex
for eachindex
inspec
Return type:
-
fromkeys
()¶ Create a new ordered dictionary with keys from iterable and values set to value.
-
get
()¶ Return the value for key if key is in the dictionary, else default.
-
index_sequence
¶ Keeps a cached copy of the
items()
sequence stored as atuple
to avoid repeatedly copying the sequence over many method calls.Returns: Return type: tuple
-
items
() → a set-like object providing a view on D's items¶
-
keys
() → a set-like object providing a view on D's keys¶
-
move_to_end
()¶ Move an existing element to the end (or beginning if last is false).
Raise KeyError if the element does not exist.
-
pop
(k[, d]) → v, remove specified key and return the corresponding[source]¶ value. If key is not found, d is returned if given, otherwise KeyError is raised.
-
popitem
()¶ Remove and return a (key, value) pair from the dictionary.
Pairs are returned in LIFO order if last is true or FIFO order if false.
-
setdefault
()¶ Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
-
update
([E, ]**F) → None. Update D from dict/iterable E and F.¶ If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
-
values
() → an object providing a view on D's values¶
-
-
class
pyteomics.auxiliary.file_helpers.
TableJoiner
(*sources, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.file_helpers.ChainBase
-
__init__
(*sources, **kwargs)¶ Initialize self. See help(type(self)) for accurate signature.
-
map
(target=None, processes=-1, queue_timeout=4, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If negative, the number of processes will match the number of available CPUs.
- queue_timeout (float, optional) – The number of seconds to block, waiting for a result before checking to see if all workers are done.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
Combined examples¶
This section lists examples that illustrate the possible usage of Pyteomics as a whole. The list will grow in time.
Contents:
Example 1: Unravelling the Peptidome¶
In this example, we will introduce the Pyteomics tools to predict the basic physicochemical characteristics of peptides, such as mass, charge and chromatographic retention time. We will download a FASTA database with baker’s yeast proteins, digest it with trypsin and study the distributions of various quantitative qualities that may be measured in a typical proteomic experiment.
The example is organized as a script interrupted by comments. It is
assumed that the reader already has experience with numpy and matplotlib
libraries. The source code for the example can be found
here
.
Before we begin, we need to import all the modules that we may require. Besides pyteomics itself, we need the builtin tools that allow to access the hard drive (os), download files from the Internet (urllib), open gzip archives (gzip), and external libraries to process and visualize arrays of data (numpy, matplotlib).
import os
from urllib.request import urlretrieve
import gzip
import matplotlib.pyplot as plt
import numpy as np
from pyteomics import fasta, parser, mass, achrom, electrochem, auxiliary
We also need to download a real FASTA database. For our purposes, the Uniprot database with Saccharomyces cerevisiae proteins will work fine. We’ll download a gzip-compressed database from Uniprot FTP server:
if not os.path.isfile('yeast.fasta.gz'):
print('Downloading the FASTA file for Saccharomyces cerevisiae...')
urlretrieve(
'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/'
'reference_proteomes/Eukaryota/UP000002311_559292.fasta.gz',
'yeast.fasta.gz')
print('Done!')
The pyteomics.fasta.FASTA()
class allows to iterate over the protein
sequences in a FASTA file in a regular Python loop. It replaced
pyteomics.fasta.read()
, although the latter still exists, too.
In this example, we create a FASTA
object from a file-like object
representing a gzip archive. All file parser objects are flexible and support
a variety of use cases. Additionally pyteomics.fasta
supports an even
greater variety of FASTA types and flavors.
For all FASTA parser classes, check fasta - manipulations with FASTA databases. See also: an explanation of Indexed Parsers.
In order to obtain
the peptide sequences, we cleave each protein using the
pyteomics.parser.cleave()
function and combine results into a set object
that automatically discards multiple occurrences of the same sequence.
print('Cleaving the proteins with trypsin...')
unique_peptides = set()
with gzip.open('yeast.fasta.gz', mode='rt') as gzfile:
for description, sequence in fasta.FASTA(gzfile):
new_peptides = parser.cleave(sequence, 'trypsin')
unique_peptides.update(new_peptides)
print('Done, {0} sequences obtained!'.format(len(unique_peptides)))
Later we will calculate different peptide properties. In order to store them, we create a list of dicts, where each dict stores the properties of a single peptide, including its sequence.
peptides = [{'sequence': i} for i in unique_peptides]
It is also more efficient to pre-parse the sequences into individual amino acids and supply the parsed structures into the functions that calculate m/z, charge, etc. During parsing, we explicitly save the terminal groups of peptides so that they are taken into the account when calculating m/z and charge of a peptide.
print('Parsing peptide sequences...')
for peptide in peptides:
peptide['parsed_sequence'] = parser.parse(
peptide['sequence'],
show_unmodified_termini=True)
peptide['length'] = parser.length(peptide['parsed_sequence'])
print('Done!')
For our purposes, we will limit ourselves to reasonably short peptides with the length less than 100 residues.
peptides = [peptide for peptide in peptides if peptide['length'] <= 100]
We use pyteomics.electrochem.charge()
to calculate the charge at pH=2.0.
The neutral mass and m/z of an ion is found with
pyteomics.mass.calculate_mass()
.
print('Calculating the mass, charge and m/z...')
for peptide in peptides:
peptide['charge'] = int(round(
electrochem.charge(peptide['parsed_sequence'], pH=2.0)))
peptide['mass'] = mass.calculate_mass(peptide['parsed_sequence'])
peptide['m/z'] = mass.calculate_mass(peptide['parsed_sequence'],
charge=peptide['charge'])
print('Done!')
Next, we calculate the retention time in the reversed- and normal-phase
chromatography using pyteomics.achrom.calculate_RT()
for two different
sets of retention coefficients.
The phase is specified by supplying corresponding sets of retention
coefficients, pyteomics.achrom.RCs_zubarev
and
pyteomics.achrom.RCs_yoshida_lc
for the reversed and normal phases,
correspondingly.
print('Calculating the retention time...')
for peptide in peptides:
peptide['RT_RP'] = achrom.calculate_RT(
peptide['parsed_sequence'],
achrom.RCs_zubarev)
peptide['RT_normal'] = achrom.calculate_RT(
peptide['parsed_sequence'],
achrom.RCs_yoshida_lc)
print('Done!')
Now, as we have all the numbers we can estimate the complexity of a sample by plotting the distributions of parameters measurable in a typical proteomic experiment. First, we show the distribution of m/z using the standard histogram plotting function from matplotlib.
plt.figure()
plt.hist([peptide['m/z'] for peptide in peptides],
bins = 2000,
range=(0,4000))
plt.xlabel('m/z, Th')
plt.ylabel('# of peptides within 2 Th bin')
The same set of commands allows us to plot the distribution of charge states in the sample:
plt.figure()
plt.hist([peptide['charge'] for peptide in peptides],
bins = 20,
range=(0,10))
plt.xlabel('charge, e')
plt.ylabel('# of peptides')
Next, we want to visualize the statistical correlation between m/z and retention time in reversed-phase chromatography.
The standard approach would be to use a scatter plot.
However, with a sample of our size that would be uninformative. Instead,
we will plot a 2d-histogram. There is no standard matplotlib command for that
and we have to use a combination of numpy and matplotlib. The function
numpy.histogram2d()
bins a set of (x,y) points on a plane and returns
the matrix of numbers in each individual bin and the borders of the bins.
We also use a trick of replacing zeros in this matrix with the not-a-number
value so that on the final figure empty bins are highlighted with white color
instead of the darkest blue. We suggest removing the fourth line in this code
snippet to see how that affects the final plot. At the last line, we also
apply the linear regression to obtain the coefficient of correlation between
m/z and retention time.
x = [peptide['RT_RP'] for peptide in peptides]
y = [peptide['RT_normal'] for peptide in peptides]
heatmap, xbins, ybins = np.histogram2d(x, y, bins=100)
heatmap[heatmap == 0] = np.nan
a, b, r, stderr = auxiliary.linear_regression(x,y)
The obtained heatmap is plotted with matplotlib.pyplot.imshow()
function
that visualizes matrices.
plt.figure()
plt.imshow(heatmap)
plt.xlabel('RT on RP, min')
plt.ylabel('RT on normal phase, min')
plt.title('All tryptic peptides, RT correlation = {0}'.format(r))
The same code can also be applied to compare the retention times obtained on different chromatographic phases. As you can see upon execution of the code, the retention times obtained on different chromatographic phases seem to be uncorrelated.
x = [peptide['m/z'] for peptide in peptides]
y = [peptide['RT_RP'] for peptide in peptides]
heatmap, xbins, ybins = np.histogram2d(x, y,
bins=[150, 2000],
range=[[0, 4000], [0, 150]])
heatmap[heatmap == 0] = np.nan
a, b, r, stderr = auxiliary.linear_regression(x,y)
plt.figure()
plt.imshow(heatmap,
aspect='auto',
origin='lower')
plt.xlabel('m/z, Th')
plt.ylabel('RT on RP, min')
plt.title('All tryptic peptides, correlation = {0}'.format(r))
Finally, let us check whether the retention times remain uncorrelated when we narrow down the sample of peptides. We select the peptides with m/z lying in a 700-701 Th window and plot two chromatographic retention times. This time the sample allows us to use a scatter plot.
close_mass_peptides = [peptide for peptide in peptides
if 700.0 <= peptide['m/z'] <= 701.0]
x = [peptide['RT_RP'] for peptide in close_mass_peptides]
y = [peptide['RT_normal'] for peptide in close_mass_peptides]
a, b, r, stderr = auxiliary.linear_regression(x, y)
plt.figure()
plt.scatter(x, y)
plt.xlabel('RT on RP, min')
plt.ylabel('RT on normal phase, min')
plt.title('Tryptic peptides with m/z=700-701 Th\nRT correlation = {0}'.format(r))
plt.show()
As you can see, the retention times of peptides lying in a narrow mass window turn out to be substantially correlated.
At this point we stop. The next example will cover the modules allowing access to experimental proteomic datasets stored in XML-based formats.
Example 2: Fragmentation¶
In this example, we are going to retrieve MS/MS data from an MGF file and compare it to identification info we read from a pepXML file. We are going to compare the MS/MS spectrum in the file with the theoretical spectrum of a peptide assigned to this spectrum by the search engine.
The script source can be downloaded here
. We will
also need the example MGF file
and the
example pepXML file
, but the script will download
them for you.
The MGF file has a single MS/MS spectrum in it. This spectrum is taken from the SwedCAD database of annotated MS/MS spectra. The pepXML file was obtained by running X!Tandem against the MGF file and converting the results to pepXML with the Tandem2XML tool from TPP.
Let’s start with importing the modules.
from pyteomics import mgf, pepxml, mass
import os
from urllib.request import urlretrieve
import pylab
Then we’ll download the files, if needed:
for fname in ('mgf', 'pep.xml'):
if not os.path.isfile('example.' + fname):
urlretrieve('http://pyteomics.readthedocs.io/en/latest/_static/example.'
+ fname, 'example.' + fname)
Now it’s time to define the function that will give us m/z of theoretical
fragments for a given sequence. We will use
pyteomics.mass.fast_mass()
to calculate the values.
All we need to do is split the sequence at every bond and iterate
over possible charges and ion types:
def fragments(peptide, types=('b', 'y'), maxcharge=1):
"""
The function generates all possible m/z for fragments of types
`types` and of charges from 1 to `maxharge`.
"""
for i in range(1, len(peptide)-1):
for ion_type in types:
for charge in range(1, maxcharge+1):
if ion_type[0] in 'abc':
yield mass.fast_mass(
peptide[:i], ion_type=ion_type, charge=charge)
else:
yield mass.fast_mass(
peptide[i:], ion_type=ion_type, charge=charge)
So, the outer loop is over “fragmentation sites”, the next one is over ion types, then over charges, and lastly over two parts of the sequence (C- and N-terminal).
All right, now it’s time to extract the info from the files. We are going to use the with statement syntax, which is not required, but recommended.
with mgf.read('example.mgf') as spectra, pepxml.read('example.pep.xml') as psms:
spectrum = next(spectra)
psm = next(psms)
Now prepare the figure…
pylab.figure()
pylab.title('Theoretical and experimental spectra for '
+ psm['search_hit'][0]['peptide'])
pylab.xlabel('m/z, Th')
pylab.ylabel('Intensity, rel. units')
… plot the real spectrum:
pylab.bar(spectrum['m/z array'], spectrum['intensity array'], width=0.1, linewidth=2,
edgecolor='black')
… calculate and plot the theoretical spectrum, and show everything:
theor_spectrum = list(fragments(psm['search_hit'][0]['peptide'],
maxcharge=psm['assumed_charge']))
pylab.bar(theor_spectrum,
[spectrum['intensity array'].max()]*len(theor_spectrum),
width=0.1, edgecolor='red', alpha=0.7)
pylab.show()
You will see something like this:

That’s it, as you can see, the most intensive peaks in the spectrum are indeed matched by the theoretical spectrum.
Example 3: Search engines and PSM filtering¶
In this example we are going to parse the output of several search engines and see what we can do with it using Pyteomics.
Full Python code can be downloaded here
(Python script)
and here
(IPython Notebook).
The files used in this example can be downloaded from
here.
The example, including code, figures, and accompanying text, is contained in the IPython Notebook file.
History of changes¶
dev¶
Add pyteomics.electrochem.gravy()
(#9).
4.3.1¶
Technical release.
4.3¶
First release after the move to Github. Issue and PR numbers from now on refer to the Github repo. Archive of the Bibucket issues and PRs is stored here.
Changes in this release:
- New module
pyteomics.openms.idxml
.- Fix #3, #5, and some issues in
tandem
.
4.2¶
Changes in XML XPath implementation. For standard XML parser classes, this only means a minor change in performance (should be a slight improvement, most noticeable for
TandemXML
).
For custom classes: the implementation of xpath evaluation in
pyteomics.xml.XML.iterfind()
has changed. Pseudo-conditions are now not supported. Instead, an attempt is made to support full XPath. The main difference is that the XPath is evaluated on XML elements, whereas pseudo-conditions used to be evaluated for complete Python dictionaries. To reproduce old behavior, you can just write an explicit if statement at an appropriate place. New implementation allows actually skipping the elements that do not satisfy the XPath predicate. When writing classes which by default iterate over elements based on a complex XPath, set_default_iter_path
instead of_default_iter_tag
.Warning
Beware that if
_default_iter_path
differs from_default_iter_tag
and you use indexing, all elements corresponding to_default_iter_tag
will be indexed. This is a limitation of the index building procedure. This discrepancy will lead to confusing behavior (length checks, membership tests and other things based on index will not correspond to items returned by iteration).map()
calls will also operate on the full index.New keyword arguments queue_size, queue_timeout and processes for indexed parsers with support for
map()
.New method
mass.Unimod.by_id()
. Also,mass.Unimod
now supports dict-like queries with record IDs.Reduce memory footprint for unit primitives (PR #35 by Joshua Klein).
New functions
pyteomics.auxiliary.sigma_T()
andpyteomics.auxiliary.sigma_fdr()
.Fix issues #44, #46, #47, #48.
4.1.2¶
Bugfix: fix the standard mass value for pyrrolysine (issue #42).
4.1.1¶
API changes¶
- In
ms1.read()
andms2.read()
, the default value for use_index is nowFalse
. Using the indexed parsers may result in incorrect behavior if the “first” scan number in S-lines is not unique.
4.1¶
- New module
pyteomics.mztab
provides a parser for mzTab files. - New module
pyteomics.ms2
provides a parser for ms2 files. This is in fact an alias toms1
, which handles both formats. - Added index saving functionality for
pyteomics.mgf.IndexedMGF
. - New helper functions
pyteomics.pylab_aux.plot_spectrum()
andpyteomics.pylab_aux.annotate_spectrum()
. - The rule and exception arguments in
pyteomics.parser.cleave()
can be keys fromexpasy_rules
. - Fixes.
4.0¶
Add parameters semi and exception in
pyteomics.parser.cleave()
.Add new parameter encoding in file writers.
Add new parameters write_charges and use_numpy in
pyteomics.mgf.write()
. Speed up the writing whennumpy
is available.Indexing text parsers. This release introduces a family of parser classes for text files. These parsers create byte offsets of indexed entries to allow random access by unique key or by positional index, “rich” access by slices and, in case of MGF/mzML/mzXML, by retention time range. All indexing parsers, text- or XML-based, now have a unified interface.
New class
pyteomics.mgf.IndexedMGF
is now the recommended way to parse MGF files. It supports fast access by spectrum titles by using an index of byte offsets. The old, sequential parser is preserved under its name,pyteomics.mgf.MGF
. The functionpyteomics.mgf.read()
now returns an instance of one of the two classes, based on the use_index argument and the type of source. The common ancestor class,pyteomics.mgf.MGFBase
, can be used for type checking.New FASTA parsing classes:
pyteomics.fasta.FASTABase
- common ancestor, suitable for type checking;
pyteomics.fasta.FASTA
- text-mode, sequential parser; does what the oldfasta.read()
was doing. Additionally, the following subclasses perform format-specific parsing of FASTA headers:pyteomics.fasta.IndexedFASTA
- binary-mode, indexing parser. Supports direct indexing by header string;
pyteomics.fasta.TwoLayerIndexedFASTA
- additionally supports indexing by extracted header fields. Format-specific second indexes are available in subclasses:
pyteomics.fasta.read()
now returns an instance of one of these classes, depending on the arguments use_index and flavor.
pyteomics.ms1.IndexedMS1
andpyteomics.ms1.MS1
are available for ms1 format.(In collaboration with J. Klein)
Multiprocessing support: all indexed XML and text file parsers now expose a
map()
method. This method can map a user-supplied function to each file entry in separate processes (or simply parallelize the parsing itself). Additionally, objects returned bychain()
functions anditerfind()
methods also expose themap()
interface to allow parallelizing the work over multiple files and when iterating over non-default XML tree elements. The order of entries is not preserved in the output. (In collaboration with J. Klein)New module
pyteomics.peff
implements theIndexedPEFF
parser for protein databases in the new PSI standard format, PEFF. (Contributed by J. Klein)New module
pyteomics.traml
implements theTraML
parser for the PSI standard format for SRM data, TraML. (In collaboration with J. Klein)
pyteomics.protxml.ProtXML
now also supports indexing and multiprocessing.Removed parameter skip_empty_cvparam_values in XML parsers. In cvParam elements, missing “value” attribute is now always equivalent to the case when it is equal to an empty string. This affects the structure of items produced by MzML and MzIdentML parsers.
Multiple fixes and improvements.
3.5.1¶
Technical release to update the package metadata on PyPI. Project documentation on pythonhosted.org has been deleted. Latest documentation is available at: https://pyteomics.readthedocs.io/.
3.5¶
Preserve accession information on cvParam elements in mzML parser. Dictionaries produced by the parser can now be queried by accession using
pyteomics.auxiliary.cvquery()
. (Contributed by J. Klein)Add optional decode_binary argument in
pyteomics.mzml.MzML
andpyteomics.mzxml.MzXML
. When set to False, the parsers provide binary records suitable for decoding on demand. (Contributed by J. Klein)Add method
write_byte_offsets()
inpyteomics.mzml.MzML
,pyteomics.mzxml.MzXML
andpyteomics.mzid.MzIdentML
. Byte offsets can be loaded later to speed up random access. (Contributed by J. Klein)Random access to MGF spectrum entries.
- Add function
pyteomics.mgf.get_spectrum()
.- Add class
pyteomics.mgf.MGF
.mgf.read()
is now an alias to the class. The class can be used for indexing using spectrum titles.This functionality will be changed in upcoming versions.
New module
pyteomics.protxml
for parsing of ProteinProphet output files.Add PeptideProphet and iProphet analysis information to the output of
pyteomics.pepxml.DataFrame()
.New parameter huge_tree in XML parser constructors and
read()
functions. It is passed to the underlyinglxml
calls. Default value is False. Set to True to overcome errors such as: XMLSyntaxError: xmlSAX2Characters: huge text node.New parameter skip_empty_cvparam_values in XML parser constructors. It instructs the parser to treat the empty “value” attributes in cvParam elements as if they were not there. This is helpful in cases when such empty “values” are present in one vendor’s file and absent in another: enabling the parameter will result in more unified output. Default value is False.
Change the default value for read_schema to
False
in XML parsing modules.Change the default value for retrieve_refs to
True
in MzIdentML constructor.Implement retrieve_refs for
pyteomics.mzml.MzML
. (Contributed by J. Klein)New parameter keep_cterm in decoy generation functions in
pyteomics.fasta
.New parameters decoy_prefix and decoy_suffix in all format-specific FDR filtering functions. If the standard
is_decoy()
function works for your files, you can use these parameters to specify either the prefix or the suffix appended to the protein names in decoy entries.New ion types in
pyteomics.mass.std_ion_comp
.Bugfixes.
3.4.2¶
- New module
pyteomics.ms1
for parsing of MS1 files.mass.Composition
constructor now accepts ion_type and charge parameters.- New functions
pyteomics.mzid.DataFrame()
andpyteomics.mzid.filter_df()
. Their behavior may be refined later on.- Changes in behavior of
pyteomics.auxiliary.filter()
andpyteomics.auxiliary.qvalues()
:
- both functions now always return DataFrames with
pandas.DataFrame
input and full_output=True.- string values of key, is_decoy and pep are substituted with simple itemgetter functions for non-pandas, non-numpy input;
- additional parameters score_label, decoy_label, pep_label, and q_label for output control.
- Performance optimizations in XML parsing code.
3.4.1¶
- Add selenocysteine (“U”) and pyrrolysine (“O”) to
pyteomics.mass.std_aa_mass
andpyteomics.mass.std_aa_comp
.- An optional parameter encoding is now accepted by text file readers (
pyteomics.mgf.read()
andpyteomics.fasta.read()
). This can be useful for MGF files with non-ASCII spectrum titles or comments.- New function
pyteomics.mass.mass.isotopologues()
.- Performance improvements in
pyteomics.electrochem.pI()
.- Fix the issue in
pyteomics.xml
which resulted in very long processing times for indexed XML files with a byte ordering mark (BOM).- Support all standard and non-standard data array names in
pyteomics.mzml
.- Change default value of
retrieve_refs
inpyteomics.mzid.read()
toTrue
.- Preserve unit information extracted from cvParam tags in PSI XML files.
- Fix in
pyteomics.mzxml
, other minor fixes.
3.4¶
- New module
pyteomics.mzxml
for parsing of MzXML files.- New parameter dtype in
pyteomics.mgf.read()
,pyteomics.mzml.read()
andpyteomics.mzxml.read()
allows changing the dtype of arrays yielded by the parsers.pyteomics.featurexml
moved into a subpackagepyteomics.openms
.- New module
pyteomics.openms.trafoxml
for OpenMS transformation files.- Bugfix in XML indexing code to make it work on Python 3.x versions prior to 3.5.
- Bugfix in
pyteomics.pylab_aux.scatter_trend()
(support for lists and other non-ndarrays).- Performance improvements in
pyteomics.achrom
calibration functions.
3.3.1¶
New submodule pyteomics.featurexml
with a parser for OpenMS featureXML files.
3.3¶
- mzML and mzIdentML parsers can now create an index of element offsets. This allows quick random access to elements by unique ID.
- mzML parsers now come in two flavors:
pyteomics.mzml.MzML
andpyteomics.mzml.PreIndexedMzML
. The latter uses the byte offsets listed at the end of the file.- New parameters convert_arrays and read_charges in
mgf.read()
allow using it withoutnumpy
and possibly improve performance. The default behavior is retained.- Performance optimizations in
mgf.read()
andparser.cleave()
.- New decoy generation mode called “fused decoy”, described in the paper accepted to JASMS.
API changes¶
pyteomics.parser.cleave()
no longer accepts the labels argument. It is emphasized that the input sequences are expected to be in plain one-letter notation, but no checks are performed.DataFrame()
functions inpepxml
andtandem
now extract more protein-related information. The list-like protein-related values can be reported as lists or packed into strings, depending on the optional paramter sep. Some column names have changed as a result.- Call signatures of
pyteomics.fasta.decoy_sequence()
and the functions using it are slightly changed. Standard modes are now also exposed as individual functions.
3.2¶
New submodule pyteomics.mass.unimod
contains rewritten machinery
for handling of Unimod relational databases (contributed by Joshua Klein).
This is a substitution and extension for the old mass.Unimod
class.
pyteomics.mass.unimod
requires SQLAlchemy.
Other changes:
- New function
pyteomics.auxiliary.linear_regression_perpendicular()
provides a linear fit minimizing distances from data points to the fit line (as opposed topyteomics.auxiliary.linear_regression()
, which minimizes vertical distances).- Both new and old linear regression functions now accept a single array of shape (N, 2).
pyteomics.pylab_aux.scatter_trend()
now has an optional parameter regression which can be a callable performing the regression. Also, the regression equation is now the label of the regression line, not the scatter plot.- Another two new parameters for
pyteomics.pylab_aux.scatter_trend()
are sigma_kwargs and sigma_values.pyteomics.pylab_aux
functionsplot_line()
andscatter_trend()
now return the objects they create.- Writer functions (
pyteomics.mgf.write()
,pyteomics.fasta.write()
,pyteomics.fasta.write_decoy_db()
) now accept a file_mode argument that overrides the mode in which the file is opened.- In
pyteomics.mgf.write()
one can now override the format spec for fragment m/z, intensity and charge values using the optinal fragment_format argument. Key order and key-value parameter formatters are now also handled via optional arguments.pyteomics.fasta.decoy_db()
now supports ignore_comments and parser arguments.
3.1.1¶
- Bugfix in
pyteomics.auxiliary
.- New parameter show_legend in
pyteomics.pylab_aux.scatter_trend()
.- Performance improvements in
pyteomics.parser
.
3.1¶
This release offers integration with the great pandas
library.
Working with qvalues()
and filter()
functions
is now much easier if you have your PSMs in a DataFrame
.
Many search engines use CSV as their output format, allowing direct
creation of DataFrame
objects. New functions
pyteomics.tandem.DataFrame()
and pyteomics.pepxml.DataFrame()
faciliatate creation of DataFrames from corresponding formats.
Also, qvalues()
, filter()
and fdr()
functions can now use
posterior error probabilities (PEPs) instead of using decoys for q-value calculation.
- In
qvalues()
andfilter()
functions, key and is_decoy can now be array-like objects or strings (as well as functions and iterators). If a string is given, it is used as a field name in the PSM array orDataFrame
.fdr()
functions also support strings and iterables as arguments.- New parameter pep in
qvalues()
,filter()
andfdr()
functions. It can be callable, array-like, or iterator. Conflicts with decoy-related parameters. Compatible with key, but makes it optional.- Fixed the behavior of
filter.chain()
functions. They now treat the full_output argument the same way asfilter()
functions.- Fixed the issue that caused exceptions when calling
fasta.decoy_db()
andfasta.write_decoy_db()
with explicitly given mode (signature for creation ofpyteomics.auxiliary.FileReader
objects slightly changed).- Pyteomics now uses setuptools and is a namespace package.
- Minor fixes.
API changes¶
- Default value of remove_decoy in
qvalues()
is nowFalse
.
3.0.1¶
- Added legend_kwargs as a keyword argument to
pyteomics.pylab_aux.scatter_trend()
.- Minor fixes.
3.0.0¶
- XML parsers are now implemented as objects, each format has its own class. Those classes can be instantiated using the same arguments as
read()
functions accepted, and support direct iteration and thewith
syntax. Theread()
functions are now simple aliases to the corresponding constructors.- As a result, functions
iterfind()
,version_info()
andget_by_id()
functions are now deprecated in favor of methodsiterfind()
andget_by_id()
and attributeversion_info
of corresponding instances.- In
pyteomics.mgf.write()
, the order of keys and the format of values are now controlled via module-level variables.- In
pyteomics.electrochem
, correction for pK of terminal groups depending on the terminal residue is implemented; example set of pK and corrected pK added.- Imports of external dependencies are delayed where possible, so that unnecessary
ImportErrors
do not occur.local_fdr()
renamed toqvalues()
inpepxml
,mzid
,tandem
andauxiliary
.local_fdr()
did not reflect the semantics of the function. The algorithm has been also corrected so that the array of q-values is always sorted (as it should be by definition).qvalues()
now also accepts a parameter full_output which keeps the PSMs alongside their scores and associated q-values.- All
fdr()
,qvalues()
, andfilter()
functions now accept a new parameter correction. It is used for more accurate estimation of the number of false positives using TDA (paper with explanation).filter()
functions now support both iterator protocol and context manager protocol. They now also accept the full_output parameter, which has the following meaning: ifTrue
(default), then an array of PSMs is directly returned by the function. Otherwise, an iterator is returned, as before. The array takes some memory, but this way is usually around 2x faster.- New function
pyteomics.pylab_aux.plot_qvalue_curve()
.pyteomics.mass.Composition
objects now have amass()
method (equivalent topyteomics.mass.calculate_mass()
.- Also,
Composition
and objects returned bypyteomics.parser.amino_acid_composition()
now inherit fromcollections.defaultdict
andcollections.Counter
.- Decoy-related functions in
pyteomics.fasta
now accept a new parameter keep_nterm that preserves the N-terminal residue in the generated decoy sequences.- Minor fixes.
API changes¶
- In
pyteomics.pylab_aux.scatter_trend()
, keyword arguments forpylab.scatter()
andpylab.plot()
are now accepted as dicts scatter_kwargs and plot_kwargs. Keyword argument alpha is now not accepted and should be put in the appropriate dict.- In
pyteomics.pylab_aux.plot_function_3d()
andpyteomics.pylab_aux.plot_function_contour()
, arbitrary kwargs can now also be passed to the plotting function.filter()
functions do not support context manager protocol by default. To keep using them as iterators / context managers, specifyfull_output=False
(see above for details).
2.5.5¶
Fix for a memory leak in pyteomics.mzid.get_by_id()
, which affects
pyteomics.mzid.read()
with retrieve_refs=True
.
2.5.4¶
- New functions
local_fdr()
inpepxml
,mzid
, andtandem
. The function returns a NumPy array with PSM scores and corresponding values of local FDR.- New parameter iterative in
read()
functions of XML parsing modules. Parsing of mzIdentML files withretrieve_refs=True
got significantly faster.
2.5.3¶
- Universally applicable modifications are now allowed in
pyteomics.parser.isoforms()
.- It is now also possible to specify non-terminal modifications which are only applicable to terminal residues.
- Fix in
pyteomics.parser.parse()
: if the labels argument is provided, it needs to contain standard terminal groups if they are present in the sequence or if show_unmodified_termini is set toTrue
.pyteomics.mass.Composition
instances are now pickleable.- Performance improvements.
2.5.2¶
- New parameter reverse in all
filter()
functions.- New function
pyteomics.mass.fast_mass2()
, which is analogous topyteomicsmass.fast_mass()
, but supports full modX notation and is several times slower.- Fix in
pyteomics.pepxml.read()
for compatibility with files produced with Mascot2XML utility.- Unknown labels now allowed in
pyteomics.electrochem
andpyteomics.achrom
functions in accordance with new general policy.
2.5.1¶
- Bugfixes in
pyteomics.parser.isoforms()
:
- handling of the labels argument is now in accordance with new policy
- solved memory problems when using max_mods
pyteomics.parser.cleave()
does not require a valid modX sequence by default.
2.5.0¶
pyteomics.parser.amino_acid_composition()
now accepts “split” parsed sequences.- Cleavage rules in
pyteomics.parser.expasy_rules
updated.- Helper function
pyteomics.parser.num_sites()
counts the number of cleavage sites in a sequence.- Helper function
pyteomics.parser.match_modX()
does essentially the same aspyteomics.parser.is_modX()
, but returns are.match
object orNone
instead of abool
.- Bugfix in
pyteomics.auxiliary.filter()
, which didn’t work correctly with iterators.- Added a new parameter
max_mods
inpyteomics.parser.isoforms()
.
API changes¶
- The boolean
overlap
parameter inpyteomics.parser.cleave()
is replaced with an integermin_length
. Sincemin_length
usespyteomics.parser.length()
, thelabels
keyword argument is now accepted bycleave()
andnum_sites()
, if needed. With carefully designed cleavage rules, all cleavage functions work with modX sequences.- The
labels
argument inpyteomics.parser.parse()
and related functions has changed its meaning.parse()
won’t raise an exception for non-standard labels in sequences if thelabels
keyword argument is not given.- The modX notation specification is now more strict to avoid ambiguity: only zero or two terminal groups can be present in a modX sequence. Sequences with one terminal group specified will be supported where possible, but be advised that sequences such as “H-OH” are intrinsically ambiguous.
2.4.3¶
- Added the
ratio
keyword argument for FDR calculation.- Minor changes in
iterfind()
functions of file parsers.- Bugfix in
pyteomics.mgf.write()
(duplication of pepmass key).- Removed non-functional parameter
read_schema
forpyteomics.tandem.read()
.
2.4.2¶
- Bugfix in
pyteomics.mass.most_probable_isotopic_composition()
. The bug manifested itself after version 2.4.0, whenpyteomics.mass.nist_mass
was expanded. Also, the format of the returned value is now in accordance with the documentation.
2.4.1¶
- New function
pyteomics.auxiliary.filter()
for filtering lists of PSMs not coming directly from files in supported formats.- Also, a format-agnostic helper function
pyteomics.auxiliary.fdr()
.
2.4.0¶
New functions for filtering to a certain FDR level based on target-decoy strategy, as well as for FDR estimation, in
pyteomics.tandem
,pyteomics.pepxml
andpyteomics.mzid
. The functions are calledfilter()
(beware of shadowing the built-in function) andfdr()
(in each of the modules). Chained versionsfilter.chain()
andfilter.chain.from_iterable()
are also available. See Data Access for more info.New function
pyteomics.parser.coverage()
for sequence coverage calculation.New function
pyteomics.fasta.decoy_chain()
, a chained version ofpyteomics.fasta.decoy_db()
.New elements in
pyteomics.mass.nist_mass
. Pretty much all elements are there now.Fix in
pyteomics.parser.parse()
to cover some fancy corner cases.Bugfix in
pyteomics.tandem
: modification info is now fully extracted.
pyteomics.mass.isotopic_composition_abundance()
is now able to calculate abundances for larger molecules.Note
Rounding errors may be significant in this case.
2.3.0¶
- New parameter “read_schema” in
read()
functions of XML parsing modules. When set toFalse
, disables the attempts to fetch an auxiliary file and obtain structure information about the file being parsed.- New function
chain()
in all modules that have aread()
function, for convenient chaining of multiple files.chain()
only works as a context manager. Useitertools.chain()
in other cases. Thechain.from_iterable
form is also available as a context manager.- New function
pyteomics.auxiliary.print_tree()
for exploration of complex nested dicts produced by XML parsers.- New sets of retention coefficients in
pyteomics.achrom
.- Bugfix in
pyteomics.pepxml
. The bug caused an exception when parsing some pepXML files.- The output of
pyteomics.mgf.read()
now always contains a masked array of charges.- Other minor fixes.
API change¶
- In
pyteomics.mgf.read()
the precursor charge is now always represented by a list of ints (aChargeList
object).
2.2.2¶
- Bugfix in
pyteomics.tandem
. The info about all proteins is now extracted.
2.2.1¶
- Update parsers for FASTA headers.
- NamedTuple for FASTA entries is now defined globally, which should solve pickling problems.
2.2.0¶
- New module
pyteomics.tandem
for reading output files of X!Tandem search engine.
2.1.6¶
- Fix in
pyteomics.pepxml
. pepXML files generated by TPP are now processed without errors.
2.1.5¶
- Fix in
pyteomics.pepxml
. ‘modified_peptide’ is now always available.- Fix in
pyteomics.mass
(issue #2 in the bug tracker).- Improved arithmetics for
Composition
objects.
2.1.4¶
- In
fasta
,decoy_db()
now doesn’t write to file, but returns an iterator over FASTA records. The olddecoy_db()
is now calledwrite_decoy_db()
, which is equivalent todecoy_db()
combined withwrite()
.
Bugfixes:
- In
pyteomics.mgf.read()
, the charges, if present, are returned as a masked array now. Previously, an exception occurred if charges were missing for some of the fragments.- Values in
mass.nist_mass
corrected.- Other minor corrections.
2.1.3¶
- Adjust the behavior affected by the bug fixed in 2.1.2. name attributes of <cvParam> elements in the absence of value attributes are now collected in a list under the ‘name’ key.
- Add support for overlapping matches in
parser.cleave()
.
2.1.2¶
- Bugfix in XML parsers. The bug caused the mzML parser to break on some files. The fix can slightly change the format of the output.
2.1.1¶
- Rename keys in the dicts returned by
mgf.read()
to facilitate writing code working with both MGF and mzML.- The items yielded by
fasta.read()
now have attributes description and sequence.
2.1.0¶
- New sets of retention coefficients in
achrom
.mass.Composition
now only stores non-zero ints.fasta
now has tools for parsing of FASTA headers.- File parsers now implement the context manager protocol. We recommend using with statements to avoid resource leaks.
API changes¶
- ‘pepmass’ is now a tuple in the output of
mgf.read()
(to allow reading precursor intensities).- new function
fasta.parse()
for convenient parsing of FASTA headers.fasta.std_parsers
stores parsers for common UniProt header formats.- new parameter parser in
fasta.read()
allows to apply parsing while reading a FASTA file.- close parameter removed in all functions that do file I/O. The unified behavior is: if the parameter is a file object, it won’t be closed by the function. If a file path is given, the file object will be created and closed inside the corresponding function.
2.0.3¶
- Added new class
pyteomics.mass.Unimod
. The interface is experimental and may change.- Improved
iterfind()
function in XML-reading modules.pyteomics.mass.Composition
objects now support multiplication byint
.- Bugfix in
auxiliary.linear_regression()
.
2.0.2¶
- Added new function
iterfind()
inpyteomics.mzid
,pyteomics.pepxml
andpyteomics.mzml
.
2.0.1¶
API changes¶
pyteomics.parser.peptide_length()
is renamed topyteomics.parser.length()
.
2.0.0¶
- Added
mzid
module for parsing of mzIdentML files.- Fixed bugs, improved tests.
API changes¶
- top-module functions in
fasta
,mgf
,mzml
,pepxml
, as well asmzid
, are now calledread()
.- in
parser
,parse_sequence()
renamed toparse()
. It now accepts an optional parameter allow_unknown_modifications.mgf.write_mgf()
andfasta.write_fasta()
renamed towrite()
.- the output format of all
read()
functions has changed.
1.2.5¶
- Include Apache license version 2.0: http://www.opensource.org/licenses/Apache-2.0
- Minor bugfix in
pyteomics.fasta
.
1.2.4¶
- Changes in
pyteomics.mass
.
API changes¶
Composition
objects can be created using positional first argument, which will be treated as a sequence or (upon failure) as a formula. This means that all functions relying on Composition (calculate_mass()
,most_probable_isotopic_composition()
,isotopic_composition_abundance()
) allow that as well. However, it’s of no use for the latter.Composition
entries for modifications can be added to aa_comp and used in composition and mass calculations. This way the specified group will be added to any residue bearing this modification.- That being said, the
add_modifications()
function is not needed anymore and has been removed.- Addition and subtraction of
Composition
objects now produces aComposition
object, allowing addition/subtraction of multiple objects.Composition
is now a subclass ofcollections.defaultdict
so one can safely retrieve values without checking if a key exists.
1.2.3¶
pyteomics.parser.isoforms()
now allows terminal modifications.- Bugfixes in
pyteomics.parser.parse_sequence()
.- New function
pyteomics.parser.tostring()
converts parsed sequences to strings.- Helper function
pyteomics.parser.is_modX()
added to check modX labels.
API changes¶
pyteomics.parser.isoforms()
now returns a generator object
1.2.2¶
- Bugfix in
pyteomics.pepxml
: modification info is now extracted.- New optional boolean argument ‘split’ in
pyteomics.parser.parse_sequence()
allows to generate a list of tuples where modifications are separated from the residues instead of a regular list of labels. In labels not only modX labels are now allowed, but also separate mod prefixes. Such modifications are assumed to be applicable to any residue.
1.2.1¶
- Memory usage significantly decreased when parsing large mzML and pepXML files.
1.2.0¶
- Added support for Python 3. Python 2.7 is still supported, Python 2.6 is not.
1.1.1¶
- New function called
add_modifications()
added inpyteomics.mass
. It updates aa_comp.- Also,
pyteomics.parser.isoforms()
is a new function to get all possible modified sequences of a peptide.
1.1.0¶
- New module added -
pyteomics.mgf
. It is intended for reading and writing files in Mascot Generic Format.
1.0.2¶
- In
pyteomics.pepxml
module, now all search hits are read from file (not only the top hit).
API changes:¶
pyteomics.pepxml.read()
: information specific to search hits is now stored in a list under the'search_hits'
key. The list is sorted by hit rank.
1.0.1¶
- Fix compatibility issues in
pyteomics.pepxml
module.
1.0.0¶
- The first public release of Pyteomics.
API changes:¶
pyteomics.achrom
: rename'length correction factor'
to'length correction parameter'
.
pyteomics.achrom.get_RCs_vary_lcf()
was renamed topyteomics.achrom.get_RCs_vary_lcp()
.- length_correction_factor keyword argument of
pyteomics.achrom.get_RCs()
was renamed to lcp.