Pyteomics documentation v4.7.1

Data Access

«  Retention time prediction   ::   Contents   ::   Pyteomics API documentation  »

Data Access

The following section is dedicated to data manipulation. Pyteomics aims to support the most common formats of (LC-)MS/MS data, peptide identification results and protein databases.

General Notes

  • Each module mentioned below corresponds to a file format. In each module, the top-level function read() allows iteration over entries in a file. It works like the built-in open(), allowing direct iteration and supporting the with syntax, which we recommend using. So you can do:

    >>> from pyteomics import mgf
    >>> reader = mgf.read('tests/test.mgf')
    >>> for spectrum in reader:
    >>>    ...
    >>> reader.close()
    

    … but it is recommended to do:

    >>> from pyteomics import mgf
    >>> with mgf.read('tests/test.mgf') as reader:
    >>>     for spectrum in reader:
    >>>        ...
    
  • Additionally, most modules provide one or several classes which implement different parsing modes, e.g. pyteomics.mgf.MGF and pyteomics.mgf.IndexedMGF. Indexed parsers build an index of file entries and thus allow random access in addition to iteration. See Indexed Parsers for a detailed description and examples.

  • Apart from read(), which reads just one file, all modules described here have functions for reading multiple files: chain() and chain.from_iterable(). chain('f1', 'f2') is equivalent to chain.from_iterable(['f1', 'f2']). chain() and chain.from_iterable() only support the with syntax. If you don’t want to use the with syntax, you can just use the itertools functions chain() and chain.from_iterable().

  • Throughout this section we use pyteomics.auxiliary.print_tree() to display the structure of the data returned by various parsers. Replace this call with the actual processsing that you need to perform on your files.

Text-based formats

MGF

Mascot Generic Format (MGF) is a simple human-readable format for MS/MS data. It allows storing MS/MS peak lists and exprimental parameters. pyteomics.mgf is a module that implements reading and writing MGF files.

Reading

pyteomics.mgf.read() function allows iterating through spectrum entries. Spectra are represented as dicts. By default, MS/MS peak lists are stored as numpy.ndarray objects m/z array and intensity array. Fragment charges will be stored in a masked array under the charge array key. Parameters are stored as a dict under params key.

Here is an example of use:

>>> from pyteomics import mgf, auxiliary
>>> with mgf.read('tests/test.mgf') as reader:
>>>     auxiliary.print_tree(next(reader))
m/z array
params
 -> username
 -> useremail
 -> mods
 -> pepmass
 -> title
 -> itol
 -> charge
 -> mass
 -> itolu
 -> it_mods
 -> com
intensity array
charge array

To speed up parsing, or if you want to avoid using numpy, you can tweak the behaviour of pyteomics.mgf.read() with parameters convert_arrays and read_charges.

Reading file headers

Also, pyteomics.mgf allows to extract headers with general parameters from MGF files with pyteomics.mgf.read_header() function. It also returns a dict.

>>> header = mgf.read_header('tests/test.mgf')
>>> auxiliary.print_tree(header)
itolu
itol
username
com
useremail
it_mods
charge
mods
mass

Class-based interface

Since version 3.4.3, MGF parsing functionality is encapsulated in a class: pyteomics.mgf.MGF. This class can be used for:

  • sequential parsing of the file (the same as read()):

    >>> with mgf.MGF('tests/test.mgf') as reader:
    ..:     for spectrum in reader:
    ..:         ...
    
  • accessing the file header (the same as read_header()):

    >>> f = mgf.MGF('tests/test.mgf')
    >>> f.header
    {'charge': [2, 3],
     'com': 'Based on http://www.matrixscience.com/help/data_file_help.html',
     'it_mods': 'Oxidation (M)',
     'itol': '1',
     'itolu': 'Da',
     'mass': 'Monoisotopic',
     'mods': 'Carbamidomethyl (C)',
     'useremail': 'leu@altered-state.edu',
     'username': 'Lou Scene'}
    
  • direct access to spectra by title (the same as get_spectrum()):

    >>> f = mgf.MGF('tests/test.mgf')
    >>> f['Spectrum 2']
    {'charge array': masked_array(data = [3 2 1 1 1 1],
                  mask = False,
            fill_value = 0),
     'intensity array': array([  237.,   128.,   108.,  1007.,   974.,    79.]),
     'm/z array': array([  345.1,   370.2,   460.2,  1673.3,  1674. ,  1675.3]),
     'params': {'charge': [2, 3],
      'com': 'Based on http://www.matrixscience.com/help/data_file_help.html',
      'it_mods': 'Oxidation (M)',
      'itol': '1',
      'itolu': 'Da',
      'mass': 'Monoisotopic',
      'mods': 'Carbamidomethyl (C)',
      'pepmass': (1084.9, 1234.0),
      'rtinseconds': '25',
      'scans': '3',
      'title': 'Spectrum 2',
      'useremail': 'leu@altered-state.edu',
      'username': 'Lou Scene'}}
    

    Note

    MGF’s support for direct indexing is rudimentary, because it does not in fact keep an index and has to search through the file line-wise on every call. pyteomics.mgf.IndexedMGF is designed for random access and more (see Indexed Parsers for details).

Writing

Creation of MGF files is implemented in pyteomics.mgf.write() function. The user can specify the header, an iterable of spectra in the same format as returned by read(), and the output path.

>>> spectra = mgf.read('tests/test.mgf')
>>> mgf.write(spectra=spectra, header=header)
USERNAME=Lou Scene
ITOL=1
USEREMAIL=leu@altered-state.edu
MODS=Carbamidomethyl (C)
IT_MODS=Oxidation (M)
CHARGE=2+ and 3+
MASS=Monoisotopic
ITOLU=Da
COM=Taken from http://www.matrixscience.com/help/data_file_help.html

BEGIN IONS
TITLE=Spectrum 1
PEPMASS=983.6
846.6 73.0
846.8 44.0
847.6 67.0
1640.1 291.0
1640.6 54.0
1895.5 49.0
END IONS

BEGIN IONS
TITLE=Spectrum 2
RTINSECONDS=25
PEPMASS=1084.9
SCANS=3
345.1 237.0
370.2 128.0
460.2 108.0
1673.3 1007.0
1674.0 974.0
1675.3 79.0
END IONS

MS1 and MS2

MS1 and MS2 are simple human-readable formats for MS1 and MSn data. It allows storing peak lists and exprimental parameters. Just like MS1 and MS2 formats are quite similar to MGF, the corresponding module (pyteomics.ms1 and pyteomics.ms2) provides the same functions and classes with very similar signatures for reading headers and spectra from files.

Writing is not supported at this time.

FASTA

FASTA is a common format for protein sequence databases.

Reading

To extract data from FASTA databases, use the pyteomics.fasta.read() function.

>>> from pyteomics import fasta
>>> with fasta.read('/path/to/file/my.fasta') as db:
>>>     for entry in db:
>>>         ...

Just like other parsers in Pyteomics, pyteomics.fasta.read() returns a generator object instead of a list to prevent excessive memory use. The generator yields (description, sequence) tuples, so it’s natural to use it as follows:

>>> with fasta.read('/path/to/file/my.fasta') as db:
>>>     for descr, seq in db:
>>>         ...

You can also use attributes to access description and sequence:

>>> with fasta.read('my.fasta') as reader:
>>>     descriptions = [item.description for item in reader]

Description parsing

You can specify a function that will be applied to the FASTA headers for your convenience. pyteomics.fasta.std_parsers has some pre-defined parsers that can be used for this purpose.

>>> with fasta.read('HUMAN.fasta', parser=fasta.std_parsers['uniprot']) as r:
>>>    print(next(r).description)
{'PE': 2, 'gene_id': 'LCE6A', 'GN': 'LCE6A', 'id': 'A0A183', 'taxon': 'HUMAN',
 'SV': 1, 'OS': 'Homo sapiens', 'entry': 'LCE6A_HUMAN',
 'name': 'Late cornified envelope protein 6A', 'db': 'sp'}

or try guessing the header format:

>>> with fasta.read('HUMAN.fasta', parser=fasta.parse) as r:
>>>    print(next(r).description)
{'PE': 2, 'gene_id': 'LCE6A', 'GN': 'LCE6A', 'id': 'A0A183', 'taxon': 'HUMAN',
 'SV': 1, 'OS': 'Homo sapiens', 'entry': 'LCE6A_HUMAN',
 'name': 'Late cornified envelope protein 6A', 'db': 'sp'}

Class-based interface

The pyteomics.fasta.FASTA class is available for text-based (old style) parsing (the same as shown with read() above). Also, the new binary-mode, indexed parser, pyteomics.fasta.IndexedFASTA implements all the perks of the Indexed Parsers. Both classes also have a number of flavor-specific subclasses that implement header parsing.

Additionally, flavored indexed parsers allow accessing the protein entries by the extracted ID field, while the regular pyteomics.fasta.IndexedFASTA uses full description string for identification:

In [1]: from pyteomics import fasta

In [2]: db = fasta.IndexedUniProt('sprot_human.fasta') # A SwissProt database

In [3]: len(db['Q8IYH5'].sequence)
Out[3]: 903

In [4]: db['Q8IYH5'] == db['sp|Q8IYH5|ZZZ3_HUMAN ZZ-type zinc finger-containing protein 3 OS=Homo sapiens GN=ZZZ3 PE=1 SV=1']
Out[4]: True

Writing

You can also create a FASTA file using a sequence of (description, sequence) tuples.

>>> entries = [('Protein 1', 'PEPTIDE'*1000), ('Protein 2', 'PEPTIDE'*2000)]
>>> fasta.write(entries, 'target-file.fasta')

Decoy databases

Another common task is to generate a decoy database. Pyteomics allows that by means of the pyteomics.fasta.decoy_db() and pyteomics.fasta.write_decoy_db() functions.

>>> fasta.write_decoy_db('mydb.fasta', 'mydb-with-decoy.fasta')

The only required argument is the first one, indicating the source database. The second argument is the target file and defaults to system standard output.

If you need to modify a single sequence, use the pyteomics.fasta.decoy_sequence() function. It supports three modes: 'reverse', 'shuffle', and 'fused' (see pyteomics.fasta.reverse(), pyteomics.fasta.shuffle() and pyteomics.fasta.fused_decoy() for documentation).

>>> fasta.decoy_sequence('PEPTIDE', 'reverse')
'EDITPEP'
>>> fasta.decoy_sequence('PEPTIDE', 'shuffle')
'TPPIDEE'
>>> fasta.decoy_sequence('PEPTIDE', 'shuffle')
'PTIDEPE'

mzTab

mzTab is a HUPO-PSI standardized text-based format for describing identification and quantification of peptides and small molecules. You can read an mzTab file into a set of pandas.DataFrame objects with the pyteomics.mztab.MzTab class.

>>> from pyteomics import mztab
>>> tables = mztab.MzTab("path/to/file.mzTab")
>>> psms = tables.spectrum_match_table
>>> # do something with DataFrame

XML formats

XML parsers are implemented as classes and provide an object-oriented interface. The functional interface is preserved for backward compatibility and wraps the actual class-based machinery. That means that reader objects returned by read() functions have additional methods.

One of the most important methods is iterfind(). It allows reading additional information from XML files.

mzML and mzXML

mzML and mzXML are XML-based formats for experimental data obtained on MS/MS or LC-MS setups. Pyteomics offers you the functionality of pyteomics.mzml and pyteomics.mzxml modules to gain access to the information contained in those files from Python. The interfaces of the two modules are very similar, this section will use mzML for demonstration.

The user can iterate through MS/MS spectra contained in a file via the pyteomics.mzml.read() function or pyteomics.mzml.MzML class. Here is an example of the output:

>>> from pyteomics import mzml, auxiliary
>>> with mzml.read('tests/test.mzML') as reader:
>>>     auxiliary.print_tree(next(reader))
count
index
highest observed m/z
ms level
total ion current
intensity array
lowest observed m/z
defaultArrayLength
profile spectrum
MSn spectrum
positive scan
base peak intensity
m/z array
base peak m/z
id
scanList
 -> count
 -> scan [list]
 ->  -> scan start time
 ->  -> preset scan configuration
 ->  -> filter string
 ->  -> instrumentConfigurationRef
 ->  -> scanWindowList
 ->  ->  -> count
 ->  ->  -> scanWindow [list]
 ->  ->  ->  -> scan window lower limit
 ->  ->  ->  -> scan window upper limit
 ->  -> [Thermo Trailer Extra]Monoisotopic M/Z:
 -> no combination

Additionally, pyteomics.mzml.MzML objects support direct indexing with spectrum IDs and all other features of Indexed Parsers.

pyteomics.mzml.PreIndexedMzML offers the same functionality, but it uses byte offset information found at the end of the file. Unlike the rest of the functions and classes, pyteomics.mzml.PreIndexedMzML does not have a counterpart in pyteomics.mzxml.

mzMLb

mzMLb is an HDF5-based format which wraps an mzML file and intelligently re-organizes it for fast random access while reducing the on-disk file size using HDF5’s rich support for data compression. If the dependencies, h5py and hdf5plugin, are available pyteomics.mzmlb can be used to access the data in these files just like pyteomics.mzml reads mzML files.

pepXML

pepXML is a widely used XML-based format for peptide identifications. It contains information about the MS data, the parameters of the search engine used and the assigned sequences. To access these data, use pyteomics.pepxml module.

The function pyteomics.pepxml.read() iterates through Peptide-Spectrum matches in a pepXML file and returns them as a custom dict. Alternatively, you can use the pyteomics.pepxml.PepXML interface.

>>> from pyteomics import pepxml, auxiliary
>>> with pepxml.read('tests/test.pep.xml') as reader:
>>>     auxiliary.print_tree(next(reader))
end_scan
search_hit [list]
 -> hit_rank
 -> calc_neutral_pep_mass
 -> modifications
 -> modified_peptide
 -> peptide
 -> num_matched_ions
 -> search_score
 ->  -> deltacn
 ->  -> spscore
 ->  -> sprank
 ->  -> deltacnstar
 ->  -> xcorr
 -> num_missed_cleavages
 -> analysis_result [list]
 ->  -> peptideprophet_result
 ->  ->  -> all_ntt_prob
 ->  ->  -> parameter
 ->  ->  ->  -> massd
 ->  ->  ->  -> fval
 ->  ->  ->  -> nmc
 ->  ->  ->  -> ntt
 ->  ->  -> probability
 ->  -> analysis
 -> tot_num_ions
 -> num_tot_proteins
 -> is_rejected
 -> proteins [list]
 ->  -> num_tol_term
 ->  -> protein
 ->  -> peptide_next_aa
 ->  -> protein_descr
 ->  -> peptide_prev_aa
 -> massdiff
index
assumed_charge
spectrum
precursor_neutral_mass
start_scan

Reading into a pandas.DataFrame

If you like working with tabular data using pandas, you can load pepXML files directly into pandas.DataFrames using the pyteomics.pepxml.DataFrame() function. It can read multiple files at once (using pyteomics.pepxml.chain()) and return a combined table with essential information about search results. This function requires pandas.

X!Tandem

X!Tandem search engine has its own output format that contains more info than pepXML. Pyteomics has a reader for it in the pyteomics.tandem module.

>>> from pyteomics import tandem, auxiliary
>>> with tandem.read('tests/test.t.xml') as reader:
...     auxiliary.print_tree(next(reader))
...
rt
support
 -> fragment ion mass spectrum
 ->  -> M+H
 ->  -> note
 ->  -> charge
 ->  -> Ydata
 ->  ->  -> units
 ->  ->  -> values
 ->  -> Xdata
 ->  ->  -> units
 ->  ->  -> values
 ->  -> label
 ->  -> id
 -> supporting data
 ->  -> convolution survival function
 ->  ->  -> Ydata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> Xdata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> label
 ->  -> b ion histogram
 ->  ->  -> Ydata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> Xdata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> label
 ->  -> y ion histogram
 ->  ->  -> Ydata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> Xdata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> label
 ->  -> hyperscore expectation function
 ->  ->  -> a1
 ->  ->  -> a0
 ->  ->  -> Ydata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> Xdata
 ->  ->  ->  -> units
 ->  ->  ->  -> values
 ->  ->  -> label
mh
maxI
expect
sumI
act
fI
z
id
protein [list]
 -> peptide
 ->  -> pre
 ->  -> end
 ->  -> seq
 ->  -> b_ions
 ->  -> nextscore
 ->  -> mh
 ->  -> y_ions
 ->  -> start
 ->  -> hyperscore
 ->  -> expect
 ->  -> delta
 ->  -> id
 ->  -> post
 ->  -> missed_cleavages
 ->  -> b_score
 ->  -> y_score
 -> uid
 -> sumI
 -> label
 -> note
 -> expect
 -> file
 ->  -> URL
 ->  -> type
 -> id

pyteomics.tandem.read() returns a pyteomics.tandem.TandemXML instance, which can also be created directly.

Reading into a pandas.DataFrame

You can also load data from X!Tandem files directly into pandas.DataFrames using the pyteomics.tandem.DataFrame() function. It can read multiple files at once (using pyteomics.tandem.chain()) and return a combined table with essential information about search results. Of course, this function requires pandas.

mzIdentML

mzIdentML is one of the standards developed by the Proteomics Informatics working group of the HUPO Proteomics Standard Initiative.

The module interface is similar to that of the other reader modules. The pyteomics.mzid.read() function returns a pyteomics.mzid.MzIdentML instance, which you can just as easily use directly.

>>> from pyteomics import mzid, auxiliary
>>> with mzid.read('tests/test.mzid') as reader:
>>>     auxiliary.print_tree(next(reader))
SpectrumIdentificationItem [list]
 -> PeptideEvidenceRef [list]
 ->  -> peptideEvidence_ref
 -> ProteinScape:SequestMetaScore
 -> chargeState
 -> rank
 -> ProteinScape:IntensityCoverage
 -> calculatedMassToCharge
 -> peptide_ref
 -> passThreshold
 -> experimentalMassToCharge
 -> id
spectrumID
id
spectraData_ref

Element IDs and references

In mzIdentML, some elements contain references to other elements in the same file. The references are simply XML attributes whose name ends with _ref and the value is an ID, identical to the value of the id attribute of a certain element.

The parser can retrieve information from these references on the fly, which can be enabled by passing retrieve_refs=True to the pyteomics.mzid.MzIdentML.iterfind() method, to pyteomics.mzid.MzIdentML constructor, or to pyteomics.mzid.read(). Retrieval of data by ID is implemented in the pyteomics.mzid.MzIdentML.get_by_id() method. Alternatively, the MzIdentML object itself can be indexed with element IDs:

>>> from pyteomics import mzid
>>> m = mzid.MzIdentML('tests/test.mzid')
>>> m['ipi.HUMAN_decoy']
{'DatabaseName': 'database IPI_human',
 'decoy DB accession regexp': '^SHD',
 'decoy DB generation algorithm': 'PeakQuant.DecoyDatabaseBuilder',
 'id': 'ipi.HUMAN_decoy',
 'location': 'file://www.medizinisches-proteom-center.de/DBServer/ipi.HUMAN/3.15/ipi.HUMAN_decoy.fasta',
 'name': ['decoy DB from IPI_human',
  'DB composition target+decoy',
  'decoy DB type shuffle'],
 'numDatabaseSequences': 58099,
 'releaseDate': '2006-02-22T09:30:47Z',
 'version': '3.15'}
>>> m.close()

Note

Since version 3.3, pyteomics.mzid.MzIdentML objects keep an index of byte offsets for some of the elements (see Indexed Parsers). Indexing helps achieve acceptable performance when using retrieve_refs=True, or when accessing individual elements by their ID.

This behavior can be disabled by passing use_index=False to the object constructor. An alternative, older mechanism is caching of element IDs. To build a cache for a file, you can pass build_id_cache=True and use_index=False to the MzIdentML constructor, or to pyteomics.mzid.read(), or call the pyteomics.mzid.MzIdentML.build_id_cache() method prior to reading the data.

Reading into a pandas.DataFrame

pyteomics.mzid also provides a pyteomics.mzid.DataFrame() function that reads one or several files into a single Pandas DataFrame. This function requires pandas.

idXML

idXML is an OpenMS format for peptide identifications. It is supported in pyteomics.openms.idxml. It partially supports indexing (protein information can be indexed and extracted with retrieve_refs).

The regular iterative parsing is done through read() or IDXML, and pandas.DataFrame can be created as well.

TraML

TraML is also a PSI format. It stores a lot of information on SRM experiments. The parser, pyteomics.traml.TraML, iterates over <Transition> elements by default. Like MzIdentML, it has a retrieve_refs parameter that helps pull in the information from other parts of the file. TraML is one of the Indexed Parsers.

FeatureXML

pyteomics.openms.featurexml implements a simple parser for .featureXML files used in the OpenMS framework. The usage is identical to other XML parsing modules. Since featureXML has feature IDs, FeatureXML objects also support direct indexing as well as iteration, among the many features of Indexed Parsers:

>>> from pyteomics.openms import featurexml

>>> # function style, iteration
... with featurexml.read('tests/test.featureXML') as f:
...     qual = [feat['overallquality'] for feat in f]
...

>>> qual # qualities of the two features in the test file
[0.791454, 0.945634]

>>> # object-oriented style, direct indexing
>>> f = featurexml.FeatureXML('tests/test.featureXML')
>>> f['f_189396504510444007']['overallquality']
0.945634
>>> f.close()

As always, pyteomics.openms.featurexml.read() and pyteomics.openms.featurexml.FeatureXML are interchangeable.

TrafoXML

.trafoXML is another OpenMS format based on XML. It describes a tranformation produced by an RT alignment algorithm. The file basically contains a series of (from; to) pairs corresponding to original and transformed retention times:

>>> from pyteomics.openms import trafoxml
>>> from_rt, to_rt = [], []
>>> with trafoxml.read('test/test.trafoXML') as f:
...    for pair in f:
...        from_rt.append(pair['from'])
...        to_rt.append(pair['to'])

>>> # plot the transformation
>>> import pylab
>>> pylab.plot(from_rt, to_rt)

As always, pyteomics.openms.trafoxml.read() and pyteomics.openms.trafoxml.TrafoXML are interchangeable. TrafoXML parsers do not support indexing because there are no IDs for specific data points in this format.

Controlled Vocabularies

Controlled Vocabularies are the universal annotation system used in the PSI formats, including mzML and mzIdentML. pyteomics.mzml.MzML, pyteomics.traml.TraML and pyteomics.mzid.MzIdentML retain the annotation information. It can be accessed using the helper function, pyteomics.auxiliary.cvquery():

>>> from pyteomics import auxiliary as aux, mzid, mzml
>>> f = mzid.MzIdentML('tests/test.mzid')
>>> s = next(f)
>>> s
{'SpectrumIdentificationItem': [{'ProteinScape:SequestMetaScore': 7.59488518903425, 'calculatedMassToCharge': 1507.695, 'PeptideEvidenceRef': [{'peptideEvidence_ref': 'PE1_SEQ_spec1_pep1'}], 'chargeState': 1, 'passThreshold': True, 'peptide_ref': 'prot1_pep1', 'rank': 1, 'id': 'SEQ_spec1_pep1', 'ProteinScape:IntensityCoverage': 0.3919545603809718, 'experimentalMassToCharge': 1507.696}], 'spectrumID': 'databasekey=1', 'id': 'SEQ_spec1', 'spectraData_ref': 'LCMALDI_spectra'}
>>> aux.cvquery(s)
{'MS:1001506': 7.59488518903425, 'MS:1001505': 0.3919545603809718}
>>> f.close()

Indexed Parsers

Most of the parsers implement indexing: MGF, mzML, mzXML, FASTA, PEFF, pepXML, mzIdentML, ms1, TraML, featureXML. Some formats do not have indexing parsers, because there is no unique ID field in the files to identify entries.

XML parser classes are called according to the format, e.g. pyteomics.mzml.MzML. Text format parsers that implement indexing are called with the word “Indexed”, e.g. pyteomics.fasta.IndexedFASTA, as opposed to pyteomics.fasta.FASTA, which does not implement indexing. This distinction is due to the fact that indexed parsers need to open the files in binary mode. This may affect performance for text-based formats and is not always backwards-compatible (you cannot instantiate an indexed parser class using a previously opened file if it is in text mode). XML files, on the other hand, are always meant to be opened in binary mode. So, there is no duplication of classes for XML formats, but indexing can still be disabled by passing use_index=False to the class constructor or the read() function.

Basic usage

Indexed parsers can be instantiated using the class name or the read() function:

In [1]: from pyteomics import mgf

In [2]: f = mgf.IndexedMGF('tests/test.mgf')

In [3]: f
Out[3]: <pyteomics.mgf.IndexedMGF at 0x7fc983cbaeb8>

In [4]: f.close()

In [5]: f = mgf.read('tests/test.mgf', use_index=True)

In [6]: f
Out[6]: <pyteomics.mgf.IndexedMGF at 0x7fc980c63898>

They support direct assignment and iteration or the with syntax, the same way as the older, iterative parsers.

Parser objects can be used as dictionaries mapping entry IDs to entries, or as lists:

In [7]: f['Spectrum 2']
Out[7]:
{'params': {'com': 'Based on http://www.matrixscience.com/help/data_file_help.html',
  'itol': '1',
  'itolu': 'Da',
  'mods': 'Carbamidomethyl (C)',
  'it_mods': 'Oxidation (M)',
  'mass': 'Monoisotopic',
  'username': 'Lou Scene',
  'useremail': 'leu@altered-state.edu',
  'charge': [2, 3],
  'title': 'Spectrum 2',
  'pepmass': (1084.9, 1234.0),
  'scans': '3',
  'rtinseconds': 25.0 second},
 'm/z array': array([ 345.1,  370.2,  460.2, 1673.3, 1674. , 1675.3]),
 'intensity array': array([ 237.,  128.,  108., 1007.,  974.,   79.]),
 'charge array': masked_array(data=[3, 2, 1, 1, 1, 1],
              mask=False,
        fill_value=0)}

In [8]: f[1]['params']['title'] # positional indexing
Out[8]: 'Spectrum 2'

Like dictionaries, indexed parsers support membership testing and len():

In [9]: 'Spectrum 1' in f
Out[9]: True

In [10]: len(f)
Out[10]: 2

Rich Indexing

Indexed parsers also support positional indexing, slices of IDs and integers. ID-based slices include both endpoints; integer-based slices exclude the right edge of the interval. With integer indexing, step is also supported. Here is a self-explanatory demo of indexing functionality using a test file of two spectra:

In [11]: len(f['Spectrum 1':'Spectrum 2'])
Out[11]: 2

In [12]: len(f['Spectrum 2':'Spectrum 1'])
Out[12]: 2

In [13]: len(f[:])
Out[13]: 2

In [14]: len(f[:1])
Out[14]: 1

In [15]: len(f[1:0])
Out[15]: 0

In [16]: len(f[1:0:-1])
Out[16]: 1

In [17]: len(f[::2])
Out[17]: 1

RT-based indexing

In MGF, mzML and mzXML the spectra are usually time-ordered. The corresponding indexed parsers allow accessing the spectra by retention time, including slices:

In [18]: f = mzxml.MzXML('tests/test.mzXML')

In [19]: spec = f.time[5.5] # get the spectrum closest to this retention time

In [20]: len(f.time[5.5:6.0]) # get spectra from a range
Out[20]: 2

RT lookup is performed using binary search. When retrieving ranges, the closest spectra to the start and end of the range are used as endpoints, so it is possible that they are slightly outside the range.

Multiprocessing

Indexed parsers provide a unified interface for multiprocessing: map(). The method applies a user-defined function to entries from the file, calling it in different processes. If the function is not provided, the parsing itself is parallelized. Depending on the format, this may speed up or slow down the parsing overall. map() is a generator and yields items as they become available, not preserving the original order:

In [1]: from pyteomics import mzml

In [2]: f = mzml.MzML('tests/test.mzML')

In [3]: for spec in f.map():
   ...:     print(spec['id'])
   ...:
controllerType=0 controllerNumber=1 scan=2
controllerType=0 controllerNumber=1 scan=1

In [4]: for item in f.map(lambda spec: spec['id']):
   ...:     print(item)
   ...:
controllerType=0 controllerNumber=1 scan=1
controllerType=0 controllerNumber=1 scan=2

Note

To use map() with lambda functions (and in some other corner cases, like parsers instantiated with pre-opened file objects), the dill package is required. This is because the target callable and the parser itself need to be pickled for multiprocessing to work.

Apart from parser objects, map() is available on objects returned by chain() functions and iterfind():

In [5]: for c in f.iterfind('chromatogram').map():
   ...:     print(c['id'])
   ...:
TIC

In [6]: for spec in mzml.chain('tests/test.mzML', 'tests/test.mzML').map():
   ...:     print(spec['id'])
   ...:
controllerType=0 controllerNumber=1 scan=1
controllerType=0 controllerNumber=1 scan=2
controllerType=0 controllerNumber=1 scan=1
controllerType=0 controllerNumber=1 scan=2

FDR estimation and filtering

The modules for reading proteomics search engine or post-processing output (tandem, pepxml, mzid, idxml and protxml) expose similar functions is_decoy(), fdr() and filter(). These functions implement the widely used Target-Decoy Approach (TDA) to estimation of False Discovery Rate (FDR).

The is_decoy() function is supposed to determine if a particular spectrum identification is coming from the decoy database. In tandem and pepxml this is done by checking if the protein description/name starts with a certain prefix. In mzid, a boolean value that stores this information in the PSM dict is used.

Warning

Because of the variety of the software producing files in pepXML and mzIdentML formats, the is_decoy() function provided in the corresponding modules may not work for your specific files. In this case you will have to refer to the source of pyteomics.pepxml.is_decoy() and pyteomics.mzid.is_decoy() and create your own function in a similar manner.

The fdr() function estimates the FDR in a set of PSMs by counting the decoy matches. Since it is using the is_decoy() function, the warning above applies. You can supply a custom function so that fdr() works for your data. fdr() can also be imported from auxiliary, where it has no default for is_decoy().

The filter() function works like chain(), but instead of yielding all PSMs, it filters them to a certain level of FDR. PSM filtering requires counting decoy matches, too (see above), but it also implies sorting the PSMs by some kind of a score. This score cannot be universal due to the above-mentioned reasons, and it can be specified as a user-defined function. For instance, the default sorting key in pyteomics.mzid.filter() is only expected to work with mzIdentML files created with Mascot. So once again,

Warning

The default parameters of filter() may not work for your files.

There are also filter.chain() and filter.chain.from_iterable(). These are different from filter() in that they apply FDR filtering to all files separately and then provide a reader over top PSMs of all files, whereas filter() pools all PSMs together and applies a single threshold.

If you want to filter a list representing PSMs in arbitrary format, you can use pyteomics.auxiliary.filter(). Instead of files it takes lists (or other iterables) of PSMs. The rest is the same as for other filter() functions.

NumPy and Pandas support, etc.

pyteomics.auxiliary.filter() supports structured numpy arrays and pandas.DataFrames of PSMs. This makes it easy to filter search results stored as CSV files (see Example 3: Search engines and PSM filtering for more info).

Generally, PSMs can be provided as iterators, lists, arrays, and DataFrames, and key and is_decoy parameters to filter() can be functions, strings, lists, arrays, or iterators. If a string is given, it is used as a key in a structured array, DataFrame or an iterable of dicts.

FDR correction

As described in this JPR article, filtering based on decoy counting is inherently biased, especially for small datasets. All TDA-related functions have an optional argument, correction, that enables the correcting procedure proposed in the article.

Controlled Vocabulary Terms In Structured Data

When parsing files with controlled vocabulary terms, especially in positions where those terms may be used as keys, pyteomics will use an instance of cvstr to represent that value. A cvstr is a sub-class of str with two additional attributes storing the accession number of the term it names, and another holding the accession number of the unit of its value, if any.

class pyteomics.auxiliary.structures.cvstr(value, accession=None, unit_accession=None)[source]

Bases: str

A helper class to associate a controlled vocabullary accession number with an otherwise plain str object

accession

The accession number for this parameter, e.g. MS:1000040

Type:

str

unit_accession

The accession number for the unit of the value, if any

Type:

str

Accessing the attributes of a dict key to find out if it matches a query is inconvenient. To handle looking up a value by accession, the CVQueryEngine type can help solve the problem by either looking up a single accession value, or convert a nested dict structure with cvstr as keys into a dict with accession numbers as keys, mapping to the value their owners pointed to in the original dict, or their naming str if the value is empty:

For example, if we had parsed an mzML file, and read out a spectrum:

>>> from pyteomics import mzml
>>> scan = next(mzml.read("tests/test.mzml"))
>>> scan
{'index': 0,
'id': 'controllerType=0 controllerNumber=1 scan=1',
'defaultArrayLength': 19914,
'scanList': {'count': 1,
'scan': [{'instrumentConfigurationRef': 'IC1',
    'scanWindowList': {'count': 1,
    'scanWindow': [{'scan window lower limit': 200.0 m/z,
    'scan window upper limit': 2000.0 m/z}]},
    'scan start time': 0.004935 minute,
    'filter string': 'FTMS + p ESI Full ms [200.00-2000.00]',
    'preset scan configuration': 1.0,
    '[Thermo Trailer Extra]Monoisotopic M/Z:': 810.4152221679688}],
'no combination': ''},
'ms level': 1,
'MSn spectrum': '',
'positive scan': '',
'profile spectrum': '',
'base peak m/z': 810.415283203125 m/z,
'base peak intensity': 1471973.875 number of counts,
'total ion current': 15245068.0,
'lowest observed m/z': 200.00018816645022 m/z,
'highest observed m/z': 2000.0099466203771 m/z,
'count': 2,
'm/z array': array([ 200.00018817,  200.00043034,  200.00067252, ..., 1999.96151259,
        1999.98572931, 2000.00994662]),
'intensity array': array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)}

Then cvquery(scan) would yield the following look up table:

>>> from pyteomics.auxiliary import cvquery
>>> cvquery(scan)
{'MS:1000501': 200.0 m/z,
 'MS:1000500': 2000.0 m/z,
 'MS:1000016': 0.004935 minute,
 'MS:1000512': 'FTMS + p ESI Full ms [200.00-2000.00]',
 'MS:1000616': 1.0,
 'MS:1000795': 'no combination',
 'MS:1000511': 1,
 'MS:1000580': 'MSn spectrum',
 'MS:1000130': 'positive scan',
 'MS:1000128': 'profile spectrum',
 'MS:1000504': 810.415283203125 m/z,
 'MS:1000505': 1471973.875 number of counts,
 'MS:1000285': 15245068.0,
 'MS:1000528': 200.00018816645022 m/z,
 'MS:1000527': 2000.0099466203771 m/z,
 'MS:1000514': array([ 200.00018817,  200.00043034,  200.00067252, ..., 1999.96151259,
         1999.98572931, 2000.00994662]),
 'MS:1000515': array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)}

Alternatively, if we have a particular accession in mind, e.g. MS:1000016 for "scan start time", we could query for that specifically:

>>> cvquery(scan, "MS:1000016")
0.004935

Because CVQueryEngine does not have any state of its own, we use a pre-made instance, pyteomics.auxiliary.structures.cvquery.

pyteomics.auxiliary.structures.cvquery(data, accession=None)

Traverse an arbitrarily nested dictionary looking for keys which are cvstr instances, or objects with an attribute called accession.

class pyteomics.auxiliary.structures.CVQueryEngine[source]

Bases: object

Traverse an arbitrarily nested dictionary looking for keys which are cvstr instances, or objects with an attribute called accession.

__init__()
index(data)[source]

Construct a flat dict whose keys are the accession numbers for all qualified keys in data and whose values are the mapped values from data.

query(data, accession)[source]

Search data for a key with the accession number accession. Returns None if not found.

Unit Handling

When parsing parsing a data file with unit information associated with a scalar value, pyteomics uses annotated data types to carry units around, with the specific type determined by parsing the value.

For instance, given the XML string:

<cvParam cvRef="MS" accession="MS:1000927" name="ion injection time" value="68.227485656738"
         unitCvRef="UO" unitAccession="UO:0000028" unitName="millisecond"/>

This will be parsed into a unitfloat:

>>> value = unitfloat(68.227485656738, "millisecond")
# Get the unit name, perhaps to do a conversion to another unit
>>> value.unit_info
"millisecond"
# Can be coerced into a plain float without issue
>>> float(value)
68.227485656738
# Can be used identically to a normal float
>>> value > 50.0 and value < 90.0
True

To normalize the time unit, we can write a function like this:

from pyteomics.auxiliary import unitfloat

def in_minutes(x):
    '''Convert a time quantity to minutes

    Parameters
    ----------
    x: unitfloat
        A float representing a quantity of time annotated with a time unit

    Returns
    -------
    unitfloat:
        The time after conversion to minutes
    '''
    try:
        unit = x.unit_info
    except AttributeError:
        return x
    if unit == 'minute':
        return x
    elif unit == 'second':
        y = unitfloat(x / 60., 'minute')
        return y
    elif unit == 'hour':
        y = unitfloat(x * 60, 'minute')
        return y
    else:
        warnings.warn("Time unit %r not recognized" % unit)
    return x
>>> seconds = unitfloat(93.5, "second")
>>> minutes = in_minutes(seconds)
>>> minutes
1.55833
class pyteomics.auxiliary.structures.unitfloat(value, unit_info=None)[source]

Bases: float

Represents an float value with a unit name.

Behaves identically to a built-in float type.

unit_info

The name of the unit this value posseses.

Type:

str

class pyteomics.auxiliary.structures.unitint(value, unit_info=None)[source]

Bases: int

Represents an integer value with a unit name.

Behaves identically to a built-in int type.

unit_info

The name of the unit this value posseses.

Type:

str

class pyteomics.auxiliary.structures.unitstr(value, unit_info=None)[source]

Bases: str

Represents an string value with a unit name.

Behaves identically to a built-in str type.

unit_info

The name of the unit this value posseses.

Type:

str

«  Retention time prediction   ::   Contents   ::   Pyteomics API documentation  »