Data Access¶
The following section is dedicated to data manipulation. Pyteomics aims to support the most common formats of (LC-)MS/MS data, peptide identification results and protein databases.
General Notes¶
Each module mentioned below corresponds to a file format. In each module, the top-level function
read()
allows iteration over entries in a file. It works like the built-inopen()
, allowing direct iteration and supporting thewith
syntax, which we recommend using. So you can do:>>> from pyteomics import mgf >>> reader = mgf.read('tests/test.mgf') >>> for spectrum in reader: >>> ... >>> reader.close()
… but it is recommended to do:
>>> from pyteomics import mgf >>> with mgf.read('tests/test.mgf') as reader: >>> for spectrum in reader: >>> ...
Additionally, most modules provide one or several classes which implement different parsing modes, e.g.
pyteomics.mgf.MGF
andpyteomics.mgf.IndexedMGF
. Indexed parsers build an index of file entries and thus allow random access in addition to iteration. See Indexed Parsers for a detailed description and examples.Apart from
read()
, which reads just one file, all modules described here have functions for reading multiple files:chain()
andchain.from_iterable()
.chain('f1', 'f2')
is equivalent tochain.from_iterable(['f1', 'f2'])
.chain()
andchain.from_iterable()
only support thewith
syntax. If you don’t want to use thewith
syntax, you can just use theitertools
functionschain()
andchain.from_iterable()
.Throughout this section we use
pyteomics.auxiliary.print_tree()
to display the structure of the data returned by various parsers. Replace this call with the actual processsing that you need to perform on your files.
Text-based formats¶
MGF¶
Mascot Generic Format
(MGF) is a simple
human-readable format for MS/MS data. It allows storing MS/MS peak lists and
exprimental parameters. pyteomics.mgf
is a module that implements
reading and writing MGF files.
Reading¶
pyteomics.mgf.read()
function allows iterating through spectrum entries.
Spectra are represented as dicts
. By default, MS/MS peak lists are stored
as numpy.ndarray
objects m/z array and intensity array.
Fragment charges will be stored in a masked array under the charge array key.
Parameters are stored as a dict
under params key.
Here is an example of use:
>>> from pyteomics import mgf, auxiliary
>>> with mgf.read('tests/test.mgf') as reader:
>>> auxiliary.print_tree(next(reader))
m/z array
params
-> username
-> useremail
-> mods
-> pepmass
-> title
-> itol
-> charge
-> mass
-> itolu
-> it_mods
-> com
intensity array
charge array
To speed up parsing, or if you want to avoid using numpy
, you can tweak the
behaviour of pyteomics.mgf.read()
with parameters convert_arrays and read_charges.
Reading file headers¶
Also, pyteomics.mgf
allows to extract headers with general
parameters from MGF files with pyteomics.mgf.read_header()
function. It
also returns a dict
.
>>> header = mgf.read_header('tests/test.mgf')
>>> auxiliary.print_tree(header)
itolu
itol
username
com
useremail
it_mods
charge
mods
mass
Class-based interface¶
Since version 3.4.3, MGF parsing functionality is encapsulated in a class:
pyteomics.mgf.MGF
. This class can be used for:
sequential parsing of the file (the same as
read()
):>>> with mgf.MGF('tests/test.mgf') as reader: ..: for spectrum in reader: ..: ...accessing the file header (the same as
read_header()
):>>> f = mgf.MGF('tests/test.mgf') >>> f.header {'charge': [2, 3], 'com': 'Based on http://www.matrixscience.com/help/data_file_help.html', 'it_mods': 'Oxidation (M)', 'itol': '1', 'itolu': 'Da', 'mass': 'Monoisotopic', 'mods': 'Carbamidomethyl (C)', 'useremail': 'leu@altered-state.edu', 'username': 'Lou Scene'}direct access to spectra by title (the same as
get_spectrum()
):>>> f = mgf.MGF('tests/test.mgf') >>> f['Spectrum 2'] {'charge array': masked_array(data = [3 2 1 1 1 1], mask = False, fill_value = 0), 'intensity array': array([ 237., 128., 108., 1007., 974., 79.]), 'm/z array': array([ 345.1, 370.2, 460.2, 1673.3, 1674. , 1675.3]), 'params': {'charge': [2, 3], 'com': 'Based on http://www.matrixscience.com/help/data_file_help.html', 'it_mods': 'Oxidation (M)', 'itol': '1', 'itolu': 'Da', 'mass': 'Monoisotopic', 'mods': 'Carbamidomethyl (C)', 'pepmass': (1084.9, 1234.0), 'rtinseconds': '25', 'scans': '3', 'title': 'Spectrum 2', 'useremail': 'leu@altered-state.edu', 'username': 'Lou Scene'}}Note
MGF
’s support for direct indexing is rudimentary, because it does not in fact keep an index and has to search through the file line-wise on every call.pyteomics.mgf.IndexedMGF
is designed for random access and more (see Indexed Parsers for details).
Writing¶
Creation of MGF files is implemented in pyteomics.mgf.write()
function.
The user can specify the header, an iterable of spectra in the same format as
returned by read()
, and the output path.
>>> spectra = mgf.read('tests/test.mgf')
>>> mgf.write(spectra=spectra, header=header)
USERNAME=Lou Scene
ITOL=1
USEREMAIL=leu@altered-state.edu
MODS=Carbamidomethyl (C)
IT_MODS=Oxidation (M)
CHARGE=2+ and 3+
MASS=Monoisotopic
ITOLU=Da
COM=Taken from http://www.matrixscience.com/help/data_file_help.html
BEGIN IONS
TITLE=Spectrum 1
PEPMASS=983.6
846.6 73.0
846.8 44.0
847.6 67.0
1640.1 291.0
1640.6 54.0
1895.5 49.0
END IONS
BEGIN IONS
TITLE=Spectrum 2
RTINSECONDS=25
PEPMASS=1084.9
SCANS=3
345.1 237.0
370.2 128.0
460.2 108.0
1673.3 1007.0
1674.0 974.0
1675.3 79.0
END IONS
MS1 and MS2¶
MS1 and MS2 are simple
human-readable formats for MS1 and MSn data. It allows storing peak lists and
exprimental parameters. Just like MS1 and MS2 formats are quite similar to MGF,
the corresponding module (pyteomics.ms1
and pyteomics.ms2
) provides
the same functions and classes with very similar signatures for reading headers and
spectra from files.
Writing is not supported at this time.
FASTA¶
FASTA is a common format for protein sequence databases.
Reading¶
To extract data from FASTA databases, use the pyteomics.fasta.read()
function.
>>> from pyteomics import fasta
>>> with fasta.read('/path/to/file/my.fasta') as db:
>>> for entry in db:
>>> ...
Just like other parsers in Pyteomics, pyteomics.fasta.read()
returns a generator object instead of a
list
to prevent excessive memory use. The generator yields
(description, sequence) tuples, so it’s natural to use it as follows:
>>> with fasta.read('/path/to/file/my.fasta') as db:
>>> for descr, seq in db:
>>> ...
You can also use attributes to access description and sequence:
>>> with fasta.read('my.fasta') as reader:
>>> descriptions = [item.description for item in reader]
Description parsing¶
You can specify a function that will be applied to the FASTA headers for
your convenience. pyteomics.fasta.std_parsers
has some pre-defined
parsers that can be used for this purpose.
>>> with fasta.read('HUMAN.fasta', parser=fasta.std_parsers['uniprot']) as r:
>>> print(next(r).description)
{'PE': 2, 'gene_id': 'LCE6A', 'GN': 'LCE6A', 'id': 'A0A183', 'taxon': 'HUMAN',
'SV': 1, 'OS': 'Homo sapiens', 'entry': 'LCE6A_HUMAN',
'name': 'Late cornified envelope protein 6A', 'db': 'sp'}
or try guessing the header format:
>>> with fasta.read('HUMAN.fasta', parser=fasta.parse) as r:
>>> print(next(r).description)
{'PE': 2, 'gene_id': 'LCE6A', 'GN': 'LCE6A', 'id': 'A0A183', 'taxon': 'HUMAN',
'SV': 1, 'OS': 'Homo sapiens', 'entry': 'LCE6A_HUMAN',
'name': 'Late cornified envelope protein 6A', 'db': 'sp'}
Class-based interface¶
The pyteomics.fasta.FASTA
class is available for text-based (old style) parsing
(the same as shown with read()
above). Also, the new binary-mode, indexed parser,
pyteomics.fasta.IndexedFASTA
implements all the perks of the Indexed Parsers.
Both classes also have a number of flavor-specific subclasses that implement header parsing.
Additionally, flavored indexed parsers allow accessing the protein entries by the extracted ID field,
while the regular pyteomics.fasta.IndexedFASTA
uses full description string for identification:
In [1]: from pyteomics import fasta
In [2]: db = fasta.IndexedUniProt('sprot_human.fasta') # A SwissProt database
In [3]: len(db['Q8IYH5'].sequence)
Out[3]: 903
In [4]: db['Q8IYH5'] == db['sp|Q8IYH5|ZZZ3_HUMAN ZZ-type zinc finger-containing protein 3 OS=Homo sapiens GN=ZZZ3 PE=1 SV=1']
Out[4]: True
Writing¶
You can also create a FASTA file using a sequence of (description, sequence)
tuples
.
>>> entries = [('Protein 1', 'PEPTIDE'*1000), ('Protein 2', 'PEPTIDE'*2000)]
>>> fasta.write(entries, 'target-file.fasta')
Decoy databases¶
Another common task is to generate a decoy database. Pyteomics allows
that by means of the pyteomics.fasta.decoy_db()
and
pyteomics.fasta.write_decoy_db()
functions.
>>> fasta.write_decoy_db('mydb.fasta', 'mydb-with-decoy.fasta')
The only required argument is the first one, indicating the source database. The second argument is the target file and defaults to system standard output.
If you need to modify a single sequence, use the
pyteomics.fasta.decoy_sequence()
function. It supports three modes:
'reverse'
, 'shuffle'
, and 'fused'
(see pyteomics.fasta.reverse()
,
pyteomics.fasta.shuffle()
and pyteomics.fasta.fused_decoy()
for documentation).
>>> fasta.decoy_sequence('PEPTIDE', 'reverse')
'EDITPEP'
>>> fasta.decoy_sequence('PEPTIDE', 'shuffle')
'TPPIDEE'
>>> fasta.decoy_sequence('PEPTIDE', 'shuffle')
'PTIDEPE'
mzTab¶
mzTab is a HUPO-PSI standardized text-based format for describing identification
and quantification of peptides and small molecules. You can read an mzTab file into
a set of pandas.DataFrame
objects with the pyteomics.mztab.MzTab
class.
>>> from pyteomics import mztab
>>> tables = mztab.MzTab("path/to/file.mzTab")
>>> psms = tables.spectrum_match_table
>>> # do something with DataFrame
XML formats¶
XML parsers are implemented as classes and provide an
object-oriented interface. The functional interface is preserved for backward
compatibility and wraps the actual class-based machinery.
That means that reader objects returned
by read()
functions have additional methods.
One of the most important methods is iterfind()
. It allows reading
additional information from XML files.
mzML and mzXML¶
mzML and mzXML are XML-based formats for experimental data obtained on MS/MS or LC-MS
setups. Pyteomics offers you the functionality of pyteomics.mzml
and
pyteomics.mzxml
modules to gain access to the information contained in those files from Python.
The interfaces of the two modules are very similar, this section will use mzML
for demonstration.
The user can iterate through MS/MS spectra contained in a file via the
pyteomics.mzml.read()
function or pyteomics.mzml.MzML
class.
Here is an example of the output:
>>> from pyteomics import mzml, auxiliary
>>> with mzml.read('tests/test.mzML') as reader:
>>> auxiliary.print_tree(next(reader))
count
index
highest observed m/z
ms level
total ion current
intensity array
lowest observed m/z
defaultArrayLength
profile spectrum
MSn spectrum
positive scan
base peak intensity
m/z array
base peak m/z
id
scanList
-> count
-> scan [list]
-> -> scan start time
-> -> preset scan configuration
-> -> filter string
-> -> instrumentConfigurationRef
-> -> scanWindowList
-> -> -> count
-> -> -> scanWindow [list]
-> -> -> -> scan window lower limit
-> -> -> -> scan window upper limit
-> -> [Thermo Trailer Extra]Monoisotopic M/Z:
-> no combination
Additionally, pyteomics.mzml.MzML
objects support direct indexing
with spectrum IDs and all other features of Indexed Parsers.
pyteomics.mzml.PreIndexedMzML
offers the same functionality,
but it uses byte offset information found at the end of the file.
Unlike the rest of the functions and classes, pyteomics.mzml.PreIndexedMzML
does not have a counterpart in pyteomics.mzxml
.
mzMLb¶
mzMLb is an HDF5-based format which wraps an mzML file and intelligently re-organizes
it for fast random access while reducing the on-disk file size using HDF5’s rich support for
data compression. If the dependencies, h5py
and hdf5plugin
, are available
pyteomics.mzmlb
can be used to access the data in these files just like pyteomics.mzml
reads mzML files.
pepXML¶
pepXML
is a widely used XML-based format for peptide identifications.
It contains information about the MS data, the parameters of the search engine
used and the assigned sequences. To access these data, use
pyteomics.pepxml
module.
The function pyteomics.pepxml.read()
iterates through Peptide-Spectrum
matches in a pepXML file and returns them as a custom dict. Alternatively, you
can use the pyteomics.pepxml.PepXML
interface.
>>> from pyteomics import pepxml, auxiliary
>>> with pepxml.read('tests/test.pep.xml') as reader:
>>> auxiliary.print_tree(next(reader))
end_scan
search_hit [list]
-> hit_rank
-> calc_neutral_pep_mass
-> modifications
-> modified_peptide
-> peptide
-> num_matched_ions
-> search_score
-> -> deltacn
-> -> spscore
-> -> sprank
-> -> deltacnstar
-> -> xcorr
-> num_missed_cleavages
-> analysis_result [list]
-> -> peptideprophet_result
-> -> -> all_ntt_prob
-> -> -> parameter
-> -> -> -> massd
-> -> -> -> fval
-> -> -> -> nmc
-> -> -> -> ntt
-> -> -> probability
-> -> analysis
-> tot_num_ions
-> num_tot_proteins
-> is_rejected
-> proteins [list]
-> -> num_tol_term
-> -> protein
-> -> peptide_next_aa
-> -> protein_descr
-> -> peptide_prev_aa
-> massdiff
index
assumed_charge
spectrum
precursor_neutral_mass
start_scan
Reading into a pandas.DataFrame¶
If you like working with tabular data using pandas
, you can load pepXML files
directly into pandas.DataFrames
using the pyteomics.pepxml.DataFrame()
function. It can read multiple files
at once (using pyteomics.pepxml.chain()
) and return a combined table with
essential information about search results. This function requires pandas
.
X!Tandem¶
X!Tandem search engine has its own output
format that contains more info than pepXML. Pyteomics has a reader for it
in the pyteomics.tandem
module.
>>> from pyteomics import tandem, auxiliary
>>> with tandem.read('tests/test.t.xml') as reader:
... auxiliary.print_tree(next(reader))
...
rt
support
-> fragment ion mass spectrum
-> -> M+H
-> -> note
-> -> charge
-> -> Ydata
-> -> -> units
-> -> -> values
-> -> Xdata
-> -> -> units
-> -> -> values
-> -> label
-> -> id
-> supporting data
-> -> convolution survival function
-> -> -> Ydata
-> -> -> -> units
-> -> -> -> values
-> -> -> Xdata
-> -> -> -> units
-> -> -> -> values
-> -> -> label
-> -> b ion histogram
-> -> -> Ydata
-> -> -> -> units
-> -> -> -> values
-> -> -> Xdata
-> -> -> -> units
-> -> -> -> values
-> -> -> label
-> -> y ion histogram
-> -> -> Ydata
-> -> -> -> units
-> -> -> -> values
-> -> -> Xdata
-> -> -> -> units
-> -> -> -> values
-> -> -> label
-> -> hyperscore expectation function
-> -> -> a1
-> -> -> a0
-> -> -> Ydata
-> -> -> -> units
-> -> -> -> values
-> -> -> Xdata
-> -> -> -> units
-> -> -> -> values
-> -> -> label
mh
maxI
expect
sumI
act
fI
z
id
protein [list]
-> peptide
-> -> pre
-> -> end
-> -> seq
-> -> b_ions
-> -> nextscore
-> -> mh
-> -> y_ions
-> -> start
-> -> hyperscore
-> -> expect
-> -> delta
-> -> id
-> -> post
-> -> missed_cleavages
-> -> b_score
-> -> y_score
-> uid
-> sumI
-> label
-> note
-> expect
-> file
-> -> URL
-> -> type
-> id
pyteomics.tandem.read()
returns a
pyteomics.tandem.TandemXML
instance, which can also be
created directly.
Reading into a pandas.DataFrame¶
You can also load data from X!Tandem files directly into pandas.DataFrames
using the pyteomics.tandem.DataFrame()
function. It can read multiple files
at once (using pyteomics.tandem.chain()
) and return a combined table with
essential information about search results. Of course, this function requires pandas
.
mzIdentML¶
mzIdentML is one of the standards developed by the Proteomics Informatics working group of the HUPO Proteomics Standard Initiative.
The module interface is similar to that of the other reader modules.
The pyteomics.mzid.read()
function returns a
pyteomics.mzid.MzIdentML
instance, which you can just as easily
use directly.
>>> from pyteomics import mzid, auxiliary
>>> with mzid.read('tests/test.mzid') as reader:
>>> auxiliary.print_tree(next(reader))
SpectrumIdentificationItem [list]
-> PeptideEvidenceRef [list]
-> -> peptideEvidence_ref
-> ProteinScape:SequestMetaScore
-> chargeState
-> rank
-> ProteinScape:IntensityCoverage
-> calculatedMassToCharge
-> peptide_ref
-> passThreshold
-> experimentalMassToCharge
-> id
spectrumID
id
spectraData_ref
Element IDs and references¶
In mzIdentML, some elements contain references to other elements in the same
file. The references are simply XML attributes whose name ends with _ref
and
the value is an ID, identical to the value of the id
attribute of a certain
element.
The parser can retrieve information from these references on the fly, which can
be enabled by passing retrieve_refs=True
to the
pyteomics.mzid.MzIdentML.iterfind()
method, to
pyteomics.mzid.MzIdentML
constructor, or to
pyteomics.mzid.read()
. Retrieval of data by ID is implemented in
the pyteomics.mzid.MzIdentML.get_by_id()
method. Alternatively, the
MzIdentML
object itself can be indexed with element IDs:
>>> from pyteomics import mzid
>>> m = mzid.MzIdentML('tests/test.mzid')
>>> m['ipi.HUMAN_decoy']
{'DatabaseName': 'database IPI_human',
'decoy DB accession regexp': '^SHD',
'decoy DB generation algorithm': 'PeakQuant.DecoyDatabaseBuilder',
'id': 'ipi.HUMAN_decoy',
'location': 'file://www.medizinisches-proteom-center.de/DBServer/ipi.HUMAN/3.15/ipi.HUMAN_decoy.fasta',
'name': ['decoy DB from IPI_human',
'DB composition target+decoy',
'decoy DB type shuffle'],
'numDatabaseSequences': 58099,
'releaseDate': '2006-02-22T09:30:47Z',
'version': '3.15'}
>>> m.close()
Note
Since version 3.3, pyteomics.mzid.MzIdentML
objects keep an index of byte
offsets for some of the elements (see Indexed Parsers).
Indexing helps achieve acceptable performance
when using retrieve_refs=True
, or when accessing individual elements by their ID.
This behavior can be disabled by passing
use_index=False
to the object constructor.
An alternative, older mechanism is caching of element IDs. To build
a cache for a file, you can pass build_id_cache=True
and use_index=False
to the MzIdentML
constructor, or to pyteomics.mzid.read()
,
or call the pyteomics.mzid.MzIdentML.build_id_cache()
method
prior to reading the data.
Reading into a pandas.DataFrame¶
pyteomics.mzid
also provides a pyteomics.mzid.DataFrame()
function
that reads one or several files into a single Pandas DataFrame
.
This function requires pandas
.
idXML¶
idXML is an OpenMS format for peptide identifications. It is supported in pyteomics.openms.idxml
.
It partially supports indexing (protein information can be indexed and extracted with retrieve_refs).
The regular iterative parsing is done through read()
or IDXML
, and pandas.DataFrame
can be created as well.
TraML¶
TraML is also a PSI format. It stores a lot of information on SRM experiments.
The parser, pyteomics.traml.TraML
, iterates over <Transition> elements by default.
Like MzIdentML, it has a retrieve_refs parameter that helps pull in the information from other parts of the file.
TraML
is one of the Indexed Parsers.
FeatureXML¶
pyteomics.openms.featurexml
implements a simple parser for .featureXML files
used in the OpenMS framework. The usage
is identical to other XML parsing modules. Since featureXML has feature IDs,
FeatureXML
objects also support direct indexing as well as iteration, among
the many features of Indexed Parsers:
>>> from pyteomics.openms import featurexml
>>> # function style, iteration
... with featurexml.read('tests/test.featureXML') as f:
... qual = [feat['overallquality'] for feat in f]
...
>>> qual # qualities of the two features in the test file
[0.791454, 0.945634]
>>> # object-oriented style, direct indexing
>>> f = featurexml.FeatureXML('tests/test.featureXML')
>>> f['f_189396504510444007']['overallquality']
0.945634
>>> f.close()
As always, pyteomics.openms.featurexml.read()
and pyteomics.openms.featurexml.FeatureXML
are interchangeable.
TrafoXML¶
.trafoXML is another OpenMS format based on XML. It describes a tranformation produced by an RT alignment algorithm. The file basically contains a series of (from; to) pairs corresponding to original and transformed retention times:
>>> from pyteomics.openms import trafoxml
>>> from_rt, to_rt = [], []
>>> with trafoxml.read('test/test.trafoXML') as f:
... for pair in f:
... from_rt.append(pair['from'])
... to_rt.append(pair['to'])
>>> # plot the transformation
>>> import pylab
>>> pylab.plot(from_rt, to_rt)
As always, pyteomics.openms.trafoxml.read()
and pyteomics.openms.trafoxml.TrafoXML
are interchangeable.
TrafoXML parsers do not support indexing because there are no IDs for specific data points in this format.
Controlled Vocabularies¶
Controlled Vocabularies
are the universal annotation system used in the PSI formats, including
mzML and mzIdentML. pyteomics.mzml.MzML
, pyteomics.traml.TraML
and pyteomics.mzid.MzIdentML
retain the annotation information. It can be accessed using the helper function, pyteomics.auxiliary.cvquery()
:
>>> from pyteomics import auxiliary as aux, mzid, mzml
>>> f = mzid.MzIdentML('tests/test.mzid')
>>> s = next(f)
>>> s
{'SpectrumIdentificationItem': [{'ProteinScape:SequestMetaScore': 7.59488518903425, 'calculatedMassToCharge': 1507.695, 'PeptideEvidenceRef': [{'peptideEvidence_ref': 'PE1_SEQ_spec1_pep1'}], 'chargeState': 1, 'passThreshold': True, 'peptide_ref': 'prot1_pep1', 'rank': 1, 'id': 'SEQ_spec1_pep1', 'ProteinScape:IntensityCoverage': 0.3919545603809718, 'experimentalMassToCharge': 1507.696}], 'spectrumID': 'databasekey=1', 'id': 'SEQ_spec1', 'spectraData_ref': 'LCMALDI_spectra'}
>>> aux.cvquery(s)
{'MS:1001506': 7.59488518903425, 'MS:1001505': 0.3919545603809718}
>>> f.close()
Indexed Parsers¶
Most of the parsers implement indexing: MGF, mzML, mzXML, FASTA, PEFF, pepXML, mzIdentML, ms1, TraML, featureXML. Some formats do not have indexing parsers, because there is no unique ID field in the files to identify entries.
XML parser classes are called according to the format, e.g. pyteomics.mzml.MzML
. Text format parsers
that implement indexing are called with the word “Indexed”, e.g. pyteomics.fasta.IndexedFASTA
,
as opposed to pyteomics.fasta.FASTA
, which does not implement indexing.
This distinction is due to the fact that indexed parsers need to open the files in binary mode.
This may affect performance for text-based formats and is not always backwards-compatible
(you cannot instantiate an indexed parser class using a previously opened file if it is in text mode).
XML files, on the other hand, are always meant to be opened in binary mode.
So, there is no duplication of classes for XML formats, but indexing can still be disabled by passing
use_index=False
to the class constructor or the read()
function.
Basic usage¶
Indexed parsers can be instantiated using the class name or the read()
function:
In [1]: from pyteomics import mgf
In [2]: f = mgf.IndexedMGF('tests/test.mgf')
In [3]: f
Out[3]: <pyteomics.mgf.IndexedMGF at 0x7fc983cbaeb8>
In [4]: f.close()
In [5]: f = mgf.read('tests/test.mgf', use_index=True)
In [6]: f
Out[6]: <pyteomics.mgf.IndexedMGF at 0x7fc980c63898>
They support direct assignment and iteration or the with syntax, the same way as the older, iterative parsers.
Parser objects can be used as dictionaries mapping entry IDs to entries, or as lists:
In [7]: f['Spectrum 2']
Out[7]:
{'params': {'com': 'Based on http://www.matrixscience.com/help/data_file_help.html',
'itol': '1',
'itolu': 'Da',
'mods': 'Carbamidomethyl (C)',
'it_mods': 'Oxidation (M)',
'mass': 'Monoisotopic',
'username': 'Lou Scene',
'useremail': 'leu@altered-state.edu',
'charge': [2, 3],
'title': 'Spectrum 2',
'pepmass': (1084.9, 1234.0),
'scans': '3',
'rtinseconds': 25.0 second},
'm/z array': array([ 345.1, 370.2, 460.2, 1673.3, 1674. , 1675.3]),
'intensity array': array([ 237., 128., 108., 1007., 974., 79.]),
'charge array': masked_array(data=[3, 2, 1, 1, 1, 1],
mask=False,
fill_value=0)}
In [8]: f[1]['params']['title'] # positional indexing
Out[8]: 'Spectrum 2'
Like dictionaries, indexed parsers support membership testing and len()
:
In [9]: 'Spectrum 1' in f
Out[9]: True
In [10]: len(f)
Out[10]: 2
Rich Indexing¶
Indexed parsers also support positional indexing, slices of IDs and integers. ID-based slices include both endpoints; integer-based slices exclude the right edge of the interval. With integer indexing, step is also supported. Here is a self-explanatory demo of indexing functionality using a test file of two spectra:
In [11]: len(f['Spectrum 1':'Spectrum 2'])
Out[11]: 2
In [12]: len(f['Spectrum 2':'Spectrum 1'])
Out[12]: 2
In [13]: len(f[:])
Out[13]: 2
In [14]: len(f[:1])
Out[14]: 1
In [15]: len(f[1:0])
Out[15]: 0
In [16]: len(f[1:0:-1])
Out[16]: 1
In [17]: len(f[::2])
Out[17]: 1
RT-based indexing¶
In MGF, mzML and mzXML the spectra are usually time-ordered. The corresponding indexed parsers allow accessing the spectra by retention time, including slices:
In [18]: f = mzxml.MzXML('tests/test.mzXML')
In [19]: spec = f.time[5.5] # get the spectrum closest to this retention time
In [20]: len(f.time[5.5:6.0]) # get spectra from a range
Out[20]: 2
RT lookup is performed using binary search. When retrieving ranges, the closest spectra to the start and end of the range are used as endpoints, so it is possible that they are slightly outside the range.
Multiprocessing¶
Indexed parsers provide a unified interface for multiprocessing: map()
.
The method applies a user-defined function to entries from the file, calling it in different processes.
If the function is not provided, the parsing itself is parallelized. Depending on the format,
this may speed up or slow down the parsing overall.
map()
is a generator and yields items as they become available, not preserving the original order:
In [1]: from pyteomics import mzml
In [2]: f = mzml.MzML('tests/test.mzML')
In [3]: for spec in f.map():
...: print(spec['id'])
...:
controllerType=0 controllerNumber=1 scan=2
controllerType=0 controllerNumber=1 scan=1
In [4]: for item in f.map(lambda spec: spec['id']):
...: print(item)
...:
controllerType=0 controllerNumber=1 scan=1
controllerType=0 controllerNumber=1 scan=2
Note
To use map()
with lambda functions (and in some other corner cases, like
parsers instantiated with pre-opened file objects), the dill
package is required.
This is because the target callable and the parser itself need to be pickled for multiprocessing to work.
Apart from parser objects, map()
is available on objects returned by chain()
functions
and iterfind()
:
In [5]: for c in f.iterfind('chromatogram').map():
...: print(c['id'])
...:
TIC
In [6]: for spec in mzml.chain('tests/test.mzML', 'tests/test.mzML').map():
...: print(spec['id'])
...:
controllerType=0 controllerNumber=1 scan=1
controllerType=0 controllerNumber=1 scan=2
controllerType=0 controllerNumber=1 scan=1
controllerType=0 controllerNumber=1 scan=2
FDR estimation and filtering¶
The modules for reading proteomics search engine or post-processing output
(tandem
, pepxml
, mzid
, idxml
and protxml
)
expose similar functions
is_decoy()
, fdr()
and filter()
.
These functions implement the widely used
Target-Decoy Approach (TDA) to estimation of False Discovery Rate (FDR).
The is_decoy()
function is supposed to determine if a particular
spectrum identification is coming from the decoy database. In tandem
and pepxml
this is done by checking if the protein description/name
starts with a certain prefix. In mzid
, a boolean value that stores
this information in the PSM dict is used.
Warning
Because of the variety of the software producing files in pepXML and
mzIdentML formats, the is_decoy()
function provided in the
corresponding modules may not work for your specific files. In this case
you will have to refer to the source of
pyteomics.pepxml.is_decoy()
and
pyteomics.mzid.is_decoy()
and create your own function in a
similar manner.
The fdr()
function estimates the FDR in a set of PSMs by counting
the decoy matches. Since it is using the is_decoy()
function, the
warning above applies. You can supply a custom function so that fdr()
works for your data. fdr()
can also be imported from
auxiliary
, where it has no default for is_decoy()
.
The filter()
function works like chain()
, but instead of
yielding all PSMs, it filters them to a certain level of FDR. PSM filtering
requires counting decoy matches, too (see above), but it also implies sorting
the PSMs by some kind of a score. This score cannot be universal due to the
above-mentioned reasons, and it can be specified as a user-defined function.
For instance, the default sorting key in pyteomics.mzid.filter()
is
only expected to work with mzIdentML files created with Mascot.
So once again,
Warning
The default parameters of filter()
may not work for your files.
There are also filter.chain()
and
filter.chain.from_iterable()
. These are different from
filter()
in that they apply FDR filtering to all files separately
and then provide a reader over top PSMs of all files, whereas
filter()
pools all PSMs together and applies a single threshold.
If you want to filter a list representing PSMs in arbitrary format, you can
use pyteomics.auxiliary.filter()
. Instead of files it takes lists
(or other iterables) of PSMs. The rest is the same as for other
filter()
functions.
NumPy and Pandas support, etc.¶
pyteomics.auxiliary.filter()
supports structured numpy
arrays and
pandas.DataFrames
of PSMs. This makes it easy to filter search results
stored as CSV files (see Example 3: Search engines and PSM filtering for more info).
Generally, PSMs can be provided as iterators, lists, arrays, and DataFrames
,
and key and is_decoy parameters to filter()
can be functions, strings,
lists, arrays, or iterators. If a string is given, it is used as a key in a structured
array, DataFrame
or an iterable of dicts
.
FDR correction¶
As described in this JPR article, filtering based on decoy counting is inherently biased, especially for small datasets. All TDA-related functions have an optional argument, correction, that enables the correcting procedure proposed in the article.
Controlled Vocabulary Terms In Structured Data¶
When parsing files with controlled vocabulary terms, especially in positions where
those terms may be used as keys, pyteomics
will use an instance of cvstr
to represent that value. A cvstr
is a sub-class of str
with two additional attributes storing the accession number of the term it names, and another
holding the accession number of the unit of its value, if any.
- class pyteomics.auxiliary.structures.cvstr(value, accession=None, unit_accession=None)[source]¶
Bases:
str
A helper class to associate a controlled vocabullary accession number with an otherwise plain
str
object
Accessing the attributes of a dict
key to find out if it matches a query
is inconvenient. To handle looking up a value by accession, the CVQueryEngine
type can help solve the problem by either looking up a single accession value, or
convert a nested dict
structure with cvstr
as keys into a dict
with accession numbers as keys, mapping to the value their owners pointed to in the original
dict
, or their naming str
if the value is empty:
For example, if we had parsed an mzML file, and read out a spectrum:
>>> from pyteomics import mzml
>>> scan = next(mzml.read("tests/test.mzml"))
>>> scan
{'index': 0,
'id': 'controllerType=0 controllerNumber=1 scan=1',
'defaultArrayLength': 19914,
'scanList': {'count': 1,
'scan': [{'instrumentConfigurationRef': 'IC1',
'scanWindowList': {'count': 1,
'scanWindow': [{'scan window lower limit': 200.0 m/z,
'scan window upper limit': 2000.0 m/z}]},
'scan start time': 0.004935 minute,
'filter string': 'FTMS + p ESI Full ms [200.00-2000.00]',
'preset scan configuration': 1.0,
'[Thermo Trailer Extra]Monoisotopic M/Z:': 810.4152221679688}],
'no combination': ''},
'ms level': 1,
'MSn spectrum': '',
'positive scan': '',
'profile spectrum': '',
'base peak m/z': 810.415283203125 m/z,
'base peak intensity': 1471973.875 number of counts,
'total ion current': 15245068.0,
'lowest observed m/z': 200.00018816645022 m/z,
'highest observed m/z': 2000.0099466203771 m/z,
'count': 2,
'm/z array': array([ 200.00018817, 200.00043034, 200.00067252, ..., 1999.96151259,
1999.98572931, 2000.00994662]),
'intensity array': array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)}
Then cvquery(scan)
would yield the following look up table:
>>> from pyteomics.auxiliary import cvquery
>>> cvquery(scan)
{'MS:1000501': 200.0 m/z,
'MS:1000500': 2000.0 m/z,
'MS:1000016': 0.004935 minute,
'MS:1000512': 'FTMS + p ESI Full ms [200.00-2000.00]',
'MS:1000616': 1.0,
'MS:1000795': 'no combination',
'MS:1000511': 1,
'MS:1000580': 'MSn spectrum',
'MS:1000130': 'positive scan',
'MS:1000128': 'profile spectrum',
'MS:1000504': 810.415283203125 m/z,
'MS:1000505': 1471973.875 number of counts,
'MS:1000285': 15245068.0,
'MS:1000528': 200.00018816645022 m/z,
'MS:1000527': 2000.0099466203771 m/z,
'MS:1000514': array([ 200.00018817, 200.00043034, 200.00067252, ..., 1999.96151259,
1999.98572931, 2000.00994662]),
'MS:1000515': array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)}
Alternatively, if we have a particular accession in mind, e.g. MS:1000016
for "scan start time"
, we could query for
that specifically:
>>> cvquery(scan, "MS:1000016")
0.004935
Because CVQueryEngine
does not have any state of its own,
we use a pre-made instance, pyteomics.auxiliary.structures.cvquery
.
- pyteomics.auxiliary.structures.cvquery(data, accession=None)¶
Traverse an arbitrarily nested dictionary looking for keys which are
cvstr
instances, or objects with an attribute calledaccession
.
- class pyteomics.auxiliary.structures.CVQueryEngine[source]¶
Bases:
object
Traverse an arbitrarily nested dictionary looking for keys which are
cvstr
instances, or objects with an attribute calledaccession
.- __init__()¶
Unit Handling¶
When parsing parsing a data file with unit information associated with a scalar value,
pyteomics
uses annotated data types to carry units around, with the specific type
determined by parsing the value.
For instance, given the XML string:
<cvParam cvRef="MS" accession="MS:1000927" name="ion injection time" value="68.227485656738"
unitCvRef="UO" unitAccession="UO:0000028" unitName="millisecond"/>
This will be parsed into a unitfloat
:
>>> value = unitfloat(68.227485656738, "millisecond")
# Get the unit name, perhaps to do a conversion to another unit
>>> value.unit_info
"millisecond"
# Can be coerced into a plain float without issue
>>> float(value)
68.227485656738
# Can be used identically to a normal float
>>> value > 50.0 and value < 90.0
True
To normalize the time unit, we can write a function like this:
from pyteomics.auxiliary import unitfloat
def in_minutes(x):
'''Convert a time quantity to minutes
Parameters
----------
x: unitfloat
A float representing a quantity of time annotated with a time unit
Returns
-------
unitfloat:
The time after conversion to minutes
'''
try:
unit = x.unit_info
except AttributeError:
return x
if unit == 'minute':
return x
elif unit == 'second':
y = unitfloat(x / 60., 'minute')
return y
elif unit == 'hour':
y = unitfloat(x * 60, 'minute')
return y
else:
warnings.warn("Time unit %r not recognized" % unit)
return x
>>> seconds = unitfloat(93.5, "second")
>>> minutes = in_minutes(seconds)
>>> minutes
1.55833
- class pyteomics.auxiliary.structures.unitfloat(value, unit_info=None)[source]¶
Bases:
float
Represents an float value with a unit name.
Behaves identically to a built-in
float
type.