Pyteomics documentation v4.7.3

mzml - reader for mass spectrometry data in mzML format

«  peff - PSI Extended FASTA Format   ::   Contents   ::   mzxml - reader for mass spectrometry data in mzXML format  »

mzml - reader for mass spectrometry data in mzML format

Summary

mzML is a standard rich XML-format for raw mass spectrometry data storage. Please refer to psidev.info for the detailed specification of the format and structure of mzML files.

This module provides a minimalistic way to extract information from mzML files. You can use the old functional interface (read()) or the new object-oriented interface (MzML or PreIndexedMzML) to iterate over entries in <spectrum> elements. MzML and PreIndexedMzML also support direct indexing with spectrum IDs.

Data access

MzML - a class representing a single mzML file. Other data access functions use this class internally.

PreIndexedMzML - a class representing a single mzML file. Uses byte offsets listed at the end of the file for quick access to spectrum elements.

read() - iterate through spectra in mzML file. Data from a single spectrum are converted to a human-readable dict. Spectra themselves are stored under ‘m/z array’ and ‘intensity array’ keys.

chain() - read multiple mzML files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

Controlled Vocabularies

mzML relies on controlled vocabularies to describe its contents extensibly. See Controlled Vocabulary Terms for more details on how they are used.

Handling Time Units and Other Qualified Quantities

mzML contains information which may be described as using a variety of different time units. See Unit Handling for more information.

Deprecated functions

version_info() - get version information about the mzML file. You can just read the corresponding attribute of the MzML object.

iterfind() - iterate over elements in an mzML file. You can just call the corresponding method of the MzML object.

Dependencies

This module requires lxml and numpy.


pyteomics.mzml.chain(*sources, **kwargs)

Chain MzML for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to the MzML function.

Parameters:
  • sources (Iterable) – Sources for creating new sequences from, such as paths or file-like objects

  • kwargs (Mapping) – Additional arguments used to instantiate each sequence

chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:

files – Iterable of file names or file objects.

pyteomics.mzml.version_info(source)

Provide version information about the mzML file.

Note

This function is provided for backward compatibility only. It simply creates an MzML instance and returns its version_info attribute.

Parameters:

source (str or file) – File name or file-like object.

Returns:

out – A (version, schema URL) tuple, both elements are strings or None.

Return type:

tuple

pyteomics.mzml.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an MzML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.

  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.

  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.

  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.

  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.

  • decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output (under “m/z array”, “intensity array”, etc.). Default is True.

Returns:

out

Return type:

iterator

class pyteomics.mzml.MzML(*args, **kwargs)[source]

Bases: BinaryArrayConversionMixin, TimeOrderedIndexedReaderMixin, MultiProcessingXML, IndexSavingXML

Parser class for mzML files.

__init__(*args, **kwargs)[source]
class binary_array_record(data, compression, dtype, source, key)

Bases: binary_array_record

Hold all of the information about a base64 encoded array needed to decode the array.

__init__()
compression

Alias for field number 1

count(value, /)

Return number of occurrences of value.

data

Alias for field number 0

decode()

Decode data into a numerical array

Return type:

np.ndarray

dtype

Alias for field number 2

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

key

Alias for field number 4

source

Alias for field number 3

build_byte_index()

Build the byte offset index by either reading these offsets from the file at _byte_offset_filename, or falling back to the method used by IndexedXML or IndexedTextReader if this operation fails due to an IOError

build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

decode_data_array(source, compression_type=None, dtype=<class 'numpy.float64'>)

Decode a base64-encoded, compressed bytestring into a numerical array.

Parameters:
  • source (bytes) – A base64 string encoding a potentially compressed numerical array.

  • compression_type (str, optional) – The name of the compression method used before encoding the array into base64.

  • dtype (type, optional) – The data type to use to decode the binary array from the decompressed bytes.

Return type:

np.ndarray

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.

  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.

  • **kwargs (passed to self._get_info_smart().)

Returns:

out

Return type:

iterator

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs

  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.

  • args (Sequence, optional) – Additional positional arguments to be passed to the target function

  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function

  • **_kwargs – Additional keyword arguments to be passed to the target function

Yields:

object – The work item returned by the target function.

classmethod prebuild_byte_offset_file(path)

Construct a new XML reader, build its byte offset index and write it to file

Parameters:

path (str) – The path to the file to parse

reset()

Resets the iterator to its initial state.

write_byte_offsets()

Write the byte offsets in _offset_index to the file at _byte_offset_filename

class pyteomics.mzml.PreIndexedMzML(*args, **kwargs)[source]

Bases: MzML

Parser class for mzML files, subclass of MzML. Uses byte offsets listed at the end of the file for quick access to spectrum elements.

__init__(*args, **kwargs)
class binary_array_record(data, compression, dtype, source, key)

Bases: binary_array_record

Hold all of the information about a base64 encoded array needed to decode the array.

__init__()
compression

Alias for field number 1

count(value, /)

Return number of occurrences of value.

data

Alias for field number 0

decode()

Decode data into a numerical array

Return type:

np.ndarray

dtype

Alias for field number 2

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

key

Alias for field number 4

source

Alias for field number 3

build_byte_index()[source]

Build up a HierarchicalOffsetIndex of offsets for elements. Calls _find_index_list() or falls back on regular MzML indexing.

Returns:

out

Return type:

HierarchicalOffsetIndex

build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

decode_data_array(source, compression_type=None, dtype=<class 'numpy.float64'>)

Decode a base64-encoded, compressed bytestring into a numerical array.

Parameters:
  • source (bytes) – A base64 string encoding a potentially compressed numerical array.

  • compression_type (str, optional) – The name of the compression method used before encoding the array into base64.

  • dtype (type, optional) – The data type to use to decode the binary array from the decompressed bytes.

Return type:

np.ndarray

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.

  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.

  • **kwargs (passed to self._get_info_smart().)

Returns:

out

Return type:

iterator

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs

  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.

  • args (Sequence, optional) – Additional positional arguments to be passed to the target function

  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function

  • **_kwargs – Additional keyword arguments to be passed to the target function

Yields:

object – The work item returned by the target function.

classmethod prebuild_byte_offset_file(path)

Construct a new XML reader, build its byte offset index and write it to file

Parameters:

path (str) – The path to the file to parse

reset()

Resets the iterator to its initial state.

write_byte_offsets()

Write the byte offsets in _offset_index to the file at _byte_offset_filename

pyteomics.mzml.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an MzML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.

  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.

  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.

  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.

  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.

  • decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output (under “m/z array”, “intensity array”, etc.). Default is True.

Returns:

out

Return type:

iterator

pyteomics.mzml.read(source, read_schema=False, iterative=True, use_index=False, dtype=None, huge_tree=False, decode_binary=True)[source]

Parse source and iterate through spectra.

Parameters:
  • source (str or file) – A path to a target mzML file or the file object itself.

  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.

  • iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce memory usage at almost the same parsing speed. Default is True.

  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for spectrum elements. Default is False.

  • dtype (type or dict, optional) – dtype to convert arrays to, one for both m/z and intensity arrays or one for each key. If dict, keys should be ‘m/z array’ and ‘intensity array’.

  • decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output (under “m/z array”, “intensity array”, etc.). Default is True.

  • huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether security checks for XML tree depth and node size should be disabled. Default is False. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).

Returns:

out – An iterator over the dicts with spectrum properties.

Return type:

iterator

«  peff - PSI Extended FASTA Format   ::   Contents   ::   mzxml - reader for mass spectrometry data in mzXML format  »