mzxml - reader for mass spectrometry data in mzXML format¶
Summary¶
mzXML is a (formerly) standard XML-format for raw mass spectrometry data storage, intended to be replaced with mzML.
This module provides a minimalistic way to extract information from mzXML
files. You can use the old functional interface (read()
) or the new
object-oriented interface (MzXML
)
to iterate over entries in <scan>
elements.
MzXML
also supports direct indexing with scan IDs.
Data access¶
MzXML
- a class representing a single mzXML file. Other data access functions use this class internally.
read()
- iterate through spectra in mzXML file. Data from a single scan are converted to a human-readable dict. Spectra themselves are stored under ‘m/z array’ and ‘intensity array’ keys.
chain()
- read multiple mzXML files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
Deprecated functions¶
version_info()
- get version information about the mzXML file. You can just read the corresponding attribute of theMzXML
object.
iterfind()
- iterate over elements in an mzXML file. You can just call the corresponding method of theMzXML
object.
Dependencies¶
This module requires lxml
and numpy
.
-
pyteomics.mzxml.
chain
(*sources, **kwargs)¶ Chain
sequence_maker()
for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to thesequence_maker()
function.-
pyteomics.mzxml.
sources
¶ Sources for creating new sequences from, such as paths or file-like objects
Type: Iterable
-
pyteomics.mzxml.
kwargs
¶ Additional arguments used to instantiate each sequence
Type: Mapping
-
-
chain.
from_iterable
(files, **kwargs)¶ Chain
read()
for several files. Keyword arguments are passed to theread()
function.Parameters: files – Iterable of file names or file objects.
-
pyteomics.mzxml.
version_info
(source)¶ Provide version information about the XML file.
Note
This function is provided for backward compatibility only. It simply creates an
MzXML
instance and returns itsversion_info
attribute.Parameters: source (str or file) – File name or file-like object. Returns: out – A (version, schema URL) tuple, both elements are strings or None. Return type: tuple
-
pyteomics.mzxml.
iterfind
(source, path, **kwargs)[source]¶ Parse source and yield info on elements with specified local name or by specified XPath.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create anMzXML
object and use itsiterfind()
method.Parameters: - source (str or file) – File name or file-like object.
- path (str) – Element name or XPath-like expression. Only local names separated
with slashes are accepted. An asterisk (*) means any element.
You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces. - recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzIdentML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings. - decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output
(under “m/z array”, “intensity array”, etc.).
Default is
True
.
Returns: out
Return type: iterator
-
class
pyteomics.mzxml.
MzXML
(*args, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.utils.BinaryArrayConversionMixin
,pyteomics.auxiliary.file_helpers.TimeOrderedIndexedReaderMixin
,pyteomics.xml.MultiProcessingXML
,pyteomics.xml.IndexSavingXML
Parser class for mzXML files.
-
class
binary_array_record
¶ Bases:
pyteomics.auxiliary.utils.binary_array_record
Hold all of the information about a base64 encoded array needed to decode the array.
-
__init__
¶ Initialize self. See help(type(self)) for accurate signature.
-
compression
¶ Alias for field number 1
-
count
()¶ Return number of occurrences of value.
-
data
¶ Alias for field number 0
-
dtype
¶ Alias for field number 2
-
index
()¶ Return first index of value.
Raises ValueError if the value is not present.
-
key
¶ Alias for field number 4
-
source
¶ Alias for field number 3
-
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
decode_data_array
(source, compression_type=None, dtype=<class 'numpy.float64'>)¶ Decode a base64-encoded, compressed bytestring into a numerical array.
Parameters: Returns: Return type: np.ndarray
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)[source]¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
classmethod
prebuild_byte_offset_file
(path)¶ Construct a new XML reader, build its byte offset index and write it to file
Parameters: path (str) – The path to the file to parse
-
reset
()¶ Resets the iterator to its initial state.
-
write_byte_offsets
()¶ Write the byte offsets in
_offset_index
to the file at_byte_offset_filename
-
class
-
pyteomics.mzxml.
iterfind
(source, path, **kwargs)[source] Parse source and yield info on elements with specified local name or by specified XPath.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create anMzXML
object and use itsiterfind()
method.Parameters: - source (str or file) – File name or file-like object.
- path (str) – Element name or XPath-like expression. Only local names separated
with slashes are accepted. An asterisk (*) means any element.
You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces. - recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
. - iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative
parsing significantly reduces memory usage and may be just a little
slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
. - read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzIdentML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings. - decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output
(under “m/z array”, “intensity array”, etc.).
Default is
True
.
Returns: out
Return type: iterator
-
pyteomics.mzxml.
read
(source, read_schema=False, iterative=True, use_index=False, dtype=None, huge_tree=False, decode_binary=True)[source]¶ Parse source and iterate through spectra.
Parameters: - source (str or file) – A path to a target mzML file or the file object itself.
- read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the mzML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings. - iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce
memory usage at almost the same parsing speed. Default is
True
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
spectrum elements. Default is
False
. - decode_binary (bool, optional) – Defines whether binary data should be decoded and included in the output
(under “m/z array”, “intensity array”, etc.).
Default is
True
. - huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether
security checks for XML tree depth and node size should be disabled.
Default is
False
. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
Returns: out – An iterator over the dicts with spectrum properties.
Return type: iterator