mzmlb - reader for mass spectrometry data in mzMLb format¶
Warning
This is a Provisional Implementation. The mzMLb format has been published but is not yet broadly available.
Summary¶
mzMLb is an HDF5 container format wrapping around the standard rich XML-format for raw mass spectrometry data storage. Please refer to [1] for more information about mzMLb and its features. Please refer to psidev.info for the detailed specification of the format and structure of mzML files.
This module provides a minimalistic way to extract information from mzMLb
files. You can use the old functional interface (read()) or the new
object-oriented interface (MzMLb to iterate over entries in <spectrum> elements.
MzMLb also support direct indexing with spectrum IDs or indices.
Data access¶
MzMLb- a class representing a single mzMLb file. Other data access functions use this class internally.
read()- iterate through spectra in mzMLb file. Data from a single spectrum are converted to a human-readable dict. Spectra themselves are stored under ‘m/z array’ and ‘intensity array’ keys.
chain()- read multiple mzMLb files at once.
chain.from_iterable()- read multiple files at once, using an iterable of files.
Controlled Vocabularies and Caching¶
mzML relies on controlled vocabularies to describe its contents extensibly.
Every MzML needs a copy of PSI-MS CV, which it handles using the psims library.
If you want to save time when creating instances of MzML, consider enabling the psims cache.
See psims documentation
on how to enable and configure the cache (alternatively, you can handle CV creation yourself and pass a pre-created instance
using the cv parameter to MzMLb).
See also
Controlled Vocabulary Terms
for more details on how they are used.
Handling Time Units and Other Qualified Quantities¶
mzMLb contains information which may be described as using a variety of different time units. See Unit Handling for more information.
References
- class pyteomics.mzmlb.ExternalArrayRegistry(registry, chunk_size=None)[source]¶
Bases:
objectRead chunks out of a single long array
This is an implementation detail of
MzMLb- registry¶
A mapping from array name to the out-of-core array object.
- Type:
Mapping
- class pyteomics.mzmlb.ExternalDataMzML(*args, **kwargs)[source]¶
Bases:
MzMLAn MzML parser that reads data arrays from an external provider.
This is an implementation detail of
MzMLb.- class binary_array_record(data, compression, dtype, source, key)¶
Bases:
binary_array_recordHold all of the information about a base64 encoded array needed to decode the array.
- __init__()¶
- compression¶
Alias for field number 1
- count(value, /)¶
Return number of occurrences of value.
- data¶
Alias for field number 0
- dtype¶
Alias for field number 2
- index(value, start=0, stop=9223372036854775807, /)¶
Return first index of value.
Raises ValueError if the value is not present.
- key¶
Alias for field number 4
- source¶
Alias for field number 3
- build_byte_index()¶
Build the byte offset index by either reading these offsets from the file at
_byte_offset_filename, or falling back to the method used byIndexedXMLorIndexedTextReaderif this operation fails due to an IOError
- build_id_cache()¶
Construct a cache for each element in the document, indexed by id attribute
- build_tree()¶
Build and store the
ElementTreeinstance for the underlying file
- clear_id_cache()¶
Clear the element ID cache
- clear_tree()¶
Remove the saved
ElementTree.
- decode_data_array(array_name, offset, length, transform=None, dtype=<class 'numpy.float64'>)[source]¶
Decode a base64-encoded, compressed bytestring into a numerical array.
- Parameters:
- Return type:
np.ndarray
- get_by_id(elem_id, id_key=None, element_type=None, **kwargs)¶
Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
- iterfind(path, **kwargs)¶
Parse the XML and yield info on elements with specified local name or by specified “XPath”.
- Parameters:
path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
**kwargs (passed to
self._get_info_smart().)
- Returns:
out
- Return type:
iterator
- map(target=None, workers=None, args=None, kwargs=None, method='mp', **_kwargs)¶
Execute the
targetfunction over entries of this object in parallel. The type of parallelism is determined by themethodparameter.Results will be returned out of order.
- Parameters:
target (
Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargsandkwargs.workers (int, optional) – The number of worker threads or processes to use. The default depends on the
methodparameter.args (
Sequence, optional) – Additional positional arguments to be passed to the target function.kwargs (
Mapping, optional) – Additional keyword arguments to be passed to the target function.method (str, optional) –
The type of parallelism to use. Can be one of the following:
**_kwargs – Additional keyword arguments to be passed to the target function.
- Yields:
object – The work item returned by the target function.
- pmap(target=None, workers=None, args=None, kwargs=None, **_kwargs)¶
Execute the
targetfunction over entries of this object across up toworkersprocesses.Results will be returned out of order.
- Parameters:
target (
Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargsandkwargs.workers (int or None, optional) – The number of worker processes to use. If not a positive integer, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (
Sequence, optional) – Additional positional arguments to be passed to the target function.kwargs (
Mapping, optional) – Additional keyword arguments to be passed to the target function.**_kwargs – Additional keyword arguments to be passed to the target function.
- Yields:
object – The work item returned by the target function.
- classmethod prebuild_byte_offset_file(path)¶
Construct a new XML reader, build its byte offset index and write it to file
- Parameters:
path (str) – The path to the file to parse
- tmap(target=None, workers=None, args=None, kwargs=None, chunk_size=None, **_kwargs)¶
Execute the
targetfunction over entries of this object across up toworkersthreads.Results will be returned out of order.
- Parameters:
target (
Callable, optional) –The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in
argsandkwargs.Warning
target must be thread-safe. The target function cannot interact with the underlying file object directly.
workers (int or None, optional) – The number of worker threads to use. If not a positive integer, defaults to the number of available CPUs.
args (
Sequence, optional) – Additional positional arguments to be passed to the target function.kwargs (
Mapping, optional) – Additional keyword arguments to be passed to the target function.chunk_size (int, optional) – The number of work items to hand out to each worker thread at a time. If not specified, defaults to
chunk_sizeattribute of this object.**_kwargs – Additional keyword arguments to be passed to the target function.
- Yields:
object – The work item returned by the target function.
- write_byte_offsets()¶
Write the byte offsets in
_offset_indexto the file at_byte_offset_filename
- class pyteomics.mzmlb.HDF5ByteBuffer(buffer, offset=None)[source]¶
Bases:
RawIOBaseHelper class that looks file-like so that we can pass a HDF5 byte dataset to an arbitrary XML parser.
Implements
RawIOBasefor reading.- close()[source]¶
Flush and close the IO object.
This method has no effect if the file is already closed.
- fileno()¶
Return underlying file descriptor if one exists.
Raise OSError if the IO object does not use a file descriptor.
- flush()¶
Flush write buffers, if applicable.
This is not implemented for read-only and non-blocking streams.
- isatty()[source]¶
Return whether this is an ‘interactive’ stream.
Return False if it can’t be determined.
- readable()[source]¶
Return whether object was opened for reading.
If False, read() will raise OSError.
- readline(size=-1, /)¶
Read and return a line from the stream.
If size is specified, at most size bytes will be read.
The line terminator is always b’n’ for binary files; for text files, the newlines argument to open can be used to select the line terminator(s) recognized.
- readlines(hint=-1, /)¶
Return a list of lines from the stream.
hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.
- seek(offset, whence=0)[source]¶
Change the stream position to the given byte offset.
- offset
The stream position, relative to ‘whence’.
- whence
The relative position to seek from.
The offset is interpreted relative to the position indicated by whence. Values for whence are:
os.SEEK_SET or 0 – start of stream (the default); offset should be zero or positive
os.SEEK_CUR or 1 – current stream position; offset may be negative
os.SEEK_END or 2 – end of stream; offset is usually negative
Return the new absolute position.
- seekable()[source]¶
Return whether object supports random access.
If False, seek(), tell() and truncate() will raise OSError. This method may need to do a test seek().
- truncate(size=None, /)¶
Truncate file to size bytes.
File pointer is left unchanged. Size defaults to the current IO position as reported by tell(). Return the new size.
- writable()¶
Return whether object was opened for writing.
If False, write() will raise OSError.
- writelines(lines, /)¶
Write a list of lines to stream.
Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.
- class pyteomics.mzmlb.MzMLb(path, hdfargs=None, mzmlargs=None, allow_updates=False, use_index=True, **kwargs)[source]¶
Bases:
TimeOrderedIndexedReaderMixin,TaskMappingMixinA parser for mzMLb [1].
Provides an identical interface to
MzML.- path¶
The mzMLb file path or a file-like object providing it.
- Type:
str, Path-like, or file-like object
- handle¶
The raw HDF5 file container.
- Type:
h5py.File
- mzml_parser¶
The mzML parser for the XML stream inside the HDF5 file with special behavior for retrieving the out-of-band data arrays from their respective storage locations.
- Type:
References
- [1] Bhamber, R. S., Jankevics, A., Deutsch, E. W., Jones, A. R., & Dowsey, A. W. (2021).
MzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements. Journal of Proteome Research, 20(1), 172–183. https://doi.org/10.1021/acs.jproteome.0c00192
- __init__(path, hdfargs=None, mzmlargs=None, allow_updates=False, use_index=True, **kwargs)[source]¶
Instantiate a
MultiProcessingTaskMappingMixinobject, set default parameters for IPC.- Parameters:
queue_timeout (float, keyword only, optional) – The number of seconds to block, waiting for a result before checking to see if all workers are done.
queue_size (int, keyword only, optional) – The length of IPC queue used.
workers (int, keyword only, optional) – Number of worker processes or threads to spawn when
map()is called. This can also be specified in themap()call.
- get_by_id(id)[source]¶
Parse the file and return the element with id attribute equal to elem_id. Returns
Noneif no such element is found.
- get_dataset(name)[source]¶
Get an HDF5 dataset by its name or path relative to the root node.
Warning
Because this accesses HDF5 data directly, it may be possible to mutate the underlying file if
allow_updatesisTrue.- Parameters:
name (
str) – The dataset name or path.- Return type:
h5py.Datasetorh5py.Group- Raises:
KeyError : – The name is not found.
- map(target=None, workers=None, args=None, kwargs=None, method='mp', **_kwargs)¶
Execute the
targetfunction over entries of this object in parallel. The type of parallelism is determined by themethodparameter.Results will be returned out of order.
- Parameters:
target (
Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargsandkwargs.workers (int, optional) – The number of worker threads or processes to use. The default depends on the
methodparameter.args (
Sequence, optional) – Additional positional arguments to be passed to the target function.kwargs (
Mapping, optional) – Additional keyword arguments to be passed to the target function.method (str, optional) –
The type of parallelism to use. Can be one of the following:
**_kwargs – Additional keyword arguments to be passed to the target function.
- Yields:
object – The work item returned by the target function.
- pmap(target=None, workers=None, args=None, kwargs=None, **_kwargs)¶
Execute the
targetfunction over entries of this object across up toworkersprocesses.Results will be returned out of order.
- Parameters:
target (
Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargsandkwargs.workers (int or None, optional) – The number of worker processes to use. If not a positive integer, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (
Sequence, optional) – Additional positional arguments to be passed to the target function.kwargs (
Mapping, optional) – Additional keyword arguments to be passed to the target function.**_kwargs – Additional keyword arguments to be passed to the target function.
- Yields:
object – The work item returned by the target function.
- tmap(target=None, workers=None, args=None, kwargs=None, chunk_size=None, **_kwargs)¶
Execute the
targetfunction over entries of this object across up toworkersthreads.Results will be returned out of order.
- Parameters:
target (
Callable, optional) –The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in
argsandkwargs.Warning
target must be thread-safe. The target function cannot interact with the underlying file object directly.
workers (int or None, optional) – The number of worker threads to use. If not a positive integer, defaults to the number of available CPUs.
args (
Sequence, optional) – Additional positional arguments to be passed to the target function.kwargs (
Mapping, optional) – Additional keyword arguments to be passed to the target function.chunk_size (int, optional) – The number of work items to hand out to each worker thread at a time. If not specified, defaults to
chunk_sizeattribute of this object.**_kwargs – Additional keyword arguments to be passed to the target function.
- Yields:
object – The work item returned by the target function.
- class pyteomics.mzmlb.chunk_interval_cache_record(start, end, array)[source]¶
Bases:
chunk_interval_cache_record- __init__()¶
- array¶
Alias for field number 2
- count(value, /)¶
Return number of occurrences of value.
- end¶
Alias for field number 1
- index(value, start=0, stop=9223372036854775807, /)¶
Return first index of value.
Raises ValueError if the value is not present.
- start¶
Alias for field number 0
- pyteomics.mzmlb.delta_predict(data, copy=True)[source]¶
Reverse the lossy transformation of the delta compression helper.
- Parameters:
data (
numpy.ndarray) – The data to transformcopy (bool) – Whether to make a copy of the data array or transform it in-place.
- Returns:
The transformed data array
- Return type:
numpy.ndarray
- class pyteomics.mzmlb.external_array_slice(array_name, offset, length, source, transform, key, dtype)[source]¶
Bases:
external_array_slice- __init__()¶
- array_name¶
Alias for field number 0
- count(value, /)¶
Return number of occurrences of value.
- dtype¶
Alias for field number 6
- index(value, start=0, stop=9223372036854775807, /)¶
Return first index of value.
Raises ValueError if the value is not present.
- key¶
Alias for field number 5
- length¶
Alias for field number 2
- offset¶
Alias for field number 1
- source¶
Alias for field number 3
- transform¶
Alias for field number 4
- pyteomics.mzmlb.linear_predict(data, copy=True)[source]¶
Reverse the lossy transformation of the linear interpolation compression helper.
- Parameters:
data (
numpy.ndarray) – The data to transformcopy (bool) – Whether to make a copy of the data array or transform it in-place.
- Returns:
The transformed data array
- Return type:
numpy.ndarray