mzmlb - reader for mass spectrometry data in mzMLb format¶

Warning

This is a Provisional Implementation. The mzMLb format has been published but is not yet broadly available.

Summary¶

mzMLb is an HDF5 container format wrapping around the standard rich XML-format for raw mass spectrometry data storage. Please refer to [1] for more information about mzMLb and its features. Please refer to psidev.info for the detailed specification of the format and structure of mzML files.

This module provides a minimalistic way to extract information from mzMLb files. You can use the old functional interface (read()) or the new object-oriented interface (MzMLb to iterate over entries in <spectrum> elements. MzMLb also support direct indexing with spectrum IDs or indices.

Data access¶

MzMLb - a class representing a single mzMLb file. Other data access functions use this class internally.

read() - iterate through spectra in mzMLb file. Data from a single spectrum are converted to a human-readable dict. Spectra themselves are stored under ‘m/z array’ and ‘intensity array’ keys.

chain() - read multiple mzMLb files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

Controlled Vocabularies and Caching¶

mzML relies on controlled vocabularies to describe its contents extensibly. Every MzML needs a copy of PSI-MS CV, which it handles using the psims library. If you want to save time when creating instances of MzML, consider enabling the psims cache. See psims documentation on how to enable and configure the cache (alternatively, you can handle CV creation yourself and pass a pre-created instance using the cv parameter to MzMLb). See also Controlled Vocabulary Terms for more details on how they are used.

Handling Time Units and Other Qualified Quantities¶

mzMLb contains information which may be described as using a variety of different time units. See Unit Handling for more information.

References

class pyteomics.mzmlb.ExternalArrayRegistry(registry, chunk_size=None)[source]¶

Bases: object

Read chunks out of a single long array

This is an implementation detail of MzMLb

registry¶

A mapping from array name to the out-of-core array object.

Type:: Mapping

chunk_size¶

The number of entries to chunk together and keep in memory.

Type:: int

chunk_cache¶

A mapping from array name to cached array blocks.

Type:: dict

__init__(registry, chunk_size=None)[source]¶

class pyteomics.mzmlb.ExternalDataMzML(*args, **kwargs)[source]¶

Bases: MzML

An MzML parser that reads data arrays from an external provider.

This is an implementation detail of MzMLb.

__init__(*args, **kwargs)[source]¶

class binary_array_record(data, compression, dtype, source, key)¶

Bases: binary_array_record

Hold all of the information about a base64 encoded array needed to decode the array.

__init__()¶

compression¶: Alias for field number 1

count(value, /)¶: Return number of occurrences of value.

data¶: Alias for field number 0

decode()¶

Decode data into a numerical array

Return type:: np.ndarray

dtype¶: Alias for field number 2

index(value, start=0, stop=9223372036854775807, /)¶

Return first index of value.

Raises ValueError if the value is not present.

key¶: Alias for field number 4

source¶: Alias for field number 3

build_byte_index()¶: Build the byte offset index by either reading these offsets from the file at _byte_offset_filename, or falling back to the method used by IndexedXML or IndexedTextReader if this operation fails due to an IOError

build_id_cache()¶: Construct a cache for each element in the document, indexed by id attribute

build_tree()¶: Build and store the ElementTree instance for the underlying file

clear_id_cache()¶: Clear the element ID cache

clear_tree()¶: Remove the saved ElementTree.

decode_data_array(array_name, offset, length, transform=None, dtype=<class 'numpy.float64'>)[source]¶

Decode a base64-encoded, compressed bytestring into a numerical array.

Parameters:

source (bytes) – A base64 string encoding a potentially compressed numerical array.
compression_type (str, optional) – The name of the compression method used before encoding the array into base64.
dtype (type, optional) – The data type to use to decode the binary array from the decompressed bytes.

Return type:

np.ndarray

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)¶

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:

elem_id (str) – The id value of the entity to retrieve.
id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.

Return type:

dict

iterfind(path, **kwargs)¶

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:

path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
**kwargs (passed to self._get_info_smart().)

Returns:

out

Return type:

iterator

map(target=None, workers=None, args=None, kwargs=None, method='mp', **_kwargs)¶

Execute the target function over entries of this object in parallel. The type of parallelism is determined by the method parameter.

Results will be returned out of order.

Parameters:

target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.
workers (int, optional) – The number of worker threads or processes to use. The default depends on the method parameter.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
method (str, optional) –
The type of parallelism to use. Can be one of the following:
- either one of ‘p’, ‘mp’, ‘processes’, or ‘multiprocessing’: use multiprocessing This is the default. This is also equivalent to calling pmap(), see there for details.
- either one of ‘t’, ‘threading’, or ‘threads’: use threading This is also equivalent to calling tmap(), see there for details.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

pmap(target=None, workers=None, args=None, kwargs=None, **_kwargs)¶

Execute the target function over entries of this object across up to workers processes.

Results will be returned out of order.

Parameters:

target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.
workers (int or None, optional) – The number of worker processes to use. If not a positive integer, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

classmethod prebuild_byte_offset_file(path)¶

Construct a new XML reader, build its byte offset index and write it to file

Parameters:: path (str) – The path to the file to parse

reset()[source]¶: Resets the iterator to its initial state.

tmap(target=None, workers=None, args=None, kwargs=None, chunk_size=None, **_kwargs)¶

Execute the target function over entries of this object across up to workers threads.

Results will be returned out of order.

Parameters:

target (Callable, optional) –
The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.

Warning

target must be thread-safe. The target function cannot interact with the underlying file object directly.
workers (int or None, optional) – The number of worker threads to use. If not a positive integer, defaults to the number of available CPUs.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
chunk_size (int, optional) – The number of work items to hand out to each worker thread at a time. If not specified, defaults to chunk_size attribute of this object.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

write_byte_offsets()¶: Write the byte offsets in _offset_index to the file at _byte_offset_filename

class pyteomics.mzmlb.HDF5ByteBuffer(buffer, offset=None)[source]¶

Bases: RawIOBase

Helper class that looks file-like so that we can pass a HDF5 byte dataset to an arbitrary XML parser.

Implements RawIOBase for reading.

__init__(buffer, offset=None)[source]¶

close()[source]¶

Flush and close the IO object.

This method has no effect if the file is already closed.

fileno()¶

Return underlying file descriptor if one exists.

Raise OSError if the IO object does not use a file descriptor.

flush()¶

Flush write buffers, if applicable.

This is not implemented for read-only and non-blocking streams.

isatty()[source]¶

Return whether this is an ‘interactive’ stream.

Return False if it can’t be determined.

readable()[source]¶

Return whether object was opened for reading.

If False, read() will raise OSError.

readall()[source]¶: Read until EOF, using multiple read() call.

readline(size=-1, /)¶

Read and return a line from the stream.

If size is specified, at most size bytes will be read.

The line terminator is always b’n’ for binary files; for text files, the newlines argument to open can be used to select the line terminator(s) recognized.

readlines(hint=-1, /)¶

Return a list of lines from the stream.

hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.

seek(offset, whence=0)[source]¶

Change the stream position to the given byte offset.

offset
The stream position, relative to ‘whence’.

whence
The relative position to seek from.

The offset is interpreted relative to the position indicated by whence. Values for whence are:

os.SEEK_SET or 0 – start of stream (the default); offset should be zero or positive
os.SEEK_CUR or 1 – current stream position; offset may be negative
os.SEEK_END or 2 – end of stream; offset is usually negative

Return the new absolute position.

seekable()[source]¶

Return whether object supports random access.

If False, seek(), tell() and truncate() will raise OSError. This method may need to do a test seek().

tell()[source]¶: Return current stream position.

truncate(size=None, /)¶

Truncate file to size bytes.

File pointer is left unchanged. Size defaults to the current IO position as reported by tell(). Return the new size.

writable()¶

Return whether object was opened for writing.

If False, write() will raise OSError.

writelines(lines, /)¶

Write a list of lines to stream.

Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.

class pyteomics.mzmlb.MzMLb(path, hdfargs=None, mzmlargs=None, allow_updates=False, use_index=True, **kwargs)[source]¶

Bases: TimeOrderedIndexedReaderMixin, TaskMappingMixin

A parser for mzMLb [1].

Provides an identical interface to MzML.

path¶

The mzMLb file path or a file-like object providing it.

Type:: str, Path-like, or file-like object

handle¶

The raw HDF5 file container.

Type:: h5py.File

mzml_parser¶

The mzML parser for the XML stream inside the HDF5 file with special behavior for retrieving the out-of-band data arrays from their respective storage locations.

Type:: ExternalDataMzML

schema_version¶

The mzMLb HDF5 schema version, distinct from the mzML schema inside it.

Type:: str

References

[1] Bhamber, R. S., Jankevics, A., Deutsch, E. W., Jones, A. R., & Dowsey, A. W. (2021).: MzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements. Journal of Proteome Research, 20(1), 172–183. https://doi.org/10.1021/acs.jproteome.0c00192

__init__(path, hdfargs=None, mzmlargs=None, allow_updates=False, use_index=True, **kwargs)[source]¶

Instantiate a MultiProcessingTaskMappingMixin object, set default parameters for IPC.

Parameters:

queue_timeout (float, keyword only, optional) – The number of seconds to block, waiting for a result before checking to see if all workers are done.
queue_size (int, keyword only, optional) – The length of IPC queue used.
workers (int, keyword only, optional) – Number of worker processes or threads to spawn when map() is called. This can also be specified in the map() call.

get_by_id(id)[source]¶

Parse the file and return the element with id attribute equal to elem_id. Returns None if no such element is found.

Parameters:: elem_id (str) – The value of the id attribute to match.
Returns:: out
Return type:: dict or None

get_dataset(name)[source]¶

Get an HDF5 dataset by its name or path relative to the root node.

Warning

Because this accesses HDF5 data directly, it may be possible to mutate the underlying file if allow_updates is True.

Parameters:: name (str) – The dataset name or path.
Return type:: h5py.Dataset or h5py.Group
Raises:: KeyError : – The name is not found.

map(target=None, workers=None, args=None, kwargs=None, method='mp', **_kwargs)¶

Execute the target function over entries of this object in parallel. The type of parallelism is determined by the method parameter.

Results will be returned out of order.

Parameters:

target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.
workers (int, optional) – The number of worker threads or processes to use. The default depends on the method parameter.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
method (str, optional) –
The type of parallelism to use. Can be one of the following:
- either one of ‘p’, ‘mp’, ‘processes’, or ‘multiprocessing’: use multiprocessing This is the default. This is also equivalent to calling pmap(), see there for details.
- either one of ‘t’, ‘threading’, or ‘threads’: use threading This is also equivalent to calling tmap(), see there for details.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

pmap(target=None, workers=None, args=None, kwargs=None, **_kwargs)¶

Execute the target function over entries of this object across up to workers processes.

Results will be returned out of order.

Parameters:

target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.
workers (int or None, optional) – The number of worker processes to use. If not a positive integer, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

tmap(target=None, workers=None, args=None, kwargs=None, chunk_size=None, **_kwargs)¶

Execute the target function over entries of this object across up to workers threads.

Results will be returned out of order.

Parameters:

target (Callable, optional) –
The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.

Warning

target must be thread-safe. The target function cannot interact with the underlying file object directly.
workers (int or None, optional) – The number of worker threads to use. If not a positive integer, defaults to the number of available CPUs.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
chunk_size (int, optional) – The number of work items to hand out to each worker thread at a time. If not specified, defaults to chunk_size attribute of this object.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

class pyteomics.mzmlb.chunk_interval_cache_record(start, end, array)[source]¶

Bases: chunk_interval_cache_record

__init__()¶

array¶: Alias for field number 2

count(value, /)¶: Return number of occurrences of value.

end¶: Alias for field number 1

index(value, start=0, stop=9223372036854775807, /)¶

Return first index of value.

Raises ValueError if the value is not present.

start¶: Alias for field number 0

pyteomics.mzmlb.delta_predict(data, copy=True)[source]¶

Reverse the lossy transformation of the delta compression helper.

Parameters:

data (numpy.ndarray) – The data to transform
copy (bool) – Whether to make a copy of the data array or transform it in-place.

Returns:

The transformed data array

Return type:

numpy.ndarray

class pyteomics.mzmlb.external_array_slice(array_name, offset, length, source, transform, key, dtype)[source]¶

Bases: external_array_slice

__init__()¶

array_name¶: Alias for field number 0

count(value, /)¶: Return number of occurrences of value.

decode()[source]¶

Decode data into a numerical array

Return type:: np.ndarray

dtype¶: Alias for field number 6

index(value, start=0, stop=9223372036854775807, /)¶

Return first index of value.

Raises ValueError if the value is not present.

key¶: Alias for field number 5

length¶: Alias for field number 2

offset¶: Alias for field number 1

source¶: Alias for field number 3

transform¶: Alias for field number 4

pyteomics.mzmlb.linear_predict(data, copy=True)[source]¶

Reverse the lossy transformation of the linear interpolation compression helper.

Parameters:

data (numpy.ndarray) – The data to transform
copy (bool) – Whether to make a copy of the data array or transform it in-place.

Returns:

The transformed data array

Return type:

numpy.ndarray

pyteomics.mzmlb.read(source, dtype=None)[source]¶

Parse source and iterate through spectra.

Parameters:

source (str or file) – A path to a target mzMLb file or the file object itself.
dtype (type or dict, optional) – dtype to convert arrays to, one for both m/z and intensity arrays or one for each key. If dict, keys should be ‘m/z array’ and ‘intensity array’.

Returns:

out – An iterator over the dicts with spectrum properties.

Return type:

iterator

Pyteomics documentation v5.0