mzmlb - reader for mass spectrometry data in mzMLb format¶
Warning
This is a Provisional Implementation. The mzMLb format has been published but is not yet broadly available.
Summary¶
mzMLb is an HDF5 container format wrapping around the standard rich XML-format for raw mass spectrometry data storage. Please refer to [1] for more information about mzMLb and its features. Please refer to psidev.info for the detailed specification of the format and structure of mzML files.
This module provides a minimalistic way to extract information from mzMLb
files. You can use the old functional interface (read()
) or the new
object-oriented interface (MzMLb
to iterate over entries in <spectrum>
elements.
MzMLb
also support direct indexing with spectrum IDs or indices.
Data access¶
MzMLb
- a class representing a single mzMLb file. Other data access functions use this class internally.
read()
- iterate through spectra in mzMLb file. Data from a single spectrum are converted to a human-readable dict. Spectra themselves are stored under ‘m/z array’ and ‘intensity array’ keys.
chain()
- read multiple mzMLb files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
Controlled Vocabularies¶
mzMLb relies on controlled vocabularies to describe its contents extensibly. See Controlled Vocabulary Terms for more details on how they are used.
Handling Time Units and Other Qualified Quantities¶
mzMLb contains information which may be described as using a variety of different time units. See Unit Handling for more information.
References
- class pyteomics.mzmlb.ExternalArrayRegistry(registry, chunk_size=None)[source]¶
Bases:
object
Read chunks out of a single long array
This is an implementation detail of
MzMLb
- registry¶
A mapping from array name to the out-of-core array object.
- Type:
Mapping
- class pyteomics.mzmlb.ExternalDataMzML(*args, **kwargs)[source]¶
Bases:
MzML
An MzML parser that reads data arrays from an external provider.
This is an implementation detail of
MzMLb
.- class binary_array_record(data, compression, dtype, source, key)¶
Bases:
binary_array_record
Hold all of the information about a base64 encoded array needed to decode the array.
- __init__()¶
- compression¶
Alias for field number 1
- count(value, /)¶
Return number of occurrences of value.
- data¶
Alias for field number 0
- dtype¶
Alias for field number 2
- index(value, start=0, stop=9223372036854775807, /)¶
Return first index of value.
Raises ValueError if the value is not present.
- key¶
Alias for field number 4
- source¶
Alias for field number 3
- build_id_cache()¶
Construct a cache for each element in the document, indexed by id attribute
- build_tree()¶
Build and store the
ElementTree
instance for the underlying file
- clear_id_cache()¶
Clear the element ID cache
- clear_tree()¶
Remove the saved
ElementTree
.
- decode_data_array(array_name, offset, length, transform=None, dtype=<class 'numpy.float64'>)[source]¶
Decode a base64-encoded, compressed bytestring into a numerical array.
- Parameters:
- Return type:
np.ndarray
- get_by_id(elem_id, id_key=None, element_type=None, **kwargs)¶
Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
- iterfind(path, **kwargs)¶
Parse the XML and yield info on elements with specified local name or by specified “XPath”.
- Parameters:
path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
**kwargs (passed to
self._get_info_smart()
.) –
- Returns:
out
- Return type:
iterator
- map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶
Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
- Parameters:
target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (
Sequence
, optional) – Additional positional arguments to be passed to the target functionkwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function**_kwargs – Additional keyword arguments to be passed to the target function
- Yields:
object – The work item returned by the target function.
- classmethod prebuild_byte_offset_file(path)¶
Construct a new XML reader, build its byte offset index and write it to file
- Parameters:
path (str) – The path to the file to parse
- write_byte_offsets()¶
Write the byte offsets in
_offset_index
to the file at_byte_offset_filename
- class pyteomics.mzmlb.HDF5ByteBuffer(buffer, offset=None)[source]¶
Bases:
RawIOBase
Helper class that looks file-like so that we can pass a HDF5 byte dataset to an arbitrary XML parser.
Implements
RawIOBase
for reading.- close()[source]¶
Flush and close the IO object.
This method has no effect if the file is already closed.
- fileno()¶
Returns underlying file descriptor if one exists.
OSError is raised if the IO object does not use a file descriptor.
- flush()¶
Flush write buffers, if applicable.
This is not implemented for read-only and non-blocking streams.
- isatty()[source]¶
Return whether this is an ‘interactive’ stream.
Return False if it can’t be determined.
- readable()[source]¶
Return whether object was opened for reading.
If False, read() will raise OSError.
- readline(size=-1, /)¶
Read and return a line from the stream.
If size is specified, at most size bytes will be read.
The line terminator is always b’n’ for binary files; for text files, the newlines argument to open can be used to select the line terminator(s) recognized.
- readlines(hint=-1, /)¶
Return a list of lines from the stream.
hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.
- seek(offset, whence=0)[source]¶
Change stream position.
Change the stream position to the given byte offset. The offset is interpreted relative to the position indicated by whence. Values for whence are:
0 – start of stream (the default); offset should be zero or positive
1 – current stream position; offset may be negative
2 – end of stream; offset is usually negative
Return the new absolute position.
- seekable()[source]¶
Return whether object supports random access.
If False, seek(), tell() and truncate() will raise OSError. This method may need to do a test seek().
- truncate()¶
Truncate file to size bytes.
File pointer is left unchanged. Size defaults to the current IO position as reported by tell(). Returns the new size.
- writable()¶
Return whether object was opened for writing.
If False, write() will raise OSError.
- writelines(lines, /)¶
Write a list of lines to stream.
Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.
- class pyteomics.mzmlb.MzMLb(path, hdfargs=None, mzmlargs=None, allow_updates=False, use_index=True, **kwargs)[source]¶
Bases:
TimeOrderedIndexedReaderMixin
,TaskMappingMixin
A parser for mzMLb [1].
Provides an identical interface to
MzML
.- path¶
The mzMLb file path or a file-like object providing it.
- Type:
str, Path-like, or file-like object
- handle¶
The raw HDF5 file container.
- Type:
h5py.File
- mzml_parser¶
The mzML parser for the XML stream inside the HDF5 file with special behavior for retrieving the out-of-band data arrays from their respective storage locations.
- Type:
References
- [1] Bhamber, R. S., Jankevics, A., Deutsch, E. W., Jones, A. R., & Dowsey, A. W. (2021).
MzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements. Journal of Proteome Research, 20(1), 172–183. https://doi.org/10.1021/acs.jproteome.0c00192
- __init__(path, hdfargs=None, mzmlargs=None, allow_updates=False, use_index=True, **kwargs)[source]¶
Instantiate a
TaskMappingMixin
object, set default parameters for IPC.- Parameters:
queue_timeout (float, keyword only, optional) – The number of seconds to block, waiting for a result before checking to see if all workers are done.
queue_size (int, keyword only, optional) – The length of IPC queue used.
processes (int, keyword only, optional) – Number of worker processes to spawn when
map()
is called. This can also be specified in themap()
call.
- get_by_id(id)[source]¶
Parse the file and return the element with id attribute equal to elem_id. Returns
None
if no such element is found.
- get_dataset(name)[source]¶
Get an HDF5 dataset by its name or path relative to the root node.
Warning
Because this accesses HDF5 data directly, it may be possible to mutate the underlying file if
allow_updates
isTrue
.- Parameters:
name (
str
) – The dataset name or path.- Return type:
h5py.Dataset
orh5py.Group
- Raises:
KeyError : – The name is not found.
- map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶
Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
- Parameters:
target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (
Sequence
, optional) – Additional positional arguments to be passed to the target functionkwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function**_kwargs – Additional keyword arguments to be passed to the target function
- Yields:
object – The work item returned by the target function.
- class pyteomics.mzmlb.chunk_interval_cache_record(start, end, array)[source]¶
Bases:
chunk_interval_cache_record
- __init__()¶
- array¶
Alias for field number 2
- count(value, /)¶
Return number of occurrences of value.
- end¶
Alias for field number 1
- index(value, start=0, stop=9223372036854775807, /)¶
Return first index of value.
Raises ValueError if the value is not present.
- start¶
Alias for field number 0
- pyteomics.mzmlb.delta_predict(data, copy=True)[source]¶
Reverse the lossy transformation of the delta compression helper.
- Parameters:
data (
numpy.ndarray
) – The data to transformcopy (bool) – Whether to make a copy of the data array or transform it in-place.
- Returns:
The transformed data array
- Return type:
numpy.ndarray
- class pyteomics.mzmlb.external_array_slice(array_name, offset, length, source, transform, key, dtype)[source]¶
Bases:
external_array_slice
- __init__()¶
- array_name¶
Alias for field number 0
- count(value, /)¶
Return number of occurrences of value.
- dtype¶
Alias for field number 6
- index(value, start=0, stop=9223372036854775807, /)¶
Return first index of value.
Raises ValueError if the value is not present.
- key¶
Alias for field number 5
- length¶
Alias for field number 2
- offset¶
Alias for field number 1
- source¶
Alias for field number 3
- transform¶
Alias for field number 4
- pyteomics.mzmlb.linear_predict(data, copy=True)[source]¶
Reverse the lossy transformation of the linear interpolation compression helper.
- Parameters:
data (
numpy.ndarray
) – The data to transformcopy (bool) – Whether to make a copy of the data array or transform it in-place.
- Returns:
The transformed data array
- Return type:
numpy.ndarray