featurexml - reader for featureXML files¶

Summary¶

featureXML is a format specified in the OpenMS project. It defines a list of LC-MS features observed in an experiment.

This module provides a minimalistic way to extract information from featureXML files. You can use the old functional interface (read()) or the new object-oriented interface (FeatureXML) to iterate over entries in <feature> elements. FeatureXML also supports direct indexing with feature IDs.

Data access¶

FeatureXML - a class representing a single featureXML file. Other data access functions use this class internally.

read() - iterate through features in a featureXML file. Data from a single feature are converted to a human-readable dict.

chain() - read multiple featureXML files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

Dependencies¶

This module requres lxml.

pyteomics.openms.featurexml.chain(*args, **kwargs)¶: Chain read() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the read() function.

chain.from_iterable(files, **kwargs)¶

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:: files – Iterable of file names or file objects.

class pyteomics.openms.featurexml.FeatureXML(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶

Bases: MultiProcessingXML

Parser class for featureXML files.

__init__(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)¶

Create an indexed XML parser object.

Parameters:

source (str or file) – File name or file-like object corresponding to an XML file.
read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.

build_byte_index()¶

Build up an index of offsets for elements.

Returns:: out
Return type:: TagSpecificXMLByteIndex

build_id_cache()¶: Construct a cache for each element in the document, indexed by id attribute

build_tree()¶: Build and store the ElementTree instance for the underlying file

clear_id_cache()¶: Clear the element ID cache

clear_tree()¶: Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)¶

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:

elem_id (str) – The id value of the entity to retrieve.
id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.

Return type:

dict

iterfind(path, **kwargs)¶

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:

path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
**kwargs (passed to self._get_info_smart().)

Returns:

out

Return type:

iterator

map(target=None, workers=None, args=None, kwargs=None, method='mp', **_kwargs)¶

Execute the target function over entries of this object in parallel. The type of parallelism is determined by the method parameter.

Results will be returned out of order.

Parameters:

target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.
workers (int, optional) – The number of worker threads or processes to use. The default depends on the method parameter.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
method (str, optional) –
The type of parallelism to use. Can be one of the following:
- either one of ‘p’, ‘mp’, ‘processes’, or ‘multiprocessing’: use multiprocessing This is the default. This is also equivalent to calling pmap(), see there for details.
- either one of ‘t’, ‘threading’, or ‘threads’: use threading This is also equivalent to calling tmap(), see there for details.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

pmap(target=None, workers=None, args=None, kwargs=None, **_kwargs)¶

Execute the target function over entries of this object across up to workers processes.

Results will be returned out of order.

Parameters:

target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.
workers (int or None, optional) – The number of worker processes to use. If not a positive integer, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

reset()¶: Resets the iterator to its initial state.

tmap(target=None, workers=None, args=None, kwargs=None, chunk_size=None, **_kwargs)¶

Execute the target function over entries of this object across up to workers threads.

Results will be returned out of order.

Parameters:

target (Callable, optional) –
The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.

Warning

target must be thread-safe. The target function cannot interact with the underlying file object directly.
workers (int or None, optional) – The number of worker threads to use. If not a positive integer, defaults to the number of available CPUs.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
chunk_size (int, optional) – The number of work items to hand out to each worker thread at a time. If not specified, defaults to chunk_size attribute of this object.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

pyteomics.openms.featurexml.read(source, read_schema=True, iterative=True, use_index=False)[source]¶

Parse source and iterate through features.

Parameters:

source (str or file) – A path to a target featureXML file or the file object itself.
read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the file header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings.
iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce memory usage at almost the same parsing speed. Default is True.
use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for spectrum elements. Default is False.

Returns:

out – An iterator over the dicts with feature properties.

Return type:

iterator

Pyteomics documentation v5.0