xml - utilities for XML parsing¶

This module is not intended for end users. It implements the abstract classes for all XML parsers, XML and IndexedXML, and some utility functions.

Dependencies¶

This module requres lxml and numpy.

class pyteomics.xml.ByteCountingXMLScanner(source, indexed_tags, block_size=1000000)[source]¶

Bases: _file_obj

Carry out the construction of a byte offset index for source XML file for each type of tag in indexed_tags.

Inheris from pyteomics.auxiliary._file_obj to support the object-oriented _keep_state() interface.

__init__(source, indexed_tags, block_size=1000000)[source]¶

Parameters:

indexed_tags (iterable of bytes) – The XML tags (without namespaces) to build indices for.
block_size (int, optional) – The size of the each chunk or “block” of the file to hold in memory as a partitioned string at any given time. Defaults to 1000000.

build_byte_index(lookup_id_key_mapping=None)[source]¶

Builds a byte offset index for one or more types of tags.

Parameters:: lookup_id_key_mapping (Mapping, optional) – A mapping from tag name to the attribute to look up the identity for each entity of that type to be extracted. Defaults to ‘id’ for each type of tag.
Returns:: Mapping from tag type to dict from identifier to byte offset
Return type:: defaultdict(dict)

replace_entities(key)[source]¶

Replace XML entities in a string with their character representation

Uses the minimal mapping of XML entities pre-defined for all XML documents and does not attempt to deal with external DTD defined entities. This mapping is found in entities.

Parameters:: key (str) – The string to substitute
Return type:: str

class pyteomics.xml.CVParamParser(*args, **kwargs)[source]¶

Bases: XML

A subclass of XML that implements additional processing for cvParam elements. These elements refer to the PSI-MS Controlled Vocabulary, and CVParamParser uses a copy of it for type checking. This class requires psims to work.

cv¶

Type:: psims.controlled_vocabulary.controlled_vocabulary.ControlledVocabulary

__init__(*args, **kwargs)[source]¶

Create an XML parser object.

Parameters:

source (str or file) – File name or file-like object corresponding to an XML file.
read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
build_id_cache (bool, optional) – Defines whether a dictionary mapping IDs to XML tree elements should be built and stored on the instance. It is used in XML.get_by_id(), e.g. when using pyteomics.mzid.MzIdentML with retrieve_refs=True.
huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether security checks for XML tree depth and node size should be disabled. Default is False. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).

build_id_cache()¶: Construct a cache for each element in the document, indexed by id attribute

build_tree()¶: Build and store the ElementTree instance for the underlying file

clear_id_cache()¶: Clear the element ID cache

clear_tree()¶: Remove the saved ElementTree.

get_by_id(elem_id, **kwargs)¶

Parse the file and return the element with id attribute equal to elem_id. Returns None if no such element is found.

Parameters:: elem_id (str) – The value of the id attribute to match.
Returns:: out
Return type:: dict or None

iterfind(path, **kwargs)¶

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:

path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
**kwargs (passed to self._get_info_smart().)

Returns:

out

Return type:

iterator

reset()¶: Resets the iterator to its initial state.

class pyteomics.xml.IndexSavingXML(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶

Bases: IndexSavingMixin, IndexedXML

An extension to the IndexedXML type which adds facilities to read and write the byte offset index externally.

__init__(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)¶

Create an indexed XML parser object.

Parameters:

source (str or file) – File name or file-like object corresponding to an XML file.
read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.

build_byte_index()¶: Build the byte offset index by either reading these offsets from the file at _byte_offset_filename, or falling back to the method used by IndexedXML or IndexedTextReader if this operation fails due to an IOError

build_id_cache()¶: Construct a cache for each element in the document, indexed by id attribute

build_tree()¶: Build and store the ElementTree instance for the underlying file

clear_id_cache()¶: Clear the element ID cache

clear_tree()¶: Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)¶

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:

elem_id (str) – The id value of the entity to retrieve.
id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.

Return type:

dict

iterfind(path, **kwargs)¶

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:

path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
**kwargs (passed to self._get_info_smart().)

Returns:

out

Return type:

iterator

classmethod prebuild_byte_offset_file(path)¶

Construct a new XML reader, build its byte offset index and write it to file

Parameters:: path (str) – The path to the file to parse

reset()¶: Resets the iterator to its initial state.

write_byte_offsets()¶: Write the byte offsets in _offset_index to the file at _byte_offset_filename

class pyteomics.xml.IndexedIterfind(parser, tag_name, **kwargs)[source]¶

Bases: TaskMappingMixin, Iterfind

__init__(parser, tag_name, **kwargs)[source]¶

Instantiate a MultiProcessingTaskMappingMixin object, set default parameters for IPC.

Parameters:

queue_timeout (float, keyword only, optional) – The number of seconds to block, waiting for a result before checking to see if all workers are done.
queue_size (int, keyword only, optional) – The length of IPC queue used.
workers (int, keyword only, optional) – Number of worker processes or threads to spawn when map() is called. This can also be specified in the map() call.

map(target=None, workers=None, args=None, kwargs=None, method='mp', **_kwargs)¶

Execute the target function over entries of this object in parallel. The type of parallelism is determined by the method parameter.

Results will be returned out of order.

Parameters:

target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.
workers (int, optional) – The number of worker threads or processes to use. The default depends on the method parameter.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
method (str, optional) –
The type of parallelism to use. Can be one of the following:
- either one of ‘p’, ‘mp’, ‘processes’, or ‘multiprocessing’: use multiprocessing This is the default. This is also equivalent to calling pmap(), see there for details.
- either one of ‘t’, ‘threading’, or ‘threads’: use threading This is also equivalent to calling tmap(), see there for details.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

pmap(target=None, workers=None, args=None, kwargs=None, **_kwargs)¶

Execute the target function over entries of this object across up to workers processes.

Results will be returned out of order.

Parameters:

target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.
workers (int or None, optional) – The number of worker processes to use. If not a positive integer, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

tmap(target=None, workers=None, args=None, kwargs=None, chunk_size=None, **_kwargs)¶

Execute the target function over entries of this object across up to workers threads.

Results will be returned out of order.

Parameters:

target (Callable, optional) –
The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.

Warning

target must be thread-safe. The target function cannot interact with the underlying file object directly.
workers (int or None, optional) – The number of worker threads to use. If not a positive integer, defaults to the number of available CPUs.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
chunk_size (int, optional) – The number of work items to hand out to each worker thread at a time. If not specified, defaults to chunk_size attribute of this object.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

class pyteomics.xml.IndexedXML(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶

Bases: IndexedReaderMixin, XML

Subclass of XML which uses an index of byte offsets for some elements for quick random access.

__init__(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶

Create an indexed XML parser object.

Parameters:

source (str or file) – File name or file-like object corresponding to an XML file.
read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.

build_byte_index()[source]¶

Build up an index of offsets for elements.

Returns:: out
Return type:: TagSpecificXMLByteIndex

build_id_cache()¶: Construct a cache for each element in the document, indexed by id attribute

build_tree()¶: Build and store the ElementTree instance for the underlying file

clear_id_cache()¶: Clear the element ID cache

clear_tree()¶: Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)[source]¶

Parameters:

elem_id (str) – The id value of the entity to retrieve.
id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.

Return type:

dict

iterfind(path, **kwargs)[source]¶

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:

path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
**kwargs (passed to self._get_info_smart().)

Returns:

out

Return type:

iterator

reset()¶: Resets the iterator to its initial state.

class pyteomics.xml.MultiProcessingXML(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶

Bases: IndexedXML, TaskMappingMixin

XML reader that feeds indexes to external processes for parallel parsing and analysis of XML entries.

__init__(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)¶

Create an indexed XML parser object.

Parameters:

source (str or file) – File name or file-like object corresponding to an XML file.
read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.

build_byte_index()¶

Build up an index of offsets for elements.

Returns:: out
Return type:: TagSpecificXMLByteIndex

build_id_cache()¶: Construct a cache for each element in the document, indexed by id attribute

build_tree()¶: Build and store the ElementTree instance for the underlying file

clear_id_cache()¶: Clear the element ID cache

clear_tree()¶: Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)¶

Parameters:

elem_id (str) – The id value of the entity to retrieve.
id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.

Return type:

dict

iterfind(path, **kwargs)¶

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:

path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
**kwargs (passed to self._get_info_smart().)

Returns:

out

Return type:

iterator

map(target=None, workers=None, args=None, kwargs=None, method='mp', **_kwargs)¶

Execute the target function over entries of this object in parallel. The type of parallelism is determined by the method parameter.

Results will be returned out of order.

Parameters:

target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.
workers (int, optional) – The number of worker threads or processes to use. The default depends on the method parameter.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
method (str, optional) –
The type of parallelism to use. Can be one of the following:
- either one of ‘p’, ‘mp’, ‘processes’, or ‘multiprocessing’: use multiprocessing This is the default. This is also equivalent to calling pmap(), see there for details.
- either one of ‘t’, ‘threading’, or ‘threads’: use threading This is also equivalent to calling tmap(), see there for details.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

pmap(target=None, workers=None, args=None, kwargs=None, **_kwargs)¶

Execute the target function over entries of this object across up to workers processes.

Results will be returned out of order.

Parameters:

target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.
workers (int or None, optional) – The number of worker processes to use. If not a positive integer, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

reset()¶: Resets the iterator to its initial state.

tmap(target=None, workers=None, args=None, kwargs=None, chunk_size=None, **_kwargs)¶

Execute the target function over entries of this object across up to workers threads.

Results will be returned out of order.

Parameters:

target (Callable, optional) –
The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.

Warning

target must be thread-safe. The target function cannot interact with the underlying file object directly.
workers (int or None, optional) – The number of worker threads to use. If not a positive integer, defaults to the number of available CPUs.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
chunk_size (int, optional) – The number of work items to hand out to each worker thread at a time. If not specified, defaults to chunk_size attribute of this object.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

class pyteomics.xml.TagSpecificXMLByteIndex(source, indexed_tags=None, keys=None)[source]¶

Bases: object

Encapsulates the construction and querying of a byte offset index for a set of XML tags.

This type mimics an immutable Mapping.

indexed_tags¶

The tag names to index, not including a namespace

Type:: iterable of bytes

offsets¶

The hierarchy of byte offsets organized {"tag_type": {"id": byte_offset}}

Type:: defaultdict(OrderedDict(str, int))

indexed_tag_keys¶

A mapping from tag name to unique identifier attribute

Type:: dict(str, str)

Parameters:: index_tags (iterable of bytes) – The tag names to include in the index

__init__(source, indexed_tags=None, keys=None)[source]¶

build_index()[source]¶

Perform the byte offset index building for py:attr:source.

Returns:: offsets – The hierarchical offset, stored in offsets
Return type:: defaultdict

class pyteomics.xml.XML(source, read_schema=None, iterative=None, build_id_cache=False, **kwargs)[source]¶

Bases: FileReader

Base class for all format-specific XML parsers. The instances can be used as context managers and as iterators.

__init__(source, read_schema=None, iterative=None, build_id_cache=False, **kwargs)[source]¶

Create an XML parser object.

Parameters:

source (str or file) – File name or file-like object corresponding to an XML file.
read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
build_id_cache (bool, optional) – Defines whether a dictionary mapping IDs to XML tree elements should be built and stored on the instance. It is used in XML.get_by_id(), e.g. when using pyteomics.mzid.MzIdentML with retrieve_refs=True.
huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether security checks for XML tree depth and node size should be disabled. Default is False. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).

build_id_cache()[source]¶: Construct a cache for each element in the document, indexed by id attribute

build_tree()[source]¶: Build and store the ElementTree instance for the underlying file

clear_id_cache()[source]¶: Clear the element ID cache

clear_tree()[source]¶: Remove the saved ElementTree.

get_by_id(elem_id, **kwargs)[source]¶

Parse the file and return the element with id attribute equal to elem_id. Returns None if no such element is found.

Parameters:: elem_id (str) – The value of the id attribute to match.
Returns:: out
Return type:: dict or None

iterfind(path, **kwargs)[source]¶

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:

path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
**kwargs (passed to self._get_info_smart().)

Returns:

out

Return type:

iterator

reset()¶: Resets the iterator to its initial state.

pyteomics.xml.xpath(tree, path, ns=None)[source]¶

Return the results of XPath query with added namespaces. Assumes the ns declaration is on the root element or absent.

Parameters:

tree (ElementTree)
path (str)
ns (str or None, optional)

pyteomics.xml.xsd_parser(schema_url)[source]¶

Parse an XSD file from the specified URL into a schema dictionary that can be used by XML parsers to automatically cast data to the appropriate type.

Parameters:: schema_url (str) – The URL to retrieve the schema from
Return type:: dict

Pyteomics documentation v5.0

xml - utilities for XML parsing

xml - utilities for XML parsing¶

Dependencies¶