xml - utilities for XML parsing¶
This module is not intended for end users. It implements the abstract classes
for all XML parsers, XML
and IndexedXML
, and some utility functions.
Dependencies¶
This module requres lxml
and numpy
.
-
class
pyteomics.xml.
ByteCountingXMLScanner
(source, indexed_tags, block_size=1000000)[source]¶ Bases:
pyteomics.auxiliary.file_helpers._file_obj
Carry out the construction of a byte offset index for source XML file for each type of tag in
indexed_tags
.Inheris from
pyteomics.auxiliary._file_obj
to support the object-oriented_keep_state()
interface.-
__init__
(source, indexed_tags, block_size=1000000)[source]¶ Parameters: - indexed_tags (iterable of bytes) – The XML tags (without namespaces) to build indices for.
- block_size (int, optional) – The size of the each chunk or “block” of the file to hold in memory as a partitioned string at any given time. Defaults to 1000000.
-
build_byte_index
(lookup_id_key_mapping=None)[source]¶ Builds a byte offset index for one or more types of tags.
Parameters: lookup_id_key_mapping (Mapping, optional) – A mapping from tag name to the attribute to look up the identity for each entity of that type to be extracted. Defaults to ‘id’ for each type of tag. Returns: Mapping from tag type to dict from identifier to byte offset Return type: defaultdict(dict)
-
replace_entities
(key)[source]¶ Replace XML entities in a string with their character representation
Uses the minimal mapping of XML entities pre-defined for all XML documents and does not attempt to deal with external DTD defined entities. This mapping is found in
entities
.Parameters: key (str) – The string to substitute Returns: Return type: str
-
-
class
pyteomics.xml.
IndexSavingXML
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.file_helpers.IndexSavingMixin
,pyteomics.xml.IndexedXML
An extension to the IndexedXML type which adds facilities to read and write the byte offset index externally.
-
__init__
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)¶ Create an indexed XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
elements listed in indexed_tags.
This is useful for random access to spectra in mzML or elements of mzIdentML files,
or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
. - indexed_tags (container of bytes, optional) – If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
classmethod
prebuild_byte_offset_file
(path)¶ Construct a new XML reader, build its byte offset index and write it to file
Parameters: path (str) – The path to the file to parse
-
reset
()¶ Resets the iterator to its initial state.
-
write_byte_offsets
()¶ Write the byte offsets in
_offset_index
to the file at_byte_offset_filename
-
-
class
pyteomics.xml.
IndexedXML
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.file_helpers.IndexedReaderMixin
,pyteomics.xml.XML
Subclass of
XML
which uses an index of byte offsets for some elements for quick random access.-
__init__
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶ Create an indexed XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
elements listed in indexed_tags.
This is useful for random access to spectra in mzML or elements of mzIdentML files,
or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
. - indexed_tags (container of bytes, optional) – If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)[source]¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)[source]¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.xml.
MultiProcessingXML
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]¶ Bases:
pyteomics.xml.IndexedXML
,pyteomics.auxiliary.file_helpers.TaskMappingMixin
XML reader that feeds indexes to external processes for parallel parsing and analysis of XML entries.
-
__init__
(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)¶ Create an indexed XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for
elements listed in indexed_tags.
This is useful for random access to spectra in mzML or elements of mzIdentML files,
or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
. - indexed_tags (container of bytes, optional) – If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
-
build_id_cache
()¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
()¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(elem_id, id_key=None, element_type=None, **kwargs)¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: Returns: Return type:
-
iterfind
(path, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
map
(target=None, processes=-1, args=None, kwargs=None, **_kwargs)¶ Execute the
target
function over entries of this object across up toprocesses
processes.Results will be returned out of order.
Parameters: - target (
Callable
, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values inargs
andkwargs
- processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
- args (
Sequence
, optional) – Additional positional arguments to be passed to the target function - kwargs (
Mapping
, optional) – Additional keyword arguments to be passed to the target function - **_kwargs – Additional keyword arguments to be passed to the target function
Yields: object – The work item returned by the target function.
- target (
-
reset
()¶ Resets the iterator to its initial state.
-
-
class
pyteomics.xml.
TagSpecificXMLByteIndex
(source, indexed_tags=None, keys=None)[source]¶ Bases:
object
Encapsulates the construction and querying of a byte offset index for a set of XML tags.
This type mimics an immutable Mapping.
The tag names to index, not including a namespace
Type: iterable of bytes
-
offsets
¶ The hierarchy of byte offsets organized
{"tag_type": {"id": byte_offset}}
Type: defaultdict(OrderedDict(str, int))
Parameters: index_tags (iterable of bytes) – The tag names to include in the index
-
class
pyteomics.xml.
XML
(source, read_schema=None, iterative=None, build_id_cache=False, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.file_helpers.FileReader
Base class for all format-specific XML parsers. The instances can be used as context managers and as iterators.
-
__init__
(source, read_schema=None, iterative=None, build_id_cache=False, **kwargs)[source]¶ Create an XML parser object.
Parameters: - source (str or file) – File name or file-like object corresponding to an XML file.
- read_schema (bool, optional) – Defines whether schema file referenced in the file header
should be used to extract information about value conversion.
Default is
False
. - iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
. - build_id_cache (bool, optional) – Defines whether a dictionary mapping IDs to XML tree elements
should be built and stored on the instance. It is used in
XML.get_by_id()
, e.g. when usingpyteomics.mzid.MzIdentML
withretrieve_refs=True
. - huge_tree (bool, optional) – This option is passed to the lxml parser and defines whether
security checks for XML tree depth and node size should be disabled.
Default is
False
. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).
-
build_id_cache
()[source]¶ Construct a cache for each element in the document, indexed by id attribute
-
get_by_id
(elem_id, **kwargs)[source]¶ Parse the file and return the element with id attribute equal to elem_id. Returns
None
if no such element is found.Parameters: elem_id (str) – The value of the id attribute to match. Returns: out Return type: dict
orNone
-
iterfind
(path, **kwargs)[source]¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: - path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
- **kwargs (passed to
self._get_info_smart()
.) –
Returns: out
Return type: iterator
-
reset
()¶ Resets the iterator to its initial state.
-