Pyteomics documentation v3.5.1

xml - utilities for XML parsing

«  pylab_aux - auxiliary functions for plotting with pylab   ::   Contents

xml - utilities for XML parsing

This module is not intended for end users. It implements the abstract classes for all XML parsers, XML and IndexedXML, and some utility functions.

Dependencies

This module requres lxml and numpy.


class pyteomics.xml.ArrayConversionMixin(*args, **kwargs)[source]

Bases: pyteomics.auxiliary.utils.BinaryDataArrayTransformer

Methods

binary_array_record Hold all of the information about a base64 encoded array needed to decode the array.
decode_data_array(source[, …]) Decode a base64-encoded, compressed bytestring into a numerical array.
__init__(*args, **kwargs)[source]

x.__init__(…) initializes x; see help(type(x)) for signature

class binary_array_record

Bases: pyteomics.auxiliary.utils.binary_array_record

Hold all of the information about a base64 encoded array needed to decode the array.

Attributes:
compression

Alias for field number 1

data

Alias for field number 0

dtype

Alias for field number 2

key

Alias for field number 4

source

Alias for field number 3

Methods

count(value)
decode() Decode data into a numerical array
index(value, [start, [stop]]) Raises ValueError if the value is not present.
__init__

x.__init__(…) initializes x; see help(type(x)) for signature

compression

Alias for field number 1

count(value) → integer -- return number of occurrences of value
data

Alias for field number 0

decode()

Decode data into a numerical array

Returns:
np.ndarray
dtype

Alias for field number 2

index(value[, start[, stop]]) → integer -- return first index of value.

Raises ValueError if the value is not present.

key

Alias for field number 4

source

Alias for field number 3

decode_data_array(source, compression_type=None, dtype=<type 'numpy.float64'>)

Decode a base64-encoded, compressed bytestring into a numerical array.

Parameters:
source : bytes

A base64 string encoding a potentially compressed numerical array.

compression_type : str, optional

The name of the compression method used before encoding the array into base64.

dtype : type, optional

The data type to use to decode the binary array from the decompressed bytes.

Returns:
np.ndarray
class pyteomics.xml.ByteCountingXMLScanner(source, indexed_tags, block_size=1000000)[source]

Bases: pyteomics.auxiliary.file_helpers._file_obj

Carry out the construction of a byte offset index for source XML file for each type of tag in indexed_tags.

Inheris from pyteomics.auxiliary._file_obj to support the object-oriented _keep_state() interface.

Methods

build_byte_index(*args, **kwargs) Builds a byte offset index for one or more types of tags.
scan  
__init__(source, indexed_tags, block_size=1000000)[source]
Parameters:
indexed_tags : iterable of bytes

The XML tags (without namespaces) to build indices for.

block_size : int, optional

The size of the each chunk or “block” of the file to hold in memory as a partitioned string at any given time. Defaults to 1000000.

build_byte_index(*args, **kwargs)[source]

Builds a byte offset index for one or more types of tags.

Parameters:
lookup_id_key_mapping : Mapping, optional

A mapping from tag name to the attribute to look up the identity for each entity of that type to be extracted. Defaults to ‘id’ for each type of tag.

Returns:
defaultdict(ByteEncodingOrderedDict)

Mapping from tag type to ByteEncodingOrderedDict from identifier to byte offset

class pyteomics.xml.ByteEncodingOrderedDict(**kwds)[source]

Bases: collections.OrderedDict

Methods

clear()
copy()
fromkeys(S[, v]) If not specified, the value defaults to None.
get(k[,d])
has_key(k)
items()
iteritems() od.iteritems -> an iterator over the (key, value) pairs in od
iterkeys()
itervalues() od.itervalues -> an iterator over the values in od
keys()
pop(k[,d]) value.
popitem() Pairs are returned in LIFO order if last is true or FIFO order if false.
setdefault(k[,d])
update([E, ]**F) If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k, v in F.items(): D[k] = v
values()
viewitems()
viewkeys()
viewvalues()
__init__(**kwds)

Initialize an ordered dictionary. The signature is the same as regular dictionaries, but keyword arguments are not recommended because their insertion order is arbitrary.

clear() → None. Remove all items from od.
copy() → a shallow copy of od
classmethod fromkeys(S[, v]) → New ordered dictionary with keys from S.

If not specified, the value defaults to None.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.
has_key(k) → True if D has a key k, else False
items() → list of (key, value) pairs in od
iteritems()

od.iteritems -> an iterator over the (key, value) pairs in od

iterkeys() → an iterator over the keys in od
itervalues()

od.itervalues -> an iterator over the values in od

keys() → list of keys in od
pop(k[, d]) → v, remove specified key and return the corresponding

value. If key is not found, d is returned if given, otherwise KeyError is raised.

popitem() → (k, v), return and remove a (key, value) pair.

Pairs are returned in LIFO order if last is true or FIFO order if false.

setdefault(k[, d]) → od.get(k,d), also set od[k]=d if k not in od
update([E, ]**F) → None. Update D from mapping/iterable E and F.

If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k, v in F.items(): D[k] = v

values() → list of values in od
viewitems() → a set-like object providing a view on od's items
viewkeys() → a set-like object providing a view on od's keys
viewvalues() → an object providing a view on od's values
class pyteomics.xml.FlatTagSpecificXMLByteIndex(source, indexed_tags=None, keys=None)[source]

Bases: pyteomics.xml.TagSpecificXMLByteIndex

An alternative interface on top of TagSpecificXMLByteIndex that assumes that identifiers across different tags are globally unique, as in MzIdentML.

Attributes:
offsets : ByteEncodingOrderedDict

The mapping between ids and byte offsets.

Methods

build_index  
items  
keys  
__init__(source, indexed_tags=None, keys=None)

x.__init__(…) initializes x; see help(type(x)) for signature

build_index()[source]

Perform the byte offset index building for py:attr:source.

Returns:
offsets: defaultdict

The hierarchical offset, stored in offsets

class pyteomics.xml.IndexSavingXML(source, read_schema=False, iterative=True, build_id_cache=False, use_index=True, *args, **kwargs)[source]

Bases: pyteomics.xml.IndexedXML

An extension to the IndexedXML type which adds facilities to read and write the byte offset index externally.

Methods

build_id_cache(*args, **kwargs) Construct a cache for each element in the document, indexed by id attribute
build_tree(*args, **kwargs) Build and store the ElementTree instance for the underlying file
clear_id_cache() Clear the element ID cache
clear_tree() Remove the saved ElementTree.
get_by_id(*args, **kwargs) Retrieve the requested entity by its id.
iterfind(*args, **kwargs) Parse the XML and yield info on elements with specified local name or by specified “XPath”.
prebuild_byte_offset_file(path) Construct a new XML reader, build its byte offset index and write it to file
write_byte_offsets() Write the byte offsets in _offset_index to the file at _byte_offset_filename
next  
reset  
__init__(source, read_schema=False, iterative=True, build_id_cache=False, use_index=True, *args, **kwargs)

Create an XML parser object.

Parameters:
source : str or file

File name or file-like object corresponding to an XML file.

read_schema : bool, optional

Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.

iterative : bool, optional

Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.

use_index : bool, optional

Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.

indexed_tags : container of bytes, optional

If use_index is True, elements listed in this parameter will be indexed. Empty set by default.

build_id_cache(*args, **kwargs)

Construct a cache for each element in the document, indexed by id attribute

build_tree(*args, **kwargs)

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(*args, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
elem_id : str

The id value of the entity to retrieve.

id_key : str, optional

The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.

Returns:
dict
iterfind(*args, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
path : str

Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.

**kwargs : passed to self._get_info_smart().
Returns:
out : iterator
classmethod prebuild_byte_offset_file(path)[source]

Construct a new XML reader, build its byte offset index and write it to file

Parameters:
path : str

The path to the file to parse

reset()

Resets the iterator to its initial state.

write_byte_offsets()[source]

Write the byte offsets in _offset_index to the file at _byte_offset_filename

class pyteomics.xml.IndexedXML(source, read_schema=False, iterative=True, build_id_cache=False, use_index=True, *args, **kwargs)[source]

Bases: pyteomics.xml.XML

Subclass of XML which uses an index of byte offsets for some elements for quick random access.

Methods

build_id_cache(*args, **kwargs) Construct a cache for each element in the document, indexed by id attribute
build_tree(*args, **kwargs) Build and store the ElementTree instance for the underlying file
clear_id_cache() Clear the element ID cache
clear_tree() Remove the saved ElementTree.
get_by_id(*args, **kwargs) Retrieve the requested entity by its id.
iterfind(*args, **kwargs) Parse the XML and yield info on elements with specified local name or by specified “XPath”.
next  
reset  
__init__(source, read_schema=False, iterative=True, build_id_cache=False, use_index=True, *args, **kwargs)[source]

Create an XML parser object.

Parameters:
source : str or file

File name or file-like object corresponding to an XML file.

read_schema : bool, optional

Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.

iterative : bool, optional

Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.

use_index : bool, optional

Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.

indexed_tags : container of bytes, optional

If use_index is True, elements listed in this parameter will be indexed. Empty set by default.

build_id_cache(*args, **kwargs)

Construct a cache for each element in the document, indexed by id attribute

build_tree(*args, **kwargs)

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(*args, **kwargs)[source]

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
elem_id : str

The id value of the entity to retrieve.

id_key : str, optional

The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.

Returns:
dict
iterfind(*args, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
path : str

Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.

**kwargs : passed to self._get_info_smart().
Returns:
out : iterator
reset()

Resets the iterator to its initial state.

class pyteomics.xml.PrebuiltOffsetIndex(offsets)[source]

Bases: pyteomics.xml.FlatTagSpecificXMLByteIndex

An Offset Index class which just holds offsets and performs no extra scanning effort.

Attributes:
offsets : ByteEncodingOrderedDict

Methods

build_index  
items  
keys  
__init__(offsets)[source]

x.__init__(…) initializes x; see help(type(x)) for signature

build_index()

Perform the byte offset index building for py:attr:source.

Returns:
offsets: defaultdict

The hierarchical offset, stored in offsets

class pyteomics.xml.TagSpecificXMLByteIndex(source, indexed_tags=None, keys=None)[source]

Bases: object

Encapsulates the construction and querying of a byte offset index for a set of XML tags.

This type mimics an immutable Mapping.

Parameters:
index_tags: iterable of bytes

The tag names to include in the index

Attributes:
indexed_tags : iterable of bytes

The tag names to index, not including a namespace

offsets : defaultdict(OrderedDict(str, int))

The hierarchy of byte offsets organized {"tag_type": {"id": byte_offset}}

indexed_tag_keys: dict(str, str)

A mapping from tag name to unique identifier attribute

Methods

build_index() Perform the byte offset index building for py:attr:source.
items  
keys  
__init__(source, indexed_tags=None, keys=None)[source]

x.__init__(…) initializes x; see help(type(x)) for signature

build_index()[source]

Perform the byte offset index building for py:attr:source.

Returns:
offsets: defaultdict

The hierarchical offset, stored in offsets

class pyteomics.xml.XML(source, read_schema=False, iterative=True, build_id_cache=False, **kwargs)[source]

Bases: pyteomics.auxiliary.file_helpers.FileReader

Base class for all format-specific XML parsers. The instances can be used as context managers and as iterators.

Methods

build_id_cache(*args, **kwargs) Construct a cache for each element in the document, indexed by id attribute
build_tree(*args, **kwargs) Build and store the ElementTree instance for the underlying file
clear_id_cache() Clear the element ID cache
clear_tree() Remove the saved ElementTree.
get_by_id(*args, **kwargs) Parse the file and return the element with id attribute equal to elem_id.
iterfind(*args, **kwargs) Parse the XML and yield info on elements with specified local name or by specified “XPath”.
next  
reset  
__init__(source, read_schema=False, iterative=True, build_id_cache=False, **kwargs)[source]

Create an XML parser object.

Parameters:
source : str or file

File name or file-like object corresponding to an XML file.

read_schema : bool, optional

Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.

iterative : bool, optional

Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.

build_id_cache : bool, optional

Defines whether a dictionary mapping IDs to XML tree elements should be built and stored on the instance. It is used in XML.get_by_id(), e.g. when using pyteomics.mzid.MzIdentML with retrieve_refs=True.

huge_tree : bool, optional

This option is passed to the lxml parser and defines whether security checks for XML tree depth and node size should be disabled. Default is False. Enable this option for trusted files to avoid XMLSyntaxError exceptions (e.g. XMLSyntaxError: xmlSAX2Characters: huge text node).

skip_empty_cvparam_values : bool, optional

Warning

This parameter affects the format of the produced dictionaries.

By default, when parsing cvParam elements, “value” attributes with empty values are not treated differently from others. When this parameter is set to True, these empty values are flattened. You can enable this to obtain the same output structure regardless of the presence of an empty “value”. Default is False.

build_id_cache(*args, **kwargs)[source]

Construct a cache for each element in the document, indexed by id attribute

build_tree(*args, **kwargs)[source]

Build and store the ElementTree instance for the underlying file

clear_id_cache()[source]

Clear the element ID cache

clear_tree()[source]

Remove the saved ElementTree.

get_by_id(*args, **kwargs)[source]

Parse the file and return the element with id attribute equal to elem_id. Returns None if no such element is found.

Parameters:
elem_id : str

The value of the id attribute to match.

Returns:
out : dict or None
iterfind(*args, **kwargs)[source]

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
path : str

Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.

**kwargs : passed to self._get_info_smart().
Returns:
out : iterator
reset()

Resets the iterator to its initial state.

pyteomics.xml.load_byte_index(fp)[source]

Read a byte offset index from a file

Parameters:
fp : file

The file to read the index from

Returns:
ByteEncodingOrderedDict
pyteomics.xml.save_byte_index(index, fp)[source]

Write the byte offset index to the provided file

Parameters:
index : ByteEncodingOrderedDict

The byte offset index to be saved

fp : file

The file to write the index to

Returns:
file
pyteomics.xml.xpath(tree, path, ns=None)[source]

Return the results of XPath query with added namespaces. Assumes the ns declaration is on the root element or absent.

Parameters:
tree : ElementTree
path : str
ns : str or None, optional

«  pylab_aux - auxiliary functions for plotting with pylab   ::   Contents