Pyteomics documentation v4.4.0

pepxml - pepXML file reader

«  ms2 - read and write MS/MS data in MS2 format   ::   Contents   ::   protxml - parsing of ProteinProphet output files  »

pepxml - pepXML file reader

Summary

pepXML was the first widely accepted format for proteomics search engines’ output. Even though it is to be replaced by a community standard mzIdentML, it is still used commonly.

This module provides minimalistic infrastructure for access to data stored in pepXML files. The most important function is read(), which reads peptide-spectum matches and related information and saves them into human-readable dicts. This function relies on the terminology of the underlying lxml library.

Data access

PepXML - a class representing a single pepXML file. Other data access functions use this class internally.

read() - iterate through peptide-spectrum matches in a pepXML file. Data for a single spectrum are converted to an easy-to-use dict.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

DataFrame() - read pepXML files into a pandas.DataFrame.

Target-decoy approach

filter() - filter PSMs from a chain of pepXML files to a specific FDR using TDA.

filter.chain() - chain a series of filters applied independently to several files.

filter.chain.from_iterable() - chain a series of filters applied independently to an iterable of files.

filter_df() - filter pepXML files and return a pandas.DataFrame.

fdr() - estimate the false discovery rate of a PSM set using the target-decoy approach.

qvalues() - get an array of scores and local FDR values for a PSM set using the target-decoy approach.

is_decoy() - determine whether a PSM is decoy or not.

Miscellaneous

roc_curve() - get a receiver-operator curve (min PeptideProphet probability in a sample vs. false discovery rate) of PeptideProphet analysis.

Deprecated functions

iterfind() - iterate over elements in a pepXML file. You can just call the corresponding method of the PepXML object.

version_info() - get information about pepXML version and schema. You can just read the corresponding attribute of the PepXML object.

Dependencies

This module requires lxml.


pyteomics.pepxml.chain(*sources, **kwargs)

Chain sequence_maker() for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to the sequence_maker() function.

pyteomics.pepxml.sources

Sources for creating new sequences from, such as paths or file-like objects

Type:Iterable
pyteomics.pepxml.kwargs

Additional arguments used to instantiate each sequence

Type:Mapping
chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.pepxml.filter(*args, **kwargs)

Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.

Requires numpy and, optionally, pandas.

Parameters:
  • args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
  • fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
  • key (callable / array-like / iterable / str, keyword only, optional) –

    A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only, optional) –

    A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is True.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • full_output (bool, keyword only, optional) –

    If True, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default is True.

    Note

    The name for the parameter comes from the fact that it is internally passed to qvalues().

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • **kwargs (passed to the chain() function.) –
Returns:

out

Return type:

iterator or numpy.ndarray or pandas.DataFrame

filter.chain(*files, **kwargs)

Chain filter() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the filter() function.

filter.chain.from_iterable(*files, **kwargs)

Chain filter() for several files. Keyword arguments are passed to the filter() function.

Parameters:files – Iterable of file names or file objects.
pyteomics.pepxml.version_info(source)

Provide version information about the pepXML file.

Note

This function is provided for backward compatibility only. It simply creates an PepXML instance and returns its version_info attribute.

Parameters:source (str or file) – File name or file-like object.
Returns:out – A (version, schema URL) tuple, both elements are strings or None.
Return type:tuple
pyteomics.pepxml.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an PepXML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.
  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
Returns:

out

Return type:

iterator

pyteomics.pepxml.fdr(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)

Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:

FDR = \frac{N_{decoy}}{N_{target} * ratio}

The second formula is:

FDR = \frac{N_{decoy} * (1 + \frac{1}{ratio})}{N_{total}}

Note

This function is less versatile than qvalues(). To obtain FDR, you can call qvalues() and take the last q-value. This function can be used (with correction = 0 or 1) when numpy is not available.

Parameters:
  • psms (iterable, optional) – An iterable of PSMs, e.g. as returned by read(). Not needed if is_decoy is an iterable.
  • formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
  • is_decoy (callable, iterable, or str, optional) –

    If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is is_decoy(). If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • pep (callable, iterable, or str, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a pandas.DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.

  • ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

    Note

    Requires numpy, if correction is a float or 2.

    Note

    Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using filter() without correction).

Returns:

out – The estimation of FDR, (roughly) between 0 and 1.

Return type:

float

pyteomics.pepxml.qvalues(*args, **kwargs)

Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.

Requires numpy (and optionally pandas).

Parameters:
  • args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
  • key (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • reverse (bool, keyword only, optional) – If True, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default is False.
  • is_decoy (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Warning

    The default function may not work with your files, because format flavours are diverse.

  • decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
  • decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
  • pep (callable / array-like / iterable / str, keyword only, optional) –

    If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a DataFrame).

    Note

    If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.

  • remove_decoy (bool, keyword only, optional) –

    Defines whether decoy matches should be removed from the output. Default is False.

    Note

    If set to False, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation of fdr() for math; basically, if remove_decoy is True, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.

  • formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is True, else 2 (see fdr() for definitions).
  • ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
  • correction (int or float, keyword only, optional) –

    Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.

    0 (default): no correction;

    1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;

    2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.

    If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.

    See this paper for further explanation.

  • q_label (str, optional) – Field name for q-value in the output. Default is 'q'.
  • score_label (str, optional) – Field name for score in the output. Default is 'score'.
  • decoy_label (str, optional) – Field name for the decoy flag in the output. Default is 'is decoy'.
  • pep_label (str, optional) – Field name for PEP in the output. Default is 'PEP'.
  • full_output (bool, keyword only, optional) – If True, then the returned array has PSM objects along with scores and q-values. Default is False.
  • **kwargs (passed to the chain() function.) –
Returns:

out – A sorted array of records with the following fields:

  • ’score’: np.float64
  • ’is decoy’: np.bool_
  • ’q’: np.float64
  • ’psm’: np.object_ (if full_output is True)

Return type:

numpy.ndarray

pyteomics.pepxml.DataFrame(*args, **kwargs)[source]

Read pepXML output files into a pandas.DataFrame.

Requires pandas.

Parameters:
  • *args – Passed to chain().
  • **kwargs – Passed to chain().
  • sep (str or None, keyword only, optional) – Some values related to PSMs (such as protein information) are variable-length lists. If sep is a str, they will be packed into single string using this delimiter. If sep is None, they are kept as lists. Default is None.
  • pd_kwargs (dict, optional) – Keyword arguments passed to the pandas.DataFrame constructor.
Returns:

out

Return type:

pandas.DataFrame

class pyteomics.pepxml.PepXML(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)[source]

Bases: pyteomics.xml.MultiProcessingXML, pyteomics.xml.IndexSavingXML

Parser class for pepXML files.

__init__(source, read_schema=False, iterative=True, build_id_cache=False, use_index=None, *args, **kwargs)

Create an indexed XML parser object.

Parameters:
  • source (str or file) – File name or file-like object corresponding to an XML file.
  • read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is False.
  • iterative (bool, optional) – Defines whether an ElementTree object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default is True.
  • use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with retrieve_refs=True. If True, build_id_cache is ignored. If False, the object acts exactly like XML. Default is True.
  • indexed_tags (container of bytes, optional) – If use_index is True, elements listed in this parameter will be indexed. Empty set by default.
build_id_cache()

Construct a cache for each element in the document, indexed by id attribute

build_tree()

Build and store the ElementTree instance for the underlying file

clear_id_cache()

Clear the element ID cache

clear_tree()

Remove the saved ElementTree.

get_by_id(elem_id, id_key=None, element_type=None, **kwargs)

Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.

Parameters:
  • elem_id (str) – The id value of the entity to retrieve.
  • id_key (str, optional) – The name of the XML attribute to use for lookup. Defaults to self._default_id_attr.
Returns:

Return type:

dict

iterfind(path, **kwargs)

Parse the XML and yield info on elements with specified local name or by specified “XPath”.

Parameters:
  • path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
  • **kwargs (passed to self._get_info_smart().) –
Returns:

out

Return type:

iterator

map(target=None, processes=-1, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
  • target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs
  • processes (int, optional) – The number of worker processes to use. If 0 or negative, defaults to the number of available CPUs. This parameter can also be set at reader creation.
  • args (Sequence, optional) – Additional positional arguments to be passed to the target function
  • kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function
  • **_kwargs – Additional keyword arguments to be passed to the target function
Yields:

object – The work item returned by the target function.

classmethod prebuild_byte_offset_file(path)

Construct a new XML reader, build its byte offset index and write it to file

Parameters:path (str) – The path to the file to parse
reset()

Resets the iterator to its initial state.

write_byte_offsets()

Write the byte offsets in _offset_index to the file at _byte_offset_filename

pyteomics.pepxml.filter_df(*args, **kwargs)[source]

Read pepXML files or DataFrames and return a DataFrame with filtered PSMs. Positional arguments can be pepXML files or DataFrames.

Requires pandas.

Parameters:
  • key (str / iterable / callable, keyword only, optional) – PSM score. Default is ‘expect’.
  • is_decoy (str / iterable / callable, keyword only, optional) – Default is to check if all strings in the “protein” column start with ‘DECOY_’
  • *args – Passed to auxiliary.filter() and/or DataFrame().
  • **kwargs – Passed to auxiliary.filter() and/or DataFrame().
Returns:

out

Return type:

pandas.DataFrame

pyteomics.pepxml.is_decoy(psm, prefix='DECOY_')

Given a PSM dict, return True if all protein names for the PSM start with prefix, and False otherwise. This function might not work for some pepXML flavours. Use the source to get the idea and suit it to your needs.

Parameters:
  • psm (dict) – A dict, as yielded by read().
  • prefix (str, optional) – A prefix used to mark decoy proteins. Default is ‘DECOY_’.
Returns:

out

Return type:

bool

pyteomics.pepxml.iterfind(source, path, **kwargs)[source]

Parse source and yield info on elements with specified local name or by specified “XPath”.

Note

This function is provided for backward compatibility only. If you do multiple iterfind() calls on one file, you should create an PepXML object and use its iterfind() method.

Parameters:
  • source (str or file) – File name or file-like object.
  • path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as: "/path/to/element[some_value>1.5]" Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.
  • recursive (bool, optional) – If False, subelements will not be processed when extracting info from elements. Default is True.
  • iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is True, however, it is highly recommended to disable iterative parsing if possible. Default value is True.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the mzIdentML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
Returns:

out

Return type:

iterator

pyteomics.pepxml.read(source, read_schema=False, iterative=True, **kwargs)[source]

Parse source and iterate through peptide-spectrum matches.

Parameters:
  • source (str or file) – A path to a target pepXML file or the file object itself.
  • read_schema (bool, optional) – If True, attempt to extract information from the XML schema mentioned in the pepXML header. Otherwise, use default parameters. Not recommended without Internet connection or if you don’t like to get the related warnings.
  • iterative (bool, optional) – Defines whether iterative parsing should be used. It helps reduce memory usage at almost the same parsing speed. Default is True.
Returns:

out – An iterator over dicts with PSM properties.

Return type:

PepXML

pyteomics.pepxml.roc_curve(source)[source]

Parse source and return a ROC curve for peptideprophet analysis.

Parameters:source (str or file) – A path to a target pepXML file or the file object itself.
Returns:out – A list of ROC points.
Return type:list

«  ms2 - read and write MS/MS data in MS2 format   ::   Contents   ::   protxml - parsing of ProteinProphet output files  »