idxml - idXML file reader¶
Summary¶
idXML is a format specified in the OpenMS project. It defines a list of peptide identifications.
This module provides a minimalistic way to extract information from idXML
files. You can use the old functional interface (read()
) or the new
object-oriented interface (IDXML
) to iterate over entries in
<PeptideIdentification>
elements. Note that each entry can contain more than one PSM
(peptide-spectrum match). They are accessible with 'PeptideHit'
key.
IDXML
objects also support direct indexing by element ID.
Data access¶
IDXML
- a class representing a single idXML file. Other data access functions use this class internally.
read()
- iterate through peptide-spectrum matches in an idXML file. Data from a single PSM group are converted to a human-readable dict. Basically creates anIDXML
object and reads it.
chain()
- read multiple files at once.
chain.from_iterable()
- read multiple files at once, using an iterable of files.
DataFrame()
- read idXML files into apandas.DataFrame
.
Target-decoy approach¶
filter()
- read a chain of idXML files and filter to a certain FDR using TDA.
filter.chain()
- chain a series of filters applied independently to several files.
filter.chain.from_iterable()
- chain a series of filters applied independently to an iterable of files.
filter_df()
- filter idXML files and return apandas.DataFrame
.
is_decoy()
- determine if a “SpectrumIdentificationResult” should be consiudered decoy.
fdr()
- estimate the false discovery rate of a set of identifications using the target-decoy approach.
qvalues()
- get an array of scores and local FDR values for a PSM set using the target-decoy approach.
Deprecated functions¶
version_info()
- get information about idXML version and schema. You can just read the corresponding attribute of theIDXML
object.
get_by_id()
- get an element by its ID and extract the data from it. You can just call the corresponding method of theIDXML
object.
iterfind()
- iterate over elements in an idXML file. You can just call the corresponding method of theIDXML
object.
Dependencies¶
This module requires lxml
.
- pyteomics.openms.idxml.version_info(source)¶
Provide version information about the idXML file.
Note
This function is provided for backward compatibility only. It simply creates an
IDXML
instance and returns itsversion_info
attribute.
- pyteomics.openms.idxml.fdr(psms=None, formula=1, is_decoy=None, ratio=1, correction=0, pep=None, decoy_prefix='DECOY_', decoy_suffix=None)¶
Estimate FDR of a data set using TDA or given PEP values. Two formulas can be used. The first one (default) is:
\[FDR = \frac{N_{decoy}}{N_{target} * ratio}\]The second formula is:
\[FDR = \frac{N_{decoy} * (1 + \frac{1}{ratio})}{N_{total}}\]Note
This function is less versatile than
qvalues()
. To obtain FDR, you can callqvalues()
and take the last q-value. This function can be used (with correction = 0 or 1) whennumpy
is not available.- Parameters:
psms (iterable, optional) – An iterable of PSMs, e.g. as returned by
read()
. Not needed if is_decoy is an iterable.formula (int, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1.
is_decoy (callable, iterable, or str, optional) –
If callable, should accept exactly one argument (PSM) and return a truthy value if the PSM is considered decoy. Default is
is_decoy()
. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or apandas.DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
pep (callable, iterable, or str, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
pandas.DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate FDR. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, formula, ratio, correction.
ratio (float, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
correction (int or float, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
Note
Requires
numpy
, if correction is a float or 2.Note
Correction is only needed if the PSM set at hand was obtained using TDA filtering based on decoy counting (as done by using
filter()
without correction).
- Returns:
out – The estimation of FDR, (roughly) between 0 and 1.
- Return type:
- pyteomics.openms.idxml.qvalues(*args, **kwargs)¶
Read args and return a NumPy array with scores and q-values. q-values are calculated either using TDA or based on provided values of PEP.
Requires
numpy
(and optionallypandas
).- Parameters:
args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
key (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). If array-like, should contain scores for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
.is_decoy (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy. If array-like, should contain boolean values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Warning
The default function may not work with your files, because format flavours are diverse.
decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
False
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions).ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
q_label (str, optional) – Field name for q-value in the output. Default is
'q'
.score_label (str, optional) – Field name for score in the output. Default is
'score'
.decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
.pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
.full_output (bool, keyword only, optional) – If
True
, then the returned array has PSM objects along with scores and q-values. Default isFalse
.**kwargs (passed to the
chain()
function.)
- Returns:
out – A sorted array of records with the following fields:
’score’:
np.float64
’is decoy’:
np.bool_
’q’:
np.float64
’psm’:
np.object_
(if full_output isTrue
)
- Return type:
numpy.ndarray
- pyteomics.openms.idxml.chain(*sources, **kwargs)¶
Chain
IDXML
for several sources into a single iterable. Positional arguments should be sources like file names or file objects. Keyword arguments are passed to theIDXML
function.- Parameters:
sources (
Iterable
) – Sources for creating new sequences from, such as paths or file-like objectskwargs (
Mapping
) – Additional arguments used to instantiate each sequence
- chain.from_iterable(files, **kwargs)¶
Chain
read()
for several files. Keyword arguments are passed to theread()
function.- Parameters:
files – Iterable of file names or file objects.
- pyteomics.openms.idxml.filter(*args, **kwargs)¶
Read args and yield only the PSMs that form a set with estimated false discovery rate (FDR) not exceeding fdr.
Requires
numpy
and, optionally,pandas
.- Parameters:
args (positional) – Files to read PSMs from. All positional arguments are treated as files. The rest of the arguments must be named.
fdr (float, keyword only, 0 <= fdr <= 1) – Desired FDR level.
key (callable / array-like / iterable / str, keyword only, optional) –
A function used for sorting of PSMs. Should accept exactly one argument (PSM) and return a number (the smaller the better). The default is a function that tries to extract e-value from the PSM.
Warning
The default function may not work with your files, because format flavours are diverse.
reverse (bool, keyword only, optional) – If
True
, then PSMs are sorted in descending order, i.e. the value of the key function is higher for better PSMs. Default isFalse
.is_decoy (callable / array-like / iterable / str, keyword only, optional) –
A function used to determine if the PSM is decoy or not. Should accept exactly one argument (PSM) and return a truthy value if the PSM should be considered decoy.
Warning
The default function may not work with your files, because format flavours are diverse.
decoy_prefix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name prefix to use to detect decoy matches. If you provide your own is_decoy, or if you specify decoy_suffix, this parameter has no effect. Default is “DECOY_”.
decoy_suffix (str, optional) – If the default is_decoy function works for you, this parameter specifies which protein name suffix to use to detect decoy matches. If you provide your own is_decoy, this parameter has no effect. Mutually exclusive with decoy_prefix.
remove_decoy (bool, keyword only, optional) –
Defines whether decoy matches should be removed from the output. Default is
True
.Note
If set to
False
, then by default the decoy PSMs will be taken into account when estimating FDR. Refer to the documentation offdr()
for math; basically, if remove_decoy isTrue
, then formula 1 is used to control output FDR, otherwise it’s formula 2. This can be changed by overriding the formula argument.formula (int, keyword only, optional) – Can be either 1 or 2, defines which formula should be used for FDR estimation. Default is 1 if remove_decoy is
True
, else 2 (seefdr()
for definitions).ratio (float, keyword only, optional) – The size ratio between the decoy and target databases. Default is 1. In theory, the “size” of the database is the number of theoretical peptides eligible for assignment to spectra that are produced by in silico cleavage of that database.
correction (int or float, keyword only, optional) –
Possible values are 0, 1 and 2, or floating point numbers between 0 and 1.
0 (default): no correction;
1: enable “+1” correction. This accounts for the probability that a false positive scores better than the first excluded decoy PSM;
2: this also corrects that probability for finite size of the sample, so the correction will be slightly less than “+1”.
If a floating point number is given, then instead of the expectation value for the number of false PSMs, the confidence value is used. The value of correction is then interpreted as desired confidence level. E.g., if correction=0.95, then the calculated q-values do not exceed the “real” q-values with 95% probability.
See this paper for further explanation.
pep (callable / array-like / iterable / str, keyword only, optional) –
If callable, a function used to determine the posterior error probability (PEP). Should accept exactly one argument (PSM) and return a float. If array-like, should contain float values for all given PSMs. If string, it is used as a field name (PSMs must be in a record array or a
DataFrame
).Note
If this parameter is given, then PEP values will be used to calculate q-values. Otherwise, decoy PSMs will be used instead. This option conflicts with: is_decoy, remove_decoy, formula, ratio, correction. key can still be provided. Without key, PSMs will be sorted by PEP.
full_output (bool, keyword only, optional) –
If
True
, then an array of PSM objects is returned. Otherwise, an iterator / context manager object is returned, and the files are parsed twice. This saves some RAM, but is ~2x slower. Default isTrue
.Note
The name for the parameter comes from the fact that it is internally passed to
qvalues()
.q_label (str, optional) – Field name for q-value in the output. Default is
'q'
.score_label (str, optional) – Field name for score in the output. Default is
'score'
.decoy_label (str, optional) – Field name for the decoy flag in the output. Default is
'is decoy'
.pep_label (str, optional) – Field name for PEP in the output. Default is
'PEP'
.**kwargs (passed to the
chain()
function.)
- Returns:
out
- Return type:
iterator or
numpy.ndarray
orpandas.DataFrame
- filter.chain(*files, **kwargs)¶
Chain
filter()
for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to thefilter()
function.
- filter.chain.from_iterable(*files, **kwargs)¶
Chain
filter()
for several files. Keyword arguments are passed to thefilter()
function.- Parameters:
files – Iterable of file names or file objects.
- pyteomics.openms.idxml.DataFrame(*args, **kwargs)[source]¶
Read idXML files into a
pandas.DataFrame
.Requires
pandas
.Warning
Only the first ‘PeptideHit’ element is considered in every ‘PeptideIdentification’.
- Parameters:
*args – Passed to
chain()
**kwargs – Passed to
chain()
sep (str or None, keyword only, optional) – Some values related to PSMs (such as protein information) are variable-length lists. If sep is a
str
, they will be packed into single string using this delimiter. If sep isNone
, they are kept as lists. Default isNone
.
- Returns:
out
- Return type:
pandas.DataFrame
- class pyteomics.openms.idxml.IDXML(*args, **kwargs)[source]¶
Bases:
IndexedXML
Parser class for idXML files.
- __init__(*args, **kwargs)[source]¶
Create an indexed XML parser object.
- Parameters:
source (str or file) – File name or file-like object corresponding to an XML file.
read_schema (bool, optional) – Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is
False
.iterative (bool, optional) – Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
.use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
.indexed_tags (container of bytes, optional) – If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
- build_byte_index()¶
Build up an index of offsets for elements.
- Returns:
out
- Return type:
- build_id_cache()¶
Construct a cache for each element in the document, indexed by id attribute
- build_tree()¶
Build and store the
ElementTree
instance for the underlying file
- clear_id_cache()¶
Clear the element ID cache
- clear_tree()¶
Remove the saved
ElementTree
.
- get_by_id(elem_id, id_key=None, element_type=None, **kwargs)¶
Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
- iterfind(path, **kwargs)¶
Parse the XML and yield info on elements with specified local name or by specified “XPath”.
- Parameters:
path (str) – Element name or XPath-like expression. The path is very close to full XPath syntax, but local names should be used for all elements in the path. They will be substituted with local-name() checks, up to the (first) predicate. The path can be absolute or “free”. Please don’t specify namespaces.
**kwargs (passed to
self._get_info_smart()
.)
- Returns:
out
- Return type:
iterator
- reset()¶
Resets the iterator to its initial state.
- pyteomics.openms.idxml.filter_df(*args, **kwargs)[source]¶
Read idXML files or DataFrames and return a
DataFrame
with filtered PSMs. Positional arguments can be idXML files or DataFrames.Requires
pandas
.Warning
Only the first ‘PeptideHit’ element is considered in every ‘PeptideIdentification’.
- Parameters:
key (str / iterable / callable, keyword only, optional) – Peptide identification score. Default is ‘score’. You will probably need to change it.
is_decoy (str / iterable / callable, keyword only, optional) – Default is ‘is decoy’.
*args – Passed to
auxiliary.filter()
and/orDataFrame()
.**kwargs – Passed to
auxiliary.filter()
and/orDataFrame()
.
- Returns:
out
- Return type:
pandas.DataFrame
- pyteomics.openms.idxml.get_by_id(source, elem_id, **kwargs)[source]¶
Parse source and return the element with id attribute equal to elem_id. Returns
None
if no such element is found.Note
This function is provided for backward compatibility only. If you do multiple
get_by_id()
calls on one file, you should create anIDXML
object and use itsget_by_id()
method.
- pyteomics.openms.idxml.is_decoy(psm, prefix=None)[source]¶
Given a PSM dict, return
True
if it is marked as decoy, andFalse
otherwise.
- pyteomics.openms.idxml.iterfind(source, path, **kwargs)[source]¶
Parse source and yield info on elements with specified local name or by specified “XPath”.
Note
This function is provided for backward compatibility only. If you do multiple
iterfind()
calls on one file, you should create anIDXML
object and use itsiterfind()
method.- Parameters:
source (str or file) – File name or file-like object.
path (str) – Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
.retrieve_refs (bool, optional) – If
True
, additional information from references will be automatically added to the results. The file processing time will increase. Default isFalse
.iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
.read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the IDXML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings.build_id_cache (bool, optional) – Defines whether a cache of element IDs should be built and stored on the created
IDXML
instance. Default value is the value of retrieve_refs.
- Returns:
out
- Return type:
iterator
- pyteomics.openms.idxml.read(source, **kwargs)[source]¶
Parse source and iterate through peptide-spectrum matches.
Note
This function is provided for backward compatibility only. It simply creates an
IDXML
instance using provided arguments and returns it.- Parameters:
source (str or file) – A path to a target IDXML file or the file object itself.
recursive (bool, optional) – If
False
, subelements will not be processed when extracting info from elements. Default isTrue
.retrieve_refs (bool, optional) – If
True
, additional information from references will be automatically added to the results. The file processing time will increase. Default isTrue
.iterative (bool, optional) – Specifies whether iterative XML parsing should be used. Iterative parsing significantly reduces memory usage and may be just a little slower. When retrieve_refs is
True
, however, it is highly recommended to disable iterative parsing if possible. Default value isTrue
.read_schema (bool, optional) – If
True
, attempt to extract information from the XML schema mentioned in the IDXML header (default). Otherwise, use default parameters. Disable this to avoid waiting on slow network connections or if you don’t like to get the related warnings.build_id_cache (bool, optional) –
Defines whether a cache of element IDs should be built and stored on the created
IDXML
instance. Default value is the value of retrieve_refs.Note
This parameter is ignored when
use_index
isTrue
(default).use_index (bool, optional) – Defines whether an index of byte offsets needs to be created for the indexed elements. If
True
(default), build_id_cache is ignored.indexed_tags (container of bytes, optional) – Defines which elements need to be indexed. Empty set by default.
- Returns:
out – An iterator over the dicts with PSM properties.
- Return type: