modX format and the parser module¶

Pyteomics historically uses a custom IUPAC-derived peptide sequence notation named modX. As in the IUPAC notation, each amino acid residue is represented by a capital letter, but it may preceded by an arbitrary number of small letters to show modification. Terminal groups are separated from the backbone sequence by a hyphen (‘-’). By default, both termini are assumed to be unmodified, which can be shown explicitly by ‘H-’ for N-terminal hydrogen and ‘-OH’ for C-terminal hydroxyl.

“H-HoxMMdaN-OH” is an example of a valid sequence in modX. See parser - operations on modX peptide sequences for additional information. Note that it is recommended to include either 0 or 2 terminal groups in a modX sequence.

Parsing¶

There are two helper functions to check if a label is in modX format or represents a terminal modification: pyteomics.parser.is_modX() and pyteomics.parser.is_term_group():

>>> parser.is_modX('A')
True
>>> parser.is_modX('pT')
True
>>> parser.is_modX('pTx')
False
>>> parser.is_term_group('pT')
False
>>> parser.is_term_group('Ac-')
True

A modX sequence can be translated to a list of amino acid residues with pyteomics.parser.parse() function:

>>> from pyteomics import parser
>>> parser.parse('PEPTIDE')
['P', 'E', 'P', 'T', 'I', 'D', 'E']
>>> parser.parse('PEPTIDE', show_unmodified_termini=True)
['H-', 'P', 'E', 'P', 'T', 'I', 'D', 'E', '-OH']
>>> parser.parse('Ac-PEpTIDE', labels=parser.std_labels+['Ac-', 'pT'])
['Ac-', 'P', 'E', 'pT', 'I', 'D', 'E']

In the last example we supplied two arguments, the sequence itself and ‘labels’. The latter is used to specify what labels are allowed for amino acid residues and terminal modifications. std_labels is a predefined set of labels for the twenty standard amino acids, ‘H-’ for N-terminal hydrogen and ‘-OH’ for C-terminal hydroxyl. In this example we specified the codes for phosphorylated threonine and N-terminal acetylation.

Since version 2.5, specifying labels is never mandatory. If this argument is not supplied, no checks will be made. However, the last example won’t work without labels, because it has only one terminal group shown, which is discouraged.

parse() has another mode, in which it returns tuples:

>>> parser.parse('Ac-PEpTIDE-OH', split=True)
[('Ac-', 'P'), ('E',), ('p', 'T'), ('I',), ('D',), ('E',)]

or:

>>> parser.parse('Ac-PEpTIDE-OH', split=True, labels=parser.std_labels+['Ac-', 'p'])
[('Ac-', 'P'), ('E',), ('p', 'T'), ('I',), ('D',), ('E',)]

Also, note what we supply as labels here: ‘p’ instead of ‘pT’. That means that ‘p’ is a modification applicable to any residue.

In modX, standard len() function cannot be used to determine the length of a peptide because of the modifications. Use pyteomics.parser.length() instead:

>>> from pyteomics import parser
>>> parser.length('aVRILLaVIGNE')
10

The pyteomics.parser.amino_acid_composition() function accepts a sequence and returns a dictionary with amino acid labels as keys and integer numbers as values, corresponding to the number of times each residue occurs in the sequence:

>>> from pyteomics import parser
>>> parser.amino_acid_composition('PEPTIDE')
{'I': 1.0, 'P': 2.0, 'E': 2.0, 'T': 1.0, 'D': 1.0}

In silico digestion¶

pyteomics.parser.cleave() performs in silico cleavage. The required arguments are the sequence, the rule for enzyme specificity and the number of missed cleavages allowed (optional). cleave() returns a set of product peptides; you can get original indices of peptides with xcleave().

>>> from pyteomics import parser
>>> parser.cleave('AKAKBK', parser.expasy_rules['trypsin'], 0)
{'AK', 'BK'}
>>> parser.xcleave('AKAKBK', 'trypsin', 0)
[(0, 'AK'), (2, 'AK'), (4, 'BK')]

pyteomics.parser.expasy_rules and pyteomics.parser.psims_rules are predefined dicts with the clevage rules for the most common proteases. Their keys are recognized by cleave().

Variable modifications¶

All possible modified sequences of a peptide can be obtained with pyteomics.parser.isoforms():

>>> from pyteomics import parser
>>> forms = parser.isoforms('PEPTIDE', variable_mods={'p': ['T'], 'ox': ['P']})
>>> for seq in forms:
...    print(seq)
...
oxPEPpTIDE
oxPEPTIDE
oxPEoxPpTIDE
oxPEoxPTIDE
PEPpTIDE
PEPTIDE
PEoxPpTIDE
PEoxPTIDE

Pyteomics documentation v5.0