modX format and the parser module¶
Pyteomics historically uses a custom IUPAC-derived peptide sequence notation named modX. As in the IUPAC notation, each amino acid residue is represented by a capital letter, but it may preceded by an arbitrary number of small letters to show modification. Terminal groups are separated from the backbone sequence by a hyphen (‘-’). By default, both termini are assumed to be unmodified, which can be shown explicitly by ‘H-’ for N-terminal hydrogen and ‘-OH’ for C-terminal hydroxyl.
“H-HoxMMdaN-OH” is an example of a valid sequence in modX. See
parser - operations on modX peptide sequences for additional information. Note that it is recommended to include
either 0 or 2 terminal groups in a modX sequence.
Parsing¶
There are two helper functions to check if a label is in modX format or represents
a terminal modification: pyteomics.parser.is_modX() and
pyteomics.parser.is_term_group():
>>> parser.is_modX('A')
True
>>> parser.is_modX('pT')
True
>>> parser.is_modX('pTx')
False
>>> parser.is_term_group('pT')
False
>>> parser.is_term_group('Ac-')
True
A modX sequence can be translated to a list of amino acid residues with
pyteomics.parser.parse() function:
>>> from pyteomics import parser
>>> parser.parse('PEPTIDE')
['P', 'E', 'P', 'T', 'I', 'D', 'E']
>>> parser.parse('PEPTIDE', show_unmodified_termini=True)
['H-', 'P', 'E', 'P', 'T', 'I', 'D', 'E', '-OH']
>>> parser.parse('Ac-PEpTIDE', labels=parser.std_labels+['Ac-', 'pT'])
['Ac-', 'P', 'E', 'pT', 'I', 'D', 'E']
In the last example we supplied two arguments, the sequence itself
and ‘labels’. The latter is used to specify what labels are allowed for amino
acid residues and terminal modifications. std_labels is a predefined
set of labels for the twenty standard amino acids, ‘H-’ for N-terminal hydrogen
and ‘-OH’ for C-terminal hydroxyl. In this example we specified the codes for
phosphorylated threonine and N-terminal acetylation.
Since version 2.5, specifying labels is never mandatory. If this argument
is not supplied, no checks will be made. However, the last example won’t work
without labels, because it has only one terminal group shown, which is
discouraged.
parse() has another mode, in which it returns tuples:
>>> parser.parse('Ac-PEpTIDE-OH', split=True)
[('Ac-', 'P'), ('E',), ('p', 'T'), ('I',), ('D',), ('E',)]
or:
>>> parser.parse('Ac-PEpTIDE-OH', split=True, labels=parser.std_labels+['Ac-', 'p'])
[('Ac-', 'P'), ('E',), ('p', 'T'), ('I',), ('D',), ('E',)]
Also, note what we supply as labels here: ‘p’ instead of ‘pT’. That means that ‘p’ is a modification applicable to any residue.
In modX, standard len() function cannot be used to determine the length
of a peptide because of the modifications.
Use pyteomics.parser.length() instead:
>>> from pyteomics import parser
>>> parser.length('aVRILLaVIGNE')
10
The pyteomics.parser.amino_acid_composition() function accepts a sequence
and returns a dictionary with amino acid labels as keys and integer numbers as
values, corresponding to the number of times each residue occurs in the sequence:
>>> from pyteomics import parser
>>> parser.amino_acid_composition('PEPTIDE')
{'I': 1.0, 'P': 2.0, 'E': 2.0, 'T': 1.0, 'D': 1.0}
In silico digestion¶
pyteomics.parser.cleave() performs in silico cleavage.
The required arguments are the sequence, the rule for enzyme specificity and the
number of missed cleavages allowed (optional). cleave() returns a
set of product peptides; you can get original indices of peptides with xcleave().
>>> from pyteomics import parser
>>> parser.cleave('AKAKBK', parser.expasy_rules['trypsin'], 0)
{'AK', 'BK'}
>>> parser.xcleave('AKAKBK', 'trypsin', 0)
[(0, 'AK'), (2, 'AK'), (4, 'BK')]
pyteomics.parser.expasy_rules and pyteomics.parser.psims_rules are predefined dicts
with the clevage rules for the most common proteases. Their keys are recognized by cleave().
Variable modifications¶
All possible modified sequences of a peptide can be obtained with
pyteomics.parser.isoforms():
>>> from pyteomics import parser
>>> forms = parser.isoforms('PEPTIDE', variable_mods={'p': ['T'], 'ox': ['P']})
>>> for seq in forms:
... print(seq)
...
oxPEPpTIDE
oxPEPTIDE
oxPEoxPpTIDE
oxPEoxPTIDE
PEPpTIDE
PEPTIDE
PEoxPpTIDE
PEoxPTIDE