Peptide sequence formats¶

Two common formats for representing peptide sequences are supported in Pyteomics: modX and ProForma.

modX is a custom format historically used in Pyteomics to represent peptide sequences with modifications. This format is supported by all sequence-related functions in Pyteomics, including functions for digestion, mass and composition calculations, spectrum annotation, etc. The core functions working with modX sequences (sequence parsing, peptidoform generation, etc.) are located in the parser - operations on modX peptide sequences module. See modX format and the parser module for details.

ProForma is a more recent format that provides a more structured way to represent peptide sequences and their modifications. It is supported by most sequence-related functions in Pyteomics, including mass and composition calculations and spectrum annotation. See ProForma format for details.

modX format and the parser module ¶

Pyteomics historically uses a custom IUPAC-derived peptide sequence notation named modX. As in the IUPAC notation, each amino acid residue is represented by a capital letter, but it may preceded by an arbitrary number of small letters to show modification. Terminal groups are separated from the backbone sequence by a hyphen (‘-’). By default, both termini are assumed to be unmodified, which can be shown explicitly by ‘H-’ for N-terminal hydrogen and ‘-OH’ for C-terminal hydroxyl.

“H-HoxMMdaN-OH” is an example of a valid sequence in modX. See parser - operations on modX peptide sequences for additional information. Note that it is recommended to include either 0 or 2 terminal groups in a modX sequence.

Parsing ¶

There are two helper functions to check if a label is in modX format or represents a terminal modification: pyteomics.parser.is_modX() and pyteomics.parser.is_term_group():

>>> parser.is_modX('A')
True
>>> parser.is_modX('pT')
True
>>> parser.is_modX('pTx')
False
>>> parser.is_term_group('pT')
False
>>> parser.is_term_group('Ac-')
True

A modX sequence can be translated to a list of amino acid residues with pyteomics.parser.parse() function:

>>> from pyteomics import parser
>>> parser.parse('PEPTIDE')
['P', 'E', 'P', 'T', 'I', 'D', 'E']
>>> parser.parse('PEPTIDE', show_unmodified_termini=True)
['H-', 'P', 'E', 'P', 'T', 'I', 'D', 'E', '-OH']
>>> parser.parse('Ac-PEpTIDE', labels=parser.std_labels+['Ac-', 'pT'])
['Ac-', 'P', 'E', 'pT', 'I', 'D', 'E']

In the last example we supplied two arguments, the sequence itself and ‘labels’. The latter is used to specify what labels are allowed for amino acid residues and terminal modifications. std_labels is a predefined set of labels for the twenty standard amino acids, ‘H-’ for N-terminal hydrogen and ‘-OH’ for C-terminal hydroxyl. In this example we specified the codes for phosphorylated threonine and N-terminal acetylation.

Since version 2.5, specifying labels is never mandatory. If this argument is not supplied, no checks will be made. However, the last example won’t work without labels, because it has only one terminal group shown, which is discouraged.

parse() has another mode, in which it returns tuples:

>>> parser.parse('Ac-PEpTIDE-OH', split=True)
[('Ac-', 'P'), ('E',), ('p', 'T'), ('I',), ('D',), ('E',)]

or:

>>> parser.parse('Ac-PEpTIDE-OH', split=True, labels=parser.std_labels+['Ac-', 'p'])
[('Ac-', 'P'), ('E',), ('p', 'T'), ('I',), ('D',), ('E',)]

Also, note what we supply as labels here: ‘p’ instead of ‘pT’. That means that ‘p’ is a modification applicable to any residue.

In modX, standard len() function cannot be used to determine the length of a peptide because of the modifications. Use pyteomics.parser.length() instead:

>>> from pyteomics import parser
>>> parser.length('aVRILLaVIGNE')
10

The pyteomics.parser.amino_acid_composition() function accepts a sequence and returns a dictionary with amino acid labels as keys and integer numbers as values, corresponding to the number of times each residue occurs in the sequence:

>>> from pyteomics import parser
>>> parser.amino_acid_composition('PEPTIDE')
{'I': 1.0, 'P': 2.0, 'E': 2.0, 'T': 1.0, 'D': 1.0}

In silico digestion ¶

pyteomics.parser.cleave() performs in silico cleavage. The required arguments are the sequence, the rule for enzyme specificity and the number of missed cleavages allowed (optional). cleave() returns a set of product peptides; you can get original indices of peptides with xcleave().

>>> from pyteomics import parser
>>> parser.cleave('AKAKBK', parser.expasy_rules['trypsin'], 0)
{'AK', 'BK'}
>>> parser.xcleave('AKAKBK', 'trypsin', 0)
[(0, 'AK'), (2, 'AK'), (4, 'BK')]

pyteomics.parser.expasy_rules and pyteomics.parser.psims_rules are predefined dicts with the clevage rules for the most common proteases. Their keys are recognized by cleave().

Variable modifications ¶

All possible modified sequences of a peptide can be obtained with pyteomics.parser.isoforms():

>>> from pyteomics import parser
>>> forms = parser.isoforms('PEPTIDE', variable_mods={'p': ['T'], 'ox': ['P']})
>>> for seq in forms:
...    print(seq)
...
oxPEPpTIDE
oxPEPTIDE
oxPEoxPpTIDE
oxPEoxPTIDE
PEPpTIDE
PEPTIDE
PEoxPpTIDE
PEoxPTIDE

The ProForma standard and implementation ¶

ProForma is a standard for representing proteoforms and peptidoforms, developed by the PSI (Proteomics Standards Initiative). It provides a structured way to represent peptide sequences a wide variety of modifications and uncertainties.

Pyteomics supports ProForma v2.0. The core functions and classes related to ProForma support are located in the proforma - Proteoform and Peptidoform Notation, see there for more information.

The ProForma parser is object-oriented, with a primary class ProForma representing a parsed ProForma sequence. To instantiate a ProForma object, use the class method ProForma.parse():

.. code-block:: python

    >>> seq = ProForma.parse("EM[Oxidation]EVT[Phospho]SES[Phospho]PEK")
    >>> seq
    ProForma([('E', None), ('M', [GenericModification('Oxidation', None, None)]), ('E', None), ('V', None), ('T', [GenericModification('Phospho', None, None)]), ('S', None), ('E', None), ('S', [GenericModification('Phospho', None, None)]), ('P', None), ('E', None), ('K', None)], {'n_term': None, 'c_term': None, 'unlocalized_modifications': [], 'labile_modifications': [], 'fixed_modifications': [], 'intervals': [], 'isotopes': [], 'group_ids': [], 'charge_state': None})

    >>> seq.mass
    1440.47687500136

    >>> seq.composition()
    Composition({'H': 86, 'C': 51, 'O': 30, 'N': 12, 'S': 1, 'P': 2})

Top-level + in ProForma is treated as a chimeric separator only when chimeric=True is passed. The return value is then a list of parsed components:

>>> forms = ProForma.parse("<[Carbamidomethyl]@C>AC+CC", chimeric=True)
>>> len(forms)
2
>>> [str(form) for form in forms]
['<[Carbamidomethyl]@C>AC', '<[Carbamidomethyl]@C>CC']

Fixed modification rules, isotope labels, and peptidoform names are shared across all chimeric components.

Other APIs such as mass calculation, fragment series generation, and spectrum annotation operate on one peptidoform at a time. Use the parsed components individually:

>>> from pyteomics import mass
>>> masses = [mass.calculate_mass(proforma=str(form)) for form in forms]
>>> fragments = [mass.fragment_series(str(form)) for form in forms]

Pyteomics documentation v5.0