parser - operations on modX peptide sequences¶
modX is a simple extension of the IUPAC one-letter peptide sequence representation.
The labels (or codes) for the 20 standard amino acids in modX are the same as in IUPAC nomeclature. A label for a modified amino acid has a general form of ‘modX’, i.e.:
it starts with an arbitrary number of lower-case symbols or numbers (a modification);
it ends with a single upper-case symbol (an amino acid residue).
The valid examples of modX amino acid labels are: ‘G’, ‘pS’, ‘oxM’. This rule allows to combine read- and parseability.
Besides the sequence of amino acid residues, modX has a rule to specify terminal modifications (alternative terminal groups) of a polypeptide. Such a label should start or end with a hyphen. The default N-terminal amine group and C-terminal carboxyl group may not be shown explicitly.
Therefore, valid examples of peptide sequences in modX are: “GAGA”, “H-PEPTIDE-OH”, “H-TEST-NH2”. It is not recommmended to specify only one terminal group.
Operations on polypeptide sequences¶
parse()- convert a sequence string into a list of amino acid residues.
to_string()- convert a parsed sequence to a string.
to_proforma()- convert a (parsed) modX sequence to ProForma.
amino_acid_composition()- get numbers of each amino acid residue in a peptide.
cleave(),icleave(),xcleave()- cleave a polypeptide using a given rule of enzymatic digestion.
num_sites()- count the number of cleavage sites in a sequence.
peptidoforms()- generate all unique modified peptide sequences given the initial sequence and modifications.
Auxiliary commands¶
coverage()andcoverage_mask()- calculate the sequence coverage of a protein by peptides.
length()- calculate the number of amino acid residues in a polypeptide.
strip()- remove all modifications and terminal groups from a modX sequence.
valid()- check if a sequence can be parsed successfully.
fast_valid()- check if a sequence contains of known one-letter codes.
is_modX()- check if supplied code corresponds to a modX label.
is_term_group()- check if supplied code corresponds to a terminal group.
Data¶
std_amino_acids- a list of the 20 standard amino acid IUPAC codes.
std_nterm- the standard N-terminal modification (the unmodified group is a single atom of hydrogen).
std_cterm- the standard C-terminal modification (the unmodified group is hydroxyl).
std_labels- a list of all standard sequence elements, amino acid residues and terminal modifications.
expasy_rulesandpsims_rules- two dicts with the regular expressions of cleavage rules for the most popular proteolytic enzymes.
- pyteomics.parser.amino_acid_composition(sequence, show_unmodified_termini=False, term_aa=False, allow_unknown_modifications=False, **kwargs)[source]¶
Calculate amino acid composition of a polypeptide.
- Parameters:
sequence (str or list) – The sequence of a polypeptide or a list with a parsed sequence.
show_unmodified_termini (bool, optional) – If
Truethen the unmodified N- and C-terminus are explicitly shown in the returned dict. Default value isFalse.term_aa (bool, optional) – If
Truethen the terminal amino acid residues are artificially modified with nterm or cterm modification. Default value isFalse.allow_unknown_modifications (bool, optional) – If
Truethen do not raise an exception when an unknown modification of a known amino acid residue is found in the sequence. Default value isFalse.labels (list, optional) – A list of allowed labels for amino acids and terminal modifications.
- Returns:
out – A dictionary of amino acid composition.
- Return type:
Examples
>>> amino_acid_composition('PEPTIDE') == {'I': 1, 'P': 2, 'E': 2, 'T': 1, 'D': 1} True >>> amino_acid_composition('PEPTDE', term_aa=True) == {'ctermE': 1, 'E': 1, 'D': 1, 'P': 1, 'T': 1, 'ntermP': 1} True >>> amino_acid_composition('PEPpTIDE', labels=std_labels+['pT']) == {'I': 1, 'P': 2, 'E': 2, 'D': 1, 'pT': 1} True
- pyteomics.parser.cleave(*args, **kwargs)[source]¶
Cleaves a polypeptide sequence using a given rule.
- Parameters:
sequence (str) –
The sequence of a polypeptide.
Note
The sequence is expected to be in one-letter uppercase notation. Otherwise, some of the cleavage rules in
expasy_ruleswill not work as expected.rule (str or compiled regex) –
A key present in
expasy_rules,psims_rules(or an MS ontology accession) or a regular expression describing the site of cleavage. It is recommended to design the regex so that it matches only the residue whose C-terminal bond is to be cleaved. All additional requirements should be specified using lookaround assertions.expasy_rulescontains cleavage rules for popular cleavage agents.See also
The regex argument.
missed_cleavages (int, optional) – Maximum number of allowed missed cleavages. Defaults to 0.
min_length (int or None, optional) –
Minimum peptide length. Defaults to
None.max_length (int or None, optional) – Maximum peptide length. Defaults to
None. See note above.semi (bool, optional) – Include products of semi-specific cleavage. Default is
False. This effectively cuts every peptide at every position and adds results to the output.exception (str or compiled RE or None, optional) – Exceptions to the cleavage rule. If specified, should be a key present in
expasy_rulesor regular expression. Cleavage sites matching rule will be checked against exception and omitted if they match.regex (bool, optional) – If
True, the cleavage rule is always interpreted as a regex. Otherwise, a matching value is looked up inexpasy_rulesandpsims_rules.
- Returns:
out – A set of unique (!) peptides.
- Return type:
Examples
>>> cleave('AKAKBK', expasy_rules['trypsin'], 0) == {'AK', 'BK'} True >>> cleave('AKAKBK', 'trypsin', 0) == {'AK', 'BK'} True >>> cleave('AKAKBK', 'MS:1001251', 0) == {'AK', 'BK'} True >>> cleave('GKGKYKCK', 'Trypsin/P', 2) == {'CK', 'GKYK', 'YKCK', 'GKGK', 'GKYKCK', 'GK', 'GKGKYK', 'YK'} True
- pyteomics.parser.coverage(protein, peptides, fast: bool = True) float[source]¶
Calculate how much of protein is covered by peptides. Peptides can overlap. If a peptide is found multiple times in protein, it contributes more to the overall coverage.
Requires
numpy.Note
Modifications and terminal groups are discarded.
- Parameters:
- Returns:
out – The sequence coverage, between 0 and 1.
- Return type:
Examples
>>> coverage('PEPTIDES'*100, ['PEP', 'EPT']) 0.5
- pyteomics.parser.coverage_mask(protein: str, peptides: Iterable, fast: bool = True) np.ndarray[source]¶
Calculate a coverage mask of protein by peptides. Peptides can overlap. If a peptide is found multiple times in protein, it contributes more to the overall coverage.
Requires
numpy.Note
Modifications and terminal groups are discarded.
- Parameters:
- Returns:
out – A 1D array of integers, with length equal to that of protein. Each position indicates how many peptides cover the corresponding residue in protein.
- Return type:
numpy.ndarray
- pyteomics.parser.expasy_rules = {'arg-c': 'R', 'asp-n': '\\w(?=D)', 'bnps-skatole': 'W', 'caspase 1': '(?<=[FWYL]\\w[HAT])D(?=[^PEDQKR])', 'caspase 10': '(?<=IEA)D', 'caspase 2': '(?<=DVA)D(?=[^PEDQKR])', 'caspase 3': '(?<=DMQ)D(?=[^PEDQKR])', 'caspase 4': '(?<=LEV)D(?=[^PEDQKR])', 'caspase 5': '(?<=[LW]EH)D', 'caspase 6': '(?<=VE[HI])D(?=[^PEDQKR])', 'caspase 7': '(?<=DEV)D(?=[^PEDQKR])', 'caspase 8': '(?<=[IL]ET)D(?=[^PEDQKR])', 'caspase 9': '(?<=LEH)D', 'chymotrypsin high specificity': '([FY](?=[^P]))|(W(?=[^MP]))', 'chymotrypsin low specificity': '([FLY](?=[^P]))|(W(?=[^MP]))|(M(?=[^PY]))|(H(?=[^DMPW]))', 'clostripain': 'R', 'cnbr': 'M', 'enterokinase': '(?<=[DE]{3})K', 'factor xa': '(?<=[AFGILTVM][DE]G)R', 'formic acid': 'D', 'glutamyl endopeptidase': 'E', 'granzyme b': '(?<=IEP)D', 'hydroxylamine': 'N(?=G)', 'iodosobenzoic acid': 'W', 'lysc': 'K', 'ntcb': '\\w(?=C)', 'pepsin ph1.3': '((?<=[^HKR][^P])[^R](?=[FL][^P]))|((?<=[^HKR][^P])[FL](?=\\w[^P]))', 'pepsin ph2.0': '((?<=[^HKR][^P])[^R](?=[FLWY][^P]))|((?<=[^HKR][^P])[FLWY](?=\\w[^P]))', 'proline endopeptidase': '(?<=[HKR])P(?=[^P])', 'proteinase k': '[AEFILTVWY]', 'staphylococcal peptidase i': '(?<=[^E])E', 'thermolysin': '[^DE](?=[AFILMV][^P])', 'thrombin': '((?<=G)R(?=G))|((?<=[AFGILTVM][AFGILTVWA]P)R(?=[^DE][^DE]))', 'trypsin': '([KR](?=[^P]))|((?<=W)K(?=P))|((?<=M)R(?=P))', 'trypsin_exception': '((?<=[CD])K(?=D))|((?<=C)K(?=[HY]))|((?<=C)R(?=K))|((?<=R)R(?=[HR]))'}¶
This dict contains regular expressions for cleavage rules of the most popular proteolytic enzymes. The rules were taken from the PeptideCutter tool at Expasy.
Note
‘trypsin_exception’ can be used as exception argument when calling
cleave()with ‘trypsin’ rule:>>> parser.cleave('PEPTIDKDE', parser.expasy_rules['trypsin']) {'DE', 'PEPTIDK'} >>> parser.cleave('PEPTIDKDE', parser.expasy_rules['trypsin'], exception=parser.expasy_rules['trypsin_exception']) {'PEPTIDKDE'}
- pyteomics.parser.fast_valid(sequence, labels={'-OH', 'A', 'C', 'D', 'E', 'F', 'G', 'H', 'H-', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'})[source]¶
Iterate over sequence and check if all items are in labels. With strings, this only works as expected on sequences without modifications or terminal groups.
- pyteomics.parser.icleave(sequence, rule, missed_cleavages=0, min_length=None, max_length=None, semi=False, exception=None, regex=False)[source]¶
Like
cleave(), but the result is an iterator and includes peptide indices. Refer tocleave()for explanation of parameters.- Returns:
out – An iterator over (index, sequence) pairs.
- Return type:
iterator
- pyteomics.parser.is_modX(label)[source]¶
Check if label is a valid ‘modX’ label.
Examples
>>> is_modX('M') True >>> is_modX('oxM') True >>> is_modX('oxMet') False >>> is_modX('160C') True
- pyteomics.parser.is_term_group(label)[source]¶
Check if label corresponds to a terminal group.
Examples
>>> is_term_group('A') False >>> is_term_group('Ac-') True >>> is_term_group('-customGroup') True >>> is_term_group('this-group-') False >>> is_term_group('-') False
- pyteomics.parser.is_term_mod(label)¶
Check if label corresponds to a terminal group.
Examples
>>> is_term_group('A') False >>> is_term_group('Ac-') True >>> is_term_group('-customGroup') True >>> is_term_group('this-group-') False >>> is_term_group('-') False
- pyteomics.parser.isoforms(sequence, **kwargs)¶
Apply variable and fixed modifications to the polypeptide and yield the unique modified sequences.
- Parameters:
sequence (str) – Peptide sequence to modify.
variable_mods (dict, optional) –
A dict of variable modifications in the following format:
{'label1': ['X', 'Y', ...], 'label2': ['X', 'A', 'B', ...]}Keys in the dict are modification labels (terminal modifications allowed). Values are iterables of residue labels (one letter each) or
True. If a value for a modification isTrue, it is applicable to any residue (useful for terminal modifications). You can use values such as ‘ntermX’ or ‘ctermY’ to specify that a mdofication only occurs when the residue is in the terminal position. This is not needed for terminal modifications.Note
Several variable modifications can occur on amino acids of the same type, but in the output each amino acid residue will be modified at most once (apart from terminal modifications).
fixed_mods (dict, optional) –
A dict of fixed modifications in the same format.
Note: if a residue is affected by a fixed modification, no variable modifications will be applied to it (apart from terminal modifications).
labels (list, optional) – A list of amino acid labels containing all the labels present in sequence. Modified entries will be added automatically. Defaults to
std_labels. Not required since version 2.5.max_mods (int or None, optional) – Number of modifications that can occur simultaneously on a peptide, excluding fixed modifications. If
Noneor ifmax_modsis greater than the number of modification sites, all possible isoforms are generated. Default isNone.override (bool, optional) – Defines how to handle the residues that are modified in the input.
Falsemeans that they will be preserved (default).Truemeans they will be treated as unmodified.show_unmodified_termini (bool, optional) – If
Truethen the unmodified N- and C-termini are explicitly shown in the returned sequences. Default value isFalse.format (str, optional) – If
'str'(default), an iterator over sequences is returned. If'split', the iterator will yield results in the same format asparse()with the ‘split’ option, with unmodified terminal groups shown.
- Returns:
out – All possible unique polypeptide sequences resulting from the specified modifications are yielded obe by one.
- Return type:
iterator over strings or lists
- pyteomics.parser.length(sequence, **kwargs)[source]¶
Calculate the number of amino acid residues in a polypeptide written in modX notation.
- Parameters:
- Returns:
out
- Return type:
Examples
>>> length('PEPTIDE') 7 >>> length('H-PEPTIDE-OH') 7
- pyteomics.parser.match_modX(label)[source]¶
Check if label is a valid ‘modX’ label.
- Parameters:
label (str)
- Returns:
out
- Return type:
re.match or None
- pyteomics.parser.num_sites(sequence, rule, **kwargs)[source]¶
Count the number of sites where sequence can be cleaved using the given rule (e.g. number of miscleavages for a peptide).
- Parameters:
sequence (str) – The sequence of a polypeptide.
rule (str or compiled regex) –
A regular expression describing the site of cleavage. It is recommended to design the regex so that it matches only the residue whose C-terminal bond is to be cleaved. All additional requirements should be specified using lookaround assertions.
labels (list, optional) – A list of allowed labels for amino acids and terminal modifications.
exception (str or compiled RE or None, optional) – Exceptions to the cleavage rule. If specified, should be a regular expression. Cleavage sites matching rule will be checked against exception and omitted if they match.
- Returns:
out – Number of cleavage sites.
- Return type:
- pyteomics.parser.parse(sequence, show_unmodified_termini=False, split=False, allow_unknown_modifications=False, **kwargs)[source]¶
Parse a sequence string written in modX notation into a list of labels or (if split argument is
True) into a list of tuples representing amino acid residues and their modifications.- Parameters:
sequence (str) – The sequence of a polypeptide.
show_unmodified_termini (bool, optional) – If
Truethen the unmodified N- and C-termini are explicitly shown in the returned list. Default value isFalse.split (bool, optional) – If
Truethen the result will be a list of tuples with 1 to 4 elements: terminal modification, modification, residue. Default value isFalse.allow_unknown_modifications (bool, optional) –
If
Truethen do not raise an exception when an unknown modification of a known amino acid residue is found in the sequence. This also includes terminal groups. Default value isFalse.Note
Since version 2.5, this parameter has effect only if labels are provided.
labels (container, optional) –
A container of allowed labels for amino acids, modifications and terminal modifications. If not provided, no checks will be done. Separate labels for modifications (such as ‘p’ or ‘ox’) can be supplied, which means they are applicable to all residues.
Warning
If show_unmodified_termini is set to
True, standard terminal groups need to be present in labels.Warning
Avoid using sequences with only one terminal group, as they are ambiguous. If you provide one, labels (or
std_labels) will be used to resolve the ambiguity.
- Returns:
out – List of tuples with labels of modifications and amino acid residues.
- Return type:
Examples
>>> parse('PEPTIDE', split=True) [('P',), ('E',), ('P',), ('T',), ('I',), ('D',), ('E',)] >>> parse('H-PEPTIDE') ['P', 'E', 'P', 'T', 'I', 'D', 'E'] >>> parse('PEPTIDE', show_unmodified_termini=True) ['H-', 'P', 'E', 'P', 'T', 'I', 'D', 'E', '-OH'] >>> parse('TEpSToxM', labels=std_labels + ['pS', 'oxM']) ['T', 'E', 'pS', 'T', 'oxM'] >>> parse('zPEPzTIDzE', True, True, labels=std_labels+['z']) [('H-', 'z', 'P'), ('E',), ('P',), ('z', 'T'), ('I',), ('D',), ('z', 'E', '-OH')] >>> parse('Pmod1EPTIDE') ['P', 'mod1E', 'P', 'T', 'I', 'D', 'E']
- pyteomics.parser.peptidoforms(sequence, **kwargs)[source]¶
Apply variable and fixed modifications to the polypeptide and yield the unique modified sequences.
- Parameters:
sequence (str) – Peptide sequence to modify.
variable_mods (dict, optional) –
A dict of variable modifications in the following format:
{'label1': ['X', 'Y', ...], 'label2': ['X', 'A', 'B', ...]}Keys in the dict are modification labels (terminal modifications allowed). Values are iterables of residue labels (one letter each) or
True. If a value for a modification isTrue, it is applicable to any residue (useful for terminal modifications). You can use values such as ‘ntermX’ or ‘ctermY’ to specify that a mdofication only occurs when the residue is in the terminal position. This is not needed for terminal modifications.Note
Several variable modifications can occur on amino acids of the same type, but in the output each amino acid residue will be modified at most once (apart from terminal modifications).
fixed_mods (dict, optional) –
A dict of fixed modifications in the same format.
Note: if a residue is affected by a fixed modification, no variable modifications will be applied to it (apart from terminal modifications).
labels (list, optional) – A list of amino acid labels containing all the labels present in sequence. Modified entries will be added automatically. Defaults to
std_labels. Not required since version 2.5.max_mods (int or None, optional) – Number of modifications that can occur simultaneously on a peptide, excluding fixed modifications. If
Noneor ifmax_modsis greater than the number of modification sites, all possible isoforms are generated. Default isNone.override (bool, optional) – Defines how to handle the residues that are modified in the input.
Falsemeans that they will be preserved (default).Truemeans they will be treated as unmodified.show_unmodified_termini (bool, optional) – If
Truethen the unmodified N- and C-termini are explicitly shown in the returned sequences. Default value isFalse.format (str, optional) – If
'str'(default), an iterator over sequences is returned. If'split', the iterator will yield results in the same format asparse()with the ‘split’ option, with unmodified terminal groups shown.
- Returns:
out – All possible unique polypeptide sequences resulting from the specified modifications are yielded obe by one.
- Return type:
iterator over strings or lists
- pyteomics.parser.psims_rules = {'2-iodobenzoate': '(?<=W)', 'Arg-C': '(?<=R)(?!P)', 'Asp-N': '(?=[BD])', 'Asp-N ambic': '(?=[DE])', 'CNBr': '(?<=M)', 'Chymotrypsin': '(?<=[FYWL])(?!P)', 'Formic acid': '((?<=D))|((?=D))', 'Lys-C': '(?<=K)(?!P)', 'Lys-C/P': '(?<=K)', 'PepsinA': '(?<=[FL])', 'TrypChymo': '(?<=[FYWLKR])(?!P)', 'Trypsin': '(?<=[KR])(?!P)', 'Trypsin/P': '(?<=[KR])', 'V8-DE': '(?<=[BDEZ])(?!P)', 'V8-E': '(?<=[EZ])(?!P)', 'glutamyl endopeptidase': '(?<=[^E]E)', 'leukocyte elastase': '(?<=[ALIV])(?!P)', 'proline endopeptidase': '(?<=[HKR]P)(?!P)'}¶
This dict contains regular expressions for cleavage rules of the most popular proteolytic enzymes. The rules were taken from the PSI MS ontology.
You can use names or accessions to access the rules. Use
pyteomics.auxiliary.cvquery()for accession access:>>> from pyteomics.auxiliary import cvquery >>> from pyteomics.parser import psims_rules >>> cvquery(psims_rules, 'MS:1001918') '(?<=W)'
- pyteomics.parser.std_amino_acids = ['Q', 'W', 'E', 'R', 'T', 'Y', 'I', 'P', 'A', 'S', 'D', 'F', 'G', 'H', 'K', 'L', 'C', 'V', 'N', 'M']¶
modX labels for the 20 standard amino acids.
- pyteomics.parser.std_cterm = '-OH'¶
modX label for the unmodified C-terminus.
- pyteomics.parser.std_labels = ['Q', 'W', 'E', 'R', 'T', 'Y', 'I', 'P', 'A', 'S', 'D', 'F', 'G', 'H', 'K', 'L', 'C', 'V', 'N', 'M', 'H-', '-OH']¶
modX labels for the standard amino acids and unmodified termini.
- pyteomics.parser.std_nterm = 'H-'¶
modX label for the unmodified N-terminus.
- pyteomics.parser.strip(sequence: __annotationlib_name_2__ | Iterable[__annotationlib_name_1__] | Iterable[__annotationlib_name_3__] | 'ProForma') str[source]¶
Remove all modifications and terminal groups from a modX sequence, parsed sequence or ProForma object, and return a one-letter sequence string.
Examples
>>> strip('Ac-oxMYPEPTIDE-OH') 'MYPEPTIDE' >>> strip(['Ac-', 'oxM', 'Y', 'P', 'E', 'pP', 'T', 'I', 'D', 'E', '-OH']) 'MYPEPTIDE'
- pyteomics.parser.to_proforma(sequence, **kwargs)[source]¶
Converts a (parsed) modX sequence to a basic ProForma string. Modifications are represented as masses, if those are given in :arg:`aa_mass`, as chemical formulas (via :arg:`aa_comp`) or as names (using :arg:`mod_names`).
- Parameters:
sequence (str or list) – A modX sequence, possibly in the parsed form.
aa_mass (dict, keyword only, optional) – Used to render modifications as mass shifts.
aa_comp (dict, keyword only, optional) – Used to render modifications as chemical formulas.
mod_names (dict or callable, keyword only, optional) – Used to get the rendered name of modification from the mod label.
prefix (str, keyword only, optional) – Prepend all modification names with the given prefix.
- Returns:
out – A ProForma sequence.
- Return type:
Examples
>>> to_proforma('PEPTIDE') 'PEPTIDE' >>> to_proforma('Ac-oxMYPEPTIDE-OH', aa_mass={'Ac-': 42.010565}, mod_names={'ox': 'Oxidation'}, prefix='U:') '[+42.0106]-M[U:Oxidation]YPEPTIDE' >>> to_proforma('oxidationMYPEPTIDE') # last fallback is to just capitalize the label 'M[Oxidation]YPEPTIDE'
- pyteomics.parser.to_string(parsed_sequence: Iterable[str] | Iterable[tuple], show_unmodified_termini=True)[source]¶
Create a string from a parsed sequence.
- Parameters:
parsed_sequence (iterable) – Expected to be in one of the formats returned by
parse(), i.e. list of labels or list of tuples.show_unmodified_termini (bool, optional) – Defines the behavior towards standard terminal groups in the input.
Truemeans that they will be preserved if present (default).Falsemeans that they will be removed. Standard terminal groups will not be added if not shown in parsed_sequence, regardless of this setting.
- Returns:
sequence
- Return type:
- pyteomics.parser.tostring(parsed_sequence: Iterable[str] | Iterable[tuple], show_unmodified_termini=True)¶
Create a string from a parsed sequence.
- Parameters:
parsed_sequence (iterable) – Expected to be in one of the formats returned by
parse(), i.e. list of labels or list of tuples.show_unmodified_termini (bool, optional) – Defines the behavior towards standard terminal groups in the input.
Truemeans that they will be preserved if present (default).Falsemeans that they will be removed. Standard terminal groups will not be added if not shown in parsed_sequence, regardless of this setting.
- Returns:
sequence
- Return type: