Pyteomics documentation v4.6.3

Example 1: Unravelling the Peptidome

«  Combined examples   ::   Contents   ::   Example 2: Fragmentation  »

Example 1: Unravelling the Peptidome

In this example, we will introduce the Pyteomics tools to predict the basic physicochemical characteristics of peptides, such as mass, charge and chromatographic retention time. We will download a FASTA database with baker’s yeast proteins, digest it with trypsin and study the distributions of various quantitative qualities that may be measured in a typical proteomic experiment.

The example is organized as a script interrupted by comments. It is assumed that the reader already has experience with numpy and matplotlib libraries. The source code for the example can be found here.

Before we begin, we need to import all the modules that we may require. Besides pyteomics itself, we need the builtin tools that allow to access the hard drive (os), download files from the Internet (urllib), open gzip archives (gzip), and external libraries to process and visualize arrays of data (numpy, matplotlib).

import os
from urllib.request import urlretrieve
import gzip
import matplotlib.pyplot as plt
import numpy as np
from pyteomics import fasta, parser, mass, achrom, electrochem, auxiliary

We also need to download a real FASTA database. For our purposes, the Uniprot database with Saccharomyces cerevisiae proteins will work fine. We’ll download a gzip-compressed database from Uniprot FTP server:

if not os.path.isfile('yeast.fasta.gz'):
    print('Downloading the FASTA file for Saccharomyces cerevisiae...')

The pyteomics.fasta.FASTA() class allows to iterate over the protein sequences in a FASTA file in a regular Python loop. It replaced, although the latter still exists, too. In this example, we create a FASTA object from a file-like object representing a gzip archive. All file parser objects are flexible and support a variety of use cases. Additionally pyteomics.fasta supports an even greater variety of FASTA types and flavors.

For all FASTA parser classes, check fasta - manipulations with FASTA databases. See also: an explanation of Indexed Parsers.

In order to obtain the peptide sequences, we cleave each protein using the pyteomics.parser.cleave() function and combine results into a set object that automatically discards multiple occurrences of the same sequence.

print('Cleaving the proteins with trypsin...')
unique_peptides = set()
with'yeast.fasta.gz', mode='rt') as gzfile:
    for description, sequence in fasta.FASTA(gzfile):
        new_peptides = parser.cleave(sequence, 'trypsin')
print('Done, {0} sequences obtained!'.format(len(unique_peptides)))

Later we will calculate different peptide properties. In order to store them, we create a list of dicts, where each dict stores the properties of a single peptide, including its sequence.

peptides = [{'sequence': i} for i in unique_peptides]

It is also more efficient to pre-parse the sequences into individual amino acids and supply the parsed structures into the functions that calculate m/z, charge, etc. During parsing, we explicitly save the terminal groups of peptides so that they are taken into the account when calculating m/z and charge of a peptide.

print('Parsing peptide sequences...')
for peptide in peptides:
    peptide['parsed_sequence'] = parser.parse(
    peptide['length'] = parser.length(peptide['parsed_sequence'])

For our purposes, we will limit ourselves to reasonably short peptides with the length less than 100 residues.

peptides = [peptide for peptide in peptides if peptide['length'] <= 100]

We use pyteomics.electrochem.charge() to calculate the charge at pH=2.0. The neutral mass and m/z of an ion is found with pyteomics.mass.calculate_mass().

print('Calculating the mass, charge and m/z...')
for peptide in peptides:
    peptide['charge'] = int(round(
        electrochem.charge(peptide['parsed_sequence'], pH=2.0)))
    peptide['mass'] = mass.calculate_mass(peptide['parsed_sequence'])
    peptide['m/z'] = mass.calculate_mass(peptide['parsed_sequence'],

Next, we calculate the retention time in the reversed- and normal-phase chromatography using pyteomics.achrom.calculate_RT() for two different sets of retention coefficients. The phase is specified by supplying corresponding sets of retention coefficients, pyteomics.achrom.RCs_zubarev and pyteomics.achrom.RCs_yoshida_lc for the reversed and normal phases, correspondingly.

print('Calculating the retention time...')
for peptide in peptides:
    peptide['RT_RP'] = achrom.calculate_RT(
    peptide['RT_normal'] = achrom.calculate_RT(

Now, as we have all the numbers we can estimate the complexity of a sample by plotting the distributions of parameters measurable in a typical proteomic experiment. First, we show the distribution of m/z using the standard histogram plotting function from matplotlib.

plt.hist([peptide['m/z'] for peptide in peptides],
    bins = 2000,
plt.xlabel('m/z, Th')
plt.ylabel('# of peptides within 2 Th bin')

The same set of commands allows us to plot the distribution of charge states in the sample:

plt.hist([peptide['charge'] for peptide in peptides],
    bins = 20,
plt.xlabel('charge, e')
plt.ylabel('# of peptides')

Next, we want to visualize the statistical correlation between m/z and retention time in reversed-phase chromatography.

The standard approach would be to use a scatter plot. However, with a sample of our size that would be uninformative. Instead, we will plot a 2d-histogram. There is no standard matplotlib command for that and we have to use a combination of numpy and matplotlib. The function numpy.histogram2d() bins a set of (x,y) points on a plane and returns the matrix of numbers in each individual bin and the borders of the bins. We also use a trick of replacing zeros in this matrix with the not-a-number value so that on the final figure empty bins are highlighted with white color instead of the darkest blue. We suggest removing the fourth line in this code snippet to see how that affects the final plot. At the last line, we also apply the linear regression to obtain the coefficient of correlation between m/z and retention time.

x = [peptide['RT_RP'] for peptide in peptides]
y = [peptide['RT_normal'] for peptide in peptides]
heatmap, xbins, ybins = np.histogram2d(x, y, bins=100)
heatmap[heatmap == 0] = np.nan
a, b, r, stderr = auxiliary.linear_regression(x,y)

The obtained heatmap is plotted with matplotlib.pyplot.imshow() function that visualizes matrices.

plt.xlabel('RT on RP, min')
plt.ylabel('RT on normal phase, min')
plt.title('All tryptic peptides, RT correlation = {0}'.format(r))

The same code can also be applied to compare the retention times obtained on different chromatographic phases. As you can see upon execution of the code, the retention times obtained on different chromatographic phases seem to be uncorrelated.

x = [peptide['m/z'] for peptide in peptides]
y = [peptide['RT_RP'] for peptide in peptides]
heatmap, xbins, ybins = np.histogram2d(x, y,
    bins=[150, 2000],
    range=[[0, 4000], [0, 150]])
heatmap[heatmap == 0] = np.nan
a, b, r, stderr = auxiliary.linear_regression(x,y)

plt.xlabel('m/z, Th')
plt.ylabel('RT on RP, min')
plt.title('All tryptic peptides, correlation = {0}'.format(r))

Finally, let us check whether the retention times remain uncorrelated when we narrow down the sample of peptides. We select the peptides with m/z lying in a 700-701 Th window and plot two chromatographic retention times. This time the sample allows us to use a scatter plot.

close_mass_peptides = [peptide for peptide in peptides
                       if 700.0 <= peptide['m/z'] <= 701.0]
x = [peptide['RT_RP'] for peptide in close_mass_peptides]
y = [peptide['RT_normal'] for peptide in close_mass_peptides]
a, b, r, stderr = auxiliary.linear_regression(x, y)

plt.scatter(x, y)
plt.xlabel('RT on RP, min')
plt.ylabel('RT on normal phase, min')
plt.title('Tryptic peptides with m/z=700-701 Th\nRT correlation = {0}'.format(r))

As you can see, the retention times of peptides lying in a narrow mass window turn out to be substantially correlated.

At this point we stop. The next example will cover the modules allowing access to experimental proteomic datasets stored in XML-based formats.

«  Combined examples   ::   Contents   ::   Example 2: Fragmentation  »