mgf - read and write MS/MS data in Mascot Generic Format¶

Summary¶

MGF is a simple human-readable format for MS/MS data. It allows storing MS/MS peak lists and exprimental parameters.

This module provides classes and functions for access to data stored in MGF files. Parsing is done using MGF and IndexedMGF classes. The read() function can be used as an entry point. MGF spectra are converted to dictionaries. MS/MS data points are (optionally) represented as numpy arrays. Also, common parameters can be read from MGF file header with read_header() function. write() allows creation of MGF files.

Classes¶

MGF - a text-mode MGF parser. Suitable to read spectra from a file consecutively. Needs a file opened in text mode (or will open it if given a file name).

IndexedMGF - a binary-mode MGF parser. When created, builds a byte offset index for fast random access by spectrum titles. Sequential iteration is also supported. Needs a seekable file opened in binary mode (if created from existing file object).

MGFBase - abstract class, the common ancestor of the two classes above. Can be used for type checking.

Functions¶

read() - an alias for MGF or IndexedMGF.

get_spectrum() - read a single spectrum with given title from a file.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

read_header() - get a dict with common parameters for all spectra from the beginning of MGF file.

write() - write an MGF file.

pyteomics.mgf.chain(*args, **kwargs)¶: Chain read() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the read() function.

chain.from_iterable(files, **kwargs)¶

Chain read() for several files. Keyword arguments are passed to the read() function.

Parameters:: files (iterable) – Iterable of file names or file objects.

class pyteomics.mgf.IndexedMGF(source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding='utf-8', index_by_scans=False, read_ions=False, _skip_index=False, **kwargs)[source]¶

Bases: MGFBase, TaskMappingMixin, TimeOrderedIndexedReaderMixin, IndexSavingTextReader

A class representing an MGF file. Supports the with syntax and direct iteration for sequential parsing. Specific spectra can be accessed by title using the indexing syntax in constant time. If created using a file object, it needs to be opened in binary mode.

When iterated, IndexedMGF object yields spectra one by one. Each ‘spectrum’ is a dict with five keys: ‘m/z array’, ‘intensity array’, ‘charge array’, ‘ion array’ and ‘params’. ‘m/z array’ and ‘intensity array’ store numpy.ndarray’s of floats, ‘charge array’ is a masked array (numpy.ma.MaskedArray) of ints, ‘ion_array’ is an array of Ions (str) and ‘params’ stores a dict of parameters (keys and values are str, keys corresponding to MGF, lowercased).

header¶

The file header.

Type:: dict

time¶

A property used for accessing spectra by retention time.

Type:: RTLocator

__init__(source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding='utf-8', index_by_scans=False, read_ions=False, _skip_index=False, **kwargs)[source]¶

Create an IndexedMGF (binary-mode) reader for a given MGF file.

Parameters:

source (str or file or None, optional) –
A file object (or file name) with data in MGF format. Default is None, which means read standard input.

Note

If a file object is given, it must be opened in binary mode.
use_header (bool, optional) – Add the info from file header to each dict. Spectrum-specific parameters override those from the header in case of conflict. Default is True.
convert_arrays (one of {0, 1, 2}, optional) – If 0, m/z, intensities and (possibly) charges will be returned as regular lists. If 1, they will be converted to regular numpy.ndarray’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 require numpy.
read_charges (bool, optional) – If True (default), fragment charges are reported. Disabling it improves performance.
read_ions (bool, optional) – If True (default: False), fragment ion types are reported. Disabling it improves performance. Note that right now, only one of (read_charges, read_ions) may be True.
dtype (type or str or dict, optional) – dtype argument to numpy array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’, ‘charge array’ and/or ‘ion array’.
encoding (str, optional) – File encoding.
block_size (int, optinal) – Size of the chunk (in bytes) used to parse the file when creating the byte offset index.

Returns:

out – The reader object.

Return type:

IndexedMGF

build_byte_index()¶: Build the byte offset index by either reading these offsets from the file at _byte_offset_filename, or falling back to the method used by IndexedXML or IndexedTextReader if this operation fails due to an IOError

map(target=None, workers=None, args=None, kwargs=None, method='mp', **_kwargs)¶

Execute the target function over entries of this object in parallel. The type of parallelism is determined by the method parameter.

Results will be returned out of order.

Parameters:

target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.
workers (int, optional) – The number of worker threads or processes to use. The default depends on the method parameter.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
method (str, optional) –
The type of parallelism to use. Can be one of the following:
- either one of ‘p’, ‘mp’, ‘processes’, or ‘multiprocessing’: use multiprocessing This is the default. This is also equivalent to calling pmap(), see there for details.
- either one of ‘t’, ‘threading’, or ‘threads’: use threading This is also equivalent to calling tmap(), see there for details.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

pmap(target=None, workers=None, args=None, kwargs=None, **_kwargs)¶

Execute the target function over entries of this object across up to workers processes.

Results will be returned out of order.

Parameters:

target (Callable, optional) – The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.
workers (int or None, optional) – The number of worker processes to use. If not a positive integer, defaults to the number of available CPUs. This parameter can also be set at reader creation.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

classmethod prebuild_byte_offset_file(path)¶

Construct a new XML reader, build its byte offset index and write it to file

Parameters:: path (str) – The path to the file to parse

reset()¶: Resets the iterator to its initial state.

tmap(target=None, workers=None, args=None, kwargs=None, chunk_size=None, **_kwargs)¶

Execute the target function over entries of this object across up to workers threads.

Results will be returned out of order.

Parameters:

target (Callable, optional) –
The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs.

Warning

target must be thread-safe. The target function cannot interact with the underlying file object directly.
workers (int or None, optional) – The number of worker threads to use. If not a positive integer, defaults to the number of available CPUs.
args (Sequence, optional) – Additional positional arguments to be passed to the target function.
kwargs (Mapping, optional) – Additional keyword arguments to be passed to the target function.
chunk_size (int, optional) – The number of work items to hand out to each worker thread at a time. If not specified, defaults to chunk_size attribute of this object.
**_kwargs – Additional keyword arguments to be passed to the target function.

Yields:

object – The work item returned by the target function.

write_byte_offsets()¶: Write the byte offsets in _offset_index to the file at _byte_offset_filename

class pyteomics.mgf.MGF(source=None, use_header=True, convert_arrays=2, read_charges=True, read_ions=False, dtype=None, encoding=None)[source]¶

Bases: MGFBase, FileReader

A class representing an MGF file. Supports the with syntax and direct iteration for sequential parsing. Specific spectra can be accessed by title using the indexing syntax (if the file is seekable), but it takes linear time to search through the file. Consider using IndexedMGF for constant-time access to spectra.

MGF object behaves as an iterator, yielding spectra one by one. Each ‘spectrum’ is a dict with five keys: ‘m/z array’, ‘intensity array’, ‘charge array’, ‘ion array’ and ‘params’. ‘m/z array’ and ‘intensity array’ store numpy.ndarray’s of floats, ‘charge array’ is a masked array (numpy.ma.MaskedArray) of ints, ‘ion_array’ is a masked array of Ions (str) and ‘params’ stores a dict of parameters (keys and values are str, keys corresponding to MGF, lowercased).

header¶

The file header.

Type:: dict

__init__(source=None, use_header=True, convert_arrays=2, read_charges=True, read_ions=False, dtype=None, encoding=None)[source]¶

Create an MGF (text-mode) reader for a given MGF file.

Parameters:

source (str or file or None, optional) –
A file object (or file name) with data in MGF format. Default is None, which means read standard input.

..note :: If a file object is given, it must be opened in text mode.
use_header (bool, optional) – Add the info from file header to each dict. Spectrum-specific parameters override those from the header in case of conflict. Default is True.
convert_arrays (one of {0, 1, 2}, optional) – If 0, m/z, intensities and (possibly) charges will be returned as regular lists. If 1, they will be converted to regular numpy.ndarray’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 require numpy.
read_charges (bool, optional) – If True (default), fragment charges are reported. Disabling it improves performance.
read_ions (bool, optional) – If True (default: False), fragment ion types are reported. Disabling it improves performance. Note that right now, only one of (read_charges, read_ions) may be True.
dtype (type or str or dict, optional) – dtype argument to numpy array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’, ‘charge array’ and/or ‘ion array’.
encoding (str, optional) – File encoding.

Returns:

out – The reader object.

Return type:

MGF

reset()¶: Resets the iterator to its initial state.

class pyteomics.mgf.MGFBase(source=None, **kwargs)[source]¶

Bases: MaskedArrayConversionMixin

Abstract mixin class representing an MGF file. Subclasses implement different approaches to parsing.

__init__(source=None, **kwargs)[source]¶

Create an MGF file object, set MGF-specific parameters.

Parameters:

source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is None, which means read standard input.
use_header (bool, optional, keyword only) – Add the info from file header to each dict. Spectrum-specific parameters override those from the header in case of conflict. Default is True.
convert_arrays (one of {0, 1, 2}, optional, keyword only) – If 0, m/z, intensities and (possibly) charges or (possibly) ions will be returned as regular lists. If 1, they will be converted to regular numpy.ndarray’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 require numpy.
read_charges (bool, optional, keyword only) – If True (default), fragment charges are reported. Disabling it improves performance.
read_ions (bool, optional) – If True (default: False), fragment ions are reported. Disabling it improves performance. Note that right now, only one of (read_charges, read_ions) may be True.
dtype (type or str or dict, optional, keyword only) – dtype argument to numpy array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’, ‘charge array’ and/or ‘ion array’.
encoding (str, optional, keyword only) – File encoding.

pyteomics.mgf.get_spectrum(source, title, *args, **kwargs)[source]¶

Read one spectrum (with given title) from source.

See read() for explanation of parameters affecting the output.

Note

Only the key-value pairs after the “TITLE =” line will be included in the output.

Parameters:

source (str or file or None) – File to read from.
title (str) – Spectrum title.
*args – Given to read().
**kwargs – Given to read().

Returns:

out – A dict with the spectrum, if it is found, and None otherwise.

Return type:

dict or None

pyteomics.mgf.read(*args, **kwargs)[source]¶

Returns a reader for a given MGF file. Most of the parameters repeat the instantiation signature of MGF and IndexedMGF. Additional parameter use_index helps decide which class to instantiate for given source.

Parameters:

source (str or file or None, optional) – A file object (or file name) with data in MGF format. Default is None, which means read standard input.
use_header (bool, optional) – Add the info from file header to each dict. Spectrum-specific parameters override those from the header in case of conflict. Default is True.
convert_arrays (one of {0, 1, 2}, optional) – If 0, m/z, intensities and (possibly) charges will be returned as regular lists. If 1, they will be converted to regular numpy.ndarray’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 require numpy.
read_charges (bool, optional) – If True (default), fragment charges are reported. Disabling it improves performance.
read_ions (bool, optional) – If True (default: False), fragment ion types are reported. Disabling it improves performance. Note that right now, only one of (read_charges, read_ions) may be True.
dtype (type or str or dict, optional) – dtype argument to numpy array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’, ‘charge array’ and/or ‘ion array’.
encoding (str, optional) – File encoding.
use_index (bool, optional) –
Determines which parsing method to use. If True (default), an instance of IndexedMGF is created. This facilitates random access by spectrum titles. If an open file is passed as source, it needs to be open in binary mode.

If False, an instance of MGF is created. It reads source in text mode and is suitable for iterative parsing. Access by spectrum title requires linear search and thus takes linear time.
block_size (int, optinal) – Size of the chunk (in bytes) used to parse the file when creating the byte offset index. (Accepted only for IndexedMGF.)

Returns:

out – Instance of MGF or IndexedMGF.

Return type:

MGFBase

pyteomics.mgf.read_header(source)[source]¶

Read the specified MGF file, get search parameters specified in the header as a dict, the keys corresponding to MGF format (lowercased).

Parameters:: source (str or file) – File name or file object representing an file in MGF format.
Returns:: header
Return type:: dict

pyteomics.mgf.write(spectra, output=None, header='', key_order=['title', 'pepmass', 'rtinseconds', 'charge'], fragment_format=None, write_charges=True, write_ions=False, use_numpy=None, param_formatters={'charge': <function _charge_repr>, 'pepmass': <function _pepmass_repr>})[source]¶

Create a file in MGF format.

Parameters:

spectra (iterable) –
A sequence of dictionaries with keys ‘m/z array’, ‘intensity array’, and ‘params’. ‘m/z array’ and ‘intensity array’ should be sequences of int, float, or str. Strings will be written ‘as is’. The sequences should be of equal length, otherwise excessive values will be ignored.

’params’ should be a dict with keys corresponding to MGF format. Keys must be strings, they will be uppercased and used as is, without any format consistency tests. Values can be of any type allowing string representation.

’charge array’ or ‘ion array’ can also be specified.

Note

Passing a single spectrum will work, but will trigger a warning. This usage pattern is discouraged. To ensure correct output when writing multiple spectra, it is recommended to construct a sequence of spectra first and then call write() once.

See also

This discussion of usage patterns of write(): https://github.com/levitsky/pyteomics/discussions/109
output (str or file or None, optional) –
Path or a file-like object open for writing. If an existing file is specified by file name, it will be opened for writing. Default value is None, which means using standard output.

Note

The default mode for output files specified by name has been changed from a to w in pyteomics 4.6. See file_mode to override the mode.
header (dict or (multiline) str or list of str, optional) – In case of a single string or a list of strings, the header will be written ‘as is’. In case of dict, the keys (must be strings) will be uppercased.
write_charges (bool, optional) – If False, fragment charges from ‘charge array’ will not be written. Default is True.
write_ions (bool, optional) – If False, fragment ions from ‘ion array’ will not be written. If True, then write_charges is set to False. Default is False.
fragment_format (str, optional) –
Format string for m/z, intensity and charge (or ion annotation) of a fragment. Useful to set the number of decimal places, e.g.: fragment_format='%.4f %.0f'. Default is '{} {} {}'.

Note

The supported format syntax differs depending on other parameters. If use_numpy is True and numpy is available, fragment peaks will be written using numpy.savetxt(). Then, fragment_format must be recognized by that function.

Otherwise, plain Python string formatting is done. See the docs for details on writing the format string. If some or all charges are missing, an empty string is substituted instead, so formatting as float or int will raise an exception. Hence it is safer to just use {} for charges.
key_order (list, optional) –
A list of strings specifying the order in which params will be written in the spectrum header. Unlisted keys will be in arbitrary order. Default is _default_key_order.

Note

This does not affect the order of lines in the global header.
param_formatters (dict, optional) – A dict mapping parameter names to functions. Each function must accept two arguments (key and value) and return a string. Default is _default_value_formatters.
use_numpy (bool, optional) –
Controls whether fragment peak arrays are written using numpy.savetxt(). Using numpy.savetxt() is faster, but cannot handle sparse arrays of fragment charges. You may want to disable this if you need to save spectra with ‘charge arrays’ with missing values.

If not specified, will be set to the opposite of write_chrages. If numpy is not available, this parameter has no effect.
file_mode (str, keyword only, optional) –
If output is a file name, defines the mode the file will be opened in. Otherwise will be ignored. Default is ‘w’.

Note

The default changed from ‘a’ in pyteomics 4.6.
encoding (str, keyword only, optional) – Output file encoding (if output is specified by name).

Returns:

output

Return type:

file

Pyteomics documentation v5.0