Pyteomics documentation v4.1.3dev0

mgf - read and write MS/MS data in Mascot Generic Format

Contents

mgf - read and write MS/MS data in Mascot Generic Format

Summary

MGF is a simple human-readable format for MS/MS data. It allows storing MS/MS peak lists and exprimental parameters.

This module provides classes and functions for access to data stored in MGF files. Parsing is done using MGF and IndexedMGF classes. The read() function can be used as an entry point. MGF spectra are converted to dictionaries. MS/MS data points are (optionally) represented as numpy arrays. Also, common parameters can be read from MGF file header with read_header() function. write() allows creation of MGF files.

Classes

MGF - a text-mode MGF parser. Suitable to read spectra from a file consecutively. Needs a file opened in text mode (or will open it if given a file name).

IndexedMGF - a binary-mode MGF parser. When created, builds a byte offset index for fast random access by spectrum titles. Sequential iteration is also supported. Needs a seekable file opened in binary mode (if created from existing file object).

MGFBase - abstract class, the common ancestor of the two classes above. Can be used for type checking.

Functions

read() - iterate through spectra in MGF file. Data from a single spectrum are converted to a human-readable dict.

get_spectrum() - read a single spectrum with given title from a file.

chain() - read multiple files at once.

chain.from_iterable() - read multiple files at once, using an iterable of files.

read_header() - get a dict with common parameters for all spectra from the beginning of MGF file.

write() - write an MGF file.


pyteomics.mgf.chain(*args, **kwargs)

Chain read() for several files. Positional arguments should be file names or file objects. Keyword arguments are passed to the read() function.

chain.from_iterable(files, **kwargs)

Chain read() for several files. Keyword arguments are passed to the read() function.

files : iterable
Iterable of file names or file objects.
class pyteomics.mgf.IndexedMGF(source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding='utf-8', block_size=1000000, _skip_index=False)[source]

Bases: pyteomics.auxiliary.file_helpers.TaskMappingMixin, pyteomics.auxiliary.file_helpers.TimeOrderedIndexedReaderMixin, pyteomics.auxiliary.file_helpers.IndexSavingTextReader, pyteomics.mgf.MGFBase

A class representing an MGF file. Supports the with syntax and direct iteration for sequential parsing. Specific spectra can be accessed by title using the indexing syntax in constant time. If created using a file object, it needs to be opened in binary mode.

When iterated, IndexedMGF object yields spectra one by one. Each ‘spectrum’ is a dict with four keys: ‘m/z array’, ‘intensity array’, ‘charge array’ and ‘params’. ‘m/z array’ and ‘intensity array’ store numpy.ndarray’s of floats, ‘charge array’ is a masked array (numpy.ma.MaskedArray) of ints, and ‘params’ stores a dict of parameters (keys and values are str, keys corresponding to MGF, lowercased).

Attributes:
header : dict

The file header.

time : RTLocator

A property used for accessing spectra by retention time.

Methods

map(self[, target, processes, …]) Execute the target function over entries of this object across up to processes processes.
prebuild_byte_offset_file(cls, path) Construct a new XML reader, build its byte offset index and write it to file
write_byte_offsets(self) Write the byte offsets in _offset_index to the file at _byte_offset_filename
build_byte_index  
get_by_id  
get_by_ids  
get_by_index  
get_by_index_slice  
get_by_indexes  
get_by_key_slice  
get_spectrum  
next  
parse_charge  
reset  
__init__(self, source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding='utf-8', block_size=1000000, _skip_index=False)[source]

x.__init__(…) initializes x; see help(type(x)) for signature

map(self, target=None, processes=-1, queue_timeout=4, args=None, kwargs=None, **_kwargs)

Execute the target function over entries of this object across up to processes processes.

Results will be returned out of order.

Parameters:
target : Callable, optional

The function to execute over each entry. It will be given a single object yielded by the wrapped iterator as well as all of the values in args and kwargs

processes : int, optional

The number of worker processes to use. If negative, the number of processes will match the number of available CPUs.

queue_timeout : float, optional

The number of seconds to block, waiting for a result before checking to see if all workers are done.

args : Sequence, optional

Additional positional arguments to be passed to the target function

kwargs : Mapping, optional

Additional keyword arguments to be passed to the target function

**_kwargs

Additional keyword arguments to be passed to the target function

Yields:
object

The work item returned by the target function.

classmethod prebuild_byte_offset_file(cls, path)

Construct a new XML reader, build its byte offset index and write it to file

Parameters:
path : str

The path to the file to parse

reset(self)

Resets the iterator to its initial state.

write_byte_offsets(self)

Write the byte offsets in _offset_index to the file at _byte_offset_filename

class pyteomics.mgf.MGF(source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding=None)[source]

Bases: pyteomics.auxiliary.file_helpers.FileReader, pyteomics.mgf.MGFBase

A class representing an MGF file. Supports the with syntax and direct iteration for sequential parsing. Specific spectra can be accessed by title using the indexing syntax (if the file is seekable), but it takes linear time to search through the file. Consider using IndexedMGF for constant-time access to spectra.

MGF object behaves as an iterator, yielding spectra one by one. Each ‘spectrum’ is a dict with four keys: ‘m/z array’, ‘intensity array’, ‘charge array’ and ‘params’. ‘m/z array’ and ‘intensity array’ store numpy.ndarray’s of floats, ‘charge array’ is a masked array (numpy.ma.MaskedArray) of ints, and ‘params’ stores a dict of parameters (keys and values are str, keys corresponding to MGF, lowercased).

Attributes:
header : dict

The file header.

Methods

get_spectrum  
next  
parse_charge  
reset  
__init__(self, source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None, encoding=None)[source]

x.__init__(…) initializes x; see help(type(x)) for signature

reset(self)

Resets the iterator to its initial state.

class pyteomics.mgf.MGFBase(source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None)[source]

Abstract class representing an MGF file. Subclasses implement different approaches to parsing.

Attributes:
encoding
header

Methods

get_spectrum  
parse_charge  
__init__(self, source=None, use_header=True, convert_arrays=2, read_charges=True, dtype=None)[source]

Create an MGF file object, set MGF-specific parameters.

Parameters:
source : str or file or None, optional

A file object (or file name) with data in MGF format. Default is None, which means read standard input.

use_header : bool, optional

Add the info from file header to each dict. Spectrum-specific parameters override those from the header in case of conflict. Default is True.

convert_arrays : one of {0, 1, 2}, optional

If 0, m/z, intensities and (possibly) charges will be returned as regular lists. If 1, they will be converted to regular numpy.ndarray’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 require numpy.

read_charges : bool, optional

If True (default), fragment charges are reported. Disabling it improves performance.

dtype : type or str or dict, optional

dtype argument to numpy array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’ and/or ‘charge array’.

encoding : str, optional

File encoding.

pyteomics.mgf.get_spectrum(source, title, *args, **kwargs)[source]

Read one spectrum (with given title) from source.

See read() for explanation of parameters affecting the output.

Note

Only the key-value pairs after the “TITLE =” line will be included in the output.

Parameters:
source : str or file or None

File to read from.

title : str

Spectrum title.

The rest of the arguments are the same as for :py:func:`read`.
Returns:
out : dict or None

A dict with the spectrum, if it is found, and None otherwise.

pyteomics.mgf.read(*args, **kwargs)[source]

Returns a reader for a given MGF file. Most of the parameters repeat the instantiation signature of MGF and IndexedMGF. Additional parameter use_index helps decide which class to instantiate for given source.

Parameters:
source : str or file or None, optional

A file object (or file name) with data in MGF format. Default is None, which means read standard input.

use_header : bool, optional

Add the info from file header to each dict. Spectrum-specific parameters override those from the header in case of conflict. Default is True.

convert_arrays : one of {0, 1, 2}, optional

If 0, m/z, intensities and (possibly) charges will be returned as regular lists. If 1, they will be converted to regular numpy.ndarray’s. If 2, charges will be reported as a masked array (default). The default option is the slowest. 1 and 2 require numpy.

read_charges : bool, optional

If True (default), fragment charges are reported. Disabling it improves performance.

dtype : type or str or dict, optional

dtype argument to numpy array constructor, one for all arrays or one for each key. Keys should be ‘m/z array’, ‘intensity array’ and/or ‘charge array’.

encoding : str, optional

File encoding.

use_index : bool, optional

Determines which parsing method to use. If True (default), an instance of IndexedMGF is created. This facilitates random access by spectrum titles. If an open file is passed as source, it needs to be open in binary mode.

If False, an instance of MGF is created. It reads source in text mode and is suitable for iterative parsing. Access by spectrum title requires linear search and thus takes linear time.

block_size : int, optinal

Size of the chunk (in bytes) used to parse the file when creating the byte offset index. (Accepted only for IndexedMGF.)

Returns:
out : MGFBase

Instance of MGF or IndexedMGF.

pyteomics.mgf.read_header(*args, **kwargs)[source]

Read the specified MGF file, get search parameters specified in the header as a dict, the keys corresponding to MGF format (lowercased).

Parameters:
source : str or file

File name or file object representing an file in MGF format.

Returns:
header : dict
pyteomics.mgf.write(*args, **kwargs)[source]

Create a file in MGF format.

Parameters:
spectra : iterable

A sequence of dictionaries with keys ‘m/z array’, ‘intensity array’, and ‘params’. ‘m/z array’ and ‘intensity array’ should be sequences of int, float, or str. Strings will be written ‘as is’. The sequences should be of equal length, otherwise excessive values will be ignored.

‘params’ should be a dict with keys corresponding to MGF format. Keys must be strings, they will be uppercased and used as is, without any format consistency tests. Values can be of any type allowing string representation.

‘charge array’ can also be specified.

output : str or file or None, optional

Path or a file-like object open for writing. If an existing file is specified by file name, it will be opened for appending. In this case writing with a header can result in violation of format conventions. Default value is None, which means using standard output.

header : dict or (multiline) str or list of str, optional

In case of a single string or a list of strings, the header will be written ‘as is’. In case of dict, the keys (must be strings) will be uppercased.

write_charges : bool, optional

If False, fragment charges from ‘charge array’ will not be written. Default is True.

fragment_format : str, optional

Format string for m/z, intensity and charge of a fragment. Useful to set the number of decimal places, e.g.: fragment_format='%.4f %.0f'. Default is '{} {} {}'.

Note

The supported format syntax differs depending on other parameters. If use_numpy is True and numpy is available, fragment peaks will be written using numpy.savetxt(). Then, fragment_format must be recognized by that function.

Otherwise, plain Python string formatting is done. See the docs for details on writing the format string. If some or all charges are missing, an empty string is substituted instead, so formatting as float or int will raise an exception. Hence it is safer to just use {} for charges.

key_order : list, optional

A list of strings specifying the order in which params will be written in the spectrum header. Unlisted keys will be in arbitrary order. Default is _default_key_order.

Note

This does not affect the order of lines in the global header.

param_formatters : dict, optional

A dict mapping parameter names to functions. Each function must accept two arguments (key and value) and return a string. Default is _default_value_formatters.

use_numpy : bool, optional

Controls whether fragment peak arrays are written using numpy.savetxt(). Using numpy.savetxt() is faster, but cannot handle sparse arrays of fragment charges. You may want to disable this if you need to save spectra with ‘charge arrays’ with missing values.

If not specified, will be set to the opposite of write_chrages. If numpy is not available, this parameter has no effect.

file_mode : str, keyword only, optional

If output is a file name, defines the mode the file will be opened in. Otherwise will be ignored. Default is ‘a’.

encoding : str, keyword only, optional

Output file encoding (if output is specified by name).

Returns:
output : file

Contents