building and using libraries of peptide ion fragmentation spectra s.e. stein, l.e. kilpatrick, m....

Building and Using Libraries of Peptide Ion Fragmentation SpectraS.E. Stein, L.E. Kilpatrick, M. Mautner, P. Neta, J. Roth

National Institute of Standards and Technology, Gaithersburg, MD/Charleston, SC

Overview: MS/MS spectra that serve to identify peptides by sequence/spectrum matching can be of value for more reliably identifying those peptides in later studies. Our work involves the development of methods for refining and reusing information contained in these spectra to create mass spectral reference libraries, not dissimilar from those routinely employed in other fields of mass spectrometry.

Method: Spectra that identify peptides by current sequence/spectrum matching methods are first added to an archive along with associated information. For each identified peptide ion that has been identified more than once, an annotated ‘consensus spectrum’ is derived from the spectra used to make those identifications. Consensus spectra are then subjected to spectrum/sequence consistency tests to assign a measure of identification reliability and remove false positive identifications. This employs sequence/spectrum correlations reported in the literature and found in our work as well as other quantities found to be correlated with identification reliability. Examples are the similarity of original spectra, unassigned abundance in the consensus spectrum and scores of the original sequence/spectrum match. Related consensus spectra are intercompared to find further errors and finally combined to form an annotated, searchable reference library.

Building The Library

1) Extract spectra and analysis information for reliable peptide identifications from spectrum/sequence matches.

2) Create a ‘consensus spectrum’ from all spectra assigned to a single peptide ion.

3) For each consensus spectrum, perform spectrum/sequence consistency check. Also examine reliable, single hit identifications.

4) Assign reliability measures to each spectrum and build library.

Ways of Using the Library

Identification Confirmation (post-processing)Confirm/reject peptides identified by sequence search programs

Find peptides (pre-processing) Search all spectra against libraryThen search unmatched peptides using sequence search and other methods

Create library of ‘unidentified spectra’ from their consensus spectraCompare to identified peptides, use de novo methods, find biomarkers, …

Target identificationInternal standards, biomarkers, target proteins

Transmit peptide analysis informationDifficult-to-identify, unusual, manually identified, special meaning

Spectrum Variability: Effective peptide identification by matching spectra in a reference library requires that spectra are reproducible. The degree of variability has been measured using sets of spectra identifying the same precursor ion. Variations typical of ion trap spectra are shown below.

Most spectra identified by sequence search methods have dot products above 0.7.

0

20

40

60

80

100

120

140

160

180

0.2 0.4 0.6 0.8 1

Dot Product

Max

/Med

ian

0

200

400

600

800

1000

1200

1400

1600

0.2 0.4 0.6 0.8 1

Dot Product

#Spe

ctra

Next Steps

Enhance QA/QC Goal: Distinguish unexpected fragmentations from misidentifications

Add spectra from reliable singly identified peptides to Library

Build comprehensive Libraries

Test and optimize spectrum matching algorithms

Provide annotated libraries for mass spectrometry data systems

Spectrum/SequenceConsistency

All Data for an Identified Peptide Ion

Annotated ReferenceSpectrum

ReferenceArchive

ReferenceLibrary

Spectra Sequence ConfirmatoryInformation

Consensus Spectrum

Extract Common Features

FragmentationRules

Analysis

LC-MS/MS Results

1) Source Spectraa) Online repositories

- PeptideAtlas- the Global Proteome Machine - Open Proteomics Database- NCRR Proteomics Resource

b) Collaborators/Contributors- NIH/LNT (Markey/Geer/Kowalik…)- Institute of Systems Biology (Nesvizshskii/King/Aebersold/…) - theGPM (Beavis)- Blueprint Initiative (Hogue)

c) NIST MeasurementsSingle Protein Digests/IT, 3Q

3) Spectrum/Sequence ConsistencyFind likelihood that a consensus spectrum originated from assigned sequence.Each factor serves to refine probability.

Factors identified as discriminating (a and b are illustrated below):a) Match with theoretical spectrum

based on individual amino acid fragmentation behaviorb) Fraction of unassigned abundance

abundance for peaks not originating from a known fragmentation pathc) Y/B ion correlations and ratio of Y/B ion abundance sumsd) Unexpected major peaks formally consistent with rules

include tests for reasonableness of neutral lossese) Amino acid specific rules (Proline, Glycine, Aspartic Acid, …)f) Predicted/observed fragment ion charge state abundance ratios

4) Overall Identification Confidence [under development]

Influential factors employed Spectrum/Sequence Consistency (described above)

Original Score (degree of sequence match)

Peptide sequence re-identificationsNumber of different spectra, experiments, modifications, charge states

Number of peptides per protein

Occurrence Distribution: Most peptide identifications are made multiple times.

2) Build Consensus SpectraReject noise and spurious signals/Measure and report variability

Identify matching m/z peaks in input spectra.

Find and exclude outlier spectra (use dot product of each pair) when multiple sources are available , limit the number of spectra from a single source

Omit peaks occurring in ½ or fewer of spectra consider only spectra with sufficiently high S/N to have generated the peak of interest

Compute average abundance, report variance

Create annotated spectruminclude information such as spectrum origin, retention, median dot product using all peaks and only consensus peaks, fraction of abundance not at consensus m/z, …

Sample Application: Mycobacterium Smegmatis

[data from the Open Proteomics Database/R.Wang, 27 sets of experiments]

- Created library of 2739 Peptide Ion Consensus Spectra

- 95% of identified spectra matched peptides identified by other spectra

In one series of LC-MS/MS experiments:

1551 Spectra were identified by sequence search engine

1527 Of the above spectra were re-matched by consensus library searching

1067 Spectra not identified by sequence search were identified by library search

948 Different peptides were originally identified by sequence search engine

924 Peptides were re-matched

332 Peptides not identified by sequence search in this series were identified

24 Peptides not re-matched were rich in false positives

983 Consensus spectra of spectra unmatched by library search were derived

Please Recycle Your Spectra!- Your MS/MS Data Files contain valuable, reusable information that may be very helpful to others (and maybe even you) after recycling.

- Please let us know if we may recycle them ([email protected] / 301-975-2505). Raw data files are fine, with or without identifications.

- Sources of original spectra are cited as part of consensus spectrum annotation (if you wish).

- We also encourage submission to on-line spectral repositories (www.peptideatlas.org, www.thegpm.org, bioinformatics.icmb.utexas.edu/OPD/)

Library Construction Flow Diagram

Spectrum Matching Algorithms

Algorithms: Measures of spectrum matching have been adapted from algorithms used for electron ionization spectra. Peaks are weighted by their significance:

- Reduce significance of common impurity ions (e.g., neutral loss from parent ion)

- Adjust Y/B weighting for instrument and sequence

- Reduce weight for uncertain and isotopic peaks

- Use confidence of library spectrum

Speed: Straightforward indexing leads to very fast identification (<< sec) even for very large libraries.

Spectrum 1

Spectrum 2

ConsensusSpectrum

A ‘consensusspectrum’ is composed of peaks present in the majority of spectra in which the peak could have been generated

Replicate spectra at moderate to low S/N commonly show spurious peaks from neutral losses by impurity ions of uncertain origin as well as seemingly random noise spikes.

Similarity of pairs of spectra originating from the same peptide ion are measured by their ‘dot product’, where spectra are expressed as normalized vectors. Here, S/N is measured as the maximum to median

abundance in a typical spectrum (assumes most peaks are noise).

Above S/N of 40, spectra are quite reproducible (0.7 or better median dot product).

Distribution of Number of Identifications Per Identified Peptide for D. Radiodurans (spectra from NCRR, http://ncrr.pnl.gov/data/) and M. Smegmatis

(Open Proteomics Database)

Pairs of spectra identified as originating from a given peptide ion have been compared to each other – variations in abundance have been found to depend primarily on the level of signal/noise in the measured spectra.

Organism Different Ions

Source Spectra

Human 14,698 84,288

D. Radiodurans 6,888 115,972

Yeast 5,879 43,344

M. Smegmatis 2,739 30,718

E. Coli 1,259 5,861

Mixed (GPM) 66,882 587,199

These plots use D. Radiodurans spectra from NCRR, false ids obtained by searching Human sequences. True and false identifications had the same average sequence ‘score’.

Correct identifications have lower fractions of unassigned abundance than incorrect identifications

True positive spectra match theoretical spectra better than false positive spectra.

These plots illustrate quantitative relations for factors a) and b) that can aid in the separation of true and false positive identifications

In two well-studied cases, about 5% of identifications were made by only one spectrum.

Roughly 10% of identifications were made more than 100 times

Approximately 1/3 of peptides identified were identified only once (not shown). These will be separately processed, requiring higher scores for acceptance in library.

Consensus Spectra Derived

building and using libraries of peptide ion fragmentation spectra s.e. stein, l.e. kilpatrick, m....

Documents

matching spectra

msms spectra

extract spectra

sets of spectra

related consensus spectra

similarity of original

identified peptide ion

annotated consensus