building and using libraries of peptide ion fragmentation spectra s.e. stein, l.e. kilpatrick, m....

1
Building and Using Libraries of Peptide Ion Fragmentation Spectra S.E. Stein, L.E. Kilpatrick, M. Mautner, P. Neta, J. Roth National Institute of Standards and Technology, Gaithersburg, MD/Charleston, SC Overview: MS/MS spectra that serve to identify peptides by sequence/spectrum matching can be of value for more reliably identifying those peptides in later studies. Our work involves the development of methods for refining and reusing information contained in these spectra to create mass spectral reference libraries, not dissimilar from those routinely employed in other fields of mass spectrometry. Method: Spectra that identify peptides by current sequence/spectrum matching methods are first added to an archive along with associated information. For each identified peptide ion that has been identified more than once, an annotated ‘consensus spectrum’ is derived from the spectra used to make those identifications. Consensus spectra are then subjected to spectrum/sequence consistency tests to assign a measure of identification reliability and remove false positive identifications. This employs sequence/spectrum correlations reported in the literature and found in our work as well as other quantities found to be correlated with identification reliability. Examples are the similarity of original spectra, unassigned abundance in the consensus spectrum and scores of the original sequence/spectrum match. Related consensus spectra are intercompared to find further errors and finally combined to form an annotated, searchable reference library. Building The Library 1) Extract spectra and analysis information for reliable peptide identifications from spectrum/sequence matches. 2) Create a ‘consensus spectrum’ from all spectra assigned to a single peptide ion. 3) For each consensus spectrum, perform spectrum/sequence consistency check. Also examine reliable, single hit identifications. 4) Assign reliability measures to each spectrum and build library. Ways of Using the Library Identification Confirmation (post-processing) Confirm/reject peptides identified by sequence search programs Find peptides (pre-processing) Search all spectra against library Then search unmatched peptides using sequence search and other methods Create library of ‘unidentified spectra’ from their consensus spectra Compare to identified peptides, use de novo methods, find biomarkers, … Target identification Internal standards, biomarkers, target proteins Transmit peptide analysis information Difficult-to-identify, unusual, manually identified, special meaning Spectrum Variability: Effective peptide identification by matching spectra in a reference library requires that spectra are reproducible. The degree of variability has been measured using sets of spectra identifying the same precursor ion. Variations typical of ion trap spectra are shown below. Most spectra identified by sequence search methods have dot products above 0.7. 0 20 40 60 80 100 120 140 160 180 0.2 0.4 0.6 0.8 1 DotProduct Max/Median 0 200 400 600 800 1000 1200 1400 1600 0.2 0.4 0.6 0.8 1 D ot P roduct #Spectra Next Steps Enhance QA/QC Goal: Distinguish unexpected fragmentations from misidentifications Add spectra from reliable singly identified peptides to Library Build comprehensive Libraries Test and optimize spectrum matching algorithms Provide annotated libraries for mass spectrometry data systems Spectrum/Sequence Consistency All Data for an Identified Peptide Ion Annotated Reference Spectrum Reference Archive Reference Library Spectra Sequence Confirmatory Information Consensus Spectrum Extract Common Features Fragmentation Rules Analysis LC-MS/MS Results 1)Source Spectra a) Online repositories - PeptideAtlas - the Global Proteome Machine - Open Proteomics Database - NCRR Proteomics Resource b) Collaborators/Contributors - NIH/LNT (Markey/Geer/Kowalik…) - Institute of Systems Biology (Nesvizshskii/King/Aeberso ld/…) - theGPM (Beavis) - Blueprint Initiative (Hogue) c) NIST Measurements Single Protein Digests/IT, 3Q 3) Spectrum/Sequence Consistency Find likelihood that a consensus spectrum originated from assigned sequence. Each factor serves to refine probability. Factors identified as discriminating (a and b are illustrated below): a) Match with theoretical spectrum based on individual amino acid fragmentation behavior b) Fraction of unassigned abundance abundance for peaks not originating from a known fragmentation path c) Y/B ion correlations and ratio of Y/B ion abundance sums d) Unexpected major peaks formally consistent with rules include tests for reasonableness of neutral losses e) Amino acid specific rules (Proline, Glycine, Aspartic Acid, …) f) Predicted/observed fragment ion charge state abundance ratios 4) Overall Identification Confidence [under development] Influential factors employed Spectrum/Sequence Consistency (described above) Original Score (degree of sequence match) Peptide sequence re-identifications Number of different spectra, experiments, modifications, charge states Number of peptides per protein Occurrence Distribution: Most peptide identifications are made multiple times. 2) Build Consensus Spectra Reject noise and spurious signals/Measure and report variability Identify matching m/z peaks in input spectra. Find and exclude outlier spectra (use dot product of each pair) when multiple sources are available , limit the number of spectra from a single source Omit peaks occurring in ½ or fewer of spectra consider only spectra with sufficiently high S/N to have generated the peak of interest Compute average abundance, report variance Create annotated spectrum include information such as spectrum origin, retention, median dot product using all peaks and only consensus peaks, fraction of abundance not at consensus m/z, … Sample Application: Mycobacterium Smegmatis [data from the Open Proteomics Database/R.Wang, 27 sets of experiments] - Created library of 2739 Peptide Ion Consensus Spectra - 95% of identified spectra matched peptides identified by other spectra In one series of LC-MS/MS experiments: 1551 Spectra were identified by sequence search engine 1527 Of the above spectra were re-matched by consensus library searching 1067 Spectra not identified by sequence search were identified by library search 948 Different peptides were originally identified by sequence search engine 924 Peptides were re-matched 332 Peptides not identified by sequence search in this series were identified 24 Peptides not re-matched were rich in false positives 983 Consensus spectra of spectra unmatched by library search were derived Please Recycle Your Spectra! - Your MS/MS Data Files contain valuable, reusable information that may be very helpful to others (and maybe even you) after recycling. - Please let us know if we may recycle them ([email protected] / 301-975-2505). Raw data files are fine, with or without identifications. - Sources of original spectra are cited as part of consensus spectrum annotation (if you wish). - We also encourage submission to on-line spectral repositories (www.peptideatlas.org, www.thegpm.org, bioinformatics.icmb.utexas.edu/OPD/) Library Construction Flow Diagram Spectrum Matching Algorithms Algorithms: Measures of spectrum matching have been adapted from algorithms used for electron ionization spectra. Peaks are weighted by their significance: - Reduce significance of common impurity ions (e.g., neutral loss from parent ion) - Adjust Y/B weighting for instrument and sequence - Reduce weight for uncertain and isotopic peaks - Use confidence of library spectrum Speed: Straightforward indexing leads to very fast identification (<< sec) even for very large libraries. Spectrum 1 Spectrum 2 Consensus Spectrum A ‘consensus spectrum’ is composed of peaks present in the majority of spectra in which the peak could have been generated Replicate spectra at moderate to low S/N commonly show spurious peaks from neutral losses by impurity ions of uncertain origin as well as seemingly random noise spikes. Similarity of pairs of spectra originating from the same peptide ion are measured by their ‘dot product’, where spectra are expressed as normalized vectors. Here, S/N is measured as the maximum to median abundance in a typical spectrum (assumes most peaks are noise). Above S/N of 40, spectra are quite reproducible (0.7 or better median dot product). Distribution of Number of Identifications Per Identified Peptide for D. Radiodurans (spectra from NCRR, http://ncrr.pnl.gov/data/) and M. Smegmatis (Open Proteomics Database) Pairs of spectra identified as originating from a given peptide ion have been compared to each other variations in abundance have been found to depend primarily on the level of signal/noise in the measured spectra. Organism Differe nt Ions Source Spectra Human 14,698 84,288 D. Radiodurans 6,888 115,972 Yeast 5,879 43,344 M. Smegmatis 2,739 30,718 E. Coli 1,259 5,861 Mixed (GPM) 66,882 587,199 These plots use D. Radiodurans spectra from NCRR, false ids obtained by searching Human sequences. True and false identifications had the same average sequence ‘score’. Correct identifications have lower fractions of unassigned abundance than incorrect identifications True positive spectra match theoretical spectra better than false positive spectra. These plots illustrate quantitative relations for factors a) and b) that can aid in the separation of true and false positive identifications In two well- studied cases, about 5% of identification s were made by only one spectrum. Roughly 10% of identificati ons were made more than 100 times Approximately 1/3 of peptides identified were identified only once (not shown). These will be separately processed, requiring higher scores for acceptance in library. Consensus Spectra Derived

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building and Using Libraries of Peptide Ion Fragmentation Spectra S.E. Stein, L.E. Kilpatrick, M. Mautner, P. Neta, J. Roth National Institute of Standards

Building and Using Libraries of Peptide Ion Fragmentation SpectraS.E. Stein, L.E. Kilpatrick, M. Mautner, P. Neta, J. Roth

National Institute of Standards and Technology, Gaithersburg, MD/Charleston, SC

Overview: MS/MS spectra that serve to identify peptides by sequence/spectrum matching can be of value for more reliably identifying those peptides in later studies. Our work involves the development of methods for refining and reusing information contained in these spectra to create mass spectral reference libraries, not dissimilar from those routinely employed in other fields of mass spectrometry.

Method: Spectra that identify peptides by current sequence/spectrum matching methods are first added to an archive along with associated information. For each identified peptide ion that has been identified more than once, an annotated ‘consensus spectrum’ is derived from the spectra used to make those identifications. Consensus spectra are then subjected to spectrum/sequence consistency tests to assign a measure of identification reliability and remove false positive identifications. This employs sequence/spectrum correlations reported in the literature and found in our work as well as other quantities found to be correlated with identification reliability. Examples are the similarity of original spectra, unassigned abundance in the consensus spectrum and scores of the original sequence/spectrum match. Related consensus spectra are intercompared to find further errors and finally combined to form an annotated, searchable reference library.

Building The Library

1) Extract spectra and analysis information for reliable peptide identifications from spectrum/sequence matches.

2) Create a ‘consensus spectrum’ from all spectra assigned to a single peptide ion.

3) For each consensus spectrum, perform spectrum/sequence consistency check. Also examine reliable, single hit identifications.

4) Assign reliability measures to each spectrum and build library.

Ways of Using the Library

Identification Confirmation (post-processing)Confirm/reject peptides identified by sequence search programs

Find peptides (pre-processing) Search all spectra against libraryThen search unmatched peptides using sequence search and other methods

Create library of ‘unidentified spectra’ from their consensus spectraCompare to identified peptides, use de novo methods, find biomarkers, …

Target identificationInternal standards, biomarkers, target proteins

Transmit peptide analysis informationDifficult-to-identify, unusual, manually identified, special meaning

Spectrum Variability: Effective peptide identification by matching spectra in a reference library requires that spectra are reproducible. The degree of variability has been measured using sets of spectra identifying the same precursor ion. Variations typical of ion trap spectra are shown below.

Most spectra identified by sequence search methods have dot products above 0.7.

0

20

40

60

80

100

120

140

160

180

0.2 0.4 0.6 0.8 1

Dot Product

Max

/Med

ian

0

200

400

600

800

1000

1200

1400

1600

0.2 0.4 0.6 0.8 1

Dot Product

#Spe

ctra

Next Steps

Enhance QA/QC Goal: Distinguish unexpected fragmentations from misidentifications

Add spectra from reliable singly identified peptides to Library

Build comprehensive Libraries

Test and optimize spectrum matching algorithms

Provide annotated libraries for mass spectrometry data systems

Spectrum/SequenceConsistency

All Data for an Identified Peptide Ion

Annotated ReferenceSpectrum

ReferenceArchive

ReferenceLibrary

Spectra Sequence ConfirmatoryInformation

Consensus Spectrum

Extract Common Features

FragmentationRules

Analysis

LC-MS/MS Results

1) Source Spectraa) Online repositories

- PeptideAtlas- the Global Proteome Machine - Open Proteomics Database- NCRR Proteomics Resource

b) Collaborators/Contributors- NIH/LNT (Markey/Geer/Kowalik…)- Institute of Systems Biology (Nesvizshskii/King/Aebersold/…) - theGPM (Beavis)- Blueprint Initiative (Hogue)

c) NIST MeasurementsSingle Protein Digests/IT, 3Q

3) Spectrum/Sequence ConsistencyFind likelihood that a consensus spectrum originated from assigned sequence.Each factor serves to refine probability.

Factors identified as discriminating (a and b are illustrated below):a) Match with theoretical spectrum

based on individual amino acid fragmentation behaviorb) Fraction of unassigned abundance

abundance for peaks not originating from a known fragmentation pathc) Y/B ion correlations and ratio of Y/B ion abundance sumsd) Unexpected major peaks formally consistent with rules

include tests for reasonableness of neutral lossese) Amino acid specific rules (Proline, Glycine, Aspartic Acid, …)f) Predicted/observed fragment ion charge state abundance ratios

4) Overall Identification Confidence [under development]

Influential factors employed Spectrum/Sequence Consistency (described above)

Original Score (degree of sequence match)

Peptide sequence re-identificationsNumber of different spectra, experiments, modifications, charge states

Number of peptides per protein

Occurrence Distribution: Most peptide identifications are made multiple times.

2) Build Consensus SpectraReject noise and spurious signals/Measure and report variability

Identify matching m/z peaks in input spectra.

Find and exclude outlier spectra (use dot product of each pair) when multiple sources are available , limit the number of spectra from a single source

Omit peaks occurring in ½ or fewer of spectra consider only spectra with sufficiently high S/N to have generated the peak of interest

Compute average abundance, report variance

Create annotated spectruminclude information such as spectrum origin, retention, median dot product using all peaks and only consensus peaks, fraction of abundance not at consensus m/z, …

Sample Application: Mycobacterium Smegmatis

[data from the Open Proteomics Database/R.Wang, 27 sets of experiments]

- Created library of 2739 Peptide Ion Consensus Spectra

- 95% of identified spectra matched peptides identified by other spectra

In one series of LC-MS/MS experiments:

1551 Spectra were identified by sequence search engine

1527 Of the above spectra were re-matched by consensus library searching

1067 Spectra not identified by sequence search were identified by library search

948 Different peptides were originally identified by sequence search engine

924 Peptides were re-matched

332 Peptides not identified by sequence search in this series were identified

24 Peptides not re-matched were rich in false positives

983 Consensus spectra of spectra unmatched by library search were derived

Please Recycle Your Spectra!- Your MS/MS Data Files contain valuable, reusable information that may be very helpful to others (and maybe even you) after recycling.

- Please let us know if we may recycle them ([email protected] / 301-975-2505). Raw data files are fine, with or without identifications.

- Sources of original spectra are cited as part of consensus spectrum annotation (if you wish).

- We also encourage submission to on-line spectral repositories (www.peptideatlas.org, www.thegpm.org, bioinformatics.icmb.utexas.edu/OPD/)

Library Construction Flow Diagram

Spectrum Matching Algorithms

Algorithms: Measures of spectrum matching have been adapted from algorithms used for electron ionization spectra. Peaks are weighted by their significance:

- Reduce significance of common impurity ions (e.g., neutral loss from parent ion)

- Adjust Y/B weighting for instrument and sequence

- Reduce weight for uncertain and isotopic peaks

- Use confidence of library spectrum

Speed: Straightforward indexing leads to very fast identification (<< sec) even for very large libraries.

Spectrum 1

Spectrum 2

ConsensusSpectrum

A ‘consensusspectrum’ is composed of peaks present in the majority of spectra in which the peak could have been generated

Replicate spectra at moderate to low S/N commonly show spurious peaks from neutral losses by impurity ions of uncertain origin as well as seemingly random noise spikes.

Similarity of pairs of spectra originating from the same peptide ion are measured by their ‘dot product’, where spectra are expressed as normalized vectors. Here, S/N is measured as the maximum to median

abundance in a typical spectrum (assumes most peaks are noise).

Above S/N of 40, spectra are quite reproducible (0.7 or better median dot product).

Distribution of Number of Identifications Per Identified Peptide for D. Radiodurans (spectra from NCRR, http://ncrr.pnl.gov/data/) and M. Smegmatis

(Open Proteomics Database)

Pairs of spectra identified as originating from a given peptide ion have been compared to each other – variations in abundance have been found to depend primarily on the level of signal/noise in the measured spectra.

Organism Different Ions

Source Spectra

Human 14,698 84,288

D. Radiodurans 6,888 115,972

Yeast 5,879 43,344

M. Smegmatis 2,739 30,718

E. Coli 1,259 5,861

Mixed (GPM) 66,882 587,199

These plots use D. Radiodurans spectra from NCRR, false ids obtained by searching Human sequences. True and false identifications had the same average sequence ‘score’.

Correct identifications have lower fractions of unassigned abundance than incorrect identifications

True positive spectra match theoretical spectra better than false positive spectra.

These plots illustrate quantitative relations for factors a) and b) that can aid in the separation of true and false positive identifications

In two well-studied cases, about 5% of identifications were made by only one spectrum.

Roughly 10% of identifications were made more than 100 times

Approximately 1/3 of peptides identified were identified only once (not shown). These will be separately processed, requiring higher scores for acceptance in library.

Consensus Spectra Derived