meta-search and result combining

47
Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

Upload: mira

Post on 17-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Meta-Search and Result Combining. Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Peptide Identifications. Search engines provide an answer for every spectrum... Can we figure out which ones to believe? Why is this hard? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Meta-Search and Result Combining

Meta-Search and Result Combining

Nathan EdwardsDepartment of Biochemistry and

Molecular & Cellular Biology

Georgetown University Medical Center

Page 2: Meta-Search and Result Combining

Peptide Identifications

Search engines provide an answer for every spectrum... Can we figure out which ones to believe?

Why is this hard? Hard to determine “good” scores Significance estimates are unreliable Need more ids from weak spectra Each search engine has its strengths ...

... and weaknesses Search engines give different answers

2

Page 3: Meta-Search and Result Combining

Mascot Search Results

3

Page 4: Meta-Search and Result Combining

Translation start-site correction

Halobacterium sp. NRC-1 Extreme halophilic Archaeon, insoluble

membrane and soluble cytoplasmic proteins Goo, et al. MCP 2003.

GdhA1 gene: Glutamate dehydrogenase A1

Multiple significant peptide identifications Observed start is consistent with Glimmer 3.0

prediction(s)4

Page 5: Meta-Search and Result Combining

Halobacterium sp. NRC-1ORF: GdhA1

K-score E-value vs PepArML @ 10% FDR Many peptides inconsistent with annotated

translation start site of NP_279651

5

0 40 80 120 160 200 240 280 320 360 400 440

Page 6: Meta-Search and Result Combining

Translation start-site correction

6

Page 7: Meta-Search and Result Combining

Search engine scores are inconsistent!

7 Mascot

Tan

dem

Page 8: Meta-Search and Result Combining

Common Algorithmic Framework – Different Results

Pre-process experimental spectra Charge state, cleaning, binning

Filter peptide candidates Decide which PSMs to evaluate

Score peptide-spectrum match Fragmentation modeling, dot product

Rank peptides per spectrum Retain statistics per spectrum

Estimate E-values Apply empirical or theoretical model

8

Page 9: Meta-Search and Result Combining

Comparison of search engines

No single score is comprehensive

Search engines disagree

Many spectra lack confident peptide assignment

9

4%

OMSSA10%

2%

5%9%

69%

2%

X!Tandem

Mascot

Page 10: Meta-Search and Result Combining

Simple approaches (Union)

Different search engines confidently identify different spectra: Due to search space, spectral processing, scoring,

significance estimation Filter each search engine's results and union

Union of results must be more complete But how to estimate significance for the union? What if the results for same spectra disagree?

Need to compensate for reduced specificity How much?

10

Page 11: Meta-Search and Result Combining

Union of filtered peptide ids

11 Mascot

Tan

dem

Page 12: Meta-Search and Result Combining

Union of filtered peptide ids

12 Mascot

Tan

dem

Page 13: Meta-Search and Result Combining

Union of filtered peptide ids

13 Mascot

Tan

dem

Page 14: Meta-Search and Result Combining

Simple approaches (Intersection)

Different search engines agree on many spectra Agreement is unexpected given differences Filter each search engine's results and take the

intersection Intersection of results must be more significant

But how to estimate significance for the intersection? What about the borderline spectra?

Need to compensate for reduced sensitivity How and how much?

14

Page 15: Meta-Search and Result Combining

Intersection of filtered peptide ids

15 Mascot

Tan

dem

Page 16: Meta-Search and Result Combining

Intersection of filtered peptide ids

16 Mascot

Tan

dem

Page 17: Meta-Search and Result Combining

Intersection of filtered peptide ids

17 Mascot

Tan

dem

Page 18: Meta-Search and Result Combining

Combine / Merge Results

Threshold peptide-spectrum matches from each of two search engines PSMs agree → boost specificity PSMs from one → boost sensitivity PSMs disagree → ?????

Sometimes agreement is "lost" due to threshold... How much should agreement increase our confidence?

Scores easy to "understand" Difficult to establish statistical significance

How to generalize to more engines?

18

Page 19: Meta-Search and Result Combining

Consensus and Multi-Search Multiple witnesses increase confidence

As long as they are independent Example: Getting the story straight

Independent "random" hits unlikely to agree Agreement is indication of biased sampling Example: loaded dice

Meta-search is relatively easy Merging and re-ranking is hard Example: Booking a flight to Boston!

Scores and E-values are not comparable How to choose the best answer? Example: Best E-value favors Tandem!

19

Page 20: Meta-Search and Result Combining

Search for Consensus Running many search engines is hard! Identifications must have every opportunity to

agree: No failed searches, matched search parameters,

sequence databases, spectra But the search engines all use:

Varying spectral file formats, different parameter specifications for mass tolerance, modifications, pre-processing for sequence databases, different charge-state handling, termini rules

Decoy searches must also use identical parameters

20

Page 21: Meta-Search and Result Combining

Searching for Consensus

Initial methionine loss as tryptic peptide? Missing charge state handling? X!Tandem's refinement mode Pyro-Gln, Pyro-Glu modifications? Precursor mass tolerance (Da vs ppm) Semi-tryptic only (no fully-tryptic mode).

21

Page 22: Meta-Search and Result Combining

Configuring for Consensus

Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases.

Must strive to ensure that each search engine is presented with the same search criteria, despite different formats, syntax, and quirks. Search engine configuration must be automated.

22

Page 23: Meta-Search and Result Combining

Results Extraction for Consensus

Must be able to unambiguously extract peptide identifications from results Spectrum identifiers / scan numbers Modification identifiers Protein accessions

How should we handle E-values vs. probabilities vs. FDR (partitioned)? Cannot rely on these to be comparable Must use consistent, external significance calibration

23

Page 24: Meta-Search and Result Combining

Search Engine Independent FDR Estimation

Comparing search engines is difficult due to different FDR estimation techniques Implicit assumption: Spectra scores can be thresholded

Competitive vs Global Competitive controls some spectral variation

Reversed vs Shuffled Decoy Sequence Reversed models target redundancy accurately

Charge-state partition or Unified Mitigates effect of peptide length dependent scores What about peptide property partitions?

24

Page 25: Meta-Search and Result Combining

Search Execution for Consensus

Running many search engines take time 7 x 3 searches of the same spectra!

Some search engines require licenses or specific operating systems

How to use grid/cloud computing effectively? Cannot assume a shared file-system Search engines may crash or be preempted Machine may "disappear" Machine may consistently fail searches

25

Page 26: Meta-Search and Result Combining

Combining Multi-Search Results

Treat search engines as black-boxes Generate PSMs + scores, features

Apply machine learning / statistical modeling to results Use multiple match metrics

Combine/refine using multiple search engines Agreement suggests correctness

26

Page 27: Meta-Search and Result Combining

Machine Learning / Statistical Modeling

Use of multiple metrics of PSM quality: Precursor delta, trypsin digest features, etc

Often requires "training" with examples Different examples will change the result Generalization is always the question

Scores can be hard to "understand" Difficult to establish statistical significance

e.g. PeptideProphet/iProphet Weighted linear combination of features Number of sibling searches

27

Page 28: Meta-Search and Result Combining

Available Tools

PeptideProphet/iProphet Part of trans-proteomic-pipeline suite

Scaffold Commercial reimplementation of PP/iP

PepArML Publicly available from the Edwards lab

Lots of in-house stuff… Result combining mentioned in talks, lots of

papers, etc. but no public tools28

Page 29: Meta-Search and Result Combining

Peptide 8

Peptide 7

For Each Spectrum

GetMascot

Identification

GetSEQUEST

Identification

GetX!Tandem

Identification

Peptide 1

Peptide 3

Peptide 4

Peptide 5

Peptide 6

, ( | )i j

i j

agreement score p D

( | ) ( | )( | , )

( | ) ( | ) ( | ) ( | )

p D p NSPp D NSP

p D p NSP p D p NSP

Peptide 2

p=76%

p=81%

p=56%

Agreement score

Using the probabilities given by each search engine and the probability of them agreeing, a better peptide ID is made

Using the probabilities given by each search engine and the probability of them agreeing, a better peptide ID is made

Brian Searle

Page 30: Meta-Search and Result Combining

PepArML Strategy

Meta-Search for Multi-Search: Automatic configuration of searches Automatic preprocessing of sequence databases Automatic spectral reformatting Automatic execution of search on local or remote

computing resources (AWS/grid/NFS). Result Combining:

Decoy-based FDR significance estimation Unsupervised, model-free, machine-learning

30

Page 31: Meta-Search and Result Combining

Peptide Identification Meta-Search Simple unified search

interface for: Mascot, X!Tandem, K-

Score, S-Score, OMSSA, MyriMatch, InsPecT+MSSGF

Automatic decoy searches

Automatic spectrumfile "chunking"

Automatic scheduling Serial, Multi-Processor,

Cluster, Grid, Cloud

31

Page 32: Meta-Search and Result Combining

Grid-Enabled Peptide Identification Meta-Search

32

AmazonWeb Services

UniversityCluster

Edwards LabScheduler &80+ CPUs

Securecommunication

Heterogeneouscompute resources

Single, simplesearch request

Scales easily to 250+ simultaneous

searches

Page 33: Meta-Search and Result Combining

PepArML Combiner

Peptide identification arbiter by machine learning

Unifies these ideas within a model-free, combining machine learning framework

Unsupervised training procedure

33

Page 34: Meta-Search and Result Combining

PepArML Overview

34

X!Tandem

Mascot

OMSSA

Other

PepArML

Feature extraction

Page 35: Meta-Search and Result Combining

Dataset Construction

35

T),( 11 PS

F),( 21 PS

T),( 12 PS

X!Tandem Mascot OMSSA

T),( mn PS

……

Page 36: Meta-Search and Result Combining

Voting Heuristic Combiner

Choose PSM with most votes

Break ties using FDR Select PSM with min. FDR of tied votes

How to apply this to a decoy database?

Lots of possibilities – all imperfect Now using: 100*#votes – min. decoy hits

36

Page 37: Meta-Search and Result Combining

Supervised Learning

37

Page 38: Meta-Search and Result Combining

Search Engine Info. Gain

38

Page 39: Meta-Search and Result Combining

Precursor & Digest Info. Gain

39

Page 40: Meta-Search and Result Combining

Retention Time & Proteotypic Peptide Properties Info. Gain

40

Page 41: Meta-Search and Result Combining

Application to Real Data

How well do these models generalize? Different instruments

Spectral characteristics change scores Search parameters

Different parameters change score values Supervised learning requires

(Synthetic) experimental data from every instrument Search results from available search engines Training/models for all

parameters x search engine sets x instruments

41

Page 42: Meta-Search and Result Combining

Model Generalization

42

Page 43: Meta-Search and Result Combining

Unsupervised Learning

43

Page 44: Meta-Search and Result Combining

Unsupervised Learning Performance

44

Page 45: Meta-Search and Result Combining

Unsupervised Learning Convergence

45

Page 46: Meta-Search and Result Combining

PepArML Performance

46

LCQ QSTAR LTQ-FT

Standard Protein Mix Database18 Standard Proteins – Mix1

Page 47: Meta-Search and Result Combining

Conclusions

Combining search results from multiple engines can be very powerful Boost both sensitivity and specificity Running multiple search engines is hard

Statistical significance is hard Use empirical FDR estimates...but be careful...lots of

subtleties Consensus is powerful, but fragile

Search engine quirks can destroy it "Witnesses" are not independent

47