![Page 1: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/1.jpg)
Search Engine Result
Combining
Search Engine Result
Combining
Nathan EdwardsDepartment of Biochemistry and Molecular & Cellular BiologyGeorgetown University Medical Center
![Page 2: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/2.jpg)
2
Peptide Identification Results
• Search engines provide an answer for every spectrum...• Can we figure out which ones to believe?
• Why is this hard? • Hard to determine “good” scores• Significance estimates are unreliable• Need more ids from weak spectra• Each search engine has its strengths ...
... and weaknesses• Search engines give different answers
![Page 3: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/3.jpg)
3
Mascot Search Results
![Page 4: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/4.jpg)
4
Translation start-site correction
• Halobacterium sp. NRC-1• Extreme halophilic Archaeon, insoluble
membrane and soluble cytoplasmic proteins• Goo, et al. MCP 2003.
• GdhA1 gene:• Glutamate dehydrogenase A1
• Multiple significant peptide identifications• Observed start is consistent with Glimmer 3.0
prediction(s)
![Page 5: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/5.jpg)
5
Halobacterium sp. NRC-1ORF: GdhA1
• K-score E-value vs PepArML @ 10% FDR• Many peptides inconsistent with annotated
translation start site of NP_279651
0 40 80 120 160 200 240 280 320 360 400 440
![Page 6: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/6.jpg)
6
Translation start-site correction
![Page 7: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/7.jpg)
7
Search engine scores are inconsistent!
Mascot
Tan
dem
![Page 8: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/8.jpg)
8
Common Algorithmic Framework – Different Results
• Pre-process experimental spectra• Charge state, cleaning, binning
• Filter peptide candidates• Decide which PSMs to evaluate
• Score peptide-spectrum match• Fragmentation modeling, dot product
• Rank peptides per spectrum• Retain statistics per spectrum
• Estimate E-values• Appy empirical or theoretical model
![Page 9: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/9.jpg)
9
Comparison of search engines
• No single score is comprehensive
• Search engines disagree
• Many spectra lack confident peptide assignment
4%
OMSSA10%
2%
5%9%
69%
2%
X!Tandem
Mascot
![Page 10: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/10.jpg)
10
Lots of techniques out there
• Treat search engines as black-boxes• Generate PSMs + scores, features
• Apply supervised machine learning to results• Use multiple match metrics
• Combine/refine using multiple search engines• Agreement suggests correctness
• Use empirical significance estimates• “Decoy” databases (FDR)
![Page 11: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/11.jpg)
11
Machine Learning
• Use of multiple metrics of PSM quality:• Precursor delta, trypsin digest features, etc
• Requires "training" with examples• Different examples will change the result• Generalization is always the question
• Scores can be hard to "understand"• Difficult to establish statistical significance
• Peptide Prophet's discriminant function• Weighted linear combination of features
![Page 12: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/12.jpg)
12
Combine / Merge Results
Threshold peptide-spectrum matches from each of two search engines• PSMs agree → boost specificity• PSMs from one → boost sensitivity• PSMs disagree → ?????
• Sometimes agreement is "lost" due to threshold...• How much should agreement increase our confidence?
• Scores easy to "understand"• Difficult to establish statistical significance
• How to generalize to more engines?
![Page 13: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/13.jpg)
13
Consensus and Meta-Search
• Multiple witnesses increase confidence• As long as they are independent• Example: Getting the story straight
• Independent "random" hits unlikely to agree• Agreement is indication of biased sampling• Example: loaded dice
• Meta-search is relatively easy• Merging and re-ranking is hard• Example: Booking a flight to Denver!
• Scores and E-values are not comparable• How to choose the best answer?• Example: Best E-value favors Tandem!
![Page 14: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/14.jpg)
14
Searching for Consensus
Search engine quirks can destroy consensus
• Initial methionine loss as tryptic peptide
• Charge state enumeration or guessing
• X!Tandem's refinement mode
• Pyro-Gln, Pyro-Glu modifications
• Difficulty tracking spectrum identifiers
• Precursor mass tolerance (Da vs ppm)
Decoy searches must be identical!
![Page 15: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/15.jpg)
15
Configuring for Consensus
Search engine configuration can be difficult:
• Correct spectral format
• Search parameter files and command-line
• Pre-processed sequence databases.
• Tracking spectrum identifiers
• Extracting peptide identifications, especially modifications and protein identifiers
![Page 16: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/16.jpg)
16
Peptide Identification Meta-Search
• Simple unified search interface for:• Mascot, X!Tandem, K-
Score, S-Score, OMSSA, MyriMatch, InsPecT
• Automatic decoy searches
• Automatic spectrumfile "chunking"
• Automatic scheduling• Serial, Multi-Processor,
Cluster, Grid
![Page 17: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/17.jpg)
17
Peptide Identification Grid-Enabled Meta-Search
NSF TeraGrid1000+ CPUs
UMIACS250+ CPUs
Edwards LabScheduler &80+ CPUs
Securecommunication
Heterogeneouscompute resources
Single, simplesearch request
Scales easily to 250+ simultaneous
searches
X!Tandem,KScore,OMSSA,
MyriMatch,Mascot(1 core).
X!Tandem,KScore,OMSSA.
X!Tandem,KScore,OMSSA.
![Page 18: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/18.jpg)
18
PepArML
• Peptide identification arbiter by machine learning
• Unifies these ideas within a model-free, combining machine learning framework
• Unsupervised training procedure
![Page 19: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/19.jpg)
19
PepArML Overview
X!Tandem
Mascot
OMSSA
Other
PepArML
Feature extraction
![Page 20: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/20.jpg)
20
Dataset Construction
T),( 11 PS
F),( 21 PS
T),( 12 PS
X!Tandem Mascot OMSSA
T),( mn PS
……
![Page 21: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/21.jpg)
21
Voting Heuristic Combiner
• Choose PSM with most votes
• Break ties using FDR• Select PSM with min. FDR of tied votes
• How to apply this to a decoy database?
• Lots of possibilities – all imperfect• Now using: 100*#votes – min. decoy hits
![Page 22: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/22.jpg)
22
Supervised Learning
![Page 23: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/23.jpg)
23
Feature Evaluation
![Page 24: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/24.jpg)
24
Application to Real Data
• How well do these models generalize?
• Different instruments• Spectral characteristics change scores
• Search parameters• Different parameters change score values
• Supervised learning requires• (Synthetic) experimental data from every instrument• Search results from available search engines• Training/models for all
parameters x search engine sets x instruments
![Page 25: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/25.jpg)
25
Model Generalization
![Page 26: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/26.jpg)
26
Unsupervised Learning
![Page 27: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/27.jpg)
27
Unsupervised Learning Performance
![Page 28: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/28.jpg)
28
Unsupervised Learning Convergence
![Page 29: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/29.jpg)
29
Peptide Atlas A8_IP – LTQ
![Page 30: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/30.jpg)
30
OMICS 17 Protein Mix – LCQ
![Page 31: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/31.jpg)
31
Feature Selection (InfoGain)
![Page 32: Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56649ef65503460f94c0a432/html5/thumbnails/32.jpg)
32
Conclusions
• Combining search results from multiple engines can be very powerful• Boost both sensitivity and specificity• Running multiple search engines is hard
• Statistical significance is hard• Use empirical FDR estimates...but be
careful...lots of subtleties• Consensus is powerful, but fragile
• Search engine quirks can destroy it• "Witnesses" are not independent