novel empirical fdr estimation in peparml david retz and nathan edwards georgetown university...
TRANSCRIPT
![Page 1: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/1.jpg)
Novel Empirical FDR Estimation in PepArML
David Retz and
Nathan EdwardsGeorgetown University Medical Center
![Page 2: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/2.jpg)
What is PepArML?
Meta-search using seven search engines: Mascot; X!Tandem Native, K-Score, S-Score;
OMSSA; Myrimatch; InsPecT + MSGF Automatic target + decoy searches Automatic construction of search configuration Automatic spectra and sequence (re-)formatting
Heterogeneous cluster, grid, cloud computing Centralized scheduler Shared and private computational resources Integration with NSF TeraGrid and AWS
2
![Page 3: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/3.jpg)
What is PepArML?
A peptide identification result combiner Selects best identification, per spectrum Model-free, auto-train machine-learning Estimates false-discovery-rates Format output as pepXML and protXML
In use: more than 23M spectra, 1.4M search jobs, and 1TB in spectra and results.
PepArML identifies significantly more spectra than single search engines. Recovers more proteins with fewer replicates
3
![Page 4: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/4.jpg)
PepArML Performance
4
LCQ QSTAR LTQ-FT
Standard Protein Mix Database18 Standard Proteins – Mix1
![Page 5: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/5.jpg)
PepArML Advantages
Can accommodate new search engines or spectrum and peptide features easily
Learns the specific characteristics of each dataset from scratch!
Provides a platform for comparison of single search engine results with common FDR estimation procedure.
5
![Page 6: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/6.jpg)
Search Engine Info. Gain
6
![Page 7: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/7.jpg)
Precursor & Digest Info. Gain
7
![Page 8: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/8.jpg)
Retention Time & Proteotypic Peptide Properties Info. Gain
8
![Page 9: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/9.jpg)
Search Engine Independent FDR Estimation
Comparing search engines is difficult due to different FDR estimation techniques Implicit assumption: Spectra scores can be thresholded
Competitive vs Global Competitive controls some spectral variation
Reversed vs Shuffled Decoy Sequence Reversed models target redundancy accurately
Charge-state partition or Unified Mitigates effect of peptide length dependent scores What about peptide property partitions?
9
![Page 10: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/10.jpg)
PepArML Disadvantages
Training heuristic can fail to “get started” Works best on large datasets Iterative training can be time-consuming Machine-learning “confidence” is
uninterpretable for peptide identification Require two decoy-searches to “calibrate”
confidence as FDR Each spectrum searched ~ 21 times!
10
![Page 11: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/11.jpg)
PepArML Disadvantages
Training heuristic can fail to “get started” Works best on large datasets Iterative training can be time-consuming Machine-learning “confidence” is
uninterpretable for peptide identification Require two decoy-searches to “calibrate”
confidence as FDR Can we eliminate the internal decoy? Reduce search phase by 33%
11
![Page 12: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/12.jpg)
Mascot OMSSATandem
Train Classifier & Predict Correct IDs
Stable?
Ouput Peptide Spectrum Assignments
Spectra
No
Yes
Recalibrate Confidence as FDR (D1)
Select "True" Proteins
Extract Peptides & Features
Select High-Quality IDs (D0)
Assign Training Labels
Select "True" Proteins
. . . . . .PepArML Workflow
Select high-quality IDs Guess true proteins from
search results Label spectra & train Calibrate confidence Guess true proteins from
ML results Iterate! Estimate FDR using
(external) decoy12
![Page 13: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/13.jpg)
Select High-Quality Unanimous Peptide Identifications
Require fast and easy, but comparable search-engine metric.
13
min decoy hits min z-score
![Page 14: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/14.jpg)
Simulate Decoy Results by Sampling Target Results
14
Target
Decoy
Sampled Target
![Page 15: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/15.jpg)
Simulate Decoy Results by Sampling Target Results
15
Target
Decoy
Sampled Target
![Page 16: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/16.jpg)
Sampled Target Approximates Decoy Calibration
Sample 75% non-training “false” target results
Rescale to # of spectra
Approximates FDR well-enough to replace internal decoy
16
![Page 17: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/17.jpg)
Decoy-free PepArML results
17
LCQ QSTAR LTQ-FT
Standard Protein Mix Database18 Standard Proteins – Mix1
![Page 18: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/18.jpg)
Conclusions
PepArML can significantly boost the number of spectra, peptides, and proteins identified Give it a try – free! Nothing to install!
A common FDR framework facilitates head-to-head comparison of search engines and FDR estimation techniques
Sampled target results can substitute for decoy results (internally) Reduces search time by 33%
18
![Page 19: Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062423/56649eca5503460f94bd8a26/html5/thumbnails/19.jpg)
19
Acknowledgements
Growing list of PepArML users Fenselau lab (Maryland) Graham lab (JHU) Genovese lab (Bologna University, Italy)
Dr. Brian Balgley Bioproximity
Dr. Chau-Wen Tseng & Dr. Xue Wu University of Maryland Computer Science
Funding: NIH/NCI