generalized protein parsimony and spectral counting for functional enrichment analysis

53
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

Upload: marcel

Post on 12-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis. Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center. Why Tandem Mass Spectrometry?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Generalized Protein Parsimony and Spectral Counting for

Functional Enrichment Analysis

Nathan EdwardsDepartment of Biochemistry and

Molecular & Cellular Biology

Georgetown University Medical Center

Page 2: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

2

Why Tandem Mass Spectrometry?

LC-MS/MS spectra provide evidence for the amino-acid sequence and abundance of functional proteins.

Key concepts: Spectrum acquisition is unbiased by knowledge Direct observation of amino-acid sequence Sensitive to small sequence variations Spectrum acquisition is biased by abundance

Page 3: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

3

Sample Preparation for MS/MS

Enzymatic Digestand

Fractionation

Page 4: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

4

Single Stage MS

MS

Page 5: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

5

Tandem Mass Spectrometry(MS/MS)

Precursor selection

Page 6: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

6

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

(CID)

MS/MS

Page 7: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

7

Peptide Fragmentation

Peptide: S-G-F-L-E-E-D-E-L-K

y1

y2

y3

y4

y5

y6

y7

y8

y9

ion

1020

907

778

663

534

405

292

145

88

MW

762SGFL EEDELKb4

389SGFLEED ELKb7

MWion

633SGFLE EDELKb5

1080S GFLEEDELKb1

1022SG FLEEDELKb2

875SGF LEEDELKb3

504SGFLEE DELKb6

260SGFLEEDE LKb8

147SGFLEEDEL Kb9

Page 8: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

8

Unannotated Splice Isoform

Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003.

LIME1 gene: LCK interacting transmembrane adaptor 1

LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias.

Multiple significant peptide identifications

Page 9: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

9

Unannotated Splice Isoform

Page 10: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

10

Unannotated Splice Isoform

Page 11: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

11

Translation start-site correction

Halobacterium sp. NRC-1 Extreme halophilic Archaeon, insoluble membrane

and soluble cytoplasmic proteins Goo, et al. MCP 2003.

GdhA1 gene: Glutamate dehydrogenase A1

Multiple significant peptide identifications Observed start is consistent with Glimmer 3.0

prediction(s)

Page 12: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

12

Halobacterium sp. NRC-1ORF: GdhA1

K-score E-value vs PepArML @ 10% FDR Many peptides inconsistent with annotated

translation start site of NP_279651

0 40 80 120 160 200 240 280 320 360 400 440

Page 13: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

13

Lost peptide identifications

Missing from the sequence database

Search engine strengths, weaknesses, quirks

Poor score or statistical significance

Thorough search takes too long

Page 14: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

14

All amino-acid 30-mers, no redundancy From ESTs, Proteins, mRNAs

30-40 fold size and search time reduction Formatted as a FASTA sequence database One entry per gene/cluster.

Peptide Sequence Databases

Organism Size (AA) Size (Entries)Human 248Mb 74,976Mouse 171Mb 55,887

Rat 76Mb 42,372Zebra-fish 94Mb 40,490

Page 15: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

15

Combine search engine results

No single score is comprehensive

Search engines disagree

Many spectra lack confident peptide assignment

Searle et al. JPR 7(1), 2008

38%

14%28%

14%

3%

2%

1%

X! Tandem

SEQUESTMascot

Page 16: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

16

Combining search engine results – harder than it looks!

Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too!

How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance?

We apply "unsupervised" machine-learning.... Lots of related work unified in a single framework.

Page 17: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Search Engine Info. Gain

17

Page 18: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Mascot OMSSATandem

Train Classifier & Predict Correct IDs

Stable?

Ouput Peptide Spectrum Assignments

Spectra

No

Yes

Recalibrate Confidence as FDR (D1)

Select "True" Proteins

Extract Peptides & Features

Select High-Quality IDs (D0)

Assign Training Labels

Select "True" Proteins

. . . . . .PepArML Workflow

Select high-quality IDs Guess true proteins from

search results Label spectra & train Calibrate confidence Guess true proteins from

ML results Iterate! Estimate FDR using

(external) decoy18

Page 19: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

False-Discovery-Rate Curves

19

Page 20: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

20

PepArML Meta-Search EngineNSF TeraGrid1000+ CPUs

Edwards LabScheduler &80+ CPUs

Securecommunication

Heterogeneouscompute resources

Single, simplesearch request

Scales easily to 250+ simultaneous

searches

X!Tandem,KScore,OMSSA,

MyriMatch,Mascot(1 core).

X!Tandem,KScore,OMSSA,

MyriMatch.

Amazon AWS

Page 21: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

21

PeptideMapper Web Service

I’m Feeling Lucky

Page 22: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

22

PeptideMapper Web Service

I’m Feeling Lucky

Page 23: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

23

PeptideMapper Web Service

Suffix-tree index on peptide sequence database Fast peptide to gene/cluster mapping “Compression” makes this feasible

Peptide alignment with cluster evidence Amino-acid or nucleotide; exact & near-exact

Genomic-loci mapping via UCSC “known-gene” transcripts, and Predetermined, embedded genomic coordinates

Page 24: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

molecular biology ↕

phenotype

Systems Biology

24

KnowledgeDatabases

Structured High-Throughput

Experiments• Localization• Function• Process• Interactions• Pathway• Mutation

• Proteomics• Sequencing• Microarrays• Metabolomics

molecular biology↕

biology

Page 25: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

molecular biology ↕

phenotype

Systems Biology

25

MathematicalModels

Structured High-Throughput

Experiments• Localization• Function• Process• Interactions• Pathway• Mutation

• Proteomics• Sequencing• Microarrays• Metabolomics

molecular biology↕

biology

KnowledgeDatabasesFunctional

AnnotationEnrichment

Page 26: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

molecular biology ↕

phenotype

Systems Biology

26

MathematicalModels

Structured High-Throughput

Experiments• Localization• Function• Process• Interactions• Pathway• Mutation

• Proteomics• Sequencing• Microarrays• Metabolomics

molecular biology↕

biology

KnowledgeDatabasesFunctional

AnnotationEnrichment

Page 27: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Why not in proteomics?

Double counting and false positives… …due to traditional protein inference

Proteomics cannot see all proteins… …proteins are not equally likely to be drawn

Good relative abundance is hard… …extra chemistries, workflows, and software …missing values are particularly problematic

27

Page 28: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

In proteomics…

Double counting and false positives… Use generalized protein parsimony

Proteomics cannot see all proteins… Use identified proteins as background

Good relative abundance is hard… Model differential spectral counts directly

28

Page 29: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Traditional Protein Parsimony

Select the smallest set of proteins that explain all identified peptides.

Sensible principle, implies Eliminate equivalent/subset proteins

Equivalent proteins are problematic: Which one to choose?

Unique-protein peptides force the inclusion of proteins into solution True for most tools, even probability based ones Bad consequences for FDR filtered ids 29

Page 30: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Peptide-Spectrum Matches

Sigma49 – 32,691 LTQ MS/MS spectra of 49 human protein standards; IPI Human

Yeast – 162,420 LTQ MS/MS spectra from a yeast cell lysate; SGD.

X!Tandem E-value (no refinement), 1% FDR

30Spectra used in: Zhang, B.;  Chambers, M. C.;  Tabb, D. L. 2007.

Page 31: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Many proteins are easy

Eliminate equivalent / dominated proteins Sigma49: 277 → 60 proteins Yeast: 1226 → 1085 proteins

Many components have a single protein: Sigma49: 52 ( 3 multi-protein) Yeast: 994 (43 multi-protein)

Single peptides force protein inclusion Sigma49: 16 single-peptide proteins Yeast: 476 single-peptide proteins

31

Page 32: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Must eliminate redundancy

Contained proteins should not be selected

32

IPI00925547 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00298860 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00925299 X X X X X X X X IPI00925519 X X X X X X X IPI00908908 X X X X IPI00903112 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

37 distinct peptides

Page 33: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Must eliminate redundancy

Contained proteins should not be selected Even if they have some probability mass Number of sibling peptides matter less if they are

shared.33

IPI00925547 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00298860 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00925299 X X X X X X X X IPI00925519 X X X X X X X IPI00908908 X X X X IPI00903112 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

1.01.00.80.70.01.0

Single AA Difference

Page 34: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

1.00.00.00.00.01.0

Must ignore some PSMs

A single additional peptide should not force protein into solution

34

IPI00925547 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00298860 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00925299 X X X X X X X X IPI00925519 X X X X X X X IPI00908908 X X X X IPI00903112 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

Single AA Difference

Page 35: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Example from Yeast

"Inosine monophosphate dehydrogenase" 4 gene family

Contained proteins should not be selected Single peptide evidence for YML056C

35

YLR432W X X X X X X XYHR216W X X XYAR073W X X YML056C X X X X X X

1.00.60.01.0

Page 36: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Must ignore some PSMs

Improving peptide identification sensitivitymakes things worse! False PSMs don't cluster

36

10%

2xProteins

PSMs

PSMs

Page 37: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Must ignore some PSMs

Improving peptide identification sensitivitymakes things worse! False PSMs don't cluster

37

Select Proteins toExplain True PSM%

PSMs

PSMs

90%

90%

Page 38: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Must ignore some PSMs

How do we choose? Maximize # peptides? Minimize FDR (naïve model)? Maximize # PSMs?

38

IPI00925547 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00298860 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X XIPI00925299 X X X X X X X X IPI00925519 X X X X X X X IPI00908908 X X X X IPI00903112 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

YLR432W X X X X X X XYHR216W X X XYAR073W X X YML056C X X X X X X

Page 39: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Generalized Protein Parsimony

Weight peptides by number of PSMs Constrain unique peptides per protein Maximize explained peptides (PSMs)

Match PSM filtering FDR to % uncovered PSMs

Readily solved by branch-and-bound Permits complex protein/peptide constraints

Reduces to traditional protein parsimony39

Page 40: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Match uncovered PSMs to FDR

40

Page 41: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Plasma membrane enrichment

Pellicle enrichment of plasma membrane Choksawangkarn et al. JPR 2013 (Fenselau Lab)

Six replicate LC-MS/MS analyses each Cell-lysate (44,861 MS/MS) Fe3O4-Al2O3 pellicle (21,871 MS/MS)

625 3-unique proteins to match 10% FDR: Lysate: 18,976 PSMs; Pellicle: 13,723 PSMs 89 proteins with significantly (< 10-5) increased counts

41

Page 42: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Semi-quantitative LC-MS/MS

42

Precursor selection + collision induced dissociation

(CID)

MS/MS

Page 43: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Semi-quantitative LC-MS/MS

43

Chen and Yates. Molecular Oncology, 2007

Page 44: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Plasma membrane enrichment

Na/K+ ATPase subunit alpha-1 (P05023): Lysate: 1; Pellicle: 90; p-value: 5.2 x 10-33

Transferrin receptor protein 1 (P02786): Lysate: 17; Pellicle: 63; p-value: 2.0 x 10-11

DAVID Bioinformatics analysis (89/625): Plasma membrane (GO:0005886) : 29 (5.2 x 10-5) Transmembrane (SwissProtKW): 24 (1.3 x 10-6)

Transmembrane (SwissProtKW): Lysate: 524; Pellicle: 1335; p-value: 2.6 x 10-158

44

Page 45: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Distribution of p-values (Yeast)

45

Page 46: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

A protein's PSMs rise and fall together!

46

Page 47: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

A protein's PSMs rise and fall together?

47

Page 48: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Anomalies indicate proteoforms

48

Page 49: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

HER2/Neu Mouse Model of Breast Cancer

Paulovich, et al. JPR, 2007 Study of normal and tumor mammary tissue

by LC-MS/MS 1.4 million MS/MS spectra

Peptide-spectrum assignments Normal samples (Nn): 161,286 (49.7%) Tumor samples (Nt): 163,068 (50.3%)

4270 proteins identified in total 2-unique generalized protein parsimony

49

Page 50: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Nascent polypeptide-associated complex subunit alpha

50

7.3 x 10-8

Page 51: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

51

Pyruvate kinase isozymes M1/M22.5 x 10-5

Page 52: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

52

Summary

Improve the scope and sensitivity of peptide identification for genome annotation, using

Exhaustive peptide sequence databases Machine-learning for combining Meta-search tools to maximize consensus Grid-computing for thorough search

Page 53: Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis

Summary

Functional annotation enrichment for proteomics too: Careful counting (generalized parsimony) Differential abundance by spectral counts

Use (multivariate-)hypergeometric model for Differential abundance by spectral counts Proteoform detection

53