de novo peptide sequencing: informatics and pattern ...rose/790b/791_presentation.pdf · computer...
TRANSCRIPT
02/05/10CSCE 791
1
De Novo Peptide Sequencing: Informatics and Pattern Recognition applied to
Proteomics
John R. RoseComputer Science and Engineering
University of South Carolina
02/05/10CSCE 791
2
Overview
• Background
• Information Theoretic Scoring Function
• Test Data Set
• Comparison with Existing Methods
• Conclusions
• Future Work
02/05/10CSCE 791
3
Background
• Analogy:– Genome Machine Code– Proteome Execution of Code
• Protein identification is important– For drug discovery research – For the identification microbes in environmental samples
• Approaches using tandem mass spectrometry data:– Database searching– De Novo Sequencing– Tagging
02/05/10CSCE 791
4
Tandem MS Data
• A peptide is ionized and the peptide bonds are fragmented• Fragment ions form peaks in the spectrum corresponding to their
mass-charge ratio.
2117.187
1818.190
765.373
358.1171219.587880.468 1689.886244.089
1990.213666.267
526.191
1332.574 1593.3531009.58286.010
0
1000
2000
3000
Inte
ns. [
a.u.
]
0 250 500 750 1000 1250 1500 1750 2000 2250 2500m/z
02/05/10CSCE 791
5
Tandem MS Data
• Fragment ions include a,b,c,x,y,z, ions.
• de Novo sequencing focuses on y and b ions.– y ions contain the carboxyl terminus– b ions containing the amino terminus
02/05/10CSCE 791
6
Tandem MS Data
• A good quality spectrum consists of– a ladder of peaks of the y-ions and– a ladder of peaks of the b-ions
• Ex: b-ions y-ionsF GLSLVR
FG LSLVR
FGL SLVR
FGLS LVR
FGLSL VR
FGLSLV R
02/05/10CSCE 791
7
Approaches to peptide identification
Frank et al. JPR. 2006.
02/05/10CSCE 791
8
De Novo Sequencing
• Data: tandem MS spectrum• Goal: find the corresponding peptide• General approach:
– Identify y and/or b ions– propose candidate peptides– Score each candidate– Return highest ranking peptides
• Two key issues:– Model for candidate peptide generation– Scoring function to evaluate candidates
02/05/10CSCE 791
9
Candidate Peptide Generation
• The peptide sequence can be derived by the mass differences of adjacent peaks in each of the two ladders
• Ex: b-ions y-ions
I YEVEGMR
IY EVEGMR
IYE VEGMR
IYEV EGMR
IYEVE GMR
IYEVEG MR
IYEVEGM R
• Complicating factors:– Missing peaks– Posttranslational modifications– Many-to-one equivalences, e.g.,
AG,GA,K,Q,E are similar in mass
IYEVEGMR
02/05/10CSCE 791
10
Actual example of labeled y and b ion peaks
02/05/10CSCE 791
11
The spectrum graph
Frank et al. JPR. 2006.
02/05/10CSCE 791
12
Construction of the NC-spectrum GraphChen et. al JCB 2001
• Create a pair of nodes, Nj and Cj, for each ion Ij .• Create two auxiliary nodes N0 and C0. to represent the zero mass and parent
mass, respectively.• Let V = {N0 , N1 , …, Nk , C0 , C1 , …, Ck}.• Each node x is placed assigned coordinate cord(x) according to the total mass
of its amino acids, that is,
⎪⎪⎩
⎪⎪⎨
⎧
====
+−−−
=
j
j
j
j
CxNxCxNx
wWw
Wxcord 0
0
11
180
)(
0 429.22
N0 C0
174.11 273.11
C1 N1
87.10 360.12
N2C2
02/05/10CSCE 791
13
Construction of the NC-spectrum Graph
0 429.22
N0 C0
274.112
361.121
Mass / Charge
Abu
ndan
c e (1
0 0%
)
W = 447.225
⎪⎪⎩
⎪⎪⎨
⎧
====
+−−−
=
j
j
j
j
CxNxCxNx
wWw
Wxcord 0
0
11
180
)(
02/05/10CSCE 791
14
Construction of the NC-spectrum Graph
0 429.22
N0 C0
174.11 273.11
C1 N1
274.112
361.121
Mass / Charge
Abu
ndan
c e (1
0 0%
)
W = 447.225
⎪⎪⎩
⎪⎪⎨
⎧
====
+−−−
=
j
j
j
j
CxNxCxNx
wWw
Wxcord 0
0
11
180
)(
02/05/10CSCE 791
15
Construction of the NC-spectrum Graph
0 429.22
N0 C0
174.11 273.11
C1 N1
87.10 360.12
N2C2
274.112
361.121
Mass / Charge
Abu
ndan
c e (1
0 0%
)
W = 447.225
⎪⎪⎩
⎪⎪⎨
⎧
====
+−−−
=
j
j
j
j
CxNxCxNx
wWw
Wxcord 0
0
11
180
)(
02/05/10CSCE 791
16
Construction of the NC-spectrum Graph
0 429.22
N0 C0
174.11 273.11
C1 N1
87.10 360.12
N2C2
Mass(W) = 186.21W
S+WMass(S+W) = 273.29
Mass(S) = 87.08S R
Mass(R) = 156.19
02/05/10CSCE 791
17
0 429.22
N0 C0
174.11 273.1187.10 360.12
C1 N1 C2N2
Construction of the NC-spectrum Graph
Each path from N0 to C0 represents a possible sequence for the peptideA feasible path is a path from N0 to C0 that goes through exactly one node for each pair (either Nj or Cj).
02/05/10CSCE 791
18
Construction of the NC-spectrum Graph
0 429.22
N0 C0
174.11 273.1187.10 360.12
C1 N1 C2N2
This is not a feasible path: misses ion I2
02/05/10CSCE 791
19
Construction of the NC-spectrum Graph
0 429.22
N0 C0
174.11 273.1187.10 360.12
C1 N1 C2N2
This is a feasible path
02/05/10CSCE 791
20
Problem Reformulation
• Input: an NC-spectrum graph G.• Output: a feasible path from N0 to C0.
• Difficulty:– A longest path does not always go through exactly
one of each pair of nodes.
– This is an NP-hard problem if the graph is a general directed graph.
02/05/10CSCE 791
21
Renaming Nodes
Rename the nodes from left to right as X0 ,…, Xk ,Yk ,…,Y0
0 429.22
X0 Y0
174.11 273.1187.10 360.12
X2 Y2 Y1X1
0 429.22
N0 C0
174.11 273.1187.10 360.12
C1 N1 C2N2
Xi and Yi form a complementary pair of nodes for ion i.
02/05/10CSCE 791
22
Problem Reformulation
X0 Y0Xk Yk Y1X1 … …
• Let M(i, j) be a two-dimensional matrix with 0 ≤ i, j ≤ k. • Let M(i, j)=1 if
– there exists a path L from X0 to Xi and a path R from Yj to Y0, such that L and R together contain exactly one of Xp and Yp for each P in [0, max{i, j}].
X0 Y0YjXi
L R
Yi Y1X1 X2 Y2……
02/05/10CSCE 791
23
Problem Reformulation
X0 Y0Xk
L Re
Yj
• There is a feasible path if and only if– for some i and k, there is an edge e from Xi to Yk and M(i, k) = 1,
or– for some k and j, there is an edge e from Xk to Yj and M(k, j) = 1
X0 Y0YkXi
L Re
02/05/10CSCE 791
24
Candidate Peptide Generation
• Complicating factors:– Posttranslational modifications– Many-to-one equivalences, e.g., AG,GA,K,Q,E are
similar in mass– Noise Peaks– Missing peaks
02/05/10CSCE 791
25
Candidate Peptide Generation
– Missing peaks• Now a many-to-many combinatorial problem• Ex: ATEEQLK• If b4 ion is missing then b3 represents ATE and b5 represents
ATEEQ• Then the mass difference for EQ is unresolved.• Recall that AG,GA,K,Q,E are similar in mass• Thus EQ, QE, AGQ, GAQ, AGE, GAE,….. have similar mass
02/05/10CSCE 791
26
Candidate Peptide Evaluation
• Model for candidate generation– Traditional focus on fragmentation model
• Increasing fragmentation model sophistication• Better posttranslational modification models• No model of peptide amino acid content
– QuasiNovo approach• Unsophisticated fragmentation model• No posttranslational modification model• Uses information theory to model peptide amino acid content
02/05/10CSCE 791
27
Modeling Peptide Amino Acid Content
• Basic Idea: Examine actual proteins to characterize likely combinations of amino acids
• Underlying hypothesis: amino acid content is not random
• Analogy: model letter combinations in a language– examine documents in that language
– compile profiles of letter combinations
– predict missing letters from partial data
• Motivation:– Ability to distinguish between mass-equivalent combinations
– Ability to deal with missing peaks
02/05/10CSCE 791
28
Amino Acid Distribution Data
Tabulation of amino acid distributions:Let <a1a2…an> be a contiguous sequence of n amino acids.
– There are n amino acids:<a1>, < a2>,…,<an>
– There are n-1 ordered amino acid pairs:<a1a2>, < a2a3 >,…,< an-1an>
– etc.QuasiNovo has been evaluated with 3-,4-,5-, and 6-tuples
Tuple frequencies are then normalized.
02/05/10CSCE 791
29
Amino Acid Distribution Data
Three amino acid profiles used:1. Gammaproteobacteria:
– 206 complete genomes– 23,882,564 tryptic peptides
2. Actinobacteria:– 58 complete genomes– 7,380,927 tryptic peptides generated
3. Mammalia:– 4 complete genomes: Bovine, Human, Mouse, Rat– 9,835,585 tryptic peptides generated
02/05/10CSCE 791
30
QuasiNovo’s Use of Tuple-Profiles
• Score candidate peptides
score(FGLSLVR) = p(SLVR)p(L|SLVR)p(G|LSLV)p(F|GLSL)
• Discard poor scoring candidates
• Handle missing peaks
– Find set of ai that maximize P(ai|ai-4ai-3ai-2ai-1)
02/05/10CSCE 791
31
Test Data Set
280 spectra of peptides selected by Frank & Pevzner (2005)– molecular mass of up to 1400 Da– peptides with 7-16 amino acids (average length of 10.5)– source: ISB protein mixture data set and Open Proteomics Database
Data set used to compare PepNovo with– Sherenga– Peaks– Lutefisk
Later used to compare NovoHMM with– PepNovo– Sherenga– Peaks– Lutefisk
02/05/10CSCE 791
32
Results
• The contenders:– PepNovo v1.03– PepNovo+– NovoHMM– QuasiNovo– QuasiNovo Reranking
02/05/10CSCE 791
33
Results
0
20
40
60
80
100
0 1 2 3
Number of Incorrect Residues
% C
orre
ct
PepNovo+
PepNovo v1.03
NovoHMM
Quasinovo
Quasinovo Reranking
Results for set of 280 MS-MS test spectra comparing PepNovo+, PepNovo, NovoHMM, with a QuasiNovo reranking and QuasiNovo.
02/05/10CSCE 791
34
Results
Results for set of 76 MS-MS test spectra for E. coli peptides comparing PepNovo+, PepNovo, NovoHMM, with three QuasiNovo scoring functions based on amino acid distributions in Gammaproteobacteria, Actinobacteria, and Mammalia.
0
20
40
60
80
100
0 1 2 3
Number of Incorrect Residues
% C
orre
ct PepNovo+
PepNovo v1.03
NovoHMM
Gammaproteobacteria
Actinobacteria
Mammalia
02/05/10CSCE 791
35
Results
0.8150.8130.716QuasinovoReranking
0.7350.7590.523NovoHMM
0.7020.6160.509PepNovo+
y2-ionb2-ion
Complete peptideTerminal ion pairAlgorithm
Comparison of Terminal Pair and Overall Accuracy
02/05/10CSCE 791
36
Conclusions and Future Work
• The QuasiNovo peptide model– predicts peptide amino acid content– has limited understanding of fragmentation– outperforms the PepNovo+ and NovoHMM
• QuasiNovo reranking– reranks PepNovo+ and NovoHMM results– proof-of-concept for combining peptide & fragmentation
models– shows best overall performance
• Future: Combine QuasiNovo amino acid model with a sophisticated fragmentation model
02/05/10CSCE 791
37
Acknowledgements
Rose Lab– Jimmy Cleveland– Achraf Elallali– Amadeo Bellotti
Fox Lab– Alvin Fox– Karen Fox– Jennifer Intelicato-Young
Support– Funding from Alfred P. Sloan Foundation– Experiments were conducted on a 128-core shared memory computer
funded by NSF (CNS 0708391).
02/05/10CSCE 791
38
Gammaproteobacteria
• Cumulative results from 174 spectra• x = n number of correctly predicted amino acids• Note: a predicted amino acid is correct if it appears within 2.5 Da of its position in the
actual peptide
0.000.100.200.300.400.500.600.700.800.901.00
x = 3 x = 4 x = 5 x = 6 x = 7 x = 8 x = 9 x = 10 x = 11 x = 12
QuasiNovo MM Reranking NovoHMM PepNovo+
x = 3 x = 4 x = 5 x = 6 x = 7 x = 8 x = 9 x = 10 x = 11 x = 12QuasiNovo 0.95 0.90 0.82 0.74 0.68 0.64 0.63 0.60 0.58 0.50MM Reranking 0.97 0.93 0.90 0.86 0.84 0.81 0.71 0.63 0.51 0.46NovoHMM 0.97 0.94 0.87 0.75 0.61 0.45 0.27 0.23 0.13 0.02PepNovo+ 0.95 0.93 0.84 0.67 0.55 0.38 0.26 0.20 0.11 0.09
02/05/10CSCE 791
39
Actinobacteria
• Cumulative results from 27 spectra• x = n number of correctly predicted amino acids• Note: a predicted amino acid is correct if it appears within 2.5 Da of its position in the
actual peptide
x = 3 x = 4 x = 5 x = 6 x = 7 x = 8 x = 9 x = 10 x = 11 x = 12QuasiNovo 0.95 0.86 0.68 0.59 0.36 0.32 0.29 0.29 0.19 0.21MM Reranking 1.00 1.00 0.93 0.89 0.89 0.70 0.62 0.46 0.35 0.38NovoHMM 1.00 0.85 0.81 0.70 0.48 0.22 0.12 0.08 0.00 0.00PepNovo+ 0.96 0.96 0.93 0.78 0.70 0.48 0.31 0.23 0.10 0.00
0.000.100.200.300.400.500.600.700.800.901.00
x = 3 x = 4 x = 5 x = 6 x = 7 x = 8 x = 9 x = 10 x = 11 x = 12
QuasiNovo MM Reranking NovoHMM PepNovo+
02/05/10CSCE 791
40
Results: Mammalia
• Cumulative results from 79 spectra• x = n number of correctly predicted amino acids• Note: a predicted amino acid is correct if it appears within 2.5 Da of its position in the
actual peptide
x = 3 x = 4 x = 5 x = 6 x = 7 x = 8 x = 9 x = 10 x = 11 x = 12QuasiNovo 0.87 0.66 0.55 0.49 0.36 0.34 0.33 0.25 0.18 0.25MM Reranking 0.92 0.87 0.78 0.67 0.63 0.52 0.45 0.38 0.37 0.36NovoHMM 0.87 0.85 0.71 0.57 0.44 0.32 0.19 0.14 0.04 0.00PepNovo+ 0.90 0.86 0.77 0.72 0.56 0.32 0.26 0.21 0.07 0.00
0.000.100.200.300.400.500.600.700.800.901.00
x = 3 x = 4 x = 5 x = 6 x = 7 x = 8 x = 9 x = 10 x = 11 x = 12
QuasiNovo MM Reranking NovoHMM PepNovo+
02/05/10CSCE 791
41
EF-Tu Protein
• DISTILLER/MASCOT identification: AIDKPFLLPIEDVFSISGR • QuasiNovo identification: DSDKPFMMPVEDVFSITGR
– Score(AIDKPFLLPIEDVFSISGR) = 1.83164551734336e-38– Score(DSDKPFMMPVEDVFSITGR) = 7.10172913187262e-36
• QuasiNovo result supported by microbiological data– Gram stain– physiological tests– visual comparison of spectra of environmental isolates versus known S. aureus
and interpretation of Distiller/Mascot sequence assignment• Note: Distiller results based on 18 peaks vs 12 peaks for QuasiNovo• Peptide displays loss of 3 water molecules