proteomics informatics –
DESCRIPTION
Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4). Peptide Mapping - Mass Accuracy. Peptide Mapping Database Size. Human. C. elegans. S. cerevisiae. Peptide Mapping Cys -Containing Peptides. Human. C. elegans. - PowerPoint PPT PresentationTRANSCRIPT
Proteomics Informatics – Protein identification I: searching protein
sequence collections and significance testing (Week 4)
2
Peptide Mapping - Mass Accuracy
3
Peptide MappingDatabase Size
C. elegans
S. cerevisiae
Human
4
Peptide MappingCys-ContainingPeptides
C. elegans
S. cerevisiae
Human
MS
Identification – Peptide Mass Fingerprinting
MS
Digestion
All Peptide Masses
Pick Protein
Compare, Score, Test Significance
Repeat for each protein
SequenceDB
Identified Proteins
ProFound – Search Parameters
http://prowl.rockefeller.edu/
ProFound – Protein Identification by Peptide Mapping
pattern
r
iiirr
ii F
mmrmm
gNrNIkPDIkP
2
1
20
minmax
1 2
)(
2exp
2!)!()|()|(
W. Zhang & B.T. Chait, Analytical Chemistry72 (2000) 2482-2489
ProFound Results
Peptide Mapping – Mass Accuracy
ProFound
0
1
2
3
4
5
6
7
0 0.5 1 1.5 2
Mass Tolerance (Da)
-log(
e)
Mascot
0
20
40
60
80
100
120
140
0 0.5 1 1.5 2
Mass Tolerance (Da)Sc
ore
Peptide Mapping - Database SizeS. cerevisiae
Fungi
All Taxa
Expectation Values
Peptide mapping example:S. Cerevisiae 4.8e-7
Fungi 8.4e-6
All Taxa 2.9e-4
Database size
Missed Cleavage Sites
u = 1
u = 2
u = 4
Expectation Values
Peptide mapping example:u=1 4.8e-7
u=2 1.1e-5
u=4 6.8e-4
Peptide Mapping - Partial Modifications
No Modifications
Phophorylation (S, T, or Y)
Searched Searched With Without Possible Modifications Phosphorylation
of S/T/Y
DARPP-32 0.00006 0.01
CFTR 0.00002 0.005
Even if the protein is modified it is usually better to search a protein sequence database without specifying possible modifications using peptide mapping data.
Peptide Mapping - Ranking by Direct Calculation of the Significance
The response to random input data should be random.
Maximum number of correct identification and minimum number of incorrect identifications for any data set.
Maximal separation between scores for correct identifications and the distribution of scores for random matching proteins for any data set.
The statistical significance of the results should be calculated.
The searches should be fast.
General Criteria for a Good Protein Identification Algorithms
Response to Random Data
Nor
mal
ized
Fre
quen
cy
Peptide FragmentationMass
Analyzer 1Frag-
mentation DetectorIon Source
Mass Analyzer 2
b
y
Identification – Tandem MS
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
Tandem MS – Sequence Confirmation
KLEDEELFGS
K1166
L1020
E907
D778
E663
E534
L405
F292
G145
S88 b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
KLEDEELFGS
Tandem MS – Sequence Confirmation
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
KLEDEELFGS
Tandem MS – Sequence Confirmation
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
KLEDEELFGS
Tandem MS – Sequence Confirmation
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
KLEDEELFGS
Tandem MS – Sequence Confirmation
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
113
KLEDEELFGS
113
Tandem MS – Sequence Confirmation
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
129
129
KLEDEELFGS
Tandem MS – Sequence Confirmation
KLEDEELFGS
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
Tandem MS – Sequence Confirmation
KLEDEELFGS
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
Tandem MS – Sequence Confirmation
KLEDEELFGS
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 10801022
Tandem MS – Sequence Confirmation
Tandem MS – de novo Sequencing
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292 405 5349071020663 778 1080
1022
Mass Differences
1-letter code
3-letter code
Chemical formula
Monoisotopic
Average
A Ala C3H5ON 71.0371 71.0788R Arg C6H12ON4 156.101 156.188N Asn C4H6O2N2 114.043 114.104D Asp C4H5O3N 115.027 115.089C Cys C3H5ONS 103.009 103.139E Glu C5H7O3N 129.043 129.116Q Gln C5H8O2N2 128.059 128.131G Gly C2H3ON 57.0215 57.0519H His C6H7ON3 137.059 137.141I Ile C6H11ON 113.084 113.159L Leu C6H11ON 113.084 113.159K Lys C6H12ON2 128.095 128.174M Met C5H9ONS 131.04 131.193F Phe C9H9ON 147.068 147.177P Pro C5H7ON 97.0528 97.1167S Ser C3H5O2N 87.032 87.0782T Thr C4H7O2N 101.048 101.105W Trp C11H10ON2 186.079 186.213Y Tyr C9H9O2N 163.063 163.176V Val C5H9ON 99.0684 99.1326
Amino acid masses
Sequences consistent
with spectrum
Tandem MS – de novo Sequencing260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260 32 129 145 244 274 373 403 502 518 615 647 760 762 819
292 97 113 212 242 341 371 470 486 583 615 728 730 787
389 16 115 145 244 274 373 389 486 518 631 633 690
405 99 129 228 258 357 373 470 502 615 617 674
504 30 129 159 258 274 371 403 516 518 575
534 99 129 228 244 341 373 486 488 545
633 30 129 145 242 274 387 389 446
663 99 115 212 244 357 359 416
762 16 113 145 258 260 317
778 97 129 242 244 301
875 32 145 147 204
907 113 115 172
1020 2 59
1022 57
Tandem MS – de novo Sequencing260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260 32 129 145 244 274 373 403 502 518 615 647 760 762 819
292 97 113 212 242 341 371 470 486 583 615 728 730 787
389 16 115 145 244 274 373 389 486 518 631 633 690
405 99 129 228 258 357 373 470 502 615 617 674
504 30 129 159 258 274 371 403 516 518 575
534 99 129 228 244 341 373 486 488 545
633 30 129 145 242 274 387 389 446
663 99 115 212 244 357 359 416
762 16 113 145 258 260 317
778 97 129 242 244 301
875 32 145 147 204
907 113 115 172
1020 2 59
1022 57
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260 32 E 145 244 274 373 403 502 518 615 647 760 762 819
292 P I/L 212 242 341 371 470 486 583 615 728 730 787
389 16 D 145 244 274 373 389 486 518 631 633 690
405 V E 228 258 357 373 470 502 615 617 674
504 30 E 159 258 274 371 403 516 518 575
534 V E 228 244 341 373 486 488 545
633 30 E 145 242 274 387 389 446
663 V D 212 244 357 359 416
762 16 I/L 145 258 260 317
778 P E 242 244 301
875 32 145 F 204
907 I/L D 172
1020 2 59
1022 G
Tandem MS – de novo Sequencing
X
X
X
X
X
X
…GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG……GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG…
Peptide M+H = 11661166 -1079 = 87 => S
SGF(I/L)EEDE(I/L)…
SGF(I/L)EEDE(I/L)…
1166 – 1020 – 18 = 128ÞK or Q
SGF(I/L)EEDE(I/L)(K/Q)
Tandem MS – de novo Sequencing
Challenges in de novo sequencing
Neutral loss (-H2O, -NH3)
Modifications
Background peaks
Incomplete information
Challenges in de novo sequencing
Neutral loss (-H2O, -NH3)
Modifications
Background peaks
Incomplete information
MS/MS
LysisFractionation
Tandem MS – Database Search
MS/MS
Digestion
SequenceDB
All FragmentMasses
Pick Protein
Compare, Score, Test Significance
Repeat for all proteins
Pick PeptideLC-MS
Repeat for
all peptides
Algorithms
Comparing and Optimizing Algorithms
Score
Score 1-Specificity
1-Specificity
Sens
itivi
tySe
nsiti
vity
Algorithm 1
Algorithm 2
True
True
False
False
Score
Score 1-Specificity
1-Specificity
Sens
itivi
tySe
nsiti
vity
Algorithm 1
Algorithm 2
True
True
False
False
37
MS/MS - Parent Mass Error and Enzyme Specificity
)!!( ybIII nnxx
Expectation Values
MS/MS example:Dm=2, Trypsin 2.5e-5
Dm=100, Trypsin 2.5e-5
Dm=2, non-specific 7.9e-5
Dm=100, non-specific 1.6e-4
Sequest
Cross-correlation
X! Tandem - Search Parameters
http://www.thegpm.org/
X! Tandem - Search Parameters
X! Tandem - Search Parameters
sequences
sequences
spectra
Conventional, single stage searching
Generic search engine
Test all cleavages,
modifications, & mutations
for all sequences
Determining potential modifications- e.g., oxidation, phosphorylation, deamidation
- calculation order 2n - NP complete
Some hard problems in MS/MS analysis in proteomics
Allowing for unanticipated peptide cleavages - e.g., chymotryptic contamination in trypsin - calculation order ~ 200 × tryptic cleavage - “unfortunate” coefficient
Detecting point mutations - e.g., sequence homology - calculation order 18N
- NP complete
sequences
sequences
spectra
Multi-stage searching
Trypticcleavage
Modifications #1
Modifications #2
Point mutation
X! Tandem
Search Results
Search Results
Sequence Annotations
Search Results
Search Results
LysisFractionation
DigestionLC-MS/MS
Identification – Spectrum Library Search
MS/MS
Spectrum Library
PickSpectrum
Compare, Score, Test Significance
Repeat for
all spectra
Identified Proteins
1. Find the best 10 spectra for a particular sequence, with the same PTMs and charge.2. Add the spectra together and normalize the intensity values.
3. Assign a “quality” value: the median expectation value of the 10 spectra used.
4. Record the 20 most intense peaks in the averaged spectrum, it’s parent ion z, m/z, sequence, protein accessions & quality.
Steps in making an Annotated Spectrum Library (ASL):
0
2
4
6
8
10
0 10 20 30 40 50
peptide length
fract
ion
of li
brar
y (%
)Spectrum Library Characteristics – Peptide Length
0
10
20
30
40
50
10 30 50 70 90 110 130 150 170 190
protein Mr (kDa)
% c
over
age
residuespeptides
Spectrum Library Characteristics – Protein Coverage
Library spectrum
Test spectrum(5:25)
(5:25)
Results: 4 peaks selected, 1 peak missed
Identification – Spectrum Library Search
Matches Probability1 0.452 0.153 0.0164 0.000395 0.0000037
Apply a hypergeometric probability model: - 25 possible m/z values; - 5 peaks in the library spectrum; and - 4 selected by the test spectrum.
How likely is this?Identification – Spectrum Library Search
If you have 1000 possible m/z values and 20 peaks in test and library spectrum?
1.0E-14
1.0E-12
1.0E-10
1.0E-08
1.0E-06
1.0E-04
1.0E-02
1.0E+00
1 2 3 4 5 6 7 8 9 10
matches
p 1 matched: p = 0.65 matched: p = 0.0002
10 matched: p = 0.0000000000001
Identification – Spectrum Library Search
ExperimentalMass Spectrum
Library of AssignedMass Spectra
M/Z
Best search result
Identification – Spectrum Library Search
X! Hunter
1. Use dot product to find a library spectrum that best matches a test spectrum.2. Calculate p-value with hypergeometric distribution.
3. Use p-value to calculate expectation value, given the identification parameters.4. If expectation value is less than the median expectation value of the library spectrum, report the median value.
X! Hunter algorithm:
X! Hunter Result
Query Spectrum
Library Spectrum
Significance Testing
False protein identification is caused by random matching
An objective criterion for testing the significance of protein identification results is necessary.
The significance of protein identifications can be tested once the distribution of scores for false results is known.
Significance Testing - Expectation Values
The majority of sequences in a collection will give a score due to random matching.
Database Search
M/Z
List of Candidates
ExtrapolateAnd Calculate Expectation Values
List of Candidates With Expectation Values
Distribution of Scoresfor Random and False Identifications
Significance Testing - Expectation Values
Proteomics Informatics – Protein identification I: searching protein
sequence collections and significance testing (Week 4)