proteomics informatics –

Proteomics Informatics – Protein identification I: searching protein

sequence collections and significance testing (Week 4)

2

Peptide Mapping - Mass Accuracy

3

Peptide MappingDatabase Size

C. elegans

S. cerevisiae

Human

4

Peptide MappingCys-ContainingPeptides

C. elegans

S. cerevisiae

Human

MS

Identification – Peptide Mass Fingerprinting

MS

Digestion

All Peptide Masses

Pick Protein

Compare, Score, Test Significance

Repeat for each protein

SequenceDB

Identified Proteins

ProFound – Search Parameters

http://prowl.rockefeller.edu/

ProFound – Protein Identification by Peptide Mapping

pattern

r

iiirr

ii F

mmrmm

gNrNIkPDIkP

2

1

20

minmax

1 2

)(

2exp

2!)!()|()|(

W. Zhang & B.T. Chait, Analytical Chemistry72 (2000) 2482-2489

ProFound Results

Peptide Mapping – Mass Accuracy

ProFound

0

1

2

3

4

5

6

7

0 0.5 1 1.5 2

Mass Tolerance (Da)

-log(

e)

Mascot

0

20

40

60

80

100

120

140

0 0.5 1 1.5 2

Mass Tolerance (Da)Sc

ore

Peptide Mapping - Database SizeS. cerevisiae

Fungi

All Taxa

Expectation Values

Peptide mapping example:S. Cerevisiae 4.8e-7

Fungi 8.4e-6

All Taxa 2.9e-4

Database size

Missed Cleavage Sites

u = 1

u = 2

u = 4

Expectation Values

Peptide mapping example:u=1 4.8e-7

u=2 1.1e-5

u=4 6.8e-4

Peptide Mapping - Partial Modifications

No Modifications

Phophorylation (S, T, or Y)

Searched Searched With Without Possible Modifications Phosphorylation

of S/T/Y

DARPP-32 0.00006 0.01

CFTR 0.00002 0.005

Even if the protein is modified it is usually better to search a protein sequence database without specifying possible modifications using peptide mapping data.

Peptide Mapping - Ranking by Direct Calculation of the Significance

The response to random input data should be random.

Maximum number of correct identification and minimum number of incorrect identifications for any data set.

Maximal separation between scores for correct identifications and the distribution of scores for random matching proteins for any data set.

The statistical significance of the results should be calculated.

The searches should be fast.

General Criteria for a Good Protein Identification Algorithms

Response to Random Data

Nor

mal

ized

Fre

quen

cy

Peptide FragmentationMass

Analyzer 1Frag-

mentation DetectorIon Source

Mass Analyzer 2

b

y

Identification – Tandem MS

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

Tandem MS – Sequence Confirmation

KLEDEELFGS

K1166

L1020

E907

D778

E663

E534

L405

F292

G145

S88 b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

KLEDEELFGS


147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

KLEDEELFGS


147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 10801022

KLEDEELFGS


147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 10801022

113

KLEDEELFGS

113


147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 10801022

129

129

KLEDEELFGS


KLEDEELFGS

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 10801022


Tandem MS – de novo Sequencing

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292 405 5349071020663 778 1080

1022

Mass Differences

1-letter code

3-letter code

Chemical formula

Monoisotopic

Average

A Ala C3H5ON 71.0371 71.0788R Arg C6H12ON4 156.101 156.188N Asn C4H6O2N2 114.043 114.104D Asp C4H5O3N 115.027 115.089C Cys C3H5ONS 103.009 103.139E Glu C5H7O3N 129.043 129.116Q Gln C5H8O2N2 128.059 128.131G Gly C2H3ON 57.0215 57.0519H His C6H7ON3 137.059 137.141I Ile C6H11ON 113.084 113.159L Leu C6H11ON 113.084 113.159K Lys C6H12ON2 128.095 128.174M Met C5H9ONS 131.04 131.193F Phe C9H9ON 147.068 147.177P Pro C5H7ON 97.0528 97.1167S Ser C3H5O2N 87.032 87.0782T Thr C4H7O2N 101.048 101.105W Trp C11H10ON2 186.079 186.213Y Tyr C9H9O2N 163.063 163.176V Val C5H9ON 99.0684 99.1326

Amino acid masses

Sequences consistent

with spectrum

Tandem MS – de novo Sequencing260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079

260 32 129 145 244 274 373 403 502 518 615 647 760 762 819

292 97 113 212 242 341 371 470 486 583 615 728 730 787

389 16 115 145 244 274 373 389 486 518 631 633 690

405 99 129 228 258 357 373 470 502 615 617 674

504 30 129 159 258 274 371 403 516 518 575

534 99 129 228 244 341 373 486 488 545

633 30 129 145 242 274 387 389 446

663 99 115 212 244 357 359 416

762 16 113 145 258 260 317

778 97 129 242 244 301

875 32 145 147 204

907 113 115 172

1020 2 59

1022 57

260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079

260 32 E 145 244 274 373 403 502 518 615 647 760 762 819

292 P I/L 212 242 341 371 470 486 583 615 728 730 787

389 16 D 145 244 274 373 389 486 518 631 633 690

405 V E 228 258 357 373 470 502 615 617 674

504 30 E 159 258 274 371 403 516 518 575

534 V E 228 244 341 373 486 488 545

633 30 E 145 242 274 387 389 446

663 V D 212 244 357 359 416

762 16 I/L 145 258 260 317

778 P E 242 244 301

875 32 145 F 204

907 I/L D 172

1020 2 59

1022 G


X

X

X

X

X

X

…GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG……GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG…

Peptide M+H = 11661166 -1079 = 87 => S

SGF(I/L)EEDE(I/L)…

SGF(I/L)EEDE(I/L)…

1166 – 1020 – 18 = 128ÞK or Q

SGF(I/L)EEDE(I/L)(K/Q)


Challenges in de novo sequencing

Neutral loss (-H2O, -NH3)

Modifications

Background peaks

Incomplete information

Challenges in de novo sequencing

Neutral loss (-H2O, -NH3)

Modifications

Background peaks

Incomplete information

MS/MS

LysisFractionation

Tandem MS – Database Search

MS/MS

Digestion

SequenceDB

All FragmentMasses

Pick Protein


Repeat for all proteins

Pick PeptideLC-MS

Repeat for

all peptides

Algorithms

Comparing and Optimizing Algorithms

Score

Score 1-Specificity

1-Specificity

Sens

itivi

tySe

nsiti

vity

Algorithm 1

Algorithm 2

True

True

False

False

Score

Score 1-Specificity

1-Specificity

Sens

itivi

tySe

nsiti

vity

Algorithm 1

Algorithm 2

True

True

False

False

37

MS/MS - Parent Mass Error and Enzyme Specificity

)!!( ybIII nnxx

Expectation Values

MS/MS example:Dm=2, Trypsin 2.5e-5

Dm=100, Trypsin 2.5e-5

Dm=2, non-specific 7.9e-5

Dm=100, non-specific 1.6e-4

Sequest

Cross-correlation

X! Tandem - Search Parameters

http://www.thegpm.org/

X! Tandem - Search Parameters

sequences

sequences

spectra

Conventional, single stage searching

Generic search engine

Test all cleavages,

modifications, & mutations

for all sequences

Determining potential modifications- e.g., oxidation, phosphorylation, deamidation

- calculation order 2n - NP complete

Some hard problems in MS/MS analysis in proteomics

Allowing for unanticipated peptide cleavages - e.g., chymotryptic contamination in trypsin - calculation order ~ 200 × tryptic cleavage - “unfortunate” coefficient

Detecting point mutations - e.g., sequence homology - calculation order 18N

- NP complete

sequences

sequences

spectra

Multi-stage searching

Trypticcleavage

Modifications #1

Modifications #2

Point mutation

X! Tandem

Search Results

Sequence Annotations

Search Results

LysisFractionation

DigestionLC-MS/MS

Identification – Spectrum Library Search

MS/MS

Spectrum Library

PickSpectrum


Repeat for

all spectra

Identified Proteins

1. Find the best 10 spectra for a particular sequence, with the same PTMs and charge.2. Add the spectra together and normalize the intensity values.

3. Assign a “quality” value: the median expectation value of the 10 spectra used.

4. Record the 20 most intense peaks in the averaged spectrum, it’s parent ion z, m/z, sequence, protein accessions & quality.

Steps in making an Annotated Spectrum Library (ASL):

0

2

4

6

8

10

0 10 20 30 40 50

peptide length

fract

ion

of li

brar

y (%

)Spectrum Library Characteristics – Peptide Length

0

10

20

30

40

50

10 30 50 70 90 110 130 150 170 190

protein Mr (kDa)

% c

over

age

residuespeptides

Spectrum Library Characteristics – Protein Coverage

Library spectrum

Test spectrum(5:25)

(5:25)

Results: 4 peaks selected, 1 peak missed


Matches Probability1 0.452 0.153 0.0164 0.000395 0.0000037

Apply a hypergeometric probability model: - 25 possible m/z values; - 5 peaks in the library spectrum; and - 4 selected by the test spectrum.

How likely is this?Identification – Spectrum Library Search

If you have 1000 possible m/z values and 20 peaks in test and library spectrum?

1.0E-14

1.0E-12

1.0E-10

1.0E-08

1.0E-06

1.0E-04

1.0E-02

1.0E+00

1 2 3 4 5 6 7 8 9 10

matches

p 1 matched: p = 0.65 matched: p = 0.0002

10 matched: p = 0.0000000000001


ExperimentalMass Spectrum

Library of AssignedMass Spectra

M/Z

Best search result


X! Hunter

1. Use dot product to find a library spectrum that best matches a test spectrum.2. Calculate p-value with hypergeometric distribution.

3. Use p-value to calculate expectation value, given the identification parameters.4. If expectation value is less than the median expectation value of the library spectrum, report the median value.

X! Hunter algorithm:

X! Hunter Result

Query Spectrum

Library Spectrum

Significance Testing

False protein identification is caused by random matching

An objective criterion for testing the significance of protein identification results is necessary.

The significance of protein identifications can be tested once the distribution of scores for false results is known.

Significance Testing - Expectation Values

The majority of sequences in a collection will give a score due to random matching.

Database Search

M/Z

List of Candidates

ExtrapolateAnd Calculate Expectation Values

List of Candidates With Expectation Values

Distribution of Scoresfor Random and False Identifications

Significance Testing - Expectation Values

Proteomics Informatics – Protein identification I: searching protein

sequence collections and significance testing (Week 4)

proteomics informatics –

Documents

peptide mapping data

protein sequence database

protein sequence collections

peptide mappingdatabase

identification tandem

random input data

data set

significance testingweek