testing statistical significance scores of sequence comparison methods with structure similarity

21
Testing statistical significance scores of sequence comparison methods with structure similarity Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27

Upload: noel-sweet

Post on 15-Mar-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Testing statistical significance scores of sequence comparison methods with structure similarity. Tim Hulsen NCMLS PhD Two-Day Conference 2006-04-27. Introduction. Sequence comparison: Important for finding similar proteins (homologs) for a protein with unknown function - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Testing statistical significance scores of sequence comparison methods with structure similarity

Testing statistical significance scores of sequence comparison methods with structure similarity

Tim HulsenNCMLS PhD Two-Day Conference

2006-04-27

Page 2: Testing statistical significance scores of sequence comparison methods with structure similarity

Introduction

• Sequence comparison: Important for finding similar proteins

(homologs) for a protein with unknown function

• Algorithms: BLAST, FASTA, Smith-Waterman

• Statistical scores: E-value (standard), Z-value

Page 3: Testing statistical significance scores of sequence comparison methods with structure similarity

E-value or Z-value?

• Smith-Waterman sequence comparison with Z-value statistics:100 randomized shuffles to test significance of SW score

O. MFTGQEYHSV

shuffle

1. GQHMSVFTEY2. YMSHQFTVGEetc.

# seqs

SW score

rnd ori: 5*SD Z = 5

Page 4: Testing statistical significance scores of sequence comparison methods with structure similarity

E-value or Z-value?

• Z-value calculation takes much time (2x100 randomizations)

• Comet et al. (1999) and Bastien et al. (2004): Z-value is theoretically more sensitive and more selective than E-value

• BUT Advantage of Z-value has never been proven by experimental results

Page 5: Testing statistical significance scores of sequence comparison methods with structure similarity

How to compare?

• Structural comparison is better than sequence comparison

• ASTRAL SCOP: Structural Classification Of Proteins

• e.g. a.2.1.3, c.1.2.4; same number ~ same structure

• Use structural classification as benchmark for sequence comparison methods

Page 6: Testing statistical significance scores of sequence comparison methods with structure similarity

ASTRAL SCOP statisticsmax. % identity members families avg. fam. size max. fam. size families =1 families >1

10% 3631 2250 1.614 25 1655 595

20% 3968 2297 1.727 29 1605 692

25% 4357 2313 1.884 32 1530 783

30% 4821 2320 2.078 39 1435 885

35% 5301 2322 2.283 46 1333 989

40% 5674 2322 2.444 47 1269 1053

50% 6442 2324 2.772 50 1178 1146

70% 7551 2325 3.248 127 1087 1238

90% 8759 2326 3.766 405 1023 1303

95% 9498 2326 4.083 479 977 1349

Page 7: Testing statistical significance scores of sequence comparison methods with structure similarity

Methods (1)• Smith-Waterman algorithms: dynamic programming; computationally

intensive– Paracel with e-value (PA E):

• SW implementation of Paracel– Biofacet with z-value (BF Z):

• SW implementation of Gene-IT– ParAlign with e-value (PA E):

• SW implementation of Sencel– SSEARCH with e-value (SS E):

• SW implementation of FASTA (see next page)

Page 8: Testing statistical significance scores of sequence comparison methods with structure similarity

Methods (2)

• Heuristic algorithms:– FASTA (FA E)

• Pearson & Lipman, 1988• Heuristic approximation; performs better than

BLAST with strongly diverged proteins– BLAST (BL E):

• Altschul et al., 1990• Heuristic approximation; stretches local alignments

(HSPs) to global alignment• Should be faster than FASTA

Page 9: Testing statistical significance scores of sequence comparison methods with structure similarity

Method parameters- all:

- matrix: BLOSUM62- gap open penalty: 12- gap extension penalty: 1

- Biofacet with z-value: 100 randomizations

Page 10: Testing statistical significance scores of sequence comparison methods with structure similarity

Receiver Operating Characteristic

• R.O.C.: statistical value, mostly used in clinical medicine

• Proposed by Gribskov & Robinson (1996) to be used for sequence comparison analysis

Page 11: Testing statistical significance scores of sequence comparison methods with structure similarity

ROC50 Examplequery d1c75a_ a.3.1.1  

hit # pc e    

1 d1gcya1 b.71.1.1 0.31

2 d1h32b_ a.3.1.1 0.4

3 d1gks__ a.3.1.1 0.52

4 d1a56__ a.3.1.1 0.52

5 d1kx2a_ a.3.1.1 0.67

6 d1etpa1 a.3.1.4 0.67

7 d1zpda3 c.36.1.9 0.87

8 d1eu1a2 c.81.1.1 0.87

9 d451c__ a.3.1.1 1.1

10 d1flca2 c.23.10.2 1.1

11 d1mdwa_ d.3.1.3 1.1

12 d2dvh__ a.3.1.1 1.5

13 d1shsa_ b.15.1.1 1.5

14 d1mg2d_ a.3.1.1 1.5

15 d1c53__ a.3.1.1 2.4

16 d3c2c__ a.3.1.1 2.4

17 d1bvsa1 a.5.1.1 6.8

18 d1dvva_ a.3.1.1 6.8

19 d1cyi__ a.3.1.1 6.8

20 d1dw0a_ a.3.1.1 6.8

21 d1h0ba_ b.29.1.11 6.8

22 d3pfk__ c.89.1.1 6.8

23 d1kful3 d.3.1.3 6.8

24 d1ixrc1 a.4.5.11 14

25 d1ixsb1 a.4.5.11 14

- Take 100 best hits- True positives: in same SCOP family, or false positives: not in same family- For each of first 50 false positives: calculate number of true positives higher in list (0,4,4,4,5,5,6,9,12,12,12,12,12)- Divide sum of these numbers by number of false positives (50) and by total number of possible true positives (size of family -1) = ROC50

(0,167)- Take average of ROC50 scores for all entries

Page 12: Testing statistical significance scores of sequence comparison methods with structure similarity

ROC50 results

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

pdb010 pdb020 pdb025 pdb030 pdb035 pdb040 pdb050 pdb070 pdb090 pdb095

ASTRAL SCOP set

mea

n R

OC

50 pc ebf zbl efa ess epa e

Page 13: Testing statistical significance scores of sequence comparison methods with structure similarity

Coverage vs. Error

• C.V.E. = Coverage vs. Error (Brenner et al., 1998)

• E.P.Q. = selectivity indicator (how much false positives?)

• Coverage = sensitivity indicator (how much true positives of total?)

Page 14: Testing statistical significance scores of sequence comparison methods with structure similarity

CVE Examplequery d1c75a_ a.3.1.1  

hit # pc e    

1 d1gcya1 b.71.1.1 0.31

2 d1h32b_ a.3.1.1 0.4

3 d1gks__ a.3.1.1 0.52

4 d1a56__ a.3.1.1 0.52

5 d1kx2a_ a.3.1.1 0.67

6 d1etpa1 a.3.1.4 0.67

7 d1zpda3 c.36.1.9 0.87

8 d1eu1a2 c.81.1.1 0.87

9 d451c__ a.3.1.1 1.1

10 d1flca2 c.23.10.2 1.1

11 d1mdwa_ d.3.1.3 1.1

12 d2dvh__ a.3.1.1 1.5

13 d1shsa_ b.15.1.1 1.5

14 d1mg2d_ a.3.1.1 1.5

15 d1c53__ a.3.1.1 2.4

16 d3c2c__ a.3.1.1 2.4

17 d1bvsa1 a.5.1.1 6.8

18 d1dvva_ a.3.1.1 6.8

19 d1cyi__ a.3.1.1 6.8

20 d1dw0a_ a.3.1.1 6.8

21 d1h0ba_ b.29.1.11 6.8

22 d3pfk__ c.89.1.1 6.8

23 d1kful3 d.3.1.3 6.8

24 d1ixrc1 a.4.5.11 14

25 d1ixsb1 a.4.5.11 14

- Vary threshold above which a hit is seen as a positive: e.g. e=10,e=1,e=0.1,e=0.01- True positives: in same SCOP family, or false positives: not in same family- For each threshold, calculate coverage: number of true positives divided by total number of possible true positives- For each threshold, calculate errors-per-query: number of false positives divided by number of queries- Plot coverage on x-axis and errors-per-query on y-axis; right-bottom is best

Page 15: Testing statistical significance scores of sequence comparison methods with structure similarity

CVE results

(for PDB010)

-

+

Page 16: Testing statistical significance scores of sequence comparison methods with structure similarity

Mean Average Precision

• A.P.: borrowed from information retrieval search (Salton, 1991)

• Recall: true positives divided by number of homologs

• Precision: true positives divided by number of hits

• A.P. = approximate integral to calculate area under recall-precision curve

Page 17: Testing statistical significance scores of sequence comparison methods with structure similarity

Mean AP Examplequery d1c75a_ a.3.1.1  

hit # pc e    

1 d1gcya1 b.71.1.1 0.31

2 d1h32b_ a.3.1.1 0.4

3 d1gks__ a.3.1.1 0.52

4 d1a56__ a.3.1.1 0.52

5 d1kx2a_ a.3.1.1 0.67

6 d1etpa1 a.3.1.4 0.67

7 d1zpda3 c.36.1.9 0.87

8 d1eu1a2 c.81.1.1 0.87

9 d451c__ a.3.1.1 1.1

10 d1flca2 c.23.10.2 1.1

11 d1mdwa_ d.3.1.3 1.1

12 d2dvh__ a.3.1.1 1.5

13 d1shsa_ b.15.1.1 1.5

14 d1mg2d_ a.3.1.1 1.5

15 d1c53__ a.3.1.1 2.4

16 d3c2c__ a.3.1.1 2.4

17 d1bvsa1 a.5.1.1 6.8

18 d1dvva_ a.3.1.1 6.8

19 d1cyi__ a.3.1.1 6.8

20 d1dw0a_ a.3.1.1 6.8

21 d1h0ba_ b.29.1.11 6.8

22 d3pfk__ c.89.1.1 6.8

23 d1kful3 d.3.1.3 6.8

24 d1ixrc1 a.4.5.11 14

25 d1ixsb1 a.4.5.11 14

- Take 100 best hits- True positives: in same SCOP family, or false positives: not in same family-For each of the true positives: divide the positive rank (1,2,3,4,5,6,7,8,9,10,11,12) by the true positive rank (2,3,4,5,9,12,14,15,16,18,19,20)- Divide the sum of all of these numbers by the total number of hits (100) = AP (0.140)- Take average of AP scores for all entries = mean AP

Page 18: Testing statistical significance scores of sequence comparison methods with structure similarity

Mean AP results

0.00

0.03

0.06

0.09

0.12

0.15

0.18

0.21

0.24

0.27

0.30

pdb010 pdb020 pdb025 pdb030 pdb035 pdb040 pdb050 pdb070 pdb090 pdb095

ASTRAL SCOP set

mea

n A

P

pc ebf zbl efa ess epa e

Page 19: Testing statistical significance scores of sequence comparison methods with structure similarity

Time consumption

• PDB095 all-against-all comparison:– Biofacet: multiple days (Z-value calc.!)– SSEARCH: 5h49m– ParAlign: 47m– FASTA: 40m– BLAST: 15m

Page 20: Testing statistical significance scores of sequence comparison methods with structure similarity

Conclusions

• e-value better than Z-value(!)• SW implementations are (more or less)

the same (SSEARCH, ParAlign and Biofacet), but SSEARCH with e-value scores best of all

• Use FASTA/BLAST only when time is important

• Larger structural comparison database needed for better analysis

Page 21: Testing statistical significance scores of sequence comparison methods with structure similarity

Credits

Peter Groenen

Wilco Fleuren

Jack Leunissen