algorithms for structural comparison and statistical analysis of 3d protein motifs brian y. chen,...

34
Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel, Olivier Lichtarge, Lydia E. Kavraki

Post on 20-Dec-2015

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein

Motifs

Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,Olivier Lichtarge, Lydia E. Kavraki

Page 2: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Motivation

• Understanding the function of proteins is a fundamental purpose of biology

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

• Experimental determination of protein function is expensive and time consuming

• Algorithms for computational function prediction could guide and accelerate protein function discovery process

Page 3: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

A Computational Approach

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

• Comparative Analysis– Focus: Algorithms for Comparative

Analysis

• What is similar about proteins with similar function?– Sequence – same components?– Geometry – same structure?– Dynamics – same motion?

Page 4: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

A Computational Approach

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

• Comparative Analysis– Focus: Algorithms for Comparative

Analysis

• What is similar about proteins with similar function?– Sequence – same components?– Geometry – same structure?– Dynamics – same motion?

(Same Chemistry)

Page 5: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

What do we need?

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

• A motif for comparison– Representative of Biological function

• An algorithm for comparison– Search for Geometric and Chemical

similarity

• Statistical analysis– Classification of results

Page 6: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Outline

• Evolutionary Trace (ET)– A source of biologically relevant motifs using

evolutionary data

• Match Augmentation (MA)– An algorithm for identifying geometric

similarity

• Statistical analysis– Statistically determined geometric thresholds

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Page 7: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

The Evolutionary Trace (ET)

Lichtarge et al, JMB 1996; Lichtarge et al, JMB 1997; Lichtarge et al, PNAS 1996; Sowa et al, NSB 2001

A A A . E C WG Y R I G C KA K R . D C WG T R L F C LG A K I Y C LG T R I A C KA K K . D C WG Y R L C C LA K Y . E C W

Structure

alignment

tree

+

Functional site

G T R I A C K

G Y R I G C KG Y R L C C LG T R L F C LG A K I Y C L

A A A . E C W

A K K . D C WA K R . D C W

A K Y . E C W

position 1 2 3 4 5 6 7

consensus X - - X - C Xrank 2 - - 4 - 1 3

Evolutionary Trace

rank 4

Page 8: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

ET Clusters Functionally Relevant

http://imgen.bcm.tmc.edu/molgenlabs/lichtarge/trace_of_the_week/traces.html

Ligand binding site

ET clusters

Trp1 domain of HopDihydropteroate SynthaseGalectin CRDCluster Type

Structural Epitope : Yellow = ligand, Blue = Residues within 5Å of the ligandET Clusters : Yellow = ligand, Red = Largest Cluster, Other colors = trace residues

Page 9: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Geometric Motifs

• Trace Clusters are functionally relevant

• A source for geometric motifs

• Geometric Motifs Function– Given a protein structure:

• Same Amino Acids • Same Geometry and Chemistry

– Does the protein have the same function?Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

?

Page 10: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Outline

• Evolutionary Trace (ET)– A source of biologically relevant motifs using

evolutionary data

• Match Augmentation (MA)– An algorithm for identifying geometric

similarity

• Statistical analysis– Statistically determined geometric thresholds

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Page 11: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Geometric Comparison Algorithms• Geometric Hashing

Wolfson H.J. et al. IEEE Comp. Sci. Eng., 4(4):10–21,1997.

• JESSBarker J.A. et al. Bioinformatics, 19(13):1644-9, 2003.

• PINTSStark A. et al. Journal of Molecular Biology,

326:1307-16, 2003.

• Many Others– webFEATURE, DALI, CE, SSAP…

• Our method: Match Augmentation– Integrate Structural and Evolutionary data– Efficient application of hashing and depth first search

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

PSB 2005

Page 12: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Geometric Comparison Strategy

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

• Biological Input: – A structure of a functional site (Motif)– A protein structure with unknown function

(Target)

• Geometric Search: – Find target atoms geometrically similar to motif

atoms, similar atoms and amino acids (Match)

• Output:– Match of atoms with greatest geometric similarity

• Might potentially identify a similar functional site in the target

Page 13: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Motifs: Structure & Evolution Data• Structure of a Functional Site

– Points in three dimensions (3D) taken from atom coordinates (motif point)

– Labeled by residue and atom identity

– Alternate residues from mutation

• Support for complex active sites

– Priority-ranked motif points• Functional relevance

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

{G,C,T} C

43

12

Page 14: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Input Data: Targets

• Targets– Points in 3D taken from atom

coordinates of whole protein structures (target points)

– Labeled by residue and atom identity

– No Alternate residues

– No ranking

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

{Y} C

Page 15: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Search: Matching Criteria

• Geometric Similarity– points are within when

optimally superimposed

• Label Compatibility– Target residue label is a

member of Alternative Residues– Atom labels identical

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

<

{S,L,T}

{S}

C

C

Page 16: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Matches

• Matches correlate motif points to target points– Bijection– Fulfill Geometric and Label Criteria

• Geometric Similarity measured by Least Root Mean Squared Distance (LRMSD)

• The match we seek:– Bijection of all motif points– Smallest LRMSD of all matches

considered

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Motif

Target

Match

Page 17: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Match Augmentation at a Glance

Input

SeedMatching

Augmentation

Output

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

• Design Principle:– Correlate high ranking points first– Exhaustively test potential matches– Filter for the match with lowest LRMSD

• Two Phases:– Seed Matching– Augmentation

Page 18: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Match Augmentation at a Glance

Input

SeedMatching

Augmentation

Output

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

31

2Match High Ranked Points

• Design Principle:– Correlate high ranking points first– Exhaustively test potential matches– Filter for the match with lowest LRMSD

• Two Phases:– Seed Matching– Augmentation

Page 19: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Match Augmentation at a Glance

Input

SeedMatching

Augmentation

Output

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

31

23

1

2Expand matches to rest of Motif

• Design Principle:– Correlate high ranking points first– Exhaustively test potential matches– Filter for the match with lowest LRMSD

• Two Phases:– Seed Matching– Augmentation

Page 20: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Match Augmentation at a Glance

Input

SeedMatching

Augmentation

Output

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

31

2

4

31

2Expand matches to rest of Motif

• Design Principle:– Correlate high ranking points first– Exhaustively test potential matches– Filter for the match with lowest LRMSD

• Two Phases:– Seed Matching– Augmentation

Page 21: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Match Augmentation at a Glance

Input

SeedMatching

Augmentation

Output

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

31

2

4

531

2Expand matches to rest of Motif

• Design Principle:– Correlate high ranking points first– Exhaustively test potential matches– Filter for the match with lowest LRMSD

• Two Phases:– Seed Matching– Augmentation

Page 22: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Filtering Completed Matches• Augmentation implements a depth first search:

• Data is stored in heap of matches• Final output: match with smallest LRMSD

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

No more pointsto match

LRMSD:2.41

Matches Sortedby LRMSD

Final OutputLRMSD: 0.87

Page 23: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Match Augmentation Summary

• Hybrid Algorithm– Seed Matching: Hashing – Augmentation: Depth First Search

• Finds matches to motifs within target structures– Final output: match with smallest LRMSD

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Page 24: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Testing MA on Biological data

• Data Set– 12 motifs selected from residues surrounding enzymatic

active sites – 73 targets, each evolutionarily related to one of the motifs– Details: www.cs.rice.edu/~brianyc/papers/PSB2005/

• Experimental Protocol– Search for each motif within every target.

• Matches of evolutionarily related motif-target pairs are “HPs” (BLUE)

• Matches of unrelated motif-target pairs are “NHPs” (RED)

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Page 25: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Match Augmentation Conclusions• Match Augmentation is accurate

– Identifies cognate active sites in 95.4% of evolutionarily related proteins

• Match Augmentation is very efficient– Matches can be found in a fraction of a

second

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Page 26: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Outline

• Evolutionary Trace (ET)– A source of biologically relevant motifs using

evolutionary data

• Match Augmentation (MA)– An algorithm for identifying geometric

similarity

• Statistical analysis– Statistically determined geometric thresholds

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Page 27: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Evaluating Statistical Significance• Hypothesis Testing Framework:

– H0: Motif and Target are functionally unrelated

– HA: Motif and Target are functionally related

• Reject H0 for a given match only if the match is unusual under H0.

• Problem: how do we evaluate the H0 for a given match?

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Page 28: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

The “Usual” H0 distribution

• The set of matches between the motif and all functionally unrelated targets

• Previous methods approximate this distribution:– JESS

• Matches are compared to a reference population of motifs is partially ordered by degree of occurrence

– PINTS• Approximate the distribution of matches with an artificial

curve parameterized by motif size and residue content.

• MA can calculate this distribution explicitly by computing matches to the entire PDB

• JESS: Barker J.A. et al. Bioinformatics, 19(13):1644-9, 2003.• PINTS: Stark A. et al. Journal of Molecular Biology, 326:1307-16, 2003.Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Page 29: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

A Distribution of Match LRMSDs

• LRMSD distribution of matches with entire PDB– Almost all known protein structures – Almost no functional relation to a our motifs

• Reasonable H0 Distribution

0 1 2 3 4 0 1 2 3 4 5

Unsmoothed Smoothed

LRMSD LRMSD

Frequency

Frequency

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Page 30: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

How unusual is our match?

• We want: the probability of observing a match with lower LRMSD than given match

RMSD

est

imat

ed d

en

sity

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

LRMSD

Frequency

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Match LRMSD

A B

A: Area left of line

B: Area under curve

p-Value:A

Bp =

matches with lower LRMSD

matches total

• Apply P-value to reject H0

Page 31: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Statistical Significance

• Result: Data driven statistical significance value (p-value)– No dependence on approximations like previous work

• p-value of a match tells us the probability of observing another match with lower LRMSD, with a functionally unrelated target

• Apply p-value to reject H0

• Do matches identifying cognate active sites (HPs) have low p-values? (i.e. Can we reject H0 for HPs?)

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Page 32: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Testing our Statistical Analysis

• Distributions of matches over the PDB can be calculated efficiently– 12:48 on a single machine, on average

• Do not have to scan the entire PDB to accurately determine the H0 distribution– 5% random sample accurate enough– Reduces sample time to 0:38, on average

• Matches of cognate active sites (HPs) are statistically significant (low p-values)

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Page 33: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Conclusions

• Match Augmentation is accurate and extremely efficient– Correctly identifies cognate active sites (HPs)– Identifies matches in fractions of a second

• Algorithmic efficiency enables detailed Statistical Analysis– Explicitly calculate H0 distribution without dependence on

approximated H0 distributions

– Matches of cognate active sites (HPs) are statistically significant

– Significance threshold translates into useful motif-specific LRMSD thresholds

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki

Page 34: Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs Brian Y. Chen, Viacheslav Y. Fofanov, David M. Kristensen, Marek Kimmel,

Special Thanks• Kavraki Group

– David Schwarz– Amarda Shehu– Allison Heath– Hernan Stamati– Anne Christian– Drew Bryant– Amanda Cruess– Brad Dodson– Jessica Wu

• Lichtarge Lab– David

Kristensen– Dan Morgan– Ivana Mihalek– Hui Yao

• Kimmel Group– Viacheslav

Fofanov

• Funding– NSF– NLM

5T15LM07093– March of

Dimes– Whitaker

Foundation– Sloan

Foundation– VIGRE– AMD

Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs

B. Chen, V. Fofanov, D. Kristensen, M. Kimmel, O. Lichtarge, L. Kavraki