scooter willis university of florida computer and information science and engineering

25
Predicting co-evolving pairs in Pfam using information theory where entropy is determined by phylogenetic mutation events Scooter Willis University of Florida Computer and Information Science and Engineering

Upload: olesia

Post on 31-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Predicting co-evolving pairs in Pfam using information theory where entropy is determined by phylogenetic mutation events. Scooter Willis University of Florida Computer and Information Science and Engineering. Homology modeling. Proteins grouped by function will share similar structures - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scooter Willis University of Florida  Computer and Information Science and Engineering

Predicting co-evolving pairs in Pfam using information theory where entropy is determined by phylogenetic mutation events

Scooter Willis

University of Florida

Computer and Information Science and Engineering

Page 2: Scooter Willis University of Florida  Computer and Information Science and Engineering

Homology modeling

Proteins grouped by function will share similar structures

Pfam is a large collection of protein sequences grouped by Hidden Markov Models

Pfam 19.0 December 2005 8183 protein families where 2,765 have one or more solved PDB structures

Page 3: Scooter Willis University of Florida  Computer and Information Science and Engineering

Pfam5000

“Implications of Structural Genomics Target Selection Strategies:Pfam5000, Whole Genome, and Random Approaches”, John-Marc Chandonia and Steven E. Brenner, PROTEINS: Structure, Function, Bioinformatics (2005)

NIH is supporting structural genomics projects at 9 pilot centers through the Protein Structure Initiative.

Funding is $300 million over the next five years

Page 4: Scooter Willis University of Florida  Computer and Information Science and Engineering

Co-evolving pairs

Co-evolving pairs is defined as two amino acids > 10 sequence positions apart but within 12 angstroms of each other in 3D space

Apply information theory to protein families to detect co-evolving pairs which provides indicates tertiary placement of secondary structures

Actively research topic with numerous publications in the last 5 years

Accepted that the information value is present but difficult to separate from background noise

Page 5: Scooter Willis University of Florida  Computer and Information Science and Engineering

Information Theory Approach

The measure of entropy H(x), where x is a discrete random variable and p(x) is the probability function, deals with the randomness or uncertainty there is in a signal and is calculated with the following formula.

)(log xpxpxH

Page 6: Scooter Willis University of Florida  Computer and Information Science and Engineering

Information Theory Approach

)(log xpxpxH

)(log ypypyH

),(log,, yxpyxpyxH

),()()(, yxHyHxHyxMI

))()(

),(log(),(),(

,ji ji

jiji ypxp

yxpyxpyxMI

Page 7: Scooter Willis University of Florida  Computer and Information Science and Engineering

Mutual Information Venn Diagram

H(X|Y) H(Y|X)MI(X,Y)

H(X,Y)

H(X) H(Y)

Page 8: Scooter Willis University of Florida  Computer and Information Science and Engineering

Sampling

The difficulty of applying statistical methods to data sets in genetic sequences is that they tend to not be random samples and the extent of the entire population set is unknown

The bias towards protein sequences that have medical research value and the corresponding phylogenetic influences introduces noise or indistinguishable background signal which decreases the quality of statistical measures

The primary impact is on accurately measuring probability because the sample statistics do not reflect the population’s statistics

Page 9: Scooter Willis University of Florida  Computer and Information Science and Engineering

Phylogenetic Effect

Reducing the impact of the phylogenetic effect when calculating probability will improve the quality of the information and the ability to detect co-evolving pairs

Page 10: Scooter Willis University of Florida  Computer and Information Science and Engineering

Phylogenetic Tree

A

DA

A

C

CC

C

C

C C

Page 11: Scooter Willis University of Florida  Computer and Information Science and Engineering

Phylogenetic Tree Mutation Events

XX

DEAE

XX

CD

CDCD

CDXX

CD CE

Page 12: Scooter Willis University of Florida  Computer and Information Science and Engineering

Results

Comparison of mutual information calculation using standard probability of sequences (STA) and probability with reduced phylogenetic effect (RPE)

Interested in amino acid pairs that have a mutual information score greater than four standard deviations from the mean (Z>=4)

Maximize the percentage of co-evolving pairs that are less than 12 angstroms apart and greater than 10 sequence positions apart

Page 13: Scooter Willis University of Florida  Computer and Information Science and Engineering

QGYSLPPEEDLLGLGMTSTSFI.........RGIYLQNAKTLEEYHNTVLRGTF..ATV..KS.KIL.TED.DRI.RK..WAIHKL.MCTFTIQGYSLPPEEDLLGFGISATSFI.........RGIYLQNVKDLREYSETIQAGKL..ATV..KG.KIL.SQD.DKT.RK..WVIHTL.MCSFSLQGYTTKGGADLVGVGLTSIGEG.........QRHYAQNFKDMSSYEAALDRGVL..PFE..RG.VIL.SDD.DEL.RK..AVIMEL.MANFKLQGYTTKKFTQTIGIGVTSIGEG.........GDYYTQNYKDLHHYEKALDLGHL..PVE..RG.VAL.SQE.DVL.RK..EVIMQM.MSNLKLQGYTTHAGTELFGFGATSISML.........HDAYVQNHKQLKEYYQAVAGDAL..PVS..KG.IKL.TTD.DIL.RR..DVIMCI.MSNFYLQGYTTQPESDLLGFGITSISML.........QDVYAQNHKTLKAFYNALDREVM..PIE..KG.FKL.SQD.DLI.RR..TVIKEL.MCQFKLQGYTTLPTADLIGFGLTSISML.........QAAYAQNQKHLATYFSDVAAGHH.GPQE..CG.FNC.TVE.DLL.RR..TIIMEL.MCQFSLQGYTTKKGVELLGFGATSIGML.........YDSYFQNWKTLRDYNKTVDEGKI..PVF..RG.YVL.NED.DFI.RR..EVIMDI.MCNLGVQGYSTYADCDLVAIGVSSIGKI.........GSTYSQNERDIDAYYAAIDEGRL..PIM..RG.YQL.NQD.DIL.RR..NIIQDL.MCRFALQGYSTHAGYDQVGLGISAIGAI.........AGRYVQNARTLDEYYGALDHGRL..PLA..RG.VAM.SAD.DHL.RR..EIIGAL.MCNGVLQGYSTHADCDLLAFGMSAISRV.........GDVYAQNEKELDAYYARIDAGEL..PVL..RG.LTL.TPD.DHV.RR..ALIGEL.MCGFELMGYTTHADTDLLGLGVSAISHI.........GATYSQNPRDLPSWEDAVDQGQL..PVW..RG.VAL.SAD.DQL.RA..ELIQQL.MCQGEVMGYTTHADTDLIGLGVSAISHF.........GDSYSQNPRELAAWDAAVDRGAL..PVC..RG.MQL.SAD.DLL.RA..EVIQAL.LCRGRVQGYTTQEECDLLGLGVSAISLL.........GDTYAQNQKELKHYYTAIDNTGI..ALH..KG.FAM.SEE.DCL.RR..DVIKQL.ICNFKLQGYTTQGDTDLLGMGVSAISMI.........GDCYAQNQKELKQYYQQVDEQGN..ALW..RG.IAL.TRD.DCI.RR..DVIKSL.ICNFRLQGYTTQGESDLLGLGVSAISML.........GDSYAQNEKDLETYYACVEQRGN..ALW..RG.LTM.TED.DCL.RR..DVIKTL.ICHFQLQGYTTQEECDLLGLGVSSISQI.........GDCYAQNQKDIRPYYEAIDKDGH..ALW..KG.CSL.NRD.DEI.RR..VVIKQL.ICHFDLQGYTTQGECDLVGFGVSAISMI.........GDAYAQNQKELKKYYAQVNDLRH..ALW..KG.VSL.DSD.DLL.RR..EVIKQL.ICNFKLQGYTTHGHCDLIGLGVSAISQI.........GDLYCQNSSDLTAYQNSLASAQL..ATS..RG.LVC.NAD.DRL.RR..AVIQQL.ICNFKLQGYTTHGHCDLIGLGVSAISQI.........GDLYCQNSSDLNTYQDSLSNAQL..ATQ..RG.LLC.NHD.DRI.RR..AVIQQL.ICHFELQGYTTHGHCDLVGLGVSAISQI.........GDLYSQNSSDINDYQTSLDNGQL..AIR..RG.LHC.NSD.DRV.RR..AVIQQL.ICHFELQGYTTHGHCDLIGLGVSAISQV.........GDLYSQNSSDLNDYQRLLDSDQP..ATL..RG.LIC.SED.DRI.RR..AVIQQL.ICHFTLQGYTTDNEPVLIGLGASAISTF.........SDAYIQNIADIKNYSRAIEEQGL..ASF..RG.IDI.SQE.DHL.RG..EIISAL.MCHFAVQGYTNDRCGTLIGFGPSSISQF.........PGGYAQNISDVGQYRKRVEAGEL..ATV..RG.YTL.RDT.DRI.RS..AIISAL.MCNFCVQGYTTDACETLIGFGASAIGRS.........AHGYVQNEVAIGRYAQSVATGQL..ATA..KG.YRL.TAD.DRL.RA..EIIERI.MCDFSVQGYTTDACETLIGLGASAIGRT.........NDGYVQNEVPPGLYAQHIASGRL..ATV..KG.YRM.TPE.DRL.RA..GIIERL.MCDFGVQGYTTDACKTLIGIGASAIGRF.........GNGYHQNIVPPGLYASCVASGEL..PTA..KI.YEL.TAE.DRV.RA..DVIEQL.MCNFSVQGYSADTCKTLIAFGASAIGRV.........GEGYVENAGALEAYSQHIAAGRL..ATS..KG.YRL.IGE.DRV.RG..AIIERL.MCDLEALGYSADTCKTLIGFGASAIGRV.........GEGYVQNEVTRDSYCRHIAAGRL..ATS..KG.YRL.TDE.DRA.RA..AIIERL.MCDLEALGYSADTCKTVIGLGPSAIGRL.........REGYVQNESATASYHQHIQAGRP..ATS..KG.YCL.SPE.DRL.RA..AIIERL.MCDLQALGYSAETCSTVIGLGASAIGRC.........GDGYVQNDLTQSCYNRHIASGRL..AIS..RG.YRL.ATE.DRV.RA..AIIEQL.MCYLEAQGYTTDQGEVLLGFGASAIGHL.........PQGYVQNEVQIGAYAQSIGASRL..ATA..KG.YGL.TDD.DRL.RA..DIIERI.MCEFSAQGYTTDDCDSLIGLGASAIGRL.........PAGYMQNHVPLGLYAERIAFGVL..PTA..KG.YLL.SEE.DKL.RA..RVIERL.MCDFEA

PF04055.9 Sample Data 180-540

Page 14: Scooter Willis University of Florida  Computer and Information Science and Engineering

PF04055.9 STA methodMutual Information log base 2

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Z >= 4 amino acid pairs

An

gstr

om

s

Page 15: Scooter Willis University of Florida  Computer and Information Science and Engineering

PF04055.9 RPE methodMutual Information log base 2 with Phylogentic effect

reduced

0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Z>=4 amino acid pairs

An

gstr

om

s

Page 16: Scooter Willis University of Florida  Computer and Information Science and Engineering

Sequence Positions of InterestMutual information pairs closer than 12 angstroms

and greater than 10 sequence positions apart

STA RPE

410 554 188 412

412 555 8 189

410 555 8 414

8 412 8 412

2 193 189 412

8 198 2 193

330 412

26 188

190 412

5 188

188 330

3 188

Page 17: Scooter Willis University of Florida  Computer and Information Science and Engineering

Clusters of Interest

410 555

8 198

188

8

412

330

26

STA RPE

Page 18: Scooter Willis University of Florida  Computer and Information Science and Engineering

MI scores for PF04055.9

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 2000 4000 6000 8000 10000 12000

All (X,Y) pairs where Y > X

MI(

X,Y

)

MI scores for PF04055.9 with Phylogenetic effect reduced

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2000 4000 6000 8000 10000 12000

All (X,Y) pairs where Y > X

MI(

X,Y

)

Page 19: Scooter Willis University of Florida  Computer and Information Science and Engineering

Mutual Information Analysis in Pfam

2,765 families have one or more PDB structures Filter on families with > 100 sequences and

< 5000 sequences At least one PDB structure must have 90%

agreement with an associated Pfam sequence in the family

783 families were used to test the predictive quality of the STA and RPE methods

Page 20: Scooter Willis University of Florida  Computer and Information Science and Engineering

Initial Results

STA RPE

%<12a %Pfam %<12A %Pfam

Z>=4 49.7 20.7 57.8 22.4

Z=3 32.8 58.0 35.5 56.0

Z=2 17.5 90.3 19.48 87.1

Page 21: Scooter Willis University of Florida  Computer and Information Science and Engineering

Additional Filtering

Number of MI pairs where Z>=2

Percen

tage <

12 A

ng

strom

s per P

fam

family

Page 22: Scooter Willis University of Florida  Computer and Information Science and Engineering

#MI scores < 500 and MC > 40

STA RPE

%<12A %Pfam %<12A %Pfam

Z>=4 56.2 18.3 81.3 15.8

Z=3 42.6 55.8 56.4 46.4

Z=2 33.2 79.7 36.9 71.6

Page 23: Scooter Willis University of Florida  Computer and Information Science and Engineering

0

10

20

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

#Pfa

m f

amil

ies

RPE Z>=4 STA>=4 RPE Z=3 STA Z=3

Page 24: Scooter Willis University of Florida  Computer and Information Science and Engineering

PF00014.13 PDB 1KUN

Page 25: Scooter Willis University of Florida  Computer and Information Science and Engineering

Structure Prediction CASP7

Mutual Information prediction of co-evolving pairs is used to build a model to score a predicted tertiary structure

When a PDB exists for a particular Pfam family then we have accurate data to score the predicted structure

When no PDB exists then a predicted tertiary structure will score better when the sum of the distances between co-evolving pairs is minimum as compared to other predicted structures