scooter willis university of florida computer and information science and engineering

Predicting co-evolving pairs in Pfam using information theory where entropy is determined by phylogenetic mutation events

Scooter Willis

University of Florida

Computer and Information Science and Engineering

Homology modeling

Proteins grouped by function will share similar structures

Pfam is a large collection of protein sequences grouped by Hidden Markov Models

Pfam 19.0 December 2005 8183 protein families where 2,765 have one or more solved PDB structures

Pfam5000

“Implications of Structural Genomics Target Selection Strategies:Pfam5000, Whole Genome, and Random Approaches”, John-Marc Chandonia and Steven E. Brenner, PROTEINS: Structure, Function, Bioinformatics (2005)

NIH is supporting structural genomics projects at 9 pilot centers through the Protein Structure Initiative.

Funding is $300 million over the next five years

Co-evolving pairs

Co-evolving pairs is defined as two amino acids > 10 sequence positions apart but within 12 angstroms of each other in 3D space

Apply information theory to protein families to detect co-evolving pairs which provides indicates tertiary placement of secondary structures

Actively research topic with numerous publications in the last 5 years

Accepted that the information value is present but difficult to separate from background noise

Information Theory Approach

The measure of entropy H(x), where x is a discrete random variable and p(x) is the probability function, deals with the randomness or uncertainty there is in a signal and is calculated with the following formula.

)(log xpxpxH

Information Theory Approach

)(log xpxpxH

)(log ypypyH

),(log,, yxpyxpyxH

),()()(, yxHyHxHyxMI

))()(

),(log(),(),(

,ji ji

jiji ypxp

yxpyxpyxMI

Mutual Information Venn Diagram

H(X|Y) H(Y|X)MI(X,Y)

H(X,Y)

H(X) H(Y)

Sampling

The difficulty of applying statistical methods to data sets in genetic sequences is that they tend to not be random samples and the extent of the entire population set is unknown

The bias towards protein sequences that have medical research value and the corresponding phylogenetic influences introduces noise or indistinguishable background signal which decreases the quality of statistical measures

The primary impact is on accurately measuring probability because the sample statistics do not reflect the population’s statistics

Phylogenetic Effect

Reducing the impact of the phylogenetic effect when calculating probability will improve the quality of the information and the ability to detect co-evolving pairs

Phylogenetic Tree

A

DA

A

C

CC

C

C

C C

Phylogenetic Tree Mutation Events

XX

DEAE

XX

CD

CDCD

CDXX

CD CE

Results

Comparison of mutual information calculation using standard probability of sequences (STA) and probability with reduced phylogenetic effect (RPE)

Interested in amino acid pairs that have a mutual information score greater than four standard deviations from the mean (Z>=4)

Maximize the percentage of co-evolving pairs that are less than 12 angstroms apart and greater than 10 sequence positions apart

QGYSLPPEEDLLGLGMTSTSFI.........RGIYLQNAKTLEEYHNTVLRGTF..ATV..KS.KIL.TED.DRI.RK..WAIHKL.MCTFTIQGYSLPPEEDLLGFGISATSFI.........RGIYLQNVKDLREYSETIQAGKL..ATV..KG.KIL.SQD.DKT.RK..WVIHTL.MCSFSLQGYTTKGGADLVGVGLTSIGEG.........QRHYAQNFKDMSSYEAALDRGVL..PFE..RG.VIL.SDD.DEL.RK..AVIMEL.MANFKLQGYTTKKFTQTIGIGVTSIGEG.........GDYYTQNYKDLHHYEKALDLGHL..PVE..RG.VAL.SQE.DVL.RK..EVIMQM.MSNLKLQGYTTHAGTELFGFGATSISML.........HDAYVQNHKQLKEYYQAVAGDAL..PVS..KG.IKL.TTD.DIL.RR..DVIMCI.MSNFYLQGYTTQPESDLLGFGITSISML.........QDVYAQNHKTLKAFYNALDREVM..PIE..KG.FKL.SQD.DLI.RR..TVIKEL.MCQFKLQGYTTLPTADLIGFGLTSISML.........QAAYAQNQKHLATYFSDVAAGHH.GPQE..CG.FNC.TVE.DLL.RR..TIIMEL.MCQFSLQGYTTKKGVELLGFGATSIGML.........YDSYFQNWKTLRDYNKTVDEGKI..PVF..RG.YVL.NED.DFI.RR..EVIMDI.MCNLGVQGYSTYADCDLVAIGVSSIGKI.........GSTYSQNERDIDAYYAAIDEGRL..PIM..RG.YQL.NQD.DIL.RR..NIIQDL.MCRFALQGYSTHAGYDQVGLGISAIGAI.........AGRYVQNARTLDEYYGALDHGRL..PLA..RG.VAM.SAD.DHL.RR..EIIGAL.MCNGVLQGYSTHADCDLLAFGMSAISRV.........GDVYAQNEKELDAYYARIDAGEL..PVL..RG.LTL.TPD.DHV.RR..ALIGEL.MCGFELMGYTTHADTDLLGLGVSAISHI.........GATYSQNPRDLPSWEDAVDQGQL..PVW..RG.VAL.SAD.DQL.RA..ELIQQL.MCQGEVMGYTTHADTDLIGLGVSAISHF.........GDSYSQNPRELAAWDAAVDRGAL..PVC..RG.MQL.SAD.DLL.RA..EVIQAL.LCRGRVQGYTTQEECDLLGLGVSAISLL.........GDTYAQNQKELKHYYTAIDNTGI..ALH..KG.FAM.SEE.DCL.RR..DVIKQL.ICNFKLQGYTTQGDTDLLGMGVSAISMI.........GDCYAQNQKELKQYYQQVDEQGN..ALW..RG.IAL.TRD.DCI.RR..DVIKSL.ICNFRLQGYTTQGESDLLGLGVSAISML.........GDSYAQNEKDLETYYACVEQRGN..ALW..RG.LTM.TED.DCL.RR..DVIKTL.ICHFQLQGYTTQEECDLLGLGVSSISQI.........GDCYAQNQKDIRPYYEAIDKDGH..ALW..KG.CSL.NRD.DEI.RR..VVIKQL.ICHFDLQGYTTQGECDLVGFGVSAISMI.........GDAYAQNQKELKKYYAQVNDLRH..ALW..KG.VSL.DSD.DLL.RR..EVIKQL.ICNFKLQGYTTHGHCDLIGLGVSAISQI.........GDLYCQNSSDLTAYQNSLASAQL..ATS..RG.LVC.NAD.DRL.RR..AVIQQL.ICNFKLQGYTTHGHCDLIGLGVSAISQI.........GDLYCQNSSDLNTYQDSLSNAQL..ATQ..RG.LLC.NHD.DRI.RR..AVIQQL.ICHFELQGYTTHGHCDLVGLGVSAISQI.........GDLYSQNSSDINDYQTSLDNGQL..AIR..RG.LHC.NSD.DRV.RR..AVIQQL.ICHFELQGYTTHGHCDLIGLGVSAISQV.........GDLYSQNSSDLNDYQRLLDSDQP..ATL..RG.LIC.SED.DRI.RR..AVIQQL.ICHFTLQGYTTDNEPVLIGLGASAISTF.........SDAYIQNIADIKNYSRAIEEQGL..ASF..RG.IDI.SQE.DHL.RG..EIISAL.MCHFAVQGYTNDRCGTLIGFGPSSISQF.........PGGYAQNISDVGQYRKRVEAGEL..ATV..RG.YTL.RDT.DRI.RS..AIISAL.MCNFCVQGYTTDACETLIGFGASAIGRS.........AHGYVQNEVAIGRYAQSVATGQL..ATA..KG.YRL.TAD.DRL.RA..EIIERI.MCDFSVQGYTTDACETLIGLGASAIGRT.........NDGYVQNEVPPGLYAQHIASGRL..ATV..KG.YRM.TPE.DRL.RA..GIIERL.MCDFGVQGYTTDACKTLIGIGASAIGRF.........GNGYHQNIVPPGLYASCVASGEL..PTA..KI.YEL.TAE.DRV.RA..DVIEQL.MCNFSVQGYSADTCKTLIAFGASAIGRV.........GEGYVENAGALEAYSQHIAAGRL..ATS..KG.YRL.IGE.DRV.RG..AIIERL.MCDLEALGYSADTCKTLIGFGASAIGRV.........GEGYVQNEVTRDSYCRHIAAGRL..ATS..KG.YRL.TDE.DRA.RA..AIIERL.MCDLEALGYSADTCKTVIGLGPSAIGRL.........REGYVQNESATASYHQHIQAGRP..ATS..KG.YCL.SPE.DRL.RA..AIIERL.MCDLQALGYSAETCSTVIGLGASAIGRC.........GDGYVQNDLTQSCYNRHIASGRL..AIS..RG.YRL.ATE.DRV.RA..AIIEQL.MCYLEAQGYTTDQGEVLLGFGASAIGHL.........PQGYVQNEVQIGAYAQSIGASRL..ATA..KG.YGL.TDD.DRL.RA..DIIERI.MCEFSAQGYTTDDCDSLIGLGASAIGRL.........PAGYMQNHVPLGLYAERIAFGVL..PTA..KG.YLL.SEE.DKL.RA..RVIERL.MCDFEA

PF04055.9 Sample Data 180-540

PF04055.9 STA methodMutual Information log base 2

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Z >= 4 amino acid pairs

An

gstr

om

s

PF04055.9 RPE methodMutual Information log base 2 with Phylogentic effect

reduced

0

5

10

15

20

25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Z>=4 amino acid pairs

An

gstr

om

s

Sequence Positions of InterestMutual information pairs closer than 12 angstroms

and greater than 10 sequence positions apart

STA RPE

410 554 188 412

412 555 8 189

410 555 8 414

8 412 8 412

2 193 189 412

8 198 2 193

330 412

26 188

190 412

5 188

188 330

3 188

Clusters of Interest

410 555

8 198

188

8

412

330

26

STA RPE

MI scores for PF04055.9

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 2000 4000 6000 8000 10000 12000

All (X,Y) pairs where Y > X

MI(

X,Y

)

MI scores for PF04055.9 with Phylogenetic effect reduced

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2000 4000 6000 8000 10000 12000

All (X,Y) pairs where Y > X

MI(

X,Y

)

Mutual Information Analysis in Pfam

2,765 families have one or more PDB structures Filter on families with > 100 sequences and

< 5000 sequences At least one PDB structure must have 90%

agreement with an associated Pfam sequence in the family

783 families were used to test the predictive quality of the STA and RPE methods

Initial Results

STA RPE

%<12a %Pfam %<12A %Pfam

Z>=4 49.7 20.7 57.8 22.4

Z=3 32.8 58.0 35.5 56.0

Z=2 17.5 90.3 19.48 87.1

Additional Filtering

Number of MI pairs where Z>=2

Percen

tage <

12 A

ng

strom

s per P

fam

family

#MI scores < 500 and MC > 40

STA RPE

%<12A %Pfam %<12A %Pfam

Z>=4 56.2 18.3 81.3 15.8

Z=3 42.6 55.8 56.4 46.4

Z=2 33.2 79.7 36.9 71.6

0

10

20

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

#Pfa

m f

amil

ies

RPE Z>=4 STA>=4 RPE Z=3 STA Z=3

PF00014.13 PDB 1KUN

Structure Prediction CASP7

Mutual Information prediction of co-evolving pairs is used to build a model to score a predicted tertiary structure

When a PDB exists for a particular Pfam family then we have accurate data to score the predicted structure

When no PDB exists then a predicted tertiary structure will score better when the sum of the distances between co-evolving pairs is minimum as compared to other predicted structures

scooter willis university of florida computer and information science and engineering

Documents

information value

information science

probability function

protein families

protein structure initiative

amino acid pairs

genetic sequences

random approaches