scooter willis university of florida computer and information science and engineering
DESCRIPTION
Predicting co-evolving pairs in Pfam using information theory where entropy is determined by phylogenetic mutation events. Scooter Willis University of Florida Computer and Information Science and Engineering. Homology modeling. Proteins grouped by function will share similar structures - PowerPoint PPT PresentationTRANSCRIPT
Predicting co-evolving pairs in Pfam using information theory where entropy is determined by phylogenetic mutation events
Scooter Willis
University of Florida
Computer and Information Science and Engineering
Homology modeling
Proteins grouped by function will share similar structures
Pfam is a large collection of protein sequences grouped by Hidden Markov Models
Pfam 19.0 December 2005 8183 protein families where 2,765 have one or more solved PDB structures
Pfam5000
“Implications of Structural Genomics Target Selection Strategies:Pfam5000, Whole Genome, and Random Approaches”, John-Marc Chandonia and Steven E. Brenner, PROTEINS: Structure, Function, Bioinformatics (2005)
NIH is supporting structural genomics projects at 9 pilot centers through the Protein Structure Initiative.
Funding is $300 million over the next five years
Co-evolving pairs
Co-evolving pairs is defined as two amino acids > 10 sequence positions apart but within 12 angstroms of each other in 3D space
Apply information theory to protein families to detect co-evolving pairs which provides indicates tertiary placement of secondary structures
Actively research topic with numerous publications in the last 5 years
Accepted that the information value is present but difficult to separate from background noise
Information Theory Approach
The measure of entropy H(x), where x is a discrete random variable and p(x) is the probability function, deals with the randomness or uncertainty there is in a signal and is calculated with the following formula.
)(log xpxpxH
Information Theory Approach
)(log xpxpxH
)(log ypypyH
),(log,, yxpyxpyxH
),()()(, yxHyHxHyxMI
))()(
),(log(),(),(
,ji ji
jiji ypxp
yxpyxpyxMI
Mutual Information Venn Diagram
H(X|Y) H(Y|X)MI(X,Y)
H(X,Y)
H(X) H(Y)
Sampling
The difficulty of applying statistical methods to data sets in genetic sequences is that they tend to not be random samples and the extent of the entire population set is unknown
The bias towards protein sequences that have medical research value and the corresponding phylogenetic influences introduces noise or indistinguishable background signal which decreases the quality of statistical measures
The primary impact is on accurately measuring probability because the sample statistics do not reflect the population’s statistics
Phylogenetic Effect
Reducing the impact of the phylogenetic effect when calculating probability will improve the quality of the information and the ability to detect co-evolving pairs
Phylogenetic Tree
A
DA
A
C
CC
C
C
C C
Phylogenetic Tree Mutation Events
XX
DEAE
XX
CD
CDCD
CDXX
CD CE
Results
Comparison of mutual information calculation using standard probability of sequences (STA) and probability with reduced phylogenetic effect (RPE)
Interested in amino acid pairs that have a mutual information score greater than four standard deviations from the mean (Z>=4)
Maximize the percentage of co-evolving pairs that are less than 12 angstroms apart and greater than 10 sequence positions apart
QGYSLPPEEDLLGLGMTSTSFI.........RGIYLQNAKTLEEYHNTVLRGTF..ATV..KS.KIL.TED.DRI.RK..WAIHKL.MCTFTIQGYSLPPEEDLLGFGISATSFI.........RGIYLQNVKDLREYSETIQAGKL..ATV..KG.KIL.SQD.DKT.RK..WVIHTL.MCSFSLQGYTTKGGADLVGVGLTSIGEG.........QRHYAQNFKDMSSYEAALDRGVL..PFE..RG.VIL.SDD.DEL.RK..AVIMEL.MANFKLQGYTTKKFTQTIGIGVTSIGEG.........GDYYTQNYKDLHHYEKALDLGHL..PVE..RG.VAL.SQE.DVL.RK..EVIMQM.MSNLKLQGYTTHAGTELFGFGATSISML.........HDAYVQNHKQLKEYYQAVAGDAL..PVS..KG.IKL.TTD.DIL.RR..DVIMCI.MSNFYLQGYTTQPESDLLGFGITSISML.........QDVYAQNHKTLKAFYNALDREVM..PIE..KG.FKL.SQD.DLI.RR..TVIKEL.MCQFKLQGYTTLPTADLIGFGLTSISML.........QAAYAQNQKHLATYFSDVAAGHH.GPQE..CG.FNC.TVE.DLL.RR..TIIMEL.MCQFSLQGYTTKKGVELLGFGATSIGML.........YDSYFQNWKTLRDYNKTVDEGKI..PVF..RG.YVL.NED.DFI.RR..EVIMDI.MCNLGVQGYSTYADCDLVAIGVSSIGKI.........GSTYSQNERDIDAYYAAIDEGRL..PIM..RG.YQL.NQD.DIL.RR..NIIQDL.MCRFALQGYSTHAGYDQVGLGISAIGAI.........AGRYVQNARTLDEYYGALDHGRL..PLA..RG.VAM.SAD.DHL.RR..EIIGAL.MCNGVLQGYSTHADCDLLAFGMSAISRV.........GDVYAQNEKELDAYYARIDAGEL..PVL..RG.LTL.TPD.DHV.RR..ALIGEL.MCGFELMGYTTHADTDLLGLGVSAISHI.........GATYSQNPRDLPSWEDAVDQGQL..PVW..RG.VAL.SAD.DQL.RA..ELIQQL.MCQGEVMGYTTHADTDLIGLGVSAISHF.........GDSYSQNPRELAAWDAAVDRGAL..PVC..RG.MQL.SAD.DLL.RA..EVIQAL.LCRGRVQGYTTQEECDLLGLGVSAISLL.........GDTYAQNQKELKHYYTAIDNTGI..ALH..KG.FAM.SEE.DCL.RR..DVIKQL.ICNFKLQGYTTQGDTDLLGMGVSAISMI.........GDCYAQNQKELKQYYQQVDEQGN..ALW..RG.IAL.TRD.DCI.RR..DVIKSL.ICNFRLQGYTTQGESDLLGLGVSAISML.........GDSYAQNEKDLETYYACVEQRGN..ALW..RG.LTM.TED.DCL.RR..DVIKTL.ICHFQLQGYTTQEECDLLGLGVSSISQI.........GDCYAQNQKDIRPYYEAIDKDGH..ALW..KG.CSL.NRD.DEI.RR..VVIKQL.ICHFDLQGYTTQGECDLVGFGVSAISMI.........GDAYAQNQKELKKYYAQVNDLRH..ALW..KG.VSL.DSD.DLL.RR..EVIKQL.ICNFKLQGYTTHGHCDLIGLGVSAISQI.........GDLYCQNSSDLTAYQNSLASAQL..ATS..RG.LVC.NAD.DRL.RR..AVIQQL.ICNFKLQGYTTHGHCDLIGLGVSAISQI.........GDLYCQNSSDLNTYQDSLSNAQL..ATQ..RG.LLC.NHD.DRI.RR..AVIQQL.ICHFELQGYTTHGHCDLVGLGVSAISQI.........GDLYSQNSSDINDYQTSLDNGQL..AIR..RG.LHC.NSD.DRV.RR..AVIQQL.ICHFELQGYTTHGHCDLIGLGVSAISQV.........GDLYSQNSSDLNDYQRLLDSDQP..ATL..RG.LIC.SED.DRI.RR..AVIQQL.ICHFTLQGYTTDNEPVLIGLGASAISTF.........SDAYIQNIADIKNYSRAIEEQGL..ASF..RG.IDI.SQE.DHL.RG..EIISAL.MCHFAVQGYTNDRCGTLIGFGPSSISQF.........PGGYAQNISDVGQYRKRVEAGEL..ATV..RG.YTL.RDT.DRI.RS..AIISAL.MCNFCVQGYTTDACETLIGFGASAIGRS.........AHGYVQNEVAIGRYAQSVATGQL..ATA..KG.YRL.TAD.DRL.RA..EIIERI.MCDFSVQGYTTDACETLIGLGASAIGRT.........NDGYVQNEVPPGLYAQHIASGRL..ATV..KG.YRM.TPE.DRL.RA..GIIERL.MCDFGVQGYTTDACKTLIGIGASAIGRF.........GNGYHQNIVPPGLYASCVASGEL..PTA..KI.YEL.TAE.DRV.RA..DVIEQL.MCNFSVQGYSADTCKTLIAFGASAIGRV.........GEGYVENAGALEAYSQHIAAGRL..ATS..KG.YRL.IGE.DRV.RG..AIIERL.MCDLEALGYSADTCKTLIGFGASAIGRV.........GEGYVQNEVTRDSYCRHIAAGRL..ATS..KG.YRL.TDE.DRA.RA..AIIERL.MCDLEALGYSADTCKTVIGLGPSAIGRL.........REGYVQNESATASYHQHIQAGRP..ATS..KG.YCL.SPE.DRL.RA..AIIERL.MCDLQALGYSAETCSTVIGLGASAIGRC.........GDGYVQNDLTQSCYNRHIASGRL..AIS..RG.YRL.ATE.DRV.RA..AIIEQL.MCYLEAQGYTTDQGEVLLGFGASAIGHL.........PQGYVQNEVQIGAYAQSIGASRL..ATA..KG.YGL.TDD.DRL.RA..DIIERI.MCEFSAQGYTTDDCDSLIGLGASAIGRL.........PAGYMQNHVPLGLYAERIAFGVL..PTA..KG.YLL.SEE.DKL.RA..RVIERL.MCDFEA
PF04055.9 Sample Data 180-540
PF04055.9 STA methodMutual Information log base 2
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Z >= 4 amino acid pairs
An
gstr
om
s
PF04055.9 RPE methodMutual Information log base 2 with Phylogentic effect
reduced
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Z>=4 amino acid pairs
An
gstr
om
s
Sequence Positions of InterestMutual information pairs closer than 12 angstroms
and greater than 10 sequence positions apart
STA RPE
410 554 188 412
412 555 8 189
410 555 8 414
8 412 8 412
2 193 189 412
8 198 2 193
330 412
26 188
190 412
5 188
188 330
3 188
Clusters of Interest
410 555
8 198
188
8
412
330
26
STA RPE
MI scores for PF04055.9
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 2000 4000 6000 8000 10000 12000
All (X,Y) pairs where Y > X
MI(
X,Y
)
MI scores for PF04055.9 with Phylogenetic effect reduced
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2000 4000 6000 8000 10000 12000
All (X,Y) pairs where Y > X
MI(
X,Y
)
Mutual Information Analysis in Pfam
2,765 families have one or more PDB structures Filter on families with > 100 sequences and
< 5000 sequences At least one PDB structure must have 90%
agreement with an associated Pfam sequence in the family
783 families were used to test the predictive quality of the STA and RPE methods
Initial Results
STA RPE
%<12a %Pfam %<12A %Pfam
Z>=4 49.7 20.7 57.8 22.4
Z=3 32.8 58.0 35.5 56.0
Z=2 17.5 90.3 19.48 87.1
Additional Filtering
Number of MI pairs where Z>=2
Percen
tage <
12 A
ng
strom
s per P
fam
family
#MI scores < 500 and MC > 40
STA RPE
%<12A %Pfam %<12A %Pfam
Z>=4 56.2 18.3 81.3 15.8
Z=3 42.6 55.8 56.4 46.4
Z=2 33.2 79.7 36.9 71.6
0
10
20
30
40
50
60
70
80
90
100
10 20 30 40 50 60 70 80 90 100
#Pfa
m f
amil
ies
RPE Z>=4 STA>=4 RPE Z=3 STA Z=3
PF00014.13 PDB 1KUN
Structure Prediction CASP7
Mutual Information prediction of co-evolving pairs is used to build a model to score a predicted tertiary structure
When a PDB exists for a particular Pfam family then we have accurate data to score the predicted structure
When no PDB exists then a predicted tertiary structure will score better when the sum of the distances between co-evolving pairs is minimum as compared to other predicted structures