contact map prediction and ab initio structure...

31
Contact map guided ab initio structure prediction S M Golam Mortuza Postdoctoral Research Fellow I-TASSER Workshop 2017 North Carolina A&T State University, Greensboro, NC

Upload: dangtruc

Post on 09-Jul-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Contact map guided ab

initio structure prediction

S M Golam Mortuza

Postdoctoral Research Fellow

I-TASSER Workshop 2017

North Carolina A&T State University, Greensboro, NC

Outline

• Ab initio structure prediction: QUARK

• Contact map prediction: NeBcon

• Contact guided ab initio structure prediction: C-QUARK

• Ab initio GPCR structure prediction: GPCR-AIM

3/20/2017 2

Ab initio structure prediction

• In the absence of homologous templates ,I-TASSER based models are often less useful for biomedical studies due to less accuracy of the models

• Ab initio protein folding method assembles protein structures without using templates

• Ab initio structure modeling represents the most challenging problem in structure prediction

3/20/2017 3

QUARK: Ab initio structure prediction method

3/20/2017 4

Knowledge-based

potentials:

QUARK: Fragment generation and distance profile

3/20/2017 5

Xu et al., Proteins-Structure Function and Bioinformatics, 81(2), pp. 229-239 (2012)

QUARK: Energy Function 𝐸𝑡𝑜𝑡= 𝐸𝑝𝑟𝑚 + 𝑤1𝐸𝑝𝑟𝑠 + 𝑤2𝐸𝑒𝑣 + 𝑤3𝐸ℎ𝑏 + 𝑤4𝐸𝑠𝑎 + 𝑤5𝐸𝑑ℎ + 𝑤6𝐸𝑑𝑝 + 𝑤7𝐸𝑟𝑔 + 𝑤8𝐸𝑑𝑎𝑏 +

𝑤9𝐸ℎ𝑝 + 𝑤10𝐸𝑏𝑝

1. Backbone atomic pair-wise potential (𝐸𝑝𝑟𝑚)

2. Side-chain center pair-wise potential (𝐸𝑝𝑟𝑠)

3. Excluded volume (𝐸𝑒𝑣)

4. Hydrogen bonding (𝐸ℎ𝑏)

5. Solvent accessibility (𝐸𝑠𝑎)

6. Backbone torsion potential (𝐸𝑑ℎ)

7. Fragment-based distance profile (𝐸𝑑𝑝)

8. Radius of gyration (𝐸𝑟𝑔)

9. Strand-helix-strand packing (𝐸𝑑𝑎𝑏) 10. Helix packing (𝐸ℎ𝑝)

11. Strand packing (𝐸𝑏𝑝)

6

Problems with Metropolis Monte Carlo

1. Possibility of getting trapped at local energy basin

2. Increasing T can overcome local energy barrier, but it cannot detect low-energy regions

E

X

Low Temperature

E

X

High Temperature

𝑝𝑎𝑐𝑐𝑒𝑝𝑡 ~ 𝑒−𝑑𝐸/𝑇

3/20/2017 7

Replica Exchange Monte Carlo Initial Random Configuration

Make Random Change

Calculate dE

𝑝𝑎𝑐𝑐𝑒𝑝𝑡= 𝑒−𝑑𝐸/𝑇1

Initial Random Configuration

Make Random Change

Calculate dE

𝑝𝑎𝑐𝑐𝑒𝑝𝑡= 𝑒−𝑑𝐸/𝑇2

Initial Random Configuration

Make Random Change

Calculate dE

𝑝𝑎𝑐𝑐𝑒𝑝𝑡= 𝑒−𝑑𝐸/𝑇3

T1 T2 T3

𝑃𝑠𝑤𝑎𝑝𝑖,𝑗 = 𝑒𝐸𝑖−𝐸𝑗

1𝑡𝑖−1𝑡𝑗

3/20/2017 8 Tmax =2.4 + 0.016L Tmin= 0.6+ 0.00067L

Benchmark Results: QUARK vs. Rosetta Data set: 51 small proteins (70-100 AA) and 94 medium proteins (100-150 AA)

RMSD: 96/145 targets QUARK models are better than Rosetta (p-value: 1.51X10-4)

TM-score: 95/145 targets QUARK models are better than Rosetta (p-value: 2.87X10-7)

3/20/2017 9 Xu et al., Proteins-Structure Function and Bioinformatics, 80(7), pp. 1715-1735(2012)

Benchmark Results: QUARK vs. Rosetta

Data set

Methods

First (best in top five) cluster

center model

RMSD TM-score

51 small proteins with (70-100

residues)

Rosetta 10.1 (8.5) 0.350 (0.393)

QUARK 9.1 (7.7) 0.404 (0.441)

94 medium proteins with (100-150

residues)

Rosetta 13.0 (11.5) 0.317 (0.346)

QUARK 12.5 (10.7) 0.334 (0.374)

3/20/2017 10

Xu et al., Proteins-Structure Function and Bioinformatics, 80(7), pp. 1715-1735(2012)

Benchmark Results: QUARK vs. Rosetta

Red: Native Blue: Rosetta Green: QUARK

3/20/2017 11

Xu et al., Proteins-Structure Function and Bioinformatics, 80(7), pp. 1715-1735(2012)

QUARK in CASP Experiments CASP9 CASP10 CASP11

Groups Z Groups Z Groups Z

QUARK 31.6 QUARK 17.1 QUARK 33.5

Multicon-Refine 22.4 TASSER-VMT 13.9 RBO_Aleph 29.6

Chunk-TASSER 20.7 Pcons-net 13.7 Multicom-con 21.4

RaptorX 19.8 PMS 11.7 RaptorX-FM 17.6

Baker-Rosetta 19.0 RaptorX-Roll 11.3 myprotein-me 15.9

Jiang_Assembly 14.7 HHpred-thread 10.9 TASSER-VMT 15.8

Gws 13.9 Multicom-clust 10.6 Baker-Rosetta 15.7

BioSerf 13.6 RBO-MBS 9.1 Seok-server 15.6

SAM-T08-server 12.7 MUFold_CRF 8.8 FUSION 15.5

Seok-server 12.6 Baker-Rosetta 8.1 nns 15.4

Here, Z-score (Z) represents the significance of the structure predictions by each group compared to the average performance 3/20/2017 12

QUARK modeling of T0837-D1 (128 AA) in CASP 11

Assessor’s comment: T0837-D1_499_1 represents the FM model with biggest improvement for PDB templates in CASP11 experiment

QUARK fragments RMSD ~ 0.1-2.6 A

13

Why Zhang-Server performs better than QUARK in CASP experiments??

• Models built by QUARK are compared with threading templates by LOMETS

• The templates are then re-ranked by their similarity to the QUARK models before they are subjected to the I-TASSER structure-assembly simulations.

3/20/2017 14

Zhang et al., Proteins, 84, pp.76-86 (2015)

Limitations in current methods • Fold small proteins (<150 residues) • Can only fold beta-protein with simple topology

R0014 CASP10

3/20/2017 15

Contact maps in ab initio protein structure prediction

• Sequence-based contact map prediction can be useful for 3D structure folding of larger size proteins that have complicated topologies

• Incorrectly predicted contacts can be harmful to 3D structure construction.

• Contact prediction should have an accuracy of at least 22% to generate a positive effect to the ab initio structure prediction

3/20/2017 16

Basic information on contact maps

• Residues are in contact if the distance between 𝐶𝛼 or 𝐶𝛽 atoms of the residues is < 8 Å

• Contact classification: • Short range: Sequence

separation 6-11 residues

• Medium range: Sequence separation 12-24 residues

• Long range: Sequence separation >24 residues

3/20/2017 17

Short range

Medium range Long

range

TTSQKHRDFVAEPGEKPVGSLAGIGEVLGKKLEERG 1 7 13 26

Programs for predicting contact maps

• Machine Learning: o BETACON

o SVMcon

o SVMSEQ

• Coevolution: o PSICOV

o CCMpred

o mfDCA

o Gremlin

• Meta: oSTRUCTCH oMetaPSICOV oPconsC2, PconsC31

3/20/2017 18

NeBcon (Neural network and Bayes-classifier based contact prediction)

3/20/2017 19

Naïve Bayes Classifier (NBC) 𝑋𝑖𝑗= (𝑋

𝑖𝑗

1

, 𝑋𝑖𝑗

2

, ⋯ , 𝑋𝑖𝑗

𝑚

)

𝑃 𝐶 𝑋𝑖𝑗 =𝑃 𝐶 𝑃 𝑋𝑖𝑗

𝑚 𝐶𝑁𝑚=1

𝑃 𝑋𝑖𝑗

=𝑃 𝐶 𝑃 𝑋𝑖𝑗

𝑚 𝐶𝑁𝑚=1

𝑃 0 𝑃 𝑋𝑖𝑗𝑚 0𝑁

𝑚=1 + 𝑃 1 𝑃 𝑋𝑖𝑗𝑚 1𝑁

𝑚=1

𝑋𝑖𝑗

𝑚

is the confidence score for the ith and jth residues to be in contact as predicted by mth contact predictor.

0 =in contact 1 =not in contact

𝑃 0 𝑋𝑖𝑗 =𝑃 0 𝑃 𝑋𝑖𝑗

𝑚 0𝑁𝑚=1

𝑃 0 𝑃 𝑋𝑖𝑗𝑚 0𝑁

𝑚=1 + 𝑃 1 𝑃 𝑋𝑖𝑗𝑚 1𝑁

𝑚=1

Under the “naïve” assumption, the confidence scores from different contact predictors are independent from each other

𝑃 𝐶 𝑋𝑖𝑗 =𝑃 𝐶 𝑃(𝑋𝑖𝑗|𝐶)

𝑃(𝑋𝑖𝑗)

3/20/2017 20

Contact prediction accuracy comparison

0.406 0.341

0.288

0.406 0.432 0.364

0.459

0.709 0.798

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Acc

ura

cy

50 easy targets Top L/5 long range

Accuracy of the prediction: Acc = Ncorr/NT •Ncorr = # of correctly predicted contacts in the contact map •NT = # of predicted contacts in the contact map

0.198 0.167 0.181

0.134 0.119 0.094

0.242

0.312

0.451

0.000.050.100.150.200.250.300.350.400.450.50

Acc

ura

cy

48 hard targets Top L/5 long range

3/20/2017 21

Contact prediction accuracy comparison (all ranges)

Methods

Short (6-11)

Medium (12-24)

Long (>24)

BETACON

0.540 (1×10-9) 0.430 (3×10-10) 0.310 (2×10-12)

SVMSEQ 0.475 (2×10-12) 0.393 (2×10-12) 0.236 (2×10-12)

SVMcon

0.564 (4×10-9) 0.455 (1×10-8) 0.255 (2×10-12)

PSICOV 0.204 (2×10-12) 0.246 (2×10-12) 0.262 (2×10-12)

CCMpred

0.206 (2×10-12) 0.238 (2×10-12) 0.227 (2×10-12)

FreeContact 0.234 (2×10-12) 0.278 (2×10-12) 0.278 (2×10-12)

STRUCTCH

0.605 (3×10-4) 0.487 (4×10-5) 0.353 (2×10-12)

MetaPSICOV 0.576 (5×10-6) 0.572 (5×10-1) 0.515 (2×10-7)

NeBcon

0.651 0.574 0.628

3/20/2017 22

Contact prediction accuracy comparison (long range)

Average ACC of MetaPSICOV = 0.515 Average ACC of NBC = 0.546 P-value= 0.03

Average ACC of NeBcon= 0.628 Average ACC of NBC = 0.546 P-value= 3.5×10-8

3/20/2017 23 He et al., Bioinformatics (2017)

Diversity of contact maps

𝐻 = − 𝑝𝑖 log2 𝑝𝑖

100

𝑖

𝑝𝑖 is the fraction of the top-L contacts at ith cell, where L is the length of the protein

Hmin = 0 All contacts are accumulated in one cell Hmax=6.64 (=log2100) All contacts are evenly distributed when L>100

3/20/2017

Diversity of contact maps Methods Long All

BETACON 2.656 (8.4*10-16) 3.912 (6.9*10-25)

SVMSEQ 3.540 (4.9*10-7) 4.146 (5.6*10-13)

SVMcon 3.289 (1.5*10-16) 3.962 (1.2*10-24)

PSICOV 3.505 (6.2*10-2) 3.959 (1.23*10-2)

CCMpred 4.415 (6.9*10-9) 5.016 (1.1*10-6)

FreeContact 4.478 (4.5*10-10) 4.977 (5.0*10-6)

STRUCTCH 3.477 (2.6*10-8) 4.072 (7.7*10-17)

MetaPSICOV 3.552 (4.0*10-5) 4.217 (9.7*10-6)

NeBcon 3.665 (6.5*10-5) 4.273 (3.3*10-9)

Native 3.815 4.473

3/20/2017 25

Example: diversity of contact maps

3/20/2017 26 He et al., Bioinformatics (2017)

C-QUARK: Contact map guided ab initio structure prediction

3/20/2017 27

NeBcon

Knowledge-based

potentials:

C-QUARK in CASP 12 Groups Z

C-QUARK 65.1

Baker-Rosetta 60.3

GOAL 49.9

RaptorX 44.2

ToyPred_email 40.4

Multicom-Novl 19.4

Seok-server 9.2

IntFOLD4 9.1

FFAS-3D 8.4

FALCON_TOPO 6.3

Here, Z-score (Z) represents the significance of the structure predictions by each group compared to the average performance

3/20/2017 28

GPCR-AIM: Ab initio GPCR structure prediction

3/20/2017 29

References • Xu, D., and Zhang, Y., "Ab initio protein structure assembly using

continuous structure fragments and optimized knowledge-based force field," Proteins-Structure Function and Bioinformatics, 80(7), pp. 1715-1735. (2012)

• Xu, D., and Zhang, Y., “Toward optimal fragment generation for ab initio protein structure assembly," Proteins-Structure Function and Bioinformatics, 81(2), pp. 229-239 (2012)

• Zhang et al., "Integration of QUARK and I-TASSER for Ab Initio Protein Structure Prediction in CASP11," Proteins, 84, pp.76-86 (2015)

• He, B., Mortuza, S.M., Shen, H., Wang, Y., Zhang, Y. “NeBcon: Protein contact map prediction using neural network training coupled with naïve Bayes classifiers.” Bioinformatics (2017) (In press)

• He, B., Mortuza, S.M., Wang, Y., Zhang, Y. “NeBcon used to improve structure prediction.” (2017) (In preparation)

• Wu, H., Zhang, C., Zhang, Y., “Assemble atomic structure of G protein-coupled receptors from primary sequences”. (2017) (In preparation)

3/20/2017 30