20131019 生物物理若手 journal club

38
DNABind: A hybrid algorithm for structure- based prediction of DNA-binding residues by combining machine learning- and template- based approaches. Proteins. 2013 Jun 5. 20131019 生生生生生生生 生生 西 Journal Club

Upload: medku

Post on 14-Jun-2015

269 views

Category:

Technology


1 download

DESCRIPTION

Proteins. 2013 Nov;81(11):1885-99. doi: 10.1002/prot.24330. Epub 2013 Aug 16. DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches. Liu R, Hu J.

TRANSCRIPT

Page 1: 20131019 生物物理若手 Journal Club

DNABind: A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning- and template-based approaches. Proteins. 2013 Jun 5.

20131019生物物理若手関西支部 Journal Club

Page 2: 20131019 生物物理若手 Journal Club

Topics

Prediction of protein-DNA binding residues

Statistics of network

Machine learning

Page 3: 20131019 生物物理若手 Journal Club
Page 4: 20131019 生物物理若手 Journal Club

Result: DNABind, a hybrid method of machine learning and template-based approaches showed excellent performance on predicting DNA-binding

residues.

Query protein, Template protein, TP, FP, FN

Machine learning Template DNABind

CprK

(3E6

C:C)

EcoR

V(1R

VE:A

)

DNABind improves classification.

True positive residues.

Page 5: 20131019 生物物理若手 Journal Club

Aim

Protein-DNA interactions is important for cell biology.

Its determination by experiments is time- and cost-consuming.

Computational approaches are desirable.

Page 6: 20131019 生物物理若手 Journal Club

Computational approaches

Data bank (PDB)Binding residues charactersExposed solventsHigher electrostatics potentialMore conservedHotspots as clusters of conserved residues

Structural properties (DNA-binding residue vs surface)Packing densitySurface curvatureB-factorResidue fluctuationHydrogen bond donor

http://www.rcsb.org/pdb/home/home.do

Page 7: 20131019 生物物理若手 Journal Club

Feature-basedExtract effective features

Template-basedAlign template and retrieve the best match

Computational algorithms

Template!!

Page 8: 20131019 生物物理若手 Journal Club

Feature-basedExtract effective features

Template-basedAlign template and retrieve the best match

Computational algorithms

Template!!

Page 9: 20131019 生物物理若手 Journal Club

Feature-basedExtract effective features

Template-basedAlign template and retrieve the best match

Computational algorithms

Template!!

Page 10: 20131019 生物物理若手 Journal Club

Features used in machine learningStructure-based

PSSM (position specific scoring matrix)Evolutionally conservationSolvent accessibilityLocal geometry (depth and protrusion index)Topological features

degree, closeness, betweenness, clustering coefficient

Relative position (distance to centroid)Statistical potential (Boltzmann distribution)

Sequence-based (more difficult than structure)Amino acid identityResidue physicochemical properties

polarity, secondary structure, molecular volume, codon diversity, electrostatic charge

Predicted structure (Not need 3D structure !!)

Page 11: 20131019 生物物理若手 Journal Club

Features used in machine learning

Structure-basedPSSMRelative solvent accessibilityDepth and protrusion indexTopological featuresDistance to centroidStatistical potentials

Sequence-basedPSSMPredicted structuresAmino acid indicesStatistical potentials

𝑀𝐿𝑠𝑐𝑜𝑟𝑒=𝛼𝑆𝑇 𝑅𝑠𝑐𝑜𝑟𝑒+(1−𝛼 )𝑆𝐸𝑄𝑠𝑐𝑜𝑟𝑒

Construct machine learning (SVM)

𝑆𝑇 𝑅𝑠𝑐𝑜𝑟𝑒 𝑆𝐸𝑄𝑠𝑐𝑜𝑟𝑒

Page 12: 20131019 生物物理若手 Journal Club

Used in image recognition, etc…Recognition of faces in the camera.

Template-based approach

Template!!

Page 13: 20131019 生物物理若手 Journal Club

Used in image recognition, etc…Recognition of faces in the camera.

Template-based approach

Match!! Template!!

Page 14: 20131019 生物物理若手 Journal Club

Template-based prediction

Template-basedStructural alignment and statistical potentialThe binding residue prediction will be conducted only if the target protein was considered as a DNA-binding protein.

312 templates were selected.

Page 15: 20131019 生物物理若手 Journal Club

Network

Degree is a commonly used measure to reflect the local connectivity of a node.

Closeness is a global centrality metric used to determine how critical a residue is in a residue interaction network.

Betweenness of residue i is defined to be the sum of the fraction of shortest paths between all pairs of residues that pass through residue i.

Clustering coefficient (transitivity) quantifies how close its neighbors are to being a clique. Probability that the adjacent vertices of a vertex are connected.

Motif, hub, and community are also important…

Page 16: 20131019 生物物理若手 Journal Club

Network sample; human protein interactome

Scale-freeSmall-worldCluster

Power law (Pareto distribution)

Bioinformatics. 2012 Jan 1;28(1):84-90.

Page 17: 20131019 生物物理若手 Journal Club

Machine learning

Example; spam4601 samples, 57 parameters.Classification; spam or nonspam

Page 18: 20131019 生物物理若手 Journal Club

Machine learningSupport vector machine (SVM)Decision treeRandomForestLogistic regressionLASSO (Elastic net and Ridge)Neural networks (Deep learning)

Evolutionary algorithmGaussian processingk nearest neighborClusteringBayesian networksAssociation rule learningInductive logic programming (ILP)

Page 19: 20131019 生物物理若手 Journal Club

Support vector machine (SVM)

Make hyperplane to divide groups.Kernel method; non-linear to linearEasy to do.Much computational time.Tuning is very difficult.

Page 20: 20131019 生物物理若手 Journal Club

Decision tree

Make many trees.Easy to understand graphically.Performance is not so good.

Page 21: 20131019 生物物理若手 Journal Club

RandomForest

Make many decision trees.Much precise.A little time consumer.

Page 22: 20131019 生物物理若手 Journal Club

Logistic regression

Many medical researchers use…Easy to use but tuning is very difficult.(to tell the truth…)

Page 23: 20131019 生物物理若手 Journal Club

LASSO, Elastic net, and Ridge regression

𝛼={1⋮0LASSOElastic NetRidge

Least Absolute Shrinkage and Selection Operator

Page 24: 20131019 生物物理若手 Journal Club

Neural networks

Artificial mammal brain (perceptron).Hidden multi-layer.

Deep learning is hot topic!!(hard to understand…)

http://opencv.jp/opencv-1.0.0/document/opencvref_ml_nn.html

Page 25: 20131019 生物物理若手 Journal Club

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

Page 26: 20131019 生物物理若手 Journal Club

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

Train data

Test

Page 27: 20131019 生物物理若手 Journal Club

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

Train data

Test

Page 28: 20131019 生物物理若手 Journal Club

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

Train data

Test

Page 29: 20131019 生物物理若手 Journal Club

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

Train data

Test

Page 30: 20131019 生物物理若手 Journal Club

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

Train data

Test

Page 31: 20131019 生物物理若手 Journal Club

n-fold cross validation

To evaluate how the results of a statistical analysis will generalize to an independent data set.

Train data

Test 1

One-leave out CV

Page 32: 20131019 生物物理若手 Journal Club

Performance

SVM Tree RandomForest LASSO Elastic net Ridge Logistic nnet

Recall 0.917 0.872 0.927 0.894 0.892 0.852 0.893 0.930

Precision 0.948 0.914 0.954 0.932 0.926 0.926 0.930 0.935

F 0.932 0.893 0.940 0.913 0.911 0.887 0.911 0.932

MMC 0.890 0.826 0.902 0.858 0.856 0.821 0.856 0.888

Page 33: 20131019 生物物理若手 Journal Club

Combine two approaches

𝐶 𝑠𝑐𝑜𝑟𝑒={𝛽𝑀 𝐿𝑠𝑐𝑜𝑟𝑒+(1− 𝛽)𝑇 𝐿𝑠𝑐𝑜𝑟𝑒

𝑀𝐿𝑠𝑐𝑜𝑟𝑒

if

𝑀𝐿𝑠𝑐𝑜𝑟𝑒=𝛼𝑆𝑇 𝑅𝑠𝑐𝑜𝑟𝑒+(1−𝛼 )𝑆𝐸𝑄𝑠𝑐𝑜𝑟𝑒

and are determined by CV and ROC analysis.

Page 34: 20131019 生物物理若手 Journal Club

A: Binding residues are highly solvent accessible.B, C: Binding residues have low depth and high protrusion.D-G: Not so much difference in networks.H: Binding residues are less distant to the centroid.

Statistical features of structure

Page 35: 20131019 生物物理若手 Journal Club

Performance

Page 36: 20131019 生物物理若手 Journal Club

Performance

Proteins. 2004 Dec 1;57(4):702-10.Nucleic Acids Res. 2005 Apr 22;33(7):2302-9.

TM-score is a measure of similarity between two protein structures with different tertiary structures. < 0.2 is random relation and > 0.5 is highly related.

Higher TM score is required for good prediction.

Page 37: 20131019 生物物理若手 Journal Club

PerformanceComparison among ML, TL, and DNABind.

Comparison between DNABind and other software.

Page 38: 20131019 生物物理若手 Journal Club

Result: DNABind, a hybrid method of machine learning and template-based approaches showed excellent performance on predicting DNA-binding

residues.

Query protein, Template protein, TP, FP, FN

Machine learning Template DNABind

CprK

(3E6

C:C)

EcoR

V(1R

VE:A

)

DNABind improves classification.

True positive residues.