structural bioinformatics chih-hao lu cadd/0916.ppt

Structural Bioinformatics

Chih-Hao Lu

http://140.128.63.6/~CADD/0916.ppt

陸志豪助理教授

[email protected]

國立交通大學生物資訊所博士

結構生物資訊、計算生物學、演化式計算與機器學習

蛋白質區域結構模組與功能預測蛋白質結構與動力學的相關研究蛋白質與分子的交互作用相關研究

學歷

專長

研究領域

• To identify drugs that inhibit target proteins involved in diseases and have therapeutic effect against diseases– Drugs often have stronger binding affinities than

natural compounds

Mechanism of drug actions

Target protein

A pathway of disease

ProteinProtein

Natural compound

Drug

xx x x

Classification of Drug Development

Protein

(receptor) S

tructu

re

Compound structure Known Unknown

Known

Unknow

n

Structure-based Drug Design (SBDD)

SBDD or de novo design

High-Throughput Screening(HTS)

Compound similarity searchO

O

O

O

O

O

query Similar compounds

OO

DDT 2002

Central Dogma

Why study protein structure?

• Proteins play crucial functional roles in all biological processes: enzymatic catalysis, signaling messengers …

• Function depends on 3D structure.

• Easy to obtain protein sequences, difficult to determine structure.

7

From primary to quaternary

Primary Structure

•蛋白質的骨架是由二十種胺基酸 (Amino Acid)所組成的長條序列

•胺基酸彼此是由胜汰鍵 (Peptide Bond)所連結

Proteins are polypeptide chains

Fundamentals of protein structure 20 Amino Acids

?

Amino acid Abbreviated names MtOccurrence in proteins(%)

Glycine Gly G 75 7.2

Alanine Ala A 89 7.8

Valine Val V 117 6.6

Leucine Leu L 131 9.1

Isoleucine Ile I 131 5.3

Methionine Met M 149 2.3

Phenylalanine Phe F 165 3.9

Tyrosine Tyr Y 181 3.2

Tryptophan Trp W 204 1.4

Serine Ser S 105 6.8

Proline Pro P 115 5.2

Threonine Thr T 119 5.9

Cysteine Cys C 121 1.9

Asparagine Asn N 132 4.3

Glutamine Gln Q 146 4.2

Lysine Lys K 146 5.9

Histidine His H 155 2.3

Arginine Arg R 174 5.1

Aspartic acid Asp D 133 5.3

Glutamic acid Glu E 147 6.3

TTCCPSIVARSNFNVCRLPGTPEAICATYTGCII

Sequence

Secondary Structure

helix

•平均每 3.6個殘基(Residues)形成一個轉折

• helix的結構是由氫鍵 (Hydrogen bonds)的交互作用形成

310helix, helix, helix

The helix has a dipole moment

Some amino acids are preferred in helices

• Good– Ala Glu Leu Met

• Poor– Pro Gly Tyr Ser

•結構具有雙向性 (Amphipathic)–疏水性 (Hydrophobic)–親水性 (Hydrophilic)

Helical wheel

sheet

• sheet 是由數個彩帶狀的 strand 所組成的平面

•每兩個 strand可以分成平行(parallel)與反平行(antiparallel)的結構

Antiparallel sheets

Parallel sheets

Turn or Loop

•連接 helix或是 strand 時， peptide bond需要作將近 180度的轉折，這些區域就稱之為 Turn

•此外有一些不規則的結構，統稱為 Loop

Turn

Loop

Hairpin loops

Secondary structure elements are connected to form simple motifs

Schematic diagrams of the calcium-binding motif

(Luscombe, Genome Biology 2000)

The hairpin motif occurs frequently in protein structures

The Greek key motif is found in antiparallel sheets

Tertiary Structure

sheet

Helixloop

Tertiary Structure

TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

Sequence

Secondary Structure

•數個 secondary structure聚在一起，就形成了蛋白質的三級結構 (Tertiary Structure)

Simple motifs combine to form complex motifs

Quaternary structure

•由數個相同或是不同的三級結構分子(subunit)，再結合而成的複合體，稱為四級結構。

How to determine the protein structure?

• By experimentation– X-Ray– NMR (nuclear magnetic resonance spectroscopy)

• Sequence-Structure gap

31

Target Selection

CrystallomicsData

CollectionStructureSolution

StructureRefinement

FunctionalAnnotation

Structure Determination(X-ray)

Isolation, Expression,Purification,Crystallization

PDBDeposition

Publication

The first x-ray crystallographic structural results in 1958

first determination 3-D globular protein structure (myglobin) in 1958 – John Kendrew

Molecular visualization

• Abstract views of macromolecular– well-defined secondary structure elements (-

helices and -strands)– Jane Richardson, 1985

• -helix as simple cylinder or broad, spiral ribbon• -strand as broad, flat ribbon

The structure of myoglobin

Molecular visualization

RasMol

PyMOL

Swiss-Pdb Viewer

MOLMOL

MolScript

MDL Chime

Green Fluorescent Protein (GFP)

The Protein Data Bank http://www.rcsb.org/pdb/home/home.do

http://www.rcsb.org/pdb/home/home.do

Number of Structures Available

Structure-based databases

• Popular software and resources for protein structure validation– PDBSum, Procheck, What_Check

• Resources classifying protein structure– SCOP, CATH, DALI, VAST, CE

• Popular resources of protein interactions– Protein-Protein(DNA) interaction server, DIP, MINT

• Popular resources visualizing macromolecular structures– PDBSum, NDB Atlas, STING

Protein evolution and the SCOP database

http://scop.berkeley.edu/

http://scop.berkeley.edu/

SCOP

• Classes– all- protein

• can have small adornment of or 310 helix– all- structures

• may have several regions of 310 helix, and small -sheet outside the -helical core

/ (alpha and beta)• mainly parallel sheets (-- units)

+ (alpha plus beta)• mainly antiparallel sheets (segregated and region)

– others• multidomain proteins, membrane and cell surface proteins,

small proteins, coiled coil proteins, low-resolution structures, peptides, and designed proteins

Class, Fold, Superfamily and Family classification

A. niger

2aaa:1-353

Acid -amylase

B. cereus

J. Biochem 113:646-649

Oligo-1,6 glucosidase

B. circulans

1cdg:1-382 1cgt:1-382

Cyclodextrin glycosyltransferase

-Glucanase -Amylase (N) -Amylase-Galactosidase (3)

B. stearothermophilus

1cyg:1-378

TIM Trp biosynthesis Glycosyltransferase RuBisCo (C)

Rossmann fold Flavodoxin-like -Barrel

scop Root

Class

Fold

Superfamily

Family

Protein

Species

PDB/Ref

SCOP Sample Hierarchy

Det

erm

ined

by

stru

ctu

re

Rel

ated

by

hom

olog

y

The CATH domain structure database http://www.cathdb.info/index.html

http://www.cathdb.info/index.html

CATH http://www.cathdb.info/index.html

Structure quality assurance • Not all structures are of equally high quality• Models from X-ray crystallography• Models from NMR spectroscopy• Errors in deposited structures• Procheck, What_Check

2YSB

Ramachandran Plot

• A graph between the dihedral angles of an amino acid in a protein.

• Due to steric hindrance from amino acid side chains, only certain angles are allowed in a folded protein.

• A plot between the dihedral angles of individual amino acids in a protein can serve to indicate how well the structure has been determined.

• Any deviations from the allowed values are called Outliers and usually indicate bad geometry

Dihedral Angles

Ramachandran Plot

Standard Plot showing wheredifferent secondary structures fitinto the plot.

A real life example. All non-glycineresidues are in allowed regions.

Validation

• Ideally, there should be no outliers in the Ramachandran plot, except for Glycine and Proline, which are “special” amino acids.

• However, there may be some rational explanation for outliers by the scientist depositing the structure. (Always refer to the publication!).

• Expect to find more than 85-90% of residues to fall into the red regions.

So what do you think about this ?

Secondary structure assignment http://swift.cmbi.ru.nl/gv/dssp/

http://e106.life.nctu.edu.tw/~hwhuang/dssp/

http://140.128.63.6/~bioinformatics/MDLChime26SP4.exe

http://swift.cmbi.ru.nl/gv/dssp/


The role of secondary structure

• In structural genomics– basic unit for structure classification– main uses

• it is indicative of the fold• it is an intuitive means of visualizing protein structure• it influences the sequence alignment• it is related to function

– applications (ex. Secondary Structure Element)• speed up large-scale all-against-all alignment of 3D

structures• comparative modeling and threading

Hydrogen Bonding is Key to Automated Methods

• Why? - ~90% of backbone donors (NH) and acceptors (C=O) form hydrogen bonds

• Basic definition – Angle N – (H) – O greater than 120 degrees – H …O less than 2.5Å– Note H’s not usually identified directly

Angle-distance hydrogen bond assignment

• Baker and Hubbard assigned hydrogen bonds according to the angle N-H-O and to the distance rHO (1984)

N

O

H

1Å

<2.5Å>120°

?

N

O

H

1Å

2.5Å120°

?

60°

30°

1.25Å

2.165Å

~3.122Å

N OH

1Å 2.5Å

180°

Coulomb hydrogen bond calculation – used by DSSP

E = f + - 1

rNO+

1

rHC'+

1

rHO+

1

rNC'

• f is a constant 332 Å kcal/e2

• Delta is the + and – polar charge in electrons• Weakest H-bond –0.5 kcal/mole in DSSP• H not given – requires extrapolation – note assumes

planar geometry for peptide bond

DSSP

• H – alpha helix• G = 310 helix• I = Pi helix• B = bridge – single residue sheet• E = extended beta strand• T = beta turn• S = bend• C = coil


DSSP as Implemented in the PDB

1ATP

http://www.rcsb.org/pdb/explore/explore.do?structureId=1ATP

Identifying structural domain and function in proteins

1NTY

Prediction of protein-protein or protein-DNA interaction

• Sequence-based methods– Homology

– Correlated Mutation

• Structure-based methods– Physical docking

• Hybrid methods

Principles and methods of docking and ligand design

• Structure-based design– Docking

• Analog-based design– QSAR

– (Quantitative structure-activity relationships)

Most force fields consist of a summation of bonded forces associated with chemical bonds, bond angles, and bond dihedrals, and non-bonded forces associated with van der Waals forces and electrostatic charge.

Fold recognition method

Prediction in 1D

– Secondary structure prediction– Solvent accessibility prediction– Disulfide bond prediction– Fold recognition– Enzyme class prediction– Subcellular localization prediction– Metal binding sites prediction– Disulfide connectivity prediction– Phi psi angle prediction

Secondary structure prediction

sheet

Helixloop


EEEELLLLLHHHHHHHHHHHHLLLLLHHHHHHLLLLEEEELLLLL

H Helix

E sheet

L loop

Solvent accessibility prediction


EEEEBBBBEEEEEBBBBBBBEEEEEEBBBBBBBEEEEEEEBBBBEE

B Buried

E Exposed B

B

B B

B

B

EE

E EE

E

Disulfide bond predictionTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

O O R O O R

Fold recognition

SCOP

Root classes folds superfamily family proteins species

/

+

Multi-domain

Membrane..

Small protein..….….…

TIM barrel

TIM…Aldolase………

TIM………

TIM ChickenHuman……….

StructureClassificationOfProteins

SCOP statistics 11 800 1294 2327


?

Subcellular localization prediction

Eukaryotic Cellular compartments


?

Metal binding sites prediction


NNNNBNNBNNNNNBNNNNNBNNNNNNNNNNNNNNNNNNBNNN

B Binding

N Non-binding

Phi psi angle predictionRamachandran plot • Phi Cn-1 – Nn – Cn – Cn

• Psi Nn – Cn – Cn – Nn+1

A B

D

C

E FG H I

J K L

M N O

P Q R


AADGJJKKCPGDANOOEEAAAAJJJJJJJJKKNNQQCCJJJJAAAA

TTC1C2PSIVARSNFNVC3RLPGTPEAIC4ATYTGC5IIIPGATC6PGDYAN

C1

C6

C2

C5

C4

C3

connectivity pattern 1-6, 2-5, 3-4

Disulfide Connectivity Prediction

Class 1 Features 1~NClass 2 Features 1~N Class 3 Features 1~N Class 4 Features 1~N :

:

Class K Features 1~N

Training Data

SVM

SVM Model

Class 1

Class 2

Class 3

:

:

Class K

Feature 1

Feature 2

Feature 3

:

:

Feature N

Testing Data

SVM

SVM Model

Protein Structure Prediction

Sequence

Sequence HomologyTo known fold

HomologyModeling

>30%

Threading

Match Found?

Ab initio

No

Model

Yes

<30%

87

Homology modeling

• The goal of protein modeling is to predict a structure from its sequence– Template recognition and initial alignment

– Alignment correction

– Backbone generation

– Loop modeling

– Side-chain modeling

– Model optimization

– Model validation

??

KQFTKCELSQNLYDIDGYGRIALPELICTMFHTSGYDTQAIVENDESTEYGLFQISNALWCKSSQSPQSRNICDITCDKFLDDDITDDIMCAKKILDIKGIDYWIAHKALCTEKLEQWLCEKE

Use as template

8lyz1alc

KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRLShare Similar

Sequence

Homologous

What is Homology Modeling?

Target Template

89

Structure prediction by homology modeling

90

Step 1

Step 2

Step 3

Step 4

Structure comparison and alignment

1CRN 1JXX

CE

http://cl.sdsc.edu/ce.html

DALI

http://ekhidna.biocenter.helsinki.fi/dali_server/

http://cl.sdsc.edu/ce.html

http://ekhidna.biocenter.helsinki.fi/dali_server/

Homework (10/6上課前交 )1. 根據以下條件，在 Protein Data Bank上搜尋，並列出所搜尋到的 PDB ID– Hemoglobin– Has ligand: Yes– X-ray resolution<1.5– Homo sapiens

2. 比較 PDB ID:1ema 在 DSSP 與 STRIDE的二級結構差異

3. 利用 PDBsum裡的 Procheck分析 PDB ID:1atp

structural bioinformatics chih-hao lu cadd/0916.ppt

Documents