structural bioinformatics chih-hao lu cadd/0916.ppt
TRANSCRIPT
Structural Bioinformatics
Chih-Hao Lu
http://140.128.63.6/~CADD/0916.ppt
陸志豪助理教授
國立交通大學生物資訊所 博士
結構生物資訊、計算生物學、演化式計算與機器學習
蛋白質區域結構模組與功能預測蛋白質結構與動力學的相關研究蛋白質與分子的交互作用相關研究
學歷
專長
研究領域
• To identify drugs that inhibit target proteins involved in diseases and have therapeutic effect against diseases– Drugs often have stronger binding affinities than
natural compounds
Mechanism of drug actions
Target protein
A pathway of disease
ProteinProtein
Natural compound
Drug
xx x x
Classification of Drug Development
Protein
(receptor) S
tructu
re
Compound structure Known Unknown
Known
Unknow
n
Structure-based Drug Design (SBDD)
SBDD or de novo design
High-Throughput Screening(HTS)
Compound similarity searchO
O
O
O
O
O
query Similar compounds
OO
DDT 2002
Central Dogma
Why study protein structure?
• Proteins play crucial functional roles in all biological processes: enzymatic catalysis, signaling messengers …
• Function depends on 3D structure.
• Easy to obtain protein sequences, difficult to determine structure.
7
From primary to quaternary
Primary Structure
•蛋白質的骨架是由二十種胺基酸 (Amino Acid)所組成的長條序列
•胺基酸彼此是由胜汰鍵 (Peptide Bond)所連結
Proteins are polypeptide chains
Fundamentals of protein structure 20 Amino Acids
?
Amino acid Abbreviated names MtOccurrence in proteins(%)
Glycine Gly G 75 7.2
Alanine Ala A 89 7.8
Valine Val V 117 6.6
Leucine Leu L 131 9.1
Isoleucine Ile I 131 5.3
Methionine Met M 149 2.3
Phenylalanine Phe F 165 3.9
Tyrosine Tyr Y 181 3.2
Tryptophan Trp W 204 1.4
Serine Ser S 105 6.8
Proline Pro P 115 5.2
Threonine Thr T 119 5.9
Cysteine Cys C 121 1.9
Asparagine Asn N 132 4.3
Glutamine Gln Q 146 4.2
Lysine Lys K 146 5.9
Histidine His H 155 2.3
Arginine Arg R 174 5.1
Aspartic acid Asp D 133 5.3
Glutamic acid Glu E 147 6.3
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCII
Sequence
Secondary Structure
helix
•平均每 3.6個殘基(Residues)形成一個轉折
• helix的結構是由氫鍵 (Hydrogen bonds)的交互作用形成
310helix, helix, helix
The helix has a dipole moment
Some amino acids are preferred in helices
• Good– Ala Glu Leu Met
• Poor– Pro Gly Tyr Ser
•結構具有雙向性 (Amphipathic)–疏水性 (Hydrophobic)–親水性 (Hydrophilic)
Helical wheel
sheet
• sheet 是由數個彩帶狀的 strand 所組成的平面
•每兩個 strand可以分成平行(parallel)與反平行(antiparallel)的結構
Antiparallel sheets
Parallel sheets
Turn or Loop
•連接 helix或是 strand 時, peptide bond需要作將近 180度的轉折,這些區域就稱之為 Turn
•此外有一些不規則的結構,統稱為 Loop
Turn
Loop
Hairpin loops
Secondary structure elements are connected to form simple motifs
Schematic diagrams of the calcium-binding motif
(Luscombe, Genome Biology 2000)
The hairpin motif occurs frequently in protein structures
The Greek key motif is found in antiparallel sheets
Tertiary Structure
sheet
Helixloop
Tertiary Structure
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Sequence
Secondary Structure
•數個 secondary structure聚在一起,就形成了蛋白質的三級結構 (Tertiary Structure)
Simple motifs combine to form complex motifs
Quaternary structure
•由數個相同或是不同的三級結構分子(subunit),再結合而成的複合體,稱為四級結構。
How to determine the protein structure?
• By experimentation– X-Ray– NMR (nuclear magnetic resonance spectroscopy)
• Sequence-Structure gap
31
Target Selection
CrystallomicsData
CollectionStructureSolution
StructureRefinement
FunctionalAnnotation
Structure Determination(X-ray)
Isolation, Expression,Purification,Crystallization
PDBDeposition
Publication
The first x-ray crystallographic structural results in 1958
first determination 3-D globular protein structure (myglobin) in 1958 – John Kendrew
Molecular visualization
• Abstract views of macromolecular– well-defined secondary structure elements (-
helices and -strands)– Jane Richardson, 1985
• -helix as simple cylinder or broad, spiral ribbon• -strand as broad, flat ribbon
The structure of myoglobin
Molecular visualization
RasMol
PyMOL
Swiss-Pdb Viewer
MOLMOL
MolScript
MDL Chime
Green Fluorescent Protein (GFP)
Green Fluorescent Protein (GFP)
Green Fluorescent Protein (GFP)
Green Fluorescent Protein (GFP)
The Protein Data Bank http://www.rcsb.org/pdb/home/home.do
Number of Structures Available
Structure-based databases
• Popular software and resources for protein structure validation– PDBSum, Procheck, What_Check
• Resources classifying protein structure– SCOP, CATH, DALI, VAST, CE
• Popular resources of protein interactions– Protein-Protein(DNA) interaction server, DIP, MINT
• Popular resources visualizing macromolecular structures– PDBSum, NDB Atlas, STING
SCOP
• Classes– all- protein
• can have small adornment of or 310 helix– all- structures
• may have several regions of 310 helix, and small -sheet outside the -helical core
/ (alpha and beta)• mainly parallel sheets (-- units)
+ (alpha plus beta)• mainly antiparallel sheets (segregated and region)
– others• multidomain proteins, membrane and cell surface proteins,
small proteins, coiled coil proteins, low-resolution structures, peptides, and designed proteins
Class, Fold, Superfamily and Family classification
A. niger
2aaa:1-353
Acid -amylase
B. cereus
J. Biochem 113:646-649
Oligo-1,6 glucosidase
B. circulans
1cdg:1-382 1cgt:1-382
Cyclodextrin glycosyltransferase
-Glucanase -Amylase (N) -Amylase-Galactosidase (3)
B. stearothermophilus
1cyg:1-378
TIM Trp biosynthesis Glycosyltransferase RuBisCo (C)
Rossmann fold Flavodoxin-like -Barrel
scop Root
Class
Fold
Superfamily
Family
Protein
Species
PDB/Ref
SCOP Sample Hierarchy
Det
erm
ined
by
stru
ctu
re
Rel
ated
by
hom
olog
y
The CATH domain structure database http://www.cathdb.info/index.html
CATH http://www.cathdb.info/index.html
Structure quality assurance • Not all structures are of equally high quality• Models from X-ray crystallography• Models from NMR spectroscopy• Errors in deposited structures• Procheck, What_Check
2YSB
Ramachandran Plot
• A graph between the dihedral angles of an amino acid in a protein.
• Due to steric hindrance from amino acid side chains, only certain angles are allowed in a folded protein.
• A plot between the dihedral angles of individual amino acids in a protein can serve to indicate how well the structure has been determined.
• Any deviations from the allowed values are called Outliers and usually indicate bad geometry
Dihedral Angles
Ramachandran Plot
Standard Plot showing wheredifferent secondary structures fitinto the plot.
A real life example. All non-glycineresidues are in allowed regions.
Validation
• Ideally, there should be no outliers in the Ramachandran plot, except for Glycine and Proline, which are “special” amino acids.
• However, there may be some rational explanation for outliers by the scientist depositing the structure. (Always refer to the publication!).
• Expect to find more than 85-90% of residues to fall into the red regions.
So what do you think about this ?
Secondary structure assignment http://swift.cmbi.ru.nl/gv/dssp/
http://e106.life.nctu.edu.tw/~hwhuang/dssp/
http://140.128.63.6/~bioinformatics/MDLChime26SP4.exe
The role of secondary structure
• In structural genomics– basic unit for structure classification– main uses
• it is indicative of the fold• it is an intuitive means of visualizing protein structure• it influences the sequence alignment• it is related to function
– applications (ex. Secondary Structure Element)• speed up large-scale all-against-all alignment of 3D
structures• comparative modeling and threading
Hydrogen Bonding is Key to Automated Methods
• Why? - ~90% of backbone donors (NH) and acceptors (C=O) form hydrogen bonds
• Basic definition – Angle N – (H) – O greater than 120 degrees – H …O less than 2.5Å– Note H’s not usually identified directly
Angle-distance hydrogen bond assignment
• Baker and Hubbard assigned hydrogen bonds according to the angle N-H-O and to the distance rHO (1984)
N
O
H
1Å
<2.5Å>120°
?
N
O
H
1Å
2.5Å120°
?
60°
30°
1.25Å
2.165Å
~3.122Å
N OH
1Å 2.5Å
180°
Coulomb hydrogen bond calculation – used by DSSP
E = f + - 1
rNO+
1
rHC'+
1
rHO+
1
rNC'
• f is a constant 332 Å kcal/e2
• Delta is the + and – polar charge in electrons• Weakest H-bond –0.5 kcal/mole in DSSP• H not given – requires extrapolation – note assumes
planar geometry for peptide bond
DSSP
• H – alpha helix• G = 310 helix• I = Pi helix• B = bridge – single residue sheet• E = extended beta strand• T = beta turn• S = bend• C = coil
http://e106.life.nctu.edu.tw/~hwhuang/dssp/
Identifying structural domain and function in proteins
1NTY
Prediction of protein-protein or protein-DNA interaction
• Sequence-based methods– Homology
– Correlated Mutation
• Structure-based methods– Physical docking
• Hybrid methods
Principles and methods of docking and ligand design
• Structure-based design– Docking
• Analog-based design– QSAR
– (Quantitative structure-activity relationships)
Most force fields consist of a summation of bonded forces associated with chemical bonds, bond angles, and bond dihedrals, and non-bonded forces associated with van der Waals forces and electrostatic charge.
Fold recognition method
Prediction in 1D
– Secondary structure prediction– Solvent accessibility prediction– Disulfide bond prediction– Fold recognition– Enzyme class prediction– Subcellular localization prediction– Metal binding sites prediction– Disulfide connectivity prediction– Phi psi angle prediction
Secondary structure prediction
sheet
Helixloop
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
EEEELLLLLHHHHHHHHHHHHLLLLLHHHHHHLLLLEEEELLLLL
H Helix
E sheet
L loop
Solvent accessibility prediction
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
EEEEBBBBEEEEEBBBBBBBEEEEEEBBBBBBBEEEEEEEBBBBEE
B Buried
E Exposed B
B
B B
B
B
EE
E EE
E
Disulfide bond predictionTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
O O R O O R
Fold recognition
SCOP
Root classes folds superfamily family proteins species
/
+
Multi-domain
Membrane..
Small protein..….….…
TIM barrel
TIM…Aldolase………
TIM………
TIM ChickenHuman……….
StructureClassificationOfProteins
SCOP statistics 11 800 1294 2327
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
?
Subcellular localization prediction
Eukaryotic Cellular compartments
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
?
Metal binding sites prediction
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
NNNNBNNBNNNNNBNNNNNBNNNNNNNNNNNNNNNNNNBNNN
B Binding
N Non-binding
Phi psi angle predictionRamachandran plot • Phi Cn-1 – Nn – Cn – Cn
• Psi Nn – Cn – Cn – Nn+1
A B
D
C
E FG H I
J K L
M N O
P Q R
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
AADGJJKKCPGDANOOEEAAAAJJJJJJJJKKNNQQCCJJJJAAAA
TTC1C2PSIVARSNFNVC3RLPGTPEAIC4ATYTGC5IIIPGATC6PGDYAN
C1
C6
C2
C5
C4
C3
connectivity pattern 1-6, 2-5, 3-4
Disulfide Connectivity Prediction
Class 1 Features 1~NClass 2 Features 1~N Class 3 Features 1~N Class 4 Features 1~N :
:
Class K Features 1~N
Training Data
SVM
SVM Model
Class 1
Class 2
Class 3
:
:
Class K
Feature 1
Feature 2
Feature 3
:
:
Feature N
Testing Data
SVM
SVM Model
Protein Structure Prediction
Sequence
Sequence HomologyTo known fold
HomologyModeling
>30%
Threading
Match Found?
Ab initio
No
Model
Yes
<30%
87
Homology modeling
• The goal of protein modeling is to predict a structure from its sequence– Template recognition and initial alignment
– Alignment correction
– Backbone generation
– Loop modeling
– Side-chain modeling
– Model optimization
– Model validation
??
KQFTKCELSQNLYDIDGYGRIALPELICTMFHTSGYDTQAIVENDESTEYGLFQISNALWCKSSQSPQSRNICDITCDKFLDDDITDDIMCAKKILDIKGIDYWIAHKALCTEKLEQWLCEKE
Use as template
8lyz1alc
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRLShare Similar
Sequence
Homologous
What is Homology Modeling?
Target Template
89
Structure prediction by homology modeling
90
Step 1
Step 2
Step 3
Step 4
Structure comparison and alignment
1CRN 1JXX
CE
http://cl.sdsc.edu/ce.html
DALI
http://ekhidna.biocenter.helsinki.fi/dali_server/
Homework (10/6上課前交 )1. 根據以下條件,在 Protein Data Bank上搜尋,並列出所搜尋到的 PDB ID– Hemoglobin– Has ligand: Yes– X-ray resolution<1.5– Homo sapiens
2. 比較 PDB ID:1ema 在 DSSP 與 STRIDE的二級結構差異
3. 利用 PDBsum裡的 Procheck分析 PDB ID:1atp