bioinformatics t7-proteinstructure v2014

101

Upload: wvcrieki

Post on 02-Jul-2015

2.554 views

Category:

Education


0 download

DESCRIPTION

Bioinformatics

TRANSCRIPT

Page 1: Bioinformatics t7-proteinstructure v2014
Page 2: Bioinformatics t7-proteinstructure v2014

FBW

25-11-2014

Wim Van Criekinge

Page 3: Bioinformatics t7-proteinstructure v2014
Page 4: Bioinformatics t7-proteinstructure v2014
Page 5: Bioinformatics t7-proteinstructure v2014
Page 6: Bioinformatics t7-proteinstructure v2014
Page 7: Bioinformatics t7-proteinstructure v2014
Page 8: Bioinformatics t7-proteinstructure v2014
Page 9: Bioinformatics t7-proteinstructure v2014
Page 10: Bioinformatics t7-proteinstructure v2014
Page 11: Bioinformatics t7-proteinstructure v2014
Page 12: Bioinformatics t7-proteinstructure v2014
Page 13: Bioinformatics t7-proteinstructure v2014
Page 14: Bioinformatics t7-proteinstructure v2014
Page 15: Bioinformatics t7-proteinstructure v2014
Page 16: Bioinformatics t7-proteinstructure v2014
Page 17: Bioinformatics t7-proteinstructure v2014
Page 18: Bioinformatics t7-proteinstructure v2014
Page 19: Bioinformatics t7-proteinstructure v2014
Page 20: Bioinformatics t7-proteinstructure v2014
Page 21: Bioinformatics t7-proteinstructure v2014
Page 22: Bioinformatics t7-proteinstructure v2014
Page 23: Bioinformatics t7-proteinstructure v2014
Page 24: Bioinformatics t7-proteinstructure v2014

Wel les op 4 november en GEEN les op 18 november

Page 25: Bioinformatics t7-proteinstructure v2014

The reason for “bioinformatics” to exist ?

• empirical finding: if two biological sequences are sufficiently similar, almost invariably they have similar biological functions and will be descended from a common ancestor.

• (i) function is encoded into sequence, this means: the sequence provides the syntax and

• (ii) there is a redundancy in the encoding, many positions in the sequence may be changed without perceptible changes in the function, thus the semantics of the encoding is robust.

Page 26: Bioinformatics t7-proteinstructure v2014

Protein Structure

Introduction

Why ?

How do proteins fold ?

Levels of protein structure

0,1,2,3,4

X-ray / NMR

The Protein Database (PDB)

Protein Modeling

Bioinformatics & Proteomics

Weblems

Page 27: Bioinformatics t7-proteinstructure v2014

• Proteins perform a variety of cellular

tasks in the living cells

• Each protein adopts a particular folding

that determines its function

• The 3D structure of a protein can bring

into close proximity residues that are far

apart in the amino acid sequence

• Catalytic site: Business End of the

molecule

Why protein structure ?

Page 28: Bioinformatics t7-proteinstructure v2014

Rationale for understanding protein structure and function

Protein sequence

-large numbers of

sequences, including

whole genomes

Protein function

- rational drug design and treatment of disease

- protein and genetic engineering

- build networks to model cellular pathways

- study organismal function and evolution

?

structure determination

structure prediction

homology

rational mutagenesis

biochemical analysis

model studies

Protein structure

- three dimensional

- complicated

- mediates function

Page 29: Bioinformatics t7-proteinstructure v2014

About the use of protein models (Peitch)

• Structure is preserved under evolution when sequence is not – Interpreting the impact of mutations/SNPs and conserved

residues on protein function. Potential link to disease• Function ?

– Biochemical: the chemical interactions occerring in a protein

– Biological: role within the cell

– Phenotypic: the role in the organism

• Gene Ontology functional classification !

– Priorisation of residues to mutate to determine protein function

– Providing hints for protein function:Catalytic mechanisms of enzymes often require key residues to be close together in 3D space

– (protein-ligand complexes, rational drug design, putative interaction interfaces)

Page 30: Bioinformatics t7-proteinstructure v2014

MIS-SENSE MUTATION

e.g. Sickle Cell Anaemia

Cause: defective haemoglobin due to mutation in β-

globin gene

Symptoms: severe anaemia and death in homozygote

Page 31: Bioinformatics t7-proteinstructure v2014

Normal β-globin - 146 amino acids

val - his - leu - thr - pro - glu - glu - ---------

1 2 3 4 5 6 7

Normal gene (aa 6) Mutant gene

DNA CTC CAC

mRNA GAG GUG

Product Glu Valine

Mutant β-globin

val - his - leu - thr - pro - val - glu - ---------

Page 32: Bioinformatics t7-proteinstructure v2014

Protein Conformation

• Christian Anfinsen

Studies on reversible denaturation

“Sequence specifies conformation”

• Chaperones and disulfide

interchange enzymes:

involved but not controlling final state, they

provide environment to refold if misfolded

• Structure implies function: The amino

acid sequence encodes the protein’s

structural information

Page 33: Bioinformatics t7-proteinstructure v2014

• by itself:– Anfinsen had developed what he called his

"thermodynamic hypothesis" of protein folding to explain the native conformation of amino acid structures. He theorized that the native or natural conformation occurs because this particular shape is thermodynamically the most stable in the intracellular environment. That is, it takes this shape as a result of the constraints of the peptide bonds as modified by the other chemical and physical properties of the amino acids.

– To test this hypothesis, Anfinsen unfolded the RNase enzyme under extreme chemical conditions and observed that the enzyme's amino acid structure refolded spontaneously back into its original form when he returned the chemical environment to natural cellular conditions.

– "The native conformation is determined by the totality of interatomic interactions and hence by the amino acid sequence, in a given environment."

How does a protein fold ?

Page 34: Bioinformatics t7-proteinstructure v2014

Protein Structure

Introduction

Why ?

How do proteins fold ?

Levels of protein structure

0,1,2,3,4

X-ray / NMR

The Protein Database (PDB)

Protein Modeling

Bioinformatics & Proteomics

Weblems

Page 35: Bioinformatics t7-proteinstructure v2014

• Proteins are linear heteropolymers: one or more polypeptide chains

• Below about 40 residues the term peptide is frequently used.

• A certain number of residues is necessary to perform a particular biochemical function, and around 40-50 residues appears to be the lower limit for a functional domain size.

• Protein sizes range from this lower limit to several hundred residues in multi-functional proteins.

• Three-dimentional shapes (folds) adopted vary enormously

• Experimental methods:– X-ray crystallography

– NMR (nuclear magnetic resonance)

– Electron microscopy

– Ab initio calculations …

The Basics

Page 36: Bioinformatics t7-proteinstructure v2014

• Zeroth: amino acid composition

(proteomics, %cysteine, %glycine)

Levels of protein structure

Page 37: Bioinformatics t7-proteinstructure v2014

The basic structure of an a-amino acid is quite simple. R denotes any one of the

20 possible side chains (see table below). We notice that the Ca-atom has 4

different ligands (the H is omitted in the drawing) and is thus chiral. An easy

trick to remember the correct L-form is the CORN-rule: when the Ca-atom is

viewed with the H in front, the residues read "CO-R-N" in a clockwise

direction.

Amino Acid Residues

Page 38: Bioinformatics t7-proteinstructure v2014
Page 39: Bioinformatics t7-proteinstructure v2014

Amino Acid Residues

Page 40: Bioinformatics t7-proteinstructure v2014

Amino Acid Residues

Page 41: Bioinformatics t7-proteinstructure v2014

Amino Acid Residues

Page 42: Bioinformatics t7-proteinstructure v2014

Amino Acid Residues

Page 43: Bioinformatics t7-proteinstructure v2014

• Primary: This is simply the order of

covalent linkages along the

polypeptide chain, I.e. the sequence

itself

Levels of protein structure

Page 44: Bioinformatics t7-proteinstructure v2014

Backbone Torsion Angles

Page 45: Bioinformatics t7-proteinstructure v2014

Backbone Torsion Angles

Page 46: Bioinformatics t7-proteinstructure v2014

• Secondary

– Local organization of the protein backbone: alpha-

helix, Beta-strand (which assemble into Beta-

sheets) turn and interconnecting loop.

Levels of protein structure

Page 47: Bioinformatics t7-proteinstructure v2014

Ramachandran / Phi-Psi Plot

Page 48: Bioinformatics t7-proteinstructure v2014

The alpha-helix

Page 49: Bioinformatics t7-proteinstructure v2014

• Residues with hydrophobic properties conserved at i, i+2, i+4 separated by unconserved or hydrophilic residues suggest surface beta- strands.

A short run of hydrophobic amino acids (4 residues) suggests a buried beta-strand.

Pairs of conserved hydrophobic amino acids separated by pairs of unconserved, or hydrophilic residues suggests an alfa-helix with one face packing in the protein core. Likewise, an i, i+3, i+4, i+7 pattern of conserved hydrophobic residues.

A Practical Approach: Interpretation

Page 50: Bioinformatics t7-proteinstructure v2014

Beta-sheets

Page 51: Bioinformatics t7-proteinstructure v2014

Topologies of Beta-sheets

Page 52: Bioinformatics t7-proteinstructure v2014

Secondary structure prediction ?

Page 53: Bioinformatics t7-proteinstructure v2014

• Chou, P.Y. and Fasman, G.D. (1974).

Conformational parameters for amino acids in helical, b-

sheet, and random coil regions calculated from proteins.

Biochemistry 13, 211-221.

• Chou, P.Y. and Fasman, G.D. (1974).

Prediction of protein conformation.

Biochemistry 13, 222-245.

Secondary structure prediction:CHOU-FASMAN

Page 54: Bioinformatics t7-proteinstructure v2014

•Method

•Assigning a set of prediction values to a

residue, based on statistic analysis of 15

proteins

• Applying a simple algorithm to those

numbers

Secondary structure prediction:CHOU-FASMAN

Page 55: Bioinformatics t7-proteinstructure v2014

Calculation of preference parameters

observed counts

• P = Log --------------------- + 1.0

expected counts

• Preference parameter > 1.0 specific residue has a

preference for the specific secondary structure.

• Preference parameter = 1.0 specific residue does not

have a preference for, nor dislikes the specific secondary

structure.

• Preference parameter < 1.0 specific residue dislikes the

specific secondary structure.

For each of the 20 residues and each secondary structure (a-

helix, b-sheet and b-turn):

Secondary structure prediction:CHOU-FASMAN

Page 56: Bioinformatics t7-proteinstructure v2014

Preference parametersResidue P(a) P(b) P(t) f(i) f(i+1) f(i+2) f(i+3)

Ala 1.45 0.97 0.57 0.049 0.049 0.034 0.029

Arg 0.79 0.90 1.00 0.051 0.127 0.025 0.101

Asn 0.73 0.65 1.68 0.101 0.086 0.216 0.065

Asp 0.98 0.80 1.26 0.137 0.088 0.069 0.059

Cys 0.77 1.30 1.17 0.089 0.022 0.111 0.089

Gln 1.17 1.23 0.56 0.050 0.089 0.030 0.089

Glu 1.53 0.26 0.44 0.011 0.032 0.053 0.021

Gly 0.53 0.81 1.68 0.104 0.090 0.158 0.113

His 1.24 0.71 0.69 0.083 0.050 0.033 0.033

Ile 1.00 1.60 0.58 0.068 0.034 0.017 0.051

Leu 1.34 1.22 0.53 0.038 0.019 0.032 0.051

Lys 1.07 0.74 1.01 0.060 0.080 0.067 0.073

Met 1.20 1.67 0.67 0.070 0.070 0.036 0.070

Phe 1.12 1.28 0.71 0.031 0.047 0.063 0.063

Pro 0.59 0.62 1.54 0.074 0.272 0.012 0.062

Ser 0.79 0.72 1.56 0.100 0.095 0.095 0.104

Thr 0.82 1.20 1.00 0.062 0.093 0.056 0.068

Trp 1.14 1.19 1.11 0.045 0.000 0.045 0.205

Tyr 0.61 1.29 1.25 0.136 0.025 0.110 0.102

Val 1.14 1.65 0.30 0.023 0.029 0.011 0.029

Secondary structure prediction:CHOU-FASMAN

Page 57: Bioinformatics t7-proteinstructure v2014

Applying algorithm1. Assign parameters to residue.

2. Identify regions where 4 out of 6 residues have P(a)>100: a-helix. Extend

helix in both directions until four contiguous residues have an average

P(a)<100: end of a-helix. If segment is longer than 5 residues and P(a)>P(b):

a-helix.

3. Repeat this procedure to locate all of the helical regions.

4. Identify regions where 3 out of 5 residues have P(b)>100: b-sheet. Extend

sheet in both directions until four contiguous residues have an average

P(b)<100: end of b-sheet. If P(b)>105 and P(b)>P(a): a-helix.

5. Rest: P(a)>P(b) a-helix. P(b)>P(a) b-sheet.

6. To identify a bend at residue number i, calculate the following value:

p(t) = f(i)f(i+1)f(i+2)f(i+3)

If: (1) p(t) > 0.000075; (2) average P(t)>1.00 in the tetrapeptide; and (3)

averages for tetrapeptide obey P(a)<P(t)>P(b): b-turn.

Secondary structure prediction:CHOU-FASMAN

Page 58: Bioinformatics t7-proteinstructure v2014

Successful method?

19 proteins evaluated:

• Successful in locating 88% of helical and 95% of b regions

• Correctly predicting 80% of helical and 86% of b-sheet residues

• Accuracy of predicting the three conformational states for all residues, helix, b, and coil, is 77%

Chou & Fasman:successful method

After 1974:improvement of preference parameters

Secondary structure prediction:CHOU-FASMAN

Page 59: Bioinformatics t7-proteinstructure v2014
Page 60: Bioinformatics t7-proteinstructure v2014

Sander-Schneider: Evolution of overall structure

• Naturally occurring sequences with more than

20% sequence identity over 80 or more

residues always adopt the same basic

structure (Sander and Schneider 1991)

Page 61: Bioinformatics t7-proteinstructure v2014

Sander-Schneider

• HSSP: homology derived secondary structure

Page 62: Bioinformatics t7-proteinstructure v2014

• SCOP:

– Structural Classification of

Proteins

• FSSP:

– Family of Structurally Similar

Proteins

• CATH:

– Class, Architecture, Topology,

Homology

Structural Family Databases

Page 63: Bioinformatics t7-proteinstructure v2014

Levels of protein structure

• Tertiary

– Packing of secondary structure

elements into a compact spatial unit

– Fold or domain – this is the level to

which structure is currently possible

Page 64: Bioinformatics t7-proteinstructure v2014

Domains

Page 65: Bioinformatics t7-proteinstructure v2014

Protein Architecture

Page 66: Bioinformatics t7-proteinstructure v2014

• Protein Dissection into domain

• Conserved Domain Architecture

Retrieval Tool (CDART) uses

information in Pfam and SMART to

assign domains along a sequence

• (automatic when blasting)

Domains

Page 67: Bioinformatics t7-proteinstructure v2014

• From the analysis of alignment of protein families

• Conserved sequence features, usually associate with a specific function

• PROSITE database for protein “signature” protein (large amount of FP & FN)

• From aligment of homologous sequences (PRINTS/PRODOM)

• From Hidden Markov Models (PFAM)

• Meta approach: INTERPRO

Domains

Page 68: Bioinformatics t7-proteinstructure v2014

Protein Architecture

Page 69: Bioinformatics t7-proteinstructure v2014

Levels of protein structure: Topology

Page 70: Bioinformatics t7-proteinstructure v2014

Hydrophobicity Plot

P53_HUMAN (P04637) human cellular tumor antigen p53

Kyte-Doolittle hydrophilicty, window=19

Page 71: Bioinformatics t7-proteinstructure v2014
Page 72: Bioinformatics t7-proteinstructure v2014

The ‘positive inside’ rule(EMBO J. 5:3021; EJB 174:671,205:1207; FEBS lett. 282:41)

Bacterial IM

In: 16% KR out: 4% KR

Eukaryotic PM

In: 17% KR out: 7% KR

Thylakoid membrane

In: 13% KR out: 5% KR

Mitochondrial IM

In: 10% KR out: 3% KR

Page 73: Bioinformatics t7-proteinstructure v2014
Page 74: Bioinformatics t7-proteinstructure v2014

• Membrane-bound receptors

• A very large number of different domains both to

bind their ligand and to activate G proteins.

• 6 different families

• Transducing messages as photons, organic odorants,

nucleotides, nucleosides, peptides, lipids and proteins.

GPCR Topology

• Pharmaceutically the most important class

• Challenge: Methods to find novel GCPRs in human genome

Page 75: Bioinformatics t7-proteinstructure v2014

GPCR Topology

Page 76: Bioinformatics t7-proteinstructure v2014

• Seven transmembrane regions

GPCR Structure

• Conserved residues and motifs (i.e. NPXXY)

• Hydrophobic/ hydrophilic domains

GPCR Topology

Page 78: Bioinformatics t7-proteinstructure v2014

Levels of protein structure

• Difficult to predict

• Functional units: Apoptosome,

proteasome

Page 79: Bioinformatics t7-proteinstructure v2014

Protein Structure

Introduction

Why ?

How do proteins fold ?

Levels of protein structure

0,1,2,3,4

X-ray / NMR

The Protein Database (PDB)

Protein Modeling

Bioinformatics & Proteomics

Weblems

Page 80: Bioinformatics t7-proteinstructure v2014

• X-ray crystallography is an experimental technique that exploits the fact that X-rays are diffracted by crystals.

• X-rays have the proper wavelength (in the Ångström range, ~10-8 cm) to be scattered by the electron cloud of an atom of comparable size.

• Based on the diffraction pattern obtained from X-ray scattering off the periodic assembly of molecules or atoms in the crystal, the electron density can be reconstructed.

• A model is then progressively built into the experimental electron density, refined against the data and the result is a quite accurate molecular structure.

What is X-ray Crystallography

Page 81: Bioinformatics t7-proteinstructure v2014

• NMR uses protein in solution– Can look at the dynamic properties of the protein structure

– Can look at the interactions between the protein and ligands, substrates or other proteins

– Can look at protein folding

– Sample is not damaged in any way

– The maximum size of a protein for NMR structure determination is ~30 kDa.This elliminates ~50% of all proteins

– High solubility is a requirement

• X-ray crystallography uses protein crystals– No size limit: As long as you can crystallise it

– Solubility requirement is less stringent

– Simple definition of resolution

– Direct calculation from data to electron density and back again

– Crystallisation is the process bottleneck, Binary (all or nothing)

– Phase problem Relies on heavy atom soaks or SeMet incorporation

• Both techniques require large amounts of pure protein and require expensive equipment!

NMR or Crystallography ?

Page 82: Bioinformatics t7-proteinstructure v2014

Protein Structure

Introduction

Why ?

How do proteins fold ?

Levels of protein structure

0,1,2,3,4

X-ray / NMR

The Protein Database (PDB)

Protein Modeling

Bioinformatics & Proteomics

Weblems

Page 83: Bioinformatics t7-proteinstructure v2014

PDB

Page 84: Bioinformatics t7-proteinstructure v2014

PDB

Page 85: Bioinformatics t7-proteinstructure v2014

PDB

Page 86: Bioinformatics t7-proteinstructure v2014

PDB

Page 87: Bioinformatics t7-proteinstructure v2014

Visualizing Structures

Cn3D versie 4.0 (NCBI)

Page 88: Bioinformatics t7-proteinstructure v2014

Ball: Van der Waals radius

Stick: length joins center

N, blue/O, red/S, yellow/C, gray (green)

Visualizing Structures

Page 89: Bioinformatics t7-proteinstructure v2014

From N to C

Visualizing Structures

Page 90: Bioinformatics t7-proteinstructure v2014

• Demonstration of Protein explorer

• PDB, install Chime

• Search helicase (select structure where

DNA is present)

• Stop spinning, hide water molecules

• Show basic residues, interact with

negatively charged backbone

• RASMOL / Cn3D

Visualizing Structures

Page 91: Bioinformatics t7-proteinstructure v2014

Protein Structure

Introduction

Why ?

How do proteins fold ?

Levels of protein structure

0,1,2,3,4

X-ray / NMR

The Protein Database (PDB)

Protein Modeling

Bioinformatics & Proteomics

Weblems

Page 92: Bioinformatics t7-proteinstructure v2014

Modeling

Page 93: Bioinformatics t7-proteinstructure v2014

Protein Stucture

Molecular Modeling:

building a 3D protein structure

from its sequence

Page 94: Bioinformatics t7-proteinstructure v2014

• Finding a structural homologue

• Blast

–versus PDB database or PSI-blast (E<0.005)

–Domain coverage at least 60%

• Avoid Gaps

–Choose for few gaps and reasonable similarity scores instead of lots of gaps and high similarity scores

Modeling

Page 95: Bioinformatics t7-proteinstructure v2014

• Extract “template” sequences and align with query

• Whatch out for missing data (PDB file) and complement with additonal templates

• Try to get as much information as possible, X/NMR

• Sequence alignment from structure comparson of templates (SSA) can be different from a simple sequence aligment

• >40% identity, any aligment method is OK

• <40%, checks are essential– Residue conservation checks in functional regions (patterns/motifs)

– Indels: combine gaps separted by few resides

– Manual editing: Move gaps from secondary elements to loops

– Within loops, move gaps to loop ends, i.e. turnaround point of backbone

• Align templates structurally, extract the corresponding SSA or QTA (Query/template alignment)

Modeling

Page 96: Bioinformatics t7-proteinstructure v2014

Input for model building

• Query sequence (the one you want the 3D

model for)

• Template sequences and structures

• Query/Template(s) (structure) sequence

aligment

Modeling

Page 97: Bioinformatics t7-proteinstructure v2014

• Methods (details on these see paper):

– WHATIF,

– SWISS-MODEL,

– MODELLER,

– ICM,

– 3D-JIGSAW,

– CPH-models,

– SDC1

Modeling

Page 98: Bioinformatics t7-proteinstructure v2014

• Model evaluation (How good is the prediction, how much can the algorithm rely/extract on the provided templates)– PROCHECK

– WHATIF

– ERRAT

• CASP (Critical Assessment of Structure Prediction)– Beste method is manual alignment editing !

Modeling

Page 99: Bioinformatics t7-proteinstructure v2014

CASP4: overall model accuracy ranging from 1 Å to 6 Å for 50-10% sequence identity

**T112/dhso – 4.9 Å (348 residues; 24%) **T92/yeco – 5.6 Å (104 residues; 12%)

**T128/sodm – 1.0 Å (198 residues; 50%)

**T125/sp18 – 4.4 Å (137 residues; 24%)

**T111/eno – 1.7 Å (430 residues; 51%) **T122/trpa – 2.9 Å (241 residues; 33%)

Comparative modelling at CASP

CASP2

fair

~ 75%

~ 1.0 Å

~ 3.0 Å

CASP3

fair

~75%

~ 1.0 Å

~ 2.5 Å

CASP4

fair

~75%

~ 1.0 Å

~ 2.0 Å

CASP1

poor

~ 50%

~ 3.0 Å

> 5.0 Å

BC

excellent

~ 80%

1.0 Å

2.0 Å

alignment

side chain

short loops

longer loops

Page 100: Bioinformatics t7-proteinstructure v2014
Page 101: Bioinformatics t7-proteinstructure v2014

Protein Engineering / Protein Design