biocomputational studies on protein structures

61
From the Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden Biocomputational studies on protein structures by Erik Nordling Stockholm 2002

Upload: others

Post on 13-Nov-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biocomputational studies on protein structures

From the Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden

Biocomputational studies on protein structures

by

Er ik Nordling

Stockholm 2002

Page 2: Biocomputational studies on protein structures

© Erik Nordling

ISBN 91-7349-295-7

Page 3: Biocomputational studies on protein structures

Till min älskade Mirna

Page 4: Biocomputational studies on protein structures
Page 5: Biocomputational studies on protein structures

Abstract Biology in the post-genomic era produces large amount of data. This, in combination with the need for efficient algorithms to find genes in the genomic material, has brought a renaissance into the field of computational biology. Methods now range from lead discovery in the drug discovery process, by virtual ligand screening, through sequence comparisons and homology searches, to micro array data analysis and visualisation. This thesis primarily deals with sequence analysis, different aspects of protein structure prediction and enzyme–ligand complex characterisation, I have applied bioinformatic techniques on the enzyme families of medium-chain dehydrogenases/reductases (MDR), short-chain dehydrogenases/reductases (SDR) and biologically active peptides

Sequence analysis of the MDR superfamily extends the evolutionary context of this superfamily, as MDR enzymes were collected from completely sequenced genomes. The analysis reveals the presence of eight families whereof several were previously uncharacterised. Three families are formed by dimeric alcohol dehydrogenases (ADH), cinnamyl alcohol dehydrogenases (CAD) and tetrameric alcohol dehydrogenases (YADH). Three further families are centred on forms initially detected as mitochondrial respiratory function proteins (MRF), acetyl-CoA reductases of fatty acid synthases (ACR), and leukotriene B4 dehydrogenases (LTD). The two remaining families, with polyol dehydrogenases (PDH) and quinone reductases (QOR), are also distinct but with variable sequences. The analysis also suggests that new functions have evolved in this superfamily in higher organisms.

Factors that govern the substrate specificity of γγADH were investigated with docking calculations and can be traced to active site characteristics, most notably the Ser48/Thr48 replacement between γγADH and ββADH, which allow the oxidation of 3β-hydroxy bile acids, such as isoUDCA, in ��� ADH, while both enzymes are inactive versus 3α-hydroxy bile acids.

A homology model of type 10 17β-hydroxysteroid dehydrogenase was constructed from 7α-hydroxysteroid dehydrogenase. The validity of the model was investigated by its ability to distinguish between active and inactive using docking calculations. Substrates tested ranged from steroids and bile acids to L- and D-hydroxyacyl CoA. Ligands with 17β or 3α hydroxy groups and L-hydroxyacyl CoA could achieve interactions favourable for catalysis at the active site. A crystallographically determined structure published after the submission of our paper verified large portions of our model.

The role of a conserved asparagine in the short-chain dehydrogenase/reductase (SDR) fold is investigated through structural comparisons of 21 members with experimentally verified structures. An extensive hydrogen-bonding network including parts of the active site is revealed in 16 out of 21 SDR forms.

Molecular dynamics simulations were employed to study the effect of deleterious mutants and replacements, known to diminish fibrillation, to the stability of the helical region of the amyloidogenic peptides amyloid β-peptide (Aβ) and surfactant protein-C (SP-C). The effects in SP-C are quantitatively distinguishable, while the noise ratio in the Aβ simulations makes valid predictions somewhat difficult.

Sequence comparisons of the C-peptide of proinsulin displays a sequence variability that is 1 to 2 orders of magnitude greater than that of insulin, but in the same order of magnitude as the well established peptide hormones relaxin and parathormone, which in conjunction with functional reports may indicate a hormonal function for the C-peptide. Erik Nordling: Biocomputational studies on protein structures; Stockholm 2002; ISBN 91-7349-295-7

Page 6: Biocomputational studies on protein structures

Biocomputational studies on protein structures

6

Contents

Abbreviations............................................................................................................................. 8 List of original articles............................................................................................................. 10 Introduction.............................................................................................................................. 12 Background .............................................................................................................................. 13

Medium-chain Dehydrogenases/Reductases (papers I-III) .............................................. 13 Short-chain Dehydrogenases/Reductases (papers IV-V) ................................................. 15 Biologically active peptides (papers VI-VII) ................................................................... 16

Techniques ............................................................................................................................... 17 Sequence comparisons......................................................................................................... 17

Pairwise alignments.......................................................................................................... 17 Multiple alignments.......................................................................................................... 18 Scoring matrices............................................................................................................... 18

Databases.............................................................................................................................. 20 Database searching............................................................................................................... 20

FASTA ............................................................................................................................. 20 BLAST ............................................................................................................................. 21

Phylogenetic analysis........................................................................................................... 21 Maximum parsimony ....................................................................................................... 21 Neighbour joining ............................................................................................................ 22 Bootstrap analysis ............................................................................................................ 22

Secondary structure prediction............................................................................................. 22 Molecular modelling............................................................................................................ 23

Force field methods.......................................................................................................... 24 Local minimisation techniques............................................................................................. 25

Steepest descent................................................................................................................ 25 Conjugate gradient ........................................................................................................... 25

Global optimisation techniques............................................................................................ 25 Monte Carlo (MC)............................................................................................................ 26 Molecular dynamics (MD) ............................................................................................... 27 Simulated annealing......................................................................................................... 27

Docking calculations............................................................................................................ 27 Rigid body dockings......................................................................................................... 28 Flexible ligand – rigid receptor ........................................................................................ 28 Flexible ligand – flexible receptor ................................................................................... 29 Comments on docking methods....................................................................................... 30

Homology modelling............................................................................................................ 30 Aim of the study....................................................................................................................... 32 Project overview....................................................................................................................... 33

Genome comparison and modelling of active sites in the MDR family (paper I) ............... 33 Differential multiplicity of MDR alcohol dehydrogenases (paper II).................................. 36 Molecular modelling and docking of bile acids to γγ and ββ ADH class I (paper III) ........ 38 Molecular modelling and substrate docking of human type 10 17β-hydroxysteroid dehydrogenase (paper IV) .................................................................................................... 40 Structural Role of Conserved Asn179 in the Short-Chain Dehydrogenase/Reductase Scaffold (paper V)................................................................................................................ 42 Molecular dynamic studies of sequence influence upon fibrillation (paper VI).................. 43 A structural basis for proinsulin C-peptide activity (paper VII) .......................................... 45

Page 7: Biocomputational studies on protein structures

Contents

7

Conclusions.............................................................................................................................. 46 Perspectives.............................................................................................................................. 47 Acknowledgements.................................................................................................................. 49 References................................................................................................................................ 51

Page 8: Biocomputational studies on protein structures

Biocomputational studies on protein structures

8

Abbreviations

One and three letter codes for the 20 genetically encoded amino acids. Alanine Ala A

Arginine Arg R

Asparagine Asn N

Aspartic acid Asp D

Cysteine Cys C

Glutamic acid Glu E

Glutamine Gln Q

Glycine Gly G

Histidine His H

Isoleucine Ile I

Leucine Leu L

Lysine Lys K

Methionine Met M

Phenylalanine Phe F

Proline Pro P

Serine Ser S

Threonine Thr T

Tryptophan Trp W

Tyrosine Tyr Y

Valine Val V

Page 9: Biocomputational studies on protein structures

Abbreviations

9

17β-HSD-10 type 10 17β-hydroxysteroid dehydrogenase

Aβ Amyloid β-peptide

ACR Acyl-CoA reductase

AD Alzheimer’s disease

ADH Alcohol dehydrogenase

BPMC Biased probability Monte Carlo

CAD Cinnamyl alcohol dehydrogenase

FAS Fatty acyl synthase

HSD Hydroxysteroid dehydrogenase

LTD Leukotriene B4 dehydrogenase

MC Monte Carlo

MD Molecular dynamics

MDR Medium-chain dehydrogenases/reductases

MRF Mitochondrial respiratory function protein

NAD+/ NADH Nicotinamide adenine dinucleotide (oxidised/reduced)

NADP+/NADPH Nicotinamide adenine dinucleotide phosphate (oxidised/reduced)

PAP Pulmonary alveolar proteinosis

PDB Protein databank (www.rcsb.org)

PDH Polyol dehydrogenase

QOR Quinone oxidoreductase

RMSD Root mean square deviation

SDR Short-chain dehydrogenases/reductases

SNP Single nucleotide polymorphism

SP-C Surfactant protein-C

UDCA Ursodeoxycholic acid

YADH Yeast alcohol dehydrogenase

Å Ångström (10-10m)

Page 10: Biocomputational studies on protein structures

Biocomputational studies on protein structures

10

List of original articles This thesis is based on the following articles, which will be referred to by their Roman

numerals.

I. Nordling, E., Jörnvall, H. and Persson, B. Medium-chain

dehydrogenases/reductases (MDR): Family characterizations including genome

comparisons and active site modelling. Eur. J. Biochem. 2002 269, in press

II. Nordling, E., Persson, B. and Jörnvall, H. Differential multiplicity of MDR

alcohol dehydrogenases: enzyme genes in the human genome versus those in

organisms initially studied Cell. Mol. Life Sci. 2002, 59:1070-1075

III. Marschall, H.U., Oppermann, U.C., Svensson, S., Nordling, E., Persson, B.,

Höög, J-.O. and Jörnvall, H. Human liver class I alcohol dehydrogenase ���

isozyme: the sole cytosolic 3β-hydroxysteroid dehydrogenase of iso bile acids.

Hepatology 2000, 31:990-996

IV. Nordling, E., Oppermann, U.C., Jörnvall, H. and Persson, B. Human type 10 17

β-hydroxysteroid dehydrogenase: molecular modelling and substrate docking. J.

Mol. Graph. Model. 2001, 19:514-520, 591-593

V. Filling, C., Nordling, E., Benach, J., Berndt, K.D., Ladenstein, R., Jörnvall, H.

and Oppermann, U. Structural role of conserved Asn179 in the short-chain

dehydrogenase/reductase scaffold. Biochem. Biophys. Res. Commun. 2001,

289:712-717

VI. Nordling, E., Kallberg, Y., Johansson, J. and Persson, B. Molecular dynamics

studies of sequence influence upon fibrillation. Manuscript

VII. Jörnvall, H., Nordling, E., Persson, B., Shafqat, J. Ekberg, K, Wahren, J and

Johansson, J. A structural basis for proinsulin C-peptide activity. Manuscript

The published articles are reprinted with permission from the copyright holders.

Page 11: Biocomputational studies on protein structures

List of original articles

11

Related papers not included in the thesis

• Dreij, K., Sundberg, K., Johansson, A.S., Nordling, E., Seidel, A., Persson, B.,

Mannervik, B. and Jernström, B. Catalytic activities of human alpha class glutathione

transferases toward carcinogenic dibenzo[a,l]pyrene diol epoxides. Chem. Res.

Toxicol. 2002, 15:825-831

• Filling, C., Berndt, K.D., Benach, J., Knapp, S., Prozorovski, T., Nordling, E.,

Ladenstein, R., Jörnvall, H. and Oppermann, U. Critical residues for structure and

catalysis in short-chain dehydrogenases/reductases. J. Biol. Chem. 2002, 277:25677-

25684

• Strömberg, P., Svensson, S., Hedberg, J.J., Nordling, E. and Höög J.-O. Identification

and characterisation of two allelic forms of human alcohol dehydrogenase 2. Cell.

Mol. Life. Sci. 2002, 59:552-559

• Jonsson, A.P., Aissouni, Y., Palmberg, C., Percipalle, P., Nordling, E., Daneholt, B.,

Jörnvall, H. and Bergman, T. Recovery of gel-separated proteins for in-solution

digestion and mass spectrometry. Anal. Chem. 2001, 73:5370-5377

Page 12: Biocomputational studies on protein structures

Biocomputational studies on protein structures

12

Introduction

The structure of a protein determines its function. It is essential for substrate specificity,

stability and interactions with other proteins. The quest for protein structure and interactions

is recognised as one of the big challenges after the completion of the human genome, and

several large-scale structural genomics project are launched around the globe (Burley 2000).

The realisation that all information required for protein folding is inherited in the protein

sequence (Epstein et al. 1963) gave rise to the greatest challenge for computational biology,

the folding problem. It has not been possible to find a universal solution as with the genetic

code. One complication is that many proteins require assistance of chaperone proteins to fold

into their native structure (Ellis et al. 1989). Nevertheless, several methods exist, which

exhibit moderate predictive accuracy (Bonneu et al. 2001, Skolnick et al. 2001, Hardin et al.

2002). The most accurate structure prediction methods to date are homology modelling, or

comparative modelling methods that use the information from homologous structures to

predict the tertiary structure. The quality of the model is dependent upon the number of

residue identities in common between the two proteins, but is often below 1 Å in RMSD if the

sequence identity is relatively high.

The activity of an enzyme is determined by the shape and amino acid residue distribution of

the active site. This makes it possible to search for potential substrates by computational

methods termed docking calculations, by finding the optimal placement of the substrate with

respect to electrostatic/hydrophobic interactions and shape complementarities.

The completion of the first genome, that of the SV40 virus, with a total of 5,224 base pairs

(Fiers et al. 1978), marked the arrival of a new perspective in biology. For the first time, all

the genetic information of an organism were then known. Now with the completion of the

human genome (Venter et al. 2001, Lander et al. 2001), the needs of analysis are enormous.

One of the more appealing possibilities of the genomic sequences is the possibility to compare

species and to try to explain their differences in terms of their genes (Bansal 1999). Whole

genome comparisons have also been used for phylogenetic analysis of organisms (Koonin et

al. 1997, Fitz-Gibbon and House 1999, Tekaia et al. 1999), which largely confirms the

previous picture of the divisions of life.

Page 13: Biocomputational studies on protein structures

Introduction

13

The large amounts of data generated through the advances in biology and biochemistry have

given rise to the discipline of bioinformatics, which deals with the biological information, and

tries to use it as a base to predict different biological properties, such as sub-cellular

localisation or secondary structure of a protein.

In my thesis, I have applied bioinformatic techniques on the enzyme families of medium-

chain dehydrogenases/reductases (MDR), short-chain dehydrogenases/reductases (SDR) and

biologically active peptides

Background

Medium-chain Dehydrogenases/Reductases (papers I-III)

Medium-chain dehydrogenases/reductases (MDRs) constitute a large enzyme superfamily

with (including species variants) close to 1000 members, which have been compared at

separate stages of establishment (Persson et al. 1994, Jörnvall et al. 1999). MDR proteins are

either dimeric or tetrameric proteins with about 350 residues. The tertiary structure reveals

that they consist of one catalytic and one coenzyme-binding domain (Fig. 1).

Fig. 1 Model of dimeric γγ ADH (paper III) in ribbon representation, NAD+ is shown as ball and stick models, zinc ions are displayed in space-filling representation. isoUDCA is docked into the active site and displayed in a ball and sticks model with solvent accessible surface coloured by the electrostatic potential. Picture created in Weblab Viewer Lite (Accelrys Inc.)

Page 14: Biocomputational studies on protein structures

Biocomputational studies on protein structures

14

Several, but not all of the members have one zinc ion bound with catalytic function at the

active site. Some, in particular classical, dimeric alcohol dehydrogenases (ADHs), also have a

second zinc ion at a structural site (Brändén et al. 1975). The coenzyme binding domain has a

Rossmann-fold that binds NAD(P)H (Rossmann et al. 1975). Several gene duplications have

led to the present state of different families, summarised in Fig. 2.

The MDR enzymes represent many different enzyme activities of which zinc-dependent

ADHs are the ones early investigated most thoroughly (Bonnichsen and Wassen 1948,

Brändén et al. 1975). They participate in the

oxidation of alcohols, detoxification of

aldehydes/alcohols and the metabolism of bile acids

(Jörnvall et al. 2000, Marschall et al. 2000). There

are also tetrameric alcohol dehydrogenases,

originally detected in yeast, as yeast alcohol

dehydrogenases (YADH) (Negelein and Wulff

1937), but present also in other species (paper II).

The polyol dehydrogenases (PDHs) have activities

originally detected for sorbitol dehydrogenase

(Jörnvall et al. 1981). All their substrates are

widespread in nature because of their derivation

from glucose, fructose, and general metabolism.

Cinnamyl alcohol dehydrogenases, CAD, are

predominantly found in plants. This enzyme type

catalyses the last step in the biosynthesis of the monomeric precursors of lignin, the main

constituent of plant cell walls (Boudet et al. 1995). It has been extensively characterised from

several plant sources (Sarni et al. 1984, Halpin et al. 1994, Galliano et al. 1993), because of

its importance for the pulp industry (cf. Persson et al. 1993). Downregulation or inhibition of

CAD will reduce wood lignin content and yield a pulp of higher quality (Campbell and

Sederoff 1995). The first characterised member of the quinone oxidoreductase (QOR) family

was a lens protein (ζ-crystallin) (Gonzalez et al. 1995), in which a mutational loss may result

in cataract formation already at birth. This suggests that ζ-crystallin has a role in the

protection of the lens against oxidative damage (Rao et al. 1997). The mitochondrial

respiratory function proteins (MRF) are necessary for respiratory function, although the

mechanism is not clear (Yamazoe et al. 1994). The acyl-CoA reductases (ACR) constitute one

YADH

CADADH

PDH

MRF

QOR

LTD

ACR

0.1

Fig. 2. Evolutionary tree showing the eight MDR families. ADH –alcohol dehydrogenases, CAD – cinnamyl alcohol dehydrogenases, YADH – yeast alcohol dehydrogenases, ACR – acyl CoA reductases, LTD – leukotriene B4 dehydrogenases, QOR – quinone oxidoreductases, MRF – mitochondrial respiratory function proteins, PDH – polyol dehydrogenases. Tree created with ClustalW (Thompson et al. 1994) and visualised in Treeview (Page 1996)

Page 15: Biocomputational studies on protein structures

Background

15

domain of fatty acyl synthase (FAS) (Amy et al. 1989) and erythronolide synthase (Donadio

et al. 1991) multienzymes in animals, which are responsible for the synthesis of long-chain

fatty acids. This is an example of fusion of genes encoding for a multisubunit complex, since

separate genes provide the same function in most bacteria (Magnuson et al. 1993) and plants

(Ohlrogge and Browse 1995). The leukotriene B4 dehydrogenases (LTDs) are involved in the

metabolism of immunoreactive molecules (Yokomizu et al. 1996).

In common, therefore, as demonstrated by the examples above, all MDR families appear to

have some members with protective functions in different organismal defences (Jörnvall et al.

1999), similar to the P450 family. However, recent findings suggest more specific functions

for the family members in higher organisms (paper II).

Short-chain Dehydrogenases/Reductases (papers IV-V)

The first reported enzyme of the short-chain dehydrogenases/reductase (SDR) family was

Drosophila ADH (Sofer and Ursprung 1968, Schwartz and Jörnvall 1976), sharing function

with horse liver alcohol dehydrogenase, but not evolutionary origin. As additional proteins

were found to be similar in sequence the SDR family was established (Jörnvall et al. 1981,

Jany et al. 1984, Jörnvall et al. 1984, Persson et al. 1991). The SDR proteins are one-domain

NAD(P)(H)-dependent enzymes of typically 250 amino acid residues (Jörnvall et al. 1995),

often multimeric (Fig. 3).

Fig. 3. Tetrameric structure of 20β-hydroxysteriod dehydrogenase (pdb code 2HSD; Ghosh et al. 1994). Each subunit is shaded differently, NAD+ is displayed in space-filling representation. Figure created with Weblab Viewer Lite (Accelrys, Inc.)

Page 16: Biocomputational studies on protein structures

Biocomputational studies on protein structures

16

They display a wide substrate spectrum, ranging from steroids, alcohols, sugars, and aromatic

compounds to xenobiotics. SDRs are subdivided into classical and extended SDRs (Persson et

al. 1995), differing in lengths and cofactor-binding motifs. SDR constitutes a large family

(Jörnvall et al. 1999) with about 3000 forms known, including species variants in all living

kingdoms (Kallberg et al. 2002). The family is highly divergent, with typically 15–30%

residue identity in pairwise comparisons. For 21 members, the three-dimensional structures

have been determined. In spite of the low residue identities between the different members,

the folding pattern is conserved with superimposable peptide backbones (Krook et al. 1993;

Ghosh et al. 2001). The criterion for SDR membership is the occurrence of typical sequence

motifs, arranged in a specific manner. These motifs contain crucial residues in the nucleotide

binding and for the active site including its highly conserved triad of Ser, Tyr, and Lys

residues (Jörnvall et al. 1995, Persson et al. 1995, Oppermann et al. 1997).

Biologically active peptides (papers VI-VII)

Peptides with a biological activity are widespread in nature. Their functions range from

metabolic regulation of hormones like glucagon (Thomsen et al. 1972), innate host-defense of

anti-bacterial peptides such as LL-37 (Agerberth et al. 1995) and NK-lysin (Andersson et al.

1995) through neurological functions such as for neuropeptide Y (Minth et al. 1986) and its

likes, to the more deleterious effects of the disintegrins of snake toxins (Huang et al. 1986).

Many of the peptides are synthesised as larger inactive precursors, which are cleaved to

produce the active components. The short peptides may also require a structural assistance to

achieve the folded conformation (Beers and Lomax 1995). Often the precursors contain more

than one active peptide, which then are released together in equimolar amounts.

The peptides included in this thesis are the C-peptide, surfactant protein C (SP-C) and the

amyloid β-peptide (Aβ). The C-peptide is produced by the posttranslational cleavage of

prosinsulin to form mature insulin and C-peptide (Steiner et al. 1967). The understanding of

its function has increased from originally considered only a structural assistant for the correct

folding of insulin, to now known to have hormonal activities of its own. (Wahren et al. 2000).

In diabetes type I patients it has beneficial effects on glycaemic control and protects against

diabetic complications, while synthetic insulin, which is identical to endogenous insulin but

lacks C-peptide, does not suffice (Wahren and Johansson 1998). The surfactant protein C

(Johansson et al. 1988) is a natural part of the surfactant mixture in the lungs and lowers the

surface tension in the alveolus (Gustafsson and Johansson 1998). It is produced as a larger

Page 17: Biocomputational studies on protein structures

Background

17

precursor to fold it into an α-helix structure (Beers and Lomax 1995). The amyloid β peptide

is a product from the APP protein (Kang et al. 1987), which now has recently been suggested

to function as a kinesin 1 receptor, mediating the axonal transport of β-secretase and

presenelin-1 (Kamal et al. 2001)

Techniques

In this thesis, I have used several bioinformatic techniques. They are described below in

further detail than the format of the original articles allowed.

Sequence comparisons

The procedure of comparing two or more sequences to bring as many identical or similar

residues as possible into vertical register produces an alignment of the two sequences. Two

main types of sequence alignment are recognised, global and local. The global alignment

optimises the alignment over the full-length of the sequences, while in the local alignment,

stretches of sequence with the highest density of matches are given the highest priority.

Pairwise alignments

A direct manner of comparing sequences is the dot plot analysis (Gibbs and McIntyre 1970).

The two sequences to be compared form the rows and columns of a matrix, and in the

simplest case a dot is plotted in a graphical representation of the matrix wherever the

sequences are identical. In practice, the comparison is most often done in a window of

specified length, and a dot is put in the graph whenever the score within this window is above

a certain specified threshold. The score might be calculated as the number of identities or as a

similarity based on a scoring matrix. Sequence similarities will be obvious as diagonals in the

plot. The method does not align sequences but it constitutes a quick manner of spotting

similarities. It has the additional advantage over other alignment methods that it can detect

repetitions in the sequences as parallel diagonals in the plot.

Global alignments aim to optimally align two or more sequences. The methods most used for

global alignments are based on algorithms originally developed by Needleman and Wunsch

(1970) and later modified by Sellers (1974). This procedure (the NWS method) is using a

dynamic programming algorithm that simplifies the enormous task of calculating a score for

all possible alignments of two sequences with gaps of any lengths. The sequences to be

aligned are arranged as rows and columns of a rectangular matrix. A score is calculated for

Page 18: Biocomputational studies on protein structures

Biocomputational studies on protein structures

18

each position of the matrix according to three possible events: replacement (or conservation)

of a residue, insertion in sequence A or insertion in sequence B. The alignment is traced back

from the highest scoring matrix element to the beginning of the sequence.

The dynamic programming algorithm can also be used for finding local sequence similarities.

The Smith and Waterman (1981) algorithm is very similar to the NWS method except that a

calculated negative number for a matrix position is replaced by zero, indicating that no

sequence similarity has been detected up to that point. When all matrix elements have been

calculated, the maximum number in the matrix is located, and the alignment is traced back

from this point until the first positive number.

Multiple alignments

A multiple sequence alignment of a protein family gives evolutionary information about the

family, since the alignment picks up the evolutionary pressure on each amino acid, and, thus

displays which positions that are crucial for maintaining the fold and function of the proteins

of the investigated family.

The simplest method to obtain a multiple sequence alignment is to use one sequence as the

base for the alignment and simply align all other sequences pairwise to this sequence

(Altschul et al. 1997). The most common procedure for multiple sequence alignment uses

hierarchical methods. In these methods, alignments of all pairs of sequences are made first

using the dynamic programming algorithm. The sequences are then grouped according to their

similarities into a tree (hierarchical cluster analysis). Finally, starting with the most similar

pairs, all the sequences are aligned stepwise to each other using the dynamic programming

method. This is the procedure used in the most popular program, CLUSTALW (Thompson et

al. 1994),used in all papers except in papers III and VI.

Scoring matrices

All alignment methods need some sort of scoring for matches and mismatches. For a certain

alignment, a number is assigned to each position in the sequence depending on the match at

that position. The scores for all positions in the alignment are then added to calculate a total

score, which is used to select the optimal alignment among alternative alignments. The

simplest way of scoring is to assign a score of 1 for a match and 0 for a mismatch. Such a

matrix is often referred to as a unitary matrix (Feng et al. 1984). The commonly used matrices

Page 19: Biocomputational studies on protein structures

Sequence comparisons

19

in protein sequence analysis are generally presented as log-odds matrices. Each score in the

matrix is the logarithm of an odds ratio. The odds ratio used is the ratio of the number of

times residue "A" is observed to replace residue "B" divided by the number of times residue

"A" would be expected to replace residue "B" if the replacement occurred at random.

The PAM series are based on estimated mutation rates (Point Accepted Mutations) from

closely related proteins (Dayhoff et al. 1978). PAM 1 stands for 1 % accepted mutations.

PAM matrices for less similar sequences are obtained by extrapolations. The PAM100 matrix

corresponds to 100 accepted mutations per 100 residues, but since the same residue might

change more than once, two sequences with this level of mutations will have about 50 %

identity. A major drawback is that the series was developed upon a limited set of sequences.

The Gonnet matrix is based upon exhaustive matching of the entire protein sequence database

(Gonnet et al. 1992), an almost identical approach to the original PAM matrices but with a

more divergent set of sequences. The BLOSUM series is calculated from blocks of aligned

sequences from homologous proteins with a certain level of identity. In this manner, the

pattern of changes observed at a certain level of identity is used, instead of extrapolations

from patterns for closely related proteins (Henikoff and Henikoff 1992). The first PAM

matrix is almost as efficient as the newer BLOSUM and GONNET matrices in exhaustive

comparisons (Henikoff and Henikoff 1993, Vogt et al. 1995, Abagyan and Batalov 1997),

which is a considerable achievement considering the limited set of proteins available at the

time of its creation.

The scores from all pairs of aligned residues are combined with suitable penalties for

introducing gaps to calculate a total score, which is used to select the optimal alignment. The

gap penalties are normally determined by two parameters, one for opening a gap and one for

elongating it, proportional to the length of the gap. Most programs allow the user to choose

these parameters, which might have different optima for different protein types. The choice of

gap opening and extension penalties will depend much on the purpose of the alignment. If, for

instance the alignment is to be used as input for the construction of a homology model, it may

be beneficial with a high gap opening penalty and a low gap extension penalty, since the

quality of the model is sensitive to numerous insertions/deletion events (Abagyan and Batalov

1997)

Page 20: Biocomputational studies on protein structures

Biocomputational studies on protein structures

20

Databases

The huge amount of biological information that is collected through the scientific community

worldwide is stored in large databases. The information in the databases may be as diverse as

nucleotide sequences (Stoesser et al. 2001, Benson et al. 2002), metabolic pathways

(Kanehisa 1997) or expression profiles from micro array experiments (DeRisi et al. 1997).

The databases most relevant for the work in this thesis contain protein sequences and

structures. Swissprot (Bairoch and Apweiler 2000) and PIR (Wu et al. 2002) are manually

curated databases that contain annotated protein sequences of high quality. Since there are

several protein databases, the need to combine them for a complete view of the protein

universe has arisen. In that process, a high degree of redundancy is generated, since the

databases contain overlapping information. The solution to that problem is to remove identical

sequences of the same length or shorter versions of other entries in the database (Kallberg and

Persson 1999). The major protein structure database is the Protein databank (PDB) (Berman

et al. 2000), where experimentally determined structures and theoretical models are stored.

Also worth mentioning are the CATH (Orengo et al. 1997) and the SCOP (Murzin et al.

1995) databases, which contain structural classification of proteins.

Database searching

The question posed when performing a database search is whether there are homologous

sequences to the query sequence in the database. The problem is resolved in a similar fashion

as in the sequence alignment case. As dynamic programming can be quite time consuming,

several heuristical methods have been developed. These methods are not guaranteed to find

the best possible solution. However they are significantly faster and can therefore be used to

search large databases.

FASTA

One of the more popular search methods is FASTA (Pearson and Lipman 1988). The FASTA

algorithm is a fast heuristic approximation to the Smith-Waterman algorithm. The FASTA

algorithm divides the query sequence into overlapping words, usually of length two for

proteins or six for nucleic acids. Then as each sequence is read from the database it is also

divided into its overlapping words. These two lists of words are compared to find the identical

words in both sequences. High scoring sequences are subjected to a Smith-Waterman

alignment within the region of the dot-plot defined by the concentrated identities and the

window size.

Page 21: Biocomputational studies on protein structures

Databases

21

BLAST

The BLAST algorithm (Altschul et al. 1990) uses a word-based heuristic similar to that of

FASTA to approximate a simplification of the Smith-Waterman algorithm known as the

maximal segment pairs algorithm. A major drawback of the method is that it does not allow

gaps. The list of words is expanded with the addition of similar words in order to recover the

sensitivity lost by matching only identical words. BLAST then examines the database

sequences for words that exactly match any of the words on the expanded list.

A new gapped BLAST (Altschul et al. 1997) gets around the problem of not allowing gaps

and still avoids the high computational cost of a full Smith-Waterman alignment on the pair of

sequences, by building the alignment out from a central high scoring pair of aligned amino

acids. Another improvement of the BLAST algorithm is PSI-BLAST (Altschul et al. 1997),

which starts with a gapped BLAST search and creates a profile out of the found sequences.

This profile is used as the basis for a new search, and the procedure is iterated until the search

has converged and no new proteins are found, or the number of iterations reaches a number

specified by the user. This method has been found to be very sensitive in picking out remote

homologies (Jones and Swindell 2002).

Phylogenetic analysis

The objective of phylogenetic analysis is to trace evolution. This can be done for

morphological features of organisms, or as favoured in later years, by tracing mutational

events in the genetic material of organisms. The work included in this thesis is based solely on

protein sequences and reflects primarily the evolution of new functions in protein families and

not the birth of new species, although speciation events also are included in the trees. There

are a few different methods available to construct a phylogenetic tree. The two most used

methods are outlined below, both start with the construction of a multiple alignment of the

sequences in question and then follow the procedures described below in each section.

Maximum parsimony

The phylogenetic trees are constructed so that the sequences are grouped together by the

criterion of minimum replacements from the last common ancestor (Fitch 1971, Hartigan

1973). The method yields a number of low scoring trees. The search of trees can sample all

possible trees in search of the correct one, but it is not feasible for more than about 20

sequences/taxa (Swofford 1998). Instead usually a heuristic approach is used where the trees

Page 22: Biocomputational studies on protein structures

Biocomputational studies on protein structures

22

are built by stepwise addition of sequences in the growing tree. This method has problems to

accommodate the fact that multiple substitutions may occur at the same site (e.g. Ala � Gly

followed by Gly � Ala at the same position later in time) (Nei 1996). Popular programs

include PAUP (Swofford 1998) and PHYLIP (Felsenstein 1993). The method is used in paper

I to construct the tree of the MDR superfamily.

Neighbour joining

The trees are constructed using pairwise distances of all sequences in the set. The final tree is

put together so that the distance between two sequences in the tree (branch lengths)

corresponds to the similarity between the two sequences. A drawback of the method is that

the same distance may be obtained by comparing two different sequences (Saitou and Nei

1987). ClustalW (Thompson et al. 1994) and the already mentioned PAUP and PHYLIP are

some of the most used programs based upon this methodology. The method is used in paper I

and II for the tree construction.

Bootstrap analysis

Phylogenetic trees are usually tested for the reliability by a bootstrap test (Felsenstein 1985).

The test is made by constructing a tree based on a subset of the columns in the multiple

alignment that the tree is based on. This procedure is repeated multiple times and gives an

estimate on the reliability by giving the number of times the analysis results in the same tree.

A bootstrap value over 95% for a branch point is usually considered to be highly confident

(Efron et al. 1996).

Secondary structure prediction

Predicting the secondary structure of a protein is an important step towards the elucidation of

its three-dimensional structure, as well as its function, in cases where the three-dimensional

structure is not available, as in the situation for most membrane proteins and for very large

proteins. The secondary structure prediction methods commonly try to predict if a residue

belongs to one of three states, α-helix, β-strand or coil (irregular parts and non-ordered

structural segments). Some methods specialised to predict just one type of structure have

recently been developed, and most notably a β-turn predictor which has the highest accuracy

yet for a secondary structure prediction method (Shepherd et al. 1999). The first generation of

secondary structure prediction methods was based upon the statistically inferred single amino

acid propensity for a certain secondary structure element. The methods were not very

Page 23: Biocomputational studies on protein structures

Phylogenetic analysis

23

successful since the secondary structure is dependent upon both local and long-range

interactions (cf. Rost and Sander 2000). The second generation of prediction methods was

able to incorporate the local environment by calculating the propensities for segments of

variable sizes (Chou and Fasman 1974). The accuracy of the methods achieved around 60 %

but could not reach higher. The breakthrough for accuracy came with the use of multiple

alignments as the input to the prediction methods (Zvelebil et al. 1987), which led to the third

generation of prediction methods. The multiple alignment includes evolutionary information

about residue replacements. Thus, yielding indirect information about long-range interactions

and the local environment. The top performing methods today are based upon machine

learning techniques. Neural networks are used in the two most prominent methods. The first

method that broke the 70% barrier was PHD (Rost and Sander 1993), which used a multiple

alignment as an input to a neural network that is trained upon known sequences from the

PDB. One of the largest advances to secondary structure prediction was PSIPRED (Jones

1999), which used position specific profiles from PSI-BLAST searches as the input to a

simple feed forward neural network. This led to an increase in accuracy to roughly 75% and

that is still the frontier in secondary structure prediction. To increase this further, also long-

range interactions need to be considered. A first step in this direction can be seen in a study by

Baldi et al. (1999), where a neural network based method is developed that can accommodate

long-range interactions and scores as well as or better than the existing methods.

Molecular modelling

The structure of a protein determines its physical and chemical properties. Methods that

utilize physical or chemical information to predict structural, and therefore implicitly

functional, properties of a biological molecule are collectively termed molecular modelling

methods (cf. Forster 2002). A broad division can be drawn between methods that predict the

binding of a ligand to a macromolecular receptor – docking methods, and methods that aim at

predicting the structure of protein. A further division can be made among the structure

prediction methods, between fast threading methods that predict the fold of the protein

without generating a three-dimensional model of the protein. The speed of these methods

makes them suitable for whole-genome prediction of folds or identifying possible templates

for a more thorough investigation of the structure by constructing a complete three-

dimensional model of the protein by ab initio prediction or homology modelling.

Page 24: Biocomputational studies on protein structures

Biocomputational studies on protein structures

24

The physical properties of a molecule can be described by Schrödinger’s equation.

Unfortunately it cannot be solved for systems with more than one electron (H, He+, Li2+).

However the concept can be extended by approximations to include larger systems and has

developed to a discipline of its own, quantum chemistry (cf. Oldfield 2002). The

approximation methods can be categorized as either ab initio or semiempirical. Semiempirical

methods use parameters that compensate for neglecting some of the time consuming

mathematical terms in Schrödinger's equation and can be used on systems up to a few

hundred atoms (Stewart 1990), whereas ab initio methods include all such terms and can only

handle a few dozen atoms (Schmidt et al. 1993). The parameters used by semiempirical

methods can be derived from experimental measurements or by performing ab initio

calculations on model systems.

Force field methods

The Schrödinger equation can be bypassed by writing the energy as a parametric function of

the nuclear coordinates. This allows the dynamics of the atoms to be treated by classical

mechanics. For time independent phenomena, the problem reduces to the calculation of the

energy at a given geometry (Jensen 1999). The foundation of this methodology is the

observation that molecules tend to be composed of units, which are structurally similar in

different molecules. The O–H bond in methanol and in ethanol is of the same length and

energy content. Force fields are a generalisation of the picture of molecules being composed

of structural units, “ functional groups” , which behave similarly in different molecules. Force

fields take care of the different properties of atoms by assigning them into different atom

types according to the atom and the type of chemical bonding it is involved in. The MMFF

force field (Halgren 1995a) used in among other programs ICM, extensively used in my

thesis, has 40 types of carbon and 20 types of oxygen depending on hybridisation state, bond

number, bond-partners and charge. The parameters assigned to each atom-type are based upon

empirical measurements or higher-order calculations (Chandrasekhar and van Gunsteren

2002)

The energy is written as a sum of terms, each a function describing the energy associated with

a particular aspect of the molecule, including the energy of bond-stretching, rotation around a

bond, bending of bond-angles and non-bonded atom-atom interactions, such as van der Waals

or electrostatic interactions. This approach has the added advantage that it is quite simple to

change the nature of the function for each term, and equally easy to add or remove terms from

Page 25: Biocomputational studies on protein structures

Molecular modelling

25

the total energy function (Abagyan 1993). Such an energy function of the nuclear coordinates

makes it possible to calculate the geometry of the possible conformations of a molecule and

their relative energies. Stable conformations correspond to minima on the potential energy

surface and can be found by minimizing the energy as a function of the nuclear coordinates.

Local minimisation techniques

Local minimisation techniques are employed to find the nearest minimum, which may, or,

more probably, may not correspond to the global minimum.

Steepest descent

The gradient vector points in the direction where the energy function increases most, i.e. the

function value can always be lowered by stepping in the negative direction. Functional

evaluations are made in the negative gradient direction until the energy increases and a

temporary minimum along the negative gradient is located. At this point a new gradient is

calculated and used for the next search (Morse and Feschbach 1953). This method will always

lower the function value and is guaranteed to approach a minimum. However, two main

problems are associated with this technique. Two subsequent searches are made perpendicular

to each other and each round will therefore partially undo the previous minimisation step’s

decrease of the function value. The other problem is that as the minimum is approached, the

rate of convergence is decreased and the method will actually never reach the minimum. It

will crawl towards it at an ever-decreasing speed. Even though its obvious drawbacks, the

method is simple and robust and guaranteed to lower the function value, which may explain

its widespread use.

Conjugate gradient

The method is very similar to the previous method, but tries to avoid the partial undoing of

the previous step, by doing each step except the first, in a direction which is a mixture of the

current negative gradient and the previous search direction. The next step is taken in a

direction, which is conjugate of the previous direction, hence its name (Polak 1971).

Global optimisation techniques

The previous methods can only locate the minimum nearest to the starting conformation, as

they mimic a marbles path down a funnel and find the lowest spot in the energy landscape. If

the desired minimum is the global minimum, the deepest valley in the energy landscape, other

Page 26: Biocomputational studies on protein structures

Biocomputational studies on protein structures

26

methods have to be employed. The optimal way to find the global minimum would be to

sample all possible conformations and calculate the energy. This systematic approach is not

practically feasible with more than 10–12 free variables, since the number of possible

conformations increases exponentially with the number of free variables, and likewise does

the computational time of the analysis. Large biomolecular systems, such as proteins, lipid

membranes or nucleic acids, cannot be studied by a systematic approach. In order to be able

to investigate them, methods have been developed that sample a large number of possible

conformations in order to locate minima. Although they are not guaranteed to find the global

minimum, they often find a local minimum, which is close to the global minimum in terms of

energy.

Monte Carlo (MC)

The search starts from a given conformation. In each MC run the position of one or more

atoms is randomly changed. The new conformation is accepted as a starting point for the next

perturbation if the energy is lower than the energy of the previous conformation. Steps that

generate conformations with higher energy than the previous one are evaluated by calculation

of the Boltzmann factor (Eq. 1, T is temperature in Kelvin, kB is Boltzmann’s constant and ∆E

is the change in energy) and comparing it to a random number between 0 and 1.

Boltzmann factor = e-∆E/kB

T Eq. 1

If the Boltzmann factor is less than the random number the conformation is used as the

starting conformation for the next step in the Monte Carlo procedure. This allows the

calculation to step out of local minima and continue the search for the global minimum

(Hammersly 1960). A further enhancement of the methodology is the inclusion of information

of preferred side-chain angles for amino acids in proteins. Each amino acid has a set of

discrete side-chain angles, rotamers, that are energetically more favourable (Brändén and

Tooze 1999). By including them into the search algorithm one gets the biased probability

Monte Carlo (BPMC) search algorithm, that samples the conformational space more

efficiently (Abagyan and Totrov 1994).

Page 27: Biocomputational studies on protein structures

Molecular modelling

27

Molecular dynamics (MD)

MD methods solve Newton’s equation of motion (Eq. 2), where the force F is determined by

the mass, m, and the acceleration, a, for atoms on an energy surface (Alder and Wainwright

1957).

F = m * a Eq. 2

The available energy is distributed between potential and kinetic energy, and molecules are

thus able to overcome barriers separating minima if the barrier height is less than the total

energy minus the potential energy. Newton’s equation requires small time steps (femtosecond

level) to be integrated, which makes the simulation time short, nowadays usually in the

nanoseconds range, even though the microsecond barrier has been reached for a 36-residue

peptide with a Cray T3E supercomputer (Duan and Kollman 1998). This, in parallel with the

use of physiologically relevant temperatures, means that in practise only the local area around

the starting point is sampled and that relatively small barriers can be overcome. Molecular

dynamics simulations have evolved from the first simulation of a protein in vacuum

(McCammon et al. 1977) to include realistic description of solvent effects (Berendsen et al.

1981). This allows MD simulations to give insights into the natural dynamics on different

timescales of biomolecules in solution (Hansson et al. 2002). Popular implementations are the

CHARMM, AMBER and GROMACS (Lindahl et al. 2001) packages. The GROMACS

package was used in paper VI.

Simulated annealing

In simulated annealing the starting temperature is chosen to be high, 2000 – 3000 K, and then

gradually decreased during an MC or MD run. This allows an initially extensive sampling of

the conformational space of the molecule, and as the temperature decreases the molecule is

trapped in a minimum. If the decrease is infinitely slow the molecule would be trapped in the

global minimum, but that would require infinite computational time as well (Kirkpatrick et al.

1983).

Docking calculations

The solution to the docking problem in computational biology is the bound conformation of a

substrate to a macromolecular receptor with a correct estimation of the binding energy.

Reliable docking depends on both an adequate scoring function, whose global minimum

Page 28: Biocomputational studies on protein structures

Biocomputational studies on protein structures

28

corresponds to the biologically relevant complex, and an optimisation algorithm that

consistently finds that global minimum.

Rigid body dockings

The earliest docking methods used rigid body docking techniques (Kuntz et al. 1982), where

both the receptor and the ligand are represented as rigid bodies. In this approach the docking

problem is simplified to a search of surface complementarities. With the rigidity of the

molecules the search space is reduced to three variables that determine the rotational and

translational movements of the ligand. Even this three-dimensional search space is hard to

completely cover with a brute force approach and often stochastic search schemes such as

Monte Carlo or simulated annealing techniques are used.

Geometric surface models and data structures are used in order to find reasonable binding

modes, and heuristic cost functions – mostly rating some notion of geometric fitness – to rank

candidate complexes.

The first automated docking program was the DOCK program (Kuntz et al. 1982), which uses

a graph theory derived search algorithm in its later apparitions (Ewing et al. 1997) that

superimposes ligand atoms onto predefined site-points that map the negative image of the

binding site. The hits are scored for the inter-molecular interactions, based on the inter-

molecular terms of a molecular mechanics force field.

The gain of using this method is the swiftness of the process, which allows large databases of

compounds to be screened versus a receptor (Schneider and Böhm 2002). A drawback of the

method is that the ligand and receptor has to be co-crystallized to achieve high efficiency of

the method (Lengauer and Rarey 1996).

Flexible ligand – rigid receptor

As more powerful computers were developed, the difficulty of the docking problem was

increased, by allowing internal rearrangements of the ligand, by introducing rotation around

bonds. These techniques involve a flexible ligand and rigid receptor. This allows for the

discovery of novel substrates for a receptor at the expense of more computationally

demanding calculations (Lengauer and Rarey 1996). The free variables added to the

calculations are the rotations around bonds in the ligand. In the rigid body approach, three

Page 29: Biocomputational studies on protein structures

Docking calculations

29

variables were needed to rotate and position the substrate, each rotatable bond increases the

dimension of the problem by one. Thus, for a ligand with two rotatable bonds, there are five

variables to optimise. As the ligands often are small with relatively few variables, the speed of

the dockings is high and a large number of possible ligands can be tested. The docked

conformations are scored by semi-empirically derived force field based energy functions

(Sippl 1995, Halgren 1995b). In order to speed up the scoring of the docked conformations,

the macromolecular receptor can be represented by grids that contain the affinities for each

atom type and the possibilities for electrostatic interactions (Morris et al. 1996, Totrov and

Abagyan 1996). The gain is that the grids allow fast calculation of the score of the docked

conformation. Usually Monte Carlo algorithms (Morris et al. 1996) or genetic algorithms

(Jones et al. 1997, Morris et al. 1998) are used to efficiently sample the possible

conformations. Some popular programs of this type are AutoDock (Goodsell and Olsen

1990), FLEXX (Rarey et al. 1996) and DOCK version 4 (Ewing et al. 2001).

Flexible ligand – flexible receptor

A natural extension of the above technique is to introduce flexibility in the receptor as well. A

first step is to allow rearrangement of the protein’s side-chains. This allowed the successful

docking of new classes of substrates to receptors (Totrov and Abagyan 1994, Totrov and

Abagyan 1997). The extra free variables introduced to the system makes the approach not

feasible for scanning of large compound databases. To diminish the computational need, side-

chains located at a distance from the active site are often fixed (paper III + IV).

All these methods have trouble dealing with induced fit mechanisms, common in some

enzyme families (Najmanovich et al. 2000). A recent paper estimates that 50% of the

imaginable protein-ligand complexes are not reachable by today’s techniques due to induced

fit mechanisms (Österberg et al. 2002). It can be somewhat circumvented by using receptor

structures already in the active conformation, which are co-crystallized with a substrate.

Computational techniques that address the problem have been developed. One is based upon

selecting hinge regions in the proteins that can, as the name suggests, function as hinges in the

receptor and thus allow opening and closing movements of the receptor (Sandak et al. 1998).

Another technique, which is a development of the popular Autodock software, create grids

based upon both the bound complex and the free receptor and thus accounts for the

conformational change upon binding of substrate (Österberg et al. 2002).

Page 30: Biocomputational studies on protein structures

Biocomputational studies on protein structures

30

Comments on docking methods

The two most attractive techniques are the grid docking methods, for their swiftness, and the

flexible ligand – flexible receptor methods, for their thoroughness. The success of the grid

docking requires an active conformation of the active site, as it cannot accommodate large

rearrangements. Smaller rearrangements can be incorporated by allowing a certain overlap of

ligand and receptor, which later can be relieved by an all-atom refinement of the top-scoring

complexes. The need for larger or multiple side-chain rearrangements can only be

accommodated by the flexible ligand – flexible receptor technique. Thus, this technique is

often suited for larger substrates, since they quite naturally often require more side-chain

movements to fit in the active site. The same is true for dockings of substrates into models,

since the side-chains most likely are not in an optimal placement for ligand binding. This was

demonstrated in the recent docking of carcinogenic dibenzopyrene diol epoxides to

glutathione transfereases (Dreij et al. 2002).

Homology modelling

Homology modelling or comparative modelling methods, first reported by Browne et al.

(1969), make it possible to predict the three-dimensional structure of a protein by using

information derived from a homologous protein of known structure (Sali and Blundell 1993,

Sanchez and Sali 1997, Cardozo et al. 1995) without attempting the laborious task of ab initio

structure prediction.

The basis for this approach is that two proteins that have evolved from the same ancestral

protein retain the overall fold. The fold is more conserved through evolution than the

sequence of a protein. It acts as a scaffold upon which new functions can be attached by

changing the amino acid sequence. This has led to the evolution of superfamilies such as

MDR (Persson et al. 1994) and SDR (Jany et al. 1984, Persson et al. 1991, Jörnvall et al.

1995), studied in this thesis, with a conserved overall fold and many divergent functions.

In order to construct a homology model, the query sequence has to be aligned with one or

more homologous proteins to be used as templates in the modelling process. If the sequence

identity between query and template falls below 30%, the quality of the alignment will

deteriorate (Venclovas et al. 1999), yielding more unreliable models. There are two main

methods for the transfer of three-dimensional coordinates from the template to the query.

Fragment-based homology modelling methods identify structurally conserved regions through

Page 31: Biocomputational studies on protein structures

Docking calculations

31

the alignment, these conserved fragments usually correspond to secondary structure elements

and the coordinates of the templates are copied to them. The variable parts correspond to a

large extent to loop structures, whose conformations are investigated through searches in a

structural database of loop structures (Blundell et al. 1987, Levitt 1992). The restraint-based

homology modelling methods uses the alignment to infer geometrical restraints, such as limits

of distances of Cα-atoms, ranges of backbone and side-chain angles. These restraints are then

combined with an energy function to obtain a scoring function that is used to calculate the

resulting structure of both the structurally conserved regions and the loop regions (Sali and

Blundell 1993, Cardozo et al. 1995). This has the advantage that the structure generation is

not divided into two separate stages and the prediction method is thus more robust (Forster

2002). The hardest problem is the correct prediction of the loop regions, which has all the

problems associated with ab initio structure prediction, but limited in size and with defined

endpoints. The accuracy of loop-predictions has been shown to increase by taking a

deformation zone into account, by extending the loops a few residues in each direction from

the gap in the alignment. The procedure can accommodate the change in the local

environment caused by the deviation from the fold of the template structure (Abagyan et al.

1997).

If the sequences are distantly related, additional information to assist in the alignment

procedure is of great value. A multiple alignment of related proteins, even if they lack

experimentally verified structures, adds information of conserved sequence stretches that may

correspond to secondary structure elements and regions that allow insertions and deletion that

may hint at loop regions. Similarly secondary structure predictions may assist the assignment

of equivalent secondary structure elements between the query and the template, in addition

information about the placement of gaps can be inferred from the predictions. This strategy

has been used in paper IV.

Page 32: Biocomputational studies on protein structures

Biocomputational studies on protein structures

32

Aim of the study

The aim of the study was to apply different bioinformatic and biocomputational methods to

medium-chain dehydrogenases/reductases (MDR), short-chain dehydrogenases/reductases

(SDR) and biologically active peptides, with an emphasis on methods that utilise structural

information, in order to study and characterise proteins within these families. More

specifically the work has been directed toward:

• To characterise the MDR superfamily in complete genomes (papers I and II)

• To explain why isoUDCA is a substrate for γγADH but not for ββADH (paper III)

• To obtain a model of human type 10 17β-hydroxysteroid dehydrogenase and use it as

a base to explain the binding properties of steroid substrates to the protein (paper IV)

• To characterise the role of a conserved residue in SDR fold (paper V)

• To analyse the stability of secondary structure elements in amyloidogenic peptides

(paper VI)

• To compare the sequence variation between species of peptide hormones, including

the C-peptide of proinsulin (paper VII)

Page 33: Biocomputational studies on protein structures

Project overview

33

Project overview

Genome comparison and modelling of active sites in the MDR

family (paper I)

The availability of completed genomes provides a great opportunity to analyse all members of

enzyme families. We have therefore now studied the medium-chain dehydrogenases/

reductases (MDR) enzymes in all eukaryotic genomes available and, for comparison, the

Escherichia coli genome.

The genomes were searched for MDR members using FASTA (Pearson and Lipman 1988)

with known MDR proteins as query sequences. Multiple sequence alignments were calculated

using ClustalW (Thompson et al. 1994). The evolutionary tree, constructed from the aligned

MDR sequences from the six genomes, allows distinction of several subgroups (schematic

representation in Fig. 2). Three families are formed by dimeric alcohol dehydrogenases

(ADH; originally detected in animals/plants), cinnamyl alcohol dehydrogenases (CAD;

originally detected in plants) and tetrameric alcohol dehydrogenases (YADH; originally

detected in yeast). Although similar in name, those three families are clearly different and

separated structurally. Three further families are centred on forms initially detected as

mitochondrial respiratory function proteins (MRF), acetyl-CoA reductases of fatty acid

synthases (ACR), and leukotriene B4 dehydrogenases (LTD). The two remaining families,

with polyol dehydrogenases (PDH; originally detected as sorbitol dehydrogenase) and

quinone reductases (QOR; originally detected as ζ-crystallin), are also distinct but with

variable sequences.

The most common family in the human genome is ADH with nine members (Fig. 4) (further

investigated in paper II), followed by QOR with seven members. A. thaliana has a similar

pattern, but also has a large number of CAD and LTD members. In the unicellular organisms,

PDH is the most common family, pointing at different living conditions between unicellular

and multicellular organisms.

Page 34: Biocomputational studies on protein structures

Biocomputational studies on protein structures

34

A. thaliana

ADH

CAD

PDHQOR

MRF

LTD

Others

D. melanogaster

ADH

PDH

QORMRF

ACR

Others

H. sapiens

ADH

PDH

QOR

MRF

LTD

ACR

Others

Sum:23 Sum:38 Sum:10

C. elegans

ADH

YADH

PDHQOR

MRF

LTD

ACR

Others

Sum:13Sum:13

S. cerevisiae

ADH

CAD

YADHPDH

QOR

MRFLTD

Sum:15

E. coli

ADH

CAD

YADH

PDH

QOR

LTD

Others

Sum:17 Fig. 4. Distribution of MDR members among the different organisms investigated, the total number of members in each genome is also given.

The quinone reductase family members are homologous to the synaptic vesicle protein VAT-

1, indicating a possible neuronal function for the whole family. The polyol dehydrogenases

are numerous in the E. coli genome, contributing to half of the MDR forms, emphasising their

importance in carbohydrate metabolism. Molecular modelling of the active sites was

performed for members of the different subgroups in order to give estimates on the geometry,

hydrophobicity and volume of the substrate-binding pockets. The results are summarised in

Table 1.

Table 1. Active site characteristics for the investigated families and common functions of the members. Volume in parenthesis is for all but one member.

Active site characteristics Family

Volume (Å3) Hydrophobicity index

Function

PDH 77–257 (144–257)

–1.28 � +0.77 Polyol dehydrogenases Threonine 3-dehydrogenase

CAD 160–248 –0.69 � +0.49 Mannitol dehydrogenase Cinnamyl-alcohol dehydrogenase

QOR 124–289 –0.56 � +1.47 Quinone oxidoreductase Synaptic vesicle membrane protein VAT-1 homologue

MRF 168–243 –1.48 � –0.18 Mitochondrial respiratory function protein

LTD 161–291 –1.65 � –0.02 NADP-dependent leukotriene dehydrogenase

ACR 140–189 +0.64 � +1.08 Fatty acid synthase

Page 35: Biocomputational studies on protein structures

Project overview

35

The volumes of the active sites are more or less in the same range, with a broad span of

volumes for each family, except for the narrow range of the ACR family, explained by few

members with a high degree of sequence similarity. MRF and LTD clearly prefers hydrophilic

substrates while ACR prefers hydrophobic substrates, quite natural considering its role in fatty

acid synthesis. The variation of the hydrophobicity in PDH, CAD and QOR indicates a wide

variety of substrates for the proteins in these groups.

The sequence comparisons and subdivisions make it possible to define sequence patterns

useful for characterisation of MDR members. For QOR, a Prosite pattern (Hofmann et al.

1999) already exists (PS01162). However, this pattern only finds 5 of our presently

recognised 18 QOR members. Based upon the sequences now available, we propose a new

pattern that better will detect the QOR members (Table 2). It finds 15 of the QOR members,

which is three-fold more than the existing Prosite pattern, and it misses only 3 QOR forms,

while detecting no false positives. The patterns for the PDH and the CAD families are based

on residues that bind the catalytic zinc and the substrate. The PDH pattern (Table 2) is a bit

complex but captures all the different substrate specificities of this group and only one false

positive when the pattern is screened versus Swissprot. The MRF, LTD and ACR patterns

(Table 2) are highly specific. They find no false positive matches and miss not one of our

members.

The patterns are useful for proper recognition of new genomic sequences. They allow rapid

annotation into the different families of the MDR superfamily from the huge amounts of

sequences generated by ongoing genome projects. They are also ideal for finding particular

enzymes in the ever-increasing sequence databases.

Table 2. Sequence patterns and screening results. The column hits gives number of MDR forms detected in the six genomes investigated, fp (false positives) gives number of non-members detected in Swissprot, fn (false negatives) gives the number of proteins classified as a member but not found by the pattern. Family Pattern hits fp fn QOR [GAS]-x-N-x(2)-[DEN]-x(5)-G-x(6,19)-[PS]-x(3)-[GA]-x-[ED]-x(2)-

G-x-[VIL]-x(3)-G 15 0 3

MRF L-x(6)-[VL]-T-Y-G-G-M-[SA]-[KR] 6 0 0 PDH [GA]-[VIL]-[CS]-[GN]-[STA]-D-[VILMS]-[HKP]-x(14,27)-G-H-

[ED]-x(2)-G-x-[VI]-x(10,12)-G-[DEQ]-x-[IV] 22 1 0

CAD C-G-x-C-x(2)-D-x(17)-G-H-E 12 0 0 LTD D-x-[YF]-x-[DE]-N-V-G-[GS]-x(3)-[DEN] 16 0 0 ACR W-x(5)-W-x(8)-P-x(2)-Y-x(3)-Y-Y 5 0 0

Page 36: Biocomputational studies on protein structures

Biocomputational studies on protein structures

36

Differential multiplicity of MDR alcohol dehydrogenases (paper II)

The originally purified ADH enzyme types (yeast, Drosophila, liver and plant) were screened

for in the genomes of human, Arabidopsis thaliana, Saccharomyces cerevisiae and

Drosophila melanogaster, and were compared, where relevant, to corresponding forms in

mouse, pea, Escherichia coli and Caenorhabditis elegans. The screenings revealed the

presence of ten human, ten Arabidopsis, five yeast, and one Drosophila MDR-ADH genes. Of

the human genes, three had properties like pseudogenes (recognized by unusual exon/intron

patterns, combined with a lack of upstream promoter elements such as TATA, CAAT or GC

boxes, and with deviating chromosome localizations) and the remaining seven genes

correspond to the enzymes of classes I (3 genes) and II–V (one gene each) (Jörnvall and Höög

1995, Duester et al. 1999). These seven genes are all on chromosome 4 (Fig. 5), compatible

with the early assignments (Duester et al. 1985).

Fig. 5. Gene organisation of ADH on chromosome 4

The screenings also show that ethanol-active MDR-ADH, when present, appears to be of

multiple occurrences. The multiplicity (Jörnvall et al. 2000) and the general occurrence of

ethanol-activity (Fernández et al. 1995) have been noticed before, but can now be correlated

with enzyme type and with eukaryotic type of organism. These and other functional

conclusions are best noticed when the MDR-ADH enzymes are divided into the two

constituent families of dimeric and tetrameric ADHs, initially represented by the liver and

yeast ADHs, respectively.

The dimeric ethanol-active ADHs divergence into at least three groups, corresponding to the

formaldehyde-active class III enzyme from all species, the ethanol-active non-class-III

enzymes from animals, and the ethanol-active non-class III enzymes from plants (Fig. 6).

Page 37: Biocomputational studies on protein structures

Project overview

37

Fig. 6. Evolutionary tree of dimeric ADH. Tree created in ClustalW and viewed in Treview.

Several functional conclusions regarding these enzymes can be drawn by the evolutionary

tree:

• Only plants and vertebrates appear to have the ethanol-active dimeric MDR-ADHs.

• Extensive multiplicity occurs both in vertebrate and plant groups of these enzymes.

• The patterns in the non-class-III groups from vertebrates and plants show some

similarities, with both early and late branching frequently at similar levels.

A comparison of animal and plant lines of MDR-ADH, with species variants included, shows

that the time estimates, calculated in two different manners for the species separation, in the

class III line are too distant, falsely suggesting the class III species separations to be further

away than those for classes I and P and also more distant than accepted radiation nodes of

higher vertebrates. Hence, the constant speed expected for the class III evolution may be

slightly higher than calculated before (Cañestro et al. 2002), or class III may evolve faster in

higher eukaryotes, perhaps giving novel enzymogenesis and additional functions for class III

Page 38: Biocomputational studies on protein structures

Biocomputational studies on protein structures

38

in higher eukaryotes, like already known for class I with its pattern of recent isozyme

emergence (Jörnvall and Höög 1995, Duester et al. 1999).

The present pattern, with the distinction towards absence of dimeric ADH between

higher/lower eukaryotes, and with partly parallel patterns in plants and animals, may suggest

that a specific function is associated with ADH in higher eukaryotes. If so, dimeric ADH may

have key functions in higher eukaryotes rather than just general detoxications (Jörnvall et al.

2000).

A similar visualization of the tetrameric family shows considerably less multiplicity and

places yeast ADH with ADH of C. elegans, and with some E. coli and A. thaliana forms.

Thus, C. elegans and A. thaliana have both dimeric and tetrameric ADH enzymes.

Molecular modelling and docking of bile acids to γγγγγγγγ and ββββββββ ADH

class I (paper III)

A three-dimensional model of the class I alcohol dehydrogenase (ADH) γ subunit was

obtained by adopting its amino acid sequence into the known fold of horse liver alcohol

dehydrogenase (PDB code 3bto) using the program ICM (Molsoft LLC, Abagyan et al. 1994).

For docking studies using the human class I ADH ββ dimer, the structure was taken from the

PDB file 1deh. Docking of bile acids to the dimeric enzyme (ββ-structure and γγ-model) was

performed using a non-rigid procedure, allowing free movement of the substrate, the rotatable

bonds of the substrate, the χ angles of the residues within 7 Å from the active site and with

additional distance restraints of 2.0–2.4 Å between the O-atom at C-3 of the bile acids and the

catalytic zinc and the H-atom at C-3 of bile acids and the C-4 of the nicotinamide of NAD+.

Molecular modelling of the human class I γ ADH subunit and docking calculations

demonstrated that iso-ursodeoxycholic acid (isoUDCA) fits into the active site tightly

surrounded by mainly hydrophobic residues (Fig. 7), and that it interacts with the catalytic

residues in a preferable way (Table 3). For comparison, similar docking calculations were

performed with UDCA. This substrate could not form the interactions crucial for catalysis

(Fig. 7, Table 3). If the 3α-hydroxyl group was forced to fit at the active site, the rest of the

steroid was tilted in comparison to the 3β-hydroxyl epimer and positioned away from the

Page 39: Biocomputational studies on protein structures

Project overview

39

Fig. 7. isoUDCA (dark gray) and UDCA (light gray) docked into the active site of ��� ADH. Residues within 5 Šof the bile acids are displayed in stick models and with their surfaces transparently displayed. Residues F93, T94, Y110, N114 and L116 were omitted for increased visibility. Arrows show catalytic interactions between the ligand , O� of S48 and C4 of NAD+. The picture was created in ICM version 2.8 (Molsoft LLC, San Diego, CA).

active-site channel, compatible with the fact that this conformation is not suitable for efficient

catalysis.

The same calculations were also performed for the human class I ββ ADH where the presence

of Thr48 (instead of Ser48 as in the γγ isozyme) caused inefficient binding in accordance with

kinetic data. Distances crucial for catalysis are collected in Table 3. Distances required for

catalysis, in the range of 2 – 2.4 Å, are obtained only by isoUDCA in ��� ADH.

Table 3. Distances in Ångström obtained from the docking calculations for isoUDCA and UDCA

Isozyme Substrate Oγγγγ(Ser48) -H(C3-OH) Zn -O(C3-OH) C4(NAD+) –H(C3) ADH ββ UDCA 2.31 2.51 2.94 ADH ββ isoUDCA 3.35 2.24 1.94 ADH γγ UDCA 3.20 2.88 3.53 ADH γγ isoUDCA 2.22 2.30 2.19

Page 40: Biocomputational studies on protein structures

Biocomputational studies on protein structures

40

Molecular modelling and substrate docking of human type 10 17ββββ-

hydroxysteroid dehydrogenase (paper IV)

The human type 10 17β-hydroxysteroid dehydrogenase (17β-HSD-10) is a multifunctional

mitochondrial enzyme. It catalyses the oxidative inactivation at C17 of androgens and

estrogens. However, it also mediates oxidation of 3α-hydroxy groups of androgens, thereby

reactivating androgen metabolites. Finally, it is involved in β-oxidation of fatty acids by

catalysing the L-hydroxyacyl CoA dehydrogenase reaction of the β-oxidation cycle. Since no

three-dimensional structure of 17β-HSD-10 was available, homology modelling was carried

out to understand the molecular basis of these substrate specificities. The template in the

homology modelling was 7α-HSD (1fmc; Tanaka et al. 1996), which has a sequence identity

of 27% to the target protein. The low sequence identity required special care in the modelling

process.

The molecular modelling indicated that the reaction resulting in 3α-OH/3-oxo conversion can

be predicted to occur with 5α-reduced steroids such as 3α-ol,5α-androstane,17-one, with

optimal distances for the 3α-OH hydrogen bonding to the tyroxyl Oη 168 of 1.89 Å, to C-4 of

NAD+ of 2.23 Å, and a distance between 3α-OH oxygen to Ser155 side-chain Hγ of 1.96 Å.

Oxidation of the 3β-hydroxyl as in 3β-ol,5α-androstane,17-one or 3β-ol,5α-androstane,17β-

ol is not possible, due to a predicted distance of over 4 Å between the 3β-OH oxygen and the

Ser155 side chain. This distance will not allow the hydrogen bond necessary for catalysis.

Docking simulations with 5β reduced steroids, such as bile acids (isoUDCA and UDCA)

reveal again unfavourable distances to Ser155, thus preventing catalysis to take place. The 17

β-OH dehydrogenase reaction is possible with steroids of the androgen and estrogen classes,

yielding favourable geometry and distances for 3β,17β,5α-androstane-diol, testosterone, 5α-

dihydrotestosterone and estradiol. These predicted properties correlate with experimental data.

After submission of this manuscript another group solved the structure of the rat homologue

by crystallography (Powell et al. 2000). This structure confirms large portions of our model

including some of the enzyme–substrate complexes that we predicted. In Fig. 8, our model

and the crystal structure backbones are superimposed. The cores of the structures superimpose

well, including the majority of the active site residues. Parts of the predicted substrate-binding

loop are missing in the crystal structure, indicating an unordered structure of that part.

Page 41: Biocomputational studies on protein structures

Project overview

41

Nevertheless, we have correctly predicted the start and end of the loop. The dark-coloured

regions corresponding to mispredicted areas are mainly centred in loops, which are

notoriously hard to predict, and secondary structure elements at the edges of the structure. The

entire C-terminal part from Asn243 to Pro261 is mispredicted due to an alignment error to the

template, and the whole region is shifted one residue. The placement of the NAD+ is highly

accurate in the vicinity of the active site and allow us to correctly predict which substrates that

may be catalysed by the enzyme.

The overall quality of our predictions makes us feel that it can be justifiable to create

homology models based on templates with quite low sequence identity in homology

modelling, in our case a successful modelling was created at a sequence identity level below

30% if the overall fold of the family is conserved

.

Fig. 8. Model of ERAB and crystal structure of the rat homologue superimposed, white areas were correctly predicted, while dark areas were mispredicted. NAD+ and substrate of the model are coloured light grey; while in the crystal structure they are darker. Figure created in ICM version 2.8 (Molsoft LLC, San Diego, CA)

Page 42: Biocomputational studies on protein structures

Biocomputational studies on protein structures

42

Structural Role of Conserved Asn179 in the Short-Chain

Dehydrogenase/Reductase Scaffold (paper V)

Short-chain dehydrogenases/reductases (SDR) constitute a large family of enzymes found in

all forms of life. Despite a low level of sequence identity, the three-dimensional structures

determined display a nearly superimposable α/β folding pattern. We identified a conserved

asparagine residue located within strand βF and analysed its role in the short-chain

dehydrogenase/reductase architecture. Mutagenetic replacement of Asn179 to Ala in bacterial

3β/17β-hydroxysteroid dehydrogenase yields a folded but enzymatically inactive enzyme.

Significantly altered unfolding characteristics indicate an apparent increase in stability.

Crystallographic analysis of the wild-type enzyme at 1.2 Å resolution reveals a hydrogen

bonding network including a buried and well-ordered water molecule, connecting strands βE

to βF (Fig. 9), a common feature found in 16 of 21 known three-dimensional structures of the

family.

Fig. 9. Strands βE and βF and interactions of Asn179, with distances given in Ångström. Figure created with ICM, version 2.8 (Molsoft LLC, San Diego, CA) This segment interacts with the active site and the substrate-binding region, which could

explain the loss of function in the N179A mutant. The main interactions of this conserved site

are made through side-chain to main-chain atoms, which makes it impossible to trace them

through sequence comparisons, since no evolutionary pressure is applied on the residues in

order to maintain the interactions. Based on these results we predict that in mammalian 11β-

Page 43: Biocomputational studies on protein structures

Project overview

43

hydroxysteroid dehydrogenase the essential Asn-linked glycosylation site, which corresponds

to the conserved segment, displays the same structural features and has a central role to

maintain the SDR scaffold, rather than being a carbohydrate attachment point.

Molecular dynamic studies of sequence influence upon fibrillation

(paper VI)

Diseases such as the prion diseases and Alzheimer’s disease (AD), caused by fibrillating

proteins are gaining increased attention due to their importance in dementia among an aging

population. Presently, over 20 diseases are known to be linked to fibrillating proteins in

various tissues. We have studied the stability of the helical peptides involved in AD and

pulmonary alveolar proteinosis (PAP) using 4 nanoseconds molecular dynamics simulations.

The effect of stabilising replacements and the effect of the mutation linked to the hereditary

cerebral amyloidosis of the Flemish type are also investigated.

The stability of the peptides is clearly related to their amino acid sequence. The amyloid β-

peptide (Aβ) has lost most of its helical structure midway through the simulation and assumes

a more or less unordered structure with some helical tendencies to it (Fig. 10A), much like the

variant involved in hereditary cerebral amyloidosis of the Flemish type (Aβ Flemish type),

which loses all of its helical tendencies and even show β-strand tendencies (Fig. 10C). The

experimentally verified stable analogue Aβ (K16A+L17A+F20A) retains a helical structure

during the whole 4 ns simulation (Fig. 10B). Surfactant protein-C (SP-C) is more stable than

Aβ but the helix collapses into short segments with helical properties (Fig. 10D). The stable

analogue SP-C (Leu) maintains a rigid helix during the whole simulation (Fig. 10E) while the

tri-substituted SP-C (V18L+V20L+V24L) shows less helical character than SP-C (Leu), but

more than wild-type SP-C (Fig. 10F).

The results can be used to design more stable analogues based on the concepts derived from

this study. The deleterious effect of the mutations in hereditary cerebral amyloidosis of the

Flemish type can be explained by decreased α-helical stability of the peptide. We have also

investigated the secondary structure propensities and a correlation between β-propensities and

fibrillation was found, compatible with the molecular dynamics results.

Page 44: Biocomputational studies on protein structures

Biocomputational studies on protein structures

44

A B C D E

F 0 ps 2000 ps 4000 ps Fig. 10. Snapshots of the development of the structure during the MD simulations. A: Aβ B: Aβ (K16A+L17A+F20A) C: Aβ Flemish type (A21G) D: SP-C E: SP-C (Leu) F: SP-C (V18L+V20L+V24L)

Page 45: Biocomputational studies on protein structures

Project overview

45

A structural basis for proinsulin C-peptide activity (paper VII)

The insulin and C-peptide parts of proinsulin differ markedly in interspecies variability. With

37 forms known in primary structure, C-peptide is between one or two orders of magnitude

more variable than insulin, is about as variable as relaxin in the same protein family, and

shows a tripartite variability pattern. Considering the mammalian C-peptides, 9 positions of

the N- and C-terminal are conserved. In contrast, mid-positions of C-peptide are highly

variable and susceptible to mutational removal. These structural features are consistent with

binding interactions recently defined experimentally (Ohtomo et al. 1998, Rigler et al. 1999,

Wahren et al. 2000, Pramanik et al. 2001), highlighting an importance of the C-terminal

pentapeptide.

The main argument against a hormonal function for the C-peptide is the large species

variability, with no strictly conserved residue. To investigate this apparent problem, 15

peptide hormones were investigated by constructing multiple alignments. The results are

shown in Table 4 where C-peptide is grouped together with three other highly variable, but

well-established peptide hormones. This puts the variability of C-peptide in a new perspective

and indicates the possibility of hormonal activities of C-peptide.

Table 4. Sequence variability among peptide hormones, differences are given in % identity and in PAM units (accepted point mutations) Hormone n Human – Rat Human – Chicken Human – Cod Human – Hagfish % PAM % PAM % PAM % PAM Somatostatin 14 0* 0 0 0 - - 0 0 Glucagon 29 0 0 0 0 - - - - VIP 28 0 0 14 16 21 25 - - Substance P 11 0 0 9 10 27 33 - - GIP 42 5 5 - - - - - - Insulin 51 8 8 14 16 - - 35 47 IGF-1 70 4 4 11 12 - - - - IGF-2 67 6 6 15 17 - - - - CCK 33 9 10 39 55 - - - - Secretin 27 11 12 48 75 - - - - Pancreatic polypeptide

36 22 26 56 97 - - - -

Gastrin 17 18 21 - - - - - - GRP 27 26 32 30 38 - - - - Parathormone 84 27 33 62 119 - - - - C-Peptide 31 32 42 68 148 - - 93 988 Relaxin 54 48 75 - - - - - - *Comparison versus the mouse homologue instead of rat

Page 46: Biocomputational studies on protein structures

Biocomputational studies on protein structures

46

Conclusions

• Sequence comparisons are important and fruitful to examine evolutionary and

functional relationships in families of both large and small proteins/peptides.

• Docking calculations is a powerful tool to elucidate substrate specificities and to

investigate the molecular mechanisms of substrate binding.

• Homology modelling is feasible below 30% sequence identity if the fold of the family

is well conserved. Loops and outer parts are likely to be slightly mispredicted, while

the accuracy of the core and the active site can be expected to be high.

• A multiple structural alignment is a powerful complement to traditional multiple

sequence alignments, as factors not obvious from the sequence conservation may be

detected. The importance of structural comparisons is expected to increase as more

structures become available.

• MD simulations are ideal in studies of the stability of small peptides, as the timescales

practically reachable are biologically relevant.

• The whole genome comparisons of the MDR superfamily has given further insights

into the evolution of this superfamily and several new families have been

characterised. Different multiplicity patterns of the superfamily are also detected in

higher and lower eukaryotes, indicating different functions.

• The substrate specificity of γγADH and ββADH is primarily determined by the

Ser48/Thr48 replacement, where the Ser in γγADH allows the oxidation of bulkier

secondary alcohols.

• The model of type 10 17β-HSD is compatible with known enzymatic data and can be

used to explain enzyme-substrate interactions at a molecular level.

Page 47: Biocomputational studies on protein structures

Conclusions

47

• The structural analysis of the SDR superfamily revealed a conserved hydrogen-bonded

network centered around a conserved asparagine, common for 16 of the 21 proteins of

the SDR with experimentally verified structure.

• MD simulations of amyloidogenic helical peptides and nonfibrillating variants point at

the importance of the stability of the helical portion for their amyloidogenic

tendencies.

• The helical peptides involved in pulmonary alveolar proteinosis and Alzheimer’s

readily lose their helical conformation without the influence of external factors.

• The sequence variability of the C-peptide does not exclude it from being a peptide

hormone, as other well-characterised peptide hormones exhibit similar variability

patterns

Perspectives

One of the driving forces in computational biology is the parallel development of faster and

more powerful computers, which allows addressing more complex systems by computational

methods. The Blue Gene project by IBM, where they construct a supercomputer to try to fold

a protein by MD simulations, is one of the projects that most obviously benefits of new

technology. The recent advances in the field of computer technology and the construction of

more efficient algorithms in MD simulation software indicates that the current limiting factor

is the crudeness of the force fields used in the simulations.

The development in biology is of more importance than technological breakthroughs. The

completion of the human genome last year gave a renaissance for the field of sequence

comparisons, and especially gene prediction algorithms are and will be of utter importance in

the near future. Increase of the biological knowledge about the mechanisms that govern each

process, will provide the basis for making better prediction algorithms for transmembrane

proteins, subcellular localisation, glycosylation and secondary structure.

Page 48: Biocomputational studies on protein structures

Biocomputational studies on protein structures

48

The project that will have greatest impact on homology modelling is the structural genomic

initiative, where proteins without homology with experimentally determined structures, are

selected for crystallisation. The aim is to cover all existing folds of soluble proteins, which

would make it possible to predict the structure of all known proteins. The initiative has

identified 5285 proteins that will cover the three-dimensional space of protein folds with a

reasonable small gaps to fill up by homology modelling, of them 459 has been crystallized to

date and 15 have their structure deposited in PDB (www.jcsg.org).

The databases of single nucleotide polymorphisms (SNP) can give enormous amounts of both

disease related and structurally relevant information, especially in relation to known or

modelled three-dimensional folds.

The enormous amount of biological information generated worldwide threatens to drown the

relevant information for each researcher. Therefore are initiatives like GPCRDB (ref)

valuable, were a consortium of researchers gather all available data about a protein family and

makes it clearly and easily available to the rest of the world, in this case it is G-protein

coupled receptors, but the concept could be transferred to any protein family.

As more resources are directed into the field of bioinformatics the harder problems are no

longer shunned. Therefore I hope to see an algorithm within 5–10 years that produces fairly

accurate (RMSD below 2Å) estimates of the three-dimensional structure of proteins just based

on the sequence of the protein.

Page 49: Biocomputational studies on protein structures

References

49

Acknowledgements

This thesis was carried out at Chemistry I, Department of Medical Biochemistry and

Biophysics, Karolinska Institutet, Stockholm, Sweden. I am very grateful to all of you that

have helped me in the completion of my thesis. In particular I would like to express my

gratitude to the following persons.

Bengt Persson, my supervisor, for rescuing me from the lab-bench and toxic fumes in the far

north and allowing me to work in a field which I barely could dream of during the cold arctic

nights in Sundsvall.

Hans Jörnvall, the head of Chemistry I, for your enthusiasm and for freely sharing your

knowledge and for providing excellent working facilities.

The people that lured me into going for a Ph.D after my exjobb, Magnus Gustafsson, Malin

Hult, Tim Prozorovski, Yuqin Wang, Yvonne Kallberg and Valentina Bonetto, I had a

really good time.

My roommate Yvonne Kallberg for lots of discussion about science, Sundsvall and unrelated

stuff in our dark cave, for nice conference and party company. Lately, our sanctuary has been

invaded by Annika Nor in who tried to let some sunshine into our lives, with varying degrees

of success.

The boys of Chemistry I with unhealthy interests in non-science projects, Andreas P.

Jonsson, Jesper J. Hedberg, Magnus Gustafsson, Mikael Henr iksson, Patr ik Strömberg

and Stefan Svensson, you’ re not safe yet.

My present and past student colleagues in Chemistry I:

Andreas Almlén, Juan Astorga-Wells, Ann-Char lotte Bergman, Peter Bergman, Åsa

Brunnström, Elo Er iste, Daniel Hirschberg, Char lotta Filling, Waltter i Hosia, Johan

Lengquist, Jing L i, Ingemar L indh, Char lotte L indhé, Suya L iu, Ermias Melles, Al

Necal, Johan Nilsson, Åke Norberg, Madalina Oppermann, Anna Päivio, Essam Refai,

Naeem Shafqat, Margareta Stark, Mar ia Tollin, Xiaoqiu Wu, Eli Zahou and Shah

Zaltash, for sharing the daily grind of a Ph.D.

Page 50: Biocomputational studies on protein structures

Biocomputational studies on protein structures

50

I would to thank my main collaborators during these years, Udo Oppermann, Jan

Johansson and Hans Jörnvall, I learned a lot and hopefully science took an ant-step in the

right direction from my contributions.

Time to thank the HEJ-lab core:

Ann-Margreth Jörnvall, Ingegerd Nylander , I rene Byman, Ella Ceder lund, Car ina

Palmberg, Eva Mårtensson, Ulr ika Waldenström, Monica L indh, Marie Stahlberg,

Gunvor Alvelius, Jan Sjövall, Jan-Olof Höög, Mats Andersson, Birgitta Agerber th,

Mustafa El-Ahmad, Tomas Bergman, William Gr iffiths, Lars Hjelmqvist, Hanns-Ulr ich

Marschall, Rannar Sillard and Jawed Shafqat, there would be no HEJ-lab without you

guys.

I would also like to grasp the opportunity and say cheers to the friends I found in all over

MBB, if you’ re forgotten blame me, it’s late and everything else is finished, Jenny, Stina,

Eva, Helena (Structural Biology), Andis, Hans, Simone, Stefan (Organic chemistry),

Micke, Mattias, Pontus, (Kemi II), Aristi (Biochemistry).

I would like to express my deepest gratitude and love to my parents who always stood by and

supported me.

If it weren’ t for my sister Kar ins travelling, perhaps I wouldn’ t have found the courage to

leave the safe haven of Sundsvall, thanks a billion :-)

Mirna, finally I’m finished, I owe it all to you.

Ok, to sum it up, it’s been good; I will miss you all.

Page 51: Biocomputational studies on protein structures

References

51

References Abagyan, R. and Totrov, M. Biased probability Monte Carlo conformational searches and electrostatic calculations for peptides and Proteins. J. Mol. Biol. 1994, 235:983-1002 Abagyan, R., Batalov, S., Cardozo, T., Totrov, M., Webber, J. and Zhou, Y. Homology modeling with internal coordinate mechanics: deformation zone mapping and improvements of models via conformational search. Proteins 1997, Suppl 1:29-37 Abagyan, R.A. and Batalov, S. Do aligned sequences share the same fold? J. Mol. Biol. 1997, 273:355-68 Abagyan, R.A. Towards protein folding by global energy optimization. FEBS Lett. 1993, 325:17-22 Agerberth, B., Gunne, H., Odeberg, J., Kogner, P., Boman, H.G.and Gudmundsson, G.H., FALL-39, a putative human peptide antibiotic, is cysteine-free and expressed in bone marrow and testis. Proc. Natl. Acad. Sci. U S A 1995, 92:195-199 Alder, B.J. and Wainwright, T.E. Phase transition for a hard-sphere system. J. Chem. Phys. 1957, 27:1208 Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990, 215:403-410 Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25:3389-3402 Amy, C.M., Witkowski, A., Naggert, J., Williams, B., Randhawa, Z. and Smith, S. Molecular cloning and sequencing of cDNAs encoding the entire rat fatty acid synthase. Proc. Natl. Acad. Sci. U S A 1989, 86:3114-3118 Andersson, M., Gunne, H., Agerberth, B., Boman, A., Bergman, T., Sillard, R., Joernvall, H., Mutt, V., Olsson, B., Wigzell, H., Dagerlind, A., Boman, H.G. and Gudmundsson, G.H., NK-lysin, a novel effector peptide of cytotoxic T and NK cells. Structure and cDNA cloning of the porcine form, induction by interleukin 2, antibacterial and antitumour activity. EMBO J. 1995, 14:1615-1625 Bairoch, A. and Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000, 28:45-48 Baldi, P., Brunak, S., Frasconi, P., Soda, G. and Pollastri, G. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics 1999, 15:937-946 Bansal, A.K. An automated comparative analysis of 17 complete microbial genomes. Bioinformatics 1999, 15:900-908 Beers, M.F. and Lomax, C. Synthesis and processing of hydrophobic surfactant protein C by isolated rat type II cells. Am. J. Physiol. 1995, 269:744-753

Page 52: Biocomputational studies on protein structures

Biocomputational studies on protein structures

52

Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A. and Wheeler, D.L. GenBank. Nucleic Acids Res. 2002, 30:17-20 Berendsen, H.J.C., Postma, J.P.M., van Gunsteren, W.F. and Hermans, J. Interaction models for water in relation to protein hydration. In: Intermolecular Forces, (ed. Pullman, B.) Reidel, Dordrecht, 1981, 331-342 Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H. , Shindyalov, I.N. and Bourne P.E. The Protein Data Bank. Nucleic Acids Res., 2000, 28:235-242 Blundell, T.L., Sibanda, B.L., Sternberg, M.J. and Thornton, J.M. Knowledge-based prediction of protein structures and the design of novel molecules. Nature 1987, 326:347-352 Bonneau, R., Tsai, J., Ruczinski, I., Chivian, D., Rohl, C., Strauss, C.E.M. and Baker, D. Rosetta in CASP4: progress in ab initio protein structure prediction. Proteins 2001, Suppl 5:119-126 Bonnichsen, R.K. and Wassén, A.M. Crystalline alcohol dehydrogenase from horse liver. Arch. Biochem. Biophys. 1948, 18: 361-363 Boudet, A.M., Lapierre, C. and Grima-Pettenati, J. Biochemistry and molecular biology of lignification. New Phytol. 1995, 129:203-236 Browne, W.J., North, A.C., Phillips, D.C., Brew, K., Vanaman, T.C., Hill, R.L. A possible three-dimensional structure of bovine alpha-lactalbumin based on that of hen's egg-white lysozyme. J. Mol. Biol. 1969, 42:65-86 Brändén, C.I., Jörnvall, H., Eklund, H. and Furugren, B. Alcohol dehydrogenases. In: The enzymes, third edition (ed. Boyer, P.D.), chapter 3, Academic Press, 1975 Brändén, C. and Tooze, J., Introduction to protein structure, second edition. Garland Publishing Inc., NY, 1999 Burley, S.K. An overview of structural genomics. Nat. Struct. Biol. 2000 Suppl:932-934 Campbell, M.M. and Sederoff, R.R. Variation in lignin content and composition. Mechanisms of control and implications for the genetic improvement of plants. Plant Physiol. 1995, 110:3-13 Cañestro, C., Albalat, R., Hjelmqvist, L., Godoy, L., Jörnvall, H. and Gonzàlez-Duarte, R. Ascidian and amphioxus Adh genes correlate functional and molecular features of the ADH family expansion during vertebrate evolution. J. Mol. Evol. 2002, 54:81-89 Cardozo T, Totrov M, and Abagyan R. Homology modeling by the ICM method. Proteins 1995, 23:403-414 Chandrasekhar, I. and van Gunsteren, W.F. A comparison of the potential energy parameters of aliphatic alkanes: molecular dynamics simulations of triacylglycerols in the alpha phase. Eur. Biophys. J. 2002, 31:89-101

Page 53: Biocomputational studies on protein structures

References

53

Chou, P.Y. and Fasman, G.D. Prediction of protein conformation. Biochemistry 1974, 13:222-245 Dayhoff, M.O. Schwartz, R.M. and Orcutt, B.C. A model of evolutionary change in Proteins. Matrices for detecting distant relationships In: Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Washington DC, 1978 DeRisi, J.L., Iyer, V.R. and Brown P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997 278:680-686 Donadio, S., Staver, M.J., McAlpine, J.B., Swanson, S.J. and Katz, L. Modular organization of genes required for complex polyketide biosynthesis. Science 1991, 252:675-679 Dreij, K., Sundberg, K., Johansson, A.S., Nordling, E., Seidel, A., Persson, B., Mannervik, B. and Jernström, B. Catalytic activities of human alpha class glutathione transferases toward carcinogenic dibenzo[a,l]pyrene diol epoxides. Chem. Res. Toxicol. 2002, 15:825-831 Duan, Y. and Kollman, P.A. Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution. Science 1998, 282:740-744 Duester, G., Farrés, J., Felder, V.C., Höög, J.-O., Parés, X., Plapp, B., Yin, S.-y. and Jörnvall, H. Alcohol dehydrogenase nomenclature. Biochem. Pharmacol. 1999, 58:389-395 Duester, G., Hatfield, G.W. and Smith, M. Molecular genetic analysis of human alcohol dehydrogenase. Alcohol 1985, 2:53-56 Efron, B., Halloran, E. and Holmes, S. Bootstrap confidence levels for phylogenetic trees. Proc. Natl. Acad. Sci. U S A 1996, 93:13429-13434 Ellis, R.J., van der Vies, S.M. and Hemmingsen, S.M. The molecular chaperone concept. Biochem. Soc. Symp. 1989, 55:145-153 Epstein, C.J., Goldberger, R.F. and Anfinsen, C.B. The genetic control of tertiary protein structure. Model systems. Cold Spring Harbor Symp. Quant. Biol. 1963, 28:439-449 Ewing, T.J., Makino, S., Skillman, A.G. and Kuntz, I.D. DOCK 4.0: search strategies for automated molecular docking of flexible molecule databases. J. Comput. Aided. Mol. Des. 2001, 15:411-428 Ewing, T.J.A. and Kuntz, I.D. Critical Evaluation of Search Algorithms for Automated Molecular Docking and Database Screening. J. Comp. Chem. 1997, 9:1175-1189 Felsenstein, J. PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle. 1993 Felsenstein, J. Confidence-limits on phylogenies - an approach using the bootstrap. Evolution 1985, 39:783-791 Feng, D.F., Johnson, M.S. and Doolittle, R.F. Aligning amino acid sequences: comparison of commonly used methods. J. Mol. Evol. 1984-85;21:112-125

Page 54: Biocomputational studies on protein structures

Biocomputational studies on protein structures

54

Fernández, M.R., Biosca, J.A., Norin, A., Jörnvall, H. and Parés, X. Class III alcohol dehydrogenase from Saccharomyces cerevisiae: Structural and enzymatic features differ toward the human/mammalian forms in a manner consistent with functional needs in formaldehyde detoxication. FEBS Lett. 1995, 370:23-26 Fiers, W., Contreras, R., Haegemann, G., Rogiers, R., van de Voorde, A., van Heuverswyn, H., van Herreweghe, J., Volckaert, G. and Ysebaert, M. Complete nucleotide sequence of SV40 DNA. Nature 1978, 273:113-120 Fitch, W.M. Toward defining the course of evolution: minimum change for a specific tree topology. Sys. Zool. 1971, 20:406-416 Fitz-Gibbon, S.T., and House C.H. Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res. 1999, 27:4218-4222 Forster, M.J. Molecular modelling in structural biology. Micron 2002, 33:365-384 Galliano, H., Cabane, M., Eckerskorn, C., Lottspeich, F., Sandermann, H. Jr and Ernst D. Molecular cloning, sequence analysis and elicitor-/ozone-induced accumulation of cinnamyl alcohol dehydrogenase from Norway spruce (Picea abies L.). Plant Mol. Biol. 1993, 23:145-156 Ghosh, D., Sawicki, M., Pletnev, V., Erman, M., Ohno, S., Nakajin, S., and Duax, W.L. Porcine carbonyl reductase: Structural basis for a functional monomer in short chain dehydrogenases/reductases. J. Biol. Chem. 2001, 276:18457–18463 Ghosh, D., Wawrzak, Z., Weeks, C.M., Duax,W.L. and Erman, M. The refined three-dimensional structure of 3 α,20 β-hydroxysteroid dehydrogenase and possible roles of the residues conserved in short-chain dehydrogenases. Structure 1994, 2:629-640 Gibbs, A.J. and McIntyre, G.A. The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 1970,16:1-11 Gonnet, G.H., Cohen, M.A. and Benner, S.A. Exhaustive matching of the entire protein sequence database. Science 1992, 256:1443-1445 Gonzalez, P., Rao, P.V., Nunez, S.B. and Zigler, J.S. Jr. Evidence for independent recruitment of � -crystallin/quinone reductase (CRYZ) as a crystallin in camelids and hystricomorph rodents. Mol. Biol. Evol. 1995, 12:773-781 Goodsell, D.S. and Olson, A.J. Automated Docking of Substrates to Proteins by Simulated Annealing. Proteins: Str. Func. and Genet. 1990, 8:195-202 Gustafsson, M. and Johansson, J. Structural and functional properties of pulmonary surfactant protein C and related analogs. J. Protein Chem. 1998, 17:540-542 Halgren, T.A. Merck Molecular Force Field. I.-V. J. Comp. Chem. 1995a, 17:490-641 Halgren, T.A. Potential energy functions. Curr. Opin. Struct. Biol. 1995b, 5:205-210

Page 55: Biocomputational studies on protein structures

References

55

Halpin, C., Knight, M.E., Foxon, G.A., Campbell, M., Boudet, A.M., Boon, J.J., Chabbert, B., Tollier, M.T. and Schuch, W. Manipulation of lignin quality by down-regulation of cinnamyl alcohol dehydrogenase. Plant Journal 1994, 6:339-350 Hammersley, J.M. Monte Carlo Methods for Solving Multivariable Problems. Ann. New York Acad. Sci. 1960, 86:844-874 Hansson, T., Oostenbrink, C. and van Gunsteren, W. Molecular dynamics simulations. Curr. Opin. Struct. Biol. 2002, 12:190-196 Hardin, C., Eastwood, M.P., Prentiss, M., Luthey-Schulten, Z. and Wolynes, P.G. Folding funnels: the key to robust protein structure prediction. J. Comput. Chem. 2002, 23:138-146 Hartigan, J.A. Minimum evolution fits to a given tree. Biometrics 1973, 29:53-65 Henikoff, S. and Henikoff, J.G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U S A. 1992, 89:10915-10919 Henikoff, S. and Henikoff, J.G. Performance evaluation of amino acid substitution matrices. Proteins 1993, 17:49-61 Hofmann, K., Bucher, P., Falquet, L. and Bairoch A. The PROSITE database, its status in 1999. Nucleic Acids Res. 1999, 27:215-219 Huang, T.F., Holt, J.C., Lukasiewicz, H. and Niewiarowski, S. Trigramin. A low molecular weight peptide inhibiting fibrinogen interaction with platelet receptors expressed on glycoprotein IIb-IIIa complex. J. Biol. Chem. 1987, 262:16157-16163 Jany, K.D., Ulmer, W., Froschle, M. and Pfleiderer, G. Complete amino acid sequence of glucose dehydrogenase from Bacillus megaterium. FEBS Lett. 1984, 165:6-10 Jensen, F. Introduction to Computational Chemistry Wiley and Sons, Chichester, 1999 Johansson, J., Curstedt, T., Robertson, B. and Jörnvall, H. Size and structure of the hydrophobic low molecular weight surfactant-associated polypeptide. Biochemistry 1988, 27:3544-3547 Jones, D.T. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 1999, 292, 195-202 Jones, D.T. and Swindells, M.B. Getting the most from PSI-BLAST. Trends Biochem. Sci. 2002, 27:161-164 Jones, G., Willett, P., Glen, R.C., Leach, A.R. and Taylor, R. Development and validation of a genetic algorithm for flexible docking. J. Mol. Biol. 1997, 267:727-748 Jörnvall, H. and Höög, J.-O. Nomenclature of alcohol dehydrogenases. Alcohol and Alcoholism 1995, 30:153-161

Page 56: Biocomputational studies on protein structures

Biocomputational studies on protein structures

56

Jörnvall, H., Höög, J.-O. and Persson, B. SDR and MDR: completed genome sequences show these protein families to be large, of old origin, and of complex nature. FEBS Lett., 1999 445:261-264 Jörnvall, H., Höög, J.-O., Persson, B. and Parés, X. Pharmacogenetics of the alcohol dehydrogenase system. Am. J. Pharmacol. 2000, 61:184-191 Jörnvall, H., Persson, B., Krook, M., Atrian, S., Gonzalez-Duarte, R., Jeffery, J., and Ghosh, D. Short-chain dehydrogenases/reductases (SDR). Biochemistry 1995, 34:6003–6013 Jörnvall, H., Persson, M. and Jeffery, J. Alcohol and polyol dehydrogenases are both divided into two protein types, and structural properties cross-relate the different enzyme activities within each type. Proc. Natl. Acad. Sci. U S A 1981, 78:4226-4230 Jörnvall, H., von Bahr-Lindström, H., Jany, K.D., Ulmer, W. and Froschle, M. Extended superfamily of short alcohol-polyol-sugar dehydrogenases: structural similarities between glucose and ribitol dehydrogenases. FEBS Lett. 1984, 165:190-196 Kallberg, Y., Oppermann, U., Jörnvall, H. and Persson, B. Short-chain dehydrogenase/reductase (SDR) relationships: A large family with eight clusters common to human, animal, and plant genomes. Protein Sci. 2002, 11:636-641 Kallberg, Y. and Persson, B. KIND-a non-redundant protein database. Bioinformatics 1999, 15:260-261 Kamal, A., Almenar-Queralt, A., LeBlanc, J.F., Roberts, E.A. and Goldstein, L.S. Kinesin-mediated axonal transport of a membrane compartment containing beta-secretase and presenilin-1 requires APP. Nature 2001, 414:643-648 Kanehisa, M. A database for post-genome analysis. Trends Genet. 1997, 13:375-376 Kang, J., Lemaire, H.-G., Unterbeck, A., Salbaum, J.M., Masters, C.L., Grzeschik, K.-H., Multhaup, G., Beyreuther, K. and Mueller-Hill, B., The precursor of Alzheimer's disease amyloid A4 protein resembles a cell-surface receptor. Nature 1987, 325:733-736 Kirkpatrick, S., Gelatt, C.D., and Vecchi, M.P. Optimization by Simulated Annealing. Science 1983, 220:671-680 Koonin, E.V., Mushegian, A.R. Galperin, M.Y. and Walker D.R. Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea. Mol. Microbiol. 1997, 25:619-637 Krook, M., Ghosh, D., Strömberg, R., Carlquist, M., and Jörnvall, H. Carboxyethyllysine in a protein: Native carbonyl reductase/NADP+-dependent prostaglandin dehydrogenase. Proc. Natl. Acad. Sci. U S A 1993, 90:502–506 Kuntz, I.D., Blaney, J.M., Oatley, S.J., Langridge, R., and Ferrin, T.E. "A geometric approach to macromolecule-ligand interactions", J. Mol. Biol. 1982, 161, 269-288 Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., et al. and Morgan, M.J. Initial sequencing and analysis of the human genome. Nature 2001, 409:860-921

Page 57: Biocomputational studies on protein structures

References

57

Lengauer, T. and Rarey, M. Computational methods for biomolecular docking. Curr. Opin. Struct. Biol. 1996, 6:402-406 Levitt, M. Accurate modeling of protein conformation by automatic segment matching. J. Mol. Biol. 1992, 226:507-533 Lindahl, E., Hess, B. and van der Spoel, D. GROMACS 3.0: a package for molecular simulation and trajectory analysis. J. Mol. Model. 2001, 7:306–317 Magnuson, K., Jackowski, S., Rock, C.O. and Cronan, J.E. Jr. Regulation of fatty acid biosynthesis in Escherichia coli. Microbiol. Rev. 1993, 57:522-542 Marschall, H.-U., Oppermann, U.C., Svensson, S., Nordling, E., Persson, B., Höög, J.-O. and Jörnvall, H. Human liver class I alcohol dehydrogenase ��� isozyme: the sole cytosolic 3β-hydroxysteroid dehydrogenase of iso bile acids. Hepatology 2000, 31:990-996 McCammon, J.A., Gelin, B.R. and Karplus, M. Dynamics of folded proteins. Nature 1977, 267:585-590 Minth, C.D., Andrews, P.C. and Dixon, J.E., Characterization, sequence, and expression of the cloned human neuropeptide Y gene. J. Biol. Chem. 1986, 261:11974-11979 Morris, G.M., Goodsell, D.S., Halliday, R.S., Huey, R., Hart, W.E., Belew, R.K. and Olson, A.J. Automated Docking Using a Lamarckian Genetic Algorithm and and Empirical Binding Free Energy Function. J. Computational Chemistry 1998, 19:1639-1662 Morris, G.M., Goodsell, D.S., Huey, R. and Olson, A.J. Distributed automated docking of flexible ligands to proteins: Parallel applications of AutoDock 2.4. J. Computer-Aided Molecular Design 1996, 10:293-304 Morse, P.M. and Feshbach, H. Asymptotic Series; Method of Steepest Descent. In: Methods of Theoretical Physics, Part I. McGraw-Hill, New York, 1953, 434-443 Murzin, A.G., Brenner, S.E., Hubbard, T. and Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995, 247:536-540 Najmanovich, R., Kuttner, J., Sobolev, V. and Edelman, M. Side-chain flexibility in proteins upon ligand binding. Proteins 2000, 39:261-268 Needleman, S.B. and Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two Proteins. J. Mol. Biol. 1970, 48:443-453 Negelein, E. and Wulff, H.-J. Kristallisation des Proteins der Acet-aldehyd-reduktase. Biochem. Z. 1937, 289: 436-437 Nei, M. Phylogenetic analysis in molecular evolutionary genetics. Annu. Rev. Genet. 1996, 30:371-403 Ohlrogge, J. and Browse, J. Lipid biosynthesis. Plant Cell 1995, 7:957-970

Page 58: Biocomputational studies on protein structures

Biocomputational studies on protein structures

58

Ohtomo, Y., Bergman, T., Johansson, B.-L., Jörnvall, H. and Wahren, J. Differential effects of proinsulin C-peptide fragments on Na+,K+-ATPase activity of renal tubule segments. Diabetologia 1998, 41:287-291 Oldfield, E. Chemical shifts in amino acids, peptides, and proteins: from quantum chemistry to drug design. Annu. Rev. Phys. Chem. 2002, 53:349-378 Oppermann, U.C., Persson, B., Filling, C., and Jörnvall, H.. Structure–function relationships of SDR hydroxysteroid dehydrogenases. Adv. Exp. Med. Biol. 1997, 414:403–415 Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. CATH- A Hierarchic Classification of Protein Domain Structures. Structure 1997, 5:1093-1108 Page, R.D. TreeView: an application to display phylogenetic trees on personal computers. Comput. Appl. Biosci. 1996, 12:357-358 Pearson, W.R. and Lipman, D.J. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U S A 1988, 85:2444-2448 Persson, B., Hallborn, J., Walfridsson, M., Hahn-Hägerdal, B., Keränen, S., Penttilä, M. and Jörnvall, H. Dual relationships of xylitol and alcohol dehydrogenases in families of two protein types. FEBS Lett. 1993, 324:9-14 Persson, B., Krook, M. and Jörnvall, H. Characteristics of short-chain alcohol dehydrogenases and related enzymes. Eur. J. Biochem. 1991, 200:537-543 Persson, B., Krook, M., and Jörnvall, H. Short-chain dehydrogenases/reductases. Adv. Exp. Med. Biol. 1995, 372:383–395 Persson, B., Zigler, J.S. Jr and Jörnvall, H. A super-family of medium-chain dehydrogenases/reductases (MDR). Sub-lines including � -crystallin, alcohol and polyol dehydrogenases, quinone oxidoreductase, enoyl reductases, VAT-1 and other proteins. Eur. J. Biochem. 1994, 226:15-22 Polak, E. Chapter 2.3 In: Computational Methods in Optimization. New York: Academic Press, 1971 Powell, A.J., Read, J.A., Banfield, M.J., Gunn-Moore, F., Yan, S.D., Lustbader, J., Stern, A.R., Stern, D.M. and Brady, R.L. Recognition of Structurally Diverse Substrates by Type II 3-Hydroxyacyl-Coa Dehydrogenase (Hadh II) Amyloid-Beta Binding Alcohol Dehydrogenase (Abad) J. Mol. Biol. 2000, 303:311-327 Pramanik, A., Ekberg, K., Zhong, Z., Shafqat, J., Henriksson, M., Jansson, O., Tibell, A., Tally, M., Wahren, J., Jörnvall, H., Rigler, R. and Johansson, J. C-peptide binding to human cell membranes: importance of Glu27. Biochem. Biophys. Res. Commun. 2001, 284:94-98 Rao, P.V., Gonzalez, P., Persson, B., Jörnvall, H., Garland, D. and Zigler J.S. Jr Guinea pig and bovine � -crystallins have distinct functional characteristics highlighting replacements in otherwise similar structures. Biochemistry 1997, 36, 5353-5362

Page 59: Biocomputational studies on protein structures

References

59

Rarey, M., Kramer, B., Lengauer, T. and Klebe, G. A fast flexible docking method using an incremental construction algorithm. J. Mol. Biol. 1996, 261:470-489 Rigler, R., Pramanik, A., Jonasson, P., Kratz, G., Jansson, O.T., Nygren, J.-Å., Ståhl, S., Ekberg, K., Johansson, B.-L., Uhlén, S., Uhlén, M., Jörnvall, H. and Wahren, J. Specific binding of proinsulin C-peptide to human cell membranes. Proc. Natl. Acad. Sci. U S A 1999, 96:13318-13323 Rossmann, M.G., Moras, D. and Olsen, K.W. Chemical and biological evolution of nucleotide-binding protein. Nature 1974, 250:194-199 Rost, B. and Sander, C. Prediction of protein secondary structure at better than 70% accuracy, J. Mol. Biol. 1993, 232:584-599 Rost, B. and Sander, C. Third generation prediction of secondary structure. In: Protein Structure Prediction: Methods and Protocols (ed. Webster, D.), Humana Press, Clifton, NJ, 2000, 71-95 Saitou, N. and Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987, 4:406-425 Sali, A. and Blundell, T.L. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 1993, 234:779-815 Sanchez, R. and Sali, A. Advances in comparative protein-structure modelling. Curr. Opin. Struct. Biol. 1997, 7:206-214 Sandak, B., Wolfson, H.J. and Nussinov, R. Flexible docking allowing induced fit in proteins: insights from an open to closed conformational isomers. Proteins 1998, 32:159-174 Sarni, F., Grand, C. and Boudet, A.M. Purification and properties of cinnamoyl-CoA reductase and cinnamyl alcohol dehydrogenase from poplar stems (Populus X euramericana). Eur. J. Biochem. 1984, 139:259-265 Schmidt, M.W., Baldridge, K.K., Boatz, J.A., Elbert, S.T., Gordon, M.S., Jensen, J.H., Koseki, S., Matsunaga, N., Nguyen, K.A.; et al. General atomic and molecular electronic structure system. J. Comput. Chem. 1993, 14:1347-1363 Schneider, G. and Bohm, H.J. Virtual screening and fast automated docking methods. Drug Discov. Today 2002, 7:64-70 Schwartz, M.F. and Jörnvall, H. Structural analyses of mutant and wild-type alcohol dehydrogenases from Drosophila melanogaster. Eur. J. Biochem. 1976, 68:159-168 Sellers, P.H. On the theory and the computation of evolutionary distances. SIAM J. Appl. Math. 1974, 26:787-793 Shepherd, A.J., Gorse, D. and Thornton, J.M. Prediction of the location and type of beta-turns in proteins using neural networks. Protein Sci. 1999, 8:1045-55

Page 60: Biocomputational studies on protein structures

Biocomputational studies on protein structures

60

Sippl, M.J. Knowledge-based potentials for Proteins. Curr. Opin. Struct. Biol. 1995, 5:229-235 Skolnick, J., Kolinski, A., Kihara, D., Betancourt, M., Rotkiewicz, P.and Boniecki, M. Ab initio protein structure prediction via a combination of threading, lattice folding, clustering, and structure refinement. Proteins 2001, Suppl 5:149-156 Smith, T.F. and Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 1981, 147:195-197 Sofer, W. and Ursprung, H. Drosophila alcohol dehydrogenase. J. Biol. Chem. 1968, 243: 3110-3115 Steiner, D., Cunningham, D., Spigelman, L. and Aten, B. Insulin biosynthesis: evidence for a precursor. Science 1967, 157:697-700 Stewart, J.J. MOPAC: a semiempirical molecular orbital program. J. Comput. Aided. Mol. Des. 1990, 4:1-105 Stoesser, G., Baker, W., van den Broek, A., Camon, E., Garcia-Pastor, M., Kanz, C., Kulikova, T., Lombard, V., Lopez, R., Parkinson, H., Redaschi, N., Sterk, P., Stoehr, P. and Tuli, M.A. The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 2001, 29:17-21 Swofford, D.L. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts, 1998 Tanaka, N., Nonaka, T., Tanabe, T., Yoshimoto, T., Tsuru, D. and Mitsui, Y. Crystal structures of the binary and ternary complexes of 7 α-hydroxysteroid dehydrogenase from Escherichia coli. Biochemistry 1996, 35:7715-7730 Tekaia, F., Lazcano, A. and B. Dujon.. The genomic tree as revealed from whole proteome comparisons. Genome Res. 1999, 9:550-557 Thompson, J.D., Higgins, D.G. and Gibson, T.J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680 Thomsen, J., Kristiansen, K., Brunfeldt, K. and Sundby, F., The amino acid sequence of human glucagon. FEBS Lett. 1972, 21:315-319 Totrov, M. and Abagyan, R. Detailed ab initio prediction of lysozyme-antibody complex with 1.6 A accuracy. Nat. Struct. Biol. 1994, 1:259-263 Totrov, M. and Abagyan, R. Flexible protein-ligand docking by global energy optimization in internal coordinates. Proteins 1997, Suppl 1:215-220 Wahren, J. and Johansson, B. New aspects of C-peptide physiology. Horm. Metab. Res. 1998, 30:A2-A5

Page 61: Biocomputational studies on protein structures

References

61

Wahren, J., Ekberg, K., Johansson, J., Henriksson, M., Pramanik, A., Johansson, B.L., Rigler, R. and Jörnvall, H. Role of C-peptide in human physiology. Am. J. Physiol. Endocrinol. Metab. 2000, 278:E759-E768 Venclovas, C., Zemla, A., Fidelis, K. and Moult, J. Some measures of comparative performance in the three CASPs. Proteins 1999, Suppl 3:231-237 Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., et al. and Zhu X. The sequence of the human genome. Science 2001, 291:1304-1351 Vogt, G., Etzold, T. and Argos, P. An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J. Mol. Biol. 1995, 249:816-831 Wu, C.H., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z.Z., Ledley, R.S., Lewis, K.C., Mewes, H.W., Orcutt, B.C., Suzek, B.E., Tsugita, A., Vinayaka, C.R., Yeh, L.S., Zhang, J. and Barker, W.C. The Protein Information Resource: an integrated public resource of functional annotation of Proteins. Nucleic Acids Res. 2002, 30:35-37 Yamazoe, M., Shirahige, K., Rashid, M.B., Kaneko, Y., Nakayama, T., Ogasawara, N. and Yoshikawa, H. A protein which binds preferentially to single-stranded core sequence of autonomously replicating sequence is essential for respiratory function in mitochondrial of Saccharomyces cerevisiae. J. Biol. Chem. 1994, 269:15244-15252 Yokomizo, T., Ogawa, Y., Uozumi, N., Kume, K., Izumi, T. and Shimizu, T. cDNA cloning, expression, and mutagenesis study of leukotriene B4 12-hydroxydehydrogenase. J. Biol. Chem. 1996, 271:2844-2850 Zvelebil, M.J., Barton, G.J., Taylor, W.R. and, Sternberg, M.J.E. Prediction of protein secondary structure and active sites using alignment of homologous sequences, J. Mol. Biol. 1987, 195, 957-961 Österberg, F., Morris, G.M., Sanner, M.F., Olson, A.J. and Goodsell, D.S.Automated docking to multiple target structures: incorporation of protein mobility and structural water heterogeneity in AutoDock. Proteins 2002, 46:34-40