biocomputational studies on protein structures
TRANSCRIPT
From the Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden
Biocomputational studies on protein structures
by
Er ik Nordling
Stockholm 2002
© Erik Nordling
ISBN 91-7349-295-7
Till min älskade Mirna
Abstract Biology in the post-genomic era produces large amount of data. This, in combination with the need for efficient algorithms to find genes in the genomic material, has brought a renaissance into the field of computational biology. Methods now range from lead discovery in the drug discovery process, by virtual ligand screening, through sequence comparisons and homology searches, to micro array data analysis and visualisation. This thesis primarily deals with sequence analysis, different aspects of protein structure prediction and enzyme–ligand complex characterisation, I have applied bioinformatic techniques on the enzyme families of medium-chain dehydrogenases/reductases (MDR), short-chain dehydrogenases/reductases (SDR) and biologically active peptides
Sequence analysis of the MDR superfamily extends the evolutionary context of this superfamily, as MDR enzymes were collected from completely sequenced genomes. The analysis reveals the presence of eight families whereof several were previously uncharacterised. Three families are formed by dimeric alcohol dehydrogenases (ADH), cinnamyl alcohol dehydrogenases (CAD) and tetrameric alcohol dehydrogenases (YADH). Three further families are centred on forms initially detected as mitochondrial respiratory function proteins (MRF), acetyl-CoA reductases of fatty acid synthases (ACR), and leukotriene B4 dehydrogenases (LTD). The two remaining families, with polyol dehydrogenases (PDH) and quinone reductases (QOR), are also distinct but with variable sequences. The analysis also suggests that new functions have evolved in this superfamily in higher organisms.
Factors that govern the substrate specificity of γγADH were investigated with docking calculations and can be traced to active site characteristics, most notably the Ser48/Thr48 replacement between γγADH and ββADH, which allow the oxidation of 3β-hydroxy bile acids, such as isoUDCA, in ��� ADH, while both enzymes are inactive versus 3α-hydroxy bile acids.
A homology model of type 10 17β-hydroxysteroid dehydrogenase was constructed from 7α-hydroxysteroid dehydrogenase. The validity of the model was investigated by its ability to distinguish between active and inactive using docking calculations. Substrates tested ranged from steroids and bile acids to L- and D-hydroxyacyl CoA. Ligands with 17β or 3α hydroxy groups and L-hydroxyacyl CoA could achieve interactions favourable for catalysis at the active site. A crystallographically determined structure published after the submission of our paper verified large portions of our model.
The role of a conserved asparagine in the short-chain dehydrogenase/reductase (SDR) fold is investigated through structural comparisons of 21 members with experimentally verified structures. An extensive hydrogen-bonding network including parts of the active site is revealed in 16 out of 21 SDR forms.
Molecular dynamics simulations were employed to study the effect of deleterious mutants and replacements, known to diminish fibrillation, to the stability of the helical region of the amyloidogenic peptides amyloid β-peptide (Aβ) and surfactant protein-C (SP-C). The effects in SP-C are quantitatively distinguishable, while the noise ratio in the Aβ simulations makes valid predictions somewhat difficult.
Sequence comparisons of the C-peptide of proinsulin displays a sequence variability that is 1 to 2 orders of magnitude greater than that of insulin, but in the same order of magnitude as the well established peptide hormones relaxin and parathormone, which in conjunction with functional reports may indicate a hormonal function for the C-peptide. Erik Nordling: Biocomputational studies on protein structures; Stockholm 2002; ISBN 91-7349-295-7
Biocomputational studies on protein structures
6
Contents
Abbreviations............................................................................................................................. 8 List of original articles............................................................................................................. 10 Introduction.............................................................................................................................. 12 Background .............................................................................................................................. 13
Medium-chain Dehydrogenases/Reductases (papers I-III) .............................................. 13 Short-chain Dehydrogenases/Reductases (papers IV-V) ................................................. 15 Biologically active peptides (papers VI-VII) ................................................................... 16
Techniques ............................................................................................................................... 17 Sequence comparisons......................................................................................................... 17
Pairwise alignments.......................................................................................................... 17 Multiple alignments.......................................................................................................... 18 Scoring matrices............................................................................................................... 18
Databases.............................................................................................................................. 20 Database searching............................................................................................................... 20
FASTA ............................................................................................................................. 20 BLAST ............................................................................................................................. 21
Phylogenetic analysis........................................................................................................... 21 Maximum parsimony ....................................................................................................... 21 Neighbour joining ............................................................................................................ 22 Bootstrap analysis ............................................................................................................ 22
Secondary structure prediction............................................................................................. 22 Molecular modelling............................................................................................................ 23
Force field methods.......................................................................................................... 24 Local minimisation techniques............................................................................................. 25
Steepest descent................................................................................................................ 25 Conjugate gradient ........................................................................................................... 25
Global optimisation techniques............................................................................................ 25 Monte Carlo (MC)............................................................................................................ 26 Molecular dynamics (MD) ............................................................................................... 27 Simulated annealing......................................................................................................... 27
Docking calculations............................................................................................................ 27 Rigid body dockings......................................................................................................... 28 Flexible ligand – rigid receptor ........................................................................................ 28 Flexible ligand – flexible receptor ................................................................................... 29 Comments on docking methods....................................................................................... 30
Homology modelling............................................................................................................ 30 Aim of the study....................................................................................................................... 32 Project overview....................................................................................................................... 33
Genome comparison and modelling of active sites in the MDR family (paper I) ............... 33 Differential multiplicity of MDR alcohol dehydrogenases (paper II).................................. 36 Molecular modelling and docking of bile acids to γγ and ββ ADH class I (paper III) ........ 38 Molecular modelling and substrate docking of human type 10 17β-hydroxysteroid dehydrogenase (paper IV) .................................................................................................... 40 Structural Role of Conserved Asn179 in the Short-Chain Dehydrogenase/Reductase Scaffold (paper V)................................................................................................................ 42 Molecular dynamic studies of sequence influence upon fibrillation (paper VI).................. 43 A structural basis for proinsulin C-peptide activity (paper VII) .......................................... 45
Contents
7
Conclusions.............................................................................................................................. 46 Perspectives.............................................................................................................................. 47 Acknowledgements.................................................................................................................. 49 References................................................................................................................................ 51
Biocomputational studies on protein structures
8
Abbreviations
One and three letter codes for the 20 genetically encoded amino acids. Alanine Ala A
Arginine Arg R
Asparagine Asn N
Aspartic acid Asp D
Cysteine Cys C
Glutamic acid Glu E
Glutamine Gln Q
Glycine Gly G
Histidine His H
Isoleucine Ile I
Leucine Leu L
Lysine Lys K
Methionine Met M
Phenylalanine Phe F
Proline Pro P
Serine Ser S
Threonine Thr T
Tryptophan Trp W
Tyrosine Tyr Y
Valine Val V
Abbreviations
9
17β-HSD-10 type 10 17β-hydroxysteroid dehydrogenase
Aβ Amyloid β-peptide
ACR Acyl-CoA reductase
AD Alzheimer’s disease
ADH Alcohol dehydrogenase
BPMC Biased probability Monte Carlo
CAD Cinnamyl alcohol dehydrogenase
FAS Fatty acyl synthase
HSD Hydroxysteroid dehydrogenase
LTD Leukotriene B4 dehydrogenase
MC Monte Carlo
MD Molecular dynamics
MDR Medium-chain dehydrogenases/reductases
MRF Mitochondrial respiratory function protein
NAD+/ NADH Nicotinamide adenine dinucleotide (oxidised/reduced)
NADP+/NADPH Nicotinamide adenine dinucleotide phosphate (oxidised/reduced)
PAP Pulmonary alveolar proteinosis
PDB Protein databank (www.rcsb.org)
PDH Polyol dehydrogenase
QOR Quinone oxidoreductase
RMSD Root mean square deviation
SDR Short-chain dehydrogenases/reductases
SNP Single nucleotide polymorphism
SP-C Surfactant protein-C
UDCA Ursodeoxycholic acid
YADH Yeast alcohol dehydrogenase
Å Ångström (10-10m)
Biocomputational studies on protein structures
10
List of original articles This thesis is based on the following articles, which will be referred to by their Roman
numerals.
I. Nordling, E., Jörnvall, H. and Persson, B. Medium-chain
dehydrogenases/reductases (MDR): Family characterizations including genome
comparisons and active site modelling. Eur. J. Biochem. 2002 269, in press
II. Nordling, E., Persson, B. and Jörnvall, H. Differential multiplicity of MDR
alcohol dehydrogenases: enzyme genes in the human genome versus those in
organisms initially studied Cell. Mol. Life Sci. 2002, 59:1070-1075
III. Marschall, H.U., Oppermann, U.C., Svensson, S., Nordling, E., Persson, B.,
Höög, J-.O. and Jörnvall, H. Human liver class I alcohol dehydrogenase ���
isozyme: the sole cytosolic 3β-hydroxysteroid dehydrogenase of iso bile acids.
Hepatology 2000, 31:990-996
IV. Nordling, E., Oppermann, U.C., Jörnvall, H. and Persson, B. Human type 10 17
β-hydroxysteroid dehydrogenase: molecular modelling and substrate docking. J.
Mol. Graph. Model. 2001, 19:514-520, 591-593
V. Filling, C., Nordling, E., Benach, J., Berndt, K.D., Ladenstein, R., Jörnvall, H.
and Oppermann, U. Structural role of conserved Asn179 in the short-chain
dehydrogenase/reductase scaffold. Biochem. Biophys. Res. Commun. 2001,
289:712-717
VI. Nordling, E., Kallberg, Y., Johansson, J. and Persson, B. Molecular dynamics
studies of sequence influence upon fibrillation. Manuscript
VII. Jörnvall, H., Nordling, E., Persson, B., Shafqat, J. Ekberg, K, Wahren, J and
Johansson, J. A structural basis for proinsulin C-peptide activity. Manuscript
The published articles are reprinted with permission from the copyright holders.
List of original articles
11
Related papers not included in the thesis
• Dreij, K., Sundberg, K., Johansson, A.S., Nordling, E., Seidel, A., Persson, B.,
Mannervik, B. and Jernström, B. Catalytic activities of human alpha class glutathione
transferases toward carcinogenic dibenzo[a,l]pyrene diol epoxides. Chem. Res.
Toxicol. 2002, 15:825-831
• Filling, C., Berndt, K.D., Benach, J., Knapp, S., Prozorovski, T., Nordling, E.,
Ladenstein, R., Jörnvall, H. and Oppermann, U. Critical residues for structure and
catalysis in short-chain dehydrogenases/reductases. J. Biol. Chem. 2002, 277:25677-
25684
• Strömberg, P., Svensson, S., Hedberg, J.J., Nordling, E. and Höög J.-O. Identification
and characterisation of two allelic forms of human alcohol dehydrogenase 2. Cell.
Mol. Life. Sci. 2002, 59:552-559
• Jonsson, A.P., Aissouni, Y., Palmberg, C., Percipalle, P., Nordling, E., Daneholt, B.,
Jörnvall, H. and Bergman, T. Recovery of gel-separated proteins for in-solution
digestion and mass spectrometry. Anal. Chem. 2001, 73:5370-5377
Biocomputational studies on protein structures
12
Introduction
The structure of a protein determines its function. It is essential for substrate specificity,
stability and interactions with other proteins. The quest for protein structure and interactions
is recognised as one of the big challenges after the completion of the human genome, and
several large-scale structural genomics project are launched around the globe (Burley 2000).
The realisation that all information required for protein folding is inherited in the protein
sequence (Epstein et al. 1963) gave rise to the greatest challenge for computational biology,
the folding problem. It has not been possible to find a universal solution as with the genetic
code. One complication is that many proteins require assistance of chaperone proteins to fold
into their native structure (Ellis et al. 1989). Nevertheless, several methods exist, which
exhibit moderate predictive accuracy (Bonneu et al. 2001, Skolnick et al. 2001, Hardin et al.
2002). The most accurate structure prediction methods to date are homology modelling, or
comparative modelling methods that use the information from homologous structures to
predict the tertiary structure. The quality of the model is dependent upon the number of
residue identities in common between the two proteins, but is often below 1 Å in RMSD if the
sequence identity is relatively high.
The activity of an enzyme is determined by the shape and amino acid residue distribution of
the active site. This makes it possible to search for potential substrates by computational
methods termed docking calculations, by finding the optimal placement of the substrate with
respect to electrostatic/hydrophobic interactions and shape complementarities.
The completion of the first genome, that of the SV40 virus, with a total of 5,224 base pairs
(Fiers et al. 1978), marked the arrival of a new perspective in biology. For the first time, all
the genetic information of an organism were then known. Now with the completion of the
human genome (Venter et al. 2001, Lander et al. 2001), the needs of analysis are enormous.
One of the more appealing possibilities of the genomic sequences is the possibility to compare
species and to try to explain their differences in terms of their genes (Bansal 1999). Whole
genome comparisons have also been used for phylogenetic analysis of organisms (Koonin et
al. 1997, Fitz-Gibbon and House 1999, Tekaia et al. 1999), which largely confirms the
previous picture of the divisions of life.
Introduction
13
The large amounts of data generated through the advances in biology and biochemistry have
given rise to the discipline of bioinformatics, which deals with the biological information, and
tries to use it as a base to predict different biological properties, such as sub-cellular
localisation or secondary structure of a protein.
In my thesis, I have applied bioinformatic techniques on the enzyme families of medium-
chain dehydrogenases/reductases (MDR), short-chain dehydrogenases/reductases (SDR) and
biologically active peptides
Background
Medium-chain Dehydrogenases/Reductases (papers I-III)
Medium-chain dehydrogenases/reductases (MDRs) constitute a large enzyme superfamily
with (including species variants) close to 1000 members, which have been compared at
separate stages of establishment (Persson et al. 1994, Jörnvall et al. 1999). MDR proteins are
either dimeric or tetrameric proteins with about 350 residues. The tertiary structure reveals
that they consist of one catalytic and one coenzyme-binding domain (Fig. 1).
Fig. 1 Model of dimeric γγ ADH (paper III) in ribbon representation, NAD+ is shown as ball and stick models, zinc ions are displayed in space-filling representation. isoUDCA is docked into the active site and displayed in a ball and sticks model with solvent accessible surface coloured by the electrostatic potential. Picture created in Weblab Viewer Lite (Accelrys Inc.)
Biocomputational studies on protein structures
14
Several, but not all of the members have one zinc ion bound with catalytic function at the
active site. Some, in particular classical, dimeric alcohol dehydrogenases (ADHs), also have a
second zinc ion at a structural site (Brändén et al. 1975). The coenzyme binding domain has a
Rossmann-fold that binds NAD(P)H (Rossmann et al. 1975). Several gene duplications have
led to the present state of different families, summarised in Fig. 2.
The MDR enzymes represent many different enzyme activities of which zinc-dependent
ADHs are the ones early investigated most thoroughly (Bonnichsen and Wassen 1948,
Brändén et al. 1975). They participate in the
oxidation of alcohols, detoxification of
aldehydes/alcohols and the metabolism of bile acids
(Jörnvall et al. 2000, Marschall et al. 2000). There
are also tetrameric alcohol dehydrogenases,
originally detected in yeast, as yeast alcohol
dehydrogenases (YADH) (Negelein and Wulff
1937), but present also in other species (paper II).
The polyol dehydrogenases (PDHs) have activities
originally detected for sorbitol dehydrogenase
(Jörnvall et al. 1981). All their substrates are
widespread in nature because of their derivation
from glucose, fructose, and general metabolism.
Cinnamyl alcohol dehydrogenases, CAD, are
predominantly found in plants. This enzyme type
catalyses the last step in the biosynthesis of the monomeric precursors of lignin, the main
constituent of plant cell walls (Boudet et al. 1995). It has been extensively characterised from
several plant sources (Sarni et al. 1984, Halpin et al. 1994, Galliano et al. 1993), because of
its importance for the pulp industry (cf. Persson et al. 1993). Downregulation or inhibition of
CAD will reduce wood lignin content and yield a pulp of higher quality (Campbell and
Sederoff 1995). The first characterised member of the quinone oxidoreductase (QOR) family
was a lens protein (ζ-crystallin) (Gonzalez et al. 1995), in which a mutational loss may result
in cataract formation already at birth. This suggests that ζ-crystallin has a role in the
protection of the lens against oxidative damage (Rao et al. 1997). The mitochondrial
respiratory function proteins (MRF) are necessary for respiratory function, although the
mechanism is not clear (Yamazoe et al. 1994). The acyl-CoA reductases (ACR) constitute one
YADH
CADADH
PDH
MRF
QOR
LTD
ACR
0.1
Fig. 2. Evolutionary tree showing the eight MDR families. ADH –alcohol dehydrogenases, CAD – cinnamyl alcohol dehydrogenases, YADH – yeast alcohol dehydrogenases, ACR – acyl CoA reductases, LTD – leukotriene B4 dehydrogenases, QOR – quinone oxidoreductases, MRF – mitochondrial respiratory function proteins, PDH – polyol dehydrogenases. Tree created with ClustalW (Thompson et al. 1994) and visualised in Treeview (Page 1996)
Background
15
domain of fatty acyl synthase (FAS) (Amy et al. 1989) and erythronolide synthase (Donadio
et al. 1991) multienzymes in animals, which are responsible for the synthesis of long-chain
fatty acids. This is an example of fusion of genes encoding for a multisubunit complex, since
separate genes provide the same function in most bacteria (Magnuson et al. 1993) and plants
(Ohlrogge and Browse 1995). The leukotriene B4 dehydrogenases (LTDs) are involved in the
metabolism of immunoreactive molecules (Yokomizu et al. 1996).
In common, therefore, as demonstrated by the examples above, all MDR families appear to
have some members with protective functions in different organismal defences (Jörnvall et al.
1999), similar to the P450 family. However, recent findings suggest more specific functions
for the family members in higher organisms (paper II).
Short-chain Dehydrogenases/Reductases (papers IV-V)
The first reported enzyme of the short-chain dehydrogenases/reductase (SDR) family was
Drosophila ADH (Sofer and Ursprung 1968, Schwartz and Jörnvall 1976), sharing function
with horse liver alcohol dehydrogenase, but not evolutionary origin. As additional proteins
were found to be similar in sequence the SDR family was established (Jörnvall et al. 1981,
Jany et al. 1984, Jörnvall et al. 1984, Persson et al. 1991). The SDR proteins are one-domain
NAD(P)(H)-dependent enzymes of typically 250 amino acid residues (Jörnvall et al. 1995),
often multimeric (Fig. 3).
Fig. 3. Tetrameric structure of 20β-hydroxysteriod dehydrogenase (pdb code 2HSD; Ghosh et al. 1994). Each subunit is shaded differently, NAD+ is displayed in space-filling representation. Figure created with Weblab Viewer Lite (Accelrys, Inc.)
Biocomputational studies on protein structures
16
They display a wide substrate spectrum, ranging from steroids, alcohols, sugars, and aromatic
compounds to xenobiotics. SDRs are subdivided into classical and extended SDRs (Persson et
al. 1995), differing in lengths and cofactor-binding motifs. SDR constitutes a large family
(Jörnvall et al. 1999) with about 3000 forms known, including species variants in all living
kingdoms (Kallberg et al. 2002). The family is highly divergent, with typically 15–30%
residue identity in pairwise comparisons. For 21 members, the three-dimensional structures
have been determined. In spite of the low residue identities between the different members,
the folding pattern is conserved with superimposable peptide backbones (Krook et al. 1993;
Ghosh et al. 2001). The criterion for SDR membership is the occurrence of typical sequence
motifs, arranged in a specific manner. These motifs contain crucial residues in the nucleotide
binding and for the active site including its highly conserved triad of Ser, Tyr, and Lys
residues (Jörnvall et al. 1995, Persson et al. 1995, Oppermann et al. 1997).
Biologically active peptides (papers VI-VII)
Peptides with a biological activity are widespread in nature. Their functions range from
metabolic regulation of hormones like glucagon (Thomsen et al. 1972), innate host-defense of
anti-bacterial peptides such as LL-37 (Agerberth et al. 1995) and NK-lysin (Andersson et al.
1995) through neurological functions such as for neuropeptide Y (Minth et al. 1986) and its
likes, to the more deleterious effects of the disintegrins of snake toxins (Huang et al. 1986).
Many of the peptides are synthesised as larger inactive precursors, which are cleaved to
produce the active components. The short peptides may also require a structural assistance to
achieve the folded conformation (Beers and Lomax 1995). Often the precursors contain more
than one active peptide, which then are released together in equimolar amounts.
The peptides included in this thesis are the C-peptide, surfactant protein C (SP-C) and the
amyloid β-peptide (Aβ). The C-peptide is produced by the posttranslational cleavage of
prosinsulin to form mature insulin and C-peptide (Steiner et al. 1967). The understanding of
its function has increased from originally considered only a structural assistant for the correct
folding of insulin, to now known to have hormonal activities of its own. (Wahren et al. 2000).
In diabetes type I patients it has beneficial effects on glycaemic control and protects against
diabetic complications, while synthetic insulin, which is identical to endogenous insulin but
lacks C-peptide, does not suffice (Wahren and Johansson 1998). The surfactant protein C
(Johansson et al. 1988) is a natural part of the surfactant mixture in the lungs and lowers the
surface tension in the alveolus (Gustafsson and Johansson 1998). It is produced as a larger
Background
17
precursor to fold it into an α-helix structure (Beers and Lomax 1995). The amyloid β peptide
is a product from the APP protein (Kang et al. 1987), which now has recently been suggested
to function as a kinesin 1 receptor, mediating the axonal transport of β-secretase and
presenelin-1 (Kamal et al. 2001)
Techniques
In this thesis, I have used several bioinformatic techniques. They are described below in
further detail than the format of the original articles allowed.
Sequence comparisons
The procedure of comparing two or more sequences to bring as many identical or similar
residues as possible into vertical register produces an alignment of the two sequences. Two
main types of sequence alignment are recognised, global and local. The global alignment
optimises the alignment over the full-length of the sequences, while in the local alignment,
stretches of sequence with the highest density of matches are given the highest priority.
Pairwise alignments
A direct manner of comparing sequences is the dot plot analysis (Gibbs and McIntyre 1970).
The two sequences to be compared form the rows and columns of a matrix, and in the
simplest case a dot is plotted in a graphical representation of the matrix wherever the
sequences are identical. In practice, the comparison is most often done in a window of
specified length, and a dot is put in the graph whenever the score within this window is above
a certain specified threshold. The score might be calculated as the number of identities or as a
similarity based on a scoring matrix. Sequence similarities will be obvious as diagonals in the
plot. The method does not align sequences but it constitutes a quick manner of spotting
similarities. It has the additional advantage over other alignment methods that it can detect
repetitions in the sequences as parallel diagonals in the plot.
Global alignments aim to optimally align two or more sequences. The methods most used for
global alignments are based on algorithms originally developed by Needleman and Wunsch
(1970) and later modified by Sellers (1974). This procedure (the NWS method) is using a
dynamic programming algorithm that simplifies the enormous task of calculating a score for
all possible alignments of two sequences with gaps of any lengths. The sequences to be
aligned are arranged as rows and columns of a rectangular matrix. A score is calculated for
Biocomputational studies on protein structures
18
each position of the matrix according to three possible events: replacement (or conservation)
of a residue, insertion in sequence A or insertion in sequence B. The alignment is traced back
from the highest scoring matrix element to the beginning of the sequence.
The dynamic programming algorithm can also be used for finding local sequence similarities.
The Smith and Waterman (1981) algorithm is very similar to the NWS method except that a
calculated negative number for a matrix position is replaced by zero, indicating that no
sequence similarity has been detected up to that point. When all matrix elements have been
calculated, the maximum number in the matrix is located, and the alignment is traced back
from this point until the first positive number.
Multiple alignments
A multiple sequence alignment of a protein family gives evolutionary information about the
family, since the alignment picks up the evolutionary pressure on each amino acid, and, thus
displays which positions that are crucial for maintaining the fold and function of the proteins
of the investigated family.
The simplest method to obtain a multiple sequence alignment is to use one sequence as the
base for the alignment and simply align all other sequences pairwise to this sequence
(Altschul et al. 1997). The most common procedure for multiple sequence alignment uses
hierarchical methods. In these methods, alignments of all pairs of sequences are made first
using the dynamic programming algorithm. The sequences are then grouped according to their
similarities into a tree (hierarchical cluster analysis). Finally, starting with the most similar
pairs, all the sequences are aligned stepwise to each other using the dynamic programming
method. This is the procedure used in the most popular program, CLUSTALW (Thompson et
al. 1994),used in all papers except in papers III and VI.
Scoring matrices
All alignment methods need some sort of scoring for matches and mismatches. For a certain
alignment, a number is assigned to each position in the sequence depending on the match at
that position. The scores for all positions in the alignment are then added to calculate a total
score, which is used to select the optimal alignment among alternative alignments. The
simplest way of scoring is to assign a score of 1 for a match and 0 for a mismatch. Such a
matrix is often referred to as a unitary matrix (Feng et al. 1984). The commonly used matrices
Sequence comparisons
19
in protein sequence analysis are generally presented as log-odds matrices. Each score in the
matrix is the logarithm of an odds ratio. The odds ratio used is the ratio of the number of
times residue "A" is observed to replace residue "B" divided by the number of times residue
"A" would be expected to replace residue "B" if the replacement occurred at random.
The PAM series are based on estimated mutation rates (Point Accepted Mutations) from
closely related proteins (Dayhoff et al. 1978). PAM 1 stands for 1 % accepted mutations.
PAM matrices for less similar sequences are obtained by extrapolations. The PAM100 matrix
corresponds to 100 accepted mutations per 100 residues, but since the same residue might
change more than once, two sequences with this level of mutations will have about 50 %
identity. A major drawback is that the series was developed upon a limited set of sequences.
The Gonnet matrix is based upon exhaustive matching of the entire protein sequence database
(Gonnet et al. 1992), an almost identical approach to the original PAM matrices but with a
more divergent set of sequences. The BLOSUM series is calculated from blocks of aligned
sequences from homologous proteins with a certain level of identity. In this manner, the
pattern of changes observed at a certain level of identity is used, instead of extrapolations
from patterns for closely related proteins (Henikoff and Henikoff 1992). The first PAM
matrix is almost as efficient as the newer BLOSUM and GONNET matrices in exhaustive
comparisons (Henikoff and Henikoff 1993, Vogt et al. 1995, Abagyan and Batalov 1997),
which is a considerable achievement considering the limited set of proteins available at the
time of its creation.
The scores from all pairs of aligned residues are combined with suitable penalties for
introducing gaps to calculate a total score, which is used to select the optimal alignment. The
gap penalties are normally determined by two parameters, one for opening a gap and one for
elongating it, proportional to the length of the gap. Most programs allow the user to choose
these parameters, which might have different optima for different protein types. The choice of
gap opening and extension penalties will depend much on the purpose of the alignment. If, for
instance the alignment is to be used as input for the construction of a homology model, it may
be beneficial with a high gap opening penalty and a low gap extension penalty, since the
quality of the model is sensitive to numerous insertions/deletion events (Abagyan and Batalov
1997)
Biocomputational studies on protein structures
20
Databases
The huge amount of biological information that is collected through the scientific community
worldwide is stored in large databases. The information in the databases may be as diverse as
nucleotide sequences (Stoesser et al. 2001, Benson et al. 2002), metabolic pathways
(Kanehisa 1997) or expression profiles from micro array experiments (DeRisi et al. 1997).
The databases most relevant for the work in this thesis contain protein sequences and
structures. Swissprot (Bairoch and Apweiler 2000) and PIR (Wu et al. 2002) are manually
curated databases that contain annotated protein sequences of high quality. Since there are
several protein databases, the need to combine them for a complete view of the protein
universe has arisen. In that process, a high degree of redundancy is generated, since the
databases contain overlapping information. The solution to that problem is to remove identical
sequences of the same length or shorter versions of other entries in the database (Kallberg and
Persson 1999). The major protein structure database is the Protein databank (PDB) (Berman
et al. 2000), where experimentally determined structures and theoretical models are stored.
Also worth mentioning are the CATH (Orengo et al. 1997) and the SCOP (Murzin et al.
1995) databases, which contain structural classification of proteins.
Database searching
The question posed when performing a database search is whether there are homologous
sequences to the query sequence in the database. The problem is resolved in a similar fashion
as in the sequence alignment case. As dynamic programming can be quite time consuming,
several heuristical methods have been developed. These methods are not guaranteed to find
the best possible solution. However they are significantly faster and can therefore be used to
search large databases.
FASTA
One of the more popular search methods is FASTA (Pearson and Lipman 1988). The FASTA
algorithm is a fast heuristic approximation to the Smith-Waterman algorithm. The FASTA
algorithm divides the query sequence into overlapping words, usually of length two for
proteins or six for nucleic acids. Then as each sequence is read from the database it is also
divided into its overlapping words. These two lists of words are compared to find the identical
words in both sequences. High scoring sequences are subjected to a Smith-Waterman
alignment within the region of the dot-plot defined by the concentrated identities and the
window size.
Databases
21
BLAST
The BLAST algorithm (Altschul et al. 1990) uses a word-based heuristic similar to that of
FASTA to approximate a simplification of the Smith-Waterman algorithm known as the
maximal segment pairs algorithm. A major drawback of the method is that it does not allow
gaps. The list of words is expanded with the addition of similar words in order to recover the
sensitivity lost by matching only identical words. BLAST then examines the database
sequences for words that exactly match any of the words on the expanded list.
A new gapped BLAST (Altschul et al. 1997) gets around the problem of not allowing gaps
and still avoids the high computational cost of a full Smith-Waterman alignment on the pair of
sequences, by building the alignment out from a central high scoring pair of aligned amino
acids. Another improvement of the BLAST algorithm is PSI-BLAST (Altschul et al. 1997),
which starts with a gapped BLAST search and creates a profile out of the found sequences.
This profile is used as the basis for a new search, and the procedure is iterated until the search
has converged and no new proteins are found, or the number of iterations reaches a number
specified by the user. This method has been found to be very sensitive in picking out remote
homologies (Jones and Swindell 2002).
Phylogenetic analysis
The objective of phylogenetic analysis is to trace evolution. This can be done for
morphological features of organisms, or as favoured in later years, by tracing mutational
events in the genetic material of organisms. The work included in this thesis is based solely on
protein sequences and reflects primarily the evolution of new functions in protein families and
not the birth of new species, although speciation events also are included in the trees. There
are a few different methods available to construct a phylogenetic tree. The two most used
methods are outlined below, both start with the construction of a multiple alignment of the
sequences in question and then follow the procedures described below in each section.
Maximum parsimony
The phylogenetic trees are constructed so that the sequences are grouped together by the
criterion of minimum replacements from the last common ancestor (Fitch 1971, Hartigan
1973). The method yields a number of low scoring trees. The search of trees can sample all
possible trees in search of the correct one, but it is not feasible for more than about 20
sequences/taxa (Swofford 1998). Instead usually a heuristic approach is used where the trees
Biocomputational studies on protein structures
22
are built by stepwise addition of sequences in the growing tree. This method has problems to
accommodate the fact that multiple substitutions may occur at the same site (e.g. Ala � Gly
followed by Gly � Ala at the same position later in time) (Nei 1996). Popular programs
include PAUP (Swofford 1998) and PHYLIP (Felsenstein 1993). The method is used in paper
I to construct the tree of the MDR superfamily.
Neighbour joining
The trees are constructed using pairwise distances of all sequences in the set. The final tree is
put together so that the distance between two sequences in the tree (branch lengths)
corresponds to the similarity between the two sequences. A drawback of the method is that
the same distance may be obtained by comparing two different sequences (Saitou and Nei
1987). ClustalW (Thompson et al. 1994) and the already mentioned PAUP and PHYLIP are
some of the most used programs based upon this methodology. The method is used in paper I
and II for the tree construction.
Bootstrap analysis
Phylogenetic trees are usually tested for the reliability by a bootstrap test (Felsenstein 1985).
The test is made by constructing a tree based on a subset of the columns in the multiple
alignment that the tree is based on. This procedure is repeated multiple times and gives an
estimate on the reliability by giving the number of times the analysis results in the same tree.
A bootstrap value over 95% for a branch point is usually considered to be highly confident
(Efron et al. 1996).
Secondary structure prediction
Predicting the secondary structure of a protein is an important step towards the elucidation of
its three-dimensional structure, as well as its function, in cases where the three-dimensional
structure is not available, as in the situation for most membrane proteins and for very large
proteins. The secondary structure prediction methods commonly try to predict if a residue
belongs to one of three states, α-helix, β-strand or coil (irregular parts and non-ordered
structural segments). Some methods specialised to predict just one type of structure have
recently been developed, and most notably a β-turn predictor which has the highest accuracy
yet for a secondary structure prediction method (Shepherd et al. 1999). The first generation of
secondary structure prediction methods was based upon the statistically inferred single amino
acid propensity for a certain secondary structure element. The methods were not very
Phylogenetic analysis
23
successful since the secondary structure is dependent upon both local and long-range
interactions (cf. Rost and Sander 2000). The second generation of prediction methods was
able to incorporate the local environment by calculating the propensities for segments of
variable sizes (Chou and Fasman 1974). The accuracy of the methods achieved around 60 %
but could not reach higher. The breakthrough for accuracy came with the use of multiple
alignments as the input to the prediction methods (Zvelebil et al. 1987), which led to the third
generation of prediction methods. The multiple alignment includes evolutionary information
about residue replacements. Thus, yielding indirect information about long-range interactions
and the local environment. The top performing methods today are based upon machine
learning techniques. Neural networks are used in the two most prominent methods. The first
method that broke the 70% barrier was PHD (Rost and Sander 1993), which used a multiple
alignment as an input to a neural network that is trained upon known sequences from the
PDB. One of the largest advances to secondary structure prediction was PSIPRED (Jones
1999), which used position specific profiles from PSI-BLAST searches as the input to a
simple feed forward neural network. This led to an increase in accuracy to roughly 75% and
that is still the frontier in secondary structure prediction. To increase this further, also long-
range interactions need to be considered. A first step in this direction can be seen in a study by
Baldi et al. (1999), where a neural network based method is developed that can accommodate
long-range interactions and scores as well as or better than the existing methods.
Molecular modelling
The structure of a protein determines its physical and chemical properties. Methods that
utilize physical or chemical information to predict structural, and therefore implicitly
functional, properties of a biological molecule are collectively termed molecular modelling
methods (cf. Forster 2002). A broad division can be drawn between methods that predict the
binding of a ligand to a macromolecular receptor – docking methods, and methods that aim at
predicting the structure of protein. A further division can be made among the structure
prediction methods, between fast threading methods that predict the fold of the protein
without generating a three-dimensional model of the protein. The speed of these methods
makes them suitable for whole-genome prediction of folds or identifying possible templates
for a more thorough investigation of the structure by constructing a complete three-
dimensional model of the protein by ab initio prediction or homology modelling.
Biocomputational studies on protein structures
24
The physical properties of a molecule can be described by Schrödinger’s equation.
Unfortunately it cannot be solved for systems with more than one electron (H, He+, Li2+).
However the concept can be extended by approximations to include larger systems and has
developed to a discipline of its own, quantum chemistry (cf. Oldfield 2002). The
approximation methods can be categorized as either ab initio or semiempirical. Semiempirical
methods use parameters that compensate for neglecting some of the time consuming
mathematical terms in Schrödinger's equation and can be used on systems up to a few
hundred atoms (Stewart 1990), whereas ab initio methods include all such terms and can only
handle a few dozen atoms (Schmidt et al. 1993). The parameters used by semiempirical
methods can be derived from experimental measurements or by performing ab initio
calculations on model systems.
Force field methods
The Schrödinger equation can be bypassed by writing the energy as a parametric function of
the nuclear coordinates. This allows the dynamics of the atoms to be treated by classical
mechanics. For time independent phenomena, the problem reduces to the calculation of the
energy at a given geometry (Jensen 1999). The foundation of this methodology is the
observation that molecules tend to be composed of units, which are structurally similar in
different molecules. The O–H bond in methanol and in ethanol is of the same length and
energy content. Force fields are a generalisation of the picture of molecules being composed
of structural units, “ functional groups” , which behave similarly in different molecules. Force
fields take care of the different properties of atoms by assigning them into different atom
types according to the atom and the type of chemical bonding it is involved in. The MMFF
force field (Halgren 1995a) used in among other programs ICM, extensively used in my
thesis, has 40 types of carbon and 20 types of oxygen depending on hybridisation state, bond
number, bond-partners and charge. The parameters assigned to each atom-type are based upon
empirical measurements or higher-order calculations (Chandrasekhar and van Gunsteren
2002)
The energy is written as a sum of terms, each a function describing the energy associated with
a particular aspect of the molecule, including the energy of bond-stretching, rotation around a
bond, bending of bond-angles and non-bonded atom-atom interactions, such as van der Waals
or electrostatic interactions. This approach has the added advantage that it is quite simple to
change the nature of the function for each term, and equally easy to add or remove terms from
Molecular modelling
25
the total energy function (Abagyan 1993). Such an energy function of the nuclear coordinates
makes it possible to calculate the geometry of the possible conformations of a molecule and
their relative energies. Stable conformations correspond to minima on the potential energy
surface and can be found by minimizing the energy as a function of the nuclear coordinates.
Local minimisation techniques
Local minimisation techniques are employed to find the nearest minimum, which may, or,
more probably, may not correspond to the global minimum.
Steepest descent
The gradient vector points in the direction where the energy function increases most, i.e. the
function value can always be lowered by stepping in the negative direction. Functional
evaluations are made in the negative gradient direction until the energy increases and a
temporary minimum along the negative gradient is located. At this point a new gradient is
calculated and used for the next search (Morse and Feschbach 1953). This method will always
lower the function value and is guaranteed to approach a minimum. However, two main
problems are associated with this technique. Two subsequent searches are made perpendicular
to each other and each round will therefore partially undo the previous minimisation step’s
decrease of the function value. The other problem is that as the minimum is approached, the
rate of convergence is decreased and the method will actually never reach the minimum. It
will crawl towards it at an ever-decreasing speed. Even though its obvious drawbacks, the
method is simple and robust and guaranteed to lower the function value, which may explain
its widespread use.
Conjugate gradient
The method is very similar to the previous method, but tries to avoid the partial undoing of
the previous step, by doing each step except the first, in a direction which is a mixture of the
current negative gradient and the previous search direction. The next step is taken in a
direction, which is conjugate of the previous direction, hence its name (Polak 1971).
Global optimisation techniques
The previous methods can only locate the minimum nearest to the starting conformation, as
they mimic a marbles path down a funnel and find the lowest spot in the energy landscape. If
the desired minimum is the global minimum, the deepest valley in the energy landscape, other
Biocomputational studies on protein structures
26
methods have to be employed. The optimal way to find the global minimum would be to
sample all possible conformations and calculate the energy. This systematic approach is not
practically feasible with more than 10–12 free variables, since the number of possible
conformations increases exponentially with the number of free variables, and likewise does
the computational time of the analysis. Large biomolecular systems, such as proteins, lipid
membranes or nucleic acids, cannot be studied by a systematic approach. In order to be able
to investigate them, methods have been developed that sample a large number of possible
conformations in order to locate minima. Although they are not guaranteed to find the global
minimum, they often find a local minimum, which is close to the global minimum in terms of
energy.
Monte Carlo (MC)
The search starts from a given conformation. In each MC run the position of one or more
atoms is randomly changed. The new conformation is accepted as a starting point for the next
perturbation if the energy is lower than the energy of the previous conformation. Steps that
generate conformations with higher energy than the previous one are evaluated by calculation
of the Boltzmann factor (Eq. 1, T is temperature in Kelvin, kB is Boltzmann’s constant and ∆E
is the change in energy) and comparing it to a random number between 0 and 1.
Boltzmann factor = e-∆E/kB
T Eq. 1
If the Boltzmann factor is less than the random number the conformation is used as the
starting conformation for the next step in the Monte Carlo procedure. This allows the
calculation to step out of local minima and continue the search for the global minimum
(Hammersly 1960). A further enhancement of the methodology is the inclusion of information
of preferred side-chain angles for amino acids in proteins. Each amino acid has a set of
discrete side-chain angles, rotamers, that are energetically more favourable (Brändén and
Tooze 1999). By including them into the search algorithm one gets the biased probability
Monte Carlo (BPMC) search algorithm, that samples the conformational space more
efficiently (Abagyan and Totrov 1994).
Molecular modelling
27
Molecular dynamics (MD)
MD methods solve Newton’s equation of motion (Eq. 2), where the force F is determined by
the mass, m, and the acceleration, a, for atoms on an energy surface (Alder and Wainwright
1957).
F = m * a Eq. 2
The available energy is distributed between potential and kinetic energy, and molecules are
thus able to overcome barriers separating minima if the barrier height is less than the total
energy minus the potential energy. Newton’s equation requires small time steps (femtosecond
level) to be integrated, which makes the simulation time short, nowadays usually in the
nanoseconds range, even though the microsecond barrier has been reached for a 36-residue
peptide with a Cray T3E supercomputer (Duan and Kollman 1998). This, in parallel with the
use of physiologically relevant temperatures, means that in practise only the local area around
the starting point is sampled and that relatively small barriers can be overcome. Molecular
dynamics simulations have evolved from the first simulation of a protein in vacuum
(McCammon et al. 1977) to include realistic description of solvent effects (Berendsen et al.
1981). This allows MD simulations to give insights into the natural dynamics on different
timescales of biomolecules in solution (Hansson et al. 2002). Popular implementations are the
CHARMM, AMBER and GROMACS (Lindahl et al. 2001) packages. The GROMACS
package was used in paper VI.
Simulated annealing
In simulated annealing the starting temperature is chosen to be high, 2000 – 3000 K, and then
gradually decreased during an MC or MD run. This allows an initially extensive sampling of
the conformational space of the molecule, and as the temperature decreases the molecule is
trapped in a minimum. If the decrease is infinitely slow the molecule would be trapped in the
global minimum, but that would require infinite computational time as well (Kirkpatrick et al.
1983).
Docking calculations
The solution to the docking problem in computational biology is the bound conformation of a
substrate to a macromolecular receptor with a correct estimation of the binding energy.
Reliable docking depends on both an adequate scoring function, whose global minimum
Biocomputational studies on protein structures
28
corresponds to the biologically relevant complex, and an optimisation algorithm that
consistently finds that global minimum.
Rigid body dockings
The earliest docking methods used rigid body docking techniques (Kuntz et al. 1982), where
both the receptor and the ligand are represented as rigid bodies. In this approach the docking
problem is simplified to a search of surface complementarities. With the rigidity of the
molecules the search space is reduced to three variables that determine the rotational and
translational movements of the ligand. Even this three-dimensional search space is hard to
completely cover with a brute force approach and often stochastic search schemes such as
Monte Carlo or simulated annealing techniques are used.
Geometric surface models and data structures are used in order to find reasonable binding
modes, and heuristic cost functions – mostly rating some notion of geometric fitness – to rank
candidate complexes.
The first automated docking program was the DOCK program (Kuntz et al. 1982), which uses
a graph theory derived search algorithm in its later apparitions (Ewing et al. 1997) that
superimposes ligand atoms onto predefined site-points that map the negative image of the
binding site. The hits are scored for the inter-molecular interactions, based on the inter-
molecular terms of a molecular mechanics force field.
The gain of using this method is the swiftness of the process, which allows large databases of
compounds to be screened versus a receptor (Schneider and Böhm 2002). A drawback of the
method is that the ligand and receptor has to be co-crystallized to achieve high efficiency of
the method (Lengauer and Rarey 1996).
Flexible ligand – rigid receptor
As more powerful computers were developed, the difficulty of the docking problem was
increased, by allowing internal rearrangements of the ligand, by introducing rotation around
bonds. These techniques involve a flexible ligand and rigid receptor. This allows for the
discovery of novel substrates for a receptor at the expense of more computationally
demanding calculations (Lengauer and Rarey 1996). The free variables added to the
calculations are the rotations around bonds in the ligand. In the rigid body approach, three
Docking calculations
29
variables were needed to rotate and position the substrate, each rotatable bond increases the
dimension of the problem by one. Thus, for a ligand with two rotatable bonds, there are five
variables to optimise. As the ligands often are small with relatively few variables, the speed of
the dockings is high and a large number of possible ligands can be tested. The docked
conformations are scored by semi-empirically derived force field based energy functions
(Sippl 1995, Halgren 1995b). In order to speed up the scoring of the docked conformations,
the macromolecular receptor can be represented by grids that contain the affinities for each
atom type and the possibilities for electrostatic interactions (Morris et al. 1996, Totrov and
Abagyan 1996). The gain is that the grids allow fast calculation of the score of the docked
conformation. Usually Monte Carlo algorithms (Morris et al. 1996) or genetic algorithms
(Jones et al. 1997, Morris et al. 1998) are used to efficiently sample the possible
conformations. Some popular programs of this type are AutoDock (Goodsell and Olsen
1990), FLEXX (Rarey et al. 1996) and DOCK version 4 (Ewing et al. 2001).
Flexible ligand – flexible receptor
A natural extension of the above technique is to introduce flexibility in the receptor as well. A
first step is to allow rearrangement of the protein’s side-chains. This allowed the successful
docking of new classes of substrates to receptors (Totrov and Abagyan 1994, Totrov and
Abagyan 1997). The extra free variables introduced to the system makes the approach not
feasible for scanning of large compound databases. To diminish the computational need, side-
chains located at a distance from the active site are often fixed (paper III + IV).
All these methods have trouble dealing with induced fit mechanisms, common in some
enzyme families (Najmanovich et al. 2000). A recent paper estimates that 50% of the
imaginable protein-ligand complexes are not reachable by today’s techniques due to induced
fit mechanisms (Österberg et al. 2002). It can be somewhat circumvented by using receptor
structures already in the active conformation, which are co-crystallized with a substrate.
Computational techniques that address the problem have been developed. One is based upon
selecting hinge regions in the proteins that can, as the name suggests, function as hinges in the
receptor and thus allow opening and closing movements of the receptor (Sandak et al. 1998).
Another technique, which is a development of the popular Autodock software, create grids
based upon both the bound complex and the free receptor and thus accounts for the
conformational change upon binding of substrate (Österberg et al. 2002).
Biocomputational studies on protein structures
30
Comments on docking methods
The two most attractive techniques are the grid docking methods, for their swiftness, and the
flexible ligand – flexible receptor methods, for their thoroughness. The success of the grid
docking requires an active conformation of the active site, as it cannot accommodate large
rearrangements. Smaller rearrangements can be incorporated by allowing a certain overlap of
ligand and receptor, which later can be relieved by an all-atom refinement of the top-scoring
complexes. The need for larger or multiple side-chain rearrangements can only be
accommodated by the flexible ligand – flexible receptor technique. Thus, this technique is
often suited for larger substrates, since they quite naturally often require more side-chain
movements to fit in the active site. The same is true for dockings of substrates into models,
since the side-chains most likely are not in an optimal placement for ligand binding. This was
demonstrated in the recent docking of carcinogenic dibenzopyrene diol epoxides to
glutathione transfereases (Dreij et al. 2002).
Homology modelling
Homology modelling or comparative modelling methods, first reported by Browne et al.
(1969), make it possible to predict the three-dimensional structure of a protein by using
information derived from a homologous protein of known structure (Sali and Blundell 1993,
Sanchez and Sali 1997, Cardozo et al. 1995) without attempting the laborious task of ab initio
structure prediction.
The basis for this approach is that two proteins that have evolved from the same ancestral
protein retain the overall fold. The fold is more conserved through evolution than the
sequence of a protein. It acts as a scaffold upon which new functions can be attached by
changing the amino acid sequence. This has led to the evolution of superfamilies such as
MDR (Persson et al. 1994) and SDR (Jany et al. 1984, Persson et al. 1991, Jörnvall et al.
1995), studied in this thesis, with a conserved overall fold and many divergent functions.
In order to construct a homology model, the query sequence has to be aligned with one or
more homologous proteins to be used as templates in the modelling process. If the sequence
identity between query and template falls below 30%, the quality of the alignment will
deteriorate (Venclovas et al. 1999), yielding more unreliable models. There are two main
methods for the transfer of three-dimensional coordinates from the template to the query.
Fragment-based homology modelling methods identify structurally conserved regions through
Docking calculations
31
the alignment, these conserved fragments usually correspond to secondary structure elements
and the coordinates of the templates are copied to them. The variable parts correspond to a
large extent to loop structures, whose conformations are investigated through searches in a
structural database of loop structures (Blundell et al. 1987, Levitt 1992). The restraint-based
homology modelling methods uses the alignment to infer geometrical restraints, such as limits
of distances of Cα-atoms, ranges of backbone and side-chain angles. These restraints are then
combined with an energy function to obtain a scoring function that is used to calculate the
resulting structure of both the structurally conserved regions and the loop regions (Sali and
Blundell 1993, Cardozo et al. 1995). This has the advantage that the structure generation is
not divided into two separate stages and the prediction method is thus more robust (Forster
2002). The hardest problem is the correct prediction of the loop regions, which has all the
problems associated with ab initio structure prediction, but limited in size and with defined
endpoints. The accuracy of loop-predictions has been shown to increase by taking a
deformation zone into account, by extending the loops a few residues in each direction from
the gap in the alignment. The procedure can accommodate the change in the local
environment caused by the deviation from the fold of the template structure (Abagyan et al.
1997).
If the sequences are distantly related, additional information to assist in the alignment
procedure is of great value. A multiple alignment of related proteins, even if they lack
experimentally verified structures, adds information of conserved sequence stretches that may
correspond to secondary structure elements and regions that allow insertions and deletion that
may hint at loop regions. Similarly secondary structure predictions may assist the assignment
of equivalent secondary structure elements between the query and the template, in addition
information about the placement of gaps can be inferred from the predictions. This strategy
has been used in paper IV.
Biocomputational studies on protein structures
32
Aim of the study
The aim of the study was to apply different bioinformatic and biocomputational methods to
medium-chain dehydrogenases/reductases (MDR), short-chain dehydrogenases/reductases
(SDR) and biologically active peptides, with an emphasis on methods that utilise structural
information, in order to study and characterise proteins within these families. More
specifically the work has been directed toward:
• To characterise the MDR superfamily in complete genomes (papers I and II)
• To explain why isoUDCA is a substrate for γγADH but not for ββADH (paper III)
• To obtain a model of human type 10 17β-hydroxysteroid dehydrogenase and use it as
a base to explain the binding properties of steroid substrates to the protein (paper IV)
• To characterise the role of a conserved residue in SDR fold (paper V)
• To analyse the stability of secondary structure elements in amyloidogenic peptides
(paper VI)
• To compare the sequence variation between species of peptide hormones, including
the C-peptide of proinsulin (paper VII)
Project overview
33
Project overview
Genome comparison and modelling of active sites in the MDR
family (paper I)
The availability of completed genomes provides a great opportunity to analyse all members of
enzyme families. We have therefore now studied the medium-chain dehydrogenases/
reductases (MDR) enzymes in all eukaryotic genomes available and, for comparison, the
Escherichia coli genome.
The genomes were searched for MDR members using FASTA (Pearson and Lipman 1988)
with known MDR proteins as query sequences. Multiple sequence alignments were calculated
using ClustalW (Thompson et al. 1994). The evolutionary tree, constructed from the aligned
MDR sequences from the six genomes, allows distinction of several subgroups (schematic
representation in Fig. 2). Three families are formed by dimeric alcohol dehydrogenases
(ADH; originally detected in animals/plants), cinnamyl alcohol dehydrogenases (CAD;
originally detected in plants) and tetrameric alcohol dehydrogenases (YADH; originally
detected in yeast). Although similar in name, those three families are clearly different and
separated structurally. Three further families are centred on forms initially detected as
mitochondrial respiratory function proteins (MRF), acetyl-CoA reductases of fatty acid
synthases (ACR), and leukotriene B4 dehydrogenases (LTD). The two remaining families,
with polyol dehydrogenases (PDH; originally detected as sorbitol dehydrogenase) and
quinone reductases (QOR; originally detected as ζ-crystallin), are also distinct but with
variable sequences.
The most common family in the human genome is ADH with nine members (Fig. 4) (further
investigated in paper II), followed by QOR with seven members. A. thaliana has a similar
pattern, but also has a large number of CAD and LTD members. In the unicellular organisms,
PDH is the most common family, pointing at different living conditions between unicellular
and multicellular organisms.
Biocomputational studies on protein structures
34
A. thaliana
ADH
CAD
PDHQOR
MRF
LTD
Others
D. melanogaster
ADH
PDH
QORMRF
ACR
Others
H. sapiens
ADH
PDH
QOR
MRF
LTD
ACR
Others
Sum:23 Sum:38 Sum:10
C. elegans
ADH
YADH
PDHQOR
MRF
LTD
ACR
Others
Sum:13Sum:13
S. cerevisiae
ADH
CAD
YADHPDH
QOR
MRFLTD
Sum:15
E. coli
ADH
CAD
YADH
PDH
QOR
LTD
Others
Sum:17 Fig. 4. Distribution of MDR members among the different organisms investigated, the total number of members in each genome is also given.
The quinone reductase family members are homologous to the synaptic vesicle protein VAT-
1, indicating a possible neuronal function for the whole family. The polyol dehydrogenases
are numerous in the E. coli genome, contributing to half of the MDR forms, emphasising their
importance in carbohydrate metabolism. Molecular modelling of the active sites was
performed for members of the different subgroups in order to give estimates on the geometry,
hydrophobicity and volume of the substrate-binding pockets. The results are summarised in
Table 1.
Table 1. Active site characteristics for the investigated families and common functions of the members. Volume in parenthesis is for all but one member.
Active site characteristics Family
Volume (Å3) Hydrophobicity index
Function
PDH 77–257 (144–257)
–1.28 � +0.77 Polyol dehydrogenases Threonine 3-dehydrogenase
CAD 160–248 –0.69 � +0.49 Mannitol dehydrogenase Cinnamyl-alcohol dehydrogenase
QOR 124–289 –0.56 � +1.47 Quinone oxidoreductase Synaptic vesicle membrane protein VAT-1 homologue
MRF 168–243 –1.48 � –0.18 Mitochondrial respiratory function protein
LTD 161–291 –1.65 � –0.02 NADP-dependent leukotriene dehydrogenase
ACR 140–189 +0.64 � +1.08 Fatty acid synthase
Project overview
35
The volumes of the active sites are more or less in the same range, with a broad span of
volumes for each family, except for the narrow range of the ACR family, explained by few
members with a high degree of sequence similarity. MRF and LTD clearly prefers hydrophilic
substrates while ACR prefers hydrophobic substrates, quite natural considering its role in fatty
acid synthesis. The variation of the hydrophobicity in PDH, CAD and QOR indicates a wide
variety of substrates for the proteins in these groups.
The sequence comparisons and subdivisions make it possible to define sequence patterns
useful for characterisation of MDR members. For QOR, a Prosite pattern (Hofmann et al.
1999) already exists (PS01162). However, this pattern only finds 5 of our presently
recognised 18 QOR members. Based upon the sequences now available, we propose a new
pattern that better will detect the QOR members (Table 2). It finds 15 of the QOR members,
which is three-fold more than the existing Prosite pattern, and it misses only 3 QOR forms,
while detecting no false positives. The patterns for the PDH and the CAD families are based
on residues that bind the catalytic zinc and the substrate. The PDH pattern (Table 2) is a bit
complex but captures all the different substrate specificities of this group and only one false
positive when the pattern is screened versus Swissprot. The MRF, LTD and ACR patterns
(Table 2) are highly specific. They find no false positive matches and miss not one of our
members.
The patterns are useful for proper recognition of new genomic sequences. They allow rapid
annotation into the different families of the MDR superfamily from the huge amounts of
sequences generated by ongoing genome projects. They are also ideal for finding particular
enzymes in the ever-increasing sequence databases.
Table 2. Sequence patterns and screening results. The column hits gives number of MDR forms detected in the six genomes investigated, fp (false positives) gives number of non-members detected in Swissprot, fn (false negatives) gives the number of proteins classified as a member but not found by the pattern. Family Pattern hits fp fn QOR [GAS]-x-N-x(2)-[DEN]-x(5)-G-x(6,19)-[PS]-x(3)-[GA]-x-[ED]-x(2)-
G-x-[VIL]-x(3)-G 15 0 3
MRF L-x(6)-[VL]-T-Y-G-G-M-[SA]-[KR] 6 0 0 PDH [GA]-[VIL]-[CS]-[GN]-[STA]-D-[VILMS]-[HKP]-x(14,27)-G-H-
[ED]-x(2)-G-x-[VI]-x(10,12)-G-[DEQ]-x-[IV] 22 1 0
CAD C-G-x-C-x(2)-D-x(17)-G-H-E 12 0 0 LTD D-x-[YF]-x-[DE]-N-V-G-[GS]-x(3)-[DEN] 16 0 0 ACR W-x(5)-W-x(8)-P-x(2)-Y-x(3)-Y-Y 5 0 0
Biocomputational studies on protein structures
36
Differential multiplicity of MDR alcohol dehydrogenases (paper II)
The originally purified ADH enzyme types (yeast, Drosophila, liver and plant) were screened
for in the genomes of human, Arabidopsis thaliana, Saccharomyces cerevisiae and
Drosophila melanogaster, and were compared, where relevant, to corresponding forms in
mouse, pea, Escherichia coli and Caenorhabditis elegans. The screenings revealed the
presence of ten human, ten Arabidopsis, five yeast, and one Drosophila MDR-ADH genes. Of
the human genes, three had properties like pseudogenes (recognized by unusual exon/intron
patterns, combined with a lack of upstream promoter elements such as TATA, CAAT or GC
boxes, and with deviating chromosome localizations) and the remaining seven genes
correspond to the enzymes of classes I (3 genes) and II–V (one gene each) (Jörnvall and Höög
1995, Duester et al. 1999). These seven genes are all on chromosome 4 (Fig. 5), compatible
with the early assignments (Duester et al. 1985).
Fig. 5. Gene organisation of ADH on chromosome 4
The screenings also show that ethanol-active MDR-ADH, when present, appears to be of
multiple occurrences. The multiplicity (Jörnvall et al. 2000) and the general occurrence of
ethanol-activity (Fernández et al. 1995) have been noticed before, but can now be correlated
with enzyme type and with eukaryotic type of organism. These and other functional
conclusions are best noticed when the MDR-ADH enzymes are divided into the two
constituent families of dimeric and tetrameric ADHs, initially represented by the liver and
yeast ADHs, respectively.
The dimeric ethanol-active ADHs divergence into at least three groups, corresponding to the
formaldehyde-active class III enzyme from all species, the ethanol-active non-class-III
enzymes from animals, and the ethanol-active non-class III enzymes from plants (Fig. 6).
Project overview
37
Fig. 6. Evolutionary tree of dimeric ADH. Tree created in ClustalW and viewed in Treview.
Several functional conclusions regarding these enzymes can be drawn by the evolutionary
tree:
• Only plants and vertebrates appear to have the ethanol-active dimeric MDR-ADHs.
• Extensive multiplicity occurs both in vertebrate and plant groups of these enzymes.
• The patterns in the non-class-III groups from vertebrates and plants show some
similarities, with both early and late branching frequently at similar levels.
A comparison of animal and plant lines of MDR-ADH, with species variants included, shows
that the time estimates, calculated in two different manners for the species separation, in the
class III line are too distant, falsely suggesting the class III species separations to be further
away than those for classes I and P and also more distant than accepted radiation nodes of
higher vertebrates. Hence, the constant speed expected for the class III evolution may be
slightly higher than calculated before (Cañestro et al. 2002), or class III may evolve faster in
higher eukaryotes, perhaps giving novel enzymogenesis and additional functions for class III
Biocomputational studies on protein structures
38
in higher eukaryotes, like already known for class I with its pattern of recent isozyme
emergence (Jörnvall and Höög 1995, Duester et al. 1999).
The present pattern, with the distinction towards absence of dimeric ADH between
higher/lower eukaryotes, and with partly parallel patterns in plants and animals, may suggest
that a specific function is associated with ADH in higher eukaryotes. If so, dimeric ADH may
have key functions in higher eukaryotes rather than just general detoxications (Jörnvall et al.
2000).
A similar visualization of the tetrameric family shows considerably less multiplicity and
places yeast ADH with ADH of C. elegans, and with some E. coli and A. thaliana forms.
Thus, C. elegans and A. thaliana have both dimeric and tetrameric ADH enzymes.
Molecular modelling and docking of bile acids to γγγγγγγγ and ββββββββ ADH
class I (paper III)
A three-dimensional model of the class I alcohol dehydrogenase (ADH) γ subunit was
obtained by adopting its amino acid sequence into the known fold of horse liver alcohol
dehydrogenase (PDB code 3bto) using the program ICM (Molsoft LLC, Abagyan et al. 1994).
For docking studies using the human class I ADH ββ dimer, the structure was taken from the
PDB file 1deh. Docking of bile acids to the dimeric enzyme (ββ-structure and γγ-model) was
performed using a non-rigid procedure, allowing free movement of the substrate, the rotatable
bonds of the substrate, the χ angles of the residues within 7 Å from the active site and with
additional distance restraints of 2.0–2.4 Å between the O-atom at C-3 of the bile acids and the
catalytic zinc and the H-atom at C-3 of bile acids and the C-4 of the nicotinamide of NAD+.
Molecular modelling of the human class I γ ADH subunit and docking calculations
demonstrated that iso-ursodeoxycholic acid (isoUDCA) fits into the active site tightly
surrounded by mainly hydrophobic residues (Fig. 7), and that it interacts with the catalytic
residues in a preferable way (Table 3). For comparison, similar docking calculations were
performed with UDCA. This substrate could not form the interactions crucial for catalysis
(Fig. 7, Table 3). If the 3α-hydroxyl group was forced to fit at the active site, the rest of the
steroid was tilted in comparison to the 3β-hydroxyl epimer and positioned away from the
Project overview
39
Fig. 7. isoUDCA (dark gray) and UDCA (light gray) docked into the active site of ��� ADH. Residues within 5 Šof the bile acids are displayed in stick models and with their surfaces transparently displayed. Residues F93, T94, Y110, N114 and L116 were omitted for increased visibility. Arrows show catalytic interactions between the ligand , O� of S48 and C4 of NAD+. The picture was created in ICM version 2.8 (Molsoft LLC, San Diego, CA).
active-site channel, compatible with the fact that this conformation is not suitable for efficient
catalysis.
The same calculations were also performed for the human class I ββ ADH where the presence
of Thr48 (instead of Ser48 as in the γγ isozyme) caused inefficient binding in accordance with
kinetic data. Distances crucial for catalysis are collected in Table 3. Distances required for
catalysis, in the range of 2 – 2.4 Å, are obtained only by isoUDCA in ��� ADH.
Table 3. Distances in Ångström obtained from the docking calculations for isoUDCA and UDCA
Isozyme Substrate Oγγγγ(Ser48) -H(C3-OH) Zn -O(C3-OH) C4(NAD+) –H(C3) ADH ββ UDCA 2.31 2.51 2.94 ADH ββ isoUDCA 3.35 2.24 1.94 ADH γγ UDCA 3.20 2.88 3.53 ADH γγ isoUDCA 2.22 2.30 2.19
Biocomputational studies on protein structures
40
Molecular modelling and substrate docking of human type 10 17ββββ-
hydroxysteroid dehydrogenase (paper IV)
The human type 10 17β-hydroxysteroid dehydrogenase (17β-HSD-10) is a multifunctional
mitochondrial enzyme. It catalyses the oxidative inactivation at C17 of androgens and
estrogens. However, it also mediates oxidation of 3α-hydroxy groups of androgens, thereby
reactivating androgen metabolites. Finally, it is involved in β-oxidation of fatty acids by
catalysing the L-hydroxyacyl CoA dehydrogenase reaction of the β-oxidation cycle. Since no
three-dimensional structure of 17β-HSD-10 was available, homology modelling was carried
out to understand the molecular basis of these substrate specificities. The template in the
homology modelling was 7α-HSD (1fmc; Tanaka et al. 1996), which has a sequence identity
of 27% to the target protein. The low sequence identity required special care in the modelling
process.
The molecular modelling indicated that the reaction resulting in 3α-OH/3-oxo conversion can
be predicted to occur with 5α-reduced steroids such as 3α-ol,5α-androstane,17-one, with
optimal distances for the 3α-OH hydrogen bonding to the tyroxyl Oη 168 of 1.89 Å, to C-4 of
NAD+ of 2.23 Å, and a distance between 3α-OH oxygen to Ser155 side-chain Hγ of 1.96 Å.
Oxidation of the 3β-hydroxyl as in 3β-ol,5α-androstane,17-one or 3β-ol,5α-androstane,17β-
ol is not possible, due to a predicted distance of over 4 Å between the 3β-OH oxygen and the
Ser155 side chain. This distance will not allow the hydrogen bond necessary for catalysis.
Docking simulations with 5β reduced steroids, such as bile acids (isoUDCA and UDCA)
reveal again unfavourable distances to Ser155, thus preventing catalysis to take place. The 17
β-OH dehydrogenase reaction is possible with steroids of the androgen and estrogen classes,
yielding favourable geometry and distances for 3β,17β,5α-androstane-diol, testosterone, 5α-
dihydrotestosterone and estradiol. These predicted properties correlate with experimental data.
After submission of this manuscript another group solved the structure of the rat homologue
by crystallography (Powell et al. 2000). This structure confirms large portions of our model
including some of the enzyme–substrate complexes that we predicted. In Fig. 8, our model
and the crystal structure backbones are superimposed. The cores of the structures superimpose
well, including the majority of the active site residues. Parts of the predicted substrate-binding
loop are missing in the crystal structure, indicating an unordered structure of that part.
Project overview
41
Nevertheless, we have correctly predicted the start and end of the loop. The dark-coloured
regions corresponding to mispredicted areas are mainly centred in loops, which are
notoriously hard to predict, and secondary structure elements at the edges of the structure. The
entire C-terminal part from Asn243 to Pro261 is mispredicted due to an alignment error to the
template, and the whole region is shifted one residue. The placement of the NAD+ is highly
accurate in the vicinity of the active site and allow us to correctly predict which substrates that
may be catalysed by the enzyme.
The overall quality of our predictions makes us feel that it can be justifiable to create
homology models based on templates with quite low sequence identity in homology
modelling, in our case a successful modelling was created at a sequence identity level below
30% if the overall fold of the family is conserved
.
Fig. 8. Model of ERAB and crystal structure of the rat homologue superimposed, white areas were correctly predicted, while dark areas were mispredicted. NAD+ and substrate of the model are coloured light grey; while in the crystal structure they are darker. Figure created in ICM version 2.8 (Molsoft LLC, San Diego, CA)
Biocomputational studies on protein structures
42
Structural Role of Conserved Asn179 in the Short-Chain
Dehydrogenase/Reductase Scaffold (paper V)
Short-chain dehydrogenases/reductases (SDR) constitute a large family of enzymes found in
all forms of life. Despite a low level of sequence identity, the three-dimensional structures
determined display a nearly superimposable α/β folding pattern. We identified a conserved
asparagine residue located within strand βF and analysed its role in the short-chain
dehydrogenase/reductase architecture. Mutagenetic replacement of Asn179 to Ala in bacterial
3β/17β-hydroxysteroid dehydrogenase yields a folded but enzymatically inactive enzyme.
Significantly altered unfolding characteristics indicate an apparent increase in stability.
Crystallographic analysis of the wild-type enzyme at 1.2 Å resolution reveals a hydrogen
bonding network including a buried and well-ordered water molecule, connecting strands βE
to βF (Fig. 9), a common feature found in 16 of 21 known three-dimensional structures of the
family.
Fig. 9. Strands βE and βF and interactions of Asn179, with distances given in Ångström. Figure created with ICM, version 2.8 (Molsoft LLC, San Diego, CA) This segment interacts with the active site and the substrate-binding region, which could
explain the loss of function in the N179A mutant. The main interactions of this conserved site
are made through side-chain to main-chain atoms, which makes it impossible to trace them
through sequence comparisons, since no evolutionary pressure is applied on the residues in
order to maintain the interactions. Based on these results we predict that in mammalian 11β-
Project overview
43
hydroxysteroid dehydrogenase the essential Asn-linked glycosylation site, which corresponds
to the conserved segment, displays the same structural features and has a central role to
maintain the SDR scaffold, rather than being a carbohydrate attachment point.
Molecular dynamic studies of sequence influence upon fibrillation
(paper VI)
Diseases such as the prion diseases and Alzheimer’s disease (AD), caused by fibrillating
proteins are gaining increased attention due to their importance in dementia among an aging
population. Presently, over 20 diseases are known to be linked to fibrillating proteins in
various tissues. We have studied the stability of the helical peptides involved in AD and
pulmonary alveolar proteinosis (PAP) using 4 nanoseconds molecular dynamics simulations.
The effect of stabilising replacements and the effect of the mutation linked to the hereditary
cerebral amyloidosis of the Flemish type are also investigated.
The stability of the peptides is clearly related to their amino acid sequence. The amyloid β-
peptide (Aβ) has lost most of its helical structure midway through the simulation and assumes
a more or less unordered structure with some helical tendencies to it (Fig. 10A), much like the
variant involved in hereditary cerebral amyloidosis of the Flemish type (Aβ Flemish type),
which loses all of its helical tendencies and even show β-strand tendencies (Fig. 10C). The
experimentally verified stable analogue Aβ (K16A+L17A+F20A) retains a helical structure
during the whole 4 ns simulation (Fig. 10B). Surfactant protein-C (SP-C) is more stable than
Aβ but the helix collapses into short segments with helical properties (Fig. 10D). The stable
analogue SP-C (Leu) maintains a rigid helix during the whole simulation (Fig. 10E) while the
tri-substituted SP-C (V18L+V20L+V24L) shows less helical character than SP-C (Leu), but
more than wild-type SP-C (Fig. 10F).
The results can be used to design more stable analogues based on the concepts derived from
this study. The deleterious effect of the mutations in hereditary cerebral amyloidosis of the
Flemish type can be explained by decreased α-helical stability of the peptide. We have also
investigated the secondary structure propensities and a correlation between β-propensities and
fibrillation was found, compatible with the molecular dynamics results.
Biocomputational studies on protein structures
44
A B C D E
F 0 ps 2000 ps 4000 ps Fig. 10. Snapshots of the development of the structure during the MD simulations. A: Aβ B: Aβ (K16A+L17A+F20A) C: Aβ Flemish type (A21G) D: SP-C E: SP-C (Leu) F: SP-C (V18L+V20L+V24L)
Project overview
45
A structural basis for proinsulin C-peptide activity (paper VII)
The insulin and C-peptide parts of proinsulin differ markedly in interspecies variability. With
37 forms known in primary structure, C-peptide is between one or two orders of magnitude
more variable than insulin, is about as variable as relaxin in the same protein family, and
shows a tripartite variability pattern. Considering the mammalian C-peptides, 9 positions of
the N- and C-terminal are conserved. In contrast, mid-positions of C-peptide are highly
variable and susceptible to mutational removal. These structural features are consistent with
binding interactions recently defined experimentally (Ohtomo et al. 1998, Rigler et al. 1999,
Wahren et al. 2000, Pramanik et al. 2001), highlighting an importance of the C-terminal
pentapeptide.
The main argument against a hormonal function for the C-peptide is the large species
variability, with no strictly conserved residue. To investigate this apparent problem, 15
peptide hormones were investigated by constructing multiple alignments. The results are
shown in Table 4 where C-peptide is grouped together with three other highly variable, but
well-established peptide hormones. This puts the variability of C-peptide in a new perspective
and indicates the possibility of hormonal activities of C-peptide.
Table 4. Sequence variability among peptide hormones, differences are given in % identity and in PAM units (accepted point mutations) Hormone n Human – Rat Human – Chicken Human – Cod Human – Hagfish % PAM % PAM % PAM % PAM Somatostatin 14 0* 0 0 0 - - 0 0 Glucagon 29 0 0 0 0 - - - - VIP 28 0 0 14 16 21 25 - - Substance P 11 0 0 9 10 27 33 - - GIP 42 5 5 - - - - - - Insulin 51 8 8 14 16 - - 35 47 IGF-1 70 4 4 11 12 - - - - IGF-2 67 6 6 15 17 - - - - CCK 33 9 10 39 55 - - - - Secretin 27 11 12 48 75 - - - - Pancreatic polypeptide
36 22 26 56 97 - - - -
Gastrin 17 18 21 - - - - - - GRP 27 26 32 30 38 - - - - Parathormone 84 27 33 62 119 - - - - C-Peptide 31 32 42 68 148 - - 93 988 Relaxin 54 48 75 - - - - - - *Comparison versus the mouse homologue instead of rat
Biocomputational studies on protein structures
46
Conclusions
• Sequence comparisons are important and fruitful to examine evolutionary and
functional relationships in families of both large and small proteins/peptides.
• Docking calculations is a powerful tool to elucidate substrate specificities and to
investigate the molecular mechanisms of substrate binding.
• Homology modelling is feasible below 30% sequence identity if the fold of the family
is well conserved. Loops and outer parts are likely to be slightly mispredicted, while
the accuracy of the core and the active site can be expected to be high.
• A multiple structural alignment is a powerful complement to traditional multiple
sequence alignments, as factors not obvious from the sequence conservation may be
detected. The importance of structural comparisons is expected to increase as more
structures become available.
• MD simulations are ideal in studies of the stability of small peptides, as the timescales
practically reachable are biologically relevant.
• The whole genome comparisons of the MDR superfamily has given further insights
into the evolution of this superfamily and several new families have been
characterised. Different multiplicity patterns of the superfamily are also detected in
higher and lower eukaryotes, indicating different functions.
• The substrate specificity of γγADH and ββADH is primarily determined by the
Ser48/Thr48 replacement, where the Ser in γγADH allows the oxidation of bulkier
secondary alcohols.
• The model of type 10 17β-HSD is compatible with known enzymatic data and can be
used to explain enzyme-substrate interactions at a molecular level.
Conclusions
47
• The structural analysis of the SDR superfamily revealed a conserved hydrogen-bonded
network centered around a conserved asparagine, common for 16 of the 21 proteins of
the SDR with experimentally verified structure.
• MD simulations of amyloidogenic helical peptides and nonfibrillating variants point at
the importance of the stability of the helical portion for their amyloidogenic
tendencies.
• The helical peptides involved in pulmonary alveolar proteinosis and Alzheimer’s
readily lose their helical conformation without the influence of external factors.
• The sequence variability of the C-peptide does not exclude it from being a peptide
hormone, as other well-characterised peptide hormones exhibit similar variability
patterns
Perspectives
One of the driving forces in computational biology is the parallel development of faster and
more powerful computers, which allows addressing more complex systems by computational
methods. The Blue Gene project by IBM, where they construct a supercomputer to try to fold
a protein by MD simulations, is one of the projects that most obviously benefits of new
technology. The recent advances in the field of computer technology and the construction of
more efficient algorithms in MD simulation software indicates that the current limiting factor
is the crudeness of the force fields used in the simulations.
The development in biology is of more importance than technological breakthroughs. The
completion of the human genome last year gave a renaissance for the field of sequence
comparisons, and especially gene prediction algorithms are and will be of utter importance in
the near future. Increase of the biological knowledge about the mechanisms that govern each
process, will provide the basis for making better prediction algorithms for transmembrane
proteins, subcellular localisation, glycosylation and secondary structure.
Biocomputational studies on protein structures
48
The project that will have greatest impact on homology modelling is the structural genomic
initiative, where proteins without homology with experimentally determined structures, are
selected for crystallisation. The aim is to cover all existing folds of soluble proteins, which
would make it possible to predict the structure of all known proteins. The initiative has
identified 5285 proteins that will cover the three-dimensional space of protein folds with a
reasonable small gaps to fill up by homology modelling, of them 459 has been crystallized to
date and 15 have their structure deposited in PDB (www.jcsg.org).
The databases of single nucleotide polymorphisms (SNP) can give enormous amounts of both
disease related and structurally relevant information, especially in relation to known or
modelled three-dimensional folds.
The enormous amount of biological information generated worldwide threatens to drown the
relevant information for each researcher. Therefore are initiatives like GPCRDB (ref)
valuable, were a consortium of researchers gather all available data about a protein family and
makes it clearly and easily available to the rest of the world, in this case it is G-protein
coupled receptors, but the concept could be transferred to any protein family.
As more resources are directed into the field of bioinformatics the harder problems are no
longer shunned. Therefore I hope to see an algorithm within 5–10 years that produces fairly
accurate (RMSD below 2Å) estimates of the three-dimensional structure of proteins just based
on the sequence of the protein.
References
49
Acknowledgements
This thesis was carried out at Chemistry I, Department of Medical Biochemistry and
Biophysics, Karolinska Institutet, Stockholm, Sweden. I am very grateful to all of you that
have helped me in the completion of my thesis. In particular I would like to express my
gratitude to the following persons.
Bengt Persson, my supervisor, for rescuing me from the lab-bench and toxic fumes in the far
north and allowing me to work in a field which I barely could dream of during the cold arctic
nights in Sundsvall.
Hans Jörnvall, the head of Chemistry I, for your enthusiasm and for freely sharing your
knowledge and for providing excellent working facilities.
The people that lured me into going for a Ph.D after my exjobb, Magnus Gustafsson, Malin
Hult, Tim Prozorovski, Yuqin Wang, Yvonne Kallberg and Valentina Bonetto, I had a
really good time.
My roommate Yvonne Kallberg for lots of discussion about science, Sundsvall and unrelated
stuff in our dark cave, for nice conference and party company. Lately, our sanctuary has been
invaded by Annika Nor in who tried to let some sunshine into our lives, with varying degrees
of success.
The boys of Chemistry I with unhealthy interests in non-science projects, Andreas P.
Jonsson, Jesper J. Hedberg, Magnus Gustafsson, Mikael Henr iksson, Patr ik Strömberg
and Stefan Svensson, you’ re not safe yet.
My present and past student colleagues in Chemistry I:
Andreas Almlén, Juan Astorga-Wells, Ann-Char lotte Bergman, Peter Bergman, Åsa
Brunnström, Elo Er iste, Daniel Hirschberg, Char lotta Filling, Waltter i Hosia, Johan
Lengquist, Jing L i, Ingemar L indh, Char lotte L indhé, Suya L iu, Ermias Melles, Al
Necal, Johan Nilsson, Åke Norberg, Madalina Oppermann, Anna Päivio, Essam Refai,
Naeem Shafqat, Margareta Stark, Mar ia Tollin, Xiaoqiu Wu, Eli Zahou and Shah
Zaltash, for sharing the daily grind of a Ph.D.
Biocomputational studies on protein structures
50
I would to thank my main collaborators during these years, Udo Oppermann, Jan
Johansson and Hans Jörnvall, I learned a lot and hopefully science took an ant-step in the
right direction from my contributions.
Time to thank the HEJ-lab core:
Ann-Margreth Jörnvall, Ingegerd Nylander , I rene Byman, Ella Ceder lund, Car ina
Palmberg, Eva Mårtensson, Ulr ika Waldenström, Monica L indh, Marie Stahlberg,
Gunvor Alvelius, Jan Sjövall, Jan-Olof Höög, Mats Andersson, Birgitta Agerber th,
Mustafa El-Ahmad, Tomas Bergman, William Gr iffiths, Lars Hjelmqvist, Hanns-Ulr ich
Marschall, Rannar Sillard and Jawed Shafqat, there would be no HEJ-lab without you
guys.
I would also like to grasp the opportunity and say cheers to the friends I found in all over
MBB, if you’ re forgotten blame me, it’s late and everything else is finished, Jenny, Stina,
Eva, Helena (Structural Biology), Andis, Hans, Simone, Stefan (Organic chemistry),
Micke, Mattias, Pontus, (Kemi II), Aristi (Biochemistry).
I would like to express my deepest gratitude and love to my parents who always stood by and
supported me.
If it weren’ t for my sister Kar ins travelling, perhaps I wouldn’ t have found the courage to
leave the safe haven of Sundsvall, thanks a billion :-)
Mirna, finally I’m finished, I owe it all to you.
Ok, to sum it up, it’s been good; I will miss you all.
References
51
References Abagyan, R. and Totrov, M. Biased probability Monte Carlo conformational searches and electrostatic calculations for peptides and Proteins. J. Mol. Biol. 1994, 235:983-1002 Abagyan, R., Batalov, S., Cardozo, T., Totrov, M., Webber, J. and Zhou, Y. Homology modeling with internal coordinate mechanics: deformation zone mapping and improvements of models via conformational search. Proteins 1997, Suppl 1:29-37 Abagyan, R.A. and Batalov, S. Do aligned sequences share the same fold? J. Mol. Biol. 1997, 273:355-68 Abagyan, R.A. Towards protein folding by global energy optimization. FEBS Lett. 1993, 325:17-22 Agerberth, B., Gunne, H., Odeberg, J., Kogner, P., Boman, H.G.and Gudmundsson, G.H., FALL-39, a putative human peptide antibiotic, is cysteine-free and expressed in bone marrow and testis. Proc. Natl. Acad. Sci. U S A 1995, 92:195-199 Alder, B.J. and Wainwright, T.E. Phase transition for a hard-sphere system. J. Chem. Phys. 1957, 27:1208 Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990, 215:403-410 Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25:3389-3402 Amy, C.M., Witkowski, A., Naggert, J., Williams, B., Randhawa, Z. and Smith, S. Molecular cloning and sequencing of cDNAs encoding the entire rat fatty acid synthase. Proc. Natl. Acad. Sci. U S A 1989, 86:3114-3118 Andersson, M., Gunne, H., Agerberth, B., Boman, A., Bergman, T., Sillard, R., Joernvall, H., Mutt, V., Olsson, B., Wigzell, H., Dagerlind, A., Boman, H.G. and Gudmundsson, G.H., NK-lysin, a novel effector peptide of cytotoxic T and NK cells. Structure and cDNA cloning of the porcine form, induction by interleukin 2, antibacterial and antitumour activity. EMBO J. 1995, 14:1615-1625 Bairoch, A. and Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000, 28:45-48 Baldi, P., Brunak, S., Frasconi, P., Soda, G. and Pollastri, G. Exploiting the past and the future in protein secondary structure prediction. Bioinformatics 1999, 15:937-946 Bansal, A.K. An automated comparative analysis of 17 complete microbial genomes. Bioinformatics 1999, 15:900-908 Beers, M.F. and Lomax, C. Synthesis and processing of hydrophobic surfactant protein C by isolated rat type II cells. Am. J. Physiol. 1995, 269:744-753
Biocomputational studies on protein structures
52
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A. and Wheeler, D.L. GenBank. Nucleic Acids Res. 2002, 30:17-20 Berendsen, H.J.C., Postma, J.P.M., van Gunsteren, W.F. and Hermans, J. Interaction models for water in relation to protein hydration. In: Intermolecular Forces, (ed. Pullman, B.) Reidel, Dordrecht, 1981, 331-342 Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H. , Shindyalov, I.N. and Bourne P.E. The Protein Data Bank. Nucleic Acids Res., 2000, 28:235-242 Blundell, T.L., Sibanda, B.L., Sternberg, M.J. and Thornton, J.M. Knowledge-based prediction of protein structures and the design of novel molecules. Nature 1987, 326:347-352 Bonneau, R., Tsai, J., Ruczinski, I., Chivian, D., Rohl, C., Strauss, C.E.M. and Baker, D. Rosetta in CASP4: progress in ab initio protein structure prediction. Proteins 2001, Suppl 5:119-126 Bonnichsen, R.K. and Wassén, A.M. Crystalline alcohol dehydrogenase from horse liver. Arch. Biochem. Biophys. 1948, 18: 361-363 Boudet, A.M., Lapierre, C. and Grima-Pettenati, J. Biochemistry and molecular biology of lignification. New Phytol. 1995, 129:203-236 Browne, W.J., North, A.C., Phillips, D.C., Brew, K., Vanaman, T.C., Hill, R.L. A possible three-dimensional structure of bovine alpha-lactalbumin based on that of hen's egg-white lysozyme. J. Mol. Biol. 1969, 42:65-86 Brändén, C.I., Jörnvall, H., Eklund, H. and Furugren, B. Alcohol dehydrogenases. In: The enzymes, third edition (ed. Boyer, P.D.), chapter 3, Academic Press, 1975 Brändén, C. and Tooze, J., Introduction to protein structure, second edition. Garland Publishing Inc., NY, 1999 Burley, S.K. An overview of structural genomics. Nat. Struct. Biol. 2000 Suppl:932-934 Campbell, M.M. and Sederoff, R.R. Variation in lignin content and composition. Mechanisms of control and implications for the genetic improvement of plants. Plant Physiol. 1995, 110:3-13 Cañestro, C., Albalat, R., Hjelmqvist, L., Godoy, L., Jörnvall, H. and Gonzàlez-Duarte, R. Ascidian and amphioxus Adh genes correlate functional and molecular features of the ADH family expansion during vertebrate evolution. J. Mol. Evol. 2002, 54:81-89 Cardozo T, Totrov M, and Abagyan R. Homology modeling by the ICM method. Proteins 1995, 23:403-414 Chandrasekhar, I. and van Gunsteren, W.F. A comparison of the potential energy parameters of aliphatic alkanes: molecular dynamics simulations of triacylglycerols in the alpha phase. Eur. Biophys. J. 2002, 31:89-101
References
53
Chou, P.Y. and Fasman, G.D. Prediction of protein conformation. Biochemistry 1974, 13:222-245 Dayhoff, M.O. Schwartz, R.M. and Orcutt, B.C. A model of evolutionary change in Proteins. Matrices for detecting distant relationships In: Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Washington DC, 1978 DeRisi, J.L., Iyer, V.R. and Brown P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997 278:680-686 Donadio, S., Staver, M.J., McAlpine, J.B., Swanson, S.J. and Katz, L. Modular organization of genes required for complex polyketide biosynthesis. Science 1991, 252:675-679 Dreij, K., Sundberg, K., Johansson, A.S., Nordling, E., Seidel, A., Persson, B., Mannervik, B. and Jernström, B. Catalytic activities of human alpha class glutathione transferases toward carcinogenic dibenzo[a,l]pyrene diol epoxides. Chem. Res. Toxicol. 2002, 15:825-831 Duan, Y. and Kollman, P.A. Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution. Science 1998, 282:740-744 Duester, G., Farrés, J., Felder, V.C., Höög, J.-O., Parés, X., Plapp, B., Yin, S.-y. and Jörnvall, H. Alcohol dehydrogenase nomenclature. Biochem. Pharmacol. 1999, 58:389-395 Duester, G., Hatfield, G.W. and Smith, M. Molecular genetic analysis of human alcohol dehydrogenase. Alcohol 1985, 2:53-56 Efron, B., Halloran, E. and Holmes, S. Bootstrap confidence levels for phylogenetic trees. Proc. Natl. Acad. Sci. U S A 1996, 93:13429-13434 Ellis, R.J., van der Vies, S.M. and Hemmingsen, S.M. The molecular chaperone concept. Biochem. Soc. Symp. 1989, 55:145-153 Epstein, C.J., Goldberger, R.F. and Anfinsen, C.B. The genetic control of tertiary protein structure. Model systems. Cold Spring Harbor Symp. Quant. Biol. 1963, 28:439-449 Ewing, T.J., Makino, S., Skillman, A.G. and Kuntz, I.D. DOCK 4.0: search strategies for automated molecular docking of flexible molecule databases. J. Comput. Aided. Mol. Des. 2001, 15:411-428 Ewing, T.J.A. and Kuntz, I.D. Critical Evaluation of Search Algorithms for Automated Molecular Docking and Database Screening. J. Comp. Chem. 1997, 9:1175-1189 Felsenstein, J. PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle. 1993 Felsenstein, J. Confidence-limits on phylogenies - an approach using the bootstrap. Evolution 1985, 39:783-791 Feng, D.F., Johnson, M.S. and Doolittle, R.F. Aligning amino acid sequences: comparison of commonly used methods. J. Mol. Evol. 1984-85;21:112-125
Biocomputational studies on protein structures
54
Fernández, M.R., Biosca, J.A., Norin, A., Jörnvall, H. and Parés, X. Class III alcohol dehydrogenase from Saccharomyces cerevisiae: Structural and enzymatic features differ toward the human/mammalian forms in a manner consistent with functional needs in formaldehyde detoxication. FEBS Lett. 1995, 370:23-26 Fiers, W., Contreras, R., Haegemann, G., Rogiers, R., van de Voorde, A., van Heuverswyn, H., van Herreweghe, J., Volckaert, G. and Ysebaert, M. Complete nucleotide sequence of SV40 DNA. Nature 1978, 273:113-120 Fitch, W.M. Toward defining the course of evolution: minimum change for a specific tree topology. Sys. Zool. 1971, 20:406-416 Fitz-Gibbon, S.T., and House C.H. Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res. 1999, 27:4218-4222 Forster, M.J. Molecular modelling in structural biology. Micron 2002, 33:365-384 Galliano, H., Cabane, M., Eckerskorn, C., Lottspeich, F., Sandermann, H. Jr and Ernst D. Molecular cloning, sequence analysis and elicitor-/ozone-induced accumulation of cinnamyl alcohol dehydrogenase from Norway spruce (Picea abies L.). Plant Mol. Biol. 1993, 23:145-156 Ghosh, D., Sawicki, M., Pletnev, V., Erman, M., Ohno, S., Nakajin, S., and Duax, W.L. Porcine carbonyl reductase: Structural basis for a functional monomer in short chain dehydrogenases/reductases. J. Biol. Chem. 2001, 276:18457–18463 Ghosh, D., Wawrzak, Z., Weeks, C.M., Duax,W.L. and Erman, M. The refined three-dimensional structure of 3 α,20 β-hydroxysteroid dehydrogenase and possible roles of the residues conserved in short-chain dehydrogenases. Structure 1994, 2:629-640 Gibbs, A.J. and McIntyre, G.A. The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur. J. Biochem. 1970,16:1-11 Gonnet, G.H., Cohen, M.A. and Benner, S.A. Exhaustive matching of the entire protein sequence database. Science 1992, 256:1443-1445 Gonzalez, P., Rao, P.V., Nunez, S.B. and Zigler, J.S. Jr. Evidence for independent recruitment of � -crystallin/quinone reductase (CRYZ) as a crystallin in camelids and hystricomorph rodents. Mol. Biol. Evol. 1995, 12:773-781 Goodsell, D.S. and Olson, A.J. Automated Docking of Substrates to Proteins by Simulated Annealing. Proteins: Str. Func. and Genet. 1990, 8:195-202 Gustafsson, M. and Johansson, J. Structural and functional properties of pulmonary surfactant protein C and related analogs. J. Protein Chem. 1998, 17:540-542 Halgren, T.A. Merck Molecular Force Field. I.-V. J. Comp. Chem. 1995a, 17:490-641 Halgren, T.A. Potential energy functions. Curr. Opin. Struct. Biol. 1995b, 5:205-210
References
55
Halpin, C., Knight, M.E., Foxon, G.A., Campbell, M., Boudet, A.M., Boon, J.J., Chabbert, B., Tollier, M.T. and Schuch, W. Manipulation of lignin quality by down-regulation of cinnamyl alcohol dehydrogenase. Plant Journal 1994, 6:339-350 Hammersley, J.M. Monte Carlo Methods for Solving Multivariable Problems. Ann. New York Acad. Sci. 1960, 86:844-874 Hansson, T., Oostenbrink, C. and van Gunsteren, W. Molecular dynamics simulations. Curr. Opin. Struct. Biol. 2002, 12:190-196 Hardin, C., Eastwood, M.P., Prentiss, M., Luthey-Schulten, Z. and Wolynes, P.G. Folding funnels: the key to robust protein structure prediction. J. Comput. Chem. 2002, 23:138-146 Hartigan, J.A. Minimum evolution fits to a given tree. Biometrics 1973, 29:53-65 Henikoff, S. and Henikoff, J.G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U S A. 1992, 89:10915-10919 Henikoff, S. and Henikoff, J.G. Performance evaluation of amino acid substitution matrices. Proteins 1993, 17:49-61 Hofmann, K., Bucher, P., Falquet, L. and Bairoch A. The PROSITE database, its status in 1999. Nucleic Acids Res. 1999, 27:215-219 Huang, T.F., Holt, J.C., Lukasiewicz, H. and Niewiarowski, S. Trigramin. A low molecular weight peptide inhibiting fibrinogen interaction with platelet receptors expressed on glycoprotein IIb-IIIa complex. J. Biol. Chem. 1987, 262:16157-16163 Jany, K.D., Ulmer, W., Froschle, M. and Pfleiderer, G. Complete amino acid sequence of glucose dehydrogenase from Bacillus megaterium. FEBS Lett. 1984, 165:6-10 Jensen, F. Introduction to Computational Chemistry Wiley and Sons, Chichester, 1999 Johansson, J., Curstedt, T., Robertson, B. and Jörnvall, H. Size and structure of the hydrophobic low molecular weight surfactant-associated polypeptide. Biochemistry 1988, 27:3544-3547 Jones, D.T. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 1999, 292, 195-202 Jones, D.T. and Swindells, M.B. Getting the most from PSI-BLAST. Trends Biochem. Sci. 2002, 27:161-164 Jones, G., Willett, P., Glen, R.C., Leach, A.R. and Taylor, R. Development and validation of a genetic algorithm for flexible docking. J. Mol. Biol. 1997, 267:727-748 Jörnvall, H. and Höög, J.-O. Nomenclature of alcohol dehydrogenases. Alcohol and Alcoholism 1995, 30:153-161
Biocomputational studies on protein structures
56
Jörnvall, H., Höög, J.-O. and Persson, B. SDR and MDR: completed genome sequences show these protein families to be large, of old origin, and of complex nature. FEBS Lett., 1999 445:261-264 Jörnvall, H., Höög, J.-O., Persson, B. and Parés, X. Pharmacogenetics of the alcohol dehydrogenase system. Am. J. Pharmacol. 2000, 61:184-191 Jörnvall, H., Persson, B., Krook, M., Atrian, S., Gonzalez-Duarte, R., Jeffery, J., and Ghosh, D. Short-chain dehydrogenases/reductases (SDR). Biochemistry 1995, 34:6003–6013 Jörnvall, H., Persson, M. and Jeffery, J. Alcohol and polyol dehydrogenases are both divided into two protein types, and structural properties cross-relate the different enzyme activities within each type. Proc. Natl. Acad. Sci. U S A 1981, 78:4226-4230 Jörnvall, H., von Bahr-Lindström, H., Jany, K.D., Ulmer, W. and Froschle, M. Extended superfamily of short alcohol-polyol-sugar dehydrogenases: structural similarities between glucose and ribitol dehydrogenases. FEBS Lett. 1984, 165:190-196 Kallberg, Y., Oppermann, U., Jörnvall, H. and Persson, B. Short-chain dehydrogenase/reductase (SDR) relationships: A large family with eight clusters common to human, animal, and plant genomes. Protein Sci. 2002, 11:636-641 Kallberg, Y. and Persson, B. KIND-a non-redundant protein database. Bioinformatics 1999, 15:260-261 Kamal, A., Almenar-Queralt, A., LeBlanc, J.F., Roberts, E.A. and Goldstein, L.S. Kinesin-mediated axonal transport of a membrane compartment containing beta-secretase and presenilin-1 requires APP. Nature 2001, 414:643-648 Kanehisa, M. A database for post-genome analysis. Trends Genet. 1997, 13:375-376 Kang, J., Lemaire, H.-G., Unterbeck, A., Salbaum, J.M., Masters, C.L., Grzeschik, K.-H., Multhaup, G., Beyreuther, K. and Mueller-Hill, B., The precursor of Alzheimer's disease amyloid A4 protein resembles a cell-surface receptor. Nature 1987, 325:733-736 Kirkpatrick, S., Gelatt, C.D., and Vecchi, M.P. Optimization by Simulated Annealing. Science 1983, 220:671-680 Koonin, E.V., Mushegian, A.R. Galperin, M.Y. and Walker D.R. Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea. Mol. Microbiol. 1997, 25:619-637 Krook, M., Ghosh, D., Strömberg, R., Carlquist, M., and Jörnvall, H. Carboxyethyllysine in a protein: Native carbonyl reductase/NADP+-dependent prostaglandin dehydrogenase. Proc. Natl. Acad. Sci. U S A 1993, 90:502–506 Kuntz, I.D., Blaney, J.M., Oatley, S.J., Langridge, R., and Ferrin, T.E. "A geometric approach to macromolecule-ligand interactions", J. Mol. Biol. 1982, 161, 269-288 Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., et al. and Morgan, M.J. Initial sequencing and analysis of the human genome. Nature 2001, 409:860-921
References
57
Lengauer, T. and Rarey, M. Computational methods for biomolecular docking. Curr. Opin. Struct. Biol. 1996, 6:402-406 Levitt, M. Accurate modeling of protein conformation by automatic segment matching. J. Mol. Biol. 1992, 226:507-533 Lindahl, E., Hess, B. and van der Spoel, D. GROMACS 3.0: a package for molecular simulation and trajectory analysis. J. Mol. Model. 2001, 7:306–317 Magnuson, K., Jackowski, S., Rock, C.O. and Cronan, J.E. Jr. Regulation of fatty acid biosynthesis in Escherichia coli. Microbiol. Rev. 1993, 57:522-542 Marschall, H.-U., Oppermann, U.C., Svensson, S., Nordling, E., Persson, B., Höög, J.-O. and Jörnvall, H. Human liver class I alcohol dehydrogenase ��� isozyme: the sole cytosolic 3β-hydroxysteroid dehydrogenase of iso bile acids. Hepatology 2000, 31:990-996 McCammon, J.A., Gelin, B.R. and Karplus, M. Dynamics of folded proteins. Nature 1977, 267:585-590 Minth, C.D., Andrews, P.C. and Dixon, J.E., Characterization, sequence, and expression of the cloned human neuropeptide Y gene. J. Biol. Chem. 1986, 261:11974-11979 Morris, G.M., Goodsell, D.S., Halliday, R.S., Huey, R., Hart, W.E., Belew, R.K. and Olson, A.J. Automated Docking Using a Lamarckian Genetic Algorithm and and Empirical Binding Free Energy Function. J. Computational Chemistry 1998, 19:1639-1662 Morris, G.M., Goodsell, D.S., Huey, R. and Olson, A.J. Distributed automated docking of flexible ligands to proteins: Parallel applications of AutoDock 2.4. J. Computer-Aided Molecular Design 1996, 10:293-304 Morse, P.M. and Feshbach, H. Asymptotic Series; Method of Steepest Descent. In: Methods of Theoretical Physics, Part I. McGraw-Hill, New York, 1953, 434-443 Murzin, A.G., Brenner, S.E., Hubbard, T. and Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995, 247:536-540 Najmanovich, R., Kuttner, J., Sobolev, V. and Edelman, M. Side-chain flexibility in proteins upon ligand binding. Proteins 2000, 39:261-268 Needleman, S.B. and Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two Proteins. J. Mol. Biol. 1970, 48:443-453 Negelein, E. and Wulff, H.-J. Kristallisation des Proteins der Acet-aldehyd-reduktase. Biochem. Z. 1937, 289: 436-437 Nei, M. Phylogenetic analysis in molecular evolutionary genetics. Annu. Rev. Genet. 1996, 30:371-403 Ohlrogge, J. and Browse, J. Lipid biosynthesis. Plant Cell 1995, 7:957-970
Biocomputational studies on protein structures
58
Ohtomo, Y., Bergman, T., Johansson, B.-L., Jörnvall, H. and Wahren, J. Differential effects of proinsulin C-peptide fragments on Na+,K+-ATPase activity of renal tubule segments. Diabetologia 1998, 41:287-291 Oldfield, E. Chemical shifts in amino acids, peptides, and proteins: from quantum chemistry to drug design. Annu. Rev. Phys. Chem. 2002, 53:349-378 Oppermann, U.C., Persson, B., Filling, C., and Jörnvall, H.. Structure–function relationships of SDR hydroxysteroid dehydrogenases. Adv. Exp. Med. Biol. 1997, 414:403–415 Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. CATH- A Hierarchic Classification of Protein Domain Structures. Structure 1997, 5:1093-1108 Page, R.D. TreeView: an application to display phylogenetic trees on personal computers. Comput. Appl. Biosci. 1996, 12:357-358 Pearson, W.R. and Lipman, D.J. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U S A 1988, 85:2444-2448 Persson, B., Hallborn, J., Walfridsson, M., Hahn-Hägerdal, B., Keränen, S., Penttilä, M. and Jörnvall, H. Dual relationships of xylitol and alcohol dehydrogenases in families of two protein types. FEBS Lett. 1993, 324:9-14 Persson, B., Krook, M. and Jörnvall, H. Characteristics of short-chain alcohol dehydrogenases and related enzymes. Eur. J. Biochem. 1991, 200:537-543 Persson, B., Krook, M., and Jörnvall, H. Short-chain dehydrogenases/reductases. Adv. Exp. Med. Biol. 1995, 372:383–395 Persson, B., Zigler, J.S. Jr and Jörnvall, H. A super-family of medium-chain dehydrogenases/reductases (MDR). Sub-lines including � -crystallin, alcohol and polyol dehydrogenases, quinone oxidoreductase, enoyl reductases, VAT-1 and other proteins. Eur. J. Biochem. 1994, 226:15-22 Polak, E. Chapter 2.3 In: Computational Methods in Optimization. New York: Academic Press, 1971 Powell, A.J., Read, J.A., Banfield, M.J., Gunn-Moore, F., Yan, S.D., Lustbader, J., Stern, A.R., Stern, D.M. and Brady, R.L. Recognition of Structurally Diverse Substrates by Type II 3-Hydroxyacyl-Coa Dehydrogenase (Hadh II) Amyloid-Beta Binding Alcohol Dehydrogenase (Abad) J. Mol. Biol. 2000, 303:311-327 Pramanik, A., Ekberg, K., Zhong, Z., Shafqat, J., Henriksson, M., Jansson, O., Tibell, A., Tally, M., Wahren, J., Jörnvall, H., Rigler, R. and Johansson, J. C-peptide binding to human cell membranes: importance of Glu27. Biochem. Biophys. Res. Commun. 2001, 284:94-98 Rao, P.V., Gonzalez, P., Persson, B., Jörnvall, H., Garland, D. and Zigler J.S. Jr Guinea pig and bovine � -crystallins have distinct functional characteristics highlighting replacements in otherwise similar structures. Biochemistry 1997, 36, 5353-5362
References
59
Rarey, M., Kramer, B., Lengauer, T. and Klebe, G. A fast flexible docking method using an incremental construction algorithm. J. Mol. Biol. 1996, 261:470-489 Rigler, R., Pramanik, A., Jonasson, P., Kratz, G., Jansson, O.T., Nygren, J.-Å., Ståhl, S., Ekberg, K., Johansson, B.-L., Uhlén, S., Uhlén, M., Jörnvall, H. and Wahren, J. Specific binding of proinsulin C-peptide to human cell membranes. Proc. Natl. Acad. Sci. U S A 1999, 96:13318-13323 Rossmann, M.G., Moras, D. and Olsen, K.W. Chemical and biological evolution of nucleotide-binding protein. Nature 1974, 250:194-199 Rost, B. and Sander, C. Prediction of protein secondary structure at better than 70% accuracy, J. Mol. Biol. 1993, 232:584-599 Rost, B. and Sander, C. Third generation prediction of secondary structure. In: Protein Structure Prediction: Methods and Protocols (ed. Webster, D.), Humana Press, Clifton, NJ, 2000, 71-95 Saitou, N. and Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987, 4:406-425 Sali, A. and Blundell, T.L. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 1993, 234:779-815 Sanchez, R. and Sali, A. Advances in comparative protein-structure modelling. Curr. Opin. Struct. Biol. 1997, 7:206-214 Sandak, B., Wolfson, H.J. and Nussinov, R. Flexible docking allowing induced fit in proteins: insights from an open to closed conformational isomers. Proteins 1998, 32:159-174 Sarni, F., Grand, C. and Boudet, A.M. Purification and properties of cinnamoyl-CoA reductase and cinnamyl alcohol dehydrogenase from poplar stems (Populus X euramericana). Eur. J. Biochem. 1984, 139:259-265 Schmidt, M.W., Baldridge, K.K., Boatz, J.A., Elbert, S.T., Gordon, M.S., Jensen, J.H., Koseki, S., Matsunaga, N., Nguyen, K.A.; et al. General atomic and molecular electronic structure system. J. Comput. Chem. 1993, 14:1347-1363 Schneider, G. and Bohm, H.J. Virtual screening and fast automated docking methods. Drug Discov. Today 2002, 7:64-70 Schwartz, M.F. and Jörnvall, H. Structural analyses of mutant and wild-type alcohol dehydrogenases from Drosophila melanogaster. Eur. J. Biochem. 1976, 68:159-168 Sellers, P.H. On the theory and the computation of evolutionary distances. SIAM J. Appl. Math. 1974, 26:787-793 Shepherd, A.J., Gorse, D. and Thornton, J.M. Prediction of the location and type of beta-turns in proteins using neural networks. Protein Sci. 1999, 8:1045-55
Biocomputational studies on protein structures
60
Sippl, M.J. Knowledge-based potentials for Proteins. Curr. Opin. Struct. Biol. 1995, 5:229-235 Skolnick, J., Kolinski, A., Kihara, D., Betancourt, M., Rotkiewicz, P.and Boniecki, M. Ab initio protein structure prediction via a combination of threading, lattice folding, clustering, and structure refinement. Proteins 2001, Suppl 5:149-156 Smith, T.F. and Waterman, M.S. Identification of common molecular subsequences. J. Mol. Biol. 1981, 147:195-197 Sofer, W. and Ursprung, H. Drosophila alcohol dehydrogenase. J. Biol. Chem. 1968, 243: 3110-3115 Steiner, D., Cunningham, D., Spigelman, L. and Aten, B. Insulin biosynthesis: evidence for a precursor. Science 1967, 157:697-700 Stewart, J.J. MOPAC: a semiempirical molecular orbital program. J. Comput. Aided. Mol. Des. 1990, 4:1-105 Stoesser, G., Baker, W., van den Broek, A., Camon, E., Garcia-Pastor, M., Kanz, C., Kulikova, T., Lombard, V., Lopez, R., Parkinson, H., Redaschi, N., Sterk, P., Stoehr, P. and Tuli, M.A. The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 2001, 29:17-21 Swofford, D.L. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts, 1998 Tanaka, N., Nonaka, T., Tanabe, T., Yoshimoto, T., Tsuru, D. and Mitsui, Y. Crystal structures of the binary and ternary complexes of 7 α-hydroxysteroid dehydrogenase from Escherichia coli. Biochemistry 1996, 35:7715-7730 Tekaia, F., Lazcano, A. and B. Dujon.. The genomic tree as revealed from whole proteome comparisons. Genome Res. 1999, 9:550-557 Thompson, J.D., Higgins, D.G. and Gibson, T.J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680 Thomsen, J., Kristiansen, K., Brunfeldt, K. and Sundby, F., The amino acid sequence of human glucagon. FEBS Lett. 1972, 21:315-319 Totrov, M. and Abagyan, R. Detailed ab initio prediction of lysozyme-antibody complex with 1.6 A accuracy. Nat. Struct. Biol. 1994, 1:259-263 Totrov, M. and Abagyan, R. Flexible protein-ligand docking by global energy optimization in internal coordinates. Proteins 1997, Suppl 1:215-220 Wahren, J. and Johansson, B. New aspects of C-peptide physiology. Horm. Metab. Res. 1998, 30:A2-A5
References
61
Wahren, J., Ekberg, K., Johansson, J., Henriksson, M., Pramanik, A., Johansson, B.L., Rigler, R. and Jörnvall, H. Role of C-peptide in human physiology. Am. J. Physiol. Endocrinol. Metab. 2000, 278:E759-E768 Venclovas, C., Zemla, A., Fidelis, K. and Moult, J. Some measures of comparative performance in the three CASPs. Proteins 1999, Suppl 3:231-237 Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., et al. and Zhu X. The sequence of the human genome. Science 2001, 291:1304-1351 Vogt, G., Etzold, T. and Argos, P. An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J. Mol. Biol. 1995, 249:816-831 Wu, C.H., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z.Z., Ledley, R.S., Lewis, K.C., Mewes, H.W., Orcutt, B.C., Suzek, B.E., Tsugita, A., Vinayaka, C.R., Yeh, L.S., Zhang, J. and Barker, W.C. The Protein Information Resource: an integrated public resource of functional annotation of Proteins. Nucleic Acids Res. 2002, 30:35-37 Yamazoe, M., Shirahige, K., Rashid, M.B., Kaneko, Y., Nakayama, T., Ogasawara, N. and Yoshikawa, H. A protein which binds preferentially to single-stranded core sequence of autonomously replicating sequence is essential for respiratory function in mitochondrial of Saccharomyces cerevisiae. J. Biol. Chem. 1994, 269:15244-15252 Yokomizo, T., Ogawa, Y., Uozumi, N., Kume, K., Izumi, T. and Shimizu, T. cDNA cloning, expression, and mutagenesis study of leukotriene B4 12-hydroxydehydrogenase. J. Biol. Chem. 1996, 271:2844-2850 Zvelebil, M.J., Barton, G.J., Taylor, W.R. and, Sternberg, M.J.E. Prediction of protein secondary structure and active sites using alignment of homologous sequences, J. Mol. Biol. 1987, 195, 957-961 Österberg, F., Morris, G.M., Sanner, M.F., Olson, A.J. and Goodsell, D.S.Automated docking to multiple target structures: incorporation of protein mobility and structural water heterogeneity in AutoDock. Proteins 2002, 46:34-40