data driven rna structure determination - school of informatics
TRANSCRIPT
Data Driven RNA Structure
Determination
Wolfgang Lehrach
Master of Science
School of Informatics
University of Edinburgh
2003
(Graduation date: December 2003)
Abstract
This project presents a novel unsupervised learning approach to learning to de-
termine the structure of RNA molecules with little prior explicit knowledge of
RNA and only a database of sequences and their structures. The approach is
based upon adapting the parameters to maximise the probability of the correct
real life structure given the sequence. This was done with the aim of generalising
to novel sequences. A simple form of motif recognition is incorporated to improve
the ability to use the data within the database.
Rosetta style MCMC simulated annealing and conjugate gradient descent are
implemented to determine the structure with the highest probability given a
sequence and energy function.
Overall it was found that while the novel learning algorithm performed very
well, performance was limited by the choice of a simplest energy function with
few parameters which could only express how to fold molecules of low complexity
and which had to approximate more complex molecules.
i
Declaration
I declare that this thesis was composed by myself, that the work contained herein
is my own except where explicitly stated otherwise in the text, and that this work
has not been submitted for any other degree or professional qualification except
as specified.
(Wolfgang Lehrach)
ii
Table of Contents
1 Introduction 2
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 What is RNA and Why Determine its Structure . . . . . . . . . . 3
1.2.1 What RNA is made of . . . . . . . . . . . . . . . . . . . . 3
1.2.2 What is DNA . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 What are Proteins . . . . . . . . . . . . . . . . . . . . . . 7
1.2.4 Different levels of structure . . . . . . . . . . . . . . . . . . 8
1.3 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Probability and Sampling theory . . . . . . . . . . . . . . 8
1.3.2 The Metropolis Method . . . . . . . . . . . . . . . . . . . 10
1.4 Other approaches considered . . . . . . . . . . . . . . . . . . . . . 11
1.5 Temporal issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Synopsis of the Rest of the Dissertation . . . . . . . . . . . . . . . 11
2 Literature review 13
2.1 Problem Feasibility of Structure Determination of RNA and Proteins 13
2.2 MFold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Relative advantages and disadvantages . . . . . . . . . . . 14
2.3 Rosetta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Scoring function . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Sampling of the Protein Space . . . . . . . . . . . . . . . . 19
2.3.3 Result processing . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.4 Relative Advantages and Disadvantages . . . . . . . . . . . 20
iii
3 Methodology 21
3.1 Techniques used . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Why use Probability Theory . . . . . . . . . . . . . . . . . 21
3.1.2 Defining P (C | Ψ, seq) . . . . . . . . . . . . . . . . . . . . 21
3.1.3 Increasing the Probability of a Structure for a Given Sequence 22
3.1.4 Finding the Structure given the Sequence of an RNAMolecule 24
3.1.5 Rossetta-style Simulated Annealing to find all Minima . . 25
3.2 Modelling the Structure of the RNA Molecule . . . . . . . . . . . 25
3.2.1 Why use a High Level Molecule Representation . . . . . . 25
3.2.2 Cartesian ’xyz’ Molecule Representation . . . . . . . . . . 26
3.2.3 Chain Spherical ’rtp’ Molecule Representation . . . . . . . 27
3.2.4 Bond and Torsion angle based . . . . . . . . . . . . . . . . 28
3.2.5 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Defining the Energy function . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Scoring the bonds between neighbours . . . . . . . . . . . 30
3.3.3 Incorporating Base to Base interactions . . . . . . . . . . . 32
3.3.4 Initial Parameter Settings . . . . . . . . . . . . . . . . . . 34
3.4 Incorporating common motif recognition . . . . . . . . . . . . . . 34
3.4.1 Choosing motifs . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.2 Comparing structures . . . . . . . . . . . . . . . . . . . . . 35
3.5 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.1 Development Environment and Language . . . . . . . . . . 36
3.5.2 Regression checking . . . . . . . . . . . . . . . . . . . . . . 37
3.5.3 Vectorization, Optimisation and Benchmarking . . . . . . 37
3.5.4 Parallelizing the code . . . . . . . . . . . . . . . . . . . . . 37
3.6 Evaluating the Learning of the Parameters Ψ . . . . . . . . . . . . 38
3.6.1 Initial Teething Problems . . . . . . . . . . . . . . . . . . 38
3.6.2 Learning an Artifical Hairpin Structure . . . . . . . . . . . 39
3.6.3 Random folding . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6.4 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
iv
4 Results 41
4.1 Effects of Different Structural Repressentions on Structure Deter-
mination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Hairpin results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Adapting the Initial Parameter Settings . . . . . . . . . . 41
4.2.2 Adapting Difficult Parameter Settings . . . . . . . . . . . 43
4.3 Rederiving the Parameters of Random Folded Data . . . . . . . . 47
4.4 Real RNA data results . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Conclusion 49
5.1 Concluding Remarks and Observations . . . . . . . . . . . . . . . 49
5.2 Unsolved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Suggestions For Further Work . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Integrating Temporal Information . . . . . . . . . . . . . . 51
5.3.2 Expanding upon Motif Recognition . . . . . . . . . . . . . 51
5.3.3 Improving the energy function: . . . . . . . . . . . . . . . 51
5.3.4 Implementation in a Different Language . . . . . . . . . . 52
5.3.5 A Fully Cross Validated Run . . . . . . . . . . . . . . . . . 52
A Structure produced by Real RNA Molecules without Motifs 53
B Structure produced by Real RNA Molecules with Motifs 58
Bibliography 62
v
List of Figures
1.1 Structural determination given an RNA sequence . . . . . . . . . 2
1.2 Chemical structure of DNA . . . . . . . . . . . . . . . . . . . . . 4
1.3 Central dogma of biology - should I include this? should I cite it
from CBD notes? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Buried vs non-buried Penv terms in Rosetta . . . . . . . . . . . . . 18
3.1 Chain Spherical co-ordinate system . . . . . . . . . . . . . . . . . 28
3.2 Updates required in Cartesian Representation vs Chain Spherical 29
3.3 Expressing bond preferences of sequential bases in terms of distance 31
3.4 Artificial hairpin based on loop of Transfer RNA . . . . . . . . . . 40
4.1 Folding a hairpin in the Chain Spherical Molecule Representation 42
4.2 Folding a hairpin in the Cartesian representation: Every 6th Frame 42
4.3 Adjusting parameters of almost correct hairpin . . . . . . . . . . . 43
4.4 Comparing structure under initial parameter settings and learnt
parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Comparing the Energy of a Line against the Hairpin . . . . . . . . 44
4.6 Comparing the hairpin under intitial perturbed parameters and
learnt parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7 Adjusting parameters for the hairpin from initially perburbed pa-
rameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.8 Adjusting Parameters of a Hairpin starting from perturbed param-
eters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.9 Individually folding a molecule from the main sample . . . . . . . 47
vi
List of Tables
1.1 Different Types of RNA . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Classifying the different levels of RNA structure . . . . . . . . . . 9
2.1 RNA Secondary Structures considered by MFold . . . . . . . . . . 15
3.1 Modelled base to base interactions . . . . . . . . . . . . . . . . . . 33
3.2 Starting values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
A.1 Structure produced and learnt parameters . . . . . . . . . . . . . 57
B.1 Structure produced with motifs and learnt parameters . . . . . . . 61
vii
Chapter 1
Introduction
1.1 Overview
The aim of this project is to determine the structure of an RNA molecule given
its sequence by using the knowledge gained from a database of sequences and
their real life structures.
ACCACGGUUCACAA
Figure 1.1: Structural determination given an RNA sequence
RNA is involved in many of the fundamental processes in life and there are
many scientific and medical reasons why determining RNA’s functions would be
very useful, where its function is determined almost exclusively by its structure.
Due to this, there has been lots of prior work in structural determination of which
little has come from a basis in inference and graphical models instead mostly using
ad-hoc methods.
To determine the correct structure, there needs to be a scoring function which
2
Chapter 1. Introduction 3
rates how well a structure fits a given sequence. It is then reasonably simple -
if computationally expensive - to find a structure which maximizes this scoring
function. This is a common optimization problem for which there exist lots of
techniques to tackle it.
The fundamental problem is how to design or create a scoring function that
is able to differentiate between what structures fit a sequence and so could occur
within a cell and what simply maximises the score but couldn’t exist in real.
Manually creating and tuning a scoring function which can tell between good
and bad structures is actually a very difficult problem. As the point is to get
structure which would actually occur within a cell, there are many interactions
which have to be taken into account which are hard to explicitly model, like the
effect of the surrounding water molecules.
This project on the other hand tries to make the real world structure for some
given sequence more likely. The scoring function is modelled as P (C | Ψ, seq),
the probability of the structure given the parameters Ψ and the sequence. Using
MCMC and gradient ascent, it is then possible to increase the likelihood of the
structure by adapting the parameters, which is done for all real world examples.
The hope is then that there a set of parameters which will correct score structure
for lots of sequences and will generalise to novel sequences.
The main results from this project are that it works well on trivial example
cases like a single hairpin or a simple 3d structure. However, it does not generalise
so well to optimising for multiple types of molecules at the same time.
1.2 What is RNA and Why Determine its Structure
1.2.1 What RNA is made of
RNA and DNA are both nucleic acids and consist of bases hanging off a sugar
phosphate backbone. RNA is the evolutionary older molecule, and it is thought
that it used to used to perform all tasks in the cell[10] which tend to be done
by proteins in modern cells, while DNA is a much more stable molecule used for
long term information storage. This means RNA has the potential to form lots
Chapter 1. Introduction 4
Images from http://www.ch.cam.ac.uk/SGTL/Structures/nucleic/backbone.html and http://www.tulane.edu/~biochem/nolan/lectures/rna/
images/rnaimage2.gif
Figure 1.2: Chemical structure of DNA
of useful structures with catalytic properties.
If the structure of RNA could easily be determined for an arbitrary sequence,
it should be possible to attempt to design RNA that carry out tasks in the cell
for which Proteins would be normally used. This would open up new avenues
of attack on lots of existing medical problems as it might be easier to use RNA
molecules than attempting to use proteins. Also it would be easier to work out
structure for existing RNA molecules within the cell and from that their function
within the cell.
The structure of the sugar phosphate backbone as well as the possible bases
can be seen in 1.2. y
According to the central dogma of molecular biology, a large part of what RNA
generally do within modern cells is to act as a disposable information distribution
system, unlike DNA which is used for long term storage and proteins that are
used as functional agents. Information only every travels one direction in modern
cells, with a few exception like virii. This can be seen in 1.3.
Even in modern cells, RNA is a very versatile molecule that performs many
different functions within the cell. For instance, here are the some of the main
types of RNA as can be seen in 1.1
Chapter 1. Introduction 5
Figure 1.3: Central dogma of biology - should I include this? should I cite it from
CBD notes?
Chapter 1. Introduction 6
RNA type Description
Transfer RNA
Small 80 base sequences of RNA that connect
to a amino acid and contain the anti-codon
of the triplet that code for that amino acid.
In effect it is a translation device from the
RNA sequence to a protein amino acid. These
are matched up during protein synthesis to
convert the sequence to a protein.
Image from http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.html
Messenger RNA
Generally a copy of the sequence of some DNA
that will be synthesised to a protein. Can also
fulfil the same actions as DNA, as it would
probably have to have done in proto-cells.
Ribosomal RNA
This form of RNA forms a complex with var-
ious proteins to form the ribosomes. These
are vital for translating mRNA to proteins.
Image from http://www.zib.de/MDGroup/rnalab/firstpageDateien/rRNA.html
Table 1.1: Different Types of RNA
Chapter 1. Introduction 7
This project should be able to determine the structure of all of them, as this is
a functional classification not based upon any differences in the RNAs underlying
structure.
1.2.2 What is DNA
The main difference between RNA and DNA is an OH instead of a H in the 3’
position on the sugar ring which means that the structure of DNA is a lot more
stable. There is also a different base T instead of U. This means that the project
should also be able to fold DNA molecules with only retraining and changing
internally the U base to a T base.
DNA have only relatively recently been found to have catalytic properties as
well[13] which means that this project might be able to find interesting structure
that DNA can form.
1.2.3 What are Proteins
Proteins are the modern work horses of the cell. They perform almost all of
the physical interaction to the outside of the cell and almost all of its internal
maintainance. Due to their shape and interaction with other molecules, they can
make and break bonds on other molecules overcoming the activation energy that
would be required. This makes them extremely powerful and a lot of work has
been done on attempting to determine their structure.
There are lots of similarities between structure determination in proteins and
RNA molecules. This means if the technique on RNA works very well, it should
also be applicable to folding proteins. This also means that there is a very large
base of literature on protein folding which can be referred back to as proteins
play a central role in modern cells and is also very important to medical science.
Protein structure determination has the advantage that it is that there exist
a lot more proteins with their structure determined than with RNA. This should
lead to less data scarcity problems but as proteins have more base units(20 amino
acids vs 4 different base types) this leads to 380 point to point interactions com-
Chapter 1. Introduction 8
pared to 12 point to point interactions in RNA to get data for.
1.2.4 Different levels of structure
The structure of RNA can be defined on different levels. Not all approaches to
finding the structure of RNA work on the same structural level, so it is important
to be clear which level of structure is being talked about.
This project only deals with Quaternary structure, which reduces to being
equivilant to Tertiary structure in the case of there being only one RNA molecule.
Proteins have analogous structure level classifications.
1.3 Mathematical Background
1.3.1 Probability and Sampling theory
Starting with the discrete case. There is a set of events, X and a mapping
P (x ∈ X) which tells us the probability of event x ∈ X occuring.
Given a probability distribution P(X), it should be possible to calculate prop-
erties of X, like its mean. The mean can be calculated by
mean(X) = E(X) =∑
x∈X
xE(X)
or in the general case:
E(f(X)) =∑
x∈X
f(x)E(X)
This is extended to the continuous case:
E(f(X)) =∫ X
f(x)E(X)dx
This can be approximated by taking sampling from P(X) and averaging their
results. In the limit, this should give the correct answer. If x1, ..., xn are inde-
pendent samples from P(X), then:
∫ X
f(x)E(X)dx = 〈f(x)〉P (X) ≈1
n
∑
xi
f(xi)P (xi)
Chapter 1. Introduction 9
Structure level Description
Primary A C C A G
The sequence of the RNA molecule - this is co-
valent bonding between the backbone units.
Secondary
C C C G
G G G C
U
A
Covalent Bond
Hydrogen Bond
Deals with “local” ordered structure via hy-
drogen bonding: strands + helices in proteins,
short helical regions in RNA. Equivalently the
list of base pairs that occur in the three dimen-
sional RNA structure.
Tertiary
The global structure of protein/RNA, mainly
driven by other bonds than hydrogen-bonds. As
can be seen above in the interactions between
different strands
Quaternary
The interaction of multiple RNA molecules.
Table 1.2: Classifying the different levels of RNA structure
Chapter 1. Introduction 10
This is a sampling approximation. This will converge to the correct answer
given enough independent samples.
1.3.2 The Metropolis Method
Metropolis sampling allows one to sample from a distribution where the partition
function is not known, i.e. a function which is proportional to the required
distribution is known. This often occurs in Bayesian statistics when applying
Bayes Theorem. Given P (evidence | model) ∝ P (model | evidence)P (evidence),
it is possible to sample from P (model | evidence)P (evidence), without having
to know anything about P (model). With these samples, it is then possible to
approximate E(f(X)).
This technique will be needed to approximate an essential and otherwise in-
tractable integral within a derivative to make the sequence given the structure
more likely which is the central part of this project.
How to perform metropolis sampling is outlined below:
1. Pick a starting point x
2. Add x to list of points
3. Pick y from P (y | x). This is generally a Gaussian centred on x with a
small variance.
4. if P (y) > P (x) or if U(0, 1) < P (y)P (x)
accept the point else reject it.
5. If point was accepted set update x to contain y otherwise leave x unchanged.
6. Add point x to list of samples
7. Go to Step 2.
The list of points can be used as a list of independent samples in equation ??
and will give the correct answer given enough samples. This can be shown by
using Markov Chains[14].
Chapter 1. Introduction 11
1.4 Other approaches considered
Originally, this project was going to be based on a graphical models approach
to folding RNA molecules, for instance as it was applied in [15] to determine
the amino acid alignemnt when the backbone was known. However - assuming
pairwise interactions - it is extremely difficult to find any structural indepedencies
within the tree.
If it is assumed that the folding start from a straight line, one can assume
that the end start to fold independently. However, this quickly start to break
down and is equivilant to a well know optimization(keeping neighbours list and
cutting off interaction).
1.5 Temporal issues
What is the difference between folding RNA and determining its structure? It
is how much can be said about the evolution of the structure of the RNA as it
progresses towards its final shape. This project can not say anything about the
folding process of the molecule like for instance whether there are any plateaus
that would need to be crossed with a helper molecule. All that it can do is
determine the final structure.
While this project can model a complex that multiple RNA molecules could
form, it can’t for instance take into account how each RNA molecule would change
over the evolution of the interaction. An example of what this project can’t cope
with is an RNA molecule that cuts certain other RNA molecule in a specific place.
When this project refers to “folding”, it should be taken to mean the progres-
sion over time of deteremining the structure, not the actual states that an RNA
molecule would be expected to go through if it was actually folding in vivo.
1.6 Synopsis of the Rest of the Dissertation
Chapter two deals with previous approaches to this problem, their relative ad-
vantages and disadvantages and how they relate to this project while chapter
Chapter 1. Introduction 12
three details the methods that are used within this project and the implemen-
tation issues that were found. Chapter four goes into detail about how well the
algorithm performed and the results that were found an chapter five summarises
the achievements of this project and possible future extensions to this project.
Chapter 2
Literature review
2.1 Problem Feasibility of Structure Determination
of RNA and Proteins
What make this problem feasible to try and solve? There are a few conflicting
factors:
For:
• Some sucesses by Existing Techniques: Existing techniques in the field
lile Rosseta and MFold have been used for some time to sucessful help to
determine structures.
Against:
• Computational requirements: Despite the great advances in computa-
tional speed over the last few decades, simulating a small RNA molecule
with 2000 atoms and their interactions is still extremely computationally
difficult due to the scaling of the problem. The hope is that more intelligent
methods will get around this requirement.
• Helper proteins: Most proteins encounter energy plateaus during their
folding. This is where they are stuck in a sub-optimal folding that is a
local minima. Other proteins bind to the almost complete protein or RNA
13
Chapter 2. Literature review 14
molecule(?) and give it the required activation energy to complete its fold-
ing.
2.2 MFold
The traditional method that has been used when folding RNA is MFold that
only deals with secondary structure(see 1.2) in the hope that this provides a
significant step toward working out the actual Tertiary structure of the resulting
molecule. The scoring metric it uses is the amount of free energy available with
that structure using real life values worked out from experiments(?)
MFold works with the different types of loops of RNA, as can be seen in 2.1,
with tables of the reduction in free energy of having a loop of each type of each
size.
In this project these secondary structure are not modelled explicitly as the
energy function can encourage them to exist without explicitly knowing about
them. These will hopefully be reproduced in the results when real RNA is folded
and should be relatively easy to identify. It will prove interesting to compare the
results that MFold and this project return for the same molecule by looking for
overlap in the structure that they predict.
This prediction of the secondary structure can provide a starting point for
determining the full Tiertary structure of a given RNA molecule. The initial
shape can be resonably close to the real shape of the RNA molecule with only
the tiertary interaction wrong. However, this project is more interested in being
able to determine the structure of RNA where mfold fails and so starting from the
mfold starting position will lead to the wrong minima which might be difficult to
get out of. In effect, it could hinder more than it helps so it is not implemented.
2.2.1 Relative advantages and disadvantages
MFold has various advantages and disadvantages relative to this project:
• Bounds on folding time: As MFold uses dynamic programming methods,
Chapter 2. Literature review 15
Loop type Diagram Description
Bulges Bulges result form
small insertions or
deletions in a series of
base pair matches.
Hairpin Hairpins allow the
structure to flip back
on itself. They gener-
ally have a minimum
size of around 4-5.
Interior Very much like a
bulge except that a
mismatch occurs on
both sides.
Multi-loop Generally a central
feature in RNA.
Images from http://www.daimi.au.dk/~schauser/genome_analysis_F03/lectures_F03/RNA-struct-prediction.pdf
Table 2.1: RNA Secondary Structures considered by MFold
Chapter 2. Literature review 16
it can give low order polynomial bounds of time and space needed to fold
the RNA molecule.
• Gives solutions under constraints: Certain base pairings can be pre-
specified to occur in the solutions. If for instance it is known from exper-
imental data that a certain pairing must occur, MFold can ensure that all
solutions contain that pairing.
• Produces multiple solutions Unlike for instance a recursive approach
which would only give a single solution, MFold can produce multiple low
energy solutions to a problem [7]. This is useful as one of the other minima
may be the actual real life solution instead of the global minima. This
project should also be able to do this, but not in such a controlled manner.
• Gives estimates of real-life energy values: MFold actually takes into
account the measured energy of these configurations, while this project just
uses relative values which have little real life relevance.
Its main disadvantages compare to this project are:
• Only second order: As MFold only deals with secondary structure, it
can be almost completely wrong about the resulting shape of the molecule.
This project should be able to deal with this.
• Can’t deal with exception: Everything in biology has exceptions, and
base pairs are themselves not always immutable. For instance: Base triples
can form and pseudo-knots can form which are ignored, which are unpre-
dictable in how much energy they contribute to the final structure [9]
• Can’t deal with multi-stranded RNA: This is also related to the sec-
ondary structure constraint. This project should be able to deal with de-
termining the final shape that the multi-stranded RNA folds to.
Chapter 2. Literature review 17
2.3 Rosetta
Rosetta is the current hot topic in protein structure determination as it performed
extremely well on unseen data(CASP4 + CASP5). Part of this project is to look
into see how applicable its techniques are to determining RNAs structure.
The best possible situation for this project would be to be able to say that this
project is simply a more disciplined version of Rosetta, as there are similarities
in the approaches taken but with Rosetta should need more hand tuning.
The basis of Rosetta is the fact that the local structure influences but doesn’t
totally determine the final structure [3]. This can be seen by clustering together
fragments of sequences based on the similarity of their resulting structure.
There are three novels items in Rosetta: its scoring function, its searching
algorithm and its processing of its results.
2.3.1 Scoring function
The Rosetta scoring function is based on the decomposition: P (structure |
sequence) proportional to P (sequence | structure)P (structure), where P (structure |
sequence) represents sequence dependent features and P (structure) represents
universal sequence-independent features like β-strands assembling into β-sheets[1].
This decomposition can easily be seen to hold by use of Bayes Theorem:
P (Hypothesis | Evidence) = P (Evidence|Hypothesis)P (Hypothesis)P (Evidence)
. Bayes Theorem of-
fers a concrete way to establish probabilities which can otherwise be very difficult
to measure and plays a central role when applying probably theory to any statis-
tical problem.
P (Sequence | Structure) is approximated to two first order terms:
P (aa1aa2...aan | X) =∏
i
P (aai | X)
︸ ︷︷ ︸
∏
i<j
P (aai, aaj | X)
P (aai | X)P (aaj | X)︸ ︷︷ ︸
...
Penv Ppair
where Penv is the per amino acid environment term and Ppair is a pairwise
interaction term. Higher order terms were found to have little discriminative
power and so were not used.
Chapter 2. Literature review 18
XH20
H20
H20
H20
H20
H20
H20
H20
H20
H20H20H20H20H20
H20
H20
H20
H20
H20
H20
H20
H20
H20
H20H20H20H20H20
X
Example of a buried base Example of an unburied base
Figure 2.1: Buried vs non-buried Penv terms in Rosetta
Penv looks at each amino acid individually based on whether it is surrounded
by lots of other bases. This insures that hydrophobic terms tend to move to
the middle of structure whereas polar, hydrophilic terms to outside. This is very
important in protein structure determination, but not as relevant in RNA folding.
Ppair accounts for specific pair interactions between protein bases. It tends to
decay to 0 at 12A and thus can be ignored for long distances.
Penv and Ppair are set from looking at existing proteins and seeing what the
posterior distribution is. This project could attempt to derive these parameters
automatically.
This leaves the P (Structure) term. This is evaluated in terms of the secondary
structure, independent of the sequence and is chosen to give the maximum dis-
criminating power between real proteins and proteins that appear good under
normal scoring functions and P (Structure | Sequence) criteria but turn out to
be implausible. This was looked into by generating large numbers of non-native
proteins that scored very highly and plotting native vs non-native for each term
of interest.
Rosetta sets part of this term by representing every two residue segment in
the secondary structures by a vector and whether it is a helix or a strand. The
distances and orientations are then mapped onto spherical co-ordinates and were
examined manually looking for good classification performance between native
and compact but non-native structures, where a near native structure would still
score well. This approach resembles boosting[6], in that it allows focusing on the
Chapter 2. Literature review 19
mis-classified elements(the hard cases) and then combining various weak classifies
at the end.
Various other scoring methods were also used to score the proteins, whereupon
linear regression to find weights which produce the best classification performance
in predicted whether a protein is native. This weights are roughly equivilant to
the parameters Ψ used in this project and should be automatically be determined.
2.3.2 Sampling of the Protein Space
The basis of the sampling is the observation that the local sequence influences
the local structure, even if it doesn’t completely determine it. Rosetta stores the
proteins in a torsional space representation[4] to minimize the degrees of freedom
which are irrelevant to this problem.
Rosetta looks up all possible 3 and 9 residue segments are looked up for what
local structure they correspond to in a protein sequence database using sequence
profile comparison method which is nearest neighbours [2].
Nearest neighbour is used to find the 25 best sequences from the database,
where best is computed as distance between frequency of amino acids in each
sequence position.
A move then consists of substituting the torsional angles of a randomly chosen
neighbour at a random position for those of the current configuration.
This space is then searched using MC methods with energy functions that
favour compact structures with paired β-strands and buried hydrophobic residues,
see the scoring function.
2.3.3 Result processing
Each query repeated lots and lots of different times from different places. Results
are again clustered, and the centres of largest clusters taken as highest confidence
models, with the spread of the cluster being an indication of how reliable the
results are.
Another important point is that they used clustering on the resulting struc-
Chapter 2. Literature review 20
tures from the folding. The biggest clusters with the highest energy where then
submitted to the competitions as the possible folding of the protein as multiple
entries are allowed.
2.3.4 Relative Advantages and Disadvantages
Rosetta’s main advantages relative to this project are:
• Known to work well: Rosetta is known to perform very well on unseen
data, while this projects performance is very much unknown.
• Can work with constraints: Rossetta not only does ab initio predicition
of RNA folding but can also use experimental evidence that constrains the
final structure like Residual Dipolar Coupling [5]. This project can not
incorporate constraints like this although this could be incorporated into
the energy function.
Its disadvantages are:
• Manual tuning: It can’t automatically tune its parameters to fit the
data. Manual intervention is required. This project should be able to do
this automatically.
• Ad hoc: It is rather based on ad-hoc methods in places, like for instance
using linear regression to combine together lots of different
Its similarities
• Unit-less values: Both projects use non-real world values for energy
• Not physical modelling: Both projects do not explicitly model the un-
derlying physical processes but attempt to determine the structure using
other methods.
• Temporal issues: Both project determine the structure but can not de-
termine it progression over time as it reaches the final state.
Chapter 3
Methodology
3.1 Techniques used
3.1.1 Why use Probability Theory
Using an energy function to score the different structure available for a sequence
is not very powerful, as none of the machinery of probability theory can be bought
to bear upon it.
For instance, there is a sequence seq and structure C. The problem is to
make the structure C more likely for the sequence seq: Even the statement of
the problem uses probability theory. Simply increasing the score of the structure
by changing the scoring function isn’t good enough, as there is no guarantee that
all other structure aren’t increasing in some way, etc. While this could be done
in an ad-hoc way, it is better to use the natural and intuitive machinery that is
already there.
3.1.2 Defining P (C | Ψ, seq)
The probability function is defined in terms of the energy function. The energy
function must have continuous, differentiable parameters Ψ which will be adapted
to make the energy function prefer real structures for a sequence. Given that there
is an energy function H(C,Ψ, seq) where C is the structure, Ψ is the parameters
21
Chapter 3. Methodology 22
of the energy function and seq is the sequence of the RNA molecule, there is a
standard way to define a probability distribution:
P (C | Ψ, seq) =e−H(C,Ψ,seq)
Z
where Z is the partition function which turns it into a probability distribution.
This is needed so that P (C | Ψ, seq) satisfies the basic axioms of probability,
namely the continuous equivilant of that events must have a probability that
sums to one. It is defined as follows:
Z =∫
e−H(C,Ψ,seq)dC
Z is set of by the values of the parameters Ψ but is intractable to compute
in all apart from the most simply cases. However, using Markov Chain Monte
Carlo methods, it is still possible to increase the probability of a structure given
a sequence.
The energy function that is used in this project is defined in 3.3.1
3.1.3 Increasing the Probability of a Structure for a Given
Sequence
This is the central idea used within this project. As log is a monotonic function,
increasing log(f) is equivalent to increasing f. If this insight is applied to P (C |
Ψ, seq):
∂
∂Ψln (P (C | Ψ , seq)) =
∂
∂Ψln
(
e−H(C,Ψ ,seq)∫
e−H(C,Ψ ,seq)dC
)
= −
(
∂
∂ΨH (C,Ψ , seq)
)
+
∫(
∂
∂ΨH (C,Ψ , seq)
)(
e−H(C,Ψ ,seq)∫
e−H(C,Ψ ,seq)dC
)
dC
= −
(
∂
∂ΨH (C,Ψ , seq)
)
+
⟨
∂
∂PH (C,Ψ , seq)
⟩
P (C,Ψ ,seq)
Markov-chain Monte Carlo is used to approximate⟨
∂∂PH (C,Ψ , seq)
⟩
P (C,Ψ ,seq).
As it is not possible to evaluate P (C | Ψ, seq), simple gradient ascent is used
to try to find the local maxima:
Chapter 3. Methodology 23
Ψn+1 = Ψn + stepsize ∗∂ lnP (C,Ψn, seq)
∂Ψ
If stepsize is small enough and this equation is iterated often enough it should
increases P (C | Ψ, seq) by changing Ψ until it is a local maximum. Now if an
attempt was made to find the structure of seq by maximizing P (C2 | Ψ, seq), it
should hopefully find that C = C2, i.e. the structure that was earlier made likely.
This has a nice side effect of having an semi-intuitive explanation - if(
∂∂ΨH (C,Ψ , seq)
)
=⟨
∂∂PH (C,Ψ , seq)
⟩
P (C,Ψ ,seq)
then it has converged. If they are not equal, Ψ is adjusted so that ...
A simple approach is taken to try and optimise for multiple molecules at the
same time: This equation is iterated for each molecule in turn. This is repeated.
There are no guarantees of convergence without arbitarily small step size
and infinite precesion or of being able to find parameter setting that satisfy the
requirements for multiple molecules at the same time.
3.1.3.1 Reducing the Burn-in period for MCMC
One important consideration is what structure is used to start the MCMC sam-
pling from. Just starting from a straight chain can bias the sampling as it is in a
area of very low probability and the MCMC will take a long time to find an area
of higher probability, which it would normally be present in. This time is called
the burn in period and sample point from this time are generally discarded or
this bias can be avoided by either having very long sampling periods.
The approach taken in this project is to start with the shape that which
should be made more likely, then maximise its probability for a sufficiently long
period until a local minima is reached. Then after every time the parameteres
Ψ are updated perform a few more cycles of minimising the energy to ensure
that the starting position tracks the changing local minima. This was found to
significantly reduce the numbers of cycles of MCMC sampling that had to be
performed.
Another consideration is whether to use all sample or just to record a sample
from this stream occasionally in the hope of getting overall independent samples.
Chapter 3. Methodology 24
The advantage of getting independent samples is that it allows estimates of the
number of samples required, but when and how often to record to ensure this [14]
3.1.3.2 Other possible kinds of sampling
Gibbs sampling could also have been implemented. Gibbs sampling works by
sampling first variable given all the others, then the second variable given all the
others etc. Its main advantage over MCMC is that every sample is accepted and
no samples are lost due to rejection.
A continuous solution would be intractable but by discritizing each variable
in turn and evaluating it in every position, it is then simple to normalise these
and turn them into a valid distribution but this is going to be a lot more work
than simply throwing away the samples in Metropolis sampling.
Hybrid Monte Carlo sampling was also looked at, but it was decided that it
was too slow. For a more complete survey of other available sampling methods,
look at [14].
3.1.4 Finding the Structure given the Sequence of an RNA
Molecule
As the mapping from the energy to the probability is anti-monotonic, maximising
the probability is equivalent to minimising the energy function. All the deriva-
tives of the energy function with respect to the structure can be calculated, so a
standard function minimisation routine namely conjugate gradient descent from
netlab[?] is used to optimise the structure with the given sequence and parameter
settings.
So, given a sequence seq an initial shape is chosen for the structure C like a
circle or a random structure. The parameters settings Ψ are known from either
setting them manually or from being adapted for a certain shape as done in the
last section.
Convergence is not explicitly detected as sometimes it can appear to have
converge but then “break through” to another local minima and start improving
Chapter 3. Methodology 25
the structure again. This project takes a simple approach and just attempts to
optimise the energy function for a long time, making up for the slowness of this
approach by simply using more computers.
3.1.5 Rossetta-style Simulated Annealing to find all Minima
The main problem with just using the technique outlined above it is a naıve
hill climbing algorithm so it will never find more than one minima. The energy
function can easily have multiple minima of which one is the RNA molecules.
Starting from random positions is one approach to this problem but one which
is unlikely to find all the different minima and there are many highly twisted
minima when the molecule is overlapping a lot. Instead a simulated annealing
style approach modelled on MCMC simulated annealing as in Rosetta [2] is used
to try and find all of the minima.
Each update is to select a random position on the RNA molecule. Then
copy three sets of bond lengths, bond angles and torsion angles from a randomly
selected fragment with the same sequence.
The probability of a structure given a sequence is updated in this context to
take into account the temperature:
P (C | Ψ, seq) =e−H(C,Ψ,seq)
Z ∗ T
This means that the higher the temperature, the more likely it is that a jump
which decreases the probability will suceeed. When the temperature decrease,
it becomes harder and harder to do big jumps, at which point the conjugate
gradient descent method is used to find the final maxima.
3.2 Modelling the Structure of the RNA Molecule
3.2.1 Why use a High Level Molecule Representation
The RNA molecule is stored at a high level representation; only the position of
base is stored, not the positions of all of the atoms that make up that base. This
is done for many reasons:
Chapter 3. Methodology 26
• In order to make the problem more tractable in the short time frame avail-
able.
• To very significantly the reduce the computational complexity of trying to
fold the molecule. As there are about 20 atoms per base at least point to
point interactions between all molecule have to be taken to account makes
the problem run a few orders of magnitude quicker
• It is a more elegant solution to the problem that is more likely to be applied
to other problems as it requires less prior knowledge. For instance to adapt
this project to DNA folding, all that would be required is changing all U
bases to T bases and retraining, while with a atom based system the actual
structure would have to change. While this might be minor, adapting to
prxotein folding would certainly require almost a complete rewrite, while
with a higher level representation it would only require adding more bases.
The speed and result of the structural determination depends on how the that
is found can be folding specific, various different representations of the positions
of the RNA bases can be used. The current implementation supports 3 different
kinds of representations where the main difference is how many irrelevant degrees
of freedom can be explicitly ignored in the representation and how long it takes
to calculate the derivatives.
It would be expected that the time taken to determine how long it takes to fold
a structure would be a trade off between converging faster in representation where
it can ignore irrelevant degrees of freedom versus the trade off of the derivatives
being slower to calculate.
3.2.2 Cartesian ’xyz’ Molecule Representation
The representation is simply Cartesian co-ordinates of each base.
C = (x1, y1, z1), (x2, y2, z2), ..., (xn, yn, zn)
The main advantage of this representation is the speed with which it can be
Chapter 3. Methodology 27
moved with as when calculating the derivatives it requires only an update to 2
different sets of parameters, unlike with say chain spherical.
∆1,5 =√
(x1 − x5)2 + (y1 − y5)2 + (z1 − z5)2
Then for example:
d∆1,5
dx1=
x1 − x5
Delta1,5
All points individually effect the structure. However translating or rotating
all points will not affect the structure or its probability so there are 6 degrees of
redundancy within the representation.
3.2.3 Chain Spherical ’rtp’ Molecule Representation
Each base is described in spherical co-ordinates relative to the last base as can be
seen in Figure 3.1 without any rotation. Each base then uses normal non-rotated
Cartesian co-ordinates to determine the relative distance to the next base. r1, θ1
and φ1 specify the location of the first base relative to the origin. This means
that the chain spherical representation can represent any Cartesian molecule in
any position, which is useful for debugging.
So the structure is defined as follows:
C = (r1, θ1, φ1), (r2, θ2, φ2), ..., (rn, θn, φn)
To convert to Cartesian:
x1 = r1 cos(θ1) sin(φ1) y1 = r1 sin(θ1) sin(φ1) z1 = r1 cos(φ1)
x2 = x1 + r2 cos(θ2) sin(φ2) y2 = y1 + r2 sin(θ2) sin(φ2) z2 = z1 + r2 cos(φ2)
x3 = x2 + r3 cos(θ3) sin(φ3) y3 = y2 + r3 sin(θ3) sin(φ3) z3 = z2 + r3 cos(φ3)
... ... ...
and to convert back:
r1 =√
x21 + y21 + z21 θ1 = arctan y1, x1 φ1 = arccos z1r1
r2 =√
(x2 − x1)2 + (y2 − y1)2 + (z2 − z1)2 θ2 = arctan y2−y1x2−x1
φ2 = arccos z2−z3r2
... ... ...
Chapter 3. Methodology 28
y
x
z
θ1
r 1
r 2
z
y
x
φ1
r
φ
θ
2
2
2
3
x
y
z
Figure 3.1: Chain Spherical co-ordinate system
In this representation, r1, θ1 and φ1 do not in any way affect the actual struc-
ture of the molecule or its probability, so these can be ignored when attempting
to find the local minima. The final structure can still be rotated in 3 different
ways, so there are still 3 implicit degrees of redundancy with the representation.
As r1, θ1 and φ1 can be ignored when folding the molecule, it should converge
faster than the Cartesian representation. However, it is more expensive to calcu-
late the derivative of a point to point interaction. It requires updating on average
n/2 other bases, instead of just 2.
3.2.4 Bond and Torsion angle based
This representation stored the bond lengths, bond angles and torsion angles of
the chain. This is a very natural representation for the molecule as this relates
closely to the constraints that a real chemical model has to satisfy.
The derivatives for this are more complex and were not computed. Instead,
for completeness they were simulated with numerical approximations to check if
the resulting folding was any more complete, but no noticeable difference was
found.
Chapter 3. Methodology 29
G
A
A A
A
C G
A
A A
A
C1
2
3 4
5
6
Updates to bring GC bases closer to-
gether in Cartesian representation
Updates to bring GC bases closer to-
gether in Chain Spherical representa-
tion
Figure 3.2: Updates required in Cartesian Representation vs Chain Spherical
In this representation,
r1
bondangle1 bondangle2
torsionangle1 torsionangle2 torsionangle3are all the parameters that don’t affect the probability of the structure. After
removing these from consideration, there are no implicit redundant degrees of
freedom left, i.e. there is no simply way to change all the points that don’t affect
the energy function.
However, this comes at a high computational cost, as calculating each deriva-
tive of a point to point interaction is made considerably more expensive and
complicated.
3.2.5 Data preprocessing
The data that this project uses is pulled from various internet databases of RNA
structure files. This project uses [16] and then processes the data which is re-
trieved with various perl scripts.
When loading a RNA molecule in which RNA is stored as atom positions, the
Chapter 3. Methodology 30
mean of all atoms belonging to a single base are averaged to get a final position
for the whole molecule instead of for instance picking the same position on the
backbone every cycle.
In order not to bias the results, no sequence is allowed to have more than
one structure. If there are more than one structure for a sequence, the structure
is picked randomly from the options available. This is because there are often
duplicate entries in the database or entries done in very slightly different condi-
tions. However these don’t pose very a interesting challenge for this project as
first the rather larger question of whether it can get the general structure of the
RNA molecule should be answered.
3.3 Defining the Energy function
3.3.1 Requirements
As already mentioned, the parameters Ψ of the energy function must be continous
and differentiable in order that the technique that is used to make structure more
likely actually works. If for instance the parameters Ψ were discrete, another
technique like genetic algorithms would have to be applied.
The energy function is based only upon distances between different bases is
due to the multiple ways to represent the RNA molecules, as can be seen in the
last section. It is not clear that having non-distance terms would increase the
expressive power of the energy function but if there was any terms that would be
more useful not expressed in terms of distance, it would be very easy to implement
them.
3.3.2 Scoring the bonds between neighbours
The energy function needs to constrain bonds between neighbours in the chain
as bases can only be a certain distances apart and can only be at certain angles
to each other due to the underlying chemical structure. These constraints on the
bonds are generally expressed in terms of their length, bond angle and torsion
Chapter 3. Methodology 31
ACCACGGUUCACAA
Figure 3.3: Expressing bond preferences of sequential bases in terms of distance
angle. This project maps these constraints to distance based constraints. The
mapping from length, bond angle and torsion angle to distances can be seen in
3.3.
As each base has a difference chemical structure, it can be assumed that there
is a base specific term to the distributions of bonds, so the local structure is
constrained by the local sequence.
This leads to the beginning of an energy function:
H(C,Ψ, seq) =∑
i
αBL
n
(∆i,i+1 − pbsmeanseq(i,i+1))
2
pbsσseq(i,i+1)+αBA
n
(∆i,i+2 − pbsmeanseq(i,i+1,i+2))
2
pbsσseq(i,i+1,i+2)+
αBT
n
(∆i,i+3 − pbsmeanseq(i,i+1,i+2,i+3))
2
pbsσseq(i,i+1,i+2,i+3)
∆i,j is the Euclidean distance between bases i and j. There is no need to
normalise the gaussian distrbution used as this is the energy function. This
normalising occurs when converting to P (C | seq,Ψ) and the relative strength
learned from the real data and this means that energy function is more simple
which is an advantage as it will be evaluated a lot.
pbsmeanseq(i,i+1) is the mean of the bonds lengths that occur and pbsσseq(i,i+1) is
the standard deviation of that bond. These are simply set by looking at every
sequence fragment which the sequence seq(i,i+1) occurs and taking the mean and
standard deviation of ∆(i, i+ 1) through all the data that is available.
The parameters Ψ that are as defined as yet are αBL, αBA and αBT . αBL
models how important it is that the bond length being close to the mean for
those pair of basis seq(i, i+1), taking into account how much variance is naturally
within the bases. αBA and αBT do the same for bond angles and bond torsionl.
As there are so many pbs values, these are set directly from the database
Chapter 3. Methodology 32
instead of trying to learn them from the examples. These are very unlikely to
change much and it reduces the complexity of the learning problem.
In order to model multiple RNA molecules, this project models a single long
molecule but discounts the local bond constraints over the boundary to a different
actual RNA molecule which in essence allows modelling of multiple strands for
free. This means that the final energy function is:
H(C,Ψ, seq) =∑
i
αBL
n
(∆i,i+1 − pbsmeanseq(i,i+1))
2
pbsσseq(i,i+1)+αBA
n
(∆i,i+2 − pbsmeanseq(i,i+1,i+2)
2
pbsσseq(i,i+1,i+2)+
αBT
n
(∆i,i+3 − pbsmeanseq(i,i+1,i+2,i+3)
2
pbsσseq(i,i+1,i+2,i+3)
where Si,j is 1 if base i is on the same molecule as base j and 0 if not.
3.3.3 Incorporating Base to Base interactions
There are various interactions between bases that should be modelled as can be
seen in Table 3.1.
The α and C values are all in Ψ, the parameters of the energy function. The α
values are to be able to adjust the relative importance of each of the interactions,
while the C values allow adjusting how close the bases have to be. For instance,
it would be expected that CKD would be low as this it is always important that
base pairs do not overlap.
This give the complete energy function:
H(C,Ψ, seq) =∑
i
αBL
n
(∆i,i+1 − pbsmeanseq(i,i+1))
2
pbsσseq(i,i+1)+αBA
n
(∆i,i+2 − pbsmeanseq(i,i+1,i+2)
2
pbsσseq(i,i+1,i+2)+
αBT
n
(∆i,i+3 − pbsmeanseq(i,i+1,i+2,i+3)
2
pbsσseq(i,i+1,i+2,i+3)+
∑
i,j
−α2GCn2
MGCij
∆i,j + C2GC−α2AUn2
MAUij
∆i,j + C2AU−α2GUn2
MGUij
∆i,j + C2GU
−α2UMn2
MUMij
∆i,j + C2UM−α2LSn2
MLSij
∆i,j + C2LS−α2KD
n2MKD
ij
∆i,j + C2KD
where S are the parameters of the chain, and P are the parameters of the
energy function.
Chapter 3. Methodology 33
Energy Term Description
−α2GC
n2
MGCij
∆i,j+C2GC
G and C bases form strong triple hydrogen bonds and so
should be encouraged to come closer together. This is
done by subtracting this term from the energy function
which means that the closer each pair of G and C bases
are, the lower energy the final molecule is. MGCi,j = 1
everywhere that base i is G and base j is C or visa versa.
−α2AU
n2
MACij
∆i,j+C2AU
A and C bases form double hydrogen bonds and so should
also encouraged to come close together. MAUi,j is 1 every-
where that base i is A and base j is U or visa versa.
−α2GU
n2
MGUij
∆i,j+C2GU
G and U bases form a hydrogen bonds and so should also
be encouraged to come close together. MGUi,j is 1 every-
where that base i is G and base j is U or visa versa.
+α2UM
n2
MUMij
∆i,j+C2UM
UM stands for unmatched. Bases which are not comple-
mentary should be kept apart as no base pairs can form.
MUMi,j is 1 everywhere that base i is not complementary to
base j.
−α2LS
n2
MLSij
∆i,j+C2LS
LS stands for long stem matches is designed to encourage
long sequence of matching base pairs to come together.
This is explained in more detail below.a
z+α2KD
n2
MKDij
∆i,j+C2KD
KD stands for the Keep Distance. This is designed simply
to stop bases getting too close together or even overlap-
ping. All entries of MKD apart from the diagonal are set
to 1.
Table 3.1: Modelled base to base interactions
Chapter 3. Methodology 34
Parameter αBL αBA αBT αGC αAU αGU αUM αLS αKD CGC CAU CGU CUM CLS CKD
Initial Setting 1 1 1 3 2 2 2 1 2 .2 2 3 2 1 0.15
Table 3.2: Starting values
The conversion from energy function to a probability distribution is done:...
Thermodynamics tell us that the folding is a probability distribution proportional
to the exponent of the energy function[8].
The reason that point-wise interactions are divided by n2 while the sequential
constraints are only divided by n is to ensure that the interactions strengths are
relevant to molecule of different lengths.
3.3.4 Initial Parameter Settings
Starting values were set to αGC = 3αGU and αAU = 2αGU to account for the
number of hydrogen bonds that form when the base pairs match up and thus
their relative desirability. The other values were initally guess at, with the hope
that it would be close enough to the real answer so it could learn the rest. The
full set of starting values used can be seen in Table 3.2.
3.4 Incorporating common motif recognition
While the basic folding worked well enough, a second approach was tried to make
more use of the data available. [3] found the existance of recurring sequence
patterns that cross over protein family groups. Simple motif recognition is in-
corporated into this project in order to try and improve the performance of the
structure determination algorithm and to test if the equivilant observation holds
about recurring RNA bases.
3.4.1 Choosing motifs
Given the length of motifs n that should be searched for:
Chapter 3. Methodology 35
1. List all fragments of length n on every molecule within the database.
2. Group all fragments with the same sequence, discarding sequences with only
one fragment.
3. Pick out the fragements with the highest simularity between their structures
(See Section3.4.2)
4. Add the best motifs to the energy function with an extra parameter to
indicate their relative importance.
3.4.2 Comparing structures
Originally the structure were compared by converting to the distance, bond angle
and torsion angle representation then taking a simple distance measure between
the structure to be compared. This was highly unsatisfactory for many rea-
sons like the complexity of ensuring that the angle measurement correctly looped
around, the fact that it doesn’t generalise well to comparing more then two dif-
ferent motifs etc. It also provided misleading results about how well the project
was working that lead to lots of time being wasted.
The second approach that was taken was to measure the mean of the standard
deviation of each element of ∆ for each sequence. ∆ for molecule k is defined as:
∆ki,j =| C
ki − Ck
j |
The structure simulatity measure is then:
1
n4
n∑
i,j
(m∑
k
(∆ki,j − ( 1
m
∑mk ∆k
i,j))2
m− 1
While this takes O(n2) to evaluate, its simplicity means that it less likely to
be mis-implemented and more likely to give a correct answer.
Chapter 3. Methodology 36
3.5 Implementation Issues
3.5.1 Development Environment and Language
Initially development started in octave[17], however due to various annoyances
with this language development was moved onto matlab. Matlab was used for
speed of development which is important in a project with a short timescale. It
deals very well with numerical exceptions like dividing by 0, has good debug-
ging facilities, generally highly numerically accurate due to good libraries and
performance is good if the code is properly vectorized and/or compiled.
Netlab[12] is a very useful library of matlab functions to perform various nu-
merical tasks which proved invaluable. This project especially used the conjugate
gradient descent implementation and the gradient sanity checking code with some
minor modifications.
Matlab had in relation to this project serious problems with the number of
licenses running out when attempting to cluster the software. If given more time,
a port to a licence free language would be a very good idea. As the matlab
compiler can produce C code, it shouldn’t be too hard to integrate the resulting
compiled to C code into a Beowulf cluster using MPI.
Some of the glue code needed was built using the computer equivalent of duct
tape: Perl. This was mainly used to parse the databases of sequences online due
to its strength in text processing and to process equations outputted from Maple.
Revision tracking control was provided by CVS. This was used to insure that
the same code-base on multiple computers was matched up as a personal com-
puter was used to process the database and to be able to tag releases to keep
track of milestones achieved.
The development environment used was Emacs running in octave mode, with
the text mode version of matlab. This was found to be the most productive
method due to the occasionally instability of the Java interface to Matlab.
Chapter 3. Methodology 37
3.5.2 Regression checking
With any project it is important to be able to trust the results that it gives back.
This project has regression checking built in as various sanity checking function
that make sure the routines are giving back correct answers.
For instance, all derivatives that are computed within this project can be
rechecked numerically to check for any bugs in either the derivation or the im-
plementation. This proved to be highly useful when building the project.
3.5.3 Vectorization, Optimisation and Benchmarking
Most of the coding was initially written unvectorized for simplicity. It was then
vectorised for simplicity and extra speed. This provided most of the speed up.
The other main optimisation used (which was considerably less effort) was to
use the builtin matlab compiler. Overall the code ran twice as fast.
A specific folding case was used everytime to give fair comparisions of the
speed at this project could fold the RNA molecule.
3.5.4 Parallelizing the code
Various possible clustering methods were looked into, including plab[11] NetSolve
and hand-coding to Beowulf with MPI. Plab was used for speed of development
and the fact that almost all of the problem are ’embarrassingly parralizable’.
A brief attempt was made to port plab to octave as it would solve the problem
of only limited number of licenses being available for matlab within the Univer-
sity of Edinburgh. However lack of non-blocking I/O routines and bugs within
the blocking IO routines within the octave libraries meant that this had to be
abandoned.
There are many parts of the code that were parallelizable:
• Independently folding the same molecule to see if different local minima are
found with simulated annealing.
• Many small runs of MCMC instead of one big one run.
Chapter 3. Methodology 38
• When adjusting parameters, each molecule in the database can be consid-
ered separately, then the sum of their adjustments applied. This limits to
parralellism to a computer per molecule, but it is still a significant speed
up.
• When analysing the results, analysing each parameter setting is indepen-
dent, so it simply break up into work units which can be spread out.
3.6 Evaluating the Learning of the Parameters Ψ
Evaluating how well the parameters are adapted is not as trivial as it may at first
seem. For instance if we make some random sequences and folding them using
one set of parameters and then try to derive those parameters, there could be
many possible different settings which equate to the same set of local minima in
the energy function. A trivial example is double all the α parameters which will
result in a “colder” more extreme energy function gradient.
Three main evaluations were carried out: trying to fold an artificial example,
trying to derive the parameters of a random folding and attempting to fold real
data.
3.6.1 Initial Teething Problems
The computational expense of trying to repeatedly fold a complex model from
scratch for a sucession of slightly different parameter settings is prohibitively
expensive (compounded by the lack of matlab licenses) as many attempts may
have to be found to find the correct minima of the energy function, assuming that
it even exists. An efficient way to approach this problem is to start the folding
from the target structure as this ensures that the correct minima is always found.
As long as enough cycles of function minimisation is allowed to ensure convergence
to the true shape, the divergence from the initial this should acuratley reflect the
true performance of the algorithm.
However, in practice this turned out to be a very bad idea. As the folding
Chapter 3. Methodology 39
progresses, not only does the energy minima get closer to the shape that the
algorithm should be learning but also there is the equivilant of the temperature
decreasing in simulated annealing. 1. This is where the minimum of the energy
function get more and more extreme and the energy of these gets arbitiarily low
and the probability approachs one.
This would not normally be a problem but the netlab conjugate gradient
descent implemention that is used by this project uses a line search with a default
absolute accuracy of 10e-4. As the temperatre decreases, so does the ability of
the line search algorithm to progress around the energy landscape.
This means that the longer a run has been going on, the better it appears to
perform, indepedently of its actual performance. Needless to say, this was a bit
misleading and the sanity checking of running the folding for a very long time
didn’t help as each iteration was limited by the line accuracy.
All results that are present in the project have been folded from a circle to
ensure that there can be no bias due to this temperature change. This problem
was discovered late on in the project which combinded with the lack of matlab
licesnes is why there are less results than would be wanted for a perfect evaluation.
3.6.2 Learning an Artifical Hairpin Structure
A artificial hairpin was constructed, first in a very undisciplined manner. However
it was found that this hairpin was very unrealistic as the hairpin turn contained
to few bases. A more realistic model was made based on one of the loops of
Transfer RNA as can be seen in 3.4.
3.6.3 Random folding
A number of random sequences and random structures are generated, and then
“folded” with known random parameters setting. Using only the determined
structure of these sequences with all parameters reset to 1, test if the parameter
1The same things happens in the EM algorithm with a mixture of guassians where if a
Gaussian becomes centred on a single point, its variance gets arbitarily small, which is generally
a very bad model of the underlying data
Chapter 3. Methodology 40
−3 −2 −1 0 1 2 3
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
C
U
C
C
U
A
U
G
C
U
U
G
G
A
G
Figure 3.4: Artificial hairpin based on loop of Transfer RNA
adaption routine can derive the correct parameter setting.
As the derived parameters may not necessarily be the same as the initial
paramteres but still be correct in being able to determine the correct structure,
we also have to test how close the determined structure for the random data is
to the initial folded structures.
3.6.4 Real data
Due to computational reasons, only a small subset of the complete dataset can
folded.
This will determine if the project would ever actually be useful for real worl
tasks in its present state and to highlight any specific problem with example
which it can not deal with.
This real data is folded both with and without motifs to investigate if they
help the molecule fold correctly.
Chapter 4
Results
4.1 Effects of Different Structural Repressentions
on Structure Determination
As predicted in Section 3.2.1, convergence is faster in the more advanced represen-
tations. Comparing the Cartesian molecule representation in Figure 4.1 against
the Chain spherical representation in Figure 4.2, there is an interesting qualita-
tive difference in how the molecule folds. It tends to curl around each bond more
which makes the straight lines more bendy, while the Cartesian tends to travel
more directly to the final shape, but take a lot longer.
The chain spherical representation works a lot faster even with its longer per
cycle evaluation time, as its about half as fast to evaluate but takes roughly 6
times as long to converge. Because of the speed difference, the Chain Spherical
RNA structure representation is used throughout the rest of this project.
4.2 Hairpin results
4.2.1 Adapting the Initial Parameter Settings
The first test of this project is to learn the structure of the artifical hairpin in
Section 3.4 for its sequence. Initially the parameters in 3.2 were used. Parameters
41
Chapter 4. Results 42
−2 0 2−2
−1
0
1
2 CU
C
C
UA
UGC
U
U
G
GA G
−0.063185
−2 0 2
−2
−1
0
1
CU
C
C
U
AUG
CU
U
G
GA G
−0.073129
−2 0 2
−2
−1
0
1
CU
C
C
U
AUG
CU
U
G
GA G
−0.073425
−2 0 2
−2
−1
0
1
CU
C
C
U
AUG
CU
U
G
G
A G−0.074583
−2 −1 0 1
−3
−2
−1
0
1
CU
C
C
U
AUG
CU
UG
G
AG
−0.077753
−3 −2 −1 0 1−3
−2
−1
0
1
C
U
C
CU
AUG
CU
UG
G
AG
−0.08253
−2 −1 0 1
−2
0
2 C
U
C
CU
AUG
CU
UG
G
AG
−0.083912
−2 −1 0 1
−2
0
2 C
U
C
CU
AUG
CU
UG
G
AG
−0.084232
−2 −1 0 1
−2
0
2 C
U
C
CU
AUG
CU
UG
G
AG
−0.084614
−2 0 2
−2
0
2 C
U
C
CU
AUG
CU
UG
G
AG
−0.085374
−2 0 2
−2
0
2 C
U
C
CU
AUG
CU
UG
G
AG
−0.085724
−2 −1 0 1
−2
0
2 C
U
C
CU
AUG
CU
UG
G
AG
−0.085804
Figure 4.1: Folding a hairpin in the Chain Spherical Molecule Representation
−2 0 2−2
−1
0
1
2 CU
C
C
UA
UGC
U
U
G
GA G
−0.05742
−2 0 2−2
−1
0
1
2 CU
C
C
UA
UGC
U
U
G
GA G
−0.059239
−2 −1 0 1−2
−1
0
1
2 CU
C
C
UA
UGC
U
U
G
GA G
−0.060734
−2 −1 0 1−2
−1
0
1
2 CU
C
C
U
AUG
CU
U
G
G
A G−0.066958
−2 0 2−2
−1
0
1
2C
U
C
C
U
AUG
CU
UG
G
AG
−0.06724
−2 0 2−2
−1
0
1
2C
U
C
C
U
AUGC
U
UG
G
AG
−0.070259
−2 0 2−2
−1
0
1
2C
U
C
C
U
AUGC
U
UG
G
AG
−0.070393
−2 0 2−2
−1
0
1
2C
U
C
C
U
AUGC
U
UG
G
AG
−0.070597
−2 0 2−2
−1
0
1
2C
U
C
C
U
AUGCU
UG
G
A
G−0.070907
−2 0 2−2
−1
0
1
2C
U
C
C
U
AUGC
U
UG
G
A
G−0.071848
−2 0 2−2
−1
0
1
2C
U
C
C
U
AUG
CU
UG
G
A
G−0.071873
−2 0 2
−2
−1
0
1
2C
U
C
CU
AUG
CU
UG
G
A
G−0.071988
Figure 4.2: Folding a hairpin in the Cartesian representation: Every 6th Frame
Chapter 4. Results 43
0 2000 4000 6000 8000 10000 12000 14000 160000
2
4
6
8
10
12
14
16
18ABTMGCcGCMAUcAUMGUcGUUMcUMLScLSKDcKD
Parameter αBL αBA αBT αGC αAU αGU αUM αLS αKD CGC CAU CGU CUM CLS CKD
Initial Setting 1 0 0 3 2 2 2 1 2 .2 2 3 2 1 0.15
Final Settings 16.3991 0 0 2.9135 2.1358 2.0676 1.9150 1.0056 1.9947 0.1842 2.0023 3.0283 2.0284 0.8794 0.0011
Figure 4.3: Adjusting parameters of almost correct hairpin
were adjusted as can be seen in Figure 4.3, but there was disappointingly little
imporvement as can be seen in Figure 4.4.
Overall the simularity matrix dropped from 0.0170 to 0.0134 which is a 24
Comparing the energy of the straight line with the energy of the hairpin can
be seen in Figure 4.5.
4.2.2 Adapting Difficult Parameter Settings
While adapting those parameters was reassuring, it isn’t an impressive demon-
stration of the power of the technique. Adjusting the parameters to make the
initial shape a lot more distant from gives us a more difficult example case. The
adaption of the parameters can be seen in Figure 4.7 and the initial and final
folded shape can be seen in 4.6. The scoring matric improves from 0.0564 to
0.0192, which confirms that the hairpin is a lot closer to being correct.
Chapter 4. Results 44
−1 0 1−1
0
1
2
3
4
5
C
U
C
C
U
A
U
G
C
U
U
G
G
A
G
Initial Structure
0 1 2−1
0
1
2
3
4
5
6
C
U
C
C
U
A
UG
C
U
U
G
G
A
G
Best structure with initial parameters
0 1 2
0
1
2
3
4
5
6
C
U
C
C
U
A
UG
C
U
U
G
G
A
G
Best structure with final parameters
Figure 4.4: Comparing structure under initial parameter settings and learnt parameters
0 200 400 600 800 1000 1200 1400 1600−5
0
5
10
15
20
25
30Line energyHairpin energy
Figure 4.5: Comparing the Energy of a Line against the Hairpin
Chapter 4. Results 45
−1 0 1
−1
0
1
2
3
4
5
C
U
C
C
U
A
U
G
C
U
U
G
G
A
G
Initial Structure
0 1 2 3
−2
0
2
4
6
8
10
CU
C
C
U
AUG
C
U
U
G
GA G
Best structure with perturbed parameters
0 1 2−2
−1
0
1
2
3
4
5
6
7
8
C
U
C
C
U
A
UGC
U
U
G
G
A
G
Best structure with final parameters
Figure 4.6: Comparing the hairpin under intitial perturbed parameters and learnt
parameters
Chapter 4. Results 46
0 0.5 1 1.5 2 2.5
x 104
0
2
4
6
8
10
12
14
16
18ABTMGCcGCMAUcAUMGUcGUUMcUMLScLSKDcKD
Parameter αBL αBA αBT αGC αAU αGU αUM αLS αKD CGC CAU CGU CUM CLS CKD
Perturbed Settings 1 0 0 3 2 2 2 1 2 1 1 1 1 1 1
Final Settings 16.5157 0 0 2.9456 2.2201 2.1206 1.8610 1.0127 1.9933 0.6667 1.1152 1.9606 2.0692 0.7860 0.6031
Figure 4.7: Adjusting parameters for the hairpin from initially perburbed parameters
Chapter 4. Results 47
Parameter αBL αBA αBT αGC αAU αGU αUM αLS αKD CGC CAU CGU CUM CLS CKD
Hidden Parameters 0.5 1 4 1.5 1.6 1.2 0.8 2 1.3 3 1.5 0.5 1 2 0.5
Derived Settings 0.0001 20.2819 252.6689 0.9944 1.0004 1.0028 0.9951 0.9973 1.0004 1.0163 1.0145 0.9950 1.0003 1.0193 1.0185
Figure 4.8: Adjusting Parameters of a Hairpin starting from perturbed parameters
−15−10−5 0 5
1015
2025
0
5
10
15
GG
C
C
CG
G
CC
GG
Initial hairpin shape
GC
C
C
G
GCC
G
−100
10
1520
25
0
10
20GG
CG
C
G
C
Folding of hairpin with initial parameters
C
C
GC
C
G
G
C
G
GG
CC
−15−10
−50
5
16182022242628
5
10
15
20
C
GG
C
G
CG
C
C
G
Folding of hairpin with best parameters
G
C
GG
CG
C
GC
C
Figure 4.9: Individually folding a molecule from the main sample
4.3 Rederiving the Parameters of Random Folded
Data
As can be seen in Figure 4.8, the learning algorithm did not managed to
4.4 Real RNA data results
This structure determination was attempted on real small molecules of RNA. The
results of this are shown in Appendix A without motifs and in Appendix B with
motifs. The results are a bit disappointing, probably partially due to having to
try and adapt to many molecules at the same time. The motifs do not appear
to make any large difference to the results which is quite suprising as one would
expect that are would be an essential part of increasing the probability.
Looking at the first model in detail, there is not much difference when param-
eters are modelled for it on its own as can be seen in figure 4.9.
There does not appear to be any big difference. Its error score is 0.3295
initially which reduces to 0.2148 which is a bit better but visually the structure
Chapter 4. Results 48
does not appear any different.
Lots of other interesting comparsions were done like testing the generalisation
performance. Due to late discoverery of the bad combination of the default line
accurary search parameters for conjugate gradient, with the “decreasing temper-
ature” phenomeon and the sudden severe instability of the plab library meant
that they had to be left out. However, they should hopefully be ready in time
for the demonstration.
Chapter 5
Conclusion
5.1 Concluding Remarks and Observations
Using motifs and with unsupervised learning on known RNA molecules this
project managed to fold novel RNA molecules of low complexity and approxi-
mate the structure of high complexity models with very little prior knowledge
of the problem. It also managed to learn to fold closer to the correct answer all
examples it was introduced to. While this demonstrates the power of applying
probability theory to this problem, it also highlights that the energy function has
too few parameters to capture all the necessary interactions to fully determine the
structure of RNA molecules. However just adding huge amounts of parameters
could lead to the opposite problem of over-fitting, which shows that designing
the energy function is far from trivial.
This project has produced a framework that allows easy exploration and ex-
perimentation of RNA structure determination in a distributed and very visual
manner. In particular all folding, MCMC sampling etc can be watched in real
time, even if it is clustered(each clustered computer opens a window for real-time
display of what it is doing). Seeing this temporal element helps understanding
what is actually happening, how the project is functioning and what if anything
is going against expectations. The existing code base of the project means that it
is very easy to substitute in a new energy function which would should succeed at
49
Chapter 5. Conclusion 50
the problem better, but this project has demonstrated the feasibility of learning
the energy parameters.
This problem had many annoying technical problems to deal with that weren’t
directly related to the project. The main problem was the lack of matlab licenses
available to get the requisite amount of computer power(as the licenses were
shared with lots of other people) which combined with the late discovery of the
temperature decreasing phenomenon lead to a not as full as evaluation as would
be wanted. The chronic shortage of matlab compiler licenses also meant that often
the matlab code would be running at less than optimal speed which resulted in
lots of time wasted. The stability of plab was also highly questionable at times,
often resulting in losing the results of long runs of analysing the result when it
was finishing.
5.2 Unsolved Problems
In the results section it was clear that the energy function does not seem to be
adaptable enough to learn complex RNA molecule shapes, like the ’S’ shape which
had such problems in the big runs. An extra term to explicitly cope with these
type of structure would have to be added.
This project can use quite ridiculous amounts of computing power, with up
to 25 computers harnessed for analysing a single instance for 20 hours. This is
partially due to the limitations of implementing it in matlab but it is also an
intrinsic characteristic of the problem and the approach taken.
No convergence testing is done as it very difficult to detect as can often be
seen when matching the molecule folding. This results in extra computation time
being needed as all molecules must be folded for a long enough period that they
will almost always have fully convereged.
Chapter 5. Conclusion 51
5.3 Suggestions For Further Work
5.3.1 Integrating Temporal Information
As outlined in 1.5, this project can not collect any temporal information about
the RNA molecule. A possible solution to this problem would be modelling the
project as Markov Chains, with the transition matrix being a discritized version
of P (xi+1|xi), where the equilibrium distribution should then provide the correct
structure which we are looking for. It should then be possible to integrate out
any time specific elements.
5.3.2 Expanding upon Motif Recognition
With a lot of the non high scoring motifs, there were distinct clusters of possible
local structure as can be seen in ??. These were generally ignored by this project
as there were example available where all motifs were in the same cluster. It would
be worth investigating the link between the surrounding sequence and which motif
is choosen, as is often done in Protein Structural Determination. These extra
motifs can then be incorporated in real energy functions for sequences that fit
that cluster.
Also the length of all motifs is currently preset. It would prove fruitful to try
and identify a method where the most promising motifs within a range of length
could be found as if the motif size is too small, lots of motifs are redundent and
cover the same area while if they are too big the motifs will be missed.
5.3.3 Improving the energy function:
The energy function used within this project has be more adaptable because while
this project sucessfully maximised the parameters to give the best approximation
by the structure, the approximations were generally not that close to the real
structure.
Chapter 5. Conclusion 52
5.3.4 Implementation in a Different Language
Matlab was very useful in the short timeframe for this project as a prototyping
language. However, an implementation in a language which doesn’t require li-
censes would be needed to look at any really interesting amounts of data as one
of the main problem this project faced was the University of Edinburgh running
out of matlab licenses.
Also the vectorization approach that matlab takes is not ideal with some of the
optimisations that would be wanted in this project. For instance neighbour lists
would allow discarding interactions that are too far apart to have any significant
influence upon each other and would allow the folding of longer RNA molecules.
5.3.5 A Fully Cross Validated Run
In order to fully evaluate this project, a true cross validation run is need that
maintains a strict barrier between the data that is learnt from the data that is
being folded. While this project did manage to minimise the contamination, the
lack of matlab licenses prohibited doing a fully cross validated sample.
Appendix A
Structure produced by Real RNA
Molecules without Motifs
Table ?? shows how the folding of each molecule changed from how it was folded
under the initial settings.
Best folding under Initial Best folding under final
Parameters Parameters
10
2010
20
20
25
30
35
40
45
50
55
60
65
C
C
C
A
A
GG
G
A
A
C
Target
U
G
C
A
G
A
U
G
C
G
G
A
G
CC
CUG
510
1520
−50
5
−5
0
5
10
A
C
AG
C
C
UG
A
GG
A
CA
G
GGC
U
0.38222
A
CAU
GCGC
GC
20
40
−100
10
−20
−10
0
10
A
GAG
AGG
C
A
AU
C
UC
GA
C
C
G
G
0.3987
C
U
A
GC
GC
GC
53
Appendix A. Structure produced by Real RNA Molecules without Motifs 54
Best folding under Initial Best folding under final
Parameters Parameters
−20−10
040
5060
−18−16−14−12−10−8−6
AG
G
A
C
G
AUGG
AAUC
Target
C
C
C
GG
U
A
CGGU
U
G
510
155
1015
−15
−10
−5
0
G
U
A
CU
CG
C
G
A
A
A
0.46079
UC
GU
A
CGGGGG
G
CU
A
1020
30
010
20
−10
0
10
20
GA
U
U
A
C
C
C
GGAG
A
G
A
C
G
C
G
A
UG
G
U
0.37914
CGU
010
20
1015
2025
510152025
GG
C
A
U
G
C
U
G
G
U
A
C
C
AA
C
C
U
Target
U
G
GG
UA
GA
1020
−10−5
05
−5
0
5CG
AU
GCA
GAU
0.41585
GAUGCUA
UG
GC
CC
GA
GU
−100
1020
−20−10
0
−20
−10
0
GUA
G
CA
U
0.37418
G
G
UCA
U
CUAC
G
A
GU
G
G
G
C
AC
4050
60
3040
50
50
60
70
80
UCACG
UAA
U
Target
GA
A
CC
GGG
A
U
A
CG
C
AAG
−8−6−4−205
1015
2025
−10
−5
0
5
A
AU
A
GCG
A
A
AC
U
G
0.61945
CG
G
GCA
U
AG
U
CA
C
23510
1520
25
−30
−20
−10
0
10
20
G
A
A
A
C
U
C
A
A
A
U
G
G
G
U
0.4156
G
C
U
G
G
A
C
C
C
A
A
Appendix A. Structure produced by Real RNA Molecules without Motifs 55
Best folding under Initial Best folding under final
Parameters Parameters
−50
525
3035
15
20
25
30
35 C
C
C
C
G
G
G
C
Target
G
G 24
6
68
−4
−2
0
2
C
G
CG
C
1.1524
G
G
C
C
G
10
20
−50
510
−10
−5
0
G
C
G
G
C
0.30863
C
C
CG
G
−50
525
3035
15
20
25
30
CG
G
G
C
G
C
C
C
Target
G4
6
34
56
7
1
2
3
4
5
6
C
G
CG
1.1462
C
G
C
G
G
C
−10−5
05
−20246
−20
−15
−10
−5
0
G
G
C
G
G
C
0.28175
CC
C
G
−50
510
15
−50
5
−10
−5
0
5
10
15
C
UCG
Target
U
G
AGG
CUCU
0 2 4 6−5
0
5
2468
G
C
UU
G
CU
C
C
1.016
GG
AU
−20−10
0
246
5
10
15
20
25
30 C
GU
A
UG
UC
0.45517
GGCUC
Appendix A. Structure produced by Real RNA Molecules without Motifs 56
Best folding under Initial Best folding under final
Parameters Parameters
1520
25−5
05
−30
−25
−20
−15
−10
−5
0
5
10
15
20
A
G
G
G
C
A
C
A
A
Target
G
U
U
UG
CC
C
CC
05
10
24
68
−10
−5
0
5 C
GC
G
C
C
U
0.91667
A
A
CG
U
C
C
G
GA
AU 0
1020
30
681012
0
10
20
GC
UC
G
C
A
G
G
U
C
0.62657
A
A
U
A
C
G
CC
510
1520
2530
35
5
10
15
20
25U
C
G
C
GC
Target
UU
G
C
G
A
2.533.5−2
02
4
0
5
10
15
20
C
G
U
0.8874
CG
UA
U
CG
GC
05
10
−50
510
0
5
10
15
G
G
U
C
C
C
U
0.53291
G
G
A
UC
Appendix A. Structure produced by Real RNA Molecules without Motifs 57
Best folding under Initial Best folding under final
Parameters Parameters
Table A.1: Structure produced and learnt parameters.
Number above each graph indicates the error in the re-
production
Appendix B
Structure produced by Real RNA
Molecules with Motifs
RNA molecule Best folding under Initial Best folding under final
Parameters Parameters
10
2010
20
20
25
30
35
40
45
50
55
60
65
C
C
C
A
A
GG
G
A
A
C
Target
U
G
C
A
G
A
U
G
C
G
G
A
G
CC
CUG
0 2 45
10
15
−10
−5
0
5
10
AG
U
C
CG
A
C
G
G
CG
A
0.50746
C
C
G
CG
GGC
G
U
A
C
AUA
A
010
2020
40
−20
−10
0
10
ACCGGGUCGU
GU
A
C
0.34341
C
C
G
AAC
GA
CGC
AG
A
G
58
Appendix B. Structure produced by Real RNA Molecules with Motifs 59
RNA molecule Best folding under Initial Best folding under final
Parameters Parameters
−20−10
040
5060
−18−16−14−12−10−8−6
AG
G
A
C
G
AUGG
AAUC
Target
C
C
C
GG
U
A
CGGU
U
G
510
1520
25
510
15
−10
−5
0
GA
UG
CA
U
CGAUC
GGC
CGUC
0.40749
G
A
A
G
GGA
U
−100
10
−50
510
−20
−10
0
10
GU
GA
G
C
C
U
GGCU
U
A
0.30399
A
C
CU
C
GG
AG
AG
GA
010
20
1015
2025
510152025
GG
C
A
U
G
C
U
G
G
U
A
C
C
AA
C
C
U
Target
U
G
GG
UA
GA
−15−10
−50
5
1015
2025
−8−6−4−2
02 U
UA
GAU
C
GC
G
CG
UG
AG
U
CC
0.35378
CAG
UGAG
A
−50 5 1015
−100
1020
30
−505
UCA
UA
C
GC
U
G
C
U
G
A
0.31095
U
A
A
GG
GG
GU
A
CCG
4050
60
3040
50
50
60
70
80
UCACG
UAA
U
Target
GA
A
CC
GGG
A
U
A
CG
C
AAG
−20 2 4 6 8
−5
0
5
−15
−10
−5
0
5
A
G
A
C
G
G
A
U
C
A
CCU
G
G
U
0.67754
A
C
G
AA
CG
A
UA
−100
1020
−100
10
−10
0
10
CAGA
U
C
GA
GG
A
0.40911
G
GC
CCAUUAC
GAUAA
Appendix B. Structure produced by Real RNA Molecules with Motifs 60
RNA molecule Best folding under Initial Best folding under final
Parameters Parameters
−50
525
3035
15
20
25
30
35 C
C
C
C
G
G
G
C
Target
G
G2
468
10
−4
−3
−2
−1
0
CG
G
G
C
1.1595
C
GC
CG
1020
−10−5
05
−20246
CGGGCC
0.26177
CCGG
−50
525
3035
15
20
25
30
CG
G
G
C
G
C
C
C
Target
G
02
46
8
−4
−2
0
GC
G
C
C
1.1624
G
CG
GC
−10−5
05
05
10
−20
−15
−10
−5
0
G
G
G
C
C
G
C
0.2531
CG
C
−50
510
15
−50
5
−10
−5
0
5
10
15
C
UCG
Target
U
G
AGG
CUCU
24
24
68
2
4
6
8
10
12
AU
U
C
U
1.0979
G
U
G
C
C
G
C
G
−15−10
−50
−20
−10
0
−20
−10
0
10
C
U
C
G
G
0.39203
C
U
UG
C
A
GU
Appendix B. Structure produced by Real RNA Molecules with Motifs 61
RNA molecule Best folding under Initial Best folding under final
Parameters Parameters
1520
25−5
05
−30
−25
−20
−15
−10
−5
0
5
10
15
20
A
G
G
G
C
A
C
A
A
Target
G
U
U
UG
CC
C
CC
24
68
46
8
−12
−10
−8
−6
−4
−2
0
2
4
A
C
G
U
CGCCG
A
UA
U
0.94021
GC
C
C
G
A
−30−20
−100
010
20
−505
AA
G
UG
C
C
CC
C
U
0.45836
U
A
G
CGACG
510
1520
2530
35
5
10
15
20
25U
C
G
C
GC
Target
UU
G
C
G
A
24
68
10
12345
0
2
4
6
8
UU
A
1.1172
U
CGC
G
G
C
CG
510
150
510
−10
−5
0
5
10
CGU
U
G
A
C
0.43994
U
GG
CC
Table B.1: Structure produced with motifs and learnt
parameters. Number above each graph indicates the er-
ror
Bibliography
[1] Kim T. Simons Ingo Ruczinski Charles Kooperberg Brian A. Fox Chris
Bystroff and David Baker Improved Recognition of Native-Like Protein
Structures Using a Combination of Sequence-Dependent and Sequence-
Independent Features of Proteins PROTEINS: Structure, Function, and Ge-
netics 34:82-95 (1999)
[2] Kim T. Simons, Charles Kooperberg, Enoch Huang and David Baker As-
sembly of Protein Tertiary Structure from Framents with Similar Local Se-
quences using Simulated Annealing and Bayesian Scoring Functions J. Mol.
Biol. (1997) 268, 209-225
[3] Karen F.Han Christopher Bystroff and Davd Baker Three-dimensional struc-
tures and contexts associated with recurrent amino acid sequence patterns
Protein Science (1997) 6 1587-1590, Cambridge University Press
[4] Christopher Bystroff An Alternative Derivation of the Equations of Motion
in Torsion Space for a Branched Linear Chain Protein Engineering vol.14
no.11 pp.825-828, 2001
[5] Carol A. Rohl and David Baker De Novo Determination of Protein Backbone
Structure from Residual Dipolar Couplings Using Rosetta J. Am. Chem. Soc.
Vol 124, No 11, pp.2723-2729, 2002
[6] R. E. Shapire. The boosting approach to machine learning: An overview
MSRI Workshop on Nonlinear Estimation and Classification, 2002.
http://www.cs.princeton.edu/ schapire/papers/msri.ps.gz
62
Bibliography 63
[7] Micheal Zuker On Finding All Suboptimal Foldings of an RNA Molecule
Science, New Series, Volume 244, Issue 4900 (Apr 7, 1989), 48-52
[8] Rune B. Lyngs, Michael Zuker and Christian N. S. Pedersen An Improved
Algorithm for RNA Secondary Structure Prediction BRICS RS-99-15 May
1999
[9] A. Zuker, B. Mathews and C. Turner Algorithms and Thermodynamics for
RNA Secondary Structure Prediction: A Practical Guide MFold manual
http://www.bioinfo.rpi.edu/ zukerm/export/mfold-3.0-manual.pdf
[10] DNA, RNA, PNA; stepping backwards in time http://exobio.ucsd.edu/
Space_Sciences/rna-dna.htm A summary of an unavailable in time paper
Gerald Joyce Nature, Vol. 338 (1989) pp. 217-224
[11] Ulrik Kjems Parallel Matlab http://bond.imm.dtu.dk/plab/about.html
[12] Ian Nabney and Christopher M. Bishop Netlab neural network software
http://www.ncrg.aston.ac.uk/netlab/index.html
[13] F. Eckstein RNA and DNA as Catalysts Unpublished, available at
http://www.mpibpc.gwdg.de/inform/MpiNews/cientif/jahrg5/9.99/scta.html
[14] David J.C. MacKay Information Theory, Inference and Learning Algorithms
360-381
[15] Chen Yanover and Yair Weiss Approximate Inference
and Protein Folding NIPS*2002 Paper AP08 (2002)
http://www.cs.huji.ac.il/ cheny/sidechain/sc.pdf
[16] H. M. Berman, W. K. Olson, D. L. Beveridge, J. Westbrook, A. Gel-
bin, T. Demeny, S.-H. Hsieh, A. R. Srinivasan, and B. Schneider. The
Nucleic Acid Database: A Comprehensive Relational Database of Three-
Dimensional Structures of Nucleic Acids. Biophys. J., 63, 751-759. (1992)
http://beta-ndb.rutgers.edu/
[17] John W. Eaton Octave http://www.octave.org/