g r m a t i dna/rna structure - vrije universiteit · the secondary structure of an rna molecule is...

31
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course DNA/Protein Structure- function Analysis and Prediction Lecture 12 DNA/RNA Structure Prediction

Upload: others

Post on 22-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

Master CourseDNA/Protein Structure-function Analysis and Prediction

Lecture 12

DNA/RNA Structure Prediction

Page 2: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Epigenectics – Epigenomics: Gene Expression

• Transcription factors (TF) are essential for transcription initialisation

• Transcription is done by polymerase type II (eukaryotes)

• mRNA must then move from nucleus to ribosomes(extranuclear) for translation

• In eukaryotes there can be many TF-binding sites upstream of an ORF that together regulate transcription

• Nucleosomes (chromatin structures composed of histones) are structures round of which DNA coils. This blocks access of TFs

Page 3: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Epigenectics – Epigenomics: Gene Expression

mRNA transcription

TF binding site (open)

TF binding site (closed)

TATA

Nucleosome

Page 4: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Expression• Because DNA has flexibility, bound TFs can move in

order to interact with pol II, which is necessary for transcription initiation (see next slide)

• Recent TF-based initialisation theory includes a wave function (Carlsberg) of TF-binding, which is supposed to go from left to right. In this way the TF-binding site nearest to the TATA box would be bound by a TF which will then in turn bind Pol II.

• It has been suggested that “Speckles” have something to do with this (speckels are observed protein plaques in the nucleus)

• Current prediction methods for gene co-expression, e.g. finding a single shared TF binding site, do not take this TF cooperativity into account (“parking lot optimisation”)

Page 5: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Expression..

mRNA transcription

TF binding site

TATA

TF

Pol II

mRNA

TF binding site

This is still a hypothetical model…Speckel

Page 6: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

DNA/RNA Structure-Function relationships

• Apart from coding for proteins via genes, DNA is now known to code for many more RNA-based cell components (snRNA, rRNA,..)

• The importance of structural features of DNA (e.g. bendability, binding histones, methylation) is becoming ever more important.

• For the many different classes of RNA molecules, structure is directly causing function

• It is therefore important to analyse and predict DNA structure, but particularly, RNA structure

Page 7: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

The complementary bases, C-G and A-U form stable base pairs with each other through the creation of hydrogen bonds between donor and acceptor sites on the bases. These are called Watson-Crickbase pairs and are also referred to as canonical base pairs. In addition, we consider the weaker G-U wobble pair, where the bases bond in a skewed fashion. Other base pairs also occur, some of which are stable. These are all called non-canonical base pairs.

Canonical base pairs

Page 8: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

The secondary structureof an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA sequence will be represented as R= r1, r1, r2, r3,…, rn, where r i is called the i th

(ribo)nucleotide. Each r i belongs to the set {a,c,g,u}. .

RNA secondary structure

Page 9: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

A secondary structure, or folding, on R is a set S of ordered pairs, written asi-j, satisfying:

1.j - i > 4 2.If i-j and i ’-j ’ are 2 base pairs, (assuming without loss in generality thati ≤ i ’ ), then either:

1. i = i ’ and j = j ’ (they are the same base pair), 2. i ≤ j ≤ i ’ ≤ j ’ (i-j precedes i ’-j ’), or 3. i ≤ i ’ ≤ j ’ ≤ j (i-j includes i ’-j ’)

The last condition excludes pseudoknots. These occur when 2 base pairs, i-j and i ’-j ’, satisfy i ≤ i ’

≤ j ≤ j ’.

Secondary Structure and Pseudoknots

Page 10: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Pseudoknots are not taken into account in secondary structure prediction because energy minimizing methods cannot deal with them. It is not known how to assign energies to the loops created by pseudoknots and dynamic programming methods that compute minimum energy structures break down. For this reason, pseudoknotsare often considered as belonging to tertiary structure. However, pseudoknots are real and important structural features. However, covariance methods (next slide) are able to predict them from aligned, homologous RNA sequences. The Figure on the next slide represents a small pseudoknot model.

Pseudoknots

Page 11: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

A 3D model of a pseudoknot

Page 12: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

A 3D model of a pseudoknot

The 2 helices in the structure (preceding slide) are stacked coaxially.

RNA structure can be predicted from sequence data. There are two basic routes.

1. The first attempts structure prediction of single sequences based on minimizing the free energy of folding.

2. The second computes common foldings for a family of aligned, homologous RNAs. Usually, the alignment and secondary structure inference must be performed simultaneously, or at least iteratively (see next slide)

Page 13: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Predicting RNA Secondary Structure

• By Thermodynamics Method• Minimize Gibbs Free Energy

• By Phylogenetic Comparison Method (Covariance method)

• Compare RNA Sequences of Identical Function From Different Organisms

• By Combination of the Above Two Methods• In principle, this could be the most powerful method

Page 14: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Thermodynamics

• Gibbs Free Energy,G• Describes the energetics of biomolecules in aqueous solution.

The change in free energy, ∆G, for a chemical process, such as nucleic acid folding, can be used to determine the direction of the process:

• ∆G=0: equilibrium• ∆G>0: unfavorable process• ∆G<0: favorable process

• Thus the natural tendency for biomolecules in solution is to minimize free energy of the entire system (biomolecules + solvent).

Page 15: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Thermodynamics

• ∆G = ∆H - T∆S∆H is enthalpy, ∆S is entropy, and T is the temperature in Kelvin.

• Molecular interactions, such as hydrogen bonds, van derWaals and electrostatic interactions contribute to the ∆H term. ∆Sdescribes the change of order of the system.

• Thus, both molecular interactions as well as the order of the system determine the direction of a chemical process.

• For any nucleic acid solution, it is extremely difficult to calculate the free energy from first principle

• Biophysical methods can be used to measure free energy changes

Page 16: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Thermodynamics The Equilibrium Partition Function

• For a population of structures S, a partition function Qand the probability for a particular folding, s can be calculated:

• The heat capacity for the RNA can be obtained:and

Heat capacity Cp (heat required to change temperature by 1 degree) can be measured experimentally, and can then be used to get information on G

∑∈

∆−=

Ss

RTsG

eQQ

e RTsG∆−

QRTG ln−=2

2

T

GTCp

∂∂−=

is probability

Page 17: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Zuker’s Energy Minimization Method (mFOLD)

• An RNA Sequence is called R= {r 1,r2,r3…rn}, where r i is the ith ribonucleotide and it belongs to a set of {A, U, G, C}

• A secondary structure of R is a set Sof base pairs, i.j, which satisfies:• 1=<i<j=<n;• j-i>4 (can’t have loop containing less than 4 nucleotides);• If i,j and i’ .j’ are two basepairs, (assume i =< i’ ), then either

» i = i’ and j = j’ (same base pair)» i < j < i’ < j’ (i.j proceeds i’ .j’ ) or» i < i’ < j’ < j (i.j includes i’ . j’ ) (this excludes pseudoknots which is

i<i’ <j<j’ )

• If e(i,j) is the energy for the base pair i.j, the total energy for R is

• The objective is to minimize E(S).

∑∈

=Sji

jieSE,

),()(

5’3’5’ 3’

Page 18: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Zuker’s Energy Minimization Method (mFOLD)

Free Energy Parameters• Extensive database of free energies for the following

RNA units has been obtained (so called “TinocoRules” and “Turner Rules”):

• Single Strand Stacking energy• Canonical (AU GC) and non-canonical (GU) basepairs in duplexes

• Still lacking accurate free energy parameters for • Loops• Mismatches (AA, CA etc)

• Using these energy parameters, the current version of mFOLD can predict ~73% phylogenetically deduced secondary structures.

Page 19: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Dynamic Programming (mFOLD)

• An Example of W(i,j) • A matrix W(i,j) is computed that is dependent on the experimentally measured basepair energy e(i,j)

• Recursion begins with i=1, j=n1. If W(i+1,j)=W(i,j), then i is not

paired. Set i=i+1 and start the recursion again.

2. If W(i,j-1)=W(i,j), then j is not paired. Set j=j-1 and start the recursion again.

3. If W(i,j)=W(i,k)+W(k+1,j) , the fragment k+1,j gets put on a stack and the fragment i…k is analyzed by setting j = k and going back to the recursion beginning.

4. If W(i,j)=e(i,j)+W(i+1,j-1), a basepair is identified and is added to the list by setting i=i+1 and j=j-1

Page 20: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Suboptimal Folding (mFOLD)

• For any sequence of N nucleotides, the expected number of structures is greater than 1.8N

• A sequence of 100 nucleotides has 3x1025 foldings. If a computer can calculate 1000 strs./s-1, it would take 1015 years!

• mFOLD generates suboptimal foldings whose free energy fall within a certain range of values. Many of these structures are different in trivial ways. These suboptimal foldings can still be useful for designing experiments.

Page 21: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

A computer predicted folding of Bacillus subtilisRNase P RNA

These three representations are equivalent..

Page 22: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Secondary Structure Prediction for Aligned RNA Sequences

• Both energy as well as RNA sequence covariation can be combined to predict RNA secondary structures

• To quantify sequence covariation, let fi(X) be the frequency of base X at aligned position I and fij(XY)be the frequency of finding X in i and Y in j, the mutual information score is (Chiu & Kolodziejczak and Gutell & Woese)

if for instance only GC and GU pairs at positions i and j then M ij=0.

• The total energy for RNA is set to a linear combination of measured free energy plus the covariance contribution

∑=YX ji

ijijij YfXf

XYfXYfM

, )()(

)(log)(

Page 23: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Other Secondary Prediction Methods• Nusinov algorithm (historically important), Hogewegand

Hesper (1984)

• Vienna: http://www.tbi.univie.ac.at/~ivo/RNA/• uses the same recursive method in searching the folding space

• Added the option of computing the population of RNA secondary structures by the equilibrium partition function

• Specific heat of an RNA can be calculated by numerical differentiation from the equilibrium partition function

• RNACAD:http://www.cse.ucsc.edu/research/compbio/ssurrna.html• An effort in improving multiple RNA sequence alignment by taking into account

both primary as well secondary structure information• Use Stochastic Context-Free Grammars (SCFGs), an extension of hidden Markov

models (HMMs) method

• Bundschuh, R., and Hwa, T. (1999) RNA secondary structure formation: A solvable model of heteropolymer folding. PHYSICAL REVIEW LETTERS 83, 1479-1482.

• This work treats RNA as heteropolymer and uses a simplified Go-like model to provide an exact solution for RNA transition between its native and molten phases.

Page 24: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Running mFOLD

• http://bioinfo.math.rpi.edu/~mfold/rna/form1.cgi

• Constraints can be entered1. force bases i,i+1,...,i+k-1 to be double stranded by entering:

F i 0 kon 1 line in the constraint box.

2. force consecutive base pairs i.j,i+1.j-1, ...,i+k-1.j-k+1 by entering:F i j k on 1 line in the constraint box.

3. force bases i,i+1,...,i+k-1 to be single stranded by entering:P i 0 kon 1 line in the constraint box.

4. prohibit the consecutive base pairs i.j,i+1.j-1, ...,i+k-1.j-k+1 by entering:P i j k on 1 line in the constraint box.

5. prohibit bases i to j from pairing with bases k to l by entering:P i-j k-l on 1 line in the constraint box.

Page 25: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Running mFOLD5’-CUUGGAUGGGUGACCACCUGGG-3’

No constraint F 1 21 2 entered

Page 26: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Predicting RNA 3D Structures• Currently available RNA 3D structure prediction

programs make use the fact that a tertiary structure is built upon preformed secondary structures

• So once a solid secondary structure can be predicted, it is possible to predict its 3D structure

• The chances of obtaining a valid 3D structure can be increased by known space constraints among the different secondary segments (e.g. cross-linking, NMR results).

• However, there are far less thermodynamic data on 3-D RNA structures which makes 3-D structure prediction challenging.

Page 27: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Mc-Sym• Mc-Sym uses “backtracking” method to solve a

general problem in computer science called the constraint satisfaction problem (CSP)

• Backtracking algorithm organizes the search space as a tree where each node corresponds to the application of an operator

• At each application, if the partially folded RNA structure is consistent with its RNA conformational database, the next operator is applied, otherwise the entire attached branch is pruned and the algorithm backtracks to the previous node.

Page 28: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Mc-Sym (Continued)

• The selection of a spanning tree for a particular RNA is left to the user, but it is suggested that the nucleotides imposing the most constraints are introduced first

• Users also supply a particular Mc-Sym “conformation” for each nucleotide. These “conformers” are derived from currently available 3D databases

Page 29: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

Mc-Sym (Continued)Sample script:

SEQUENCE 1 A r

GAAUGCCUGCGAGCAUCCC

;;DECLARE;;

1 helixA *2 helixA *3 helixA *4 helixA *5 helixA *6 helixA *

……………19 helixA *

;;;;

RELATIONS;;

18 helix * 1917 helix * 1816 helix * 17

……….5 helix * 64 helix * 53 helix * 42 helix * 31 helix * 2

;;BUILD;19 18 17 16 15 14 13 1212 11 10 9 8 7 6 55 4 3 2 1;;CONSTRAINTS;; (enter experimental constraints)18 2 3.0

Page 30: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

RNA-protein Interactions• There is currently no computational method that can

predict the RNA-protein interaction interfaces;• Statistical methods have been applied to identify

structure features at the protein-RNA interface. For instance, ENTANCLE finds that most atoms contributed from a protein to recogonizing an RNA are from main chains (C, O, N, H), not from side chains! But much remains to be done;

• Electrostatic potential has primary importance in protein-RNA recognition due to the negatively charged phosphate backbones. Efforts are made to quantify electrostatic potential at the molecular surface of a protein and RNA in order to predict the site of RNA interaction. This often provides good prediction at least for the site on the protein.

Page 31: G R M A T I DNA/RNA Structure - Vrije Universiteit · The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA

ReferencesPredicting RNA secondary structures:good reviews

• 1. Turner, D. H., and Sugimoto, N. (1988) RNA structure prediction.Annu Rev Biophys Biophys Chem 17, 167-92.

• 2. Zuker, M. (2000) Calculating nucleic acid secondary structure. Curr OpinStruct Biol 10, 303-10.

• Obtaining experimental thermodynamics parameters:

• 3. Xia, T., SantaLucia, J., Jr., Burkard, M. E., Kierzek, R., Schroeder, S. J., Jiao, X., Cox, C., and Turner, D. H. (1998) Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry 37, 14719-35.

• 4. Borer, P. N., Dengler, B., Tinoco, I., Jr., and Uhlenbeck, O. C. (1974) Stability of ribonucleic acid double-stranded helices. J Mol Biol 86, 843-53.

• Thermodynamics Theory for RNA structure prediction:

• 5. Bundschuh, R., and Hwa, T. (1999) RNA secondary structure formation: A solvable model of heteropolymer folding. PHYSICAL REVIEW LETTERS 83, 1479-1482.

• 6. McCaskill, J. S. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29, 1105-19.