rna and protein structure prediction bioinformatics ...lopresti/courses/2007-08/cse308... · rna...

45
CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Spring 2007 · Lecture 18 - 1 - Bioinformatics: Issues and Algorithms CSE 308-408 Spring 2007 Lecture 18 RNA and Protein Structure Prediction

Upload: others

Post on 24-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 1 -

Bioinformatics:Issues and AlgorithmsCSE 308-408 • Spring 2007 • Lecture 18

RNA and ProteinStructure Prediction

Page 2: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 2 -

Outline

• Multi-Dimensional Nature of Life• RNA Secondary Structure Prediction• Protein Structure Determination• Protein Threading

Page 3: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 3 -

Life is not one-dimensional

http://www.mdc-berlin.de/englisch/research/research_areas/cancer/heinemann.htm

http://crystal.uah.edu/~carter/protein/xray.htm

Secondary and tertiary structuresfor tRNA (tranfser RNA)

Sample protein structures

Page 4: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 4 -

RNA secondary structure

http://www.bioinfo.rpi.edu/~zukerm/Bio-5495/RNAfold-html/

• Base-pairing rules similar to basic sequence alignment.

• Unlike DNA, single-stranded RNA can fold back on itself.

• Recall that C-G and A-U form stable pairs (“Watson-Crick”).

• In addition, G-U forms a weaker pair (“wobble”).

RNA secondary structure:

http

://tig

ger.u

ic.e

du/c

lass

es/p

hys/

phys

461/

phys

450/

ANJU

M04

/RN

A_ss

trand

.jpg

Page 5: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 5 -

Modeling RNA secondary structure

http://www.bioinfo.rpi.edu/~zukerm/Bio-5495/RNAfold-html/

(1) j – i > 4

(2) if (i, j) and (i', j') are two base pairs, with i ≤ i', then either:

(a) i = i' and j = j'

or (b) i < j < i' < j'

or (c) i < i' < j' < j

Let R = r1r

2 ... r

n be an RNA sequence, where r

i ∈ {A, C, G, U}.

A secondary structure is a set of ordered pairs (i, j) such that:

This is also known as a folding.

pairing can't be too “tight”

i.e., same base pair

(i, j) precedes (i', j')

(i, j) encloses (i', j')

Page 6: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 6 -

Visualizing RNA secondary structure

Computer predicted folding of Bacillus subtilis RNase P RNA

http://www.bioinfo.rpi.edu/~zukerm/Bio-5495/RNAfold-html/

M = multi-loopI = interior loopB = bulge loopH = hairpin loop

Dot plot representation

place “dot” foreach (i, j) pair

H

HH

H

H

H

H H

H

M M

M

M

I

I

I

II

B

lower base pair precedesupper pair (i < j < i' < j')

M

H

HH

this base pair enclosesall others (i < i' < j' < j )

Circular representation

base 0

base 400

Page 7: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 7 -

Hairpin loop

Stacked base pairs

Predicting RNA secondary structure

Two popular approaches:

Goal: to determine a secondary structure with minimum free energy.

• Ignore loops, maximize number of base pairings (a bit simplistic, but leads to a nice algorithm).

• Employ loop-specific energy models (more realistic, but also more complex algorithms).

Premise:• Base pairings have a stabilizing

effect on a structure's free energy.• Loops have a destabiling effect.

Page 8: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 8 -

An approach for predicting RNA secondary structure

Assumption: Energy of each base pair is independent of all other base pairs and of loop structure.

Consequence: Total free energy is sum of energies for all base pairs.

Downside: Predictions only approximate reality.

Key observation:• We can use solutions for shorter subsequences to

determine solution for longer sequences.• Should sound familiar: precisely the kind of decoupling

required for dynamic programming to work.

Page 9: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 9 -

Notation and definitions

S(i, j) = secondary structure of RNA strand (i.e., set of base pairings) from base r

i to base r

j, inclusive.

E(S(i, j)) = free energy of secondary structure S(i, j).

Note: we can define e(a, b) to be a very large value when physical constraints are violated (e.g., b – a ≤ 4, so that hairpin loop would be too tight).

E S i , j = ∑a ,b ∈ S i , j

e a ,bSo we have:

e(a, b) = free energy of base pair (a, b).

Page 10: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 10 -

Algorithm for predicting RNA secondary structure

Consider RNA strand from base position i to base position j:

ri

rk

rj-1

rj

Two possible cases depending on whether rj base-pairs:

What is optimal folding for this?

If rj does not base-pair, then E(S(i, j)) = E(S(i, j-1)).

We already computed this.

ri

rk

rj-1

rj

If rj does pair, E(S(i, j)) = E(S(i, k-1)) + e(k, j) + E(S(k+1, j-1)).

Already computed.

ri

rk

rj-1

rj

Already computed.

Page 11: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 11 -

Algorithm for predicting RNA secondary structure

For each partial folding S(i, j), free energy is given by:

Optimize free energy over increasingly longer subsequences.

min{E S i1, j−1 e i , j min{E S i , k−1 e k , j E S k1, j−1} for ik j}

How do we deal with case where ri is not paired?

Define e(k, j) to be 0 when k = j.

E S i , j =

Free energy of optimal folding for entire strand is E(S(1,n)).

Page 12: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 12 -

Time and space complexity

What are time and space requirements of this algorithm?

For each partial folding S(i, j), free energy is given by:

min{E S i1, j−1 e i , j min {E S i , k−1 e k , j E S k1, j−1} for ik j}

E S i , j =

For RNA sequence of length n, we must be an n x n matrix.Each entry requires exploring an average of n/2 values for k.

Hence, time complexity is O(n3), space complexity is O(n2).

Page 13: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 13 -

Summary of RNA secondary structure prediction #1

http://matisse.ucsd.edu/~hwa/pub/goletter.pdf

• Tertiary structure is difficult to model and compute.• Determining secondary structure is more amendable to a

solution. While only an approximation, it gives good hints.• No knots ⇒ planar graph.

Solved using dynamic programming in O(n3) time.

General summary of the situation:

Page 14: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 14 -

Summary of RNA secondary structure prediction #2

Hairpin loopBulge

Interior loop Helical region

Better results can be obtained by modeling loops. This problem is also solvable in O(n3) time using some tricks.

red = destabilizing (“bad”)green = stabilizing (“good”)

Page 15: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 15 -

Protein structure

http://crystal.uah.edu/~carter/protein/protein.htm

Sometimes distinct proteins must combine to form correct 3-dimensional structure for a particular protein to function properly. E.g., hemoglobin is made of four similar proteins that combine to form its quaternary structure.

Primary structure of protein is determined by number and order of amino acids within polypeptide chain.

Protein's secondary structure is defined as local conformation of its backbone, which consists of molecules that make up an amino acid's frame excluding side chains. Two common motifs include beta-pleated sheets and alpha helices.Tertiary structure is formed when attractions of side chains and secondary structure combine to form distinct 3-dimensional structure. This gives protein its specific function.

Page 16: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 16 -

Protein structure

http://www.science.org.au/sats2004/images/mackay2.jpg

Page 17: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 17 -

Sequence → structure → function

sequence

structure

function

medicine

http://www.bioinformatics.uwaterloo.ca/~j3xu/CS882/CS882-ProteinStructurePrediction.ppt

Page 18: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 18 -

How is true 3-D structure determined?

http://crystal.uah.edu/~carter/protein/xray.htm

• As of today, must be determined experimentally.• Techniques include x-ray crystallography and NMR.

diffraction pattern

Page 19: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 19 -

X-ray crystallography

http://www-structure.llnl.gov/Xray/xrayequipment.htm

A diffraction pattern: the white spots are the reflections.

Page 20: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 20 -

From electron density map to structure

http://www.ib3.gmu.edu/vaisman/csi731/lec04f02.pdf

Page 21: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 21 -

Protein structure determination

http://www-structure.llnl.gov/Xray/tutorial/protein_structure.htm

backbone ... w/ side chains electron density map

Page 22: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 22 -

Two common protein secondary structures

Alpha HelixR groups of amino acids all extend to outside. Helix makes a complete turn every 3.6 amino acids. Helix is right-handed; it twists in clockwise direction. Carbonyl group (-C=O) of each peptide bond extends parallel to axis of helix and points directly at -N-H group of peptide bond 4 amino acids below it in helix. A hydrogen bond forms between them [-N-H·····O=C-].

http://www.rothamsted.bbsrc.ac.uk/notebook/courses/guide/protalpha.htm

Beta ConformationConsists of pairs of chains lying side-by-side and stabilized by hydrogen bonds between carbonyl oxygen atom on one chain and -NH group on adjacent chain. Chains are often "anti-parallel"; N-terminal to C-terminal direction of one being reverse of other.

Page 23: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 23 -

Alpha helix

Alpha-helix (also written α-helix) is rod-like structure stabilized by hydrogen bonds between CO and NH groups of main chain.

http://www.rothamsted.bbsrc.ac.uk/notebook/courses/guide/protalpha.htm

Ribbon representation of right-handed alpha-helix with only the alpha carbons represented.

Examining backbone structure, note that alpha carbons spaced three and four in linear sequence are actually quite close together in helix structure. The hydrogen bonds are shown in green; all main chain CO and NH groups are hydrogen bonded. This structure is quite sturdy.

Page 24: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 24 -

Beta sheet

http://www.cryst.bbk.ac.uk/PPS2/course/section3/sheet.html

Page 25: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 25 -

Protein structure

http://bmbiris.bmb.uga.edu/wampler/tutorial/prot3.html

Red = alpha helixGreen = beta sheetBlack = misc. loops

Both marine bloodworm hemoglobin (left) and E. coli cytochrome B562 (right) are composed of

mostly alpha helicies. The 4 helix bundle of the cytochrome is a common motif.

Some proteins are made up of mostly alpha helicies.

The green alga plastocyanin (left) and sea snake neurotoxin (right) are mostly beta sheets.

Some are mostly beta sheet.

Page 26: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 26 -

Protein structure

http://bmbiris.bmb.uga.edu/wampler/tutorial/prot3.html

Many proteins are a mix of alpha helicies and beta sheets.

Two simple proteins with a mix of 2o components:ribonuclease T1 (left) and pancreatic trypsin inhibitor (right).

Page 27: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 27 -

Protein Domains

http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/T/TertiaryStructure.html

In some cases, each domain is encoded by a separate exon in the gene. The histocompatibility molecule shown here has three domains: α1, α2, and α3 are each encoded by its own exon.

• binding a small ligand (e.g., a peptide in the molecule shown here)

• spanning the plasma membrane (transmembrane proteins)

• containing the catalytic site (enzymes)• DNA-binding (in transcription factors)• providing a surface to bind specifically to

another protein.

Tertiary structure of many proteins is built from several domains. Often each has a separate function to perform, such as:

Page 28: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 28 -

How to do this?

Moving towards protein structure prediction ...

This is known as protein threading.

• Most important information seems to be contained in alpha helices and beta sheets (which form core), not in loops.

• Given amino acid sequence, we want to determine locations of helices, sheets, and loops, and their arrangements.

• Experimental techniques are expensive and time-consuming.• Exhaustive enumeration at molecular level (taking structure

with smallest free energy)? Nah ...• As of 2002, the NIH protein structure database contained

approximately 15,000 entries. Hmm ...• Idea: given sequence, see if it could fit a known structure.

Page 29: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 29 -

PDB new vs. old folding growth

New fold

Old fold

http://www.bioinformatics.uwaterloo.ca/~j3xu/CS882/CS882-ProteinStructurePrediction.ppt

• Number of unique folds in nature is fairly small (possibly a few thousands).

• 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB.

Page 30: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 30 -

Protein threading

http://www.biostat.wisc.edu/bmi576/lecture16.pdf http://www.ics.uci.edu/~rickl/publications/1998-salzberg-chapter.pdf

• homology modeling: align sequence to sequence,• threading: align sequence to structure (templates).

Somewhat similar to sequence alignment we studied earlier:

Page 31: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 31 -

The protein threading problem

Given:• new protein sequence, and• library of templates:

Find:• best alignment of sequence

to some template.

http://www.biostat.wisc.edu/bmi576/lecture16.pdf

Page 32: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 32 -

The protein threading problem

One possible threading (note non-local interactions):

http://www.dcs.kcl.ac.uk/teaching/units/csmacmb/DOC/lecture17b.pdfhttp://www.ics.uci.edu/~rickl/publications/1998-salzberg-chapter.pdf

Page 33: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 33 -

The protein threading problem

Input:1. Protein sequence A with n amino acids ai.

2. Core structural model C, with m core segments Ci. Also:

(a) Length ci of each core segment.

(b) Core segments Ci and Ci+1 are connected by loop λi for which we know max (lmaxi) and min (lmini) lengths.(c) Structural environment for each amino acid position.

3. Scoring function f(T) to evaluate each threading T.

Output: set of integers T = {t1, t2, ..., tm} such that value of ti indicates which amino acid from A occupies first position in core segment i.

Page 34: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 34 -

Core templates with interactions

http://www.biostat.wisc.edu/bmi576/lecture16.pdf http://www.ics.uci.edu/~rickl/publications/1998-salzberg-chapter.pdf

• Small circles represent amino acid positions.• Thin lines indicate interactions represented in model.

Page 35: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 35 -

The protein threading problem

Unfortunately, due to variable-length gaps between core segments and non-local interactions, this problem is NP-hard.Fortunately, it is amenable to solution by a general-purpose optimization strategy known as branch-and-bound.

Possible threadings:

http://www.biostat.wisc.edu/bmi576/lecture16.pdf http://www.ics.uci.edu/~rickl/publications/1998-salzberg-chapter.pdf

Page 36: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 36 -

Important note:

Digression: branch-and-bound

• Partion solution space into distinct sets.• Compute lower bound that applies to all solutions in given set.• If we can find a solution that is better than this lower bound,

we don't need to explore any solution in that set.

branch and bound will find optimal solution (it's not heuristic)... but ...it might take exponential time to do it.

Still, it is often much faster than naive exhaustive search.

Basic idea:

Page 37: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 37 -

Digression: branch-and-bound

Let's see how this works for the traveling salesman problem, which we know is also NP-complete (protein threading is a bit too complicated for now).

Given: a set of cities and costs to travel between them.Find: a minimum cost tour that visits each city once.

A

CB

D E

123

2

4

2

2

5572

A B C D EA - 2 12 5 2B 2 - 3 7 2C 12 3 - 2 5D 5 7 2 - 4E 2 2 5 4 -

Page 38: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 38 -

Now let's try going from A to C:Start search at A:

B3

≥ 15E

5

≥ 17

E

+ 4 = 11

D

+ 2 = 7

C

+ 3 = 5

B

2

Branch-and-bound

A

Total cost for this tour is 13.No point in exploring any of these subtrees any further!

+ 2 = 13

D2

≥ 14

C

12

A

B

C

D

E

Page 39: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 39 -

Branch-and-bound

In traveling salesman, we were able to eliminate from consideration all tours starting with:

A-C-B-... and A-C-D-... and A-C-E-...because we knew they could never be optimal; we already had a tour with total cost less than their partial costs.

General observation: complete solution with cost w

partial solutions lower bound x ≥ w

partial solutions lower bound y < w

partial solutions lower bound z < y

don't bother exploring this

explore this if/whenbound becomes best

explore this first

Page 40: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 40 -

Back to protein threading ...

Recall that ti indicates which amino acid occupies the first position in core segment i. Our scoring function is:

f T =∑ig1i , t i∑

i∑jig 2i , j , t i , t j

ti

As we know sizes of core segments and min and max lengths for loops, we can determine ranges for ti's.

This will form basis for branch-and-bound.di-1 di di+1

Page 41: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 41 -

Branch-and-bound for protein threading

Given a set of threadings T *, the optimization problem is:

minT∈T *

f T = minT∈T *

∑ig1i , t i∑

i∑jig2i , j , t i , t j

= minT∈T *

∑i [g1i , t i∑

jig 2i , j , t i , t j]

What's a lower bound we can use?

∑i [ minbixd i

g1i , x ∑ji

minbi yd ib jzd j

g2i , j , y , z ]Note this is determined by the interval [bi, di] that ti may fall in.

Page 42: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 42 -

Branch-and-bound for protein threading

Now we must split solution space into disjoint sets. Do this by selecting largest current interval for a ti and cutting it in half.

http://www.biostat.wisc.edu/bmi576/lecture16.pdf

Page 43: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 43 -

Branch-and-bound for protein threading

http://www.biostat.wisc.edu/bmi576/lecture16.pdf

Page 44: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 44 -

CAFASP3 Example

http://www.bioinformatics.uwaterloo.ca/~j3xu/CS882/CS882-ProteinStructurePrediction.ppt

Red: Experimental Structure Blue: Correct Prediction Green: Incorrect Prediction

CAFASP: Critical Assessment of Fully Automated Structure Prediction

CAFASP3 evaluated by MaxSub, a computer program. Predicted structures are superimposed to the experimental structures to see how long is superimposable.

Page 45: RNA and Protein Structure Prediction Bioinformatics ...lopresti/Courses/2007-08/CSE308... · RNA and Protein Structure Prediction. CSE 308-408 · Bioinformatics: Issues and Algorithms

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 18 - 45 -

Wrap-up

Remember:• Come to class having done the readings.• Check Blackboard regularly for updates.

Readings for next time:• "The Invention of the Genetic Code," Brian Hayes, American

Scientist, vol. 86, no. 1, Jan.–Feb., 1998, pp. 8–14.• "Ode to the Code," Brian Hayes, American Scientist, vol. 92,

no. 6, Nov.– Dec., 2004, pp. 494–498.

(Both papers are in the Readings folder on Blackboard.)