chapter 6 dynamic programming. 2 algorithmic paradigms greed. build up a solution incrementally,...

Post on 18-Jan-2016

222 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Chapter 6 Dynamic Programming

2

Algorithmic Paradigms

Greed. Build up a solution incrementally, myopically optimizing some local criterion.

Divide-and-conquer. Break up a problem into two sub-problems, solve each sub-problem independently, and combine solution to sub-problems to form solution to original problem.

Dynamic programming. Break up a problem into a series of overlapping sub-problems, and build up solutions to larger and larger sub-problems.

3

Dynamic Programming History

Bellman. Pioneered the systematic study of dynamic programming in the 1950s.

Etymology. Dynamic programming = planning over

time. Secretary of Defense was hostile to

mathematical research. Bellman sought an impressive name to

avoid confrontation.–"it's impossible to use dynamic in a pejorative sense"

–"something not even a Congressman could object to"

Reference: Bellman, R. E. Eye of the Hurricane, An Autobiography.

4

Dynamic Programming Applications

Areas

Bioinformatics. Control theory. Information theory. Operations research (e.g. knapsack problem) Computer science: theory, graphics, AI, systems,

….

Some famous dynamic programming algorithms

Viterbi’s algorithm for hidden Markov models. Unix diff for comparing two files. Smith-Waterman for sequence alignment. Bellman-Ford for shortest path routing in networks. Cocke-Kasami-Younger for parsing context free

grammars.

6.4 Knapsack Problem

6

Knapsack ProblemKnapsack problem.

Given n objects and a "knapsack." Item i weighs wi > 0 kilograms and has value vi > 0. Knapsack has capacity of W kilograms. Goal: fill knapsack so as to maximize total value.

Ex: { 3, 4 } has value 40.

Greedy: repeatedly add item with maximum ratio vi / wi.

Ex: { 5, 2, 1 } achieves only value = 35 greedy not optimal.

1

Value

18

22

28

1

Weight

5

6

6 2

7

Item

1

3

4

5

2W = 11

7

Dynamic Programming: False Start

Def. OPT(i) = max profit subset of items 1, …, i.

Case 1: OPT does not select item i.– OPT selects best of { 1, 2, …, i-1 }

Case 2: OPT selects item i.– accepting item i does not immediately imply that we will have to reject other items

– without knowing what other items were selected before i, we don't even know if we have enough room for i

Conclusion. Need more sub-problems!

8

Dynamic Programming: Adding a New Variable

Def. OPT(i, w) = max profit subset of items 1, …, i with weight limit w.

Case 1: OPT does not select item i.– OPT selects best of { 1, 2, …, i-1 } using weight limit w

Case 2: OPT selects item i.– new weight limit = w – wi

– OPT selects best of { 1, 2, …, i–1 } using this new weight limit

9

Input: n, w1,…,wN, v1,…,vN

for w = 0 to W M[0, w] = 0

for i = 1 to n for w = 1 to W if (wi > w) M[i, w] = M[i-1, w] else M[i, w] = max {M[i-1, w], vi + M[i-1, w-wi ]}

return M[n, W]

Knapsack Problem: Bottom-Up

Knapsack. Fill up an n-by-W array.

10

Knapsack Algorithm

n + 1

1

Value

18

22

28

1

Weight

5

6

6 2

7

Item

1

3

4

5

2

{ 1, 2 }

{ 1, 2, 3 }

{ 1, 2, 3, 4 }

{ 1 }

{ 1, 2, 3, 4, 5 }

0

0

0

0

0

0

0

1

0

1

1

1

1

1

2

0

6

6

6

1

6

3

0

7

7

7

1

7

4

0

7

7

7

1

7

5

0

7

18

18

1

18

6

0

7

19

22

1

22

7

0

7

24

24

1

28

8

0

7

25

28

1

29

9

0

7

25

29

1

34

10

0

7

25

29

1

34

11

0

7

25

40

1

40

W + 1

W = 11

OPT: { 4, 3 }value = 22 + 18 = 40

11

Knapsack Problem: Running Time

Running time. O (n W)

How much storage is needed?

The table we used above is of size O(n W).

For example, if W = 1000000 and n = 100, we need a total of 400 Mbytes (106 x 100 x 4 bytes).

This is too high.

The space requirement can be reduced: once row j has been computed, we don’t need rows j – 1 and below.

Space reduces to 2W = 8 Mbytes in the above case.

6.6 Sequence Alignment

13

String SimilarityHow similar are two strings?

ocurrance Occurrence

o c u r r a n c e

c c u r r e n c eo

-

o c u r r n c e

c c u r r n c eo

- - a

e -

o c u r r a n c e

c c u r r e n c eo

-

5 mismatches, 1 gap

1 mismatch, 1 gap

0 mismatches, 3 gaps

14

String SimilarityHow similar are two strings?

ocurrance Occurrence

Key concept used by SearchEngines to correct spellingerrors.

You get a response:

“Did you mean xxxxx ?”

o c u r r a n c e

c c u r r e n c eo

-

o c u r r n c e

c c u r r n c eo

- - a

e -

o c u r r a n c e

c c u r r e n c eo

-

5 mismatches, 1 gap

1 mismatch, 1 gap

0 mismatches, 3 gaps

Google search for “allgorythm”

When a word is not in the dictionary, the search enginesearches through all words in data base and finds the one that is closest to “allgorythm” and asks if you probably meant it.

16

Applications.

Basis for Unix diff: determine how similar two files are.

Plagiarism detection: “document similarity” is a good indicator of whether a document was plagiarized.

Speech recognition. Sound pattern corresponding to a spoken word is matched against the sound pattern corresponding to various standard spoken words in a data base.

Computational biology.

Edit Distance

Plagiarism checker – a simple test

Here a paragraph from Chapter 6 of the text (current chapter on Dynamic Programming) and made some small changes:

“we now turn to a more powerful and subtle design technique, dynamic programming. It'll be simpler to day precisely what characterizes dynamic programming after we have seen it in action, but the basic idea is drawn from the intuition behind divide-and-conquer ad is essentially the opposite of the greedy strategy: one implicitly explores the space of all possible solutions, by carefully decomposing things into a series of subproblems, and then building up correct solutions to larger and larger subproblems. In a way, we can thus view dynamic programming as operating dangerously close to the edge of brute-force search.”

Let us try if the detector can find the original source:

http://www.dustball.com/cs/plagiarism.checker/

21

Edit distance. [Levenshtein 1966, Needleman-Wunsch 1970]

Gap penalty ; mismatch penalty pq.

Cost = sum of gap and mismatch penalties.

2 + CA

C G A C C T A C C T

C T G A C T A C A T

T G A C C T A C C T

C T G A C T A C A T

-T

C

C

C

TC + GT + AG+ 2CA

-

Edit Distance

22

In the simplest model we will first study, consider just two operations:

insert and delete each costing one unit.

Example: What is DIS(“abbaabab”, “bbabaaba”)?

abbaabab -> bbaabab (delete a) -> bbabab (delete a) -> bbabaab (insert a) -> bbabaaba (insert a)

DIS(abbaabab, bbabaaba) <= 4. Is there a way to perform this transformation with 3 or fewer edit operations?

Edit Distance

23

Edit distance: Problem Structure

Def. DIS(i, j) = min edit distance between x1 x2 . . . xi and y1 y2 . . . yj.

Case 1: xi = yj

in this case, DIS(i, j) = DIS(i-1, j-1).

Case 2: xi = yj. Now there are two choices. Either delete xi and x1 x2 . . . xi-1 to y1 y2 . . . yj OR convert x1 x2 . . . xi and y1 y2 . . . yj-1 and insert yj.

Extending the model to include substitution

Now add another operation:

insert (i, c) insert c as the i-th symbol (cost = 1) delete (i) delete the i-th symbol (cost = 1) change(i, c, d) change i-th symbol from c to d (cost = 1.5)

Definition of DIS(i, j) :

Alignment problem in molecular biology

Given two fragments F1 and F2 of DNA, we want to know how closely related they are.

(Closeness is an indicator of how likely is it that one evolved from the other.)

This kind of analysis plays a critical role in molecular biology.

DNA sequences evolve by insertions, deletions and mutations.

Thus the problem is almost the same as the one we studied in the previous slide.

26

Goal: Given two strings X = x1 x2 . . . xm and Y = y1 y2 . . . yn find alignment of minimum cost.

Def. An alignment M is a set of ordered pairs xi-yj such that each item occurs in at most one pair and no crossings.

Def. The pair xi-yj and xi'-yj' cross if i < i', but j > j'.

Ex: CTACCG vs. TACATG.Sol: M = x2-y1, x3-y2, x4-y3, x5-y4, x6-y6.

Sequence Alignment

C T A C C -

T A C A T-

G

G

y1 y2 y3 y4 y5 y6

x2 x3 x4 x5x1 x6

27

Sequence Alignment: Problem StructureDef. OPT(i, j) = min cost of aligning strings x1 x2 . . . xi and y1 y2 . . . yj.

Case 1: OPT matches xi-yj.– pay mismatch for xi-yj + min cost of aligning two stringsx1 x2 . . . xi-1 and y1 y2 . . . yj-1

Case 2a: OPT leaves xi unmatched.– pay gap for xi and min cost of aligning x1 x2 . . . xi-1 and y1 y2 . . . yj

Case 2b: OPT leaves yj unmatched.– pay gap for yj and min cost of aligning x1 x2 . . . xi and y1 y2 . . . yj-1

28

Sequence Alignment: Algorithm

Analysis. (mn) time and space.

Computational biology: m = n = 105 or more is common. 10 billions ops OK (may take a few minutes), but 10GB array?

Sequence-Alignment(m, n, x1x2...xm, y1y2...yn, , ) { for i = 0 to m M[0, i] = i for j = 0 to n M[j, 0] = j

for i = 1 to m for j = 1 to n M[i, j] = min([xi, yj] + M[i-1, j-1], + M[i-1, j], + M[i, j-1]) return M[m, n]}

Problem 6.4 (HW # 4, Problem 1)

Input: NY[j], j = 1, 2, …, n and SF[j], j = 1, 2, …, n and moving cost k (= 10).

Output: optimum cost of operating the business for n months.

Solution idea:

Define OPT(j, NY) as the optimal cost of operating the business for the first j months under the condition that the operation was located in NY in the j-th month and similarly for OPT(j, SF).

Now write a recurrence formula for OPT(j, NY) in terms of OPT(j-1, NY) and OPT(j-1, SF) and similarly for OPT(j, SF).

From this it is easy to compute OPT(j, NY) and OT(j, SF).

6.5 RNA Secondary Structure

31

RNA Secondary StructureRNA. String B = b1b2bn over alphabet { A, C, G, U }.

Secondary structure. RNA is single-stranded so it tends to loop back and form base pairs with itself. This structure is essential for understanding behavior of molecule.

G

U

C

A

GA

A

G

CG

A

UG

A

U

U

A

G

A

C A

A

C

U

G

A

G

U

C

A

U

C

G

G

G

C

C

G

Ex: GUCGAUUGAGCGAAUGUAACAACGUGGCUACGGCGAGA

complementary base pairs: A-U, C-G

32

RNA Secondary StructureSecondary structure. A set of pairs S = { (bi, bj) } that satisfy:

[Watson-Crick.] S is a matching and each pair in S is a Watson-Crick complement: A-U, U-A, C-G, or G-C.

[No sharp turns.] The ends of each pair are separated by at least 4 intervening bases. If (bi, bj) S, then i < j - 4.

[Non-crossing.] If (bi, bj) and (bk, bl) are two pairs in S, then

we cannot have i < k < j < l.

Free energy. Usual hypothesis is that an RNA molecule will form the secondary structure with the optimum total free energy.

Goal. Given an RNA molecule B = b1b2bn, find a secondary

structure S that maximizes the number of base pairs.

approximate by number of base pairs

34

RNA Secondary Structure: ExamplesExamples.

C

G G

C

A

G

U

U

U A

A U G U G G C C A U

G G

C

A

G

U

U A

A U G G G C A U

C

G G

C

A

U

G

U

U A

A G U U G G C C A U

sharp turn crossingok

G

G

4

base pair

36

RNA Secondary Structure: Sub-problemsFirst attempt. OPT(j) = maximum number of base pairs in a secondary structure of the substring b1b2bj.

Difficulty. Results in two sub-problems. Finding secondary structure in: b1b2bt-1. Finding secondary structure in: bt+1bt+2bn-1.

1 t n

match bt and bn

OPT(t-1)

need more sub-problems

37

Dynamic Programming formulation

Notation. OPT(i, j) = maximum number of base pairs in a secondary structure of the substring bibi+1bj.

Case 1. If i j - 4.– OPT(i, j) = 0 by no-sharp turns condition.

Case 2. Base bj is not involved in a pair.– OPT(i, j) = OPT(i, j-1)

Case 3. Base bj pairs with bt for some i t < j - 4.– non-crossing constraint decouples resulting sub-problems

– OPT(i, j) = 1 + maxt { OPT(i, t-1) + OPT(t+1, j-1) }

take max over t such that i t < j-4 andbt and bj are Watson-Crick complements

38

Dynamic Programming - algorithm

Q. What order to solve the sub-problems?A. Do shortest intervals first.

Running time. O(n3).

RNA(b1,…,bn) { for k = 5, 6, …, n-1 for i = 1, 2, …, n-k j = i + k Compute M[i, j]

return M[1, n]}

using recurrence

0 0 0

0 0

02

3

4

1

i

6 7 8 9

j

top related