1 greedy algorithms bioinformatics algorithms © jeff parker, 2009 why should i care about...
TRANSCRIPT
1
Greedy Algorithms
Bioinformatics Algorithms
© Jeff Parker, 2009
Why should I care about posterity? What's posterity ever done for me?
- Gourcho Marx
2
Question from Last Week
What are we supposed to take away from the discussion of Motif Finding?
Objectives
Understand the Motif Problem
Understand Exhaustive Search
Understand that some Exhaustive search is less exhausting than others
Understand Branch and Bound
3
Outline
Objectives
Understand why we cannot use backtracking
Understand what a Greedy Algorithm is
Understand use of metrics
Understand alternatives – BFS, Best First, and A* search
Understand the biological problem
Understand some algorithms to reverse sort
What is a greedy algorithm?
Understand why trying for local optimization can distort things
Alternatives to Greed
4
Backtracking's Limits
Backtracking can solve problems like the knights tour,
It is not well suited for the 15 puzzle
There is no way to characterize a position as a dead end.
However, some positions are more promising than others
The problem of sorting also has no dead ends
5
Definition of Greedy Algorithm
Many algorithms make a sequence of choices among alternatives
Sorting – which pair should we exchange?
Traveling System – which city should we visit next?
Greedy algorithms make a locally optimal choice in hopes that it will make a global optimal choice.
That is, they look ahead one move.
Sometimes a greedy algorithm is optimal.
Often it is not.
6
Decision Trees
We may have a sequence of decisions to make
Should I file the short form or the long form?
If I file the long form, should I fill out Schedule C?
Note that the tree below is not a data structure: it is an expansion of the logic of an algorithm.
Should I file to long form?
Should I file Schedule C?
yes no
yes no
7
Example
Consider a linear search
We can view this as a sequence of decisions
In searching an array of N items for item x, there are 2N+1 outcomes: the item should be
Before the first
Same value as the first
After the first, but before the second
and so on
2 5 6 12 21
Before first Third
8
Linear SearchLinear_search ( int target, int list []) {
for (int pos = 0; pos < MAX; pos++) {if ( target >= list[pos] )
break;}// Sort things out…if (pos < MAX){
if (target == list[pos])update list[pos];
elseinsert before list[pos];
} else // pos == MAXinsert after list[pos];
}
T <= L[0]
T == L[0]
yes
yes no
updateL[0]
Insertbefore L[0]
no
T <= L[1]
T == L[1]
yes
yes no
updateL[1]
Insertbefore L[1]
9
High Level ViewThis gives a scrawny tree.The algorithm is GreedyHow many comparisons are
needed to reach each possible outcome?
2+2+3+3+4+4+5+5+6+6+5 = 45
We are looking at the sum of the path lengths: gives a measure of the average complexity
There are two forms: internal and external. Closely related
Greedy algorithms work in phases. In each phase, a decision is made that appears to be good, without regard for
future consequences.
10
Look at Binary SearchTwo possible versions of main loop/* Binary 1 - Forgetful Binary Search */while ( top > bottom ) {
middle = ( top + bottom ) / 2;if ( list[middle] < target )
bottom = middle + 1;else
top = middle;} // then sort things out…/* Version 2 - Careful Binary Search - check middle entry */ while ( top > bottom ) {
middle = ( top + bottom ) / 2;if ( list[middle] == target )
break;if ( list[middle] < target )
bottom = middle + 1;else /* Cut down search space */
top = middle - 1;} // Sort things out…
11
Which is better?
Either version is better than linear search for more than a few items.
To compare them, look at the average path length from the root to a decision
We sum up the length of all paths again
How many decisions to reach all outcomes?
First look at Forgetful search
4+4+4+4+3+3+3+3+3+4+4 = 39
<= L[3]
<= L[2]
<= L[1] = L[3] <= L[5]
= L[1] = L[2]
= L[4]
<= L[4]
= L[5]
12
Careful Binary SearchCareful search has fewer recursive calls: each call does twice as much work
1 +3+5+6+6+4+3+4+5+6+6=49
(For larger sets, does better than linear) = L[3]
< L[3]
= L[4]
< L[4]
= L[5]
< L[5]
= L[2]
< L[2]
= L[1]
< L[1]
13
Balloon Dog TheoremWhen you put one outcome close to the root, may push others further away
= L[3]
< L[3]
= L[4]
< L[4]
= L[5]
< L[5]
= L[2]
< L[2]
= L[1]
< L[1]
14
Problem
Our problem today is to find the smallest number of operations that will lead us from one sequence to another
Basic operation is reversing a section of the sequence
15
RepresentationWe will represent the sequence with signed integers
One sequence (turnip below) is presented in order: goal
16
RepresentationWe will represent the sequence with signed integers
We can represent a sequence of signed integers as a sequence of unsigned integers
1 -5 4 -3 2
1 2 10 9 7 8 6 5 3 4
Bracket the numbers with initial and final values that do not change
0 1 2 10 9 7 8 6 5 3 4 11
We may compress an increasing or a decreasing run.
0 7 8 9 6 5 4 3 2 1 10 <=> 0 3 4 2 1 5
17
SearchAt each step, we can reverse a subsequence
The problem is to minimize the number of steps
Exhaustive search would follow all possible outcomes
This is simplest to organize as a BFS of the space
This is called uninformed search
Pay no attention to the contents
To compare positions, need a metric
18
Breadth First SearchHe leapt onto his horse and galloped off madly in all directions. - Stephen Leacock
Systematically search each alternative
Look at all boards one move away
Look at all boards two moves away
To implement BFS, use a queue
Take the next board from queue
Look at all boards one move away
Toss duplicates.
Insert the rest in the priority queue
1
2
3
4
5
6 7
8
9
1011
12
13
19
MetricsBFS takes too long for many problems.
How do we decide which move is better?
To measure progress, we use metrics.
Traveling Salesman – cost of tour
Bubble Sort – the length of the sorted subarray
15 puzzle – number of tiles that are home
Greedy algorithm uses metric
20
Informed Search
These searches are “informed” by a measure of how close a position is to a solution
In so-called “Hill Climbing”, we follow the most promising path we can. While this can quickly lead to a better position, it often leaves us at a local max (min) The hill we climb is not always the highest
We cannot always continue to increase (reduce) the metric
21
Our Greedy Strategy
Our strategy will be a form of Depth First Search
At each stage, we will select the most promising next step
Since there are no dead-ends, there is always hope
We need a metric.
How do we decide which permutation is close to solved?
22
Metric 1
First metric: Length of run of items in order
0 1 2 10 9 7 8 6 5 3 4 11
0 1 2 3 5 6 8 7 9 10 4 11
Can increase this at each step
0 1 2 3 4 10 9 7 8 6 5 11
0 1 2 3 4 5 6 8 7 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
Compare with0 1 2 10 9 7 8 6 5 3 4 110 1 2 10 9 7 8 6 5 4 3 110 1 2 10 9 8 7 6 5 4 3 110 1 2 3 4 5 6 7 8 9 10 11
23
Metric 2
Look at breakpoints – places where abs(ai – ai+1) != 1
0 1 2 10 9 7 8 6 5 3 4 11 - 5
0 1 2 10 9 8 7 6 5 3 4 11 - 3
0 1 2 10 9 8 7 6 5 4 3 11 - 2
0 1 2 3 4 5 6 7 8 9 10 11 - 0
24
Finding Breakpoints
# Look at breakpoints – places where abs(ai – ai+1) != 1
def findBP(ar):
"""Find the breakpoints in a string of integers"""
lst = []
for i in xrange(1, len(ar)):
if (abs(ar[i-1] - ar[i]) > 1):
lst.append(i)
return lst
25
Reversing String Segment
def reverseSegment(ar, strt, end):
"""Take an array ar, and reverse the segment ar[strt:end]"""
sublst = ar[strt:end]
# Now reverse the stack using Python idiom
sublst = sublst[::-1]
# Print the list as the three components
print ar[:strt], sublst, ar[end:],
# We print the number of breakpoints in the caller
return ar[:strt] + sublst + ar[end:]
26
Sorting
def sortPermutation(ar):
"""Greedy algorithm to sort a permutation."""
bp = findBP(ar)
while (len(bp) > 0):
bpLen = len(bp)
bpMin = len(ar)
minAr = []
# Look at all possible reversals
27
All possible Reversals
for i in xrange(bpLen):
for j in xrange(i+1, bpLen):
if (bp[i] < bp[j] - 1):
cand = reverseSegment(ar, bp[i], bp[j])
candBP = findBP(cand)
candBPLen = len(candBP)
if (candBPLen < bpMin):
bpMin = candBPLen
minAr = cand
28
Output
List [0, 6, 1, 2, 5, 4, 9, 7, 8, 10, 3, 11]
BPs [1, 2, 4, 6, 7, 9, 10, 11] 8
[0] [2, 1, 6] [5, 4, 9, 7, 8, 10, 3, 11] w/ breakpoint count 7
[0] [4, 5, 2, 1, 6] [9, 7, 8, 10, 3, 11] w/ breakpoint count 8
…
[0, 6] [2, 1] [5, 4, 9, 7, 8, 10, 3, 11] w/ breakpoint count 8
[0, 6] [4, 5, 2, 1] [9, 7, 8, 10, 3, 11] w/ breakpoint count 8
…
[0, 6, 1, 2, 5, 4] [3, 10, 8, 7, 9] [11] w/ breakpoint count 7
[0, 6, 1, 2, 5, 4, 9] [8, 7] [10, 3, 11] w/ breakpoint count 7
29
Termination
It is not always possible to find a move that lowers the breakcount
(Always have some move that leaves it fixed)
How do we know that this will terminate?
30
Informed Search: Best FirstKeep a table of positions we have already seen
Insert starting position in PQueue and table
While PQueue is not empty
Select position from the PQueue
While there are moves from here
Generate next position
If position is not in table
Insert in the PQueue
We investigate multiple strands at the same time
Like breadth first search, but informed by our notion of closeness.
By placing the positions in a priority queue, we look at the most promising positions first
1
2
3
4
56
7
8
31
Best First algorithm in actionTake the best position from priority queue
Look at all boards one move away
For each position, check to see if we have seen it before
If not, insert in the priority queue
Rank boards by their distance from a solution
We don't know how far it really is:
We use our estimate h*(b)
12
4
5
6 7
8
3
8
1
2
4
5
6 7
8
3
9
12
4
5
6 7
83
7
12
4
5
6 7
8
3
9
12
4
5
6 7
8
3
7
32
Best First in actionStart with the center position (1)Generate all outcomes: discard boards we have seen beforePlace remaining outcomes in the priority queueWe select one of the cheapest (2)Generate outcomes: toss duplicatesSelect the new cheapest (3)
Note that 1, 2, 3 do not form a legal sequence of moves
12
4
5
6 7
8
3
8
1
2
4
5
6 7
8
3
9
12
4
5
6 7
83
7
12
4
5
6 7
8
3
9
12
4
5
6 7
8
3
7
1
2 4
5
6 7
8
3
9
12
4
5
6
7
8
3
9
12
4
5
6 7
8
3
8 12
3
Duplicate
33
A* Search
BFS finds the minimal solution, but it takes a long time.Best First uses function h*(b). Faster, but solution may not be the best.A* is an informed search that will find an optimal solution. One way to improve things it to improve h*(b). Often difficultDefine a new priority function f*, where
f*(b) = g*(b) + h*(b)where g*(b) is the best estimate of the number of steps required to reach this
position. Breadth First Search amounts to
f*(b) = g*(b)and Best First Search amounts to
f*(b) = h*(b)
34
A* Search
Use our new priority function f*, wheref*(b) = g*(b) + h*(b)
where g*(b) is the best estimate of the number of steps required to reach this position.
Why do we need to estimate g*(b)? Don't we know how long it took?Shortcuts: You may find you can reach a position that took you 20 steps
through another path that only takes 16 steps. When you find a better path, update the stored board to point to the new,
better, solution. (Though this will happen with A* search, it will never happen with BFS. Why?) Requeue the board at the new priority
Not all implementations of PQ have an easy way to update costsBut it turns out there is no harm done if you have multiple copies of a
board in the PQ.