lao* paper presentation - teaching labscsc2542h/fall/material/csc2542f16_lao... · the slides on...
TRANSCRIPT
![Page 1: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/1.jpg)
LAO*PaperPresentationJonathanEidelman
CSC2542UniversityofToronto
Fall2016
![Page 2: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/2.jpg)
Acknowledgements
TheslidesonLAO*arethoseofPascalPoupart.
TheslidesonAO*arethoseofGholamreza Ghassem-Sani.
Thankyoutobothresearchersforsharingtheirslidesontheweb.
![Page 3: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/3.jpg)
Module 9 LAO*
CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
![Page 4: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/4.jpg)
CS886 (c) 2013 Pascal Poupart
2
Large State Space
• Value Iteration, Policy Iteration and Linear Programming – Complexity at least quadratic in |𝑆|
• Problem: |𝑆| may be very large
– Queuing problems: infinite state space – Factored problems: exponentially many states
![Page 5: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/5.jpg)
CS886 (c) 2013 Pascal Poupart
3
Mitigate Size of State Space
• Two ideas:
• Exploit initial state – Not all states are reachable
• Exploit heuristic ℎ
– approximation of optimal value function – usually an upper bound ℎ 𝑠 ≥ 𝑉∗ 𝑠 ∀𝑠
![Page 6: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/6.jpg)
CS886 (c) 2013 Pascal Poupart
4
State Space
State space |𝑆|
𝑠0 Reachable states
States reachable by 𝜋∗
![Page 7: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/7.jpg)
CS886 (c) 2013 Pascal Poupart
6
LAO* Algorithm • Related to
– A*: path heuristic search – AO*: tree heuristic search – LAO*: cyclic graph heuristic search
• LAO* alternates between
– State space expansion – Policy optimization
• value iteration, policy iteration, linear programming
![Page 8: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/8.jpg)
Slides by: Gholamreza Ghassem-Sani
AO* REVIEW
![Page 9: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/9.jpg)
AND/OR graphs▪ Some problems are best represented as
achieving subgoals, some of which achieved simultaneously and independently (AND)
▪ Up to now, only dealt with OR options
Possess TV set
Steal TV Earn Money Buy TV
![Page 10: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/10.jpg)
Searching AND/OR graphs▪ A solution in an AND-OR tree is a sub tree
whose leafs are included in the goal set
▪ Cost function: sum of costs in AND node f(n) = f(n1) + f(n2) + …. + f(nk)
▪ How can we extend A* to search AND/OR trees? The AO* algorithm.
![Page 11: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/11.jpg)
AND/OR search ▪ We must examine several nodes
simultaneously when choosing the next move
A
B C D38
E F G H I J17 9 27
(5) (10) (3) (4) (15) (10)
A
B C D(3)(4)
(5)(9)
![Page 12: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/12.jpg)
AND/OR Best-First-Search▪ Traverse the graph (from the initial node)
following the best current path. ▪ Pick one of the unexpanded nodes on that
path and expand it. Add its successors to the graph and compute f for each of them
▪ Change the expanded node’s f value to reflect its successors. Propagate the change up the graph.
▪ Reconsider the current best solution and repeat until a solution is found
![Page 13: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/13.jpg)
AND/OR Best-First-Search example
A
B C D(3)(4)
(5)(9)
A(5)
2.1.
A
B CD
E F(4) (4)(10)
(3)(9)
(4)(10)
3.
![Page 14: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/14.jpg)
AND/OR Best-First-Search example
B C D
G H E F(5) (7) (4) (4)(10)
(6)(12)
(4) (10)
4. A
![Page 15: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/15.jpg)
A Longer path may be better
B C D
G H E F
A
JI
Unsolvable B C D
G H E F
A
JI
Unsolvable
![Page 16: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/16.jpg)
Interacting Sub goals
C
D
E
A
(2)(5)
![Page 17: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/17.jpg)
AO* algorithm
1. Let G be a graph with only starting node INIT. 2. Repeat the followings until INIT is labeled SOLVED
or h(INIT) > FUTILITY a) Select an unexpanded node from the most promising
path from INIT (call it NODE) b) Generate successors of NODE. If there are none, set
h(NODE) = FUTILITY (i.e., NODE is unsolvable); otherwise for each SUCCESSOR that is not an ancestor of NODE do the following:
i. Add SUCCESSSOR to G. ii. If SUCCESSOR is a terminal node, label it SOLVED and
set h(SUCCESSOR) = 0. iii. If SUCCESSPR is not a terminal node, compute its h
![Page 18: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/18.jpg)
AO* algorithm (Cont.)c) Propagate the newly discovered information up the
graph by doing the following: let S be set of SOLVED nodes or nodes whose h values have been changed and need to have values propagated back to their parents. Initialize S to Node. Until S is empty repeat the followings:
i. Remove a node from S and call it CURRENT. ii. Compute the cost of each of the arcs emerging from
CURRENT. Assign minimum cost of its successors as its h. iii. Mark the best path out of CURRENT by marking the arc that
had the minimum cost in step ii iv. Mark CURRENT as SOLVED if all of the nodes connected to it
through new labeled arc have been labeled SOLVED v. If CURRENT has been labeled SOLVED or its cost was just
changed, propagate its new cost back up through the graph. So add all of the ancestors of CURRENT to S.
![Page 19: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/19.jpg)
An Example
![Page 20: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/20.jpg)
An ExampleA(8)
![Page 21: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/21.jpg)
An Example
CDB
A
(8)(1)
(2)
[12]4 5
5
[13]
![Page 22: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/22.jpg)
An Example
CDB
A
(8)(4)
(2)
[15]4 5
5
[13]
2
![Page 23: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/23.jpg)
An Example
CDB
A
(3)(4)
G
E
(2)
(1)
(0)
[15]4 5
52
24
[8]
![Page 24: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/24.jpg)
An Example
CDB
A
(4)(4)
G
E
(2)
(3)
(0)
[15]4 5
52
22
4
[9]
3
![Page 25: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/25.jpg)
An Example
CDB
A
(4)
G
E
(2)
(3)
(0)
[15]4 5
52
22
4
Solved
3
Solved
Solved
![Page 26: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/26.jpg)
▪ Considers the cost (> 0) for switching from one branch to another in the search
▪ Example: path finding in real life
A CBDF E G
11 4 1 2 167
f(B) = 1 + 1 = 2 (A) f(C) = 1 + 2 = 3 f(D) = 1 + 4 = 5 (B) f(A) = 1 + 3 = 4 f(B) = 1 + 5 = 6 (A) f(C) = 1 + 2 = 3 f(A) = 1 + 6 = 7 (C) f(E) = 1 + 7 = 8 f(B) = 1 + 5 = 6 (A) f(C) = 1 + 8 = 9 f(D) = 1 + 4 = 5 (B) f(A) = 1 + 9 = 10 f(F)=1+11= 12 (D) f(B) = 1 + 10 = 11
Real Time A*▪ Considers the cost (> 0) for switching from one branch to
another in the search ▪ Example: path finding in real life
A CBDF E G
11 4 1 2 167
f(B) = 1 + 1 = 2 (A) f(C) = 1 + 2 = 3 f(D) = 1 + 4 = 5 (B) f(A) = 1 + 3 = 4 f(B) = 1 + 5 = 6 (A) f(C) = 1 + 2 = 3 f(A) = 1 + 6 = 7 (C) f(E) = 1 + 7 = 8 f(B) = 1 + 5 = 6 (A) f(C) = 1 + 8 = 9 f(D) = 1 + 4 = 5 (B) f(A) = 1 + 9 = 10 f(F)=1+11= 12 (D) f(B) = 1 + 10 = 11
![Page 27: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/27.jpg)
Another ExampleCurrent State = S f(A) = 3 + 5 = 8 f(B) = 2 + 4 = 6
Current State = B f(S) = 2 + 8 = 10
f(A) = 4 + 5 = 9 f(C) = 1 + 5 = 6 f(E) = 4 + 2 = 6
Current State = C f(H) = 2 + 4 = 6 f(B) = 1 + 6 = 7
A B
C
S
(4)
(4) D
(5)2
(2)
G
(1)
E
(2)H
(0)
34
41
(5)
F2 13
2
![Page 28: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/28.jpg)
Current State = H f(C) = 2 + 7 = 9
Current State = C f(B) = 1 + 6 = 7 f(H) = ∞ Current State = B f(S) = 2 + 8 = 10 f(A) = 4 + 5 = 9 f(E) = 4 + 2 = 6 f(C) = ∞
Current State = E f(B) = 4 + 9 = 13 f(D) = 3 + 2 = 5 f(F) = 1 + 1 = 2
A B
C
S
(4)
(4) D
(5)2
(2)
G
(1)
E
(2)H
(0)
34
41
(5)
F2 13
2
Another Example
![Page 29: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/29.jpg)
A B
C
S
(4)
(4) D
(5)2
(2)
G
(1)
E
(2)H
(0)
34
41
(5)
F2 13
2
Current State = F f(E) = 1 + 5 = 6
Current State = E f(D) = 3 + 2 = 5 f(B) = 4 + 9 = 13 f(F) = ∞ Current State = D f(G) = 2 + 0 = 2 f(E) = 3 + 13 = 16
Visited Nodes = S, B, C, H, C, B, E, F, E, D, G
Path = S, B, E, D, G
Another Example
![Page 30: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/30.jpg)
CS886 (c) 2013 Pascal Poupart
7
Terminology • 𝑆: state space
• 𝑆𝐸 ⊆ 𝑆: envelope – Growing set of states
• 𝑆𝑇 ⊆ 𝑆𝐸: terminal states – States whose children are not in the envelope
• 𝑆𝑠0𝜋 ⊆ 𝑆𝐸: states reachable from 𝑠0 by following 𝜋
• ℎ(𝑠): heuristic such that ℎ 𝑠 ≥ 𝑉∗ 𝑠 ∀𝑠 – E.g., ℎ 𝑠 = max
𝑠,𝑎𝑅(𝑠, 𝑎)/(1 − 𝛾)
![Page 31: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/31.jpg)
CS886 (c) 2013 Pascal Poupart
8
LAO* Algorithm
LAO*(MDP, heuristic ℎ) 𝑆𝐸 ← {𝑠0}, 𝑆𝑇 ← {𝑠0} Repeat
Let 𝑅𝐸 𝑠, 𝑎 = ℎ(𝑠) 𝑠 ∈ 𝑆𝑇𝑅(𝑠, 𝑎) otherwise
Let 𝑇𝐸(𝑠′|𝑠, 𝑎) = 0 𝑠 ∈ 𝑆𝑇
Pr (𝑠′|𝑠, 𝑎) otherwise
Find optimal policy 𝜋 for 𝑆𝐸, 𝑅𝐸, 𝑇𝐸 Find reachable states 𝑆𝑠0
𝜋 Select reachable terminal states s1, … , sk ⊆ 𝑆𝑠0
𝜋 ∩ 𝑆𝑇 𝑆𝑇 ← (𝑆𝑇 ∖ 𝑠1, … , 𝑠𝑘 ) ∪ (𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑠1, … , 𝑠𝑘 ∖ 𝑆𝐸) 𝑆𝐸 ← 𝑆𝐸 ∪ 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛( 𝑠1, … , 𝑠𝑘 ) Until 𝑆𝑠0
𝜋 ∩ 𝑆𝑇 is empty
![Page 32: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/32.jpg)
CS886 (c) 2013 Pascal Poupart
9
Efficiency
Efficiency influenced by
1. Choice of terminal states to add to envelope
2. Algorithm to find optimal policy – Can use value iteration, policy iteration, modified
policy iteration, linear programming – Key: reuse previous computation
• E.g., start with previous policy or value function at each iteration
![Page 33: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/33.jpg)
CS886 (c) 2013 Pascal Poupart
10
Convergence • Theorem: LAO* converges to the optimal policy • Proof:
– Fact: At each iteration, the value function 𝑉 is an upper bound on 𝑉∗ due to the heuristic function ℎ
– Proof by contradiction: suppose the algorithm stops, but 𝜋 is not optimal.
• Since the algorithm stopped, all states reachable by 𝜋 are in 𝑆𝐸 ∖ 𝑆𝑇
• Hence, the value function 𝑉 is the value of 𝜋 and since 𝜋 is suboptimal then 𝑉 < 𝑉∗, which contradicts the fact that 𝑉 is an upper bound on 𝑉∗
![Page 34: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by](https://reader033.vdocuments.mx/reader033/viewer/2022051601/5aba14a57f8b9ad3038ed9a5/html5/thumbnails/34.jpg)
CS886 (c) 2013 Pascal Poupart
11
Summary
• LAO* – Extension of basic solution algorithms (value iteration,
policy iteration, linear programming) – Exploit initial state and heuristic function – Gradually grow an envelope of states – Complexity depends on # of reachable states instead
of size of state space