sample-based learning and search with permanent and transient … · 2018. 11. 3. · sample-based...
TRANSCRIPT
![Page 1: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/1.jpg)
Sample-Based Learning and Search with Permanent and Transient Memories
David Silver, Richard S. Sutton, Martin Müller
![Page 2: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/2.jpg)
A Brief History of Computer Go
2005: Computer Go is impossible!
2006: UCT invented and applied to 9x9 Go (Kocsis, Szepesvari; Gelly et al.)
2007: Human master level achieved at 9x9 Go (Gelly, Silver; Coulom)
2008: Human grandmaster level achieved at 9x9 Go (Teytaud et al.)
2009: Human master level achieved at 19x19 Go?
![Page 3: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/3.jpg)
Big Ideas for Big Environments
1. Function approximation
2. Tracking
3. Sample-based planning
4. Bootstrapping
![Page 4: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/4.jpg)
1. Function Approximation
Problem:
State space too large to represent all values
No generalisation between states
Solution:
Approximate value function with a smaller number of parameters
![Page 5: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/5.jpg)
2. Tracking
Problem:
Value function too complex to represent
Solution:
Focus on current region of state space
![Page 6: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/6.jpg)
3. Sample-Based Planning (Dyna)
Problem:
Insufficient experience to learn good policy
Solution:
Sample experience from a model of the world
Learn from simulated experience
![Page 7: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/7.jpg)
4. Bootstrapping (TD)
Problem:
Return has high variance
Solution:
Learn from value of successor states
![Page 8: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/8.jpg)
Sample-Based Search
Sample-based planning + tracking
Simulate episodes starting from current state
Learn from simulated experience
![Page 9: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/9.jpg)
Sample-Based Search
Current State
Terminal State
Simulation
![Page 10: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/10.jpg)
Sample-Based Search Algorithms
Monte-Carlo simulation
Monte-Carlo tree search
UCT
Heuristic UCT
States represented individually
Monte-Carlo learning
![Page 11: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/11.jpg)
New Idea
Dyna-2
Linear value function approximation
Temporal difference learning
![Page 12: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/12.jpg)
Dyna-2: Two Memories
A memory is a vector of features φ(s,a) and parameters θ
Permanent memory learned from all experience
Transient memory learned from recent experience
Transient memory is forgotten over time
(φ, θ)
(φ̄, θ̄)
![Page 13: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/13.jpg)
Dyna-2: Learning and Search
Sample-based learning:
Permanent memory is updated from real experience using Sarsa(λ)
Sample-based search:
Transient memory is updated from simulated experience using Sarsa(λ)
Transient memory is forgotten at the end of each real episode
![Page 14: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/14.jpg)
Real Experience
Current State
Learning
Permanent memoryQ(s, a) = φ(s, a)T θ
![Page 15: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/15.jpg)
Simulated Experience
Search
Current State
Transient memoryQ̄(s, a) = φ(s, a)T θ + φ̄(s, a)T θ̄
![Page 16: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/16.jpg)
Shape Knowledge in Go
Go players utilise a large vocabulary of shapes:
One-point jump
Ponnuki
Hane
![Page 17: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/17.jpg)
Local Shape Features
Binary features matching a local configuration of stones
All possible locations and configurations from 1x1 to 3x3
~1 million features for 9x9 Go
![Page 18: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/18.jpg)
Empty Triangle
Permanent memory is based on all experience
General knowledge about world
![Page 19: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/19.jpg)
Guzumi
Transient memory forgets old experience
Local knowledge about current situation
![Page 20: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/20.jpg)
“Blood Vomiting Game”
![Page 21: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/21.jpg)
Results for Dyna-2
![Page 22: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/22.jpg)
Dyna-2 + Alpha-Beta Search
Like a traditional alpha-beta search, except:
Evaluation function is re-learned every move
Based on simulated experience
Using permanent + transient memories
![Page 23: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/23.jpg)
Dyna-2 + Alpha-Beta
![Page 24: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/24.jpg)
Bootstrapping Results
![Page 25: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/25.jpg)
Conclusions (9x9 Go)
Sample-based search methods outperform other approaches
Function approximation improves performance
Combining permanent and transient memories significantly improves performance
TD learning outperforms Monte-Carlo learning
![Page 26: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/26.jpg)
Extrapolation (19x19 Go)
Sample-based search methods massively outperform other approaches
Function approximation is crucial
Combining permanent and transient memories significantly improves performance
TD learning outperforms Monte-Carlo learning
![Page 27: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/27.jpg)
Questions?
![Page 28: Sample-Based Learning and Search with Permanent and Transient … · 2018. 11. 3. · Sample-Based Planning (Dyna) Problem: Insufficient experience to learn good policy Solution:](https://reader036.vdocuments.mx/reader036/viewer/2022071608/61462fde8f9ff81254201a4f/html5/thumbnails/28.jpg)
Permanent and Transient Memories