progressive strategies for monte-carlo tree search presenter: ling zhao university of alberta...

23
Progressive Strategies For Monte-Carlo Tree Search Presenter: Ling Zhao University of Alberta November 5, 2007 Authors: G.M.J.B. Chaslot, M.H.M. Winands, J.W.H.M. Uiterwijk, H.J. van den Herik and B. Bouzy

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Progressive Strategies For Monte-Carlo Tree Search

Presenter: Ling Zhao

University of Alberta

November 5, 2007

Authors: G.M.J.B. Chaslot, M.H.M. Winands,

J.W.H.M. Uiterwijk, H.J. van den Herik

and B. Bouzy

2

Outlines

Monte-Carlo Tree Search (MCTS) and the implementation in MANGO.

Progressive strategies: progressive bias and progressive unpruning.

Experiments.Conclusions and future work.

3

MCTS

4

Selection

Process: select moves in UCT tree for the best balance between exploitation and exploration.

A multi-armed bandit problems. UCB formula:

k: No. k child of node n, vi: value of node i

ni: visit count of node i, np: visit count of node pC: const

Selection precondition: np >= T (= 30)

5

Expansion

Process: For a given leaf node, determine whether it will be expanded by storing one or more of its children in UCT tree.

Simple rule: expand one node per simulated game (the first node encountered not in UCT tree).

In MANGO, if np = T (= 30), all its children will be expanded.

6

Simulation

Process: self-play until the end of the game.

Rules: 1. Disallow play in its eyes 2. Stop the game after a certain number of moves.

In MANGO, the probability of a move being selected in simulation is proportional to its urgency, a sum of capture value, 3x3 pattern value and proximity modification.

7

Backpropagation

Process: using the result of a simulated game to update the nodes it traverses.

Result: +1 for win, -1 for loss, 0 for drawvi of node i is computed by averaging the

result of all simulated games made through it.

8

Progressive Strategies

Soft transition between selection strategy and simulation strategy.

Intuition: Selection strategy becomes more accurate than simulation one only when the number of games simulated is large.

Progress strategy uses the information available for the selection strategy, and some expensive domain knowledge.

Progress strategy is similar to the simulation strategy when a few games have been played, and converges to selection strategy when numerous games have been played.

9

Progressive Bias

Direct search using possibly expensive heuristic knowledge.

Modify the selection strategy, and make sure the influence decreases fast when many games have been played.

10

Progressive Bias Formula

Hi is a coefficient representing knowledge

For children with ni =0, is replaced by M with M>>any vi, thus the children with the highest f(ni) is selected.

If np [30, 100], f(ni) is dominant.

If np (100, 500], f(ni) has partial impact.

When np > 500, f(ni) is dominated, but can be used for tie breaker.

11

Alternative Approach

Using prior knowledge (Gelly and Silver):

“Scalability of this approach to larger board sizes is an open question”.

12

Progressive Unpruning

Reducing the branching factor artificially when the selection strategy is used.

Increase the branching factor progressively when more games are simulated.

Pruning or unpruning is done according to the heuristic value of the children.

13

Progressive Unpruning (Details)

If np = T, only k0 (=5) children with highest heuristic

values are not pruned.

If np > T, k = lg(np /40) * 2.67 + k0, children will be left

unpruned.

k = 5 (np = 40), 7 (np = 80), 10 (np = 120)

Similar idea used by Coulom (progressive widening).

14

Heuristic Values

Pattern value: learned offline using pattern matching (89,119 patterns from 2000 pro games).

Capture value: the number of stones to be captured or to escape a capture with the move.

Proximity value: Euclidean distance to the last move.

15

Heuristic Value Formula

Ci: Capture value

Pi: pattern value

Dk,i: distance to the kth last move

k = 1.25 + k/2

Computing Pi the time consuming part

16

Time For Computing Heuristics

Computing H is around 1000 times slower than playing a move in simulated game.

So H is computed only once per node, when T (=30) games is played through it.

Speed reduction is only 4%, since the number of nodes with visit count >= 30 is low compared to the total number of moves in simulated games.

17

Domain Knowledge Calls Vs. T

18

Visit Count Vs. Number of Nodes

19

Experiments

Self played games on 13x13 board (10 sec per move): MANGO with progressive strategies won 91% of the 500 games against MANGO without progressive strategies.

MANGO : 20,000 simulated games, 1 sec on 9x9, 2 sec on 13x13, 5 sec on 19x19.

GNU Go: level 10 on 9x9 and 13x13, 0 on 19x19.

20

MANGO Vs. GNU Go

21

MANGO Vs. GNU Go

Plain MCTS does not scale well to 13x13 or 19x19 board.

Progressive strategies are useful on every board size.

The two progressive strategies combined are most powerful, esp. in 19x19.

22

Tournament Results

Always in the top half. But were negative results removed?

23

Conclusions and Future Work

Two progressive strategies are useful by providing a soft transition between selection and simulation.

Overhead is negligible. Combine with RAVE and UCT with prior

knowledge. Combine with the advanced knowledge

developed by Coulom. Using life and death information. Better progressive bias.

P-A. Coquelin and R. Munos. Bandit Algorithm for Tree Search. Technical Report 6141, INRIA, 2007.