combining online and offline knowledge in uct

8
 Combining Online and Oine Knowledge in UCT Sylvain Gelly  [email protected] Univ. Pari s Sud, LRI, CNRS, INRIA, France David Silver  [email protected] University of Alberta, Edmonton, Alberta Abstract The UCT al gori thm lear ns a val ue func - tion onli ne using sample-based search. The T D(λ) algorithm can learn a value function oine for the on-policy distribution. We con- sider three approaches for combining oine and onlin e v alue fun cti ons in the UCT al- gori thm. First , the oin e valu e functio n is used as a default policy during Monte-Carlo simulati on. Second, the UCT value function is combined with a rapid online estimate of acti on values. Third, the oine val ue func- tion is used as prior knowledge in the UCT searc h tree. We eval uate these algo rith ms in 9  × 9 Go against GnuGo 3. 7. 10. The rs t alg orithm perf orms better tha n UCT wit h a random sim ula tion poli cy, but sur- pri sin gly , worse than UCT with a we aker, handc raft ed simulation policy. The second algorithm outperforms UCT altogether. The third algorithm outperforms UCT with hand- craf ted prior knowledge. We combine these algo rithms in  MoGo, the wo rld ’s str ong est 9 ×  9 Go pro gra m. Eac h tec hni que signi - cantly improv es  MoGo’s  playing strength. 1. Introduction V alue -base d rein forcement learning algo rithms hav e achieved man y notab le successes. F or example, vari- ants of the  TD(λ) algorithm have learned to achieve a master level of play in the games of Chess (Baxter et al., 1998), Checkers (Schaeer et al., 2001) and Oth- ello (Buro , 1999). In each case, the val ue functi on is approximated by a linear combination of binary fea- tures. The weig hts are adapted to nd the rel ati ve Appearing in Proceedings of the 24 th International Confer- ence on Machine Learning , Corvallis, OR, 2007. Copyright 2007 by the author(s)/owner(s). contribution of each feature to the expected value, and the policy is improved with respect to the new value function. Using this approach, the agent learns knowl- edge that applies across the on-policy distribution of states. UCT is a sample -ba sed sea rc h alg ori thm (Ko csi s & Szepesv ari, 2006 ) that has ach iev ed remarkable suc- cess in the more challenging game of Go (Gelly et al., 2006). At every time-step the UCT algorithm buil ds a search tree, estimating the value function at each state by Monte-Carlo simulation. After each simulated episode, the values in the tree are updated online and the simulation policy is improved with respect to the new values. These values represent local knowledge that is highly specialised to the current state. In this paper, we seek to combine the advantages of both approac hes. We int roduce two new algorith ms that combine the general knowledge accumulated by an oine reinforcement learning algorithm, with the local knowledge found online by sample-based search. We also introduce a third algorithm that combines two sources of kno wledg e found online by samp le-based searc h: one of whic h is unbi ased, and the other biase d but fast to learn. To test our algorithms, we use the domain of 9 × 9 Go (Figure 1). The program MoGo, based on the UCT al- gorithm, has won the KGS Computer Go tournament at 9  ×  9, 13  ×  13 and 19  ×  19 Go, and is the high- est rated program on the Computer Go Online Server (Ge lly et al., 200 6; Wang & Gelly, 200 7). The pro- gram RLGO , based on a linear combination of binary features and using the  TD(0) algorithm, has learned the strongest known value function for 9 × 9 Go that doesn’t inco rporate signican t prior domain kno wl- edge (Silver et al., 2007). We investigate here whether combining the online knowledge of  MoGo with the of- ine knowledge of  RLGO  can achi eve better overal l performance.

Upload: eldane-vieira

Post on 03-Nov-2015

3 views

Category:

Documents


0 download

DESCRIPTION

UCT

TRANSCRIPT

  • Combining Online and Oine Knowledge in UCT

    Sylvain Gelly [email protected]

    Univ. Paris Sud, LRI, CNRS, INRIA, France

    David Silver [email protected]

    University of Alberta, Edmonton, Alberta

    Abstract

    The UCT algorithm learns a value func-tion online using sample-based search. TheTD() algorithm can learn a value functionoine for the on-policy distribution. We con-sider three approaches for combining oineand online value functions in the UCT al-gorithm. First, the oine value function isused as a default policy during Monte-Carlosimulation. Second, the UCT value functionis combined with a rapid online estimate ofaction values. Third, the oine value func-tion is used as prior knowledge in the UCTsearch tree. We evaluate these algorithmsin 9 9 Go against GnuGo 3.7.10. Thefirst algorithm performs better than UCTwith a random simulation policy, but sur-prisingly, worse than UCT with a weaker,handcrafted simulation policy. The secondalgorithm outperforms UCT altogether. Thethird algorithm outperforms UCT with hand-crafted prior knowledge. We combine thesealgorithms in MoGo, the worlds strongest9 9 Go program. Each technique signifi-cantly improves MoGos playing strength.

    1. Introduction

    Value-based reinforcement learning algorithms haveachieved many notable successes. For example, vari-ants of the TD() algorithm have learned to achievea master level of play in the games of Chess (Baxteret al., 1998), Checkers (Schaeffer et al., 2001) and Oth-ello (Buro, 1999). In each case, the value function isapproximated by a linear combination of binary fea-tures. The weights are adapted to find the relative

    Appearing in Proceedings of the 24 th International Confer-ence on Machine Learning, Corvallis, OR, 2007. Copyright2007 by the author(s)/owner(s).

    contribution of each feature to the expected value, andthe policy is improved with respect to the new valuefunction. Using this approach, the agent learns knowl-edge that applies across the on-policy distribution ofstates.

    UCT is a sample-based search algorithm (Kocsis &Szepesvari, 2006) that has achieved remarkable suc-cess in the more challenging game of Go (Gelly et al.,2006). At every time-step the UCT algorithm buildsa search tree, estimating the value function at eachstate by Monte-Carlo simulation. After each simulatedepisode, the values in the tree are updated online andthe simulation policy is improved with respect to thenew values. These values represent local knowledgethat is highly specialised to the current state.

    In this paper, we seek to combine the advantages ofboth approaches. We introduce two new algorithmsthat combine the general knowledge accumulated byan oine reinforcement learning algorithm, with thelocal knowledge found online by sample-based search.We also introduce a third algorithm that combines twosources of knowledge found online by sample-basedsearch: one of which is unbiased, and the other biasedbut fast to learn.

    To test our algorithms, we use the domain of 9 9 Go(Figure 1). The programMoGo, based on the UCT al-gorithm, has won the KGS Computer Go tournamentat 9 9, 13 13 and 19 19 Go, and is the high-est rated program on the Computer Go Online Server(Gelly et al., 2006; Wang & Gelly, 2007). The pro-gram RLGO, based on a linear combination of binaryfeatures and using the TD(0) algorithm, has learnedthe strongest known value function for 9 9 Go thatdoesnt incorporate significant prior domain knowl-edge (Silver et al., 2007). We investigate here whethercombining the online knowledge of MoGo with the of-fline knowledge of RLGO can achieve better overallperformance.

  • Combining Online and Oine Knowledge in UCT

    2. Value-Based Reinforcement Learning

    Value-based reinforcement learning methods use avalue function as an intermediate step for computinga policy. In episodic tasks the return Rt is the totalreward accumulated in that episode from time t un-til it terminates at time T . The action value functionQ(s, a) is the expected return after action a A istaken in state s S, using policy to select all sub-sequent actions,

    Rt =

    Tk=t+1

    rk

    Q(s, a) = E [Rt|st = s, at = a]

    The value function can be estimated by a variety ofdifferent algorithms, for example using the TD() al-gorithm (Sutton, 1988). The policy can then be im-proved with respect to the new value function, for ex-ample using an -greedy policy to balance exploitation(selecting the maximum value action) with exploration(selecting a random action). A cyclic process of eval-uation and policy improvement, known as policy it-eration, forms the basis of value-based reinforcementlearning algorithms (Sutton & Barto, 1998).

    If a model of the environment is available or can belearned, then value-based reinforcement learning algo-rithms can be used to perform sample-based search.Rather than learning from real experience, simulatedepisodes can be sampled from the model. The valuefunction is then updated using the simulated experi-ence. The Dyna architecture provides one example ofsample-based search (Sutton, 1990).

    3. UCT

    The UCT algorithm (Kocsis & Szepesvari, 2006) isa value-based reinforcement learning algorithm thatfocusses exclusively on the start state and the tree ofsubsequent states.

    The action value function QUCT (s, a) is approximatedby a partial tabular representation T S A, con-taining a subset of all (state, action) pairs. This can bethought of as a search tree of visited states, with thestart state at the root. A distinct value is estimatedfor each state and action in the tree by Monte-Carlosimulation.

    The policy used by UCT is designed to balance ex-ploration with exploitation, based on the multi-armedbandit algorithm UCB (Auer et al., 2002).

    If all actions from the current state s are represented

    in the tree, a A(s), (s, a) T , then UCT selectsthe action that maximises an upper confidence boundon the action value,

    QUCT (s, a) = QUCT (s, a) + c

    log n(s)

    n(s, a)

    UCT (s) = argmaxa

    QUCT (s, a)

    where n(s, a) counts the number of times that actiona has been selected from state s, and n(s) counts thetotal number of visits to a state, n(s) =

    a n(s, a).

    If any action from the current state s is not representedin the tree, a A(s), (s, a) 6 T , then the uniformrandom policy random is used to select an action fromall unrepresented actions, A(s) = {a|(s, a) / T }.

    At the end of an episode s1, a1, s2, a2, ..., sT , each stateaction pair in the search tree, (st, at) T , is updatedusing the return from that episode,

    n(st, at) n(st, at) + 1 (1)

    QUCT (st, at) QUCT (st, at) (2)

    +1

    n(st, at)[Rt QUCT (st, at)]

    New states and actions from that episode, (st, at) 6T , are then added to the tree, with initial valueQ(st, at) = Rt and n(st, at) = 1. In some cases, itis more memory efficient to only add the first visitedstate and action such that (st, at) / T (Coulom, 2006;Gelly et al., 2006). This procedure builds up a searchtree containing n nodes after n episodes of experience.

    The UCT policy can be thought of in two stages. Inthe beginning of each episode it selects actions accord-ing to knowledge contained within the search tree. Butonce it leaves the scope of its search tree it has noknowledge and behaves randomly. Thus each state inthe tree estimates its value by Monte-Carlo simulation.As more information propagates up the tree, the pol-icy improves, and the Monte-Carlo estimates are basedon more accurate returns.

    If a model is available, then UCT can be used as asample-based search algorithm. Episodes are sampledfrom the model, starting from the actual state s. Anew representation T (s) is constructed at every ac-tual time-step, using simulated experience. Typically,thousands of episodes can be simulated at each step,so that the value function contains a detailed searchtree for the current state s.

  • Combining Online and Oine Knowledge in UCT

    In a two-player game, the opponent can be modelledusing the agents own policy, and episodes simulatedby self-play. UCT is used to maximise the upper con-fidence bound on the agents value and to minimisethe lower confidence bound on the opponents. Un-der certain assumptions about non-stationarity, UCTconverges on the minimax value (Kocsis & Szepes-vari, 2006). However, unlike other minimax searchalgorithms such as alpha-beta search, UCT requiresno prior domain knowledge to evaluate states or or-der moves. Furthermore, the UCT search tree is non-uniform and favours the most promising lines. Theseproperties make UCT ideally suited to the game ofGo, which has a large state space and branching fac-tor, and for which no strong evaluation functions areknown.

    4. Linear Value Function

    Approximation

    UCT is a tabular method and does not generalise be-tween states. Other value-based reinforcement learn-ing methods offer a rich variety of function approxi-mation methods for state abstraction in complex do-mains (Schraudolph et al., 1994; Enzenberger, 2003;Sutton, 1996). We consider here a simple approachto function approximation that requires minimal priordomain knowledge (Silver et al., 2007), and whichhas proven successful in many other domains (Baxteret al., 1998; Schaeffer et al., 2001; Buro, 1999).

    We wish to estimate a simple reward function: r = 1if the agent wins the game and r = 0 otherwise. Thevalue function is approximated by a linear combinationof binary features with weights ,

    QRLGO(s, a) = ((s, a)T

    )where the sigmoid squashing function maps the valuefunction to an estimated probability of winning thegame. After each time-step, weights are updated us-ing the TD(0) algorithm (Sutton, 1988). Because thevalue function is a probability, we modify the loss func-tion so as to minimise the cross entropy between thecurrent value and the subsequent value,

    = rt+1 +QRLGO(st+1, at+1)QRLGO(st, at)

    i =

    |(st, at)|i

    where is the TD-error and is a step-size parameter.

    In the game of Go, the notion of shape has strategicimportance. For this reason we use binary features

    Figure 1. (Left) An example position from a game of 9 9Go. Black and White take turns to place down stones.Stones can never move, but may be captured if completelysurrounded. The player to surround most territory winsthe game. (Right) Shapes are an important part of Gostrategy. The figure shows (clockwise from top-left) 1 1,2 1, 2 2 and 3 3 local shape features which occur inthe example position.

    (s, a) that recognise local patterns of stones (Silveret al., 2007). Each local shape feature matches a spe-cific configuration of stones and empty intersectionswithin a particular rectangle on the board (Figure 1).Local shape features are created for all configurations,at all positions on the board, from 1 1 up to 3 3.Two sets of weights are used: in the first set, weightsare shared between all local shape features that arerotationally or reflectionally symmetric. In the secondset, weights are also shared between all local shapefeatures that are translations of the same pattern.

    During training, two versions of the same agent playagainst each other, both using an -greedy policy.Each game is started from the empty board and playedthrough to completion, so that the loss is minimisedfor the on-policy distribution of states. Thus, the valuefunction approximation learns the relative contribu-tion of each local shape feature to winning, across thefull distribution of positions encountered during self-play.

    5. UCT with a Default Policy

    The UCT algorithm has no knowledge beyond thescope of its search tree. If it encounters a state thatis not represented in the tree, it behaves randomly.Gelly et al. (Gelly et al., 2006) combined UCT witha default policy that is used to complete episodes onceUCT leaves the search tree. We denote the originalUCT algorithm by UCT (random) and this extendedalgorithm by UCT () for a default policy .

    The worlds strongest 9 9 Computer Go programMoGo uses UCT with a hand-crafted default policyMoGo (Gelly et al., 2006). However, in many domains

  • Combining Online and Oine Knowledge in UCT

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    , 2

    , 2

    win

    nin

    gra

    teagain

    strandom

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 0.2 0.4 0.6 0.8 1

    , 2

    , 2

    win

    nin

    gra

    teagain

    stMoGo

    Figure 2. The relative strengths of each class of default policy, against the random policy random (left) and against ahandcrafted policy MoGo (right). The x axis represents the degree of randomness in each policy.

    it is difficult to construct a good default policy. Evenwhen expert knowledge is available, it may be hard tointerpret and encode. Furthermore, the default policymust be fast, so that many episodes can be simulated,and a large search tree built. We would like a methodfor learning a high performance default policy withminimal domain knowledge.

    A linear combination of binary features provides onesuch method. Binary features are fast to evaluateand can be updated incrementally. The representa-tion learned by this approach is known to performwell in many domains (Baxter et al., 1998; Schaefferet al., 2001). Minimal domain knowledge is necessaryto specify the features (Silver et al., 2007; Buro, 1999).

    We learn a value function QRLGO for the domain of9 9 Go using a linear combination of binary features(see Section 4). It is learned oine from games ofself-play. We use this value function to generate adefault policy for UCT. As Monte-Carlo simulationworks best with a stochastic default policy, we considerthree different approaches for generating a stochasticpolicy from the value function QRLGO.

    First, we consider an -greedy policy,

    (s, a) =

    {1 + |A(s)| if a = argmaxaQRLGO(s, a

    )

    |A(s)| otherwise

    Second, we consider a greedy policy over a noisyvalue function, corrupted by Gaussian noise (s, a) N(0, 2),

    (s, a) =

    {1 if a = argmaxaQRLGO(s, a

    ) + (s, a)

    0 otherwise

    Third, we select moves using a softmax distributionwith temperature parameter ,

    (s, a) =eQRLGO(s,a)/a e

    QRLGO(s,a)/

    We compared the performance of each class of defaultpolicy random, MoGo, , , and . Figure 2 as-sesses the relative strength of each default policy (asa Go player), in a round-robin tournament of 6000games between each pair of policies. With little orno randomisation, the policies based on QRLGO out-perform both the random policy random and MoGoshandcrafted policy MoGo by a margin of over 90%.As the level of randomisation increases, the policiesdegenerate towards the random policy random.

    Next, we compare the accuracy of each default pol-icy for Monte-Carlo simulation, on a test suite of200 hand-labelled positions. 1000 episodes of self-playwere played from each test position using each policy. We measured the MSE between the average return(i.e. the Monte-Carlo estimate) and the hand-labelledvalue (see Figure 3). In general, a good policy forUCT () must be able to evaluate accurately in Monte-Carlo simulation. In our experiments with MoGo, theMSE appears to have a close relationship with playingstrength.

    The MSE improves from uniform random simulationswhen a stronger and appropriately randomised defaultpolicy is used. If the default policy is too determinis-tic, then Monte-Carlo simulation fails to provide anybenefits and the performance of drops dramatically.If the default policy is too random, then it becomesequivalent to the random policy random.

  • Combining Online and Oine Knowledge in UCT

    0.3

    0.35

    0.4

    0.45

    0.5

    0.55

    0.6

    0 0.2 0.4 0.6 0.8 1,

    2,

    2

    MSE

    over

    labele

    dposi

    tions

    randomMoGo

    Figure 3. The MSE of each policy when Monte Carlo sim-ulation is used to evaluate a test suite of 200 hand-labelledpositions. The x axis indicates the degree of randomnessin the policy.

    Intuitively, one might expect that a stronger, appro-priately randomised policy would outperform a weakerpolicy during Monte-Carlo simulation. However, theaccuracy of , and never come close to the ac-curacy of the handcrafted policy MoGo, despite thelarge improvement in playing strengths for these de-fault policies. To verify that the default policies basedonQRLGO are indeed stronger in our particular suite oftest positions, we reran the round-robin tournament,starting from each of these positions in turn, and foundthat the relative strengths of the default policies re-main similar. We also compared the performance ofthe complete UCT algorithm, using the best defaultpolicy based on QRLGO and the parameter minimis-ing MSE (see Table 1). This experiment confirms thatthe MSE results apply in actual play.

    It is surprising that an objectively stronger defaultpolicy does not lead to better performance in UCT.Furthermore, because this result only requires Monte-Carlo simulation, it has implications for other sample-based search algorithms. It appears that the natureof the simulation policy may be as or more impor-tant than its objective performance. Each policy hasits own bias, leading it to a particular distribution ofepisodes during Monte-Carlo simulation. If the distri-bution is skewed towards an objectively unlikely out-come, then the predictive accuracy of the search algo-rithm may be impaired.

    6. Rapid Action Value Estimation

    The UCT algorithm must sample every action from astate s T before it has a basis on which to compare

    Algorithm Wins .v. GnuGoUCT (random) 8.88 0.42 %UCT () 9.38 1.9%UCT (MoGo) 48.62 1.1%

    Table 1. Winning rate of the UCT algorithm againstGnuGo 3.7.10 (level 0), given 5000 simulations per move,using different default policies. The numbers after the correspond to the standard error from several thousandcomplete games. is used with = 0.15.

    values. Furthermore, to produce a low-variance esti-mate of the value, each action in state s must be sam-pled multiple times. When the action space is large,this can cause slow learning. To solve this problem, weintroduce a new algorithm UCTRAV E , which forms arapid action value estimate for action a in state s, andcombines this online knowledge into UCT.

    Normally, Monte-Carlo methods estimate the value byaveraging the return of all episodes in which a is se-lected immediately. Instead, we average the returns ofall episodes in which a is selected at any subsequenttime. In the domain of Computer Go, this idea isknown as the all-moves-as-first heuristic (Bruegmann,1993). However, the same idea can be applied in anydomain where action sequences can be transposed.

    Let QRAV E(s, a) be the rapid value estimate for actiona in state s. After each episode s1, a1, s2, a2, ..., sT , theaction values are updated for every state st1 T andevery subsequent action at2 such that at2 A(st1),t1 t2 and t < t2, at 6= at2 ,

    m(st1 , at2) m(st1 , at2) + 1QRAV E(st1 , at2) QRAV E(st1 , at2)

    +1/m(st1 , at2)[Rt1 QRAV E(st1 , at2)]

    where m(s, a) counts the number of times that actiona has been selected at any time following state s.

    The rapid action value estimate can quickly learn alow-variance value for each action. However, it mayintroduce some bias, as the value of an action usu-ally depends on the exact state in which it is selected.Hence we would like to use the rapid estimate initially,but use the original UCT estimate in the limit. Toachieve this, we use a linear combination of the twoestimates, with a decaying weight ,

  • Combining Online and Oine Knowledge in UCT

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    0.55

    0.6

    0.65

    0.7

    1 10 100 1000 10000k

    UCTUCTRAVE

    Win

    nin

    gra

    teagain

    stG

    nuG

    o3.7

    .10

    Figure 4. Winning rate of UCTRAV E(MoGo) with 3000simulations per move against GnuGo 3.7.10 (level 8), fordifferent settings of the equivalence parameter k. The barsindicate the standard error. Each point of the plot is anaverage over 2300 complete games.

    QRAV E(s, a) = QRAV E(s, a) + c

    logm(s)

    m(s, a)

    (s, a) =

    k

    3n(s) + k

    QUR(s, a) = (s, a)QRAV E(s, a)

    + (1 (s, a))QUCT (s, a)

    UR(s) = argmaxa

    QUR(s, a)

    where m(s) =

    am(s, a). The equivalence parameterk controls the number of episodes of experience whenboth estimates are given equal weight.

    We tested the new algorithm UCTRAV E(MoGo), us-ing the default policy MoGo, for different settings ofthe equivalence parameter k. For each setting, weplayed a 2300 game match against GnuGo 3.7.10 (level8). The results are shown in Figure 4, and compared tothe UCT (MoGo) algorithm with 3000 simulations permove. The winning rate using UCTRAV E varies be-tween 50% and 60%, compared to 24% without rapidestimation. Maximum performance is achieved withan equivalence parameter of 1000 or more. This in-dicates that the rapid action value estimate is worthabout the same as several thousand episodes of UCTsimulation.

    7. UCT with Prior Knowledge

    The UCT algorithm estimates the value of each stateby Monte-Carlo simulation. However, in many caseswe have prior knowledge about the likely value of a

    state. We introduce a simple method to utilise oineknowledge, which increases the learning rate of UCTwithout biasing its asymptotic value estimates.

    We modify UCT to incorporate an existing value func-tion Qprior(s, a). When a new state and action (s, a)is added to the UCT representation T , we initialise itsvalue according to our prior knowledge,

    n(s, a) nprior(s, a)

    QUCT (s, a) Qprior(s, a)

    The number nprior estimates the equivalent experiencecontained in the prior value function. This indicatesthe number of episodes that UCT would require toachieve an estimate of similar accuracy. After initial-isation, the value function is updated using the nor-mal UCT update (see equations 1 and 2). We de-note the new UCT algorithm using default policy by UCT (,Qprior).

    A similar modification can be made to the UCTRAV Ealgorithm, by initialising the rapid estimates accordingto prior knowledge,

    m(s, a) mprior(s, a)

    QRAV E(s, a) Qprior(s, a)

    We compare several methods for generating priorknowledge in 9 9 Go. First, we use an even-gameheuristic, Qeven(s, a) = 0.5, to indicate that most posi-tions encountered on-policy are likely to be close. Sec-ond, we use a grandfather heuristic, Qgrand(st, a) =QUCT (st2, a), to indicate that the value with playerP to play is usually similar to the value of the last statewith P to play. Third, we use a handcrafted heuristicQMoGo(s, a). This heuristic was designed such thatgreedy action selection would produce the best knowndefault policy MoGo(s, a). Finally, we use the linearcombination of binary features, QRLGO(s, a), learnedoine by TD() (see section 4).

    For each source of prior knowledge, we assign an equiv-alent experience mprior(s, a) = Meq, for various con-stant values of Meq. We played 2300 games betweenUCTRAV E(MoGo, Qprior) and GnuGo 3.7.10 (level 8),alternating colours between each game. The UCT al-gorithms sampled 3000 episodes of experience at eachmove (see Figure 5), rather than a fixed time per move.In fact the algorithms have comparable execution time(Table 4).

    The value function learned oine, QRLGO, outper-forms all the other heuristics and increases the winningrate of the UCTRAV E algorithm from 60% to 69%.

  • Combining Online and Oine Knowledge in UCT

    Algorithm Wins .v. GnuGoUCT (random) 1.84 0.22 %UCT (MoGo) 23.88 0.85%UCTRAV E(MoGo) 60.47 0.79 %UCTRAV E(MoGo, QRLGO) 69 0.91 %

    Table 2. Winning rate of the different UCT algorithmsagainst GnuGo 3.7.10 (level 8), given 3000 simulations permove. The numbers after the correspond to the standarderror from several thousand complete games.

    Maximum performance is achieved using an equivalentexperience of Meq = 50, which indicates that QRLGOis worth about as much as 50 episodes of UCTRAV Esimulation. It seems likely that these results could befurther improved by varying the equivalent experienceaccording to the variance of the prior value estimate.

    8. Conclusion

    The UCT algorithm can be significantly improved bythree different techniques for combining online and of-fline learning. First, a default policy can be used tocomplete episodes beyond the UCT search tree. Sec-ond, a rapid action value estimate can be used to boostinitial learning. Third, prior knowledge can be used toinitialise the value function within the UCT tree.

    We applied these three ideas to the Computer Goprogram MoGo, and benchmarked its performanceagainst GnuGo 3.7.10 (level 8), one of the strongest99 programs that isnt based on the UCT algorithm.Each new technique increases the winning rate signifi-cantly from the previous algorithms, from just 2% forthe basic UCT algorithm up to 69% using all threetechniques. Table 2 summarises these improvements,given the best parameter settings for each algorithm,and 3000 simulations per move. Table 4 indicates theCPU requirements of each algorithm.

    These results are based on executing just 3000 sim-ulations per move. When the number of simulationsincreases, the overall performance of MoGo improvescorrespondingly. For example, using the combined al-gorithm UCTRAV E(MoGo, QMoGo), the winning rateincreases to 92% with more simulations per move. Thisversion of MoGo has achieved an Elo rating of 2320 onthe Computer Go Online Server, around 500 pointshigher than any program not based on UCT, and over200 points higher than other UCT-based programs1

    (see Table 3).

    1The Elo rating system is a statistical measure of play-ing strength, where a difference of 200 points indicates anexpected winning rate of approximately 75% for the higherrated player

    Simulations Wins .v. GnuGo CGOS rating3000 69% 196010000 82% 211070000 92% 2320

    Table 3. Winning rate of UCTRAV E(MoGo, QMoGo)against GnuGo 3.7.10 (level 8) when the number ofsimulations per move is increased. The asterisked versionused on CGOS modifies the simulations/move accordingto the available time, from 300000 games in the openingto 20000.

    Algorithm SpeedUCT (random) 6000 g/sUCT (MoGo) 4300 g/sUCT (), UCT (), UCT ( ) 150 g/sUCTRAV E(MoGo) 4300 g/sUCTRAV E(MoGo, QMoGo) 4200 g/sUCTRAV E(MoGo, QRLGO) 3600 g/s

    Table 4. Number of simulations per second for each al-gorithm on a P4 3.4Ghz, at the start of the game.UCT (random) is faster but much weaker, even with thesame time per move. Apart from UCT (RLGO), all theother algorithms have a comparable execution speed

    The domain knowledge required by this version ofMoGo is contained in the default policy MoGo and inthe prior value function QMoGo. By learning a valuefunction oine using TD(), we have demonstratedhow this requirement for domain knowledge can be re-moved altogether. The learned value function outper-forms the heuristic value function when used as priorknowledge within the tree. Surprisingly, despite itsobjective superiority, the learned value function per-forms worse than the handcrafted heuristic when usedas a default policy. Understanding why certain policiesperform better than others in Monte-Carlo simulationsmay be a key component of any future advances.

    We have focussed on the domain of 99 Go. It is likelythat larger domains, such as 1313 and 1919 Go, willincrease the importance of the three algorithms. Withlarger branching factors it becomes increasingly impor-tant to simulate accurately, accelerate initial learning,and incorporate prior knowledge. When these ideasare incorporated, the UCT algorithm may prove to besuccessful in many other challenging domains.

    References

    Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002).Finite-time analysis of the multi-armed bandit prob-lem. Machine Learning, 47, 235256.

    Baxter, J., Tridgell, A., & Weaver, L. (1998). Exper-iments in parameter learning using temporal differ-

  • Combining Online and Oine Knowledge in UCT

    0.4

    0.45

    0.5

    0.55

    0.6

    0.65

    0.7

    0.75

    0 50 100 150 200

    Win

    ning

    rate

    aga

    inst

    Gnu

    Go

    3.7.

    10 d

    efau

    lt le

    vel

    QRLGOQMoGoQgrandQeven

    UCTRAV E

    nprior

    Figure 5. Winning rate of UCTRAV E(MoGo) with 3000 simulations per move against GnuGo 3.7.10 (level 8), usingdifferent prior knowledge as initialisation. The bars indicate the standard error. Each point of the plot is an average over2300 complete games.

    ences. International Computer Chess AssociationJournal, 21, 8499.

    Bruegmann, B. (1993). Monte-Carlo Go.http://www.cgl.ucsf.edu/go/Programs/Gobble.html.

    Buro, M. (1999). From simple features to sophisticatedevaluation functions. 1st International Conferenceon Computers and Games (pp. 126145).

    Coulom, R. (2006). Efficient selectivity and backupoperators in Monte-Carlo tree search. 5th Interna-tional Conference on Computer and Games, 2006-05-29. Turin, Italy.

    Enzenberger, M. (2003). Evaluation in Go by a neuralnetwork using soft segmentation. 10th Advances inComputer Games Conference (pp. 97108).

    Gelly, S., Wang, Y., Munos, R., & Teytaud, O. (2006).Modification of UCT with patterns in Monte-CarloGo (Technical Report 6062). INRIA.

    Kocsis, L., & Szepesvari, C. (2006). Bandit basedMonte-Carlo planning. 15th European Conferenceon Machine Learning (pp. 282293).

    Schaeffer, J., Hlynka, M., & Jussila, V. (2001). Tempo-ral difference learning applied to a high-performancegame-playing program. 17th International JointConference on Artificial Intelligence (pp. 529534).

    Schraudolph, N., Dayan, P., & Sejnowski, T. (1994).Temporal difference learning of position evaluationin the game of Go. Advances in Neural InformationProcessing Systems 6 (pp. 817824). San Francisco:Morgan Kaufmann.

    Silver, D., Sutton, R., & Muller, M. (2007). Reinforce-ment learning of local shape in the game of Go. 20thInternational Joint Conference on Artificial Intelli-gence (pp. 10531058).

    Sutton, R. (1988). Learning to predict by the methodof temporal differences. Machine Learning, 3, 944.

    Sutton, R. (1990). Integrated architectures for learn-ing, planning, and reacting based on approximatingdynamic programming. 7th International Confer-ence on Machine Learning (pp. 216224).

    Sutton, R. (1996). Generalization in reinforcementlearning: Successful examples using sparse coarsecoding. Advances in Neural Information ProcessingSystems 8 (pp. 10381044).

    Sutton, R., & Barto, A. (1998). Reinforcement learn-ing: An introduction. Cambridge, MA: MIT Press.

    Wang, Y., & Gelly, S. (2007). Modifications ofUCT and sequence-like simulations for Monte-CarloGo. IEEE Symposium on Computational Intelli-gence and Games, Honolulu, Hawaii (pp. 175182).