hybridonlineandofflinereinforcementlearningfortibetan...

11
Research Article Hybrid Online and Offline Reinforcement Learning for Tibetan Jiu Chess Xiali Li , Zhengyu Lv , Licheng Wu, Yue Zhao, and Xiaona Xu School of Information and Engineering, Minzu University of China, Beijing 100081, China Correspondence should be addressed to Xiali Li; [email protected] Received 13 February 2020; Revised 3 April 2020; Accepted 10 April 2020; Published 11 May 2020 Academic Editor: Ning Cai Copyright © 2020 Xiali Li et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In this study, hybrid state-action-reward-state-action (SARSA(λ)) and Q-learning algorithms are applied to different stages of an upper confidence bound applied to tree search for Tibetan Jiu chess. Q-learning is also used to update all the nodes on the search path when each game ends. A learning strategy that uses SARSA(λ) and Q-learning algorithms combining domain knowledge for a feedback function for layout and battle stages is proposed. An improved deep neural network based on ResNet18 is used for self- play training. Experimental results show that hybrid online and offline reinforcement learning with a deep neural network can improve the game program’s learning efficiency and understanding ability for Tibetan Jiu chess. 1. Introduction Compared with Go, chess, Shogi, and other games achieving the top level of human beings by deep neural network and reinforcement learning, the research of Tibetan Jiu chess is still in the primary stage. e current chess power of Jiu chess is low, which has not defeated the primary players of human beings. And the standard Jiu chess game data are very few. At present, the complete Jiu chess manual obtained is only about 300 games. Jiu chess game has 2 sequential stages of layout and battle, and the layout stage can only enter into the battle stage after all the blank intersections are filled alternately, which leads to the lengthy search path for an upper confidence bound applied to trees (UCT) search. Considering the useless steps that the algorithm may take during the exploration, the search path will be much longer than that of chess and Go. In this case, how to improve the efficiency of Jiu chess program in self-play learning under the special rules and the limitations of laboratory hardware is our study motivation. In this study, the state-action-reward-state-action (SARSA(λ)) and Q-learning algorithms are innovatively used in different stages of UCT search for Jiu chess. In the selection, expansion, and simulation section of the UCT search, the SARSA(λ) algorithm is used to update the quality of each action. In the backpropagation stage, the Q-learning algorithm is used to update the quality of each action. e quality of each action is denoted by its Q-value. e Q-learning algorithm is also used globally after the end of each game. Probability distribution function based on two-dimensional (2D) normal distribution matrix is used as feedback to SARSA(λ) and Q-learning for the layout stage. e feedback function for a time difference (TD) algorithm based on important shapes [1, 2] is constructed for the battle stage. ResNet18 structure is improved to be suitable for the deep neural network training considering its lower error rate and better performance in image classification [3]. e contribution of this study is outlined in the following: (1) Hybrid deep reinforcement learning algorithm is firstly applied to Jiu chess game. e proposed strategy can prevent many worthless or low value nodes generated by the deep learning from wasting computing resources. Using SARSA(λ) and Q-learning algorithms in different stages of UCT search provides guarantee for learning fine chess steps faster and improves the algorithm’s learning efficiency; besides, final result can be fed back to all steps to improve the accuracy of the return estimation value. Hindawi Complexity Volume 2020, Article ID 4708075, 11 pages https://doi.org/10.1155/2020/4708075

Upload: others

Post on 30-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Research ArticleHybrid Online and Offline Reinforcement Learning for TibetanJiu Chess

Xiali Li Zhengyu Lv Licheng Wu Yue Zhao and Xiaona Xu

School of Information and Engineering Minzu University of China Beijing 100081 China

Correspondence should be addressed to Xiali Li xiaer_li163com

Received 13 February 2020 Revised 3 April 2020 Accepted 10 April 2020 Published 11 May 2020

Academic Editor Ning Cai

Copyright copy 2020 Xiali Li et al(is is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In this study hybrid state-action-reward-state-action (SARSA(λ)) and Q-learning algorithms are applied to different stages of anupper confidence bound applied to tree search for Tibetan Jiu chess Q-learning is also used to update all the nodes on the searchpath when each game ends A learning strategy that uses SARSA(λ) and Q-learning algorithms combining domain knowledge fora feedback function for layout and battle stages is proposed An improved deep neural network based on ResNet18 is used for self-play training Experimental results show that hybrid online and offline reinforcement learning with a deep neural network canimprove the game programrsquos learning efficiency and understanding ability for Tibetan Jiu chess

1 Introduction

Compared with Go chess Shogi and other games achievingthe top level of human beings by deep neural network andreinforcement learning the research of Tibetan Jiu chess isstill in the primary stage (e current chess power of Jiuchess is low which has not defeated the primary players ofhuman beings And the standard Jiu chess game data arevery few At present the complete Jiu chess manual obtainedis only about 300 games Jiu chess game has 2 sequentialstages of layout and battle and the layout stage can onlyenter into the battle stage after all the blank intersections arefilled alternately which leads to the lengthy search path foran upper confidence bound applied to trees (UCT) searchConsidering the useless steps that the algorithm may takeduring the exploration the search path will be much longerthan that of chess and Go In this case how to improve theefficiency of Jiu chess program in self-play learning underthe special rules and the limitations of laboratory hardware isour study motivation

In this study the state-action-reward-state-action(SARSA(λ)) and Q-learning algorithms are innovativelyused in different stages of UCT search for Jiu chess In theselection expansion and simulation section of the UCTsearch the SARSA(λ) algorithm is used to update the

quality of each action In the backpropagation stage theQ-learning algorithm is used to update the quality of eachaction (e quality of each action is denoted by its Q-value(e Q-learning algorithm is also used globally after the endof each game Probability distribution function based ontwo-dimensional (2D) normal distribution matrix is usedas feedback to SARSA(λ) and Q-learning for the layoutstage (e feedback function for a time difference (TD)algorithm based on important shapes [1 2] is constructedfor the battle stage ResNet18 structure is improved to besuitable for the deep neural network training consideringits lower error rate and better performance in imageclassification [3] (e contribution of this study is outlinedin the following

(1) Hybrid deep reinforcement learning algorithm is firstlyapplied to Jiu chess game (e proposed strategy canpreventmanyworthless or low value nodes generated bythe deep learning from wasting computing resourcesUsing SARSA(λ) andQ-learning algorithms in differentstages of UCT search provides guarantee for learningfine chess steps faster and improves the algorithmrsquoslearning efficiency besides final result can be fed back toall steps to improve the accuracy of the return estimationvalue

HindawiComplexityVolume 2020 Article ID 4708075 11 pageshttpsdoiorg10115520204708075

(2) Constructing probability distribution function basedon two-dimensional (2D) normal distribution ma-trix makes the reinforcement learning model havethe ability of prioritizing its learning tactics in thecenter of the layout

(3) (e proposed improved Resnet18-based networkwith very small parameter scale has achieved a goodlearning effect reduced the use of computing powerand significantly shortened playing time by trainingon a common workstation or graphics processingunit (GPU) server in a short time (e experimentalresults have demonstrated it

Section 2 of the paper introduces the rules and diffi-culties of Tibetan Jiu chess and Section 3 introduces relatedworks on Tibetan Jiu chess Section 4 discusses the hybridSARSA(λ) and Q-learning algorithm applied to differentstages of the UCTsearch and the improved RedNet18-baseddeep neural network structure Section 5 details the ex-periment and data analysis Finally we outline the con-clusions and propose future work

2 Rules and Difficulties in Tibetan JIU Chess

(ere are two kinds of Tibetan chess mi mang and Jiuwhich are widely played in Sichuan Gansu Tibet QinghaiYunnan and other Tibetan areas [4] in China(e process ofplaying chess can be divided into two successive stageslayout and fighting In the layout stage the central area isoccupied first in order to construct the dominant formation(square standard etc) for the fighting stage hence thequality of the layout has an important impact on the finalvictory In the fighting stage actions include moving chesspieces jumping capture or even square capture (e key tovictory is to take the lead in constructing a girdling for-mation and the necessary condition to constructing thisformation is to construct a square

21 Tibetan Jiu Chess Rules (e public Jiu board is 14 times 14points (e Jiu game process is divided into two sequentialstages layout and battle Jiu chess is a two-player game (ewhite side plays with white stones and the black side usesblack stones Players alternate turns Stones must be addedor moved to empty points on the game board A Jiu gamestarts with an empty board (e task in the layout stage is toplace one stone on a point at each move until there are noempty points on the board (e goal in the battle stage ismoving or capturing stones until one player wins the game(e movements and capturing methods are similar to thoseof international checkers [1 2]

211 Layout Stage In the layout stage white plays firstfollowed by black (e first and second moves must beplaced on one of the points of the diagonal line of centralgrids (en each side alternates in placing one stone on onepoint until no empty points remain on the board Afterfilling the board game turns into battle stage

212 Battle Stage In the battle stage there are three actionsto select in each turn

(1) Move Usually a player moves a stone to the updown left or right adjacent empty point (see Figure 1(a))But there is the exception If a player has no more than 14stones left he can move a stone to any empty cross point hewants(is exception is shown in Figure 1(c) Since the blackside only has no more than 14 stones left the diamond blackstone moves to the point where the arrow directs After thismove the square is constructed During this move the pointof the diamond black stone is the beginning and the point ofthe black stone directed by the arrow is the ending In thiscase the black side can capture any one stone of its op-ponent (e dim diamond marked white stone is taken awayby the black side after this move

(2) Jumping Capture When the opponentrsquos stone isadjacent to the playerrsquos stone and there is an empty pointdirectly behind it the player will perform a jumping capture(is action can be continued until the player cannot capturestones or the playerrsquos turn ends As shown in Figure 1(b) thediamond marked white stone is placed to the final emptypoint where the arrows direct after four continuousjumping (is continuous jump begins from the point of thediamond white stone and ends at the point of the white stonedirected by the arrow (e white side captures all the fourblack stones on the path where the arrows direct

(3) Square Capture In one turn if a player constructs asquare with four adjacent stones he will capture one of hisopponentrsquos stones located at any point on the board (is iscalled square capture (ere are three important shapeswhich are called gate square and dalian (or chain) asso-ciated with square capture Gate is the basic shape which isshown in (A) in Figure 1(d) Square one of the importantshapes is shown in (B) in Figure 1(d) Dalian or chain is themost important Jiu shape which is shown in (C) inFigure 1(d) (is shape is critical to the winning of the gameIt is comprised by seven stones of the same color and oneempty point (e stone adjacent to the empty point is calledvital stone which is marked by the circle By moving this vitalstone to any of the two empty points a player can construct asquare and then capture one stone anywhere of its opponenton the board in one turn A player can capture his oppo-nentrsquos stones by repeatedly moving the vital stone in dif-ferent turns

213 Winning the Game If a player wants to be a winner hewill make sure one of the following conditions is met

(1) He must have at least one special shape like chainwhile his opponent does not have any gates beforehis opponent has less than 14 stones

(2) He will win by taking away all stones of his opponentwhen both sides have no special shapes or gates

22 Difficulties in Tibetan Jiu Chess

(1) Deep and wide tree search space (e layout stage isclosely connected with the battle stage When the

2 Complexity

layout is finished the battle begins (e search spaceof the game tree is huge and the search path is farlonger than that of general chess games

(2) Special rules make much more low value states thanhigh value states (e Jiu chess layout first covers thecentral area and then gradually expands outward(e importance of position gradually decreases fromthe center to the outside It takes a long time forordinary deep neural networks to learn to choosehigh value states from much more low value states

(3) Extremely limited research and expert knowledgeJiu chess players are mostly distributed in Tibetanareas creating huge difficulties in collecting and

processing Jiu chess data At present we have onlycollected and analyzed 300 complete chess recorddata represented as SGF files

3 Related Work

Deep reinforcement learning agorithms used in the Atariseries of games inlcuding Deep Q Network (DQN) algo-rithm [5] 51-atom-agent (C51) algorithm [6] and thosesuitable for continuous fieds with low search depth andnarrow decision tree width [7ndash15] have achieved orexceeded the level of human experts In the field of computergames pattern recognition [6 16] reinforcement learning

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

(a)

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

(b)

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

(c)

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(A)

(B)

(C)

(A)

(B)

(C)

(C) (C)1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

(d)

Figure 1 Actions and shapes in Jiu chess (a) Example state 1 (b) Example state 2 (c) Example state 3 (d) Important shapes

Complexity 3

deep learning and deep reinforcement learning algorithmsare used in Go including Monte Carlo algorithm and upperconfidence bound applied to tree (UCT) algorithm [17ndash20]temporal difference algorithm [21] the deep learning modelcombined with UCT search algorithm [22] and DQN al-gorithm [23 24] and they have also achieved quite goodresults in computer Go game which shows that the idea ofdeep reinforcement learning algorithm can adapt to thecomputer game environment In addition the application ofdeep reinforcement learning algorithm in Backgammon[25ndash27] Shogi [28ndash30] chess [31ndash34] Texas poker [35 36]Mahjong [37] and Star Craft II [38] has achieved humanexcellence and even exceeded human achievements (ere isno doubt that the deep reinforcement learning algorithmwill make a good progress in this field when it is applied toJiu chess

At present Jiu chess is little-known to the people becauseit is mainly spread in Tibetan people gathering areaes Due tospecial rules and smaller players compared with Go or otherpopular games the research on Jiu chess is very limited Jiuchess is a complete information game along with Go andchess However completely different from Go or chess Jiurules are very special with two consequential stages whichmakes the game tree search path very long(ere are three Jiuchess programs from all current literatures [1 2 39] And allof them have low chess power although they have differentplaying levels In [1] several important shapes of Jiu were firstrecognized and about 300 playing records were collected andprocessed Strategies based on chess shapes were designed todefend the opponent It was the first Jiu program based onexpert knowledge but it had very low power because of thelimited human knowledge In [39] a Bayesian network modelwhich was designed to solve the problem of small sample datafor Jiu playing records estimated the chess board rapidly (ismethod alleviated the extreme lack of expert knowledge to acertain extent but the model was only useful in the layoutstage In [2] a time difference algorithm was used to realizethe probability statistics and prediction of chess type whichwas the first time reinforcement learning was applied to Jiuchess It was verified that solely applying SARSA(λ) orQ-learning algorithm with special different parameters washelpful to improve the efficiency of recognizing and evalu-ating chess board However it did not make importantcontributions to the chess power because of not applying adeep neural network or UCT search Q-learning algorithm isnot only used in the games but also widely used in other fieldssuch as the wireless network to reduce the cost of resources[40] It is very promising So in our study we first combineSARSA(λ) and Q-learning algorithms in different stages ofUCT search to help us get the motivation of achieving betterperformance under low hardware cost

4 Learning Methodology

UCTsearch combined with deep learning and reinforcementlearning is of high performance for Go chess and othergames However it requires considerable hardware re-sources to support exploration to the huge state space andtraining for the deep neural network To take advantage of

reinforcement learning and deep learning while reducinghardware requirements as much as possible this studyproposed an improved deep neural network model com-bined UCT search with hybrid online and offline timedifference algorithms for Tibetan Jiu chess which is ex-quisite and efficient in self-play learning ability

Hybrid TD algorithms are applied to different stages ofthe UCT search which minimizes searching in low-valuestate spaces (ereby the computational power consump-tion and training time of the exploration state space arereduced According to the unique characteristics of Jiu chessa TD algorithm reward function is proposed based on a 2Dnormal distribution matrix for the layout stage enabling theJiu chess reinforcement learning model to more quicklyacquire layout awareness of Jiu chess priorities (e rewardfunction for the battle stage is also designed based on Jiuchess shapes (e improved ResNet18-based deep neuralnetwork based is used for self-play and training

41UCTSearch (e reinforcement learning algorithm usedin this study is based on a UCT [41 42] search algorithm (seeFigure 2) which combines Q-learning SARSA(λ) andexpert domain knowledge When the model performs self-play learning it searches from the root node uses SARSA(λ)

combined with immediate feedback from domain knowl-edge in the former three stages (selection extension anddefault policy simulation) and uses Q-learning updation inthe backpropagation stage

In the selection extension and default policy simulationstages the board situation is evaluated through expertknowledge and this evaluated value is returned to each nodeof the path using the SARSA(λ) algorithm updation method(see the bold parts of the routes in the selection extensionand default policy simulation stages of Figure 2) In thebackpropagation stage the board situation is evaluatedthrough Q-learning algorithm updation (see the bold part ofthe route in the backpropagation step of Figure 2)

42 Hybrid SARSA(λ) and Q-Learning Algorithm In thisstudy the node of the UCT search tree is expressed as

SWNP VQE (1)

where S represents the state of the node in the search treeWrepresents the value of each action of the node N representsthe number of times the action of the node is selected andfor each action a selected N(s a)⟵N(s a) + 1 P rep-resents the probability of selecting each action under thestate (calculated by the neural network) V is the winningrate estimated by the neural network Q is the winning rateestimated by the search tree and E is a variable of auxiliaryQupdation In the selection extension and simulation stagesE is updated according to the SARSA(λ) algorithm which isexpressed as follows

E(s a)⟵R(s a) + cQ si+1 ai+1( 1113857 minus Q(s a) (2)

where R(s a) is the reward value of the current situationwhich can usually be calculated by the board situation

4 Complexity

evaluation c is learning rate si+1 is the next state searchedby taking action a and ai+1 is the action selected in state si

(e node (si ai) which is previously searched in thesearching path is updated in turn by the following equation

Q si ai( 1113857⟵ 1 minus c1( 1113857Q si ai( 1113857 + c1λcminus iE(s a) (3)

where c1 is learning rate of SARSA(λ) i is denoted as thestate index satisfying 0le ile c c is the number of steps fromthe status of the root node to the status of the search endingat the stage of value return in each turn i 0 represents thesearching is initialized by taking the current situation as thesearch tree root node and i c represents the searching isended Especially at the end of the game c is simply the totalnumber of steps from the beginning to the end of the game

In the backpropagation stage of UCT search or at thetime of each game end equation (4) is used to update theQ-values of all nodes on the path where c2 is learning rate ofQ-Learning

Q si ai( 1113857⟵ 1 minus c2( 1113857Q si ai( 1113857 + c2 maxQ si+1 ai+1( 1113857

(4)

and if si+1 is a leaf node Q(si+1 ai+1) will be calculated by

Q si+1 ai+1( 1113857 W si+1 ai+1( 1113857

N si+1 ai+1( 1113857 (5)

In the selection extension and simulation stages of theUCTsearch algorithm this hybrid algorithm selects action a

with the maximum value from state s through equation (6)by calculating the score of each alternative action

a argmaxa(Q(s middot) + U(s middot)) (6)

For each alternative action a at state s there is thecomprehensive evaluation function which is represented bythe following equation

Q(s a) + U(s a) Q(s a) + cpuctP(s a)

1113936N(s middot)

N(s a) + 1

1113971

(7)

where Q(s a) is obtained by equation (3) or (4) whichdepends on different tree UCT search stages and U(s a)

represents the estimation value by the UCBmethod which isobtained by the following equation

U(s a) cpuctP(s a)

1113936N(s middot)

N(s a) + 1

1113971

(8)

where cpuct represents the parameter balancing the exploi-tation and exploration of UCT algorithm

43 Feedback Function Based on Domain KnowledgeQ-value is updated by the combination of the SARSA(λ) andQ-learning algorithms [43 44] SARSA(λ) is used to updatethe nodes on the game path in the selection expansion andsimulation stages of the UCT [18] In the backpropagationstage Q-learning is used to update the values of all nodes inthe search path(e hybrid strategy enables the algorithm tolearn to prune the huge state space effectively improvingcomputing speed and reducing the consumption of hard-ware resources

(e layout stage of Jiu chess plays a key role in the gameand its outcome [1] (e value of the board position decaysfrom the center of the board to the outside which is similarto the probability distribution of a 2D normal distributionmatrix (erefore the importance of each intersection in thelayout stage can be approximated by the 2D discrete normaldistribution (see Figure 3)

In the battle stage constructing the chess shapes discussedin [1 2] is very important in gaining a victory (e feedbackfunction used by the SARSA(λ) and Q-learning algorithm isshown in equation (9) 1113936iVi(s) Vchain+ Vgate + Vsquare

Q + U

Q + U

Q + U

P w

w

w

w

w

Selection Expansion Default policy simulation

Backpropagation

Repeat

Q + U

Figure 2 UCT algorithm for Jiu chess

Complexity 5

R(s a) f(x y) s in layout stage

1113936iVi(s) s in battle stage1113896 (9)

where f(x y) is the joint probability density of X and Y isshown in the following equation [39 45]

f(x y) 1

2πσ1σ21 minus ρ2

1113968 eminus 1 2 1minus ρ2( )( )( ) xminus μ1( )

2σ21minus 2ρ xminus μ1( ) yminus μ2( )σ1σ2( )+ yminus μ2( )2σ22( 1113857( 1113857

(10)

where X and Y represent chessboard coordinates and μrepresents the mean value of the range of the chessboard(x y isin [0 13] σ1 σ2 2 μ1 u2 65 ρ 0)Vchain Vgate and Vsquare are approximations summarizedthrough the Jiu chess rules [1 2] and experienceVchain(s) 7cchain where cchain is the sum of the number ofsquares on the chessboard in the current stateVgate(s) 3cgate where cgate is the sum of the number ofgates on the chessboard in the current state andVsquare(s) 4csquare where csquare is the sum of the number ofsquares on the chessboard in the current state

Using the 2D normal distribution approximate ma-trix probability distribution based on expert knowledgeas the feedback function of the TD algorithm in thelayout stage we can produce better search simulationsand self-play game performance in the case of deepsearch depths and large search branches and learn areasonable chess strategy in the layout stage better andfaster [46]

44 Improved Deep Neural Network Based on ResNet18Strategy prediction and board situation evaluation are re-alized using a deep neural network which is expressed as afunction with parameter θ as shown in equation (11) wheres is the current state p is the output strategy prediction andv is the value of board situation evaluation

(p v) fθ(s) (11)(e RESNETseries of deep convolution neural networks

shows good robustness in image classification and targetdetection [3] In this study a ResNet18-based network isused to transform the full connection layer It can

simultaneously output strategy prediction p and boardsituation evaluation v (see Figure 4)

Unlike ResNet18 [3] a public full connection layerwith 4096 neuron nodes is used as the hidden layer (efully connected layer is replaced by a 196-dimensionaloutput fully connected layer to output drop strategy p anda one-dimensional fully connected layer to output theestimation v of the current board status (e neuralnetwork is used to predict and evaluate the state of nodesthat have not previously been evaluated (e neuralnetwork outputs p v to the UCT search tree node so thatP(si) p V(si) v

(e number of input characteristic graphs is two Itmeans that there are two channels (e first channel is theposition and color of the pieces in the chessboard stateWhite pieces are represented by minus1 and black ones arerepresented by 1 (e second channel is the position of thefalling pieces in the current state of the chessboard (epositions in which pieces do not fall are represented by 0and the position of the falling pieces is represented by 1All convolution layers use ReLu as their activationfunction and batch normalization is performed afterconvolution

At the end of the game we play back the experience andoutput the dataset ile 0lefc |(s π z)i1113864 1113865 to adjust the pa-rameters of the deep neural network (e loss function usedis shown in the following equation

l (z minus v)2

minus πT ln p + cθ2 (12)

In the i-th state of the path πi NiNi andzi WiNi (e training of the neural network is carried out

Z static evaluation value

X horizontal coordinates of chess board

Y vertical coordinates of

chess board

Figure 3 Approximate 2D normal distribution matrix

6 Complexity

with amin-batch parameter of 100 and the learning rate is setto 01 At the end of each game the number of samples(turns) sent to neural network training is about 1 times 103 to2 times 103

5 Results

(e experiment was performed with the following param-eters cpuct step and learning rate (see Table 1) We comparethe efficiency of three methods of updating the Q valuehybrid Q-learning with SARSA(λ) pure Q-learning andpure SARSA(λ) (e server and hardware configurationused in the experiment is shown in Table 1

51Ae Improvement of Learning and Understanding AbilityIt is important to measure the learning ability of a Jiu chessagent that can quickly learn to form squares in the layoutstage (erefore we calculate the influence of the hybridupdate strategy compared to pure Q-learning or SARSA(λ)

algorithms on the sum of the squares of 200 steps of self-playgame training as shown in Table 2 When using the hybridupdate strategy the total number of squares in 200 steps isalmost the sum of the number of squares occupied whenonly using pure Q-learning or SARSA(λ) proving that thehybrid update strategy can effectively train a Jiu chess agentto understand the game of Jiu chess and learn it well

We also counted the number of squares per 20 steps inthe 200-step training process Figure 5 shows that thenumber of squares occupied when using the hybrid updatestrategy is significantly more than that when using pureQ-learning or SARSA(λ) algorithms Especially in first 80steps hybrid updating strategy is significantly more effectivethan pure Q-learning or SARSA(λ) algorithms And fromstep 140 to step 200 hybrid updating strategy has a littledrop compared to the first 120 steps It is because parametercpuct at 01 must decrease to fit later training to avoid uselessexploration (is result indicates that the chess powertrained by the hybrid update strategy is stronger than that ofpure Q-learning or SARSA(λ) In addition this strategy

produces stronger learning and understanding as well asbetter stability

52 FeedbackFunctionExperiment In order to improve thereinforcement learning efficiency of Jiu chess a feedbackfunction based on a 2D normal distribution matrix isadded to the reinforcement learning game model as anauxiliary measure for updating value To test the effect ofthis measure we conducted 7 days and 110 instances ofself-play training and obtained data with and without the2D-normal-distribution-assisted feedback mechanism forcomparison

Table 3 and Figure 6 show that the total number ofsquares occupied when using the 2D normal distributionassistant in the self-play training process is about three timesthat of programs that do not the average value is close to

Value

Policy

ImprovedResNet18 for

JIU

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

Figure 4 (e deep neural network structure

Table 1 (e experimental parameters

Toolkit package CNTK 27Runtime Net 472Operating system Windows 10Central processing unit AMD 2700X40GHzRandom access memory 32GBGraphics processing unit RTX2070(reads used 16Development environment Visual Studio 2017 Communitycpuct 01Step 200Learning rate 0001c(Q-learning) 0628c(SARSA(λ)) 0372

Table 2 Average number of squares per 20 steps in 200-steptraining

Q-learning + SARSA(λ) Q-learning SARSA(λ)

8604 4248 5571

Complexity 7

three times (e total number of squares occupied whenusing the 2D normal distribution assistant during training issignificantly more than that without the assistant Two-di-mensional normal distribution auxiliary matrix shows theapplication of expert knowledge in fact (e learning abilityto play in the layout has improved about 3 times with thetwo-dimensional normal distribution auxiliary matrix

We also compare the learning efficiency of using a 2Dnormal distribution assistant program from the perspectiveof the quality of specific layouts As shown in Figure 7(a) in10 days of training the program without the aid of the 2Dnormal distribution has little knowledge of the layout Afterusing the 2D normal distribution matrix to assist in trainingthe layout process can learn specific chess patterns faster

Squa

re co

unt

Q-learning + SARSA(λ)Q-learningSARSA(λ)

0

10

20

30

40

50

60

20 40 60 80 100 120 140 160 180 2000Training steps

Figure 5 Number of squares employing different algorithms

Table 3 (e number of squares with 2D normalization on or off

2D normalization off 2D normalization onSum 1524 5005Average 13 45

Squa

re co

unt

2D norm on 2D norm off

50 1000Training steps

0

30

60

90

120

Squa

re co

unt

0

30

60

90

120

50 1000Training steps

Figure 6 Comparison of the number of squares with or without 2D normal distribution

8 Complexity

form a triangle (chess gate) and even form squares (the mostimportant chess shape) in the fighting stage of Jiu chessgame as shown in Figures 7(b) and 7(c)

(e above experiments show that the feedback functionbased on the two-dimensional normal distribution matrixcan reduce the calculation amount of Jiu chess program andimprove the self-learning efficiency of the program becausein the layout stage the importance of the chessboard po-sition is close to the two-dimensional normal distributionmatrix Because Jiu chess can have more than 1000 matchingsteps and slow search iteration process the evaluation oflayout stage is optimized by using two-dimensional normaldistribution matrix which is actually the knowledge ofmodel experts (e feedback value of the model indirectlyreduces the process of blindly exploring the transfer value ofchessboard state action so it takes less time to get betterresults

6 Conclusion

In this study the deep reinforcement learning modelcombined with expert knowledge can learn the rules of thegame faster in a short time (e combination of Q-learningand SARSA(λ) algorithms can make the neural networklearn more valuable chess strategies in a short time whichprovides a reference for improving the learning efficiency ofthe deep reinforcement learning model (e deep rein-forcement learning model has produced good layout resultsobtained by the two-dimensional normal distributionmatrixof expert knowledge modeling which also proves that deepreinforcement learning can shorten the learning time bycombining expert knowledge reasonably (e better per-formance of ResNet18 which was used to make the deepreinforcement learning model training more effectively atlow resources cost has been verified by experimental results

Inspired byWu et al [47 48] we consider to improve Jiuchess power not only from the state-of-the-art of technol-ogies but also from the holistic social good perspectives inthe future work We will collect and process much more Jiuchess game data to establish the big data resource which canturn big values for using light reinforcement learning model

to reduce the cost of computing resources We will alsoexplore the possibilities of combining multiagent theory [49]and mechanism [50] to reduce the network delay of theproposed reinforcement model besides using the strategiesproposed in [39 41]

Data Availability

Data are available via the e-mail xiaer_li163com

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is study was funded by the National Natural ScienceFoundation of China (61873291 and 61773416) and theMUC 111 Project

References

[1] X Li S Wang Z Lv Y Li and L Wu ldquoStrategies researchbased on chess shape for Tibetan JIU computer gamerdquo In-ternational Computer Games Association Journal (ICGA)vol 40 no 3 pp 318ndash328 2019

[2] X Li Z Lv S Wang Z Wei and L Wu ldquoA reinforcementlearning model based on temporal difference algorithmrdquoIEEE Access vol 7 pp 121922ndash121930 2019

[3] K He X Zhang S Ren and J Sun ldquoDeep residual learningfor image recognitionrdquo in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 770ndash778 LasVegas NV USA June 2016

[4] D SuodaChucha and P Shotwell ldquo(e research of Tibetanchessrdquo Journal of Tibet University vol 9 no 2 pp 5ndash10 1994

[5] V Mnih K Kavukcuoglu D Silver et al ldquoPlaying atari withdeep reinforcement learningrdquo 2013 httparxivorgabs13125602

[6] D Stern R Herbrich and T Graepel ldquoBayesian patternranking for move prediction in the game of gordquo in Proceedingsof the 23rd International Conference on Machine Learningpp 873ndash880 ACM Pittsburgh PA USA 2006

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(a)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(b)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(c)

Figure 7 Comparison of learning efficiency with or without 2D normal distribution (a) 2D normalization off (b) 2D normalization on(c) 2D normalization on and making a special board type

Complexity 9

[7] W Dabney M Rowland M G Bellemare and R MunosldquoDistributional reinforcement learning with quantile regres-sionrdquo 2017 httparxivorgabs171010044

[8] M Andrychowicz W Filip A Ray et al ldquoHindsight expe-rience replayrdquo in Proceedings of the Advances in Neural In-formation Processing Systems pp 5048ndash5058 Long BeachCA USA December 2017

[9] F Scott H van Hoof and D Meger ldquoAddressing functionapproximation error in actor-critic methodsrdquo 2018 httparxivorgabs180209477

[10] T P Lillicrap J J Hunt P Alexander et al ldquoContinuouscontrol with deep reinforcement learningrdquo 2015 httparxivorgabs150902971

[11] T Haarnoja A Zhou P Abbeel and S Levine ldquoSoft actor-critic off-policy maximum entropy deep reinforcementlearning with a stochastic actorrdquo 2018 httparxivorgabs180101290

[12] M Babaeizadeh I Frosio T Stephen J Clemons andJ Kautz ldquoReinforcement learning through asynchronousadvantage actor-critic on a gpurdquo 2016 httparxivorgabs161106256

[13] V Mnih A P Badia M Mirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Machine Learning pp 1928ndash1937New York City NY USA June 2016

[14] John Schulman S Levine P Abbeel M Jordan andP Moritz ldquoTrust region policy optimizationrdquo in Proceedingsof the International Conference on Machine Learningpp 1889ndash1897 Lille France July 2015

[15] John Schulman W Filip P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017httparxivorgabs170706347

[16] S-J Yen T-N Yang C Chen and S-C Hsu ldquoPatternmatching in go game recordsrdquo in Proceedings of the SecondInternational Conference on Innovative Computing Infor-mation and Control (ICICIC 2007) p 297 Washington DCUSA 2007

[17] S Gelly and Y Wang ldquoExploration exploitation in go UCTfor Monte-Carlo gordquo in Proceedings of the NIPS NeuralInformation Processing Systems Conference On-Line Trading ofExploration and Exploitation Workshop Barcelona Spain2006

[18] Y Wang and S Gelly ldquoModifications of UCT and sequence-like simulations for Monte-Carlo gordquo in Proceedings of theIEEE Symposium on Computational Intelligence and Gamespp 175ndash182 Honolulu HI USA 2007

[19] S Sharma Z Kobti and S Goodwin ldquoKnowledge generationfor improving simulations in uct for general game playingrdquo inProceedings of the Australasian Joint Conference on ArtificialIntelligence Advances in Artificial Intelligence pp 49ndash55Auckland New Zealand December 2008

[20] X Li Z Lv S Wang Z Wei X Zhang and L Wu ldquoA middlegame search algorithm applicable to low-cost personalcomputer for gordquo IEEE Access vol 7 pp 121719ndash1217272019

[21] N N Schraudolph P Dayan and T J Sejnowski ldquoTemporaldifference learning of position evaluation in the game of gordquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 817ndash824 Denver CO USA 1994

[22] D Silver A Huang C J Maddison et al ldquoMastering the gameof go with deep neural networks and tree searchrdquo Naturevol 529 no 7587 pp 484ndash489 2016

[23] D Silver J Schrittwieser K Simonyan et al ldquoMastering thegame of go without human knowledgerdquo Nature vol 550no 7676 pp 354ndash359 2017

[24] D Silver T Hubert J Schrittwieser et al ldquoA general rein-forcement learning algorithm that masters chess shogi andgo through self-playrdquo Science vol 362 no 6419pp 1140ndash1144 2018

[25] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquo Machine Learning vol 3 no 1 pp 9ndash44 1988

[26] G Tesauro ldquoPractical issues in temporal difference learningrdquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 259ndash266 Denver CO USA 1992

[27] G Tesauro ldquoTd-gammon a self-teaching backgammonprogram achieves master-level playrdquo Neural Computationvol 6 no 2 pp 215ndash219 1994

[28] D F Beal and M C Smith ldquoFirst results from using temporaldifference learning in shogirdquo in Proceedings of the Interna-tional Conference on Computers and Games Springer Tsu-kuba Japan pp 113ndash125 November 1998

[29] R Grimbergen and H Matsubara ldquoPattern recognition forcandidate generation in the game of shogirdquo Games in AIResearch pp 97ndash108 1997

[30] Y Sato D Takahashi and R Grimbergen ldquoA shogi programbased on monte-carlo tree searchrdquo ICGA Journal vol 33no 2 pp 80ndash92 2010

[31] S (run ldquoLearning to play the game of chessrdquo in Proceedingsof the Advances in Neural Information Processing Systemspp 1069ndash1076 Denver CO USA 1995

[32] I Bratko D Kopec and D Michie ldquoPattern-based repre-sentation of chess end-game knowledgerdquo Ae ComputerJournal vol 21 no 2 pp 149ndash153 1978

[33] Y Kerner ldquoLearning strategies for explanation patterns basicgame patterns with application to chessrdquo in Proceedings of theInternational Conference on Case-Based Reasoning SpringerSesimbra Portugal pp 491ndash500 October 1995

[34] M Campbell A J Hoane Jr and F-h Hsu ldquoDeep bluerdquoArtificial Intelligence vol 134 no 1-2 pp 57ndash83 2002

[35] N Brown and T Sandholm ldquoSafe and nested subgame solvingfor imperfect-information gamesrdquo in Proceedings of the Ad-vances in Neural Information Processing Systems pp 689ndash699Long Beach CA USA December 2017

[36] N Brown and T Sandholm ldquoSuperhuman ai for heads-up no-limit poker libratus beats top professionalsrdquo Science vol 359no 6374 pp 418ndash424 2018

[37] httpswwwmsracnzh-cnnewsfeaturesmahjong-ai-suphx 2019

[38] V Zambaldi D Raposo S Adam et al ldquoRelational deepreinforcement learningrdquo 2018 httparxivorgabs180601830

[39] S T Deng ldquoDesign and implementation of JIU game pro-totype system andmultimedia coursewarerdquo Minzu Universityof China Beijing China Dissertation 2017

[40] X Chen J Wu Y Cai H Zhang and T Chen ldquoEnergy-efficiency oriented traffic offloading in wireless networks abrief survey and a learning approach for heterogeneous cel-lular networksrdquo IEEE Journal on Selected Areas in Commu-nications vol 33 no 4 pp 627ndash640 2015

[41] T Pepels M H M Winands M Lanctot and M LanctotldquoReal-time Monte Carlo tree search in ms pac-manrdquo IEEETransactions on Computational Intelligence and AI in Gamesvol 6 no 3 pp 245ndash257 2014

[42] C F Sironi and M H M Winands ldquoComparing randomi-zation strategies for search-control parameters in monte-carlo

10 Complexity

tree searchrdquo in Proceedings of the 2019 IEEE Conference onGames (CoG) pp 1ndash8 IEEE London UK August 2019

[43] C Watkins and P Dayan ldquoQ-learningrdquo Machine Learningvol 8 no 3-4 pp 279ndash292 1992

[44] F Xiao Q Liu Qi-M Fu H-K Sun and L Gao ldquoGradientdescent Sarsa (λ) algorithm based on the adaptive potentialfunction shaping reward mechanismrdquo Journal of China In-stitute of Communications vol 34 no 1 pp 77ndash88 2013

[45] E Estrada E Hameed M Langer and A Puchalska ldquoPathlaplacian operators and superdiffusive processes on graphs iitwo-dimensional latticerdquo Linear Algebra and Its Applicationsvol 555 pp 373ndash397 2018

[46] K Lloyd N Becker M W Jones and R Bogacz ldquoLearning touse working memory a reinforcement learning gating modelof rule acquisition in ratsrdquo Frontiers in ComputationalNeuroscience vol 6 p 87 2012

[47] J Wu S Guo J Li and D Zeng ldquoBig data meet greenchallenges big data toward green applicationsrdquo IEEE SystemsJournal vol 10 no 3 pp 888ndash900 2016

[48] J Wu S Guo H Huang W Liu and Y Xiang ldquoInformationand communications technologies for sustainable develop-ment goals state-of-the-art needs and perspectivesrdquo IEEECommunications Surveys amp Tutorials vol 20 no 3pp 2389ndash2406 2018

[49] B Liu N Xu H Su LWu and J Bai ldquoOn the observability ofleader-based multiagent systems with fixed topologyrdquoComplexity vol 2019 pp 1ndash10 2019

[50] H Su J Zhang and X Chen ldquoA stochastic samplingmechanism for time-varying formation of multiagent systemswith multiple leaders and communication delaysrdquo IEEETransactions on Neural Networks and Learning Systemsvol 30 no 12 pp 3699ndash3707 2019

Complexity 11

(2) Constructing probability distribution function basedon two-dimensional (2D) normal distribution ma-trix makes the reinforcement learning model havethe ability of prioritizing its learning tactics in thecenter of the layout

(3) (e proposed improved Resnet18-based networkwith very small parameter scale has achieved a goodlearning effect reduced the use of computing powerand significantly shortened playing time by trainingon a common workstation or graphics processingunit (GPU) server in a short time (e experimentalresults have demonstrated it

Section 2 of the paper introduces the rules and diffi-culties of Tibetan Jiu chess and Section 3 introduces relatedworks on Tibetan Jiu chess Section 4 discusses the hybridSARSA(λ) and Q-learning algorithm applied to differentstages of the UCTsearch and the improved RedNet18-baseddeep neural network structure Section 5 details the ex-periment and data analysis Finally we outline the con-clusions and propose future work

2 Rules and Difficulties in Tibetan JIU Chess

(ere are two kinds of Tibetan chess mi mang and Jiuwhich are widely played in Sichuan Gansu Tibet QinghaiYunnan and other Tibetan areas [4] in China(e process ofplaying chess can be divided into two successive stageslayout and fighting In the layout stage the central area isoccupied first in order to construct the dominant formation(square standard etc) for the fighting stage hence thequality of the layout has an important impact on the finalvictory In the fighting stage actions include moving chesspieces jumping capture or even square capture (e key tovictory is to take the lead in constructing a girdling for-mation and the necessary condition to constructing thisformation is to construct a square

21 Tibetan Jiu Chess Rules (e public Jiu board is 14 times 14points (e Jiu game process is divided into two sequentialstages layout and battle Jiu chess is a two-player game (ewhite side plays with white stones and the black side usesblack stones Players alternate turns Stones must be addedor moved to empty points on the game board A Jiu gamestarts with an empty board (e task in the layout stage is toplace one stone on a point at each move until there are noempty points on the board (e goal in the battle stage ismoving or capturing stones until one player wins the game(e movements and capturing methods are similar to thoseof international checkers [1 2]

211 Layout Stage In the layout stage white plays firstfollowed by black (e first and second moves must beplaced on one of the points of the diagonal line of centralgrids (en each side alternates in placing one stone on onepoint until no empty points remain on the board Afterfilling the board game turns into battle stage

212 Battle Stage In the battle stage there are three actionsto select in each turn

(1) Move Usually a player moves a stone to the updown left or right adjacent empty point (see Figure 1(a))But there is the exception If a player has no more than 14stones left he can move a stone to any empty cross point hewants(is exception is shown in Figure 1(c) Since the blackside only has no more than 14 stones left the diamond blackstone moves to the point where the arrow directs After thismove the square is constructed During this move the pointof the diamond black stone is the beginning and the point ofthe black stone directed by the arrow is the ending In thiscase the black side can capture any one stone of its op-ponent (e dim diamond marked white stone is taken awayby the black side after this move

(2) Jumping Capture When the opponentrsquos stone isadjacent to the playerrsquos stone and there is an empty pointdirectly behind it the player will perform a jumping capture(is action can be continued until the player cannot capturestones or the playerrsquos turn ends As shown in Figure 1(b) thediamond marked white stone is placed to the final emptypoint where the arrows direct after four continuousjumping (is continuous jump begins from the point of thediamond white stone and ends at the point of the white stonedirected by the arrow (e white side captures all the fourblack stones on the path where the arrows direct

(3) Square Capture In one turn if a player constructs asquare with four adjacent stones he will capture one of hisopponentrsquos stones located at any point on the board (is iscalled square capture (ere are three important shapeswhich are called gate square and dalian (or chain) asso-ciated with square capture Gate is the basic shape which isshown in (A) in Figure 1(d) Square one of the importantshapes is shown in (B) in Figure 1(d) Dalian or chain is themost important Jiu shape which is shown in (C) inFigure 1(d) (is shape is critical to the winning of the gameIt is comprised by seven stones of the same color and oneempty point (e stone adjacent to the empty point is calledvital stone which is marked by the circle By moving this vitalstone to any of the two empty points a player can construct asquare and then capture one stone anywhere of its opponenton the board in one turn A player can capture his oppo-nentrsquos stones by repeatedly moving the vital stone in dif-ferent turns

213 Winning the Game If a player wants to be a winner hewill make sure one of the following conditions is met

(1) He must have at least one special shape like chainwhile his opponent does not have any gates beforehis opponent has less than 14 stones

(2) He will win by taking away all stones of his opponentwhen both sides have no special shapes or gates

22 Difficulties in Tibetan Jiu Chess

(1) Deep and wide tree search space (e layout stage isclosely connected with the battle stage When the

2 Complexity

layout is finished the battle begins (e search spaceof the game tree is huge and the search path is farlonger than that of general chess games

(2) Special rules make much more low value states thanhigh value states (e Jiu chess layout first covers thecentral area and then gradually expands outward(e importance of position gradually decreases fromthe center to the outside It takes a long time forordinary deep neural networks to learn to choosehigh value states from much more low value states

(3) Extremely limited research and expert knowledgeJiu chess players are mostly distributed in Tibetanareas creating huge difficulties in collecting and

processing Jiu chess data At present we have onlycollected and analyzed 300 complete chess recorddata represented as SGF files

3 Related Work

Deep reinforcement learning agorithms used in the Atariseries of games inlcuding Deep Q Network (DQN) algo-rithm [5] 51-atom-agent (C51) algorithm [6] and thosesuitable for continuous fieds with low search depth andnarrow decision tree width [7ndash15] have achieved orexceeded the level of human experts In the field of computergames pattern recognition [6 16] reinforcement learning

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

(a)

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

(b)

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

(c)

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(A)

(B)

(C)

(A)

(B)

(C)

(C) (C)1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

(d)

Figure 1 Actions and shapes in Jiu chess (a) Example state 1 (b) Example state 2 (c) Example state 3 (d) Important shapes

Complexity 3

deep learning and deep reinforcement learning algorithmsare used in Go including Monte Carlo algorithm and upperconfidence bound applied to tree (UCT) algorithm [17ndash20]temporal difference algorithm [21] the deep learning modelcombined with UCT search algorithm [22] and DQN al-gorithm [23 24] and they have also achieved quite goodresults in computer Go game which shows that the idea ofdeep reinforcement learning algorithm can adapt to thecomputer game environment In addition the application ofdeep reinforcement learning algorithm in Backgammon[25ndash27] Shogi [28ndash30] chess [31ndash34] Texas poker [35 36]Mahjong [37] and Star Craft II [38] has achieved humanexcellence and even exceeded human achievements (ere isno doubt that the deep reinforcement learning algorithmwill make a good progress in this field when it is applied toJiu chess

At present Jiu chess is little-known to the people becauseit is mainly spread in Tibetan people gathering areaes Due tospecial rules and smaller players compared with Go or otherpopular games the research on Jiu chess is very limited Jiuchess is a complete information game along with Go andchess However completely different from Go or chess Jiurules are very special with two consequential stages whichmakes the game tree search path very long(ere are three Jiuchess programs from all current literatures [1 2 39] And allof them have low chess power although they have differentplaying levels In [1] several important shapes of Jiu were firstrecognized and about 300 playing records were collected andprocessed Strategies based on chess shapes were designed todefend the opponent It was the first Jiu program based onexpert knowledge but it had very low power because of thelimited human knowledge In [39] a Bayesian network modelwhich was designed to solve the problem of small sample datafor Jiu playing records estimated the chess board rapidly (ismethod alleviated the extreme lack of expert knowledge to acertain extent but the model was only useful in the layoutstage In [2] a time difference algorithm was used to realizethe probability statistics and prediction of chess type whichwas the first time reinforcement learning was applied to Jiuchess It was verified that solely applying SARSA(λ) orQ-learning algorithm with special different parameters washelpful to improve the efficiency of recognizing and evalu-ating chess board However it did not make importantcontributions to the chess power because of not applying adeep neural network or UCT search Q-learning algorithm isnot only used in the games but also widely used in other fieldssuch as the wireless network to reduce the cost of resources[40] It is very promising So in our study we first combineSARSA(λ) and Q-learning algorithms in different stages ofUCT search to help us get the motivation of achieving betterperformance under low hardware cost

4 Learning Methodology

UCTsearch combined with deep learning and reinforcementlearning is of high performance for Go chess and othergames However it requires considerable hardware re-sources to support exploration to the huge state space andtraining for the deep neural network To take advantage of

reinforcement learning and deep learning while reducinghardware requirements as much as possible this studyproposed an improved deep neural network model com-bined UCT search with hybrid online and offline timedifference algorithms for Tibetan Jiu chess which is ex-quisite and efficient in self-play learning ability

Hybrid TD algorithms are applied to different stages ofthe UCT search which minimizes searching in low-valuestate spaces (ereby the computational power consump-tion and training time of the exploration state space arereduced According to the unique characteristics of Jiu chessa TD algorithm reward function is proposed based on a 2Dnormal distribution matrix for the layout stage enabling theJiu chess reinforcement learning model to more quicklyacquire layout awareness of Jiu chess priorities (e rewardfunction for the battle stage is also designed based on Jiuchess shapes (e improved ResNet18-based deep neuralnetwork based is used for self-play and training

41UCTSearch (e reinforcement learning algorithm usedin this study is based on a UCT [41 42] search algorithm (seeFigure 2) which combines Q-learning SARSA(λ) andexpert domain knowledge When the model performs self-play learning it searches from the root node uses SARSA(λ)

combined with immediate feedback from domain knowl-edge in the former three stages (selection extension anddefault policy simulation) and uses Q-learning updation inthe backpropagation stage

In the selection extension and default policy simulationstages the board situation is evaluated through expertknowledge and this evaluated value is returned to each nodeof the path using the SARSA(λ) algorithm updation method(see the bold parts of the routes in the selection extensionand default policy simulation stages of Figure 2) In thebackpropagation stage the board situation is evaluatedthrough Q-learning algorithm updation (see the bold part ofthe route in the backpropagation step of Figure 2)

42 Hybrid SARSA(λ) and Q-Learning Algorithm In thisstudy the node of the UCT search tree is expressed as

SWNP VQE (1)

where S represents the state of the node in the search treeWrepresents the value of each action of the node N representsthe number of times the action of the node is selected andfor each action a selected N(s a)⟵N(s a) + 1 P rep-resents the probability of selecting each action under thestate (calculated by the neural network) V is the winningrate estimated by the neural network Q is the winning rateestimated by the search tree and E is a variable of auxiliaryQupdation In the selection extension and simulation stagesE is updated according to the SARSA(λ) algorithm which isexpressed as follows

E(s a)⟵R(s a) + cQ si+1 ai+1( 1113857 minus Q(s a) (2)

where R(s a) is the reward value of the current situationwhich can usually be calculated by the board situation

4 Complexity

evaluation c is learning rate si+1 is the next state searchedby taking action a and ai+1 is the action selected in state si

(e node (si ai) which is previously searched in thesearching path is updated in turn by the following equation

Q si ai( 1113857⟵ 1 minus c1( 1113857Q si ai( 1113857 + c1λcminus iE(s a) (3)

where c1 is learning rate of SARSA(λ) i is denoted as thestate index satisfying 0le ile c c is the number of steps fromthe status of the root node to the status of the search endingat the stage of value return in each turn i 0 represents thesearching is initialized by taking the current situation as thesearch tree root node and i c represents the searching isended Especially at the end of the game c is simply the totalnumber of steps from the beginning to the end of the game

In the backpropagation stage of UCT search or at thetime of each game end equation (4) is used to update theQ-values of all nodes on the path where c2 is learning rate ofQ-Learning

Q si ai( 1113857⟵ 1 minus c2( 1113857Q si ai( 1113857 + c2 maxQ si+1 ai+1( 1113857

(4)

and if si+1 is a leaf node Q(si+1 ai+1) will be calculated by

Q si+1 ai+1( 1113857 W si+1 ai+1( 1113857

N si+1 ai+1( 1113857 (5)

In the selection extension and simulation stages of theUCTsearch algorithm this hybrid algorithm selects action a

with the maximum value from state s through equation (6)by calculating the score of each alternative action

a argmaxa(Q(s middot) + U(s middot)) (6)

For each alternative action a at state s there is thecomprehensive evaluation function which is represented bythe following equation

Q(s a) + U(s a) Q(s a) + cpuctP(s a)

1113936N(s middot)

N(s a) + 1

1113971

(7)

where Q(s a) is obtained by equation (3) or (4) whichdepends on different tree UCT search stages and U(s a)

represents the estimation value by the UCBmethod which isobtained by the following equation

U(s a) cpuctP(s a)

1113936N(s middot)

N(s a) + 1

1113971

(8)

where cpuct represents the parameter balancing the exploi-tation and exploration of UCT algorithm

43 Feedback Function Based on Domain KnowledgeQ-value is updated by the combination of the SARSA(λ) andQ-learning algorithms [43 44] SARSA(λ) is used to updatethe nodes on the game path in the selection expansion andsimulation stages of the UCT [18] In the backpropagationstage Q-learning is used to update the values of all nodes inthe search path(e hybrid strategy enables the algorithm tolearn to prune the huge state space effectively improvingcomputing speed and reducing the consumption of hard-ware resources

(e layout stage of Jiu chess plays a key role in the gameand its outcome [1] (e value of the board position decaysfrom the center of the board to the outside which is similarto the probability distribution of a 2D normal distributionmatrix (erefore the importance of each intersection in thelayout stage can be approximated by the 2D discrete normaldistribution (see Figure 3)

In the battle stage constructing the chess shapes discussedin [1 2] is very important in gaining a victory (e feedbackfunction used by the SARSA(λ) and Q-learning algorithm isshown in equation (9) 1113936iVi(s) Vchain+ Vgate + Vsquare

Q + U

Q + U

Q + U

P w

w

w

w

w

Selection Expansion Default policy simulation

Backpropagation

Repeat

Q + U

Figure 2 UCT algorithm for Jiu chess

Complexity 5

R(s a) f(x y) s in layout stage

1113936iVi(s) s in battle stage1113896 (9)

where f(x y) is the joint probability density of X and Y isshown in the following equation [39 45]

f(x y) 1

2πσ1σ21 minus ρ2

1113968 eminus 1 2 1minus ρ2( )( )( ) xminus μ1( )

2σ21minus 2ρ xminus μ1( ) yminus μ2( )σ1σ2( )+ yminus μ2( )2σ22( 1113857( 1113857

(10)

where X and Y represent chessboard coordinates and μrepresents the mean value of the range of the chessboard(x y isin [0 13] σ1 σ2 2 μ1 u2 65 ρ 0)Vchain Vgate and Vsquare are approximations summarizedthrough the Jiu chess rules [1 2] and experienceVchain(s) 7cchain where cchain is the sum of the number ofsquares on the chessboard in the current stateVgate(s) 3cgate where cgate is the sum of the number ofgates on the chessboard in the current state andVsquare(s) 4csquare where csquare is the sum of the number ofsquares on the chessboard in the current state

Using the 2D normal distribution approximate ma-trix probability distribution based on expert knowledgeas the feedback function of the TD algorithm in thelayout stage we can produce better search simulationsand self-play game performance in the case of deepsearch depths and large search branches and learn areasonable chess strategy in the layout stage better andfaster [46]

44 Improved Deep Neural Network Based on ResNet18Strategy prediction and board situation evaluation are re-alized using a deep neural network which is expressed as afunction with parameter θ as shown in equation (11) wheres is the current state p is the output strategy prediction andv is the value of board situation evaluation

(p v) fθ(s) (11)(e RESNETseries of deep convolution neural networks

shows good robustness in image classification and targetdetection [3] In this study a ResNet18-based network isused to transform the full connection layer It can

simultaneously output strategy prediction p and boardsituation evaluation v (see Figure 4)

Unlike ResNet18 [3] a public full connection layerwith 4096 neuron nodes is used as the hidden layer (efully connected layer is replaced by a 196-dimensionaloutput fully connected layer to output drop strategy p anda one-dimensional fully connected layer to output theestimation v of the current board status (e neuralnetwork is used to predict and evaluate the state of nodesthat have not previously been evaluated (e neuralnetwork outputs p v to the UCT search tree node so thatP(si) p V(si) v

(e number of input characteristic graphs is two Itmeans that there are two channels (e first channel is theposition and color of the pieces in the chessboard stateWhite pieces are represented by minus1 and black ones arerepresented by 1 (e second channel is the position of thefalling pieces in the current state of the chessboard (epositions in which pieces do not fall are represented by 0and the position of the falling pieces is represented by 1All convolution layers use ReLu as their activationfunction and batch normalization is performed afterconvolution

At the end of the game we play back the experience andoutput the dataset ile 0lefc |(s π z)i1113864 1113865 to adjust the pa-rameters of the deep neural network (e loss function usedis shown in the following equation

l (z minus v)2

minus πT ln p + cθ2 (12)

In the i-th state of the path πi NiNi andzi WiNi (e training of the neural network is carried out

Z static evaluation value

X horizontal coordinates of chess board

Y vertical coordinates of

chess board

Figure 3 Approximate 2D normal distribution matrix

6 Complexity

with amin-batch parameter of 100 and the learning rate is setto 01 At the end of each game the number of samples(turns) sent to neural network training is about 1 times 103 to2 times 103

5 Results

(e experiment was performed with the following param-eters cpuct step and learning rate (see Table 1) We comparethe efficiency of three methods of updating the Q valuehybrid Q-learning with SARSA(λ) pure Q-learning andpure SARSA(λ) (e server and hardware configurationused in the experiment is shown in Table 1

51Ae Improvement of Learning and Understanding AbilityIt is important to measure the learning ability of a Jiu chessagent that can quickly learn to form squares in the layoutstage (erefore we calculate the influence of the hybridupdate strategy compared to pure Q-learning or SARSA(λ)

algorithms on the sum of the squares of 200 steps of self-playgame training as shown in Table 2 When using the hybridupdate strategy the total number of squares in 200 steps isalmost the sum of the number of squares occupied whenonly using pure Q-learning or SARSA(λ) proving that thehybrid update strategy can effectively train a Jiu chess agentto understand the game of Jiu chess and learn it well

We also counted the number of squares per 20 steps inthe 200-step training process Figure 5 shows that thenumber of squares occupied when using the hybrid updatestrategy is significantly more than that when using pureQ-learning or SARSA(λ) algorithms Especially in first 80steps hybrid updating strategy is significantly more effectivethan pure Q-learning or SARSA(λ) algorithms And fromstep 140 to step 200 hybrid updating strategy has a littledrop compared to the first 120 steps It is because parametercpuct at 01 must decrease to fit later training to avoid uselessexploration (is result indicates that the chess powertrained by the hybrid update strategy is stronger than that ofpure Q-learning or SARSA(λ) In addition this strategy

produces stronger learning and understanding as well asbetter stability

52 FeedbackFunctionExperiment In order to improve thereinforcement learning efficiency of Jiu chess a feedbackfunction based on a 2D normal distribution matrix isadded to the reinforcement learning game model as anauxiliary measure for updating value To test the effect ofthis measure we conducted 7 days and 110 instances ofself-play training and obtained data with and without the2D-normal-distribution-assisted feedback mechanism forcomparison

Table 3 and Figure 6 show that the total number ofsquares occupied when using the 2D normal distributionassistant in the self-play training process is about three timesthat of programs that do not the average value is close to

Value

Policy

ImprovedResNet18 for

JIU

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

Figure 4 (e deep neural network structure

Table 1 (e experimental parameters

Toolkit package CNTK 27Runtime Net 472Operating system Windows 10Central processing unit AMD 2700X40GHzRandom access memory 32GBGraphics processing unit RTX2070(reads used 16Development environment Visual Studio 2017 Communitycpuct 01Step 200Learning rate 0001c(Q-learning) 0628c(SARSA(λ)) 0372

Table 2 Average number of squares per 20 steps in 200-steptraining

Q-learning + SARSA(λ) Q-learning SARSA(λ)

8604 4248 5571

Complexity 7

three times (e total number of squares occupied whenusing the 2D normal distribution assistant during training issignificantly more than that without the assistant Two-di-mensional normal distribution auxiliary matrix shows theapplication of expert knowledge in fact (e learning abilityto play in the layout has improved about 3 times with thetwo-dimensional normal distribution auxiliary matrix

We also compare the learning efficiency of using a 2Dnormal distribution assistant program from the perspectiveof the quality of specific layouts As shown in Figure 7(a) in10 days of training the program without the aid of the 2Dnormal distribution has little knowledge of the layout Afterusing the 2D normal distribution matrix to assist in trainingthe layout process can learn specific chess patterns faster

Squa

re co

unt

Q-learning + SARSA(λ)Q-learningSARSA(λ)

0

10

20

30

40

50

60

20 40 60 80 100 120 140 160 180 2000Training steps

Figure 5 Number of squares employing different algorithms

Table 3 (e number of squares with 2D normalization on or off

2D normalization off 2D normalization onSum 1524 5005Average 13 45

Squa

re co

unt

2D norm on 2D norm off

50 1000Training steps

0

30

60

90

120

Squa

re co

unt

0

30

60

90

120

50 1000Training steps

Figure 6 Comparison of the number of squares with or without 2D normal distribution

8 Complexity

form a triangle (chess gate) and even form squares (the mostimportant chess shape) in the fighting stage of Jiu chessgame as shown in Figures 7(b) and 7(c)

(e above experiments show that the feedback functionbased on the two-dimensional normal distribution matrixcan reduce the calculation amount of Jiu chess program andimprove the self-learning efficiency of the program becausein the layout stage the importance of the chessboard po-sition is close to the two-dimensional normal distributionmatrix Because Jiu chess can have more than 1000 matchingsteps and slow search iteration process the evaluation oflayout stage is optimized by using two-dimensional normaldistribution matrix which is actually the knowledge ofmodel experts (e feedback value of the model indirectlyreduces the process of blindly exploring the transfer value ofchessboard state action so it takes less time to get betterresults

6 Conclusion

In this study the deep reinforcement learning modelcombined with expert knowledge can learn the rules of thegame faster in a short time (e combination of Q-learningand SARSA(λ) algorithms can make the neural networklearn more valuable chess strategies in a short time whichprovides a reference for improving the learning efficiency ofthe deep reinforcement learning model (e deep rein-forcement learning model has produced good layout resultsobtained by the two-dimensional normal distributionmatrixof expert knowledge modeling which also proves that deepreinforcement learning can shorten the learning time bycombining expert knowledge reasonably (e better per-formance of ResNet18 which was used to make the deepreinforcement learning model training more effectively atlow resources cost has been verified by experimental results

Inspired byWu et al [47 48] we consider to improve Jiuchess power not only from the state-of-the-art of technol-ogies but also from the holistic social good perspectives inthe future work We will collect and process much more Jiuchess game data to establish the big data resource which canturn big values for using light reinforcement learning model

to reduce the cost of computing resources We will alsoexplore the possibilities of combining multiagent theory [49]and mechanism [50] to reduce the network delay of theproposed reinforcement model besides using the strategiesproposed in [39 41]

Data Availability

Data are available via the e-mail xiaer_li163com

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is study was funded by the National Natural ScienceFoundation of China (61873291 and 61773416) and theMUC 111 Project

References

[1] X Li S Wang Z Lv Y Li and L Wu ldquoStrategies researchbased on chess shape for Tibetan JIU computer gamerdquo In-ternational Computer Games Association Journal (ICGA)vol 40 no 3 pp 318ndash328 2019

[2] X Li Z Lv S Wang Z Wei and L Wu ldquoA reinforcementlearning model based on temporal difference algorithmrdquoIEEE Access vol 7 pp 121922ndash121930 2019

[3] K He X Zhang S Ren and J Sun ldquoDeep residual learningfor image recognitionrdquo in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 770ndash778 LasVegas NV USA June 2016

[4] D SuodaChucha and P Shotwell ldquo(e research of Tibetanchessrdquo Journal of Tibet University vol 9 no 2 pp 5ndash10 1994

[5] V Mnih K Kavukcuoglu D Silver et al ldquoPlaying atari withdeep reinforcement learningrdquo 2013 httparxivorgabs13125602

[6] D Stern R Herbrich and T Graepel ldquoBayesian patternranking for move prediction in the game of gordquo in Proceedingsof the 23rd International Conference on Machine Learningpp 873ndash880 ACM Pittsburgh PA USA 2006

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(a)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(b)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(c)

Figure 7 Comparison of learning efficiency with or without 2D normal distribution (a) 2D normalization off (b) 2D normalization on(c) 2D normalization on and making a special board type

Complexity 9

[7] W Dabney M Rowland M G Bellemare and R MunosldquoDistributional reinforcement learning with quantile regres-sionrdquo 2017 httparxivorgabs171010044

[8] M Andrychowicz W Filip A Ray et al ldquoHindsight expe-rience replayrdquo in Proceedings of the Advances in Neural In-formation Processing Systems pp 5048ndash5058 Long BeachCA USA December 2017

[9] F Scott H van Hoof and D Meger ldquoAddressing functionapproximation error in actor-critic methodsrdquo 2018 httparxivorgabs180209477

[10] T P Lillicrap J J Hunt P Alexander et al ldquoContinuouscontrol with deep reinforcement learningrdquo 2015 httparxivorgabs150902971

[11] T Haarnoja A Zhou P Abbeel and S Levine ldquoSoft actor-critic off-policy maximum entropy deep reinforcementlearning with a stochastic actorrdquo 2018 httparxivorgabs180101290

[12] M Babaeizadeh I Frosio T Stephen J Clemons andJ Kautz ldquoReinforcement learning through asynchronousadvantage actor-critic on a gpurdquo 2016 httparxivorgabs161106256

[13] V Mnih A P Badia M Mirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Machine Learning pp 1928ndash1937New York City NY USA June 2016

[14] John Schulman S Levine P Abbeel M Jordan andP Moritz ldquoTrust region policy optimizationrdquo in Proceedingsof the International Conference on Machine Learningpp 1889ndash1897 Lille France July 2015

[15] John Schulman W Filip P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017httparxivorgabs170706347

[16] S-J Yen T-N Yang C Chen and S-C Hsu ldquoPatternmatching in go game recordsrdquo in Proceedings of the SecondInternational Conference on Innovative Computing Infor-mation and Control (ICICIC 2007) p 297 Washington DCUSA 2007

[17] S Gelly and Y Wang ldquoExploration exploitation in go UCTfor Monte-Carlo gordquo in Proceedings of the NIPS NeuralInformation Processing Systems Conference On-Line Trading ofExploration and Exploitation Workshop Barcelona Spain2006

[18] Y Wang and S Gelly ldquoModifications of UCT and sequence-like simulations for Monte-Carlo gordquo in Proceedings of theIEEE Symposium on Computational Intelligence and Gamespp 175ndash182 Honolulu HI USA 2007

[19] S Sharma Z Kobti and S Goodwin ldquoKnowledge generationfor improving simulations in uct for general game playingrdquo inProceedings of the Australasian Joint Conference on ArtificialIntelligence Advances in Artificial Intelligence pp 49ndash55Auckland New Zealand December 2008

[20] X Li Z Lv S Wang Z Wei X Zhang and L Wu ldquoA middlegame search algorithm applicable to low-cost personalcomputer for gordquo IEEE Access vol 7 pp 121719ndash1217272019

[21] N N Schraudolph P Dayan and T J Sejnowski ldquoTemporaldifference learning of position evaluation in the game of gordquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 817ndash824 Denver CO USA 1994

[22] D Silver A Huang C J Maddison et al ldquoMastering the gameof go with deep neural networks and tree searchrdquo Naturevol 529 no 7587 pp 484ndash489 2016

[23] D Silver J Schrittwieser K Simonyan et al ldquoMastering thegame of go without human knowledgerdquo Nature vol 550no 7676 pp 354ndash359 2017

[24] D Silver T Hubert J Schrittwieser et al ldquoA general rein-forcement learning algorithm that masters chess shogi andgo through self-playrdquo Science vol 362 no 6419pp 1140ndash1144 2018

[25] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquo Machine Learning vol 3 no 1 pp 9ndash44 1988

[26] G Tesauro ldquoPractical issues in temporal difference learningrdquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 259ndash266 Denver CO USA 1992

[27] G Tesauro ldquoTd-gammon a self-teaching backgammonprogram achieves master-level playrdquo Neural Computationvol 6 no 2 pp 215ndash219 1994

[28] D F Beal and M C Smith ldquoFirst results from using temporaldifference learning in shogirdquo in Proceedings of the Interna-tional Conference on Computers and Games Springer Tsu-kuba Japan pp 113ndash125 November 1998

[29] R Grimbergen and H Matsubara ldquoPattern recognition forcandidate generation in the game of shogirdquo Games in AIResearch pp 97ndash108 1997

[30] Y Sato D Takahashi and R Grimbergen ldquoA shogi programbased on monte-carlo tree searchrdquo ICGA Journal vol 33no 2 pp 80ndash92 2010

[31] S (run ldquoLearning to play the game of chessrdquo in Proceedingsof the Advances in Neural Information Processing Systemspp 1069ndash1076 Denver CO USA 1995

[32] I Bratko D Kopec and D Michie ldquoPattern-based repre-sentation of chess end-game knowledgerdquo Ae ComputerJournal vol 21 no 2 pp 149ndash153 1978

[33] Y Kerner ldquoLearning strategies for explanation patterns basicgame patterns with application to chessrdquo in Proceedings of theInternational Conference on Case-Based Reasoning SpringerSesimbra Portugal pp 491ndash500 October 1995

[34] M Campbell A J Hoane Jr and F-h Hsu ldquoDeep bluerdquoArtificial Intelligence vol 134 no 1-2 pp 57ndash83 2002

[35] N Brown and T Sandholm ldquoSafe and nested subgame solvingfor imperfect-information gamesrdquo in Proceedings of the Ad-vances in Neural Information Processing Systems pp 689ndash699Long Beach CA USA December 2017

[36] N Brown and T Sandholm ldquoSuperhuman ai for heads-up no-limit poker libratus beats top professionalsrdquo Science vol 359no 6374 pp 418ndash424 2018

[37] httpswwwmsracnzh-cnnewsfeaturesmahjong-ai-suphx 2019

[38] V Zambaldi D Raposo S Adam et al ldquoRelational deepreinforcement learningrdquo 2018 httparxivorgabs180601830

[39] S T Deng ldquoDesign and implementation of JIU game pro-totype system andmultimedia coursewarerdquo Minzu Universityof China Beijing China Dissertation 2017

[40] X Chen J Wu Y Cai H Zhang and T Chen ldquoEnergy-efficiency oriented traffic offloading in wireless networks abrief survey and a learning approach for heterogeneous cel-lular networksrdquo IEEE Journal on Selected Areas in Commu-nications vol 33 no 4 pp 627ndash640 2015

[41] T Pepels M H M Winands M Lanctot and M LanctotldquoReal-time Monte Carlo tree search in ms pac-manrdquo IEEETransactions on Computational Intelligence and AI in Gamesvol 6 no 3 pp 245ndash257 2014

[42] C F Sironi and M H M Winands ldquoComparing randomi-zation strategies for search-control parameters in monte-carlo

10 Complexity

tree searchrdquo in Proceedings of the 2019 IEEE Conference onGames (CoG) pp 1ndash8 IEEE London UK August 2019

[43] C Watkins and P Dayan ldquoQ-learningrdquo Machine Learningvol 8 no 3-4 pp 279ndash292 1992

[44] F Xiao Q Liu Qi-M Fu H-K Sun and L Gao ldquoGradientdescent Sarsa (λ) algorithm based on the adaptive potentialfunction shaping reward mechanismrdquo Journal of China In-stitute of Communications vol 34 no 1 pp 77ndash88 2013

[45] E Estrada E Hameed M Langer and A Puchalska ldquoPathlaplacian operators and superdiffusive processes on graphs iitwo-dimensional latticerdquo Linear Algebra and Its Applicationsvol 555 pp 373ndash397 2018

[46] K Lloyd N Becker M W Jones and R Bogacz ldquoLearning touse working memory a reinforcement learning gating modelof rule acquisition in ratsrdquo Frontiers in ComputationalNeuroscience vol 6 p 87 2012

[47] J Wu S Guo J Li and D Zeng ldquoBig data meet greenchallenges big data toward green applicationsrdquo IEEE SystemsJournal vol 10 no 3 pp 888ndash900 2016

[48] J Wu S Guo H Huang W Liu and Y Xiang ldquoInformationand communications technologies for sustainable develop-ment goals state-of-the-art needs and perspectivesrdquo IEEECommunications Surveys amp Tutorials vol 20 no 3pp 2389ndash2406 2018

[49] B Liu N Xu H Su LWu and J Bai ldquoOn the observability ofleader-based multiagent systems with fixed topologyrdquoComplexity vol 2019 pp 1ndash10 2019

[50] H Su J Zhang and X Chen ldquoA stochastic samplingmechanism for time-varying formation of multiagent systemswith multiple leaders and communication delaysrdquo IEEETransactions on Neural Networks and Learning Systemsvol 30 no 12 pp 3699ndash3707 2019

Complexity 11

layout is finished the battle begins (e search spaceof the game tree is huge and the search path is farlonger than that of general chess games

(2) Special rules make much more low value states thanhigh value states (e Jiu chess layout first covers thecentral area and then gradually expands outward(e importance of position gradually decreases fromthe center to the outside It takes a long time forordinary deep neural networks to learn to choosehigh value states from much more low value states

(3) Extremely limited research and expert knowledgeJiu chess players are mostly distributed in Tibetanareas creating huge difficulties in collecting and

processing Jiu chess data At present we have onlycollected and analyzed 300 complete chess recorddata represented as SGF files

3 Related Work

Deep reinforcement learning agorithms used in the Atariseries of games inlcuding Deep Q Network (DQN) algo-rithm [5] 51-atom-agent (C51) algorithm [6] and thosesuitable for continuous fieds with low search depth andnarrow decision tree width [7ndash15] have achieved orexceeded the level of human experts In the field of computergames pattern recognition [6 16] reinforcement learning

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

(a)

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

(b)

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

(c)

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(A)

(B)

(C)

(A)

(B)

(C)

(C) (C)1

2

3

4

5

6

7

8

9

10

11

12

13

14

1

2

3

4

5

6

7

8

9

10

11

12

13

14

(d)

Figure 1 Actions and shapes in Jiu chess (a) Example state 1 (b) Example state 2 (c) Example state 3 (d) Important shapes

Complexity 3

deep learning and deep reinforcement learning algorithmsare used in Go including Monte Carlo algorithm and upperconfidence bound applied to tree (UCT) algorithm [17ndash20]temporal difference algorithm [21] the deep learning modelcombined with UCT search algorithm [22] and DQN al-gorithm [23 24] and they have also achieved quite goodresults in computer Go game which shows that the idea ofdeep reinforcement learning algorithm can adapt to thecomputer game environment In addition the application ofdeep reinforcement learning algorithm in Backgammon[25ndash27] Shogi [28ndash30] chess [31ndash34] Texas poker [35 36]Mahjong [37] and Star Craft II [38] has achieved humanexcellence and even exceeded human achievements (ere isno doubt that the deep reinforcement learning algorithmwill make a good progress in this field when it is applied toJiu chess

At present Jiu chess is little-known to the people becauseit is mainly spread in Tibetan people gathering areaes Due tospecial rules and smaller players compared with Go or otherpopular games the research on Jiu chess is very limited Jiuchess is a complete information game along with Go andchess However completely different from Go or chess Jiurules are very special with two consequential stages whichmakes the game tree search path very long(ere are three Jiuchess programs from all current literatures [1 2 39] And allof them have low chess power although they have differentplaying levels In [1] several important shapes of Jiu were firstrecognized and about 300 playing records were collected andprocessed Strategies based on chess shapes were designed todefend the opponent It was the first Jiu program based onexpert knowledge but it had very low power because of thelimited human knowledge In [39] a Bayesian network modelwhich was designed to solve the problem of small sample datafor Jiu playing records estimated the chess board rapidly (ismethod alleviated the extreme lack of expert knowledge to acertain extent but the model was only useful in the layoutstage In [2] a time difference algorithm was used to realizethe probability statistics and prediction of chess type whichwas the first time reinforcement learning was applied to Jiuchess It was verified that solely applying SARSA(λ) orQ-learning algorithm with special different parameters washelpful to improve the efficiency of recognizing and evalu-ating chess board However it did not make importantcontributions to the chess power because of not applying adeep neural network or UCT search Q-learning algorithm isnot only used in the games but also widely used in other fieldssuch as the wireless network to reduce the cost of resources[40] It is very promising So in our study we first combineSARSA(λ) and Q-learning algorithms in different stages ofUCT search to help us get the motivation of achieving betterperformance under low hardware cost

4 Learning Methodology

UCTsearch combined with deep learning and reinforcementlearning is of high performance for Go chess and othergames However it requires considerable hardware re-sources to support exploration to the huge state space andtraining for the deep neural network To take advantage of

reinforcement learning and deep learning while reducinghardware requirements as much as possible this studyproposed an improved deep neural network model com-bined UCT search with hybrid online and offline timedifference algorithms for Tibetan Jiu chess which is ex-quisite and efficient in self-play learning ability

Hybrid TD algorithms are applied to different stages ofthe UCT search which minimizes searching in low-valuestate spaces (ereby the computational power consump-tion and training time of the exploration state space arereduced According to the unique characteristics of Jiu chessa TD algorithm reward function is proposed based on a 2Dnormal distribution matrix for the layout stage enabling theJiu chess reinforcement learning model to more quicklyacquire layout awareness of Jiu chess priorities (e rewardfunction for the battle stage is also designed based on Jiuchess shapes (e improved ResNet18-based deep neuralnetwork based is used for self-play and training

41UCTSearch (e reinforcement learning algorithm usedin this study is based on a UCT [41 42] search algorithm (seeFigure 2) which combines Q-learning SARSA(λ) andexpert domain knowledge When the model performs self-play learning it searches from the root node uses SARSA(λ)

combined with immediate feedback from domain knowl-edge in the former three stages (selection extension anddefault policy simulation) and uses Q-learning updation inthe backpropagation stage

In the selection extension and default policy simulationstages the board situation is evaluated through expertknowledge and this evaluated value is returned to each nodeof the path using the SARSA(λ) algorithm updation method(see the bold parts of the routes in the selection extensionand default policy simulation stages of Figure 2) In thebackpropagation stage the board situation is evaluatedthrough Q-learning algorithm updation (see the bold part ofthe route in the backpropagation step of Figure 2)

42 Hybrid SARSA(λ) and Q-Learning Algorithm In thisstudy the node of the UCT search tree is expressed as

SWNP VQE (1)

where S represents the state of the node in the search treeWrepresents the value of each action of the node N representsthe number of times the action of the node is selected andfor each action a selected N(s a)⟵N(s a) + 1 P rep-resents the probability of selecting each action under thestate (calculated by the neural network) V is the winningrate estimated by the neural network Q is the winning rateestimated by the search tree and E is a variable of auxiliaryQupdation In the selection extension and simulation stagesE is updated according to the SARSA(λ) algorithm which isexpressed as follows

E(s a)⟵R(s a) + cQ si+1 ai+1( 1113857 minus Q(s a) (2)

where R(s a) is the reward value of the current situationwhich can usually be calculated by the board situation

4 Complexity

evaluation c is learning rate si+1 is the next state searchedby taking action a and ai+1 is the action selected in state si

(e node (si ai) which is previously searched in thesearching path is updated in turn by the following equation

Q si ai( 1113857⟵ 1 minus c1( 1113857Q si ai( 1113857 + c1λcminus iE(s a) (3)

where c1 is learning rate of SARSA(λ) i is denoted as thestate index satisfying 0le ile c c is the number of steps fromthe status of the root node to the status of the search endingat the stage of value return in each turn i 0 represents thesearching is initialized by taking the current situation as thesearch tree root node and i c represents the searching isended Especially at the end of the game c is simply the totalnumber of steps from the beginning to the end of the game

In the backpropagation stage of UCT search or at thetime of each game end equation (4) is used to update theQ-values of all nodes on the path where c2 is learning rate ofQ-Learning

Q si ai( 1113857⟵ 1 minus c2( 1113857Q si ai( 1113857 + c2 maxQ si+1 ai+1( 1113857

(4)

and if si+1 is a leaf node Q(si+1 ai+1) will be calculated by

Q si+1 ai+1( 1113857 W si+1 ai+1( 1113857

N si+1 ai+1( 1113857 (5)

In the selection extension and simulation stages of theUCTsearch algorithm this hybrid algorithm selects action a

with the maximum value from state s through equation (6)by calculating the score of each alternative action

a argmaxa(Q(s middot) + U(s middot)) (6)

For each alternative action a at state s there is thecomprehensive evaluation function which is represented bythe following equation

Q(s a) + U(s a) Q(s a) + cpuctP(s a)

1113936N(s middot)

N(s a) + 1

1113971

(7)

where Q(s a) is obtained by equation (3) or (4) whichdepends on different tree UCT search stages and U(s a)

represents the estimation value by the UCBmethod which isobtained by the following equation

U(s a) cpuctP(s a)

1113936N(s middot)

N(s a) + 1

1113971

(8)

where cpuct represents the parameter balancing the exploi-tation and exploration of UCT algorithm

43 Feedback Function Based on Domain KnowledgeQ-value is updated by the combination of the SARSA(λ) andQ-learning algorithms [43 44] SARSA(λ) is used to updatethe nodes on the game path in the selection expansion andsimulation stages of the UCT [18] In the backpropagationstage Q-learning is used to update the values of all nodes inthe search path(e hybrid strategy enables the algorithm tolearn to prune the huge state space effectively improvingcomputing speed and reducing the consumption of hard-ware resources

(e layout stage of Jiu chess plays a key role in the gameand its outcome [1] (e value of the board position decaysfrom the center of the board to the outside which is similarto the probability distribution of a 2D normal distributionmatrix (erefore the importance of each intersection in thelayout stage can be approximated by the 2D discrete normaldistribution (see Figure 3)

In the battle stage constructing the chess shapes discussedin [1 2] is very important in gaining a victory (e feedbackfunction used by the SARSA(λ) and Q-learning algorithm isshown in equation (9) 1113936iVi(s) Vchain+ Vgate + Vsquare

Q + U

Q + U

Q + U

P w

w

w

w

w

Selection Expansion Default policy simulation

Backpropagation

Repeat

Q + U

Figure 2 UCT algorithm for Jiu chess

Complexity 5

R(s a) f(x y) s in layout stage

1113936iVi(s) s in battle stage1113896 (9)

where f(x y) is the joint probability density of X and Y isshown in the following equation [39 45]

f(x y) 1

2πσ1σ21 minus ρ2

1113968 eminus 1 2 1minus ρ2( )( )( ) xminus μ1( )

2σ21minus 2ρ xminus μ1( ) yminus μ2( )σ1σ2( )+ yminus μ2( )2σ22( 1113857( 1113857

(10)

where X and Y represent chessboard coordinates and μrepresents the mean value of the range of the chessboard(x y isin [0 13] σ1 σ2 2 μ1 u2 65 ρ 0)Vchain Vgate and Vsquare are approximations summarizedthrough the Jiu chess rules [1 2] and experienceVchain(s) 7cchain where cchain is the sum of the number ofsquares on the chessboard in the current stateVgate(s) 3cgate where cgate is the sum of the number ofgates on the chessboard in the current state andVsquare(s) 4csquare where csquare is the sum of the number ofsquares on the chessboard in the current state

Using the 2D normal distribution approximate ma-trix probability distribution based on expert knowledgeas the feedback function of the TD algorithm in thelayout stage we can produce better search simulationsand self-play game performance in the case of deepsearch depths and large search branches and learn areasonable chess strategy in the layout stage better andfaster [46]

44 Improved Deep Neural Network Based on ResNet18Strategy prediction and board situation evaluation are re-alized using a deep neural network which is expressed as afunction with parameter θ as shown in equation (11) wheres is the current state p is the output strategy prediction andv is the value of board situation evaluation

(p v) fθ(s) (11)(e RESNETseries of deep convolution neural networks

shows good robustness in image classification and targetdetection [3] In this study a ResNet18-based network isused to transform the full connection layer It can

simultaneously output strategy prediction p and boardsituation evaluation v (see Figure 4)

Unlike ResNet18 [3] a public full connection layerwith 4096 neuron nodes is used as the hidden layer (efully connected layer is replaced by a 196-dimensionaloutput fully connected layer to output drop strategy p anda one-dimensional fully connected layer to output theestimation v of the current board status (e neuralnetwork is used to predict and evaluate the state of nodesthat have not previously been evaluated (e neuralnetwork outputs p v to the UCT search tree node so thatP(si) p V(si) v

(e number of input characteristic graphs is two Itmeans that there are two channels (e first channel is theposition and color of the pieces in the chessboard stateWhite pieces are represented by minus1 and black ones arerepresented by 1 (e second channel is the position of thefalling pieces in the current state of the chessboard (epositions in which pieces do not fall are represented by 0and the position of the falling pieces is represented by 1All convolution layers use ReLu as their activationfunction and batch normalization is performed afterconvolution

At the end of the game we play back the experience andoutput the dataset ile 0lefc |(s π z)i1113864 1113865 to adjust the pa-rameters of the deep neural network (e loss function usedis shown in the following equation

l (z minus v)2

minus πT ln p + cθ2 (12)

In the i-th state of the path πi NiNi andzi WiNi (e training of the neural network is carried out

Z static evaluation value

X horizontal coordinates of chess board

Y vertical coordinates of

chess board

Figure 3 Approximate 2D normal distribution matrix

6 Complexity

with amin-batch parameter of 100 and the learning rate is setto 01 At the end of each game the number of samples(turns) sent to neural network training is about 1 times 103 to2 times 103

5 Results

(e experiment was performed with the following param-eters cpuct step and learning rate (see Table 1) We comparethe efficiency of three methods of updating the Q valuehybrid Q-learning with SARSA(λ) pure Q-learning andpure SARSA(λ) (e server and hardware configurationused in the experiment is shown in Table 1

51Ae Improvement of Learning and Understanding AbilityIt is important to measure the learning ability of a Jiu chessagent that can quickly learn to form squares in the layoutstage (erefore we calculate the influence of the hybridupdate strategy compared to pure Q-learning or SARSA(λ)

algorithms on the sum of the squares of 200 steps of self-playgame training as shown in Table 2 When using the hybridupdate strategy the total number of squares in 200 steps isalmost the sum of the number of squares occupied whenonly using pure Q-learning or SARSA(λ) proving that thehybrid update strategy can effectively train a Jiu chess agentto understand the game of Jiu chess and learn it well

We also counted the number of squares per 20 steps inthe 200-step training process Figure 5 shows that thenumber of squares occupied when using the hybrid updatestrategy is significantly more than that when using pureQ-learning or SARSA(λ) algorithms Especially in first 80steps hybrid updating strategy is significantly more effectivethan pure Q-learning or SARSA(λ) algorithms And fromstep 140 to step 200 hybrid updating strategy has a littledrop compared to the first 120 steps It is because parametercpuct at 01 must decrease to fit later training to avoid uselessexploration (is result indicates that the chess powertrained by the hybrid update strategy is stronger than that ofpure Q-learning or SARSA(λ) In addition this strategy

produces stronger learning and understanding as well asbetter stability

52 FeedbackFunctionExperiment In order to improve thereinforcement learning efficiency of Jiu chess a feedbackfunction based on a 2D normal distribution matrix isadded to the reinforcement learning game model as anauxiliary measure for updating value To test the effect ofthis measure we conducted 7 days and 110 instances ofself-play training and obtained data with and without the2D-normal-distribution-assisted feedback mechanism forcomparison

Table 3 and Figure 6 show that the total number ofsquares occupied when using the 2D normal distributionassistant in the self-play training process is about three timesthat of programs that do not the average value is close to

Value

Policy

ImprovedResNet18 for

JIU

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

Figure 4 (e deep neural network structure

Table 1 (e experimental parameters

Toolkit package CNTK 27Runtime Net 472Operating system Windows 10Central processing unit AMD 2700X40GHzRandom access memory 32GBGraphics processing unit RTX2070(reads used 16Development environment Visual Studio 2017 Communitycpuct 01Step 200Learning rate 0001c(Q-learning) 0628c(SARSA(λ)) 0372

Table 2 Average number of squares per 20 steps in 200-steptraining

Q-learning + SARSA(λ) Q-learning SARSA(λ)

8604 4248 5571

Complexity 7

three times (e total number of squares occupied whenusing the 2D normal distribution assistant during training issignificantly more than that without the assistant Two-di-mensional normal distribution auxiliary matrix shows theapplication of expert knowledge in fact (e learning abilityto play in the layout has improved about 3 times with thetwo-dimensional normal distribution auxiliary matrix

We also compare the learning efficiency of using a 2Dnormal distribution assistant program from the perspectiveof the quality of specific layouts As shown in Figure 7(a) in10 days of training the program without the aid of the 2Dnormal distribution has little knowledge of the layout Afterusing the 2D normal distribution matrix to assist in trainingthe layout process can learn specific chess patterns faster

Squa

re co

unt

Q-learning + SARSA(λ)Q-learningSARSA(λ)

0

10

20

30

40

50

60

20 40 60 80 100 120 140 160 180 2000Training steps

Figure 5 Number of squares employing different algorithms

Table 3 (e number of squares with 2D normalization on or off

2D normalization off 2D normalization onSum 1524 5005Average 13 45

Squa

re co

unt

2D norm on 2D norm off

50 1000Training steps

0

30

60

90

120

Squa

re co

unt

0

30

60

90

120

50 1000Training steps

Figure 6 Comparison of the number of squares with or without 2D normal distribution

8 Complexity

form a triangle (chess gate) and even form squares (the mostimportant chess shape) in the fighting stage of Jiu chessgame as shown in Figures 7(b) and 7(c)

(e above experiments show that the feedback functionbased on the two-dimensional normal distribution matrixcan reduce the calculation amount of Jiu chess program andimprove the self-learning efficiency of the program becausein the layout stage the importance of the chessboard po-sition is close to the two-dimensional normal distributionmatrix Because Jiu chess can have more than 1000 matchingsteps and slow search iteration process the evaluation oflayout stage is optimized by using two-dimensional normaldistribution matrix which is actually the knowledge ofmodel experts (e feedback value of the model indirectlyreduces the process of blindly exploring the transfer value ofchessboard state action so it takes less time to get betterresults

6 Conclusion

In this study the deep reinforcement learning modelcombined with expert knowledge can learn the rules of thegame faster in a short time (e combination of Q-learningand SARSA(λ) algorithms can make the neural networklearn more valuable chess strategies in a short time whichprovides a reference for improving the learning efficiency ofthe deep reinforcement learning model (e deep rein-forcement learning model has produced good layout resultsobtained by the two-dimensional normal distributionmatrixof expert knowledge modeling which also proves that deepreinforcement learning can shorten the learning time bycombining expert knowledge reasonably (e better per-formance of ResNet18 which was used to make the deepreinforcement learning model training more effectively atlow resources cost has been verified by experimental results

Inspired byWu et al [47 48] we consider to improve Jiuchess power not only from the state-of-the-art of technol-ogies but also from the holistic social good perspectives inthe future work We will collect and process much more Jiuchess game data to establish the big data resource which canturn big values for using light reinforcement learning model

to reduce the cost of computing resources We will alsoexplore the possibilities of combining multiagent theory [49]and mechanism [50] to reduce the network delay of theproposed reinforcement model besides using the strategiesproposed in [39 41]

Data Availability

Data are available via the e-mail xiaer_li163com

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is study was funded by the National Natural ScienceFoundation of China (61873291 and 61773416) and theMUC 111 Project

References

[1] X Li S Wang Z Lv Y Li and L Wu ldquoStrategies researchbased on chess shape for Tibetan JIU computer gamerdquo In-ternational Computer Games Association Journal (ICGA)vol 40 no 3 pp 318ndash328 2019

[2] X Li Z Lv S Wang Z Wei and L Wu ldquoA reinforcementlearning model based on temporal difference algorithmrdquoIEEE Access vol 7 pp 121922ndash121930 2019

[3] K He X Zhang S Ren and J Sun ldquoDeep residual learningfor image recognitionrdquo in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 770ndash778 LasVegas NV USA June 2016

[4] D SuodaChucha and P Shotwell ldquo(e research of Tibetanchessrdquo Journal of Tibet University vol 9 no 2 pp 5ndash10 1994

[5] V Mnih K Kavukcuoglu D Silver et al ldquoPlaying atari withdeep reinforcement learningrdquo 2013 httparxivorgabs13125602

[6] D Stern R Herbrich and T Graepel ldquoBayesian patternranking for move prediction in the game of gordquo in Proceedingsof the 23rd International Conference on Machine Learningpp 873ndash880 ACM Pittsburgh PA USA 2006

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(a)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(b)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(c)

Figure 7 Comparison of learning efficiency with or without 2D normal distribution (a) 2D normalization off (b) 2D normalization on(c) 2D normalization on and making a special board type

Complexity 9

[7] W Dabney M Rowland M G Bellemare and R MunosldquoDistributional reinforcement learning with quantile regres-sionrdquo 2017 httparxivorgabs171010044

[8] M Andrychowicz W Filip A Ray et al ldquoHindsight expe-rience replayrdquo in Proceedings of the Advances in Neural In-formation Processing Systems pp 5048ndash5058 Long BeachCA USA December 2017

[9] F Scott H van Hoof and D Meger ldquoAddressing functionapproximation error in actor-critic methodsrdquo 2018 httparxivorgabs180209477

[10] T P Lillicrap J J Hunt P Alexander et al ldquoContinuouscontrol with deep reinforcement learningrdquo 2015 httparxivorgabs150902971

[11] T Haarnoja A Zhou P Abbeel and S Levine ldquoSoft actor-critic off-policy maximum entropy deep reinforcementlearning with a stochastic actorrdquo 2018 httparxivorgabs180101290

[12] M Babaeizadeh I Frosio T Stephen J Clemons andJ Kautz ldquoReinforcement learning through asynchronousadvantage actor-critic on a gpurdquo 2016 httparxivorgabs161106256

[13] V Mnih A P Badia M Mirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Machine Learning pp 1928ndash1937New York City NY USA June 2016

[14] John Schulman S Levine P Abbeel M Jordan andP Moritz ldquoTrust region policy optimizationrdquo in Proceedingsof the International Conference on Machine Learningpp 1889ndash1897 Lille France July 2015

[15] John Schulman W Filip P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017httparxivorgabs170706347

[16] S-J Yen T-N Yang C Chen and S-C Hsu ldquoPatternmatching in go game recordsrdquo in Proceedings of the SecondInternational Conference on Innovative Computing Infor-mation and Control (ICICIC 2007) p 297 Washington DCUSA 2007

[17] S Gelly and Y Wang ldquoExploration exploitation in go UCTfor Monte-Carlo gordquo in Proceedings of the NIPS NeuralInformation Processing Systems Conference On-Line Trading ofExploration and Exploitation Workshop Barcelona Spain2006

[18] Y Wang and S Gelly ldquoModifications of UCT and sequence-like simulations for Monte-Carlo gordquo in Proceedings of theIEEE Symposium on Computational Intelligence and Gamespp 175ndash182 Honolulu HI USA 2007

[19] S Sharma Z Kobti and S Goodwin ldquoKnowledge generationfor improving simulations in uct for general game playingrdquo inProceedings of the Australasian Joint Conference on ArtificialIntelligence Advances in Artificial Intelligence pp 49ndash55Auckland New Zealand December 2008

[20] X Li Z Lv S Wang Z Wei X Zhang and L Wu ldquoA middlegame search algorithm applicable to low-cost personalcomputer for gordquo IEEE Access vol 7 pp 121719ndash1217272019

[21] N N Schraudolph P Dayan and T J Sejnowski ldquoTemporaldifference learning of position evaluation in the game of gordquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 817ndash824 Denver CO USA 1994

[22] D Silver A Huang C J Maddison et al ldquoMastering the gameof go with deep neural networks and tree searchrdquo Naturevol 529 no 7587 pp 484ndash489 2016

[23] D Silver J Schrittwieser K Simonyan et al ldquoMastering thegame of go without human knowledgerdquo Nature vol 550no 7676 pp 354ndash359 2017

[24] D Silver T Hubert J Schrittwieser et al ldquoA general rein-forcement learning algorithm that masters chess shogi andgo through self-playrdquo Science vol 362 no 6419pp 1140ndash1144 2018

[25] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquo Machine Learning vol 3 no 1 pp 9ndash44 1988

[26] G Tesauro ldquoPractical issues in temporal difference learningrdquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 259ndash266 Denver CO USA 1992

[27] G Tesauro ldquoTd-gammon a self-teaching backgammonprogram achieves master-level playrdquo Neural Computationvol 6 no 2 pp 215ndash219 1994

[28] D F Beal and M C Smith ldquoFirst results from using temporaldifference learning in shogirdquo in Proceedings of the Interna-tional Conference on Computers and Games Springer Tsu-kuba Japan pp 113ndash125 November 1998

[29] R Grimbergen and H Matsubara ldquoPattern recognition forcandidate generation in the game of shogirdquo Games in AIResearch pp 97ndash108 1997

[30] Y Sato D Takahashi and R Grimbergen ldquoA shogi programbased on monte-carlo tree searchrdquo ICGA Journal vol 33no 2 pp 80ndash92 2010

[31] S (run ldquoLearning to play the game of chessrdquo in Proceedingsof the Advances in Neural Information Processing Systemspp 1069ndash1076 Denver CO USA 1995

[32] I Bratko D Kopec and D Michie ldquoPattern-based repre-sentation of chess end-game knowledgerdquo Ae ComputerJournal vol 21 no 2 pp 149ndash153 1978

[33] Y Kerner ldquoLearning strategies for explanation patterns basicgame patterns with application to chessrdquo in Proceedings of theInternational Conference on Case-Based Reasoning SpringerSesimbra Portugal pp 491ndash500 October 1995

[34] M Campbell A J Hoane Jr and F-h Hsu ldquoDeep bluerdquoArtificial Intelligence vol 134 no 1-2 pp 57ndash83 2002

[35] N Brown and T Sandholm ldquoSafe and nested subgame solvingfor imperfect-information gamesrdquo in Proceedings of the Ad-vances in Neural Information Processing Systems pp 689ndash699Long Beach CA USA December 2017

[36] N Brown and T Sandholm ldquoSuperhuman ai for heads-up no-limit poker libratus beats top professionalsrdquo Science vol 359no 6374 pp 418ndash424 2018

[37] httpswwwmsracnzh-cnnewsfeaturesmahjong-ai-suphx 2019

[38] V Zambaldi D Raposo S Adam et al ldquoRelational deepreinforcement learningrdquo 2018 httparxivorgabs180601830

[39] S T Deng ldquoDesign and implementation of JIU game pro-totype system andmultimedia coursewarerdquo Minzu Universityof China Beijing China Dissertation 2017

[40] X Chen J Wu Y Cai H Zhang and T Chen ldquoEnergy-efficiency oriented traffic offloading in wireless networks abrief survey and a learning approach for heterogeneous cel-lular networksrdquo IEEE Journal on Selected Areas in Commu-nications vol 33 no 4 pp 627ndash640 2015

[41] T Pepels M H M Winands M Lanctot and M LanctotldquoReal-time Monte Carlo tree search in ms pac-manrdquo IEEETransactions on Computational Intelligence and AI in Gamesvol 6 no 3 pp 245ndash257 2014

[42] C F Sironi and M H M Winands ldquoComparing randomi-zation strategies for search-control parameters in monte-carlo

10 Complexity

tree searchrdquo in Proceedings of the 2019 IEEE Conference onGames (CoG) pp 1ndash8 IEEE London UK August 2019

[43] C Watkins and P Dayan ldquoQ-learningrdquo Machine Learningvol 8 no 3-4 pp 279ndash292 1992

[44] F Xiao Q Liu Qi-M Fu H-K Sun and L Gao ldquoGradientdescent Sarsa (λ) algorithm based on the adaptive potentialfunction shaping reward mechanismrdquo Journal of China In-stitute of Communications vol 34 no 1 pp 77ndash88 2013

[45] E Estrada E Hameed M Langer and A Puchalska ldquoPathlaplacian operators and superdiffusive processes on graphs iitwo-dimensional latticerdquo Linear Algebra and Its Applicationsvol 555 pp 373ndash397 2018

[46] K Lloyd N Becker M W Jones and R Bogacz ldquoLearning touse working memory a reinforcement learning gating modelof rule acquisition in ratsrdquo Frontiers in ComputationalNeuroscience vol 6 p 87 2012

[47] J Wu S Guo J Li and D Zeng ldquoBig data meet greenchallenges big data toward green applicationsrdquo IEEE SystemsJournal vol 10 no 3 pp 888ndash900 2016

[48] J Wu S Guo H Huang W Liu and Y Xiang ldquoInformationand communications technologies for sustainable develop-ment goals state-of-the-art needs and perspectivesrdquo IEEECommunications Surveys amp Tutorials vol 20 no 3pp 2389ndash2406 2018

[49] B Liu N Xu H Su LWu and J Bai ldquoOn the observability ofleader-based multiagent systems with fixed topologyrdquoComplexity vol 2019 pp 1ndash10 2019

[50] H Su J Zhang and X Chen ldquoA stochastic samplingmechanism for time-varying formation of multiagent systemswith multiple leaders and communication delaysrdquo IEEETransactions on Neural Networks and Learning Systemsvol 30 no 12 pp 3699ndash3707 2019

Complexity 11

deep learning and deep reinforcement learning algorithmsare used in Go including Monte Carlo algorithm and upperconfidence bound applied to tree (UCT) algorithm [17ndash20]temporal difference algorithm [21] the deep learning modelcombined with UCT search algorithm [22] and DQN al-gorithm [23 24] and they have also achieved quite goodresults in computer Go game which shows that the idea ofdeep reinforcement learning algorithm can adapt to thecomputer game environment In addition the application ofdeep reinforcement learning algorithm in Backgammon[25ndash27] Shogi [28ndash30] chess [31ndash34] Texas poker [35 36]Mahjong [37] and Star Craft II [38] has achieved humanexcellence and even exceeded human achievements (ere isno doubt that the deep reinforcement learning algorithmwill make a good progress in this field when it is applied toJiu chess

At present Jiu chess is little-known to the people becauseit is mainly spread in Tibetan people gathering areaes Due tospecial rules and smaller players compared with Go or otherpopular games the research on Jiu chess is very limited Jiuchess is a complete information game along with Go andchess However completely different from Go or chess Jiurules are very special with two consequential stages whichmakes the game tree search path very long(ere are three Jiuchess programs from all current literatures [1 2 39] And allof them have low chess power although they have differentplaying levels In [1] several important shapes of Jiu were firstrecognized and about 300 playing records were collected andprocessed Strategies based on chess shapes were designed todefend the opponent It was the first Jiu program based onexpert knowledge but it had very low power because of thelimited human knowledge In [39] a Bayesian network modelwhich was designed to solve the problem of small sample datafor Jiu playing records estimated the chess board rapidly (ismethod alleviated the extreme lack of expert knowledge to acertain extent but the model was only useful in the layoutstage In [2] a time difference algorithm was used to realizethe probability statistics and prediction of chess type whichwas the first time reinforcement learning was applied to Jiuchess It was verified that solely applying SARSA(λ) orQ-learning algorithm with special different parameters washelpful to improve the efficiency of recognizing and evalu-ating chess board However it did not make importantcontributions to the chess power because of not applying adeep neural network or UCT search Q-learning algorithm isnot only used in the games but also widely used in other fieldssuch as the wireless network to reduce the cost of resources[40] It is very promising So in our study we first combineSARSA(λ) and Q-learning algorithms in different stages ofUCT search to help us get the motivation of achieving betterperformance under low hardware cost

4 Learning Methodology

UCTsearch combined with deep learning and reinforcementlearning is of high performance for Go chess and othergames However it requires considerable hardware re-sources to support exploration to the huge state space andtraining for the deep neural network To take advantage of

reinforcement learning and deep learning while reducinghardware requirements as much as possible this studyproposed an improved deep neural network model com-bined UCT search with hybrid online and offline timedifference algorithms for Tibetan Jiu chess which is ex-quisite and efficient in self-play learning ability

Hybrid TD algorithms are applied to different stages ofthe UCT search which minimizes searching in low-valuestate spaces (ereby the computational power consump-tion and training time of the exploration state space arereduced According to the unique characteristics of Jiu chessa TD algorithm reward function is proposed based on a 2Dnormal distribution matrix for the layout stage enabling theJiu chess reinforcement learning model to more quicklyacquire layout awareness of Jiu chess priorities (e rewardfunction for the battle stage is also designed based on Jiuchess shapes (e improved ResNet18-based deep neuralnetwork based is used for self-play and training

41UCTSearch (e reinforcement learning algorithm usedin this study is based on a UCT [41 42] search algorithm (seeFigure 2) which combines Q-learning SARSA(λ) andexpert domain knowledge When the model performs self-play learning it searches from the root node uses SARSA(λ)

combined with immediate feedback from domain knowl-edge in the former three stages (selection extension anddefault policy simulation) and uses Q-learning updation inthe backpropagation stage

In the selection extension and default policy simulationstages the board situation is evaluated through expertknowledge and this evaluated value is returned to each nodeof the path using the SARSA(λ) algorithm updation method(see the bold parts of the routes in the selection extensionand default policy simulation stages of Figure 2) In thebackpropagation stage the board situation is evaluatedthrough Q-learning algorithm updation (see the bold part ofthe route in the backpropagation step of Figure 2)

42 Hybrid SARSA(λ) and Q-Learning Algorithm In thisstudy the node of the UCT search tree is expressed as

SWNP VQE (1)

where S represents the state of the node in the search treeWrepresents the value of each action of the node N representsthe number of times the action of the node is selected andfor each action a selected N(s a)⟵N(s a) + 1 P rep-resents the probability of selecting each action under thestate (calculated by the neural network) V is the winningrate estimated by the neural network Q is the winning rateestimated by the search tree and E is a variable of auxiliaryQupdation In the selection extension and simulation stagesE is updated according to the SARSA(λ) algorithm which isexpressed as follows

E(s a)⟵R(s a) + cQ si+1 ai+1( 1113857 minus Q(s a) (2)

where R(s a) is the reward value of the current situationwhich can usually be calculated by the board situation

4 Complexity

evaluation c is learning rate si+1 is the next state searchedby taking action a and ai+1 is the action selected in state si

(e node (si ai) which is previously searched in thesearching path is updated in turn by the following equation

Q si ai( 1113857⟵ 1 minus c1( 1113857Q si ai( 1113857 + c1λcminus iE(s a) (3)

where c1 is learning rate of SARSA(λ) i is denoted as thestate index satisfying 0le ile c c is the number of steps fromthe status of the root node to the status of the search endingat the stage of value return in each turn i 0 represents thesearching is initialized by taking the current situation as thesearch tree root node and i c represents the searching isended Especially at the end of the game c is simply the totalnumber of steps from the beginning to the end of the game

In the backpropagation stage of UCT search or at thetime of each game end equation (4) is used to update theQ-values of all nodes on the path where c2 is learning rate ofQ-Learning

Q si ai( 1113857⟵ 1 minus c2( 1113857Q si ai( 1113857 + c2 maxQ si+1 ai+1( 1113857

(4)

and if si+1 is a leaf node Q(si+1 ai+1) will be calculated by

Q si+1 ai+1( 1113857 W si+1 ai+1( 1113857

N si+1 ai+1( 1113857 (5)

In the selection extension and simulation stages of theUCTsearch algorithm this hybrid algorithm selects action a

with the maximum value from state s through equation (6)by calculating the score of each alternative action

a argmaxa(Q(s middot) + U(s middot)) (6)

For each alternative action a at state s there is thecomprehensive evaluation function which is represented bythe following equation

Q(s a) + U(s a) Q(s a) + cpuctP(s a)

1113936N(s middot)

N(s a) + 1

1113971

(7)

where Q(s a) is obtained by equation (3) or (4) whichdepends on different tree UCT search stages and U(s a)

represents the estimation value by the UCBmethod which isobtained by the following equation

U(s a) cpuctP(s a)

1113936N(s middot)

N(s a) + 1

1113971

(8)

where cpuct represents the parameter balancing the exploi-tation and exploration of UCT algorithm

43 Feedback Function Based on Domain KnowledgeQ-value is updated by the combination of the SARSA(λ) andQ-learning algorithms [43 44] SARSA(λ) is used to updatethe nodes on the game path in the selection expansion andsimulation stages of the UCT [18] In the backpropagationstage Q-learning is used to update the values of all nodes inthe search path(e hybrid strategy enables the algorithm tolearn to prune the huge state space effectively improvingcomputing speed and reducing the consumption of hard-ware resources

(e layout stage of Jiu chess plays a key role in the gameand its outcome [1] (e value of the board position decaysfrom the center of the board to the outside which is similarto the probability distribution of a 2D normal distributionmatrix (erefore the importance of each intersection in thelayout stage can be approximated by the 2D discrete normaldistribution (see Figure 3)

In the battle stage constructing the chess shapes discussedin [1 2] is very important in gaining a victory (e feedbackfunction used by the SARSA(λ) and Q-learning algorithm isshown in equation (9) 1113936iVi(s) Vchain+ Vgate + Vsquare

Q + U

Q + U

Q + U

P w

w

w

w

w

Selection Expansion Default policy simulation

Backpropagation

Repeat

Q + U

Figure 2 UCT algorithm for Jiu chess

Complexity 5

R(s a) f(x y) s in layout stage

1113936iVi(s) s in battle stage1113896 (9)

where f(x y) is the joint probability density of X and Y isshown in the following equation [39 45]

f(x y) 1

2πσ1σ21 minus ρ2

1113968 eminus 1 2 1minus ρ2( )( )( ) xminus μ1( )

2σ21minus 2ρ xminus μ1( ) yminus μ2( )σ1σ2( )+ yminus μ2( )2σ22( 1113857( 1113857

(10)

where X and Y represent chessboard coordinates and μrepresents the mean value of the range of the chessboard(x y isin [0 13] σ1 σ2 2 μ1 u2 65 ρ 0)Vchain Vgate and Vsquare are approximations summarizedthrough the Jiu chess rules [1 2] and experienceVchain(s) 7cchain where cchain is the sum of the number ofsquares on the chessboard in the current stateVgate(s) 3cgate where cgate is the sum of the number ofgates on the chessboard in the current state andVsquare(s) 4csquare where csquare is the sum of the number ofsquares on the chessboard in the current state

Using the 2D normal distribution approximate ma-trix probability distribution based on expert knowledgeas the feedback function of the TD algorithm in thelayout stage we can produce better search simulationsand self-play game performance in the case of deepsearch depths and large search branches and learn areasonable chess strategy in the layout stage better andfaster [46]

44 Improved Deep Neural Network Based on ResNet18Strategy prediction and board situation evaluation are re-alized using a deep neural network which is expressed as afunction with parameter θ as shown in equation (11) wheres is the current state p is the output strategy prediction andv is the value of board situation evaluation

(p v) fθ(s) (11)(e RESNETseries of deep convolution neural networks

shows good robustness in image classification and targetdetection [3] In this study a ResNet18-based network isused to transform the full connection layer It can

simultaneously output strategy prediction p and boardsituation evaluation v (see Figure 4)

Unlike ResNet18 [3] a public full connection layerwith 4096 neuron nodes is used as the hidden layer (efully connected layer is replaced by a 196-dimensionaloutput fully connected layer to output drop strategy p anda one-dimensional fully connected layer to output theestimation v of the current board status (e neuralnetwork is used to predict and evaluate the state of nodesthat have not previously been evaluated (e neuralnetwork outputs p v to the UCT search tree node so thatP(si) p V(si) v

(e number of input characteristic graphs is two Itmeans that there are two channels (e first channel is theposition and color of the pieces in the chessboard stateWhite pieces are represented by minus1 and black ones arerepresented by 1 (e second channel is the position of thefalling pieces in the current state of the chessboard (epositions in which pieces do not fall are represented by 0and the position of the falling pieces is represented by 1All convolution layers use ReLu as their activationfunction and batch normalization is performed afterconvolution

At the end of the game we play back the experience andoutput the dataset ile 0lefc |(s π z)i1113864 1113865 to adjust the pa-rameters of the deep neural network (e loss function usedis shown in the following equation

l (z minus v)2

minus πT ln p + cθ2 (12)

In the i-th state of the path πi NiNi andzi WiNi (e training of the neural network is carried out

Z static evaluation value

X horizontal coordinates of chess board

Y vertical coordinates of

chess board

Figure 3 Approximate 2D normal distribution matrix

6 Complexity

with amin-batch parameter of 100 and the learning rate is setto 01 At the end of each game the number of samples(turns) sent to neural network training is about 1 times 103 to2 times 103

5 Results

(e experiment was performed with the following param-eters cpuct step and learning rate (see Table 1) We comparethe efficiency of three methods of updating the Q valuehybrid Q-learning with SARSA(λ) pure Q-learning andpure SARSA(λ) (e server and hardware configurationused in the experiment is shown in Table 1

51Ae Improvement of Learning and Understanding AbilityIt is important to measure the learning ability of a Jiu chessagent that can quickly learn to form squares in the layoutstage (erefore we calculate the influence of the hybridupdate strategy compared to pure Q-learning or SARSA(λ)

algorithms on the sum of the squares of 200 steps of self-playgame training as shown in Table 2 When using the hybridupdate strategy the total number of squares in 200 steps isalmost the sum of the number of squares occupied whenonly using pure Q-learning or SARSA(λ) proving that thehybrid update strategy can effectively train a Jiu chess agentto understand the game of Jiu chess and learn it well

We also counted the number of squares per 20 steps inthe 200-step training process Figure 5 shows that thenumber of squares occupied when using the hybrid updatestrategy is significantly more than that when using pureQ-learning or SARSA(λ) algorithms Especially in first 80steps hybrid updating strategy is significantly more effectivethan pure Q-learning or SARSA(λ) algorithms And fromstep 140 to step 200 hybrid updating strategy has a littledrop compared to the first 120 steps It is because parametercpuct at 01 must decrease to fit later training to avoid uselessexploration (is result indicates that the chess powertrained by the hybrid update strategy is stronger than that ofpure Q-learning or SARSA(λ) In addition this strategy

produces stronger learning and understanding as well asbetter stability

52 FeedbackFunctionExperiment In order to improve thereinforcement learning efficiency of Jiu chess a feedbackfunction based on a 2D normal distribution matrix isadded to the reinforcement learning game model as anauxiliary measure for updating value To test the effect ofthis measure we conducted 7 days and 110 instances ofself-play training and obtained data with and without the2D-normal-distribution-assisted feedback mechanism forcomparison

Table 3 and Figure 6 show that the total number ofsquares occupied when using the 2D normal distributionassistant in the self-play training process is about three timesthat of programs that do not the average value is close to

Value

Policy

ImprovedResNet18 for

JIU

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

Figure 4 (e deep neural network structure

Table 1 (e experimental parameters

Toolkit package CNTK 27Runtime Net 472Operating system Windows 10Central processing unit AMD 2700X40GHzRandom access memory 32GBGraphics processing unit RTX2070(reads used 16Development environment Visual Studio 2017 Communitycpuct 01Step 200Learning rate 0001c(Q-learning) 0628c(SARSA(λ)) 0372

Table 2 Average number of squares per 20 steps in 200-steptraining

Q-learning + SARSA(λ) Q-learning SARSA(λ)

8604 4248 5571

Complexity 7

three times (e total number of squares occupied whenusing the 2D normal distribution assistant during training issignificantly more than that without the assistant Two-di-mensional normal distribution auxiliary matrix shows theapplication of expert knowledge in fact (e learning abilityto play in the layout has improved about 3 times with thetwo-dimensional normal distribution auxiliary matrix

We also compare the learning efficiency of using a 2Dnormal distribution assistant program from the perspectiveof the quality of specific layouts As shown in Figure 7(a) in10 days of training the program without the aid of the 2Dnormal distribution has little knowledge of the layout Afterusing the 2D normal distribution matrix to assist in trainingthe layout process can learn specific chess patterns faster

Squa

re co

unt

Q-learning + SARSA(λ)Q-learningSARSA(λ)

0

10

20

30

40

50

60

20 40 60 80 100 120 140 160 180 2000Training steps

Figure 5 Number of squares employing different algorithms

Table 3 (e number of squares with 2D normalization on or off

2D normalization off 2D normalization onSum 1524 5005Average 13 45

Squa

re co

unt

2D norm on 2D norm off

50 1000Training steps

0

30

60

90

120

Squa

re co

unt

0

30

60

90

120

50 1000Training steps

Figure 6 Comparison of the number of squares with or without 2D normal distribution

8 Complexity

form a triangle (chess gate) and even form squares (the mostimportant chess shape) in the fighting stage of Jiu chessgame as shown in Figures 7(b) and 7(c)

(e above experiments show that the feedback functionbased on the two-dimensional normal distribution matrixcan reduce the calculation amount of Jiu chess program andimprove the self-learning efficiency of the program becausein the layout stage the importance of the chessboard po-sition is close to the two-dimensional normal distributionmatrix Because Jiu chess can have more than 1000 matchingsteps and slow search iteration process the evaluation oflayout stage is optimized by using two-dimensional normaldistribution matrix which is actually the knowledge ofmodel experts (e feedback value of the model indirectlyreduces the process of blindly exploring the transfer value ofchessboard state action so it takes less time to get betterresults

6 Conclusion

In this study the deep reinforcement learning modelcombined with expert knowledge can learn the rules of thegame faster in a short time (e combination of Q-learningand SARSA(λ) algorithms can make the neural networklearn more valuable chess strategies in a short time whichprovides a reference for improving the learning efficiency ofthe deep reinforcement learning model (e deep rein-forcement learning model has produced good layout resultsobtained by the two-dimensional normal distributionmatrixof expert knowledge modeling which also proves that deepreinforcement learning can shorten the learning time bycombining expert knowledge reasonably (e better per-formance of ResNet18 which was used to make the deepreinforcement learning model training more effectively atlow resources cost has been verified by experimental results

Inspired byWu et al [47 48] we consider to improve Jiuchess power not only from the state-of-the-art of technol-ogies but also from the holistic social good perspectives inthe future work We will collect and process much more Jiuchess game data to establish the big data resource which canturn big values for using light reinforcement learning model

to reduce the cost of computing resources We will alsoexplore the possibilities of combining multiagent theory [49]and mechanism [50] to reduce the network delay of theproposed reinforcement model besides using the strategiesproposed in [39 41]

Data Availability

Data are available via the e-mail xiaer_li163com

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is study was funded by the National Natural ScienceFoundation of China (61873291 and 61773416) and theMUC 111 Project

References

[1] X Li S Wang Z Lv Y Li and L Wu ldquoStrategies researchbased on chess shape for Tibetan JIU computer gamerdquo In-ternational Computer Games Association Journal (ICGA)vol 40 no 3 pp 318ndash328 2019

[2] X Li Z Lv S Wang Z Wei and L Wu ldquoA reinforcementlearning model based on temporal difference algorithmrdquoIEEE Access vol 7 pp 121922ndash121930 2019

[3] K He X Zhang S Ren and J Sun ldquoDeep residual learningfor image recognitionrdquo in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 770ndash778 LasVegas NV USA June 2016

[4] D SuodaChucha and P Shotwell ldquo(e research of Tibetanchessrdquo Journal of Tibet University vol 9 no 2 pp 5ndash10 1994

[5] V Mnih K Kavukcuoglu D Silver et al ldquoPlaying atari withdeep reinforcement learningrdquo 2013 httparxivorgabs13125602

[6] D Stern R Herbrich and T Graepel ldquoBayesian patternranking for move prediction in the game of gordquo in Proceedingsof the 23rd International Conference on Machine Learningpp 873ndash880 ACM Pittsburgh PA USA 2006

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(a)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(b)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(c)

Figure 7 Comparison of learning efficiency with or without 2D normal distribution (a) 2D normalization off (b) 2D normalization on(c) 2D normalization on and making a special board type

Complexity 9

[7] W Dabney M Rowland M G Bellemare and R MunosldquoDistributional reinforcement learning with quantile regres-sionrdquo 2017 httparxivorgabs171010044

[8] M Andrychowicz W Filip A Ray et al ldquoHindsight expe-rience replayrdquo in Proceedings of the Advances in Neural In-formation Processing Systems pp 5048ndash5058 Long BeachCA USA December 2017

[9] F Scott H van Hoof and D Meger ldquoAddressing functionapproximation error in actor-critic methodsrdquo 2018 httparxivorgabs180209477

[10] T P Lillicrap J J Hunt P Alexander et al ldquoContinuouscontrol with deep reinforcement learningrdquo 2015 httparxivorgabs150902971

[11] T Haarnoja A Zhou P Abbeel and S Levine ldquoSoft actor-critic off-policy maximum entropy deep reinforcementlearning with a stochastic actorrdquo 2018 httparxivorgabs180101290

[12] M Babaeizadeh I Frosio T Stephen J Clemons andJ Kautz ldquoReinforcement learning through asynchronousadvantage actor-critic on a gpurdquo 2016 httparxivorgabs161106256

[13] V Mnih A P Badia M Mirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Machine Learning pp 1928ndash1937New York City NY USA June 2016

[14] John Schulman S Levine P Abbeel M Jordan andP Moritz ldquoTrust region policy optimizationrdquo in Proceedingsof the International Conference on Machine Learningpp 1889ndash1897 Lille France July 2015

[15] John Schulman W Filip P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017httparxivorgabs170706347

[16] S-J Yen T-N Yang C Chen and S-C Hsu ldquoPatternmatching in go game recordsrdquo in Proceedings of the SecondInternational Conference on Innovative Computing Infor-mation and Control (ICICIC 2007) p 297 Washington DCUSA 2007

[17] S Gelly and Y Wang ldquoExploration exploitation in go UCTfor Monte-Carlo gordquo in Proceedings of the NIPS NeuralInformation Processing Systems Conference On-Line Trading ofExploration and Exploitation Workshop Barcelona Spain2006

[18] Y Wang and S Gelly ldquoModifications of UCT and sequence-like simulations for Monte-Carlo gordquo in Proceedings of theIEEE Symposium on Computational Intelligence and Gamespp 175ndash182 Honolulu HI USA 2007

[19] S Sharma Z Kobti and S Goodwin ldquoKnowledge generationfor improving simulations in uct for general game playingrdquo inProceedings of the Australasian Joint Conference on ArtificialIntelligence Advances in Artificial Intelligence pp 49ndash55Auckland New Zealand December 2008

[20] X Li Z Lv S Wang Z Wei X Zhang and L Wu ldquoA middlegame search algorithm applicable to low-cost personalcomputer for gordquo IEEE Access vol 7 pp 121719ndash1217272019

[21] N N Schraudolph P Dayan and T J Sejnowski ldquoTemporaldifference learning of position evaluation in the game of gordquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 817ndash824 Denver CO USA 1994

[22] D Silver A Huang C J Maddison et al ldquoMastering the gameof go with deep neural networks and tree searchrdquo Naturevol 529 no 7587 pp 484ndash489 2016

[23] D Silver J Schrittwieser K Simonyan et al ldquoMastering thegame of go without human knowledgerdquo Nature vol 550no 7676 pp 354ndash359 2017

[24] D Silver T Hubert J Schrittwieser et al ldquoA general rein-forcement learning algorithm that masters chess shogi andgo through self-playrdquo Science vol 362 no 6419pp 1140ndash1144 2018

[25] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquo Machine Learning vol 3 no 1 pp 9ndash44 1988

[26] G Tesauro ldquoPractical issues in temporal difference learningrdquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 259ndash266 Denver CO USA 1992

[27] G Tesauro ldquoTd-gammon a self-teaching backgammonprogram achieves master-level playrdquo Neural Computationvol 6 no 2 pp 215ndash219 1994

[28] D F Beal and M C Smith ldquoFirst results from using temporaldifference learning in shogirdquo in Proceedings of the Interna-tional Conference on Computers and Games Springer Tsu-kuba Japan pp 113ndash125 November 1998

[29] R Grimbergen and H Matsubara ldquoPattern recognition forcandidate generation in the game of shogirdquo Games in AIResearch pp 97ndash108 1997

[30] Y Sato D Takahashi and R Grimbergen ldquoA shogi programbased on monte-carlo tree searchrdquo ICGA Journal vol 33no 2 pp 80ndash92 2010

[31] S (run ldquoLearning to play the game of chessrdquo in Proceedingsof the Advances in Neural Information Processing Systemspp 1069ndash1076 Denver CO USA 1995

[32] I Bratko D Kopec and D Michie ldquoPattern-based repre-sentation of chess end-game knowledgerdquo Ae ComputerJournal vol 21 no 2 pp 149ndash153 1978

[33] Y Kerner ldquoLearning strategies for explanation patterns basicgame patterns with application to chessrdquo in Proceedings of theInternational Conference on Case-Based Reasoning SpringerSesimbra Portugal pp 491ndash500 October 1995

[34] M Campbell A J Hoane Jr and F-h Hsu ldquoDeep bluerdquoArtificial Intelligence vol 134 no 1-2 pp 57ndash83 2002

[35] N Brown and T Sandholm ldquoSafe and nested subgame solvingfor imperfect-information gamesrdquo in Proceedings of the Ad-vances in Neural Information Processing Systems pp 689ndash699Long Beach CA USA December 2017

[36] N Brown and T Sandholm ldquoSuperhuman ai for heads-up no-limit poker libratus beats top professionalsrdquo Science vol 359no 6374 pp 418ndash424 2018

[37] httpswwwmsracnzh-cnnewsfeaturesmahjong-ai-suphx 2019

[38] V Zambaldi D Raposo S Adam et al ldquoRelational deepreinforcement learningrdquo 2018 httparxivorgabs180601830

[39] S T Deng ldquoDesign and implementation of JIU game pro-totype system andmultimedia coursewarerdquo Minzu Universityof China Beijing China Dissertation 2017

[40] X Chen J Wu Y Cai H Zhang and T Chen ldquoEnergy-efficiency oriented traffic offloading in wireless networks abrief survey and a learning approach for heterogeneous cel-lular networksrdquo IEEE Journal on Selected Areas in Commu-nications vol 33 no 4 pp 627ndash640 2015

[41] T Pepels M H M Winands M Lanctot and M LanctotldquoReal-time Monte Carlo tree search in ms pac-manrdquo IEEETransactions on Computational Intelligence and AI in Gamesvol 6 no 3 pp 245ndash257 2014

[42] C F Sironi and M H M Winands ldquoComparing randomi-zation strategies for search-control parameters in monte-carlo

10 Complexity

tree searchrdquo in Proceedings of the 2019 IEEE Conference onGames (CoG) pp 1ndash8 IEEE London UK August 2019

[43] C Watkins and P Dayan ldquoQ-learningrdquo Machine Learningvol 8 no 3-4 pp 279ndash292 1992

[44] F Xiao Q Liu Qi-M Fu H-K Sun and L Gao ldquoGradientdescent Sarsa (λ) algorithm based on the adaptive potentialfunction shaping reward mechanismrdquo Journal of China In-stitute of Communications vol 34 no 1 pp 77ndash88 2013

[45] E Estrada E Hameed M Langer and A Puchalska ldquoPathlaplacian operators and superdiffusive processes on graphs iitwo-dimensional latticerdquo Linear Algebra and Its Applicationsvol 555 pp 373ndash397 2018

[46] K Lloyd N Becker M W Jones and R Bogacz ldquoLearning touse working memory a reinforcement learning gating modelof rule acquisition in ratsrdquo Frontiers in ComputationalNeuroscience vol 6 p 87 2012

[47] J Wu S Guo J Li and D Zeng ldquoBig data meet greenchallenges big data toward green applicationsrdquo IEEE SystemsJournal vol 10 no 3 pp 888ndash900 2016

[48] J Wu S Guo H Huang W Liu and Y Xiang ldquoInformationand communications technologies for sustainable develop-ment goals state-of-the-art needs and perspectivesrdquo IEEECommunications Surveys amp Tutorials vol 20 no 3pp 2389ndash2406 2018

[49] B Liu N Xu H Su LWu and J Bai ldquoOn the observability ofleader-based multiagent systems with fixed topologyrdquoComplexity vol 2019 pp 1ndash10 2019

[50] H Su J Zhang and X Chen ldquoA stochastic samplingmechanism for time-varying formation of multiagent systemswith multiple leaders and communication delaysrdquo IEEETransactions on Neural Networks and Learning Systemsvol 30 no 12 pp 3699ndash3707 2019

Complexity 11

evaluation c is learning rate si+1 is the next state searchedby taking action a and ai+1 is the action selected in state si

(e node (si ai) which is previously searched in thesearching path is updated in turn by the following equation

Q si ai( 1113857⟵ 1 minus c1( 1113857Q si ai( 1113857 + c1λcminus iE(s a) (3)

where c1 is learning rate of SARSA(λ) i is denoted as thestate index satisfying 0le ile c c is the number of steps fromthe status of the root node to the status of the search endingat the stage of value return in each turn i 0 represents thesearching is initialized by taking the current situation as thesearch tree root node and i c represents the searching isended Especially at the end of the game c is simply the totalnumber of steps from the beginning to the end of the game

In the backpropagation stage of UCT search or at thetime of each game end equation (4) is used to update theQ-values of all nodes on the path where c2 is learning rate ofQ-Learning

Q si ai( 1113857⟵ 1 minus c2( 1113857Q si ai( 1113857 + c2 maxQ si+1 ai+1( 1113857

(4)

and if si+1 is a leaf node Q(si+1 ai+1) will be calculated by

Q si+1 ai+1( 1113857 W si+1 ai+1( 1113857

N si+1 ai+1( 1113857 (5)

In the selection extension and simulation stages of theUCTsearch algorithm this hybrid algorithm selects action a

with the maximum value from state s through equation (6)by calculating the score of each alternative action

a argmaxa(Q(s middot) + U(s middot)) (6)

For each alternative action a at state s there is thecomprehensive evaluation function which is represented bythe following equation

Q(s a) + U(s a) Q(s a) + cpuctP(s a)

1113936N(s middot)

N(s a) + 1

1113971

(7)

where Q(s a) is obtained by equation (3) or (4) whichdepends on different tree UCT search stages and U(s a)

represents the estimation value by the UCBmethod which isobtained by the following equation

U(s a) cpuctP(s a)

1113936N(s middot)

N(s a) + 1

1113971

(8)

where cpuct represents the parameter balancing the exploi-tation and exploration of UCT algorithm

43 Feedback Function Based on Domain KnowledgeQ-value is updated by the combination of the SARSA(λ) andQ-learning algorithms [43 44] SARSA(λ) is used to updatethe nodes on the game path in the selection expansion andsimulation stages of the UCT [18] In the backpropagationstage Q-learning is used to update the values of all nodes inthe search path(e hybrid strategy enables the algorithm tolearn to prune the huge state space effectively improvingcomputing speed and reducing the consumption of hard-ware resources

(e layout stage of Jiu chess plays a key role in the gameand its outcome [1] (e value of the board position decaysfrom the center of the board to the outside which is similarto the probability distribution of a 2D normal distributionmatrix (erefore the importance of each intersection in thelayout stage can be approximated by the 2D discrete normaldistribution (see Figure 3)

In the battle stage constructing the chess shapes discussedin [1 2] is very important in gaining a victory (e feedbackfunction used by the SARSA(λ) and Q-learning algorithm isshown in equation (9) 1113936iVi(s) Vchain+ Vgate + Vsquare

Q + U

Q + U

Q + U

P w

w

w

w

w

Selection Expansion Default policy simulation

Backpropagation

Repeat

Q + U

Figure 2 UCT algorithm for Jiu chess

Complexity 5

R(s a) f(x y) s in layout stage

1113936iVi(s) s in battle stage1113896 (9)

where f(x y) is the joint probability density of X and Y isshown in the following equation [39 45]

f(x y) 1

2πσ1σ21 minus ρ2

1113968 eminus 1 2 1minus ρ2( )( )( ) xminus μ1( )

2σ21minus 2ρ xminus μ1( ) yminus μ2( )σ1σ2( )+ yminus μ2( )2σ22( 1113857( 1113857

(10)

where X and Y represent chessboard coordinates and μrepresents the mean value of the range of the chessboard(x y isin [0 13] σ1 σ2 2 μ1 u2 65 ρ 0)Vchain Vgate and Vsquare are approximations summarizedthrough the Jiu chess rules [1 2] and experienceVchain(s) 7cchain where cchain is the sum of the number ofsquares on the chessboard in the current stateVgate(s) 3cgate where cgate is the sum of the number ofgates on the chessboard in the current state andVsquare(s) 4csquare where csquare is the sum of the number ofsquares on the chessboard in the current state

Using the 2D normal distribution approximate ma-trix probability distribution based on expert knowledgeas the feedback function of the TD algorithm in thelayout stage we can produce better search simulationsand self-play game performance in the case of deepsearch depths and large search branches and learn areasonable chess strategy in the layout stage better andfaster [46]

44 Improved Deep Neural Network Based on ResNet18Strategy prediction and board situation evaluation are re-alized using a deep neural network which is expressed as afunction with parameter θ as shown in equation (11) wheres is the current state p is the output strategy prediction andv is the value of board situation evaluation

(p v) fθ(s) (11)(e RESNETseries of deep convolution neural networks

shows good robustness in image classification and targetdetection [3] In this study a ResNet18-based network isused to transform the full connection layer It can

simultaneously output strategy prediction p and boardsituation evaluation v (see Figure 4)

Unlike ResNet18 [3] a public full connection layerwith 4096 neuron nodes is used as the hidden layer (efully connected layer is replaced by a 196-dimensionaloutput fully connected layer to output drop strategy p anda one-dimensional fully connected layer to output theestimation v of the current board status (e neuralnetwork is used to predict and evaluate the state of nodesthat have not previously been evaluated (e neuralnetwork outputs p v to the UCT search tree node so thatP(si) p V(si) v

(e number of input characteristic graphs is two Itmeans that there are two channels (e first channel is theposition and color of the pieces in the chessboard stateWhite pieces are represented by minus1 and black ones arerepresented by 1 (e second channel is the position of thefalling pieces in the current state of the chessboard (epositions in which pieces do not fall are represented by 0and the position of the falling pieces is represented by 1All convolution layers use ReLu as their activationfunction and batch normalization is performed afterconvolution

At the end of the game we play back the experience andoutput the dataset ile 0lefc |(s π z)i1113864 1113865 to adjust the pa-rameters of the deep neural network (e loss function usedis shown in the following equation

l (z minus v)2

minus πT ln p + cθ2 (12)

In the i-th state of the path πi NiNi andzi WiNi (e training of the neural network is carried out

Z static evaluation value

X horizontal coordinates of chess board

Y vertical coordinates of

chess board

Figure 3 Approximate 2D normal distribution matrix

6 Complexity

with amin-batch parameter of 100 and the learning rate is setto 01 At the end of each game the number of samples(turns) sent to neural network training is about 1 times 103 to2 times 103

5 Results

(e experiment was performed with the following param-eters cpuct step and learning rate (see Table 1) We comparethe efficiency of three methods of updating the Q valuehybrid Q-learning with SARSA(λ) pure Q-learning andpure SARSA(λ) (e server and hardware configurationused in the experiment is shown in Table 1

51Ae Improvement of Learning and Understanding AbilityIt is important to measure the learning ability of a Jiu chessagent that can quickly learn to form squares in the layoutstage (erefore we calculate the influence of the hybridupdate strategy compared to pure Q-learning or SARSA(λ)

algorithms on the sum of the squares of 200 steps of self-playgame training as shown in Table 2 When using the hybridupdate strategy the total number of squares in 200 steps isalmost the sum of the number of squares occupied whenonly using pure Q-learning or SARSA(λ) proving that thehybrid update strategy can effectively train a Jiu chess agentto understand the game of Jiu chess and learn it well

We also counted the number of squares per 20 steps inthe 200-step training process Figure 5 shows that thenumber of squares occupied when using the hybrid updatestrategy is significantly more than that when using pureQ-learning or SARSA(λ) algorithms Especially in first 80steps hybrid updating strategy is significantly more effectivethan pure Q-learning or SARSA(λ) algorithms And fromstep 140 to step 200 hybrid updating strategy has a littledrop compared to the first 120 steps It is because parametercpuct at 01 must decrease to fit later training to avoid uselessexploration (is result indicates that the chess powertrained by the hybrid update strategy is stronger than that ofpure Q-learning or SARSA(λ) In addition this strategy

produces stronger learning and understanding as well asbetter stability

52 FeedbackFunctionExperiment In order to improve thereinforcement learning efficiency of Jiu chess a feedbackfunction based on a 2D normal distribution matrix isadded to the reinforcement learning game model as anauxiliary measure for updating value To test the effect ofthis measure we conducted 7 days and 110 instances ofself-play training and obtained data with and without the2D-normal-distribution-assisted feedback mechanism forcomparison

Table 3 and Figure 6 show that the total number ofsquares occupied when using the 2D normal distributionassistant in the self-play training process is about three timesthat of programs that do not the average value is close to

Value

Policy

ImprovedResNet18 for

JIU

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

Figure 4 (e deep neural network structure

Table 1 (e experimental parameters

Toolkit package CNTK 27Runtime Net 472Operating system Windows 10Central processing unit AMD 2700X40GHzRandom access memory 32GBGraphics processing unit RTX2070(reads used 16Development environment Visual Studio 2017 Communitycpuct 01Step 200Learning rate 0001c(Q-learning) 0628c(SARSA(λ)) 0372

Table 2 Average number of squares per 20 steps in 200-steptraining

Q-learning + SARSA(λ) Q-learning SARSA(λ)

8604 4248 5571

Complexity 7

three times (e total number of squares occupied whenusing the 2D normal distribution assistant during training issignificantly more than that without the assistant Two-di-mensional normal distribution auxiliary matrix shows theapplication of expert knowledge in fact (e learning abilityto play in the layout has improved about 3 times with thetwo-dimensional normal distribution auxiliary matrix

We also compare the learning efficiency of using a 2Dnormal distribution assistant program from the perspectiveof the quality of specific layouts As shown in Figure 7(a) in10 days of training the program without the aid of the 2Dnormal distribution has little knowledge of the layout Afterusing the 2D normal distribution matrix to assist in trainingthe layout process can learn specific chess patterns faster

Squa

re co

unt

Q-learning + SARSA(λ)Q-learningSARSA(λ)

0

10

20

30

40

50

60

20 40 60 80 100 120 140 160 180 2000Training steps

Figure 5 Number of squares employing different algorithms

Table 3 (e number of squares with 2D normalization on or off

2D normalization off 2D normalization onSum 1524 5005Average 13 45

Squa

re co

unt

2D norm on 2D norm off

50 1000Training steps

0

30

60

90

120

Squa

re co

unt

0

30

60

90

120

50 1000Training steps

Figure 6 Comparison of the number of squares with or without 2D normal distribution

8 Complexity

form a triangle (chess gate) and even form squares (the mostimportant chess shape) in the fighting stage of Jiu chessgame as shown in Figures 7(b) and 7(c)

(e above experiments show that the feedback functionbased on the two-dimensional normal distribution matrixcan reduce the calculation amount of Jiu chess program andimprove the self-learning efficiency of the program becausein the layout stage the importance of the chessboard po-sition is close to the two-dimensional normal distributionmatrix Because Jiu chess can have more than 1000 matchingsteps and slow search iteration process the evaluation oflayout stage is optimized by using two-dimensional normaldistribution matrix which is actually the knowledge ofmodel experts (e feedback value of the model indirectlyreduces the process of blindly exploring the transfer value ofchessboard state action so it takes less time to get betterresults

6 Conclusion

In this study the deep reinforcement learning modelcombined with expert knowledge can learn the rules of thegame faster in a short time (e combination of Q-learningand SARSA(λ) algorithms can make the neural networklearn more valuable chess strategies in a short time whichprovides a reference for improving the learning efficiency ofthe deep reinforcement learning model (e deep rein-forcement learning model has produced good layout resultsobtained by the two-dimensional normal distributionmatrixof expert knowledge modeling which also proves that deepreinforcement learning can shorten the learning time bycombining expert knowledge reasonably (e better per-formance of ResNet18 which was used to make the deepreinforcement learning model training more effectively atlow resources cost has been verified by experimental results

Inspired byWu et al [47 48] we consider to improve Jiuchess power not only from the state-of-the-art of technol-ogies but also from the holistic social good perspectives inthe future work We will collect and process much more Jiuchess game data to establish the big data resource which canturn big values for using light reinforcement learning model

to reduce the cost of computing resources We will alsoexplore the possibilities of combining multiagent theory [49]and mechanism [50] to reduce the network delay of theproposed reinforcement model besides using the strategiesproposed in [39 41]

Data Availability

Data are available via the e-mail xiaer_li163com

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is study was funded by the National Natural ScienceFoundation of China (61873291 and 61773416) and theMUC 111 Project

References

[1] X Li S Wang Z Lv Y Li and L Wu ldquoStrategies researchbased on chess shape for Tibetan JIU computer gamerdquo In-ternational Computer Games Association Journal (ICGA)vol 40 no 3 pp 318ndash328 2019

[2] X Li Z Lv S Wang Z Wei and L Wu ldquoA reinforcementlearning model based on temporal difference algorithmrdquoIEEE Access vol 7 pp 121922ndash121930 2019

[3] K He X Zhang S Ren and J Sun ldquoDeep residual learningfor image recognitionrdquo in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 770ndash778 LasVegas NV USA June 2016

[4] D SuodaChucha and P Shotwell ldquo(e research of Tibetanchessrdquo Journal of Tibet University vol 9 no 2 pp 5ndash10 1994

[5] V Mnih K Kavukcuoglu D Silver et al ldquoPlaying atari withdeep reinforcement learningrdquo 2013 httparxivorgabs13125602

[6] D Stern R Herbrich and T Graepel ldquoBayesian patternranking for move prediction in the game of gordquo in Proceedingsof the 23rd International Conference on Machine Learningpp 873ndash880 ACM Pittsburgh PA USA 2006

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(a)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(b)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(c)

Figure 7 Comparison of learning efficiency with or without 2D normal distribution (a) 2D normalization off (b) 2D normalization on(c) 2D normalization on and making a special board type

Complexity 9

[7] W Dabney M Rowland M G Bellemare and R MunosldquoDistributional reinforcement learning with quantile regres-sionrdquo 2017 httparxivorgabs171010044

[8] M Andrychowicz W Filip A Ray et al ldquoHindsight expe-rience replayrdquo in Proceedings of the Advances in Neural In-formation Processing Systems pp 5048ndash5058 Long BeachCA USA December 2017

[9] F Scott H van Hoof and D Meger ldquoAddressing functionapproximation error in actor-critic methodsrdquo 2018 httparxivorgabs180209477

[10] T P Lillicrap J J Hunt P Alexander et al ldquoContinuouscontrol with deep reinforcement learningrdquo 2015 httparxivorgabs150902971

[11] T Haarnoja A Zhou P Abbeel and S Levine ldquoSoft actor-critic off-policy maximum entropy deep reinforcementlearning with a stochastic actorrdquo 2018 httparxivorgabs180101290

[12] M Babaeizadeh I Frosio T Stephen J Clemons andJ Kautz ldquoReinforcement learning through asynchronousadvantage actor-critic on a gpurdquo 2016 httparxivorgabs161106256

[13] V Mnih A P Badia M Mirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Machine Learning pp 1928ndash1937New York City NY USA June 2016

[14] John Schulman S Levine P Abbeel M Jordan andP Moritz ldquoTrust region policy optimizationrdquo in Proceedingsof the International Conference on Machine Learningpp 1889ndash1897 Lille France July 2015

[15] John Schulman W Filip P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017httparxivorgabs170706347

[16] S-J Yen T-N Yang C Chen and S-C Hsu ldquoPatternmatching in go game recordsrdquo in Proceedings of the SecondInternational Conference on Innovative Computing Infor-mation and Control (ICICIC 2007) p 297 Washington DCUSA 2007

[17] S Gelly and Y Wang ldquoExploration exploitation in go UCTfor Monte-Carlo gordquo in Proceedings of the NIPS NeuralInformation Processing Systems Conference On-Line Trading ofExploration and Exploitation Workshop Barcelona Spain2006

[18] Y Wang and S Gelly ldquoModifications of UCT and sequence-like simulations for Monte-Carlo gordquo in Proceedings of theIEEE Symposium on Computational Intelligence and Gamespp 175ndash182 Honolulu HI USA 2007

[19] S Sharma Z Kobti and S Goodwin ldquoKnowledge generationfor improving simulations in uct for general game playingrdquo inProceedings of the Australasian Joint Conference on ArtificialIntelligence Advances in Artificial Intelligence pp 49ndash55Auckland New Zealand December 2008

[20] X Li Z Lv S Wang Z Wei X Zhang and L Wu ldquoA middlegame search algorithm applicable to low-cost personalcomputer for gordquo IEEE Access vol 7 pp 121719ndash1217272019

[21] N N Schraudolph P Dayan and T J Sejnowski ldquoTemporaldifference learning of position evaluation in the game of gordquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 817ndash824 Denver CO USA 1994

[22] D Silver A Huang C J Maddison et al ldquoMastering the gameof go with deep neural networks and tree searchrdquo Naturevol 529 no 7587 pp 484ndash489 2016

[23] D Silver J Schrittwieser K Simonyan et al ldquoMastering thegame of go without human knowledgerdquo Nature vol 550no 7676 pp 354ndash359 2017

[24] D Silver T Hubert J Schrittwieser et al ldquoA general rein-forcement learning algorithm that masters chess shogi andgo through self-playrdquo Science vol 362 no 6419pp 1140ndash1144 2018

[25] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquo Machine Learning vol 3 no 1 pp 9ndash44 1988

[26] G Tesauro ldquoPractical issues in temporal difference learningrdquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 259ndash266 Denver CO USA 1992

[27] G Tesauro ldquoTd-gammon a self-teaching backgammonprogram achieves master-level playrdquo Neural Computationvol 6 no 2 pp 215ndash219 1994

[28] D F Beal and M C Smith ldquoFirst results from using temporaldifference learning in shogirdquo in Proceedings of the Interna-tional Conference on Computers and Games Springer Tsu-kuba Japan pp 113ndash125 November 1998

[29] R Grimbergen and H Matsubara ldquoPattern recognition forcandidate generation in the game of shogirdquo Games in AIResearch pp 97ndash108 1997

[30] Y Sato D Takahashi and R Grimbergen ldquoA shogi programbased on monte-carlo tree searchrdquo ICGA Journal vol 33no 2 pp 80ndash92 2010

[31] S (run ldquoLearning to play the game of chessrdquo in Proceedingsof the Advances in Neural Information Processing Systemspp 1069ndash1076 Denver CO USA 1995

[32] I Bratko D Kopec and D Michie ldquoPattern-based repre-sentation of chess end-game knowledgerdquo Ae ComputerJournal vol 21 no 2 pp 149ndash153 1978

[33] Y Kerner ldquoLearning strategies for explanation patterns basicgame patterns with application to chessrdquo in Proceedings of theInternational Conference on Case-Based Reasoning SpringerSesimbra Portugal pp 491ndash500 October 1995

[34] M Campbell A J Hoane Jr and F-h Hsu ldquoDeep bluerdquoArtificial Intelligence vol 134 no 1-2 pp 57ndash83 2002

[35] N Brown and T Sandholm ldquoSafe and nested subgame solvingfor imperfect-information gamesrdquo in Proceedings of the Ad-vances in Neural Information Processing Systems pp 689ndash699Long Beach CA USA December 2017

[36] N Brown and T Sandholm ldquoSuperhuman ai for heads-up no-limit poker libratus beats top professionalsrdquo Science vol 359no 6374 pp 418ndash424 2018

[37] httpswwwmsracnzh-cnnewsfeaturesmahjong-ai-suphx 2019

[38] V Zambaldi D Raposo S Adam et al ldquoRelational deepreinforcement learningrdquo 2018 httparxivorgabs180601830

[39] S T Deng ldquoDesign and implementation of JIU game pro-totype system andmultimedia coursewarerdquo Minzu Universityof China Beijing China Dissertation 2017

[40] X Chen J Wu Y Cai H Zhang and T Chen ldquoEnergy-efficiency oriented traffic offloading in wireless networks abrief survey and a learning approach for heterogeneous cel-lular networksrdquo IEEE Journal on Selected Areas in Commu-nications vol 33 no 4 pp 627ndash640 2015

[41] T Pepels M H M Winands M Lanctot and M LanctotldquoReal-time Monte Carlo tree search in ms pac-manrdquo IEEETransactions on Computational Intelligence and AI in Gamesvol 6 no 3 pp 245ndash257 2014

[42] C F Sironi and M H M Winands ldquoComparing randomi-zation strategies for search-control parameters in monte-carlo

10 Complexity

tree searchrdquo in Proceedings of the 2019 IEEE Conference onGames (CoG) pp 1ndash8 IEEE London UK August 2019

[43] C Watkins and P Dayan ldquoQ-learningrdquo Machine Learningvol 8 no 3-4 pp 279ndash292 1992

[44] F Xiao Q Liu Qi-M Fu H-K Sun and L Gao ldquoGradientdescent Sarsa (λ) algorithm based on the adaptive potentialfunction shaping reward mechanismrdquo Journal of China In-stitute of Communications vol 34 no 1 pp 77ndash88 2013

[45] E Estrada E Hameed M Langer and A Puchalska ldquoPathlaplacian operators and superdiffusive processes on graphs iitwo-dimensional latticerdquo Linear Algebra and Its Applicationsvol 555 pp 373ndash397 2018

[46] K Lloyd N Becker M W Jones and R Bogacz ldquoLearning touse working memory a reinforcement learning gating modelof rule acquisition in ratsrdquo Frontiers in ComputationalNeuroscience vol 6 p 87 2012

[47] J Wu S Guo J Li and D Zeng ldquoBig data meet greenchallenges big data toward green applicationsrdquo IEEE SystemsJournal vol 10 no 3 pp 888ndash900 2016

[48] J Wu S Guo H Huang W Liu and Y Xiang ldquoInformationand communications technologies for sustainable develop-ment goals state-of-the-art needs and perspectivesrdquo IEEECommunications Surveys amp Tutorials vol 20 no 3pp 2389ndash2406 2018

[49] B Liu N Xu H Su LWu and J Bai ldquoOn the observability ofleader-based multiagent systems with fixed topologyrdquoComplexity vol 2019 pp 1ndash10 2019

[50] H Su J Zhang and X Chen ldquoA stochastic samplingmechanism for time-varying formation of multiagent systemswith multiple leaders and communication delaysrdquo IEEETransactions on Neural Networks and Learning Systemsvol 30 no 12 pp 3699ndash3707 2019

Complexity 11

R(s a) f(x y) s in layout stage

1113936iVi(s) s in battle stage1113896 (9)

where f(x y) is the joint probability density of X and Y isshown in the following equation [39 45]

f(x y) 1

2πσ1σ21 minus ρ2

1113968 eminus 1 2 1minus ρ2( )( )( ) xminus μ1( )

2σ21minus 2ρ xminus μ1( ) yminus μ2( )σ1σ2( )+ yminus μ2( )2σ22( 1113857( 1113857

(10)

where X and Y represent chessboard coordinates and μrepresents the mean value of the range of the chessboard(x y isin [0 13] σ1 σ2 2 μ1 u2 65 ρ 0)Vchain Vgate and Vsquare are approximations summarizedthrough the Jiu chess rules [1 2] and experienceVchain(s) 7cchain where cchain is the sum of the number ofsquares on the chessboard in the current stateVgate(s) 3cgate where cgate is the sum of the number ofgates on the chessboard in the current state andVsquare(s) 4csquare where csquare is the sum of the number ofsquares on the chessboard in the current state

Using the 2D normal distribution approximate ma-trix probability distribution based on expert knowledgeas the feedback function of the TD algorithm in thelayout stage we can produce better search simulationsand self-play game performance in the case of deepsearch depths and large search branches and learn areasonable chess strategy in the layout stage better andfaster [46]

44 Improved Deep Neural Network Based on ResNet18Strategy prediction and board situation evaluation are re-alized using a deep neural network which is expressed as afunction with parameter θ as shown in equation (11) wheres is the current state p is the output strategy prediction andv is the value of board situation evaluation

(p v) fθ(s) (11)(e RESNETseries of deep convolution neural networks

shows good robustness in image classification and targetdetection [3] In this study a ResNet18-based network isused to transform the full connection layer It can

simultaneously output strategy prediction p and boardsituation evaluation v (see Figure 4)

Unlike ResNet18 [3] a public full connection layerwith 4096 neuron nodes is used as the hidden layer (efully connected layer is replaced by a 196-dimensionaloutput fully connected layer to output drop strategy p anda one-dimensional fully connected layer to output theestimation v of the current board status (e neuralnetwork is used to predict and evaluate the state of nodesthat have not previously been evaluated (e neuralnetwork outputs p v to the UCT search tree node so thatP(si) p V(si) v

(e number of input characteristic graphs is two Itmeans that there are two channels (e first channel is theposition and color of the pieces in the chessboard stateWhite pieces are represented by minus1 and black ones arerepresented by 1 (e second channel is the position of thefalling pieces in the current state of the chessboard (epositions in which pieces do not fall are represented by 0and the position of the falling pieces is represented by 1All convolution layers use ReLu as their activationfunction and batch normalization is performed afterconvolution

At the end of the game we play back the experience andoutput the dataset ile 0lefc |(s π z)i1113864 1113865 to adjust the pa-rameters of the deep neural network (e loss function usedis shown in the following equation

l (z minus v)2

minus πT ln p + cθ2 (12)

In the i-th state of the path πi NiNi andzi WiNi (e training of the neural network is carried out

Z static evaluation value

X horizontal coordinates of chess board

Y vertical coordinates of

chess board

Figure 3 Approximate 2D normal distribution matrix

6 Complexity

with amin-batch parameter of 100 and the learning rate is setto 01 At the end of each game the number of samples(turns) sent to neural network training is about 1 times 103 to2 times 103

5 Results

(e experiment was performed with the following param-eters cpuct step and learning rate (see Table 1) We comparethe efficiency of three methods of updating the Q valuehybrid Q-learning with SARSA(λ) pure Q-learning andpure SARSA(λ) (e server and hardware configurationused in the experiment is shown in Table 1

51Ae Improvement of Learning and Understanding AbilityIt is important to measure the learning ability of a Jiu chessagent that can quickly learn to form squares in the layoutstage (erefore we calculate the influence of the hybridupdate strategy compared to pure Q-learning or SARSA(λ)

algorithms on the sum of the squares of 200 steps of self-playgame training as shown in Table 2 When using the hybridupdate strategy the total number of squares in 200 steps isalmost the sum of the number of squares occupied whenonly using pure Q-learning or SARSA(λ) proving that thehybrid update strategy can effectively train a Jiu chess agentto understand the game of Jiu chess and learn it well

We also counted the number of squares per 20 steps inthe 200-step training process Figure 5 shows that thenumber of squares occupied when using the hybrid updatestrategy is significantly more than that when using pureQ-learning or SARSA(λ) algorithms Especially in first 80steps hybrid updating strategy is significantly more effectivethan pure Q-learning or SARSA(λ) algorithms And fromstep 140 to step 200 hybrid updating strategy has a littledrop compared to the first 120 steps It is because parametercpuct at 01 must decrease to fit later training to avoid uselessexploration (is result indicates that the chess powertrained by the hybrid update strategy is stronger than that ofpure Q-learning or SARSA(λ) In addition this strategy

produces stronger learning and understanding as well asbetter stability

52 FeedbackFunctionExperiment In order to improve thereinforcement learning efficiency of Jiu chess a feedbackfunction based on a 2D normal distribution matrix isadded to the reinforcement learning game model as anauxiliary measure for updating value To test the effect ofthis measure we conducted 7 days and 110 instances ofself-play training and obtained data with and without the2D-normal-distribution-assisted feedback mechanism forcomparison

Table 3 and Figure 6 show that the total number ofsquares occupied when using the 2D normal distributionassistant in the self-play training process is about three timesthat of programs that do not the average value is close to

Value

Policy

ImprovedResNet18 for

JIU

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

Figure 4 (e deep neural network structure

Table 1 (e experimental parameters

Toolkit package CNTK 27Runtime Net 472Operating system Windows 10Central processing unit AMD 2700X40GHzRandom access memory 32GBGraphics processing unit RTX2070(reads used 16Development environment Visual Studio 2017 Communitycpuct 01Step 200Learning rate 0001c(Q-learning) 0628c(SARSA(λ)) 0372

Table 2 Average number of squares per 20 steps in 200-steptraining

Q-learning + SARSA(λ) Q-learning SARSA(λ)

8604 4248 5571

Complexity 7

three times (e total number of squares occupied whenusing the 2D normal distribution assistant during training issignificantly more than that without the assistant Two-di-mensional normal distribution auxiliary matrix shows theapplication of expert knowledge in fact (e learning abilityto play in the layout has improved about 3 times with thetwo-dimensional normal distribution auxiliary matrix

We also compare the learning efficiency of using a 2Dnormal distribution assistant program from the perspectiveof the quality of specific layouts As shown in Figure 7(a) in10 days of training the program without the aid of the 2Dnormal distribution has little knowledge of the layout Afterusing the 2D normal distribution matrix to assist in trainingthe layout process can learn specific chess patterns faster

Squa

re co

unt

Q-learning + SARSA(λ)Q-learningSARSA(λ)

0

10

20

30

40

50

60

20 40 60 80 100 120 140 160 180 2000Training steps

Figure 5 Number of squares employing different algorithms

Table 3 (e number of squares with 2D normalization on or off

2D normalization off 2D normalization onSum 1524 5005Average 13 45

Squa

re co

unt

2D norm on 2D norm off

50 1000Training steps

0

30

60

90

120

Squa

re co

unt

0

30

60

90

120

50 1000Training steps

Figure 6 Comparison of the number of squares with or without 2D normal distribution

8 Complexity

form a triangle (chess gate) and even form squares (the mostimportant chess shape) in the fighting stage of Jiu chessgame as shown in Figures 7(b) and 7(c)

(e above experiments show that the feedback functionbased on the two-dimensional normal distribution matrixcan reduce the calculation amount of Jiu chess program andimprove the self-learning efficiency of the program becausein the layout stage the importance of the chessboard po-sition is close to the two-dimensional normal distributionmatrix Because Jiu chess can have more than 1000 matchingsteps and slow search iteration process the evaluation oflayout stage is optimized by using two-dimensional normaldistribution matrix which is actually the knowledge ofmodel experts (e feedback value of the model indirectlyreduces the process of blindly exploring the transfer value ofchessboard state action so it takes less time to get betterresults

6 Conclusion

In this study the deep reinforcement learning modelcombined with expert knowledge can learn the rules of thegame faster in a short time (e combination of Q-learningand SARSA(λ) algorithms can make the neural networklearn more valuable chess strategies in a short time whichprovides a reference for improving the learning efficiency ofthe deep reinforcement learning model (e deep rein-forcement learning model has produced good layout resultsobtained by the two-dimensional normal distributionmatrixof expert knowledge modeling which also proves that deepreinforcement learning can shorten the learning time bycombining expert knowledge reasonably (e better per-formance of ResNet18 which was used to make the deepreinforcement learning model training more effectively atlow resources cost has been verified by experimental results

Inspired byWu et al [47 48] we consider to improve Jiuchess power not only from the state-of-the-art of technol-ogies but also from the holistic social good perspectives inthe future work We will collect and process much more Jiuchess game data to establish the big data resource which canturn big values for using light reinforcement learning model

to reduce the cost of computing resources We will alsoexplore the possibilities of combining multiagent theory [49]and mechanism [50] to reduce the network delay of theproposed reinforcement model besides using the strategiesproposed in [39 41]

Data Availability

Data are available via the e-mail xiaer_li163com

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is study was funded by the National Natural ScienceFoundation of China (61873291 and 61773416) and theMUC 111 Project

References

[1] X Li S Wang Z Lv Y Li and L Wu ldquoStrategies researchbased on chess shape for Tibetan JIU computer gamerdquo In-ternational Computer Games Association Journal (ICGA)vol 40 no 3 pp 318ndash328 2019

[2] X Li Z Lv S Wang Z Wei and L Wu ldquoA reinforcementlearning model based on temporal difference algorithmrdquoIEEE Access vol 7 pp 121922ndash121930 2019

[3] K He X Zhang S Ren and J Sun ldquoDeep residual learningfor image recognitionrdquo in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 770ndash778 LasVegas NV USA June 2016

[4] D SuodaChucha and P Shotwell ldquo(e research of Tibetanchessrdquo Journal of Tibet University vol 9 no 2 pp 5ndash10 1994

[5] V Mnih K Kavukcuoglu D Silver et al ldquoPlaying atari withdeep reinforcement learningrdquo 2013 httparxivorgabs13125602

[6] D Stern R Herbrich and T Graepel ldquoBayesian patternranking for move prediction in the game of gordquo in Proceedingsof the 23rd International Conference on Machine Learningpp 873ndash880 ACM Pittsburgh PA USA 2006

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(a)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(b)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(c)

Figure 7 Comparison of learning efficiency with or without 2D normal distribution (a) 2D normalization off (b) 2D normalization on(c) 2D normalization on and making a special board type

Complexity 9

[7] W Dabney M Rowland M G Bellemare and R MunosldquoDistributional reinforcement learning with quantile regres-sionrdquo 2017 httparxivorgabs171010044

[8] M Andrychowicz W Filip A Ray et al ldquoHindsight expe-rience replayrdquo in Proceedings of the Advances in Neural In-formation Processing Systems pp 5048ndash5058 Long BeachCA USA December 2017

[9] F Scott H van Hoof and D Meger ldquoAddressing functionapproximation error in actor-critic methodsrdquo 2018 httparxivorgabs180209477

[10] T P Lillicrap J J Hunt P Alexander et al ldquoContinuouscontrol with deep reinforcement learningrdquo 2015 httparxivorgabs150902971

[11] T Haarnoja A Zhou P Abbeel and S Levine ldquoSoft actor-critic off-policy maximum entropy deep reinforcementlearning with a stochastic actorrdquo 2018 httparxivorgabs180101290

[12] M Babaeizadeh I Frosio T Stephen J Clemons andJ Kautz ldquoReinforcement learning through asynchronousadvantage actor-critic on a gpurdquo 2016 httparxivorgabs161106256

[13] V Mnih A P Badia M Mirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Machine Learning pp 1928ndash1937New York City NY USA June 2016

[14] John Schulman S Levine P Abbeel M Jordan andP Moritz ldquoTrust region policy optimizationrdquo in Proceedingsof the International Conference on Machine Learningpp 1889ndash1897 Lille France July 2015

[15] John Schulman W Filip P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017httparxivorgabs170706347

[16] S-J Yen T-N Yang C Chen and S-C Hsu ldquoPatternmatching in go game recordsrdquo in Proceedings of the SecondInternational Conference on Innovative Computing Infor-mation and Control (ICICIC 2007) p 297 Washington DCUSA 2007

[17] S Gelly and Y Wang ldquoExploration exploitation in go UCTfor Monte-Carlo gordquo in Proceedings of the NIPS NeuralInformation Processing Systems Conference On-Line Trading ofExploration and Exploitation Workshop Barcelona Spain2006

[18] Y Wang and S Gelly ldquoModifications of UCT and sequence-like simulations for Monte-Carlo gordquo in Proceedings of theIEEE Symposium on Computational Intelligence and Gamespp 175ndash182 Honolulu HI USA 2007

[19] S Sharma Z Kobti and S Goodwin ldquoKnowledge generationfor improving simulations in uct for general game playingrdquo inProceedings of the Australasian Joint Conference on ArtificialIntelligence Advances in Artificial Intelligence pp 49ndash55Auckland New Zealand December 2008

[20] X Li Z Lv S Wang Z Wei X Zhang and L Wu ldquoA middlegame search algorithm applicable to low-cost personalcomputer for gordquo IEEE Access vol 7 pp 121719ndash1217272019

[21] N N Schraudolph P Dayan and T J Sejnowski ldquoTemporaldifference learning of position evaluation in the game of gordquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 817ndash824 Denver CO USA 1994

[22] D Silver A Huang C J Maddison et al ldquoMastering the gameof go with deep neural networks and tree searchrdquo Naturevol 529 no 7587 pp 484ndash489 2016

[23] D Silver J Schrittwieser K Simonyan et al ldquoMastering thegame of go without human knowledgerdquo Nature vol 550no 7676 pp 354ndash359 2017

[24] D Silver T Hubert J Schrittwieser et al ldquoA general rein-forcement learning algorithm that masters chess shogi andgo through self-playrdquo Science vol 362 no 6419pp 1140ndash1144 2018

[25] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquo Machine Learning vol 3 no 1 pp 9ndash44 1988

[26] G Tesauro ldquoPractical issues in temporal difference learningrdquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 259ndash266 Denver CO USA 1992

[27] G Tesauro ldquoTd-gammon a self-teaching backgammonprogram achieves master-level playrdquo Neural Computationvol 6 no 2 pp 215ndash219 1994

[28] D F Beal and M C Smith ldquoFirst results from using temporaldifference learning in shogirdquo in Proceedings of the Interna-tional Conference on Computers and Games Springer Tsu-kuba Japan pp 113ndash125 November 1998

[29] R Grimbergen and H Matsubara ldquoPattern recognition forcandidate generation in the game of shogirdquo Games in AIResearch pp 97ndash108 1997

[30] Y Sato D Takahashi and R Grimbergen ldquoA shogi programbased on monte-carlo tree searchrdquo ICGA Journal vol 33no 2 pp 80ndash92 2010

[31] S (run ldquoLearning to play the game of chessrdquo in Proceedingsof the Advances in Neural Information Processing Systemspp 1069ndash1076 Denver CO USA 1995

[32] I Bratko D Kopec and D Michie ldquoPattern-based repre-sentation of chess end-game knowledgerdquo Ae ComputerJournal vol 21 no 2 pp 149ndash153 1978

[33] Y Kerner ldquoLearning strategies for explanation patterns basicgame patterns with application to chessrdquo in Proceedings of theInternational Conference on Case-Based Reasoning SpringerSesimbra Portugal pp 491ndash500 October 1995

[34] M Campbell A J Hoane Jr and F-h Hsu ldquoDeep bluerdquoArtificial Intelligence vol 134 no 1-2 pp 57ndash83 2002

[35] N Brown and T Sandholm ldquoSafe and nested subgame solvingfor imperfect-information gamesrdquo in Proceedings of the Ad-vances in Neural Information Processing Systems pp 689ndash699Long Beach CA USA December 2017

[36] N Brown and T Sandholm ldquoSuperhuman ai for heads-up no-limit poker libratus beats top professionalsrdquo Science vol 359no 6374 pp 418ndash424 2018

[37] httpswwwmsracnzh-cnnewsfeaturesmahjong-ai-suphx 2019

[38] V Zambaldi D Raposo S Adam et al ldquoRelational deepreinforcement learningrdquo 2018 httparxivorgabs180601830

[39] S T Deng ldquoDesign and implementation of JIU game pro-totype system andmultimedia coursewarerdquo Minzu Universityof China Beijing China Dissertation 2017

[40] X Chen J Wu Y Cai H Zhang and T Chen ldquoEnergy-efficiency oriented traffic offloading in wireless networks abrief survey and a learning approach for heterogeneous cel-lular networksrdquo IEEE Journal on Selected Areas in Commu-nications vol 33 no 4 pp 627ndash640 2015

[41] T Pepels M H M Winands M Lanctot and M LanctotldquoReal-time Monte Carlo tree search in ms pac-manrdquo IEEETransactions on Computational Intelligence and AI in Gamesvol 6 no 3 pp 245ndash257 2014

[42] C F Sironi and M H M Winands ldquoComparing randomi-zation strategies for search-control parameters in monte-carlo

10 Complexity

tree searchrdquo in Proceedings of the 2019 IEEE Conference onGames (CoG) pp 1ndash8 IEEE London UK August 2019

[43] C Watkins and P Dayan ldquoQ-learningrdquo Machine Learningvol 8 no 3-4 pp 279ndash292 1992

[44] F Xiao Q Liu Qi-M Fu H-K Sun and L Gao ldquoGradientdescent Sarsa (λ) algorithm based on the adaptive potentialfunction shaping reward mechanismrdquo Journal of China In-stitute of Communications vol 34 no 1 pp 77ndash88 2013

[45] E Estrada E Hameed M Langer and A Puchalska ldquoPathlaplacian operators and superdiffusive processes on graphs iitwo-dimensional latticerdquo Linear Algebra and Its Applicationsvol 555 pp 373ndash397 2018

[46] K Lloyd N Becker M W Jones and R Bogacz ldquoLearning touse working memory a reinforcement learning gating modelof rule acquisition in ratsrdquo Frontiers in ComputationalNeuroscience vol 6 p 87 2012

[47] J Wu S Guo J Li and D Zeng ldquoBig data meet greenchallenges big data toward green applicationsrdquo IEEE SystemsJournal vol 10 no 3 pp 888ndash900 2016

[48] J Wu S Guo H Huang W Liu and Y Xiang ldquoInformationand communications technologies for sustainable develop-ment goals state-of-the-art needs and perspectivesrdquo IEEECommunications Surveys amp Tutorials vol 20 no 3pp 2389ndash2406 2018

[49] B Liu N Xu H Su LWu and J Bai ldquoOn the observability ofleader-based multiagent systems with fixed topologyrdquoComplexity vol 2019 pp 1ndash10 2019

[50] H Su J Zhang and X Chen ldquoA stochastic samplingmechanism for time-varying formation of multiagent systemswith multiple leaders and communication delaysrdquo IEEETransactions on Neural Networks and Learning Systemsvol 30 no 12 pp 3699ndash3707 2019

Complexity 11

with amin-batch parameter of 100 and the learning rate is setto 01 At the end of each game the number of samples(turns) sent to neural network training is about 1 times 103 to2 times 103

5 Results

(e experiment was performed with the following param-eters cpuct step and learning rate (see Table 1) We comparethe efficiency of three methods of updating the Q valuehybrid Q-learning with SARSA(λ) pure Q-learning andpure SARSA(λ) (e server and hardware configurationused in the experiment is shown in Table 1

51Ae Improvement of Learning and Understanding AbilityIt is important to measure the learning ability of a Jiu chessagent that can quickly learn to form squares in the layoutstage (erefore we calculate the influence of the hybridupdate strategy compared to pure Q-learning or SARSA(λ)

algorithms on the sum of the squares of 200 steps of self-playgame training as shown in Table 2 When using the hybridupdate strategy the total number of squares in 200 steps isalmost the sum of the number of squares occupied whenonly using pure Q-learning or SARSA(λ) proving that thehybrid update strategy can effectively train a Jiu chess agentto understand the game of Jiu chess and learn it well

We also counted the number of squares per 20 steps inthe 200-step training process Figure 5 shows that thenumber of squares occupied when using the hybrid updatestrategy is significantly more than that when using pureQ-learning or SARSA(λ) algorithms Especially in first 80steps hybrid updating strategy is significantly more effectivethan pure Q-learning or SARSA(λ) algorithms And fromstep 140 to step 200 hybrid updating strategy has a littledrop compared to the first 120 steps It is because parametercpuct at 01 must decrease to fit later training to avoid uselessexploration (is result indicates that the chess powertrained by the hybrid update strategy is stronger than that ofpure Q-learning or SARSA(λ) In addition this strategy

produces stronger learning and understanding as well asbetter stability

52 FeedbackFunctionExperiment In order to improve thereinforcement learning efficiency of Jiu chess a feedbackfunction based on a 2D normal distribution matrix isadded to the reinforcement learning game model as anauxiliary measure for updating value To test the effect ofthis measure we conducted 7 days and 110 instances ofself-play training and obtained data with and without the2D-normal-distribution-assisted feedback mechanism forcomparison

Table 3 and Figure 6 show that the total number ofsquares occupied when using the 2D normal distributionassistant in the self-play training process is about three timesthat of programs that do not the average value is close to

Value

Policy

ImprovedResNet18 for

JIU

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

Figure 4 (e deep neural network structure

Table 1 (e experimental parameters

Toolkit package CNTK 27Runtime Net 472Operating system Windows 10Central processing unit AMD 2700X40GHzRandom access memory 32GBGraphics processing unit RTX2070(reads used 16Development environment Visual Studio 2017 Communitycpuct 01Step 200Learning rate 0001c(Q-learning) 0628c(SARSA(λ)) 0372

Table 2 Average number of squares per 20 steps in 200-steptraining

Q-learning + SARSA(λ) Q-learning SARSA(λ)

8604 4248 5571

Complexity 7

three times (e total number of squares occupied whenusing the 2D normal distribution assistant during training issignificantly more than that without the assistant Two-di-mensional normal distribution auxiliary matrix shows theapplication of expert knowledge in fact (e learning abilityto play in the layout has improved about 3 times with thetwo-dimensional normal distribution auxiliary matrix

We also compare the learning efficiency of using a 2Dnormal distribution assistant program from the perspectiveof the quality of specific layouts As shown in Figure 7(a) in10 days of training the program without the aid of the 2Dnormal distribution has little knowledge of the layout Afterusing the 2D normal distribution matrix to assist in trainingthe layout process can learn specific chess patterns faster

Squa

re co

unt

Q-learning + SARSA(λ)Q-learningSARSA(λ)

0

10

20

30

40

50

60

20 40 60 80 100 120 140 160 180 2000Training steps

Figure 5 Number of squares employing different algorithms

Table 3 (e number of squares with 2D normalization on or off

2D normalization off 2D normalization onSum 1524 5005Average 13 45

Squa

re co

unt

2D norm on 2D norm off

50 1000Training steps

0

30

60

90

120

Squa

re co

unt

0

30

60

90

120

50 1000Training steps

Figure 6 Comparison of the number of squares with or without 2D normal distribution

8 Complexity

form a triangle (chess gate) and even form squares (the mostimportant chess shape) in the fighting stage of Jiu chessgame as shown in Figures 7(b) and 7(c)

(e above experiments show that the feedback functionbased on the two-dimensional normal distribution matrixcan reduce the calculation amount of Jiu chess program andimprove the self-learning efficiency of the program becausein the layout stage the importance of the chessboard po-sition is close to the two-dimensional normal distributionmatrix Because Jiu chess can have more than 1000 matchingsteps and slow search iteration process the evaluation oflayout stage is optimized by using two-dimensional normaldistribution matrix which is actually the knowledge ofmodel experts (e feedback value of the model indirectlyreduces the process of blindly exploring the transfer value ofchessboard state action so it takes less time to get betterresults

6 Conclusion

In this study the deep reinforcement learning modelcombined with expert knowledge can learn the rules of thegame faster in a short time (e combination of Q-learningand SARSA(λ) algorithms can make the neural networklearn more valuable chess strategies in a short time whichprovides a reference for improving the learning efficiency ofthe deep reinforcement learning model (e deep rein-forcement learning model has produced good layout resultsobtained by the two-dimensional normal distributionmatrixof expert knowledge modeling which also proves that deepreinforcement learning can shorten the learning time bycombining expert knowledge reasonably (e better per-formance of ResNet18 which was used to make the deepreinforcement learning model training more effectively atlow resources cost has been verified by experimental results

Inspired byWu et al [47 48] we consider to improve Jiuchess power not only from the state-of-the-art of technol-ogies but also from the holistic social good perspectives inthe future work We will collect and process much more Jiuchess game data to establish the big data resource which canturn big values for using light reinforcement learning model

to reduce the cost of computing resources We will alsoexplore the possibilities of combining multiagent theory [49]and mechanism [50] to reduce the network delay of theproposed reinforcement model besides using the strategiesproposed in [39 41]

Data Availability

Data are available via the e-mail xiaer_li163com

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is study was funded by the National Natural ScienceFoundation of China (61873291 and 61773416) and theMUC 111 Project

References

[1] X Li S Wang Z Lv Y Li and L Wu ldquoStrategies researchbased on chess shape for Tibetan JIU computer gamerdquo In-ternational Computer Games Association Journal (ICGA)vol 40 no 3 pp 318ndash328 2019

[2] X Li Z Lv S Wang Z Wei and L Wu ldquoA reinforcementlearning model based on temporal difference algorithmrdquoIEEE Access vol 7 pp 121922ndash121930 2019

[3] K He X Zhang S Ren and J Sun ldquoDeep residual learningfor image recognitionrdquo in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 770ndash778 LasVegas NV USA June 2016

[4] D SuodaChucha and P Shotwell ldquo(e research of Tibetanchessrdquo Journal of Tibet University vol 9 no 2 pp 5ndash10 1994

[5] V Mnih K Kavukcuoglu D Silver et al ldquoPlaying atari withdeep reinforcement learningrdquo 2013 httparxivorgabs13125602

[6] D Stern R Herbrich and T Graepel ldquoBayesian patternranking for move prediction in the game of gordquo in Proceedingsof the 23rd International Conference on Machine Learningpp 873ndash880 ACM Pittsburgh PA USA 2006

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(a)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(b)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(c)

Figure 7 Comparison of learning efficiency with or without 2D normal distribution (a) 2D normalization off (b) 2D normalization on(c) 2D normalization on and making a special board type

Complexity 9

[7] W Dabney M Rowland M G Bellemare and R MunosldquoDistributional reinforcement learning with quantile regres-sionrdquo 2017 httparxivorgabs171010044

[8] M Andrychowicz W Filip A Ray et al ldquoHindsight expe-rience replayrdquo in Proceedings of the Advances in Neural In-formation Processing Systems pp 5048ndash5058 Long BeachCA USA December 2017

[9] F Scott H van Hoof and D Meger ldquoAddressing functionapproximation error in actor-critic methodsrdquo 2018 httparxivorgabs180209477

[10] T P Lillicrap J J Hunt P Alexander et al ldquoContinuouscontrol with deep reinforcement learningrdquo 2015 httparxivorgabs150902971

[11] T Haarnoja A Zhou P Abbeel and S Levine ldquoSoft actor-critic off-policy maximum entropy deep reinforcementlearning with a stochastic actorrdquo 2018 httparxivorgabs180101290

[12] M Babaeizadeh I Frosio T Stephen J Clemons andJ Kautz ldquoReinforcement learning through asynchronousadvantage actor-critic on a gpurdquo 2016 httparxivorgabs161106256

[13] V Mnih A P Badia M Mirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Machine Learning pp 1928ndash1937New York City NY USA June 2016

[14] John Schulman S Levine P Abbeel M Jordan andP Moritz ldquoTrust region policy optimizationrdquo in Proceedingsof the International Conference on Machine Learningpp 1889ndash1897 Lille France July 2015

[15] John Schulman W Filip P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017httparxivorgabs170706347

[16] S-J Yen T-N Yang C Chen and S-C Hsu ldquoPatternmatching in go game recordsrdquo in Proceedings of the SecondInternational Conference on Innovative Computing Infor-mation and Control (ICICIC 2007) p 297 Washington DCUSA 2007

[17] S Gelly and Y Wang ldquoExploration exploitation in go UCTfor Monte-Carlo gordquo in Proceedings of the NIPS NeuralInformation Processing Systems Conference On-Line Trading ofExploration and Exploitation Workshop Barcelona Spain2006

[18] Y Wang and S Gelly ldquoModifications of UCT and sequence-like simulations for Monte-Carlo gordquo in Proceedings of theIEEE Symposium on Computational Intelligence and Gamespp 175ndash182 Honolulu HI USA 2007

[19] S Sharma Z Kobti and S Goodwin ldquoKnowledge generationfor improving simulations in uct for general game playingrdquo inProceedings of the Australasian Joint Conference on ArtificialIntelligence Advances in Artificial Intelligence pp 49ndash55Auckland New Zealand December 2008

[20] X Li Z Lv S Wang Z Wei X Zhang and L Wu ldquoA middlegame search algorithm applicable to low-cost personalcomputer for gordquo IEEE Access vol 7 pp 121719ndash1217272019

[21] N N Schraudolph P Dayan and T J Sejnowski ldquoTemporaldifference learning of position evaluation in the game of gordquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 817ndash824 Denver CO USA 1994

[22] D Silver A Huang C J Maddison et al ldquoMastering the gameof go with deep neural networks and tree searchrdquo Naturevol 529 no 7587 pp 484ndash489 2016

[23] D Silver J Schrittwieser K Simonyan et al ldquoMastering thegame of go without human knowledgerdquo Nature vol 550no 7676 pp 354ndash359 2017

[24] D Silver T Hubert J Schrittwieser et al ldquoA general rein-forcement learning algorithm that masters chess shogi andgo through self-playrdquo Science vol 362 no 6419pp 1140ndash1144 2018

[25] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquo Machine Learning vol 3 no 1 pp 9ndash44 1988

[26] G Tesauro ldquoPractical issues in temporal difference learningrdquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 259ndash266 Denver CO USA 1992

[27] G Tesauro ldquoTd-gammon a self-teaching backgammonprogram achieves master-level playrdquo Neural Computationvol 6 no 2 pp 215ndash219 1994

[28] D F Beal and M C Smith ldquoFirst results from using temporaldifference learning in shogirdquo in Proceedings of the Interna-tional Conference on Computers and Games Springer Tsu-kuba Japan pp 113ndash125 November 1998

[29] R Grimbergen and H Matsubara ldquoPattern recognition forcandidate generation in the game of shogirdquo Games in AIResearch pp 97ndash108 1997

[30] Y Sato D Takahashi and R Grimbergen ldquoA shogi programbased on monte-carlo tree searchrdquo ICGA Journal vol 33no 2 pp 80ndash92 2010

[31] S (run ldquoLearning to play the game of chessrdquo in Proceedingsof the Advances in Neural Information Processing Systemspp 1069ndash1076 Denver CO USA 1995

[32] I Bratko D Kopec and D Michie ldquoPattern-based repre-sentation of chess end-game knowledgerdquo Ae ComputerJournal vol 21 no 2 pp 149ndash153 1978

[33] Y Kerner ldquoLearning strategies for explanation patterns basicgame patterns with application to chessrdquo in Proceedings of theInternational Conference on Case-Based Reasoning SpringerSesimbra Portugal pp 491ndash500 October 1995

[34] M Campbell A J Hoane Jr and F-h Hsu ldquoDeep bluerdquoArtificial Intelligence vol 134 no 1-2 pp 57ndash83 2002

[35] N Brown and T Sandholm ldquoSafe and nested subgame solvingfor imperfect-information gamesrdquo in Proceedings of the Ad-vances in Neural Information Processing Systems pp 689ndash699Long Beach CA USA December 2017

[36] N Brown and T Sandholm ldquoSuperhuman ai for heads-up no-limit poker libratus beats top professionalsrdquo Science vol 359no 6374 pp 418ndash424 2018

[37] httpswwwmsracnzh-cnnewsfeaturesmahjong-ai-suphx 2019

[38] V Zambaldi D Raposo S Adam et al ldquoRelational deepreinforcement learningrdquo 2018 httparxivorgabs180601830

[39] S T Deng ldquoDesign and implementation of JIU game pro-totype system andmultimedia coursewarerdquo Minzu Universityof China Beijing China Dissertation 2017

[40] X Chen J Wu Y Cai H Zhang and T Chen ldquoEnergy-efficiency oriented traffic offloading in wireless networks abrief survey and a learning approach for heterogeneous cel-lular networksrdquo IEEE Journal on Selected Areas in Commu-nications vol 33 no 4 pp 627ndash640 2015

[41] T Pepels M H M Winands M Lanctot and M LanctotldquoReal-time Monte Carlo tree search in ms pac-manrdquo IEEETransactions on Computational Intelligence and AI in Gamesvol 6 no 3 pp 245ndash257 2014

[42] C F Sironi and M H M Winands ldquoComparing randomi-zation strategies for search-control parameters in monte-carlo

10 Complexity

tree searchrdquo in Proceedings of the 2019 IEEE Conference onGames (CoG) pp 1ndash8 IEEE London UK August 2019

[43] C Watkins and P Dayan ldquoQ-learningrdquo Machine Learningvol 8 no 3-4 pp 279ndash292 1992

[44] F Xiao Q Liu Qi-M Fu H-K Sun and L Gao ldquoGradientdescent Sarsa (λ) algorithm based on the adaptive potentialfunction shaping reward mechanismrdquo Journal of China In-stitute of Communications vol 34 no 1 pp 77ndash88 2013

[45] E Estrada E Hameed M Langer and A Puchalska ldquoPathlaplacian operators and superdiffusive processes on graphs iitwo-dimensional latticerdquo Linear Algebra and Its Applicationsvol 555 pp 373ndash397 2018

[46] K Lloyd N Becker M W Jones and R Bogacz ldquoLearning touse working memory a reinforcement learning gating modelof rule acquisition in ratsrdquo Frontiers in ComputationalNeuroscience vol 6 p 87 2012

[47] J Wu S Guo J Li and D Zeng ldquoBig data meet greenchallenges big data toward green applicationsrdquo IEEE SystemsJournal vol 10 no 3 pp 888ndash900 2016

[48] J Wu S Guo H Huang W Liu and Y Xiang ldquoInformationand communications technologies for sustainable develop-ment goals state-of-the-art needs and perspectivesrdquo IEEECommunications Surveys amp Tutorials vol 20 no 3pp 2389ndash2406 2018

[49] B Liu N Xu H Su LWu and J Bai ldquoOn the observability ofleader-based multiagent systems with fixed topologyrdquoComplexity vol 2019 pp 1ndash10 2019

[50] H Su J Zhang and X Chen ldquoA stochastic samplingmechanism for time-varying formation of multiagent systemswith multiple leaders and communication delaysrdquo IEEETransactions on Neural Networks and Learning Systemsvol 30 no 12 pp 3699ndash3707 2019

Complexity 11

three times (e total number of squares occupied whenusing the 2D normal distribution assistant during training issignificantly more than that without the assistant Two-di-mensional normal distribution auxiliary matrix shows theapplication of expert knowledge in fact (e learning abilityto play in the layout has improved about 3 times with thetwo-dimensional normal distribution auxiliary matrix

We also compare the learning efficiency of using a 2Dnormal distribution assistant program from the perspectiveof the quality of specific layouts As shown in Figure 7(a) in10 days of training the program without the aid of the 2Dnormal distribution has little knowledge of the layout Afterusing the 2D normal distribution matrix to assist in trainingthe layout process can learn specific chess patterns faster

Squa

re co

unt

Q-learning + SARSA(λ)Q-learningSARSA(λ)

0

10

20

30

40

50

60

20 40 60 80 100 120 140 160 180 2000Training steps

Figure 5 Number of squares employing different algorithms

Table 3 (e number of squares with 2D normalization on or off

2D normalization off 2D normalization onSum 1524 5005Average 13 45

Squa

re co

unt

2D norm on 2D norm off

50 1000Training steps

0

30

60

90

120

Squa

re co

unt

0

30

60

90

120

50 1000Training steps

Figure 6 Comparison of the number of squares with or without 2D normal distribution

8 Complexity

form a triangle (chess gate) and even form squares (the mostimportant chess shape) in the fighting stage of Jiu chessgame as shown in Figures 7(b) and 7(c)

(e above experiments show that the feedback functionbased on the two-dimensional normal distribution matrixcan reduce the calculation amount of Jiu chess program andimprove the self-learning efficiency of the program becausein the layout stage the importance of the chessboard po-sition is close to the two-dimensional normal distributionmatrix Because Jiu chess can have more than 1000 matchingsteps and slow search iteration process the evaluation oflayout stage is optimized by using two-dimensional normaldistribution matrix which is actually the knowledge ofmodel experts (e feedback value of the model indirectlyreduces the process of blindly exploring the transfer value ofchessboard state action so it takes less time to get betterresults

6 Conclusion

In this study the deep reinforcement learning modelcombined with expert knowledge can learn the rules of thegame faster in a short time (e combination of Q-learningand SARSA(λ) algorithms can make the neural networklearn more valuable chess strategies in a short time whichprovides a reference for improving the learning efficiency ofthe deep reinforcement learning model (e deep rein-forcement learning model has produced good layout resultsobtained by the two-dimensional normal distributionmatrixof expert knowledge modeling which also proves that deepreinforcement learning can shorten the learning time bycombining expert knowledge reasonably (e better per-formance of ResNet18 which was used to make the deepreinforcement learning model training more effectively atlow resources cost has been verified by experimental results

Inspired byWu et al [47 48] we consider to improve Jiuchess power not only from the state-of-the-art of technol-ogies but also from the holistic social good perspectives inthe future work We will collect and process much more Jiuchess game data to establish the big data resource which canturn big values for using light reinforcement learning model

to reduce the cost of computing resources We will alsoexplore the possibilities of combining multiagent theory [49]and mechanism [50] to reduce the network delay of theproposed reinforcement model besides using the strategiesproposed in [39 41]

Data Availability

Data are available via the e-mail xiaer_li163com

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is study was funded by the National Natural ScienceFoundation of China (61873291 and 61773416) and theMUC 111 Project

References

[1] X Li S Wang Z Lv Y Li and L Wu ldquoStrategies researchbased on chess shape for Tibetan JIU computer gamerdquo In-ternational Computer Games Association Journal (ICGA)vol 40 no 3 pp 318ndash328 2019

[2] X Li Z Lv S Wang Z Wei and L Wu ldquoA reinforcementlearning model based on temporal difference algorithmrdquoIEEE Access vol 7 pp 121922ndash121930 2019

[3] K He X Zhang S Ren and J Sun ldquoDeep residual learningfor image recognitionrdquo in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 770ndash778 LasVegas NV USA June 2016

[4] D SuodaChucha and P Shotwell ldquo(e research of Tibetanchessrdquo Journal of Tibet University vol 9 no 2 pp 5ndash10 1994

[5] V Mnih K Kavukcuoglu D Silver et al ldquoPlaying atari withdeep reinforcement learningrdquo 2013 httparxivorgabs13125602

[6] D Stern R Herbrich and T Graepel ldquoBayesian patternranking for move prediction in the game of gordquo in Proceedingsof the 23rd International Conference on Machine Learningpp 873ndash880 ACM Pittsburgh PA USA 2006

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(a)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(b)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(c)

Figure 7 Comparison of learning efficiency with or without 2D normal distribution (a) 2D normalization off (b) 2D normalization on(c) 2D normalization on and making a special board type

Complexity 9

[7] W Dabney M Rowland M G Bellemare and R MunosldquoDistributional reinforcement learning with quantile regres-sionrdquo 2017 httparxivorgabs171010044

[8] M Andrychowicz W Filip A Ray et al ldquoHindsight expe-rience replayrdquo in Proceedings of the Advances in Neural In-formation Processing Systems pp 5048ndash5058 Long BeachCA USA December 2017

[9] F Scott H van Hoof and D Meger ldquoAddressing functionapproximation error in actor-critic methodsrdquo 2018 httparxivorgabs180209477

[10] T P Lillicrap J J Hunt P Alexander et al ldquoContinuouscontrol with deep reinforcement learningrdquo 2015 httparxivorgabs150902971

[11] T Haarnoja A Zhou P Abbeel and S Levine ldquoSoft actor-critic off-policy maximum entropy deep reinforcementlearning with a stochastic actorrdquo 2018 httparxivorgabs180101290

[12] M Babaeizadeh I Frosio T Stephen J Clemons andJ Kautz ldquoReinforcement learning through asynchronousadvantage actor-critic on a gpurdquo 2016 httparxivorgabs161106256

[13] V Mnih A P Badia M Mirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Machine Learning pp 1928ndash1937New York City NY USA June 2016

[14] John Schulman S Levine P Abbeel M Jordan andP Moritz ldquoTrust region policy optimizationrdquo in Proceedingsof the International Conference on Machine Learningpp 1889ndash1897 Lille France July 2015

[15] John Schulman W Filip P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017httparxivorgabs170706347

[16] S-J Yen T-N Yang C Chen and S-C Hsu ldquoPatternmatching in go game recordsrdquo in Proceedings of the SecondInternational Conference on Innovative Computing Infor-mation and Control (ICICIC 2007) p 297 Washington DCUSA 2007

[17] S Gelly and Y Wang ldquoExploration exploitation in go UCTfor Monte-Carlo gordquo in Proceedings of the NIPS NeuralInformation Processing Systems Conference On-Line Trading ofExploration and Exploitation Workshop Barcelona Spain2006

[18] Y Wang and S Gelly ldquoModifications of UCT and sequence-like simulations for Monte-Carlo gordquo in Proceedings of theIEEE Symposium on Computational Intelligence and Gamespp 175ndash182 Honolulu HI USA 2007

[19] S Sharma Z Kobti and S Goodwin ldquoKnowledge generationfor improving simulations in uct for general game playingrdquo inProceedings of the Australasian Joint Conference on ArtificialIntelligence Advances in Artificial Intelligence pp 49ndash55Auckland New Zealand December 2008

[20] X Li Z Lv S Wang Z Wei X Zhang and L Wu ldquoA middlegame search algorithm applicable to low-cost personalcomputer for gordquo IEEE Access vol 7 pp 121719ndash1217272019

[21] N N Schraudolph P Dayan and T J Sejnowski ldquoTemporaldifference learning of position evaluation in the game of gordquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 817ndash824 Denver CO USA 1994

[22] D Silver A Huang C J Maddison et al ldquoMastering the gameof go with deep neural networks and tree searchrdquo Naturevol 529 no 7587 pp 484ndash489 2016

[23] D Silver J Schrittwieser K Simonyan et al ldquoMastering thegame of go without human knowledgerdquo Nature vol 550no 7676 pp 354ndash359 2017

[24] D Silver T Hubert J Schrittwieser et al ldquoA general rein-forcement learning algorithm that masters chess shogi andgo through self-playrdquo Science vol 362 no 6419pp 1140ndash1144 2018

[25] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquo Machine Learning vol 3 no 1 pp 9ndash44 1988

[26] G Tesauro ldquoPractical issues in temporal difference learningrdquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 259ndash266 Denver CO USA 1992

[27] G Tesauro ldquoTd-gammon a self-teaching backgammonprogram achieves master-level playrdquo Neural Computationvol 6 no 2 pp 215ndash219 1994

[28] D F Beal and M C Smith ldquoFirst results from using temporaldifference learning in shogirdquo in Proceedings of the Interna-tional Conference on Computers and Games Springer Tsu-kuba Japan pp 113ndash125 November 1998

[29] R Grimbergen and H Matsubara ldquoPattern recognition forcandidate generation in the game of shogirdquo Games in AIResearch pp 97ndash108 1997

[30] Y Sato D Takahashi and R Grimbergen ldquoA shogi programbased on monte-carlo tree searchrdquo ICGA Journal vol 33no 2 pp 80ndash92 2010

[31] S (run ldquoLearning to play the game of chessrdquo in Proceedingsof the Advances in Neural Information Processing Systemspp 1069ndash1076 Denver CO USA 1995

[32] I Bratko D Kopec and D Michie ldquoPattern-based repre-sentation of chess end-game knowledgerdquo Ae ComputerJournal vol 21 no 2 pp 149ndash153 1978

[33] Y Kerner ldquoLearning strategies for explanation patterns basicgame patterns with application to chessrdquo in Proceedings of theInternational Conference on Case-Based Reasoning SpringerSesimbra Portugal pp 491ndash500 October 1995

[34] M Campbell A J Hoane Jr and F-h Hsu ldquoDeep bluerdquoArtificial Intelligence vol 134 no 1-2 pp 57ndash83 2002

[35] N Brown and T Sandholm ldquoSafe and nested subgame solvingfor imperfect-information gamesrdquo in Proceedings of the Ad-vances in Neural Information Processing Systems pp 689ndash699Long Beach CA USA December 2017

[36] N Brown and T Sandholm ldquoSuperhuman ai for heads-up no-limit poker libratus beats top professionalsrdquo Science vol 359no 6374 pp 418ndash424 2018

[37] httpswwwmsracnzh-cnnewsfeaturesmahjong-ai-suphx 2019

[38] V Zambaldi D Raposo S Adam et al ldquoRelational deepreinforcement learningrdquo 2018 httparxivorgabs180601830

[39] S T Deng ldquoDesign and implementation of JIU game pro-totype system andmultimedia coursewarerdquo Minzu Universityof China Beijing China Dissertation 2017

[40] X Chen J Wu Y Cai H Zhang and T Chen ldquoEnergy-efficiency oriented traffic offloading in wireless networks abrief survey and a learning approach for heterogeneous cel-lular networksrdquo IEEE Journal on Selected Areas in Commu-nications vol 33 no 4 pp 627ndash640 2015

[41] T Pepels M H M Winands M Lanctot and M LanctotldquoReal-time Monte Carlo tree search in ms pac-manrdquo IEEETransactions on Computational Intelligence and AI in Gamesvol 6 no 3 pp 245ndash257 2014

[42] C F Sironi and M H M Winands ldquoComparing randomi-zation strategies for search-control parameters in monte-carlo

10 Complexity

tree searchrdquo in Proceedings of the 2019 IEEE Conference onGames (CoG) pp 1ndash8 IEEE London UK August 2019

[43] C Watkins and P Dayan ldquoQ-learningrdquo Machine Learningvol 8 no 3-4 pp 279ndash292 1992

[44] F Xiao Q Liu Qi-M Fu H-K Sun and L Gao ldquoGradientdescent Sarsa (λ) algorithm based on the adaptive potentialfunction shaping reward mechanismrdquo Journal of China In-stitute of Communications vol 34 no 1 pp 77ndash88 2013

[45] E Estrada E Hameed M Langer and A Puchalska ldquoPathlaplacian operators and superdiffusive processes on graphs iitwo-dimensional latticerdquo Linear Algebra and Its Applicationsvol 555 pp 373ndash397 2018

[46] K Lloyd N Becker M W Jones and R Bogacz ldquoLearning touse working memory a reinforcement learning gating modelof rule acquisition in ratsrdquo Frontiers in ComputationalNeuroscience vol 6 p 87 2012

[47] J Wu S Guo J Li and D Zeng ldquoBig data meet greenchallenges big data toward green applicationsrdquo IEEE SystemsJournal vol 10 no 3 pp 888ndash900 2016

[48] J Wu S Guo H Huang W Liu and Y Xiang ldquoInformationand communications technologies for sustainable develop-ment goals state-of-the-art needs and perspectivesrdquo IEEECommunications Surveys amp Tutorials vol 20 no 3pp 2389ndash2406 2018

[49] B Liu N Xu H Su LWu and J Bai ldquoOn the observability ofleader-based multiagent systems with fixed topologyrdquoComplexity vol 2019 pp 1ndash10 2019

[50] H Su J Zhang and X Chen ldquoA stochastic samplingmechanism for time-varying formation of multiagent systemswith multiple leaders and communication delaysrdquo IEEETransactions on Neural Networks and Learning Systemsvol 30 no 12 pp 3699ndash3707 2019

Complexity 11

form a triangle (chess gate) and even form squares (the mostimportant chess shape) in the fighting stage of Jiu chessgame as shown in Figures 7(b) and 7(c)

(e above experiments show that the feedback functionbased on the two-dimensional normal distribution matrixcan reduce the calculation amount of Jiu chess program andimprove the self-learning efficiency of the program becausein the layout stage the importance of the chessboard po-sition is close to the two-dimensional normal distributionmatrix Because Jiu chess can have more than 1000 matchingsteps and slow search iteration process the evaluation oflayout stage is optimized by using two-dimensional normaldistribution matrix which is actually the knowledge ofmodel experts (e feedback value of the model indirectlyreduces the process of blindly exploring the transfer value ofchessboard state action so it takes less time to get betterresults

6 Conclusion

In this study the deep reinforcement learning modelcombined with expert knowledge can learn the rules of thegame faster in a short time (e combination of Q-learningand SARSA(λ) algorithms can make the neural networklearn more valuable chess strategies in a short time whichprovides a reference for improving the learning efficiency ofthe deep reinforcement learning model (e deep rein-forcement learning model has produced good layout resultsobtained by the two-dimensional normal distributionmatrixof expert knowledge modeling which also proves that deepreinforcement learning can shorten the learning time bycombining expert knowledge reasonably (e better per-formance of ResNet18 which was used to make the deepreinforcement learning model training more effectively atlow resources cost has been verified by experimental results

Inspired byWu et al [47 48] we consider to improve Jiuchess power not only from the state-of-the-art of technol-ogies but also from the holistic social good perspectives inthe future work We will collect and process much more Jiuchess game data to establish the big data resource which canturn big values for using light reinforcement learning model

to reduce the cost of computing resources We will alsoexplore the possibilities of combining multiagent theory [49]and mechanism [50] to reduce the network delay of theproposed reinforcement model besides using the strategiesproposed in [39 41]

Data Availability

Data are available via the e-mail xiaer_li163com

Conflicts of Interest

(e authors declare that they have no conflicts of interest

Acknowledgments

(is study was funded by the National Natural ScienceFoundation of China (61873291 and 61773416) and theMUC 111 Project

References

[1] X Li S Wang Z Lv Y Li and L Wu ldquoStrategies researchbased on chess shape for Tibetan JIU computer gamerdquo In-ternational Computer Games Association Journal (ICGA)vol 40 no 3 pp 318ndash328 2019

[2] X Li Z Lv S Wang Z Wei and L Wu ldquoA reinforcementlearning model based on temporal difference algorithmrdquoIEEE Access vol 7 pp 121922ndash121930 2019

[3] K He X Zhang S Ren and J Sun ldquoDeep residual learningfor image recognitionrdquo in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pp 770ndash778 LasVegas NV USA June 2016

[4] D SuodaChucha and P Shotwell ldquo(e research of Tibetanchessrdquo Journal of Tibet University vol 9 no 2 pp 5ndash10 1994

[5] V Mnih K Kavukcuoglu D Silver et al ldquoPlaying atari withdeep reinforcement learningrdquo 2013 httparxivorgabs13125602

[6] D Stern R Herbrich and T Graepel ldquoBayesian patternranking for move prediction in the game of gordquo in Proceedingsof the 23rd International Conference on Machine Learningpp 873ndash880 ACM Pittsburgh PA USA 2006

123456789

1011121314

123456789

1011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(a)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(b)

123456789

1011121314

1234567891011121314

A B C D E F G H J K L M N O

A B C D E F G H J K L M N O

(c)

Figure 7 Comparison of learning efficiency with or without 2D normal distribution (a) 2D normalization off (b) 2D normalization on(c) 2D normalization on and making a special board type

Complexity 9

[7] W Dabney M Rowland M G Bellemare and R MunosldquoDistributional reinforcement learning with quantile regres-sionrdquo 2017 httparxivorgabs171010044

[8] M Andrychowicz W Filip A Ray et al ldquoHindsight expe-rience replayrdquo in Proceedings of the Advances in Neural In-formation Processing Systems pp 5048ndash5058 Long BeachCA USA December 2017

[9] F Scott H van Hoof and D Meger ldquoAddressing functionapproximation error in actor-critic methodsrdquo 2018 httparxivorgabs180209477

[10] T P Lillicrap J J Hunt P Alexander et al ldquoContinuouscontrol with deep reinforcement learningrdquo 2015 httparxivorgabs150902971

[11] T Haarnoja A Zhou P Abbeel and S Levine ldquoSoft actor-critic off-policy maximum entropy deep reinforcementlearning with a stochastic actorrdquo 2018 httparxivorgabs180101290

[12] M Babaeizadeh I Frosio T Stephen J Clemons andJ Kautz ldquoReinforcement learning through asynchronousadvantage actor-critic on a gpurdquo 2016 httparxivorgabs161106256

[13] V Mnih A P Badia M Mirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Machine Learning pp 1928ndash1937New York City NY USA June 2016

[14] John Schulman S Levine P Abbeel M Jordan andP Moritz ldquoTrust region policy optimizationrdquo in Proceedingsof the International Conference on Machine Learningpp 1889ndash1897 Lille France July 2015

[15] John Schulman W Filip P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017httparxivorgabs170706347

[16] S-J Yen T-N Yang C Chen and S-C Hsu ldquoPatternmatching in go game recordsrdquo in Proceedings of the SecondInternational Conference on Innovative Computing Infor-mation and Control (ICICIC 2007) p 297 Washington DCUSA 2007

[17] S Gelly and Y Wang ldquoExploration exploitation in go UCTfor Monte-Carlo gordquo in Proceedings of the NIPS NeuralInformation Processing Systems Conference On-Line Trading ofExploration and Exploitation Workshop Barcelona Spain2006

[18] Y Wang and S Gelly ldquoModifications of UCT and sequence-like simulations for Monte-Carlo gordquo in Proceedings of theIEEE Symposium on Computational Intelligence and Gamespp 175ndash182 Honolulu HI USA 2007

[19] S Sharma Z Kobti and S Goodwin ldquoKnowledge generationfor improving simulations in uct for general game playingrdquo inProceedings of the Australasian Joint Conference on ArtificialIntelligence Advances in Artificial Intelligence pp 49ndash55Auckland New Zealand December 2008

[20] X Li Z Lv S Wang Z Wei X Zhang and L Wu ldquoA middlegame search algorithm applicable to low-cost personalcomputer for gordquo IEEE Access vol 7 pp 121719ndash1217272019

[21] N N Schraudolph P Dayan and T J Sejnowski ldquoTemporaldifference learning of position evaluation in the game of gordquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 817ndash824 Denver CO USA 1994

[22] D Silver A Huang C J Maddison et al ldquoMastering the gameof go with deep neural networks and tree searchrdquo Naturevol 529 no 7587 pp 484ndash489 2016

[23] D Silver J Schrittwieser K Simonyan et al ldquoMastering thegame of go without human knowledgerdquo Nature vol 550no 7676 pp 354ndash359 2017

[24] D Silver T Hubert J Schrittwieser et al ldquoA general rein-forcement learning algorithm that masters chess shogi andgo through self-playrdquo Science vol 362 no 6419pp 1140ndash1144 2018

[25] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquo Machine Learning vol 3 no 1 pp 9ndash44 1988

[26] G Tesauro ldquoPractical issues in temporal difference learningrdquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 259ndash266 Denver CO USA 1992

[27] G Tesauro ldquoTd-gammon a self-teaching backgammonprogram achieves master-level playrdquo Neural Computationvol 6 no 2 pp 215ndash219 1994

[28] D F Beal and M C Smith ldquoFirst results from using temporaldifference learning in shogirdquo in Proceedings of the Interna-tional Conference on Computers and Games Springer Tsu-kuba Japan pp 113ndash125 November 1998

[29] R Grimbergen and H Matsubara ldquoPattern recognition forcandidate generation in the game of shogirdquo Games in AIResearch pp 97ndash108 1997

[30] Y Sato D Takahashi and R Grimbergen ldquoA shogi programbased on monte-carlo tree searchrdquo ICGA Journal vol 33no 2 pp 80ndash92 2010

[31] S (run ldquoLearning to play the game of chessrdquo in Proceedingsof the Advances in Neural Information Processing Systemspp 1069ndash1076 Denver CO USA 1995

[32] I Bratko D Kopec and D Michie ldquoPattern-based repre-sentation of chess end-game knowledgerdquo Ae ComputerJournal vol 21 no 2 pp 149ndash153 1978

[33] Y Kerner ldquoLearning strategies for explanation patterns basicgame patterns with application to chessrdquo in Proceedings of theInternational Conference on Case-Based Reasoning SpringerSesimbra Portugal pp 491ndash500 October 1995

[34] M Campbell A J Hoane Jr and F-h Hsu ldquoDeep bluerdquoArtificial Intelligence vol 134 no 1-2 pp 57ndash83 2002

[35] N Brown and T Sandholm ldquoSafe and nested subgame solvingfor imperfect-information gamesrdquo in Proceedings of the Ad-vances in Neural Information Processing Systems pp 689ndash699Long Beach CA USA December 2017

[36] N Brown and T Sandholm ldquoSuperhuman ai for heads-up no-limit poker libratus beats top professionalsrdquo Science vol 359no 6374 pp 418ndash424 2018

[37] httpswwwmsracnzh-cnnewsfeaturesmahjong-ai-suphx 2019

[38] V Zambaldi D Raposo S Adam et al ldquoRelational deepreinforcement learningrdquo 2018 httparxivorgabs180601830

[39] S T Deng ldquoDesign and implementation of JIU game pro-totype system andmultimedia coursewarerdquo Minzu Universityof China Beijing China Dissertation 2017

[40] X Chen J Wu Y Cai H Zhang and T Chen ldquoEnergy-efficiency oriented traffic offloading in wireless networks abrief survey and a learning approach for heterogeneous cel-lular networksrdquo IEEE Journal on Selected Areas in Commu-nications vol 33 no 4 pp 627ndash640 2015

[41] T Pepels M H M Winands M Lanctot and M LanctotldquoReal-time Monte Carlo tree search in ms pac-manrdquo IEEETransactions on Computational Intelligence and AI in Gamesvol 6 no 3 pp 245ndash257 2014

[42] C F Sironi and M H M Winands ldquoComparing randomi-zation strategies for search-control parameters in monte-carlo

10 Complexity

tree searchrdquo in Proceedings of the 2019 IEEE Conference onGames (CoG) pp 1ndash8 IEEE London UK August 2019

[43] C Watkins and P Dayan ldquoQ-learningrdquo Machine Learningvol 8 no 3-4 pp 279ndash292 1992

[44] F Xiao Q Liu Qi-M Fu H-K Sun and L Gao ldquoGradientdescent Sarsa (λ) algorithm based on the adaptive potentialfunction shaping reward mechanismrdquo Journal of China In-stitute of Communications vol 34 no 1 pp 77ndash88 2013

[45] E Estrada E Hameed M Langer and A Puchalska ldquoPathlaplacian operators and superdiffusive processes on graphs iitwo-dimensional latticerdquo Linear Algebra and Its Applicationsvol 555 pp 373ndash397 2018

[46] K Lloyd N Becker M W Jones and R Bogacz ldquoLearning touse working memory a reinforcement learning gating modelof rule acquisition in ratsrdquo Frontiers in ComputationalNeuroscience vol 6 p 87 2012

[47] J Wu S Guo J Li and D Zeng ldquoBig data meet greenchallenges big data toward green applicationsrdquo IEEE SystemsJournal vol 10 no 3 pp 888ndash900 2016

[48] J Wu S Guo H Huang W Liu and Y Xiang ldquoInformationand communications technologies for sustainable develop-ment goals state-of-the-art needs and perspectivesrdquo IEEECommunications Surveys amp Tutorials vol 20 no 3pp 2389ndash2406 2018

[49] B Liu N Xu H Su LWu and J Bai ldquoOn the observability ofleader-based multiagent systems with fixed topologyrdquoComplexity vol 2019 pp 1ndash10 2019

[50] H Su J Zhang and X Chen ldquoA stochastic samplingmechanism for time-varying formation of multiagent systemswith multiple leaders and communication delaysrdquo IEEETransactions on Neural Networks and Learning Systemsvol 30 no 12 pp 3699ndash3707 2019

Complexity 11

[7] W Dabney M Rowland M G Bellemare and R MunosldquoDistributional reinforcement learning with quantile regres-sionrdquo 2017 httparxivorgabs171010044

[8] M Andrychowicz W Filip A Ray et al ldquoHindsight expe-rience replayrdquo in Proceedings of the Advances in Neural In-formation Processing Systems pp 5048ndash5058 Long BeachCA USA December 2017

[9] F Scott H van Hoof and D Meger ldquoAddressing functionapproximation error in actor-critic methodsrdquo 2018 httparxivorgabs180209477

[10] T P Lillicrap J J Hunt P Alexander et al ldquoContinuouscontrol with deep reinforcement learningrdquo 2015 httparxivorgabs150902971

[11] T Haarnoja A Zhou P Abbeel and S Levine ldquoSoft actor-critic off-policy maximum entropy deep reinforcementlearning with a stochastic actorrdquo 2018 httparxivorgabs180101290

[12] M Babaeizadeh I Frosio T Stephen J Clemons andJ Kautz ldquoReinforcement learning through asynchronousadvantage actor-critic on a gpurdquo 2016 httparxivorgabs161106256

[13] V Mnih A P Badia M Mirza et al ldquoAsynchronous methodsfor deep reinforcement learningrdquo in Proceedings of the In-ternational Conference on Machine Learning pp 1928ndash1937New York City NY USA June 2016

[14] John Schulman S Levine P Abbeel M Jordan andP Moritz ldquoTrust region policy optimizationrdquo in Proceedingsof the International Conference on Machine Learningpp 1889ndash1897 Lille France July 2015

[15] John Schulman W Filip P Dhariwal A Radford andO Klimov ldquoProximal policy optimization algorithmsrdquo 2017httparxivorgabs170706347

[16] S-J Yen T-N Yang C Chen and S-C Hsu ldquoPatternmatching in go game recordsrdquo in Proceedings of the SecondInternational Conference on Innovative Computing Infor-mation and Control (ICICIC 2007) p 297 Washington DCUSA 2007

[17] S Gelly and Y Wang ldquoExploration exploitation in go UCTfor Monte-Carlo gordquo in Proceedings of the NIPS NeuralInformation Processing Systems Conference On-Line Trading ofExploration and Exploitation Workshop Barcelona Spain2006

[18] Y Wang and S Gelly ldquoModifications of UCT and sequence-like simulations for Monte-Carlo gordquo in Proceedings of theIEEE Symposium on Computational Intelligence and Gamespp 175ndash182 Honolulu HI USA 2007

[19] S Sharma Z Kobti and S Goodwin ldquoKnowledge generationfor improving simulations in uct for general game playingrdquo inProceedings of the Australasian Joint Conference on ArtificialIntelligence Advances in Artificial Intelligence pp 49ndash55Auckland New Zealand December 2008

[20] X Li Z Lv S Wang Z Wei X Zhang and L Wu ldquoA middlegame search algorithm applicable to low-cost personalcomputer for gordquo IEEE Access vol 7 pp 121719ndash1217272019

[21] N N Schraudolph P Dayan and T J Sejnowski ldquoTemporaldifference learning of position evaluation in the game of gordquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 817ndash824 Denver CO USA 1994

[22] D Silver A Huang C J Maddison et al ldquoMastering the gameof go with deep neural networks and tree searchrdquo Naturevol 529 no 7587 pp 484ndash489 2016

[23] D Silver J Schrittwieser K Simonyan et al ldquoMastering thegame of go without human knowledgerdquo Nature vol 550no 7676 pp 354ndash359 2017

[24] D Silver T Hubert J Schrittwieser et al ldquoA general rein-forcement learning algorithm that masters chess shogi andgo through self-playrdquo Science vol 362 no 6419pp 1140ndash1144 2018

[25] R S Sutton ldquoLearning to predict by the methods of temporaldifferencesrdquo Machine Learning vol 3 no 1 pp 9ndash44 1988

[26] G Tesauro ldquoPractical issues in temporal difference learningrdquoin Proceedings of the Advances in Neural Information Pro-cessing Systems pp 259ndash266 Denver CO USA 1992

[27] G Tesauro ldquoTd-gammon a self-teaching backgammonprogram achieves master-level playrdquo Neural Computationvol 6 no 2 pp 215ndash219 1994

[28] D F Beal and M C Smith ldquoFirst results from using temporaldifference learning in shogirdquo in Proceedings of the Interna-tional Conference on Computers and Games Springer Tsu-kuba Japan pp 113ndash125 November 1998

[29] R Grimbergen and H Matsubara ldquoPattern recognition forcandidate generation in the game of shogirdquo Games in AIResearch pp 97ndash108 1997

[30] Y Sato D Takahashi and R Grimbergen ldquoA shogi programbased on monte-carlo tree searchrdquo ICGA Journal vol 33no 2 pp 80ndash92 2010

[31] S (run ldquoLearning to play the game of chessrdquo in Proceedingsof the Advances in Neural Information Processing Systemspp 1069ndash1076 Denver CO USA 1995

[32] I Bratko D Kopec and D Michie ldquoPattern-based repre-sentation of chess end-game knowledgerdquo Ae ComputerJournal vol 21 no 2 pp 149ndash153 1978

[33] Y Kerner ldquoLearning strategies for explanation patterns basicgame patterns with application to chessrdquo in Proceedings of theInternational Conference on Case-Based Reasoning SpringerSesimbra Portugal pp 491ndash500 October 1995

[34] M Campbell A J Hoane Jr and F-h Hsu ldquoDeep bluerdquoArtificial Intelligence vol 134 no 1-2 pp 57ndash83 2002

[35] N Brown and T Sandholm ldquoSafe and nested subgame solvingfor imperfect-information gamesrdquo in Proceedings of the Ad-vances in Neural Information Processing Systems pp 689ndash699Long Beach CA USA December 2017

[36] N Brown and T Sandholm ldquoSuperhuman ai for heads-up no-limit poker libratus beats top professionalsrdquo Science vol 359no 6374 pp 418ndash424 2018

[37] httpswwwmsracnzh-cnnewsfeaturesmahjong-ai-suphx 2019

[38] V Zambaldi D Raposo S Adam et al ldquoRelational deepreinforcement learningrdquo 2018 httparxivorgabs180601830

[39] S T Deng ldquoDesign and implementation of JIU game pro-totype system andmultimedia coursewarerdquo Minzu Universityof China Beijing China Dissertation 2017

[40] X Chen J Wu Y Cai H Zhang and T Chen ldquoEnergy-efficiency oriented traffic offloading in wireless networks abrief survey and a learning approach for heterogeneous cel-lular networksrdquo IEEE Journal on Selected Areas in Commu-nications vol 33 no 4 pp 627ndash640 2015

[41] T Pepels M H M Winands M Lanctot and M LanctotldquoReal-time Monte Carlo tree search in ms pac-manrdquo IEEETransactions on Computational Intelligence and AI in Gamesvol 6 no 3 pp 245ndash257 2014

[42] C F Sironi and M H M Winands ldquoComparing randomi-zation strategies for search-control parameters in monte-carlo

10 Complexity

tree searchrdquo in Proceedings of the 2019 IEEE Conference onGames (CoG) pp 1ndash8 IEEE London UK August 2019

[43] C Watkins and P Dayan ldquoQ-learningrdquo Machine Learningvol 8 no 3-4 pp 279ndash292 1992

[44] F Xiao Q Liu Qi-M Fu H-K Sun and L Gao ldquoGradientdescent Sarsa (λ) algorithm based on the adaptive potentialfunction shaping reward mechanismrdquo Journal of China In-stitute of Communications vol 34 no 1 pp 77ndash88 2013

[45] E Estrada E Hameed M Langer and A Puchalska ldquoPathlaplacian operators and superdiffusive processes on graphs iitwo-dimensional latticerdquo Linear Algebra and Its Applicationsvol 555 pp 373ndash397 2018

[46] K Lloyd N Becker M W Jones and R Bogacz ldquoLearning touse working memory a reinforcement learning gating modelof rule acquisition in ratsrdquo Frontiers in ComputationalNeuroscience vol 6 p 87 2012

[47] J Wu S Guo J Li and D Zeng ldquoBig data meet greenchallenges big data toward green applicationsrdquo IEEE SystemsJournal vol 10 no 3 pp 888ndash900 2016

[48] J Wu S Guo H Huang W Liu and Y Xiang ldquoInformationand communications technologies for sustainable develop-ment goals state-of-the-art needs and perspectivesrdquo IEEECommunications Surveys amp Tutorials vol 20 no 3pp 2389ndash2406 2018

[49] B Liu N Xu H Su LWu and J Bai ldquoOn the observability ofleader-based multiagent systems with fixed topologyrdquoComplexity vol 2019 pp 1ndash10 2019

[50] H Su J Zhang and X Chen ldquoA stochastic samplingmechanism for time-varying formation of multiagent systemswith multiple leaders and communication delaysrdquo IEEETransactions on Neural Networks and Learning Systemsvol 30 no 12 pp 3699ndash3707 2019

Complexity 11

tree searchrdquo in Proceedings of the 2019 IEEE Conference onGames (CoG) pp 1ndash8 IEEE London UK August 2019

[43] C Watkins and P Dayan ldquoQ-learningrdquo Machine Learningvol 8 no 3-4 pp 279ndash292 1992

[44] F Xiao Q Liu Qi-M Fu H-K Sun and L Gao ldquoGradientdescent Sarsa (λ) algorithm based on the adaptive potentialfunction shaping reward mechanismrdquo Journal of China In-stitute of Communications vol 34 no 1 pp 77ndash88 2013

[45] E Estrada E Hameed M Langer and A Puchalska ldquoPathlaplacian operators and superdiffusive processes on graphs iitwo-dimensional latticerdquo Linear Algebra and Its Applicationsvol 555 pp 373ndash397 2018

[46] K Lloyd N Becker M W Jones and R Bogacz ldquoLearning touse working memory a reinforcement learning gating modelof rule acquisition in ratsrdquo Frontiers in ComputationalNeuroscience vol 6 p 87 2012

[47] J Wu S Guo J Li and D Zeng ldquoBig data meet greenchallenges big data toward green applicationsrdquo IEEE SystemsJournal vol 10 no 3 pp 888ndash900 2016

[48] J Wu S Guo H Huang W Liu and Y Xiang ldquoInformationand communications technologies for sustainable develop-ment goals state-of-the-art needs and perspectivesrdquo IEEECommunications Surveys amp Tutorials vol 20 no 3pp 2389ndash2406 2018

[49] B Liu N Xu H Su LWu and J Bai ldquoOn the observability ofleader-based multiagent systems with fixed topologyrdquoComplexity vol 2019 pp 1ndash10 2019

[50] H Su J Zhang and X Chen ldquoA stochastic samplingmechanism for time-varying formation of multiagent systemswith multiple leaders and communication delaysrdquo IEEETransactions on Neural Networks and Learning Systemsvol 30 no 12 pp 3699ndash3707 2019

Complexity 11