abstract - arxiv · 2018. 5. 22. · hyper-parameter optimization is an important problem in...

18
AlphaX: eXploring Neural Architectures with Deep Neural Networks and Monte Carlo Tree Search Linnan Wang* Department of Computer Science Brown University [email protected] Yiyang Zhao* Department of Computer Science Northeastern University [email protected] Yuu Jinnai* Department of Computer Science Brown University [email protected] Abstract We present AlphaX, a fully automated agent that designs complex neural architec- tures from scratch. AlphaX explores the exponentially exploded search space with a novel distributed Monte Carlo Tree Search (MCTS) and a Meta-Deep Neural Network (DNN). MCTS intrinsically improves the search efficiency by automati- cally balancing the exploration and exploitation at each state, while Meta-DNN predicts the network accuracy to guide the search, and to provide an estimated reward for the preemptive backpropagation in the distributed setup. As the search progresses, AlphaX also generates the training date for Meta-DNN. So, the learning of Meta-DNN is end-to-end. In searching for NASNet style architectures, AlphaX found several promising architectures with up to 1% higher accuracy than NASNet using only 17 GPUs for 5 days, demonstrating up to 23.5x speedup over the original searching for NASNet that used 500 GPUs in 4 days. 1 Introduction Discovering efficient network architecture is extremely laborious, and requires tremendous design heuristics. For this reason, Neural Architecture Search (NAS) using AutoML techniques has sparked a surge of interests. The goal of NAS is to search for the target neural architecture in the design space [26, 36, 34, 35, 18, 2, 42, 13, 40, 25, 37, 22, 24, 30, 10, 3, 11, 6, 27, 39, 29], and it has succeeded in alleviating the amount of human efforts in the design. However, the amount of computational power required by NAS methods is often prohibitively large (e.g. 10 4 GPU hours) that motivates us to improve the search efficiency. We present AlphaX, a new DNN design agent born for the high search efficiency. The novelty of AlphaX is in a successful marriage of Monte Carlo Tree Search (MCTS) [20, 5] and a Meta-DNN that serves as a predictive model to estimate the accuracy of a sampled architecture. MCTS is a planning algorithm for finding an optimal policy in state-space search problems [5]. The most successful subclass of MCTS is Upper Confidence bound applied to Trees (UCT) which introduces a multiarm bandit algorithm (i.e. UCB1) [1] to a tree search policy [20]. This enables MCTS to delicately balance the exploration and exploitation at each state so that the search efficiency is optimized at a fine granularity. Meta-DNN improves the search procedure in two ways: 1) it predict the performance of a sampled architecture to heuristically guide the search to a promising region based on prior experiences. This is inspired by AlphaGo [32] that utilizes a value predictor in addition to Monte * These three authors contributed equally. Preprint. Work in progress. arXiv:1805.07440v1 [cs.LG] 18 May 2018

Upload: others

Post on 10-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

AlphaX: eXploring Neural Architectures with DeepNeural Networks and Monte Carlo Tree Search

Linnan Wang*Department of Computer Science

Brown [email protected]

Yiyang Zhao*Department of Computer Science

Northeastern [email protected]

Yuu Jinnai*Department of Computer Science

Brown [email protected]

Abstract

We present AlphaX, a fully automated agent that designs complex neural architec-tures from scratch. AlphaX explores the exponentially exploded search space witha novel distributed Monte Carlo Tree Search (MCTS) and a Meta-Deep NeuralNetwork (DNN). MCTS intrinsically improves the search efficiency by automati-cally balancing the exploration and exploitation at each state, while Meta-DNNpredicts the network accuracy to guide the search, and to provide an estimatedreward for the preemptive backpropagation in the distributed setup. As the searchprogresses, AlphaX also generates the training date for Meta-DNN. So, the learningof Meta-DNN is end-to-end. In searching for NASNet style architectures, AlphaXfound several promising architectures with up to 1% higher accuracy than NASNetusing only 17 GPUs for 5 days, demonstrating up to 23.5x speedup over the originalsearching for NASNet that used 500 GPUs in 4 days.

1 Introduction

Discovering efficient network architecture is extremely laborious, and requires tremendous designheuristics. For this reason, Neural Architecture Search (NAS) using AutoML techniques has sparkeda surge of interests. The goal of NAS is to search for the target neural architecture in the design space[26, 36, 34, 35, 18, 2, 42, 13, 40, 25, 37, 22, 24, 30, 10, 3, 11, 6, 27, 39, 29], and it has succeededin alleviating the amount of human efforts in the design. However, the amount of computationalpower required by NAS methods is often prohibitively large (e.g. 104 GPU hours) that motivates usto improve the search efficiency.

We present AlphaX, a new DNN design agent born for the high search efficiency. The novelty ofAlphaX is in a successful marriage of Monte Carlo Tree Search (MCTS) [20, 5] and a Meta-DNN thatserves as a predictive model to estimate the accuracy of a sampled architecture. MCTS is a planningalgorithm for finding an optimal policy in state-space search problems [5]. The most successfulsubclass of MCTS is Upper Confidence bound applied to Trees (UCT) which introduces a multiarmbandit algorithm (i.e. UCB1) [1] to a tree search policy [20]. This enables MCTS to delicatelybalance the exploration and exploitation at each state so that the search efficiency is optimized at afine granularity. Meta-DNN improves the search procedure in two ways: 1) it predict the performanceof a sampled architecture to heuristically guide the search to a promising region based on priorexperiences. This is inspired by AlphaGo [32] that utilizes a value predictor in addition to Monte

*These three authors contributed equally.

Preprint. Work in progress.

arX

iv:1

805.

0744

0v1

[cs

.LG

] 1

8 M

ay 2

018

Page 2: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

Carlo rollouts to estimate promising states in Go. 2) the predicted accuracy also estimates the rewardfor preemptive backpropagation to mitigate the bias innate to distributed training.

The AlphaX has the following technical strengths: 1) our method is extensible to various designspaces, including the nonlinear NASNet search space [43]. Other MCTS related implementations areexclusively restricted to linear DNNs and the single GPU training [27, 39]; 2) the search engine isfully automatic, and the learning of Meta-DNN is end-to-end. AlphaX generates the new trainingdata for Meta-DNN as training progresses; 3) the preemptive backpropagation enabled by Meta-DNNmakes it scales near-linearly over a large amount of GPUs with client-server style scheme.

2 Related Work

Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], whileBayesian Optimization (BO) is a popular approach for the hyper-parameter search [17, 33, 38, 19].The search space of BO is typically constrained to a few hyper-parameters, while NAS involvesthousands of discrete hyper-parameters. Also, BO requires good representations/kernels for NAS.Therefore, several discrete optimization approaches have been proposed to NAS.

Visited = 100Expected Return = 0.82

Visited = 2Expected Return = 0.81

MCTSQ-Learning c = 0.5

Figure 1: the behavior difference be-tween MCTS and Q-learning.

Reinforcement Learning (RL): Several RL techniques havebeen investigated for NAS [2, 42]. Baker et al. proposedMetaQNN, which uses a Q-learning agent to search oversequential Convolutional Neural Networks (CNNs) [2].Zoph et al. built an RNN agent trained with Policy Gra-dient to design simple non-linear CNNs and LSTM [42].Though successful in designing networks, their approachesare not efficient, as they rely on the randomness inducedby a stochastic policy to escape from the local minimum,and to explores to a new region (Details analysis of itseffect on NAS is available in 10). In contrast, our ap-proach explicitly leverages the statistics of action choiceat individual states to encourage less explored actions. InFig.1, Q-learning seeks to choose an action that exclusively optimizes the best expected return (i.e.accuracy), thus it goes left in the example except for the randomness induced by a stochastic policy(e.g. ε-greedy strategy). On the other hand, MCTS chooses to go right because the right branch ismuch less visited than the left. In contrast with Q-Learning, the UCB1 score in MCTS explicitlyconsiders both the expected return and the number of visits to balance the decision here.

Hill Climbing (HC): Elsken et al. proposed a simple hill climbing for architecture search [11]. Startingfrom an architecture they specify, they train every child networks and moves to the best performingchild. Liu et al. deployed a beam search which follows a similar procedure to hill climbing but selectsthe top-K architectures instead of only the best [23]. The problem of their approaches is that HCtends to trap into a local optimum from which it can never escape (Fig.8), while MCTS guarantees toeventually find the optimal [20].

Evolutionary Algorithm (EA): Evolutionary algorithms represent each neural network as a stringof genes and search the architecture space by mutation and recombinations [26, 36, 34, 18, 13, 24,30, 40, 25, 37, 29]. Strings which represent a neural network with top performance are selectedto generate child models. The selection process is in lieu of exploitation, and the mutation is toencourage exploring. Essentially, GA algorithms, like Q-learning, also fail to consider the visitingstatistics at individual states to inform the decision between the exploitation and exploration. Anothershortcoming of EA is the vast requirement of computational resources. Real et al. used 250 GPUs for10 days [30] and Liu et al. used 250 GPU for 10 days [24], while AlphaX finds leading architectureswith only 17 GPUs in 4 days.

Several approaches have been succeeded in speeding up NAS by bypassing the training time for eachnetwork architectures with a predictive model for weights.

HyperNet: Because NAS requires training a lot of architectures, reducing the time to train eacharchitecture has a potential to significantly reduce the total computation time. HyperNet [16] is anapproach to use a meta-network (i.e. a hypernet) to generate the weights for other DNNs. The conceptis inspired by HyperNEAT [35] that utilizes a pattern network to evolve connectivities. SMASH

2

Page 3: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

[4] proposes evaluating model architectures with the weights generated from a HyperNet to bypassthe actual training. However, those generated weights tends to be quite different from optimal ones,introducing bias in the search. In addition, SMASH only considers skip connections and variablelayers, while AlphaX fully supports the non-linear network design.

Knowledge Transfer: Transferring knowledge from previously trained networks to a new network isanother approach to reduce the training time for each architecture. Cai et al. [6] integrated Net2Net[8] to RL and successfully reduced the amount of computational resource required. Pham et al. [28]proposed to transfer weights by forcing all child models to share the same weights.

Our method is orthogonal to the purpose of the HyperNet and Knowedge Transfer – our goal is toefficiently search over architectures while these approaches aim to speed up the training time. In fact,our method can be combined with these approaches to further improve the efficiency.

3 Design Space

Block Block

Block

Block

Block

leftbranch

rightbranch

~

output

layer

layer layer

layer~

Cell Output

Cell

Cell Input1 Cell Input2

Figure 2: Cell structure

The design space of AlphaX is consistent with NASNet state space[43] of which the key advantage is at the transferability. That is, thepromising architectures found on CIFAR-10 [21] often work well onlarger image datasets (e.g. ImageNet [9]) too. NASNet is a networklinearly stacked with multiple Cells. We search for a hierarchicalCell structure, such as the one in Fig.2. There are two types of Cells,Normal Cell (NCell) and Reduction Cell (RCell). Normal Cellmaintains the input and output dimensions with the padding, whileReduction Cell reduces the height and width by half with the strid-ing. The network structure for CIFAR-10 follows "Image→NCell(×N)→ RCell→ NCell (×N)→ RCell→ NCell (×N)→ soft-max", and the network structure for ImageNet follows "Image→3×3Conv, stride 2→RCell (×2)→NCell (×N) → RCell → NCell(×N)→ RCell→ NCell (×N)→ softmax". ×N means repeatingN times, and N is empirically set to 6 in experiments.

Fig.2 demonstrates the internal hierarchical structure of a Cell. Please note the layout of RCell andNCell are same, so we illustrate the both with one structure. A Cell has two inputs, which are eitherthe output of Celli−1 or Celli−2. Blocks compose of a Cell, and it has also two input. The inputof Blocki can be any outputs of Block1 ∼ Blocki−1, or Cell inputs. Then, the unused outputs ofBlocks are concatenated in the channel dimension as the Cell’s output. A block consists of a leftand a right Branch, and the input of a Branch can be any of the block inputs. A Branch consistsof multiple network layers sequentially linked together, e.g. convolution, pool, or activation. Theoutputs of both branches add together as the Block’s output.

It is critical to match the input and output dimensions to ensure the design feasibility. If a Cell’s twoinputs mismatch, we reduce the high-dimensional input to match the low-dimensional one by thestriding. NCell enforces the output dimensions of a Block to be same as the input using padding.RCell enforces the height and width are half of the input by having only 1 conv layer or poolinglayer of stride 2 at any dataflow from the Cell input to the output, and the output of other conv layersare strictly padded the same as the layer’s input. These enable us to directly concatenate the blocks’outputs in the channel dimension as the Cell output. We also follow the design conventions in ResNet,Inception, and VGG, such as doubling the filter size of convolution layers in RCell, and assigning abatch normalization+activation right after a convolution.

With this design domain, the key questions are 1) what are the internal structure of each Branch andBlock in RCell and NCell? 2) what are the connectivities of each Block and Cell?

4 AlphaX: eXploring Neural Architectures with MCTS and DNNs

4.1 State and Action Space

State Space: A state represents a network architecture, and AlphaX utilizes states (or nodes) to keeptrack of past trails to inform future decisions. RCell and NCell are two design units; using twoseparate state spaces ignores their correlations, so we use one state to represent both structures. A

3

Page 4: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

filters:f, stride:s kernel: kNormal Cell

inputs=>left: p_cell, right: pp_cellblock0

left branch==> l1, conv, f: 32 s: 1 k: 3right branch==> l1, conv, f: 32 s: 1 k: 3==> l1, sep_conv, f: 32 s: 1 k: 3

 block NSimilar as block0

~~~~

Reduction CellSimilar as Normal Cell

(a) State Representation

state action

s1

s2

s3 s4

T

s5

s6

(b) State Space With Ter-minal

state action

s1

s2 s3

s4 s5 s6

(c) State Space WithoutTerminal

♦ 7x7 avg pool♦ 3x3 max pool♦ 5x5 max pool♦ 7x7 max pool

♦ 3x3 avg pool♦ 5x5 avg pool

♦ 1x1 conv♦ 3x3 conv♦ identity♦ 3x3 separable conv♦ 5x5 separable conv ♦ 7x7 separable conv

♦ 3♦ 4♦ 5♦ 6

♦ 1♦ 2

♦ 7♦ 8♦ 9♦ 10♦ 11 ♦ 12

Actions Codes

(d) Action Space

Figure 3: State and action space of AlphaX.

state explicitly captures the internal structures of RCell and NCell, and their hyper-parameters.Fig.3a demonstrates the JSON style coding scheme for representing a state as a string. So, we usesparse data structure such as HashMap to store and reference them. We constrained the state space sothat the architectures in the space are manageable within the GPU memory size. See Appendix Sec.8for details of our state and action constraints.

We also introduce a terminal state to allow for multiple actions. All the other states can transit to theterminal state by taking the terminal action (Fig. 3b), and the agent only trains the network, fromwhich it reaches the terminal. Without the terminal state, the agent trains each new nodes to preventthe unbounded state transitions. This results in an undesired progressive searching behavior (Fig.3c).With the terminal state, we freely create new nodes along the searching path before reaching theterminal (Fig.3b). This enables agent to bypass shallow architectures to directly train complex ones.

Action Space: An action morphs the current network architecture, i.e. current state, to transit tothe next state. MCTS decides the optimal action, and we will explain it in Section 4.2. Actionswe consider are 1) adding a new layer in the left or right branch of Blocki in NCell or RCell, 2)creating a new block in NCell and RCell, and 3) the terminating. Actions also explicitly specifythe connectivities of Blocks. A Block has two inputs, which can be either inputs of current Cell oroutputs of existing Blocks. The agent prepares each input combinations as an action to create a newBlock with inputs in the action set.

4.2 Search Procedure

In this section, we describe the search procedure of AlphaX. In AlphaX, the focus of MCTS is on theanalysis of the most promising moves at a state, while the focus of DNN is on capturing the priormodel architectures and their accuracies to update the search tree based on the Monte Carlo samplingof a new state’s descendants. Unlike other search algorithms proposed in NAS [42], MCTS tracks thevisiting statistics at individual nodes to balance the exploration and exploitation at a fine granularity.Each node tracks these two statistics: 1) N(s, a) counts the selection of action a at state s; 2) Q(s, a)is the expected reward after taking action a at state s, and intuitively Q(s, a) is an estimate of howpromising the state-action pair is based on the previously sampled network architectures. A typicalsearching iteration in AlphaX consists of Selection, Expansion, Meta-DNN assisted Simulation,Backpropagation and Meta-DNN Training, and we elucidate each steps as follows.

Selection traverses down the search tree to trace the current most promising search path. It startsfrom the root, and stops till reaching a leaf. At a node, the agent selects actions based on UCB1 [1]:

πtree(s) = argmaxa∈A

(Q(s, a)

N(s, a)+ 2c

√2 logN(s)

N(s, a)

), (1)

where N(s) is the number of visits to the state s (i.e. N(s) =∑

a∈AN(s, a)), and c is a constant.The first term ( Q(s,a)

N(s,a) ) is called an exploitation term which estimates the expected accuracy of

its descendant architectures. The second term (2c√

2 logN(s)N(s,a) ) is called an exploration term. The

exploration term starts is large while N(s, a) is small (i.e. the action is not yet explored a lot), anddecreases with N(s, a) increases. As such, the exploration term encourages the agent to take an

4

Page 5: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

action which is explored less frequently than other actions. The constant c can balance how muchexploration we encourage to the agent - the higher we set c the more the agent will weight on thesecond term, encouraging more exploration. We follow the tree policy until it reaches to a node whichis never visited before (Fig.9 in Appendix 7).

UCB1 is a well suited searching algorithm for NAS. First, Q(s, a)/N(s, a) estimates the expectedaccuracy of all the subsequent descendants, i.e. all the architecture variations from the current node.This is quite different from greedy based algorithms such as Hill Climbing that makes decisionexclusively with the current state and action. Thus, πtree(s) roughly tells us the best search directionat a state. Second, we have a fine-grained policy to cope with exploitation and exploration dilemmaby leveraging the visiting statistics (Fig.1).

Expansion adds a new node into the tree. Q(s, a) and N(s, a) are initialized to zeros, and Simulationwill update them.

Meta-DNN assisted Simulation randomly samples the descendants of a new node to approximateQ(s, a) of the node with their accuracies. The process is to estimate how promising the searchdirection rendered by the new node and its descendants. The simulation starts at the new node. Theagent traverses down the tree by taking the uniform-random action until reaching a terminal state, andit dispatches the architecture before the terminal for training. This completes a round of simulation.

The more simulation we roll, the more accurate estimate we get. However, we cannot conduct manysimulations as the network training is extremely time-consuming. AlphaX adopts a hybrid strategyto solve this issue by incorporating a meta-DNN to predict the network accuracy in addition to theactual training. Specifically, we estimate q = Q(s, a) with

Q(s, a)←

(Acc(sim0(s

′)) +1

k

∑i=1..k

Pred(simi(s′))

)/2 (2)

where s′ is the new state you explored by applying action a to state s, sim represents a simulationfrom the new state s′, and Acc is the actually trained accuracy in the first simulation, and Pred isthe predicted accuracy from Meta-DNN in the subsequent n simulations. Please note s is the parentof s′, and a is the optimal action yield by πtree(s). The key benefit of Meta-DNN is to introducethe prior experiences to guide the search. If a search branch presents architectures that are similarto previously trained good ones, Meta-DNN updates the exploitation term in Eq.1 to increase thelikelihood of exploring this branch. We set c = 0.5 following Wistuba [39].

Meta-DNN is a fully connected network that takes the model architecture as the input to predict it’saccuracy based on prior experiences. The architecture of Meta-DNN is input → 512 → 2048 →2048 → 512 → softmax, and each fully connected layers follows a RELU activation. The lastsoftmax outputs a probability distribution over 0∼99 categories that represent discretized accuraciesin [0%, 99%] at the precision of 1%. Since the Cell structure is a non-linear graph, we use thefollowing coding scheme to vectorize them: a 6 digits vector codes a Block in a Cell; the first twodigits represent up to two layers in the left branch, while the 3rd and 4th digits represent up to twolayers in the right branch. TABLE.1 demonstrates the codes of layers, and we use 0 to pad the vectorif a layer is absent. The last two digits represent the input for the left and right branch, respectively.For the input, 0 indicates the input is the output of previous Cell, 1 is the previous previous Cell,and i+ 2 indicates the input is the output of Blocki. If a block is absent, it is [0,0,0,0,0,0]. Basically,these 6 digits specify the internal structure and connectivity of a Block. A Cell has up to 5 Blocksin our state space, so a 60 digits vector is sufficient to represent a state that includes the structure ofa RCell and NCell and their connectivities. See Appendix Sec.12 for detailed description of ourcoding scheme.

Backpropagation back-tracks the search path from the new node to the root to update visiting statistics.With the new node’s estimated q (from Eq.2 ) from Simulations, we iteratively back-propagate theinformation to its ancestral as:

Q(s, a)← Q(s, a) + q, N(s, a)← N(s, a) + 1, s← parent(s), a← πtree(s), (3)until it reaches the root node.

Meta-DNN Training utilizes previously trained architectures and their accuracies. The codingscheme of training architectures is same as the one used in the simulation. As AlphaX advances inthe searching, it also generates the new training data. So, the learning of Meta-DNN is end-to-end.

5

Page 6: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

4.3 Distributed AlphaX

AlphaX

Trainer 1(parent, arch)

(parent, arch, acc)

Clients

Trainer k

Server

  Iter N

SearchEngine Job Queue

Arch for Training

Rev Buffer

Backprop with Meta-DNN of iter N

Backprop with Real Acc of iter Z

Socket

Figure 4: The architecture of distributed AlphaX

A search iteration of AlphaX involves anactual network training which can takehours to complete. It is imperative to ex-tend AlphaX to work in the distributed set-ting. Fig.4 demonstrates the proposed ar-chitecture of distributed AlphaX. There isa master node that is exclusively schedul-ing the search, while there are clients thatare exclusively for training networks. Thegeneral procedures on the server side are asfollows: 1) The agent follows the normalselection and expansion steps. 2) The sim-ulation in MCTS picks a network archn for the actual training, and we push archn into a job queue,where archn represent the selected network architecture at iteration n, and rollout−from(archn) isthe node which it started the rollout from to reach archn. 3) The agent preemptively backpropagatesq̂ ← 1

k

∑i=1..k Pred(simi(s

′)) which is based only on predicted accuracies from the Meta-DNN atiteration n.

Q(s, a)← Q(s, a) + q̂, N(s, a)← N(s, a) + 1, s← parent(s), a← πtree(s). (4)

4) The server checks the receive buffer to retrieve a finished job from clients that includes archz ,accz . Then the agent starts the second backpropagation to propagate q ← accz+q̂

2 (Eq. 2) from thenode the rollout started (s← rollout−from(archz)) to replace the backpropagated q̂ with q:

Q(s, a)← Q(s, a) + q − q̂, s← parent(s), a← πtree(s). (5)

A popular approach to parallelize MCTS is by introducing a virtual loss, which is a method to avoidmaking all the processors to traverse the same trajectory by backpropagating a score (i.e. performanceof the DNN) of 0 before the rollout finishes and the actual score becomes available to the agent[7, 12, 41, 32]. However, backpropagating a score of 0 can significantly bias the search becausethe accuracy of the DNN can be close to 1.0, and the variance of the scores among the efficientarchitectures is often small. Thus, 0 may be significantly smaller than the actual score, biasing thesearch to avoid the region. To solve this issue, we instead preemptively backpropagate the score q̂predicted by the Meta-DNN which we can expect to be closer to the actual score until the actualtraining and testing is finishes and the actual q becomes available.

The client constantly tries to retrieve a job from the master job queue if it is free. It starts trainingonce it gets the job, then it transmits the finished job back to the server. So, each client is a dedicatedtrainer. We also consider the fault-tolerance by taking a snapshot of the server’s states every fewiterations, and AlphaX can resume the searching from the breakpoint using the latest snapshot.

A pseudocode for the whole system is available in Appendix 7.

5 Experiment

5.1 Searching for NASNet-style Architectures

Experiment Setup: we test AlphaX on state-of-the-art search space proposed by Zoph et al. [43].Our experiments focus on CIFAR10, since several works using this search space has validated thatgood architectures discovered on CIFAR10 also perform well on ImageNet [43, 23]. We setupAlphaX on 17 GPUs, with 16 GPUs (NV 1080) as clients and 1 GPU (NV TITAN Xp) as the server.AlphaX builds the network from scratch, and it runs autonomously for a week. Each clients trains anarchitecture for 50 epochs with cutout data augmentation during the search, then we manually selecttop ranked architectures from AlphaX to fully train them toward 300 epochs.

Quantitative Evaluations: AlphaX found several appealing architectures with compara-ble top accuracies as NASNet using only 17 GPUs in 5 days. This demonstrates23.5× speedup over the searching for NASNet [43] that used 500 GPUs in 4 days.

6

Page 7: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

Normal Cell

+

conv3*3

||

+

Cell Output

Cell Input1 Cell Input2

++

sep 5*5

sep 5*5

sep 7*7

+

conv3*3

sep 5*5

max 3*3

max 3*3

sep 5*5

+||

: Add

: Concatenate

(a) NCell

sep 5*5

Reduction Cell

+

sep 7*7 max 3*3

+

||

avg 3*3

+ +

Cell Output

Cell Input1 Cell Input2

avg 3*3 avg 3*3

+

avg 3*3

sep 7*7

max 5*5 max 3*3 conv1*1

sep 5*5

conv1*1

sep 7*7 conv3*3

conv3*3avg 3*3 sep 5*5

conv3*3 max 7*7

(b) RCell

Figure 5: (a) and (b) demonstrate RCell and NCell structures of Arch6 in Fig.7.

0 200 400 600 800 1000 1200

# network trained

0.78

0.8

0.82

0.84

0.86

0.88

be

st

ac

c i

n s

ea

rch

ing

best accuracy trace

(a) Best Accuracy Trace

0 200 400 600 800 1000 1200 1400

# network trained

0.7

0.72

0.74

0.76

ac

cu

rac

y

60

65

70

75

arc

h d

ist

w.r

.t t

he

me

an

acc arch dist

(b) Accuracy Trace by Archs

0 200 400 600 800 1000 1200 1400

# network trained

0

2

4

6

8

10

arc

h t

yp

es

0.7

0.72

0.74

0.76

ac

cu

rac

y

arch types

accuracy trace

exploitation

exploration

(c) KMean Clusters by Archs

Figure 6: The x axis of the 3 figures represents the sampled architectures as AlphaX progresses. (a)best accuracy trace in 50 epochs. (b) the left y axis is the accuracy of each sampled architecture,while the right axis is the architecture distance to the mean arch calculated from the encoding vector(details in supplemental material). (c) the left axis is 10 types of architectures clustered by k-means,and the right axis is the accuracy trace with each type of architectures highlighted by different colors.

100 150 200 250 300

epochs

0.8

0.85

0.9

0.95

ac

cu

rac

y NASNet 12.8MB

Arch1 8.8MB

Arch2 31.2MB

Arch3 10.3MB

Arch4 5.3MB

Arch5 6.1MB

Arch6 21MB

Arch7 18.3MB

Figure 7: The validation accuracy of NASNet and oftop-ranked architectures founded by AlphaX.

We replicate NASNet in Keras * as thebaseline, and Fig.7 demonstrates several in-teresting architectures founded by AlphaXduring the search. Their internal structuresof RCell and NCell are depicted in Ap-pendix Sec.11. Arch6 (the blue line) con-sistently outperforms NASNet up to 1 %w.r.t accuracy after 100 epochs, and it’s in-ternal structures are depicted by Fig.5a andFig.5b. Please note the structural pattern isdifferent from NASNet in the sense that itfeatures the ResNet style jump connection.Additionally, Arch1 (8.8MB) and Arch5(6.1MB) delivery the comparable top accu-racy as NASNet, with 31.2% and 52.34% less parameters, respectively.

The trending up best accuracy trace (Fig.6a) and accuracy trace (Fig.6b) demonstrate AlphaX is aneffective system for NAS. As AlphaX progresses, the top accuracy is incrementally improving overthe time. Here, the accuracy is less than 90 because we only train an architecture for 50 epochs withthe cutout data augmentation. The above 90% accuracies in Fig.7 are from 300 epochs training.

Qualitative Evaluations: Fig.6b and Fig.6c intend to provide some intuition about the searchbehavior of AlphaX. The architecture distance in the right axis of Fig.6b represents the euclideandistance between a sampled architecture and the mean architecture of entire samples, whereas thedistance is defined as Euclidean distance between the encoding of the sampled architecture and the

*https://keras.io

7

Page 8: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

mean value of the encoding. Since the coding scheme provides a vector representation of neuralarchitectures, we use the euclidean distance of the two vector to estimate the similarity of the twoarchitectures. The vector representation of an architecture also enables us to cluster them based ontheir intrinsic characteristics. We cluster the sampled architectures into 10 categories with k-means.The clustering results are shown by the left axis of Fig.6c. Since an accuracy corresponds to anarchitecture (right axis of Fig.6c), we also color the accuracy w.r.t the type found by k-mean.

We observed that AlphaX tends to search for similar architectures within a period, e.g. [200, 400]and [650, 750] in Fig.6b, which suggests that the search algorithm is showing exploitation behavior,searching for architectures similar to previously found architectures with high performance. Duringthe exploitation, MCTS searches for similar architectures in the current design sub-domain, which itconsiders the most promising at that particular moment. So, the subsequent architectures are similarto the point where the exploitation starts. The 10 continuous segments highlighted in Fig.6c supportsthis hypothesis.

The exploration behavior of AlphaX is also observable from either the drastic variations of archi-tectural distance in Fig.6b or the type transitions in Fig.6c. MCTS encourages to explore novelarchitectures by depreciating the current search direction, either with the number of visits (N ) or thelow reward (accuracy).

Generally, the search is improving over time based on the accuracy trace in Fig.6b. The searchingbehavior of AlphaX is quite different from the previously studied exploitation centric architecturesearch strategies such as reinforcement learning ([2, 42]), hill climbing (Elsken et al. [10]), beamsearch (Liu et al. [23]), or genetic algorithm (Real et al. [31]). MCTS solves this issue by usingthe visiting statistics to depreciate the exploitation to escape local optima. On the other hand, it isalso faster than the exploration-centric algorithms such as random search due to the exploitationfactor in UCB1. The key advantage of MCTS is in automatically balancing the tradeoff between theexploration and exploitation at the state level, so we observe a pattern of switching between thesetwo in Fig.6c.

5.2 Search Behavior of Different Algorithms

0 20 40 60 80 100

# discovered architectures

0.39

0.4

0.41

0.42

top

ac

cu

rac

y

HillClimbing

QD

Random

MCTS

CONV(64,2,2,)->CONV(32,4,2)->FC->softmax

CONV(64,4,2,)->CONV(32,2,2)->FC->softmax

Figure 8: Comparisons of different search algorithms.

To further clarify the importance of explo-ration and exploitation, we conduct a con-trolled experiment to demonstrates the be-havior of different search algorithms. Thetask is to find the first two layers for a linearnetwork. The entire search space contains100 states, so exhaustive search is feasible.Appendix 9 provides additional implemen-tation details and experimental setup. Fig.8demonstrates Random Search, Q-Learningand MCTS reached the same architectureand the top accuracy after exhausting theentire search space, while HillClimbing was trapped in a sub-optimal structure quickly. Also, MCTSwas faster than Q-Learning and Random to reach the global optimal, which suggests its algorithmicadvantages in balancing the exploration and exploitation.

6 Conclusion

In this paper, we addressed the problem of how to design neural architecture automatically within amoderate amount of computational time. To this end, we proposed AlphaX, a method for automatedneural architecture search and demonstrated that it achieves a competitive performance with muchless computational resource required than other NAS algorithms.

8

Page 9: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

References

[1] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmedbandit problem. Machine learning, 47(2-3):235–256, 2002.

[2] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing Neural NetworkArchitectures using Reinforcement Learning. pages 1–18, 2016.

[3] Andrew Brock, Theodore Lim, J. M. Ritchie, and Nick Weston. SMASH: One-Shot ModelArchitecture Search through HyperNetworks. 2017.

[4] Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: one-shot modelarchitecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.

[5] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling,Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton.A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligenceand AI in games, 4(1):1–43, 2012.

[6] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Efficient architecture searchby network transformation. arXiv preprint arXiv:1707.04873, 2017.

[7] Guillaume MJ-B Chaslot, Mark HM Winands, and H Jaap van Den Herik. Parallel monte-carlotree search. In International Conference on Computers and Games, pages 60–71. Springer,2008.

[8] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowl-edge transfer. arXiv preprint arXiv:1511.05641, 2015.

[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scalehierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.IEEE Conference on, pages 248–255. IEEE, 2009.

[10] Thomas Elsken, Jan-Hendrik Metzen, and Frank Hutter. Simple And Efficient ArchitectureSearch for Convolutional Neural Networks. pages 1–14, 2017.

[11] Thomas Elsken, Jan-Hendrik Metzen, and Frank Hutter. Simple and efficient architecture searchfor convolutional neural networks. arXiv preprint arXiv:1711.04528, 2017.

[12] Markus Enzenberger, Martin Muller, Broderick Arneson, and Richard Segal. Fuego—an open-source framework for board games and go engine based on monte carlo tree search. IEEETransactions on Computational Intelligence and AI in Games, 2(4):259–270, 2010.

[13] Chrisantha Fernando, Dylan Banarse, Malcolm Reynolds, Frederic Besse, David Pfau, MaxJaderberg, Marc Lanctot, and Daan Wierstra. Convolution by evolution: Differentiable patternproducing networks. In Proceedings of the Genetic and Evolutionary Computation Conference2016, pages 109–116. ACM, 2016.

[14] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum,and Frank Hutter. Efficient and robust automated machine learning. In Advances in NeuralInformation Processing Systems, pages 2962–2970, 2015.

[15] Isabelle Guyon, Kristin Bennett, Gavin Cawley, Hugo Jair Escalante, Sergio Escalera, Tin KamHo, Núria Macia, Bisakha Ray, Mehreen Saeed, Alexander Statnikov, et al. Design of the 2015chalearn automl challenge. In Neural Networks (IJCNN), 2015 International Joint Conferenceon, pages 1–8. IEEE, 2015.

[16] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106,2016.

[17] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimizationfor general algorithm configuration. In International Conference on Learning and IntelligentOptimization, pages 507–523. Springer, 2011.

[18] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrentnetwork architectures. In International Conference on Machine Learning, pages 2342–2350,2015.

[19] Kirthevasan Kandasamy, Jeff Schneider, and Barnabás Póczos. High dimensional bayesianoptimisation and bandits via additive models. In International Conference on Machine Learning,pages 295–304, 2015.

9

Page 10: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

[20] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In Europeanconference on machine learning, pages 282–293. Springer, 2006.

[21] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.2009.

[22] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, JonathanHuang, and Kevin Murphy. Progressive Neural Architecture Search. 2017.

[23] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille,Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. arXiv preprintarXiv:1712.00559, 2017.

[24] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu.Hierarchical Representations for Efficient Architecture Search. pages 1–13, 2017.

[25] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon,Bala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, et al. Evolving deep neuralnetworks. arXiv preprint arXiv:1703.00548, 2017.

[26] Geoffrey F Miller, Peter M Todd, and Shailesh U Hegde. Designing neural networks usinggenetic algorithms. In ICGA, volume 89, pages 379–384, 1989.

[27] Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training deeparchitectures. arXiv preprint arXiv:1704.08792, 2017.

[28] Hieu Pham, Melody Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Faster discovery of neuralarchitectures by searching for paths in a large model. 2018.

[29] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution forimage classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.

[30] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan,Quoc Le, and Alex Kurakin. Large-Scale Evolution of Image Classifiers. 2017.

[31] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, JieTan, Quoc Le, and Alex Kurakin. Large-scale evolution of image classifiers. arXiv preprintarXiv:1703.01041, 2017.

[32] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-tering the game of go with deep neural networks and tree search. nature, 529(7587):484–489,2016.

[33] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machinelearning algorithms. In Advances in neural information processing systems, pages 2951–2959,2012.

[34] Kenneth O Stanley. Compositional pattern producing networks: A novel abstraction of develop-ment. Genetic programming and evolvable machines, 8(2):131–162, 2007.

[35] Kenneth O Stanley, David B D’Ambrosio, and Jason Gauci. A hypercube-based encoding forevolving large-scale neural networks. Artificial life, 15(2):185–212, 2009.

[36] Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmentingtopologies. Evolutionary computation, 10(2):99–127, 2002.

[37] Masanori Suganuma, Shinichi Shirakawa, and Tomoharu Nagao. A Genetic ProgrammingApproach to Designing Convolutional Neural Network Architectures. 2017.

[38] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Auto-weka: Combinedselection and hyperparameter optimization of classification algorithms. In Proceedings of the19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages847–855. ACM, 2013.

[39] Martin Wistuba. Finding competitive network architectures within a day using uct. arXivpreprint arXiv:1712.07420, 2017.

[40] Lingxi Xie and Alan Yuille. Genetic cnn. arXiv preprint arXiv:1703.01513, 2017.[41] Kazuki Yoshizoe, Akihiro Kishimoto, Tomoyuki Kaneko, Haruhiro Yoshimoto, and Yutaka

Ishikawa. Scalable distributed monte-carlo tree search. In Fourth Annual Symposium onCombinatorial Search, 2011.

10

Page 11: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

[42] Barret Zoph and Quoc V. Le. Neural Architecture Search with Reinforcement Learning. pages1–16, 2016.

[43] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferablearchitectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2017.

11

Page 12: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

Selection Expansion Simulation Backpropagation

Q+=0.7

N+=1

2.8 /15

3.0 / 4 8.5 /10

0.3 / 1 1.8 / 2

0.9 / 1

NN NNNN ・・・

πtree

πrandom

Q(s, a) /  N(s, a)

UCB1 score

1.9 1.6

2.0 2.1

2.1 +∞

2.8 /15

3.0 / 4 8.5 /10

0.3 / 1 1.8 / 2

0.9 / 1 0 / 0

1.9 1.6

2.0 2.1

2.1 +∞

2.8 /15

3.0 / 4 8.5 /10

0.3 / 1 1.8 / 2

0.9 / 1 0.7 / 1

1.9 1.6

2.0 2.1

2.1

3.5 /16

3.7 / 5 8.5 /10

0.3 / 1 2.5 / 3

0.9 / 1 0.7 / 1

1.8 1.6

2.1 2.1

2.4 1.9

Figure 9: High level overview of the search procedure of AlphaX

7 Pseudocode for AlphaX

In this section, we describe the pseudocode of the Distributed AlphaX. Algorithm 3 describes thesearch engine of AlphaX, with Figure 9 shows a high-level overview of the search engine. Algorithm2 is the server procedure to send the architecture to train chosen by the MCTS to the client andcollect the architectures trained and their scores. Algorithm 1 is the client which trains and tests thearchitecture provided by the server.

Algorithm 1 Client

1: Require: Start working once building connection to the server2: while True do3: if The client is connected to server then4: network ← Receive()5: accuracy ← Train(network)6: Send (network, accuracy) to the Server7: else8: Wait for re-connection9: end if

10: end while

Algorithm 2 Server

1: while size(TASK−QUEUE) > 2 do2: while no idle client do3: Continue . Wait for dispatching jobs until there are idle clients4: end while5: Create a new connection to a random idle client6: network ← TASK−QUEUE.pop()7: Send network to a Client8: if Received−Signal() then9: network, accuracy ← Receive−Result()

10: acc(network)← accuracy11: state← rollout−from(network)12: Backpropagation(state, (accuracy − q̂(state))/2, 0)

. Replace q̂ with q = Q(s, a) in Eq. 213: Train the meta-DNN with a new data (network, accuracy)14: else15: Continue16: end if17: end while

12

Page 13: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

Algorithm 3 Search Engine (MCTS)

1: function Expansion(state)2: Create a new node in a tree for state.3: for all action available at state do4: Q(state, action)← 0, N(state, action)← 05: end for6: end function7:8: function Simulation(state)9: action← none

10: while action is not term do11: randomly generate an action12: next−net← Apply(state, action)

. Apply returns the next state when action is applied to state13: end while14: end function15:16: function Backpropagation(state, q, n)17: while state is not root do18: state← parent(state)19: Q(state, action)← Q(state, action) + q20: N(state, action)← N(state, action) + n21: end while22: end function23:24: Require: Start from the root25: while episode < MAX−episode do26: Server()27: cur−state← root−node28: i← 029: while i < MAX−tree−depth do30: i← i+ 131: next−action← Selection(cur−state) . Select an action based on Eq. 132: if next−state not in tree then33: next−state← Expansion(next−action)34: Tt ← Simulationt(next−state) for t = 0...k

. k is the number of simulations we run using the Meta-DNN35: TASK−QUEUE.push(T0)36: rollout−from(T0)← next−state37: q̂(next−state)← 1

k

∑i=1..k Pred(Ti)

. Pred returns an accuracy predicted by the Meta-DNN38: Backpropagation(next−state, q̂) . Preemptive backpropagation to send q̂39: end if40: end while41: end while

13

Page 14: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

8 Details of the State and Action Space

This section contains the description of the state and action space for NASNet design space.

We constrain the state space to make the design problem manageable. The state space exponentiallygrows with the depth: a k layers linear network has nk architecture variations, where n is the numberof layer types. We leverage the GPU DRAM size, our current computing resources and the designheuristics from leading DNNs, to propose the following constraints on the state space: 1) a branchhas at most 2 layers; 2) a cell has at most 5 blocks; 3) any data flow from a Cell input to output haveat most 1 pooling layer. 4) the depth of blocks is limited to 2. 5) we use the layers in TABLE.1:

Actions also preserve the constraints imposed on the state space. If the next state reaches out of thedesign boundary, the agent automatically removes the action from the action set. For example, weexclude the "adding a new layer" actions for a branch if it already has 2 layers. So, the action set isdynamically changing w.r.t states.

14

Page 15: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

0 500 1000 1500 2000

# network trained

0.2

0.3

0.4

0.5

0.6

0.7

ac

cu

rac

y

= 0.2 [1, 0]

Figure 10: The effect of ε toward the top accuracy in the search.

9 Experimental Setup for Section 5.2

The state space we consider consists of small simple CNNs. We consider a convolution layer and asoftmax layer. For a convolutino layer we allow a range of 1 or 2 for stride, 32 or 64 for filters, and 2or 4 for kernels. We set the maximum depth to 3. We constraint that the final layer is always a denselayer with output size set to the number of classes.

Actions consists of (1) add conv layer, (2) add softmax layer, (3) increment or decrement one of theparameters for a convolution layer in the current CNN. For MCTS, random, and Q-learning agentshave a terminal action to terminate the episode.

MCTS: For this experiment we do not use a Meta-DNN – our goal here is to study the behavior ofeach search algorithm.

Random: Random agent selects action uniformly at random.

Q-learning: We implemented a tabular Q-learning agent with ε-greedy strategy. The learning rate isset to 0.2. We set the discount factor to be 1 in order not to prioritize short-term rewards. We fix ε to0.2. We initialize the Q-value with 0.5.

Hill Climbing: For a hill climbing, an agent starts from a randomly chosen initial state. It trains everyarchitectures in the child nodes and moves to the child node of which architecture performed the best,and repeat this procedure. Unlike MCTS and Q-learning which trains a NN only when it is a terminalstate, hill climbing considers every state (and its child nodes) it visits to train. As such, we do nothave a terminal action for hill climbing. As we observed that the hill climbing tends to stuck to alocal optimum, we restart from a randomly chosen initial state if it visits the same state twice in thetrajectory.

10 ε Scheduling for Q-Learning

In this section we investigate the effect of the hyperparameter ε on the performance of Q-learning.Fig.10 is the comparison of Q-learning with two configurations of ε. ε ∈ [1, 0] means we reduce ε by0.1 every 200 steps. Here, we try to find a linear network with the depth up to 10.

We observed that a small ε (encouraging exploitation) quickly converges to a local optimal; a large ε(encouraging exploration), though slow, finds networks with higher accuracies. This suggests thatthe performance of Q-learning is contingent on choosing a good scheduling of the hyperparameter ε.Baker et al. [2] proposed to start Q-learning with the exploration probability (ε) set to 1.0, graduallyreducing the ε down. While Q-learning requires to manually tune the schedule of the probabilityof exploration, the upper confidence bound in MCTS automatically balances the exploration andexploitation based on the number of visits (Fig.1) and only requires to choose a single parameter c.Because our goal is to minimize the amount of human effort required for deep learning, we deployedUCT for our search engine.

15

Page 16: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

11 Architectures Designed by AlphaX

conv3*3

Normal Cell

+

+

sep 3*3

max 7*7

conv3*3

sep 7*7 max 7*7

max 3*3

conv3*3

conv3*3

sep 7*7

+

avg 3*3

||

sep 7*7

sep 3*3conv3*3 max 3*3

max 3*3

+ +

Cell Output

Cell Input1 Cell Input2

(a) Normal Cell

sep 5*5

Reduction Cell

+ +

+

conv1*1

sep 7*7

max 7*7 conv3*3

max 3*3

conv3*3

sep 3*3

sep 7*7 max 7*7

||

Cell Input1 Cell Input2

Cell Output

: Add

||  : Concatenate +

(b) Reduction Cell

Figure 11: Founded network architecture1

max 3*3

Normal Cell

+

sep 5*5 sep 3*3

+

||

conv1*1

+ +

Cell Output

Cell Input1 Cell Input2

max 3*3 avg 5*5

+

sep 5*5

sep 5*5

+||

: Add

 : Concatenate

(a) Normal Cell

Reduction Cell

+ +

+

max 7*7 conv1*1

max 7*7 max 5*5

conv3*3

||

Cell Input1 Cell Input2

Cell Output

sep 5*5 sep 7*7

conv3*3 max 5*5

(b) Reduction Cell

Figure 12: Founded network architecture2

sep 5*5

Normal Cell

+ +

+

Cell Input1 Cell Input2

Cell Output

+||

: Add  : Concatenate

(a) Normal Cell

avg 3*3

Reduction Cell

sep 3*3avg 5*5

conv1*1 sep 3*3

sep 3*3 sep 3*3

conv3*3

sep 7*7

avg 5*5

+

+

+

+

+

||

Cell Input1 Cell Input2

Cell Output

(b) Reduction Cell

Figure 13: Founded network architecture3

16

Page 17: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

max 5*5

Normal Cell

+

+

+

sep 5*5

sep 5*5

||

Cell Input1 Cell Input2

Cell Output

+ : Add

||  : Concatenate

(a) Normal Cell

avg 3*3

Reduction Cell

+

+

+

sep 3*3

max 5*5

max 7*7

||

max 5*5 sep 7*7

conv1*1

sep 7*7sep 5*5

sep 7*7

+

max 3*3

Cell Input1 Cell Input2

Cell Output

(b) Reduction Cell

Figure 14: Founded network architecture4

sep 3*3

sep 3*3 conv3*3

avg 3*3

sep 3*3

sep 3*3 conv 3*3

Normal Cell

++ +

+ +

||

Cell Input1 Cell Input2

Cell Output

(a) Normal Cell

avg 3*3

sep 5*5

sep 5*5

sep 5*5

sep 5*5

avg 3*3sep 7*7

Cell Input2

Reduction Cell

+ +

sep 7*7

+

||

Cell Input1

Cell Output

+ : Add

||  : Concatenate

(b) Reduction Cell

Figure 15: Founded network architecture5

sep 3*3

Normal Cell

+

+

max 7*7

conv3*3

sep 7*7

+

||

conv3*3

sep 7*7

+ +

Cell Output

Cell Input1 Cell Input2

sep 7*7

(a) Normal Cell

Reduction Cell

+

+

avg 3*3 sep 5*5

conv3*3

||

Cell Input1 Cell Input2

Cell Output

sep 5*5 sep 7*7

sep 5*5 sep 7*7

conv3*3

+||

: Add

 : Concatenate

(b) Reduction Cell

Figure 16: Founded network architecture6

17

Page 18: Abstract - arXiv · 2018. 5. 22. · Hyper-parameter optimization is an important problem in Machine Learning [38, 14, 15], while Bayesian Optimization (BO) is a popular approach

Block2Block0 Block1 Block3 Block4

encode encode encode encode encode

[X,X,X,X,X,X] [X,X,X,X,X,X] [X,X,X,X,X,X] [X,X,X,X,X,X] [X,X,X,X,X,X]

Concat

[X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X]

Normal Cell

Block0 Block1 Block2 Block3 Block4

encode encode encode encode encode

[X,X,X,X,X,X] [X,X,X,X,X,X] [X,X,X,X,X,X] [X,X,X,X,X,X] [X,X,X,X,X,X]

Concat

[X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X]

Reduction Cell

Concat

[X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X]

Cell Input1 Cell Input2 Cell Input1 Cell Input2

(a) Network Architecture Encoding

Block 

cell h-2

sep 5*5

11

conv 1*1

7

encoder

max 5*5 avg 5*5

encoder encoder

5 2

+left

branchrightbranch

encoder

cell h-1 encoder 0 1

flatten[11, 5, 7, 2, 0, 1]

Output

encoder

(b) Singe Block Encoding

Figure 17: Encoder scheme for cell-based network architecture

Table 1: The code of different types of layers

layers code layer code layer code layer code

3x3 avg pool 1 3x3 max pool 4 1x1 conv 7 3x3 depth-separable conv 105x5 avg pool 2 5x5 max pool 5 3x3 conv 8 5x5 depth-separable conv 117x7 avg pool 3 7x7 max pool 6 identity 9 7x7 depth-separable conv 12

Put awesome architectures

12 Encoder scheme

we use the following coding scheme to vectorize the networks: a 6 digits vector codes a Block in aCell; the first two digits represent up to two layers in the left branch, while the 3rd and 4th digitsrepresent up to two layers in the right branch. TABLE.1 demonstrates the codes of layers, and weuse 0 to pad the vector if a layer is absent. The last two digits represent the input for the left andright branch, respectively. For the input, 0 indicates the input is the output of previous Cell, 1 is theprevious previous Cell, and i+ 2 indicates the input is the output of Blocki. If a block is absent, itis [0,0,0,0,0,0]. Figure.17 shows the scheme of encoding a cell-based network architecture. Basically,these 6 digits specify the internal structure and connectivity of a Block. A Cell has up to 5 Blocksin our state space, so a 60 digits vector is sufficient to represent a state that includes the structure of aRCell and NCell and their connectivities.

For example, as the architecture shown in Figure.12, There are 5 blocks in its normal cell and 3blocks in its reduction cell. According to its architecture, the encoding procedure could be followedby the steps below.

Step 1: Encode Block0 (bottom left) in NCell. There is only one layer (max 3*3) in its left branchand one layer (sep 5*5) in its right branch. According to TABLE.1, the left branch couldbe encoded as [4,0] while the right branch could be encoded as [11,0]. The inputs ofboth branch are from previous previous cell, so, the inputs could be encoded as [0,0]. Byconcatenating these 3 vectors (6 digits) from left, right to inputs, the whole block could beencoded as [4,0,11,0,0,0].

Step 2: After encoding Block0, we should repeatedly do step1 until all the blocks are encoded.If a block is absent, it is [0,0,0,0,0,0]. In this case, Block1 is [10,0,0,0,0,1], Block2is[4,0,2,0,1,1], Block3 is [7,0,11,11,1,1] and Block4 is [0,0,0,0,3,3].

Step 3: We concatenate the codes from Block0 to Block4. Hence, the NCell should be representedas [4,0,11,0,0,0,10,0,0,0,0,1,4,0,2,0,1,1,7,0,11,11,1,1,0,0,0,0,3,3].

Step 4: We encode the RCell with same operations. In this case, the RCell should be representedas [11,8,12,5,0,1,6,0,7,0,0,2,6,0,5,8,2,1,0,0,0,0,0,0,0,0,0,0,0,0].

Step 5: Concatenate the code from NCell and RCell. The architecture is represented by a 60 digitsvector.

18