aging evolution for image classifier architecture search

14
Aging Evolution for Image Classifier Architecture Search Esteban Real *† and Alok Aggarwal and Yanping Huang and Quoc V. Le Google Brain, Mountain View, California, USA Equal contribution. * Correspondence: [email protected] Abstract The effort devoted to hand-crafting neural network image classifiers has motivated the use of architecture search to dis- cover them automatically. Although evolutionary algorithms have been repeatedly applied to neural network topologies, the image classifiers thus discovered have remained inferior to human-crafted ones. Here, we evolve an image classifier— AmoebaNet-A—that surpasses hand-designs for the first time. To do this, we modify the tournament selection evolution- ary algorithm by introducing an age property to favor the younger genotypes. Matching size, AmoebaNet-A has com- parable accuracy to current state-of-the-art ImageNet models discovered with more complex architecture-search methods. Scaled to larger size, AmoebaNet-A sets a new state-of-the- art 83.9% top-1 / 96.6% top-5 ImageNet accuracy. In a con- trolled comparison against a well known reinforcement learn- ing algorithm, we give evidence that evolution can obtain re- sults faster with the same hardware, especially at the earlier stages of the search. This is relevant when fewer compute re- sources are available. Evolution is, thus, a simple method to effectively discover high-quality architectures. Introduction Until recently, most state-of-the-art image classifier archi- tectures have been manually designed by human experts (Krizhevsky, Sutskever, and Hinton 2012; He et al. 2016; Szegedy et al. 2017). To speed up the process, researchers have looked into automated methods (Baker et al. 2017a; Zoph and Le 2016; Real et al. 2017). These methods are now collectively known as architecture-search algorithms. A tra- ditional approach is neuro-evolution of topologies (Miller, Todd, and Hegde 1989; Angeline, Saunders, and Pollack 1994; Stanley and Miikkulainen 2002). Improved hardware now allows scaling up evolution to produce high-quality im- age classifiers (Real et al. 2017; Miikkulainen et al. 2017; Liu et al. 2018b). Yet, the architectures produced by evolu- tionary algorithms / genetic programming have not reached the accuracy of those directly designed by human experts. Here we evolve image classifiers that surpass hand-designs. To do this, we make two additions to the standard evo- lutionary process. First, we propose a change to the well- established tournament selection evolutionary algorithm (Goldberg and Deb 1991) that we refer to as aging evolu- tion or regularized evolution. Whereas in tournament selec- tion, the best genotypes (architectures) are kept, we propose to associate each genotype with an age, and bias the tour- nament selection to choose the younger genotypes. We will show that this change turns out to make a difference. The connection to regularization will be clarified in the Discus- sion section. Second, we implement the simplest set of mutations that would allow evolving in the NASNet search space (Zoph et al. 2018). This search space associates convolutional neural network architectures with small directed graphs in which vertices represent hidden states and labeled edges represent common network operations (such as convolutions or pool- ing layers). Our mutation rules only alter architectures by randomly reconnecting the origin of edges to different ver- tices and by randomly relabeling the edges, covering the full search space. Searching in the NASNet space allows a controlled com- parison between evolution and the original method for which it was designed, reinforcement learning (RL). Thus, this pa- per presents the first comparative case study of architecture- search algorithms for the image classification task. Within this case study, we will demonstrate that evolution can at- tain similar results with a simpler method, as will be shown in the Discussion section. In particular, we will highlight that in all our experiments evolution searched faster than RL and random search, especially at the earlier stages, which is im- portant when experiments cannot be run for long times due to compute resource limitations. Despite its simplicity, our approach works well in our benchmark against RL. It also evolved a high-quality model, which we name AmoebaNet-A. This model is competitive with the best image classifiers obtained by any other algo- rithm today at similar sizes (82.8% top-1 / 96.1% top-5 Im- ageNet accuracy). When scaled up, it sets a new state-of- the-art accuracy (83.9% top-1 / 96.6% top-5 ImageNet ac- curacy). Related Work Review papers provide informative surveys of earlier (Yao 1999; Floreano, D¨ urr, and Mattiussi 2008) and more re- cent (Elsken, Metzen, and Hutter 2018) literature on im- age classifier architecture search, including successful RL studies (Zoph et al. 2018; Liu et al. 2018a) and evolu- tionary studies like those mentioned in the Introduction. Other methods have also been applied: cascade-correlation arXiv:1802.01548v5 [cs.NE] 4 Oct 2018

Upload: others

Post on 13-Jun-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Aging Evolution for Image Classifier Architecture Search

Aging Evolution for Image Classifier Architecture Search

Esteban Real∗† and Alok Aggarwal† and Yanping Huang† and Quoc V. LeGoogle Brain, Mountain View, California, USA

†Equal contribution. ∗Correspondence: [email protected]

Abstract

The effort devoted to hand-crafting neural network imageclassifiers has motivated the use of architecture search to dis-cover them automatically. Although evolutionary algorithmshave been repeatedly applied to neural network topologies,the image classifiers thus discovered have remained inferiorto human-crafted ones. Here, we evolve an image classifier—AmoebaNet-A—that surpasses hand-designs for the first time.To do this, we modify the tournament selection evolution-ary algorithm by introducing an age property to favor theyounger genotypes. Matching size, AmoebaNet-A has com-parable accuracy to current state-of-the-art ImageNet modelsdiscovered with more complex architecture-search methods.Scaled to larger size, AmoebaNet-A sets a new state-of-the-art 83.9% top-1 / 96.6% top-5 ImageNet accuracy. In a con-trolled comparison against a well known reinforcement learn-ing algorithm, we give evidence that evolution can obtain re-sults faster with the same hardware, especially at the earlierstages of the search. This is relevant when fewer compute re-sources are available. Evolution is, thus, a simple method toeffectively discover high-quality architectures.

IntroductionUntil recently, most state-of-the-art image classifier archi-tectures have been manually designed by human experts(Krizhevsky, Sutskever, and Hinton 2012; He et al. 2016;Szegedy et al. 2017). To speed up the process, researchershave looked into automated methods (Baker et al. 2017a;Zoph and Le 2016; Real et al. 2017). These methods are nowcollectively known as architecture-search algorithms. A tra-ditional approach is neuro-evolution of topologies (Miller,Todd, and Hegde 1989; Angeline, Saunders, and Pollack1994; Stanley and Miikkulainen 2002). Improved hardwarenow allows scaling up evolution to produce high-quality im-age classifiers (Real et al. 2017; Miikkulainen et al. 2017;Liu et al. 2018b). Yet, the architectures produced by evolu-tionary algorithms / genetic programming have not reachedthe accuracy of those directly designed by human experts.Here we evolve image classifiers that surpass hand-designs.

To do this, we make two additions to the standard evo-lutionary process. First, we propose a change to the well-established tournament selection evolutionary algorithm(Goldberg and Deb 1991) that we refer to as aging evolu-tion or regularized evolution. Whereas in tournament selec-tion, the best genotypes (architectures) are kept, we propose

to associate each genotype with an age, and bias the tour-nament selection to choose the younger genotypes. We willshow that this change turns out to make a difference. Theconnection to regularization will be clarified in the Discus-sion section.

Second, we implement the simplest set of mutations thatwould allow evolving in the NASNet search space (Zoph etal. 2018). This search space associates convolutional neuralnetwork architectures with small directed graphs in whichvertices represent hidden states and labeled edges representcommon network operations (such as convolutions or pool-ing layers). Our mutation rules only alter architectures byrandomly reconnecting the origin of edges to different ver-tices and by randomly relabeling the edges, covering the fullsearch space.

Searching in the NASNet space allows a controlled com-parison between evolution and the original method for whichit was designed, reinforcement learning (RL). Thus, this pa-per presents the first comparative case study of architecture-search algorithms for the image classification task. Withinthis case study, we will demonstrate that evolution can at-tain similar results with a simpler method, as will be shownin the Discussion section. In particular, we will highlight thatin all our experiments evolution searched faster than RL andrandom search, especially at the earlier stages, which is im-portant when experiments cannot be run for long times dueto compute resource limitations.

Despite its simplicity, our approach works well in ourbenchmark against RL. It also evolved a high-quality model,which we name AmoebaNet-A. This model is competitivewith the best image classifiers obtained by any other algo-rithm today at similar sizes (82.8% top-1 / 96.1% top-5 Im-ageNet accuracy). When scaled up, it sets a new state-of-the-art accuracy (83.9% top-1 / 96.6% top-5 ImageNet ac-curacy).

Related WorkReview papers provide informative surveys of earlier (Yao1999; Floreano, Durr, and Mattiussi 2008) and more re-cent (Elsken, Metzen, and Hutter 2018) literature on im-age classifier architecture search, including successful RLstudies (Zoph et al. 2018; Liu et al. 2018a) and evolu-tionary studies like those mentioned in the Introduction.Other methods have also been applied: cascade-correlation

arX

iv:1

802.

0154

8v5

[cs

.NE

] 4

Oct

201

8

Page 2: Aging Evolution for Image Classifier Architecture Search

(Fahlman and Lebiere 1990), boosting (Cortes et al. 2017),hill-climbing (Elsken, Metzen, and Hutter 2017), MCTS(Negrinho and Gordon 2017), SMBO (Mendoza et al. 2016),and random search (Bergstra and Bengio 2012), and gridsearch (Zagoruyko and Komodakis 2016). Some methodseven forewent the idea of independent architectures (Sax-ena and Verbeek 2016). There is much architecture-searchwork beyond image classification too, but that is outside ourscope.

Even though some methods stand out due to their effi-ciency (Suganuma, Shirakawa, and Nagao 2017; Pham etal. 2018), many approaches use large amounts of resources.Several recent papers reduced the compute cost throughprogressive-complexity search stages (Liu et al. 2018a), hy-pernets (Brock et al. 2018), accuracy prediction (Baker et al.2017b; Klein et al. 2017; Domhan, Springenberg, and Hutter2017), warm-starting and ensembling (Feurer et al. 2015),parallelization, reward shaping and early stopping (Zhong,Yan, and Liu 2018) or Net2Net transformations (Cai et al.2018). Most of these methods could in principle be appliedto evolution too, but this is beyond the scope of this paper.

A popular approach to evolution has been through gen-erational algorithms, e.g. NEAT (Stanley and Miikkulainen2002). All models in the population must finish training be-fore the next generation is computed. Generational evolu-tion becomes inefficient in a distributed environment wherea different machine is used to train each model: machinesthat train faster models finish earlier and must wait idle untilall machines are ready. Real-time algorithms address this is-sue, e.g. rtNEAT (Stanley, Bryant, and Miikkulainen 2005)and tournament selection (Goldberg and Deb 1991). Unlikethe generational algorithms, however, these discard modelsaccording to their performance or do not discard them at all,resulting in models that remain alive in the population for along time—even for the whole experiment. We will presentevidence that the finite lifetimes of aging evolution can givebetter results than direct tournament selection, while retain-ing its efficiency.

MethodsThis section contains a readable description of the methods.The Methods Details section gives additional information.

Search SpaceAll experiments use the NASNet search space (Zoph et al.2018). This is a space of image classifiers, all of which havethe fixed outer structure indicated in Figure 1 (left): a feed-forward stack of Inception-like modules called cells. Eachcell receives a direct input from the previous cell (as de-picted) and a skip input from the cell before it (Figure 1,middle). The cells in the stack are of two types: the normalcell and the reduction cell. All normal cells are constrainedto have the same architecture, as are reduction cells, but thearchitecture of the normal cells is independent of that of thereduction cells. Other than this, the only difference betweenthem is that every application of the reduction cell is fol-lowed by a stride of 2 that reduces the image size, whereasnormal cells preserve the image size. As can be seen in the

figure, normal cells are arranged in three stacks of N cells.The goal of the architecture-search process is to discover thearchitectures of the normal and reduction cells.

Input Image

Softmax

Normal Cell

Normal Cell

Normal Cell

Reduction Cell

Reduction Cell

x N

x N

x N

Normal Cell

Normal Cell

Normal Cell

Normal Cell

0

7

1

avg3x3

none+

3

nonesep3x3

+

5

max3x3

avg3x3

+

2

sep3x3

sep5x5

+

4

sep3x3

avg3x3

+

6

Figure 1: NASNet Search Space (Zoph et al. 2018). LEFT:the full outer structure (omitting skip inputs for clarity).MIDDLE: detailed view with the skip inputs. RIGHT: cellexample. Dotted line demarcates a pairwise combination.

As depicted in Figure 1 (middle and right), each cell hastwo input activation tensors and one output. The very firstcell takes two copies of the input image. After that, the in-puts are the outputs of previous cells. In particular, the firstcell in each stack of normal cells uses two copies of the out-put of the previous reduction cell as its two inputs.

Both normal and reduction cells must conform to the fol-lowing construction. The two cell input tensors are con-sidered hidden states “0” and “1”. More hidden states arethen constructed through pairwise combinations. A pairwisecombination is depicted in Figure 1 (right, inside dashed cir-cle). It consists in applying an operation (or op) to an ex-isting hidden state, applying another op to another existinghidden state, and adding the results to produce a new hiddenstate. Ops belong to a fixed set of common convnet oper-ations such as convolutions and pooling layers. Repeatinghidden states or operations within a combination is permit-ted. In the cell example of Figure 1 (right), the first pairwisecombination applies a 3x3 average pool op to hidden state0 and a 3x3 max pool op to hidden state 1, in order to pro-duce hidden state 2. The next pairwise combination can nowchoose from hidden states 0, 1, and 2 to produce hidden state3 (chose 0 and 1 in Figure 1), and so on. After exactly fivepairwise combinations, any hidden states that remain unused(hidden states 5 and 6 in Figure 1) are concatenated to formthe output of the cell (hidden state 7).

A given architecture is fully specified by the five pairwisecombinations that make up the normal cell and the five thatmake up the reduction cell. Once the architecture is speci-fied, the model still has two free parameters that can be usedto alter its size (and its accuracy): the number of normal cellsper stack (N) and the number of output filters of the convo-lution ops (F). N and F are determined manually.

Evolutionary AlgorithmThe evolutionary method we used is summarized in Algo-rithm 1. It keeps a population of P trained models through-

Page 3: Aging Evolution for Image Classifier Architecture Search

out the experiment. The population is initialized with modelswith random architectures (“while |population|” in Algo-rithm 1). All architectures that conform to the search spacedescribed are possible and equally likely.

Algorithm 1 Aging Evolution

population← empty queue . The population.history ← ∅ . Will contain all models.while |population| < P do . Initialize population.

model.arch← RANDOMARCHITECTURE()model.accuracy ← TRAINANDEVAL(model.arch)add model to right of populationadd model to history

end whilewhile |history| < C do . Evolve for C cycles.

sample← ∅ . Parent candidates.while |sample| < S do

candidate← random element from population. The element stays in the population.

add candidate to sampleend whileparent← highest-accuracy model in samplechild.arch← MUTATE(parent.arch)child.accuracy ← TRAINANDEVAL(child.arch)add child to right of populationadd child to historyremove dead from left of population . Oldest.discard dead

end whilereturn highest-accuracy model in history

After this, evolution improves the initial population in cy-cles (“while |history|” in Algorithm 1). At each cycle, itsamples S random models from the population, each drawnuniformly at random with replacement. The model with thehighest validation fitness within this sample is selected as theparent. A new architecture, called the child, is constructedfrom the parent by the application of a transformation calleda mutation. A mutation causes a simple and random modi-fication of the architecture and is described in detail below.Once the child architecture is constructed, it is then trained,evaluated, and added to the population. This process is calledtournament selection (Goldberg and Deb 1991).

It is common in tournament selection to keep the popu-lation size fixed at the initial value P. This is often accom-plished with an additional step within each cycle: discarding(or killing) the worst model in the random S-sample. We willrefer to this approach as non-aging evolution. In contrast,in this paper we prefer a novel approach: killing the oldestmodel in the population—that is, removing from the popu-lation the model that was trained the earliest (“remove deadfrom left of pop” in Algorithm 1). This favors the newermodels in the population. We will refer to this approach asaging evolution. In the context of architecture search, agingevolution allows us to explore the search space more, insteadof zooming in on good models too early, as non-aging evo-lution would (see Discussion section for details).

In practice, this algorithm is parallelized by distributing

the “while |history|” loop in Algorithm 1 over multipleworkers. A full implementation can be found online ([linkanonymized for review]). Intuitively, the mutations can bethought of as providing exploration, while the parent selec-tion provides exploitation. The parameter S controls the ag-gressiveness of the exploitation: S = 1 reduces to a type ofrandom search and 2 ≤ S ≤ P leads to evolution of varyinggreediness.

New models are constructed by applying a mutation toexisting models, transforming their architectures in ran-dom ways. To navigate the NASNet search space describedabove, we use two mutations that we call the hidden statemutation and the op mutation. Only one of them is appliedin each cycle, choosing between the two at random.

avg3x3

sep7x7

+

5

42 3

avg3x3

sep7x7

+

5

42 3

Hidden StateMutation

avg3x3

sep7x7

+

5

42 3

nonesep7x7

+

5

42 3

OpMutation

Figure 2: Illustration of the two mutation types.

The hidden state mutation consists of first making a ran-dom choice of whether to modify the normal cell or the re-duction cell. Once a cell is chosen, the mutation picks one ofthe five pairwise combinations uniformly at random. Oncethe pairwise combination is picked, one of the two elementsof the pair is chosen uniformly at random. The chosen ele-ment has one hidden state. This hidden state is now replacedwith another hidden state from within the cell, subject to theconstraint that no loops are formed (to keep the feed-forwardnature of the convnet). Figure 2 (top) shows an example.

The op mutation behaves like the hidden state mutationas far as choosing one of the two cells, one of the five pair-wise combinations, and one of the two elements of the pair.Then it differs in that it modifies the op instead of the hiddenstate. It does this by replacing the existing op with a randomchoice from a fixed list of ops (see Methods Details). Fig-ure 2 (bottom) shows an example.

Baseline AlgorithmsOur main baseline is the application of RL to the samesearch space. RL was implemented using the algorithm andcode in the baseline study (Zoph et al. 2018). An LSTM con-troller outputs the architectures, constructing the pairwisecombinations one at a time, and then gets a reward for eacharchitecture by training and evaluating it. More detail canbe found in the baseline study. We also compared againstrandom search (RS). For RS, each model is constructed ran-

Page 4: Aging Evolution for Image Classifier Architecture Search

domly and independently, just like models are constructedfor the initial population in the evolutionary algorithm.

Experimental SetupWe ran controlled comparisons at scale, ensuring identicalconditions for evolution, RL and random search (RS). Inparticular, all methods used the same computer code for net-work construction, training and evaluation. Experiments al-ways searched on the CIFAR-10 dataset (Krizhevsky andHinton 2009).

As in the baseline study, we first performed architecturesearch over small models (i.e. small N and F) until 20k mod-els were evaluated. After that, we used the model augmen-tation trick (Zoph et al. 2018): we took architectures dis-covered by the search (e.g. the output of an evolutionary ex-periment) and turn them into a full-size, accurate models.To accomplish this, we enlarged the models by increasingN and F, and trained them for a longer time on the CIFAR-10 or the ImageNet classification datasets (Krizhevsky andHinton 2009; Deng et al. 2009). For ImageNet, a stem wasadded at the input of the model to reduce the image size, asshown in Figure 7 (right). This is the same procedure as inthe baseline study.

Methods DetailsThis section complements the Methods section with the de-tails necessary to reproduce our experiments. Possible ops:none (identity); 3x3, 5x5 and 7x7 separable (sep.) convolu-tions (convs.); 3x3 average (avg.) pool; 3x3 max pool; 3x3dilated (dil.) sep. conv.; 1x7 then 7x1 conv. Evolved withP=100, S=25. CIFAR-10 dataset (Krizhevsky and Hinton2009) with 5k withheld examples for validation. StandardImageNet dataset (Deng et al. 2009), 1.2M 331x331 im-ages and 1k classes; 50k examples withheld for validation;standard validation set used for testing. During the searchphase, each model trained for 25 epochs; N=3/F=24, 1 GPU.Each experiment ran on 450 K40 GPUs for 20k models (ap-prox. 7 days). To optimize evolution, we tried 5 configu-rations with P/S of: 100/2, 100/50, 20/20, 100/25, 64/16,best was 100/25. To optimize RL, started with parametersalready tuned in the baseline study and further optimizedlearning rate in 8 configurations: 0.00003, 0.00006, 0.00012,0.0002, 0.0004, 0.0008, 0.0016, 0.0032; best was 0.0008. Toavoid selection bias, plots do not include optimization runs,as was decided a priori. Best few (20) models were selectedfrom each experiment and augmented to N=6/F=32, as inbaseline study; batch 128, SGD with momentum rate 0.9,L2 weight decay 5 × 10−4, initial lr 0.024 with cosine de-cay, 600 epochs, ScheduledDropPath to 0.7 prob; auxiliarysoftmax with half-weight of main softmax. For Table 1, weused N/F of 6/32 and 6/36. For ImageNet table, 6/190 and6/204 and standard training methods (Szegedy et al. 2017):distributed sync SGD with 100 P100 GPUs; RMSProp opti-mizer with 0.9 decay and ε=0.1, 4× 10−5 weight decay, 0.1label smoothing, auxiliary softmax weighted by 0.4; dropoutprobability 0.5; ScheduledDropPath to 0.7 probability (as inbaseline—note that this trick only contributes 0.3% top-1ImageNet acc.); 0.001 initial lr, decaying every 2 epochs by

0.97. For largest model, N=18/F=448. Wherever possible,we used the same conditions as the baseline study.

ResultsComparison With RL and RS BaselinesCurrently, reinforcement learning (RL) is the predominantmethod for architecture search. In fact, today’s state-of-the-art image classifiers have been obtained by architec-ture search with RL (Zoph et al. 2018; Liu et al. 2018a).Here we seek to compare our evolutionary approach againsttheir RL algorithm. We performed large-scale side-by-sidearchitecture-search experiments on CIFAR-10. We first op-timized the hyper-parameters of the two approaches inde-pendently (details in Methods Details section). Then we ran5 repeats of each of the two algorithms—and also of randomsearch (RS).

Figure 3 shows the model accuracy as the experimentsprogress, highlighting that evolution yielded more accuratemodels at the earlier stages, which could become importantin a resource-constrained regime where the experiments mayhave to be stopped early (for example, when 450 GPUs for7 days is too much). At the later stages, if we allow to runfor the full 20k models (as in the baseline study), evolutionproduced models with similar accuracy. Both evolution andRL compared favorably against RS. It is important to notethat the vertical axis of Figure 3 does not present the com-pute cost of the models, only their accuracy. Next, we willconsider their compute cost as well.

0 200Experiment Time (hours)0.89

0.92

Top T

est

ing A

ccura

cy

Evolution

RL

RS

Figure 3: Time-course of 5 identical large-scale experimentsfor each algorithm (evolution, RL, and RS), showing ac-curacy before augmentation on CIFAR-10. All experimentswere stopped when 20k models were evaluated, as done inthe baseline study. Note this plot does not show the computecost of models, which was higher for the RL ones.

As in the baseline study, the architecture-search experi-ments above were performed over small models, to be ableto train them quicker. We then used the model augmentationtrick (Zoph et al. 2018) by which we take an architecturediscovered by the search (e.g. the output of an evolutionaryexperiment) and turn it into a full-size, accurate model, asdescribed in the Methods.

Figure 4 compares the augmented top models from thethree sets of experiments. It shows test accuracy and model

Page 5: Aging Evolution for Image Classifier Architecture Search

compute cost. The latter is measured in FLOPs, by whichwe mean the total count of operations in the forward pass,so lower is better. Evolved architectures had higher accuracy(and similar FLOPs) than those obtained with RS, and lowerFLOPs (and similar accuracy) than those obtained with RL.Number of parameters showed similar behavior to FLOPs.Therefore, evolution occupied the ideal relative position inthis graph within the scope of our case study.

0.75 1.35Model Cost (GigaFLOPs)0.957

0.967

Final Test

ing A

ccura

cy

Evol.

RL

RS

Figure 4: Final augmented models from 5 identicalarchitecture-search experiments for each algorithm, onCIFAR-10. Each marker corresponds to the top models fromone experiment.

So far we have been comparing evolution with our repro-duction of the experiments in the baseline study, but it is alsoinformative to compare directly against the results reportedby the baseline study. We select our evolved architecturewith highest validation accuracy and call it AmoebaNet-A.Table 1 compares it with the top model of the baseline study,NASNet-A. Such a comparison is not entirely controlled,as we have no way of ensuring the network training codewas identical and that the same number of experiments weredone to obtain the final model. The table summarizes theresults of training AmoebaNet-A at sizes comparable to aNASNet-A version, showing that AmoebaNet-A is slightlymore accurate (when matching model size) or considerablysmaller (when matching accuracy). We did not train ourmodel at larger sizes on CIFAR-10. Instead, we moved toImageNet to do further comparisons in the next section.

Table 1: CIFAR-10 results for AmoebaNet-A, compared totop model reported in the baseline study.

Model # Params Error (%)

NASNet-A (baseline) 3.3 M 3.41AmoebaNet-A 2.6 M 3.40± 0.08AmoebaNet-A 3.2 M 3.34± 0.06

ImageNet ResultsFollowing the accepted standard, we compare our topmodel’s performance on the popular ImageNet datasetagainst other top models from the literature. Again, we use

AmoebaNet-A, the model with the highest validation accu-racy on CIFAR-10 among our evolution experiments. Wehighlight that the model was evolved on CIFAR-10 and thentransferred to ImageNet, so the evolved architecture cannothave overfit the ImageNet dataset. When retrained on Ima-geNet, AmoebaNet-A performs comparably to the baselinefor the same number of parameters (Table 2).

Table 2: ImageNet classification results for AmoebaNet-A compared to hand-designs (top) and other automatedmethods (middle). Evolution-based approaches marked with∗. The “cost” refers to the compute cost measured inmultiply-adds. Non-AmoebaNet models are from the litera-ture (Szegedy et al. 2017; Xie et al. 2017; Zhang et al. 2017;Chen et al. 2017; Xie and Yuille 2017; Zhong, Yan, andLiu 2018; Liu et al. 2018b; Zoph et al. 2018; Liu et al.2018a). We omitted Squeeze-and-Excite-Net because it wasnot benchmarked on the same ImageNet dataset version.

Model # Params Cost 1/5 Acc. (%)

Incep-ResNet V2 55.8M 13.2B 80.4 / 95.3ResNeXt-101 83.6M 31.5B 80.9 / 95.6PolyNet 92.0M 34.7B 81.3 / 95.8Dual-Path-Net-131 79.5M 32.0B 81.5 / 95.8

GeNet-2∗ 156M – 72.1 / 90.4Block-QNN-B∗ – – 75.7 / 92.6Hierarchical∗ 64M – 79.7 / 94.8NASNet-A 88.9M 23.8B 82.7 / 96.2PNASNet-5 86.1M 25.0B 82.9 / 96.2

AmoebaNet-A∗ 86.7M 23.1B 82.8 / 96.1AmoebaNet-A∗ 99.6M 26.2B 82.8 / 96.2

Finally, we focused on AmoebaNet-A exclusively and en-larged it, setting a new state-of-the-art accuracy on Ima-geNet of 83.9%/96.6% top-1/5 accuracy with 469M param-eters. Such high parameter counts may be beneficial in train-ing other models too but we have not managed to do this yet.

DiscussionThis section will suggest directions for future work, whichwe will motivate by speculating about the evolutionary pro-cess and by summarizing additional minor results. The de-tails of these minor results have been relegated to the appen-dices, as they are not necessary to understand or reproduceour main results above.

Scope of results. Some of our findings may be restrictedto the search spaces and datasets we used. A natural direc-tion for future work is to extend the controlled comparisonto more search spaces, datasets, and tasks, to verify gener-ality, or to more algorithms. Appendix A presents prelimi-nary results, performing evolutionary and RL searches overthree search spaces (SP-I: same as in the Results section;SP-II: like SP-I but with more possible ops; SP-III: like SP-II but with more pairwise combinations) and three datasets(gray-scale CIFAR-10, MNIST, and gray-scale ImageNet),at a small-compute scale (on CPU, F=8, N=1). Evolutionreached equal or better accuracy in all cases (Figure 5, top).

Page 6: Aging Evolution for Image Classifier Architecture Search

RL

Evol.

0.7500

0.7720G

-CIF

AR

Test

Acc

ura

cy

RL

Evol.

RL

Evol.

RL

Evol.

0.9944

0.9951

MN

IST T

est

Acc

ura

cy

RL

Evol.

0.0330

0.0385

G-I

mageN

et

Test

Acc

ura

cy

0 20k# Models Searched0.66

0.77

Top T

est

ing A

ccura

cy

Evolution

RL

0 20k# Models Searched0.69

0.77

Top T

est

ing A

ccura

cy

Evolution

RL

Figure 5: TOP: Comparison of the final model accuracyin five different contexts, from left to right: G-CIFAR/SP-I, G-CIFAR/SP-II, G-CIFAR/SP-III, MNIST/SP-I and G-ImageNet/SP-I. Each circle marks the top test accuracy atthe end of one experiment. BOTTOM: Search progress ofthe experiments in the case of G-CIFAR/SP-II (LEFT, bestfor RL) and G-CIFAR/SP-III (RIGHT, best for evolution).

Algorithm speed. In our comparison study, Figure 3 sug-gests that both RL and evolution are approaching a commonaccuracy asymptote. This raises the question of which al-gorithm gets there faster. The plots indicate that evolutionreaches half-maximum accuracy in roughly half the time.We abstain, nevertheless, from further quantifying this ef-fect since it depends strongly on how speed is measured (thenumber of models necessary to reach accuracy a depends ona; the natural choice of a = amax/2 may be too low to beinformative; etc.). Algorithm speed may be more importantwhen exploring larger spaces, where reaching the optimumcan require more compute than is available. We saw an ex-ample of this in the SP-III space, where evolution stood out(Figure 5, bottom-right). Therefore, future work could ex-plore evolving on even larger spaces.

Model speed. The speed of individual models produced isalso relevant. Figure 4 demonstrates that evolved models arefaster (lower FLOPs). We speculate that asynchronous evo-lution may be reducing the FLOPs because it is indirectlyoptimizing for speed even when training for a fixed numberof epochs: fast models may do well because they “repro-duce” quickly even if they initially lack the higher accuracyof their slower peers. Verifying this speculation could be thesubject of future work.

Benefits of aging evolution. Aging evolution seemed ad-vantageous in additional small-compute-scale experiments,shown in Figure 6 and presented in more detail in Ap-pendix B. These were carried out on CPU instead of GPU,and used a gray-scale version of CIFAR-10, to reduce com-

pute requirements. In the appendix, we also show that theseresults tend to hold when varying the dataset or the searchspace.

0.751 0.772non-aging test accuracy0.751

0.772

agin

g t

est

acc

ura

cy

P=64,S=16

P=256,S=64

P=256,S=4

P,S=16

P,S=64

Other

Figure 6: Small-compute-scale comparison between our ag-ing tournament selection variant and the non-aging variant,for different population sizes (P) and sample sizes (S), show-ing that aging tends to be beneficial (most markers are abovethe y = x line).

Understanding aging evolution and regularization. Wecan speculate that aging may help navigate the trainingnoise in evolutionary experiments, as follows. Noisy trainingmeans that models may sometimes reach high accuracy justby luck. In non-aging evolution (NAE, i.e. standard tourna-ment selection), such lucky models may remain in the popu-lation for a long time—even for the whole experiment. Onelucky model, therefore, can produce many children, caus-ing the algorithm to focus on it, reducing exploration. Underaging evolution (AE), on the other hand, all models havea short lifespan, so the population is wholly renewed fre-quently, leading to more diversity and more exploration. Inaddition, another effect may be in play, which we describenext. In AE, because models die quickly, the only way anarchitecture can remain in the population for a long timeis by being passed down from parent to child through thegenerations. Each time an architecture is inherited it mustbe retrained. If it produces an inaccurate model when re-trained, that model is not selected by evolution and the ar-chitecture disappears from the population. The only way foran architecture to remain in the population for a long time isto retrain well repeatedly. In other words, AE can only im-prove a population through the inheritance of architecturesthat retrain well. (In contrast, NAE can improve a populationby accumulating architectures/models that were lucky whenthey trained the first time). That is, AE is forced to pay atten-tion to architectures rather than models. In other words, theaddition of aging involves introducing additional informa-tion to the evolutionary process: architectures should be re-train well. This additional information prevents overfittingto the training noise, which makes it a form of regulariza-tion in the broader mathematical sense1. Regardless of the

1https://en.wikipedia.org/wiki/

Page 7: Aging Evolution for Image Classifier Architecture Search

exact mechanism, in Appendix C we perform experimentsto verify the plausibility of the conjecture that aging helpsnavigate noise. There we construct a very basic search spacewhere the only difficulty is a noisy evaluation. If our conjec-ture is true, AE should be better in that basic space too. Wefound this to be the case. We leave further verification of theconjecture to future work, noting that theoretical results mayprove useful here.

Simplicity of aging evolution. A desirable feature of evo-lutionary algorithms is their simplicity. By design, the appli-cation of a mutation causes a random change. The processof constructing new architectures, therefore, is entirely ran-dom. What makes evolution different from random searchis that only the good models are selected to be mutated.This selection tends to improve the population over time.In this sense, evolution is simply “random search plus selec-tion”. In outline, the process can be described briefly: “keepa population of N models and proceed in cycles: at each cy-cle, copy-mutate the best of S random models and kill theoldest in the population”. Implementation-wise, we believethe methods of this paper are sufficient for a reader to un-derstand evolution. The sophisticated nature of the RL al-ternative introduces complexity in its implementation: it re-quires back-propagation and poses challenges to paralleliza-tion (Salimans et al. 2017). Even different implementationsof the same algorithm have been shown to produce differ-ent results (Henderson et al. 2018). Finally, evolution is alsosimple in that it has few meta-parameters, most of which donot need tuning (Real et al. 2017). In our study, we only ad-justed 2 meta-parameters and only through a handful of at-

Regularization_(mathematics)

Input Image

Softmax

Reduction Cell

Normal Cell

3x3 conv, stride 2

Normal Cell

Normal Cell

Reduction Cell

Reduction Cell

x N

x N

x N

x 2

0

7

1

avg3x3

none+

3

nonesep3x3

+

5

max3x3

avg3x3

+

2

sep3x3

sep5x5

+

4

sep3x3

avg3x3

+

6

0 1

7

sep7x7

max3x3

+

4

avg3x3

sep7x7

+

5

1x77x1

sep3x3

+

6

max3x3

max3x3

+

3

sep3x3

avg3x3

+

2

Figure 7: AmoebaNet-A architecture. The overall model(Zoph et al. 2018) (LEFT) and the AmoebaNet-A normalcell (RIGHT, TOP) and reduction cell (RIGHT, BOTTOM).

tempts (see Methods Details section). In contrast, note thatthe RL baseline requires training an agent/controller whichis often itself a neural network with many weights (such asan LSTM), and its optimization has more meta-parametersto adjust: learning rate schedule, greediness, batching, re-play buffer, etc. (These meta-parameters are all in additionto the weights and training parameters of the image classi-fiers being searched, which are present in both approaches.)It is possible that through careful tuning, RL could be madeto produce even better models than evolution, but such tun-ing would likely involve running many experiments, mak-ing it more costly. Evolution did not require much tuning, asdescribed. It is also possible that random search would pro-duce equally good models if run for a very long time, whichwould be very costly.

Interpreting architecture search. Another important di-rection for future work is that of analyzing architecture-search experiments (regardless of the algorithm used) to tryto discover new neural network design patterns. Anecdo-tally, for example, we found that architectures with high out-put vertex fan-in (number of edges into the output vertex)tend to be favored in all our experiments. In fact, the mod-els in the final evolved populations have a mean fan-in valuethat is 3 standard deviations above what would be expectedfrom randomly generated models. We verified this patternby training various models with different fan-in values andthe results confirm that accuracy increases with fan-in, ashad been found in ResNeXt (Xie et al. 2017). Discoveringbroader patterns may require designing search spaces specif-ically for this purpose.

ConclusionThis paper used an evolutionary algorithm to discover imageclassifier architectures. Our contributions are the following:• We proposed aging evolution, a variant of tournament se-

lection by which genotypes die according to their age, fa-voring the young. This improved upon standard tourna-ment selection while still allowing for efficiency at scalethrough asynchronous population updating. We open-sourced the code. We also implemented simple mutationsthat permit the application of evolution to the popularNASNet search space.

• We presented the first controlled comparison of algo-rithms for image classifier architecture search in a casestudy of evolution, RL and random search. We showedthat evolution had somewhat faster search speed and stoodout in the regime of scarcer resources / early stopping.Evolution also matched RL in final model quality, em-ploying a simpler method.

• We evolved AmoebaNet-A (Figure 7), a compet-itive image classifier. On ImageNet, it is the firstevolved model to surpass hand-designs. Matching size,AmoebaNet-A has comparable accuracy to top image-classifiers discovered with other architecture-searchmethods. At large size, it sets a new state-of-the-art accuracy. We open-sourced code and checkpoint(https://tfhub.dev/google/imagenet/amoebanet_a_n18_f448/classification/1).

Page 8: Aging Evolution for Image Classifier Architecture Search

AcknowledgmentsWe wish to thank Megan Kacholia, Vincent Vanhoucke, Xi-aoqiang Zheng and especially Jeff Dean for their supportand valuable input; Chris Ying for his work helping tuneAmoebaNet models and for his help with specialized hard-ware, Barret Zoph and Vijay Vasudevan for help with thecode and experiments used in their paper (Zoph et al. 2018),as well as Jiquan Ngiam, Jacques Pienaar, Arno Eigen-willig, Jianwei Xie, Derek Murray, Gabriel Bender, Gol-naz Ghiasi, Saurabh Saxena and Jie Tan for other codingcontributions; Jacques Pienaar, Luke Metz, Chris Ying andAndrew Selle for manuscript comments, all the above andPatrick Nguyen, Samy Bengio, Geoffrey Hinton, Risto Mi-ikkulainen, Jeff Clune, Kenneth Stanley, Yifeng Lu, DavidDohan, David So, David Ha, Vishy Tirumalashetty, YoramSinger, and Ruoming Pang for helpful discussions; and thelarger Google Brain team.

Page 9: Aging Evolution for Image Classifier Architecture Search

Appendix A: Evolution and Reinforcement Learning

MotivationIn this appendix, we will extend the comparison betweenevolution and reinforcement learning (RL) from the ResultsSection. Evolutionary algorithms and RL have been appliedrecently to the field of architecture search. Yet, comparisonis difficult because studies tend to use novel search spaces,preventing direct attribution of the results to the algorithm.For example, the search space may be small instead of the al-gorithm being fast. The picture is blurred further by the useof different training techniques that affect model accuracy(Ciregan, Meier, and Schmidhuber 2012; Wan et al. 2013;Srivastava et al. 2014), different definitions of FLOPs thataffect model compute cost2 and different hardware plat-forms that affect algorithm run-time3. Accounting for allthese factors, we will compare the two approaches in a va-riety of image classification contexts. To achieve statisticalconfidence, we will present repeated experiments withoutsampling bias.

SetupAll evolution and RL experiments used the NASNet searchspace design (Zoph et al. 2018). Within this design, we de-fine three concrete search spaces that differ in the number ofpairwise combinations (C) and in the number of ops allowed(see Methods Section). In order of increasing size, we willrefer to them as SP-I (e.g. Figure A-1f), SP-II, and SP-III(e.g. Figure A-1g). SP-I is the exact variant used in the maintext and in the study that we use as our baseline (Zoph et al.2018). SP-II increases the allowed ops from 8 to 19 (iden-tity; 1x1 and 3x3 convs.; 3x3, 5x5 and 7x7 sep. convs.; 2x2and 3x3 avg. pools; 2x2 min pool.; 2x2 and 3x3 max pools;3x3, 5x5 and 7x7 dil. sep. convs.; 1x3 then 3x1 conv.; 1x7then 7x1 conv.; 3x3 dil. conv. with rates 2, 4 and 6). SP-IIIallows for larger tree structures within the cells (C=15, same19 ops).

The evolutionary algorithm is the same as that in the maintext. The RL algorithm is the one used in the baseline study.We chose this baseline because, when we began, it had ob-tained the most accurate results on CIFAR-10, a populardataset for image classifier architecture search.

We ran evolution and RL experiments for comparison pur-poses at different compute scales, always ensuring both ap-proaches used identical conditions. In particular, evolutionand RL used the same code for network construction, train-ing and evaluation. The experiments in this appendix wereperformed at a smaller compute scale than in the main text,to reduce resource usage: we used gray-scale versions ofpopular datasets (e.g. “G-Imagenet” instead of ImageNet),we ran on CPU instead of GPU and trained relatively smallmodels (F=8, see Methods Details in main text) for only 4epochs. Where unstated, the experiments ran on SP-I andG-CIFAR.

2For example, see https://stackoverflow.com/questions/329174/what-is-flop-s-and-is-it-a-good-measure-of-performance.

3A Tesla P100 can be twice as fast as a K40, for example.

FindingsWe first optimized the meta-parameters for evolution and forRL by running experiments with each algorithm, repeatedly,under each condition (Figure A-1a). We then compared thealgorithms in 5 different contexts by swapping the dataset orthe search space (Figure A-1b). Evolution was either betterthan or equal to RL, with statistical significance. The bestcontexts for evolution and for RL are shown in more de-tail in Figures A-1c and A-1d, respectively. They show theprogress of 5 repeats of each algorithm. The initial speedof evolution is noticeable, especially in the largest searchspace (SP-III). Figures A-1f and A-1g illustrate the top ar-chitectures from SP-I and SP-III, respectively. Regardless ofcontext, Figure A-1e indicates that accuracy under evolutionincreases significantly faster than RL at the initial stage. Thisstage was not accelerated by higher RL learning rates.

OutcomeThe main text provides a comparison between algorithmsfor image classifier architecture search in the context of theSP-I search space on CIFAR-10, at scale. This appendix ex-tends those results, varying the dataset and the search spaceby running many small experiments, confirming the conclu-sions of the main text.

Page 10: Aging Evolution for Image Classifier Architecture Search

lr=

0.0

00

02

5lr

=0

.00

00

5lr

=0

.00

01

lr=

0.0

00

2lr

=0

.00

04

lr=

0.0

00

8lr

=0

.00

16

lr=

0.0

03

2lr

=0

.00

64

lr=

0.0

12

8

0.760

0.802

G-C

IFA

R M

VA

P,S

=1

02

4P=

10

24

,S=

25

6P,S

=2

56

P=

10

24

,S=

64

P=

25

6,S

=6

4P=

12

8,S

=6

4P,S

=6

4P=

25

6,S

=3

2P=

12

8,S

=3

2P=

64

,S=

32

P,S

=3

2P=

10

24

,S=

16

P=

25

6,S

=1

6P=

12

8,S

=1

6P=

64

,S=

16

P=

32

,S=

16

P,S

=1

6P=

25

6,S

=8

P=

12

8,S

=8

P=

64

,S=

8P=

32

,S=

8P=

16

,S=

8P=

10

24

,S=

4P=

25

6,S

=4

P=

12

8,S

=4

P=

64

,S=

4P=

32

,S=

4P=

16

,S=

4P,S

=4

(a)

RL

Evol.

0.7500

0.7720

G-C

IFA

R M

TA

@ 2

0k

RL

Evol.

RL

Evol.

RL

Evol.

0.9944

0.9951

MN

IST M

TA

@ 2

0k

RL

Evol.

0.0330

0.0385

G-I

mageN

et

MTA

@ 2

0k

(b)

0 20000m0.66

0.77

G-C

IFA

R M

TA

Evolution

RL

(c)

0 20000m0.69

0.77

G-C

IFA

R M

TA

Evolution

RL

(d)

RL

Evol.

0.7000

0.7630

G-C

IFA

R M

TA

@

5k

RL

Evol.

RL

Evol.

RL

Evol.

0.9939

0.9951

MN

IST M

TA

@

5k

RL

Evol.

0.0270

0.0340

G-I

mageN

et

MTA

@

5k

(e)

(f)

0 = sep. 3x31 = sep. 5x52 = sep. 7X73 = none4 = avg. 3x35 = max 3x36 = dil. 3x37 = 1x7+7x1

(g)

Figure A-1: Evolution and RL in different contexts. Plots show repeated evolution (orange) and RL (blue) experiments side-by-side. (a) Summary of hyper-parameter optimization experiments on G-CIFAR. We swept the learning rate (lr) for RL (left) andthe population size (P) and sample size (S) for evolution (right). We ran 5 experiments (circles) for each scenario. The verticalaxis measures the mean validation accuracy (MVA) of the top 100 models in an experiment. Superposed on the raw data are± 2SEM error bars. From these results, we selected best meta-parameters to use in the remainder of this figure. (b) We assessedrobustness by running the same experiments in 5 different contexts, spanning different datasets and search spaces: G-CIFAR/SP-I, G-CIFAR/SP-II, G-CIFAR/SP-III, MNIST/SP-I and G-ImageNet/SP-I, shown from left to right. These experiments ran to 20kmodels. The vertical axis measures the mean testing accuracy (MTA) of the top 100 models (selected by validation accuracy).(c) and (d) show a detailed view of the progress of the experiments in the G-CIFAR/SP-II and G-CIFAR/SP-III contexts,respectively. The horizontal axes indicate the number of models (m) produced as the experiment progresses. (e) Resource-constrained settings may require stopping experiments early. At 5k models, evolution performs better than RL in all 5 contexts.(f) and (g) show a stack of normal cells of the best model found for G-CIFAR in the SP-I and SP-III search spaces, respectively.The “h” labels some of the hidden states. The ops (“avg 3x3”, etc.) are listed in full form in the text. Data flows from leftto right. See the baseline study for a detailed description of these diagrams. In (f), N=3 (see Methods section), so the cell isreplicated three times; i.e. the left two-thirds of the diagram (grayed out) are constrained to mirror the right third. This is incontrast with the vastly larger SP-III search space of (g), where a bigger, unconstrained construct without replication (N=1) isexplored.

Page 11: Aging Evolution for Image Classifier Architecture Search

Appendix B: Aging and Non-Aging Evolution

MotivationIn this appendix, we will extend the comparison betweenaging evolution (AE) and standard tournament selection /non-aging evolution (NAE). As was described in the Meth-ods Section, the evolutionary algorithm used in this paperkeeps the population size constant by always removing theoldest model whenever a new one is added; we will refer tothis algorithm as AE. A recent paper used a similar methodbut kept the population size constant by removing the worstindividual in each tournament (Real et al. 2017); we will re-fer to that algorithm as NAE. This appendix will show howthese two algorithms compare in a variety of contexts.

SetupThe search spaces and datasets were the same as in Ap-pendix A.

Findings

NA

E

AE

0.7500

0.7730

Gra

ysc

ale

-CIF

AR

MTA

NA

E

AE

AE

NA

E

NA

E

AE

0.9944

0.9951

MN

IST M

TA

NA

E

AE

0.0320

0.0422

Gra

ysc

ale

-Im

ageN

et

MTA

Figure B-1: A comparison of NAE and AE under 5different contexts, spanning different datasets and searchspaces: G-CIFAR/SP-I, G-CIFAR/SP-II, G-CIFAR/SP-III,MNIST/SP-I and G-ImageNet/SP-I, shown from left toright. For each context, we show the final MTA of afew NAE and a few AE experiments (circles) in adjacentcolumns. We superpose ± 2SEM error bars, where SEMdenotes the standard error of the mean. The first contextcontains many repeats with identical meta-parameters andtheir MTA values seem normally distributed (Shapiro–Wilkstest). Under this normality assumption, the error bars repre-sent 95% confidence intervals.

We performed experiments in 5 different search space–dataset contexts. In each context, we ran several repeats ofevolutionary search using NAE and AE (Figure B-1). Under4 of the 5 contexts, AE resulted in statistically significanthigher accuracy at the end of the runs, on average. The ex-ception was the G-ImageNet search space, where the exper-iments were extremely short due to the compute demands oftraining on so much data using only CPUs. Interestingly, inthe two contexts where the search space was bigger (SP-IIand SP-III), all AE runs did better than all NAE runs.

Additionally, we performed three experiments compar-ing AE and NAE at scale, under the same conditions as inthe main text. The results, which can be seen in Figure B-2, provide some verification that observations from smallerCPU experiments in the previous paragraph generalize to thelarge-compute regime.

0 20000m0.896

0.916

MTA

P=64, S=16 (AE)P=256, S=4 (NAE)

P=1000, S=2 (NAE)

Figure B-2: A comparison of AE and NAE at scale. Theseexperiments use the same conditions as the main text (in-cluding dataset, search space, resources and duration). Fromtop to bottom: an AE experiment with good AE meta-parameters from Appendix A, an analogous NAE experi-ment, and an NAE experiment with the meta-parametersused in a recent study (Real et al. 2017). These accuracyvalues are not meaningful in absolute terms, as the modelsneed to be augmented to reach their maximum accuracy, asdescribed in the Methods Section).

OutcomeThe Discussion Section in the main text suggested that AEtends to perform better than NAE across various parame-ters for one fixed search space–dataset context. Such ro-bustness is desirable for computationally demanding archi-tecture search experiments, where we cannot always af-ford many runs to optimize the meta-parameters. This ap-pendix extends those results to show that the conclusionholds across various contexts.

Page 12: Aging Evolution for Image Classifier Architecture Search

Appendix C: Aging Evolution in Basic Search Space

MotivationAs indicated in the Discussion Section, we suspect that ag-ing may help navigate the noisy evaluation in an evolutionexperiment. We leave verification of this suspicion to fu-ture work, but for motivation we provide here a sanity checkfor it. We construct a very basic search space in which theonly difficulty is a noisy evaluation. Within this basic searchspace, we will see that aging evolution outperforms non-aging evolution.

SetupThe basic search space we use here does not involve anyneural networks. The goal is to evolve solutions to a verysimple, single-optimum, D-dimensional, noisy optimizationproblem with a signal-to-noise ratio matching that of ourneuro-evolution experiments.

The search space used is the set of vertices of a D-dimensional unit cube. A specific vertex is “analogous” toa neural network architecture in a real experiment. In a realexperiment, training and evaluating an architecture yields anoisy accuracy. Likewise, in this basic search space, we as-sign a noisy simulated accuracy (SA) to each cube vertex.The SA is the fraction of coordinates that are zero, plus asmall amount of Gaussian noise (µ = 0, σ = 0.01, matchingthe observed noise for neural networks). Thus, the goal is toget close to the optimum, the origin. The sample complexityused was 10k. This space is helpful because an experimentcompletes in milliseconds.

This optimization problem can be seen as a simplifica-tion of the evolutionary search for the minimum of a multi-dimensional integer-valued paraboloid with bounded sup-port, where the mutations treat the values along each co-ordinate categorically. If we restrict the domain along eachdirection to the set {0, 1}, we reduce the problem to the unitcube described above. The paraboloid’s value at a cube cor-ner is just the number of coordinates that are not zero. Wemention this connection because searching for the minimumof a paraboloid seems like a more natural choice for a triv-ial problem (“trivial” compared to architecture search). Thesimpler unit cube version, however, was chosen because itpermits faster computation.

We stress that these simulations are not intended to trulymimic architecture search experiments over the space ofneural networks. We used them only as a testing groundfor techniques that evolve solutions in the presence of noisyevaluations.

FindingsWe found that optimized NAE and AE perform similarly inlow-dimensional problems, which are easier. As the dimen-sionality (D) increases, AE becomes relatively better thanNAE (Figure B-1).

6 9log2D0.65

1.00

SA

AE

NAE

Figure B-1: Results in the basic search space. The graphsummarizes thousands of evolutionary search simulations.The vertical axis measures the simulated accuracy (SA) andthe horizontal axis the dimensionality (D) of the problem, ameasure of its difficulty. For each D, we optimized the meta-parameters for NAE and AE independently. To do this, wecarried out 100 simulations for each meta-parameter combi-nation and averaged the outcomes. We plot here the optimafound, together with ± 2SEM error bars. The graph showsthat in this basic search space, AE is never worse and is sig-nificantly better for larger D (note the broad range of thevertical axis).

OutcomeThe findings provide circumstantial evidence in favor of oursuspicion that aging may help navigate noise (DiscussionSection), suggesting that attempting to verify this with moregenerality may be an interesting direction for future work.

Page 13: Aging Evolution for Image Classifier Architecture Search

ReferencesAngeline, P. J.; Saunders, G. M.; and Pollack, J. B. 1994.An evolutionary algorithm that constructs recurrent neuralnetworks. IEEE transactions on Neural Networks.Baker, B.; Gupta, O.; Naik, N.; and Raskar, R. 2017a.Designing neural network architectures using reinforcementlearning. In ICLR.Baker, B.; Gupta, O.; Raskar, R.; and Naik, N. 2017b. Ac-celerating neural architecture search using performance pre-diction. ICLR Workshop.Bergstra, J., and Bengio, Y. 2012. Random search for hyper-parameter optimization. JMLR.Brock, A.; Lim, T.; Ritchie, J. M.; and Weston, N. 2018.Smash: one-shot model architecture search through hyper-networks. In ICLR.Cai, H.; Chen, T.; Zhang, W.; Yu, Y.; and Wang, J. 2018.Efficient architecture search by network transformation. InAAAI.Chen, Y.; Li, J.; Xiao, H.; Jin, X.; Yan, S.; and Feng, J. 2017.Dual path networks. In NIPS.Ciregan, D.; Meier, U.; and Schmidhuber, J. 2012. Multi-column deep neural networks for image classification. InCVPR.Cortes, C.; Gonzalvo, X.; Kuznetsov, V.; Mohri, M.; andYang, S. 2017. Adanet: Adaptive structural learning of arti-ficial neural networks. In ICML.Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical imagedatabase. In CVPR.Domhan, T.; Springenberg, J. T.; and Hutter, F. 2017. Speed-ing up automatic hyperparameter optimization of deep neu-ral networks by extrapolation of learning curves. In IJCAI.Elsken, T.; Metzen, J.-H.; and Hutter, F. 2017. Simple andefficient architecture search for convolutional neural net-works. ICLR Workshop.Elsken, T.; Metzen, J. H.; and Hutter, F. 2018. Neural archi-tecture search: A survey. arXiv.Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlation learning architecture. In NIPS.Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.;Blum, M.; and Hutter, F. 2015. Efficient and robust auto-mated machine learning. In NIPS.Floreano, D.; Durr, P.; and Mattiussi, C. 2008. Neuroevo-lution: from architectures to learning. Evolutionary Intelli-gence.Goldberg, D. E., and Deb, K. 1991. A comparative analysisof selection schemes used in genetic algorithms. FOGA.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residuallearning for image recognition. In CVPR.Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup,D.; and Meger, D. 2018. Deep reinforcement learning thatmatters. AAAI.Klein, A.; Falkner, S.; Springenberg, J. T.; and Hutter, F.2017. Learning curve prediction with bayesian neural net-works. ICLR.

Krizhevsky, A., and Hinton, G. 2009. Learning multiplelayers of features from tiny images. Master’s thesis, Dept.of Computer Science, U. of Toronto.Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012.Imagenet classification with deep convolutional neural net-works. In NIPS.Liu, C.; Zoph, B.; Shlens, J.; Hua, W.; Li, L.-J.; Fei-Fei, L.;Yuille, A.; Huang, J.; and Murphy, K. 2018a. Progressiveneural architecture search. ECCV.Liu, H.; Simonyan, K.; Vinyals, O.; Fernando, C.; andKavukcuoglu, K. 2018b. Hierarchical representations forefficient architecture search. In ICLR.Mendoza, H.; Klein, A.; Feurer, M.; Springenberg, J. T.; andHutter, F. 2016. Towards automatically-tuned neural net-works. In Workshop on Automatic Machine Learning.Miikkulainen, R.; Liang, J.; Meyerson, E.; Rawal, A.; Fink,D.; Francon, O.; Raju, B.; Navruzyan, A.; Duffy, N.; andHodjat, B. 2017. Evolving deep neural networks. arXiv.Miller, G. F.; Todd, P. M.; and Hegde, S. U. 1989. Designingneural networks using genetic algorithms. In ICGA.Negrinho, R., and Gordon, G. 2017. Deeparchitect: Auto-matically designing and training deep architectures. arXiv.Pham, H.; Guan, M. Y.; Zoph, B.; Le, Q. V.; and Dean, J.2018. Faster discovery of neural architectures by searchingfor paths in a large model. ICLR Workshop.Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y. L.;Le, Q.; and Kurakin, A. 2017. Large-scale evolution ofimage classifiers. In ICML.Salimans, T.; Ho, J.; Chen, X.; and Sutskever, I. 2017. Evo-lution strategies as a scalable alternative to reinforcementlearning. arXiv.Saxena, S., and Verbeek, J. 2016. Convolutional neural fab-rics. In NIPS.Simmons, J. P.; Nelson, L. D.; and Simonsohn, U. 2011.False-positive psychology: Undisclosed flexibility in datacollection and analysis allows presenting anything as sig-nificant. Psychological Science.Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; andSalakhutdinov, R. 2014. Dropout: A simple way to preventneural networks from overfitting. JMLR.Stanley, K. O., and Miikkulainen, R. 2002. Evolving neuralnetworks through augmenting topologies. Evol. Comput.Stanley, K. O.; Bryant, B. D.; and Miikkulainen, R. 2005.Real-time neuroevolution in the nero video game. TEVC.Suganuma, M.; Shirakawa, S.; and Nagao, T. 2017. Agenetic programming approach to designing convolutionalneural network architectures. In GECCO.Szegedy, C.; Ioffe, S.; Vanhoucke, V.; and Alemi, A. A.2017. Inception-v4, inception-resnet and the impact of resid-ual connections on learning. In AAAI.Wan, L.; Zeiler, M.; Zhang, S.; Le Cun, Y.; and Fergus, R.2013. Regularization of neural networks using dropconnect.In ICML.Xie, L., and Yuille, A. 2017. Genetic CNN. In ICCV.

Page 14: Aging Evolution for Image Classifier Architecture Search

Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; and He, K. 2017. Ag-gregated residual transformations for deep neural networks.In CVPR.Yao, X. 1999. Evolving artificial neural networks. IEEE.Zagoruyko, S., and Komodakis, N. 2016. Wide residualnetworks. In BMVC.Zhang, X.; Li, Z.; Loy, C. C.; and Lin, D. 2017. Polynet:A pursuit of structural diversity in very deep networks. InCVPR.Zhong, Z.; Yan, J.; and Liu, C.-L. 2018. Practical networkblocks design with q-learning. In AAAI.Zoph, B., and Le, Q. V. 2016. Neural architecture searchwith reinforcement learning. In ICLR.Zoph, B.; Vasudevan, V.; Shlens, J.; and Le, Q. V.2018. Learning transferable architectures for scalable im-age recognition. In CVPR.