chapter 7 towards automatically-tuned deep neural networksnetworks, such as convolutional or...

15
Chapter 7 Towards Automatically-Tuned Deep Neural Networks Hector Mendoza, Aaron Klein, Matthias Feurer, Jost Tobias Springenberg, Matthias Urban, Michael Burkart, Maximilian Dippel, Marius Lindauer, and Frank Hutter Abstract Recent advances in AutoML have led to automated tools that can compete with machine learning experts on supervised learning tasks. In this work, we present two versions of Auto-Net, which provide automatically-tuned deep neural networks without any human intervention. The first version, Auto-Net 1.0, builds upon ideas from the competition-winning system Auto-sklearn by using the Bayesian Optimization method SMAC and uses Lasagne as the underlying deep learning (DL) library. The more recent Auto-Net 2.0 builds upon a recent combination of Bayesian Optimization and HyperBand, called BOHB, and uses PyTorch as DL library. To the best of our knowledge, Auto-Net 1.0 was the first automatically-tuned neural network to win competition datasets against human experts (as part of the first AutoML challenge). Further empirical results show that ensembling Auto-Net 1.0 with Auto-sklearn can perform better than either approach alone, and that Auto-Net 2.0 can perform better yet. 7.1 Introduction Neural networks have significantly improved the state of the art on a variety of benchmarks in recent years and opened many new promising research avenues [22, 27, 36, 39, 41]. However, neural networks are not easy to use for non-experts since their performance crucially depends on proper settings of a large set of hyperparameters (e.g., learning rate and weight decay) and architecture choices (e.g., number of layers and type of activation functions). Here, we present work H. Mendoza · A. Klein · M. Feurer · J. T. Springenberg · M. Urban · M. Burkart · M. Dippel M. Lindauer Department of Computer Science, University of Freiburg, Freiburg, Baden-Württemberg, Germany e-mail: [email protected] F. Hutter () Department of Computer Science, University of Freiburg, Freiburg, Germany © The Author(s) 2019 F. Hutter et al. (eds.), Automated Machine Learning, The Springer Series on Challenges in Machine Learning, https://doi.org/10.1007/978-3-030-05318-5_7 135

Upload: others

Post on 08-Oct-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

Chapter 7Towards Automatically-Tuned DeepNeural Networks

Hector Mendoza, Aaron Klein, Matthias Feurer, Jost Tobias Springenberg,Matthias Urban, Michael Burkart, Maximilian Dippel, Marius Lindauer,and Frank Hutter

Abstract Recent advances in AutoML have led to automated tools that cancompete with machine learning experts on supervised learning tasks. In this work,we present two versions of Auto-Net, which provide automatically-tuned deepneural networks without any human intervention. The first version, Auto-Net 1.0,builds upon ideas from the competition-winning system Auto-sklearn by usingthe Bayesian Optimization method SMAC and uses Lasagne as the underlyingdeep learning (DL) library. The more recent Auto-Net 2.0 builds upon a recentcombination of Bayesian Optimization and HyperBand, called BOHB, and usesPyTorch as DL library. To the best of our knowledge, Auto-Net 1.0 was the firstautomatically-tuned neural network to win competition datasets against humanexperts (as part of the first AutoML challenge). Further empirical results show thatensembling Auto-Net 1.0 with Auto-sklearn can perform better than either approachalone, and that Auto-Net 2.0 can perform better yet.

7.1 Introduction

Neural networks have significantly improved the state of the art on a variety ofbenchmarks in recent years and opened many new promising research avenues[22, 27, 36, 39, 41]. However, neural networks are not easy to use for non-expertssince their performance crucially depends on proper settings of a large set ofhyperparameters (e.g., learning rate and weight decay) and architecture choices(e.g., number of layers and type of activation functions). Here, we present work

H. Mendoza · A. Klein · M. Feurer · J. T. Springenberg · M. Urban · M. Burkart · M. DippelM. LindauerDepartment of Computer Science, University of Freiburg, Freiburg, Baden-Württemberg,Germanye-mail: [email protected]

F. Hutter (�)Department of Computer Science, University of Freiburg, Freiburg, Germany

© The Author(s) 2019F. Hutter et al. (eds.), Automated Machine Learning, The Springer Serieson Challenges in Machine Learning, https://doi.org/10.1007/978-3-030-05318-5_7

135

Page 2: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

136 H. Mendoza et al.

towards effective off-the-shelf neural networks based on approaches from automatedmachine learning (AutoML).

AutoML aims to provide effective off-the-shelf learning systems to free expertsand non-experts alike from the tedious and time-consuming tasks of selecting theright algorithm for a dataset at hand, along with the right preprocessing methodand the various hyperparameters of all involved components. Thornton et al. [43]phrased this AutoML problem as a combined algorithm selection and hyperpa-rameter optimization (CASH) problem, which aims to identify the combination ofalgorithm components with the best (cross-)validation performance.

One powerful approach for solving this CASH problem treats this cross-validation performance as an expensive blackbox function and uses Bayesianoptimization [4, 35] to search for its optimizer. While Bayesian optimizationtypically uses Gaussian processes [32], these tend to have problems with the specialcharacteristics of the CASH problem (high dimensionality; both categorical andcontinuous hyperparameters; many conditional hyperparameters, which are onlyrelevant for some instantiations of other hyperparameters). Adapting GPs to handlethese characteristics is an active field of research [40, 44], but so far Bayesianoptimization methods using tree-based models [2, 17] work best in the CASHsetting [9, 43].

Auto-Net is modelled after the two prominent AutoML systems Auto-WEKA [43] and Auto-sklearn [11], discussed in Chaps. 4 and 6 of this book,respectively. Both of these use the random forest-based Bayesian optimizationmethod SMAC [17] to tackle the CASH problem – to find the best instantiation ofclassifiers in WEKA [16] and scikit-learn [30], respectively. Auto-sklearn employstwo additional methods to boost performance. Firstly, it uses meta-learning [3] basedon experience on previous datasets to start SMAC from good configurations [12].Secondly, since the eventual goal is to make the best predictions, it is wastefulto try out dozens of machine learning models and then only use the single bestmodel; instead, Auto-sklearn saves all models evaluated by SMAC and constructsan ensemble of these with the ensemble selection technique [5]. Even thoughboth Auto-WEKA and Auto-sklearn include a wide range of supervised learningmethods, neither includes modern neural networks.

Here, we introduce two versions of a system we dub Auto-Net to fill this gap.Auto-Net 1.0 is based on Theano and has a relatively simple search space, whilethe more recent Auto-Net 2.0 is implemented in PyTorch and uses a more complexspace and more recent advances in DL. A further difference lies in their respectivesearch procedure: Auto-Net 1.0 automatically configures neural networks withSMAC [17], following the same AutoML approach as Auto-WEKA and Auto-sklearn, while Auto-Net 2.0 builds upon BOHB [10], a combination of BayesianOptimization (BO) and efficient racing strategies via HyperBand (HB) [23].

Auto-Net 1.0 achieved the best performance on two datasets in the humanexpert track of the recent ChaLearn AutoML Challenge [14]. To the best of ourknowledge, this is the first time that a fully-automatically-tuned neural network wona competition dataset against human experts. Auto-Net 2.0 further improves uponAuto-Net 1.0 on large data sets, showing recent progress in the field.

Page 3: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

7 Towards Automatically-Tuned Deep Neural Networks 137

We describe the configuration space and implementation of Auto-Net 1.0 inSect. 7.2 and of Auto-Net 2.0 in Sect. 7.3. We then study their performanceempirically in Sect. 7.4 and conclude in Sect. 7.5. We omit a thorough discussionof related work and refer to Chap. 3 of this book for an overview on the extremelyactive field of neural architecture search. Nevertheless, we note that several otherrecent tools follow Auto-Net’s goal of automating deep learning, such as Auto-Keras [20], Photon-AI, H2O.ai, DEvol or Google’s Cloud AutoML service.

This chapter is an extended version of our 2016 paper introducing Auto-Net,presented at the 2016 ICML Workshop on AutoML [26].

7.2 Auto-Net 1.0

We now introduce Auto-Net 1.0 and describe its implementation. We chose toimplement this first version of Auto-Net as an extension of Auto-sklearn [11] byadding a new classification (and regression) component; the reason for this choicewas that it allows us to leverage existing parts of the machine learning pipeline:feature preprocessing, data preprocessing and ensemble construction. Here, we limitAuto-Net to fully-connected feed-forward neural networks, since they apply to awide range of different datasets; we defer the extension to other types of neuralnetworks, such as convolutional or recurrent neural networks, to future work. Tohave access to neural network techniques we use the Python deep learning libraryLasagne [6], which is built around Theano [42]. However, we note that in generalour approach is independent of the neural network implementation.

Following [2] and [7], we distinguish between layer-independent network hyper-parameters that control the architecture and training procedure and per-layerhyperparameters that are set for each layer. In total, we optimize 63 hyperparame-ters (see Table 7.1), using the same configuration space for all types of supervisedlearning (binary, multiclass and multilabel classification, as well as regression).Sparse datasets also share the same configuration space. (Since neural networkscannot handle datasets in sparse representation out of the box, we transform thedata into a dense representation on a per-batch basis prior to feeding it to the neuralnetwork.)

The per-layer hyperparameters of layer k are conditionally dependent on thenumber of layers being at least k. For practical reasons, we constrain the number oflayers to be between one and six: firstly, we aim to keep the training time of a singleconfiguration low,1 and secondly each layer adds eight per-layer hyperparametersto the configuration space, such that allowing additional layers would furthercomplicate the configuration process.

The most common way to optimize the internal weights of neural networks is viastochastic gradient descent (SGD) using partial derivatives calculated with back-propagation. Standard SGD crucially depends on the correct setting of the learning

1We aimed to be able to afford the evaluation of several dozens of configurations within a timebudget of two days on a single CPU.

Page 4: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

138 H. Mendoza et al.T

able

7.1

Con

figur

atio

nsp

ace

ofA

uto-

Net

.The

confi

gura

tion

spac

efo

rth

epr

epro

cess

ing

met

hods

can

befo

und

in[1

1]

Nam

eR

ange

Def

ault

log

scal

eTy

peC

ondi

tion

al

Net

wor

khy

perp

aram

eter

sB

atch

size

[32,40

96]

32�

float

Num

ber

ofup

date

s[50

,25

00]

200

�in

t–

Num

ber

ofla

yers

[1,6]

1–

int

Lea

rnin

gra

te[10

−6,1.

0]10

−2�

float

L2

regu

lari

zati

on[10

−7,10

−2]

10−4

�flo

at–

Dro

pout

outp

utla

yer

[0.0,

0.99

]0.

5�

float

Solv

erty

pe{S

GD

,Mom

entu

m,A

dam

,Ada

delt

a,A

dagr

ad,

smor

m,N

este

rov

}sm

orm

3s–

cat

lr-p

olic

y{F

ixed

,Inv

,Exp

,Ste

p}Fi

xed

–ca

t–

Con

diti

oned

onso

lver

type

β1

[10−4

,10

−1]

10−1

�flo

at�

β2

[10−4

,10

−1]

10−1

�flo

at�

ρ[0.

05,0.

99]

0.95

�flo

at�

Mom

entu

m[0.

3,0.

999]

0.9

�flo

at�

Con

diti

oned

onlr

-pol

icy

γ[10

−3,10

−1]

10−2

�flo

at�

k[0.

0,1.

0]0.

5–

float

�s

[2,20

]2

–in

t�

Per-

laye

rhy

perp

aram

eter

sA

ctiv

atio

n-ty

pe{S

igm

oid,

TanH

,Sca

ledT

anH

,EL

U,R

eLU

,L

eaky

,Lin

ear}

ReL

U–

cat

Num

ber

ofun

its

[64,40

96]

128

�in

t�

Dro

pout

inla

yer

[0.0,

0.99

]0.

5–

float

�W

eigh

tini

tial

izat

ion

{Con

stan

t,N

orm

al,U

nifo

rm,G

loro

t-U

nifo

rm,

Glo

rot-

Nor

mal

,H

e-N

orm

al–

cat

He-

Nor

mal

,He-

Uni

form

,Ort

hogo

nal,

Spar

se}

Std.

norm

alin

it.

[10−7

,0.

1]0.

0005

–flo

at�

Lea

kine

ss[0.

01,0.

99]

1 3–

float

�ta

nhsc

ale

in[0.

5,1.

0]2/

3–

float

�ta

nhsc

ale

out

[1.1,

3.0]

1.71

59�

float

Page 5: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

7 Towards Automatically-Tuned Deep Neural Networks 139

rate hyperparameter. To lessen this dependency, various algorithms (solvers) forstochastic gradient descent have been proposed. We include the following well-known methods from the literature in the configuration space of Auto-Net: vanillastochastic gradient descent (SGD), stochastic gradient descent with momentum(Momentum), Adam [21], Adadelta [48], Nesterov momentum [28] and Adagrad[8]. Additionally, we used a variant of the vSGD optimizer [33], dubbed “smorm”,in which the estimate of the Hessian is replaced by an estimate of the squaredgradient (calculated as in the RMSprop procedure). Each of these methods comeswith a learning rate α and an own set of hyperparameters, for example Adam’smomentum vectors β1 and β2. Each solver’s hyperparameter(s) are only active ifthe corresponding solver is chosen.

We also decay the learning rate α over time, using the following policies (whichmultiply the initial learning rate by a factor αdecay after each epoch t = 0 . . . T ):

• Fixed: αdecay = 1• Inv: αdecay = (1 + γ t)(−k)

• Exp: αdecay = γ t

• Step: αdecay = γ �t/s�

Here, the hyperparameters k, s and γ are conditionally dependent on the choice ofthe policy.

To search for a strong instantiation in this conditional search space of Auto-Net1.0, as in Auto-WEKA and Auto-sklearn, we used the random-forest based Bayesianoptimization method SMAC [17]. SMAC is an anytime approach that keeps trackof the best configuration seen so far and outputs this when terminated.

7.3 Auto-Net 2.0

AutoNet 2.0 differs from AutoNet 1.0 mainly in the following three aspects:

• it uses PyTorch [29] instead of Lasagne as a deep learning library• it uses a larger configuration space including up-to-date deep learning techniques,

modern architectures (such as ResNets) and includes more compact representa-tions of the search space, and

• it applies BOHB [10] instead of SMAC to obtain a well-performing neuralnetwork more efficiently.

In the following, we will discuss these points in more detail.Since the development and maintenance of Lasagne ended last year, we chose a

different Python library for Auto-Net 2.0. The most popular deep learning librariesright now are PyTorch [29] and Tensorflow [1]. These come with quite similarfeatures and mostly differ in the level of detail they give insight into. For example,PyTorch offers the user the possibility to trace all computations during training.While there are advantages and disadvantages for each of these libraries, we decided

Page 6: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

140 H. Mendoza et al.

to use PyTorch because of its ability to dynamically construct computational graphs.For this reason, we also started referring to Auto-Net 2.0 as Auto-PyTorch.

The search space of AutoNet 2.0 includes both hyperparameters for moduleselection (e.g. scheduler type, network architecture) and hyperparameters for eachof the specific modules. It supports different deep learning modules, such as networktype, learning rate scheduler, optimizer and regularization technique, as describedbelow. Auto-Net 2.0 is also designed to be easily extended; users can add their ownmodules to the ones listed below.

Auto-Net 2.0 currently offers four different network types:

Multi-Layer Perceptrons This is a standard implementation of conventionalMLPs extended by dropout layers [38]. Similar as in AutoNet 1.0, each layerof the MLP is parameterized (e.g., number of units and dropout rate).

Residual Neural Networks These are deep neural networks that learn residualfunctions [47], with the difference that we use fully connected layers insteadof convolutional ones. As is standard with ResNets, the architecture consistsof M groups, each of which stacks N residual blocks in sequence. While thearchitecture of each block is fixed, the number M of groups, the number ofblocks N per group, as well as the width of each group is determined byhyperparameters, as shown in Table 7.2.

Shaped Multi-Layer Perceptrons To avoid that every layer has its own hyper-parameters (which is an inefficient representation to search), in shaped MLPsthe overall shape of the layers is predetermined, e.g. as a funnel, long funnel,diamond, hexagon, brick, or triangle. We followed the shapes from https://mikkokotila.github.io/slate/#shapes; Ilya Loshchilov also proposed parameteri-zation by such shapes to us before [25].

Shaped Residual Networks A ResNet where the overall shape of the layers ispredetermined (e.g. funnel, long funnel, diamond, hexagon, brick, triangle).

The network types of ResNets and ShapedResNets can also use any of theregularization methods of Shake-Shake [13] and ShakeDrop [46]. MixUp [49] canbe used for all networks.

The optimizers currently supported in Auto-Net 2.0 are Adam [21] and SGDwith momentum. Moreover, Auto-Net 2.0 currently offers five different schedulersthat change the optimizer’s learning rate over time (as a function of the number ofepochs):

Exponential This multiplies the learning rate with a constant factor in eachepoch.

Step This decays the learning rate by a multiplicative factor after a constantnumber of steps.

Cyclic This modifies the learning rate in a certain range, alternating betweenincreasing and decreasing [37].

Cosine Annealing with Warm Restarts [24] This learning rate schedule imple-ments multiple phases of convergence. It cools down the learning rate to zerofollowing a cosine decay [24], and after each convergence phase heats it up tostart a next phase of convergence, often to a better optimum. The network weights

Page 7: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

7 Towards Automatically-Tuned Deep Neural Networks 141

Table 7.2 Configuration space of Auto-Net 2.0. There are 112 hyperparameters in totallanoitidnoCepyTLog scaletluafeDegnaRemaN

Generalhyperparameters

23[Batch size , 23]005 � int −Use mixup {True, False} True − bool −

0[Mixup alpha .0, 1. 1]0 .0 − float �Network {MLP, ResNet, ShapedMLP, ShapedResNet} MLP − cat −Optimizer {Adam, SGD} Adam − cat −Preprocessor {nystroem, kernel pca, fast ica, kitchen sinks, truncated svd} Nystroem − cat −Imputation {most frequent, median, mean} Most frequent − cat −Use loss weight strategy {True, False} True − cat −Learning rate scheduler {Step, Exponential, OnPlateau, Cyclic, CosineAnnealing} Step − cat −

Preprocessor

Nystroem

[Coef −1.0, 1. 0]0 .0 − float �2[Degree , tni−3]5 �

0[Gamma .00003, 8. 0]0 .1 � float �Kernel {poly, rbf, sigmoid, cosine} rbf − cat �

05[Num components , 001]00001 � int �

Kitchen sinks 0[Gamma .00003, 8. 1]0 .0 � float �05[Num components , 001]00001 � int �

Truncated SVD 01[Target dimension , tni−821]652 �

Kernel PCA

[Coef −1.0, 1. 0]0 .0 − float �2[Degree , tni−3]5 �

0[Gamma .00003, 8. 0]0 .1 � float �Kernel {poly, rbf, sigmoid, cosine} rbf − cat �

05[Num components , 001]00001 � int �

Fast ICA

Algorithm {parallel, deflation} Parallel − cat �Fun {logcosh, exp, cube} Logcosh − cat �Whiten {True, False} True − cat �

01[Num components , tni−5001]0002 �Networks

MLP

Activation function {Sigmoid, Tanh, ReLu} Sigmoid − cat �1[Num layers , tni−9]51 �

Num units (for layer i 01[) , 001]4201 � int �Dropout (for layer i 0[) .0, 0. 0]5 .25 − int �

ResNet

Activation function {Sigmoid, Tanh, ReLu} Sigmoid − cat �1[Residual block groups , tni−4]9 �1[Blocks per group , tni−2]4 �

Num units (for group i 821[) , 002]4201 � int �Use dropout {True, False} True − bool �Dropout (for group i 0[) .0, 0. 0]9 .5 − int �Use shake drop {True, False} True − bool �Use shake shake {True, False} True − bool �Shake drop βmax [0.0, 1. 0]0 .5 − float �

ShapedMLP

Activation function {Sigmoid, Tanh, ReLu} Sigmoid − cat �3[Num layers , tni−9]51 �01[Max units per layer , 002]4201 � int �

Network shape {Funnel, LongFunnel, Diamond, Hexagon, Brick, Triangle, Stairs} Funnel − cat �0[Max dropout per layer .0, 0. 06 .2 − float �

Dropout shape {Funnel, LongFunnel, Diamond, Hexagon, Brick, Triangle, Stairs} Funnel − cat �

Shaped ResNet

Activation function {Sigmoid, Tanh, ReLu} Sigmoid − cat �3[Num layers , tni−4]9 �1[Blocks per layer , tni−2]4 �

Use dropout {True, False} True − bool �01[Max units per layer , 002]4201 � int �

Network shape {Funnel, LongFunnel, Diamond, Hexagon, Brick, Triangle, Stairs} Funnel − cat �0[Max dropout per layer .0, 0. 06 .2 − float �

Dropout shape {Funnel, LongFunnel, Diamond, Hexagon, Brick, Triangle, Stairs} Funnel − cat �Use shake drop {True, False} True − bool �Use shake shake {True, False} True − bool �Shake drop βmax [0.0, 1. 0]0 .5 − float �

Optimizers

Adam 0[Learning rate .0001, 0. 0]1 .003 � float �0[Weight decay .0001, 0. 0]1 .05 − float �

SGD0[Learning rate .0001, 0. 0]1 .003 � float �0[Weight decay .0001, 0. 0]1 .05 − float �

0[Momentum .1, 0. 0]9 .3 � float �Schedulers

Step γ [0.001, 0. 0]9 .4505 − float �1[Step size , tni−6]01 �

Exponential γ [0.8, 0. 0]9999 .89995 − float �

OnPlateau γ [0.05, 0. 0]5 .275 − float �3[Patience , tni−6]01 �

Cyclic3[Cycle length , tni−6]01 �

1[Max factor .0, 2. 1]0 .5 − float �0[Min factor .001, 1. 0]0 .5 − float �

Cosineannealing

T0 [1, tni−01]02 �Tmult [1.0, 2. 1]0 .5 − float �

Page 8: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

142 H. Mendoza et al.

are not modified when heating up the learning rate, such that the next phase ofconvergence is warm-started.

OnPlateau This scheduler2 changes the learning rate whenever a metric stopsimproving; specifically, it multiplies the current learning rate with a factor γ ifthere was no improvement after p epochs.

Similar to Auto-Net 1.0, Auto-Net 2.0 can search over pre-processing techniques.Auto-Net 2.0 currently supports Nyström [45], Kernel principal component analysis[34], fast independent component analysis [18], random kitchen sinks [31] and trun-cated singular value decomposition [15]. Users can specify a list of pre-processingtechniques to be taken into account and can also choose between different balancingand normalization strategies (for balancing strategies only weighting the loss isavailable, and for normalization strategies, min-max normalization and standard-ization are supported). In contrast to Auto-Net 1.0, Auto-Net 2.0 does not buildan ensemble at the end (although this feature will likely be added soon). Allhyperparameters of Auto-Net 2.0 with their respective ranges and default valuescan be found in Table 7.2.

As optimizer for this highly conditional space, we used BOHB (BayesianOptimization with HyperBand) [10], which combines conventional Bayesian opti-mization with the bandit-based strategy Hyperband [23] to substantially improve itsefficiency. Like Hyperband, BOHB uses repeated runs of Successive Halving [19]to invest most runtime in promising neural networks and stops training neuralnetworks with poor performance early. Like in Bayesian optimization, BOHB learnswhich kinds of neural networks yield good results. Specifically, like the BO methodTPE [2], BOHB uses a kernel density estimator (KDE) to describe regions of highperformance in the space of neural networks (architectures and hyperparametersettings) and trades off exploration versus exploitation using this KDE. One ofthe advantages of BOHB is that it is easily parallelizable, achieving almost linearspeedups with an increasing number of workers [10].

As a budget for BOHB we can either handle epochs or (wallclock) time inminutes; by default we use runtime, but users can freely adapt the different budgetparameters. An example usage is shown in Algorithm 1. Similar to Auto-sklearn,Auto-Net is built as a plugin estimator for scikit-learn. Users have to providea training set and a performance metric (e.g., accuracy). Optionally, they might

Algorithm 1 Example usage of Auto-Net 2.0from autonet import AutoNetClassification

cls = AutoNetClassification(min_budget=5, max_budget=20, max_runtime=120)cls.fit(X_train, Y_train)predictions = cls.predict(X_test)

2Implemented by PyTorch.

Page 9: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

7 Towards Automatically-Tuned Deep Neural Networks 143

specify a validation and testset. The validation set is used during training to get ameasure for the performance of the network and to train the KDE models of BOHB.

7.4 Experiments

We now empirically evaluate our methods. Our implementations of Auto-Net run onboth CPUs and GPUs, but since neural networks heavily employ matrix operationsthey run much faster on GPUs. Our CPU-based experiments were run on a computecluster, each node of which has two eight-core Intel Xeon E5-2650 v2 CPUs,running at 2.6 GHz, and a shared memory of 64 GB. Our GPU-based experimentswere run on a compute cluster, each node of which has four GeForce GTX TITANX GPUs.

7.4.1 Baseline Evaluation of Auto-Net 1.0 and Auto-sklearn

In our first experiment, we compare different instantiations of Auto-Net 1.0 on thefive datasets of phase 0 of the AutoML challenge. First, we use the CPU-based andGPU-based versions to study the difference of running NNs on different hardware.Second, we allow the combination of neural networks with the models from Auto-sklearn. Third, we also run Auto-sklearn without neural networks as a baseline.On each dataset, we performed 10 one-day runs of each method, allowing up to100 min for the evaluation of a single configuration by five-fold cross-validationon the training set. For each time step of each run, following [11] we constructedan ensemble from the models it had evaluated so far and plot the test error of thatensemble over time. In practice, we would either use a separate process to calculatethe ensembles in parallel or compute them after the optimization process.

Fig. 7.1 shows the results on two of the five datasets. First, we note that theGPU-based version of Auto-Net was consistently about an order of magnitude fasterthan the CPU-based version. Within the given fixed compute budget, the CPU-basedversion consistently performed worst, whereas the GPU-based version performedbest on the newsgroups dataset (see Fig. 7.1a), tied with Auto-sklearn on 3 of theother datasets, and performed worse on one. Despite the fact that the CPU-basedAuto-Net was very slow, in 3/5 cases the combination of Auto-sklearn and CPU-based Auto-Net still improved over Auto-sklearn; this can, for example, be observedfor the dorothea dataset in Fig. 7.1b.

7.4.2 Results for AutoML Competition Datasets

Having developed Auto-Net 1.0 during the first AutoML challenge, we used acombination of Auto-sklearn and GPU-based Auto-Net for the last two phases

Page 10: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

144 H. Mendoza et al.

Fig. 7.1 Results for the four methods on two datasets from Tweakathon0 of the AutoMLchallenge. We show errors on the competition’s validation set (not the test set since its true labelsare not available), with our methods only having access to the training set. To avoid clutter, we plotmean error ± 1/4 standard deviations over the 10 runs of each method. (a) newsgroups dataset.(b) dorothea dataset

Fig. 7.2 Official AutoML human expert track competition results for the three datasets for whichwe used Auto-Net. We only show the top 10 entries. (a) alexis dataset. (b) yolanda dataset.(c) tania dataset

to win the respective human expert tracks. Auto-sklearn has been developed formuch longer and is much more robust than Auto-Net, so for 4/5 datasets in the3rd phase and 3/5 datasets in the 4th phase Auto-sklearn performed best by itselfand we only submitted its results. Here, we discuss the three datasets for which weused Auto-Net. Fig. 7.2 shows the official AutoML human expert track competitionresults for the three datasets for which we used Auto-Net. The alexis datasetwas part of the 3rd phase (“advanced phase”) of the challenge. For this, we ranAuto-Net on five GPUs in parallel (using SMAC in shared-model mode) for 18 h.Our submission included an automatically-constructed ensemble of 39 models andclearly outperformed all human experts, reaching an AUC score of 90%, whilethe best human competitor (Ideal Intel Analytics) only reached 80%. To our bestknowledge, this is the first time an automatically-constructed neural network wona competition dataset. The yolanda and tania datasets were part of the 4thphase (“expert phase”) of the challenge. For yolanda, we ran Auto-Net for 48 h

Page 11: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

7 Towards Automatically-Tuned Deep Neural Networks 145

104 105

time [sec]

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0Error

AutoNet-GPUAutoNet-CPUAutoNet+AutosklearnAutosklearn

Fig. 7.3 Performance on the tania dataset over time. We show cross-validation performance onthe training set since the true labels for the competition’s validation or test set are not available. Toavoid clutter, we plot mean error ± 1/4 standard deviations over the 10 runs of each method

on eight GPUs and automatically constructed an ensemble of five neural networks,achieving a close third place. For tania, we ran Auto-Net for 48 h on eight GPUsalong with Auto-sklearn on 25 CPUs, and in the end our automated ensemblingscript constructed an ensemble of eight 1-layer neural networks, two 2-layer neuralnetworks, and one logistic regression model trained with SGD. This ensemble wonthe first place on the tania dataset.

For the tania dataset, we also repeated the experiments from Sect. 7.4.1.Fig. 7.3 shows that for this dataset Auto-Net performed clearly better than Auto-sklearn, even when only running on CPUs. The GPU-based variant of Auto-Netperformed best.

7.4.3 Comparing AutoNet 1.0 and 2.0

Finally, we show an illustrative comparison between Auto-Net 1.0 and 2.0. We notethat Auto-Net 2.0 has a much more comprehensive search space than Auto-Net 1.0,and we therefore expect it to perform better on large datasets given enough time.We also expect that searching the larger space is harder than searching Auto-Net1.0’s smaller space; however, since Auto-Net 2.0 uses the efficient multi-fidelityoptimizer BOHB to terminate poorly-performing neural networks early on, it maynevertheless obtain strong anytime performance. On the other hand, Auto-Net 2.0

Page 12: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

146 H. Mendoza et al.

Table 7.3 Error metric of different Auto-Net versions, run for different times, all on CPU. Wecompare Auto-Net 1.0, ensembles of Auto-Net 1.0 and Auto-sklearn, Auto-Net 2.0 with oneworker, and Auto-Net 2.0 with four workers. All results are means across 10 runs of each system.We show errors on the competition’s validation set (not the test set since its true labels are notavailable), with our methods only having access to the training set

newsgroups dorothea

103 s 104 s 1 day 103 s 104 s 1 day

Auto-Net 1.0 0.99 0.98 0.85 0.38 0.30 0.13

Auto-sklearn + Auto-Net 1.0 0.94 0.76 0.47 0.29 0.13 0.13

Auto-Net 2.0: 1 worker 1.0 0.67 0.55 0.88 0.17 0.16

Auto-Net 2.0: 4 workers 0.89 0.57 0.44 0.22 0.17 0.14

so far does not implement ensembling, and due to this missing regularizationcomponent and its larger hypothesis space, it may be more prone to overfitting thanAuto-Net 1.0.

In order to test these expectations about performance on different-sized datasets,we used a medium-sized dataset (newsgroups, with 13k training data points) anda small one (dorothea, with 800 training data points). The results are presentedin Table 7.3.

On the medium-sized dataset newsgroups, Auto-Net 2.0 performed muchbetter than Auto-Net 1.0, and using four workers also led to strong speedups ontop of this, making Auto-Net 2.0 competitive to the ensemble of Auto-sklearn andAuto-Net 1.0. We found that despite Auto-Net 2.0’s larger search space its anytimeperformance (using the multi-fidelity method BOHB) was better than that of Auto-Net 1.0 (using the blackbox optimization method SMAC). On the small datasetdorothea, Auto-Net 2.0 also performed better than Auto-Net 1.0 early on, butgiven enough time Auto-Net 1.0 performed slightly better. We attribute this to thelack of ensembling in Auto-Net 2.0, combined with its larger search space.

7.5 Conclusion

We presented Auto-Net, which provides automatically-tuned deep neural networkswithout any human intervention. Even though neural networks show superiorperformance on many datasets, for traditional data sets with manually-definedfeatures they do not always perform best. However, we showed that, even in caseswhere other methods perform better, combining Auto-Net with Auto-sklearn to anensemble often leads to an equal or better performance than either approach alone.

Finally, we reported results on three datasets from the AutoML challenge’shuman expert track, for which Auto-Net won one third place and two first places.We showed that ensembles of Auto-sklearn and Auto-Net can get users the best ofboth worlds and quite often improve over the individual tools. First experimentson the new Auto-Net 2.0 showed that using a more comprehensive search space,combined with BOHB as an optimizer yields promising results.

Page 13: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

7 Towards Automatically-Tuned Deep Neural Networks 147

In future work, we aim to extend Auto-Net to more general neural networkarchitectures, including convolutional and recurrent neural networks.

Acknowledgements This work has partly been supported by the European Research Council(ERC) under the European Union’s Horizon 2020 research and innovation programme under grantno. 716721.

Bibliography

1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S.,Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D., Steiner,B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: Asystem for large-scale machine learning. In: 12th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 16). pp. 265–283 (2016), https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf

2. Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization.In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K. (eds.) Proceedingsof the 25th International Conference on Advances in Neural Information Processing Systems(NIPS’11). pp. 2546–2554 (2011)

3. Brazdil, P., Giraud-Carrier, C., Soares, C., Vilalta, R.: Metalearning: Applications to DataMining. Springer Publishing Company, Incorporated, 1 edn. (2008)

4. Brochu, E., Cora, V., de Freitas, N.: A tutorial on Bayesian optimization of expensive costfunctions, with application to active user modeling and hierarchical reinforcement learning.Computing Research Repository (CoRR) abs/1012.2599 (2010)

5. Caruana, R., Niculescu-Mizil, A., Crew, G., Ksikes, A.: Ensemble selection from libraries ofmodels. In: In Proceedings of the 21st International Conference on Machine Learning. pp.137–144. ACM Press (2004)

6. Dieleman, S., Schlüter, J., Raffel, C., Olson, E., Sønderby, S., Nouri, D., Maturana, D., Thoma,M., Battenberg, E., Kelly, J., Fauw, J.D., Heilman, M., diogo149, McFee, B., Weideman, H.,takacsg84, peterderivaz, Jon, instagibbs, Rasul, K., CongLiu, Britefury, Degrave, J.: Lasagne:First release. (Aug 2015), https://doi.org/10.5281/zenodo.27878

7. Domhan, T., Springenberg, J.T., Hutter, F.: Speeding up automatic hyperparameter optimiza-tion of deep neural networks by extrapolation of learning curves. In: Yang, Q., Wooldridge,M. (eds.) Proceedings of the 25th International Joint Conference on Artificial Intelligence(IJCAI’15). pp. 3460–3468 (2015)

8. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning andstochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (Jul 2011)

9. Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., Leyton-Brown, K.:Towards an empirical foundation for assessing Bayesian optimization of hyperparameters. In:NIPS Workshop on Bayesian Optimization in Theory and Practice (BayesOpt’13) (2013)

10. Falkner, S., Klein, A., Hutter, F.: Combining hyperband and bayesian optimization. In: NIPS2017 Bayesian Optimization Workshop (Dec 2017)

11. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J.T., Blum, M., Hutter, F.: Efficientand robust automated machine learning. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M.,Garnett, R. (eds.) Proceedings of the 29th International Conference on Advances in NeuralInformation Processing Systems (NIPS’15) (2015)

12. Feurer, M., Springenberg, T., Hutter, F.: Initializing Bayesian hyperparameter optimizationvia meta-learning. In: Bonet, B., Koenig, S. (eds.) Proceedings of the Twenty-nineth NationalConference on Artificial Intelligence (AAAI’15). pp. 1128–1135. AAAI Press (2015)

13. Gastaldi, X.: Shake-shake regularization. CoRR abs/1705.07485 (2017)

Page 14: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

148 H. Mendoza et al.

14. Guyon, I., Bennett, K., Cawley, G., Escalante, H.J., Escalera, S., Ho, T.K., Macià, N., Ray, B.,Saeed, M., Statnikov, A., Viegas, E.: Design of the 2015 chalearn automl challenge. In: 2015International Joint Conference on Neural Networks (IJCNN). pp. 1–8 (July 2015)

15. Halko, N., Martinsson, P., Tropp, J.: Finding structure with randomness: Stochastic algorithmsfor constructing approximate matrix decompositions (2009)

16. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA datamining software: An update. SIGKDD Explorations 11(1), 10–18 (2009)

17. Hutter, F., Hoos, H., Leyton-Brown, K.: Sequential model-based optimization for generalalgorithm configuration. In: Coello, C. (ed.) Proceedings of the Fifth International Conferenceon Learning and Intelligent Optimization (LION’11). Lecture Notes in Computer Science, vol.6683, pp. 507–523. Springer-Verlag (2011)

18. Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neuralnetworks 13(4–5), 411–430 (2000)

19. Jamieson, K., Talwalkar, A.: Non-stochastic best arm identification and hyperparameter opti-mization. In: Gretton, A., Robert, C. (eds.) Proceedings of the 19th International Conference onArtificial Intelligence and Statistics, AISTATS. JMLR Workshop and Conference Proceedings,vol. 51, pp. 240–248. JMLR.org (2016)

20. Jin, H., Song, Q., Hu, X.: Efficient neural architecture search with network morphism. CoRRabs/1806.10282 (2018)

21. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of theInternational Conference on Learning Representations (2015)

22. Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutionalneural networks. In: Bartlett, P., Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Pro-ceedings of the 26th International Conference on Advances in Neural Information ProcessingSystems (NIPS’12). pp. 1097–1105 (2012)

23. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: A novelbandit-based approach to hyperparameter optimization. Journal of Machine Learning Research18, 185:1–185:52 (2017)

24. Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. In: InternationalConference on Learning Representations (ICLR) 2017 Conference Track (2017)

25. Loshchilov, I.: Personal communication (2017)26. Mendoza, H., Klein, A., Feurer, M., Springenberg, J., Hutter, F.: Towards automatically-tuned

neural networks. In: ICML 2016 AutoML Workshop (2016)27. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A.,

Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou,I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control throughdeep reinforcement learning. Nature 518, 529–533 (2015)

28. Nesterov, Y.: A method of solving a convex programming problem with convergence rateO(1/sqr(k)). Soviet Mathematics Doklady 27, 372–376 (1983)

29. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A.,Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In: Autodiff Workshop at NIPS(2017)

30. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of MachineLearning Research 12, 2825–2830 (2011)

31. Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: Replacing minimization withrandomization in learning. In: Advances in neural information processing systems. pp. 1313–1320 (2009)

32. Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. The MIT Press (2006)33. Schaul, T., Zhang, S., LeCun, Y.: No More Pesky Learning Rates. In: Dasgupta, S., McAllester,

D. (eds.) Proceedings of the 30th International Conference on Machine Learning (ICML’13).Omnipress (2014)

34. Schölkopf, B., Smola, A., Müller, K.: Kernel principal component analysis. In: InternationalConference on Artificial Neural Networks. pp. 583–588. Springer (1997)

Page 15: Chapter 7 Towards Automatically-Tuned Deep Neural Networksnetworks, such as convolutional or recurrent neural networks, to future work. To have access to neural network techniques

7 Towards Automatically-Tuned Deep Neural Networks 149

35. Shahriari, B., Swersky, K., Wang, Z., Adams, R., de Freitas, N.: Taking the human out of theloop: A Review of Bayesian Optimization. Proc. of the IEEE 104(1) (12/2015 2016)

36. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser,J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J.,Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T.,Hassabis, D.: Mastering the game of go with deep neural networks and tree search. Nature529, 484–503 (2016)

37. Smith, L.N.: Cyclical learning rates for training neural networks. In: Applications of ComputerVision (WACV), 2017 IEEE Winter Conference on. pp. 464–472. IEEE (2017)

38. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simpleway to prevent neural networks from overfitting. The Journal of Machine Learning Research15(1), 1929–1958 (2014)

39. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. CoRRabs/1409.3215 (2014), http://arxiv.org/abs/1409.3215

40. Swersky, K., Duvenaud, D., Snoek, J., Hutter, F., Osborne, M.: Raiders of the lost architecture:Kernels for Bayesian optimization in conditional parameter spaces. In: NIPS Workshop onBayesian Optimization in Theory and Practice (BayesOpt’13) (2013)

41. Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: Closing the gap to human-levelperformance in face verification. In: Proceedings of the International Conference on ComputerVision and Pattern Recognition (CVPR’14). pp. 1701–1708. IEEE Computer Society Press(2014)

42. Theano Development Team: Theano: A Python framework for fast computation of mathemat-ical expressions. Computing Research Repository (CoRR) abs/1605.02688 (may 2016)

43. Thornton, C., Hutter, F., Hoos, H., Leyton-Brown, K.: Auto-WEKA: combined selection andhyperparameter optimization of classification algorithms. In: I.Dhillon, Koren, Y., Ghani, R.,Senator, T., Bradley, P., Parekh, R., He, J., Grossman, R., Uthurusamy, R. (eds.) The 19th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13). pp.847–855. ACM Press (2013)

44. Wang, Z., Hutter, F., Zoghi, M., Matheson, D., de Feitas, N.: Bayesian optimization in a billiondimensions via random embeddings. Journal of Artificial Intelligence Research 55, 361–387(2016)

45. Williams, C., Seeger, M.: Using the nyström method to speed up kernel machines. In: Advancesin neural information processing systems. pp. 682–688 (2001)

46. Yamada, Y., Iwamura, M., Kise, K.: Shakedrop regularization. CoRR abs/1802.02375 (2018)47. Zagoruyko, S., Komodakis, N.: Wide residual networks. CoRR abs/1605.07146 (2016)48. Zeiler, M.: ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701 (2012), http://

arxiv.org/abs/1212.570149. Zhang, H., Cissé, M., Dauphin, Y., Lopez-Paz, D.: mixup: Beyond empirical risk minimization.

CoRR abs/1710.09412 (2017)

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,adaptation, distribution and reproduction in any medium or format, as long as you give appropriatecredit to the original author(s) and the source, provide a link to the Creative Commons licence andindicate if changes were made.

The images or other third party material in this chapter are included in the chapter’s CreativeCommons licence, unless indicated otherwise in a credit line to the material. If material is notincluded in the chapter’s Creative Commons licence and your intended use is not permitted bystatutory regulation or exceeds the permitted use, you will need to obtain permission directly fromthe copyright holder.