hyperparameter search in machine learningclaesenm/optunity/varia/...hyperparameter search in machine...

IntroductionExample: optimizing hyperparameters for an SVM classifier

Challenges in hyperparameter searchState-of-the-art

References

Hyperparameter Search in Machine Learning

Marc Claesen and Bart De Moor

[email protected]

ESAT-STADIUS, KU LeuveniMinds Medical IT Department

STADIUSCenter for Dynamical Systems,

Signal Processing and Data Analytics

Marc Claesen and Bart De Moor Hyperparameter Search in Machine Learning

References

Outline

1 Introduction

2 Example: optimizing hyperparameters for an SVM classifier

3 Challenges in hyperparameter search

4 State-of-the-art

References

Machine learning

Methods capable of learning patterns of interest from data.

by formulating the learning task as an optimization problem

Machine learning is situated on the intersection of various fields:

statistics, computer science, optimization, (biology), . . .

The field encompasses learning methods with various origins, e.g.:

biology, e.g. neural networks [1]

convex optimization, e.g. support vector machines [2]

statistics, e.g. hidden Markov models [3]

tensor decompositions, e.g. recommender systems [4]

References

Machine learning

References

Machine learning

References

Hyperparameter search

Most machine learning methods are (hyper)parameterized.

e.g. Occam’s razor: model complexity and overfitting

Hyperparameters can significantly impact performance

suitable hyperparameters must be determined for each task

occurs in both supervised and unsupervised learning→ need for disciplined, automated optimization methods

Some examples:

SVM: regularization and kernel hyperparameters

ANN: regularization, network architecture, transfer functions

References

Some examples:

References

Some examples:

References

Formalizing hyperparameter tuning

In a general sense, tuning involves these components:

a learning algorithm A, parameterized by hyperparameters λ

training and test data X(tr), X(te)

a model M = A(X(tr) | λ)

loss function L to assess quality of M, typically using X(te):L(M | X(te))

In optimization terms, we aim to find λ∗ (assuming minimization):

λ∗ = arg minλL(A(X(tr) | λ) | X(te)

)= arg min

λF(λ | A,X(tr),X(te),L)︸︷︷︸

objective function

References

)= arg min

objective function

References

)= arg min

objective function

References

)= arg min

objective function

References

= arg minλF(λ | A,X(tr),X(te),L)︸︷︷︸

objective function

References

)= arg min

objective function

References

Tuning in practice

Most often done using a combination of grid and manual search:

grid search suffers from the curse of dimensionality

manual tuning leads to poor reproducibility

Better solutions exist but lack adoption because:

potential performance improvements are underestimated

lack of availability and/or ease of use

References

Tuning in practice

Most often done using a combination of grid and manual search:

grid search suffers from the curse of dimensionality

manual tuning leads to poor reproducibility

Better solutions exist but lack adoption because:

potential performance improvements are underestimated

lack of availability and/or ease of use

References

Outline

1 Introduction

4 State-of-the-art

References

Support vector machine (SVM) classifiers

minα,ξ,b

∑i∈SV

∑j∈SV

αiαjyiyjκ(xi , xj) + Cn∑

subject to yi( ∑j∈SV

αiαjyiyjκ(xi , xj) + b)≥ 1− ξi , ξi ≥ 0, ∀i .

References

minα,ξ,b

∑i∈SV

∑j∈SV

References

minα,ξ,b

∑i∈SV

∑j∈SV

References

Task: optimize hyperparameters for an SVM

Tune an SVM classifier with RBF kernel κ(u, v) = e−γ‖u−v‖2:

minα,b,ξ

∑i∈SV

∑j∈SV

αiαjyiyj exp(− γ‖xi − xj‖2

)︸︷︷︸

‖w‖2

+C∑i∈SV

optimize regularization parameter C and kernel parameter γevaluate (C , γ) pair using 2× iterated 10-fold cross-validationvia Optunity’s particle swarm optimizer [5]

References

minα,b,ξ

∑i∈SV

∑j∈SV

)︸︷︷︸

‖w‖2

+C∑i∈SV

References

minα,b,ξ

∑i∈SV

∑j∈SV

)︸︷︷︸

‖w‖2

+C∑i∈SV

References

Response surface I

References

Response surface II

References

Outline

1 Introduction

4 State-of-the-art

References

Expensive function evaluations

A single objective function evaluation consists of:

1 training a model via the learning methodcan be very time consuming (days up to weeks! [6, 7, 8])

2 predict a test set (for supervised methods)

3 compute some evaluation metric for the model/its predictions

All of the above is often done in cross-validation [9, 10].

used to reliably estimate generalization performance

involves many repetitions → exacerbates computation time

Training/evaluation time is a function of hyperparameter choice!

References

Randomness

The objective function measures empirical performance based on afinite sample (data set) → induces discrete, non-smooth jumps

This gives rise to a stochastic component, inherent to:

the learning method (e.g. resampling methods [11, 12, 13])

random sampling (e.g. cross-validation, bootstrap [10, 9])

The objective function F is not a strict mathematical function→ evaluating F(x) multiple times yields multiple results

Empirical optimum might not really be best!

References

Randomness

References

Randomness

References

Randomness

References

Exotic search spaces

Hyperparameter search spaces can be extremely complex:

mixed integer-continuous (e.g. regularization & kernel)

often domain constrained (e.g. positive regularization)

combinatorial (e.g. feature selection)

conditional dimensions (*)

(*) Consider the architecture of an artificial neural network:

number of hidden layers

size per hidden layer

(transfer functions per layer)

References

Exotic search spaces

Hyperparameter search spaces can be extremely complex:

mixed integer-continuous (e.g. regularization & kernel)

often domain constrained (e.g. positive regularization)

combinatorial (e.g. feature selection)

conditional dimensions (*)

(*) Consider the architecture of an artificial neural network:

number of hidden layers

size per hidden layer

(transfer functions per layer)

References

Desiderata for hyperparameter optimizers

Optimization routines for hyperparameter search are ideally:

efficient in terms of function evaluations,

appropriate for wildly varying objective functions,

able to account for randomness,

flexible in terms of search space,

parallelizable.

The practical performance bottleneck is evaluating F → decidingon the next point to evaluate need not be fast

References

Outline

1 Introduction

4 State-of-the-art

References

Sequential model-based optimization (SMBO)

Commonly used for time-consuming objective functions F [14, 15].

SMBO is an iterative approach, in which each iteration involves:

1 model the response surface M, based on previous evaluations→ evaluating M is cheap, use M as surrogate for F

2 find optimal test point x∗ based on M→ optimize some criterion, e.g. expected improvement [16]

Approaches differ in terms of model and criterion [14, 15, 17].

But: inherently sequential!

References

Sequential model-based optimization (SMBO)

Commonly used for time-consuming objective functions F [14, 15].

SMBO is an iterative approach, in which each iteration involves:

1 model the response surface M, based on previous evaluations→ evaluating M is cheap, use M as surrogate for F

2 find optimal test point x∗ based on M→ optimize some criterion, e.g. expected improvement [16]

Approaches differ in terms of model and criterion [14, 15, 17].

But: inherently sequential!

References

Metaheuristic optimization techniques

A large variety of metaheuristic methods have been used, such as:

particle swarm optimization [18, 19, 20]

genetic algorithms [21, 22]

artificial bee colony [23]

harmonic search [24]

simulated annealing [25]

Nelder-Mead simplex [26]

Advantages:

ease of implementation and parallelization

general purpose solvers → few implicit assumptions

References

Software

Several packages offer Bayesian SMBO approaches:

Hyperopt [27], Spearmint [17]

ParamILS [28], AutoWEKA [29]

BayesOpt [30], DiceKriging [31]

Optunity offers fundamentally distinct methods [5]:

focus on metaheuristic techniques not offered elsewhere

PSO, CMA-ES, random search, sobol sequences, . . .

multiplatform: Python, R, MATLAB, Octave

General purpose optimization libraries also applicable→ but often difficult to integrate in machine learning pipeline

References

Software

Several packages offer Bayesian SMBO approaches:

Hyperopt [27], Spearmint [17]

ParamILS [28], AutoWEKA [29]

BayesOpt [30], DiceKriging [31]

Optunity offers fundamentally distinct methods [5]:

focus on metaheuristic techniques not offered elsewhere

PSO, CMA-ES, random search, sobol sequences, . . .

multiplatform: Python, R, MATLAB, Octave

General purpose optimization libraries also applicable→ but often difficult to integrate in machine learning pipeline

References

Metaheuristic methods are competitive to SMBO

Optunity’s standard PSO [5] versus Hyperopt’s tree-structuredParzen estimator [15, 27] on two-dimensional rastrigin function.

1 100 200 300 400 500

winnerfunction evaluation number

random search

tree of Parzen estimators

particle swarm optimization

References

Conclusion

Hyperparameter search in machine learning

requires disciplined optimization methods

is receiving a lot of research attention, e.g. ChaLearn AutoML

The main challenges are:

expensive function evaluations with a stochastic component

exotic search spaces

Hyperparameter search is an interesting optimization problem→ metaheuristic optimization methods are good candidates

References

Conclusion

References

Conclusion

References

Acknowledgements

Research Council KU Leuven: GOA/10/09 MaNet

Flemish Government:

FWO: projects: G.0871.12N (Neural circuits)IWT: TBM-Logic Insulin(100793), TBM RectalCancer(100783), TBM IETA(130256); PhD grant (111065)Industrial Research fund (IOF): IOF/HB/13/027 Logic InsuliniMinds Medical Information Technologies SBO 2014VLK Stichting E. van der Schueren: rectal cancer

Federal Government: FOD: Cancer Plan 2012-2015KPC-29-023 (prostate)

COST: Action: BM1104: Mass Spectrometry Imaging

References

References I

[1] Simon Haykin and Neural Network. A comprehensivefoundation. Neural Networks, 2(2004), 2004.

[2] Corinna Cortes and Vladimir Vapnik. Support-vectornetworks. Machine learning, 20(3):273–297, 1995.

[3] Lawrence Rabiner. A tutorial on hidden markov models andselected applications in speech recognition. Proceedings ofthe IEEE, 77(2):257–286, 1989.

[4] Alexandros Karatzoglou, Xavier Amatriain, Linas Baltrunas,and Nuria Oliver. Multiverse recommendation: n-dimensionaltensor factorization for context-aware collaborative filtering.In Proceedings of the fourth ACM conference onRecommender systems, pages 79–86. ACM, 2010.

References

References II

[5] Marc Claesen, Jaak Simm, Dusan Popovic, Yves Moreau, andBart De Moor. Easy hyperparameter search using Optunity.arXiv preprint arXiv:1412.1114, 2014.

[6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neuralnetworks. In Advances in neural information processingsystems, pages 1097–1105, 2012.

[7] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen,Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker,Ke Yang, Quoc V Le, et al. Large scale distributed deepnetworks. In Advances in Neural Information ProcessingSystems, pages 1223–1231, 2012.

References

References III

[8] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence tosequence learning with neural networks. In Advances in NeuralInformation Processing Systems, pages 3104–3112, 2014.

[9] Bradley Efron and Gail Gong. A leisurely look at thebootstrap, the jackknife, and cross-validation. The AmericanStatistician, 37(1):36–48, 1983.

[10] Ron Kohavi. A study of cross-validation and bootstrap foraccuracy estimation and model selection. In InternationalJoint Conference on Artificial Intelligence, volume 14, pages1137–1145, 1995.

[11] Leo Breiman. Random forests. Machine learning, 45(1):5–32,2001.

References

References IV

[12] Marc Claesen, Frank De Smet, Johan A.K. Suykens, andBart De Moor. EnsembleSVM: A library for ensemble learningusing support vector machines. Journal of Machine LearningResearch, 15:141–145, 2014.

[13] Marc Claesen, Frank De Smet, Johan AK Suykens, and BartDe Moor. A robust ensemble approach to learn from positiveand unlabeled data using svm base models. Neurocomputing,160:73–84, 2015.

[14] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown.Sequential model-based optimization for general algorithmconfiguration. In Learning and Intelligent Optimization, pages507–523. Springer, 2011.

References

References V

[15] James S Bergstra, Remi Bardenet, Yoshua Bengio, and BalazsKegl. Algorithms for hyper-parameter optimization. InAdvances in Neural Information Processing Systems, pages2546–2554, 2011.

[16] Donald R Jones, Matthias Schonlau, and William J Welch.Efficient global optimization of expensive black-box functions.Journal of Global optimization, 13(4):455–492, 1998.

[17] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. PracticalBayesian optimization of machine learning algorithms. InAdvances in Neural Information Processing Systems, pages2951–2959, 2012.

References

References VI

[18] Michael Meissner, Michael Schmuker, and Gisbert Schneider.Optimized particle swarm optimization (opso) and itsapplication to artificial neural network training. BMCbioinformatics, 7(1):125, 2006.

[19] XC Guo, JH Yang, CG Wu, CY Wang, and YC Liang. A novells-svms hyper-parameter selection based on particle swarmoptimization. Neurocomputing, 71(16):3211–3215, 2008.

[20] Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, andZne-Jung Lee. Particle swarm optimization for parameterdetermination and feature selection of support vectormachines. Expert systems with applications,35(4):1817–1824, 2008.

References

References VII

[21] Jinn-Tsong Tsai, Jyh-Horng Chou, and Tung-Kuan Liu.Tuning the structure and parameters of a neural network byusing hybrid taguchi-genetic algorithm. Neural Networks,IEEE Transactions on, 17(1):69–80, 2006.

[22] Carlos Ansotegui, Meinolf Sellmann, and Kevin Tierney. Agender-based genetic algorithm for the automaticconfiguration of algorithms. In Principles and Practice ofConstraint Programming-CP 2009, pages 142–157. Springer,2009.

[23] Dervis Karaboga, Bahriye Akay, and Celal Ozturk. Artificialbee colony (abc) optimization algorithm for trainingfeed-forward neural networks. In Modeling decisions forartificial intelligence, pages 318–329. Springer, 2007.

References

References VIII

[24] Joao P Papa, Gustavo H Rosa, Aparecido N Marana, WalterScheirer, and David D Cox. Model selection for DiscriminativeRestricted Boltzmann Machines through meta-heuristictechniques. Journal of Computational Science, 9:14–18, 2015.

[25] Samuel Xavier-de Souza, Johan AK Suykens, JoosVandewalle, and Desire Bolle. Coupled simulated annealing.Systems, Man, and Cybernetics, Part B: Cybernetics, IEEETransactions on, 40(2):320–335, 2010.

[26] Gavin C Cawley and Nicola LC Talbot. Fast exactleave-one-out cross-validation of sparse least-squares supportvector machines. Neural networks, 17(10):1467–1475, 2004.

References

References IX

[27] James Bergstra, Dan Yamins, and David D Cox. Hyperopt: Apython library for optimizing the hyperparameters of machinelearning algorithms. In Proceedings of the 12th Python inScience Conference, pages 13–20. SciPy, 2013.

[28] Frank Hutter, Holger H Hoos, Kevin Leyton-Brown, andThomas Stutzle. ParamILS: an automatic algorithmconfiguration framework. Journal of Artificial IntelligenceResearch, 36(1):267–306, 2009.

[29] Chris Thornton, Frank Hutter, Holger H. Hoos, and KevinLeyton-Brown. Auto-WEKA: Automated selection andhyper-parameter optimization of classification algorithms.CoRR, abs/1208.3719, 2012.

References

References X

[30] Ruben Martinez-Cantin. BayesOpt: A Bayesian optimizationlibrary for nonlinear optimization, experimental design andbandits. arXiv preprint arXiv:1405.7430, 2014.

[31] Olivier Roustant, David Ginsbourger, Yves Deville, et al.DiceKriging, DiceOptim: Two R packages for the analysis ofcomputer experiments by kriging-based metamodeling andoptimization. 2012.

hyperparameter search in machine learningclaesenm/optunity/varia/...hyperparameter search in machine...

Documents