improving bayesian optimization for machine learning using ... improving bayesian optimization for...

Download Improving Bayesian Optimization for Machine Learning using ... Improving Bayesian Optimization for Machine

Post on 10-Feb-2020

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Improving Bayesian Optimization for Machine Learning using Expert Priors

    by

    Kevin Swersky

    A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy

    Graduate Department of Computer Science University of Toronto

    © Copyright 2017 by Kevin Swersky

  • Abstract

    Improving Bayesian Optimization for Machine Learning using Expert Priors

    Kevin Swersky

    Doctor of Philosophy

    Graduate Department of Computer Science

    University of Toronto

    2017

    Deep neural networks have recently become astonishingly successful at many machine

    learning problems such as object recognition and speech recognition, and they are now

    also being used in many new and creative ways. However, their performance critically

    relies on the proper setting of numerous hyperparameters. Manual tuning by an expert

    researcher has been a traditionally effective approach, however it is becoming increas-

    ingly infeasible as models become more complex and machine learning systems become

    further embedded within larger automated systems. Bayesian optimization has recently

    been proposed as a strategy for intelligently optimizing the hyperparameters of deep

    neural networks and other machine learning systems; it has been shown in many cases

    to outperform experts, and provides a promising way to reduce both the computational

    and human time required. Regardless, expert researchers can still be quite effective at

    hyperparameter tuning due to their ability to incorporate contextual knowledge and in-

    tuition into their search, while traditional Bayesian optimization treats each problem as

    a black box and therefore cannot take advantage of this knowledge. In this thesis, we

    draw inspiration from these abilities and incorporate them into the Bayesian optimiza-

    tion framework as additional prior information. These extensions include the ability to

    transfer knowledge between problems, the ability to transform the problem domain into

    one that is easier to optimize, and the ability to terminate experiments when they are no

    longer deemed to be promising, without requiring their training to converge. We demon-

    strate in experiments across a range of machine learning models that these extensions

    ii

  • significantly reduce the cost and increase the robustness of Bayesian optimization for

    automatic hyperparameter tuning.

    iii

  • Acknowledgements

    This thesis, and all of my work beyond it, are the product of the incredible guidance and

    support that I have received, and the lasting friendships that I have made throughout

    my time in academia. I cannot stress enough the many ways in which the people in my

    life have inspired me. To these people I express my deepest gratitude.

    First I absolutely must acknowledge my supervisor Richard Zemel. Rich would never

    hesitate to entertain a new idea, no matter how silly. He would happily dive head first

    into the depths of a new, unknown area, become an expert, and help me through with

    excellent advise. Beyond this, Rich was an incredible role model. He carries himself with

    a great deal of humility, and is respectful to everyone around him. I was unbelievably

    lucky to have had the opportunity to work with him, and I hope that in my career I can

    be even a fraction as good a mentor as he is.

    I would like to thank my supervisory committee, Geoffrey Hinton and Brendan Frey,

    for their excellent feedback and support throughout my degree. I had the privilege

    to work with Brendan on multiple projects and was always energized by his constant

    enthusiasm and keen ideas. I was incredibly fortunate to be able to work with Geoff. His

    intuition and ability to get to the core insight of any problem is astounding, but even

    more important is his generosity and the way he treats those around him. I was also

    fortunate to work with or alongside many excellent professors, including Sanja Fidler,

    Toniann Pitassi, Ruslan Salakhutdinov, and Raquel Urtasun. I would also like to thank

    my external examiner Max Welling for his thoughtful and insightful comments on this

    thesis.

    I was fortunate enough to have several other excellent mentors during my degree.

    Early on I worked a lot with Danny Tarlow, who helped me build research momentum

    and reach a new level of maturity in my work. Later, I collaborated with Jasper Snoek,

    whose unwavering work ethic, novel insights, and homemade beer led to the papers that

    form the foundation of this thesis. Throughout all of this, I was privileged to be able to

    collaborate with Ryan Adams, who oversaw much of my work on Bayesian optimization.

    Ryan continues to astonish me with his breadth and depth of knowledge, and by his

    kindness and sensitivity. I am indebted to my former advisor Nando de Freitas, who

    brought me into the world of machine learning, advised me during my master’s degree,

    brought me back into the world of machine learning after I graduated, and continues to

    be a good friend to this day.

    Perhaps the greatest aspect of my time in academia is the group of friends that I

    worked with, socialized with, and crunched with at deadlines. These people are unbe-

    lievably talented, smart, genuine, and drove me to set the bar as high as I possibly could.

    iv

  • They include Jimmy Ba, Yuri Burda, Laurent Charlin, George Dahl, David Duvenaud,

    Roman Garnett, Michael Gelbart, Alex Graves, Roger Grosse, Frank Hutter, Navdeep

    Jaitly, Emtiyaz Khan, Jamie Kiros, Alex Krizhevsky, Hugo Larochelle, Yujia Li, Chris

    Maddison, Ben Marlin, James Martens, Vlad Mnih, Abdel-rahman Mohamed, Michael

    Osborne, Mengye Ren, Oren Rippel, Mark Schmidt, Alex Schwing, Jake Snell, Nitish Sri-

    vastava, Ilya Sutskever, Charlie Tang, Tijmen Tieleman, Maks Volkovs, Jackson Wang,

    Alex Wiltschko, and Youssef Zaky. This is certainly not an exhaustive list.

    Finally, I owe a debt of gratitude to those closest to me, whose endless love saw me

    through this degree and everything leading up to it. I want to thank my girlfriend Sarah

    Austin, for always being there for me through even the most trying times, and for all of

    the adventures we shared along the way. I am grateful for the support of my stepdad

    Michael, my older brother Darren, my younger brother Lorne, and my stepsister Ashley.

    I would like to thank my stepbrother Nathan for pushing me to move outside of my

    comfort zone and for always lending an ear when I needed one. I am grateful for the

    short time I had with my late father Gerald, for the memories and values he bestowed

    upon me during our time together. Finally, I am most thankful to my dear mother

    Joanne, who took on an incredible burden of being a single, working mother and made

    it look easy. This thesis is as much her hard work as it is mine.

    v

  • Contents

    1 Introduction 1

    1.1 Summary of Thesis Chapters . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2 Background 5

    2.1 Deep Learning for Classification . . . . . . . . . . . . . . . . . . . . . . . 5

    2.1.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.1.2 Multi-layer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.1.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . 7

    2.2 Training Deep Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.2.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 9

    2.2.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.3 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.3.1 Probabilistic Matrix Factorization . . . . . . . . . . . . . . . . . . 12

    2.4 Topic Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.4.1 Online Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . 14

    2.5 Types of Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.6 Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.6.1 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.6.2 Acquisition Functions . . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.6.2.1 Pure Exploration and Exploitation . . . . . . . . . . . . 21

    2.6.2.2 Probability of Improvement . . . . . . . . . . . . . . . . 21

    2.6.2.3 Expected Improvement . . . . . . . . . . . . . . . . . . . 22

    2.6.2.4 Entropy Search . . . . . . . . . . . . . . . . . . . . . . . 22

    2.6.3 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . 24

    2.6.3.1 Learning Gaussian Process Hyperparameters . . . . . . . 24

    2.6.3.2 Initializating the Search . . . . . . . . . . . . . . . . . . 24

    2.6.3.3 Optimizing the Acquisition Function . . . . . . . . . . . 24

    2.6.3.4 Integrated Acquisition Functions . . . . . . . . . . . . . 25

    vi

  • 2.6.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.6.4.1 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.6.4.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3 Multi-Task 28

    3.1 In

View more