nonlinear optimization methods for data-fitting problemsmaianhti/cerc_data... · nonlinear...

32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning Nonlinear optimization methods for data-fitting problems Tien Mai Canada Excellence Research Chair in Data Science for Real-Time Decision-Making, Polytechnique Montr´ eal May 18, 2016 1 / 32

Upload: others

Post on 22-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Nonlinear optimization methods for data-fittingproblems

Tien Mai

Canada Excellence Research Chair in Data Science for Real-TimeDecision-Making, Polytechnique Montreal

May 18, 2016

1 / 32

Page 2: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Table of Contents

1 Introduction

2 Sequential optimization

3 Bayesian optimization

4 Applications

5 Discussion on optimal learning

2 / 32

Page 3: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Introduction

My bio

2009: B.Sc. Computer Science

2013: M.Sc. Operations Research (nonlinear optimization)

2016: Ph.D. Operations Research (econometric modeling,route choice problem)

April 2016 - : Post-doc Data Science ...

3 / 32

Page 4: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Nonlinear optimization

maxx∈Ω

f (x)

f (x) is nonlinear, possibly non-concave, continuous, noisy ornoisy-free (deterministic)

f (x) may be a black-box function (no analytical expression,expensive to evaluate)

Important problems when fitting economics or machinelearning models, or making decisions

4 / 32

Page 5: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Sequential optimization

maxx∈Ω

f (x)

We generate a sequence x0, x1... that converges to theoptimum solution x∗ such that f (x0) ≤ f (x1) . . . ≤ f (x∗)

At iteration t, we build a predictive model m(xt + p) toapproximate f (x) around xt , m(.) is cheap to evaluate

Quadratic model

m(xt + p) = f (xt) + pT∇f (xt) +1

2pTHkp

5 / 32

Page 6: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Sequential optimization

Trust region

Define a trust region Btpt = argmaxp|xt+p∈Bt

m(xt + p)xt+1 = xt + pt if pt is good, otherwise reduce Bt

Line search

pt = argmaxpm(xt + p)xt+1 = αtpt , αt satisfies some conditions

6 / 32

Page 7: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Predictive model

m(xt + p) = f (xt) + pT∇f (xt) +1

2pTHkp

Newton’s method: Hk is the second derivative (aka Hessian)

Quasi Newton: Hk is an Hessian approximation

BFGS, SR1, DFSBHHH (for maximum likelihood estimation)

Large-scale problems: Limited memory BFGS (L-BFGS),conjugate gradient

Derivative-free trust region algorithms

7 / 32

Page 8: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Sequential optimization - properties

Under some mild conditions, algorithms converge after a finitenumber of iterations

Local solutions

Gradients may be too expensive to obtain, and the use ofderivative-free trust region algorithms are limited to a smallnumber of variables (e.g. 20)

8 / 32

Page 9: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian optimization

Bayes rule

P(hypothesis|Data) =P(Data|hypothesis)P(hypothesis)

P(Data)

P(hypothesis) is a prior, P(hypothesis|Data) is the posteriorprobability given Data

Given Data, we use Bayes rule to infer P(hypothesis|Data)

Global optimization

Problems of derivative-free and expensive cost functions

9 / 32

Page 10: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian inference

Fitting data with a probabilistic model

Maximum likelihood estimation

maxθ

P(Data|θ)

Bayesian inference

P(θ|Data) =P(Data|θ)P(θ)

P(Data)∝ P(Data|θ)P(θ)

We can sample θ by the Metropolis Hasting algorithm

10 / 32

Page 11: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian optimization

maxx∈Ω

f (x)

Bayes ruleP(f |Data) ∝ P(Data|f )P(f )

Data = f (x0), f (x1), ..., f (xn)Gaussian process

f (x) ∼ GP(m(x), cov(x , x ′))

11 / 32

Page 12: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian optimization - algorithm

12 / 32

Page 13: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian optimization - algorithm

13 / 32

Page 14: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian optimization - algorithm

14 / 32

Page 15: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Acquisition functions

Acquisition functions are defined such that high acquisitioncorresponds to potentially high values of the objective function

Data = f (x0), . . . , f (xn)f ∗ = maxf (x0), . . . , f (xn)Probability of improvement

PI(x |Data) = P(f (x) ≥ f ∗∣∣∣Data)

Expected improvement

EI(x |Data) = Ef

[max0, f (x)− f ∗

∣∣∣Data]

15 / 32

Page 16: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Algorithm

Data = f (x0)For t = 1, 2, . . . ,N

xt = argmaxxEI(x |Data)

Compute f (xt)

Data = Data ∪ f (xt), update Gaussian process

End

16 / 32

Page 17: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Pros and cons of Bayesian optimization

Powerful tool for machine learning, especially optimal learning

The prior P(f ) is critically important to efficient Bayesianoptimization, Gaussian processes are not always the best

Solving maxx EI(x) or maxx PI(x) is expensive in many cases

Global optimization, but the convergence is not guaranteed

17 / 32

Page 18: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Applications

Route choice modeling using parametric Markov decisionprocesses

Design optimization using Bayesian optimization

18 / 32

Page 19: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Route choice modeling with reveled preference data

Data: trip observations given by travelers in real transportnetworks

Objectives

Accessing travelers’ preferences of route characteristics (e.g.travel time, travel cost)Predicting a path that a traveler would choose to go from alocation to anotherTraffic simulation

Parametric Markov decision processes with the random utilitymaximization (RUM) framework

19 / 32

Page 20: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Route choice models

Network consists of links and nodes

Link utilitiesu(a|k; θ) = θT x(a|k) + εa

x(a|k) is a vector of attributes of link a

Optimal policy at state k

maxa∈A(k)

u(a|k; θ) + V (a)

A(k) is the set of outgoing links from k, V (a) is the expectedcost-to-go under the optimal policy from a to the destination

P(a|k ; θ) = P(a ≡ arg maxa′∈A(k)u(a|k; θ) + V (a))

20 / 32

Page 21: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Estimation

Sequential optimization

maxθ

P(trips observations|θ)

Bayesian inference

P(θ|trips observations) ∝ P(trips observations|θ)P(θ)

Computing V is costly

Bayesian inference requires a huge number of samples (e.g.10000), the dynamic Bayesian inference (Imai et al., 2009)can be used

Sequential optimization often outperforms the Bayesian one

21 / 32

Page 22: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Design optimization

Designs can affect site traffic, return visits

We can use data to select better designs

Typically they randomize users to different designs, collectdata from users and look at some criteria, e.g. Click ThroughRate, number of reviews written, to decide the better ones

22 / 32

Page 23: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Too many designs

Designs are adapted depending on users’ characteristics, e.g. age,sex, location, time zone ...

When you are close to the business

When you are far from the business

23 / 32

Page 24: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Parameterizing

Designs can be parameterized

Thresholds for when advertisements have certain attributes

Thresholds for when/how to show ads

Parameters for the size of texts, numbers of ads per site ...

Maximize an objective function defined based on

Click Through Rate (CTR)

Revenue Per Opportunity (RPO)

Number of reviews written ...

The objective function is evaluated only by collecting data fromusers

24 / 32

Page 25: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian optimization

Gaussian process is fitted to data (objective values evaluatedat some sets of parameters)Maximize the acquisition function to select a new design totest (new set of parameters)Update data

This method can be applied to optimize many parameterizedsystems

25 / 32

Page 26: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Optimal learning - a case study of myself

26 / 32

Page 27: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Optimal learning

Efficiently collecting information to make decisions

Useful in applications where information is expensive to collect

Bayes rule is one of the keys

P(hypothesis|Data) =P(Data|hypothesis)P(hypothesis)

P(Data)

27 / 32

Page 28: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Another example - Learning user preferences inrecommender systems

A system that makes personalized recommendations to usersbased on their browsing history

Amazon, Ebay, Google, arXiv.org ...

For long-term users with lots of historical data: a classifiermachine learning model can be trained

For new users: How can we learn user preferences withoutsending too many irrelevant recommendations?

28 / 32

Page 29: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Learning user preferences in recommender systems

Idea:

Make a prior assumption on how an arrival item is relevant tothe user

The assumption can be updated (via Bayes) when moreinformation is observed

Define a cost function based on the situation that sendingitems are relevant to the user or not

End up with a dynamic programing problem, solving thisproblem gives an optimal policy of sending/not-sending a newitem

29 / 32

Page 30: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

References

Nocedal, Jorge, and Stephen Wright. Numerical optimization.Springer Science & Business Media, 2006.

Brochu, Eric, Vlad M. Cora, and Nando De Freitas. A tutorial onBayesian optimization of expensive cost functions, with applicationto active user modeling and hierarchical reinforcement learning.arXiv preprint arXiv:1012.2599 (2010).

Imai, Susumu, Neelam Jain, and Andrew Ching. Bayesianestimation of dynamic discrete choice models. Econometrica 77.6(2009): 1865-1899.

Powell, Warren B., and Peter Frazier. Optimal learning. TutORialsin Operations Research (2008): 213-246.

Mai, T. (2015). Dynamic programming approaches for estimatingand applying large-scale discrete choice models, Ph.D. thesis,Universite de Montreal.

30 / 32

Page 31: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

References

Design optimization at Yelp: https://people.orie.cornell.edu/pfrazier/Presentations/2014.11.Cornell.Webinar.pdf

Learning user preferences at arXiv.org:https://people.orie.cornell.edu/pfrazier/

Presentations/2014.01.Lancaster.Arxiv.pdf

31 / 32

Page 32: Nonlinear optimization methods for data-fitting problemsmaianhti/CERC_Data... · Nonlinear optimization methods for data- tting problems Tien Mai ... Powerful tool for machine learning,

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Thank you for your attention!Questions?

32 / 32