nonlinear optimization methods for data-fitting problemsmaianhti/cerc_data... · nonlinear...

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Nonlinear optimization methods for data-fittingproblems

Tien Mai

Canada Excellence Research Chair in Data Science for Real-TimeDecision-Making, Polytechnique Montreal

May 18, 2016

1 / 32


Table of Contents

1 Introduction

2 Sequential optimization

3 Bayesian optimization

4 Applications

5 Discussion on optimal learning

2 / 32


Introduction

My bio

2009: B.Sc. Computer Science

2013: M.Sc. Operations Research (nonlinear optimization)

2016: Ph.D. Operations Research (econometric modeling,route choice problem)

April 2016 - : Post-doc Data Science ...

3 / 32


Nonlinear optimization

maxx∈Ω

f (x)

f (x) is nonlinear, possibly non-concave, continuous, noisy ornoisy-free (deterministic)

f (x) may be a black-box function (no analytical expression,expensive to evaluate)

Important problems when fitting economics or machinelearning models, or making decisions

4 / 32


Sequential optimization

maxx∈Ω

f (x)

We generate a sequence x0, x1... that converges to theoptimum solution x∗ such that f (x0) ≤ f (x1) . . . ≤ f (x∗)

At iteration t, we build a predictive model m(xt + p) toapproximate f (x) around xt , m(.) is cheap to evaluate

Quadratic model

m(xt + p) = f (xt) + pT∇f (xt) +1

2pTHkp

5 / 32



Trust region

Define a trust region Btpt = argmaxp|xt+p∈Bt

m(xt + p)xt+1 = xt + pt if pt is good, otherwise reduce Bt

Line search

pt = argmaxpm(xt + p)xt+1 = αtpt , αt satisfies some conditions

6 / 32


Predictive model

m(xt + p) = f (xt) + pT∇f (xt) +1

2pTHkp

Newton’s method: Hk is the second derivative (aka Hessian)

Quasi Newton: Hk is an Hessian approximation

BFGS, SR1, DFSBHHH (for maximum likelihood estimation)

Large-scale problems: Limited memory BFGS (L-BFGS),conjugate gradient

Derivative-free trust region algorithms

7 / 32


Sequential optimization - properties

Under some mild conditions, algorithms converge after a finitenumber of iterations

Local solutions

Gradients may be too expensive to obtain, and the use ofderivative-free trust region algorithms are limited to a smallnumber of variables (e.g. 20)

8 / 32


Bayesian optimization

Bayes rule

P(hypothesis|Data) =P(Data|hypothesis)P(hypothesis)

P(Data)

P(hypothesis) is a prior, P(hypothesis|Data) is the posteriorprobability given Data

Given Data, we use Bayes rule to infer P(hypothesis|Data)

Global optimization

Problems of derivative-free and expensive cost functions

9 / 32


Bayesian inference

Fitting data with a probabilistic model

Maximum likelihood estimation

maxθ

P(Data|θ)

Bayesian inference

P(θ|Data) =P(Data|θ)P(θ)

P(Data)∝ P(Data|θ)P(θ)

We can sample θ by the Metropolis Hasting algorithm

10 / 32



maxx∈Ω

f (x)

Bayes ruleP(f |Data) ∝ P(Data|f )P(f )

Data = f (x0), f (x1), ..., f (xn)Gaussian process

f (x) ∼ GP(m(x), cov(x , x ′))

11 / 32


Bayesian optimization - algorithm

12 / 32



13 / 32



14 / 32


Acquisition functions

Acquisition functions are defined such that high acquisitioncorresponds to potentially high values of the objective function

Data = f (x0), . . . , f (xn)f ∗ = maxf (x0), . . . , f (xn)Probability of improvement

PI(x |Data) = P(f (x) ≥ f ∗∣∣∣Data)

Expected improvement

EI(x |Data) = Ef

[max0, f (x)− f ∗

∣∣∣Data]

15 / 32


Algorithm

Data = f (x0)For t = 1, 2, . . . ,N

xt = argmaxxEI(x |Data)

Compute f (xt)

Data = Data ∪ f (xt), update Gaussian process

End

16 / 32


Pros and cons of Bayesian optimization

Powerful tool for machine learning, especially optimal learning

The prior P(f ) is critically important to efficient Bayesianoptimization, Gaussian processes are not always the best

Solving maxx EI(x) or maxx PI(x) is expensive in many cases

Global optimization, but the convergence is not guaranteed

17 / 32


Applications

Route choice modeling using parametric Markov decisionprocesses

Design optimization using Bayesian optimization

18 / 32


Route choice modeling with reveled preference data

Data: trip observations given by travelers in real transportnetworks

Objectives

Accessing travelers’ preferences of route characteristics (e.g.travel time, travel cost)Predicting a path that a traveler would choose to go from alocation to anotherTraffic simulation

Parametric Markov decision processes with the random utilitymaximization (RUM) framework

19 / 32


Route choice models

Network consists of links and nodes

Link utilitiesu(a|k; θ) = θT x(a|k) + εa

x(a|k) is a vector of attributes of link a

Optimal policy at state k

maxa∈A(k)

u(a|k; θ) + V (a)

A(k) is the set of outgoing links from k, V (a) is the expectedcost-to-go under the optimal policy from a to the destination

P(a|k ; θ) = P(a ≡ arg maxa′∈A(k)u(a|k; θ) + V (a))

20 / 32


Estimation


maxθ

P(trips observations|θ)

Bayesian inference

P(θ|trips observations) ∝ P(trips observations|θ)P(θ)

Computing V is costly

Bayesian inference requires a huge number of samples (e.g.10000), the dynamic Bayesian inference (Imai et al., 2009)can be used

Sequential optimization often outperforms the Bayesian one

21 / 32


Design optimization

Designs can affect site traffic, return visits

We can use data to select better designs

Typically they randomize users to different designs, collectdata from users and look at some criteria, e.g. Click ThroughRate, number of reviews written, to decide the better ones

22 / 32


Too many designs

Designs are adapted depending on users’ characteristics, e.g. age,sex, location, time zone ...

When you are close to the business

When you are far from the business

23 / 32


Parameterizing

Designs can be parameterized

Thresholds for when advertisements have certain attributes

Thresholds for when/how to show ads

Parameters for the size of texts, numbers of ads per site ...

Maximize an objective function defined based on

Click Through Rate (CTR)

Revenue Per Opportunity (RPO)

Number of reviews written ...

The objective function is evaluated only by collecting data fromusers

24 / 32



Gaussian process is fitted to data (objective values evaluatedat some sets of parameters)Maximize the acquisition function to select a new design totest (new set of parameters)Update data

This method can be applied to optimize many parameterizedsystems

25 / 32


Optimal learning - a case study of myself

26 / 32


Optimal learning

Efficiently collecting information to make decisions

Useful in applications where information is expensive to collect

Bayes rule is one of the keys

P(hypothesis|Data) =P(Data|hypothesis)P(hypothesis)

P(Data)

27 / 32


Another example - Learning user preferences inrecommender systems

A system that makes personalized recommendations to usersbased on their browsing history

Amazon, Ebay, Google, arXiv.org ...

For long-term users with lots of historical data: a classifiermachine learning model can be trained

For new users: How can we learn user preferences withoutsending too many irrelevant recommendations?

28 / 32


Learning user preferences in recommender systems

Idea:

Make a prior assumption on how an arrival item is relevant tothe user

The assumption can be updated (via Bayes) when moreinformation is observed

Define a cost function based on the situation that sendingitems are relevant to the user or not

End up with a dynamic programing problem, solving thisproblem gives an optimal policy of sending/not-sending a newitem

29 / 32


References

Nocedal, Jorge, and Stephen Wright. Numerical optimization.Springer Science & Business Media, 2006.

Brochu, Eric, Vlad M. Cora, and Nando De Freitas. A tutorial onBayesian optimization of expensive cost functions, with applicationto active user modeling and hierarchical reinforcement learning.arXiv preprint arXiv:1012.2599 (2010).

Imai, Susumu, Neelam Jain, and Andrew Ching. Bayesianestimation of dynamic discrete choice models. Econometrica 77.6(2009): 1865-1899.

Powell, Warren B., and Peter Frazier. Optimal learning. TutORialsin Operations Research (2008): 213-246.

Mai, T. (2015). Dynamic programming approaches for estimatingand applying large-scale discrete choice models, Ph.D. thesis,Universite de Montreal.

30 / 32


References

Design optimization at Yelp: https://people.orie.cornell.edu/pfrazier/Presentations/2014.11.Cornell.Webinar.pdf

Learning user preferences at arXiv.org:https://people.orie.cornell.edu/pfrazier/

Presentations/2014.01.Lancaster.Arxiv.pdf

31 / 32

https://people.orie.cornell.edu/pfrazier/Presentations/2014.11.Cornell.Webinar.pdf

https://people.orie.cornell.edu/pfrazier/Presentations/2014.11.Cornell.Webinar.pdf

https://people.orie.cornell.edu/pfrazier/Presentations/2014.01.Lancaster.Arxiv.pdf

https://people.orie.cornell.edu/pfrazier/Presentations/2014.01.Lancaster.Arxiv.pdf


Thank you for your attention!Questions?

32 / 32

nonlinear optimization methods for data-fitting problemsmaianhti/cerc_data... · nonlinear...

Documents