nonlinear optimization methods for data-fitting problemsmaianhti/cerc_data... · nonlinear...

Post on 22-May-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Nonlinear optimization methods for data-fittingproblems

Tien Mai

Canada Excellence Research Chair in Data Science for Real-TimeDecision-Making, Polytechnique Montreal

May 18, 2016

1 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Table of Contents

1 Introduction

2 Sequential optimization

3 Bayesian optimization

4 Applications

5 Discussion on optimal learning

2 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Introduction

My bio

2009: B.Sc. Computer Science

2013: M.Sc. Operations Research (nonlinear optimization)

2016: Ph.D. Operations Research (econometric modeling,route choice problem)

April 2016 - : Post-doc Data Science ...

3 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Nonlinear optimization

maxx∈Ω

f (x)

f (x) is nonlinear, possibly non-concave, continuous, noisy ornoisy-free (deterministic)

f (x) may be a black-box function (no analytical expression,expensive to evaluate)

Important problems when fitting economics or machinelearning models, or making decisions

4 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Sequential optimization

maxx∈Ω

f (x)

We generate a sequence x0, x1... that converges to theoptimum solution x∗ such that f (x0) ≤ f (x1) . . . ≤ f (x∗)

At iteration t, we build a predictive model m(xt + p) toapproximate f (x) around xt , m(.) is cheap to evaluate

Quadratic model

m(xt + p) = f (xt) + pT∇f (xt) +1

2pTHkp

5 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Sequential optimization

Trust region

Define a trust region Btpt = argmaxp|xt+p∈Bt

m(xt + p)xt+1 = xt + pt if pt is good, otherwise reduce Bt

Line search

pt = argmaxpm(xt + p)xt+1 = αtpt , αt satisfies some conditions

6 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Predictive model

m(xt + p) = f (xt) + pT∇f (xt) +1

2pTHkp

Newton’s method: Hk is the second derivative (aka Hessian)

Quasi Newton: Hk is an Hessian approximation

BFGS, SR1, DFSBHHH (for maximum likelihood estimation)

Large-scale problems: Limited memory BFGS (L-BFGS),conjugate gradient

Derivative-free trust region algorithms

7 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Sequential optimization - properties

Under some mild conditions, algorithms converge after a finitenumber of iterations

Local solutions

Gradients may be too expensive to obtain, and the use ofderivative-free trust region algorithms are limited to a smallnumber of variables (e.g. 20)

8 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian optimization

Bayes rule

P(hypothesis|Data) =P(Data|hypothesis)P(hypothesis)

P(Data)

P(hypothesis) is a prior, P(hypothesis|Data) is the posteriorprobability given Data

Given Data, we use Bayes rule to infer P(hypothesis|Data)

Global optimization

Problems of derivative-free and expensive cost functions

9 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian inference

Fitting data with a probabilistic model

Maximum likelihood estimation

maxθ

P(Data|θ)

Bayesian inference

P(θ|Data) =P(Data|θ)P(θ)

P(Data)∝ P(Data|θ)P(θ)

We can sample θ by the Metropolis Hasting algorithm

10 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian optimization

maxx∈Ω

f (x)

Bayes ruleP(f |Data) ∝ P(Data|f )P(f )

Data = f (x0), f (x1), ..., f (xn)Gaussian process

f (x) ∼ GP(m(x), cov(x , x ′))

11 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian optimization - algorithm

12 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian optimization - algorithm

13 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian optimization - algorithm

14 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Acquisition functions

Acquisition functions are defined such that high acquisitioncorresponds to potentially high values of the objective function

Data = f (x0), . . . , f (xn)f ∗ = maxf (x0), . . . , f (xn)Probability of improvement

PI(x |Data) = P(f (x) ≥ f ∗∣∣∣Data)

Expected improvement

EI(x |Data) = Ef

[max0, f (x)− f ∗

∣∣∣Data]

15 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Algorithm

Data = f (x0)For t = 1, 2, . . . ,N

xt = argmaxxEI(x |Data)

Compute f (xt)

Data = Data ∪ f (xt), update Gaussian process

End

16 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Pros and cons of Bayesian optimization

Powerful tool for machine learning, especially optimal learning

The prior P(f ) is critically important to efficient Bayesianoptimization, Gaussian processes are not always the best

Solving maxx EI(x) or maxx PI(x) is expensive in many cases

Global optimization, but the convergence is not guaranteed

17 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Applications

Route choice modeling using parametric Markov decisionprocesses

Design optimization using Bayesian optimization

18 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Route choice modeling with reveled preference data

Data: trip observations given by travelers in real transportnetworks

Objectives

Accessing travelers’ preferences of route characteristics (e.g.travel time, travel cost)Predicting a path that a traveler would choose to go from alocation to anotherTraffic simulation

Parametric Markov decision processes with the random utilitymaximization (RUM) framework

19 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Route choice models

Network consists of links and nodes

Link utilitiesu(a|k; θ) = θT x(a|k) + εa

x(a|k) is a vector of attributes of link a

Optimal policy at state k

maxa∈A(k)

u(a|k; θ) + V (a)

A(k) is the set of outgoing links from k, V (a) is the expectedcost-to-go under the optimal policy from a to the destination

P(a|k ; θ) = P(a ≡ arg maxa′∈A(k)u(a|k; θ) + V (a))

20 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Estimation

Sequential optimization

maxθ

P(trips observations|θ)

Bayesian inference

P(θ|trips observations) ∝ P(trips observations|θ)P(θ)

Computing V is costly

Bayesian inference requires a huge number of samples (e.g.10000), the dynamic Bayesian inference (Imai et al., 2009)can be used

Sequential optimization often outperforms the Bayesian one

21 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Design optimization

Designs can affect site traffic, return visits

We can use data to select better designs

Typically they randomize users to different designs, collectdata from users and look at some criteria, e.g. Click ThroughRate, number of reviews written, to decide the better ones

22 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Too many designs

Designs are adapted depending on users’ characteristics, e.g. age,sex, location, time zone ...

When you are close to the business

When you are far from the business

23 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Parameterizing

Designs can be parameterized

Thresholds for when advertisements have certain attributes

Thresholds for when/how to show ads

Parameters for the size of texts, numbers of ads per site ...

Maximize an objective function defined based on

Click Through Rate (CTR)

Revenue Per Opportunity (RPO)

Number of reviews written ...

The objective function is evaluated only by collecting data fromusers

24 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Bayesian optimization

Gaussian process is fitted to data (objective values evaluatedat some sets of parameters)Maximize the acquisition function to select a new design totest (new set of parameters)Update data

This method can be applied to optimize many parameterizedsystems

25 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Optimal learning - a case study of myself

26 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Optimal learning

Efficiently collecting information to make decisions

Useful in applications where information is expensive to collect

Bayes rule is one of the keys

P(hypothesis|Data) =P(Data|hypothesis)P(hypothesis)

P(Data)

27 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Another example - Learning user preferences inrecommender systems

A system that makes personalized recommendations to usersbased on their browsing history

Amazon, Ebay, Google, arXiv.org ...

For long-term users with lots of historical data: a classifiermachine learning model can be trained

For new users: How can we learn user preferences withoutsending too many irrelevant recommendations?

28 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Learning user preferences in recommender systems

Idea:

Make a prior assumption on how an arrival item is relevant tothe user

The assumption can be updated (via Bayes) when moreinformation is observed

Define a cost function based on the situation that sendingitems are relevant to the user or not

End up with a dynamic programing problem, solving thisproblem gives an optimal policy of sending/not-sending a newitem

29 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

References

Nocedal, Jorge, and Stephen Wright. Numerical optimization.Springer Science & Business Media, 2006.

Brochu, Eric, Vlad M. Cora, and Nando De Freitas. A tutorial onBayesian optimization of expensive cost functions, with applicationto active user modeling and hierarchical reinforcement learning.arXiv preprint arXiv:1012.2599 (2010).

Imai, Susumu, Neelam Jain, and Andrew Ching. Bayesianestimation of dynamic discrete choice models. Econometrica 77.6(2009): 1865-1899.

Powell, Warren B., and Peter Frazier. Optimal learning. TutORialsin Operations Research (2008): 213-246.

Mai, T. (2015). Dynamic programming approaches for estimatingand applying large-scale discrete choice models, Ph.D. thesis,Universite de Montreal.

30 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

References

Design optimization at Yelp: https://people.orie.cornell.edu/pfrazier/Presentations/2014.11.Cornell.Webinar.pdf

Learning user preferences at arXiv.org:https://people.orie.cornell.edu/pfrazier/

Presentations/2014.01.Lancaster.Arxiv.pdf

31 / 32

Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning

Thank you for your attention!Questions?

32 / 32

top related