nonlinear optimization methods for data-fitting problemsmaianhti/cerc_data... · nonlinear...
TRANSCRIPT
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Nonlinear optimization methods for data-fittingproblems
Tien Mai
Canada Excellence Research Chair in Data Science for Real-TimeDecision-Making, Polytechnique Montreal
May 18, 2016
1 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Table of Contents
1 Introduction
2 Sequential optimization
3 Bayesian optimization
4 Applications
5 Discussion on optimal learning
2 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Introduction
My bio
2009: B.Sc. Computer Science
2013: M.Sc. Operations Research (nonlinear optimization)
2016: Ph.D. Operations Research (econometric modeling,route choice problem)
April 2016 - : Post-doc Data Science ...
3 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Nonlinear optimization
maxx∈Ω
f (x)
f (x) is nonlinear, possibly non-concave, continuous, noisy ornoisy-free (deterministic)
f (x) may be a black-box function (no analytical expression,expensive to evaluate)
Important problems when fitting economics or machinelearning models, or making decisions
4 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Sequential optimization
maxx∈Ω
f (x)
We generate a sequence x0, x1... that converges to theoptimum solution x∗ such that f (x0) ≤ f (x1) . . . ≤ f (x∗)
At iteration t, we build a predictive model m(xt + p) toapproximate f (x) around xt , m(.) is cheap to evaluate
Quadratic model
m(xt + p) = f (xt) + pT∇f (xt) +1
2pTHkp
5 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Sequential optimization
Trust region
Define a trust region Btpt = argmaxp|xt+p∈Bt
m(xt + p)xt+1 = xt + pt if pt is good, otherwise reduce Bt
Line search
pt = argmaxpm(xt + p)xt+1 = αtpt , αt satisfies some conditions
6 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Predictive model
m(xt + p) = f (xt) + pT∇f (xt) +1
2pTHkp
Newton’s method: Hk is the second derivative (aka Hessian)
Quasi Newton: Hk is an Hessian approximation
BFGS, SR1, DFSBHHH (for maximum likelihood estimation)
Large-scale problems: Limited memory BFGS (L-BFGS),conjugate gradient
Derivative-free trust region algorithms
7 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Sequential optimization - properties
Under some mild conditions, algorithms converge after a finitenumber of iterations
Local solutions
Gradients may be too expensive to obtain, and the use ofderivative-free trust region algorithms are limited to a smallnumber of variables (e.g. 20)
8 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Bayesian optimization
Bayes rule
P(hypothesis|Data) =P(Data|hypothesis)P(hypothesis)
P(Data)
P(hypothesis) is a prior, P(hypothesis|Data) is the posteriorprobability given Data
Given Data, we use Bayes rule to infer P(hypothesis|Data)
Global optimization
Problems of derivative-free and expensive cost functions
9 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Bayesian inference
Fitting data with a probabilistic model
Maximum likelihood estimation
maxθ
P(Data|θ)
Bayesian inference
P(θ|Data) =P(Data|θ)P(θ)
P(Data)∝ P(Data|θ)P(θ)
We can sample θ by the Metropolis Hasting algorithm
10 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Bayesian optimization
maxx∈Ω
f (x)
Bayes ruleP(f |Data) ∝ P(Data|f )P(f )
Data = f (x0), f (x1), ..., f (xn)Gaussian process
f (x) ∼ GP(m(x), cov(x , x ′))
11 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Bayesian optimization - algorithm
12 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Bayesian optimization - algorithm
13 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Bayesian optimization - algorithm
14 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Acquisition functions
Acquisition functions are defined such that high acquisitioncorresponds to potentially high values of the objective function
Data = f (x0), . . . , f (xn)f ∗ = maxf (x0), . . . , f (xn)Probability of improvement
PI(x |Data) = P(f (x) ≥ f ∗∣∣∣Data)
Expected improvement
EI(x |Data) = Ef
[max0, f (x)− f ∗
∣∣∣Data]
15 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Algorithm
Data = f (x0)For t = 1, 2, . . . ,N
xt = argmaxxEI(x |Data)
Compute f (xt)
Data = Data ∪ f (xt), update Gaussian process
End
16 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Pros and cons of Bayesian optimization
Powerful tool for machine learning, especially optimal learning
The prior P(f ) is critically important to efficient Bayesianoptimization, Gaussian processes are not always the best
Solving maxx EI(x) or maxx PI(x) is expensive in many cases
Global optimization, but the convergence is not guaranteed
17 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Applications
Route choice modeling using parametric Markov decisionprocesses
Design optimization using Bayesian optimization
18 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Route choice modeling with reveled preference data
Data: trip observations given by travelers in real transportnetworks
Objectives
Accessing travelers’ preferences of route characteristics (e.g.travel time, travel cost)Predicting a path that a traveler would choose to go from alocation to anotherTraffic simulation
Parametric Markov decision processes with the random utilitymaximization (RUM) framework
19 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Route choice models
Network consists of links and nodes
Link utilitiesu(a|k; θ) = θT x(a|k) + εa
x(a|k) is a vector of attributes of link a
Optimal policy at state k
maxa∈A(k)
u(a|k; θ) + V (a)
A(k) is the set of outgoing links from k, V (a) is the expectedcost-to-go under the optimal policy from a to the destination
P(a|k ; θ) = P(a ≡ arg maxa′∈A(k)u(a|k; θ) + V (a))
20 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Estimation
Sequential optimization
maxθ
P(trips observations|θ)
Bayesian inference
P(θ|trips observations) ∝ P(trips observations|θ)P(θ)
Computing V is costly
Bayesian inference requires a huge number of samples (e.g.10000), the dynamic Bayesian inference (Imai et al., 2009)can be used
Sequential optimization often outperforms the Bayesian one
21 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Design optimization
Designs can affect site traffic, return visits
We can use data to select better designs
Typically they randomize users to different designs, collectdata from users and look at some criteria, e.g. Click ThroughRate, number of reviews written, to decide the better ones
22 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Too many designs
Designs are adapted depending on users’ characteristics, e.g. age,sex, location, time zone ...
When you are close to the business
When you are far from the business
23 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Parameterizing
Designs can be parameterized
Thresholds for when advertisements have certain attributes
Thresholds for when/how to show ads
Parameters for the size of texts, numbers of ads per site ...
Maximize an objective function defined based on
Click Through Rate (CTR)
Revenue Per Opportunity (RPO)
Number of reviews written ...
The objective function is evaluated only by collecting data fromusers
24 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Bayesian optimization
Gaussian process is fitted to data (objective values evaluatedat some sets of parameters)Maximize the acquisition function to select a new design totest (new set of parameters)Update data
This method can be applied to optimize many parameterizedsystems
25 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Optimal learning - a case study of myself
26 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Optimal learning
Efficiently collecting information to make decisions
Useful in applications where information is expensive to collect
Bayes rule is one of the keys
P(hypothesis|Data) =P(Data|hypothesis)P(hypothesis)
P(Data)
27 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Another example - Learning user preferences inrecommender systems
A system that makes personalized recommendations to usersbased on their browsing history
Amazon, Ebay, Google, arXiv.org ...
For long-term users with lots of historical data: a classifiermachine learning model can be trained
For new users: How can we learn user preferences withoutsending too many irrelevant recommendations?
28 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Learning user preferences in recommender systems
Idea:
Make a prior assumption on how an arrival item is relevant tothe user
The assumption can be updated (via Bayes) when moreinformation is observed
Define a cost function based on the situation that sendingitems are relevant to the user or not
End up with a dynamic programing problem, solving thisproblem gives an optimal policy of sending/not-sending a newitem
29 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
References
Nocedal, Jorge, and Stephen Wright. Numerical optimization.Springer Science & Business Media, 2006.
Brochu, Eric, Vlad M. Cora, and Nando De Freitas. A tutorial onBayesian optimization of expensive cost functions, with applicationto active user modeling and hierarchical reinforcement learning.arXiv preprint arXiv:1012.2599 (2010).
Imai, Susumu, Neelam Jain, and Andrew Ching. Bayesianestimation of dynamic discrete choice models. Econometrica 77.6(2009): 1865-1899.
Powell, Warren B., and Peter Frazier. Optimal learning. TutORialsin Operations Research (2008): 213-246.
Mai, T. (2015). Dynamic programming approaches for estimatingand applying large-scale discrete choice models, Ph.D. thesis,Universite de Montreal.
30 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
References
Design optimization at Yelp: https://people.orie.cornell.edu/pfrazier/Presentations/2014.11.Cornell.Webinar.pdf
Learning user preferences at arXiv.org:https://people.orie.cornell.edu/pfrazier/
Presentations/2014.01.Lancaster.Arxiv.pdf
31 / 32
Introduction Sequential optimization Bayesian optimization Applications Discussion on optimal learning
Thank you for your attention!Questions?
32 / 32