Mini-Course 6:Hyperparameter Optimization – Harmonica
Yang Yuan
Computer Science DepartmentCornell University
Last time
I Bayesian Optimization[Snoek et al., 2012, Swersky et al., 2013, Snoek et al., 2014,Gardner et al., 2014, Wang et al., 2013].
I Gradient descent[Maclaurin et al., 2015, Fu et al., 2016, Luketina et al., 2015]
I Random Search [Bergstra and Bengio, 2012, Recht, 2016]
I Multi-armed Bandit based algorithms: Hyperband,SuccessiveHalving[Li et al., 2016, Jamieson and Talwalkar, 2016].
I Grid Search
A natural question..
A natural question..
With so many great algorithms for tuning hyperparameters...
A natural question..
With so many great algorithms for tuning hyperparameters...
Why do we still hire PhD students to do it manually?
Implicit Assumptions
I If f is random noise, no algorithm is better than randomsearch
I Every algorithm needs some assumptions to work
I BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more
resourceI GD: the hyperparameter space is smooth, and all the local
minima are pretty good
Implicit Assumptions
I If f is random noise, no algorithm is better than randomsearch
I Every algorithm needs some assumptions to work
I BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more
resourceI GD: the hyperparameter space is smooth, and all the local
minima are pretty good
Implicit Assumptions
I If f is random noise, no algorithm is better than randomsearch
I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distribution
I Hyperband/SH: f get more accurate as we invest moreresource
I GD: the hyperparameter space is smooth, and all the localminima are pretty good
Implicit Assumptions
I If f is random noise, no algorithm is better than randomsearch
I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more
resource
I GD: the hyperparameter space is smooth, and all the localminima are pretty good
Implicit Assumptions
I If f is random noise, no algorithm is better than randomsearch
I Every algorithm needs some assumptions to workI BO: f can be approximated by the prior distributionI Hyperband/SH: f get more accurate as we invest more
resourceI GD: the hyperparameter space is smooth, and all the local
minima are pretty good
Curse of dimensionality!
I If f is random noise, no algorithm is better than randomsearch
I Every algorithm needs some assumptions to workI BO:f can be approximated by the prior distributionI Hyperband/SH : f get more accurate as we invest more
resourceI GD: the hyperparameter space is smooth, and all the local
minima are roughly equally good
I None of these work in the general high-dimensional setting
Curse of dimensionality!
I Sample complexity is exponential in number of variables n
I Hyperband, SHI random searchI grid search
I Bayesian Optimization is even worse
I Exponential sample complexity as wellI Prior distribution may not suit for large n
I How to decrease dimension?
I manually select ∼ 10 important variables among all possiblevariables.
I Only tune the selected variables.I Not purely Auto-ML.
I How can we do better?
Curse of dimensionality!
I Sample complexity is exponential in number of variables nI Hyperband, SH
I random searchI grid search
I Bayesian Optimization is even worse
I Exponential sample complexity as wellI Prior distribution may not suit for large n
I How to decrease dimension?
I manually select ∼ 10 important variables among all possiblevariables.
I Only tune the selected variables.I Not purely Auto-ML.
I How can we do better?
Curse of dimensionality!
I Sample complexity is exponential in number of variables nI Hyperband, SHI random search
I grid search
I Bayesian Optimization is even worse
I Exponential sample complexity as wellI Prior distribution may not suit for large n
I How to decrease dimension?
I manually select ∼ 10 important variables among all possiblevariables.
I Only tune the selected variables.I Not purely Auto-ML.
I How can we do better?
Curse of dimensionality!
I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search
I Bayesian Optimization is even worse
I Exponential sample complexity as wellI Prior distribution may not suit for large n
I How to decrease dimension?
I manually select ∼ 10 important variables among all possiblevariables.
I Only tune the selected variables.I Not purely Auto-ML.
I How can we do better?
Curse of dimensionality!
I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search
I Bayesian Optimization is even worse
I Exponential sample complexity as wellI Prior distribution may not suit for large n
I How to decrease dimension?
I manually select ∼ 10 important variables among all possiblevariables.
I Only tune the selected variables.I Not purely Auto-ML.
I How can we do better?
Curse of dimensionality!
I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search
I Bayesian Optimization is even worseI Exponential sample complexity as well
I Prior distribution may not suit for large n
I How to decrease dimension?
I manually select ∼ 10 important variables among all possiblevariables.
I Only tune the selected variables.I Not purely Auto-ML.
I How can we do better?
Curse of dimensionality!
I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search
I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n
I How to decrease dimension?
I manually select ∼ 10 important variables among all possiblevariables.
I Only tune the selected variables.I Not purely Auto-ML.
I How can we do better?
Curse of dimensionality!
I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search
I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n
I How to decrease dimension?
I manually select ∼ 10 important variables among all possiblevariables.
I Only tune the selected variables.I Not purely Auto-ML.
I How can we do better?
Curse of dimensionality!
I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search
I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n
I How to decrease dimension?I manually select ∼ 10 important variables among all possible
variables.
I Only tune the selected variables.I Not purely Auto-ML.
I How can we do better?
Curse of dimensionality!
I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search
I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n
I How to decrease dimension?I manually select ∼ 10 important variables among all possible
variables.I Only tune the selected variables.
I Not purely Auto-ML.
I How can we do better?
Curse of dimensionality!
I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search
I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n
I How to decrease dimension?I manually select ∼ 10 important variables among all possible
variables.I Only tune the selected variables.I Not purely Auto-ML.
I How can we do better?
Curse of dimensionality!
I Sample complexity is exponential in number of variables nI Hyperband, SHI random searchI grid search
I Bayesian Optimization is even worseI Exponential sample complexity as wellI Prior distribution may not suit for large n
I How to decrease dimension?I manually select ∼ 10 important variables among all possible
variables.I Only tune the selected variables.I Not purely Auto-ML.
I How can we do better?
Our assumption
I f is a high-dimensional function on Boolean variablesI Discretize the continuous variablesI Binarize the categorical variables
I f can be approximated by a small decision tree
Kaggle 101: Survival Rate Prediction For Titanic
I Predict whether the passenger will survive based on thefollowing personal data:
I Ticket class (Pclass)I SexI AgeI Number of siblings on boardI Number of parents on boardI Number of children on boardI Ticket numberI Ticket fareI Cabin letter (cabin)I embarked port (embarked)
Kaggle 101: Survival Rate Prediction For Titanic
I Predict whether the passenger will survive based on thefollowing personal data:
I Ticket class (Pclass)I SexI AgeI Number of siblings on boardI Number of parents on boardI Number of children on boardI Ticket numberI Ticket fareI Cabin letter (cabin)I embarked port (embarked)
I Related features are much more:I HometownI OccupationI Native languageI RaceI Can swing or not?I · · ·
What is a decision tree?
I This simple decision tree on 4 variables gives you ≈ 75%prediction rate at Kaggle
When a small decision tree approximates f ?
I If f “roughly” depends on a few variablesI We don’t need exact prediction. Only need a good
“estimation”I Some variables are more important than the othersI True for many applications
I Counter example?I If f is a party function: f = x1 ⊕ x2 ⊕ · · · ⊕ xnI Cannot get a good estimation with any 4 variables.I Need exponentially large decision tree on these variables
When a small decision tree approximates f ?
I If f “roughly” depends on a few variablesI We don’t need exact prediction. Only need a good
“estimation”I Some variables are more important than the othersI True for many applications
I Counter example?I If f is a party function: f = x1 ⊕ x2 ⊕ · · · ⊕ xnI Cannot get a good estimation with any 4 variables.I Need exponentially large decision tree on these variables
How can we learn a small decision tree?
Step 1 convert decision tree into a sparse low degree polynomial inFourier basis (well known)
Step 2 learn the polynomial
Preliminaries
I f : {−1, 1}n → [−1, 1]
I D is uniform distribution on {−1, 1}n
I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].
I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S
I Representation of f under χS(x):
f (x) =∑S⊆[n]
fSχS(x)
I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].
Preliminaries
I f : {−1, 1}n → [−1, 1]
I D is uniform distribution on {−1, 1}n
I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].
I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S
I Representation of f under χS(x):
f (x) =∑S⊆[n]
fSχS(x)
I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].
Preliminaries
I f : {−1, 1}n → [−1, 1]
I D is uniform distribution on {−1, 1}n
I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ ε
I Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].
I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S
I Representation of f under χS(x):
f (x) =∑S⊆[n]
fSχS(x)
I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].
Preliminaries
I f : {−1, 1}n → [−1, 1]
I D is uniform distribution on {−1, 1}n
I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].
I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S
I Representation of f under χS(x):
f (x) =∑S⊆[n]
fSχS(x)
I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].
Preliminaries
I f : {−1, 1}n → [−1, 1]
I D is uniform distribution on {−1, 1}n
I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].
I 2n of them
I complete orthonormal basis for Boolean functionsI can be identified with S
I Representation of f under χS(x):
f (x) =∑S⊆[n]
fSχS(x)
I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].
Preliminaries
I f : {−1, 1}n → [−1, 1]
I D is uniform distribution on {−1, 1}n
I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].
I 2n of themI complete orthonormal basis for Boolean functions
I can be identified with S
I Representation of f under χS(x):
f (x) =∑S⊆[n]
fSχS(x)
I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].
Preliminaries
I f : {−1, 1}n → [−1, 1]
I D is uniform distribution on {−1, 1}n
I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].
I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S
I Representation of f under χS(x):
f (x) =∑S⊆[n]
fSχS(x)
I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].
Preliminaries
I f : {−1, 1}n → [−1, 1]
I D is uniform distribution on {−1, 1}n
I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].
I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S
I Representation of f under χS(x):
f (x) =∑S⊆[n]
fSχS(x)
I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].
Preliminaries
I f : {−1, 1}n → [−1, 1]
I D is uniform distribution on {−1, 1}n
I Two functions are close: Ex∼D [f (x)− g(x)]2 ≤ εI Fourier basis: χS(x) = Πi∈Sxi for S ⊂ [n].
I 2n of themI complete orthonormal basis for Boolean functionsI can be identified with S
I Representation of f under χS(x):
f (x) =∑S⊆[n]
fSχS(x)
I Coefficient fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)].
Preliminaries
I L1 norm: L1(f ) =∑
S |fS |.
I Sparsity: L0(f ) = |{S |fS 6= 0}|.I Parseval’s identity: Ex∼D [f (x)2] =
∑S f
2S .
Preliminaries
I L1 norm: L1(f ) =∑
S |fS |.I Sparsity: L0(f ) = |{S |fS 6= 0}|.
I Parseval’s identity: Ex∼D [f (x)2] =∑
S f2S .
Preliminaries
I L1 norm: L1(f ) =∑
S |fS |.I Sparsity: L0(f ) = |{S |fS 6= 0}|.I Parseval’s identity: Ex∼D [f (x)2] =
∑S f
2S .
Examples
max2
(+1,+1) = +1 max2
(−1,+1) = +1
max2
(+1,−1) = +1 max2
(−1,−1) = −1
Examples
max2
(+1,+1) = +1 max2
(−1,+1) = +1
max2
(+1,−1) = +1 max2
(−1,−1) = −1
max2
(x1, x2) =1
2+
1
2x1 +
1
2x2 −
1
2x1x2
Examples
max2
(+1,+1) = +1 max2
(−1,+1) = +1
max2
(+1,−1) = +1 max2
(−1,−1) = −1
max2
(x1, x2) =1
2+
1
2x1 +
1
2x2 −
1
2x1x2
I max2 has L1 = 2, L0 = 4.
Examples
max2
(+1,+1) = +1 max2
(−1,+1) = +1
max2
(+1,−1) = +1 max2
(−1,−1) = −1
max2
(x1, x2) =1
2+
1
2x1 +
1
2x2 −
1
2x1x2
I max2 has L1 = 2, L0 = 4.
Similarly,
Maj3(x1, x2, x3) =1
2x1 +
1
2x2 +
1
2x3 −
1
2x1x2x3
I Maj has L1 = 2, L0 = 4.
Examples
f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.
Examples
f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.
I 2-sparse means it has 2 terms
I degree 2 means its terms have degree at most 2.
Examples
f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.
I 2-sparse means it has 2 terms
I degree 2 means its terms have degree at most 2.
w1 w2 w3 y
1 -1 1 -6-1 -1 1 61 1 -1 10
Examples
f (w1,w2,w3) = 2w1 + 8w1w2 is 2-sparse degree 2 polynomial.
I 2-sparse means it has 2 terms
I degree 2 means its terms have degree at most 2.
w1 w2 w3 y
1 -1 1 -6-1 -1 1 61 1 -1 10
I However, f is not a linear combination of w1,w2,w3.
Examples
Expand the matrix!
w1 w2 w3 w1w2 w1w3 w2w3 y
1 -1 1 -1 1 -1 -6-1 -1 1 1 -1 -1 61 1 -1 1 -1 -1 10
Examples
Expand the matrix!
w1 w2 w3 w1w2 w1w3 w2w3 y
1 -1 1 -1 1 -1 -6-1 -1 1 1 -1 -1 61 1 -1 1 -1 -1 10
I the blue+yellow matrix is the Fourier basis
I Now f is a linear combination of the basis
Convert decision tree into sparse low degree polynomial
Theorem
For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s
2
ε function h that 2ε-approximates T .
Convert decision tree into sparse low degree polynomial
Assume decision tree T has s leaf nodes
Convert decision tree into sparse low degree polynomial
Assume decision tree T has s leaf nodes
Step 1 Truncating T at depth log( sε )
I There are 2log( sε ) = s
ε nodes on this levelI It differs by at most ε
s · s = ε fraction by union bound.I So below assume T has depth at most log( s
ε )
Convert decision tree into sparse low degree polynomial
Assume decision tree T has s leaf nodes
Step 1 Truncating T at depth log( sε )
I There are 2log( sε ) = s
ε nodes on this level
I It differs by at most εs · s = ε fraction by union bound.
I So below assume T has depth at most log( sε )
Convert decision tree into sparse low degree polynomial
Assume decision tree T has s leaf nodes
Step 1 Truncating T at depth log( sε )
I There are 2log( sε ) = s
ε nodes on this levelI It differs by at most ε
s · s = ε fraction by union bound.
I So below assume T has depth at most log( sε )
Convert decision tree into sparse low degree polynomial
Assume decision tree T has s leaf nodes
Step 1 Truncating T at depth log( sε )
I There are 2log( sε ) = s
ε nodes on this levelI It differs by at most ε
s · s = ε fraction by union bound.I So below assume T has depth at most log( s
ε )
Convert decision tree into sparse low degree polynomial
Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )
I A tree with s leaf nodes can be represented by union of s“AND” terms.
I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.
I Every “AND” term has at most log( sε ) variables, so degree at
most log( sε )
Convert decision tree into sparse low degree polynomial
Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )
I A tree with s leaf nodes can be represented by union of s“AND” terms.
I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.
I Every “AND” term has at most log( sε ) variables, so degree at
most log( sε )
Convert decision tree into sparse low degree polynomial
Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )
I A tree with s leaf nodes can be represented by union of s“AND” terms.
I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.
I Every “AND” term has at most log( sε ) variables, so degree at
most log( sε )
Convert decision tree into sparse low degree polynomial
Step 2 T with s leaves can be represented by f with L1(f ) = s anddegree log( sε )
I A tree with s leaf nodes can be represented by union of s“AND” terms.
I Every “AND” term has L1 ≤ 1, so L1(f ) ≤ s.
I Every “AND” term has at most log( sε ) variables, so degree at
most log( sε )
Convert decision tree into sparse low degree polynomial
Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2
ε anddegree log( sε ), s.t.,
E[(f − h)2] ≤ ε
I Let h include all terms in
Λ ,
{S ||fS | ≥
ε
L1(f )
}I h has terms at most
L1(f )ε
L1(f )
=L1(f )2
ε
I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑
S 6∈Λ
(fS)2 ≤ maxS 6∈Λ|fS | ·
∑S 6∈Λ
|fS | ≤ε
L1(f )· L1(f ) = ε
Convert decision tree into sparse low degree polynomial
Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2
ε anddegree log( sε ), s.t.,
E[(f − h)2] ≤ ε
I Let h include all terms in
Λ ,
{S ||fS | ≥
ε
L1(f )
}
I h has terms at most
L1(f )ε
L1(f )
=L1(f )2
ε
I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑
S 6∈Λ
(fS)2 ≤ maxS 6∈Λ|fS | ·
∑S 6∈Λ
|fS | ≤ε
L1(f )· L1(f ) = ε
Convert decision tree into sparse low degree polynomial
Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2
ε anddegree log( sε ), s.t.,
E[(f − h)2] ≤ ε
I Let h include all terms in
Λ ,
{S ||fS | ≥
ε
L1(f )
}I h has terms at most
L1(f )ε
L1(f )
=L1(f )2
ε
I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑
S 6∈Λ
(fS)2 ≤ maxS 6∈Λ|fS | ·
∑S 6∈Λ
|fS | ≤ε
L1(f )· L1(f ) = ε
Convert decision tree into sparse low degree polynomial
Step 3 For f st L1 ≤ s and degree log( sε ), there is h with L0 ≤ s2
ε anddegree log( sε ), s.t.,
E[(f − h)2] ≤ ε
I Let h include all terms in
Λ ,
{S ||fS | ≥
ε
L1(f )
}I h has terms at most
L1(f )ε
L1(f )
=L1(f )2
ε
I By Parseval identity, The missing terms have contribution atmost (sum of squares)∑
S 6∈Λ
(fS)2 ≤ maxS 6∈Λ|fS | ·
∑S 6∈Λ
|fS | ≤ε
L1(f )· L1(f ) = ε
How do we learn the polynomial?
Theorem
For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s
2
ε function h that 2ε-approximates T .
I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms
I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]
How do we learn the polynomial?
Theorem
For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s
2
ε function h that 2ε-approximates T .
I How do we learn the sparse low degree function h?
I Well studied in Boolean analysis, two classical algorithms
I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]
How do we learn the polynomial?
Theorem
For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s
2
ε function h that 2ε-approximates T .
I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms
I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]
How do we learn the polynomial?
Theorem
For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s
2
ε function h that 2ε-approximates T .
I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms
I KM algorithm [Kushilevitz and Mansour, 1991]
I LMN algorithm [Linial et al., 1993]
How do we learn the polynomial?
Theorem
For any decision tree T with s leaf nodes, there exists adegree-log( sε ) and sparsity- s
2
ε function h that 2ε-approximates T .
I How do we learn the sparse low degree function h?I Well studied in Boolean analysis, two classical algorithms
I KM algorithm [Kushilevitz and Mansour, 1991]I LMN algorithm [Linial et al., 1993]
KM algorithm
I Recursively prune less promising set of basis, explorepromising set of basis
I fα means the function of all the Fourier basis starting with αsummed together.
I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1
I f01 = fx2x3 · x2x3 + fx2 · x2
I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2
I All of these functions are well defined, also satisfy Parsevalidentity
I E[f 211] = f 2
x1x2x3+ f 2
x1x2
I At the last level, fα is equal to one coefficient
I E[f 2110] = f 2
x1x2
KM algorithm
I Recursively prune less promising set of basis, explorepromising set of basis
I fα means the function of all the Fourier basis starting with αsummed together.
I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1
I f01 = fx2x3 · x2x3 + fx2 · x2
I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2
I All of these functions are well defined, also satisfy Parsevalidentity
I E[f 211] = f 2
x1x2x3+ f 2
x1x2
I At the last level, fα is equal to one coefficient
I E[f 2110] = f 2
x1x2
KM algorithm
I Recursively prune less promising set of basis, explorepromising set of basis
I fα means the function of all the Fourier basis starting with αsummed together.
I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1
I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1
I f01 = fx2x3 · x2x3 + fx2 · x2
I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2
I All of these functions are well defined, also satisfy Parsevalidentity
I E[f 211] = f 2
x1x2x3+ f 2
x1x2
I At the last level, fα is equal to one coefficient
I E[f 2110] = f 2
x1x2
KM algorithm
I Recursively prune less promising set of basis, explorepromising set of basis
I fα means the function of all the Fourier basis starting with αsummed together.
I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1
I f01 = fx2x3 · x2x3 + fx2 · x2
I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2
I All of these functions are well defined, also satisfy Parsevalidentity
I E[f 211] = f 2
x1x2x3+ f 2
x1x2
I At the last level, fα is equal to one coefficient
I E[f 2110] = f 2
x1x2
KM algorithm
I Recursively prune less promising set of basis, explorepromising set of basis
I fα means the function of all the Fourier basis starting with αsummed together.
I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1
I f01 = fx2x3 · x2x3 + fx2 · x2
I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2
I All of these functions are well defined, also satisfy Parsevalidentity
I E[f 211] = f 2
x1x2x3+ f 2
x1x2
I At the last level, fα is equal to one coefficient
I E[f 2110] = f 2
x1x2
KM algorithm
I Recursively prune less promising set of basis, explorepromising set of basis
I fα means the function of all the Fourier basis starting with αsummed together.
I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1
I f01 = fx2x3 · x2x3 + fx2 · x2
I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2
I All of these functions are well defined, also satisfy Parsevalidentity
I E[f 211] = f 2
x1x2x3+ f 2
x1x2
I At the last level, fα is equal to one coefficient
I E[f 2110] = f 2
x1x2
KM algorithm
I Recursively prune less promising set of basis, explorepromising set of basis
I fα means the function of all the Fourier basis starting with αsummed together.
I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1
I f01 = fx2x3 · x2x3 + fx2 · x2
I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2
I All of these functions are well defined, also satisfy Parsevalidentity
I E[f 211] = f 2
x1x2x3+ f 2
x1x2
I At the last level, fα is equal to one coefficient
I E[f 2110] = f 2
x1x2
KM algorithm
I Recursively prune less promising set of basis, explorepromising set of basis
I fα means the function of all the Fourier basis starting with αsummed together.
I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1
I f01 = fx2x3 · x2x3 + fx2 · x2
I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2
I All of these functions are well defined, also satisfy Parsevalidentity
I E[f 211] = f 2
x1x2x3+ f 2
x1x2
I At the last level, fα is equal to one coefficient
I E[f 2110] = f 2
x1x2
KM algorithm
I Recursively prune less promising set of basis, explorepromising set of basis
I fα means the function of all the Fourier basis starting with αsummed together.
I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1
I f01 = fx2x3 · x2x3 + fx2 · x2
I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2
I All of these functions are well defined, also satisfy Parsevalidentity
I E[f 211] = f 2
x1x2x3+ f 2
x1x2
I At the last level, fα is equal to one coefficient
I E[f 2110] = f 2
x1x2
KM algorithm
I Recursively prune less promising set of basis, explorepromising set of basis
I fα means the function of all the Fourier basis starting with αsummed together.
I f0 = fx2x3 · x2x3 + fx2 · x2 + fx3 · x3 + f1I f1 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2 + fx1x3 · x1x3 + fx1 · x1
I f01 = fx2x3 · x2x3 + fx2 · x2
I f11 = fx1x2x3 · x1x2x3 + fx1x2 · x1x2
I All of these functions are well defined, also satisfy Parsevalidentity
I E[f 211] = f 2
x1x2x3+ f 2
x1x2
I At the last level, fα is equal to one coefficientI E[f 2
110] = f 2x1x2
KM algorithm
I θ , ε
I Running time per iteration: O( 1ε6 )
I O(nL1(f )ε
)iterations.
I Two problems
I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel
KM algorithm
I θ , ε
I Running time per iteration: O( 1ε6 )
I O(nL1(f )ε
)iterations.
I Two problems
I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel
KM algorithm
I θ , ε
I Running time per iteration: O( 1ε6 )
I O(nL1(f )ε
)iterations.
I Two problems
I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel
KM algorithm
I θ , ε
I Running time per iteration: O( 1ε6 )
I O(nL1(f )ε
)iterations.
I Two problems
I Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel
KM algorithm
I θ , ε
I Running time per iteration: O( 1ε6 )
I O(nL1(f )ε
)iterations.
I Two problemsI Pretty slow, depends on ε−7 and n
I Sequential algorithm, cannot query in parallel
KM algorithm
I θ , ε
I Running time per iteration: O( 1ε6 )
I O(nL1(f )ε
)iterations.
I Two problemsI Pretty slow, depends on ε−7 and nI Sequential algorithm, cannot query in parallel
LMN algorithm
I Take m uniform random samples for f
I For every S with degree ≤ log( sε ), we estimate fS using msamples
fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m
i=1 f (xi )χS(xi )
m
I By concentration, estimation is accurate
I Do it for all S , we get the function
I O( s2
ε2 · log n) sample complexity, parallelizableI Two problems
I does not work well in practice.I does not have guarantees in the noisy setting.
LMN algorithm
I Take m uniform random samples for f
I For every S with degree ≤ log( sε ), we estimate fS using msamples
fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m
i=1 f (xi )χS(xi )
m
I By concentration, estimation is accurate
I Do it for all S , we get the function
I O( s2
ε2 · log n) sample complexity, parallelizableI Two problems
I does not work well in practice.I does not have guarantees in the noisy setting.
LMN algorithm
I Take m uniform random samples for f
I For every S with degree ≤ log( sε ), we estimate fS using msamples
fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m
i=1 f (xi )χS(xi )
m
I By concentration, estimation is accurate
I Do it for all S , we get the function
I O( s2
ε2 · log n) sample complexity, parallelizableI Two problems
I does not work well in practice.I does not have guarantees in the noisy setting.
LMN algorithm
I Take m uniform random samples for f
I For every S with degree ≤ log( sε ), we estimate fS using msamples
fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m
i=1 f (xi )χS(xi )
m
I By concentration, estimation is accurate
I Do it for all S , we get the function
I O( s2
ε2 · log n) sample complexity, parallelizableI Two problems
I does not work well in practice.I does not have guarantees in the noisy setting.
LMN algorithm
I Take m uniform random samples for f
I For every S with degree ≤ log( sε ), we estimate fS using msamples
fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m
i=1 f (xi )χS(xi )
m
I By concentration, estimation is accurate
I Do it for all S , we get the function
I O( s2
ε2 · log n) sample complexity, parallelizable
I Two problems
I does not work well in practice.I does not have guarantees in the noisy setting.
LMN algorithm
I Take m uniform random samples for f
I For every S with degree ≤ log( sε ), we estimate fS using msamples
fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m
i=1 f (xi )χS(xi )
m
I By concentration, estimation is accurate
I Do it for all S , we get the function
I O( s2
ε2 · log n) sample complexity, parallelizableI Two problems
I does not work well in practice.I does not have guarantees in the noisy setting.
LMN algorithm
I Take m uniform random samples for f
I For every S with degree ≤ log( sε ), we estimate fS using msamples
fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m
i=1 f (xi )χS(xi )
m
I By concentration, estimation is accurate
I Do it for all S , we get the function
I O( s2
ε2 · log n) sample complexity, parallelizableI Two problems
I does not work well in practice.
I does not have guarantees in the noisy setting.
LMN algorithm
I Take m uniform random samples for f
I For every S with degree ≤ log( sε ), we estimate fS using msamples
fS , 〈f , χS〉 = Ex∼D [f (x)χS(x)] ≈∑m
i=1 f (xi )χS(xi )
m
I By concentration, estimation is accurate
I Do it for all S , we get the function
I O( s2
ε2 · log n) sample complexity, parallelizableI Two problems
I does not work well in practice.I does not have guarantees in the noisy setting.
Our algorithm: Harmonica
LMN algorithm [Linial et al., 1993] requires
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε2 · log n)
I Not improved for more than two decades!
Our algorithm: Harmonica
LMN algorithm [Linial et al., 1993] requires
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε2 · log n)
I Not improved for more than two decades!
Harmonica
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε · log n)
I 1ε improvement
I Works for noisy setting
I ParallelizableI First “practical” algorithm under uniform sampling
assumption!
I Previously criticized as useless setting
Our algorithm: Harmonica
LMN algorithm [Linial et al., 1993] requires
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε2 · log n)
I Not improved for more than two decades!
Harmonica
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε · log n)
I 1ε improvement
I Works for noisy setting
I ParallelizableI First “practical” algorithm under uniform sampling
assumption!
I Previously criticized as useless setting
Our algorithm: Harmonica
LMN algorithm [Linial et al., 1993] requires
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε2 · log n)
I Not improved for more than two decades!
Harmonica
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε · log n)I 1
ε improvement
I Works for noisy setting
I ParallelizableI First “practical” algorithm under uniform sampling
assumption!
I Previously criticized as useless setting
Our algorithm: Harmonica
LMN algorithm [Linial et al., 1993] requires
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε2 · log n)
I Not improved for more than two decades!
Harmonica
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε · log n)I 1
ε improvement
I Works for noisy setting
I ParallelizableI First “practical” algorithm under uniform sampling
assumption!
I Previously criticized as useless setting
Our algorithm: Harmonica
LMN algorithm [Linial et al., 1993] requires
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε2 · log n)
I Not improved for more than two decades!
Harmonica
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε · log n)I 1
ε improvement
I Works for noisy setting
I Parallelizable
I First “practical” algorithm under uniform samplingassumption!
I Previously criticized as useless setting
Our algorithm: Harmonica
LMN algorithm [Linial et al., 1993] requires
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε2 · log n)
I Not improved for more than two decades!
Harmonica
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε · log n)I 1
ε improvement
I Works for noisy setting
I ParallelizableI First “practical” algorithm under uniform sampling
assumption!
I Previously criticized as useless setting
Our algorithm: Harmonica
LMN algorithm [Linial et al., 1993] requires
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε2 · log n)
I Not improved for more than two decades!
Harmonica
I Running time: O(nlog(s/ε))
I Sample complexity: O( s2
ε · log n)I 1
ε improvement
I Works for noisy setting
I ParallelizableI First “practical” algorithm under uniform sampling
assumption!I Previously criticized as useless setting
How do we learn the sparse low degree polynomial?
This problem contains a few key words:
I Noisy measurements
I Sparsity recovery
I Sample efficient
How do we learn the sparse low degree polynomial?
This problem contains a few key words:
I Noisy measurements
I Sparsity recovery
I Sample efficient
Compressed sensing!
What is compressed sensing?
I Query: measurement matrix A ∈ Rm×N .I m << N
I In general, infinitely many solutions, can’t find x
I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]
What is compressed sensing?
I Query: measurement matrix A ∈ Rm×N .I m << N
I Observe: y = Ax + e,I A is what we pick to measure.I e is the noise.I x is the unknown vector.
I In general, infinitely many solutions, can’t find x
I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]
What is compressed sensing?
I Query: measurement matrix A ∈ Rm×N .I m << N
I Observe: y = Ax + e,I A is what we pick to measure.I e is the noise.I x is the unknown vector.
I In general, infinitely many solutions, can’t find x
I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]
What is compressed sensing?
I Query: measurement matrix A ∈ Rm×N .I m << N
I Observe: y = Ax + e,I A is what we pick to measure.I e is the noise.I x is the unknown vector.
I In general, infinitely many solutions, can’t find x
I If x is sparse, it can be recovered with compressed sensing[Donoho, 2006, Candes et al., 2006]
What is compressed sensing?
I Lasso algorithm:
minx∗{λ‖Ax∗ − y‖2
2 + ‖x∗‖1}
I Effect: linear regression with `1 regularization.
What is compressed sensing?
I Lasso algorithm:
minx∗{λ‖Ax∗ − y‖2
2 + ‖x∗‖1}
I Effect: linear regression with `1 regularization.
A general compressed sensing theorem
I Random orthonormal familyI ψ1, · · · , ψN are mappings from X = (x1, · · · , xd) to R.
EX∼D [ψi (X ) · ψj(X )] =
{1 if i = j
0 otherwise.
A general compressed sensing theorem
I Random orthonormal familyI ψ1, · · · , ψN are mappings from X = (x1, · · · , xd) to R.
EX∼D [ψi (X ) · ψj(X )] =
{1 if i = j
0 otherwise.
I Fourier basis {χS} is a random orthonormal family!
A general compressed sensing theorem
Theorem ([Rauhut, 2010])
Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.
‖x − x∗‖2 ≤ c‖e‖2√m
with probability 1− δ, as long as m ≥ O(s logN). c is a constant.
I In other words, if error term is bounded, x can be recovered
I If we could show that ‖e‖2√m≤√ε
c
I ‖x − x∗‖22 ≤ ε
I By Parseval identity, f is recovered with ε error!
A general compressed sensing theorem
Theorem ([Rauhut, 2010])
Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.
‖x − x∗‖2 ≤ c‖e‖2√m
with probability 1− δ, as long as m ≥ O(s logN). c is a constant.
I In other words, if error term is bounded, x can be recovered
I If we could show that ‖e‖2√m≤√ε
c
I ‖x − x∗‖22 ≤ ε
I By Parseval identity, f is recovered with ε error!
A general compressed sensing theorem
Theorem ([Rauhut, 2010])
Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.
‖x − x∗‖2 ≤ c‖e‖2√m
with probability 1− δ, as long as m ≥ O(s logN). c is a constant.
I In other words, if error term is bounded, x can be recovered
I If we could show that ‖e‖2√m≤√ε
c
I ‖x − x∗‖22 ≤ ε
I By Parseval identity, f is recovered with ε error!
A general compressed sensing theorem
Theorem ([Rauhut, 2010])
Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.
‖x − x∗‖2 ≤ c‖e‖2√m
with probability 1− δ, as long as m ≥ O(s logN). c is a constant.
I In other words, if error term is bounded, x can be recovered
I If we could show that ‖e‖2√m≤√ε
c
I ‖x − x∗‖22 ≤ ε
I By Parseval identity, f is recovered with ε error!
A general compressed sensing theorem
Theorem ([Rauhut, 2010])
Given measurement matrix A ∈ Rm×N such that its columns are ina random orthonormal family, and vector y with y = Ax + e, x is ssparse, e is the error term. Lasso finds x∗ s.t.
‖x − x∗‖2 ≤ c‖e‖2√m
with probability 1− δ, as long as m ≥ O(s logN). c is a constant.
I In other words, if error term is bounded, x can be recovered
I If we could show that ‖e‖2√m≤√ε
c
I ‖x − x∗‖22 ≤ ε
I By Parseval identity, f is recovered with ε error!
Main Theorem
Theorem (Main theorem)
Consider a decision tree T with s leaf nodes and n variables.Under uniform sampling, Lasso learn T in time nO(log(s/ε)) andsample complexity O(s2 log n/ε) with high probability.
Proof for the main theorem
I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis
I T = h + g , where
g =∑
S,|S |>d
fSχS +∑
S,|S |<d ,fS<O(ε)
fSχS
g has small value.
I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked
I g(zi ) are independent as well
Proof for the main theorem
I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis
I T = h + g , where
g =∑
S,|S |>d
fSχS +∑
S,|S |<d ,fS<O(ε)
fSχS
g has small value.
I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked
I g(zi ) are independent as well
Proof for the main theorem
I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis
I T = h + g , where
g =∑
S,|S |>d
fSχS +∑
S,|S |<d ,fS<O(ε)
fSχS
g has small value.
I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked
I g(zi ) are independent as well
Proof for the main theorem
I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis
I T = h + g , where
g =∑
S,|S |>d
fSχS +∑
S,|S |<d ,fS<O(ε)
fSχS
g has small value.
I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked
I g(zi ) are independent as well
Proof for the main theorem
I Convert T into a degree-log(s/ε) sparsity-s polynomial f onFourier basis
I T = h + g , where
g =∑
S,|S |>d
fSχS +∑
S,|S |<d ,fS<O(ε)
fSχS
g has small value.
I Assume samples are {(z1, y1), · · · , (zm, ym)} independentlypicked
I g(zi ) are independent as well
Proof for the main theorem
Theorem (Multidimensional Chebyshev inequality)
Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:
Pr(‖e‖2 >√‖V ‖2δ) ≤ m
δ2
I It suffices to show ‖(g(z1,··· ,zm))‖2√m
≤√ε
c
I E[g(zi )] = 0, since g does not contain constant
I Var[g(zi )] = ε/2, so√‖V ‖2 ≤
√ε/2.
I Set δ =√
2m, we get
Pr(‖e‖2 >√εm) ≤ 1
2
Proof for the main theorem
Theorem (Multidimensional Chebyshev inequality)
Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:
Pr(‖e‖2 >√‖V ‖2δ) ≤ m
δ2
I It suffices to show ‖(g(z1,··· ,zm))‖2√m
≤√ε
c
I E[g(zi )] = 0, since g does not contain constant
I Var[g(zi )] = ε/2, so√‖V ‖2 ≤
√ε/2.
I Set δ =√
2m, we get
Pr(‖e‖2 >√εm) ≤ 1
2
Proof for the main theorem
Theorem (Multidimensional Chebyshev inequality)
Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:
Pr(‖e‖2 >√‖V ‖2δ) ≤ m
δ2
I It suffices to show ‖(g(z1,··· ,zm))‖2√m
≤√ε
c
I E[g(zi )] = 0, since g does not contain constant
I Var[g(zi )] = ε/2, so√‖V ‖2 ≤
√ε/2.
I Set δ =√
2m, we get
Pr(‖e‖2 >√εm) ≤ 1
2
Proof for the main theorem
Theorem (Multidimensional Chebyshev inequality)
Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:
Pr(‖e‖2 >√‖V ‖2δ) ≤ m
δ2
I It suffices to show ‖(g(z1,··· ,zm))‖2√m
≤√ε
c
I E[g(zi )] = 0, since g does not contain constant
I Var[g(zi )] = ε/2, so√‖V ‖2 ≤
√ε/2.
I Set δ =√
2m, we get
Pr(‖e‖2 >√εm) ≤ 1
2
Proof for the main theorem
Theorem (Multidimensional Chebyshev inequality)
Let e be an m dimensional random vector, with expected value 0,and covariance matrix V . If v is a positive definite matrix, for anyreal number δ > 0:
Pr(‖e‖2 >√‖V ‖2δ) ≤ m
δ2
I It suffices to show ‖(g(z1,··· ,zm))‖2√m
≤√ε
c
I E[g(zi )] = 0, since g does not contain constant
I Var[g(zi )] = ε/2, so√‖V ‖2 ≤
√ε/2.
I Set δ =√
2m, we get
Pr(‖e‖2 >√εm) ≤ 1
2
Go over the whole proof
I Object: learn a decision tree T of size s
I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.
I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!
I “top layers” of decision treeI no overfitting!
I Lasso learns f by compressed sensing
Go over the whole proof
I Object: learn a decision tree T of size s
I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.
I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!
I “top layers” of decision treeI no overfitting!
I Lasso learns f by compressed sensing
Go over the whole proof
I Object: learn a decision tree T of size s
I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.
I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .
I f captures all the important variables!
I “top layers” of decision treeI no overfitting!
I Lasso learns f by compressed sensing
Go over the whole proof
I Object: learn a decision tree T of size s
I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.
I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!
I “top layers” of decision treeI no overfitting!
I Lasso learns f by compressed sensing
Go over the whole proof
I Object: learn a decision tree T of size s
I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.
I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!
I “top layers” of decision tree
I no overfitting!
I Lasso learns f by compressed sensing
Go over the whole proof
I Object: learn a decision tree T of size s
I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.
I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!
I “top layers” of decision treeI no overfitting!
I Lasso learns f by compressed sensing
Go over the whole proof
I Object: learn a decision tree T of size s
I T ≈ a degree log(s/ε) polynomial h, ‖h‖1 ≤ s.
I h ≈ a degree log(s/ε), s2/ε sparse polynomial f .I f captures all the important variables!
I “top layers” of decision treeI no overfitting!
I Lasso learns f by compressed sensing
Heuristic: iterative selections
I A small decision tree is not accurate enough to give goodresult
I We can only identify ∼ 5 important monomials
Heuristic: iterative selections
I A small decision tree is not accurate enough to give goodresult
I We can only identify ∼ 5 important monomials
Solution: Multi-stage Lasso
Heuristic: iterative selections
I A small decision tree is not accurate enough to give goodresult
I We can only identify ∼ 5 important monomials
Solution: Multi-stage LassoI First, get ∼ 5 important monomials
I Fix them to maximize the sparse linear function
I Rerun lasso for the remaining variables!
Multi-stage Lasso: how does it work?
Need to stop here. Selecting moremonomials won’t approximate thefunction better.
Multi-stage Lasso: how does it work?Fixing 5 monomials, thensample more configurations,rerun Lasso. Select 5 moremonomials.
Multi-stage Lasso: why does it work?
We assume this subtree can be approximated bya sparse function. Different subtrees can beapproximated by different functions. Much moreexpressive than one-stage Lasso!
Our algorithm: Harmonica
Step 1 Query (say) 100 random samples for f
Our algorithm: Harmonica
Step 1 Query (say) 100 random samples for f
Step 2 Expand the samples to include low degree features
Our algorithm: Harmonica
Step 1 Query (say) 100 random samples for f
Step 2 Expand the samples to include low degree features
Step 3 Run Lasso, return (say) 5 important monomials, andcorresponding important variables
Our algorithm: Harmonica
Step 1 Query (say) 100 random samples for f
Step 2 Expand the samples to include low degree features
Step 3 Run Lasso, return (say) 5 important monomials, andcorresponding important variables
Step 4 Update f by fixing these important variables. Go to Step 1.
Harmonica: an example
Assume x ∈ {−1, 1}100, y ∈ R.
Harmonica: an example
Assume x ∈ {−1, 1}100, y ∈ R.
1. Query 100 random samples (x1, f (x1)), · · · , (x100, f (x100)).
Harmonica: an example
Assume x ∈ {−1, 1}100, y ∈ R.
1. Query 100 random samples (x1, f (x1)), · · · , (x100, f (x100)).
2. Call Lasso on expanded feature vectors, which returns 5important variables
I x1 = 1, x4 = −1, x3 = 1, x10 = −1, x77 = −1.
Harmonica: an example
Assume x ∈ {−1, 1}100, y ∈ R.
1. Query 100 random samples (x1, f (x1)), · · · , (x100, f (x100)).
2. Call Lasso on expanded feature vectors, which returns 5important variables
I x1 = 1, x4 = −1, x3 = 1, x10 = −1, x77 = −1.
3. Update f as f ′ = f(1,4,3,10,77),(1,−1,1,−1,−1).I For every x , fix its 1, 4, 3, 10, 77-th coordinate to be
(1,−1, 1,−1,−1), send to f .
Harmonica: an example
4. Query 100 more random samples(x101, f
′(x101)), · · · , (x200, f′(x200)).
Harmonica: an example
4. Query 100 more random samples(x101, f
′(x101)), · · · , (x200, f′(x200)).
5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.
I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.
Harmonica: an example
4. Query 100 more random samples(x101, f
′(x101)), · · · , (x200, f′(x200)).
5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.
I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.
6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).
Harmonica: an example
4. Query 100 more random samples(x101, f
′(x101)), · · · , (x200, f′(x200)).
5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.
I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.
6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).
7. Query 100 more random samples(x201, f
′′(x201)), · · · , (x300, f′′(x300)).
Harmonica: an example
4. Query 100 more random samples(x101, f
′(x101)), · · · , (x200, f′(x200)).
5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.
I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.
6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).
7. Query 100 more random samples(x201, f
′′(x201)), · · · , (x300, f′′(x300)).
8. · · ·
Harmonica: an example
4. Query 100 more random samples(x101, f
′(x101)), · · · , (x200, f′(x200)).
5. Call Lasso on expanded feature vectors, which returns 6 moreimportant variables.
I x2 = −1, x57 = 1, x82 = 1, x13 = −1, x67 = 1, x82 = −1.
6. Update f ′ as f ′′ = f ′(2,57,82,13,67,82),(−1,1,1,−1,1,−1).
7. Query 100 more random samples(x201, f
′′(x201)), · · · , (x300, f′′(x300)).
8. · · ·9. Get f ′′′ and run hyperband/random search/spearmint on f ′′′.
Why does Harmonica work?
I Multi-stage sparse function approximation
I Very expressive
I Accurate sampling inside subtrees.
I Never waste samples in less promising subtree.
I Lasso could provably learn a decision tree
I Identify the important monomialsI Compressed sensing techniques
Why does Harmonica work?
I Multi-stage sparse function approximationI Very expressive
I Accurate sampling inside subtrees.
I Never waste samples in less promising subtree.
I Lasso could provably learn a decision tree
I Identify the important monomialsI Compressed sensing techniques
Why does Harmonica work?
I Multi-stage sparse function approximationI Very expressive
I Accurate sampling inside subtrees.
I Never waste samples in less promising subtree.
I Lasso could provably learn a decision tree
I Identify the important monomialsI Compressed sensing techniques
Why does Harmonica work?
I Multi-stage sparse function approximationI Very expressive
I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.
I Lasso could provably learn a decision tree
I Identify the important monomialsI Compressed sensing techniques
Why does Harmonica work?
I Multi-stage sparse function approximationI Very expressive
I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.
I Lasso could provably learn a decision tree
I Identify the important monomialsI Compressed sensing techniques
Why does Harmonica work?
I Multi-stage sparse function approximationI Very expressive
I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.
I Lasso could provably learn a decision treeI Identify the important monomials
I Compressed sensing techniques
Why does Harmonica work?
I Multi-stage sparse function approximationI Very expressive
I Accurate sampling inside subtrees.I Never waste samples in less promising subtree.
I Lasso could provably learn a decision treeI Identify the important monomialsI Compressed sensing techniques
Experimental setting
I Cifar10 with residual network [He et al., 2016]
I 60 different hyperparameters, 39 real, 21 dummy
I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:
I Small network: 8-layer, 30 total epochs per trialI Small network is fast!
I Base algorithm is hyperband/random search for fine tuning onlarge network
I 56-layer, 160 total epochs per trialI Features from small network work well
Experimental setting
I Cifar10 with residual network [He et al., 2016]
I 60 different hyperparameters, 39 real, 21 dummy
I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:
I Small network: 8-layer, 30 total epochs per trialI Small network is fast!
I Base algorithm is hyperband/random search for fine tuning onlarge network
I 56-layer, 160 total epochs per trialI Features from small network work well
Experimental setting
I Cifar10 with residual network [He et al., 2016]
I 60 different hyperparameters, 39 real, 21 dummy
I 10 machines run in parallel
I Two-stage Lasso, degree 3 for feature selection:
I Small network: 8-layer, 30 total epochs per trialI Small network is fast!
I Base algorithm is hyperband/random search for fine tuning onlarge network
I 56-layer, 160 total epochs per trialI Features from small network work well
Experimental setting
I Cifar10 with residual network [He et al., 2016]
I 60 different hyperparameters, 39 real, 21 dummy
I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:
I Small network: 8-layer, 30 total epochs per trialI Small network is fast!
I Base algorithm is hyperband/random search for fine tuning onlarge network
I 56-layer, 160 total epochs per trialI Features from small network work well
Experimental setting
I Cifar10 with residual network [He et al., 2016]
I 60 different hyperparameters, 39 real, 21 dummy
I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:
I Small network: 8-layer, 30 total epochs per trial
I Small network is fast!
I Base algorithm is hyperband/random search for fine tuning onlarge network
I 56-layer, 160 total epochs per trialI Features from small network work well
Experimental setting
I Cifar10 with residual network [He et al., 2016]
I 60 different hyperparameters, 39 real, 21 dummy
I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:
I Small network: 8-layer, 30 total epochs per trialI Small network is fast!
I Base algorithm is hyperband/random search for fine tuning onlarge network
I 56-layer, 160 total epochs per trialI Features from small network work well
Experimental setting
I Cifar10 with residual network [He et al., 2016]
I 60 different hyperparameters, 39 real, 21 dummy
I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:
I Small network: 8-layer, 30 total epochs per trialI Small network is fast!
I Base algorithm is hyperband/random search for fine tuning onlarge network
I 56-layer, 160 total epochs per trialI Features from small network work well
Experimental setting
I Cifar10 with residual network [He et al., 2016]
I 60 different hyperparameters, 39 real, 21 dummy
I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:
I Small network: 8-layer, 30 total epochs per trialI Small network is fast!
I Base algorithm is hyperband/random search for fine tuning onlarge network
I 56-layer, 160 total epochs per trial
I Features from small network work well
Experimental setting
I Cifar10 with residual network [He et al., 2016]
I 60 different hyperparameters, 39 real, 21 dummy
I 10 machines run in parallelI Two-stage Lasso, degree 3 for feature selection:
I Small network: 8-layer, 30 total epochs per trialI Small network is fast!
I Base algorithm is hyperband/random search for fine tuning onlarge network
I 56-layer, 160 total epochs per trialI Features from small network work well
60 Boolean variables for this task
I Weight initialization
I Optimization Method
I Learning rate
I Learning rate drop
I Momentum
I residual link weight
I Activation layer position
I Convolution bias
I Activation layer type
I Dropout
I Dropout rate
I Batch norm
I Batch norm tuning
I Resnet shortcut type
I Weight decay
I Batch size
I · · ·I and 21 dummy variables
7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5Final Test Error (%)
0
1
2
3
4
5
6
7
Diffe
rent
Alg
orith
m
Best Human RateHarmonica 1Harmonica 2
Harmonica+Random SearchRandom Search
HyperbandSpearmint
Harmonica 1 Harmonica 2 Harmonica+RndS Random Search Hyperband Spearmint0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
22.5
Tota
l Run
ning
Tim
e (G
PU D
ay)
10.1
3.6
8.3
20.017.3
8.5
Performance(Lefter is better)
Time(Shorter is better)
Harmonica
Optimization Time
0 100 200 300 400 500Number of Queries
10 1
100
101
102
103
104
105
Tota
l Opt
imiza
tion
Tim
e (s
)
Spearmint, n=60Harmonica, n=30Harmonica, n=60
Harmonica, n=100Harmonica, n=200
Selected features: matches our experience
Stage Feature Name Weights
1-1 Batch norm 8.05
1-2 Activation 3.47
1-3 Initial learning rate * Initial learning rate 3.12
1-4 Activation * Batch norm -2.55
1-5 Initial learning rate -2.34
2-1 Optimization method -4.22
2-2 Optimization method * Use momentum -3.02
2-3 Resblock first activation 2.80
2-4 Use momentum 2.19
2-5 Resblock 1st activation * Resblock 3rd activation 1.68
3-1 Weight decay parameter -0.49
3-2 Weight decay -0.26
3-3 Initial learning rate * Weight decay 0.23
3-4 Batch norm tuning 0.21
3-5 Weight decay * Weight decay parameter 0.20
Average test error drop
After fixing features in each stage, the average test error drops.
I We are in a better subtree
Average test error drop
Uniform Random After Stage 1 After Stage 2 After Stage 30
10
20
30
40
50
60
Aver
age
Test
Erro
r (\%
)60.16
33.324.33 21.3
Harmonica: benefits
I Scalable in n
I Fast optimization time (running Lasso)
I Parallelizable
I Feature extraction
Harmonica: benefits
I Scalable in n
I Fast optimization time (running Lasso)
I Parallelizable
I Feature extraction
Harmonica: benefits
I Scalable in n
I Fast optimization time (running Lasso)
I Parallelizable
I Feature extraction
Harmonica: benefits
I Scalable in n
I Fast optimization time (running Lasso)
I Parallelizable
I Feature extraction
Conclusion
I Curse of dimensionality
I Multi-stage Lasso on low degree monomials.
I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.
I With lots of important variables fixed, can call some basealgorithm for fine-tuning.
I Compressed sensing gives provable guarantee on recovery
I The first improvement on sample complexity for decision treelearning over more than two decades.
I The first “practical” decision tree learning algorithm withuniform sampling.
I This is a pretty new area, pretty important problem.
Conclusion
I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.
I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.
I With lots of important variables fixed, can call some basealgorithm for fine-tuning.
I Compressed sensing gives provable guarantee on recovery
I The first improvement on sample complexity for decision treelearning over more than two decades.
I The first “practical” decision tree learning algorithm withuniform sampling.
I This is a pretty new area, pretty important problem.
Conclusion
I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.
I Multi-stage sparse function is expressive
I Captures correlations between variablesI Query samples in promising subtrees.
I With lots of important variables fixed, can call some basealgorithm for fine-tuning.
I Compressed sensing gives provable guarantee on recovery
I The first improvement on sample complexity for decision treelearning over more than two decades.
I The first “practical” decision tree learning algorithm withuniform sampling.
I This is a pretty new area, pretty important problem.
Conclusion
I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.
I Multi-stage sparse function is expressiveI Captures correlations between variables
I Query samples in promising subtrees.
I With lots of important variables fixed, can call some basealgorithm for fine-tuning.
I Compressed sensing gives provable guarantee on recovery
I The first improvement on sample complexity for decision treelearning over more than two decades.
I The first “practical” decision tree learning algorithm withuniform sampling.
I This is a pretty new area, pretty important problem.
Conclusion
I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.
I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.
I With lots of important variables fixed, can call some basealgorithm for fine-tuning.
I Compressed sensing gives provable guarantee on recovery
I The first improvement on sample complexity for decision treelearning over more than two decades.
I The first “practical” decision tree learning algorithm withuniform sampling.
I This is a pretty new area, pretty important problem.
Conclusion
I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.
I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.
I With lots of important variables fixed, can call some basealgorithm for fine-tuning.
I Compressed sensing gives provable guarantee on recovery
I The first improvement on sample complexity for decision treelearning over more than two decades.
I The first “practical” decision tree learning algorithm withuniform sampling.
I This is a pretty new area, pretty important problem.
Conclusion
I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.
I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.
I With lots of important variables fixed, can call some basealgorithm for fine-tuning.
I Compressed sensing gives provable guarantee on recovery
I The first improvement on sample complexity for decision treelearning over more than two decades.
I The first “practical” decision tree learning algorithm withuniform sampling.
I This is a pretty new area, pretty important problem.
Conclusion
I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.
I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.
I With lots of important variables fixed, can call some basealgorithm for fine-tuning.
I Compressed sensing gives provable guarantee on recoveryI The first improvement on sample complexity for decision tree
learning over more than two decades.
I The first “practical” decision tree learning algorithm withuniform sampling.
I This is a pretty new area, pretty important problem.
Conclusion
I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.
I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.
I With lots of important variables fixed, can call some basealgorithm for fine-tuning.
I Compressed sensing gives provable guarantee on recoveryI The first improvement on sample complexity for decision tree
learning over more than two decades.I The first “practical” decision tree learning algorithm with
uniform sampling.
I This is a pretty new area, pretty important problem.
Conclusion
I Curse of dimensionalityI Multi-stage Lasso on low degree monomials.
I Multi-stage sparse function is expressiveI Captures correlations between variablesI Query samples in promising subtrees.
I With lots of important variables fixed, can call some basealgorithm for fine-tuning.
I Compressed sensing gives provable guarantee on recoveryI The first improvement on sample complexity for decision tree
learning over more than two decades.I The first “practical” decision tree learning algorithm with
uniform sampling.
I This is a pretty new area, pretty important problem.
The Last slide..
Thank you for coming to my mini-course!
Bergstra, J. and Bengio, Y. (2012).Random search for hyper-parameter optimization.J. Mach. Learn. Res., 13:281–305.
Candes, E. J., Romberg, J., and Tao, T. (2006).Robust uncertainty principles: Exact signal reconstruction fromhighly incomplete frequency information.IEEE Trans. Inf. Theor., 52(2):489–509.
Donoho, D. L. (2006).Compressed sensing.IEEE Trans. Inf. Theor., 52(4):1289–1306.
Fu, J., Luo, H., Feng, J., Low, K. H., and Chua, T. (2016).Drmad: Distilling reverse-mode automatic differentiation foroptimizing hyperparameters of deep neural networks.CoRR, abs/1601.00917.
Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q.,and Cunningham, J. P. (2014).Bayesian optimization with inequality constraints.
In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 937–945.
He, K., Zhang, X., Ren, S., and Sun, J. (2016).Deep residual learning for image recognition.In CVPR, pages 770–778.
Jamieson, K. G. and Talwalkar, A. (2016).Non-stochastic best arm identification and hyperparameteroptimization.In Proceedings of the 19th International Conference onArtificial Intelligence and Statistics, AISTATS 2016, Cadiz,Spain, May 9-11, 2016, pages 240–248.
Kushilevitz, E. and Mansour, Y. (1991).Learning decision trees using the fourier spectrum.In Proceedings of the Twenty-third Annual ACM Symposiumon Theory of Computing, STOC ’91, pages 455–464.
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., andTalwalkar, A. (2016).
Hyperband: A Novel Bandit-Based Approach toHyperparameter Optimization.ArXiv e-prints.
Linial, N., Mansour, Y., and Nisan, N. (1993).Constant depth circuits, fourier transform, and learnability.J. ACM, 40(3):607–620.
Luketina, J., Berglund, M., Greff, K., and Raiko, T. (2015).Scalable gradient-based tuning of continuous regularizationhyperparameters.CoRR, abs/1511.06727.
Maclaurin, D., Duvenaud, D., and Adams, R. P. (2015).Gradient-based hyperparameter optimization through reversiblelearning.In Proceedings of the 32Nd International Conference onInternational Conference on Machine Learning - Volume 37,ICML’15, pages 2113–2122. JMLR.org.
Rauhut, H. (2010).Compressive sensing and structured random matrices.
Theoretical foundations and numerical methods for sparserecovery, 9:1–92.
Recht, B. (2016).Embracing the random.http://www.argmin.net/2016/06/23/hyperband/.
Snoek, J., Larochelle, H., and Adams, R. P. (2012).Practical bayesian optimization of machine learningalgorithms.In Advances in Neural Information Processing Systems 25:26th Annual Conference on Neural Information ProcessingSystems 2012. Proceedings of a meeting held December 3-6,2012, Lake Tahoe, Nevada, United States., pages 2960–2968.
Snoek, J., Swersky, K., Zemel, R. S., and Adams, R. P.(2014).Input warping for bayesian optimization of non-stationaryfunctions.
In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 1674–1682.
Swersky, K., Snoek, J., and Adams, R. P. (2013).Multi-task bayesian optimization.In Advances in Neural Information Processing Systems 26:27th Annual Conference on Neural Information ProcessingSystems 2013. Proceedings of a meeting held December 5-8,2013, Lake Tahoe, Nevada, United States., pages 2004–2012.
Wang, Z., Zoghi, M., Hutter, F., Matheson, D., andde Freitas, N. (2013).Bayesian optimization in high dimensions via randomembeddings.In IJCAI 2013, Proceedings of the 23rd International JointConference on Artificial Intelligence, Beijing, China, August3-9, 2013, pages 1778–1784.