orthogonal machine learning: power and limitationsizadik/files/icmlslides.pdf · orthogonal machine...

47
Orthogonal Machine Learning: Power and Limitations Ilias Zadik 1 , joint work with Lester Mackey 2 , Vasilis Syrgkanis 2 1 Massachusetts Institute of Technology (MIT) and 2 Microsoft Research New England (MSRNE) 35th International Conference on Machine Learning (ICML) 2018 L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 1 / 21

Upload: others

Post on 02-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Orthogonal Machine Learning: Power and Limitations

Ilias Zadik1, joint work with Lester Mackey2, Vasilis Syrgkanis2

1Massachusetts Institute of Technology (MIT) and2Microsoft Research New England (MSRNE)

35th International Conference on Machine Learning (ICML) 2018

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 1 / 21

Page 2: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Introduction

Main Application: Pricing a product in the digital economy!

Simple: Plot Demand and Price and run Linear Regression:

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 2 / 21

Page 3: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Introduction

Main Application: Pricing a product in the digital economy!

Simple: Plot Demand and Price and run Linear Regression:

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 2 / 21

Page 4: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Introduction

Challenge: What if the bottom points are from the Summer and upperfrom the Winter? Then new Linear Regression:

In current reality, thousands of confounders like seasonality simultaneouslyaffect price and demand; How do we price correctly?

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 3 / 21

Page 5: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Introduction

Challenge: What if the bottom points are from the Summer and upperfrom the Winter? Then new Linear Regression:

In current reality, thousands of confounders like seasonality simultaneouslyaffect price and demand; How do we price correctly?

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 3 / 21

Page 6: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

The Partially Linear Regression Problem (PLR)

Definition (Partially Linear Regression (PLR))

Let p ∈ N, θ0 ∈ R, f0, g0 : Rp → R.

• T ∈ R treatment or policy applied [e.g. price],

• Y ∈ R outcome of interest [e.g. demand],

• X ∈ Rp vector of associated covariates [e.g. seasonality..].

Related by

Y = θ0T + f0(X) + ε, E[ε | X, T] = 0 a.s.

T = g0(X) + η, E[η | X] = 0 a.s., Var (η) > 0,

where η, ε represent noise variables.

Goal: Given n iid samples of (Yi, Ti, Xi), i = 1, .., n find a√

n-consistentasymptotically normal (

√n-a.n.) estimator of θ0;

√n(θ0 – θ0)→ N(0,σ2).

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 4 / 21

Page 7: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

The Partially Linear Regression Problem (PLR)

Definition (Partially Linear Regression (PLR))

Let p ∈ N, θ0 ∈ R, f0, g0 : Rp → R.

• T ∈ R treatment or policy applied [e.g. price],

• Y ∈ R outcome of interest [e.g. demand],

• X ∈ Rp vector of associated covariates [e.g. seasonality..].

Related by

Y = θ0T + f0(X) + ε, E[ε | X, T] = 0 a.s.

T = g0(X) + η, E[η | X] = 0 a.s., Var (η) > 0,

where η, ε represent noise variables.

Goal: Given n iid samples of (Yi, Ti, Xi), i = 1, .., n find a√

n-consistentasymptotically normal (

√n-a.n.) estimator of θ0;

√n(θ0 – θ0)→ N(0,σ2).

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 4 / 21

Page 8: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

PLR: Challenges and Main Question

Hence, n samples for learning θ0, but ...

Challenge 1: Except θ0 we also need to learn f0, g0!

Challenge 2: And we do not really want to spend too many sampleslearning them (more than necessary to estimate θ0!)

Main Question: What is the optimal learning rate of the nuisancefunctions f0, g0* so that we get a

√n-a.n. estimator of θ0?

*Maximum an so that ‖f0 – f0‖, ‖g0 – g0‖ = o(an) suffices.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 5 / 21

Page 9: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

PLR: Challenges and Main Question

Hence, n samples for learning θ0, but ...

Challenge 1: Except θ0 we also need to learn f0, g0!

Challenge 2: And we do not really want to spend too many sampleslearning them (more than necessary to estimate θ0!)

Main Question: What is the optimal learning rate of the nuisancefunctions f0, g0* so that we get a

√n-a.n. estimator of θ0?

*Maximum an so that ‖f0 – f0‖, ‖g0 – g0‖ = o(an) suffices.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 5 / 21

Page 10: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

PLR: Challenges and Main Question

Hence, n samples for learning θ0, but ...

Challenge 1: Except θ0 we also need to learn f0, g0!

Challenge 2: And we do not really want to spend too many sampleslearning them (more than necessary to estimate θ0!)

Main Question: What is the optimal learning rate of the nuisancefunctions f0, g0* so that we get a

√n-a.n. estimator of θ0?

*Maximum an so that ‖f0 – f0‖, ‖g0 – g0‖ = o(an) suffices.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 5 / 21

Page 11: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

PLR: Challenges and Main Question

Hence, n samples for learning θ0, but ...

Challenge 1: Except θ0 we also need to learn f0, g0!

Challenge 2: And we do not really want to spend too many sampleslearning them (more than necessary to estimate θ0!)

Main Question: What is the optimal learning rate of the nuisancefunctions f0, g0* so that we get a

√n-a.n. estimator of θ0?

*Maximum an so that ‖f0 – f0‖, ‖g0 – g0‖ = o(an) suffices.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 5 / 21

Page 12: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Literature Review

• Trivial Rate, learn f0, g0 at n–12 -rate.

• [Chernozhukov et al, 2017]: It suffices to learn f0, g0 at n–14 -rate to

constuct a√

n-a.n. estimator of θ0.

The technique is based onI Generalized Method of Moments (Z-estimation)I with a “First Order Orthogonal Moment”.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 6 / 21

Page 13: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Literature Review

• Trivial Rate, learn f0, g0 at n–12 -rate.

• [Chernozhukov et al, 2017]: It suffices to learn f0, g0 at n–14 -rate to

constuct a√

n-a.n. estimator of θ0.

The technique is based onI Generalized Method of Moments (Z-estimation)I with a “First Order Orthogonal Moment”.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 6 / 21

Page 14: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Literature Review

• Trivial Rate, learn f0, g0 at n–12 -rate.

• [Chernozhukov et al, 2017]: It suffices to learn f0, g0 at n–14 -rate to

constuct a√

n-a.n. estimator of θ0.

The technique is based onI Generalized Method of Moments (Z-estimation)I with a “First Order Orthogonal Moment”.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 6 / 21

Page 15: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Z-Estimation for PLR

Choose m such that E [m(Y, T, f0(X), g0(X), θ0)|X] = 0, a.s..

Given n samples Zi = (Xi, Ti, Yi),

• (Stage 1) Use Zn+1, . . . , Z2n samples to form f0, g0 ∼ f0, g0.

• (Stage 2) Use Z1, . . . , Zn to find θ0 by solving

1

n

n∑t=1

m(Tt, Yt, f0(Xt), g0(Xt), θ0) = 0.

[Chernozhukov et al, 2017] suggests a simple first-order orthogonalmoment

m(Y, T, f(X), g(X), θ) = (Y – θT – f(X)) (T – g(X))

for PLR. For this choice n–14 first stage error suffices!

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 7 / 21

Page 16: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Z-Estimation for PLR

Choose m such that E [m(Y, T, f0(X), g0(X), θ0)|X] = 0, a.s..

Given n samples Zi = (Xi, Ti, Yi),

• (Stage 1) Use Zn+1, . . . , Z2n samples to form f0, g0 ∼ f0, g0.

• (Stage 2) Use Z1, . . . , Zn to find θ0 by solving

1

n

n∑t=1

m(Tt, Yt, f0(Xt), g0(Xt), θ0) = 0.

[Chernozhukov et al, 2017] suggests a simple first-order orthogonalmoment

m(Y, T, f(X), g(X), θ) = (Y – θT – f(X)) (T – g(X))

for PLR. For this choice n–14 first stage error suffices!

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 7 / 21

Page 17: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Z-Estimation for PLR

Choose m such that E [m(Y, T, f0(X), g0(X), θ0)|X] = 0, a.s..

Given n samples Zi = (Xi, Ti, Yi),

• (Stage 1) Use Zn+1, . . . , Z2n samples to form f0, g0 ∼ f0, g0.

• (Stage 2) Use Z1, . . . , Zn to find θ0 by solving

1

n

n∑t=1

m(Tt, Yt, f0(Xt), g0(Xt), θ0) = 0.

[Chernozhukov et al, 2017] suggests a simple first-order orthogonalmoment

m(Y, T, f(X), g(X), θ) = (Y – θT – f(X)) (T – g(X))

for PLR. For this choice n–14 first stage error suffices!

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 7 / 21

Page 18: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Z-Estimation for PLR: Comments

Definition (First-Order Orthogonality)

A moment m : Rp → R is first-order orthogonal with respect to thenuisance function if

E[∇γm(Y, T, γ, θ0)|γ=(f0(X),g0(X))

|X]

= 0.

Intuition: Low sensitivity to first stage errors because of Taylor Expansion!

Question 1: Can we generalize to higher order orthogonality? Will thisimprove the first stage error we can tolerate?

Question 2: Does higher order orthogonal moments exist for PLR?

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 8 / 21

Page 19: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Z-Estimation for PLR: Comments

Definition (First-Order Orthogonality)

A moment m : Rp → R is first-order orthogonal with respect to thenuisance function if

E[∇γm(Y, T, γ, θ0)|γ=(f0(X),g0(X))

|X]

= 0.

Intuition: Low sensitivity to first stage errors because of Taylor Expansion!

Question 1: Can we generalize to higher order orthogonality? Will thisimprove the first stage error we can tolerate?

Question 2: Does higher order orthogonal moments exist for PLR?

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 8 / 21

Page 20: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Z-Estimation for PLR: Comments

Definition (First-Order Orthogonality)

A moment m : Rp → R is first-order orthogonal with respect to thenuisance function if

E[∇γm(Y, T, γ, θ0)|γ=(f0(X),g0(X))

|X]

= 0.

Intuition: Low sensitivity to first stage errors because of Taylor Expansion!

Question 1: Can we generalize to higher order orthogonality? Will thisimprove the first stage error we can tolerate?

Question 2: Does higher order orthogonal moments exist for PLR?

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 8 / 21

Page 21: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Z-Estimation for PLR: Comments

Definition (First-Order Orthogonality)

A moment m : Rp → R is first-order orthogonal with respect to thenuisance function if

E[∇γm(Y, T, γ, θ0)|γ=(f0(X),g0(X))

|X]

= 0.

Intuition: Low sensitivity to first stage errors because of Taylor Expansion!

Question 1: Can we generalize to higher order orthogonality? Will thisimprove the first stage error we can tolerate?

Question 2: Does higher order orthogonal moments exist for PLR?

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 8 / 21

Page 22: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Definition: Higher-Order Orthogonality

Let k ∈ N.

Definition (k-Orthogonal Moment)

The moment condition is called k-orthogonal, if for any α ∈ N2 withα1 + α2 ≤ k:

E [Dαm(Y, T, f0(X), g0(X), θ0)|X] = 0.

whereDα = ∇α1

γ1∇α2γ2

and γi’s are the coordinates of the nuisance f0, g0.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 9 / 21

Page 23: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Main Result on k-Orthogonality: n– 12k+2 rate suffices!

Theorem (informal)

Let m be a moment which is k-orthogonal and satisfies certainidentifiability and smoothness assumptions. Then if the Stage 1 error ofestimating f0, g0 is

o(n–1

2k+2 ),

the solution to the Stage 2 equation θ0 is a√

n-a.n. estimator of θ0.

Comments:

• The existence of a smooth k-orthogonal moment implies n–1

2k+2

nuisance error suffices!

• The proof is based on a careful higher-order Taylor Expansionargument.

• The original Theorem deals with a much more general case of GMMthan PLR (Come to Poster for details!)

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 10 / 21

Page 24: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Main Result on k-Orthogonality: n– 12k+2 rate suffices!

Theorem (informal)

Let m be a moment which is k-orthogonal and satisfies certainidentifiability and smoothness assumptions. Then if the Stage 1 error ofestimating f0, g0 is

o(n–1

2k+2 ),

the solution to the Stage 2 equation θ0 is a√

n-a.n. estimator of θ0.

Comments:

• The existence of a smooth k-orthogonal moment implies n–1

2k+2

nuisance error suffices!

• The proof is based on a careful higher-order Taylor Expansionargument.

• The original Theorem deals with a much more general case of GMMthan PLR (Come to Poster for details!)

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 10 / 21

Page 25: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

2-orthogonal moment for PLR: A Gaussianity Issue

Question: Can we construct a 2-orthogonal moment for PLR?

Gist of the Result:Yes if and only if the treatment residual η|X is not normally distributed!

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 11 / 21

Page 26: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

2-orthogonal moment for PLR: A Gaussianity Issue

Question: Can we construct a 2-orthogonal moment for PLR?

Gist of the Result:Yes if and only if the treatment residual η|X is not normally distributed!

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 11 / 21

Page 27: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

2-orthogonal moment for PLR? Limitations!

Limitation: No if η|X is normally distributed!

Theorem (informal)

Assume η|X is normally distributed. Then there is no m which is

• 2-orthogonal

• satisfies certain identifiability and smoothness assumptions and,

• the solution of Stage 2 satisfies θ0 – θ0 = OP( 1√n

).

The proof is based on Stein’s Lemma: E[q′(Z)] = E[Zq(Z)] forZ ∼ N(0, 1), which allows us to connect algebraicelly 2-orthogonalitywith the assymptotic variance of θ0!

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 12 / 21

Page 28: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

2-orthogonal moment for PLR? Limitations!

Limitation: No if η|X is normally distributed!

Theorem (informal)

Assume η|X is normally distributed. Then there is no m which is

• 2-orthogonal

• satisfies certain identifiability and smoothness assumptions and,

• the solution of Stage 2 satisfies θ0 – θ0 = OP( 1√n

).

The proof is based on Stein’s Lemma: E[q′(Z)] = E[Zq(Z)] forZ ∼ N(0, 1), which allows us to connect algebraicelly 2-orthogonalitywith the assymptotic variance of θ0!

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 12 / 21

Page 29: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

2-orthogonal moment for PLR? Limitations!

Limitation: No if η|X is normally distributed!

Theorem (informal)

Assume η|X is normally distributed. Then there is no m which is

• 2-orthogonal

• satisfies certain identifiability and smoothness assumptions and,

• the solution of Stage 2 satisfies θ0 – θ0 = OP( 1√n

).

The proof is based on Stein’s Lemma: E[q′(Z)] = E[Zq(Z)] forZ ∼ N(0, 1), which allows us to connect algebraicelly 2-orthogonalitywith the assymptotic variance of θ0!

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 12 / 21

Page 30: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

2-orthogonal moment for PLR? Power!

Power: Yes if η|X is not normally distributed!

Technical Detail before Theorem: We need to change nuisance from f0, g0to q0 = θ0g0 + f0, g0 for our positive result.

Theorem

Under the PLR model, suppose that we know E[ηr|X],E[ηr–1|X] and thatE[ηr+1] 6= rE[E[η2|X]E[ηr–1|X]] for some r ∈ N, so that η|X is not a.s.Gaussian. Then the moments

m (X, Y, T, θ, q(X), g(X))

:= (Y – q(X) – θ (T – g(X)))

×(

(T – g(X))r – E[ηr|X] – r (T – g(X))E[ηr–1|X])

are 2-orthogonal and satisfy identifiability and smoothness assumptions.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 13 / 21

Page 31: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

2-orthogonal moment for PLR? Power!

Power: Yes if η|X is not normally distributed!

Technical Detail before Theorem: We need to change nuisance from f0, g0to q0 = θ0g0 + f0, g0 for our positive result.

Theorem

Under the PLR model, suppose that we know E[ηr|X],E[ηr–1|X] and thatE[ηr+1] 6= rE[E[η2|X]E[ηr–1|X]] for some r ∈ N, so that η|X is not a.s.Gaussian. Then the moments

m (X, Y, T, θ, q(X), g(X))

:= (Y – q(X) – θ (T – g(X)))

×(

(T – g(X))r – E[ηr|X] – r (T – g(X))E[ηr–1|X])

are 2-orthogonal and satisfy identifiability and smoothness assumptions.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 13 / 21

Page 32: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

2-orthogonal moment for PLR? Power!

Power: Yes if η|X is not normally distributed!

Technical Detail before Theorem: We need to change nuisance from f0, g0to q0 = θ0g0 + f0, g0 for our positive result.

Theorem

Under the PLR model, suppose that we know E[ηr|X],E[ηr–1|X] and thatE[ηr+1] 6= rE[E[η2|X]E[ηr–1|X]] for some r ∈ N, so that η|X is not a.s.Gaussian. Then the moments

m (X, Y, T, θ, q(X), g(X))

:= (Y – q(X) – θ (T – g(X)))

×(

(T – g(X))r – E[ηr|X] – r (T – g(X))E[ηr–1|X])

are 2-orthogonal and satisfy identifiability and smoothness assumptions.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 13 / 21

Page 33: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

2-orthogonal moment for PLR? Power (comments)

Comments:

(1) 2-orthogonal moment exist under non-Gaussianity of η|X!

(2) Non-Gaussianity is standard in pricing (random discounts of a baselineprice)

(3) Proof: Reverse Engineer The Limitation Theorem.

(4) More general result in the paper without knowning the conditionalmoments.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 14 / 21

Page 34: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

2-orthogonal moment for PLR? Power (comments)

Comments:

(1) 2-orthogonal moment exist under non-Gaussianity of η|X!

(2) Non-Gaussianity is standard in pricing (random discounts of a baselineprice)

(3) Proof: Reverse Engineer The Limitation Theorem.

(4) More general result in the paper without knowning the conditionalmoments.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 14 / 21

Page 35: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

2-orthogonal moment for PLR? Power (comments)

Comments:

(1) 2-orthogonal moment exist under non-Gaussianity of η|X!

(2) Non-Gaussianity is standard in pricing (random discounts of a baselineprice)

(3) Proof: Reverse Engineer The Limitation Theorem.

(4) More general result in the paper without knowning the conditionalmoments.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 14 / 21

Page 36: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

2-orthogonal moment for PLR? Power (comments)

Comments:

(1) 2-orthogonal moment exist under non-Gaussianity of η|X!

(2) Non-Gaussianity is standard in pricing (random discounts of a baselineprice)

(3) Proof: Reverse Engineer The Limitation Theorem.

(4) More general result in the paper without knowning the conditionalmoments.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 14 / 21

Page 37: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

PLR with High Dimensional Linear Nuisance Functions

Suppose f0(X) =< X,β0 >, g0(X) =< X, γ0 > for s-sparse β0, γ0 ∈ Rp.

How high sparsity can we tolerate with the suggested methods?(Stage 1 Error ⇔ Bounds on sparsity)

LASSO can learn s-sparse linear f0, g0 with error√

s log pn . How does this

compare to the error we can tolerate?

Literature:

• Trivial Rate o( 1√n

) - No s works.

• First-Order Orthogonal Rate o(n–14 ): s = o( n

12

log p) works.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 15 / 21

Page 38: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

PLR with High Dimensional Linear Nuisance Functions

Suppose f0(X) =< X,β0 >, g0(X) =< X, γ0 > for s-sparse β0, γ0 ∈ Rp.

How high sparsity can we tolerate with the suggested methods?(Stage 1 Error ⇔ Bounds on sparsity)

LASSO can learn s-sparse linear f0, g0 with error√

s log pn . How does this

compare to the error we can tolerate?

Literature:

• Trivial Rate o( 1√n

) - No s works.

• First-Order Orthogonal Rate o(n–14 ): s = o( n

12

log p) works.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 15 / 21

Page 39: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

PLR with High Dimensional Linear Nuisance Functions

Suppose f0(X) =< X,β0 >, g0(X) =< X, γ0 > for s-sparse β0, γ0 ∈ Rp.

How high sparsity can we tolerate with the suggested methods?(Stage 1 Error ⇔ Bounds on sparsity)

LASSO can learn s-sparse linear f0, g0 with error√

s log pn . How does this

compare to the error we can tolerate?

Literature:

• Trivial Rate o( 1√n

) - No s works.

• First-Order Orthogonal Rate o(n–14 ): s = o( n

12

log p) works.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 15 / 21

Page 40: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

PLR with High Dimensional Linear Nuisance Functions

Suppose f0(X) =< X,β0 >, g0(X) =< X, γ0 > for s-sparse β0, γ0 ∈ Rp.

How high sparsity can we tolerate with the suggested methods?(Stage 1 Error ⇔ Bounds on sparsity)

LASSO can learn s-sparse linear f0, g0 with error√

s log pn . How does this

compare to the error we can tolerate?

Literature:

• Trivial Rate o( 1√n

) - No s works.

• First-Order Orthogonal Rate o(n–14 ): s = o( n

12

log p) works.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 15 / 21

Page 41: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

PLR with High Dimensional Linear Nuisance Functions

Theorem

Suppose that

• E[η3] 6= 0

• X has i.i.d. mean-zero standard Gaussian entries,

• ε, η are almost surely bounded by the known value C,

• and θ0 ∈ [–M, M] for known M.

If

s = o

(n2/3

log p

),

and in the first stage of estimation we use LASSO withλn = 2CM

√3 log(p)/n then, using the 2-orthogonal moments m for r = 2

the solutions ot Stage 2 equation is√

n-a.n. estimator of θ0.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 16 / 21

Page 42: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Experiments 1: Fixed Sparsity

We consider s = 100, n = 5000, p = 1000, θ0 = 3.

Figure: Histogram for FirstOrder Orthogonal.

Figure: Histogram for SecondOrder Orthogonal.

First Order Orthogonal: Bias Order of Magnitude Bigger than Variance!

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 17 / 21

Page 43: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Experiments 2: Varying Sparsity

We consider n = 5000, p = 1000, θ0 = 3.

Figure: 1st vs 2nd Order Orthogonal: BIAS, STD, MSE, Stage 1 L2-error.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 18 / 21

Page 44: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Experiments 3: MSE for Varying n, p, s

Figure:n=2000,p=2000

Figure:n=2000,p=5000

Figure:n=5000,p=1000

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 19 / 21

Page 45: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Summary

• We introduced the notion of k-orthogonality for GMM. Suffices to

have n–1

2k+2 first stage error for them to work. [Come to Poster forthe general result!]

• We established that non-normality of η|X is sufficient and necessaryfor the existence of useful 2-orthogonal moments for PLR.

• We used 2-orthogonal moment to tolerate o( n23

log p) sparsity, muchlarger than state-of-art tolerance.

• We made synthetic experiments that support our claims.

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 20 / 21

Page 46: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Future Work

• How fundamental is the impossibility result when η|X is normallydistributed? Can we establish a general lower bound?

• How fundamental is the sparsity o( n23

log p) barrier?

• Can we construct useful higher orthogonal moments for PLR?

Thank you!!

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 21 / 21

Page 47: Orthogonal Machine Learning: Power and Limitationsizadik/files/ICMLslides.pdf · Orthogonal Machine Learning: Power and Limitations Ilias Zadik1, joint work with Lester Mackey2, Vasilis

Future Work

• How fundamental is the impossibility result when η|X is normallydistributed? Can we establish a general lower bound?

• How fundamental is the sparsity o( n23

log p) barrier?

• Can we construct useful higher orthogonal moments for PLR?

Thank you!!

L. Mackey, V. Syrgkanis I. Zadik (MIT) Orthogonal Machine Learning 21 / 21