agnostically learning decision trees parikshit gopalan msr-silicon valley, iitb00. adam tauman kalai...

Agnostically Learning Agnostically Learning Decision TreesDecision Trees

Parikshit GopalanParikshit Gopalan MSR-Silicon ValleyMSR-Silicon Valley, , IITB’00.IITB’00.

Adam Tauman Kalai Adam Tauman Kalai MSR-New EnglandMSR-New EnglandAdam R. KlivansAdam R. Klivans UT AustinUT Austin

0 1

0

0 1

1

1 0 X1

X2 X3

0 01 1

0

1 00

1

1

Computational LearningComputational Learning

Computational LearningComputational Learning

Learning: Predict f from examples.

x, f(x)

f:{0,1}n ! {0,1}

Valiant’s ModelValiant’s Model

x, f(x)

f:{0,1}n ! {0,1}

Assumption: f comes from a nice concept class.

Halfspaces:

+-

++

+

+

+ +

+ -

-

-

--

--

-

--

Valiant’s ModelValiant’s Model

x, f(x)

f:{0,1}n ! {0,1}

Assumption: f comes from a nice concept class.

Decision Trees:

X1

X2 X3

0 01 1

0

1 00

1

1

The Agnostic Model The Agnostic Model [Kearns-Schapire-[Kearns-Schapire-

Sellie’94]Sellie’94]

x, f(x)

f:{0,1}n ! {0,1}

No assumptions about f.

Learner should do as well as best decision tree.

Decision Trees:

X2 X3

0 01 1

0

1 00

1

1

X1

The Agnostic Model The Agnostic Model [Kearns-Schapire-[Kearns-Schapire-

Sellie’94]Sellie’94]

x, f(x)

No assumptions about f.

Learner should do as well as best decision tree.

Decision Trees:

X2 X3

0 01 1

0

1 00

1

1

X1

Agnostic Model = Noisy Agnostic Model = Noisy LearningLearning

f:{0,1}n ! {0,1}

+ =

Concept: Message Truth table: Encoding Function f: Received word.

Coding: Recover the Message.

Learning: Predict f.

X2 X3

0 01 1

0

1 00

1

1

X1

Uniform Distribution Uniform Distribution Learning for Decision Learning for Decision

TreesTreesNoiseless Setting:

– No queries: nlog n [Ehrenfeucht-Haussler’89].– With queries: poly(n). [Kushilevitz-Mansour’91]

Reconstruction for sparse real polynomials in the l1 norm.

Agnostic Setting:

Polynomial time, uses queries. [G.-Kalai-Klivans’08]

The Fourier Transform The Fourier Transform MethodMethod

Powerful tool for uniform distribution Powerful tool for uniform distribution learning.learning.

Introduced by Introduced by Linial-Mansour-NisanLinial-Mansour-Nisan..– Small depth circuitsSmall depth circuits [Linial-Mansour-Nisan’89][Linial-Mansour-Nisan’89]– DNFsDNFs [Jackson’95][Jackson’95]– Decision treesDecision trees [Kushilevitz-Mansour’94, [Kushilevitz-Mansour’94,

O’Donnell-Servedio’06, G.-Kalai-Klivans’08]O’Donnell-Servedio’06, G.-Kalai-Klivans’08]– Halfspaces, IntersectionsHalfspaces, Intersections [Klivans-O’Donnell-[Klivans-O’Donnell-

Servedio’03, Kalai-Klivans-Mansour-Servedio’05]Servedio’03, Kalai-Klivans-Mansour-Servedio’05]– JuntasJuntas [Mossel-O’Donnell-Servedio’03][Mossel-O’Donnell-Servedio’03]– ParitiesParities [Feldman-G.-Khot-Ponnsuswami’06] [Feldman-G.-Khot-Ponnsuswami’06]

The Fourier PolynomialThe Fourier Polynomial Let Let f:{-1,1}f:{-1,1}nn !! {-1,1} {-1,1}. . Write Write ff as a polynomial. as a polynomial.

– AND:AND: ½ + ½X½ + ½X11 + ½X + ½X22 - ½X - ½X11XX22

– Parity:Parity: XX11XX22

Parity of Parity of ½½ [n] [n]: : (x) = (x) = i i 22 XXii

Write Write f(x) = f(x) = c( c())(x)(x)

– c(c()) =1. =1.Standard Basis

Function f

Parities

The Fourier PolynomialThe Fourier Polynomial

c(c()): :

Weight of Weight of ..

Let Let f:{-1,1}f:{-1,1}nn !! {-1,1} {-1,1}. . Write Write ff as a polynomial. as a polynomial.

– AND:AND: ½ + ½X½ + ½X11 + ½X + ½X22 - ½X - ½X11XX22

– Parity:Parity: XX11XX22

Parity of Parity of ½½ [n] [n]: : (x) = (x) = i i 22 XXii

Write Write f(x) = f(x) = c( c())(x)(x)

– c(c()) =1. =1.

Low Degree FunctionsLow Degree Functions

Sparse Functions: Sparse Functions: Most of the Most of the weight lies on small subsets.weight lies on small subsets. Halfspaces, Small-depth Halfspaces, Small-depth circuits.circuits. Low-degree algorithm. Low-degree algorithm. [Linial-Mansour-Nisan][Linial-Mansour-Nisan] Finds the low-degree Finds the low-degree Fourier coefficients.Fourier coefficients.

Least Squares Regression: Find low-degree P minimizing Ex[ |P(x) – f(x)|2 ].

Sparse FunctionsSparse FunctionsSparse Functions: Sparse Functions: Most of the Most of the weight lies on a few subsets.weight lies on a few subsets.

Decision trees.Decision trees.tt leaves leaves )) O(t)O(t) subsets subsets

Sparse Algorithm. Sparse Algorithm. [Kushilevitz-Mansour’91] [Kushilevitz-Mansour’91]

Sparse l2 Regression:

Find t-sparse P minimizing Ex[ |P(x) – f(x)|2 ].

Sparse Sparse l2 Regression RegressionSparse Functions: Sparse Functions: Most of the Most of the weight lies on a few subsets.weight lies on a few subsets.

Decision trees.Decision trees.tt leaves leaves )) O(t)O(t) subsets subsets

Sparse Algorithm. Sparse Algorithm. [Kushilevitz-Mansour’91][Kushilevitz-Mansour’91]

Sparse l2 Regression:

Find t-sparse P minimizing Ex[ |P(x) – f(x)|2 ].Finding large coefficients: Hadamard decoding.[Kushilevitz-Mansour’91, Goldreich-Levin’89]

Agnostic Learning via Agnostic Learning via l2 Regression?Regression?

-1

+1

f:{-1,1}n ! {-1,1}


-1

+1

X2 X3

0 01 1

0

1 00

1

1

X1


-1

+1

l2 Regression:

Loss |P(x) –f(x)|2

Pay 1 for indecision.

Pay 4 for a mistake.

l1 Regression: [KKMS’05]

Loss |P(x) –f(x)|



Target f

Best Tree

-1

+1

l2 Regression:

Loss |P(x) –f(x)|2



l1 Regression: [KKMS’05]

Loss |P(x) –f(x)|




-1

+1

Agnostic Learning via Agnostic Learning via l1 RegressionRegression

Thm [KKMS’05]: l1 Regression always gives a good predictor.

l1 regression for low degree polynomials via Linear Programming.

Target f

Best Tree

Sparse l1 Regression: Find a t-sparse polynomial P minimizing Ex[ |P(x) – f(x)| ].

Why is this Harder:

l2 is basis independent, l1 is not.

Don’t know the support of P.


[G.-Kalai-Klivans][G.-Kalai-Klivans]: : Polynomial time algorithm Polynomial time algorithm for for Sparse Sparse l1 RegressionRegression. .


Variables: c()’s.



Gradient

Variables: c()’s.



Gradient

Projection


The GradientThe Gradient

g(x) = sgn[f(x) - P(x)]g(x) = sgn[f(x) - P(x)]

P(x) := P(x) + P(x) := P(x) + g(x). g(x).

Increase P(x) if low.Decrease P(x) if

high.

-1

+1

f(x)

P(x)


Variables: c()’s.



Gradient

Variables: c()’s.



Gradient

Projection


0

0.05

0.1

0.15

0.2

0.25

0.3

Fourier Spectrum of P

P

Projection onto the LProjection onto the L11 ball ball

Currently: |c()| > t

Want: |c()| · t.

0

0.05

0.1

0.15

0.2

0.25

0.3


P


Below cutoff: Set to 0.

Above cutoff: Subtract.


0

0.05

0.1

0.15

0.2

0.25

0.3

Fourier Spectrum of Proj(P)

P

Proj(P)

Below cutoff: Set to 0.

Above cutoff: Subtract.

Analysis of Gradient-Projection Analysis of Gradient-Projection [Zinkevich’03][Zinkevich’03]

Progress measure: Squared L2 distance from optimum P*.

Key Equation:

|Pt – P*|2 - |Pt+1 – P*|2 ¸ 2 (L(Pt) – L(P*))

Within of optimal in 1/2 iterations.

Good L2 approximation to Pt suffices.

– 2

How suboptimal current soln is.

Progress made in this step.

-1

+1

f(x)

P(x)

0

0.05

0.1

0.15

0.2

0.25

0.3


P

GradientGradient

ProjectionProjection

g(x) = sgn[f(x) - P(x)].g(x) = sgn[f(x) - P(x)].

The GradientThe Gradient

g(x) = sgn[f(x) - P(x)].g(x) = sgn[f(x) - P(x)]. -1

+1

f(x)

P(x)

Compute sparse approximation g’ = KM(g).

Is g a good L2 approximation to g’?

No. Initially g = f.

L2(g,g’) can be as large 1.

Variables: c()’s.



Approximate Gradient

Sparse Sparse l1 Regression Regression

Variables: c()’s.



Projection Compensat

es


KM as KM as l2 Approximation Approximation

The KM Algorithm:

Input: g:{-1,1}n ! {-1,1}, and t.

Output: A t-sparse polynomial g’ minimizing

Ex [|g(x) – g’(x)|2]

Run Time: poly(n,t).

KM as LKM as L11 Approximation Approximation

The KM Algorithm:

Input: A Boolean function g = c()(x).

A error bound

Output: Approximation g’ = c’()(x) s.t

|c() – c’()| · for all ½ [n].

Run Time: poly(n,1/)

0

0.05

0.1

0.15

0.2

0.25

0.3

g

g' = KM(g)


1)Identify coefficients larger than .

2) Estimate via sampling, set rest to 0.

Only 1/2

0

0.05

0.1

0.15

0.2

0.25

0.3

g

g' = KM(g)


1)Identify coefficients larger than .

2) Estimate via sampling, set rest to 0.

0

0.05

0.1

0.15

0.2

0.25

0.3

P + g

P + g'

Projection Preserves LProjection Preserves L11 DistanceDistance

L1 distance at most 2 after projection.

Both lines stop within of each other.

0

0.05

0.1

0.15

0.2

0.25

0.3

P + g

P + g'



Both lines stop within of each other.

Else, Blue dominates Red.

0

0.05

0.1

0.15

0.2

0.25

0.3

P + g

P + g'



Projecting onto the L1 ball does not increase L1 distance.


Variables: c()’s.



• L1(P, P’) · 2

• L1(P, P’) · 2t

• L2(P, P’)2 · 4t

PP’

Can take = 1/t2.

Sparse L1 Regression: Find a sparse polynomial P minimizing Ex[ |P(x) – f(x)| ].

[G.-Kalai-Klivans’08]:[G.-Kalai-Klivans’08]: Can get within Can get within of optimum in of optimum in poly(t,1/poly(t,1/)) iterations.iterations. Algorithm for Algorithm for SparseSparse ll11 RegressionRegression. .

First polynomial time algorithm for First polynomial time algorithm for Agnostically Learning Sparse Polynomials.Agnostically Learning Sparse Polynomials.


Function f: D ! [-1,1], Orthonormal Basis B.

Sparse l2 Regression: Find a t-sparse polynomial P minimizing Ex[ |P(x) – f(x)|2 ].

Sparse l1 Regression: Find a t-sparse polynomial P minimizing Ex[ |P(x) – f(x)| ].

[G.-Kalai-Klivans’08]:[G.-Kalai-Klivans’08]: Given solution to Given solution to l2

Regression, can solve , can solve l1 Regression. Regression.

l1 Regression from Regression from l2

RegressionRegression

Problem: Can we agnostically learn DNFs in polynomial time? (uniform dist. with queries)

Noiseless Setting: Jackson’s Harmonic Sieve.

Implies weak learner for depth-3 circuits.

Beyond current Fourier techniques.

Agnostically Learning Agnostically Learning DNFs?DNFs?

agnostically learning decision trees parikshit gopalan msr-silicon valley, iitb00. adam tauman kalai...

Documents

sparse regression sparse

sellie94 x

regression agnostic

c x fx qx

e x px fx px

e x px fx gradient slide

regression sparse functions

x1x1 slide