agnostically learning decision trees parikshit gopalan msr-silicon valley, iitb00. adam tauman kalai...
TRANSCRIPT
Agnostically Learning Agnostically Learning Decision TreesDecision Trees
Parikshit GopalanParikshit Gopalan MSR-Silicon ValleyMSR-Silicon Valley, , IITB’00.IITB’00.
Adam Tauman Kalai Adam Tauman Kalai MSR-New EnglandMSR-New EnglandAdam R. KlivansAdam R. Klivans UT AustinUT Austin
0 1
0
0 1
1
1 0 X1
X2 X3
0 01 1
0
1 00
1
1
Computational LearningComputational Learning
Computational LearningComputational Learning
Computational LearningComputational Learning
Learning: Predict f from examples.
x, f(x)
f:{0,1}n ! {0,1}
Valiant’s ModelValiant’s Model
x, f(x)
f:{0,1}n ! {0,1}
Assumption: f comes from a nice concept class.
Halfspaces:
+-
++
+
+
+ +
+ -
-
-
--
--
-
--
Valiant’s ModelValiant’s Model
x, f(x)
f:{0,1}n ! {0,1}
Assumption: f comes from a nice concept class.
Decision Trees:
X1
X2 X3
0 01 1
0
1 00
1
1
The Agnostic Model The Agnostic Model [Kearns-Schapire-[Kearns-Schapire-
Sellie’94]Sellie’94]
x, f(x)
f:{0,1}n ! {0,1}
No assumptions about f.
Learner should do as well as best decision tree.
Decision Trees:
X2 X3
0 01 1
0
1 00
1
1
X1
The Agnostic Model The Agnostic Model [Kearns-Schapire-[Kearns-Schapire-
Sellie’94]Sellie’94]
x, f(x)
No assumptions about f.
Learner should do as well as best decision tree.
Decision Trees:
X2 X3
0 01 1
0
1 00
1
1
X1
Agnostic Model = Noisy Agnostic Model = Noisy LearningLearning
f:{0,1}n ! {0,1}
+ =
Concept: Message Truth table: Encoding Function f: Received word.
Coding: Recover the Message.
Learning: Predict f.
X2 X3
0 01 1
0
1 00
1
1
X1
Uniform Distribution Uniform Distribution Learning for Decision Learning for Decision
TreesTreesNoiseless Setting:
– No queries: nlog n [Ehrenfeucht-Haussler’89].– With queries: poly(n). [Kushilevitz-Mansour’91]
Reconstruction for sparse real polynomials in the l1 norm.
Agnostic Setting:
Polynomial time, uses queries. [G.-Kalai-Klivans’08]
The Fourier Transform The Fourier Transform MethodMethod
Powerful tool for uniform distribution Powerful tool for uniform distribution learning.learning.
Introduced by Introduced by Linial-Mansour-NisanLinial-Mansour-Nisan..– Small depth circuitsSmall depth circuits [Linial-Mansour-Nisan’89][Linial-Mansour-Nisan’89]– DNFsDNFs [Jackson’95][Jackson’95]– Decision treesDecision trees [Kushilevitz-Mansour’94, [Kushilevitz-Mansour’94,
O’Donnell-Servedio’06, G.-Kalai-Klivans’08]O’Donnell-Servedio’06, G.-Kalai-Klivans’08]– Halfspaces, IntersectionsHalfspaces, Intersections [Klivans-O’Donnell-[Klivans-O’Donnell-
Servedio’03, Kalai-Klivans-Mansour-Servedio’05]Servedio’03, Kalai-Klivans-Mansour-Servedio’05]– JuntasJuntas [Mossel-O’Donnell-Servedio’03][Mossel-O’Donnell-Servedio’03]– ParitiesParities [Feldman-G.-Khot-Ponnsuswami’06] [Feldman-G.-Khot-Ponnsuswami’06]
The Fourier PolynomialThe Fourier Polynomial Let Let f:{-1,1}f:{-1,1}nn !! {-1,1} {-1,1}. . Write Write ff as a polynomial. as a polynomial.
– AND:AND: ½ + ½X½ + ½X11 + ½X + ½X22 - ½X - ½X11XX22
– Parity:Parity: XX11XX22
Parity of Parity of ½½ [n] [n]: : (x) = (x) = i i 22 XXii
Write Write f(x) = f(x) = c( c())(x)(x)
– c(c()) =1. =1.Standard Basis
Function f
Parities
The Fourier PolynomialThe Fourier Polynomial
c(c()): :
Weight of Weight of ..
Let Let f:{-1,1}f:{-1,1}nn !! {-1,1} {-1,1}. . Write Write ff as a polynomial. as a polynomial.
– AND:AND: ½ + ½X½ + ½X11 + ½X + ½X22 - ½X - ½X11XX22
– Parity:Parity: XX11XX22
Parity of Parity of ½½ [n] [n]: : (x) = (x) = i i 22 XXii
Write Write f(x) = f(x) = c( c())(x)(x)
– c(c()) =1. =1.
Low Degree FunctionsLow Degree Functions
Sparse Functions: Sparse Functions: Most of the Most of the weight lies on small subsets.weight lies on small subsets. Halfspaces, Small-depth Halfspaces, Small-depth circuits.circuits. Low-degree algorithm. Low-degree algorithm. [Linial-Mansour-Nisan][Linial-Mansour-Nisan] Finds the low-degree Finds the low-degree Fourier coefficients.Fourier coefficients.
Least Squares Regression: Find low-degree P minimizing Ex[ |P(x) – f(x)|2 ].
Sparse FunctionsSparse FunctionsSparse Functions: Sparse Functions: Most of the Most of the weight lies on a few subsets.weight lies on a few subsets.
Decision trees.Decision trees.tt leaves leaves )) O(t)O(t) subsets subsets
Sparse Algorithm. Sparse Algorithm. [Kushilevitz-Mansour’91] [Kushilevitz-Mansour’91]
Sparse l2 Regression:
Find t-sparse P minimizing Ex[ |P(x) – f(x)|2 ].
Sparse Sparse l2 Regression RegressionSparse Functions: Sparse Functions: Most of the Most of the weight lies on a few subsets.weight lies on a few subsets.
Decision trees.Decision trees.tt leaves leaves )) O(t)O(t) subsets subsets
Sparse Algorithm. Sparse Algorithm. [Kushilevitz-Mansour’91][Kushilevitz-Mansour’91]
Sparse l2 Regression:
Find t-sparse P minimizing Ex[ |P(x) – f(x)|2 ].Finding large coefficients: Hadamard decoding.[Kushilevitz-Mansour’91, Goldreich-Levin’89]
Agnostic Learning via Agnostic Learning via l2 Regression?Regression?
-1
+1
f:{-1,1}n ! {-1,1}
Agnostic Learning via Agnostic Learning via l2 Regression?Regression?
-1
+1
X2 X3
0 01 1
0
1 00
1
1
X1
Agnostic Learning via Agnostic Learning via l2 Regression?Regression?
-1
+1
l2 Regression:
Loss |P(x) –f(x)|2
Pay 1 for indecision.
Pay 4 for a mistake.
l1 Regression: [KKMS’05]
Loss |P(x) –f(x)|
Pay 1 for indecision.
Pay 2 for a mistake.
Target f
Best Tree
-1
+1
l2 Regression:
Loss |P(x) –f(x)|2
Pay 1 for indecision.
Pay 4 for a mistake.
l1 Regression: [KKMS’05]
Loss |P(x) –f(x)|
Pay 1 for indecision.
Pay 2 for a mistake.
Agnostic Learning via Agnostic Learning via l1 Regression?Regression?
-1
+1
Agnostic Learning via Agnostic Learning via l1 RegressionRegression
Thm [KKMS’05]: l1 Regression always gives a good predictor.
l1 regression for low degree polynomials via Linear Programming.
Target f
Best Tree
Sparse l1 Regression: Find a t-sparse polynomial P minimizing Ex[ |P(x) – f(x)| ].
Why is this Harder:
l2 is basis independent, l1 is not.
Don’t know the support of P.
Agnostically Learning Agnostically Learning Decision TreesDecision Trees
[G.-Kalai-Klivans][G.-Kalai-Klivans]: : Polynomial time algorithm Polynomial time algorithm for for Sparse Sparse l1 RegressionRegression. .
The Gradient-Projection The Gradient-Projection MethodMethod
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
P(x) = c() (x)
f(x)
Q(x) = d() (x)
L1(P,Q) = |c() – d()|
L2(P,Q) = [ (c() –d())2]1/2
The Gradient-Projection The Gradient-Projection MethodMethod
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
Gradient
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
Gradient
Projection
The Gradient-Projection The Gradient-Projection MethodMethod
The GradientThe Gradient
g(x) = sgn[f(x) - P(x)]g(x) = sgn[f(x) - P(x)]
P(x) := P(x) + P(x) := P(x) + g(x). g(x).
Increase P(x) if low.Decrease P(x) if
high.
-1
+1
f(x)
P(x)
The Gradient-Projection The Gradient-Projection MethodMethod
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
Gradient
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
Gradient
Projection
The Gradient-Projection The Gradient-Projection MethodMethod
0
0.05
0.1
0.15
0.2
0.25
0.3
Fourier Spectrum of P
P
Projection onto the LProjection onto the L11 ball ball
Currently: |c()| > t
Want: |c()| · t.
0
0.05
0.1
0.15
0.2
0.25
0.3
Fourier Spectrum of P
P
Projection onto the LProjection onto the L11 ball ball
Currently: |c()| > t
Want: |c()| · t.
0
0.05
0.1
0.15
0.2
0.25
0.3
Fourier Spectrum of P
P
Projection onto the LProjection onto the L11 ball ball
Below cutoff: Set to 0.
Above cutoff: Subtract.
Projection onto the LProjection onto the L11 ball ball
0
0.05
0.1
0.15
0.2
0.25
0.3
Fourier Spectrum of Proj(P)
P
Proj(P)
Below cutoff: Set to 0.
Above cutoff: Subtract.
Analysis of Gradient-Projection Analysis of Gradient-Projection [Zinkevich’03][Zinkevich’03]
Progress measure: Squared L2 distance from optimum P*.
Key Equation:
|Pt – P*|2 - |Pt+1 – P*|2 ¸ 2 (L(Pt) – L(P*))
Within of optimal in 1/2 iterations.
Good L2 approximation to Pt suffices.
– 2
How suboptimal current soln is.
Progress made in this step.
-1
+1
f(x)
P(x)
0
0.05
0.1
0.15
0.2
0.25
0.3
Fourier Spectrum of P
P
GradientGradient
ProjectionProjection
g(x) = sgn[f(x) - P(x)].g(x) = sgn[f(x) - P(x)].
The GradientThe Gradient
g(x) = sgn[f(x) - P(x)].g(x) = sgn[f(x) - P(x)]. -1
+1
f(x)
P(x)
Compute sparse approximation g’ = KM(g).
Is g a good L2 approximation to g’?
No. Initially g = f.
L2(g,g’) can be as large 1.
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
Approximate Gradient
Sparse Sparse l1 Regression Regression
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
Projection Compensat
es
Sparse Sparse l1 Regression Regression
KM as KM as l2 Approximation Approximation
The KM Algorithm:
Input: g:{-1,1}n ! {-1,1}, and t.
Output: A t-sparse polynomial g’ minimizing
Ex [|g(x) – g’(x)|2]
Run Time: poly(n,t).
KM as LKM as L11 Approximation Approximation
The KM Algorithm:
Input: A Boolean function g = c()(x).
A error bound
Output: Approximation g’ = c’()(x) s.t
|c() – c’()| · for all ½ [n].
Run Time: poly(n,1/)
0
0.05
0.1
0.15
0.2
0.25
0.3
g
g' = KM(g)
KM as LKM as L11 Approximation Approximation
1)Identify coefficients larger than .
2) Estimate via sampling, set rest to 0.
Only 1/2
0
0.05
0.1
0.15
0.2
0.25
0.3
g
g' = KM(g)
KM as LKM as L11 Approximation Approximation
1)Identify coefficients larger than .
2) Estimate via sampling, set rest to 0.
0
0.05
0.1
0.15
0.2
0.25
0.3
P + g
P + g'
Projection Preserves LProjection Preserves L11 DistanceDistance
L1 distance at most 2 after projection.
Both lines stop within of each other.
0
0.05
0.1
0.15
0.2
0.25
0.3
P + g
P + g'
Projection Preserves LProjection Preserves L11 DistanceDistance
L1 distance at most 2 after projection.
Both lines stop within of each other.
Else, Blue dominates Red.
0
0.05
0.1
0.15
0.2
0.25
0.3
P + g
P + g'
Projection Preserves LProjection Preserves L11 DistanceDistance
L1 distance at most 2 after projection.
Projecting onto the L1 ball does not increase L1 distance.
Sparse Sparse l1 Regression Regression
Variables: c()’s.
Constraint: |c() | · t
Minimize: Ex|P(x) – f(x)|
• L1(P, P’) · 2
• L1(P, P’) · 2t
• L2(P, P’)2 · 4t
PP’
Can take = 1/t2.
Sparse L1 Regression: Find a sparse polynomial P minimizing Ex[ |P(x) – f(x)| ].
[G.-Kalai-Klivans’08]:[G.-Kalai-Klivans’08]: Can get within Can get within of optimum in of optimum in poly(t,1/poly(t,1/)) iterations.iterations. Algorithm for Algorithm for SparseSparse ll11 RegressionRegression. .
First polynomial time algorithm for First polynomial time algorithm for Agnostically Learning Sparse Polynomials.Agnostically Learning Sparse Polynomials.
Agnostically Learning Agnostically Learning Decision TreesDecision Trees
Function f: D ! [-1,1], Orthonormal Basis B.
Sparse l2 Regression: Find a t-sparse polynomial P minimizing Ex[ |P(x) – f(x)|2 ].
Sparse l1 Regression: Find a t-sparse polynomial P minimizing Ex[ |P(x) – f(x)| ].
[G.-Kalai-Klivans’08]:[G.-Kalai-Klivans’08]: Given solution to Given solution to l2
Regression, can solve , can solve l1 Regression. Regression.
l1 Regression from Regression from l2
RegressionRegression
Problem: Can we agnostically learn DNFs in polynomial time? (uniform dist. with queries)
Noiseless Setting: Jackson’s Harmonic Sieve.
Implies weak learner for depth-3 circuits.
Beyond current Fourier techniques.
Agnostically Learning Agnostically Learning DNFs?DNFs?