a quarter-century of efficient learnability rocco servedio columbia university valiant 60 th...

19
A Quarter-Century of Efficient Learnability Rocco Servedio Columbia University Valiant 60 th Birthday Symposium Bethesda, Maryland May 30, 2009

Post on 19-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

A Quarter-Century of Efficient Learnability

Rocco Servedio

Columbia University

Valiant 60th Birthday Symposium

Bethesda, Maryland

May 30, 2009

1984

and of course...

Probably Approximately Correct learning[Valiant84]

• Concept class of Boolean functions over domain X

[Valiant84] presents range of learning

models, oracles D models (possibly complex) world

• Unknown and arbitrary distribution over X

• Unknown target concept to be learned from examples

Learner has access to i.i.d. draws from labeled according to

each belongs to X,

i.i.d. drawn from

typically or

PAC learning concept class

Learner’s goal: come up with hypothesis that will have high accuracy on future examples.

• For any target function

• for any distribution over X,

• with probability learner outputs hypothesisthat is -accurate w.r.t.

Efficiently

Algorithm must be computationally efficient: should run in time

“The results of learnability theory would then indicate the maximum granularity of the single concepts that can be acquired without programming.”

“This paper attempts to explore the limits of what is learnable as allowed by algorithmic complexity….The identification of these limits is a major goal of the line of work proposed in this paper.”

PAC model, and its variants, provide a clean theoretical framework for studying the computational complexity of learning problems.

From :

So, what can be learned efficiently?

In the rest of the 1980s,

Valiant & colleagues gave remarkable results on the abilities and limitations of computationally efficient learning algorithms.

This work introduced research directions and questions that continue to be intensively studied to this day.

25 years of efficient learnability

Rest of talk: survey some

• positive results (algorithms)

• negative results (two flavors of hardness results)

Positive results: learning k-DNF

Theorem [Valiant84]: k-DNF learnable in polynomial time for any k=O(1).

k=2:

View a k-DNF as a disjunction over “metavariables”, learn the disjunction using elimination.

25 years later: improving this to k is still a major open question!

Much has been learned in trying for this improvement…

Poly-time PAC learning, general distributions• Decision lists (greedy alg.)

[Rivest87]

• Halfspaces (poly-time LP)[Littlestone87, BEHW89, …]

• Parities, integer lattices (Gaussian elim.) [HelmboldSloanWarmuth92, FischerSimon92]

• Restricted types of branching programs (DL + parities) [ErgunKumarRubinfeld95, BshoutyTamonWilson98]

• Geometric concept classes (…random projections…) [BshoutyChenHomer94, BGMST98, Vempala99, …]

and more…

+++

+

-- - -- - - -

-+

++

+

General-distribution PAC learning, contQuasi-poly / sub-exponential-time learning:

• poly-size decision trees [EhrenfeuchtHaussler89, Blum92]

• poly-size DNF [Bshouty96, TaruiTsukiji99,KlivansS01]

• intersections of few poly(n)-weight halfspaces [KlivansO’DonnellS02]

“PTF method” (halfspaces + metavariables) - link with complexity theory

OR

AND AND AND

x2 x3 x5 x6 x3 x5 x1 x6 x7

__ _ _

+++

+

-- - --

-

-- -- -

- --

-

-

x3

x5 x1

x1 x5 x4

-1 1 -1 1 -1 1

1

Distribution-specific learningTheorem [KearnsLiValiant87]: monotone

Boolean functions can be weakly learned (accuracy ) in poly time under the uniform distribution on

Ushered in study of algorithms for uniform-distribution and distribution-specific learning: halfspaces [Baum90], DNF [Verbeurgt90, Jackson95] , decision trees [KushilevitzMansour93] , AC0 [LinialMansourNisan89, FurstJacksonSmith91] , extended AC0 [JacksonKlivansS02], juntas [MosselO’DonnellS03] , general monotone functions [BshoutyTamon96, BlumBurchLangford98, O’DonnellWimmer09], monotone decision trees [O’DonnellS06], intersections of halfspaces [BlumKannan94, Vempala97, KwekPitt98, KlivansO’DonnellS08] , convex sets, much more…

Key tool: Fourier analysis of Boolean functions

Recently come full circle on monotone functions:– [O’DonnellWimmer09]: poly time, accuracy: optimal! (by [BlumBurchLangford98]) 1

0

1

Other variantsAfter [Valiant84], efficient learning algorithms studied in many settings:

• Learning in the presence of noise: malicious [Valiant85], agnostic [KearnsSchapireSellie93], random misclassification [AngluinLaird87],…

• Related models: Exact learning from queries and counterexamples [Angluin87], Statistical Query Learning [Kearns93], many others…

• PAC-style analyses of unsupervised learning problems: learning discrete distributions [KMRRSS94], learning mixture distributions [Dasgupta99, AroraKannan01, many others…]

• Evolvability framework [Valiant07, Feldman08, …]

Nice algorithmic results in all these settings.

Limits of efficient learnability:is proper learning feasible?

Proper learning:

learning algorithm for class must uses hypotheses from

There are efficient proper learning algorithms for conjunctions, disjunctions, halfspaces, decision lists, parities, k-DNF, k-CNF.

• What about k-term DNF – can we learn using k-term DNF as hypotheses?

Proper learning is computationally hard

Theorem [PittValiant87]: If no poly-time algorithm can learn 3-term DNF using 3-term DNF hypotheses.

Given a graph reduction produces distribution over labeled examples such that

high-accuracy 3-term DNF iff is 3-colorable.

Note: can learn 3-term DNF in poly time using 3-CNF hypotheses!

“Often a change of representation can make a difficult learning task easy.”

reduction

distribution over

(011111, +) (001111, -)

(101111, +) (010111, -)

(110111, +) (011101, -)

… …

From 1987…

Theorem [PittValiant87]: Learning k-term DNF using (2k-3)-term DNF hypotheses is hard.

Opened door to whole range of hardness results:

is hard to learn using hypotheses from

This work showed computational barriers to learning with restricted representations in general, not just proper learning:

… to 2009Great progress in recent years using sophisticated machinery from hardness of approximation.

• [ABFKP04]: Hard to learn n-term DNF using n100-size OR-of-halfspace hypotheses. [Feldman06]: Holds even if learner can make membership queries to target function.• [KhotSaket08]: Hard to (even weakly) learn intersection of 2 halfspaces using 100 halfspaces as hypothesis

If data is corrupted with 1% noise, then

• [FeldmanGopalanKhotPonnuswami08] : Hard to (even weakly) learn an AND using an AND as hypothesis. Same for halfspaces.

• [GopalanKhotSaket07, Viola08]: Hard to (even weakly) learn a parity even using degree-100 GF(2) polynomials as hypotheses

Active area with lots of ongoing work.

Representation-Independent Hardness

Suppose there are no hypothesis restrictions: any poly-size circuit OK.

Are there learning problems that are still hard for computational reasons?

[Valiant84]: Existence of pseudorandom functions [GoldreichGoldwasserMicali84] implies that general Boolean circuits are (representation-independently) hard to learn.

Yes:

PKC and hardness of learningKey insight of [KearnsValiant89]: Public-key cryptosystems hard-to-learn functions.

– Adversary can create labeled examplesof by herself…so must not be learnable from labeled examples, or else cryptosystem would be insecure!

Theorem [KearnsValiant89]: Simple classes of functions – NC1, TC0, poly-size DFAs – are inherently hard to learn.

Theorem [Regev05, KlivansSherstov06]: Really simple functions – poly-size OR of halfspaces – are inherently hard to learn.

Closing the gap: Can these results be extended to show that DNF are inherently hard to learn? Or are DNF efficiently learnable?

Efficient learnability: Model and Results

Thank you, Les!

Valiant

• provided an elegant model for the computational study of learning

• followed this up with foundational results on what is (and isn’t) efficiently learnable

These fundamental questions continue to be intensively studied and cross-fertilize other topics in TCS.