Colt Tutorial

Download Colt Tutorial

Post on 12-Dec-2015

214 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

A tutorial for COLT in Machine Learning discipline

TRANSCRIPT

  • A Tutorial on Computational Learning TheoryPresented at Genetic Programming 1997

    Stanford University, July 1997

    Vasant HonavarArtificial Intelligence Research Laboratory

    Department of Computer Sciencehonavar@cs.iastate.edu

    Iowa State University, Ames, Iowa 50011http://www.cs.iastate.edu/~honavar/aigroup.html

  • What are learning systems?

    Systems that improve their performance one or more tasks with experience in their environment

    Examples: Pattern recognizers, adaptive control systems, adaptive intelligent agents, etc.

  • Computational Models of Learning

    Model of the Learner: Computational capabilities, sensors, effectors, knowledge representation, inference mechanisms, prior knowledge, etc.

    Model of the Environment: Tasks to be learned, information sources (teacher, queries, experiments), performance measures

    Key questions: Can a learner with a certain structure learn a specified task in a particular environment? Can the learner do so efficiently? If so, how? If not, why not?

  • Computational Models of Learning

    Theories of Learning: What is it good for? Mistake bound model Maximum Likelihood model PAC (Probably Approximately Correct) model Learning from simple examples Concluding remarks

  • To make explicit relevant aspects of the learner and the environment

    To identify easy and hard learning problems (and the precise conditions under which they are easy or hard)

    To guide the design of learning systems To shed light on natural learning systems To help analyze the performance of learning systems

    Theories of Learning: What are they good for?

  • Mistake bound Model

    Example: Given an arbitrary, noise-free sequence of labeled examples ,...of an unknown binary conjunctive concept C over {0,1}N, the learner's task is to predict C(X) for a given X.

    Theorem: Exact online learning of conjunctive concepts can be accomplished with at most (N+1) prediction mistakes.

  • Mistake bound model

    Algorithm Initialize L={X1, ~X1, .... ~XN} Predict according to match between an instance

    and the conjunction of literals in L Whenever a mistake is made on a positive

    example, drop the offending literals from LEg: will result in L = {~ X1, X2, X3, X4}

    will yield L = {X2, X3}

  • Mistake bound model

    Proof of Theorem 1: No literal in C is ever eliminated from L Each mistake eliminates at least one literal from L The first mistake eliminates N of the 2N literals Conjunctive concepts can be learned with at most

    (N+1) mistakes Conclusion: Conjunctive concepts are easy to learn

    in the mistake bound model

  • Optimal Mistake Bound Learning Algorithms

    Definition: An optimal mistake bound mbound(C) for a concept classs C is the lowest possible mistake bound in the worst case (considering all concepts in C, and all possible sequences of examples).

    Definition: An optimal learning algorithm for a concept class C (in the mistake bound framework) is one that is guaranteed to exactly learn any concept in C, using any noise-free example sequence, with at most O(mbound(C)) mistakes.

    Theorem:

    mbound( ) lg| |C C

  • The Halving Algorithm

    Definition: The version space

    Definition: The halving algorithm predicts according to the majority of concepts in the current version space and a mistake results in elimination of all the offending concepts from the version space

    Fine print: The halving algorithm may not be efficiently implementable.

    { }iV C C= C| is consistent with the first i examples

  • The Halving Algorithm

    The halving algorithm can be practical if there is a way to compactly represent and efficiently manipulate the version space.

    Question: Are there any efficiently implementable optimal mistake bound learning algorithms?

    Answer: Littlestone's algorithm for learning monotone disjunctions of at most k of n literals using the hypothesis class of threshold functions with at most (k lg n) mistakes.

  • Bounding the prediction error

    Mistake bound model bounds the number of mistakes that the learner will ever make before exactly learning a concept, but not the prediction error after having seen a certain number of examples.

    Mistake bound model assumes that the examples are chosen arbitrarily - in the worst case, by a smart, adversarial teacher. It might often be satisfactory to assume randomly drawn examples

  • Probably Approximately Correct Learning

    Oracle

    Samples

    Instance Distribution

    LearnerExamples

    Concept

  • Probably Approximately Correct Learning

    Consider: An instance space X A concept space A hypothesis space An unknown, arbitrary, not necessarily

    computable, stationary probability distribution Dover the instance space X

    { }{ }C = C: ,X 0 1{ }{ }H = h : ,X 0 1

  • PAC Learning

    The oracle samples the instance space according to D and provides labeled examples of an unknown concept C to the learner

    The learner is tested on samples drawn from the instance space according to the same probability distribution D

    The learner's task is to output a hypothesis h from H that closely approximates the unknown concept C based on the examples it has encountered

  • PAC Learning

    In the PAC setting, exact learning (zero error approximation) cannot be guaranteed

    In the PAC setting, even approximate learning (with bounded non-zero error) cannot be guaranteed 100% of the time

    Definition: The error of a hypothesis h with respect to a target concept C and an instance distribution Dis given by ProbD[ ]C X h X( ) ( )

  • PAC Learning

    Definition: A concept class C is said to be PAC-learnable using a hypothesis class H if there exists a learning algorithm L such that for all concepts in C, for all instance distributions D on an instance space X, , L, when given access to the Example oracle, produces, with probability at least , a hypothesis h from H with error no more than (Valiant, 1984)

    ( ) <

  • Efficient PAC Learning

    Definition: C is said to be efficiently PAC-learnable if L runs in time that is polynomial in N (size of the instance representation), size(C) (size of the concept representation), and

    Remark Note that lower error or increased confidence require more examples.

    Remark: In order for a concept class to be efficiently PAC-learnable, it should be PAC-learnable using a random sample of size polynomial in the relevant parameters.

    1

    1

  • Sample complexity of PAC Learning

    Definition: A consistent learner is one that returns some hypothesis h from the hypothesis class H that is consistent with a random sequence of m examples.

    Remark: A consistent learner is a MAP learner (one that returns a hypothesis that is most likely given the training data) if all hypotheses are a-priori equally likely

    Theorem: A consistent learner is guranteed to be PAC if the number of samples

    m > 1 l nH

  • Sample Complexity of PAC Learning

    Proof: Consider a hypothesis h that is not a PAC approximation of an unknown concept C. Clearly, error of h, or the probability that h is wrong on a random instance is at least . The probability of h being wrong on m independently drawn random examples is at least . For PAC learning, we want to make sure that the probability of L returning such a bad hypothesis is small.

    ( )1

    ( )m1

    ( )H m1

  • PAC- Easy and PAC-Hard Concept Classes

    Conjunctive concepts are easy to learn. Use the same algorithm as the one used in the

    mistake bound framework Sample complexity Time complexity is polynomial in the relevant

    parameters of interest.Remark: Polynomial sample complexity is necessary

    but not sufficient for efficient PAC learning.

    { }O N1 3 ln ln

  • PAC-Easy and PAC-Hard Concept Classes

    Theorem: 3-term DNF concept class (disjunctions of at most 3 conjunctions) are not efficiently PAC-learnable using the same hypothesis class (although it has polynomial sample complexity) unless P=NP.

    Proof: By polynomial time reduction of graph 3-colorability (a well-known NP-complete problem) to the problem of deciding whether a given set of labeled examples is consistent with some 3-term DNF formula.

  • Transforming Hard Problems to Easy ones

    Theorem: 3-term DNF concepts are efficiently PAC-learnable using 3-CNF (conjunctive normal form with at most 3 literals per clause) hypothesis class.

    Proof: Transform each example over N boolean variables

    into a corresponding example over N3 variables (one for each possible clause in a 3-CNF formula).

    The problem reduces to learning a conjunctive concept over the transformed instance space.

    3 - term DNF 3 - CNF

  • Transforming Hard Problems to Easy ones

    Theorem For any k-term DNF are efficiently PAC-learnable using the k-CNF hypothesis class.

    Remark: In this case, enlarging the search space by using a hypothesis class that is larger than strictly necessary, actually makes the problem easy!

    Remark: No, we have not proved that P=NP.Summary:

    k 2

    Conjunctive k-term DNF k-CNF CNF Easy Hard Easy Hard

  • Inductive Bias: Occam's Razor

    Occam's razor: Keep it simple, stupid!An Occam learning algorithm returns a simple or succinct

    hypothesis that is consistent with the training