# colt tutorial

Post on 12-Dec-2015

218 views

Embed Size (px)

DESCRIPTION

A tutorial for COLT in Machine Learning disciplineTRANSCRIPT

A Tutorial on Computational Learning TheoryPresented at Genetic Programming 1997

Stanford University, July 1997

Vasant HonavarArtificial Intelligence Research Laboratory

Department of Computer Sciencehonavar@cs.iastate.edu

Iowa State University, Ames, Iowa 50011http://www.cs.iastate.edu/~honavar/aigroup.html

What are learning systems?

Systems that improve their performance one or more tasks with experience in their environment

Examples: Pattern recognizers, adaptive control systems, adaptive intelligent agents, etc.

Computational Models of Learning

Model of the Learner: Computational capabilities, sensors, effectors, knowledge representation, inference mechanisms, prior knowledge, etc.

Model of the Environment: Tasks to be learned, information sources (teacher, queries, experiments), performance measures

Key questions: Can a learner with a certain structure learn a specified task in a particular environment? Can the learner do so efficiently? If so, how? If not, why not?

Computational Models of Learning

Theories of Learning: What is it good for? Mistake bound model Maximum Likelihood model PAC (Probably Approximately Correct) model Learning from simple examples Concluding remarks

To make explicit relevant aspects of the learner and the environment

To identify easy and hard learning problems (and the precise conditions under which they are easy or hard)

To guide the design of learning systems To shed light on natural learning systems To help analyze the performance of learning systems

Theories of Learning: What are they good for?

Mistake bound Model

Example: Given an arbitrary, noise-free sequence of labeled examples ,...of an unknown binary conjunctive concept C over {0,1}N, the learner's task is to predict C(X) for a given X.

Theorem: Exact online learning of conjunctive concepts can be accomplished with at most (N+1) prediction mistakes.

Mistake bound model

Algorithm Initialize L={X1, ~X1, .... ~XN} Predict according to match between an instance

and the conjunction of literals in L Whenever a mistake is made on a positive

example, drop the offending literals from LEg: will result in L = {~ X1, X2, X3, X4}

will yield L = {X2, X3}

Mistake bound model

Proof of Theorem 1: No literal in C is ever eliminated from L Each mistake eliminates at least one literal from L The first mistake eliminates N of the 2N literals Conjunctive concepts can be learned with at most

(N+1) mistakes Conclusion: Conjunctive concepts are easy to learn

in the mistake bound model

Optimal Mistake Bound Learning Algorithms

Definition: An optimal mistake bound mbound(C) for a concept classs C is the lowest possible mistake bound in the worst case (considering all concepts in C, and all possible sequences of examples).

Definition: An optimal learning algorithm for a concept class C (in the mistake bound framework) is one that is guaranteed to exactly learn any concept in C, using any noise-free example sequence, with at most O(mbound(C)) mistakes.

Theorem:

mbound( ) lg| |C C

The Halving Algorithm

Definition: The version space

Definition: The halving algorithm predicts according to the majority of concepts in the current version space and a mistake results in elimination of all the offending concepts from the version space

Fine print: The halving algorithm may not be efficiently implementable.

{ }iV C C= C| is consistent with the first i examples

The Halving Algorithm

The halving algorithm can be practical if there is a way to compactly represent and efficiently manipulate the version space.

Question: Are there any efficiently implementable optimal mistake bound learning algorithms?

Answer: Littlestone's algorithm for learning monotone disjunctions of at most k of n literals using the hypothesis class of threshold functions with at most (k lg n) mistakes.

Bounding the prediction error

Mistake bound model bounds the number of mistakes that the learner will ever make before exactly learning a concept, but not the prediction error after having seen a certain number of examples.

Mistake bound model assumes that the examples are chosen arbitrarily - in the worst case, by a smart, adversarial teacher. It might often be satisfactory to assume randomly drawn examples

Probably Approximately Correct Learning

Oracle

Samples

Instance Distribution

LearnerExamples

Concept

Probably Approximately Correct Learning

Consider: An instance space X A concept space A hypothesis space An unknown, arbitrary, not necessarily

computable, stationary probability distribution Dover the instance space X

{ }{ }C = C: ,X 0 1{ }{ }H = h : ,X 0 1

PAC Learning

The oracle samples the instance space according to D and provides labeled examples of an unknown concept C to the learner

The learner is tested on samples drawn from the instance space according to the same probability distribution D

The learner's task is to output a hypothesis h from H that closely approximates the unknown concept C based on the examples it has encountered

PAC Learning

In the PAC setting, exact learning (zero error approximation) cannot be guaranteed

In the PAC setting, even approximate learning (with bounded non-zero error) cannot be guaranteed 100% of the time

Definition: The error of a hypothesis h with respect to a target concept C and an instance distribution Dis given by ProbD[ ]C X h X( ) ( )

PAC Learning

Definition: A concept class C is said to be PAC-learnable using a hypothesis class H if there exists a learning algorithm L such that for all concepts in C, for all instance distributions D on an instance space X, , L, when given access to the Example oracle, produces, with probability at least , a hypothesis h from H with error no more than (Valiant, 1984)

( ) <

Efficient PAC Learning

Definition: C is said to be efficiently PAC-learnable if L runs in time that is polynomial in N (size of the instance representation), size(C) (size of the concept representation), and

Remark Note that lower error or increased confidence require more examples.

Remark: In order for a concept class to be efficiently PAC-learnable, it should be PAC-learnable using a random sample of size polynomial in the relevant parameters.

1

1

Sample complexity of PAC Learning

Definition: A consistent learner is one that returns some hypothesis h from the hypothesis class H that is consistent with a random sequence of m examples.

Remark: A consistent learner is a MAP learner (one that returns a hypothesis that is most likely given the training data) if all hypotheses are a-priori equally likely

Theorem: A consistent learner is guranteed to be PAC if the number of samples

m > 1 l nH

Sample Complexity of PAC Learning

Proof: Consider a hypothesis h that is not a PAC approximation of an unknown concept C. Clearly, error of h, or the probability that h is wrong on a random instance is at least . The probability of h being wrong on m independently drawn random examples is at least . For PAC learning, we want to make sure that the probability of L returning such a bad hypothesis is small.

( )1

( )m1

( )H m1

PAC- Easy and PAC-Hard Concept Classes

Conjunctive concepts are easy to learn. Use the same algorithm as the one used in the

mistake bound framework Sample complexity Time complexity is polynomial in the relevant

parameters of interest.Remark: Polynomial sample complexity is necessary

but not sufficient for efficient PAC learning.

{ }O N1 3 ln ln

PAC-Easy and PAC-Hard Concept Classes

Theorem: 3-term DNF concept class (disjunctions of at most 3 conjunctions) are not efficiently PAC-learnable using the same hypothesis class (although it has polynomial sample complexity) unless P=NP.

Proof: By polynomial time reduction of graph 3-colorability (a well-known NP-complete problem) to the problem of deciding whether a given set of labeled examples is consistent with some 3-term DNF formula.

Transforming Hard Problems to Easy ones

Theorem: 3-term DNF concepts are efficiently PAC-learnable using 3-CNF (conjunctive normal form with at most 3 literals per clause) hypothesis class.

Proof: Transform each example over N boolean variables

into a corresponding example over N3 variables (one for each possible clause in a 3-CNF formula).

The problem reduces to learning a conjunctive concept over the transformed instance space.

3 - term DNF 3 - CNF

Transforming Hard Problems to Easy ones

Theorem For any k-term DNF are efficiently PAC-learnable using the k-CNF hypothesis class.

Remark: In this case, enlarging the search space by using a hypothesis class that is larger than strictly necessary, actually makes the problem easy!

Remark: No, we have not proved that P=NP.Summary:

k 2

Conjunctive k-term DNF k-CNF CNF Easy Hard Easy Hard

Inductive Bias: Occam's Razor

Occam's razor: Keep it simple, stupid!An Occam learning algorithm returns a simple or succinct

hypothesis that is consistent with the training