machine learning cps4801. research day keynote speaker o tuesday 9:30-11:00 stem lecture hall (2 nd...

29
Machine Learning CPS4801

Upload: maximillian-richardson

Post on 27-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Machine Learning

CPS4801

Research Day• Keynote Speaker

o Tuesday 9:30-11:00 STEM Lecture Hall (2nd floor)

o Meet-and-Greet 11:30 STEM 512

• Faculty Presentationo Tuesday 11:00-3:00 STEMo Prof. Liou 2:00 Room 415

• Student Poster o Wednesday 10:00-3:00 o Computer Science 10:00-12:00 STEM Atrium

• Schedule:http://orsp.kean.edu/ResearchDays_Schedule.html

Outline• Introduction• Decision tree learning• Clustering• Artificial Neural Networks• Genetic algorithms

Learning from Examples

• An agent is learning if it improves its performance on future tasks after making observations about the world.

• One class of learning problem: o from a collection of input-output pairs, learn a

function that predicts the output for new inputs.

Why learning?• The designer cannot anticipate all possible

situationso A robot designed to navigate mazes must learn

the layout of each new maze.

• The designer cannot anticipate all changeso A program designed to predict tomorrow’s stock

market prices must learn to adapt when conditions change.

• Programmers sometimes have no idea how to program a solutiono recognizing faces

Types of Learning• Supervised learning

o example input-output pairs and learns a function

• Unsupervised learning o correct answers not giveno clustering: a taxi agent must develop a concept

of “good traffic days” and “bad traffic days”

• Reinforcement learning o rewards or punishmentso taxi agent: lack of a tipo chess game: two points for a win

Supervised Learning• Learning a function/rule from specific input-

output pairs is also called inductive learning.• Given a training set of N example pairs:

o (x1,y1), (x2,y2), ..., (xN, yN)o target unknown function y = f(x)

• Problem: find a hypothesis h such that h ≈ f• h is generalized well if it correctly predicts

the value of y for novel examples (test set).

Supervised Learning• When the output y is one of the finite set of

values (sunny, cloudy, rainy), the learning problem is called classification. o Boolean or binary classification

• When y is a number (tomorrow’s temperature), the problem is called regression.

Inductive learning method

• The points are in the (x,y) plane, where y = f(x).• We approximate f with h selected from a

hypothesis space H. • Construct/adjust h to agree with f on training set

Inductive learning method

• Construct/adjust h to agree with f on training set

• E.g., curve fitting:

Inductive learning method

Construct/adjust h to agree with f on training set•

• E.g., curve fitting:

Inductive learning method

• Construct/adjust h to agree with f on training set• (h is consistent if it agrees with f on all examples)

• E.g., curve fitting:

Inductive learning method

• Construct/adjust h to agree with f on training set• (h is consistent if it agrees with f on all examples)• E.g., curve fitting:

• How to choose from among multiple consistent hypotheses?

Inductive learning method

• Ockham’s razor: prefer the simplest hypothesis consistent with data (14th-century English philosopher William of Ockham)

• There is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better.

15

Cross-Validation

Model

Lather, rinse, repeat (10 times)

9 folds (approx. 1409) 1 fold (approx. 157)

Train

Evaluate

Report average

Split into 10 folds

Labeled data (1566)

Learning decision trees

• One of the simplest and yet most successful forms of machine learning.

• A decision tree represents a function that takes as input a vector of attribute values and returns a “decision” – a single output.o discrete input, Boolean classification

Learning decision trees

Problem: decide whether to wait for a table at a restaurant, based on the following attributes:1. Alternate: is there an alternative restaurant nearby?2. Bar: is there a comfortable bar area to wait in?3. Fri/Sat: is today Friday or Saturday?4. Hungry: are we hungry?5. Patrons: number of people in the restaurant (None,

Some, Full)6. Price: price range ($, $$, $$$)7. Raining: is it raining outside?8. Reservation: have we made a reservation?9. Type: kind of restaurant (French, Italian, Thai, Burger)10. WaitEstimate: estimated waiting time (0-10, 10-30,

30-60, >60)

Decision trees• One possible representation for hypotheses (no

Price and Type)• “true” tree for deciding whether to wait:

Expressiveness• Decision trees can express any function of the input

attributes.• E.g., for Boolean functions, truth table row → path to leaf:

• Goal <==> (Path1 v Path2 v Path3 v ...)• Trivially, there is a consistent decision tree for any training

set with one path to leaf for each example.• Prefer to find more compact decision trees

Decision trees• One possible representation for hypotheses (no

Price and Type)• “true” tree for deciding whether to wait:

21

Constructing the Decision Tree

• Goal: Find the smallest decision tree consistent with the examples

• divide-and-conquer: Test the most important attribute first, divides the problem up into smaller subproblems that can be solved recursively. o “Most important”: attribute that best splits examples

• Form tree with root = best attribute• For each value vi (or range) of best attribute

• Selects those examples with best=vi

• Construct subtreei by recursively calling decision tree with subset of examples, all attributes except best

• Add a branch to tree with label=vi and subtree=subtreei

Decision tree learning• Aim: find a small tree consistent with the training examples• Idea: (recursively) choose "most significant" attribute as

root of (sub)tree

Choosing an attribute• Idea: a good attribute splits the examples into

subsets that are (ideally) "all positive" or "all negative"

• Which is a better choice?

Attribute-based representations

• Examples described by attribute values • A training set of 12 examples• E.g., situations where I will/won't wait for a table:

• Classification of examples is positive (T) or negative (F)

25

Choosing the Best Attribute:Binary Classification

• Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction

• Information theory (Shannon and Weaver 49)o Entropy: a measure of uncertainty of a random variable

• A coin that always comes up heads --> 0• A flip of a fair coin (Heads or tails) --> 1(bit)• The roll of a fair four-sided die --> 2(bit)

o Information gain: the expected reduction in entropy caused by partitioning the examples according to this attribute

26

Formula for Entropy

Examples:Suppose we have a collection of 10 examples, 5

positive, 5 negative:H(1/2,1/2) = -1/2log21/2 -1/2log21/2 = 1 bit

Suppose we have a collection of 100 examples, 1 positive and 99 negative:

H(1/100,99/100) = -.01log2.01 -.99log2.99 = .08 bits

Information gain• Information gain (from attribute test) = difference

between the original information requirement and new requirement

• Information Gain (IG) or reduction in entropy from the attribute test:

• Choose the attribute with the largest IG

Information gainFor the training set, p = n = 6, I(6/12, 6/12) = 1 bit

Consider the attributes Patrons and Type (and others too):

Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root

Example contd.• Decision tree learned from the 12 examples:

• Substantially simpler than the “true” tree