machine learning cps4801. research day keynote speaker o tuesday 9:30-11:00 stem lecture hall (2 nd...
TRANSCRIPT
Research Day• Keynote Speaker
o Tuesday 9:30-11:00 STEM Lecture Hall (2nd floor)
o Meet-and-Greet 11:30 STEM 512
• Faculty Presentationo Tuesday 11:00-3:00 STEMo Prof. Liou 2:00 Room 415
• Student Poster o Wednesday 10:00-3:00 o Computer Science 10:00-12:00 STEM Atrium
• Schedule:http://orsp.kean.edu/ResearchDays_Schedule.html
Outline• Introduction• Decision tree learning• Clustering• Artificial Neural Networks• Genetic algorithms
Learning from Examples
• An agent is learning if it improves its performance on future tasks after making observations about the world.
• One class of learning problem: o from a collection of input-output pairs, learn a
function that predicts the output for new inputs.
Why learning?• The designer cannot anticipate all possible
situationso A robot designed to navigate mazes must learn
the layout of each new maze.
• The designer cannot anticipate all changeso A program designed to predict tomorrow’s stock
market prices must learn to adapt when conditions change.
• Programmers sometimes have no idea how to program a solutiono recognizing faces
Types of Learning• Supervised learning
o example input-output pairs and learns a function
• Unsupervised learning o correct answers not giveno clustering: a taxi agent must develop a concept
of “good traffic days” and “bad traffic days”
• Reinforcement learning o rewards or punishmentso taxi agent: lack of a tipo chess game: two points for a win
Supervised Learning• Learning a function/rule from specific input-
output pairs is also called inductive learning.• Given a training set of N example pairs:
o (x1,y1), (x2,y2), ..., (xN, yN)o target unknown function y = f(x)
• Problem: find a hypothesis h such that h ≈ f• h is generalized well if it correctly predicts
the value of y for novel examples (test set).
Supervised Learning• When the output y is one of the finite set of
values (sunny, cloudy, rainy), the learning problem is called classification. o Boolean or binary classification
• When y is a number (tomorrow’s temperature), the problem is called regression.
Inductive learning method
• The points are in the (x,y) plane, where y = f(x).• We approximate f with h selected from a
hypothesis space H. • Construct/adjust h to agree with f on training set
Inductive learning method
• Construct/adjust h to agree with f on training set
• E.g., curve fitting:
Inductive learning method
Construct/adjust h to agree with f on training set•
• E.g., curve fitting:
Inductive learning method
• Construct/adjust h to agree with f on training set• (h is consistent if it agrees with f on all examples)
• E.g., curve fitting:
Inductive learning method
• Construct/adjust h to agree with f on training set• (h is consistent if it agrees with f on all examples)• E.g., curve fitting:
• How to choose from among multiple consistent hypotheses?
Inductive learning method
• Ockham’s razor: prefer the simplest hypothesis consistent with data (14th-century English philosopher William of Ockham)
• There is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better.
15
Cross-Validation
Model
Lather, rinse, repeat (10 times)
9 folds (approx. 1409) 1 fold (approx. 157)
Train
Evaluate
Report average
Split into 10 folds
Labeled data (1566)
Learning decision trees
• One of the simplest and yet most successful forms of machine learning.
• A decision tree represents a function that takes as input a vector of attribute values and returns a “decision” – a single output.o discrete input, Boolean classification
Learning decision trees
Problem: decide whether to wait for a table at a restaurant, based on the following attributes:1. Alternate: is there an alternative restaurant nearby?2. Bar: is there a comfortable bar area to wait in?3. Fri/Sat: is today Friday or Saturday?4. Hungry: are we hungry?5. Patrons: number of people in the restaurant (None,
Some, Full)6. Price: price range ($, $$, $$$)7. Raining: is it raining outside?8. Reservation: have we made a reservation?9. Type: kind of restaurant (French, Italian, Thai, Burger)10. WaitEstimate: estimated waiting time (0-10, 10-30,
30-60, >60)
Decision trees• One possible representation for hypotheses (no
Price and Type)• “true” tree for deciding whether to wait:
Expressiveness• Decision trees can express any function of the input
attributes.• E.g., for Boolean functions, truth table row → path to leaf:
• Goal <==> (Path1 v Path2 v Path3 v ...)• Trivially, there is a consistent decision tree for any training
set with one path to leaf for each example.• Prefer to find more compact decision trees
Decision trees• One possible representation for hypotheses (no
Price and Type)• “true” tree for deciding whether to wait:
21
Constructing the Decision Tree
• Goal: Find the smallest decision tree consistent with the examples
• divide-and-conquer: Test the most important attribute first, divides the problem up into smaller subproblems that can be solved recursively. o “Most important”: attribute that best splits examples
• Form tree with root = best attribute• For each value vi (or range) of best attribute
• Selects those examples with best=vi
• Construct subtreei by recursively calling decision tree with subset of examples, all attributes except best
• Add a branch to tree with label=vi and subtree=subtreei
Decision tree learning• Aim: find a small tree consistent with the training examples• Idea: (recursively) choose "most significant" attribute as
root of (sub)tree
Choosing an attribute• Idea: a good attribute splits the examples into
subsets that are (ideally) "all positive" or "all negative"
•
• Which is a better choice?
Attribute-based representations
• Examples described by attribute values • A training set of 12 examples• E.g., situations where I will/won't wait for a table:
• Classification of examples is positive (T) or negative (F)
25
Choosing the Best Attribute:Binary Classification
• Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction
• Information theory (Shannon and Weaver 49)o Entropy: a measure of uncertainty of a random variable
• A coin that always comes up heads --> 0• A flip of a fair coin (Heads or tails) --> 1(bit)• The roll of a fair four-sided die --> 2(bit)
o Information gain: the expected reduction in entropy caused by partitioning the examples according to this attribute
26
Formula for Entropy
Examples:Suppose we have a collection of 10 examples, 5
positive, 5 negative:H(1/2,1/2) = -1/2log21/2 -1/2log21/2 = 1 bit
Suppose we have a collection of 100 examples, 1 positive and 99 negative:
H(1/100,99/100) = -.01log2.01 -.99log2.99 = .08 bits
Information gain• Information gain (from attribute test) = difference
between the original information requirement and new requirement
• Information Gain (IG) or reduction in entropy from the attribute test:
• Choose the attribute with the largest IG
Information gainFor the training set, p = n = 6, I(6/12, 6/12) = 1 bit
Consider the attributes Patrons and Type (and others too):
Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root