machine learning, decision trees, overfittingtom/10601_sp09/lectures/dtreesandover... · 2009. 1....

47
Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 12, 2009 Readings: Mitchell, Chapter 3 Bishop, Chapter 1.6

Upload: others

Post on 15-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Machine Learning, Decision Trees, Overfitting

Machine Learning 10-601

Tom M. Mitchell Machine Learning Department

Carnegie Mellon University

January 12, 2009

Readings:

•  Mitchell, Chapter 3

•  Bishop, Chapter 1.6

Page 2: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Machine Learning 10-601

Instructor •  Tom Mitchell

TA’s •  Andy Carlson •  Purna Sarkar

Course assistant •  Sharon Cavlovich

webpage: www.cs.cmu.edu/~tom/10601_sp09

See webpage for •  Office hours •  Grading policy •  Final exam date •  Late homework

policy •  Syllabus details •  ...

Page 3: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Machine Learning:

Study of algorithms that •  improve their performance P •  at some task T •  with experience E

well-defined learning task: <P,T,E>

Page 4: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Learning to Predict Emergency C-Sections

9714 patient records, each with 215 features

[Sims et al., 2000]

Page 5: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Learning to detect objects in images

Example training images for each orientation

(Prof. H. Schneiderman)

Page 6: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Learning to classify text documents

Company home page

vs

Personal home page

vs

University home page

vs

Page 7: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Reading a noun (vs verb)

[Rustandi et al., 2005]

Page 8: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Machine Learning - Practice

Object recognition Mining Databases

Speech Recognition

Control learning

•  Supervised learning

•  Bayesian networks

•  Hidden Markov models

•  Unsupervised clustering

•  Reinforcement learning

•  ....

Text analysis

Page 9: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Machine Learning - Theory

PAC Learning Theory

# examples (m)

representational complexity (H)

error rate (ε) failure probability (δ)

Other theories for

•  Reinforcement skill learning

•  Semi-supervised learning

•  Active student querying

•  …

… also relating:

•  # of mistakes during learning

•  learner’s query strategy

•  convergence rate

•  asymptotic performance

•  bias, variance

(supervised concept learning)

Page 10: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Growth of Machine Learning •  Machine learning already the preferred approach to

–  Speech recognition, Natural language processing –  Computer vision –  Medical outcomes analysis –  Robot control –  …

•  This ML niche is growing (why?) All software apps.

ML apps.

Page 11: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Growth of Machine Learning •  Machine learning already the preferred approach to

–  Speech recognition, Natural language processing –  Computer vision –  Medical outcomes analysis –  Robot control –  …

•  This ML niche is growing –  Improved machine learning algorithms –  Increased data capture, networking –  Software too complex to write by hand –  New sensors / IO devices –  Demand for self-customization to user, environment

All software apps.

ML apps.

Page 12: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Function Approximation and Decision tree learning

Page 13: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Function approximation Setting: •  Set of possible instances X •  Unknown target function f: XY •  Set of function hypotheses H={ h | h: XY }

Given: •  Training examples {<xi,yi>} of unknown target

function f

Determine: •  Hypothesis h∈ H that best approximates f

Page 14: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Each internal node: test one attribute Xi

Each branch from a node: selects one value for Xi

Each leaf node: predict Y (or P(Y|X ∈ leaf))

Page 15: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Decision Trees How would you represent boolean function AB ? A ∨ B?

How would you represent AB ∨ CD(¬E)

Page 16: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 17: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

node = Root

[ID3, C4.5, …]

Page 18: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Random Variables, Distributions, Entropy

Page 19: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Entropy Entropy H(X) of a random variable X

H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code)

Why? Information theory: •  Most efficient code assigns -log2P(X=i) bits to encode

the message X=i •  So, expected number of bits to code one random X is:

# of possible values for X

Page 20: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Sample Entropy

Page 21: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Entropy Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Page 22: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Entropy Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Page 23: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Entropy Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka information gain) of X and Y :

Page 24: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Subset of S for which A=v

Gain(S,A) = mutual information between A and target class variable over sample S

Page 25: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 26: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 27: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 28: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Decision Tree Learning Applet

•  http://www.cs.ualberta.ca/%7Eaixplore/learning/DecisionTrees/Applet/DecisionTreeApplet.html

Page 29: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Which Tree Should We Output? •  ID3 performs heuristic search

through space of decision trees •  It stops at smallest acceptable

tree. Why?

William of Occam

Page 30: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Why Prefer Short Hypotheses? (Occam’s Razor)

Arguments in favor:

Arguments opposed:

Page 31: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Why Prefer Short Hypotheses? (Occam’s Razor)

Argument in favor: •  Fewer short hypotheses than long ones a short hypothesis that fits the data is less likely to be

a statistical coincidence highly probable that a sufficiently complex hypothesis

will fit the data

Argument opposed: •  Also fewer hypotheses with prime number of nodes

and attributes beginning with “Z” •  What’s so special about “short” hypotheses?

Page 32: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 33: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 34: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 35: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 36: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 37: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Reduced Error Pruning

Split available training data into training and pruning sets

1.  Learn tree that classifies training set perfectly 2.  Do until further pruning is harmful over pruning set

–  evaluate impact over pruning set of pruning each possible node (plus those below it)

–  greedily remove the node that best improves pruning set accuracy

This produces smallest version of most accurate tree (over the pruning set)

What if data is limited...?

Page 38: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 39: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 40: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 41: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 42: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 43: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom
Page 44: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

What you should know: •  Well posed function approximation problems:

–  Instance space, X –  Sample of labeled training data { <xi, yi>} –  Hypothesis space, H = { f: XY }

•  Learning is a search/optimization problem over H –  Various objective functions

•  minimize training error •  minimize error over separate pruning set after constructing tree from

training set •  among hypotheses that minimize training error, select smallest (?)

•  Decision tree learning –  Greedy top-down learning of decision trees (ID3, C4.5, ...) –  Overfitting and tree/rule post-pruning –  Extensions…

Page 45: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Questions to think about (1) •  ID3 and C4.5 are heuristic algorithms that

search through the space of decision trees. Why not just do an exhaustive search?

Page 46: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Questions to think about (2) •  Consider target function f: <x1,x2> y,

where x1 and x2 are real-valued, y is boolean. What is the set of decision surfaces describable with decision trees that use each attribute at most once?

Page 47: Machine Learning, Decision Trees, Overfittingtom/10601_sp09/lectures/DTreesAndOver... · 2009. 1. 14. · Machine Learning, Decision Trees, Overfitting Machine Learning 10-601 Tom

Questions to think about (3) •  Why use Information Gain to select attributes

in decision trees? What other criteria seem reasonable, and what are the tradeoffs in making this choice?