introduction to artificial intelligence massimo poesio machine learning: decision trees

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

Massimo Poesio

Machine Learning: Decision Trees

A DEFINITION OF LEARNING: LEARNING AS IMPROVEMENT

Learning denotes changes in the system that are adaptive in the sense that they enable the system to do the task or tasks drawn from the same population more efficiently and more effectively the next time. -- Herb Simon

3

LEARNING AS IMPROVEMENT, 2Improve on task, T, with respect to

performance metric, P, based on experience, E.T: Assign to words their senses.P: Percentage of words correctly classified.E: Corpus of words, some with human-given labels

T: Categorize email messages as spam or legitimate.P: Percentage of email messages correctly classified.E: Database of emails, some with human-given labels

T: Playing checkersP: Percentage of games won against an arbitrary opponent E: Playing practice games against itself

T: Recognizing hand-written wordsP: Percentage of words correctly classifiedE: Database of human-labeled images of handwritten words

4

SPECIFYING A LEARNING SYSTEM• Choose the training experience• Choose exactly what is too be learned, i.e. the

target function.• Choose how to represent the target function.• Choose a learning algorithm to infer the target

function from the experience.

Environment/Experience

Learner

Knowledge

PerformanceElement

FEATURES

• The functions learned by ML algorithms specify a mapping from input FEATURES to output FEATURES

EXAMPLE 1: CHECKERS

• Features used in the linear function seen last time:– bp(b): number of black pieces on board b– rp(b): number of red pieces on board b– bk(b): number of black kings on board b – rk(b): number of red kings on board b– bt(b): number of black pieces threatened (i.e. which

can be immediately taken by red on its next turn)– rt(b): number of red pieces threatened

EXAMPLE 2: WHEN THE NEIGHBOR DRIVES

• Suppose we are trying to learn when our neighbor goes to work by car so we can ask a ride

• Their decision appears to be influenced by– Temperature– Whether it’s going to rain or not– Day of the week– Whether they need to stop at a shop on the way back– How they are dressed

PAST EXPERIENCE IN TERMS OF FEATURES

PREDICTION TASK

PREDICTION

THE NEED FOR AVERAGING

THE NEED FOR GENERALIZATION

FIRST EXAMPLE OF ML ALGORITHM: DECISION TREES

• A method independently developed by Quinlan in AI and by Breiman et al in statistics

14

DECISION TREES• Tree-based classifiers for instances represented as feature-vectors. Nodes

test features, there is one branch for each value of the feature, and leaves specify the category.

• Can represent arbitrary conjunction and disjunction. Can represent any classification function over discrete feature vectors.

color

red blue green

shape

circle square triangleneg pos

pos neg neg

color

red blue green

shape

circle square triangle B C

A B C

A DECISION TREE FOR THE DRIVING PROBLEM

LEARNING DECISION TREES FROM DATA

• Use the data in the training set to build a decision tree that will then be used to make decisions with unseen data

• The decision tree specifies a function

TRAVERSING THE DECISION TREE

Decision Tree Learning

• Discrete class values – Slight changes in the input: either no or full effect on

the classification

• Discrete feature values (or discretized)• Fast• Modern DT induction algorithms:– Handling noisy feature values– Handling noisy labels– Handling missing feature values

Top-down DT induction

• Partition training examples into good “splits”, based on values of a single “good” feature:

(1) Sat, hot, no, casual, keys -> +(2) Mon, cold, snow, casual, no-keys -> -(3) Tue, hot, no, casual, no-keys -> -(4) Tue, cold, rain, casual, no-keys -> -(5) Wed, hot, rain, casual, keys -> +


keys?

yes no

Drive: 1,5 Walk: 2,3,4


• Partition training examples into good “splits”, based on values of a single “good” feature

(1) Sat, hot, no, casual -> +(2) Mon, cold, snow, casual -> -(3) Tue, hot, no, casual -> -(4) Tue, cold, rain, casual -> -(5) Wed, hot, rain, casual -> +• No acceptable classification: proceed recursively


t?

cold hot

Walk: 2,4 Drive: 1,5Walk: 3


t?

cold hot

Walk: 2,4 day?

SatTue

Wed

Drive: 1 Walk: 3 Drive: 5


t?

cold hot

Walk: 2,4 day?

SatTue

Wed


Mo, Thu, Fr, Su

?Drive

Top-down DT induction: divide and conquer algorithm

• Pick a feature• Split your examples into subsets based on the

values of the feature• For each subsets, examine the examples:– Zero: assign the most popular class for the parent– All from the same class: assign this class– Otherwise, process recursively

Top-Down DT induction

Different trees can be built for the same data, depending on the order of features:

t?

cold hot

Walk: 2,4 day?

SatTue

Wed


Mo, Thu, Fr, Su

?Drive


Different trees can be built for the same data, depending on the order of features:

t?

cold hot

Walk: 2,4 day?

Sat Tue WedDrive: 1 Walk: 3 Drive: 5

Mo

Drive:?

clothing

casualhalloween

walk:?

Selecting features

• Intuitively–We want more “informative” features to be

higher in the tree:• Is it Monday? Is it raining? Good political news? No

halloween cloths? Hat on? Coat on? Car keys? Yes?? -> Driving! (doesn't look as a good learning job)

–We want a nice compact tree..

Selecting features-2

• Formally– Define “tree size” (number of nodes, leaves;

depth,..)– Try all the possible trees, find the smallest one– NP-hard

• Top-down DT induction – greedy search, depends on heuristics for feature ordering (=> no guarantee)– Information gain

Entropy

Information theory: entropy – number of bits needed to encode some information.

S – set of N examples: p*N positive (“Walk”) and q*N negative (“Drive”)

Entropy(S)= -p*lg(p) – q*lg(q)

p=1, q=0 => Entropy(S)=0p=1/2, q=1/2 => Entropy(S)=1

Entropy and Decision Trees

keys?

no yes

Walk: 2,4 Drive: 1,3,5

E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97

E(Sno)=0 E(Skeys)=0

Entropy and Decision Trees

t?

cold hot

Walk: 2,4 Drive: 1,5Walk: 3

E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97

E(Scold)=0 E(Shot)=-0.33*lg(0.33)-0.66*lg(0.66)= 0.92

Information gain

• For each feature f, compute the reduction in entropy on the split:

Gain(S,f)=E(S)-∑(Entropy(Si)*|Si|/|S|)

f=keys? : Gain(S,f)=0.97f=t?: Gain(S,f)=0.97-0*2/5-0.92*3/5=0.42f=clothing?: Gain(S,f)= ?

Conquer-and-divide with Information gain

• Batch learning (read all the input data, compute information gain based on all the examples simultaneously)

• Greedy search, may find local optima• Outputs a single solution• Optimizes depth

Complexity

• Worst case: build a complete tree– Compute gains on all nodes: at level i, we have

already examined i features; m-i remaining.

• In practice: tree is rarely complete, linear on number of features, number of examples (== very fast)

Overfitting

• Suppose we build a very complex tree.. Is it good?

• Last lecture: we measure the quality (“goodness”) of the prediction, not the performance on the training data

• Why can complex trees yield mistakes:– Noise in the data– Even without noise, solutions at the last levels are

based on too few observations

Overfitting

Mo: Walk (50 observations), Drive (5)Tue: Walk (40), Drive (3)We: Drive (1)Thu: Walk (42), Drive (14)Fri: Walk (50)Sa: Drive (20), Walk (20)Su: Drive (10)

• Can we conclude that “We->Drive”?

Overfitting

• A hypothesis H is said to overfit the training data if there exist another hypothesis H' such that:Error(H, train data) <= Error (H', train data)Error(H, unseen data) > Error (H', unseen data)

• Overfitting is related to hypothesis complexity: a more complex hypothesis (e.g., a larger decision tree) overfits more

Overfitting Prevention for DT: Pruning

• “prune” a complex tree: produce a smaller tree that is less accurate on the training dataOriginal tree: ...Mo: hot->drive (2), cold -> walk (100)Pruned tree: .. Mo-> walk (100/2)

• post-/pre- pruning

Pruning criteria

• Cross-validation– Reserve some training data to evaluate the utility

of the subtrees

• Statistical tests: use a test to determine whether observations at given level can be random

• MDL (minimum description length): compare the added complexity against memorizing exceptions

DT: issues

• Splitting criteria– Information gain: split at features with many values

• Non-discrete features• Non-discrete outputs (“regression trees”)• Costs• Missing values• Incremental learning• Memory issues

ACKNOWLEDGMENTS

• Some of the slides from– Ray Mooney’s Utexas ML course–MIT Open Course Ware AI course

introduction to artificial intelligence massimo poesio machine learning: decision trees

Documents

function slide

decision tree slide

prediction slide

averaging slide

generalization slide

b c slide

prediction task slide

driving problem slide