introduction to artificial intelligence€¦ · introduction to artificial intelligence comp307...

24
Introduction to Artificial Intelligence COMP307 Machine Learning 3 – Decision Tree Learning Method Yi Mei [email protected] 1

Upload: others

Post on 26-May-2020

37 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Introduction to Artificial Intelligence

COMP307 Machine Learning 3 – Decision Tree

Learning Method

Yi [email protected]

1

Page 2: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Outline• Decision tree learning vs learned decision trees

• How to build a decision tree using a set of instances

• How to measure a node in decision tree: (im)purity measures

• Design issues of DT learning

2

Page 3: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Decision Tree• A tree-like model for making decisions

4

Page 4: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Decision Trees vs DT Learning• A Decision Tree (DT) is a classifier

– Symbolic representation, not probabilistic– Essentially a rule– “Easy” to interpret

• DT learning is a learning process– To find a DT: output/solution is a DT– One of the oldest classification learning methods in AI– Also developed independently in Statistics/Operations Research

5

Page 5: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Example (Training) Dataset• Approve/Reject a loan application?

6

Applicant Job Deposit Family Class

1 true low single Approve

2 true low couple Approve

3 true low single Approve

4 true high single Approve

5 false high couple Approve

6 false low couple Reject

7 true low children Reject

8 false low single Reject

9 false high children Reject

Page 6: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Example DT• An example DT

7

COMP307 ML3 (DT): 5

Decision Trees

Job Deposit Reject

Reject

RejectApprove

Approve

JobApprove

Family

couplesingle

true falsehigh low

true false

children

COMP307 ML3 (DT): 6

Building Decision Trees

• You can always build a decision tree trivially

– Choose some order on the attributes

– Build tree with one attribute for each level

– Label each leaf with appropriate class

A

C

D D D D D D D D

X X X X ZZZ Y Y Y

C C C

BB

• Problems

– Each leaf represents a possible instance

– All we are doing is remembering every instance — no generalisation, no

prediction, no learning

Z

A

B

D

YZ

Y X

C

• Solution

– Find a small decision tree

– capture the common features of instances

– probably generalise to predict classes for unseen instances

COMP307 ML3 (DT): 7

Building A Good Decision Tree

• Input: Instances described by attribute-value pairs

• Output: a “good” decision tree classifier

– Critical issue: choosing which attribute to use next

• DT algorithm:

Examine set of instances in the root node

If set is "pure" enough, or no more attributes

then stop

Else

Construct subsets of instances in the subnodes

Compute average "purity" of subnodes

Choose the best attribute

Recurse on each subnode

COMP307 ML3 (DT): 8

Decision Tree Building

DepositJob

true false high low

children

couplesingle

Which is best?

Choose the best attribute

Reject

Family

Page 7: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Building a DT• A simple way:

– Start from a root, select a feature at the root to be branched– Grow the tree by depth-first or breadth-first (or any other order)– For each node, add one child node for each possible feature value– A child node can be a feature node or a class (leaf node)

• Node: feature / class• Edge: value of the parent node

• Cannot support numeric features (infinite possible values)

8

Page 8: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Building a DT• Input: Instances/Samples• Output: a “good” decision tree classifier

• A decision tree progressively splits the training set into smaller subsets

• Pure node: all the samples at that node have the same class label

• No need to further split a pure node• Recursive tree-growing process: Given data at a node,

decide the node as a leaf node or find another feature test to split the node

9

Page 9: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Building a DT• Start:

– Build a tree with a single root node– The entire training set is at the root node

• Repeat until no split is needed:– Select a node from the tree to examine– Check the purity of the set at the examined node– If the set is pure enough: node -> leaf node of the major class– Else: design a feature test to expand the node, split the set

10

Page 10: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Example (Training) Dataset• Approve/Reject a loan application?

11

Applicant Job Deposit Family Class

1 true low single Approve

2 true low couple Approve

3 true low single Approve

4 true high single Approve

5 false high couple Approve

6 false low couple Reject

7 true low children Reject

8 false low single Reject

9 false high children Reject

Page 11: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Design Issues• Should the feature (answers to questions) be binary or

multivalued? In other words, how many splits should be made at a node?

• Which feature or feature combination should be tested at a node?

• When should a node be declared a leaf node?• If the tree becomes “too large”, can it be pruned to make it

smaller and simpler?• If a leaf node is impure, how should category label be

assigned to it?

12

Page 12: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Number of Splits• Binary: every question has a True/False answer

13

COMP307 ML3 (DT): 5

Decision Trees

Job Deposit Reject

Reject

RejectApprove

Approve

JobApprove

Family

couplesingle

true falsehigh low

true false

children

COMP307 ML3 (DT): 6

Building Decision Trees

• You can always build a decision tree trivially

– Choose some order on the attributes

– Build tree with one attribute for each level

– Label each leaf with appropriate class

A

C

D D D D D D D D

X X X X ZZZ Y Y Y

C C C

BB

• Problems

– Each leaf represents a possible instance

– All we are doing is remembering every instance — no generalisation, no

prediction, no learning

Z

A

B

D

YZ

Y X

C

• Solution

– Find a small decision tree

– capture the common features of instances

– probably generalise to predict classes for unseen instances

COMP307 ML3 (DT): 7

Building A Good Decision Tree

• Input: Instances described by attribute-value pairs

• Output: a “good” decision tree classifier

– Critical issue: choosing which attribute to use next

• DT algorithm:

Examine set of instances in the root node

If set is "pure" enough, or no more attributes

then stop

Else

Construct subsets of instances in the subnodes

Compute average "purity" of subnodes

Choose the best attribute

Recurse on each subnode

COMP307 ML3 (DT): 8

Decision Tree Building

DepositJob

true false high low

children

couplesingle

Which is best?

Choose the best attribute

Reject

Family

Page 13: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Feature Test Design• Which feature/attribute should be used in the feature test?• Greedy design: the question should make the child nodes as

pure as possible• Node (im)purity: can be defined in different ways

– Probability based– Information theory based

14

ML Tools 03:

Discretisation: Entropy Based Discretisation

• Entropy measures the impurity or uncertainty in a group of examples

• S is the training set, C1, …, CN classes • E(S) measure the Entropy of S, Pc is the proportion of

class Cc in S

13

High entropy

Very impure Least impure (Pure)

Null entropy

Less impure

Low entropy

Page 14: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Node (Im)purity Measure• Assume there are two classes A and B• At a node: m instances class A, n instances class B• Impurity: 𝑖𝑚𝑝 = 𝑃 𝐴 𝑃 𝐵 = !

!"#× #!"#

= !#!"# !

– If pure (m = 0 or n = 0): imp = 0– If m = n, imp is maximum– Smooth

15

The smaller the better

COMP307 ML3 (DT): 9

Measuring Purity

• Need a measure of how “pure” a node is

– all one class −→ pure −→ can predict the class

– mixture of classes −→ impure −→ have to ask more questions

• Several functions

– probability based

– information theory based

– ... ...

• Choose the attribute whose children have the best purity

COMP307 ML3 (DT): 10

(Im)Purity Measure: P(A)P(B)

A

7 A’s, 3 B’s

All class BAll class

• Impurity:

P (A)P (B) =m

m + n×

n

m + n=

mn

(m + n)2

m: number of A’s, n: number of B’s

• Goodness of attribute:

– average impurity of subnodes

COMP307 ML3 (DT): 11

Weighting the Impurities

• How do we take the average?

average = 16%

Att 1 Att 2

true false falsetrue

impurity = 4/5 * 1/5impurity = 1/5 * 4/5= 16% = 16%

impurity = 1/1 * 0/1= 0%

impurity = 4/9 * 5/9= 24.6%

average = 12.3% ?

average = 20% ?

• Need to weight the nodes by probability of going to node:

COMP307 ML3 (DT): 12

Weighting the Impurities (Continued)

falsetrue

P(node) = 7/10 P(node) = 3/10

impurity(node) = 4/7 * 3/7 impurity(node) = 2/3 * 1/3 = 12/49 = 2/9

• Goodness of attribute = weighted average impurity of subnodes

=!

i[P (nodei)× impurity(nodei)]

= (7/10× 12/49) + 3/10× 2/9+ = 84/490 + 6/90 = 0.238

Page 15: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Feature Test Design• Goodness of a feature test: average impurity of child

nodes

17

FT1?

Impurity = 1/5 * 4/5 = 4/25

Impurity = 4/5 * 1/5 = 4/25

Average Impurity = 4/25 = 16%

Page 16: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Feature Test Design• Goodness of a feature test: average impurity of child

nodes

18

FT2?

Impurity = 1/1 * 0/1 = 0

Impurity = 4/9 * 5/9 = 20/81

Average Impurity = 10/81 = 12.4%

FT2 is better than FT1?

Page 17: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Feature Test Design• Weighting the Impurities by the probability of nodes

– Weighted average impurity = ∑"#$% 𝑃 𝑛𝑜𝑑𝑒" ×𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦(𝑛𝑜𝑑𝑒")

19

FT1?

Impurity = 1/5 * 4/5 = 4/25

Impurity = 4/5 * 1/5 = 4/25

Weighted average Impurity = 4/25 = 16%

P(node1) = 0.5 P(node2) = 0.5

Page 18: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Feature Test Design• Weighting the Impurities by the probability of nodes

– Weighted average impurity = ∑"#$% 𝑃 𝑛𝑜𝑑𝑒" ×𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦(𝑛𝑜𝑑𝑒")

20

FT2?

Impurity = 1/1 * 0/1 = 0

Impurity = 4/9 * 5/9 = 20/81

Weighted average Impurity = 0.9 * 20/81 = 22.2%

P(node1) = 0.1 P(node2) = 0.9

FT1 is better than FT2 after weighting

Page 19: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Numeric Features• Approve/Reject a loan application?

22

Applicant Job Deposit Family Class

1 true $10K single Approve

2 true $7K couple Approve

3 true $4K single Approve

4 true $16K single Approve

5 false $18K couple Approve

6 false $6K couple Reject

7 true $8K children Reject

8 false $3K single Reject

9 false $30K children Reject

What question to ask about Deposit?

Page 20: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Numeric Features• Can split on a simple comparison

– Which split point?– Consider class boundaries

23

Deposit < $10K

True False

Applicant Job Deposit Family Class

8 false $3K single Reject

3 true $4K single Approve

6 false $6K couple Reject

2 true $7K couple Approve

7 true $8K children Reject

1 true $10K single Approve

4 true $16K single Approve

5 false $18K couple Approve

9 false $30K children Reject

<$4K

<$6K

<$7K

<$8K

<$10K

<$30K

Page 21: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Numeric Features• Consider class boundaries, choose the best split with

minimal weighted average impurity– (Deposit < 4K): 1/9*imp(0:1) + 8/9*imp(5:3) = 0.208– (Deposit < 6K): 2/9*imp(1:1) + 7/9*imp(4:3) = 0.246– (Deposit < 7K): 3/9*imp(1:2) + 6/9*imp(4:2) = 0.222– (Deposit < 8K): 4/9*imp(2:2) + 5/9*imp(3:2) = 0.244– (Deposit < 10K): 5/9*imp(2:3) + 4/9*imp(3:1) = 0.217– (Deposit < 30K): 8/9*imp(5:3) + 1/9*imp(0:1) = 0.208

24

Page 22: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

When to Stop Splitting? • If stop split too early, then the node is not pure enough• If stop too late, then the tree becomes too large and

complex, and can overfitting

• Stop splitting a node when– A node is pure (or reach some impurity threshold)– The maximal tree depth/size is reached– The best candidate split reduces the impurity by by less than the

preset threshold (e.g. 5%)– …

• Principle: tradeoff between tree complexity and accuracy/error

25

Page 23: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Pruning• Shrink the tree, make it smaller/simpler, to reduce overfitting• Pruning is the inverse of splitting • Any pair whose elimination yields a satisfactory (small) increase

in impurity is eliminated, and the common parent node becomes leaf node

26

Page 24: Introduction to Artificial Intelligence€¦ · Introduction to Artificial Intelligence COMP307 Machine Learning 3 –Decision Tree Learning Method Yi Mei yi.mei@ecs.vuw.ac.nz 1

Summary• Decision Tree vs DT learning• How to build a decision from a set of training instances• Design issues

– Node split– Question/Test design– Numeric features to binary features– Stopping criteria– Pruning

27