introduction to artificial intelligence€¦ · introduction to artificial intelligence comp307...

Introduction to Artificial Intelligence

COMP307 Machine Learning 3 – Decision Tree

Learning Method

Yi [email protected]

1

Outline• Decision tree learning vs learned decision trees

• How to build a decision tree using a set of instances

• How to measure a node in decision tree: (im)purity measures

• Design issues of DT learning

2

Decision Tree• A tree-like model for making decisions

4

Decision Trees vs DT Learning• A Decision Tree (DT) is a classifier

– Symbolic representation, not probabilistic– Essentially a rule– “Easy” to interpret

• DT learning is a learning process– To find a DT: output/solution is a DT– One of the oldest classification learning methods in AI– Also developed independently in Statistics/Operations Research

5

Example (Training) Dataset• Approve/Reject a loan application?

6

Applicant Job Deposit Family Class

1 true low single Approve

2 true low couple Approve


4 true high single Approve

5 false high couple Approve

6 false low couple Reject

7 true low children Reject

8 false low single Reject

9 false high children Reject

Example DT• An example DT

7

COMP307 ML3 (DT): 5

Decision Trees

Job Deposit Reject

Reject

RejectApprove

Approve

JobApprove

Family

couplesingle

true falsehigh low

true false

children

COMP307 ML3 (DT): 6

Building Decision Trees

• You can always build a decision tree trivially

– Choose some order on the attributes

– Build tree with one attribute for each level

– Label each leaf with appropriate class

A

C

D D D D D D D D

X X X X ZZZ Y Y Y

C C C

BB

• Problems

– Each leaf represents a possible instance

– All we are doing is remembering every instance — no generalisation, no

prediction, no learning

Z

A

B

D

YZ

Y X

C

• Solution

– Find a small decision tree

– capture the common features of instances

– probably generalise to predict classes for unseen instances

COMP307 ML3 (DT): 7

Building A Good Decision Tree

• Input: Instances described by attribute-value pairs

• Output: a “good” decision tree classifier

– Critical issue: choosing which attribute to use next

• DT algorithm:

Examine set of instances in the root node

If set is "pure" enough, or no more attributes

then stop

Else

Construct subsets of instances in the subnodes

Compute average "purity" of subnodes

Choose the best attribute

Recurse on each subnode

COMP307 ML3 (DT): 8

Decision Tree Building

DepositJob

true false high low

children

couplesingle

Which is best?


Reject

Family

Building a DT• A simple way:

– Start from a root, select a feature at the root to be branched– Grow the tree by depth-first or breadth-first (or any other order)– For each node, add one child node for each possible feature value– A child node can be a feature node or a class (leaf node)

• Node: feature / class• Edge: value of the parent node

• Cannot support numeric features (infinite possible values)

8

Building a DT• Input: Instances/Samples• Output: a “good” decision tree classifier

• A decision tree progressively splits the training set into smaller subsets

• Pure node: all the samples at that node have the same class label

• No need to further split a pure node• Recursive tree-growing process: Given data at a node,

decide the node as a leaf node or find another feature test to split the node

9

Building a DT• Start:

– Build a tree with a single root node– The entire training set is at the root node

• Repeat until no split is needed:– Select a node from the tree to examine– Check the purity of the set at the examined node– If the set is pure enough: node -> leaf node of the major class– Else: design a feature test to expand the node, split the set

10

Example (Training) Dataset• Approve/Reject a loan application?

11



2 true low couple Approve


4 true high single Approve

5 false high couple Approve

6 false low couple Reject

7 true low children Reject

8 false low single Reject

9 false high children Reject

Design Issues• Should the feature (answers to questions) be binary or

multivalued? In other words, how many splits should be made at a node?

• Which feature or feature combination should be tested at a node?

• When should a node be declared a leaf node?• If the tree becomes “too large”, can it be pruned to make it

smaller and simpler?• If a leaf node is impure, how should category label be

assigned to it?

12

Number of Splits• Binary: every question has a True/False answer

13

COMP307 ML3 (DT): 5

Decision Trees

Job Deposit Reject

Reject

RejectApprove

Approve

JobApprove

Family

couplesingle

true falsehigh low

true false

children

COMP307 ML3 (DT): 6

Building Decision Trees

• You can always build a decision tree trivially

– Choose some order on the attributes

– Build tree with one attribute for each level

– Label each leaf with appropriate class

A

C

D D D D D D D D

X X X X ZZZ Y Y Y

C C C

BB

• Problems

– Each leaf represents a possible instance

– All we are doing is remembering every instance — no generalisation, no

prediction, no learning

Z

A

B

D

YZ

Y X

C

• Solution

– Find a small decision tree

– capture the common features of instances

– probably generalise to predict classes for unseen instances

COMP307 ML3 (DT): 7

Building A Good Decision Tree

• Input: Instances described by attribute-value pairs

• Output: a “good” decision tree classifier

– Critical issue: choosing which attribute to use next

• DT algorithm:

Examine set of instances in the root node

If set is "pure" enough, or no more attributes

then stop

Else

Construct subsets of instances in the subnodes

Compute average "purity" of subnodes


Recurse on each subnode

COMP307 ML3 (DT): 8

Decision Tree Building

DepositJob

true false high low

children

couplesingle

Which is best?


Reject

Family

Feature Test Design• Which feature/attribute should be used in the feature test?• Greedy design: the question should make the child nodes as

pure as possible• Node (im)purity: can be defined in different ways

– Probability based– Information theory based

14

ML Tools 03:

Discretisation: Entropy Based Discretisation

• Entropy measures the impurity or uncertainty in a group of examples

• S is the training set, C1, …, CN classes • E(S) measure the Entropy of S, Pc is the proportion of

class Cc in S

13

High entropy

Very impure Least impure (Pure)

Null entropy

Less impure

Low entropy

Node (Im)purity Measure• Assume there are two classes A and B• At a node: m instances class A, n instances class B• Impurity: 𝑖𝑚𝑝 = 𝑃 𝐴 𝑃 𝐵 = !

!"#× #!"#

= !#!"# !

– If pure (m = 0 or n = 0): imp = 0– If m = n, imp is maximum– Smooth

15

The smaller the better

COMP307 ML3 (DT): 9

Measuring Purity

• Need a measure of how “pure” a node is

– all one class −→ pure −→ can predict the class

– mixture of classes −→ impure −→ have to ask more questions

• Several functions

– probability based

– information theory based

– ... ...

• Choose the attribute whose children have the best purity

COMP307 ML3 (DT): 10

(Im)Purity Measure: P(A)P(B)

A

7 A’s, 3 B’s

All class BAll class

• Impurity:

P (A)P (B) =m

m + n×

n

m + n=

mn

(m + n)2

m: number of A’s, n: number of B’s

• Goodness of attribute:

– average impurity of subnodes


Weighting the Impurities

• How do we take the average?

average = 16%

Att 1 Att 2

true false falsetrue

impurity = 4/5 * 1/5impurity = 1/5 * 4/5= 16% = 16%

impurity = 1/1 * 0/1= 0%

impurity = 4/9 * 5/9= 24.6%

average = 12.3% ?

average = 20% ?

• Need to weight the nodes by probability of going to node:


Weighting the Impurities (Continued)

falsetrue

P(node) = 7/10 P(node) = 3/10

impurity(node) = 4/7 * 3/7 impurity(node) = 2/3 * 1/3 = 12/49 = 2/9

• Goodness of attribute = weighted average impurity of subnodes

=!

i[P (nodei)× impurity(nodei)]

= (7/10× 12/49) + 3/10× 2/9+ = 84/490 + 6/90 = 0.238

Feature Test Design• Goodness of a feature test: average impurity of child

nodes

17

FT1?

Impurity = 1/5 * 4/5 = 4/25

Impurity = 4/5 * 1/5 = 4/25

Average Impurity = 4/25 = 16%

Feature Test Design• Goodness of a feature test: average impurity of child

nodes

18

FT2?

Impurity = 1/1 * 0/1 = 0

Impurity = 4/9 * 5/9 = 20/81

Average Impurity = 10/81 = 12.4%

FT2 is better than FT1?

Feature Test Design• Weighting the Impurities by the probability of nodes

– Weighted average impurity = ∑"#$% 𝑃 𝑛𝑜𝑑𝑒" ×𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦(𝑛𝑜𝑑𝑒")

19

FT1?

Impurity = 1/5 * 4/5 = 4/25

Impurity = 4/5 * 1/5 = 4/25

Weighted average Impurity = 4/25 = 16%

P(node1) = 0.5 P(node2) = 0.5

Feature Test Design• Weighting the Impurities by the probability of nodes

– Weighted average impurity = ∑"#$% 𝑃 𝑛𝑜𝑑𝑒" ×𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦(𝑛𝑜𝑑𝑒")

20

FT2?

Impurity = 1/1 * 0/1 = 0

Impurity = 4/9 * 5/9 = 20/81

Weighted average Impurity = 0.9 * 20/81 = 22.2%

P(node1) = 0.1 P(node2) = 0.9

FT1 is better than FT2 after weighting

Numeric Features• Approve/Reject a loan application?

22


1 true $10K single Approve

2 true $7K couple Approve



5 false $18K couple Approve

6 false $6K couple Reject

7 true $8K children Reject

8 false $3K single Reject

9 false $30K children Reject

What question to ask about Deposit?

Numeric Features• Can split on a simple comparison

– Which split point?– Consider class boundaries

23

Deposit < $10K

True False


8 false $3K single Reject


6 false $6K couple Reject

2 true $7K couple Approve

7 true $8K children Reject



5 false $18K couple Approve

9 false $30K children Reject

<$4K

<$6K

<$7K

<$8K

<$10K

<$30K

Numeric Features• Consider class boundaries, choose the best split with

minimal weighted average impurity– (Deposit < 4K): 1/9*imp(0:1) + 8/9*imp(5:3) = 0.208– (Deposit < 6K): 2/9*imp(1:1) + 7/9*imp(4:3) = 0.246– (Deposit < 7K): 3/9*imp(1:2) + 6/9*imp(4:2) = 0.222– (Deposit < 8K): 4/9*imp(2:2) + 5/9*imp(3:2) = 0.244– (Deposit < 10K): 5/9*imp(2:3) + 4/9*imp(3:1) = 0.217– (Deposit < 30K): 8/9*imp(5:3) + 1/9*imp(0:1) = 0.208

24

When to Stop Splitting? • If stop split too early, then the node is not pure enough• If stop too late, then the tree becomes too large and

complex, and can overfitting

• Stop splitting a node when– A node is pure (or reach some impurity threshold)– The maximal tree depth/size is reached– The best candidate split reduces the impurity by by less than the

preset threshold (e.g. 5%)– …

• Principle: tradeoff between tree complexity and accuracy/error

25

Pruning• Shrink the tree, make it smaller/simpler, to reduce overfitting• Pruning is the inverse of splitting • Any pair whose elimination yields a satisfactory (small) increase

in impurity is eliminated, and the common parent node becomes leaf node

26

Summary• Decision Tree vs DT learning• How to build a decision from a set of training instances• Design issues

– Node split– Question/Test design– Numeric features to binary features– Stopping criteria– Pruning

27

introduction to artificial intelligence€¦ · introduction to artificial intelligence comp307...

Documents