introduction to artificial intelligence€¦ · introduction to artificial intelligence comp307...
TRANSCRIPT
Introduction to Artificial Intelligence
COMP307 Machine Learning 3 – Decision Tree
Learning Method
1
Outline• Decision tree learning vs learned decision trees
• How to build a decision tree using a set of instances
• How to measure a node in decision tree: (im)purity measures
• Design issues of DT learning
2
Decision Tree• A tree-like model for making decisions
4
Decision Trees vs DT Learning• A Decision Tree (DT) is a classifier
– Symbolic representation, not probabilistic– Essentially a rule– “Easy” to interpret
• DT learning is a learning process– To find a DT: output/solution is a DT– One of the oldest classification learning methods in AI– Also developed independently in Statistics/Operations Research
5
Example (Training) Dataset• Approve/Reject a loan application?
6
Applicant Job Deposit Family Class
1 true low single Approve
2 true low couple Approve
3 true low single Approve
4 true high single Approve
5 false high couple Approve
6 false low couple Reject
7 true low children Reject
8 false low single Reject
9 false high children Reject
Example DT• An example DT
7
COMP307 ML3 (DT): 5
Decision Trees
Job Deposit Reject
Reject
RejectApprove
Approve
JobApprove
Family
couplesingle
true falsehigh low
true false
children
COMP307 ML3 (DT): 6
Building Decision Trees
• You can always build a decision tree trivially
– Choose some order on the attributes
– Build tree with one attribute for each level
– Label each leaf with appropriate class
A
C
D D D D D D D D
X X X X ZZZ Y Y Y
C C C
BB
• Problems
– Each leaf represents a possible instance
– All we are doing is remembering every instance — no generalisation, no
prediction, no learning
Z
A
B
D
YZ
Y X
C
• Solution
– Find a small decision tree
– capture the common features of instances
– probably generalise to predict classes for unseen instances
COMP307 ML3 (DT): 7
Building A Good Decision Tree
• Input: Instances described by attribute-value pairs
• Output: a “good” decision tree classifier
– Critical issue: choosing which attribute to use next
• DT algorithm:
Examine set of instances in the root node
If set is "pure" enough, or no more attributes
then stop
Else
Construct subsets of instances in the subnodes
Compute average "purity" of subnodes
Choose the best attribute
Recurse on each subnode
COMP307 ML3 (DT): 8
Decision Tree Building
DepositJob
true false high low
children
couplesingle
Which is best?
Choose the best attribute
Reject
Family
Building a DT• A simple way:
– Start from a root, select a feature at the root to be branched– Grow the tree by depth-first or breadth-first (or any other order)– For each node, add one child node for each possible feature value– A child node can be a feature node or a class (leaf node)
• Node: feature / class• Edge: value of the parent node
• Cannot support numeric features (infinite possible values)
8
Building a DT• Input: Instances/Samples• Output: a “good” decision tree classifier
• A decision tree progressively splits the training set into smaller subsets
• Pure node: all the samples at that node have the same class label
• No need to further split a pure node• Recursive tree-growing process: Given data at a node,
decide the node as a leaf node or find another feature test to split the node
9
Building a DT• Start:
– Build a tree with a single root node– The entire training set is at the root node
• Repeat until no split is needed:– Select a node from the tree to examine– Check the purity of the set at the examined node– If the set is pure enough: node -> leaf node of the major class– Else: design a feature test to expand the node, split the set
10
Example (Training) Dataset• Approve/Reject a loan application?
11
Applicant Job Deposit Family Class
1 true low single Approve
2 true low couple Approve
3 true low single Approve
4 true high single Approve
5 false high couple Approve
6 false low couple Reject
7 true low children Reject
8 false low single Reject
9 false high children Reject
Design Issues• Should the feature (answers to questions) be binary or
multivalued? In other words, how many splits should be made at a node?
• Which feature or feature combination should be tested at a node?
• When should a node be declared a leaf node?• If the tree becomes “too large”, can it be pruned to make it
smaller and simpler?• If a leaf node is impure, how should category label be
assigned to it?
12
Number of Splits• Binary: every question has a True/False answer
13
COMP307 ML3 (DT): 5
Decision Trees
Job Deposit Reject
Reject
RejectApprove
Approve
JobApprove
Family
couplesingle
true falsehigh low
true false
children
COMP307 ML3 (DT): 6
Building Decision Trees
• You can always build a decision tree trivially
– Choose some order on the attributes
– Build tree with one attribute for each level
– Label each leaf with appropriate class
A
C
D D D D D D D D
X X X X ZZZ Y Y Y
C C C
BB
• Problems
– Each leaf represents a possible instance
– All we are doing is remembering every instance — no generalisation, no
prediction, no learning
Z
A
B
D
YZ
Y X
C
• Solution
– Find a small decision tree
– capture the common features of instances
– probably generalise to predict classes for unseen instances
COMP307 ML3 (DT): 7
Building A Good Decision Tree
• Input: Instances described by attribute-value pairs
• Output: a “good” decision tree classifier
– Critical issue: choosing which attribute to use next
• DT algorithm:
Examine set of instances in the root node
If set is "pure" enough, or no more attributes
then stop
Else
Construct subsets of instances in the subnodes
Compute average "purity" of subnodes
Choose the best attribute
Recurse on each subnode
COMP307 ML3 (DT): 8
Decision Tree Building
DepositJob
true false high low
children
couplesingle
Which is best?
Choose the best attribute
Reject
Family
Feature Test Design• Which feature/attribute should be used in the feature test?• Greedy design: the question should make the child nodes as
pure as possible• Node (im)purity: can be defined in different ways
– Probability based– Information theory based
14
ML Tools 03:
Discretisation: Entropy Based Discretisation
• Entropy measures the impurity or uncertainty in a group of examples
• S is the training set, C1, …, CN classes • E(S) measure the Entropy of S, Pc is the proportion of
class Cc in S
13
High entropy
Very impure Least impure (Pure)
Null entropy
Less impure
Low entropy
Node (Im)purity Measure• Assume there are two classes A and B• At a node: m instances class A, n instances class B• Impurity: 𝑖𝑚𝑝 = 𝑃 𝐴 𝑃 𝐵 = !
!"#× #!"#
= !#!"# !
– If pure (m = 0 or n = 0): imp = 0– If m = n, imp is maximum– Smooth
15
The smaller the better
COMP307 ML3 (DT): 9
Measuring Purity
• Need a measure of how “pure” a node is
– all one class −→ pure −→ can predict the class
– mixture of classes −→ impure −→ have to ask more questions
• Several functions
– probability based
– information theory based
– ... ...
• Choose the attribute whose children have the best purity
COMP307 ML3 (DT): 10
(Im)Purity Measure: P(A)P(B)
A
7 A’s, 3 B’s
All class BAll class
• Impurity:
P (A)P (B) =m
m + n×
n
m + n=
mn
(m + n)2
m: number of A’s, n: number of B’s
• Goodness of attribute:
– average impurity of subnodes
COMP307 ML3 (DT): 11
Weighting the Impurities
• How do we take the average?
average = 16%
Att 1 Att 2
true false falsetrue
impurity = 4/5 * 1/5impurity = 1/5 * 4/5= 16% = 16%
impurity = 1/1 * 0/1= 0%
impurity = 4/9 * 5/9= 24.6%
average = 12.3% ?
average = 20% ?
• Need to weight the nodes by probability of going to node:
COMP307 ML3 (DT): 12
Weighting the Impurities (Continued)
falsetrue
P(node) = 7/10 P(node) = 3/10
impurity(node) = 4/7 * 3/7 impurity(node) = 2/3 * 1/3 = 12/49 = 2/9
• Goodness of attribute = weighted average impurity of subnodes
=!
i[P (nodei)× impurity(nodei)]
= (7/10× 12/49) + 3/10× 2/9+ = 84/490 + 6/90 = 0.238
Feature Test Design• Goodness of a feature test: average impurity of child
nodes
17
FT1?
Impurity = 1/5 * 4/5 = 4/25
Impurity = 4/5 * 1/5 = 4/25
Average Impurity = 4/25 = 16%
Feature Test Design• Goodness of a feature test: average impurity of child
nodes
18
FT2?
Impurity = 1/1 * 0/1 = 0
Impurity = 4/9 * 5/9 = 20/81
Average Impurity = 10/81 = 12.4%
FT2 is better than FT1?
Feature Test Design• Weighting the Impurities by the probability of nodes
– Weighted average impurity = ∑"#$% 𝑃 𝑛𝑜𝑑𝑒" ×𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦(𝑛𝑜𝑑𝑒")
19
FT1?
Impurity = 1/5 * 4/5 = 4/25
Impurity = 4/5 * 1/5 = 4/25
Weighted average Impurity = 4/25 = 16%
P(node1) = 0.5 P(node2) = 0.5
Feature Test Design• Weighting the Impurities by the probability of nodes
– Weighted average impurity = ∑"#$% 𝑃 𝑛𝑜𝑑𝑒" ×𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦(𝑛𝑜𝑑𝑒")
20
FT2?
Impurity = 1/1 * 0/1 = 0
Impurity = 4/9 * 5/9 = 20/81
Weighted average Impurity = 0.9 * 20/81 = 22.2%
P(node1) = 0.1 P(node2) = 0.9
FT1 is better than FT2 after weighting
Numeric Features• Approve/Reject a loan application?
22
Applicant Job Deposit Family Class
1 true $10K single Approve
2 true $7K couple Approve
3 true $4K single Approve
4 true $16K single Approve
5 false $18K couple Approve
6 false $6K couple Reject
7 true $8K children Reject
8 false $3K single Reject
9 false $30K children Reject
What question to ask about Deposit?
Numeric Features• Can split on a simple comparison
– Which split point?– Consider class boundaries
23
Deposit < $10K
True False
Applicant Job Deposit Family Class
8 false $3K single Reject
3 true $4K single Approve
6 false $6K couple Reject
2 true $7K couple Approve
7 true $8K children Reject
1 true $10K single Approve
4 true $16K single Approve
5 false $18K couple Approve
9 false $30K children Reject
<$4K
<$6K
<$7K
<$8K
<$10K
<$30K
Numeric Features• Consider class boundaries, choose the best split with
minimal weighted average impurity– (Deposit < 4K): 1/9*imp(0:1) + 8/9*imp(5:3) = 0.208– (Deposit < 6K): 2/9*imp(1:1) + 7/9*imp(4:3) = 0.246– (Deposit < 7K): 3/9*imp(1:2) + 6/9*imp(4:2) = 0.222– (Deposit < 8K): 4/9*imp(2:2) + 5/9*imp(3:2) = 0.244– (Deposit < 10K): 5/9*imp(2:3) + 4/9*imp(3:1) = 0.217– (Deposit < 30K): 8/9*imp(5:3) + 1/9*imp(0:1) = 0.208
24
When to Stop Splitting? • If stop split too early, then the node is not pure enough• If stop too late, then the tree becomes too large and
complex, and can overfitting
• Stop splitting a node when– A node is pure (or reach some impurity threshold)– The maximal tree depth/size is reached– The best candidate split reduces the impurity by by less than the
preset threshold (e.g. 5%)– …
• Principle: tradeoff between tree complexity and accuracy/error
25
Pruning• Shrink the tree, make it smaller/simpler, to reduce overfitting• Pruning is the inverse of splitting • Any pair whose elimination yields a satisfactory (small) increase
in impurity is eliminated, and the common parent node becomes leaf node
26
Summary• Decision Tree vs DT learning• How to build a decision from a set of training instances• Design issues
– Node split– Question/Test design– Numeric features to binary features– Stopping criteria– Pruning
27