announcements hw2 on web later tonighthw2 on web later tonight due oct 27 (friday after the exam)due...
TRANSCRIPT
AnnouncementsAnnouncements
• HW2 on web later tonightHW2 on web later tonight• Due Oct 27 (Friday after the exam)Due Oct 27 (Friday after the exam)• Involves ID3 (decision trees)Involves ID3 (decision trees)
• Reading Assignment on website Reading Assignment on website • Machine Learning: Four Current Directions Machine Learning: Four Current Directions
(read section on ensembles) (read section on ensembles)
• Midterm exam in two weeksMidterm exam in two weeks• In class on Oct 25In class on Oct 25
Last Time - Logistic Last Time - Logistic Regression and Regression and perceptronsperceptrons
LR Decision RuleLR Decision Rule
Predict class is + if
€
lnPr(C =1 |F)
Pr(C = 0 |F)
⎛
⎝ ⎜
⎞
⎠ ⎟= w0 + w1 f1 + ...+ wN fN
€
T < w0 + w1 f1 + ...+ wN fN
Threshold (0 if equal FP and FN costs)
The Decision Boundary The Decision Boundary of Logistic Regression of Logistic Regression is a hyperplane (line in is a hyperplane (line in 2D)2D)
€
T < w0 + w1 f1 + ...+ wN fNIf
predict +
otherwise
predict -
€
w0 + w1 f1 + w2 f2 −T = 0+ +
+
+ +
+
-
-- -€
f1
€
f2
““Weight Space”Weight Space”
W
L(W)
Goal
For LR, L(W) is a concave function (it has a single global maximum), so we are guaranteed to find the global maximum
• Given feature representations, the weights W are free parameters that define a space• Each point in “weight space” corresponds to an LR model• Associated with each point is a conditional log likelihood • One way to do LR learning is to perform “gradient ascent” in the weight space
PerceptronsPerceptrons
∑
f1
f2
fN
w1w2
wN
F0=1
w0
€
T < w0 + w1 f1 + ...+ wN fN
The decision rule for perceptrons has the same form as the decision rule for logistic regression and naïve Bayes
So perceptrons are linear separators
Input units Output unit
Today’sToday’s TopicsTopics
• Finish Neural NetsFinish Neural Nets• Decision TreesDecision Trees
Next few weeks Next few weeks previewpreview• Next timeNext time
• D-tree wrap up, ensemblesD-tree wrap up, ensembles• Exam review (last 30 mins)Exam review (last 30 mins)• Exam material will cover everything through Exam material will cover everything through
next week’s classnext week’s class
• In two weeksIn two weeks• Exam (1st hour of class)Exam (1st hour of class)• Advanced ML topics (possible projects)Advanced ML topics (possible projects)
• In three weeks In three weeks • Support Vector Machines (last pure SL topic)Support Vector Machines (last pure SL topic)• Begin Graphical Probabilistic modelsBegin Graphical Probabilistic models
Should you?Should you?
"Fenwίck here is biding his time waiting for neural networks.
Advantages of Neural Advantages of Neural NetworksNetworks
• Provide best Provide best predictive predictive accuracy for some accuracy for some problemsproblems− Being supplanted by Being supplanted by
SVM’s?SVM’s?
• Can represent a rich Can represent a rich class of conceptsclass of concepts• Nonlinear decision Nonlinear decision
surfacesurface
PositivenegativePositive
Saturday: 40% chance of rainSunday: 25% chance of rain
Artificial Neural Artificial Neural Networks (ANNs) Networks (ANNs) sometimes called sometimes called “multi-layer “multi-layer perceptrons”perceptrons”Networks
Recurrentlink
Output units
Input units
Hidden unitserrorweight
features
ANNs ANNs (continued)(continued)
Individual units set their output to be the sigmoidfunction applied to a linear combination of incomingsignals
Input to unit
Outp
ut
of
unit
€
input j = wij∑ × output i
output j =1
1+ exp(−(input j - bias j))
inputs
output
…
Perceptron Convergence Perceptron Convergence
TheoremTheorem (Rosemblatt, 1957)(Rosemblatt, 1957)
Perceptron = no Hidden Units
If a set of examples is learnable, the DELTA rule will eventually find the necessary weights
However a perceptron can only learn/represent linearly separable dataset.
Linear Separability Linear Separability Consider a perceptron
Its output is 1 If W1X1+W2X2 + … + WnXn > 0 otherwise
In terms of feature space: WiXi + WjXj =
Xj = = WiXi
Wj
Wi Xj Wj
Xi +
[ y = mx + b]
+ + + + + + - + - - + + + + - + + - - -+ + - -+ - - -
- -
Hence, can only classify examples if a “line” (hyerplane) can separate them
The XOR ProblemThe XOR Problem
Input
0 00 11 01 1
Output
0110
a)b)c)d)
Exclusive OR (XOR) XOR is not linearly separable:
b
a c
d
0 1
1
A Neural Network SolutionX1
X2
X1
X2
-1-1 -1
H1 =X1 is 1 AND X2 is1
H2 =X1 is 0 && x2==0
11 Output = neither H1 nor H2
Biases need to be set!
-1
The Need for Hidden The Need for Hidden UnitsUnits
If there is one layer of enough hidden units (possibly 2N for Boolean functions), the input can be recoded. (N = number of input units)
This recoding allows any mapping to be represented (Minsky & Papert)Question: How to provide an error signal to the interior units?
Hidden UnitsHidden Units
One View:Allows a system to create its own internal representation – for which problem solving is easy.
A perceptron
Reformulating XORReformulating XOR
X1
X2
X3 = X1 ^ X2
X1
X2
X3
Or:
X1
X2
So, if a hidden unit can learn to represent X1 ^ X2 , solution is easy
BackpropagationBackpropagation
BackpropagationBackpropagation• Backpropagation involves a generalization of Backpropagation involves a generalization of
the “delta rule” for perceptronsthe “delta rule” for perceptrons• Rumelhart, Parker, and Le Cun (and Bryson & Rumelhart, Parker, and Le Cun (and Bryson &
Ho(1969), Werbos(1974)) independently Ho(1969), Werbos(1974)) independently developed(1985) a technique for determining developed(1985) a technique for determining how to adjust weights of interior (“hidden”) how to adjust weights of interior (“hidden”) unitsunits
• Derivation involves partial derivatives Derivation involves partial derivatives (Hence, threshold function must be (Hence, threshold function must be differentiable)differentiable)
error signal
EWi,j
Weight Space - same Weight Space - same idea as we saw with LRidea as we saw with LR• Given a network layout, the weights Given a network layout, the weights
and biases are free parameters that and biases are free parameters that define a define a Space.Space.
• Now, each point in this Now, each point in this Wight SpaceWight Space (w) specifies a neural network(w) specifies a neural network
• Associated with each point is an Associated with each point is an error error rate, rate, E, over the training dataE, over the training data
• BackProp performs gradient BackProp performs gradient descentdescent in in weight spaceweight space
Gradient descent in weight Gradient descent in weight spacespace
E
W1
W2
E
w
W1
W2
Backprop CalculationsBackprop Calculations• Assume one layer of hidden units (std. Assume one layer of hidden units (std.
topology)topology)1.1. Error = ½ Error = ½ ( Teacher ( Teacherii – Output – Outputii ) ) 22
2.2. = ½ = ½ (Teacher(Teacherii – – f f [[WWi,ji,j x Output x Outputjj] )] )22
3.3. = ½ = ½ (Teacher(Teacherii – – f f [[WWi,ji,j x x f f ((WWj,kj,k x Output x Outputkk)])))]))22
• DetermineDetermine
recallrecall
ErrorWi,j
ErrorWj,k
= (use equation 2)
= (use equation 3)
* See table 4.2 for results
wx,y = - ( E / wx,y )
k j i
Differentiating the Logistic Differentiating the Logistic FunctionFunction
Outi =
= outi ( 1- outi ) = f’(weighted input)
1
1 + e - ( wj,i x outj - i )
f’(weighted input) = outi
1/2
w.outi
Weightedinputf’( )f’( )f’( )
Weightedinputf ( )
Notice that even if totally wrong, no (or very little) change in weights
1/4
1
The Need for The Need for Symmetry BreakingSymmetry Breaking
Assume all weights are initially the same
Can the corresponding (mirror-image) weight ever differ? - NO
WHY? - by symmetry
Solution - randomize initial weights
Using BP to Train Using BP to Train ANN’sANN’s
1.1. Initiate weights & bias to Initiate weights & bias to small random values (eg. small random values (eg. In [-0.3, 0/3])In [-0.3, 0/3])
2.2. Randomize order of Randomize order of training examples; for training examples; for each do:each do:
a)a) Propagate activity forward Propagate activity forward to output unitsto output units
k j i
outi = f ( wi,j x outj)j
Using BP to Train Using BP to Train ANN’s ANN’s (continued)(continued)
b)b) Compute “error” for output unitsCompute “error” for output units
c)c) Compute “error” for hidden unitsCompute “error” for hidden units
d)d) Update weightsUpdate weights
i = f ’( neti ) x (Teacheri-outi)
ij = f ’( netj ) x ( wi,j x i)
wi,j = x i x outj
wj,k = x j x outk
f ’( netj ) = f (neti) neti
Using BP to Train Using BP to Train ANN’s ANN’s (continued)(continued)
3.3. Repeat until training-set error rate small Repeat until training-set error rate small enough ( or until tuning-set error rate enough ( or until tuning-set error rate begins to rise – see later slide)begins to rise – see later slide)
− Should use “early stopping” ( i.e., minimize Should use “early stopping” ( i.e., minimize error on the tuning set; more details later)error on the tuning set; more details later)
4.4. Measure accuracy on test set to Measure accuracy on test set to estimate estimate generalizationgeneralization (future (future accuracy)accuracy)
Report CardReport Card
AA
BB
Learning EfficiencyLearning Efficiency A+A+
Classification EfficiencyClassification EfficiencyFF
Empirical PerformanceEmpirical PerformanceCC
Domain InsightDomain Insight FF
Implementation EaseImplementation Ease AA
K-NN NB
AA
CC
BB
LR ANN
??
??
??
??
??
??
??
??
??
??
Next TopicNext Topic
Decision Trees Decision Trees
Mainly Quinlan’s modelMainly Quinlan’s model
ID3 Algorithm ID3 Algorithm (Quinlan (Quinlan 1979)1979)
• Induction of Decision Trees (top-Induction of Decision Trees (top-down)down)• Based on Hunt’s CLS (1963)Based on Hunt’s CLS (1963)• Handles noisy & missing feature Handles noisy & missing feature
valuesvalues COLOR ?
SIZE ?
Red Blue
Big Small- +
-
Main IssueMain Issue
• How to choose next feature to How to choose next feature to place in decision tree ?place in decision tree ?• Random choice ?Random choice ?• Feature with largest number of Feature with largest number of
values ?values ?• Feature with fewest ?Feature with fewest ?• Information theoretic measureInformation theoretic measure
• (Quinlan’s approach) (Quinlan’s approach)
Main Hypothesis of ID3Main Hypothesis of ID3
• The simplest tree that classifies The simplest tree that classifies training examples will work best on training examples will work best on future examples (Occam’s Razor)future examples (Occam’s Razor)11
COLOR ?
SIZE ?
Red Blue
Big Small- +
-
SIZE ?
Big Small- +VS.
NP-Hard to find the smallest tree (Hyafil +Rivest, 1976)
1. Empirical evidence calls this assumption into question (Mingers 1989. Counter Examples appeared in MLJ. Also see Murphy+Pazzani JAIR
Why Occam’s Razor?Why Occam’s Razor?
• There are fewer short hypotheses (trees in There are fewer short hypotheses (trees in ID3) than long onesID3) than long ones
• Short hypothesis that fits training data Short hypothesis that fits training data
unlikely to be coincidenceunlikely to be coincidence
• Long hypothesis that fits training data Long hypothesis that fits training data might be (since many more trees)might be (since many more trees)
• COLT community formally addresses these COLT community formally addresses these issues (see chapter 7)issues (see chapter 7)
Finding Small Decision Finding Small Decision TreesTrees
• ID3: - Generate small trees with greedy ID3: - Generate small trees with greedy algorithm:algorithm:• Find a feature that “best” divides the dataFind a feature that “best” divides the data• Recur on each subset of the data that the feature Recur on each subset of the data that the feature
inducesinduces
• What does “best” mean?What does “best” mean?• Postpone brieflyPostpone briefly
Overview of ID3Overview of ID3
*NULL*
A4
+3 +5
+1+2
+4
-1 -2 -3
A1 A2
A3 A4
ID3
+3 +5+4
A1A3
-1 -2
A1 A3
-
+
-
+
A2
+1 +2
A1 A3 A4
A1 A3 A4
-1 -2
+3 +5+4
A1 A3 A4
-3
A1 A3 A4
SplittingAttribute
Use Majority class at parent node
SplittingAttribute
ID3 Algorithm ID3 Algorithm (Table 3.1)(Table 3.1)
• Given: Given: • E, a set of classified examples E, a set of classified examples • F, a set of features not yet in decision tree F, a set of features not yet in decision tree
• If E = 0 then return “?” (or use majority class at parent)If E = 0 then return “?” (or use majority class at parent)• Else if Else if All_E_Same_ClassAll_E_Same_Class, Return <the class>, Return <the class>• Else if F= 0 return *Error?* (have +/- ex’s with same feature Else if F= 0 return *Error?* (have +/- ex’s with same feature
values)values)• ElseElse
• Let bestF = FeatureThatGainsMostInfo(E, F)Let bestF = FeatureThatGainsMostInfo(E, F)• Let leftF = F – bestFLet leftF = F – bestF• Add node bestF to decision treeAdd node bestF to decision tree• For all possible value, v, of bestF doFor all possible value, v, of bestF do
• Add link (labelled v) to decision treeAdd link (labelled v) to decision tree• And connect to result of And connect to result of • ID3({ex in E| ex has value v for feature bestF}, leftF)ID3({ex in E| ex has value v for feature bestF}, leftF)
Venn Diagram View of Venn Diagram View of ID3ID3
• Question: How do decision trees Question: How do decision trees divide the feature space ?divide the feature space ?
+ + ++ + +
- -- - - - - -
+ +
- -
- -
+ +
- -
F2
F1
Venn Diagram View of Venn Diagram View of ID3ID3
• Question: How do decision trees Question: How do decision trees divide the feature space?divide the feature space?
+ + ++ + +
- -- - - - - -
+ +
- -
- -
+ +
- -
F2
F1
F1
- F2
-+-F2
-++
+
View ID3 as a Search View ID3 as a Search AlgorithmAlgorithm
Search SpaceSearch Space Space of all decision trees constructible Space of all decision trees constructible using current feature set. using current feature set.
OperatorsOperators Add a node (i.e. grow tree)Add a node (i.e. grow tree)
Search Search StrategyStrategy
Hill ClimbingHill Climbing
Heuristic Heuristic FunctionFunction
Information gain (or gain ratio)Information gain (or gain ratio)
Other algorithms use similar “purity Other algorithms use similar “purity measures”measures”
Start NodeStart Node Empty Tree or isolated leaf node (+, - ), Empty Tree or isolated leaf node (+, - ), depending on majority classdepending on majority class
End NodeEnd Node Tree that separates all the training data Tree that separates all the training data (“post pruning” may be done later to (“post pruning” may be done later to reduce overfitting.reduce overfitting.
A Sample Search TreeA Sample Search Tree
• Expand “left most” Expand “left most” (?)(?) of current node of current node• All possible trees can be generated (given All possible trees can be generated (given
thresholds “implied” by real values in train set)thresholds “implied” by real values in train set)?
F2
??
F1
??
FN
??
F2F1
???
…………
Add f1 Add fN
Add f1
Add f2
F2
?+
…………
Scoring the FeaturesScoring the Features
• We want to know:We want to know:How helpful is it to know value of feature How helpful is it to know value of feature
ff ? ?
• How do we measure “helpful”?How do we measure “helpful”?
• One technique uses “information theory” One technique uses “information theory” conceptsconcepts
(Shannon’s) Entropy(Shannon’s) Entropy
• Introduced in 1940’s as a concept in communication theoryIntroduced in 1940’s as a concept in communication theory• Roughly, the entropy an event is proportional to its Roughly, the entropy an event is proportional to its
uncertaintyuncertaintyH(p) = -p * log(p) - (1-p) * log(p) H(p) = -p * log(p) - (1-p) * log(p)
• In decision trees we are concerned with the entropy In decision trees we are concerned with the entropy associated with an example’s labelassociated with an example’s label
• Let fLet f++ = fraction of pos ex = fraction of pos ex [f+ = #pos / (#pos + #neg)]• Let fLet f-- = fraction of neg ex = fraction of neg ex [f- = #neg / (#pos + #neg)]
• The The information needed information needed to determine the category of one to determine the category of one these examples is its entropythese examples is its entropy
• Info( f+, f-) = H(f+) = -fInfo( f+, f-) = H(f+) = -f+ log (f+) -f log (f+) -f- log (f-) log (f-)
Entropy if there are only two possible outcomes
• All same class All same class (+, say) (+, say) • Info(1, 0) = -1 lg(1) = -0 lg(0) 0 Info(1, 0) = -1 lg(1) = -0 lg(0) 0
• 50-50 mixture50-50 mixture• Info(½, ½) = 2[ -½ lg(½)] = -1Info(½, ½) = 2[ -½ lg(½)] = -1
Consider the Extreme Consider the Extreme CasesCases
0
-1
1
10.50
0 (by def)
Evaluating a FeatureEvaluating a Feature
• How much does it help to know How much does it help to know the value of feature the value of feature ff ? ?
• Assume Assume f f divides the current set divides the current set of examples into of examples into N N groupsgroups
• Let q_i = fraction of data on branch iLet q_i = fraction of data on branch i• f_i+ = fraction of +’s on branch if_i+ = fraction of +’s on branch i• f_i- = fraction of –’s on branch if_i- = fraction of –’s on branch i
• E(E(ff) = ) = ΣΣ q qi i = I (f= I (fii++, f, fii
--))• Info Info neededneeded after determining the after determining the
value of feature fvalue of feature f• Another “expected value” calcAnother “expected value” calc
• PictorallyPictorally
Evaluating a Feature Evaluating a Feature (con’t)(con’t)
i= 1
N
f
v1vN
I (fI (fNN++, f, fNN
--))
I (fI (f++, f, f--))
I (fI (f11++, f, f11
--))
Info GainInfo Gain
• Gain (f) = I (fGain (f) = I (f++, f, f--) – E(f)) – E(f)
Our scoring function in our hill-climbing (greedy) algorithm
So pick f with smallest E(f)
Constant for all features
That is, choose feature that statistically tells us the most about the class of another example drawn from this distribution.
Today’s TopicsToday’s Topics
• ID3 info gain measure and ID3 info gain measure and variants.variants.
• Numeric features (also Numeric features (also hierarchical).hierarchical).
• Numeric outputsNumeric outputs• Multiple category classification.Multiple category classification.
Example Info Gain Example Info Gain CalculationCalculation
++BIGBIGRedRed
++BIGBIGRedRed
--SMALLSMALLYellowYellow
--SMALLSMALLRedRed
++BIGBIGBlueBlue
ClassClassSizeSizeShapeShapeColorColor
?)(?)(
?)(?),(
====−+
sizeEshapeEcolorEffI
Info Gain Calculation Info Gain Calculation (contd.)(contd.)
0)1,0(4.0)0,1(6.0)(
)()2
1,
2
1(4.0)
3
1,
3
2(6.0)(
)1,0(2.0)0,1(2.0)3
1,
3
2(6.0)(
91.0)4.0(log4.0)6.0(log6.0)4.0,6.0(),( 22
=×+×=
>×+×=
×+×+×=
=×−×−==−+
IIsizeE
colorEIIshapeE
IIIcolorE
IffI
Note that “Size” provides complete classification.
ID3 Info Gain Measure JustifiedID3 Info Gain Measure Justified(Ref. C4.5 J. R. Quinlan, Morgan Kaufmann, 1993, pp21-(Ref. C4.5 J. R. Quinlan, Morgan Kaufmann, 1993, pp21-22)22)
Definition of Information:Definition of Information:Info conveyed by message Info conveyed by message MM depends on its probability, i.e., depends on its probability, i.e.,
(due to Shannon)(due to Shannon)
Select example from a set Select example from a set SS and announce it belongs to class and announce it belongs to class CC..
The probability of this occurring is the fraction of The probability of this occurring is the fraction of CC’s in ’s in SS..
Hence info in this announcement is, by definition,Hence info in this announcement is, by definition,
]Prob(M)[log- info(M) 2=
Cf
)(log2 Cf−
ID3 Info Gain Measure (contd.)ID3 Info Gain Measure (contd.)
Let there be Let there be KK different classes in set different classes in set SS. The classes are:. The classes are:
What is What is expected infoexpected info from a message about the class of an example in set from a message about the class of an example in set SS??
is the average number of bits of information (by looking at is the average number of bits of information (by looking at feature values) needed to classify a member of set feature values) needed to classify a member of set SS. .
KCCC ,.......,, 21
. class of are that set offraction is where,
,)log( )info(
)(log...)(log)(log- )info(
1
222 2211
jC
K
jCC
CCCCCC
CSf
ffS
ffffffS
j
jj
KK
=
×−=⇒
×−−×−×=
∑=
)(info S
TDIDT DetailsTDIDT Details
• Handling Numeric FeaturesHandling Numeric Features• Bias toward many-valued featureBias toward many-valued feature• Multi category classificationMulti category classification• ID3’s runtime• Pruning D-trees to avoid
overfitting• Generating rules from d-trees
Handling numeric features in Handling numeric features in ID3ID3
On the flyOn the fly creation of binary features and choose best. creation of binary features and choose best.Step 1 :Step 1 : Plot current examples. Plot current examples.
Step 2 :Step 2 : Divide Divide midway between every consecutive midway between every consecutive pair of points with different categoriespair of points with different categories to create binary to create binary features. F<8 AND F<10.features. F<8 AND F<10.
Step 3 :Step 3 : Choose Choose split with best info gainsplit with best info gain..
Value of Feature5 7 9 11 13
Handling Numeric Features Handling Numeric Features (contd.)(contd.)
Note :Note :
F>5
F>10 +
+ -
T
T F
FCannot discard
numeric feature after
usein one portion
of decision tree.
Property of info gain Property of info gain measuremeasure
FAVOURS FEATURES WITH HIGH BRANCHING FACTORS.FAVOURS FEATURES WITH HIGH BRANCHING FACTORS.
(i.e. many possible values)(i.e. many possible values)
Extreme Case:Extreme Case:
At most one example per leaf and all I(.,.) scores for leafs At most one example per leaf and all I(.,.) scores for leafs equals zero, so the best feature scores.equals zero, so the best feature scores.
Student ID
1+0- 0+
0-
0+1-
1
99
999
Fix : Method 1Fix : Method 1
Convert all features to binaryConvert all features to binarye.g. Color = {Red, Blue, Green}e.g. Color = {Red, Blue, Green}
From 1 N-valued feature to N binary features.From 1 N-valued feature to N binary features. Color = Red? {True, False}Color = Red? {True, False} Color = Blue? {True, False}Color = Blue? {True, False} Color = Green? {True, False}Color = Green? {True, False}
Used in Logistic Regression, Neural Nets, SVMs…Used in Logistic Regression, Neural Nets, SVMs… D-tree readability probably worseD-tree readability probably worse
Fix : Method 2Fix : Method 2
Find info content in answer to:Find info content in answer to: What is value of feature F ignoring output category?What is value of feature F ignoring output category?
(fraction of all (fraction of all examples examples with F=i)with F=i)
Choose F that maximizes :Choose F that maximizes :
Read text (Mitchell) for exact details!Read text (Mitchell) for exact details!
⎟⎟⎠
⎞⎜⎜⎝
⎛++
++
=∑= np
np
np
npFIV ii
K
i
ii2
1
log)(
)(
)( InfoSplit
FIV
FGain=
Fix : Method 3Fix : Method 3
Group values of nominal featuresGroup values of nominal features
• Done in C4.5 and CART (Breiman et.al. 1984)Done in C4.5 and CART (Breiman et.al. 1984)• Breiman et.al. proved for the 2-category case, optimal Breiman et.al. proved for the 2-category case, optimal
binary partition can be found be considering only O(N) binary partition can be found be considering only O(N) possiblities instead of O(2^N).possiblities instead of O(2^N).
• Quinlan 1993: Would it be better to do this as a post Quinlan 1993: Would it be better to do this as a post processing step? (i.e. build tree, then merge.)processing step? (i.e. build tree, then merge.)
Color?R
B GY
Color?
R or B or …G or Y or …
Multiple Category Classification – Method Multiple Category Classification – Method 11
Approach 1 : Learn Approach 1 : Learn one tree per one tree per class.class.
Learn one tree Learn one tree
per category.per category.( ) versus ( )i iCategory Category¬
Multiple Category Classification – Method Multiple Category Classification – Method 22
Approach 2 : Learn one tree in total.Approach 2 : Learn one tree in total.
Often learning algorithms will subdivide the full space such that Often learning algorithms will subdivide the full space such that every every
point belongs to some category.point belongs to some category.Issues : Breaking ties? Issues : Breaking ties? Many categories predicted – maybe use Many categories predicted – maybe use
probability. probability.
Scoring “Splits” for regression (real-valued) Scoring “Splits” for regression (real-valued) problemsproblems
We want real values at the leaves.We want real values at the leaves.
- For each feature, F, “split” as done in ID3.- For each feature, F, “split” as done in ID3.- Use Use residue remainingresidue remaining, say using Linear Least Squares (LLS), , say using Linear Least Squares (LLS),
instead of instead of info gaininfo gain. .
- Why not a weighted sum (error per example vs error)?Why not a weighted sum (error per example vs error)?
- Some approaches just place constants at leaves.- Some approaches just place constants at leaves.
2[ ( ) ( )]i
F iex subset
Error out ex LLS ex=∈
= −∑
( ) F ii Split
Total Error F Error =∈
=∑
X
OutputLLS
Runtime Performance Runtime Performance of ID3of ID3• Let E = # examples Let E = # examples F = # featuresF = # features• At level 1At level 1 Look at each featureLook at each feature Look at each ex (to get feature value)Look at each ex (to get feature value)
Work to choose 1 featureWork to choose 1 feature = O(F x E)= O(F x E)
Runtime Performance Runtime Performance of ID3 (cont.)of ID3 (cont.)
• In worst case, need to consider all In worst case, need to consider all features along all paths (full tree)features along all paths (full tree)
Reasonably efficient
O(FO(F2 x E) x E)
Generating RulesGenerating Rules
• Antecedent: Conjuction of all Antecedent: Conjuction of all decisions leading to terminal decisions leading to terminal nodenode
• Consequent: Label of terminal Consequent: Label of terminal nodenode
• ExampleExample
RedCOLOR ?
SIZE ?
Blue
Big Small+ -
+
Green
-
Generating Rules Generating Rules (cont.)(cont.)• Generates rules:Generates rules:
Color=Green Color=Green - -
Color=Blue Color=Blue + +
Color=Red and Size=Big Color=Red and Size=Big + +
Color=Red and Size=Small Color=Red and Size=Small - -
• Note:Note:
1. Can “clean up” the rule set (see Quinlan’s)1. Can “clean up” the rule set (see Quinlan’s)
2. Decision trees learn 2. Decision trees learn disjunctivedisjunctive concepts concepts
Noise-A Major Issue in Noise-A Major Issue in MLML• Worst Case Worst Case
+, - at same point in feature space+, - at same point in feature space• Causes Causes
1. Too few features (“hidden variables”) or 1. Too few features (“hidden variables”) or too few possible valuestoo few possible values
2. Incorrectly reported/measured/judged 2. Incorrectly reported/measured/judged feature valuesfeature values
3. mis-classified instances3. mis-classified instances
Noise-A Major Issue in Noise-A Major Issue in ML (cont.)ML (cont.)
• Issue – overfittingIssue – overfitting
Producing an “awkward” concept Producing an “awkward” concept because of a few “noisy” points.because of a few “noisy” points.
-
+ + + + - +
- -
- - -
+ + + + - +
- -
- -
Bad performance on future ex’s?
Better performance?
Overfitting Viewed in Overfitting Viewed in Terms of Function-Terms of Function-FittingFitting
Data = Red Line + Noise ModelData = Red Line + Noise Model
f(x
)
x
+ + +
+ + + + + +
+ + +
+
+
Definition of Definition of OverfittingOverfitting• Assuming large enough test set so that it is Assuming large enough test set so that it is
representative. Concept C overfit the training data if representative. Concept C overfit the training data if there exists a “simpler” concept S so thatthere exists a “simpler” concept S so that
but
>
<
Training set accuracy of
C
Training set accuracy of
S
Test set accuracy of
C
Test set accuracy of
S
Remember!Remember!
• It is easy to learn/fit the training It is easy to learn/fit the training datadata
• What’s hard is generalizing well What’s hard is generalizing well to future (“test set”) data!to future (“test set”) data!
• Overfitting avoidance is a key Overfitting avoidance is a key issue in Machine Learningissue in Machine Learning
Can One Underfit?Can One Underfit?
• Sure, if not fully fitting the training Sure, if not fully fitting the training setset
-eg, just return majority category -eg, just return majority category (+ or -) in the trainset as the (+ or -) in the trainset as the learned model.learned model.
• But also if not enough data to But also if not enough data to illustrate the important distinctions.illustrate the important distinctions.
ID3 & Noisy DataID3 & Noisy Data
• To avoid overfitting, allow splitting To avoid overfitting, allow splitting to stop before all ex’s are of one to stop before all ex’s are of one class.class.
• Option 1: if info left < E, don’t splitOption 1: if info left < E, don’t split
-empirically failed; bad performance -empirically failed; bad performance on error-free data (Quinlan)on error-free data (Quinlan)
ID3 & Noisy Data ID3 & Noisy Data (cont.)(cont.)
• Option 2: Estimate if all remaining Option 2: Estimate if all remaining features are statistically features are statistically independent of the class of independent of the class of remaining examplesremaining examples
-uses “chi test” of original ID3 -uses “chi test” of original ID3 paper paper
-works well on error-free data-works well on error-free data
ID3 & Noisy Data ID3 & Noisy Data (cont.)(cont.)
• Option 3: (not in original ID3 Option 3: (not in original ID3 paper)paper)
Build complete tree, then use Build complete tree, then use some “spare” (tuning) examples some “spare” (tuning) examples to decide which parts of tree can to decide which parts of tree can be pruned. Which is called be pruned. Which is called “Reduced [tuneset] Error Pruning”“Reduced [tuneset] Error Pruning”
ID3 & Noisy Data ID3 & Noisy Data (cont.)(cont.)
• Pruning is currently the best choice—Pruning is currently the best choice—see c4.5 for technical detailssee c4.5 for technical details
• Repeat using greedy algo.Repeat using greedy algo.
Greedily Pruning D-Greedily Pruning D-treestrees
• Sample (Hill Climbing) Search SpaceSample (Hill Climbing) Search Space
best
Stop if no improvement
Pruning by Measuring Pruning by Measuring Accuracy on Tune SetAccuracy on Tune Set1.1. Run ID3 to fully fit TRAIN’ Set, measure Run ID3 to fully fit TRAIN’ Set, measure
accuracy on TUNEaccuracy on TUNE2.2. Consider all subtrees where ONE Consider all subtrees where ONE
interior node removed and replaced by interior node removed and replaced by leafleaf
-label with majority category-label with majority category in pruned subtreein pruned subtree choose best subtree on TUNEchoose best subtree on TUNE if no improvement, quitif no improvement, quit3. Go to 23. Go to 2
+
The Tradeoff in Greedy The Tradeoff in Greedy AlgorithmAlgorithm
• Efficiency vs OptimalityEfficiency vs Optimality
EgEg R
A B
CD
FE
Initial
“Tune” Best Cuts
Discard C’s & F’s subtrees
Single Best Cut
Discard B’s subtrees - irrevocable
Greedy Search: Powerful, General Purpose, Trick – of - Trade
Hypothetical Trace of a Hypothetical Trace of a Greedy AlgorithmGreedy Algorithm
Full-Tree Accuracy = 85% on TUNE set
R
A B
CD
FE
[64]
[77]
[74]
[87]
[63]
[89]
[88]
Accuracy if we replace this node with a leaf (leaving rest of the tree the same)
Pruning @ B works best
Hypothetical Trace of a Hypothetical Trace of a Greedy Algorithm Greedy Algorithm (cont.)(cont.)
• Full-Tree Accuracy = 89%Full-Tree Accuracy = 89%
- STOP since no improvement by - STOP since no improvement by cutting again, and return above cutting again, and return above tree. tree.
R
A B
[64]
[77]
[89]
Train/Tune/Test Train/Tune/Test AccuraciesAccuracies
100%
Accuracy
Tune
Test
Train
Ideal tree to choose
Chosen Pruned Tree
Rule Post-PruningRule Post-Pruning(Another greedy algoritm)(Another greedy algoritm)
1.1. Induce a decision treeInduce a decision tree
2.2. Convert to rules (see earlier slide)Convert to rules (see earlier slide)
3.3. Consider dropping one rule Consider dropping one rule antecedentantecedent
• Delete the one that improves tuning Delete the one that improves tuning set accuracy the most.set accuracy the most.
• Repeat as long as progress being Repeat as long as progress being made.made.
Rule Post-Pruning Rule Post-Pruning (Continue)(Continue)
• AdvantagesAdvantages• Allows an intermediate node to be Allows an intermediate node to be
pruned from some rules but retained pruned from some rules but retained in others.in others.
• Can correct poor early decisions in Can correct poor early decisions in tree construction.tree construction.
• Final concept more understandable.Final concept more understandable.
Training with Noisy Training with Noisy DataData• If we can clean up the training If we can clean up the training
data, should we do so?data, should we do so?• No (assuming one can’t clean up No (assuming one can’t clean up
the testing data when the learned the testing data when the learned concept will be used).concept will be used).
• Better to train with the same type Better to train with the same type of data as will be experienced of data as will be experienced when the result of learning is put when the result of learning is put into use.into use.
Overfitting + NoiseOverfitting + Noise
• Using the strict definition of Using the strict definition of overfitting presented earlier, is it overfitting presented earlier, is it possible to overfit noise-free possible to overfit noise-free data?data?• In general?In general?• Using ID3?Using ID3?
Example of Overfitting Example of Overfitting of Noise-free Dataof Noise-free Data
Let Let • Correct concept = A ^ BCorrect concept = A ^ B• Feature C to be true 50% of the Feature C to be true 50% of the
time, for both + and – examplestime, for both + and – examples• Prob(+ example) = 0.9Prob(+ example) = 0.9• Training Set:Training Set:
• +: ABCDE, ABC¬DE, ABCD¬E+: ABCDE, ABC¬DE, ABCD¬E• -: A¬B¬CD¬E, ¬AB¬C¬DE -: A¬B¬CD¬E, ¬AB¬C¬DE
Example (Continued)Example (Continued)
TreeTree Trainset Accuracy Trainset Accuracy TestSet TestSet AccuracyAccuracy
ID3’sID3’s 100%100% 50%50%
Simpler “tree”Simpler “tree” 60%60% 90%90%
C
+ -
FT
+
Post PruningPost Pruning
• There are more sophisticated methods There are more sophisticated methods of deciding where to prune than simply of deciding where to prune than simply estimating accuracy on a tuning set.estimating accuracy on a tuning set.
• See the C4.5 and CART books for See the C4.5 and CART books for details.details.
• We won’t discuss them, except for We won’t discuss them, except for MDLMDL
• Tuning sets also calledTuning sets also called• Pruning sets (in d-tree algorithms)Pruning sets (in d-tree algorithms)• Validation sets (in general)Validation sets (in general)
Tuning Sets vs MDLTuning Sets vs MDL
• Two ways to deal with overfittingTwo ways to deal with overfitting• Tuning SetsTuning Sets
• Empirically evaluate pruned treesEmpirically evaluate pruned trees
• MDL (Minimal Description Length)MDL (Minimal Description Length)• Theoretically evaluate/score pruned Theoretically evaluate/score pruned
treestrees• Describe training data in as few bits Describe training data in as few bits
as possible (“compression”)as possible (“compression”)
MDL (continue)MDL (continue)
• No need to hold back some No need to hold back some training datatraining data
• But how good is the MDL But how good is the MDL hypothesis?hypothesis?• Heuristic: MDL => good Heuristic: MDL => good
generalizationgeneralization
The Minimal The Minimal Description Length Description Length (MDL) Principle(MDL) Principle(Rissanen, 1986; Quinlan and Rivest, (Rissanen, 1986; Quinlan and Rivest,
1989)1989)• Informally, we want to view a training set asInformally, we want to view a training set as
data = general rule + exceptions to the rule data = general rule + exceptions to the rule (“noise”)(“noise”)
• Tradeoff betweenTradeoff between• Simple rule, but many exceptions Simple rule, but many exceptions • Complex rule with few exceptionsComplex rule with few exceptions
• How to make this tradeoff?How to make this tradeoff?• Try to minimize the “description length” of Try to minimize the “description length” of
the rule + exceptionsthe rule + exceptions
Trading Off Simplicity Trading Off Simplicity vs Coveragevs Coverage• MinimizeMinimize
Size of answer = # bits needed to represent a Size of answer = # bits needed to represent a decision tree that covers (possibly incompletely)decision tree that covers (possibly incompletely) the training examplesthe training examples++λλ* (# bits needed to encode the exceptions to this * (# bits needed to encode the exceptions to this decision tree)decision tree)
• Answer: the message/description lengthAnswer: the message/description length• λλ: A weighting factor, user-defined or use tuning set: A weighting factor, user-defined or use tuning set• How are future examples categorized?How are future examples categorized?
• Important issue: what’s the best coding strategy to Important issue: what’s the best coding strategy to use?use?
A Simple MDL A Simple MDL AlgorithmAlgorithm1.1. Build the full tree using ID3 (and Build the full tree using ID3 (and allall the the
training examples)training examples)2.2. Consider all/many subtrees, keeping the Consider all/many subtrees, keeping the
one that minimizes:one that minimizes:• score = (# nodes in tree) + score = (# nodes in tree) + λλ * (error rate on * (error rate on
training set)training set)(A crude scoring function)(A crude scoring function)
Some details:Some details:If # features = NIf # features = Nff and # examples = N and # examples = Nee
then need Ceiling(logthen need Ceiling(log22NNff) bits to encode ) bits to encode each tree node and Ceiling (logeach tree node and Ceiling (log22NNee) bits ) bits to encode an exception.to encode an exception.
Searching the Space of Searching the Space of Pruned D-trees with Pruned D-trees with MDLMDL• Can use same greedy search Can use same greedy search
algorithm used with pruning setsalgorithm used with pruning sets• But use But use MDL scoreMDL score rather than rather than
pruning set accuracypruning set accuracy as the as the heuristic functionheuristic function
MDL SummarizedMDL Summarized
The overfitting problemThe overfitting problem• Can exactly fit the training data, but will Can exactly fit the training data, but will
this generalize well to test data?this generalize well to test data?• Tradeoff some training-set errors for fewer test-Tradeoff some training-set errors for fewer test-
set errorsset errors
• One solution – the MDL hypothesisOne solution – the MDL hypothesis• Solve the MDL problem (on the training data) Solve the MDL problem (on the training data)
and you are likely to generalize well (accuracy and you are likely to generalize well (accuracy on the test data)on the test data)
The MDL ProblemThe MDL Problem• Minimize |description of general concept| Minimize |description of general concept|
+ + λλ | list of exceptions (in the train set) | | list of exceptions (in the train set) |
Small Disjuncts Small Disjuncts (Holte et al. IJCAI 1989)(Holte et al. IJCAI 1989)
• Results of learning can often be Results of learning can often be viewed as a disjunction of viewed as a disjunction of conjunctionsconjunctions
• Definition: small disjuncts – Definition: small disjuncts – Disjuncts that correctly classify Disjuncts that correctly classify few training examplesfew training examples• Not necessarily small in area.Not necessarily small in area.
The Problem with The Problem with Small DisjunctsSmall Disjuncts• Collectively, cover much of the training data, Collectively, cover much of the training data,
but account for much of the testset errorbut account for much of the testset error• One studyOne study
• Cover 41% of training data and produce 95% of the Cover 41% of training data and produce 95% of the test set errortest set error
• The “small-disjuncts problem” still an open The “small-disjuncts problem” still an open issue (See Quinlan paper in MLJ for additional issue (See Quinlan paper in MLJ for additional discussion).discussion).
Overfitting Avoidance Overfitting Avoidance WrapupWrapup• Note: fundamental issue in all of ML, not Note: fundamental issue in all of ML, not
just decision trees; after all, easy to just decision trees; after all, easy to exactly match training data via “table exactly match training data via “table lookup”)lookup”)
• ApproachesApproaches• Use simple ML algorithm from the start.Use simple ML algorithm from the start.• Optimize accuracy on a tuning set.Optimize accuracy on a tuning set.• Only make distinctions that are statistically Only make distinctions that are statistically
justified.justified.• Minimize |concept descriptions| + Minimize |concept descriptions| + λλ |exception list|. |exception list|.• Use ensembles to average out overfitting (next Use ensembles to average out overfitting (next
topic).topic).
Decision “Stumps”Decision “Stumps”
• Holte (MLJ) compared:Holte (MLJ) compared:• Decision trees with only one decision (decision stumps)Decision trees with only one decision (decision stumps)
VSVS• Trees produced by C4.5 (with pruning algorithm used)Trees produced by C4.5 (with pruning algorithm used)
• Decision “stumps” do remarkably well on UC Irvine Decision “stumps” do remarkably well on UC Irvine data setsdata sets• Archive too easy?Archive too easy?
• Decision stumps are a “quick and dirty” Decision stumps are a “quick and dirty” controlcontrol for for comparing to new algorithms.comparing to new algorithms.• But C4.5 easy to use and probably a better control.But C4.5 easy to use and probably a better control.
C4.5 Compared to 1R (“Decision Stumps”)C4.5 Compared to 1R (“Decision Stumps”)
• Test Set AccuracyTest Set Accuracy• 11stst column: UCI datasets column: UCI datasets
• See Holte Paper for keySee Holte Paper for key
• Max diff: 2Max diff: 2ndnd row row• Min Diff: 5Min Diff: 5thth row row• UCI datasets too easy?UCI datasets too easy?
DatasetDataset C4.5C4.5 1R1R
BCBC 72.0%72.0% 68.7%68.7%
CHCH 99.2%99.2% 68.7%68.7%
GLGL 63.2%63.2% 67.6%67.6%
G2G2 74.3%74.3% 53.8%53.8%
HDHD 73.6%73.6% 72.9%72.9%
HEHE 81.2%81.2% 76.3%76.3%
HOHO 83.6%83.6% 81.0%81.0%
HYHY 99.1%99.1% 97.2%97.2%
IRIR 93.8%93.8% 93.5%93.5%
LALA 77.2%77.2% 71.5%71.5%
LYLY 77.5%77.5% 70.7%70.7%
MUMU 100.0%100.0% 98.4%98.4%
SESE 97.7%97.7% 95.0%95.0%
SOSO 97.5%97.5% 81.0%81.0%
VOVO 95.6%95.6% 95.2%95.2%
V1V1 89.4%89.4% 86.8%86.8%
Considering the Cost Considering the Cost of Measuring a Featureof Measuring a Feature
• Want trees with high accuracy Want trees with high accuracy and whose tests are inexpensive and whose tests are inexpensive to compute.to compute.
• Common Heuristic:Common Heuristic:• Information_Gain(F)²/Cost(F)Information_Gain(F)²/Cost(F)• Used in medical domains as well as Used in medical domains as well as
robot-sensing tasks.robot-sensing tasks.