algorithms: the basic methods. inferring rudimentary rules simplicity first simple algorithms often...
Post on 19-Dec-2015
221 views
TRANSCRIPT
Algorithms:The basic methods
Inferring rudimentary rulesSimplicity first• Simple algorithms often work surprisingly well• Many different kinds of simple structure exist:• One attribute might do all the work
• 1R: learns a 1-level decision tree– In other words, generates a set of rules that all test on one
particular attribute• Basic version (assuming nominal attributes)
– One branch for each of the attribute’s values– Each branch assigns most frequent class– Error rate: proportion of instances that don’t belong to the
majority class of their corresponding branch– Choose attribute with lowest error rate
Pseudo-code for 1RFor each attribute,
For each value of the attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate
• Note: “missing” is always treated as a separate attribute value
Evaluating the weather attributes
Arbitrarily breaking the tie between the firstand third rule sets we pick the first.Oddly enough the game is played when it’s overcast and rainy but not when it’s sunny. Perhaps it’s an indoor pursuit.
Dealing with numeric attributes• Numeric attributes are
discretized: the range of the attribute is divided into a set of intervals– Instances are sorted
according to attribute’s values
– Breakpoints are placed where the (majority) class changes (so that the total error is minimized)
outlook temp. humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
rainy 65 70 TRUE no
overcast 64 65 TRUE yes
sunny 72 95 FALSE no
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes
rainy 71 91 TRUE no
The problem of overfitting• Discretization procedure is very sensitive to noise
– A single instance with an incorrect class label will most likely result in a separate interval
– Also: E.g. time stamp attribute will have zero errors because each partition contains just one instance.
– However, highly branching attributes do not usually perform well on future examples.
• Simple solution: enforce minimum number of instances in majority class per interval– Weather data example (with minimum set to 3):
Result of overfitting avoidance• Whenever adjacent partitions have the same majority class as do the
first two in this example, we merge.• Final result for for temperature attribute:
Resulting Rule set
The same procedure leads to the rule for humidity, which turns out to be the best.
Discussion of 1R• 1R was described in a paper by Holte (1993)• Contains an experimental evaluation on 16 datasets• Minimum number of instances was set to 6 after some
experimentation (instead of 3 as we did)• 1R’s simple rules performed not much worse than much
more complex decision trees• Simplicity first pays off!
Statistical modeling• “Opposite” of 1R: use all the attributes• Two assumptions:
– Attributes are equally important
– statistically conditionally independent (given the class value)
• This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known)
• Although based on assumptions that are almost never correct, this scheme works well in practice!
Probabilities for the weather data
Bayes’s rule• Probability of event H given evidence E:
P(H | E) = P(E | H) P(H) / P(E)
• A priori probability of H: P(H)– Probability of event before evidence has been seen
• A posteriori probability of H: P(H | E)– Probability of event after evidence has been seen
Naïve Bayes for classification• Classification learning: what’s the probability of the class
given an instance?– Evidence E = instance
– Event H = class value for instance
• Naïve Bayes assumption: evidence can be split into independent parts (i.e. attributes of instance!)
P(H | E) = P(E | H) P(H) / P(E)
= P(E1|H) P(E2|H)… P(En|H) P(H) / P(E)
The weather data example
P(play=yes | E) =
P(Outlook=Sunny | play=yes) *
P(Temp=Cool | play=yes) *
P(Humidity=High | play=yes) *
P(Windy=True | play=yes) *
P(play=yes) / P(E) =
= (2/9) * (3/9) * (3/9) * (3/9) *(9/14) / P(E) = 0.0053 / P(E)
Don’t worry for the 1/P(E); It’s alpha, the normalization constant.
The weather data example
P(play=no | E) =
P(Outlook=Sunny | play=no) *
P(Temp=Cool | play=no) *
P(Humidity=High | play=no) *
P(Windy=True | play=no) *
P(play=no) / P(E) =
= (3/5) * (1/5) * (4/5) * (3/5) *(5/14) / P(E) = 0.0206 / P(E)
Normalization constant
P(play=yes, E) + P(play=no, E) = P(E) i.e.
P(play=yes, E)/P(E) + P(play=no, E)/P(E) = 1 i.e.
P(play=yes | E) + P(play=no | E) = 1 i.e.
0.0053 / P(E) + 0.0206 / P(E) = 1 i.e.
1/P(E) = 1/(0.0053 + 0.0206)
So,
P(play=yes | E) = 0.0053 / (0.0053 + 0.0206) = 20.5%
P(play=no | E) = 0.0206 / (0.0053 + 0.0206) = 79.5%
E
play=yes play=no
20.5%79.5%
The “zero-frequency problem”• What if an attribute value doesn’t occur with every class
value (e.g. “Humidity = High” for class “Play=Yes”)?– Probability P(Humidity=High | play=yes) will be zero!
• A posteriori probability will also be zero! – No matter how likely the other values are!
• Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator)– I.e. initialize the counters to 1 instead of 0.
• Result: probabilities will never be zero! (stabilizes probability estimates)
Missing values• Training: instance is not included in frequency count for
attribute value-class combination• Classification: attribute will be omitted from calculation• Example:
P(play=yes | E) =
P(Temp=Cool | play=yes) *
P(Humidity=High | play=yes) *
P(Windy=True | play=yes) *
P(play=yes) / P(E) =
= (3/9) * (3/9) * (3/9) *(9/14) / P(E) = 0.0238 / P(E)
P(play=no | E) =
P(Temp=Cool | play=no) *
P(Humidity=High | play=no) *
P(Windy=True | play=no) *
P(play=no) / P(E) =
= (1/5) * (4/5) * (3/5) *(5/14) / P(E) = 0.0343 / P(E)
After normalization: P(play=yes | E) = 41%, P(play=no | E) = 59%
Dealing with numeric attributes• Usual assumption: attributes have a normal or Gaussian
probability distribution (given the class)• The probability density function for the normal distribution
is defined by two parameters:
• The sample mean
• The standard deviation
• The density function f(x):
Statistics for the weather data
Classifying a new day• A new day E:
P(play=yes | E) =
P(Outlook=Sunny | play=yes) *
P(Temp=66 | play=yes) *
P(Humidity=90 | play=yes) *
P(Windy=True | play=yes) *
P(play=yes) / P(E) =
= (2/9) * (0.0340) * (0.0221) * (3/9) *(9/14) / P(E) = 0.000036 / P(E)
P(play=no | E) =
P(Outlook=Sunny | play=no) *
P(Temp=66 | play=no) *
P(Humidity=90 | play=no) *
P(Windy=True | play=no) *
P(play=no) / P(E) =
= (3/5) * (0.0291) * (0.0380) * (3/5) *(5/14) / P(E) = 0.000136 / P(E)
After normalization: P(play=yes | E) = 20.9%, P(play=no | E) = 79.1%
Probability densities• Relationship between probability and density:
• But: this doesn’t change calculation of a posteriori probabilities because cancels out
• Exact relationship:
Discussion of Naïve Bayes• Naïve Bayes works surprisingly well (even if independence
assumption is clearly violated)
• Why?
• Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class
• However: adding too many redundant attributes will cause problems (e.g. identical attributes)
• Note also: many numeric attributes are not normally distributed