1 data mining chapter 6 implementations: real machine learning schemes kirk scott
TRANSCRIPT
![Page 1: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/1.jpg)
1
Data MiningChapter 6
Implementations: Real Machine Learning Schemes
Kirk Scott
![Page 2: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/2.jpg)
2
The Little Brown Bat
![Page 3: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/3.jpg)
3
A Zombie Fly laying eggs inside a Honey Bee
![Page 4: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/4.jpg)
4
Argyria
![Page 5: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/5.jpg)
5
![Page 6: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/6.jpg)
6
Methemoglobinemia(from Wikipedia)
• Methemoglobinemia (or methaemoglobinaemia) is a disorder characterized by the presence of a higher than normal level of methemoglobin (metHb, i.e., ferric[Fe3+] rather than ferrous [Fe2+] haemoglobin) in the blood. Methemoglobin is an oxidized form of hemoglobin that has a decreased affinity for oxygen, resulting in an increased affinity of oxygen to other heme sites within the same red blood cell.
![Page 7: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/7.jpg)
7
• This leads to an overall reduced ability of the red blood cell to release oxygen to tissues, with the associated oxygen–hemoglobin dissociation curve therefore shifted to the left. When methemoglobin concentration is elevated in red blood cells, tissue hypoxia can occur.
• …
![Page 8: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/8.jpg)
8
• Carriers• The Fugates, a family that lived in the hills of
Kentucky, are the most famous example of this hereditary genetic condition. They are known as the "Blue Fugates." Martin Fugate settled near Hazard, Kentucky, circa 1800. His wife was a carrier of the recessive methemoglobinemia (met-H) gene, as was a nearby clan with whom the Fugates intermarried. As a result, many descendants of the Fugates were born with met-H.[7][8][9]
![Page 9: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/9.jpg)
9
• The "blue men of Lurgan" were a pair of Lurgan men suffering from what was described as "familial idiopathic methaemoglobinaemia" who were treated by Dr. James Deeny in 1942. Deeny, who would later become the Chief Medical Officer of the Republic of Ireland, prescribed a course of ascorbic acid and sodium bicarbonate. In case one, by the eighth day of treatment there was a marked change in appearance and by the twelfth day of treatment the patient's complexion was normal. In case two, the patient's complexion reached normality over a month-long duration of treatment.[10]
![Page 10: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/10.jpg)
10
Back to the Topic at Hand
• Chapter 4 provided an introduction to data mining algorithms and the motivations underlying them
• Chapter 5 provided a relatively in-depth treatment of how results are evaluated
![Page 11: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/11.jpg)
11
• Some evaluation features that exist in Weka were brought out, like lift charts
• Doubtless, other features have also been implemented in Weka
• When you are doing your project, you will have to look more closely into the evaluation tools in Weka
![Page 12: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/12.jpg)
12
• With the foregoing background, chapter 6 covers issues surrounding various data mining algorithms in some detail
• My goal is to present this information at a level where you would be an informed user of Weka
![Page 13: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/13.jpg)
13
• When doing your project, you will be using Weka
• If issues come up you will be informed enough to recognize and will be able to search around in Weka for how to make a decision about them
• You will have to become sort of an expert on the data mining algorithms you choose to use
![Page 14: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/14.jpg)
14
• At the end of the chapter, in section 6.11, the book lists all of the implementations in Weka
• I think it will be useful to list all of the implementations up front
• This provides a preview of what you’ll find in Weka
• It also provides context for the discussion of the issues in sections 6.1-6.10
![Page 15: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/15.jpg)
15
Subsections of the Chapter with Implementations in Weka
![Page 16: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/16.jpg)
16
6.1 Decision Trees
• J48 (implementation of C4.5)• SimpleCart (minimum cost-complexity
pruning a la CART)• REPTree (reduced-error pruning)
![Page 17: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/17.jpg)
17
6.2 Classification Rules
• (For classifiers, see Section 11.4 and Table 11.5.)
• JRip (RIPPER rule learner)• Part (rules from partial decision trees)• Ridor (ripple-down rule learner)
![Page 18: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/18.jpg)
18
6.3 Association Rules
• (see Section 11.7 and Table 11.8)• FPGrowth (frequent-pattern trees)• GeneralizedSequentialPatterns (find large
item trees in sequential data)
![Page 19: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/19.jpg)
19
6.4 Linear Models and Extensions
• SMO and variants for learning support vector machines
• LibSVM (uses third-party libsvm library)• MultilayerPerceptron• RBFNetwork (radial-basis function
network)• Spegasos (SVM using stochastic gradient
descent)
![Page 20: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/20.jpg)
20
6.5 Instance-Based Learning
• IBk (k-nearest neighbor classifier)• KStar (generalized distance functions)• NNge (rectangular generalizations)
![Page 21: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/21.jpg)
21
6.6 Numeric Prediction
• M5P (model trees)• M5Rules (rules from model trees)• LWL (locally weighted learning)
![Page 22: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/22.jpg)
22
6.7 Bayesian Networks
• BayesNet• AODE, WAODE (averaged one-
dependence network)
![Page 23: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/23.jpg)
23
6.8 Clustering
• (For clustering methods, see Section 11.6 and Table 11.7.)
• Xmeans• Cobweb (includes Classit)• EM
![Page 24: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/24.jpg)
24
6.9 Semisupervised Learning
• No separate data mining implementations are listed for this section
![Page 25: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/25.jpg)
25
6.10 Multi-Instance Learning
• MISVM (iterative method for learning SVM by relabeling instances)
• MISMO (SVM with multi-instance kernel)• CitationKNN (nearest-neighbor method
with Hausdorff distance)
![Page 26: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/26.jpg)
26
• MILR (logistic regression for multi-instance data)
• MIOptimalBall (learning balls for multi-instance classification)
• MIDD (the diverse-density method using the noisy-OR function)
![Page 27: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/27.jpg)
27
6.1 Decision Trees
![Page 28: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/28.jpg)
28
Algorithm C4.5
• This was the algorithm introduced in chapter 4
• It is divide and conquer• Splitting decisions are greedy, based on
the purity/information function value of the results
![Page 29: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/29.jpg)
29
A Review of the Algorithm for Nominal Attributes
• 1. The fundamental question at each level of the tree is always which attribute to split on
• In other words, given attributes x1, x2, x3…, do you branch first on x1 or x2 or x3…?
• Having chosen the first to branch on, which of the remaining ones do you branch on next, and so on?
![Page 30: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/30.jpg)
30
• 2. Suppose you can come up with a function, the information (info) function
• This function is a measure of how much information is needed in order to make a decision at each node in a tree
• 3. You split on the attribute that gives the greatest information gain from level to level
![Page 31: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/31.jpg)
31
• 4. A split is good if it means that little information will be needed at the next level down
• You measure the gain by subtracting the amount of information needed at the next level down from the amount needed at the current level
![Page 32: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/32.jpg)
32
Numeric Attributes
• Because most data sets include numeric attributes, the algorithm needs to be extended
• Obviously, numeric attributes fall into a range; they don’t fall into predefined categories
• That means you need to decide where to split (branch) on them
![Page 33: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/33.jpg)
33
• In general you handle numeric attributes by ordering the instances by value and splitting <, > at a single value
• The information function was used to determine which attribute to split on for the nominal case
• The information function can also be used to choose the best split point for a numeric attribute
![Page 34: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/34.jpg)
34
Nominal vs. Numeric Splitting
• Differences between splitting on nominal and numeric:
• Nominal—split once on that attribute• Numeric—may be split again at every
succeeding level• Or, may do a multi-way split on a numeric
at a given level
![Page 35: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/35.jpg)
35
Numeric Cost Implications
• Computational cost/implementation question for numerics:
• For a numeric over a range there is a potentially infinite number of possible split points
• Whether you split into multiple branches at one level or split multiple times on the same attribute at different levels, the cost of deciding can be high
![Page 36: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/36.jpg)
36
• Also, if you split at different levels, you have a practical consideration:
• Do the instances have to be re-sorted on the attribute at every level?
• A suitable implementation can preserve the initial sorting so it’s available at all lower levels
![Page 37: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/37.jpg)
37
Missing Values in Trees
• As mentioned in chapter 4, you can handle missing values as a separate branch
• Logically, this makes sense if the absence of a value means something
• If the absence doesn’t mean anything, it makes sense to assign instances to the branches proportionally
![Page 38: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/38.jpg)
38
• Recall that in simple terms, something that the best information gain outcome is a leaf that is pure
• Practically speaking, this observation about missing values is worth noting:
• The information function and gain computations can be applied in situations where some of the attribute values are missing
![Page 39: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/39.jpg)
39
Pruning
• This wasn’t discussed in detail in chapter 4• It turns out to be a big deal• There are two kinds of pruning: Pre-
pruning and post-pruning
![Page 40: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/40.jpg)
40
• Pre-pruning is a bit of a misnomer• It means that the tree building algorithm
includes heuristics that decide not to expand down a given branch
• This is the less common approach
![Page 41: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/41.jpg)
41
• Post-pruning means:• Create a complete tree following the rules• After the tree is finished, evaluate it,
potentially removing nodes and branches• This is more common
![Page 42: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/42.jpg)
42
• With post-pruning, on the one hand, you’ve wasted work in developing branches that are pruned
• On the other hand, you don’t throw anything out without having fully developed and evaluated it
• A pre-pruning algorithm will use fewer computational resources, but it may throw out something useful
![Page 43: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/43.jpg)
43
Why Pruning, and How?
• Pruning is important because it goes back to the concepts of training, overfitting, using a test set, and algorithm/result evaluation
• You develop a tree with a training set• You potentially prune it with a test set• The end result, obviously, is a smaller tree• Hopefully it’s also a better tree
![Page 44: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/44.jpg)
44
• By definition, a completed tree will be fitted by the algorithm as closely as possible to the training set
• Pruning involves creating pruned versions of the tree and applying them to the test set
• The error rates for different, pruned versions of the tree are checked with the each other and the original
![Page 45: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/45.jpg)
45
• It turns out that the error rate on the test set may be less if branches in the overfitted tree are merged or removed
• We don’t know exactly how to prune yet• But notice that this is entirely pragmatic
and there is a logic to it
![Page 46: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/46.jpg)
46
• Up until now, the concept of overfitting has simply been asserted
• It may have seemed illogical that a “less well fitted” tree might be better
• But who knows—chopping bits out of the tree and trying it on the test set might give better results
• Why not try and see?
![Page 47: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/47.jpg)
47
The Two Kinds of Post-Pruning
• There are two kinds of post-pruning:• Subtree replacement• Subtree raising
![Page 48: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/48.jpg)
48
Subtree Replacement
• Subtree replacement refers to collapsing a subtree (branch) into a single leaf node
• The subtree replacement algorithm is bottom up
• Work from the leaves up, looking for branches where performance on the test set is better if the branch is collapsed
• See Figure 1.3 on the following overhead for illustration
![Page 49: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/49.jpg)
49
Subtree replacement, from (b) to (a), the whole left branch is replaced
![Page 50: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/50.jpg)
50
• Subtree replacement is not too computationally costly
• All of the instances from the collapsed branch go into the leaf that replaces it
• No additional computation is needed for this step
![Page 51: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/51.jpg)
51
Subtree Raising
• Subtree raising refers to collapsing an internal node and raising one of its children to replace it
• The other children of the original have to be reapportioned into the branches of the replacement
• Typically only the child with the most descendants is a candidate for raising
• See Figure 6.1 on the following overhead for illustration
![Page 52: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/52.jpg)
52
Subtree raising, from (a) to (b), B is replaced by C, and the instances in 4 and 5 have to
be reapportioned into 1, 2, and 3
![Page 53: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/53.jpg)
53
• The subtree raising algorithm is more computationally intensive than subtree replacement
• The expense comes from reapportioning the instances into the new branches/leaves and recalculating the purity/error rate
![Page 54: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/54.jpg)
54
Estimating Error Rates
• As explained earlier, a pruning algorithm can be based on error rates on test sets
• Generally, the test set is smaller than the training set
• It may not be representative of the overall population
• It may undo the overfitting from the training set, while not being perfect itself
![Page 55: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/55.jpg)
55
• It turns out that the C4.5 algorithm doesn’t actually use a test set
• Using certain statistical assumptions, it bases error estimates on the training set
• All we need to understand is that C4.5 does include pruning
• The statistics are explained in a box and, as usual, there’s no need to know the details
![Page 56: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/56.jpg)
56
Complexity of Decision Tree Induction
• The deep details of the derivation of the computational complexity of the algorithm are not important
• However, it’s worth noting that the complexity is tractable
![Page 57: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/57.jpg)
57
• The book gives this as the overall figure for tree induction (creation) with (followed by post-) pruning:
• O(mn log n) + O(n(log n)2)• m = attributes• n = instances
![Page 58: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/58.jpg)
58
The Cost of Building the Tree
• O(mn log n)• Informally:• log n is the number of levels of the tree for
some log base = average degree of branching
• At every level, in the worst case, you have to consider all n instances
• You do this for all of the m attributes
![Page 59: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/59.jpg)
59
The Cost of Pruning the Tree with Subtree Replacement
• O(n)• This is smaller than the order for subtree
raising, so it is not included separately in the formula
![Page 60: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/60.jpg)
60
The Cost of Pruning the Tree with Subtree Raising
• O(n(log n)2)• n instances potentially reclassified at every
level of the tree gives O(n(log n))• Reclassification itself is O(log n)• Therefore, the total order of complexity is
that shown above
![Page 61: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/61.jpg)
61
• All we really need to know is that decision tree induction can be implemented in a way that it runs in log/polynomial time
• It is a computationally practical algorithm
![Page 62: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/62.jpg)
62
From Trees to Rules
• As noted in chapter 4, following every branch of a tree gives a complete set of rules for it
• Rule sets can be pruned just like a tree can
• A specific approach of making rules from trees will come up in the next numbered section
![Page 63: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/63.jpg)
63
C4.5: Choices and Options
• C4.5 has some tunable parameters• They apply to both nominal and numeric
attributes• The parameters are:• Confidence value• Minimum outcomes and minimum
instances• To prune or not to prune
![Page 64: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/64.jpg)
64
Confidence Value
• As noted already, the C4.5 algorithm in Weka uses statistical tools with the training set instead of the test set to calculate error rates
• And the details are beyond the scope of this set of overheads
• The confidence rate in Weka is 25%, and for lack of a better understanding of what it means, we’ll just accept it
![Page 65: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/65.jpg)
65
Minimum Outcomes, Minimum Instances
• The minimum outcomes and instances are easier to understand
• What good is a splitting condition on an attribute that doesn’t have at least two outcomes?
• And what good is a splitting condition that doesn’t have at least two instances per branch?
• The default values for these parameters in Weka are 2 and 2
![Page 66: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/66.jpg)
66
• It is apparent that these defaults are rock bottom values
• It is of some interest what effect changing them would have
• For lack of a better understanding, you might accept these defaults
• Or you might experiment with other values and see what effect that has on the results
![Page 67: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/67.jpg)
67
To Prune or Not to Prune
• Pruning can be turned off in C4.5 in Weka in order to obtain a more or less complete tree
• However, due to some parts of the algorithm as implemented, even with explicit post-pruning turned off, the output may have been pruned in some way
![Page 68: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/68.jpg)
68
Cost-Complexity Pruning
• The pruning algorithm in C4.5 is fast• However, it doesn’t always prune enough• CART = Classification and Regression
Trees• This scheme has a more advanced,
stringent, and costly approach to pruning• It might profitably be applied to a C4.5
derived tree, giving a smaller, better result
![Page 69: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/69.jpg)
69
Discussion
• Tree induction is presented first in this chapter because it’s probably the most studied of the data mining schemes
• As presented up to this point, decision nodes have been on one attribute
• CART supports decision nodes on >1 (nominal) attributes at a time
![Page 70: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/70.jpg)
70
• For numeric attributes, a decision can be made based on a function of >1 attribute at a node
• Multivariate numeric test conditions are hyperplanes, not parallel to an axis like a single attribute compared to a constant
• Fancier schemes will take longer to run• The results may be more compact, but
also harder for humans to understand
![Page 71: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/71.jpg)
71
What’s in Weka?
• Implementation of C4.5: J48 in Weka• Reduced-error pruning: REPTree in Weka• Minimum cost-complexity pruning a la
CART (classification and regression trees): SimpleCart in Weka
![Page 72: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/72.jpg)
72
6.2 Classification Rules
• Recall the basic idea presented in chapter 4:
• You can make rules by trying to “cover” classifications in the data
• Recall, also, that the question of whether to accept “imperfect” rules or only “perfect” rules came up
• It is one of the aspects that will be discussed more here
![Page 73: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/73.jpg)
73
• The basic question with rules is the same as with trees:
• A rule-producing algorithm will tend to overfit the training data
• That will mean that it is not such a good predictor
• How do you evaluate the error rate of a rule on a test set and decide whether it is good enough to keep?
![Page 74: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/74.jpg)
74
Criteria for Choosing Tests
• Recall that in chapter 4, tests, namely conditions, are added to a rule with AND under this criterion:
• p stands for the number of correct classifications (p = positive)
• t stands for the total number covered (t = total)
• You wanted to maximize p/t
![Page 75: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/75.jpg)
75
• Maximizing p/t is what led to “perfection”• If there was a condition that gave p/t = 1, it
would be chosen• This is not necessarily ideal• Which rule is better:• A rule that covers one instance with p/t = 1• Or a rule that covers 1,000 cases with p/t
= 999/1000?
![Page 76: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/76.jpg)
76
An Alternative Rule Evaluation Criterion
• Let P and T be the values for a rule before a new condition is added
• Let p and t be the values after a condition is added
• You could compare different conditions by finding this product based on information gain:
• p * (log p/t – log P/T)
![Page 77: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/77.jpg)
77
• This new criterion skews the judgment from perfection to the number of cases covered
• If you run an algorithm that quits only when ultimate perfection is achieved, you’ll get there eventually by selecting rules based on either criterion
![Page 78: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/78.jpg)
78
• There is no absolute best criterion for selecting rules
• The real problem is still trimming the rule set back until is useful for prediction
![Page 79: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/79.jpg)
79
Missing Values, Numeric Attributes
• Covering algorithms tend to handle missing values pretty well (assuming that the majority of the values aren’t missing…)
• Informally, you could say the algorithm builds rules for positive hits
• It effectively ignores missing values• Separate and conquer means that you
slowly narrow down to a remainder of instances
![Page 80: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/80.jpg)
80
• Either instances with missing values are handled earlier in the process based on attributes with values
• Or at the end, the few remaining instances will be handled as special cases
• Conditions will be added to rules so that exceptional instances are classified—on attributes with values
• Handling these exceptional cases may actually constitute overfitting…
![Page 81: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/81.jpg)
81
• Numeric valued attributes can be handled for rules just like for trees
• Instances can be ordered on the attribute value and all candidate rules based on a <, > comparison or split can be evaluated
![Page 82: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/82.jpg)
82
Generating Good Rules
• This goes back to the idea that an imperfect rule is not overfitted and might make a good predictor
• The approach is similar to the approach with trees
• Divide the data 2/3, 1/3 into a growing set and a pruning set, for example
![Page 83: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/83.jpg)
83
• The growing set is used to make rules by adding conditions
• The pruning set is used to simplify rules by removing conditions
• The criterion for removing rules is a reduction in the error rate on the pruning set
• This is called reduced-error pruning
![Page 84: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/84.jpg)
84
• One algorithm for doing this is incremental reduced-error pruning
• The algorithm goes like this:• For a given classification, grow a complete
covering rule for it• Now test the error rate of the rule on the
test set and compare this with the error rate for all “sub-rules” with conditions removed
![Page 85: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/85.jpg)
85
• For that class, hold onto whichever rule has the lowest error rate
• Do this for all classes• Compare the error rates for the rules for
each class• Keep the one rule with the lowest error
rate• Remove the covered instances• Repeat
![Page 86: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/86.jpg)
86
• Why do you have to repeat• Because there is a difference between the
training set and the test set and you are developing rules that may be imperfect for the training set
• You remove the instances covered by the accepted rule from the training set and then go again
• That’s what they mean by incremental
![Page 87: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/87.jpg)
87
• A non-incremental version of the algorithm would build a complete rule set first and then prune it
• This is more time-consuming
![Page 88: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/88.jpg)
88
Evaluating Error Rate to Select Rules
• The book suggests various alternatives to comparing on the basis of p/t or expressions including log p/t – log P/T
• They all have the same problem:• We’ve decided to accept imperfect rules,
not just perfect ones• There is no perfect balancing point
between percent covered correctly and total number covered correctly
![Page 89: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/89.jpg)
89
Algorithm Performance/Refinements
• Incremental reduced-error pruning produces good rule sets quickly
• It can be speeded up by simply picking a rule for each class in order of size, from smallest to largest
• It will also run more quickly with a suitable stopping condition
![Page 90: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/90.jpg)
90
• For example, once a rule is accepted with a sufficiently low accuracy, stop searching for more refinements
• Unfortunately, such a stopping condition may cause better solutions to be overlooked
• A better stopping condition may be based on the MDL principle
![Page 91: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/91.jpg)
91
Using Global Optimization
• Everything we’ve talked about is heuristic• We can’t claim we’re finding optimal trees
or rule sets• After finishing an algorithm, further
heuristics may be applied, which may lead to better (but not actually optimal) solutions
• In this context the idea is to run the incremental algorithm; then try to improve, taking all of the derived rules into account
![Page 92: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/92.jpg)
92
What is RIPPER
• RIPPER stands for repeated incremental pruning to produce error reduction
• This is the name of a rule generation scheme
• In short, it has most of the bells and whistles noted above built in, in order to improve the rule sets generated
![Page 93: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/93.jpg)
93
Obtaining Rules from Partial Decision Trees
• The book asserts that rule building schemes tend to prune too much
• Tree building schemes tend to err in the opposite direction, pruning too little
• A balancing approach is to use trees to develop rule sets
![Page 94: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/94.jpg)
94
• In gross form, you could build a complete tree
• Then pick the best rule by tracing all the branches
• Then remove the covered instances and repeat
• However, building a complete tree each time is wasteful and unnecessary
![Page 95: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/95.jpg)
95
• The alternative is to build a partial decision tree
• In brief, what you do is a form of pre-pruning during tree creation
• You use metrics that tell you there’s no need to explore certain branches further
• The decision about which branches merit further expansion is based on their entropy (information function) values
![Page 96: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/96.jpg)
96
• Once the tree development is complete, you pick the best rule among those branches that can be traced to a leaf
• Then you throw out the partial tree and repeat the process with the instances that weren’t covered by the rule
![Page 97: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/97.jpg)
97
• This technique is simpler than other schemes in this sense:
• It doesn’t have a global optimization stage at the end
• It can give rule sets that match the performance of schemes that do require global optimization
![Page 98: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/98.jpg)
98
Rules with Exceptions
• Rules with exceptions get a more complete and friendlier presentation here than in chapter 4
• They are not a logical abomination• They have a logic of their own
![Page 99: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/99.jpg)
99
• Imagine starting with a default case—namely the majority classification
• All instances which don’t fall into this classification are exceptions
• Among the exceptions, let the majority classification be the new default
• Then those instances which don’t fall into this classification are exceptions
![Page 100: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/100.jpg)
100
• With other techniques it might be the nth iteration (node split, condition added) before you ultimately nail something
• Here, you’re always thinking of the majority case first
• Within one or two levels at the top you have a good picture of the situation overall
• You go down for finer levels of detail
![Page 101: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/101.jpg)
101
• Effectively, this is an alternative way of building a representation of the data
• You could say that at every level, your thinking is, “All else being equal…”
• It is no accident that the organizing principle of going from broad default to fine exception mirrors human thinking in some problem domains
![Page 102: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/102.jpg)
102
Discussion
• This is a repetition of the rule building algorithms and the implementations mentioned in section 6.11
• Simple rule building for relatively noise-free data, covering, separate and conquer: PRISM
![Page 103: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/103.jpg)
103
• Incremental reduced-error pruning, RIPPER: JRip in Weka
• Rules from partial decision trees: Part in Weka
• Rules with exceptions, ripple down rules: Ridor in Weka
![Page 104: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/104.jpg)
104
6.3 Association Rules
• The algorithm given in chapter 4 for finding association rules is known as the apriori algorithm
• It was essentially an exhaustive search• It was made somewhat more tolerable by
observing that if a weaker rule didn’t meet the threshold for acceptance, a stronger one wouldn’t either
![Page 105: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/105.jpg)
105
• The bottom line turns out to be that there’s got to be a better way
• There is—it’s known as a frequent pattern or FP-tree implementation (FP not to be confused with false positive)
• Essentially, the FP-tree is based on a special kind of data structure, a prefix tree, a tree with additional information attached
![Page 106: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/106.jpg)
106
• The details of the FP-tree implementation are of no interest
• However, there is a side comment of some interest
• The authors mention that it is desirable to use a data structure small enough to be memory resident
![Page 107: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/107.jpg)
107
• Compare this idea with things that came up in db systems, B-trees and hash join
• Even a high complexity algorithm is relatively tolerable if it can be executed in memory
• Even a low complexity algorithm is relatively costly if it involves access to secondary storage
![Page 108: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/108.jpg)
108
Association Rule Mining In Weka
• FPGrowth (frequent pattern trees)• GeneralizedSequentialPatterns (GSP)• GSP is an application of the idea of apriori
rule generation to databases of event sequences
• If, by chance, you are considering such a database of sequential events, GSP might just the piece of software to use
![Page 109: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott](https://reader031.vdocuments.mx/reader031/viewer/2022013004/56649e305503460f94b21814/html5/thumbnails/109.jpg)
109
The End