mining with rare cases

Mining with Rare Mining with Rare CasesCases

Paper by Gary M. WeissPaper by Gary M. Weiss

Presenter: Indar BhatiaPresenter: Indar BhatiaINFS 795INFS 795April 28, 2005April 28, 2005

Presentation OverviewPresentation Overview

1.1. Motivation and Introduction to problemMotivation and Introduction to problem

2.2. Why Rare Cases are ProblematicWhy Rare Cases are Problematic

3.3. Techniques for Handling Rare CasesTechniques for Handling Rare Cases

4.4. Summary and ConclusionSummary and Conclusion

Motivation and IntroductionMotivation and Introduction

What are rare cases? What are rare cases? – A case corresponds to a region in the instance A case corresponds to a region in the instance

space that is meaningful to the domain under space that is meaningful to the domain under study.study.

– A rare case is a case that covers a small region of A rare case is a case that covers a small region of the instance spacethe instance space

Why are they importantWhy are they important– Detecting suspicious cargoDetecting suspicious cargo– Finding sources of rare diseasesFinding sources of rare diseases– Detecting FraudDetecting Fraud– Finding terroristsFinding terrorists– Identifying rare diseasesIdentifying rare diseases

Classification ProblemClassification Problem– Covers relatively few training examplesCovers relatively few training examples

Example:Example: Finding association between Finding association between infrequently purchased supermarket itemsinfrequently purchased supermarket items

Modeling ProblemModeling Problem

For a classification problemFor a classification problem, , the rare cases may manifest the rare cases may manifest themselves as small disjuncts, i.e., those disjuncts in the themselves as small disjuncts, i.e., those disjuncts in the classifier that cover few training examplesclassifier that cover few training examples

In unsupervised learningIn unsupervised learning, the three rare cases will be , the three rare cases will be more difficult to generalize from because they contain more difficult to generalize from because they contain fewer data points fewer data points

In association rule miningIn association rule mining, the problem will be to detect , the problem will be to detect items that co-occur most infrequently.items that co-occur most infrequently.

Clustering: showing 1 commonAnd 3 rare classes

Two Class Classification: Positive Class contains 1 Common and 2

Rare classes

P1

P2

P3

Modeling ProblemModeling Problem

Current research indicates rare cases and small Current research indicates rare cases and small disjuncts pose difficulties for data mining, i.e., rare disjuncts pose difficulties for data mining, i.e., rare cases have much higher misclassification rate than cases have much higher misclassification rate than common cases.common cases.

Small disjuncts collectively cover a substantial fraction Small disjuncts collectively cover a substantial fraction of all examples and cannot simply be eliminated – doing of all examples and cannot simply be eliminated – doing so will substantially degrade the performance of a so will substantially degrade the performance of a classifier.classifier.

In a most thorough study of small disjuncts, ( Weiss & In a most thorough study of small disjuncts, ( Weiss & Hirsh, 2000), it was shown that in the classifiers induced Hirsh, 2000), it was shown that in the classifiers induced from 30 real-world data sets, most classifier errors are from 30 real-world data sets, most classifier errors are contributed by the smaller disjuncts.contributed by the smaller disjuncts.

Why Rare Cases are ProblematicWhy Rare Cases are Problematic

Problems arise due to absolute rarityProblems arise due to absolute rarity– Most fundamental problem is associated lack of data – only Most fundamental problem is associated lack of data – only

few examples related to rare cases are in the data set few examples related to rare cases are in the data set (Absolute rarity)(Absolute rarity)

– Lack of data makes it difficult to detect rare cases, and if Lack of data makes it difficult to detect rare cases, and if detected, makes generalization difficultdetected, makes generalization difficult

Problems arise due to relative rarityProblems arise due to relative rarity– Looking for a needle in a Haystack – rare cases are obscured Looking for a needle in a Haystack – rare cases are obscured

by common cases (Relative rarity)by common cases (Relative rarity)– Data mining algorithms rely on Data mining algorithms rely on greedy Search heuristicsgreedy Search heuristics that that

examine one variable at a time. Since the detection of rare examine one variable at a time. Since the detection of rare cases may depend on the conjunction of many conditions, any cases may depend on the conjunction of many conditions, any single condition in isolation may not provide much guidance.single condition in isolation may not provide much guidance.

– For example , consider For example , consider Association rule mining problemAssociation rule mining problem. . Association Analysis has to have a very low support, Association Analysis has to have a very low support, support support =0=0. This causes a combinatorial explosion in large datasets.. This causes a combinatorial explosion in large datasets.

Why Rare Cases are Why Rare Cases are ProblematicProblematic

The MetricsThe Metrics– The metrics used to evaluate classifier accuracy The metrics used to evaluate classifier accuracy

are more focused on common cases. As a are more focused on common cases. As a consequence, rare cases may be totally ignored.consequence, rare cases may be totally ignored.

Example:Example: – consider decision tree. Most decision trees are consider decision tree. Most decision trees are

grown in a top-down manner, where test conditions grown in a top-down manner, where test conditions are repeatedly evaluated and the best one are repeatedly evaluated and the best one selected.selected.

– The metrics (i.e., the information gain) used to The metrics (i.e., the information gain) used to select the best test generally prefers tests that select the best test generally prefers tests that result in a balanced tree where purity is increased result in a balanced tree where purity is increased for most of the examples. for most of the examples.

– Rare cases which correspond to high purity Rare cases which correspond to high purity branches covering few examples will often not be branches covering few examples will often not be included in the decision tree.included in the decision tree.

Why Rare Cases are ProblematicWhy Rare Cases are Problematic

The BiasThe Bias

– The bias of a data mining system is critical to its The bias of a data mining system is critical to its performance. The extra-evidentiary bias makes it performance. The extra-evidentiary bias makes it possible to generalize from specific examples.possible to generalize from specific examples.

– Bias used by many data mining systems, especially Bias used by many data mining systems, especially those used to induce classifiers, employ a maximum-those used to induce classifiers, employ a maximum-generality bias. generality bias.

– This means that when a disjunct that covers some set This means that when a disjunct that covers some set of training examples is formed, only the most general of training examples is formed, only the most general set of conditions that satisfy those examples are set of conditions that satisfy those examples are selected.selected.

– The maximum-generality bias works well for common The maximum-generality bias works well for common cases, but not for rare cases/small disjuncts. cases, but not for rare cases/small disjuncts.

– Attempts to address the problems of small disjuncts Attempts to address the problems of small disjuncts by selecting an appropriate bias must be considered.by selecting an appropriate bias must be considered.

Why Rare Cases are Why Rare Cases are ProblematicProblematic

Noisy dataNoisy data– Sufficient high level of background noise may Sufficient high level of background noise may

prevent the learner to distinguish between noise prevent the learner to distinguish between noise and rare cases.and rare cases.

– Unfortunately, there is not much that can be done Unfortunately, there is not much that can be done to minimize the impact on noise on rare cases.to minimize the impact on noise on rare cases.

– For example: Pruning and overfitting avoidance For example: Pruning and overfitting avoidance techniques, as well as inductive biases that foster techniques, as well as inductive biases that foster generalization, can minimize the overall impact of generalization, can minimize the overall impact of noise but, because these methods tend to remove noise but, because these methods tend to remove both the rare cases and noise-generated ones, they both the rare cases and noise-generated ones, they do so at the expense of rare cases.do so at the expense of rare cases.

Techniques For Handling rare Techniques For Handling rare CasesCases

1.1. Obtain Additional Training DataObtain Additional Training Data

2.2. Use a More Appropriate Inductive BiasUse a More Appropriate Inductive Bias

3.3. Use More Appropriate MetricsUse More Appropriate Metrics

4.4. Employ Non-Greedy Search TechniquesEmploy Non-Greedy Search Techniques

5.5. Employ Knowledge/Human InteractionEmploy Knowledge/Human Interaction

6.6. Employ BoostingEmploy Boosting

7.7. Place Rare Cases Into Separate ClassesPlace Rare Cases Into Separate Classes

1. Obtain Additional Training Data1. Obtain Additional Training Data

Simply obtaining additional training data will not help Simply obtaining additional training data will not help much because most of the new data will be also much because most of the new data will be also associated with the common cases and may be some associated with the common cases and may be some associated with rare cases. This may help problems of associated with rare cases. This may help problems of “absolute rarity” but not with “relative rarity”“absolute rarity” but not with “relative rarity”

Only by selectively obtaining additional data for the rare Only by selectively obtaining additional data for the rare cases can one address the issues with relative rarity. cases can one address the issues with relative rarity. Such a sampling scheme will also help with absolute Such a sampling scheme will also help with absolute rarity.rarity.

The selective sampling approach does not seem The selective sampling approach does not seem practical for real-world data sets.practical for real-world data sets.

2. Use a More Appropriate 2. Use a More Appropriate Inductive BiasInductive Bias

Rare cases tend to cause small disjuncts to be formed in Rare cases tend to cause small disjuncts to be formed in a classifier induced from labeled data. This is partly due a classifier induced from labeled data. This is partly due to bias used by most learners.to bias used by most learners.

Simple strategies that eliminate all small disjuncts or use Simple strategies that eliminate all small disjuncts or use statistical significance testing to prevent small disjuncts statistical significance testing to prevent small disjuncts from being formed, have proven to perform poorly. from being formed, have proven to perform poorly.

More sophisticated approaches for adjusting the bias of a More sophisticated approaches for adjusting the bias of a learner in order to minimize the problem with small learner in order to minimize the problem with small disjuncts have been investigated. disjuncts have been investigated.

Holte et al. (1989) use a maximum generality bias for Holte et al. (1989) use a maximum generality bias for large disjuncts and use a maximum specificity bias for large disjuncts and use a maximum specificity bias for small disjuncts. This was shown to have improved small disjuncts. This was shown to have improved performance of small disjuncts but degrade the performance of small disjuncts but degrade the performance of large disjuncts, yielding poorer overall performance of large disjuncts, yielding poorer overall performance.performance.

2. Use a More Appropriate 2. Use a More Appropriate Inductive BiasInductive Bias

The approach was refined to ensure that the more The approach was refined to ensure that the more specific bias used to induce the small disjuncts does not specific bias used to induce the small disjuncts does not affect – and therefore cannot degrade – the performance affect – and therefore cannot degrade – the performance of the large disjuncts. of the large disjuncts.

This was accomplished by This was accomplished by using different learners for using different learners for examples that fall into large disjuncts and examples that examples that fall into large disjuncts and examples that fall into small disjuncts. (Ting, 1994)fall into small disjuncts. (Ting, 1994)

This hybrid approach was shown to improve the This hybrid approach was shown to improve the accuracy of small disjuncts, the results were not accuracy of small disjuncts, the results were not conclusive.conclusive.

Carvalho and Frietas(2002a, 2002b) essentially use the Carvalho and Frietas(2002a, 2002b) essentially use the same approach, except that same approach, except that the set of training examples the set of training examples falling into each individual small disjunct are used to falling into each individual small disjunct are used to generate a separate classifier. generate a separate classifier.

Several attempts have been made to perform better on Several attempts have been made to perform better on rare cases by using a highly specific bias for the induced rare cases by using a highly specific bias for the induced small disjuncts. These methods have shown mixed small disjuncts. These methods have shown mixed success.success.

3. Use More Appropriate Metrics3. Use More Appropriate Metrics

Altering Relative importance of Precision vs.Altering Relative importance of Precision vs.Recall:Recall: Use Evaluation Metrics that, unlike accuracy metrics, do Use Evaluation Metrics that, unlike accuracy metrics, do

not discount the importance of rare cases.not discount the importance of rare cases. Given a classification rule R that predicts target class C, Given a classification rule R that predicts target class C,

the recall of R is the % of examples belonging to C that the recall of R is the % of examples belonging to C that are correctly identified while the precision of R is the % are correctly identified while the precision of R is the % of times that the rule is correct.of times that the rule is correct.

Rare cases can be given more prominence by increasing Rare cases can be given more prominence by increasing the importance of precision over recall.the importance of precision over recall.

Timeweaver (Weiss, 1999), a genetic-algorithm based Timeweaver (Weiss, 1999), a genetic-algorithm based classification system, searches for rare cases by carefully classification system, searches for rare cases by carefully altering the relative importance of precision vs. recallaltering the relative importance of precision vs. recall

3. Use More Appropriate 3. Use More Appropriate MetricsMetrics

Two-Phase Rule Induction:Two-Phase Rule Induction: PNrule (Joshi, Aggarwal & Kumar, 2001) – uses two-phase PNrule (Joshi, Aggarwal & Kumar, 2001) – uses two-phase

rule induction to focus on each measure separately.rule induction to focus on each measure separately. The first Phase focuses on recall. In the second phase, The first Phase focuses on recall. In the second phase,

precision is optimized. This is accomplished by learning to precision is optimized. This is accomplished by learning to identify false positives within the rule from phase-1.identify false positives within the rule from phase-1.

In the Needle-in-the-haystack analogy, the first phase In the Needle-in-the-haystack analogy, the first phase identifies regions likely to contain the needle, then in the identifies regions likely to contain the needle, then in the second phase learns to discard the hay strands within second phase learns to discard the hay strands within these regions.these regions.

PN-rule LearningPN-rule Learning

P-phase:P-phase:– Positive examples with good supportPositive examples with good support– Seek good recallSeek good recall

N-phase:N-phase:– Remove FP from examples of P-phaseRemove FP from examples of P-phase– High accuracy and significant supportHigh accuracy and significant support

4. Employ Non-Greedy Search 4. Employ Non-Greedy Search TechniquesTechniques

Most Greedy algorithms are designed to be locally optimal, Most Greedy algorithms are designed to be locally optimal, so as to avoid local minima. This is done to make sure that so as to avoid local minima. This is done to make sure that the solution remains tractable. Mining algorithms based on the solution remains tractable. Mining algorithms based on Greedy method are not globally optimal.Greedy method are not globally optimal.

Greedy algorithms are not suitable for dealing with rare Greedy algorithms are not suitable for dealing with rare cases because rare cases may depend on the conjunction of cases because rare cases may depend on the conjunction of many conditions and any single condition in isolation my not many conditions and any single condition in isolation my not provide the needed solution.provide the needed solution.

Mining solution algorithms for handling rare cases must use Mining solution algorithms for handling rare cases must use more powerful global search methods. more powerful global search methods.

Recommended solution: Recommended solution: – Genetic algorithms, which operate on a population of Genetic algorithms, which operate on a population of

candidate solutions rather than a single solution:candidate solutions rather than a single solution:– For this reason GA are more appropriate for rare cases. For this reason GA are more appropriate for rare cases.

(Goldberg, 1989), (Freitas, 2002), (Weiss, 1999), (Cavallo (Goldberg, 1989), (Freitas, 2002), (Weiss, 1999), (Cavallo and Freitas, 2002)and Freitas, 2002)

5. Employ Knowledge/Human 5. Employ Knowledge/Human InteractionInteraction

Interaction and knowledge of domain experts can be Interaction and knowledge of domain experts can be used more effectively for rare case mining. used more effectively for rare case mining.

Example: Example: – SAR detectionSAR detection– Rare disease detectionRare disease detection– Etc.Etc.

6. Employ Boosting6. Employ Boosting

Boosting algorithms, such as AdaBoost, are iterative Boosting algorithms, such as AdaBoost, are iterative algorithms that place different weights on the training algorithms that place different weights on the training distribution at each iteration.distribution at each iteration.

Following each iteration, boosting increases the Following each iteration, boosting increases the weights associated with the incorrectly classified weights associated with the incorrectly classified examples and decreases the weight associated with examples and decreases the weight associated with the correctly classified examples.the correctly classified examples.

This forces the learner to focus more on the This forces the learner to focus more on the incorrectly classified examples in the next iteration,incorrectly classified examples in the next iteration,

An algorithm, RareBoost (Joshi, Kumar and Agarwal, An algorithm, RareBoost (Joshi, Kumar and Agarwal, 2001) which applies modified weight-update 2001) which applies modified weight-update mechanism to improve the performance of rare mechanism to improve the performance of rare classes and rare cases.classes and rare cases.

7. Place Rare Cases Into Separate 7. Place Rare Cases Into Separate ClassesClasses

Rare cases complicate classification because Rare cases complicate classification because different rare cases may have little in common different rare cases may have little in common between them, making it difficult to assign same between them, making it difficult to assign same class label to all of them.class label to all of them.

Solution: Reformulate the problem so that rare Solution: Reformulate the problem so that rare cases are viewed as separate classes.cases are viewed as separate classes.

Approach:Approach:1.1. Separate each class into subclasses using clusteringSeparate each class into subclasses using clustering

2.2. Learn after re-labeling the training examples with the Learn after re-labeling the training examples with the new class labelsnew class labels

3.3. Because multiple clustering experiments were used in Because multiple clustering experiments were used in steps 1, step 2 involves learning multiple models.steps 1, step 2 involves learning multiple models.

4.4. These models are combined using voting. These models are combined using voting.

Boosting based algorithmsBoosting based algorithms

RareBoostRareBoost– Updates the weights differentlyUpdates the weights differently

SMOTEBoostSMOTEBoost– Combination of SMOTE (Synthetic Combination of SMOTE (Synthetic

Minority Oversampling Technique) Minority Oversampling Technique) and boostingand boosting

CREDOSCREDOS

First use ripple down rules to First use ripple down rules to overfit the dataoverfit the data– Ripple down rules are often usedRipple down rules are often used

Then prune to improve Then prune to improve generalizationgeneralization– Different mechanism from decision Different mechanism from decision

treestrees

Cost Sensitive ModelingCost Sensitive Modeling

Detection rate / False Alarm rate may be Detection rate / False Alarm rate may be misleadingmisleading

Cost factors: damage cost, response Cost factors: damage cost, response cost, operational costcost, operational cost

Costs for TP, FP, TN, FNCosts for TP, FP, TN, FN Define cumulative costDefine cumulative cost

Outlier Detection SchemesOutlier Detection Schemes

Detect intrusions (data Detect intrusions (data points) that are very points) that are very different from the different from the “normal” activities (rest “normal” activities (rest of the data points)of the data points)

General StepsGeneral Steps– Identify “normal” Identify “normal”

behaviorbehavior– Construct useful set of Construct useful set of

featuresfeatures– Define similarity Define similarity

functionfunction– Use outlier detection Use outlier detection

algorithmalgorithm Statistics basedStatistics based Distance basedDistance based Model basedModel based

Distance Based Outlier Distance Based Outlier DetectionDetection

Represent data as a vector of Represent data as a vector of featuresfeatures

Major approachesMajor approaches– Nearest neighbor basedNearest neighbor based– Density basedDensity based– Clustering basedClustering based

ProblemProblem– High dimensionality of dataHigh dimensionality of data

Distance Based – Nearest Distance Based – Nearest NeighborNeighbor

Not enough neighbors Not enough neighbors Outliers Outliers– Compute distance d to the k-th nearest neighborCompute distance d to the k-th nearest neighbor– Outlier pointsOutlier points

Located in more sparse neighborhoodsLocated in more sparse neighborhoods Have d larger than a certain threasholdHave d larger than a certain threashold

Mahalanobis-distance based approachMahalanobis-distance based approach– More appropriate for computing distance with More appropriate for computing distance with

skewed distributionsskewed distributions

**

* **

**

**

*

*

**

**

*

*

*

**

**

x’y’

p1 p2

Distance Based – DensityDistance Based – Density

Local Outlier Factor (LOF)Local Outlier Factor (LOF)– Average of the ratios of the density of example p Average of the ratios of the density of example p

and the density of its nearest neighborsand the density of its nearest neighbors Compute density of local neighborhood for Compute density of local neighborhood for

each pointeach point Compute LOFCompute LOF Larger LOF Larger LOF Outliers Outliers

p2

p1

Distance Based – ClusteringDistance Based – Clustering

Radius w of proximity is specifiedRadius w of proximity is specified Two points x1 and x2 are “near” if d(x1, Two points x1 and x2 are “near” if d(x1,

x2)<wx2)<w Define N(x) as number of points that are Define N(x) as number of points that are

within w of xwithin w of x Points in small cluster Points in small cluster Outliers Outliers Fixed-width clustering for speedupFixed-width clustering for speedup

Distance Based - Clustering Distance Based - Clustering (cont.)(cont.)

K-Nearst Neighbor + Canopy ClusteringK-Nearst Neighbor + Canopy Clustering– Compute sum of distances to k nearest Compute sum of distances to k nearest

neighborsneighbors Small K-NN Small K-NN point in dense region point in dense region

– Canopy clustering for speedupCanopy clustering for speedup WaveClusterWaveCluster

– Transform data into multidimensional Transform data into multidimensional signals using wavelet transformationsignals using wavelet transformation

– Remove Hign/Low frequency partsRemove Hign/Low frequency parts– Remaining parts Remaining parts Outliers Outliers

Model Based Outlier DetectionModel Based Outlier Detection

Similar to Probabilistic Based Similar to Probabilistic Based schemesschemes– Build prediction model for normal Build prediction model for normal

behaviorbehavior– Deviation from model Deviation from model potential potential

intrusionintrusion Major approachesMajor approaches

– Neural networksNeural networks– Unsupervised Support Vector Machines Unsupervised Support Vector Machines

(SVMs)(SVMs)

Model Based - Neural NetworksModel Based - Neural Networks

Use a replicator 4-layer feed-forward neural network Use a replicator 4-layer feed-forward neural network Input variables are the target output during trainingInput variables are the target output during training RNN forms a compressed model for traning dataRNN forms a compressed model for traning data Outlyingness Outlyingness reconstruction error reconstruction error

Model Based - SVMsModel Based - SVMs

Attempt to separate the entire set of Attempt to separate the entire set of training data from the origin training data from the origin

Regions where most data lies are Regions where most data lies are labeled as one classlabeled as one class

•Parameters• Expected outlier rates

•Good for high quality controlled training data

•Variance of Radial Basis Function (RBF)

- Larger higher detection rate and more false alarm- Smaller lower detection rate and fewer false alarm

origin

Summary And ConclusionSummary And Conclusion

Rare classes, which result from highly skewed class Rare classes, which result from highly skewed class distribution, share many of the problems associated distribution, share many of the problems associated with rare cases. Rare classes and rare cases are with rare cases. Rare classes and rare cases are connected.connected.

Rare cases may occur can occur within both rare Rare cases may occur can occur within both rare classes and common classes, it is expected that rare classes and common classes, it is expected that rare cases to be more of an issue for rare classers.cases to be more of an issue for rare classers.

(Japkowicz, 2001) views rare classes as a consequence (Japkowicz, 2001) views rare classes as a consequence of between-class imbalance and rare cases as a of between-class imbalance and rare cases as a consequence of within-class imbalances.consequence of within-class imbalances.

Thus, both forms of rarity are a type of data imbalanceThus, both forms of rarity are a type of data imbalance Modeling improvements presented in this paper are Modeling improvements presented in this paper are

applicable to both types of rarity.applicable to both types of rarity.

mining with rare cases

Documents

problemwhy rare cases

haystack rare cases

detection of rare cases

rare casespaper

handling rare casessummary

small region

absolute raritylack

associated lack of data