inductive learning from imbalanced data sets

1

Inductive Learning from Imbalanced Data Sets

Nathalie Japkowicz, Ph.D.

School of Information Technology and Engineering

University of Ottawa

2

Inductive Learning: Definition Given a sequence of input/output pairs of the form

<xi, yi>, where xi is a possible input, and yi is the output associated with xi:

Learn a function f such that: f(xi)=yi for all i’s, f makes a good guess for the outputs of inputs

that it has not previously seen.

[If f has only 2 possible outputs, f is called a concept and learning is called concept-learning.]

3

Inductive Learning: Example

Patient Attributes Class Temperature Cough Sore Throat Sinus Pain 1 37 yes no no no flu 2 39 no yes yes flu 3 38.4 no no no no flu 4 36.8 no yes no no flu 5 38.5 yes no yes flu 6 39.2 no no yes flu

Goal: Learn how to predict whether a new patient with a given set of symptoms does or does not have the flu.

4

Standard Assumption

The data sets are balanced: i.e., there are as many positive examples of the concept as there are negative ones.

Example: Our database of sick and healthy patients contains as many examples of sick patients as it does of healthy ones.

5

The Standard Assumption is not Always Correct There exist many domains that do not

have a balanced data set. Examples:

Helicopter Gearbox Fault Monitoring Discrimination between Earthquakes and

Nuclear Explosions Document Filtering Detection of Oil Spills Detection of Fraudulent Telephone Calls

6

But What is the Problem?

Standard learners are often biased towards the majority class.

That is because these classifiers attempt to reduce global quantities such as the error rate, not taking the data distribution into consideration.

As a result examples from the overwhelming class are well-classified whereas examples from the minority class tend to be misclassified.

7

Significance of the problem for Machine Learners/Data Miners Two Workshops:

AAAI’2000 Workshop, organizers:R. Holte, N. Japkowicz, C. Ling, S. Matwin, 13 contributions

ICML’2003 Workshop, organizers: N. Chawla, N. Japkowicz, A. Kolcz, 16 contributions

Bibliography on Class Imbalance, maintained by N. Japkowicz, 37 entries

Special Issue ACM KDDSIGMOD Explorations Newsletter, editors: N.

Chawla, N. Japkowicz, A. Kolcz (call for papers just came out) Profile of people involved in this research:

Well-known researchers: e.g., F. Provost, C. Elkan, R. Holte, T. Fawcett, C. Ling, etc.

8

Several Common Approaches

At the data Level: Re-Sampling Oversampling (Random or Directed) Undersampling (Random or Directed) Active Sampling

At the Algorithmic Level: Adjusting the Costs Adjusting the decision threshold /

probabilistic estimate at the tree leaf

9

My Contributions Fundamental What domain chara-

cteristics aggravate the problem?

Class imbalances or small disjuncts?

Are all classifiers sensitive to class imbalances?

Which proposed solutions to the class imbalance problem are more appropriate?

New Approaches Specialized

Resampling: within-class versus between-class imbalances

One class versus two-class learning

Multiple Resampling

10

Part I: Fundamentals

I. What domain characteristics aggravate the problem?

II. Class Imbalances or Small Disjuncts?III. Are all classifiers sensitive to class

imbalances?IV. Which proposed solutions to the class

imbalance problem are more appropriate?

11

I. I What domain characteristics aggravate the Problem?

To answer this question, I generated artificial domains that vary along three different axes:• The degree of concept complexity• The size of the training set• The degree of imbalance between the two

classes.

0 1

+ - + - + - + -

12

I. I What domain characteristics aggravate the Problem? I created 125 domains, each representing a

different type of class imbalance, by varying the concept complexity (C), the size of the training set (S) and the degree of imbalance (I) at different rates (5 settings were used per domain characteristics).

I ran C5.0 [a decision tree learning algorithm] on these various imbalanced domains and plotted its error rate on each domain.

Each experiment was repeated 5 times and the results averaged.

13

I. I What domain characteristics aggravate the Problem?

0.0

10.0

20.0

30.0

40.0

50.0

60.0

C=1 C=2 C=3 C=4 C=5

C

I=1

I=2

I=3

I=4

I=5

S= 1

Errorrate

LargeImbal.

Fullbalance

14

0.010.020.030.040.050.060.0

C=1 C=2 C=3 C=4 C=5

I=1

I=2

I=3

I=4

I=5

I.I What domain characteristics aggravate the Problem?

Errorrate

S= 5

LargeImbal.

Fullbalance

15

I. I What domain characteristics aggravate the Problem? The problem is aggravated by two factors:

An increase in the degree of class imbalance An increase in problem complexity – class

imbalances do not hinder the classification of simple problems (e.g., linearly separable ones)

However, the problem is simultaneously mitigated by one factor: The size of the training set – large training sets

yield low sensitivity to class imbalances

16

I.II: Clas Imbalances or Small Disjuncts? Studying the training sets from the previous

experiments, it can be inferred that when i and c are large, and s, small, the domain contains many very small subclusters.

These were also the conditions under which C5.0 performed the worst.

To test whether it is these small subclusters that cause performance degradation, we disregarded the value of s and set the size of all subclusters to 50 examples.

17

05

101520253035404550

s=1 s=5 All Subs =50

I=1

I=2

I=3

I=4

I=5

PreviousExperiment This

Experiment

High Concept Complexity: c=5

I.II: Clas Imbalances or Small Disjuncts?

Error rate

LargeImbal.

Fullbalance

18

I.II: Class Imbalances or Small Disjuncts?

When all the subclusters are of size 50, even at the highest degree of concept complexity, no matter what the class imbalance is, the error is below 1% It is negligible.

This suggests that it is not the class imbalance per se that causes a performance decrease, but rather, that it is the small disjunct problem created by the class imbalance (in highly complex and small-sized domains) that cause that loss of performance.

19

I.III Are all classifiers sensitive to class imbalances?

Neural Net (MLPs)

A5

A2 A7

+ A5 - - +

+ -

T F

> 5 5< 5 T

F

FT

Decision Tree (C5.0)

++ +

++ -

--

- - -

Support Vector Machines (SVMs)

20


05

101520253035404550

C5.0 MLP SVM

I=1

I=2

I=3

I=4

I=5

Error rate

LargeImbal.

Fullbalance

S = 1; C = 3

21


Decision Tree (C5.0) C5.0 is the most sensitive to class imbalances. This is because C5.0 works globally, not paying attention to specific data points.

Multi-Layer perceptrons (MLPs) MLPs are less prone to the class imbalance problem than C5.0. This is because of their flexibility: their solution gets adjusted by each data point in a bottom-up manner as well as by the overall data set in a top-down manner.

Support Vector Machines (SVMs) SVMs are even less prone to the class imbalance problem than MLPs because they are only concerned with a few support vectors, the data points located close to the boundaries.

22

I.IV Which Solution is Best?

Random Oversampling Directed Oversampling Random Undersampling Directed Undersampling Adjusting the Costs

23

0

510152025303540

4550

Imb. R.Ov D.Ov R.Un D.Un Cost

I=1

I=2

I=3

I=4

I=5

I.IV Which Solution is Best?

LargeImbal.

FullBal.

Err. rate

S = 1; C = 3

24

I.IV Which Solution is Best? Three of the five methods considered present

an improvement over C5.0 at S=1 and C=3: Random oversampling, Directed oversampling and Cost-modifying.

Undersampling (random and directed) is not effective and can even hurt the performance.

Random oversampling helps quite dramatically at all complexity. Directed oversampling makes a bit of a difference by helping slightly more.

On the graph of the previous slide, Cost-adjusting is about as effective as Directed oversampling. Generally, however, it is found to be slightly more useful.

25

Part II: New Approaches

I. Specialized Resampling: within-class versus between-class imbalances

II. One class versus two-class learning

III. Multiple Resampling

26

II.I: Within-class versus Between-class Imbalances

Idea: Use unsupervised learning to identify

subclusters in each class separately. Re-sample the subclusters of each

class until no within-class imbalance and no between-class imbalance are present (although the subclusters of each class can have different sizes)

27


5

SymmetricCase Asymmetric

Case

28

II.I: Within-class vs Between- class Imbalances: Experiments Imbalances Random Oversampling

Between class imbalance eliminated Guided Oversampling I (# Clusters Known)

Use prior knowledge of classes to guide clustering Guided Oversampling II (# Clusters

Unknown) Let clustering algorithm determine the number of

clusters

29

II.I: Within-class vs Between- class Imbalances: Letters Subset of the Letters

dataset found at the UCI Repository

Positive class contains the vowels a and u

Negative class contains the consonants m, s, t and w.

All letters are distributed according to their frequency in English texts.

30

II.I: Within-class vs Between- class Imbalances: Letters

Method Precision Recall F-Measure

Imbalanced 0.905 0.818 0.859

Random Oversampling 0.905 0.818 0.859

Guided Oversampling I (# Clusters Unknown)

0.923 0.914 0.919

Guided Oversampling II (Using Known Clusters)

0.935 0.877 0.905

31

II.I: Within-class vs Between- class Imbalances: Text Classification

Reuters-21578 Dataset

Classifying a document according to its topic

Positive class is a particular topic

Negative class is every other topic

32

II.I: Within-class vs Between- class Imbalances: Text Classification

Method Precision Recall F-Measure

Imbalanced 0.617 0.394 0.455

Random Oversampling

0.580 0.545 0.560

Guided Oversampling I (# Clusters Unknown)

0.650 0.510 0.544

Guided OversamplingII (Using Known Clusters)

0.601 0.751 0.665

33


Results: On letter and text categorization tasks,

this strategy worked better than the random over-sampling strategy.

Noise in the small subclusters, however, caused problems since it got too magnified.

This promising strategy requires more study..

34

II.II One-Class versus Two-Class Learning

DMLP RMLP

A5

A2 A7

+ A5 - - +

+ -

T F

> 5 5 < 5 TF

FT

Decision Tree (C5.0)

35

II.II One-Class versus Two-Class Learning

0

5

10

15

20

25

30

35

Helicop. Promoters Sonars

RMLP

DMLP

C4.5

ErrorRate

36

II.II One-Class versus Two-Class Learning One-Class learning is more accurate

than two-class learning on two of our three domains considered and as accurate on the third.

It can thus be quite useful in class imbalanced situations.

Further comparisons with other proposed methods are required.

37

II.III Multiple Resampling

Idea: Although the results reported here suggest that

undersampling is not as useful as oversampling, other studies of ours and others (on different data sets) suggest that it can be It shouldn’t be abandoned

Further experiments of ours (not reported here) suggest that rather than oversampling or undersampling until a full balance is achieved may not always be optimal A different re-sampling rate should be used

38


Idea (Continued): It is not possible to know, a-priori, whether a

given domain favours oversampling or undersampling and what resampling rate is best.

Therefore, we decided to create a self-adaptive combination scheme that considers both strategies at various rates.

39


… …

Output

Oversampling Expert Undersampl. Expert

Oversampling Classifiers(sampled at different rates)

Undersampling Classifiers(sampled at diff.rates)

40

II.III Multiple Resampling The combination scheme was compared to C4.5-

Adaboost (with 20 classifiers) with respect to the FB-measures on a text classification task (Reuters-21578, Top 10 categories)

The FB-measure combines precision (the proportion of examples classified as positive that are truly positive) and recall (the proportion of truly positive examples that are classified as positive) in the following way: F1 precision = recall F2 2 * precision = recall F0.5 precision = 2 * recall

41

II.3 Testing the Combination Scheme Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

B=1 B=2 B=0.5

Adaboost

Mixture-of-Experts

F-Measure

In all cases, the mixture scheme is superior to Adaboost. However, though it helps both recall and precision, ithelps recall more.

42

Summary/Conclusions: Overall Goals of the research

This talk presented some of the work I conducted in the recent past years. In particular, I focused on the class imbalanced problem aiming at: Establishing some fundamental results

regarding the nature of the problem, the behaviour of different types of classifiers,and the relative performance of various previously proposed schemes for dealing with the problem.

Designing new methods for attacking the problem.

43

Summary/Conclusions: Results Fundamentals The sensitivity of Decision Trees and Neural Networks to

class imbalance increases with the domain complexity and the degree of imbalance. Training set size mitigates this pattern. SVMs are not sensitive to class imbalances [up to 1/16 imbalance].

Cost-Adjusting is slightly more effective than random or directed over- or undersampling although all approaches are helpful, and directed oversampling is close to cost-adjusting.

The class imbalance problem may not be a problem in itself. Rather, the small disjunct problem it causes is responsible for the decay.

44

Summary/Conclusions: Results, New Approaches I presented three new methods very different

from each other and from previously proposed schemes. They all showed promise over previously proposed approaches.

Approach 1: Oversampling with respect to within-class and between-class imbalances

Approach 2: One-class Learning Approach 3: Adaptive combination scheme

which combines over- and under-sampling at 10 different rates each.

45

Summary/Conclusions: Future Work Expand on and study in more depth all the

new proposed approaches I have described. Adapt the idea of Boosting to the class

imbalanced problem (with the National Institute of Health (NIH) in Washington, D.C.and Master’s Student Benjamin Wang)

Design novel Oversampling schemes and feature selection schemes fo Text Classification (with Ph.D. Student: Taeho Jo)

46

Partial Bibliography

"A Multiple Resampling Method for Learning from Imbalances Data Sets" , Estabrooks, A., Jo, T. and Japkowicz, N., Computational Intelligence, Volume 20, Number 1, 2004. (in press)

"The Class Imbalance Problem: A Systematic Study" , Japkowicz N. and Stephen, S., Intelligent Data Analysis, Volume 6, Number 5, pp. 429-450, November 2002.

"Supervised versus Unsupervised Binary-Learning by Feedforward Neural Networks" , Japkowicz, N., Machine Learning Volume 42, Issue 1/2, pp. 97-122, January 2001.

"A Mixture-of-Experts Framework for Concept-Learning from Imbalanced Data Sets" , Estabrooks A, and Japkowicz, N., Proceedings of the 2001 Intelligent Data Analysis Conference .

"Concept-Learning in the Presence of Between-Class and Within-Class Imbalances" , Japkowicz N., Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, 2001.

"The Class Imbalance Problem: Significance and Strategies" , Japkowicz, N. in the Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI'2000), Volume 1, pp. 111-117

47

A Summary of the Various Measures Used

Error Rate:( b + c ) / ( a + b + c + d) Accuracy: ( a + d ) / ( a + b + c + d) Precision: P = a / ( a + c ) Recall: R = a / ( a + b ) FB-Measure: ( B2 + 1 ) P R / (B2 P + R )

dbNeg

caPos

NegPos

Classi-Fied as

Actual Class

inductive learning from imbalanced data sets

Documents

class imbalances

class imbalance problem

overwhelming class

minority class

majority class

data level

data distribution

result examples