borderline smote

Borderline-SMOTE: A New Over-Sampling Method in

Imbalanced Data Sets Learning Synthetic minority over-

sampling technique (SMOTE)Presented by Hector Franco

TCD

Topic:Machine learningImbalanced data sets

Basic concepts1. Introduction2. Recent developments3. Algorithms description.4. Evaluation.5. Discursion.

Content:

Basic concepts:0

Why this paper is important for us? Multi class problems are imbalance when we

compare one against all. In some cases the data set is very small, to

generalize well. Text classification is an example of imbalanced

data. It can be use with tree-kernels.

Effect of SMOTE and DEC – (SDC)

After DEC alone After SMOTE and DEC

6

SMOTE’s Informed Oversampling Procedure I, k=3

: Minority sample

: Synthetic sample

: Majority sample

SMOTE:

SECTION 1introduction

By convention the class with less number of examples is called minority or positive samples.

introduction

SECTION 2The recent developments in imbalanced data sets learning

Between-class imbalanced. (where we focused on)

Within-class imbalanced.

It is important in text classification. We focused on the minority class, we want a

high prediction for the minority class.. Two class problem = multiclass problem .

Types of imbalances in data sets:

Evaluation Metrics in Imbalanced Domains

NOT VERY GOOD IN UNBALANCED

DATA

Popular evaluation for imbalance problem. Usually B=1, and =1

in this paper

TP rate

FP rate

AUC:AREAUNDERROC

Data level: Change the distribution ◦ make the data balanced

Modify the existing data mining algorithms◦ Make new algorithms

2.2 Dealing with imbalanced data sets

Random oversampling: duplicate Random under sampling: (can remove

important data) Remove noise SMOTE Combine under sampling and over

sampling. Find the hard examples and over sample

them.

2.2.1 Methods at data level:-re-sampling methods-

Adaboost (increase weights of misclassified), it does not perform well on imbalances ds. Improve updated weights of TP & FP, better than weights of prediction based on TP & FP.

Use a kernel of SVM Use a BMPM

Biased Mini max Probability Machine. There are other cost-based learning…

2.2.2 Methods at Algorithm Level:

SECTION 3A new Over-Sampling Method: Borderline-SMOTE.

SMOTE:

Algorithms usually try to learn the borderline, as exactly as possible.

Borderline-SMOTE1 Borderline-SMOTE2

New oversampling methods

Borderline-SMOTE1 algorithm

Also oversampling the majority class. The random numbers are between 0 and

0.5 so the synthetic examples are more close to each other.

Borderline-SMOTE2

Circle data set (artificial)

Danger samples:

Borderline-Smote1 synthetic samples:

Section 4Experiments

Data sets:

Nothing: base line. SMOTE Random over-sampling Borderline-SMOTE1 Borderline-SMOTE2

K=5 10 Fold cross validation. C4.5 classified We only want to improve the prediction of

the minority class

Methods

circle

satimge

hab

erm

an

Section 5conclusion

Is a common problem to work with imbalanced data sets.

Borderline examples are more easy to misclassified.

Our methods are better than traditional SMOTE.

Open to research:◦ how to define DANGER examples.◦ Determination of number of examples in DANGER.◦ Combine to data mining algorithms.

conclusion

Thank you for your time

Creative commons license

You are free:•to copy, distribute, display, and perform the work •to make derivative works

Under the following conditions:•Attribution. You must give the original author credit. What does "Attribute this work" mean? The page you came from contained embedded licensing metadata, including how the creator wishes to be attributed for re-use. You can use the HTML here to cite the work. Doing so will also include metadata on your page so that others can find the original work as well. •Non-Commercial. You may not use this work for commercial purposes. •For any reuse or distribution, you must make clear to others the licence terms of this work. •Any of these conditions can be waived if you get permission from the copyright holder. •Nothing in this license impairs or restricts the author's moral rights.

borderline smote

Technology

minority class

class imbalanced

important data

borderline examples

data level

example of imbalanced

majority class

synthetic minority