borderline smote
DESCRIPTION
reading groupTRANSCRIPT
Borderline-SMOTE: A New Over-Sampling Method in
Imbalanced Data Sets Learning Synthetic minority over-
sampling technique (SMOTE)Presented by Hector Franco
TCD
Topic:Machine learningImbalanced data sets
Basic concepts1. Introduction2. Recent developments3. Algorithms description.4. Evaluation.5. Discursion.
Content:
Basic concepts:0
Why this paper is important for us? Multi class problems are imbalance when we
compare one against all. In some cases the data set is very small, to
generalize well. Text classification is an example of imbalanced
data. It can be use with tree-kernels.
Effect of SMOTE and DEC – (SDC)
After DEC alone After SMOTE and DEC
6
SMOTE’s Informed Oversampling Procedure I, k=3
: Minority sample
: Synthetic sample
: Majority sample
SMOTE:
SECTION 1introduction
By convention the class with less number of examples is called minority or positive samples.
introduction
SECTION 2The recent developments in imbalanced data sets learning
Between-class imbalanced. (where we focused on)
Within-class imbalanced.
It is important in text classification. We focused on the minority class, we want a
high prediction for the minority class.. Two class problem = multiclass problem .
Types of imbalances in data sets:
Evaluation Metrics in Imbalanced Domains
NOT VERY GOOD IN UNBALANCED
DATA
Popular evaluation for imbalance problem. Usually B=1, and =1
in this paper
TP rate
FP rate
AUC:AREAUNDERROC
Data level: Change the distribution ◦ make the data balanced
Modify the existing data mining algorithms◦ Make new algorithms
2.2 Dealing with imbalanced data sets
Random oversampling: duplicate Random under sampling: (can remove
important data) Remove noise SMOTE Combine under sampling and over
sampling. Find the hard examples and over sample
them.
2.2.1 Methods at data level:-re-sampling methods-
Adaboost (increase weights of misclassified), it does not perform well on imbalances ds. Improve updated weights of TP & FP, better than weights of prediction based on TP & FP.
Use a kernel of SVM Use a BMPM
Biased Mini max Probability Machine. There are other cost-based learning…
2.2.2 Methods at Algorithm Level:
SECTION 3A new Over-Sampling Method: Borderline-SMOTE.
SMOTE:
Algorithms usually try to learn the borderline, as exactly as possible.
Borderline-SMOTE1 Borderline-SMOTE2
New oversampling methods
Borderline-SMOTE1 algorithm
Borderline-SMOTE1 algorithm
Also oversampling the majority class. The random numbers are between 0 and
0.5 so the synthetic examples are more close to each other.
Borderline-SMOTE2
Circle data set (artificial)
Danger samples:
Borderline-Smote1 synthetic samples:
Section 4Experiments
Data sets:
Nothing: base line. SMOTE Random over-sampling Borderline-SMOTE1 Borderline-SMOTE2
K=5 10 Fold cross validation. C4.5 classified We only want to improve the prediction of
the minority class
Methods
circle
pima
satimge
hab
erm
an
Section 5conclusion
Is a common problem to work with imbalanced data sets.
Borderline examples are more easy to misclassified.
Our methods are better than traditional SMOTE.
Open to research:◦ how to define DANGER examples.◦ Determination of number of examples in DANGER.◦ Combine to data mining algorithms.
conclusion
Thank you for your time
Creative commons license
You are free:•to copy, distribute, display, and perform the work •to make derivative works
Under the following conditions:•Attribution. You must give the original author credit. What does "Attribute this work" mean? The page you came from contained embedded licensing metadata, including how the creator wishes to be attributed for re-use. You can use the HTML here to cite the work. Doing so will also include metadata on your page so that others can find the original work as well. •Non-Commercial. You may not use this work for commercial purposes. •For any reuse or distribution, you must make clear to others the licence terms of this work. •Any of these conditions can be waived if you get permission from the copyright holder. •Nothing in this license impairs or restricts the author's moral rights.