ensemble of one-class domain descriptiors for imbalanced classification

UNIVERSIDAD TECNICA FEDERICO SANTA MARIA

Departamento de Informatica

Valparaso - Chile

An Ensemble of One-Class Domain Descriptors for

Imbalanced Classification

Thesis submitted in partial fulfillment

of the requirements for the degree of

Magster en Ciencias de la Ingeniera Informatica

Felipe Ramrez Gonzalez

Evaluation Committee

Prof. Dr. rer. nat. Hector Allende Olivares (Advisor, UTFSM)Prof. Dr. Raul Monge Anwandter (UTFSM)

Prof. Dr. Max Chacon Pacheco (USACH)

September, 2012

UNIVERSIDAD TECNICA FEDERICO SANTA MARIA

Departamento de InformaticaValparaso - Chile

Thesis title

An Ensemble of One-Class Domain Descriptors for Imbal-

anced Classification

Author

Felipe Ramrez Gonzalez

Thesis submitted to the Departamento de Informatica of the Universidad Tecnica FedericoSanta Mara in partial fulfillment of the requirements for the degree of Magster en Cienciasde la Ingeniera Informatica.

Dr. Hector Allende O.(Advisor)

Dr. Raul Monge A.(Internal Evaluator)

Dr. Max Chacon P.(External Evaluator)

September, 2012

to Claudia

Acknowledgments

I would like to offer my sincerest gratitude to my advisor, Dr. Hector Allende, who hasbeen a fundamental element throughout the course of this project. I specially thank him forproviding me with the freedom to work in my own way and for trusting in my capabilities. Iwould also like to acknowledge the assistance provided by the members of the INCA ResearchGroup, I deeply thank their valuable comments and suggestions in the elaboration of thiswork.

In a personal note, I would like to acknowledge the support of my family and friends: myparents Silvia and Sergio, my brother Alvaro and specially my girlfriend Claudia, to whomthis work is dedicated. I thank them for their patience and companionship throughout thisyear of hard work.

I would also like to acknowledge the support of the following Research Grants: Fondecyt1110854, CCTVal FB0821 and DGIP 24.12.01. I specially thank DGIP and Dr. Allendefor offering me a position as scientific assistant in the Advanced Computing Laboratory atUniversidad Tecnica Federico Santa Mara where this research was developed.

Abstract

Over the past few years pattern recognition algorithms have made possible a growing numberof applications in biomedical engineering, such as mammography analysis for breast cancerdetection, electrocardiogram signal analysis for cardiovascular disease diagnosis, magneticresonance imaging analysis for brain tumor segmentation, among many others.

Given the low rates at which some of these types of diseases occur in real life, availableobservations of such phenomena are highly underrepresented, often accounting for less than1% of all available cases. Under these circumstances, an automatic test which unconditionallyissues a negative diagnosis given any observation, will be correct 99% of the time, but at thecost of not being able to detect any diseases, which is the whole purpose of the test. Thissimple example reveals that when rare cases are described, the tests empirical error is aninappropriate performance measure, as well as error minimization-based learning models, asa consequence.

This situation is commonly known as the class imbalance problem, it often characterizesapplications where a highly infrequent but really important phenomenon is described, andhinders the performance of error minimization-based pattern recognition algorithms. New so-lutions capable of compensating this problem have been proposed under two main approaches:applying data resampling schemes, or performing modifications to existing traditional algo-rithms. This discipline is known in the literature as class imbalanced learning.

In this thesis we propose a new algorithm for imbalanced classification designed to improvethe accuracy of the minority class, hence improving the overall performance of the classifier.Computer simulations show that the proposed strategy, which we have termed Dual SupportVector Domain Description, outperforms related literature approaches in specially interestingbenchmark instances.

Keywords: imbalanced data, one-class learning, ensemble learning

Resumen

En los ultimos anos los algoritmos de reconocimiento de patrones han hecho posible uncreciente numero de aplicaciones en el ambito de la ingeniera biomedica, tales como el analisisde mamografas para la deteccion de cancer, el analisis de senales electrocardiograficas para eldiagnostico de enfermedades cardiovasculares, el analisis de imagenes de resonancia magneticapara la segmentacion de tumores cerebrales, entre muchas otras.

Dada la infrecuencia con la que algunas clases de patologas se manifiestan en la vida real,los casos observables estan altamente subrepresentados, muchas veces contando por menosdel 1% del total de los datos disponibles. Bajo estas condiciones, un examen automatico queincondicionalmente emita un diagnostico negativo estara en lo correcto el 99% de las veces,pero sera incapaz de detectar aquellos casos realmente importantes donde la enfermedadesta presente. Este simple ejemplo revela que cuando se describen casos infrecuentes el erroremprico del examen es una medida de desempeno inadecuada, y en consecuencia, los modelosde aprendizaje basados en su minimizacion tambien lo son.

Esta situacion es comunmente conocida como el problema de desequilibrio de clases, suelepresentarse en aplicaciones donde se describe un fenomeno altamente infrecuente pero de vitalimportancia, y deteriora significativamente el desempeno de los algoritmos de reconocimientode patrones basados en la minimizacion del error. Por este motivo surge la necesidad dedesarrollar algoritmos capaces de compensar este deterioro, ya sea aplicando algun esquemade remuestreo de datos, o haciendo alguna modificacion a algoritmos tradicionales. Estadisciplina en la literatura recibe el nombre de class imbalanced learning o aprendizaje declases desequilibradas.

En esta tesis se propone un nuevo algoritmo para resolver problemas de clasificaciondesequilibrada, especialmente disenado para mejorar la exactitud de la clase minoritaria,mejorando as el desempeno total del clasificador. Simulaciones computacionales muestranque la estrategia propuesta, a la cual se le ha denominado Dual Support Vector DomainDescription, tiene mejor desempeno que metodos relacionados de la literatura en instanciasdel problema especialmente interesantes.

Palabras clave: datos desequilibrados, aprendizaje una-clase, aprendizaje de ensamblados

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Learning from Imbalanced Datasets . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Machine Learning Overview 5

2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Classification Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Classifier Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Sensitivity and Specificity . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Traditional Classifiers Under Class Imbalances . . . . . . . . . . . . . . . . . . 102.5 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 State of the Art 11

3.1 External Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.1 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 Complex Data Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Internal Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.1 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.2 Cost-Sensitive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.3 One-Class Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Appropriate Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . 203.3.1 Geometric Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.2 F-measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.3 ROC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.4 Optimized Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3.5 Optimized Accuracy with Recall-Precision . . . . . . . . . . . . . . . . 243.3.6 Index of Balanced Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 25

viii CONTENTS

3.4 Summary of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Methodology 29

4.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1.1 Dual Domain Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . 304.1.2 Nested Aggregation Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Performance Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36


5 Results 37

5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Experiments with Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.1 Benchmark Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.2.2 Algorithm Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3 Experiments with Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . 465.3.1 Domain Generation Framework . . . . . . . . . . . . . . . . . . . . . . 465.3.2 Algorithm Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.3.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.4.1 Relevant Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.4.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54


6 Conclusions 55

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.4 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Bibliography 57

List of Figures

1.1 Distribution of labels of the MIT-BIH ECG database . . . . . . . . . . . . . . 2

2.1 Illustration of supervised and unsupervised learning. . . . . . . . . . . . . . . . 72.2 Performance assessment in an information retrieval system . . . . . . . . . . . 9

3.1 Discrimination-based versus recognition-based classification . . . . . . . . . . . 173.2 SVDD model trained with both target and outlier objects . . . . . . . . . . . . 183.3 G-mean as a function of sensitivity and specificity . . . . . . . . . . . . . . . . 213.4 F-measure as a function of precision and recall . . . . . . . . . . . . . . . . . . 223.5 Sample ROC plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.6 Optimized precision as a function of sensitivity and specificity . . . . . . . . . 243.7 Balanced accuracy graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.8 Index of balanced accuracy as a function of sensitivity and specificity . . . . . 26

4.1 Tightened target-class and outlier-class SVDDs . . . . . . . . . . . . . . . . . 304.2 Data boundaries for three SVDD variations . . . . . . . . . . . . . . . . . . . 324.3 Width of the extended decision boundary of the SVDD for three data scenarios 324.4 Width of the extended decision boundary of the SVDD for two thresholds . . . 33

5.1 Testing G-mean as a function of imbalance in real instances . . . . . . . . . . 445.2 Testing G-mean as a function of imbalance in synthetic instances . . . . . . . 515.3 G-mean of a decision tree for five complexity levels . . . . . . . . . . . . . . . 52

x LIST OF FIGURES

List of Tables

4.1 Selection of related state-of-the-art algorithms . . . . . . . . . . . . . . . . . . 35

5.1 Summary of real benchmark datasets . . . . . . . . . . . . . . . . . . . . . . . 385.2 Multi-class to two-class mapping applied to benchmark datasets . . . . . . . . 395.3 Optimal parameters for real datasets . . . . . . . . . . . . . . . . . . . . . . . 405.4 Training performance over real datasets . . . . . . . . . . . . . . . . . . . . . . 425.5 Testing performance over real datasets . . . . . . . . . . . . . . . . . . . . . . 435.6 CPU performance over real datasets . . . . . . . . . . . . . . . . . . . . . . . . 455.7 Summary of synthetic benchmark datasets . . . . . . . . . . . . . . . . . . . . 475.8 Optimal parameters for synthetic datasets . . . . . . . . . . . . . . . . . . . . 475.9 Training performance over synthetic datasets . . . . . . . . . . . . . . . . . . . 485.10 Testing performance over synthetic datasets . . . . . . . . . . . . . . . . . . . 495.11 CPU performance over synthetic datasets . . . . . . . . . . . . . . . . . . . . . 505.12 Summary of results with real and synthetic data . . . . . . . . . . . . . . . . . 54

xii LIST OF TABLES

Chapter 1

Introduction

Machine learning has recently acquired special attention for real-world problem solving inan increasing number of domains where new issues not previously considered have appeared,given the nature of certain phenomena. One of such difficulties is the so-called class imbalanceproblem [JS02], a bias in data class distributions that is said to hinder the performance oftraditional pattern recognition algorithms such as decision trees, artificial neural networks,and support vector machines, as their respective training algorithms assume a relativelyuniform distribution of labels [Li07; JS02; Bre96; Bis96; Tom76].

The literature refers to the term class imbalance problem as a series of difficulties foundby some pattern recognition algorithms when learning from imbalanced datasets. A trainingdataset is said to be imbalanced if the samples of one of the classes are severely outnumberedby the samples of the other classes. Formally, in the case of binary classification, let n0 andn1 be the number of available training examples of each class. In cases where n1 n0,then the training set is imbalanced, class 1 being the minority, and class 0 the majority. Animbalance level or ratio is often defined as n0/n1 to characterize the amount of imbalancebetween classes in a given dataset. In this work, we represent this concept as the percentageof samples that belong to the majority class.

Some well-known real benchmark datasets have imbalanced ratios of up to 103 [CHG10],i.e.: datasets where there are 103 observations of a class per each observation of the other.These high imbalance levels are often found in applications where uncommon but vital phe-nomena need to be detected, such as computer-assisted diagnosis of rare diseases [LLH10],oil spill detection in satellite images [BS05], and recognition of many other infrequent events.

It should be noted, however, that class imbalances do not pose a problem themselves, theyare just the reflection of a phenomenons underlying probability distribution when data issampled at random. The actual problem is that error minimization-based pattern recognitionalgorithms have difficulties when fed with data drawn from these distributions.

2 Introduction

Normal sinus rhythmAtrial fibrillation

Paced rhythmVentricular bigeminy

Sinus bradycardiaVentricular trigeminy

Atrial flutterPre-excitation

2nd heart blockNodal rhythm

Ventricular tachycardiaSup. tachyarrhythmiaIdioventricular rhythm

Ventricular flutterAtrial bigeminy

Figure 1.1: Distribution of records among the 14 types of rhythm annotated in the MIT-BIHArrhythmia Database.

1.1 Motivation

Despite the fact that there is an increasing number of works addressing machine learningmethods and models, most of them are focused on solving theoretical issues rather thanproposing practical applications: so much research, so few good products [SN98]. Our mainmotivation behind this work was to address learning issues that show up when facing real-world problems.

In previous work we analyzed a database of human electrocardiogram records to ex-tract features for arrhythmia classification [RACVA10]. We observed that the distributionof records among the 14 types of rhythm was heavily imbalanced: as can be seen in Figure1.1 normal rhythm represents almost three quarters of the entire length of the database, andthe remaining quarter is split into 13 other types. We observed that our classifier tendedto memorize these patterns of normal rhythm, resulting in overall high specificities and lowsensitivities in several tests.

1.2 Learning from Imbalanced Datasets

Most learning algorithms of pattern recognition models are based on minimizing the dis-crepancy between known observations and the responses of the model given a certain task.Usually this discrepancy (also called loss) is measured as the empirical classification error,that is, the quotient between the number of discrepancies and the total number of avail-able observations. We refer to these kind of error minimization-based models as traditionalalgorithms, in the context of pattern recognition.

It has been observed that the influence of the minority class in the learning process oftraditional algorithms on highly imbalanced datasets is practically null. Classifiers tend toover-fit the majority class achieving high accuracy overall but low accuracy on the minority(and often, most important) class. Two main reasons explain why traditional classifiers failunder class imbalances: their goal is to minimize the classification error (or maximize theoverall accuracy), and it is assumed that both training and testing data are drawn from thesame distribution.

1.3 Proposal 3

Suppose an application where positive examples account only for 1% of all availableobservations, such as the case of arrhythmia classification mentioned earlier. Based on thefacts stated above, a simple classifier that unconditionally issues negative responses willachieve 99% accuracy in operation (testing performance). It may seem that the classifier hascomparable performance with state-of-the-art methods, however by definition it is unableto spot any positive examples, i.e., it has 0 sensitivity (0% accuracy on the positive class).Thus, the problem of class imbalances is precisely to find a way to improve the performanceof the positive class without diminishing the performance on the negative class.

However, regarding the nature of the problem, in [BPM04] the authors consider that theimbalance level is not the only factor that hinders the performance of traditional classifiers.Sample size, data overlapping, and subconcepts among classes are further factors that poseadditional difficulties when learning from imbalanced datasets.

1.3 Proposal

Our previous experience with imbalance data, along with the reports in the literature en-couraged us to continue to investigate the problem to find more suitable solutions. Thus,in this thesis we propose a classification model that is able to address problems with suchimbalances among class distributions in general. We aim to outperform related state-of-the-art algorithms in particular cases of interest, to provide a new approach to imbalancedclassification given certain specific problem features.

In particular, we propose a supervised binary classification algorithm consisting of anensemble-based aggregation of data descriptors that achieves generalization ability underclass distribution imbalances. The method consists of a pair of data domain descriptors, eachone modeled over a different class, along with a rule-based combination scheme designed toimprove the accuracy of the outlier class, hence improving the overall performance.

We performed computer simulations and compared our proposal with several related lit-erature approaches in terms of appropriate performance measures for class imbalances. Theresults show that our strategy achieves comparable performance in most instances and outper-forms other approaches in some specific instances. Moreover, our method has an interestingbehavior regarding performance as a function of imbalance level, which could be studied withmore detail in further work.

1.4 Contributions

This work contributes to the literature on machine learning methods and algorithms, particu-larly to the fields of novelty detection and imbalanced classification. Our stated contributionsare the following:

An in-depth survey on related literature methods to solve the class imbalance problem

A new method able to properly solve instances of the class imbalance problem

4 Introduction

A comparative experimental study of the classification and computational performanceof the proposed and related methods

The results of this research project also contributed to the publication of two papers ininternational proceedings: a new related cost-sensitive approach to imbalanced classificationin medical scenarios [ORV+12], and the method proposed in this thesis [RA12].

1.5 Summary of the Chapter

In this chapter we have introduced the problem of imbalanced classification by discussingreal-world cases and stated a formal definition for it. We have also characterized classic pat-tern recognition algorithms that fail to appropriately solve imbalanced problems in practice.Finally we described the big picture of our proposal, and stated the contributions of thisthesis to the literature.

This first chapter covered an introduction to the problem and the proposed solution.In the next chapter we discuss basic machine learning concepts for the reader to be ableto comprehend topics that are covered further in the thesis. In Chapter 3 we survey themost relevant existing methods for imbalanced classifications problems. In Chapter 4 wedescribe in detail our proposed method along with the methodology followed to conduct theexperimental study. In Chapter 5 we report and analyze the results of the comparative studyand discuss its implications. Finally, in Chapter 6 we summarize this thesis, give a generaldiscussion of the results and conclude with our final remarks.

Chapter 2

Machine Learning Overview

Artificial intelligence is the branch of computer science that aims to develop intelligent agentsfor complex automated decision making. Intelligent agents have been defined as systems thatperceive their environment and take actions that maximize their chances of success [RN02].Machine learning is a discipline within this branch of computer science that aims to allowintelligent agents evolve their behaviors based on empirical data, or experience. In thischapter we introduce some basic machine learning concepts in order to understand a morespecific framework in which this work focuses. We first give a general description of whatthis discipline is about, then we give a detailed explanation of the particular case of patternrecognition tasks, and finally we show how the class imbalance problem introduced in Chapter1 affect traditional machine learning algorithms.

2.1 Definition

Tom Mitchell defined the notion of machine learning in [Mit97]: a computer program is saidto learn from experience E with respect to some class of tasks T and performance measure P,if its performance at tasks in T, as measured by P, improves with experience E. Thus, withinthis context, intelligent agents as defined before are regarded as learners.

The objective of a learner is to determine a given phenomenons underlying probabilitydistribution using a set of known training examples, in order to be able to answer questionsabout new observations. This is often regarded as the generalization ability of a learner,or how well it explains unknown cases based on knowledge extracted from a finite numberof training observations. Thus, the core purpose of a learning machine is to maximize itschances of success in unknown environments.

Vapnik described the model of learning as a function estimation model in terms of threecomponents and a discrepancy function [Vap99]:

1. a generator of random vectors x, drawn independently from a fixed but unknown dis-tribution P (x);

6 Machine Learning Overview

2. a supervisor that returns an output vector y for every input vector x, according to aconditional distribution function P (y|x), also fixed but unknown;

3. a learning machine capable of implementing a set of functions f(x, ), , where is the set of parameters that define the family of parametric learning machines.

Thus the problem is to find a function f that best fits the supervisors response. The selectionis based on a training set of n random independent identically distributed (i.d.d.) observations(x1, y1), ..., (xn, yn) drawn according to P (x, y) = P (x)P (y|x) .

A loss function L(y, f(x, )) measures the discrepancy between the supervisors and thelearning machines responses given an observation x. The goal is then to minimize theexpected value of the loss, given by the risk functional

R() =

L(y, f(x, ))dP (x, y). (2.1)

In order to minimize this funcional when P (x, y) is unknown, an induction principle basedon available training data is used to replace R() by the empirical risk functional

Remp() =1

n

ni=1

L(yi, f(xi, )). (2.2)

The input of an object x is typically represented as a set of d features (variables) believedto carry discriminating and characterizing information about the observation, often calledfeature vectors. The d-dimensional space in which these feature vectors lie, is referred to asthe feature space. Depending on the nature of the supervisors output y of a given object,three main learning tasks can be defined:

1. the regression task, where the output takes continuous values, i.e., y R, and thesupervisor is available,

2. the classification task, where the output takes discrete values, or labels, and thesupervisor is also available, and

3. the clustering task, where the output takes discrete values but the supervisor is notalways available.

In general, according to the availability of the supervisors output, three main learningframeworks are defined: supervised learning, where the supervisor is always available, un-supervised learning, where the supervisor is not available, and semi-supervised learning, amiddle ground framework where the supervisor is not always available. In this context, re-gression and classification can be categorized as supervised learning tasks, and clusteringas an unsupervised task. Figure 2.1 illustrates the difference between supervised (A) andunsupervised (B) learning models.

2.2 Classification Tasks 7

Generator Supervisor

Learning Machine

Y

X

Generator Learning Machine

X

(A)

(B)

Figure 2.1: Difference between supervised (A) and unsupervised (B) learning models.

2.2 Classification Tasks

This work focuses on classification tasks, particularly on binary classification, in which anobservations discrete output can take one of two possible values. In the following we referto an observations feature vector as an instance, and the corresponding supervisors outputas its label. We also refer to the tuple instance-label as a pattern. Therefore, we focus on theproblem of pattern recognition.

Let the supervisors two possible output values be zero and one, i.e., y = {0, 1}, and letf(x, ), be a set of functions whose possible output are as well zero or one. Considerthe zero-one loss function

L(y, f(x, )) =

{0 if y = f(x, )1 if y 6= f(x, )

(2.3)

in which case the functional (2.1) represents the learning machines probability of misclassify-ing an instance, i.e., when the responses of the supervisor and the learning machine differ. Asa consequence, the fundamental objective of a classification learning machine is to minimizeits probability of committing classification errors.

Although we have framed classification as a task of discrete outputs, classification al-gorithms not necessarily output discrete values. Many classifiers, such as neural networks,naturally yield a probability or score, a continuous value that represents the degree to whichan instance is member of a class. In general, for classification tasks with c classes, the outputof a classifier can be denoted by y(x) = [y0(x), ..., yc1(x)], where the components yi(x) are(estimates of) the posterior probabilities for classes 0, ..., c 1 given x, i.e., the probabilitythat x is member of class i. Thus, three types of classifier can be defined [BKKP99]:

1. Crisp classifier: yi(x) {0, 1},

i yi(x) = 1, x Rd

2. Probabilistic classifier: yi(x) [0, 1],

i yi(x) = 1, x Rd

3. Possibilistic classifier: yi(x) [0, 1],

i yi(x) > 0, x Rd.

In general, a classifier is a function

D : Rd [0, 1]c (2.4)


where is the set of possible labels. In this work we use = {0, 1} for convenience. Afunction needs to be defined in order to translate outputs, whether crisp, probabilistic orpossibilistic into a discrete label. Note that the general case of output is the possibilistic type,as it can be normalized to probabilistic, and in turn, hardened to crisp. In the following werefer to patterns labeled with 0 as negative examples, and to those labeled with 1 as positiveexamples.

2.3 Classifier Performance

Assessing the performance of a classifier is application-related. In medical diagnosis, forexample, tests are often required not to miss ill patients, because doing so could have fatalconsequences. In high frequency data processing, in the other hand, maybe it would benecessary to minimize false alarms that could yield to system overloads. These examples areparticular cases of a more general framework for measuring the performance of a classifiergiven a certain task.

Given a classifier D and a set of instances S, a confusion matrix (also called contingencytable) CDS := (ci,j)22 is used to display information about the four possible outcomes of Dgiven S:

1. c1,1: positive examples correctly classified, or true positives (TP),

2. c2,1: positive examples incorrectly classified as negatives, or false negatives (FN),

3. c1,2: negative examples correctly classified, or true negatives (TN), and

4. c2,2: negative examples incorrectly classified as positives, or false positives (FP).

In matrix form:

C =

(TP FN

FP TN

). (2.5)

Note that S could be equivalent to the set of instances used for training the learning machine,also known as training set, in which case C contains information about how well the modelfits known data, or training performance. But also S could be a different, unknown set ofinstances, known as testing set, in which case C will contain more interesting information, astesting performance measures the generalization ability of the classifier.

According to this setting, a classifier can make two types of error: misclassifying a negativeexample, known as type I error or false positive, and misclassifying a positive example, knownas type II error or false negative.

In terms of (2.5), the overall classification error introduced in Section (2.2) can be ex-pressed as

Error =FP + FN

FP+ TN+ FP + FN. (2.6)

2.3 Classifier Performance 9

FNFN TP FP TNTN

Figure 2.2: A depiction of the given example and its relation to the confusion matrix values.Ticked boxes represent relevant documents, whereas crossed boxes represent irrelevantdocuments. Enclosed documents represent the documents retrieved by a given systemwhich has 3/4 precision and 3/7 recall.

Thus, the overall classification accuracy can be expressed as

Accuracy =TP+ TN

TP+ TN+ FP + FN= 1 Error. (2.7)

Further performance measures that take into account the different types of error can bederived from the confusion matrix, each one with a specific use among an application domain.In the following we explain the most commonly used ones, along with their interpretationswithin their respective fields.

2.3.1 Precision and Recall

Precision and recall are two measures commonly used for assessing the performance of in-formation retrieval systems, such as web search engines. Precision measures the exactnessof the system as it reports the relevance of retrieved instances, whilst recall measures itscompleteness as it indicates the fraction of all relevant instances that were retrieved.

Consider, for example, a search engine that looks up through 10 documents, 7 of whichare known to be relevant to a given query. If the system retrieves 4 documents in total, 3of which turn out to be relevant, then its precision is 3/4 and its recall is 3/7. In this casethe search engine is arguably exact as 3 out of 4 returned documents were relevant, but lackscompleteness as it only returned 3 of the 7 relevant documents. Figure (2.2) depicts thisexample and its relationship with the values of the confusion matrix.

In terms of (2.5) precision and recall can be expressed as

Precision =TP

TP+ FP(2.8)

and

Recall =TP

TP+ FN. (2.9)

2.3.2 Sensitivity and Specificity

Sensitivity and specificity are two measures generally used in fields where one type of erroris costlier than the other, so performance is analyzed separately in terms of the two types of


error. The sensitivity of a test measures its ability of detecting positive instances, while itsspecificity measures its ability of detecting negative examples. In simpler terms, sensitivityand specificity are measures of the accuracy of the positive and the negative class, respectively.

In terms of (2.5) sensitivity and specificity are given by

Sensitivity =TP

TP+ FN(2.10)

Specificity =TN

TN+ FP(2.11)

For a given classifier, theres often a trade-off between these measures: high sensitivityusually means low specificity and vice versa. In medicine, for example, diagnosing testsshould be highly sensitive, so they are unlikely to miss a disease, however, this often comes atthe cost of firing up too many false alarms, or having low specificity. Other applications mayneed the complete opposite, for example, brain-computer interface systems receive hundredsof signals per second, hence it may be required a low rate of false positives when searchingfor a pattern in order to avoid system overloads.

It should be noted that neither of these measures may be used independently for measuringthe performance of a classifier, they only make sense when used together. Consider, forexample, a dummy medical test that unconditionally issues positive diagnoses for any givenobservation; it will achieve perfect sensitivity, but it is useless. It also should be noted thatexpressions (2.9) and (2.10) are equivalent but have slightly different interpretations accordingto their respective application domains.

2.4 Traditional Classifiers Under Class Imbalances

In the previous chapter we defined the problem of imbalanced class distributions as a diffi-culty that concerns traditional pattern recognition algorithms when learning from datasetsof this nature. By Occams razor, traditional learning algorithms try to output the simplesthypothesis that best explains a set of training data. Under extremely imbalanced data thesimplest hypothesis is that all samples are negative.

Many traditional pattern recognition algorithms follow this principle, such as the C4.5algorithm, the Artificial Neural Networks (ANN) and the Support Vector Machines (SVM).The authors [DBFS91] observed that a feed-forward neural network trained on an imbalanceddataset may not learn to discriminate enough between classes.


In this chapter we have covered a considerable number of machine learning and patternrecognition concepts needed to understand the following contents of this thesis. We definedthe different tasks of machine learning, presented the most common measures to assess theperformance of a learning machine, and discussed the reason why traditional algorithms failto provide appropriate solutions for imbalanced problems.

Chapter 3

State of the Art

The literature on class imbalanced learning exhibits a considerable number of methods andalgorithms tailored to overcome the class imbalance problem. Most of these algorithms aremodified versions of traditional error-minimization models enabled to handle imbalances inclass distributions by means of a variety of mechanisms. Depending on at which level theyoperate, two major research trends can be identified: external (or data level) approaches,and internal (or algorithmic level) approaches. In this chapter we survey the most relevantstate-of-the-art pattern recognition algorithms proposed in the literature for class imbalancedproblems at both levels.

3.1 External Approaches

At data level the proposed solutions are mainly different forms of data resampling, such asrandom resampling, focused sampling, sampling with synthetic data and other combinationsof the above. This edition of the dataset aims to balance its class distribution in order tobe able to perform classification using traditional algorithms with an appropriate balancedtraining set of examples, hence, avoiding the problem instead of solving it.

3.1.1 Resampling

Resampling is the most simple and intuitive technique to face imbalanced classification prob-lems. In simple words, it works by modifying the training set in such a way that the classdistribution becomes balanced. This can be achieved in several ways: by randomly deletingmajority class samples, by randomly replicating minority class samples, or by applying acombination of both approaches. In the following we describe the two most popular andsimple resampling strategies.

12 State of the Art

3.1.1.1 Random Over-sampling

Random over-sampling is the action of replicating randomly-chosen objects of a given sample.In our particular case the technique is used to balance the class distribution of an imbalanceddataset, i.e., to match the number of samples of both classes, by randomly replicating theones of the minority until the dataset is balanced. The artificially-balanced dataset is thenused to train a traditional pattern recognition algorithm and perform standard classification.This procedure is described in Algorithm 1.

Algorithm 1 Random Over-sampling

1: Let D be a set of training set of n1 majority class samples and n2 minority class samples2: Choose at random n1 n2 minority class samples with replacement and append them toD

3: Train a traditional error-minimization classifier

One of the possible drawbacks of random over-sampling is that as it increases the size ofthe training set, it could increase the amount of computational resources required for trainingthe classifier as well. Additionally, this method increases the risk of the base classifier of over-fitting the replicated objects.

3.1.1.2 Random Under-sampling

Random under-sampling is the action of deleting randomly-chosen objects of a given sample.Similar to the random over-sampling technique, this procedure is used to balance a dataset byrandomly removing samples of the majority class until the dataset becomes balanced. Then,the balanced dataset is used to train a traditional pattern recognition algorithm to performstandard classification. The procedure is described in Algorithm 2.

Algorithm 2 Random Under-sampling

1: Let D be a set of training set of n1 majority class samples and n2 minority class samples2: Choose at random n1 n2 majority class samples and remove them from D3: Train a traditional error-minimization classifier

One of the disadvantages of random under-sampling is that it is prone to discard trainingobservations that could be critical for decision making, risking a decrease in generalizationperformance.

A mixture of random over-sampling and under-sampling is proposed in [LLL98], howeverthe approach does not provide a statistically significant improvement over both methodsseparately according to the authors themselves.

3.1.2 Complex Data Edition

Extensions of data edition methods have also been proposed to improve existing plain resam-pling techniques. Mainly in the form of intelligent criteria employed to select or synthesizeobjects for training. For example, in [KM97] the authors propose an one-sided sampling

3.1 External Approaches 13

strategy that under-samples the majority class by removing from the training set the lessreliable objects for classification. The selection is performed according to three criteria: ob-jects with high class-label noise, borderline objects (objects close to the boundary betweenclasses in feature space), and redundant objects. Borderline samples are commonly detectedusing Tomek Links [Tom76].

3.1.2.1 SMOTE

SMOTE is the acronym for Synthetic Minority Over-sampling Technique, a data editionmethod proposed in 2002 by Chawla et. al, that synthesizes new minority class samples byinterpolating existing neighbor objects in feature space. Similar to the resampling methodsdiscussed in previous sections, the SMOTE algorithm is used to synthesize a number ofminority class samples so that the training dataset becomes balanced to be fed to a traditionalerror-minimization pattern recognition algorithm. The described procedure is detailed inAlgorithm 3.

Algorithm 3 SMOTE

1: Input parameters: k: number of nearest neighbors2: Let D be a set of training set of n1 majority class samples and n2 minority class samples,d the dimensionality of the data, xij the value of attribute j of sample xi, and s thesynthesized samples

3: for i from 1 to (n1 n2) do4: Randomly choose one of the k nearest neighbors of a given minority class sample xl in

D, xn5: for j from 1 to d do6: d xnj xlj7: g random number between 0 and 18: sij xlj + dg9: end for

10: end for

11: Train a traditional error-minimization classifier

The main advantage of SMOTE over plain resampling is that it overcomes the risk of over-fitting replicated instances as it synthesizes new ones for training. In the other hand, thealgorithm is prone to perform interpolation using misleading neighbors, as it only considersminority class samples to choose from. Moreover, the algorithm needs to be fed with anadditional parameter that is problem-dependent and should be tuned for achieving properperformance.

Two variations of SMOTE are proposed in [HWM05]. They basically consist in a straight-forward application of the SMOTE algorithm to previously-detected borderline samples ofa given dataset. The main difference between the two variations is that one considers bothpositive and negative examples for neighborhood interpolation when synthesizing, whilst theother one only considers positive examples, similar to the original setting. The authors claimthat both techniques achieve better classification accuracy than the original version.

14 State of the Art

Another variation is proposed in [BSL09]. It synthesizes minority class samples takinginto account the presence of nearby majority class instances defining a safe level, which isignored by the previous techniques. The authors claim to outperform all aforementionedversions of SMOTE.

It should be noted that the objective of these advanced resampling strategies is to select aset of training objects to be edited (deleted, replicated, synthesized) in order to maximize theperformance of the classifier when trained with the edited observations. From the point ofview of optimization, evolutive algorithms have also contributed with solutions to imbalancedclassification. An example is the evolutionary prototype selection technique [GFH06], whichuses a genetic algorithm to perform the optimal edition to the training set for classification.

3.2 Internal Approaches

We have seen that external approaches are a variety of methods that alter imbalanced train-ing datasets so that traditional error-minimization algorithms will perform reasonably well.An alternative way to address imbalanced classification is to design imbalance-insensitivealgorithms able to counter the effect of the majority class without the need to modify thetraining set. Methods that fall in this category are regarded as internal or algorithmic levelapproaches.

Three machine learning sub-disciplines have made their contributions within this category:ensemble learning, cost-sensitive learning and one-class learning. In the following we coverthe most relevant works within each category.

3.2.1 Ensemble Learning

Ensembles of learning machines were originally proposed in [Nil65], and are based in theintuitive idea that many opinions are often more useful than only one in decision making.In this sense, it is assumed that the aggregated decision of a committee of classifiers willoutperform its average member. Formally, ensemble algorithms consists in the aggregation ofa set of local decisions {d1, ..., dL} to generate a function D by means of a linear combination(not necessarily convex) of the local contributions.

D =

ni=1

idi (3.1)

It is expected that provided an appropriate design of the aggregation function, the pre-vious assumption will be true, i.e., D will outperform the average of the local predictors di,under the assumption that they are good enough explainers of some domain of the problem.This property, commonly known as diversity is critical for an ensemble of machines to workproperly.

The literature exhibits a considerable variety of ensemble designs, however, two of themare widely-used in several fields: Bagging [Bre96] and Boosting [FS95]. Bagging consistsin training several classifiers with different bootstrap samples of the available training data.

3.2 Internal Approaches 15

In the other hand, boosting adaptively adjusts the sampling weights of training examplesaccording to the performance of previous iterations, hence focusing on objects hard to classify.

Several authors have also adapted the concept of ensemble learning with existing tech-niques for solving class imbalances. A considerable number of new hybrid methods have ben-efited from aggregating multiple decisions with encouraging results [Li07; CLHB03; FSZC99].

3.2.1.1 Balanced Bagging

In [Li07] the authors aim to address imbalanced classification by partitioning the originalimbalanced problem into several smaller balanced sub-problems using a variation of Bagging.The method consists in building a number of base classifiers fed with all available minorityclass instances and a random sample without replacement of majority class samples so thatthe training subsets are balanced. By doing so, the members of the ensemble will be trainedusing all available data. In operation, the final decision for a new observation will be themajority vote of the members of the ensemble. Algorithm 4 gives a detailed explanation ofthe procedure.

Algorithm 4 Balanced Bagging

1: Let D be a set of training set of n1 majority class samples and n2 minority class samples2: Split the majority class training set in N = n1/n2 disjunctive subsets of n2 samples3: for i from 1 to N do4: Train a traditional error-minimization classifier using all available minority class sam-

ples and subset i of split majority class samples5: end for

There are two clear advantages of Balanced Bagging over resampling methods. One isthat no majority class samples are discarded in the training set, so there is no risk of loosingimportant information. Also, the training subsets of each classification unit have no replicatedelements, hence there is no risk of over-fitting as in random over-sampling.

A major drawback of such an ensemble method is that the diversity is introduced at datalevel, so there is no guarantee that the base classifiers will be diverse given a certain problem.Other strategies for partitioning the majority class data besides random sampling could beexplored in order to assert diversity.

There is a considerable number of other ensemble-based methods for imbalanced clas-sification but only a few are worth mentioning. SMOTEBoost [CLHB03] is a combinationof the SMOTE algorithm previously described in Section 3.1.2.1 and Adaboost. Similar toSMOTE, in [CHG10] a ranked minority class over-sampling technique is introduced and usedadaptively with a modification of Adaboost, to produce the RAMOBoost algorithm. Fi-nally, the SHRINK algorithm [KHM98] uses an aggregation strategy based on a set of nestedtests, which are evaluated and weighted by an appropriate performance measure for decisionmaking.

16 State of the Art

3.2.2 Cost-Sensitive Learning

Class imbalances are very common in applications where misclassification costs are asym-metric. Consider for example computer-assisted diagnosis of diseases, where the cost of falsenegatives is much greater than that of false positives, as we previously covered in Section2.3. Cost-sensitive learning integrates asymmetric misclassification costs in order to compen-sate class imbalances during training. It has been reported that mildly imbalanced problemscan be accurately solved simply by integrating asymmetric misclassification costs in the lossfunction at training time.

Because of this natural simultaneous occurrence of both phenomena (class imbalance andcost asymmetry), it is said that cost-sensitive learning methods can solve the problem ofimbalanced classification. In [ZL06] the authors show empirically that mildly imbalancedproblems can be accurately solved simply by integrating asymmetric misclassification costsin a learning algorithm.

Similar to ensemble learning, a considerable number of cost-sensitive algorithms for im-balanced classification have been proposed.

3.2.2.1 AdaCost

Adacost [FSZC99] is a variation of the Adaboost algorithm that includes an adjustment termin the updating rule of the sampling distribution based in the asymmetry of misclassificationcosts. The authors claim that with adequate parameter selection the algorithm is able toimprove the overall classification performance. The training procedure of Adacost is describedwith detail in Algorithm 5.

Algorithm 5 AdaCost

1: Input parameters: T : number of iterations, ci: misclassification cost of instance i, C:cost adjustment constant

2: Initialize D1(i) such that D1(i) = ci/m

i cj i3: for t from 1 to T do4: Train weak learner using distribution Dt5: Compute weak hypothesis ht : X R6: t

12ln 1+r

1r

where r =

iD(i)yih(xi)(i)and (i) = sign(yiht(xi)) ci + C, i

7: Update Dt+1(i)D(i)e(tyiht(xi(i)))

Zt

Where Zt is a normalization factor so that Dt+1 remains a distribution8: end for

9: Output the final hypothesis:

H(x) = sign(f(x)), where f(x) =(T

t=1 tht(x))

However, misclassification costs are not always available and are often defined arbitrarily.We know that false negatives are costlier than false positives in a medical scenario, but the


cost ratio is not always quantified. Since misclassification costs are at the very heart of cost-sensitive learning, we believe that their scarce availability given a certain problem is a majordrawback for their application.

3.2.3 One-Class Learning

All aforementioned classification methods are based in discrimination, i.e., their learningalgorithms seek to fit an hyperplane between classes in feature space that minimizes theclassification error extracting information from the observations of both classes. On theother hand, one-class learning (also called recognition-based learning) performs classificationby tracing an hypersphere around one class of data (namely the target class) and labelingeverything that falls outside as outliers. The approach has been recently used for solving classimbalance problems with success under extremely imbalanced datasets [FSZC99]. Figure 3.1depicts the fundamental difference between both paradigms.

Contrary to discrimination-based classification, recognition-based methods assume theexistence of only one class for training. However, in class imbalanced learning, althoughscarce, minority class samples are available for training. In this sense, the information of theclass not being modeled (wether the majority or minority) could be included in the learningprocedure of one-class approaches to further refine the decision boundary obtained with themodeled class. The algorithm proposed in [TD04] takes advantage of this idea. We coverthis algorithm with more detail later in this chapter.

Now, in a two-class scenario, which of these classes should be regarded as the target classfor training remains an open question. According to [RG97] when nothing about the minorityclass distribution can be assumed (or if an insufficient number of examples is available, as inour particular case), only a description of the boundary of the majority class can be accuratelyestimated. We address this issue later on in Chapter 4.

There are mainly two approaches for building one-class classifiers: density estimationmethods and boundary methods. Density estimation methods consist in estimating the un-derlying density function of the available data and assign a rejection threshold for outlyingobjects. A common choice of the probability model is the gaussian function [Bis96], howeverit has been shown that a single normal distribution does not provide a flexible enough model

a. b.

Figure 3.1: Depiction of the difference between discrimination-based (a) and recognition-based (b)classification.

18 State of the Art

for achieving good generalization. Hence, mixtures of gaussians have been introduced to im-prove the performance in operation [DH73]. Moreover, kernels have been used with densityestimation methods to achieve further flexibility [Par62].

A drawback of density estimation methods is that in order to fit these models with accept-able likelihood, a considerable number of training samples should be available for training,which is not always the case. Boundary methods were proposed as an alternative to overcomethis situation given that they only require borderline data to describe a class. In this thesiswe focus in the latter approach for one-class classification, and in the following we review aconsiderable number of related works.

3.2.3.1 K-centers

A very simple boundary method is the k-centers algorithm [YYD98]. Similar to the k-meansclustering algorithm [Bis96], this method builds a boundary around data by means of k smallhyperspheres centered at training observations, also called support objects. The area encircledby a support object is regarded as its receptive field.

In simple terms, the method consists in choosing a support set J of k objects from thetraining set and the minimum radius r such that all training objects belong to any receptivefield. The search is performed with several trials of randomly chosen centers and individuallyimproved by swapping between the support objects and the best choice for it within theirrespective receptive fields. The best remaining subset J over all trials is reported as thesolution. The size of the support set k is optimized by means of a successive approximationscheme in which support objects are increased from 1 to (at most) the number of trainingsamples, adding in each step the farther observation from the current support objects.

This algorithm has the risk of over-fitting when k approaches the size of the training set,therefore the convergence criterion should be chosen carefully. Moreover, the algorithm doesnot consider the presence of outlying observations, which could also lead to over-fitting bydesign.

a. b.

Figure 3.2: Depiction of two SVDD models, one trained using target data only, or standard SVDD(a), and another using additionally known outlier objects or tightened SVDD (b).


3.2.3.2 Support Vector Domain Description

The Support Vector Domain Description (SVDD), proposed in 1999 by David Tax and RobertDuin [TD99], aims to describe an optimal hypersphere around a given set of target objectsbased on the structural risk minimization principle of Vapniks Support Vector Machine(SVM) [Vap99]. Unlike most one-class methods, such as the previously discussed algorithm,the SVDD can be trained including known outlier objects to further tighten its decisionboundary and improve its performance, as can be seen in Figure 3.2. The hypersphere ischaracterized by a center a and a radius R

min (R, a) = R2 + Ci

i

s.t. ||xi a||2 R2 + i i 0 i .

This problem can be solved by maximizing the function L of Lagrangian multipliers withrespect to using a standard quadratic program solver

max L =i

i(xi xi)i,j

ij(xi xj)

s.t.i

i = 1

a =i

ixi

i = C i 0 i C, i .

(3.2)

The inner product (xi xi) can be generalized by a kernel function k(x, y) = (x)(y), where is a mapping of the data to a higher dimensional space in which the fit of the hyperspheremay be improved. With such mapping the problem (3.2) becomes

max L =i

ik(xi xi)i,j

ijk(xi xj) (3.3)

s.t.i

i = 1 (3.4)

a =i

i(xi) (3.5)

i = C i 0 i C, i . (3.6)

Given the optimal values for , the center of the hypersphere a and the errors i can becalculated using restrictions (3.5) and (3.6). The radius R is defined as the distance from thecenter to the support vectors on the boundary.

Thus, a test object z will be accepted if the distance ||z a||2 is smaller than or equal tothe radius R

f(z) = I(||z a||2 R2

)

= I

k(z z) 2

i

ik(z xi) +i,j

ijk(xi xj) R2

20 State of the Art

where I is a indicator function defined as

I(A) =

{target if A is trueoutlier otherwise

(3.7)

In [GCT09] an improved boundary for the SVDD is proposed. It allows more objectsto be fit into the description and be accepted as targets. The method widens the boundaryobtained by the SVDD in terms of how close target objects are to it, avoiding over-fitting. Theclosest to the boundary, the wider the new boundary will be, and the more new objets willbe accepted. Thus, an object will be considered as an outlier if the following two conditionsare violated:

1. It is enclosed by the SVDD boundary.

2. The ratio of its distance to its nearest boundary point to the average distance of allenclosing objects to their boundary point is not greater than a given decision threshold.

Thus, objects which are accepted by the SVDD boundary are also accepted by the im-proved boundary, whereas objects which are rejected by the SVDD boundary will not nec-essarily be rejected by the proposed decision boundary. Algorithm 6 shows the improvedrule-based boundary.

Algorithm 6 Support Vector Data Description with Improved Boundary (ISVDD)

1: Let M be a trained SVDD model over the target class, D the average distance betweenthe enclosed target training objects and the boundary ofM , T an user-defined threshold,z a testing object and d(z) the distance of the testing object to the boundary

2: if M accepts z or d(z)D T then

3: Y (z) target4: else

5: Y (z) outlier6: end if

This improved boundary, however, does not consider the contrary case, in which outlierobjects are incorrectly accepted by the description. In Chapter 4 we expand this idea as itrepresents the core goal of our proposal.

The literature reports that recognition-based classification models generally outperformdiscriminant approaches when working with highly dimensional and extremely imbalanceddata [RK04].

3.3 Appropriate Performance Measures

In Section 2.3 we covered a considerable number of standard metrics used to measure theperformance of pattern recognition algorithms with their respective interpretations withinan application context. For example, the trade-off between sensitivity and specificity in

3.3 Appropriate Performance Measures 21

00.5

1

00.5

10

0.5

1

Figure 3.3: G-mean (surface) as a function of sensitivity and specificity.

a medical diagnosis scenario, and the relation between precision and recall in informationretrieval systems.

In this section we review additional performance measures proposed in the literature thataim to provide a better representation of a classifiers performance under class imbalancesthan the traditional aforementioned set of metrics. Therefore these metrics will be regardedas appropriate measures.

In general, any metric that uses values from both rows of the confusion matrix (2.5) simul-taneously, will be inherently sensitive to class imbalances [Faw06], as the class distribution ofthe dataset is the proportion of row-wise sums. Measures such as those discussed in Section2.3 are based on values from both rows of the confusion matrix and in the following will beregarded as inappropriate measures for imbalanced classification.

3.3.1 Geometric Mean

A single standard performance measure is unable to give a proper representation of the perfor-mance of a pattern recognition algorithm in the task of imbalanced classification. Sensitivityand specificity report the accuracies of the positive and negative classes, respectively. Hence,for a thoroughly performance assessment, both measures should be analyzed together. Un-fortunately, such analyzes are complex from a computational point of view as both measuresneed to be simultaneously optimized.

The geometric mean of the sensitivity and the specificity (or simply G-mean) is a perfor-mance measure that aims to represent a balance between both sensitivity and specificity. Itis defined as the square root of the product between both measures

G =

Sensitivity Specificity (3.8)

The G-mean is an arguably good performance measure for imbalanced classification as it isindependent from class distribution [KM97]. Moreover, optimizing this measure means max-imizing the overall accuracy while maintaining a balance between sensitivity and specificity.Figure 3.3 shows the surface of this measure as a function of sensitivity and specificity.

22 State of the Art

00.5

1

00.5

10

0.5

1

(a) = 0.5

00.5

1

00.5

10

0.5

1

(b) = 1

00.5

1

00.5

10

0.5

1

(c) = 2

Figure 3.4: F-measure (surface) as a function of precision and recall for three values of .

3.3.2 F-measure

Another of such representations of a pair of complementary measures in the context of infor-mation retrieval is the F-measure or F-score. It aims to provide a weighted average betweenthe precision and recall by means of a parameter . It is defined as the harmonic meanbetween precision and recall:

F =(1 + 2) Precision Recall

(2 Precision) + Recall(3.9)

In practice, this measure is generally used with = 1 and called F1. Thus, it averagesboth metrics with equal weight. For larger values of the relative importance of recall inrelation to precision increases, and the contrary for smaller values, as can be seen in Figure3.4.

A simile between information retrieval and one-class classification is that relevant doc-uments can be considered as target objects, and not relevant ones as outliers. With suchmapping the F-score can be adequately interpreted in the context of novelty or outlier de-tection, and used for performance analysis.

Note in figures 3.4 and 3.3 that the G-mean and the F1-score share similar surfaces. Thus,a similar weighting parameter could be used for the sensitivity and specificity in the G-meanto add flexibility, in case one measure should be optimized with more priority than the other,as occurs in cost-sensitive learning.

3.3.3 ROC Analysis

Receiver operating characteristics (ROC) graphs are a technique for visualizing the perfor-mance of a classifier and depict the trade-off between their true positive and false positiverates [Faw06]. They have been extensively used for evaluating medical decision making sys-tems over the past three decades, and gradually adopted by the machine learning communitywith similar purposes ever since.

Two measures extracted from the confusion matrix (2.5) need to be introduced: the truepositive rate (tprate) and the false positive rate (fprate):

tprate =TP

TP+ FN(3.10)


(C)

(A) (B)

0.2

0.4

0.6

0.8

1.0

0.2 0.4 0.6 0.8 1.0

tp ra

te =

se

nsi

tivity

fp rate = 1 - sensitivity

Figure 3.5: An example ROC plot featuring interest zones.

fprate =FN

TN+ FP. (3.11)

Note that expressions (3.10), (2.9) and (2.10) are equivalent, and that (3.11) is equivalent tothe complement of (2.11), i.e., fprate = 1 specificity.

A ROC graph displays points for each combination of tprate and fprate in a two-dimensionalunitary space, where each point represents the performance of a discrete classifier in terms ofthese two measures. A series of points that resemble a curve in ROC space are often regardedas ROC curves, and depict the performance of a continuous classifier by means of steppedscore thresholding. Instances are sorted by score, and passed through a score threshold thatvaries from to + in appropriate steps, thus, a single continuous classifier turns into asmuch discrete classifiers as the number of samples in the testing set. Then, true positive andfalse positive rates are computed for each threshold (classifier) and plotted in ROC space.

Several zones in ROC space are important to note. Classifiers near the northwest corner(zone A in Figure 3.5) are ideal, as they achieve almost perfect classification. Classifiers nearthe northeast corner (zone B in Figure 3.5) are liberal, as they issue positive responses withweak evidence, which often results in high false positive rates. Classifiers appearing at thesouthwest corner (zone C in Figure 3.5) are thought of as conservative, as they issue positiveresponses only with strong evidence, which results in a low false positive rate but at theprice of a low true positive rate as well. Classifiers near the northwest corner are the mostdesirable classifiers, they have high true positive rates and low false positive rates. Classifiersbelow the positive diagonal perform worse than chance. Figure 3.5 summarizes the previousdescription.

Another useful measure obtained from the ROC plot is the area under the curve (AUC).It gives a representation of the overall accuracy of the classifier considering the true positive-false positive tradeoff.

The measures used in ROC analysis do not consider values from both rows of the confusion

24 State of the Art

00.5

1

00.5

11

0

1

(a) 0.8 : 0.2

00.5

1

00.5

11

0

1

(b) 0.5 : 0.5

00.5

1

00.5

11

0

1

(c) 0.2 : 0.8

Figure 3.6: Optimized precision (surface) as a function of sensitivity and specificity for three levelsof class imbalance in the form n+ : n.

matrix simultaneously, therefore, they are insensitive to class imbalances.

3.3.4 Optimized Precision

The authors of [RP06] propose an improvement to the representation of the accuracy forimbalanced scenarios, which they (wrongly1) term Optimized Precision (OP).

The measure is based in two terms: the classic accuracy and a relationship index thatseeks to represent the level of imbalance between sensitivity an specificity.

OP = Accuracy |Specificity Sensitivity|

Specificity + Sensitivity(3.12)

Figure 3.6 illustrates OP as a function of sensitivity and specificity for three levels ofimbalance. We can see that the effect of sensitivity and specificity on the measure of overallperformance increases as the level of imbalance grows towards the positive and negativeclasses, respectively.

Although OP has an index of balanced class accuracies to measure the level of imbalancebetween sensitivity and specificity, it includes a term of overall accuracy, which could bias themeasure, causing an improper representation of the performance given the level of imbalance.

3.3.5 Optimized Accuracy with Recall-Precision

Another measure proposed in [HSMM] to properly assess the performance of an algorithm inthe task of imbalanced classification is the Optimized Accuracy with Recall-Precision (OARP).It basically seeks to exploit the benefits of accuracy, precision and recall altogether in orderto achieve stability and robustness against imbalances in class distributions.

Based in the work of [LB07], the authors propose to use a similar form of the relationshipindex covered in Section 3.3.4 for each class-specific precision and recall metrics, in order to

1The authors refer to the term precision as the percentage of overall correctly classified samples, whilst inthis thesis we refer to it as the exactness or relevance of retrieved documents in the context of informationsystems. The concept the authors try to refer to in their paper we term accuracy in this thesis.


0.2

0.4

0.6

0.8

1.0

-1.0 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1.0

G

= se

nsi

tivity

sp

eci

ficty

Dominance = sensitivity - specificity

(A)

(B) (C)

(D) (D)

(F)

(E)

Figure 3.7: A sample balanced accuracy graph displaying a classifier of Sensitivity = 0.6 andSpecificity = 0.8. The gray line is the boundary for feasible classifiers.

attain a better representation of the performance from the point of view of each class. Forbinary classification tasks the OARP is given by:

OARP = Accuracy 1

2 10c

(Precision+ Recall

Precision+ + Recall+

Precision Recall+

Precision + Recall+

), (3.13)

where Precision+ and Recall+ are the respective measures for the positive class, and Precision

and Recall are the respective measures for the negative class. Moreover, since OARP maydiffer by a significant amount from accuracy as a consequence of the second term, the authorsrecommend reducing it to a fraction of an arbitrary power of ten by means of the parameterc.

The authors claim that an OARP-maximization-based training algorithm will generallyoutperform accuracy-based ones in the task of imbalanced classification. This is due to thefact that the measure favors minority class samples when the data is heavily imbalanced, andit is robust against class distribution changes.

3.3.6 Index of Balanced Accuracy

In general, the aforementioned performance measures do not consider how dominant is theaccuracy of a single class over the other, hence, their individual contributions are not reportedin the final measure. The authors of [GMS09] seek to overcome this representation flaw bymeans of a new measure that quantifies the trade-off between an unbiased metric of overallaccuracy (such as G or F1) and an index of class accuracy imbalance. This term shouldnot be confused with class imbalances, as it refers to an imbalance in the individual classaccuracies (sensitivity-specificity trade-off). In this sense, the class with higher accuracy rateis regarded as the dominant class.

26 State of the Art

00.5

1

00.5

10

0.5

1

(a) = 0

00.5

1

00.5

10

0.5

1

(b) = 0.5

00.5

1

00.5

10

0.5

1

(c) = 1

Figure 3.8: Index of balanced accuracy (surface) as a function of sensitivity and specificity forthree values of .

The proposed measure named index of balanced accuracy (IBA) is calculated as the areaof a rectangular region in the balanced accuracy graph, described by a dominance measureand G-mean square. The dominance measure is defined as the signed difference between thesensitivity and the specificity, and is expected to report which is the prevalent class and howsignificant is this relationship regarding class accuracies.

Similar to ROC analysis, the outcomes of a classifier correspond to a single point in thebalanced accuracy graph. Thus, the IBA is computed as the area of the rectangle defined bythe origin (1, 0) and the outcomes of the classifier (pair dominance-G2). Also, the authorsintroduce parameter to weight the relevance of dominance:

IBA = 1 + (Sensitivity Specificity) Sensitivity Specificity (3.14)

Figure 3.7 illustrates a balanced accuracy graph with several interesting zones: perfectclassification (A), unconditional negative classification (B), unconditional positive classifi-cation (C), and unfeasible points delimited by a grey line (D). An example classifier withSensitivity = 0.6 and Specificity = 0.8 (E) is also displayed, along with its correspondingrectangular area defined by the axes (F).

Greater dominance means higher sensitivity over specificity, therefore increasing the valueof results in overweighting the accuracy of the positive class in the index of balancedaccuracy, as can be seen in Figure 3.8.

The index of imbalanced accuracy for the classifier depicted in Figure 3.7 is IBA1 = 0.384.If the class accuracies of this classifier were inverted, i.e., Sensitivity = 0.8 and Specificity =0.6, then the index would be IBA1 = 0.576, thus, it is clear that this measure favors highersensitivity over specificity. Moreover, it is trivial to note that if = 0, the indices are identicalin both cases, as there is no distinction between class accuracies.


In this chapter we have covered and reviewed a considerable number of pattern recognitionalgorithms tailored for the task of imbalanced classification along with several performanceassessment techniques developed to attain a better representation of a classifiers capabilitiesin this context.

3.4 Summary of the Chapter 27

In the next chapter we address the design, evaluation and comparison of our novel solutionwith respect to a subset of the previously reviewed methods. Thus, given the wide varietyof proposed techniques, we define a set of criteria to perform a proper selection of methods,including both the related state-of-the-art algorithms that we compare our proposal to, andalso the performance measures that we employ for that purpose.

28 State of the Art

Chapter 4

Methodology

In this chapter we expand the methodology followed to design a novel approach for imbalancedclassification and conduct an experimental study to compare its performance to that of relatedstate-of-the-art methods, as we stated in the objectives of this thesis. We begin by describingin detail the proposed method, seeking to overcome the research gaps found in the literaturereview in Chapter 3. Then we define criteria to perform a proper selection of the methods tobe used in the comparative study. Finally, we describe the performance assessment techniquesemployed in this work to report our results in the next chapter.

4.1 Proposed Method

In this thesis we propose a new method to address the class imbalance problem based in twopreviously reviewed literature approaches: one-class learning and ensemble learning, whichwe have termed Dual Support Vector Domain Description (DSVDD). Our solution basicallyworks by aggregating local decisions of one-class domain descriptors fitted to each individualclass to further tighten the decision boundary and improve the performance of the minorityclass without hindering the performance of the majority class and the overall accuracy.

Our intention is to exploit the advantages of including known outlier information in themodeling of one-class classifiers, which in their very nature are designed to model a singletarget class. This is due to the fact that although a few related works on boundary tight-ening have been proposed in the literature, apparently it still remains an open research gap.Actually, our method works in the same sense that a single SVDD trained with both classes(tightened SVDD) seeks to improve the performance of a standard SVDD trained with targetclass samples only, i.e., by correcting the decision of wrongly accepted outlier samples (seeFigure 3.2). Even though the SVDD has an inner mechanism to perform boundary tighten-ing, there are cases in which it yields worse classification than the standard approach due toclass overlapping and further data concept complexity issues.

Thus, we propose an extension of the SVDDwith improved boundary proposed in [GCT09]

30 Methodology

and previously discussed in Section 3.2.3.2 that seeks to overcome an unaddressed flaw. Ourstrategy consists in training two domain descriptors, one using the samples of the positiveclass as targets and the other using the samples of the negative class, along with a rule-basedcombination scheme designed to improve the accuracy of the minority class, hence improvingthe overall performance.

4.1.1 Dual Domain Descriptions

We have seen in the previous chapter that the improved boundary of the SVDD aims toaccept wrongly rejected objects that actually belong to the target class. However, it does notconsider the contrary case, i.e., rejecting wrongly accepted objects that actually belong tothe outlier class. With our method we intend to handle this case by means of an additionaldescription fitted to the outlier class, as described above. Thus, two tightened SVDD modelswill be trained using positive class samples and negative class samples, respectively. Bydoing so, we attempt to improve the performance of the SVDD in the cases where its innertightening mechanism fails to be an improvement on itself.

Note that both domain descriptors used in our method are tightened, wether by usingoutlier samples or target samples, however, for simplicity we will refer to them as target-classSVDD, for the SVDD trained over the target class and tightened using outlier data, andoutlier-class SVDD, for the SVDD trained over the outlier class and tightened using targetdata (see Figure 4.1 for a depiction of this scheme).

In operation, with this dual-descriptor setting, four possible outcomes are possible: twoin which the descriptors agree and two others in which the descriptors disagree. A simpleaggregation rule for the individual decisions could be to accept or reject an object in agree-ment and reject it in disagreement. However, this rule, among other simple rules that wereconsidered for experimentation, in practice yield unsatisfactory results. The intersection, forexample, implies less target objects being accepted, which means low sensitivity if positivesamples are being treated as targets. The union, as another example, implies less outlierobjects being rejected, which means low specificity in the same scenario. Low class-specificaccuracies yield low performance overall if the performance measure being used is appropriate

a. b.

Figure 4.1: A depiction of the tightened target-class SVDD (a) and tightened outlier-class SVDD(b.). Note that in this example the minority class is considered as the target class,but this may not always the case.

4.1 Proposed Method 31

for imbalanced classification, as we saw in Chapter 3, therefore, a more suitable aggregationscheme is required.

4.1.2 Nested Aggregation Rule

To combine the decisions of the two descriptions we designed a rule that seeks to improvethe performance of the outlier class while compiling with the restrictions of the extendedreceptive field of the SVDD with improved boundary. For this purpose we propose a nestedaggregation rule as follows: if the testing object is rejected by the target-class SVDD andthe ratio of its distance to its nearest boundary point, to the average distance of all enclosingobjects to their nearest boundary points D, is smaller than a given threshold, then the objectis classified as a target, or accepted. Up to this point the decision rule is equivalent to thatof the improved boundary discussed in Section 3.2.3.2. However, in our approach if a giventest object is accepted by the target-class SVDD and rejected by the outlier-class SVDD,then it is accepted. Otherwise, if it is accepted by the target-class SVDD and accepted bythe outlier-class SVDD, then it is rejected.

Algorithm 7 shows the proposed decision rule with the nested descriptor. Lines 1 through5 are equivalent to the improved boundary proposed by [GCT09] and discussed in Section3.2.3.2. The proposed rule with the nested descriptor is introduced in lines 6 to 12.

Algorithm 7 Dual Support Vector Domain Description (DSVDD)

1: Let M+ be a trained SVDD model over the target class, M a trained SVDD modelover the outlier class as its target, D the average distance between the enclosed targettraining objects and the boundary of M+, T an user-defined threshold, z a testing objectand d(z) the distance of the testing object to the boundary

2: if M+ rejects z then3: if

d(z)D

< T then4: Y (z) target5: end if

6: else

7: if M rejects z then8: Y (z) target9: else

10: Y (z) outlier11: end if

12: end if

Figure 4.2 illustrates an example of the decision boundaries of the three related methods:the standard SVDD, the SVDD with expanded boundary, and the proposed method. Figure4.2 (d) illustrates the aim of the proposed rule, which is to improve the performance of theoutlier class by rejecting objects that were wrongly accepted by the boundary of the targetclass. The final decision regarding candidate targets can also be seen as asking for a secondopinion to the outlier-class SVDD about its certainty.

32 Methodology

a. b. c. d.

Figure 4.2: Examples of decision boundaries in a set of two-dimensional testing objects. Black dotsare targets and white dots outliers. Figure (a) displays three decision boundaries: astandard target-class SVDD (thin line), a target-class SVDD with improved boundary(segmented line) and a standard outlier-class SVDD (bold line). In (b), (c) and (d)the receptive fields of the SVDD, the ISVDD, and the proposed DSVDD, respectivelyare highlighted for comparison. Although the boundaries should be tightened, theyare depicted as circumferences for simplicity.

a. b. c.

Figure 4.3: Illustration of the extended decision boundary of the SVDD given a fixed threshold Tin three possible scenarios: where objects are clustered near the boundary (a), whereobjects are distributed somewhat uniformly inside the description (b), and whereobjects are clustered within the description far from the boundary (c).

Similar to the SVDD with improved boundary, the parameter T is user-defined and shouldbe properly chosen according to each instance of the problem, where greater values of Tmean a wider expansion of the boundary and more nearby objects being captured into thedescription. Additionally, the two SVDD models need to be fed with their respective hyper-parameters. In this thesis we used soft margin SVDDs with gaussian kernels, as they providedproper flexibility for running simulations. Thus, two additional parameters need to be set foreach classifier: the soft margin parameter C and the kernel width . We give further detailson this matter later on in Chapter 5.

In general, it is not recommendable to use a fixed value of T for several instances, asthe extended receptive field of the description can lead to improper classification, given thefeatures of the dataset, such as complexity, size and data distribution. Figure 4.3 shows adepiction of the extended decision boundary of the SVDD given a fixed threshold T in threepossible scenarios: where objects are clustered near the boundary (a), where objects are

4.1 Proposed Method 33

T T

Figure 4.4: Illustration of the extended decision boundary of the SVDD for two thresholds: T1and T2, where T1 < T2.

distributed somewhat uniformly inside the description (b), and where objects are clusteredwithin the description far from the boundary (c). Objects clustered near the boundary yielda small value of D, hence, large values of T should be considered to overcome the small effectof the extended receptive field. In the other hand, objects clustered within the descriptionfar from the boundary yield a large value of D, hence, small values of T should be consideredto avoid losing specificity.

Figure 4.4 shows a depiction of the extended decision boundary of the SVDD for twothresholds. We can see that a greater threshold for the same data yields a wider receptive fieldof the expanded decision boundary. Thus, it is possible to accept previously rejected targetobjects, and in turn, increase the sensibility. However, at the same time, previously rejectedoutlier objects could be accepted into the description, which decreases the specificity (ifpositive samples are being treated as targets, which, again, it may not be the case). Therefore,this parameter should be chosen properly, as it has a direct impact in the sensitivity-specificitytrade-off.

Regarding the complexity of the proposed method, we can expect that it will be gener-ally higher than that of SVDD and ISVDD, as two descriptions need to be trained insteadof one. The complexities of SVDD and ISVDD should not differ by a significant amountbecause ISVDD only needs minor additional calculations: the average distance of enclosedtarget objects to the boundary at training time, and the decision rule in operation. We cannot estimate the differences between the complexities of our proposal and other related al-gorithms, as they involve different data structures and training algorithms not detailed herefrom an implementation point of view.

Lastly, we expect our method to outperform other related state-of-the-art methods inparticular instances of the class imbalance problem.

34 Methodology

4.2 Related Methods

In Chapter 5 we compare DSVDD to other related state-of-the-art algorithms in terms ofclassification performance and computational complexity. The selection of these algorithmsis based on three main criteria: a control sample of traditional algorithms, a sample of relatedexternal approaches, and a sample of related internal approaches. In the following we explainand identify the selected methods to be included in the experimental study.

The control sample of traditional algorithms is intended to provide a comprehensive assess-ment of the effect of class imbalances in the classification performance of error-minimizationapproaches, i.e., to inspect if class imbalances really pose additional difficulties to standardlearning machines. This is due to the fact that some authors claim that class imbalances arenot the only factor that hinder traditional pattern recognition algorithms [BPM04

ensemble of one-class domain descriptiors for imbalanced classification

Documents

mammography analysis

whomthis work

raul monge anwandter

class imbalance problem

thisyear of hard work

degree ofmagster en

partial fulfillmentof

negative diagnosis