binary classification with models and data density distribution by xuan chen

17
Binary Classification with Models and Data Density Distribution Xuan Chen E-mail: [email protected] Supervised by: Professor Raymond Chi-Wing Wong E-mail: [email protected] Abstract. Classification has always been a fundamental topic in data mining. Traditionally, the study of binary classification has been formulated as a deterministic problem with 0-1 labels. However, probabilistic information is becoming more popular nowadays, and it has many practical applications such as pattern recognition as well as tumour diagnosis. Since probabilistic labels are more informative - they imply probabilities of the samples belonging to positive cases (i.e., labelled as 1), instead of plainly stating conclusions, we believe that predictions can be more accurate with probabilistic labels in training dataset, which is what this project tries to prove. Keywords: data density distribution, probabilistic labels, error bound, models, optimization 1. Introduction 1.1. Motivation Binary classification has its universal application in people’s life: predictions on stock price, spam email filter, etc. In common cases of binary classification in machine learning, with a given training dataset T including features X and 0-1 labels Y of each sample, a model is expected to be created so that when a group of new features of training dataset are given, its labels can be precisely predicted. However, in real life situations, probabilistic labels can have a wider range of uses. For example, experts in many fields often state the probabilities of positive/negative outcomes instead of giving specific predictions, and a training datasets are often built based on such experts’ justifications. This makes sense, since it would be unfair if experts are required to classify all the samples with certainty. Hence, in order to obatin meaningful instructions in real life situations, it would be better if classification with probabilistic information gains more attention in academic research.

Upload: xuan-chen

Post on 12-Apr-2017

52 views

Category:

Documents


0 download

TRANSCRIPT

Binary Classification with Models

and Data Density Distribution

Xuan Chen

E-mail: [email protected]

Supervised by:

Professor Raymond Chi-Wing Wong

E-mail: [email protected]

Abstract. Classification has always been a fundamental topic in data mining.

Traditionally, the study of binary classification has been formulated as a deterministic

problem with 0-1 labels. However, probabilistic information is becoming more popular

nowadays, and it has many practical applications such as pattern recognition as well

as tumour diagnosis. Since probabilistic labels are more informative - they imply

probabilities of the samples belonging to positive cases (i.e., labelled as 1), instead

of plainly stating conclusions, we believe that predictions can be more accurate with

probabilistic labels in training dataset, which is what this project tries to prove.

Keywords: data density distribution, probabilistic labels, error bound, models,

optimization

1. Introduction

1.1. Motivation

Binary classification has its universal application in people’s life: predictions on stock

price, spam email filter, etc. In common cases of binary classification in machine

learning, with a given training dataset T including features X and 0-1 labels Y of

each sample, a model is expected to be created so that when a group of new features

of training dataset are given, its labels can be precisely predicted. However, in real life

situations, probabilistic labels can have a wider range of uses. For example, experts

in many fields often state the probabilities of positive/negative outcomes instead of

giving specific predictions, and a training datasets are often built based on such experts’

justifications. This makes sense, since it would be unfair if experts are required to classify

all the samples with certainty. Hence, in order to obatin meaningful instructions in real

life situations, it would be better if classification with probabilistic information gains

more attention in academic research.

Binary Classification with Models and Data Density Distribution 2

In this paper, in discussion of different data density distribution, we want to find

out whether the accuracy of prediction with the use of probabilistic labels was better

than with traditional results. The conclusion will tell us whether it is worth using

probabilistic labels.

1.2. Binary Classification with Models and Noise Conditions

We study the problem called Binary Classification with Models and Noise Conditions.

Given a training dataset T, with both probabilistic and clear-cut (0-1) labels, we want

to figure out which kind of labels tends make more precise predictions.

Data always follow some sort of distribution. Therefore, data density distribution

such as peak distribution, uniform distribution and convex distribution, and even some

special distribution such as double-arch and double-bowl distribution, were taken into

consideration in this project. Definitions of those data density distribution are specified

in Section 3. We believe that the distributions we mention in this paper are enough to

cover common cases that we are likely to meet in real life.

Moreover, models also have things to do with the accuracy of prediction. Hence,

popular models were involved in our experimental phase. We put those models’

performances with the use of probabilistic labels and clear-cut labels in contrast

respectively. Also, by comparison across different models, we may see which models

relatively do the best work, and which ones improve the most with probabilistic labels

involved. In the end of our experiment, we even tried classification ensemble to improve

when results were not satisfying.

1.3. Contributions

Generally, Gaussian Process Regression works better with probabilistic labels than

with clear-cut labels only when data density follows either bowl distribution, V-shape

distribution, double-arch or double-bowl distribution. With the use of probabilistic

labels and Gaussian Process Regression, as the instances are more found around the

classification boundary, the prediction is less accurate. Also, though Gaussian Process

Regression always performed the best, it is really imperceptible which models improve

the most with the use of probabilistic labels.

Based on theoretical analysis (which is introduced in details in Section 4), with

Gaussian Process Regression and probabilistic labels, prediction error is bounded by

O(n−2+γ4 ) generally[2], where parameter γ determines data density distribution. Under

non-realizable setting, our error bound is always no worse than the best-known error

bound (i.e. O(n−12 ))[5]. Nevertheless, the best-known error bound under realizable

setting is O(n−1)[5], which our result is not necessarily better than.

In experimental phase, we adopted four common models: Gaussian Process

Regression, Radial Basis Function Network, Nearest Neighbour and LibSVM. Generally,

Gaussian Process Regression provided the best prediction. In addition, according to our

experimental result, those models performed much better with probabilistic labels just

Binary Classification with Models and Data Density Distribution 3

when data density followed specific distribution: predictions were more accurate when

instances are farther away from the classification boundary. In those situations when

probabilistic labels had few things to do with prediction, we harnessed classification

ensemble with probability labels and showed that it efficiently improved the accuracy

when data were more found around the classification boundary.

The rest of this paper is organized as follows. In Section 2, we show the source of

common ideas and fundamental knowledge of machine learning and statistical learning

theory, along with the error bounds that were previously obtained. We then give our

problem definition in Section 3. Later on in Section 4, Section 5 and Section 6, we

present our theoretical proofs and experimental results, and we state our observations

based on these. Furthermore, we conclude our work and discuss future works in Section

7. Besides, in the end in Section 8 we talk about challenges and failures that we met

during the process of this project.

2. Related Work

In [5], it has been proven that the error is bounded by O(n−1) for clear-cut training

datasets if data distribution follows realizable assumption (i.e., there exists a classifier

which can perfectly classify any dataset generated from a distribution). In more

general cases that data distribution follows non-realizable assumption, the error of a

classifier is O(n−12 ). On the other hand, with probabilistic labels under Tysbakov Noise

Condition[10], the error produced by Gaussian Process Regression goes down to O(n2+γ4 )

(where γ ≥ 0)[2]. Our discussion of different data density distribution is a derivative

of such a result. Our result is never worse than the error bound offered by [5] under

non-realizable setting in any distribution. However, under realizable setting, with data

density following uniform or peak distribution, our error bound is unfortunately worse

than the one given by [5]. Analysis is stated in Section 4.

Common ideas in statistical learning theory were mostly introduced in [6], and basic

concepts were introduced in [8].

3. Problem Definition

3.1. Training datasets

Consider a binary classification with two classes, 0 and 1. In traditional setting, we are

given a training dataset T which contains n instances I1, I2, ..., In. Each instance Iiis associated with a feature vector xi and a target attribute yi where i ∈ N. Let X be

the set of all possible feature vectors. Note that there are two possible values of target

attribute yi, namely 0 and 1. A classifier h is defined to be a hypothesis or a function

which takes feature vector xi as an input and its output yi is either 0 or 1.

Now let’s turn to classification with probabilistic information. Similar to traditional

setting, T contains n instances, namely I1, I2, ..., In. We assume that the data samples

in the dataset are independent and identically distributed (i.i.d).Each instance Ii is

Binary Classification with Models and Data Density Distribution 4

associated with a feature vector xi and a fractional score fi (rather than a target

attribute yi) where i ∈ N. We assume that all instances are generated according to

a joint distribution of two random variables, X and Y , denoted by Pr(X,Y ). Given a

feature vector x, we define η(x) to be the conditional probability Pr(Y = 1|X = x),

the probability that an instance with its feature xi has its target attribute equal to 1.

η(x) is the estimated probability, i.e., η(x) can be considered as the computed

version of η(x) provided by models.

Since fractional score fi is obtained by labelers and some statistical information, it

can be regarded as an observed version of η(xi). To be more specific, fi is the value η(xi)

added by Gaussian white noise. The Gaussian white noise is represented in the form

of Gaussian distribution N (0, σ2) where σ is a standard deviation of this distribution.

With this noise condition, each fractional score fi follows the distribution of N (0, σ2).

If fi is smaller than 0, then it can be assigned to class 0. Likewise, if fi is larger than

1, then it can be assigned to class 1.

3.2. Measurement of error

It is assumed that the excess error of a given classifier corresponds to the difference

between the expected error generated by this classifier and the expected error generated

by the best classifier.

Given a classifier h = Iη(x)≥0.5, the expected error of h, denoted by err(h), is defined

to be Pr(x,y)∼Pr(X,Y )(y 6= h(x)). The Bayes classifier, denoted by h∗, is defined to be

the classifier which gives the minimum expected error. Note that h∗ = Iη(x)≥0.5. Given

a classifier h, its excess error is defined as the difference between its expected error

and the expected error of h∗, i.e., the excess error of h, denoted by E(h), is equal to

err(h) − err(h∗). Note that E(h) must be greater than 0. Apparently, the hypothesis

is more accurate when E(h) is approaching 0.

In a word, when η(x) is very similar to η(x), our classifiers perform as good as

optimal achievable classifier(i.e. Bayes Classifier).

3.3. Data density distribution

First of all, we state the definition of Tysbakov Noise Condition.

Definition 3.1. (Tysbakov Noise Condition) Given two noise parameters c > 0 and

γ ≥ 0, ∀t ∈ (0, 0.5),

Pr(E[|η(x)− 1/2|] < t) ≤ c · tγ.

Define f(t) = ctγ. Then g(t) = f ′(t) = cγ · tγ−1 = cγ · ta reflects the distribution

of data density. The distributions are symmetrical about η(x) = 12. Let’s discuss the

cases of different values of γ. Note that in discussion we take ”=” of the formula. But

actually, since the definition is an in-equation, distribution that under the curves we

draw in Figure 1-5 (i.e., smaller than c · tγ) can be attributed to its corresponding case.

Binary Classification with Models and Data Density Distribution 5

3.3.1. Convex Distribution Cases of convex distribution is shown in Figure 1.

1© Bowl-shape Distribution: When

γ > 2, g′(t) starts from 0 and increases

as t increases. Therefore, data density in

terms of η(x) looks like a bowl.

2© V-shape Distribution: When γ =

2, a = 1 and g(t) is a linear function.

So data density in terms of η(x) is a

symmetrical V-shape and the lowest point

occurs at η(x) = 12.

3© Cusp Distribution: When 1 < γ <

2, 0 < a < 1 and g′(t) decreases but

always larger than 0 as t increases. As a result, data density in terms of η(x) is a

cusp.

3.3.2. Uniform Distribution When γ = 1, g(t) is a constant and therefore the data is

uniformly distributed (reflected in Figure 2).

3.3.3. Peak Distribution When 0 < γ < 1, a < 0 and thus g(t) decreases as t increases.

Consequently, we can draft the distribution of data density as Figure 3 in this case.

3.3.4. Double-arch and double-bowl distribution If data density distribution follows the

pattern shown in Figure 4 and 5, the definition formula can be written as:

Pr(E[|η(x)− 1/2|] < t) ≤ c(1

3t3 − 1

4t2). (*)

Such special distributions describe the tendency that data density is abnormally

large or small at some point between polar and classification boundary.

Define p(t) = c(t3 − t2) and q(t) = p′(t) = ct(t − 1

2). Function q(t) reflects

data density distribution. If c < 0, then q(t) has a maximal value, and therefore the

distribution is double-arch-like. Likewise, if c > 0, then q(t) has a minimal value and

thus the distribution is double-bowl-like.

Actually, the original formula of Tysbakov Noise Condition can also describe these

two distributions, and we only need to find out a curve which the area beneath around

Binary Classification with Models and Data Density Distribution 6

the classification boundary is no smaller than the area of double-arch / double-bowl

distribution . For example, both uniform distribution as well as peak distribution

can describe double-bowl distribution since the area between [0.5-a, 0.5+a] (where

a ∈ [0, 0.5])beneath the curve of double-bowl distribution is smaller than that of

uniform and peak distribution. To get the area, the integration of the functions of

the distributions has to be taken. However, descriptions are loose. We have to describe

double-arch description as cusp distribution and describe double-bowl distribution as

uniform distribution to the best. In order to obtain tighter error bounds subsequently,

we need (∗) that better fits such special distributions.

3.4. Models

3.4.1. Gaussian Process Regression Gaussian Process is a common stochastic process.

In our project, Gaussian Process Regression involveed two steps. Firstly, we needed to

introduce the prior by specifying a prior mean function, denoted by m(.), and a prior

covariance function, denoted by k(.,.) and claimed the correlation between features

xi and xj . The distribution was represented as GP(m(.), k(., .)). In the second stage,

Radial Basis Function was adopted as k(.,.). With

η(x) = k(x)T (K + σ2I)−1f ,

where k(x) = {k(x,xi)}ni=1, K is a matrix consists of k(xi,xj), f = {fi}ni=1, I is an n*n

identity matrix and σ is a parameter set by users.

we get η(x) and can calculate E(h) in experimental phase.

3.4.2. Radial Basis Function Network Radial basis function network (RBF Network)

is an artificial neural network that uses radial basis functions as activation functions.

The output of the network is a scalar function of the input vector, ϕ : Rn → R , and is

given by

ϕ(x) =N∑i=1

aiρ(||x− ci||),

where N is the number of neurons in the hidden layer, ci is the center vector for

neuron i, ai is the weight of neuron i in the linear output neuron, and ρ(‖x− ci‖

)=

exp[−β ‖x− ci‖2

].

Binary Classification with Models and Data Density Distribution 7

3.4.3. Nearest Neighbour Nearest Neighbors algorithm is a non-parametric method

used for classification and regression. An object is classified by a majority vote of its

neighbors, with the object being assigned to the class most common among its k nearest

instances. For implementation of this algorithm, the nearer neighbors contribute more

to the justification of unknown instances. For example, a common weighting scheme

gives each neighbor a weight of 1/d, where d is the distance to the neighbor.

3.4.4. Support Vector Machine Support vector machines (SVMs) are supervised

learning models with associated learning algorithms that analyze data and recognize

patterns. Since SVM is a well-known and popular technique that has many variants, we

do not include detailed introductions in this paper. Intuitions of SVMs can be found in

[7].

In this project, we used a SVM algorithm called LibSVM [9] brought out by Chih-

Chung Chang and Chih-Jen Lin.

3.5. Problem

In this project, we study whether in every data density distribution can the models

perform better with probabilistic labels than with clear-cut labels. We give proofs when

Gaussian Process Regression is harnessed as our model. For other popular models,

though we do not give theoretical analysis, we still adopted them to our experiments

and have had our observations.

4. Theoretical Analysis

The following analysis is for Gaussian Process Regression.

We start up with the proof in [1].

E(h) = Pr(Y 6= h(x))− Pr(Y 6= h∗(x))

= E[|2η(x)− 1| · |h(x)− h∗(x)|]≤ 2E[|η(x)− η(x)| · Pr(h(x) 6= h∗(x))]

≤ 2√E[(η(x)− η(x))2] ·

√Pr(h(x) 6= h∗(x))

(1)

Please note that when h(x) 6= h∗(x), we have |η(x)− 12| ≤ |η(x)− η(x)|. Because

if η(x) ≤ 12, η(x) > 1

2and thus η(x) and η(x) stand on two sides of 1

2. If η(x) > 1

2, the

proof and conclusion are similar.

With probability of at least 1− δ, by proving

E[(η(x)− f)2] ≤ ∆ (2)

where

∆ ∈ O(n−1), (3)

Binary Classification with Models and Data Density Distribution 8

and from

E[(η(x)− η(x))2]

≤E[(η(x)− η(x))2 + (η(x)− f)2]

=E[(η(x)− η(x))2 + (η(x)− f)2

− 2(η(x)− η(x))(η(x)− η(x))]

=E[(η(x)− η(x))2 + (η(x)− f)2

− 2(η(x)− η(x))(η(x)− f)]

=E[(η(x)− f)2],

(4)

we get

E[(η(x)− η(x))2] ≤ ∆. (5)

Therefore,

Pr(h(x) 6= h∗(x))

=Pr(h(x) 6= h∗(x),E[(η(x)− η(x))2] ≤ ∆).(6)

Then,

Pr(h(x) 6= h∗(x))

=Pr(h(x) 6= h∗(x),E[|η(x)− η(x)|] ≤√

∆)

≤Pr(E[|η(x)− 1

2|]) <

√∆) ≤ c ·∆

γ2

∈O(n−γ2 ).

(7)

By substituting (3), (5) and (7) back to (1), we get

E(h) ∈ O(n−2+γ4 ).

Generally, with Tysbakov Noise Condition and Gaussian Process Regression,

E(h) ∈ O(n−2+γ4 ). Note that when γ increases, the error bound becomes tighter. Here

we discuss the error bound in each distribution:

4.1. Convex Distribution

If data is bowl-shape distributed, γ > 2 and error bound ranges from O(n−∞) to O(n−1).

If data is V-shape distributed, γ = 2 and error bound E(h) ∈ O(n−1). If data density

follows cusp distribution, 1 < γ < 2 and the error is bounded by O(n−1) and O(n−34 ).

4.2. Uniform Distribution

When γ = 1, data density follows uniform distribution and E(h) ∈ O(n−34 ).

Binary Classification with Models and Data Density Distribution 9

4.3. Peak Distribution

In this situation 0 ≤ γ < 1. Hence the error is better than O(n−12 ) but worse than

O(n−34 ).

4.4. Double-arch and double-bowl distribution

As defined in Section 3,

Pr(E[|η(x)− 1/2|] <√

∆) ≤ c(1

3∆

32 − 1

4∆1) ∈ O(n−1).

Hence,

E(h) ≤ 2√E[(η(x)− η(x))2] ·

√Pr(h(x) 6= h∗(x))

∈O(n−12 ) · O(n−

12 ) = O(n−1)

With the original formula of Tysbakov Noise Condition, we have to describe double-

arch description as cusp distribution and describe double bowl distribution as peak

distribution. However, the error bounds of cusp (> O(n−1)) and uniform distribution

(= O(n−34 )) are both worse than the result we get here.

Under realizable setting, the best-known error bound with clear-cut labels is

O(n−1), according to [5]. Therefore, only when data density follows any one of bowl-

shape, V-shape, double-arch or double-bowl distribution do we get a prediction with

Gaussian Process Regression and probabilistic labels at least as precise as with clear-

cut labels. In other cases, the conclusion does not necessarily hold.

Under non-realizable setting, with the use of clear-cut labels, the tightest error

bound we know so far is O(n−12 ) [5], which is no better than the result that we get in

any cases of distribution with probabilistic labels and Gaussian Process Regression.

Binary Classification with Models and Data Density Distribution 10

5. Experiment

We conducted experiments via Weka (Waikato Environment for Knowledge Analysis)

3.6, a popular suite of machine learning software written in Java, on a workstation with

2.10 GHz CPU and 2.0 GB RAM.

During experiments, we selected real datasets Model Years, Student Performance

and Wine Quality originally for regression from UCI repositary[12], and Cadata and

Body Fat from [13]. The normalized value of the target attribute of each instance

ranging from 0 to 1 was seen as the probability that the instance belongs to class 1.

The first dataset is Model Years, whose data roughly follows bowl-shape distribution

(data become sparser while approaching classification boundary 0.5). The second one

is Cadata, with uniformly distributed data in it. Data in Student Performance follows

peak distribution. We also tested dataset named Wine Quality and Body Fat, whose

data can be observed to be double-arch distributed.

As stated in Section 3, we are only given an observed version of the probabilistic

dataset, says Tf . Thus, we generate Tf by adding a noise value randomly picked from

N (0, σ) to each probability. Each added value corresponds to a fractional score in Tf .

Note that if the added value is greater than 1, we reset it to 1. If it is smaller than 0,

we attribute it to 0. In all experiments, we set σ = 0.01 as default.

We implemented our proposed classifier based on Gaussian Process Regression.

Though in theoretical analysis, we only study the error bound with the use of (1)

Gaussian Process Regression, here we also considered three other common comparative

classifiers in our experiment: (1) Radial Basis Function Network (RBF Network),

(2) Nearest Neighbour and (3) LibSVM. We ran all those four classifiers with both

probabilistic labels and clear-cut labels. Those classifications are introduced in Section

3.4. Note that (4) LibSVM is only available for clear-cut labels. Hence we merely ran

it on datasets with 0-1 labels in our experiments. We gave trials with different models

to see how models performed when probabilistic labels were adopted in prediction.

We performed a 10-fold cross validation for these experiments. The training

datasets was randomly divided into 10 pieces, each of which was regarded as testing

dataset in one of the ten folds, while the rest worked for training. We evaluate a

classifier in terms of its average accuracy on the held-out test set.

5.1. Model Years

Here is the distribution of data in Model Years. Left is the distribution of clear-cut

labels, with the distribution of probabilistic labels on the right. Apparently, data roughly

followed bowl-shape distribution.

Binary Classification with Models and Data Density Distribution 11

The chart in below presents the performance of each model with both clear-cut

labels as well as probabilistic labels.

From the chart we can find out that, with probabilistic labels, the prediction was

more accurate not only when the model was Gaussian Process Regression, as we prove in

Section 4, but when the model was RBF Network or Nearest Neighbour. Additionally,

it is interesting to notice that Nearest Neighbour classifier with probabilistic labels also

had satisfying performance, just a little bit behind Gaussian Process Regression. The

result supported our analysis that when data is bowl-shape distributed, prediction with

probabilistic labels is more accurate than with clear-cut labels.

5.2. Cadata

Here is the distribution of data in Cadata. Data followed uniform distribution.

The chart in below presents the performance of each model with both clear-cut

labels as well as probabilistic labels.

Binary Classification with Models and Data Density Distribution 12

From the above result we conclude that the accuracy was basically the same with

probabilistic as with clear-cut labels. It verified our conclusion in Section 4 that in

uniform distribution, the prediction with probabilistic labels is not necessarily better

than with clear-cut labels. But still, it is interesting to know that Gaussian Process

Regression performed the best among those regressions.

5.3. Student Performance

Here is the distribution of data in Student Performance. Data accumulated at 0.5 (which

is the classification boundary). It basically followed peak classification.

The chart in below presents the performance of each model with both clear-cut

labels as well as probabilistic labels.

Same as the result of Cadata, the accuracy was basically the same with probabilistic

as with clear-cut labels, as our analysis explained. Besides, in this case, we can see that

Binary Classification with Models and Data Density Distribution 13

prediction with Gaussian Process Regression was not outstanding anymore. All models

involved gave around the same accuracy.

5.4. Wine Quality

Here is the distribution of data in Wine Quality. Double-arch distribution is able to

best describe the distribution.

The conclusion from this result is similar to what we’ve got from Model Years : the

improvement on prediction accuracy was obvious when probabilistic labels was put in

use. In addition, in this case Gaussian Process Regression still had the best performance.

5.5. Body Fat

Here is the distribution of data in Body fat. Double-arch distribution is able to best

describe the distribution.

Binary Classification with Models and Data Density Distribution 14

The improvement on accuracy was obvious when dataset involved probabilistic. In

addition, in this case Gaussian Process Regression also had the best performance.

According to those four experimental results, GPR performed the best in most

situations. Also, we see that prediction result of probabilistic labels were not likely

to be effected by dataset sizes. In addition, it was in fact not obvious which models

improved the most with the use of probabilistic labels.

From above results, we found out that probabilistic labels failed to improve the

prediction in Case 5.2 and 5.3. In this case, we tried classification ensemble which

successfully enhanced the accuracy.

Here is how our classification ensemble worked. Firstly we used Model 1 to

do regression of dataset T with probabilistic labels. If the result is far away from

classification boundary, it is regard as confident answer and we keep it. If the result of

regression is too near to classification boundary (for instance, we get a score of 0.45 and

the classification boundary is 0.5), we picked them out and do clear-cut classification

with Model 2 on them.

We’ve tried several combination of Model 1 and Model 2: Gaussian Process

Regression and RBF Network, Gaussian Process Regression and LibSVM, and RBF

Network and LibSVM. Here we presented the worst and best result of classification

ensemble, comparing with the best result with probabilistic labels, as well as the best

result with clear-cut labels.

Binary Classification with Models and Data Density Distribution 15

Basically, predictions given by classification ensemble were always no worse than

our former results. Particularly, there always existed some combination of classification

ensemble whose prediction was much better.

6. Discussion

Intuitively, with probabilistic labels models are able to predict much more precise

estimated target attribute. However, such an advantage does not show when instances

are located around the classification boundary. For example, there is an instance labelled

as 0.49999 and is attributed to class 0. Though with probabilistic labels we are able to

predict estimated probability as 0.50001, which is pretty close to the true value whereas

it should be attributed to class 1 by models and is thus wrong. In a word, when more

data are founded around the classification boundary, the prediction is much less accurate

with probabilistic labels. This perfectly explain why the prediction error of Gaussian

Process Regression with probabilistic labels is no worse than the best-known error bound

under any setting with clear-cut labels only when γ is large enough.

7. Conclusion

In this paper, we study the worth of probabilistic labels, which depends on the data

density distribution. Generally, when data are more observed around the classification

boundary, predictions with probabilistic labels are not necessarily better than with clear-

cut labels. However, we can try classification ensembles in those cases, and the accuracy

can be better than the result when we use probabilistic or clear-cut labels solely.

Besides, from experimental result we can conclude that GPR performed the best

in most situations, whereas it was hard to define which models were improved the most

when probabilistic labels were involved.

Acknowledgements:The author would like to thank the supervisor of this project,

Professor Raymond Chi-Wing Wong, for always arranging meetings during which he

helped the the author to look at this final year thesis from different points of view, offered

Binary Classification with Models and Data Density Distribution 16

useful materials for learning, and gave good advice when the author had difficulties in

proofs.

The author would also like to thank Mr. Peng Peng, a PhD candidate at HKUST

under the supervision of Professor Wong, for always arranging advisory discussions

whenever the author asked for help. He helped the author understand machine learning

and clarify proofs and ideas in [2] at the beginning, and he always readily provided helps

when the author got stuck in thinking of ideas in his final year thesis.

Besides, the author would like to thank Mr. Ted Spaeth, who is the Final Year

Thesis Tutor. He paid a lot of time reviewing and rectifying the reports of this project.

He has also been giving much advice for improving the writing and presentation.

8. Challenges and Future Work

We met some challenges during the process of this project.

We wanted to prove whether some other algorithms gives out better performance

with the use of probabilistic labels. We tried to follow the ideas presented in [2],

whereas Gaussian Process Regression has its speciality:∑n

i=1(η(xi)−fi)2 ∈ O(1), where

η(xi) is estimated probability got from models and fi is observed version of conditional

probability η(xi). In this case the deviations were no longe ravailable. We then tried

to introduced Hausser’s Theorem, VC Dimension and VC entropy, along with Covering

Numbers, whereas none of them provides promising insights and description of a smaller

error bounds.

Therefore, there are interesting future works and room for improvement. By

considering algorithms, we may see whether our conclusion is also suitable for other

classifiers.

References

[1] P. Peng and R. Wong, Selective Sampling on Probabilistic Data, Proc. 2014 Int’l Data Mining

(SDM 14), 28-36.

[2] P. Peng, R. Wong, P. Yu, Learning on Probabilistic Labels, Proc. 2014 Int’l Data Mining (SDM

14), 307-315.

[3] Q. Nguyen, H. Valizadegan, M. Hauskrecht, Learning Classification with Auxiliary Probabilistic

Information, Proc. 2011 IEEE int’l Conf. on Data Mining (ICDM 2011), 477-486.

[4] M. Ebden, Gaussian Process for Regression: A Quick Introduction,

http://www.robots.ox.ac.uk/ mebden/reports/GPtutorial.pdf, 1-5.

[5] M. Anthony, P. L. Bartlett, Neural Network Learning: Theoretical Foundations, March 1994,

Cambridge University Press 1999, 278.

[6] Q. Bousquet, S. Boucheron,G. Lugosi, Introduction to Statistical Learning Theory,

www.kyb.mpg.de/publications/pdfs/pdf2819.pdf, 179-213.

[7] A. Ng, Introduction of Support Vector Machine, http://cs229.stanford.edu/notes/cs229-notes3.pdf.

[8] A. Ng, CS229: Machine Learning, http://Coursera.org, 2014.

[9] C. Chang, C. Lin, LibSVM, http://www.csie.ntu.edu.tw/ cjlin/libsvm/

[10] A. Tysbakov, Optimal aggregation of classifiers in statistical learning, The Annals of Statistics

2004, 135-166.

Binary Classification with Models and Data Density Distribution 17

[11] P. Massart, E. Nedelec, Risk Bounds for Statistical Learning, The Annals of Statistics 2006, 2326-

2366.

[12] A. Frank,A. Asuncion, UCI machine learning repositary, https://archive.ics.uci.edu/ml/datasets.html

[13] Department of Statistics at Carnegie Mellon University, StatLib, http://lib.stat.cmu.edu/