classification problem givengiven predict class label of a given querypredict class label of a given...

Classification ProblemClassification Problem

• GivenGiven

• Predict class label of a given query Predict class label of a given query

Nnnn y 1)},{( x q

n x } ,{ ny

0x

-- --

-- -

-

--

-

-

-

-

-

-

-

+

+

++

+ ++

+

++

+

+

+

+

+

+

+

+

-

-

-

-

-+

-

- + +

.0x

-

-

+

-

+2x

1x

+-

Classification ProblemClassification Problem

• Unknown probability distribution Unknown probability distribution

• We need to estimate: We need to estimate:

),( yP x

)()|( 00 xx fP

)()|( 00 xx fP

The Bayesian ClassifierThe Bayesian Classifier• Loss function:Loss function: • Expected loss (conditional risk) associated with class Expected loss (conditional risk) associated with class jj::

• Bayes rule:Bayes rule:

• Zero-one loss function:Zero-one loss function:

kj |

x x ||)|(1

kPkjjRJ

k

)|(minarg1

* xjRjJj

kjif

kjifkj

1

0|

)|(maxarg1

* xjPjJj

Bayes rule

The Bayesian ClassifierThe Bayesian Classifier

• Bayes rule achieves the minimum error rateBayes rule achieves the minimum error rate

• How to estimate the posterior probabilities: How to estimate the posterior probabilities:

)|(maxarg1

* xjPjJj

JjjP 1| x

)|(ˆmaxargˆ1

xx jPjJj

Density estimationDensity estimation• Use Bayes theorem to estimate the posterior probability Use Bayes theorem to estimate the posterior probability

values:values:

is the probability density function of given is the probability density function of given

class class

is the prior probability of classis the prior probability of class

J

k

kPkp

jPjpjP

1

|

|)|(

x

xx

x jp |xj

jP j

Naïve Bayes ClassifierNaïve Bayes Classifier• Makes the assumption of independence of features given the class:Makes the assumption of independence of features given the class:

• The task of estimating a The task of estimating a qq-dimensional density function is reduced to the estimation of q -dimensional density function is reduced to the estimation of q one-dimensional density functions. Thus, the complexity of the task is drastically one-dimensional density functions. Thus, the complexity of the task is drastically reduced.reduced.

• The use of Bayes theorem becomes much simpler.The use of Bayes theorem becomes much simpler.

• Proven to be effective in practice.Proven to be effective in practice.

jxpjxxxpjp i

q

iq ||,,,|

121

x

Nearest-Neighbor MethodsNearest-Neighbor Methods• Predict the class label of as the most frequent Predict the class label of as the most frequent

one occurring in the neighborsone occurring in the neighbors

0xK

-- --

-- -

-

--

-

-

-

-

-

-

-

+

+

++

+ ++

+

++

+

+

+

+

+

+

+

+

-

-

-

-

-+

-

- + +

-

-

+

-

+2x

1x

+- .+

-



0xK

-- --

-- -

-

--

-

-

-

-

-

-

-

+

+

++

+ ++

+

++

+

+

+

+

+

+

+

+

-

-

-

-

-+

-

- + +

-

-

+

-

+2x

1x

+-++

-



0xK

-- --

-- -

-

--

-

-

-

-

-

-

-

+

+

++

+ ++

+

++

+

+

+

+

+

+

+

+

-

-

-

-

-+

-

- + +

-

-

+

-

+2x

1x

+- .+

-.. distanc

edistance

metricmetric

Basic assumption:Basic assumption:

)() (

)() (

xxx

xxx

ff

ff

small for x

Example: Letter RecognitionExample: Letter Recognition

....

..Edge countEdge count

First statisticalFirst statistical momentmoment

Asymptotic Properties of Asymptotic Properties of K-NN MethodsK-NN Methods

)(ˆlim xx jjN ff

0/lim NKNif and if and KNlim

• The first condition reduces the variance by making the estimation The first condition reduces the variance by making the estimation independent of the accidental characteristics of the independent of the accidental characteristics of the KK nearest nearest

neighbors. neighbors.

• The second condition reduces the bias by assuring that the The second condition reduces the bias by assuring that the KK nearest neighbors are arbitrarily close to the query point. nearest neighbors are arbitrarily close to the query point.

Asymptotic Properties of Asymptotic Properties of K-NN MethodsK-NN Methods

EEN 2lim 1

1E classification error rate of the 1-NN ruleclassification error rate of the 1-NN rule

E classification error rate of the Bayes ruleclassification error rate of the Bayes rule

In the asymptotic limitIn the asymptotic limit no decision rule is more no decision rule is more than twice as accurate as the 1-NN rulethan twice as accurate as the 1-NN rule

Finite-sample settingsFinite-sample settings

• If the number of training data If the number of training data NN is large and the number is large and the number of input features of input features qq is small, then the asymptotic results may is small, then the asymptotic results may still be valid.still be valid.

• However, for a moderate to large number of input However, for a moderate to large number of input variables, the sample required for their validity is variables, the sample required for their validity is beyond feasibility.beyond feasibility.

• How well the 1-NN rule works in finite-How well the 1-NN rule works in finite-sample settings?sample settings?

Curse-of-DimensionalityCurse-of-Dimensionality

• This phenomenon is known as This phenomenon is known as the the curse-of-dimensionalitycurse-of-dimensionality

• It refers to the fact that in high dimensional It refers to the fact that in high dimensional spaces data become extremely sparse and spaces data become extremely sparse and

are far apart from each otherare far apart from each other

• It affects It affects anyany estimation problem with estimation problem with high dimensionalityhigh dimensionality

Curse of Dimensionality

Sample of size Sample of size N=500N=500 uniformly distributed in uniformly distributed in q]1 ,0[

DMAXDMAX

DMINDMIN

DMAX/DMINDMAX/DMIN


dimdim

The distribution of the ratio The distribution of the ratio DMAX/DMINDMAX/DMIN converges to converges to 11 as the dimensionality increases as the dimensionality increases


dimdim

Variance of distances from a given pointVariance of distances from a given point


The variance of distances from a given point The variance of distances from a given point converges to converges to 00 as the dimensionality increases as the dimensionality increases

dimdim


Distance values from a given pointDistance values from a given point

Values flatten out as dimensionality increasesValues flatten out as dimensionality increases

Computing radii of nearest neighborhoodsComputing radii of nearest neighborhoods

median radius of a nearest neighborhoodmedian radius of a nearest neighborhood

q.5,.5- cubeunit in theon distributi uniform


q 4 4 6 6 10 10 20 20 20

N 100 1000 100 1000 1000 10000 10000

d(q,N) 0.42 0.23 0.71 0.48 0.91 0.72 1.51 1.20 0.76

610 1010

~N• Random sample of size uniform distribution in theRandom sample of size uniform distribution in theq -dimensional unit hypercube-dimensional unit hypercube

• Diameter of a neighborhood using EuclideanDiameter of a neighborhood using Euclidean1K)(),( /1 qNONqd distance: distance:

As dimensionality increases, the distance from the As dimensionality increases, the distance from the closest point increases fasterclosest point increases faster

Large Highly biased estimationsLarge Highly biased estimations),( Nqd


• It is a serious problem in many It is a serious problem in many real-world applicationsreal-world applications

• Microarray dataMicroarray data: 3,000-4,000 genes;: 3,000-4,000 genes;

• DocumentsDocuments: 10,000-20,000 words in : 10,000-20,000 words in dictionary;dictionary;

• ImagesImages, , face recognitionface recognition, etc., etc.

How can we deal withHow can we deal with the curse of dimensionality?the curse of dimensionality?

5.19122.92

2.9268.7

N

iiii

iii

T

N

ii

xxx

xxxN

xxx

xxxE

xxx

xE

E

Nx

x

12

222211

2211

2

11

2222211

22112

11

221122

11

12

1

2

1

1

,

: 22

1

μxμx

xμx

matrix covariance

N

i

iN

i

ii

N

i

iiN

i

i

N

iiii

iii

xN

xxN

xxN

xN

xxx

xxxN

1

2

221

2211

12211

1

2

11

12

222211

2211

2

11

11

11

1

variancevariance

variancevariance

covariancecovariance

covariancecovariance

06.15.0

5.099.0

15.105.1

05.104.1

05.101.0

01.093.0

04.149.0

49.097.0

03.193.0

93.094.0

Dimensionality ReductionDimensionality Reduction

• Many dimensions are often Many dimensions are often interdependent (correlated);interdependent (correlated);

We can:We can:

• Reduce the dimensionality of problems;Reduce the dimensionality of problems;

• Transform interdependent coordinates Transform interdependent coordinates into significant and independent ones;into significant and independent ones;