ics583 112 non-parametric density estimation s

ICS583 Non-Parametric density estimation 2

Acknowledgement

• The material in these slides are based on:

• R. O.Duda,P. E. Hart, and D.G.Stork, Pattern Classification. John

Wiley & Sons, 2nd ed., 2001.

• Pattern Recognition (4th Edition) by Sergios Theodoridis,

Konstantinos Koutroumbas and Konstantinos Koutroumbas, 2009.

• Pattern Recognition and Machine Learning, C. M. Bishop, Springer,

2006.

• Lectures of Prof. Andrew W. Moore, School of Computer Science,

Carnegie Mellon University

• And other internet resources


Non-Parametric Density Estimation

• Introduction

• Histogram-based estimation

• Parzen windows estimation

• Nearest neighbor estimation


Introduction

• In supervised learning we assumed that the parametric

forms of underlying (class conditional) density functions

were known.

• However, in practical pattern recognition this assumption

may be doubtful.

• All Parametric densities are unimodal (have a single local

maximum), whereas many practical problems involve

multi-modal densities

• Non-parametric techniques can be used with arbitrary

distributions and without knowing the parametric form of

the underlying (class conditional) densities.


Density Estimation and Parzen Window

Types of non-parametric methods

1. Estimation of density functions p (x|i) using

sample patterns.

2. Estimation of a posteriori probabilities P(i |x)

directly based on sample patterns or prototypes.

We will consider two approaches for both types of non-parametric methods above:

1. Parzen-window-based;

2. k-Nearest neighbor-based;


Density estimation

• Density estimation - estimating the probability density function

p(x) based on a given set of training samples D = {x1,...,xn}.

• The estimated density is denoted by pˆ(x).

• Assume that the training samples are independent and identically

distributed (i.i.d.) - each has the same probability distribution as

the others and all are mutually independent and distributed

according to p(x).

• The difference between parametric and non-parametric

estimation:

1. In the non-parametric case we try to estimate a function of the

distribution p(x) instead of a parameter vector.

2.We have a finite number of training samples meaning that

there will be some errors in the function estimation.


Histograms

• Histograms are the simplest way to density estimation.

• The feature space is divided into m equal sized cells or bins

Bi.

• The number of the training samples ni, i=1,...n falling to

each bin is computed and the estimate within each bin is

p(x) = ni /Vn ,

where V is the volume of the cell. (All cells have equal

volume so the index is not needed.)

• The histogram estimate is not a very good way to estimate

densities, especially when there are many features. It leads

to discontinuous density estimates.


Histograms (cont.)


Density estimation

In words : Place a segment of length h at x and count points inside it.

• If is continuous: as , if

2ˆ- ,

1)ˆ(ˆ)(ˆ

total

in

hxx

N

k

hxpxp

N

hk

N

kP

N

NN

2

2

hx̂x̂

hx̂

x̂

)(xp )()( xpxp

N

0,,0 N

kkh N

NN


Density estimation

• Consider the problem of selecting k labeled balls without

replacement from an urn containing n balls.

• In how many different ways may we select those k balls?

combinations (subsets) containing k elements.

Where n! = n · (n − 1) · · · 2 · 1, (0! = 1)


Density estimation

• Assume that we wish to estimate the value of the density function p at x

based on training samples x1 , . . . , xn.

• If we think a region R around x, then the probability that a training

sample xj will fall in the region R is

----- (1)

P is an averaged version of the density function p(x).

Assume that there are k (of n) training samples in the region R.

The probability of k of the training samples falling into region R is

given by the binomial density

----- (2)


Density estimation

(4) V)x(p'dx)'x(p

The expected value for k is

----- (3)

Estimation of the E[k] by the observed k’ leads to the estimate P’ =

k’/n.

Therefore, the ratio k/n is a good estimate for the probability P and

hence for the density function p.

Assume p(x) is continuous and that the region R is so small that p

does not vary significantly within it, we can write:

Where x is a point within R and V the volume enclosed by R

Combining equations 1, 3, and 4

---- (5)


V

n/k)x(p


Density estimation …

• Variance

• The smaller the h the higher the variance

h=0.1, N=1000 h=0.8, N=1000



h=0.1, N=10000

The higher the N the better the accuracy


Example :


CURSE OF DIMENSIONALITY

• In all the methods, so far, we saw that the highest the

number of points, N, the better the resulting estimate.

• If in the one-dimensional space an interval, filled with N

points, is adequate (for good estimation), in the two-

dimensional space the corresponding square will require

N2 and in the ℓ-dimensional space the ℓ-dimensional cube

will require Nℓ points.

• The exponential increase in the number of necessary

points in known as the curse of dimensionality. This is a

major problem one is confronted with in high

dimensional spaces.



• Condition for convergence

The fraction k/(nV) is a space averaged value of p(x). p(x)

is obtained only if V approaches zero.

This is the case where no samples are included in R: it is an

uninteresting case!

In this case, the estimate diverges: it is an uninteresting

case!

fixed)n (if 0)x(plim0k ,0V

)x(plim0k ,0V



• The volume V needs to approach 0 anyway if we

want to use this estimation

•Practically, V cannot be allowed to become

small since the number of samples is always

limited

•One will have to accept a certain amount of

variance in the ratio k/n


Unlimited number of samples

Theoretically, if an unlimited number of samples is

available, we can circumvent this difficulty

To estimate the density of x, we form a sequence of regions

R1, R2,…containing x: the first region contains one sample,

the second two samples and so on.

Let Vn be the volume of Rn, kn the number of samples

falling in Rn and pn(x) be the nth estimate for p(x):

pn(x) = (kn/n)/Vn (7)



lim ( )n np x p x For

Three conditions are required

First condition assures P/V → p(x)

Second condition assures that the frequency ratio will

converge to the probability

Third condition states that the number of samples in region

Rn is small compared to number of samples. This is

required for pn(x) to converge.

1. lim 0

2. lim

3. lim / 0

n n

n n

n n

V

k

k n



)x(p)x(pn

n

There are two ways of obtaining the sequence of

regions that satisfy these constraints.

1. Shrink an initial region by specifying Vn as some

function of n. where Vn = 1/n

This is called “the Parzen-window estimation

method”

2. Specify kn as some function of n and let Vn grow

until it encloses kn neighbors of x. kn = n.

This is called “the kn-nearest neighbor estimation

method”


Parzen windows - An example

• Assume that the region Rn is a d-dimensional hypercube.

If hn is the length of the side of the hypercube, its volume is

given by Vn = hnd

• We can obtain an analytic expression for kn – the number of

samples falling into the hypercube – by defining the

following window function:

• ((x-xi)/hn) is equal to unity if xi falls within the hypercube

of volume Vn centered at x and equal to zero otherwise.


Parzen windows - An example

1

,n

in

i n

x xk

h

It follows that ((x - xi)/hn) = 1 if xi falls in the hypercube

of volume Vn centered at x. And ((x - xi)/hn) = 0

otherwise.

The number of samples in this hypercube is therefore given

by

By substituting kn in equation (7), we obtain the following

estimate:

Pn(x) estimates p(x) as an average of functions of x and the

samples (xi) (i = 1,… ,n).

These functions can be general!

1

1 1( )

nn i

n

in n n

k x xp x

nV n V h


Parzen windows

1

1 1( )

ni

n

i n n

x xp x

n V h

• The Parzen-window density estimate using n training

samples and the window function is defined by

• The estimate pn(x) is an average of (window) functions.

Usually the window function has its maximum at the origin

and its values become smaller when we move further away

from the origin. Then each training sample is contributing

to the estimate in accordance with its distance from x.


Illustration

n

ini

1i n

nh

xx

h

1

n

1)x(p

The behavior of the Parzen -window method

Case where p(x) N(0,1)

Let (u) = (1/(2) exp (-u2/2) and hn = h1/n (n>1)

(h1: known parameter)

Thus:

is an average of normal densities centered at the samples xi.


Parzen windows …

)1,x(N)xx(e2

1 )xx()x(p 1

2

1

2/1

11

Numerical results:

For n = 1 and h1=1

For n = 10 and h = 0.1, the contributions of the

individual samples are clearly observable !


Parzen windows …


Analogous results are also obtained in two dimensions as

illustrated:


Parzen windows …


Case where p(x) = 1.U(a,b) + 2.T(c,d) (unknown density)

(mixture of a uniform and a triangle density)


Parzen windows …


Parzen windows example

Question: Given a set of five data points x1 = 2, x2 =

2.5, x3 = 3, x4 = 1 and x5 = 6, find Parzen probability

density function (pdf) estimates at x = 3, using the

Gaussian function with = 1 as window function?


Parzen windows example …

x1 1

2𝜋𝑒−

(𝑥1−𝑥)2

2 = 1

2𝜋𝑒−

(2−3)2

2 = 0.242

x2 1

2𝜋𝑒−

(2.5−3)2

2 = 0.3521

x3 1

2𝜋𝑒−

(3−3)2

2 = 0.3989

x4 1

2𝜋𝑒−

(1−3)2

2 = 0.054

x5 1

2𝜋𝑒−

(6−3)2

2 = 0.0044

P(x=3) = (0.242 + 0.3521+

0.3989+0.054+0.0044)/5 = 0.2103


Parzen windows example …


Parzen windows …


Parzen windows notes

• The window width ha (volume, Va ) is the most

critical parameter in Parzen window approach.

• It is selected by cross-validation approach where a

validation set or portion of the training set is used

to form a validation set.

• The classifier is trained using different values of ha .

• The ha that results in the smallest error in the validation

set is selected as the most optimal one.

This technique is normally used with algorithms that

has parameters to select from.


Classification example

In classifiers based on Parzen-window estimation:

• We estimate the densities for each category and

classify a test point by the label corresponding to the

maximum posterior

• The decision region for a Parzen-window classifier

depends upon the choice of window function as

illustrated in the following figure.


Parzen windows …


Example of Nearest Neighbor Rule

• Two class problem: yellow triangles and blue

squares. Circle represents the unknown sample x

and as its nearest neighbor comes from class θ1, it

is labeled as class θ1.


Example of k-NN rule with k = 3

• There are two classes: yellow triangles and blue

squares. The circle represents the unknown sample

x and as two of its nearest neighbors come from

class θ2, it is labeled class θ2.

• The number k should be:

• 1) large to minimize probability of misclassifying x.

• 2) small (with respect to no of samples) so that points are

close enough to x to give an accurate estimate of the true

class of x.


Kn - Nearest neighbor estimation

• Goal: a solution for the problem of the unknown “best”

window function

• Let the cell volume be a function of the training data

• Center a cell about x and let it grow until it captures kn samples

(kn = f(n))

• kn are called the kn nearest-neighbors of x

• possibilities can occur:

• Density is high near x; therefore the cell will be small which

provides a good resolution

• Density is low; therefore the cell will grow large and stop until

higher density regions are reached

We can obtain a family of estimates by setting kn=k1/n

and choosing different values for k1


Illustration

For kn = n = 1 ; the estimate becomes:

Pn(x) = kn / n.Vn

= 1 / V1

=1 / 2|x-x1|


Kn - Nearest neighbor estimation …



• K= 10, N = 200



◦ Estimation of a-posteriori probabilities

Goal: estimate P(i | x) from a set of n labeled samples

◦ Let’s place a cell of volume V around x and capture k

samples

◦ ki samples amongst k turned out to be labeled i then:

pn(x, i) = ki /n.V

An estimate for pn(i| x) is:

k

k

),x(p

),x(p)x|(p i

cj

1j

jn

inin



• ki/k is the fraction of the samples within the cell that are labeled i

•For minimum error rate, the most frequently represented category within the cell is selected

• If k is large and the cell sufficiently small, the performance will approach the best possible


Nearest Neighbor

• The nearest –neighbor rule

• Let Dn = {x1, x2, …, xn} be a set of n labeled prototypes

• Let x’ Dn be the closest prototype to a test point x

• The nearest-neighbor rule for classifying x is

• to assign it the label associated with x’


Nearest Neighbor

• Nearest-neighbor rule is a sub-optimal procedure

• The nearest-neighbor rule leads to an error rate greater

than the minimum possible: the Bayes rate

• If the number of prototypes is large (unlimited), the error

rate of the nearest-neighbor classifier is never worse than

twice the Bayes rate (it can be demonstrated!)

• If n , it is always possible to find x’ sufficiently

close so that:

P(i | x’) P(i | x)


Bounds on Error Rate of k-Nearest Neighbor Rule

• As k gets larger the error rate equals the Bayes rate

• k should be a small fraction of the total number of

samples


Nearest Neighbor …

• If P(m | x) 1, then the nearest neighbor

selection is almost always the same as the Bayes

selection


59




• N-n classifier effectively partitions the feature space into

cells consisting of all points closer to a given training point

x’ than to any other training points.

• All points in such a cell are thus labeled by the category of

the training point –Voronoi tesselation of the space


The k –Nearest-Neighbor Rule

• Classify x by assigning it the label most frequently

represented among the k nearest samples and use a

voting scheme


Example

Example:k = 3 (odd value) and x

= (0.10, 0.25)t

Closest vectors to x with their

labels are:

{(0.10, 0.28, 2); (0.12, 0.20, 2);

(0.15, 0.35,1)}

One voting scheme assigns the

label 2 to x since 2 is the most

frequently represented

Prototypes Labels

(0.15, 0.35)

(0.10, 0.28)

(0.09, 0.30)

(0.12, 0.20)

1

2

5

2

ics583 112 non-parametric density estimation s

Documents