ics583 112 non-parametric density estimation s

62

Upload: alasidig

Post on 11-May-2017

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: ICS583 112 Non-Parametric Density Estimation S
Page 2: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 2

Acknowledgement

• The material in these slides are based on:

• R. O.Duda,P. E. Hart, and D.G.Stork, Pattern Classification. John

Wiley & Sons, 2nd ed., 2001.

• Pattern Recognition (4th Edition) by Sergios Theodoridis,

Konstantinos Koutroumbas and Konstantinos Koutroumbas, 2009.

• Pattern Recognition and Machine Learning, C. M. Bishop, Springer,

2006.

• Lectures of Prof. Andrew W. Moore, School of Computer Science,

Carnegie Mellon University

• And other internet resources

Page 3: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 3

Non-Parametric Density Estimation

• Introduction

• Histogram-based estimation

• Parzen windows estimation

• Nearest neighbor estimation

Page 4: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 4

Introduction

• In supervised learning we assumed that the parametric

forms of underlying (class conditional) density functions

were known.

• However, in practical pattern recognition this assumption

may be doubtful.

• All Parametric densities are unimodal (have a single local

maximum), whereas many practical problems involve

multi-modal densities

• Non-parametric techniques can be used with arbitrary

distributions and without knowing the parametric form of

the underlying (class conditional) densities.

Page 5: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 5

Density Estimation and Parzen Window

Types of non-parametric methods

1. Estimation of density functions p (x|i) using

sample patterns.

2. Estimation of a posteriori probabilities P(i |x)

directly based on sample patterns or prototypes.

We will consider two approaches for both types of non-parametric methods above:

1. Parzen-window-based;

2. k-Nearest neighbor-based;

Page 6: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 6

Density estimation

• Density estimation - estimating the probability density function

p(x) based on a given set of training samples D = {x1,...,xn}.

• The estimated density is denoted by pˆ(x).

• Assume that the training samples are independent and identically

distributed (i.i.d.) - each has the same probability distribution as

the others and all are mutually independent and distributed

according to p(x).

• The difference between parametric and non-parametric

estimation:

1. In the non-parametric case we try to estimate a function of the

distribution p(x) instead of a parameter vector.

2.We have a finite number of training samples meaning that

there will be some errors in the function estimation.

Page 7: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 7

Histograms

• Histograms are the simplest way to density estimation.

• The feature space is divided into m equal sized cells or bins

Bi.

• The number of the training samples ni, i=1,...n falling to

each bin is computed and the estimate within each bin is

p(x) = ni /Vn ,

where V is the volume of the cell. (All cells have equal

volume so the index is not needed.)

• The histogram estimate is not a very good way to estimate

densities, especially when there are many features. It leads

to discontinuous density estimates.

Page 8: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 8

Histograms (cont.)

Page 9: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 9

Density estimation

In words : Place a segment of length h at x and count points inside it.

• If is continuous: as , if

2ˆ- ,

1)ˆ(ˆ)(ˆ

total

in

hxx

N

k

hxpxp

N

hk

N

kP

N

NN

2

2

hx̂x̂

hx̂

)(xp )()( xpxp

N

0,,0 N

kkh N

NN

Page 10: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 10

Density estimation

• Consider the problem of selecting k labeled balls without

replacement from an urn containing n balls.

• In how many different ways may we select those k balls?

combinations (subsets) containing k elements.

Where n! = n · (n − 1) · · · 2 · 1, (0! = 1)

Page 11: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 11

Density estimation

• Assume that we wish to estimate the value of the density function p at x

based on training samples x1 , . . . , xn.

• If we think a region R around x, then the probability that a training

sample xj will fall in the region R is

----- (1)

P is an averaged version of the density function p(x).

Assume that there are k (of n) training samples in the region R.

The probability of k of the training samples falling into region R is

given by the binomial density

----- (2)

Page 12: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 12

Density estimation

(4) V)x(p'dx)'x(p

The expected value for k is

----- (3)

Estimation of the E[k] by the observed k’ leads to the estimate P’ =

k’/n.

Therefore, the ratio k/n is a good estimate for the probability P and

hence for the density function p.

Assume p(x) is continuous and that the region R is so small that p

does not vary significantly within it, we can write:

Where x is a point within R and V the volume enclosed by R

Combining equations 1, 3, and 4

---- (5)

Page 13: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 13

V

n/k)x(p

Page 14: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 14

Density estimation …

• Variance

• The smaller the h the higher the variance

h=0.1, N=1000 h=0.8, N=1000

Page 15: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 15

Density estimation …

h=0.1, N=10000

The higher the N the better the accuracy

Page 16: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 16

Example :

Page 17: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 17

CURSE OF DIMENSIONALITY

• In all the methods, so far, we saw that the highest the

number of points, N, the better the resulting estimate.

• If in the one-dimensional space an interval, filled with N

points, is adequate (for good estimation), in the two-

dimensional space the corresponding square will require

N2 and in the ℓ-dimensional space the ℓ-dimensional cube

will require Nℓ points.

• The exponential increase in the number of necessary

points in known as the curse of dimensionality. This is a

major problem one is confronted with in high

dimensional spaces.

Page 18: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 18

Density estimation …

• Condition for convergence

The fraction k/(nV) is a space averaged value of p(x). p(x)

is obtained only if V approaches zero.

This is the case where no samples are included in R: it is an

uninteresting case!

In this case, the estimate diverges: it is an uninteresting

case!

fixed)n (if 0)x(plim0k ,0V

)x(plim0k ,0V

Page 19: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 19

Density estimation …

• The volume V needs to approach 0 anyway if we

want to use this estimation

•Practically, V cannot be allowed to become

small since the number of samples is always

limited

•One will have to accept a certain amount of

variance in the ratio k/n

Page 20: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 20

Unlimited number of samples

Theoretically, if an unlimited number of samples is

available, we can circumvent this difficulty

To estimate the density of x, we form a sequence of regions

R1, R2,…containing x: the first region contains one sample,

the second two samples and so on.

Let Vn be the volume of Rn, kn the number of samples

falling in Rn and pn(x) be the nth estimate for p(x):

pn(x) = (kn/n)/Vn (7)

Page 21: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 21

Unlimited number of samples

lim ( )n np x p x For

Three conditions are required

First condition assures P/V → p(x)

Second condition assures that the frequency ratio will

converge to the probability

Third condition states that the number of samples in region

Rn is small compared to number of samples. This is

required for pn(x) to converge.

1. lim 0

2. lim

3. lim / 0

n n

n n

n n

V

k

k n

Page 22: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 22

Unlimited number of samples

)x(p)x(pn

n

There are two ways of obtaining the sequence of

regions that satisfy these constraints.

1. Shrink an initial region by specifying Vn as some

function of n. where Vn = 1/n

This is called “the Parzen-window estimation

method”

2. Specify kn as some function of n and let Vn grow

until it encloses kn neighbors of x. kn = n.

This is called “the kn-nearest neighbor estimation

method”

Page 23: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 23

Page 24: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 24

Parzen windows - An example

• Assume that the region Rn is a d-dimensional hypercube.

If hn is the length of the side of the hypercube, its volume is

given by Vn = hnd

• We can obtain an analytic expression for kn – the number of

samples falling into the hypercube – by defining the

following window function:

• ((x-xi)/hn) is equal to unity if xi falls within the hypercube

of volume Vn centered at x and equal to zero otherwise.

Page 25: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 25

Parzen windows - An example

1

,n

in

i n

x xk

h

It follows that ((x - xi)/hn) = 1 if xi falls in the hypercube

of volume Vn centered at x. And ((x - xi)/hn) = 0

otherwise.

The number of samples in this hypercube is therefore given

by

By substituting kn in equation (7), we obtain the following

estimate:

Pn(x) estimates p(x) as an average of functions of x and the

samples (xi) (i = 1,… ,n).

These functions can be general!

1

1 1( )

nn i

n

in n n

k x xp x

nV n V h

Page 26: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 26

Parzen windows

1

1 1( )

ni

n

i n n

x xp x

n V h

• The Parzen-window density estimate using n training

samples and the window function is defined by

• The estimate pn(x) is an average of (window) functions.

Usually the window function has its maximum at the origin

and its values become smaller when we move further away

from the origin. Then each training sample is contributing

to the estimate in accordance with its distance from x.

Page 27: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 27

Illustration

n

ini

1i n

nh

xx

h

1

n

1)x(p

The behavior of the Parzen -window method

Case where p(x) N(0,1)

Let (u) = (1/(2) exp (-u2/2) and hn = h1/n (n>1)

(h1: known parameter)

Thus:

is an average of normal densities centered at the samples xi.

Page 28: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 28

Parzen windows …

)1,x(N)xx(e2

1 )xx()x(p 1

2

1

2/1

11

Numerical results:

For n = 1 and h1=1

For n = 10 and h = 0.1, the contributions of the

individual samples are clearly observable !

Page 29: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 29

Parzen windows …

Page 30: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 30

Parzen windows …

Page 31: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 31

Analogous results are also obtained in two dimensions as

illustrated:

Page 32: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 32

Parzen windows …

Page 33: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 33

Case where p(x) = 1.U(a,b) + 2.T(c,d) (unknown density)

(mixture of a uniform and a triangle density)

Page 34: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 34

Parzen windows …

Page 35: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 35

Parzen windows …

Page 36: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 36

Parzen windows …

Page 37: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 37

Parzen windows example

Question: Given a set of five data points x1 = 2, x2 =

2.5, x3 = 3, x4 = 1 and x5 = 6, find Parzen probability

density function (pdf) estimates at x = 3, using the

Gaussian function with = 1 as window function?

Page 38: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 38

Parzen windows example …

x1 1

2𝜋𝑒−

(𝑥1−𝑥)2

2 = 1

2𝜋𝑒−

(2−3)2

2 = 0.242

x2 1

2𝜋𝑒−

(2.5−3)2

2 = 0.3521

x3 1

2𝜋𝑒−

(3−3)2

2 = 0.3989

x4 1

2𝜋𝑒−

(1−3)2

2 = 0.054

x5 1

2𝜋𝑒−

(6−3)2

2 = 0.0044

P(x=3) = (0.242 + 0.3521+

0.3989+0.054+0.0044)/5 = 0.2103

Page 39: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 39

Parzen windows example …

Page 40: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 40

Parzen windows …

Page 41: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 41

Parzen windows …

Page 42: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 42

Parzen windows notes

• The window width ha (volume, Va ) is the most

critical parameter in Parzen window approach.

• It is selected by cross-validation approach where a

validation set or portion of the training set is used

to form a validation set.

• The classifier is trained using different values of ha .

• The ha that results in the smallest error in the validation

set is selected as the most optimal one.

This technique is normally used with algorithms that

has parameters to select from.

Page 43: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 43

Classification example

In classifiers based on Parzen-window estimation:

• We estimate the densities for each category and

classify a test point by the label corresponding to the

maximum posterior

• The decision region for a Parzen-window classifier

depends upon the choice of window function as

illustrated in the following figure.

Page 44: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 44

Parzen windows …

Page 45: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 45

Example of Nearest Neighbor Rule

• Two class problem: yellow triangles and blue

squares. Circle represents the unknown sample x

and as its nearest neighbor comes from class θ1, it

is labeled as class θ1.

Page 46: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 46

Example of k-NN rule with k = 3

• There are two classes: yellow triangles and blue

squares. The circle represents the unknown sample

x and as two of its nearest neighbors come from

class θ2, it is labeled class θ2.

• The number k should be:

• 1) large to minimize probability of misclassifying x.

• 2) small (with respect to no of samples) so that points are

close enough to x to give an accurate estimate of the true

class of x.

Page 47: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 47

Kn - Nearest neighbor estimation

• Goal: a solution for the problem of the unknown “best”

window function

• Let the cell volume be a function of the training data

• Center a cell about x and let it grow until it captures kn samples

(kn = f(n))

• kn are called the kn nearest-neighbors of x

• possibilities can occur:

• Density is high near x; therefore the cell will be small which

provides a good resolution

• Density is low; therefore the cell will grow large and stop until

higher density regions are reached

We can obtain a family of estimates by setting kn=k1/n

and choosing different values for k1

Page 48: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 48

Illustration

For kn = n = 1 ; the estimate becomes:

Pn(x) = kn / n.Vn

= 1 / V1

=1 / 2|x-x1|

Page 49: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 49

Kn - Nearest neighbor estimation …

Page 50: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 50

Kn - Nearest neighbor estimation …

Page 51: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 51

Kn - Nearest neighbor estimation …

Page 52: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 52

Kn - Nearest neighbor estimation …

• K= 10, N = 200

Page 53: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 53

Kn - Nearest neighbor estimation …

◦ Estimation of a-posteriori probabilities

Goal: estimate P(i | x) from a set of n labeled samples

◦ Let’s place a cell of volume V around x and capture k

samples

◦ ki samples amongst k turned out to be labeled i then:

pn(x, i) = ki /n.V

An estimate for pn(i| x) is:

k

k

),x(p

),x(p)x|(p i

cj

1j

jn

inin

Page 54: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 54

Kn - Nearest neighbor estimation …

• ki/k is the fraction of the samples within the cell that are labeled i

•For minimum error rate, the most frequently represented category within the cell is selected

• If k is large and the cell sufficiently small, the performance will approach the best possible

Page 55: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 55

Nearest Neighbor

• The nearest –neighbor rule

• Let Dn = {x1, x2, …, xn} be a set of n labeled prototypes

• Let x’ Dn be the closest prototype to a test point x

• The nearest-neighbor rule for classifying x is

• to assign it the label associated with x’

Page 56: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 56

Nearest Neighbor

• Nearest-neighbor rule is a sub-optimal procedure

• The nearest-neighbor rule leads to an error rate greater

than the minimum possible: the Bayes rate

• If the number of prototypes is large (unlimited), the error

rate of the nearest-neighbor classifier is never worse than

twice the Bayes rate (it can be demonstrated!)

• If n , it is always possible to find x’ sufficiently

close so that:

P(i | x’) P(i | x)

Page 57: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 57

Bounds on Error Rate of k-Nearest Neighbor Rule

• As k gets larger the error rate equals the Bayes rate

• k should be a small fraction of the total number of

samples

Page 58: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 58

Nearest Neighbor …

• If P(m | x) 1, then the nearest neighbor

selection is almost always the same as the Bayes

selection

Page 59: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 59

59

Nearest Neighbor …

Page 60: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 60

Nearest Neighbor …

• N-n classifier effectively partitions the feature space into

cells consisting of all points closer to a given training point

x’ than to any other training points.

• All points in such a cell are thus labeled by the category of

the training point –Voronoi tesselation of the space

Page 61: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 61

The k –Nearest-Neighbor Rule

• Classify x by assigning it the label most frequently

represented among the k nearest samples and use a

voting scheme

Page 62: ICS583 112 Non-Parametric Density Estimation S

ICS583 Non-Parametric density estimation 62

Example

Example:k = 3 (odd value) and x

= (0.10, 0.25)t

Closest vectors to x with their

labels are:

{(0.10, 0.28, 2); (0.12, 0.20, 2);

(0.15, 0.35,1)}

One voting scheme assigns the

label 2 to x since 2 is the most

frequently represented

Prototypes Labels

(0.15, 0.35)

(0.10, 0.28)

(0.09, 0.30)

(0.12, 0.20)

1

2

5

2