k-nearest neighbor classification on spatial data streams using p-trees maleq khan, qin ding,...

Post on 05-Jan-2016

219 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

k-Nearest Neighbor Classification on Spatial

Data Streams Using P-trees

Maleq Khan, Qin Ding, William Perrizo; NDSU

Introduction

We explored distance metric based computation using P-trees

Defined a new distance metric, called HOB distance

Revealed some useful properties of P-trees

A new method of nearest neighbor classification using P-tree

- called Closed-KNN

A new algorithm for k-clustering using P-trees

- efficient statistical computation from the P-trees

Overview

1. Data Mining - classification and clustering

2. Various distance metricsMinkowski, Manhattan, Euclidian, Max, Canberra, Cord,

and HOB distance

- Neighborhoods and decision boundaries

3. P-trees and its properties

4. k-nearest neighbor classification- Closed-KNN using Max and HOB distance

5. k-clustering - overview of existing algorithms- our new algorithm- computation of mean and variance from the P-

trees

Data Mining

extracting knowledge from a large amount of data

Functionalities: feature selection, association rule mining, classification & prediction, cluster analysis, outlier analysis, evolution analysis

Information Pyramid

Raw data

Useful Information

Data MiningMore data

less information

Classification

Predicting the class of a data object

Bc3b3a3

Ac2b2a2

Ac1b1a1

ClassFeature3Feature2Feature1

Training data: Class labels are known

Classifiercba

Sample with unknown class:Predicted class Of the Sample

also called Supervised learning

Types of Classifier

Eager classifier: Builds a classifier model in advance

e.g. decision tree induction, neural network

Lazy classifier: Uses the raw training datae.g. k-nearest neighbor

ClusteringThe process of grouping objects into

classes,with the objective: the data objects are

• similar to the objects in the same cluster • dissimilar to the objects in the other clusters.

A two dimensional space showing 3 clusters

Clustering is often called unsupervised

learning or unsupervised classification

the class labels of the data objects are unknown

Distance Metric

Measures the dissimilarity between two data points.

A distance metric is a function, d, of two n-dimensional points

X and Y, such that     

d(X, Y) is positive definite:   if (X Y), d(X, Y) > 0

                               if (X = Y), d(X, Y) = 0

d(X, Y) is symmetric: d(X, Y) = d(Y, X)

d(X, Y) holds triangle inequality: d(X, Y) + d(Y, Z) d(X, Z)

Various Distance Metrics

Minkowski distance or Lp distance, pn

i

piip yxYXd

1

1

,

Manhattan distance,

n

iii yxYXd

11 ,

Euclidian distance,

n

iii yxYXd

1

22 ,

Max distance, ii

n

iyxYXd

1max,

(P = 1)

(P = 2)

(P = )

nxxxxX ,,,, 321 nyyyyY ,,,, 321 Let and

An Example

A two-dimensional space:

Manhattan, d1(X,Y) = XZ+ ZY = 4+3 = 7

Euclidian, d2(X,Y) = XY = 5

Max, d(X,Y) = Max(XZ, ZY) = XZ = 4X (2,1)

Y (6,4)

Z

d1 d2 d

1 pp ddFor any positive integer p,

Some Other Distances

Canberra distance 

Squared cord distance

Squared chi-squared distance

n

i ii

iic yx

yxYXd

1

,

n

iiisc yxYXd

1

2,

n

i ii

iichi yx

yxYXd

1

2

,

HOB Similarity

Higher Order Bit (HOB) similarity:

HOBS(A, B) = ii

m

sbasiis

1:max

0

Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

x1: 0 1 1 0 1 0 0 1 x2: 0 1 0 1 1 1 0 1

y1: 0 1 1 1 1 1 0 1 y2: 0 1 0 1 0 0 0 0

HOBS(x1, y1) = 3 HOBS(x2, y2) = 4

A, B: two scalars (integer)

ai, bi : ith bit of A and B (left to right)

m : number of bits

HOB DistanceThe HOB distance between two scalar value A and B:

dv(A, B) = m – HOB(A, B)

The previous example:Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

x1: 0 1 1 0 1 0 0 1 x2: 0 1 0 1 1 1 0 1

y1: 0 1 1 1 1 1 0 1 y2: 0 1 0 1 0 0 0 0

HOBS(x1, y1) = 3 HOBS(x2, y2) = 4

dv(x1, y1) = 8 – 3 = 5 dv(x2, y2) = 8 – 4 = 4

dv(x1, y1) = 8 – 3 = 5 dv(x2, y2) = 8 – 4 = 4

The HOB distance between two points X and Y:

HOBmaxmax11

,yxm - ,yxdX,Yd ii

n

iiiv

n

iH

In our example (considering 2-dimensional data):

dh(X, Y) = max (5, 4) = 5

HOB Distance Is a Metric

HOB distance is positive definite

     if (X = Y), = 0

     if (X Y), > 0

YXdH ,

YXdH ,

HOB distance is symmetric

XYdYXd HH ,,

HOB distance holds triangle inequality

ZXdZYdYXd HHH ,,,

Neighborhood of a Point

Neighborhood of a target point, T, is a set of points, S,

such that X S if and only if d(T, X) r

2r

T

X

T

2r

X

2r

T

X

T

2r

X

Manhattan Euclidian Max HOB

If X is a point on the boundary, d(T, X) = r

Decision Boundary decision boundary between points A and B, is locus of the point X satisfying the condition d(A, X) = d(B, X)

B

X

A

D

R2

R1

d(A,X)

d(B,X)

> 45

Euclidian

B

A

Max

Manhattan

< 45

B

A

EuclidianMax

Manhattan

B

A

B

A

Decision boundary for HOB Distance. Perpendicular to the axis that makes maximum distance

Remotely Sensed Imagery Data

An image is a collection of pixels

Each pixel represent an square area in the ground

Several attributes or bands associated with each pixel

ex. red, green, blue reflectance values, soil moisture, nitrate

Band Sequential (BSQ) file: one file for each band

Bit Sequential (bSQ) file: one file each bit of each band

Bi,j is the bSQ file for jth bit of ith band

Peano count-Tree or P-tree

We form one P-tree from each bSQ filePi,j is the basic P-tree for bit j of band I

•Root of the P-tree is the count of 1 bits in the entire image•Root has 4 children with the counts of the 4 quadrants

•Recursively divide the quadrants until there is only one bit in the quadrant unless the node is pure0 or pure1

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

55 ____________/ / \ \___________ / _____/ \ ___ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101

Pure1 node:

All bits are 1

Root Count

Peano Mask Tree (PMT)

55 ____________/ / \ \___________ / _____/ \ ___ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101

m ____________/ / \ \____________ / ____/ \ ____ \ 1 ____m__ _m__ 1 / / | \ / | \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101

P-tree PMT

0 represents Pure0 node

1 represents pure1 node

m represents mixed node

P-tree ANDing

m

1 m 0 m

Subtree1 Subtree2

m

m 0 0 m

Subtree3 Subtree4

m

m 0 0 m

Subtree3 Subtree5

AND =

ORing and COMPLEMENT operation are performed in similar way

Also there are some other P-tree structured (such as PVT)

and ANDing algorithms that are beyond the scope of this

presentation

Value & Interval P-tree

The value P-tree Pi(v) represents the pixels that have value v for band i.

there is a 1 in Pi(v) at a pixel location, if that pixel have the value v for band i

otherwise there is a 0 in Pi(v).

Let, bj = jth bit of the value v and

and Pi,j = the basic P-tree for band i bit j.

Define Pti,j = Pi,j if bj = 1

= Pi,j if bj = 0

Then Pi(v) = Pti,1 AND Pti,2 AND Pti,3 AND … AND Pti,m

The interval P-tree, Pi(v1, v2) = Pi(v1) OR Pi(v1+1) OR Pi(v1+2) OR … OR Pi(v2)

Notations

P1 & P2 : P1 AND P2

P1 | P2 : P1 OR P2

P´ : COMPLEMENT of P

Pi, j : basic P-tree for band i bit j.

Pi(v) : value P-tree for value v of band i.

Pi(v1, v2) : interval P-tree for interval [v1, v2] of band i.

P0 : is pure0-tree, a P-tree having the root node which is pure0.

P1 : is pure1-tree, a P-tree having the root node which is pure1.

rc(P) : root count of P-tree P

N : number of pixels

n : number of bands

m : number of bits

Properties of P-trees

1. a)

b)

00rc PPP

1rc PPNP

00& PPP

PPP 1&

PPP &

0'& PPP

2. a)

b)

c)

d)

PPP 0|

11| PPP

PPP |

1'| PPP

3. a)

b)

c)

d)

4. rc(P1 | P2) = 0 rc(P1) = 0 and rc(P2) = 0

5. v1 v2 rc{Pi (v1) & Pi(v2)} = 0

6. rc(P1 | P2) = rc(P1) + rc(P2) - rc(P1 & P2)

7. rc{Pi (v1) | Pi(v2)} = rc{Pi (v1)} + rc{Pi(v2)}, where v1 v2

P-tree Header

Header of a P-tree file to make a generalized P-tree structure

1 word 2 words

2 words

4 words 4 words  

Format Code

Fan-out

# of levels

Root count Length of the body in

bytes

Body of the P-tree

k-Nearest Neighbor Classification

1)  Select a suitable value for k   

2) Determine a suitable distance metric

3) Find k nearest neighbors of the sample using the

selected metric

4)  Find the plurality class of the nearest neighbors by voting on the class labels of the NNs

5) Assign the plurality class to the sample to be classified.

Closed-KNN

T

T is the target pixels.

With k = 3, to find the third nearest neighbor,

KNN arbitrarily select one point from the

boundary line of the neighborhood

Closed-KNN includes all points on the boundary

Closed-KNN yields higher classification accuracy than traditional KNN

Searching Nearest Neighbors

We begin searching by finding the exact matches.

Let the target sample, T = <v1, v2, v3, …, vn>

The initial neighborhood is the point T.

We expand the neighborhood along each dimension:

along dimension i, [vi] is expanded to the interval [vi – ai , vi+bi],

for some positive integers ai and bi.

Continue expansion until there are at least k points in the neighborhood.

HOB Similarity Method for KNN

In this method, we match bits of the target to the training data

Fist we find matching in all 8 bits of each band (exact matching)

let, bi,j = jth bit of the ith band of the target pixel.

Define Pti,j = Pi,j, if bi,j = 1

= Pi,j, otherwise

And Pvi,1-j = Pti,1 & Pti,2 & Pti,3 & … & Pti,j

Pnn = Pv1,1-8 & Pv2,1-8 & Pv3,1-8 & … & Pvn,1-8

If rc(Pnn) < k, update Pnn = Pv1,1-7 & Pv2,1-7 & Pv3,1-7 & … & Pvn,1-7

An Analysis of HOB Method

Let ith band value of the target T, vi = 105 = 01101001b

[01101001] = [105, 105] 1st expansion

[0110100-] = [01101000, 01101001] = [104, 105] 2nd expansion

[011010- -] = [01101000, 01101011] = [104, 107]

Does not expand evenly in both side: Target = 105 and center of [104, 111] = (104+107) / 2 = 105.5

And expands by power of 2.

Computationally very cheap

Perfect Centering Method

Max distance metric provides better neighborhood by- keeping the target in the center- and expanding by 1 in both side

Initial neighborhood P-tree (exact matching):Pnn = P1(v1) & P2(v2) & P3(v3) & … & Pn(vn)

If rc(Pnn) < k Pnn = P1(v1-1, v1+1) & P2(v2-1, v2+1) & … & Pn(vn-1, vn+1)

If rc(Pnn) < k Pnn = P1(v1-2, v1+2) & P2(v2-2, v2+2) & … & Pn(vn-2, vn+2)

Computationally costlier than HOB Similarity method

But a little better classification accuracy

Finding the Plurality Class

Let, Pc(i) is the value P-trees for the class i

Plurality class = PnniPci

&)(rcmaxarg

Performance

Experimented on two sets of Arial photographs of The

Best Management Plot (BMP) of Oakes Irrigation Test Area

(OITA), ND

Data contains 6 bands: Red, Green, Blue reflectance

values, Soil Moisture, Nitrate, and Yield (class label).

Band values ranges from 0 to 255 (8 bits)

Considering 8 classes or levels of yield values: 0 to 7

Performance – Accuracy

40

45

50

55

60

65

70

75

80

256 1024 4096 16384 65536 262144

Training Set Size (no. of pixels)

Acc

ura

cy (

%)

KNN-Manhattan KNN-Euclidian

KNN-Max KNN-HOBS

P-tree: Perfect Centering (closed-KNN) P-tree: HOBS (closed-KNN)

1997 Dataset:

Performance - Accuracy (cont.)

1998 Dataset:

20

25

30

35

40

45

50

55

60

65

256 1024 4096 16384 65536 262144

Training Set Size (no of pixels)

Acc

ura

cy (

%)

KNN-Manhattan KNN-Euclidian

KNN-Max KNN-HOBS

P-tree: Perfect Centering (closed-KNN) P-tree: HOBS (closed-KNN)

Performance - Time

1997 Dataset: both axis in logarithmic scale

0.00001

0.0001

0.001

0.01

0.1

1

256 1024 4096 16384 65536 262144

Training Set Size (no. of pixels)

Per

Sam

ple

Cla

ssif

icat

ion

tim

e (s

ec)

KNN-ManhattanKNN-EuclidianKNN-MaxKNN-HOBSP-tree: Perfect Centering (cosed-KNN)P-tree: HOBS (closed-KNN)

Performance - Time (cont.)

0.00001

0.0001

0.001

0.01

0.1

1

256 1024 4096 16384 65536 262144Training Set Size (no. of pixels)

Per

Sam

ple

Cla

ssif

icat

ion

Tim

e (s

ec)

KNN-ManhattanKNN-EuclidianKNN-MaxKNN-HOBSP-tree: Perfect Centering (closed-KNN)P-tree: HOBS (closed-KNN)

1998 Dataset : both axis in logarithmic scale

k-Clustering

Partitioning data into k clusters, C1, C2, …, Ck as to minimizes

some criterion function

such as the sum of squared Euclidian distance measured

from the centroid of the cluster or total variance

, ci is the centroid or mean of Ci

or sum of the pair-wise weight

c is the weight function usually the distance between p and

q

k

i Cpi

i

pcd0

22 ,

k

i Cqp i

qpc0 ,

,

k-Means Algorithm

1. Arbitrarily select k initial cluster centers

2. Assign each data point to its nearest center

3. Update the centers by the means of the clusters

4. Repeat step 2 & 3 until no change

Good optimization, very slow

Complexity O(nNkt), n = # of dimension, N = # of data points

k = # of clusters, t = # of iterations

To solve speed issues,

some other algorithms have been proposed sacrificing quality

Divisive Approach

1. Initially consider the whole space as one hyperbox

2. Select a hyperbox to split

3. Select an axis and cut-point

4. Split the selected hyperbox by a hyperplane perpendicular to the selected axis through the selected cut-point

5. Repeat step 2-4 until there are k hyperboxes, each hyperbox is a clusterMean-split algorithm, variance-based algorithm and our

proposed

new algorithm follow the divisive approach

They differ in the strategies for selecting the hyperbox, axis

and cut-point.

Mean-Split Algorithm

The initial hyperbox (the whole space) is assigned a number k

that is, k clusters will be formed from this hyperbox

Let, L = number of clusters assigned to a hyperbox

Li clusters are assigned to the i th sub-hyperbox

where, i = 1, 2 0 1

n = # of points, V = volume

1. Select a hyperbox with L > 1

2. Select the axis with largest spread of projected data

3. Mean of the projected data is the cut-point

Fast but poor optimization

21

1

21

1VV

V

nn

nLL i

i

Variance-Based Algorithm

1. Select the hyperbox with largest variance

2. By checking each point on each dimension of the selected hyperbox

find the optimal cut-point, topt, that gives maximum

variance

reduction on the projected data. twtwtt

opt222

211

2maxarg

where wi and are the weight and variance of the i th interval (i = 1, 2)

ti2

Still computationally costly but optimization is closer to k-means

Our Algorithm

When a new hyperbox is formed find two means m1 and m2

for each dimension using the projected data:

a. Arbitrarily select two values for m1 and m2 (m1 < m2)

b. Update m1 = mean of the interval [0, (m1+m2)/2]

c. Update m2 = mean of the interval [(m1+m2)/2, upper_limit]

d. Repeat step b & c until no change in m1 and m2.

1. Select the hyperbox and axis for which (m2 – m1) is

largest

2. Cut-point = (m1 + m2) / 2

Our Algorithm (cont.)

We represent each cluster by a P-tree

the initial cluster is the pure1-tree, P1

Let Pci is the P-tree for cluster ci

the P-trees for the two new clusters after splitting along axis j:

PCi1 = PCi & Pj(0, (m1+m2)/2)

PCi2 = PCi & Pj((m1+m2)/2, upper_limit)

Note: Pj((m1+m2)/2, upper_limit) = complement of Pj(0, (m1+m2)/2)

Computing Sum & Mean from P-trees

for all points and for dimension or band i:

sum = mean =

For the points in a cluster:

sum = mean =

Here the template P-tree, Pt = P-tree representing the

cluster

1

0,

1 &2n

jtji

jn PPrc

t

n

jtji

jn

Prc

PPrc

1

0,

1 &2

1

0,

12n

jji

jn Prc

N

Prcn

jji

jn

1

0,

12

Computing Variance from P-trees

1

0

1

0,,

22 &2n

j

n

kkiji

kjn PPrc

1

0

1

0,,

22 &&2n

j

n

ktkiji

kjn PPPrc

21 xN 221 x

NVariance = =

For all points in the space:

2x

For the points in a cluster:

2x

Performance

Unlike variance based method, instead of checking each

point on the axis, our method rapidly converges to the

optimal cut point, topt .

avoids scanning database by computing sum and mean

from the root count of the P-trees

very much faster than variance-based method while

optimization as good as variance-based method

Conclusion

Analyzed the effect of various distance metric

Used a new metric, HOB Distance for fast P-tree-based computation

Revealed useful properties of P-trees

using P-trees, a fast new method of KNN, called Closed-KNN, giving higher classification accuracy

Designed a new FAST k-clustering algorithm: computing sum, mean, variance from P-tree without scanning databases

Thank You

top related