new algorithms for efficient high-dimensional nonparametric classification

New Algorithms for Efficient High-Dimensional Nonparametric Classification

Ting Liu, Andrew W. Moore, and Alexander Gray

Overview Introduction

k Nearest Neighbors (k-NN) KNS1: conventional k-NN search

New algorithms for k-NN classification KNS2: for skewed-class data KNS3: ”are at least t of k-NN positive”?

Results Comments

Introduction: k-NN k-NN

Nonparametric classification method. Given a data set of n data points, it

finds the k closest points to a query point , and chooses the label corresponding to the majority.

Computational complexity is too high in many solutions, especially for the high-dimensional case.

DRV

DRq

Introduction: KNS1 KNS1:

Conventional k-NN search with ball-tree. Ball-Tree (binary):

Root node represents full set of points. Leaf node contains some points. Non-leaf node has two children nodes. Pivot of a node: one of the points in the node,

or the centroid of the points. Radius of a node:

Introduction: KNS1 Bound the distance from a query point q:

Trade off the cost of construction against the tightness of the radius of the balls.

Introduction: KNS1 recursive procedure: PSout=BallKNN (PSin, Node)

PSin consists of the k-NN of q in V ( the set of points searched so far)

PSout consists of the k-NN of q in V and Node

points interested of distance

minimum theis ||max

to inpoint any from

distance possible minimum theis

sofar

Nodeminp

qD

qNode

D

inPS

xx

KNS2 KNS2:

For skewed-class data: one class is much more frequent than the other.

Find the # of the k NN in the positive class without explicitly finding the k-NN set.

Basic idea: Build two ball-trees: Postree (small), Negtree “Find Positive”: Search Postree to find k-nn set Possetk

using KNS1; “Insert negative”: Search Negtree, use Possetk as bounds

to prune nodes far away and to estimate the # of negative points to be inserted to the true nearest neighbor set.

KNS2 Definitions:

Dists={Dist1,…, Distk}: the distance to the k nearest positive neighbors of q, sorted in increasing order.

V: the set of points in the negative balls visited so far. (n, C): n is the # of positive points in k NN of q.

C ={C1,…,Cn},

Ci is # of the negative points in V closer than the ith positive neighbor to q. and

NodeminpD Node

maxpD

KNS2 Step 2 “insert negative”

is implemented by the recursive function

(nout, Cout)=NegCount(nin, Cin, Node, jparent, Dists)

(nin, Cin) sumarize interesting negative points for V;

(nout, Cout) sumarize interesting negative points for V and Node;

KNS3 KNS3

“are at least t of k nearest neighbors positive?”

No constraint of skewness in the class. Proposition:

Instead of directly compute the exact values, we compute the lower and upper bound, since

m+t=k+1

class. positive thefrom ofNN theof least at ifonly and ifNegm

Post qktDD

KNS3P is a set of balls from Postree, N consists of balls from Negtree.

t)|(u|

t)|(u|

DD

DDji

P,uuPu)Lo(D

j

ii

j-

ii

upost

ji

ji

post

j

1

1

1

minp

minpminp

Points

and Points where

,) Lo(Then

, that such , balls thesortingfirst , compute To

Experimental results Real data

Experimental resultsk=9, t=ceiling(k/2),

Randomly pick 1% negative records and 50% positive records as test (986 points)

Train on the reaming 87372 data points

Comments Why k-NN? Baseline

No free lunch: For uniform high-dimensional data, no

benefits. Results mean the intrinsic dimensionality

is much lower.

new algorithms for efficient high-dimensional nonparametric classification

Documents

set of points

positive points

nn set

nn positive

interesting negative

nn of q

data set of n data points

closest points