fast and scalable nearest neighbor based classification taufik abidin and william perrizo department...
DESCRIPTION
Problems with kNN Finding k-Nearest Neighbor Set from horizontally structured data (record oriented data) can be expensive for large training set (containing millions or trillions of tuples) – linear to the size of the training set (1 scan) – Closed kNN is much more accurate but requires 2 scans Vertically structuring the data can help.TRANSCRIPT
Fast and Scalable Nearest Fast and Scalable Nearest Neighbor Based Neighbor Based
ClassificationClassification
Taufik Abidin and William PerrizoDepartment of Computer Science North Dakota State University
Given a (large) TRAINING SET, R(A1,…,An, C), with C=CLASSES and {A1…An}=FEATURES
Classification is: labeling unclassified objects based on the training set
kNN classification goes as follows:
ClassificatioClassificationn
Search for the k-Nearest Neighbors
Vote the classTraining Set
Unclassified Object
Problems with kNNProblems with kNN Finding k-Nearest Neighbor Set from horizontally struc
tured data (record oriented data) can be expensive for large training set (containing millions or trillions of tuples)– linear to the size of the training set (1 scan)– Closed kNN is much more accurate but requires 2 scans
Vertically structuring the data can help.
6. 1st half of 1st of 2nd is 1
00 0 0 1 1
4. 1st half of 2nd half not 0 00 0 0
2. 1st half is not pure1 0
00 0
1. Whole file is not pure1 0
Horizontal structures(records)
Scanned vertically
P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1 10
0 1 0 0 1 01
0 0 00 0 0 1 01 10
0 1 0
0 1 0 1 0
0 0 01 0 01
0 1 0
0 0 0 1 0
0 0 10 1
0 0 10 1 01
0 0 00 1 01
0 0 0 0 1 0 010 015. 2nd half of 2nd half is 1
00 0 0 1
R11
00001011
process P-trees using multi-operand logical ANDs.
Vertical Predicate-tree (P-tree) structuring: vertically partition table; compress each vertical bit slice into a basic Ptree;
010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100
R( A1 A2 A3 A4)
A data table, R(A1..An), containing horizontal structures (records) isprocessed vertically (vertical scans)
The basic (1-D) Ptree for R11 is built by recording the truth of the predicate “pure 1” recursively on halves, until purity is reached.
3. 2nd half is not pure1 0 00 0
7. 2nd half of 1st of 2nd not 0
00 0 0 1 10
0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0
R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43
R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100
Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2
01 21-level
But it is pure (pure0) so this branch ends
Total Total VariationVariation
The Total Variation of a set X, TV(a) is the sum of the squared separations of objects in X from a , defined as follows: TV(a) = xX(x-a)o(x-a)
We will use the concept of functional contours (in particular, the TV contours) in this presentation to identify a well-pruned, small superset of the Nearest Neighbor Set of an unclassified sample (which can then be efficiently scanned)
First we will discuss functional contours in general then consider the specific TV contours.
Given f:R(A1..An)Y and SY , define contour(f,S) f-1(S).
From the derived attribute point of view,Contour(f,S) = SELECT A1..An FROM R* WHERE R*.Af S.
If S={a}, f-1({a}) is Isobar(f, a)
There is a DUALITY betweenfunctions, f:R(A1..An)Y andderived attributes, Af of R given by x.Af f(x)where Dom(Af)=Y
A1 A2 An
x1 x2 xn
: . . .
Y f(x)f A1 A2 An Af
x1 x2 xn f(x): . . .
R R*
A1 A2 An
: : . . .
YSfR
A1..An space
Y
S
graph(f) ={ (a1,...,an,f(a1.an)) | (a1..an)R }
contour(f,S)
TV(a) =xR(x-a)o(x-a) If we use d for a index variable over the dimensions,
= xRd=1..n(xd2 - 2adxd + ad
2) i,j,k bit slices indexes
= xRd=1..n(k2kxdk)2 - 2xRd=1..nad(k2kxdk) + |R||a|2
= xd(i2ixdi)(j2
jxdj) - 2xRd=1..nad(k2kxdk) + |R||a|2
= xdi,j 2i+jxdixdj
- 2 x,d,k2k ad xdk + |R||a|2
= x,d,i,j 2i+j xdixdj
- |R||a|2 2 dad x,k2
kxdk +
TV(a) = i,j,d 2i+j |Pdi^dj| - |R||a|2 k2
k+1 dad |Pdk| +
The first term does not depend upon a. Thus, the derived attribute coming from f(a)=TV-TV()(which does not have that 1st term at all) has identical contours as TV (just a lowered graph).We also find it useful to post-compose a log function to reduce the number of bit slices.The resulting functional is called the High-Dimension-ready Total Variation or HDTV(a).
= x,d,i,j 2i+j xdixdj
+ dadad )|R|( -2dadd +
= x,d,i,j 2i+j xdixdj
- |R|dadad2|R| dadd +
Isobars are hyper-circles centered at
graph(g) is a log-shaped hyper-funnel:
From equation 7,
f(a)=TV(a)-TV() d(adad- dd) )= |R| ( -2d(add-dd) +
TV(a) = x,d,i,j 2i+j xdixdj + |R| ( -2dadd + dadad )
+ dd2 )= |R|( dad
2 - 2ddad
f()=0 and g(a) HDTV(a) = ln( f(a) )= ln|R| + ln|a-|2
= |R| |a-|2 so
going inward and outward along a- by we arrive at inner point, b=+(1-/|a-|)(a-) andouter point, c=-(1+/|a-|)(a-).
-contour(radius about a)
a
For an -contour ring (radius about a)
g(b) and g(c) are the lower and upper endpoints of a vertical interval, S, defining the ε-contour shown.
An easy P-tree calculation on that interval provides a P-tree mask for the -contour (no scan requred). b c
g(b)
g(c)
x1
x2
g(x)
If more pruning is needed (i.e., HDTV(a) contour is still to big to scan) use a dimension projection contour (Dim-i projection P-trees are already computed = basic P-trees of R.Ai. Form that contour_mask_P-tree; AND it with the HDTV contour P-tree. The result is a mask for the intersection).
-contour(radius about a)
a
HDTV(b)
HDTV(c)
b c
As pre-processing, calculate basic P-trees for the HDTV derived attribute. To classify a,1. Calculate b and c (which depend upon a and )2. Form the mask P-tree for training points with HDTV-values in [HDTV(b),HDTV(c)] (Note: the paper was submitted we were still doing this step by sorting TV(a) values. Now we use the contour approach which speeds up this step considerably. The performance evaluation graphs in this paper are still based on the old method. And w/o Gaussian vote weighting).3. User that mask P-tree to prune down the candidate NNS.4. If the root count of the candidate set is small enough, proceed to scan and assign class votes using, e.g., a Gaussian vote function, else prune further using a dimension projection).
contour of dimension projection
f(a)=a1
x1
x2
HDTV(x)
If more pruning is needed (i.e., HDTV(a) contour is still to big to scan)
Graphs of TV, TV-TV() and HDTV
TV()=TV(x33)
TV(x15)
12
34
5
XY
TV
12
34
5
12
3
TV(x15)-TV()
12
34
5
XY
TV-TV()
45
HDTV
Experiements: DatasetExperiements: Dataset1. KDDCUP-99 Dataset (Network Intrusion Dataset)
– 4.8 millions records, 32 numerical attributes– 6 classes, each contains >10,000 records– Class distribution:
– Testing set: 120 records, 20 per class– 4 synthetic datasets (randomly generated):
- 10,000 records (SS-I)- 100,000 records (SS-II)- 1,000,000 records (SS-III) - 2,000,000 records (SS-IV)
Normal 972,780
IP sweep 12,481
Neptune 1,072,017
Port sweep 10,413
Satan 15,892
Smurf 2,807,886
(k=5) Note: SMART-TV was done by sorting the derived attribute. Now we use the much faster P-tree interval mask.
Algorithm
x 1000 cardinality
10 100 1000 2000 4891
SMART-TV 0.14 0.33 2.01 3.88 9.27
Vertical Closed-KNN 0.89 1.06 3.94 12.44 30.79
KNN 0.39 2.34 23.47 49.28 NA
Speed or Speed or ScalabilityScalability
1000 2000 3000 40000
10
20
30
40
50
60
70
80
90
100
Training Set Cardinality (x1000)
Tim
e in
Sec
onds
Running Time Against Varying Cardinality
SMART-TVPKNNKNN
Machine used:Intel Pentium 4 CPU 2.6 GHz machine,
3.8GB RAM, running Red Hat Linux
Dataset Dataset (Cont.)(Cont.)
2. OPTICS dataset– ~8,000 points, 8 classes (CL-1, CL-2,…,CL-8) – 2 numerical attributes
– Training set: 7,920 points – Testing set: 80 points, 10 per class
CL-1
CL-2
CL-3CL-6
CL-4
CL-5
CL-7CL-8
3. IRIS dataset– 150 samples– 3 classes (iris-setosa, iris-versi
color, and iris-virginica)– 4 numerical attributes– Training set: 120 samples– Testing set: 30 samples, 10 per
class
Dataset Dataset (Cont.)(Cont.)
Overall F-score Classification Accuracy Comparison(Note: SMART-TV class voting done with equal votes for each training neighbor – now we use a
Gaussian vote weighting and get better accuracy than the other two).
Comparison of the Algorithms Overall Classif ication Accuracy
0.00
0.25
0.50
0.75
1.00
IRIS OPTICS SS-I SS-II SS-III SS-IV NI
Dataset
Aver
age
F-Sc
ore
SMART-TV
PKNN
KNN
Datasets SMART-TV PKNN KNN
IRIS 0.97 0.71 0.97
OPTICS 0.96 0.99 0.97
SS-I 0.96 0.72 0.89
SS-II 0.92 0.91 0.97
SS-III 0.94 0.91 0.96
SS-IV 0.92 0.91 0.97
NI 0.93 0.91 NA
Overall Overall AccuracyAccuracy
A nearest-based classification algorithm that starts its classification steps by approximating the Nearest Neighbor Set.
A total variation functional is used prune down the NNS candidate set.
It finishes classification in the traditional way The algorithm is fast. It scales well to very large
dataset. The classification accuracy is very comparable to that of Closed kNN (which is better than kNN).
SummarySummary