given k, k-means clustering is implemented in 4 steps, assumes the clustering criteria is to...

4
Given k, k-means clustering is implemented in 4 steps, assumes the clustering criteria is to maximize intra-cluster similarity and minimize inter-cluster similarity. A heuristic is used (the method isn’t really optimal) 1. Partition the points to be clustered into k subsets (or pick k initial mean points and create initial k-subsets by putting each other point in the cluster with the closest mean). 2. Compute the mean of each cluster of the current partition. Assign each non-mean point to the cluster with the most similar (closest) mean. 3. Go back to Step 2 (compute the means of the new clusters). 4. Stop when the new set of means doesn’t change [much] (or use some other stopping condition?). It's O(tkn), n=#_of_objs k=#_of_clusts t=#_of_iterations. Normally, k, t << n, so close to O(n) Suggestion: Use pTrees. It should be possible to calculate all distances between the sample and the centroids at once (and to parallelize that?) using L 1 - distance (other distances?, other correlations?). It may also payoff to do it for multiple test points at a time?, or all non-mean points at one time?, so that there is just one pass across the pTrees? and to parallelize this? In almost all cases, it is possible to re-compute centroids using one pTree pass (see Yue Cui's thesis in the library (departmental or University). Also it should be possible, with multilevel pTrees, to tell when one can early exit using the top level pTrees only, the top 2 levels only, then the top 3 levels only, etc. Does something like this work for k-means image pixel classification too? (There are many videos on Utube describing k-means image classification).

Upload: edward-harrison

Post on 17-Jan-2016

234 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Given k, k-means clustering is implemented in 4 steps, assumes the clustering criteria is to maximize intra- cluster similarity and minimize inter-cluster

Given k, k-means clustering is implemented in 4 steps, assumes the clustering criteria is to maximize intra-cluster similarity and minimize inter-cluster similarity. A heuristic is used (the method isn’t really optimal)

1. Partition the points to be clustered into k subsets (or pick k initial mean points and create initial k-subsets by putting each other point in the cluster with the closest mean).

2. Compute the mean of each cluster of the current partition.Assign each non-mean point to the cluster with the most similar (closest) mean.

3. Go back to Step 2 (compute the means of the new clusters).4. Stop when the new set of means doesn’t change [much] (or use some other stopping condition?).

It's O(tkn), n=#_of_objs k=#_of_clusts t=#_of_iterations. Normally, k, t << n, so close to O(n)

Suggestion: Use pTrees. It should be possible to calculate all distances between the sample and the centroids at once (and to parallelize that?) using L1-distance (other distances?, other correlations?).

It may also payoff to do it for multiple test points at a time?, or all non-mean points at one time?, so that there is just one pass across the pTrees? and to parallelize this? In almost all cases, it is possible to re-compute centroids using one pTree pass (see Yue Cui's thesis in the library (departmental or University).

Also it should be possible, with multilevel pTrees, to tell when one can early exit using the top level pTrees only, the top 2 levels only, then the top 3 levels only, etc.

Does something like this work for k-means image pixel classification too? (There are many videos on Utube describing k-means image classification).

Does k-means works for Netflix data also, to predict r(U,M)?. E.g., in user-vote.C, forevery V in supM use k-means to vote according to the strongest class (largest?, largest with vote weighted according to count gap with others? no loops?

Page 2: Given k, k-means clustering is implemented in 4 steps, assumes the clustering criteria is to maximize intra- cluster similarity and minimize inter-cluster

R(A1 A2 A3 A4)2 7 6 16 7 6 03 7 5 12 7 5 73 2 1 42 2 1 57 0 1 47 0 1 4

Example of HDkM (Horizontal Data k-Means) Showing looping and Partial Distance early exit.Pick initial means;

7 0 1 4 2 7 6 1

Loop over rows 1st then columns. Calculate the first distance. In the second distance calculation, as soon as the accumulated distance exceeds minimum distance so far (always the 1st distance), exit column loop.

m1= m2=

6 7 6 0m1,d2( ) =91 d2( m2, 6 7 6 0 ) =17 so C2

3 7 5 1m1,d2( ) =90 d2( m2, 3 7 5 1 ) =2 so C2

2 7 5 7m1,d2( ) =99 d2( m2, 2 7 5 7 ) =37 so C2

3 2 1 4m1,d2( ) =20 d2( m2, 3 2 1 4

2 2 1 5m1,d2( ) =30 d2( m2, 2 2 1 5

In initial re-clustering, C1 is:

3 2 1 4

2 2 1 5

and C2 is:

6 7 6 0

3 7 5 1

2 7 5 7

7 0 1 4

2 7 6 1

and the new means are:

4.7 1 1 4.2

3.5 5.2 4.2 3.2

m21=

m22=

7 0 1 4

Full column loop determines:

Full column loop determines:

Full column loop determines:

In col loop (2-3)2=1 (7-2)2=25 exceeds 20 , so C1

(2-2)2=0 (7-2)2=25

(6-1)2=25+25=50>30 , so C1

early exit of column loop

early exit of column loop

Page 3: Given k, k-means clustering is implemented in 4 steps, assumes the clustering criteria is to maximize intra- cluster similarity and minimize inter-cluster

R(A1 A2 A3 A4)2 7 6 16 7 6 03 7 5 12 7 5 73 2 1 42 2 1 57 0 1 47 0 1 4

The re-clustering of C1 is: and C2 is:

4.7 1 1 4.2 3.5 5.2 4.2 3.2m21= m22=

2 7 6 1m21,d2( ) =79 d2(m22, 2 7 6 1 ) =13

6 7 6 0m21,d2( ) =80 d2(m22, 6 7 6 0 ) =22

3 7 5 1m21,d2( ) =65 d2(m22, 3 7 5 1 ) = 9

2 7 5 7m21,d2( ) =67 d2(m22, 2 7 5 7 ) =19

3 2 1 4m21,d2( ) = 4 d2(m22, 3 2 1 4 ) =21

2 2 1 5m21,d2( ) = 9 d2(m22, 2 2 1 5 ) =26

7 0 1 4m21,d2( ) = 6 d2(m22, 7 0 1 4 ) =50

7 0 1 4m21,d2( ) = 6 d2(m22, 7 0 1 4 ) =50

2 7 6 1

6 7 6 0

3 7 5 1

2 7 5 7

3 2 1 4

2 2 1 5

7 0 1 4

7 0 1 4

Which is the same as the previous clustering so the process has completely converged and we are done.

Second epoch

Page 4: Given k, k-means clustering is implemented in 4 steps, assumes the clustering criteria is to maximize intra- cluster similarity and minimize inter-cluster

One pTree calculation per epoch? This should be able to be done for all the samples at once. Would make k-means clustering fast (and k-means image classification, what TreeMiner is doing for DoD).

Dr. Fei Pan's theorem (pg 39) Let A be the jth column of a data-table. The number of bits in A is m+1 , and Pj,m, ..., Pj,0 are the basic pTrees of A. For any constant, c = ( bm...b0 )2

PA>c = Pj,m om ... Pj,k+1 ok+1 Pj,k

c = bm ... bk+1 ... b0 1. oi is the AND () operation if bi=1.

oi is the OR () operation if bi=0.

2. k is the rightmost bit position with bit-value "0"

3. the operators are right binding.

Horizontal-Data_k-Means (HDkM) is approx. O(n), n=#_of_points to be clustered. So for very large datasets (e.g., billions or trillions of points) this is far to slow to handle all the data coming at us.)

1. Use EIN Theory to create k pTree masks (i.e., partially cluster data into k clusters), one for each set of pts closest in all dimensions to a cluster representer. Then finish the other points with HDkM (example follows)

2. Try Partial Distance variation to improve speed (reduce leftover pts to be clustered by HDkM further.

3. Fully cluster using pTrees?

4. Parallelize 1-3 above. By DeMorgan's law (says AND distributes over OR and OR distributes over AND),

PAc = (PA<c)' = (Pm om (Pm-1 om-1 ...(Pk+1 ok+1 Pk )...))' c = bm ... bm-1 ... bk+1 ... b0

= (P'm o'm (Pm-1 om-1 ...(Pk+1 ok+1 Pk )...)')= (P'm o'm (P'm-1 o'm-1 ...(Pk+1 ok+1 Pk )...'))= (P'm o'm (P'm-1 o'm-1 ...(P'k+1 o'k+1 P'k )...))

...

5. Apply to k-means classification of images (e.g., land use). (see Utubes on k-means image classification