clustering: k-means, nearest neighbors07-clustering:knn.pdf · clustering: k-means 8. disadvantages...
TRANSCRIPT
![Page 1: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/1.jpg)
Clustering: K-Means, Nearest Neighbors
Foundation of Data Analysis 01/30/2020
![Page 2: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/2.jpg)
Clustering Example
Original image Segmented image
Divide data into different groups
2
![Page 3: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/3.jpg)
1. Ask user how many clusters they’d like (e.g. k=5)
Sides credit to Andrew W. Moore, CMU
Clustering: K-means
3
![Page 4: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/4.jpg)
Sides credit to Andrew W. Moore, CMU
1. Ask user how many clusters they’d like (e.g. k=5)
2. Randomly guess k cluster Center locations
Clustering: K-means
4
![Page 5: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/5.jpg)
Sides credit to Andrew W. Moore, CMU
1. Ask user how many clusters they’d like (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
Clustering: K-means
5
![Page 6: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/6.jpg)
Sides credit to Andrew W. Moore, CMU
1. Ask user how many clusters they’d like (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center re-finds the centroid of the points it owns…
Clustering: K-means
6
![Page 7: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/7.jpg)
Sides credit to Andrew W. Moore, CMU
1. Ask user how many clusters they’d like (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center re-finds the centroid of the points it owns…
5. …and jumps there
Clustering: K-means
7
![Page 8: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/8.jpg)
Sides credit to Andrew W. Moore, CMU
1. Ask user how many clusters they’d like (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center re-finds the centroid of the points it owns…
5. …and jumps there
6. …Repeat steps 3-5 until terminated!
Clustering: K-means
8
![Page 9: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/9.jpg)
Disadvantages of K-means
9
![Page 10: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/10.jpg)
Disadvantages of K-means
• Does not work efficiently with complex structured data (mostly non-linear)
10
![Page 11: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/11.jpg)
Disadvantages of K-means
• Does not work efficiently with complex structured data (mostly non-linear)
• Hard assignment for labels might lead to misgrouping
11
![Page 12: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/12.jpg)
Disadvantages of K-means
• Does not work efficiently with complex structured data (mostly non-linear)
• Hard assignment for labels might lead to misgrouping
12
• Random guess for initialization might be a hassle
![Page 13: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/13.jpg)
• Nearest Neighbors: (Un)supervised Learning (non-parametric model)
13
![Page 14: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/14.jpg)
■ Basic idea: ■ If it walks like a duck, quacks like a duck, then it’s probably a duck
Compute Distance
Choose k of the “nearest” seeds
Nearest Neighbors: Unsupervised Learning
14
![Page 15: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/15.jpg)
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
Nearest Neighbors
K-nearest neighbors of seed x: data points that have the k smallest distance to x.
15
![Page 16: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/16.jpg)
Voronoi Diagram
Nearest Neighbor
16
• Partitions space into regions
• boundary: points at the same distance from two different training examples
![Page 17: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/17.jpg)
• K-Nearest Neighbor (KNN) classification - supervised learning
17
![Page 18: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/18.jpg)
KNN Classifiers
• Requires three things – The set of stored records – Distance metric – The value of k, the number of
nearest neighbors to retrieve
Unknown recordUnknown seed
18
![Page 19: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/19.jpg)
KNN Classifiers
• Requires three things – The set of stored records – Distance metric – The value of k, the number of
nearest neighbors to retrieve
• To classify an unknown seed: – Compute distance to other
training seeds – Identify k nearest neighbors – Use class labels of nearest
neighbors to determine the class label of unknown seed (e.g., by taking majority vote)
Unknown recordUnknown seed
19
![Page 20: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/20.jpg)
■ Compute distance between two points: ■ Euclidean distance (L2 norm)
■ Determine the class from nearest neighbor list ■ take the majority vote of class labels among the k-nearest
neighbors ■ Weight the vote according to distance
■ weight factor, w = 1/d2
∑ −=i ii
qpqpd 2)(),(
Nearest Neighbor Classification
20
![Page 21: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/21.jpg)
■ Choosing the value of k: ■ If k is too small, sensitive to noise points ■ If k is too large, neighborhood may include points from other
classes
X
Nearest Neighbor Classification…
21
![Page 22: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/22.jpg)
■ Scaling issues ■ Attributes may have to be scaled to prevent distance measures from
being dominated by one of the attributes ■ Example:
■ height of a person may vary from 1.5m to 1.8m
■ weight of a person may vary from 90lb to 300lb ■ income of a person may vary from $10K to $1M
Issues of Nearest Neighbor Classification
22
![Page 23: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/23.jpg)
+ + + ++ + + +oo o o oo ooooo oo o oo oo?
23
K-NN and Irrelevant Features
![Page 24: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/24.jpg)
+
+
+
+
+
++ +
o
o
o o
o
o
oo
o
o
o
oo
o
o
o
o
o?
24
K-NN and Irrelevant Features
![Page 25: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/25.jpg)
+
+
+
++
+ + +o
oo o
o
o
oo
oo
ooo
oo
o
oo?
25
K-NN and Irrelevant Features
![Page 26: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/26.jpg)
■ Problem with Euclidean measure: ■ High dimensional data
■ curse of dimensionality ■ Can produce counter-intuitive results
1 1 1 1 1 1 1 1 1 1 1 0
0 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1vs
d = 1.4142 d = 1.4142
Solution: Normalize the vectors to unit length.
26
Issues of Nearest Neighbor Classification
![Page 27: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/27.jpg)
• Training: – Save the training examples
K-NN Algorithm
• At prediction: – Find the k training examples (x1,y1),…(xk,yk) that are
closest to the test example x – Predict the most frequent class among those yi’s.
27
![Page 28: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/28.jpg)
• Training: – Save the training examples
K-NN Algorithm
• At prediction: – Find the k training examples (x1,y1),…(xk,yk) that are
closest to the test example x – Predict the most frequent class among those yi’s.
• Improvements: – Weighting examples from the neighborhood – Measuring “closeness” – Finding “close” examples in a large training set quickly
28
![Page 29: Clustering: K-Means, Nearest Neighbors07-Clustering:KNN.pdf · Clustering: K-means 8. Disadvantages of K-means 9. Disadvantages of K-means • Does not work efficiently with complex](https://reader035.vdocuments.mx/reader035/viewer/2022071414/610e7d2735f60e747a0c6166/html5/thumbnails/29.jpg)
K-means using r-NN
1. Pick k points c1=x1,….,ck=xk as centers 2. For each xi, find Di=Neighborhood(xi)
3. For each xi, let ci=mean(Di)
4. Go to step 2….
29
Tricks with Fast k-NN