Download - L13. Cluster Analysis
Cluster Analysis
Poul Petersen
BigML
2
Trees vs Clusters
Trees (Supervised Learning)
Provide: labeled data Learning Task: be able to predict label
Clusters (Unsupervised Learning)
Provide: unlabeled data Learning Task: group data by similarity
3
Trees vs Clusters
sepal length
sepal width
petal length
petal width species
5.1 3.5 1.4 0.2 setosa5.7 2.6 3.5 1.0 versicolor6.7 2.5 5.8 1.8 virginica… … … … …
sepal length
sepal width
petal length
petal width
5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …
Learning Task: Find function “f” such that: f(X)≈Y
Learning Task: Find “k” clusters such that the data in each cluster is self similar
Inputs “X” Label “Y”
4
• Customer segmentation
• Item discovery
• Association
• Recommender
• Active learning
Use Cases
5
Customer Segmentation
GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for upsell.
• Dataset of mobile game users.
• Data for each user consists of usage statistics and a LTV based on in-game purchases
• Assumption: Usage correlates to LTV
5
0%
3%
7%
1%
6
Item Discovery
• Dataset of 86 whiskies
• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.
GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste.
Smoky
Fruity
Body
7
Association
• Dataset of Lending Club Loans
• Mark any loan that is currently or has even been late as “trouble”
GOAL: Cluster the loans by application profile to rank loan quality by percentage of trouble loans in population
0%
3%
7%
1%
8
Active Learning
GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data.
• Dataset of diagnostic measurements of 768 patients.
• Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*
*For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not-fraud. Or a million images which need to be labeled as cat/not-cat.
2323
9
Human Example
K=3
10
Human Example
“Round”
“Skinny”
“Corners”
11
Human Example
• Human used prior knowledge to select possible features that separated the objects.
• “round”, “skinny”, “edges”, “hard”, etc
• Items were then separated based on the chosen features
• Separation quality was then tested to ensure:
• met criteria of K=3
• groups were sufficiently “distant”
• no crossover
12
Learning from Humans
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
Create features that capture these object differences
13
Feature EngineeringObject Length / Width Num Surfaces
penny 1 3
dime 1 3
knob 1 4
eraser 2.75 6
box 1 6
block 1.6 6
screw 8 3
battery 5 3
key 4.25 3
bead 1 2
14
Now we can Plot
K=3
NumSurfaces
Length / Width
box block eraser
knob
pennydime
bead
key battery screw
K-Means Key Insight:We can find clusters using distances
in n-dimensional feature space
15
Now we can Plot
NumSurfaces
Length / Width
box block eraser
knob
pennydime
bead
key battery screw
K-MeansFind “best” (minimum distance)circles that include all points
16
K-Means Algorithm
K=3
16
K-Means Algorithm
K=3
16
K-Means Algorithm
K=3
16
K-Means Algorithm
K=3
17
K-Means Algorithm
K=3
17
K-Means Algorithm
K=3
18
Features Matter!
Metal Other
Wood
19
Convergence
19
Convergence
19
Convergence
Convergence guaranteedbut not necessarily unique
Starting points important (K++)
20
Starting Points
• Random points or instances in n-dimensional space
• Chose points “farthest” away from each other
• but this is sensitive to outliers
• k++
• the first center is chosen randomly from instance
• each subsequent center is chosen from the remaining instances with probability proportional to its squared distance from the point's closest existing cluster center
21
Scaling
price
number of bedrooms
d = 160,000
d = 1
22
Other Tricks
• What is the distance to a “missing value”?
• What is the distance between categorical values?
• What is the distance between text features?
• Does it have to be Euclidean distance?
• Unknown “K”?
23
Distance to Missing Value?
• Nonsense! Try replacing missing values with:
• Maximum
• Mean
• Median
• Minimum
• Zero
• Ignore instances with missing values
24
Distance to Categorical?
• Special distance function
• if valA == valB then distance = 0 (or scaling value) else distance = 1
• Assign centroid the most common category of the member instances
• Compute Euclidean distance as normal
One approach: similar to “k-prototypes”
25
Distance to Categorical?
feature_1 feature_2 feature_3
instance_1 red cat ball
instance_2 red cat ball
instance_3 red cat box
instance_4 blue dog fridge
D = 0
D = 1
D = sqrt(3)
26
Only Euclidean?
1
MahalanobisCosine Similarity
0
-1
27
Finding K: G-means
28
Finding K: G-means
28
Finding K: G-means
29
Finding K: G-means
Let K=2
29
Finding K: G-means
Let K=2
29
Finding K: G-means
Let K=2
Keep 1, Split 1 New K=3
30
Finding K: G-means
Let K=3
30
Finding K: G-means
Let K=3
30
Finding K: G-means
Let K=3
Keep 1, Split 2 New K=5
31
Finding K: G-means
Let K=5
31
Finding K: G-means
Let K=5
31
Finding K: G-means
Let K=5
K=5