l13. cluster analysis

Cluster Analysis

Poul Petersen

BigML

2

Trees vs Clusters

Trees (Supervised Learning)

Provide: labeled data Learning Task: be able to predict label

Clusters (Unsupervised Learning)

Provide: unlabeled data Learning Task: group data by similarity

3

Trees vs Clusters

sepal length

sepal width

petal length

petal width species

5.1 3.5 1.4 0.2 setosa5.7 2.6 3.5 1.0 versicolor6.7 2.5 5.8 1.8 virginica… … … … …

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …

Learning Task: Find function “f” such that: f(X)≈Y

Learning Task: Find “k” clusters such that the data in each cluster is self similar

Inputs “X” Label “Y”

4

• Customer segmentation

• Item discovery

• Association

• Recommender

• Active learning

Use Cases

5

Customer Segmentation

GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for upsell.

• Dataset of mobile game users.

• Data for each user consists of usage statistics and a LTV based on in-game purchases

• Assumption: Usage correlates to LTV

5

0%

3%

7%

1%

6

Item Discovery

• Dataset of 86 whiskies

• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.

GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste.

Smoky

Fruity

Body

7

Association

• Dataset of Lending Club Loans

• Mark any loan that is currently or has even been late as “trouble”

GOAL: Cluster the loans by application profile to rank loan quality by percentage of trouble loans in population

0%

3%

7%

1%

8

Active Learning

GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data.

• Dataset of diagnostic measurements of 768 patients.

• Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*

*For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not-fraud. Or a million images which need to be labeled as cat/not-cat.

2323

9

Human Example

K=3

10

Human Example

“Round”

“Skinny”

“Corners”

11

Human Example

• Human used prior knowledge to select possible features that separated the objects.

• “round”, “skinny”, “edges”, “hard”, etc

• Items were then separated based on the chosen features

• Separation quality was then tested to ensure:

• met criteria of K=3

• groups were sufficiently “distant”

• no crossover

12

Learning from Humans

• Length/Width

• greater than 1 => “skinny”

• equal to 1 => “round”

• less than 1 => invert

• Number of Surfaces

• distinct surfaces require “edges” which have corners

• easier to count

Create features that capture these object differences

13

Feature EngineeringObject Length / Width Num Surfaces

penny 1 3

dime 1 3

knob 1 4

eraser 2.75 6

box 1 6

block 1.6 6

screw 8 3

battery 5 3

key 4.25 3

bead 1 2

14

Now we can Plot

K=3

NumSurfaces

Length / Width

box block eraser

knob

pennydime

bead

key battery screw

K-Means Key Insight:We can find clusters using distances

in n-dimensional feature space

15

Now we can Plot

NumSurfaces

Length / Width

box block eraser

knob

pennydime

bead

key battery screw

K-MeansFind “best” (minimum distance)circles that include all points

16

K-Means Algorithm

K=3

17

K-Means Algorithm

K=3

18

Features Matter!

Metal Other

Wood

19

Convergence

19

Convergence

Convergence guaranteedbut not necessarily unique

Starting points important (K++)

20

Starting Points

• Random points or instances in n-dimensional space

• Chose points “farthest” away from each other

• but this is sensitive to outliers

• k++

• the first center is chosen randomly from instance

• each subsequent center is chosen from the remaining instances with probability proportional to its squared distance from the point's closest existing cluster center

21

Scaling

price

number of bedrooms

d = 160,000

d = 1

22

Other Tricks

• What is the distance to a “missing value”?

• What is the distance between categorical values?

• What is the distance between text features?

• Does it have to be Euclidean distance?

• Unknown “K”?

23

Distance to Missing Value?

• Nonsense! Try replacing missing values with:

• Maximum

• Mean

• Median

• Minimum

• Zero

• Ignore instances with missing values

24

Distance to Categorical?

• Special distance function

• if valA == valB then distance = 0 (or scaling value) else distance = 1

• Assign centroid the most common category of the member instances

• Compute Euclidean distance as normal

One approach: similar to “k-prototypes”

25

Distance to Categorical?

feature_1 feature_2 feature_3

instance_1 red cat ball

instance_2 red cat ball

instance_3 red cat box

instance_4 blue dog fridge

D = 0

D = 1

D = sqrt(3)

26

Only Euclidean?

1

MahalanobisCosine Similarity

0

-1

27

Finding K: G-means

28

Finding K: G-means

29

Finding K: G-means

Let K=2

29

Finding K: G-means

Let K=2

Keep 1, Split 1 New K=3

30

Finding K: G-means

Let K=3

30

Finding K: G-means

Let K=3

Keep 1, Split 2 New K=5

31

Finding K: G-means

Let K=5

31

Finding K: G-means

Let K=5

K=5

l13. cluster analysis

Data & Analytics