l13. cluster analysis

44
Cluster Analysis Poul Petersen BigML

Upload: machine-learning-valencia

Post on 14-Apr-2017

510 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: L13. Cluster Analysis

Cluster Analysis

Poul Petersen

BigML

Page 2: L13. Cluster Analysis

2

Trees vs Clusters

Trees (Supervised Learning)

Provide: labeled data Learning Task: be able to predict label

Clusters (Unsupervised Learning)

Provide: unlabeled data Learning Task: group data by similarity

Page 3: L13. Cluster Analysis

3

Trees vs Clusters

sepal length

sepal width

petal length

petal width species

5.1 3.5 1.4 0.2 setosa5.7 2.6 3.5 1.0 versicolor6.7 2.5 5.8 1.8 virginica… … … … …

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …

Learning Task: Find function “f” such that: f(X)≈Y

Learning Task: Find “k” clusters such that the data in each cluster is self similar

Inputs “X” Label “Y”

Page 4: L13. Cluster Analysis

4

• Customer segmentation

• Item discovery

• Association

• Recommender

• Active learning

Use Cases

Page 5: L13. Cluster Analysis

5

Customer Segmentation

GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for upsell.

• Dataset of mobile game users.

• Data for each user consists of usage statistics and a LTV based on in-game purchases

• Assumption: Usage correlates to LTV

5

0%

3%

7%

1%

Page 6: L13. Cluster Analysis

6

Item Discovery

• Dataset of 86 whiskies

• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.

GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste.

Smoky

Fruity

Body

Page 7: L13. Cluster Analysis

7

Association

• Dataset of Lending Club Loans

• Mark any loan that is currently or has even been late as “trouble”

GOAL: Cluster the loans by application profile to rank loan quality by percentage of trouble loans in population

0%

3%

7%

1%

Page 8: L13. Cluster Analysis

8

Active Learning

GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data.

• Dataset of diagnostic measurements of 768 patients.

• Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*

*For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not-fraud. Or a million images which need to be labeled as cat/not-cat.

2323

Page 9: L13. Cluster Analysis

9

Human Example

K=3

Page 10: L13. Cluster Analysis

10

Human Example

“Round”

“Skinny”

“Corners”

Page 11: L13. Cluster Analysis

11

Human Example

• Human used prior knowledge to select possible features that separated the objects.

• “round”, “skinny”, “edges”, “hard”, etc

• Items were then separated based on the chosen features

• Separation quality was then tested to ensure:

• met criteria of K=3

• groups were sufficiently “distant”

• no crossover

Page 12: L13. Cluster Analysis

12

Learning from Humans

• Length/Width

• greater than 1 => “skinny”

• equal to 1 => “round”

• less than 1 => invert

• Number of Surfaces

• distinct surfaces require “edges” which have corners

• easier to count

Create features that capture these object differences

Page 13: L13. Cluster Analysis

13

Feature EngineeringObject Length / Width Num Surfaces

penny 1 3

dime 1 3

knob 1 4

eraser 2.75 6

box 1 6

block 1.6 6

screw 8 3

battery 5 3

key 4.25 3

bead 1 2

Page 14: L13. Cluster Analysis

14

Now we can Plot

K=3

NumSurfaces

Length / Width

box block eraser

knob

pennydime

bead

key battery screw

K-Means Key Insight:We can find clusters using distances

in n-dimensional feature space

Page 15: L13. Cluster Analysis

15

Now we can Plot

NumSurfaces

Length / Width

box block eraser

knob

pennydime

bead

key battery screw

K-MeansFind “best” (minimum distance)circles that include all points

Page 16: L13. Cluster Analysis

16

K-Means Algorithm

K=3

Page 17: L13. Cluster Analysis

16

K-Means Algorithm

K=3

Page 18: L13. Cluster Analysis

16

K-Means Algorithm

K=3

Page 19: L13. Cluster Analysis

16

K-Means Algorithm

K=3

Page 20: L13. Cluster Analysis

17

K-Means Algorithm

K=3

Page 21: L13. Cluster Analysis

17

K-Means Algorithm

K=3

Page 22: L13. Cluster Analysis

18

Features Matter!

Metal Other

Wood

Page 23: L13. Cluster Analysis

19

Convergence

Page 24: L13. Cluster Analysis

19

Convergence

Page 25: L13. Cluster Analysis

19

Convergence

Convergence guaranteedbut not necessarily unique

Starting points important (K++)

Page 26: L13. Cluster Analysis

20

Starting Points

• Random points or instances in n-dimensional space

• Chose points “farthest” away from each other

• but this is sensitive to outliers

• k++

• the first center is chosen randomly from instance

• each subsequent center is chosen from the remaining instances with probability proportional to its squared distance from the point's closest existing cluster center

Page 27: L13. Cluster Analysis

21

Scaling

price

number of bedrooms

d = 160,000

d = 1

Page 28: L13. Cluster Analysis

22

Other Tricks

• What is the distance to a “missing value”?

• What is the distance between categorical values?

• What is the distance between text features?

• Does it have to be Euclidean distance?

• Unknown “K”?

Page 29: L13. Cluster Analysis

23

Distance to Missing Value?

• Nonsense! Try replacing missing values with:

• Maximum

• Mean

• Median

• Minimum

• Zero

• Ignore instances with missing values

Page 30: L13. Cluster Analysis

24

Distance to Categorical?

• Special distance function

• if valA == valB then distance = 0 (or scaling value) else distance = 1

• Assign centroid the most common category of the member instances

• Compute Euclidean distance as normal

One approach: similar to “k-prototypes”

Page 31: L13. Cluster Analysis

25

Distance to Categorical?

feature_1 feature_2 feature_3

instance_1 red cat ball

instance_2 red cat ball

instance_3 red cat box

instance_4 blue dog fridge

D = 0

D = 1

D = sqrt(3)

Page 32: L13. Cluster Analysis

26

Only Euclidean?

1

MahalanobisCosine Similarity

0

-1

Page 33: L13. Cluster Analysis

27

Finding K: G-means

Page 34: L13. Cluster Analysis

28

Finding K: G-means

Page 35: L13. Cluster Analysis

28

Finding K: G-means

Page 36: L13. Cluster Analysis

29

Finding K: G-means

Let K=2

Page 37: L13. Cluster Analysis

29

Finding K: G-means

Let K=2

Page 38: L13. Cluster Analysis

29

Finding K: G-means

Let K=2

Keep 1, Split 1 New K=3

Page 39: L13. Cluster Analysis

30

Finding K: G-means

Let K=3

Page 40: L13. Cluster Analysis

30

Finding K: G-means

Let K=3

Page 41: L13. Cluster Analysis

30

Finding K: G-means

Let K=3

Keep 1, Split 2 New K=5

Page 42: L13. Cluster Analysis

31

Finding K: G-means

Let K=5

Page 43: L13. Cluster Analysis

31

Finding K: G-means

Let K=5

Page 44: L13. Cluster Analysis

31

Finding K: G-means

Let K=5

K=5