# l13. cluster analysis

Post on 14-Apr-2017

489 views

Embed Size (px)

TRANSCRIPT

Cluster Analysis

Poul Petersen

BigML

2

Trees vs Clusters

Trees (Supervised Learning)

Provide: labeled data Learning Task: be able to predict label

Clusters (Unsupervised Learning)

Provide: unlabeled data Learning Task: group data by similarity

3

Trees vs Clusters

sepal length

sepal width

petal length

petal width species

5.1 3.5 1.4 0.2 setosa5.7 2.6 3.5 1.0 versicolor6.7 2.5 5.8 1.8 virginica

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8

Learning Task: Find function f such that: f(X)Y

Learning Task: Find k clusters such that the data in each cluster is self similar

Inputs X Label Y

4

Customer segmentation

Item discovery

Association

Recommender

Active learning

Use Cases

5

Customer Segmentation

GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for upsell.

Dataset of mobile game users. Data for each user consists of usage statistics

and a LTV based on in-game purchases

Assumption: Usage correlates to LTV

5

0%

3%

7%

1%

6

Item Discovery

Dataset of 86 whiskies Each whiskey scored on a scale from 0 to 4

for each of 12 possible flavor characteristics.

GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste.

Smoky

Fruity

Body

7

Association

Dataset of Lending Club Loans Mark any loan that is currently or has even

been late as trouble

GOAL: Cluster the loans by application profile to rank loan quality by percentage of trouble loans in population

0%

3%

7%

1%

8

Active Learning

GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data.

Dataset of diagnostic measurements of 768 patients.

Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*

*For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not-fraud. Or a million images which need to be labeled as cat/not-cat.

2323

9

Human Example

K=3

10

Human Example

Round

Skinny

Corners

11

Human Example

Human used prior knowledge to select possible features that separated the objects.

round, skinny, edges, hard, etc Items were then separated based on the chosen

features

Separation quality was then tested to ensure: met criteria of K=3 groups were sufficiently distant no crossover

12

Learning from Humans

Length/Width greater than 1 => skinny equal to 1 => round less than 1 => invert

Number of Surfaces distinct surfaces require edges which have corners easier to count

Create features that capture these object differences

13

Feature EngineeringObject Length / Width Num Surfaces

penny 1 3

dime 1 3

knob 1 4

eraser 2.75 6

box 1 6

block 1.6 6

screw 8 3

battery 5 3

key 4.25 3

bead 1 2

14

Now we can Plot

K=3

Num

Surfaces

Length / Width

box block eraser

knob

penny

dime

bead

key battery screw

K-Means Key Insight:

We can find clusters using distances

in n-dimensional feature space

15

Now we can Plot

Num

Surfaces

Length / Width

box block eraser

knob

penny

dime

bead

key battery screw

K-Means

Find best (minimum distance)

circles that include all points

16

K-Means Algorithm

K=3

16

K-Means Algorithm

K=3

16

K-Means Algorithm

K=3

16

K-Means Algorithm

K=3

17

K-Means Algorithm

K=3

17

K-Means Algorithm

K=3

18

Features Matter!

Metal Other

Wood

19

Convergence

19

Convergence

19

Convergence

Convergence guaranteed

but not necessarily unique

Starting points important (K++)

20

Starting Points

Random points or instances in n-dimensional space

Chose points farthest away from each other

but this is sensitive to outliers

k++

the first center is chosen randomly from instance

each subsequent center is chosen from the remaining instances with probability proportional to its squared distance from the point's closest existing cluster center

21

Scaling

price

number of bedrooms

d = 160,000

d = 1

22

Other Tricks

What is the distance to a missing value?

What is the distance between categorical values?

What is the distance between text features?

Does it have to be Euclidean distance?

Unknown K?

23

Distance to Missing Value?

Nonsense! Try replacing missing values with:

Maximum

Mean

Median

Minimum

Zero

Ignore instances with missing values

24

Distance to Categorical?

Special distance function

if valA == valB then distance = 0 (or scaling value) else distance = 1

Assign centroid the most common category of the member instances

Compute Euclidean distance as normal

One approach: similar to k-prototypes

25

Distance to Categorical?

feature_1 feature_2 feature_3

instance_1 red cat ball

instance_2 red cat ball

instance_3 red cat box

instance_4 blue dog fridge

D = 0

D = 1

D = sqrt(3)

26

Only Euclidean?

1

MahalanobisCosine Similarity0

-1

27

Finding K: G-means

28

Finding K: G-means

28

Finding K: G-means

29

Finding K: G-means

Let K=2

29

Finding K: G-means

Let K=2

29

Finding K: G-means

Let K=2

Keep 1, Split 1 New K=3

30

Finding K: G-means

Let K=3

30

Finding K: G-means

Let K=3

30

Finding K: G-means

Let K=3

Keep 1, Split 2 New K=5

31

Finding K: G-means

Let K=5

31

Finding K: G-means

Let K=5

31

Finding K: G-means

Let K=5

K=5

Recommended