tunup final presentation

TunUp: A Distributed Cloud-basedGenetic Evolutionary Tuning for DataClustering

Gianmario [email protected]

March 2013

AgilOne, Inc.1091 N Shoreline Blvd. #250Mountain View, CA 94043

Agenda

1. Introduction2. Problem description3. TunUp4. K-means5. Clustering evaluation6. Full space tuning7. Genetic algorithm tuning8. Conclusions

Big Data

Business IntelligenceWhy ? Where? What? How?Insights of customers, products and companies

Can someone else know your customer better than you?Do you have the domain knowledge and proper computation

infrastructure?

Big Data as a Service (BDaaS)

Problem Description

income cost

customers

Tuning of Clustering Algorithms

We need tuning when:

➢ New algorithm or version is released

➢ We want to improve accuracy and/or performance

➢ New customer comes and the system must be adapted for the new dataset and requirements

9

TunUp

Java framework integrating JavaML and Watchmaker

Main features:

➢ Data manipulation (loading, labelling and normalization)➢ Clustering algorithms (k-means)➢ Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand)➢ Evaluation techniques validation (Pearson Correlation t-test)➢ Full search space tuning➢ Genetic Algorithm tuning (local and parallel implementation)➢ RESTful API for web service deployment (tomcat in Amazon EC2)

Open-source: http://github.com/gm-spacagna/tunup

k-meansGeometric hard-assigning Clustering algorithm:

It partitions n data points into k clusters in which each point belongs to the cluster with the nearest mean centroid.

If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares:

Algorithm:

1. Init ialization: a set of k random centroids are generated

2. Assignment: each point is assigned to the closest centroid

3. Update: the new centroids are calculated as the mean of the new clusters

4. Go to 2 until the convergence (centroids are stable and do not change)

k-means tuningInput parameters required:

1. K = (2,...,40)

2. Distance measure

3. Max iterations = 20 (fixed)

Different input parameters

Very different outcomes!!!

0. Angular 2. Chebyshev 3. Cosine 4. Euclidean 5. Jaccard Index 6. Manhattan 7. Pearson Correlation Coefficient8. Radial Basis Function Kernel9. Spearman Footrule

Clustering EvaluationDefinition of cluster:“A group of the same or similar elements gathered or occurring closely together”

Two main categories:

➢ Internal criterion : only based on the clustered data itself

➢ External criterion : based on benchmarks of pre-classified items

How do we evaluate if a set of clusters is good or not?

“Clustering is in the eye of the beholder” [E. Castro, 2002]

Internal EvaluationCommon goal is assigning better scores when:➢ High intra-cluster similarity➢ Low inter-cluster similarity

Cluster models:

➢ Distance-based (k-means)

➢ Density-based (EM-clustering)

➢ Distribution-based (DBSCAN)

➢ Connectivity-based (linkage clustering)

The choice of the evaluation technique depends on the nature of the data and the cluster model of the algorithm.

Proposed techniquesAIC: measure of the relative quantity of lost information of a statistical model. The clustering algorithm is modelled as a Gaussian Mixture Process. (inverted function)

Dunn: ratio between the minimum inter-clusters similarity and maximum cluster diameter. (natural fn.)

Davies-Bouldin : average similarity between each cluster and its most similar one. (inverted fn.)

Silhouette: measure of how well each point lies within its cluster. Indicates if the object is correctly clustered or if it would be more appropriate into the neighbouring cluster. (natural fn.)

External criterion: AdjustedRandGiven a a set of n elements S = {o1,...,on} and two partitions to compare: X={X1,...,Xr} and Y={Y1,...,Ys}

We can use AdjustedRand as reference of the best clustering evaluation and use it as validation for the internal criterion.

RandIndex=number of agreements between X and Ytotal number of possible pair combinations

AdjustedRandIndex=RandIndex−ExpectedIndexMaxIndex−ExpectedIndex

Correlation t-test

Average correlations:

AIC : 0.77Dunn: 0.49Davies-Bouldin: 0.51Silhouette: 0.49

Pearson correlation over a set of 120 random k-means configuration

evaluations:

DatasetD313100 vectors2 dimensions31 clusters

Source: http://cs.joensuu.fi/sipu/datasets/

S15000 vectors2 dimensions15 clusters

Initial Centroids issueN. observations = 200Input Configuration: k = 31 , Distance Measure = Eclidean

AdjustedRand AIC

We can consider the median value!

Full space evaluation

Global optimal is for:K = 36DistanceMeasure = Euclidean

N executions averaged = 20

Genetic Algorithm Tuning

Pr (mutate k i→k j)∝1

distance (k i , k j )

[x1,x2,x3,x4,...,xm]

[y1,y2,y3,y4,...,ym]

Crossovering:

Mutation:

Pr (mutate d i→d j)=1

N dist−1

[x1,x2,x3,y4,...,ym]

[y1,y2,y3,x4,...,xm]

Elitism +

Roulette wheel

Tuning parameters:

Fitness Evaluation : AICProb. mutation: 0.5Prob. Crossovering: 0.9Population size: 6Stagnation limit: 5Elitism: 1N executions averaged: 10

Relevant results:

➢ Best fitness value always decreasing➢ Mean fitness value trend decreasing➢ High standard deviation in the previous

population often generates a better mean population in the next one

Results

Test1: k = 39, Distance Measure = Manhattan

Test2: k = 33, Distance Measure = RBF Kernel

Test3: k = 36, Distance Measure = Euclidean

Different results due to:1. Early convergence2. Random initial centroids

Parallel GA

Optimal n. of servers = POP_SIZE – ELITISM

Amazon Elastic Compute Cloud EC210 x Micro instances

Simulation:10 evolutions, POP_SIZE = 5, no elitism

E[T single evolution] ≤

ConclusionsWe developed, tested and analysed TunUp, an open-solution for:Evaluation, Validation, Tuning of Data Clustering Algorithms

Future applications:➢ Tuning of existing algorithms➢ Supporting new algorithms design➢ Evaluation and comparison of different algorithms

Limitations:➢ Single distance measure➢ Equal normalization➢ Master / slave parallel execution➢ Random initial centroids

Questions?

Thank you! Tack! Grazie!

tunup final presentation

Technology