tunup final presentation

25
TunUp: A Distributed Cloud-based Genetic Evolutionary Tuning for Data Clustering Gianmario Spacagna [email protected] March 2013 AgilOne, Inc. 1091 N Shoreline Blvd. #250 Mountain View, CA 94043

Upload: gianmario-spacagna

Post on 02-Jul-2015

467 views

Category:

Technology


0 download

DESCRIPTION

The amount of digital data in the new era has grown exponentially in recent years and with the development of new technologies, is growing more rapidly than ever before. Nevertheless, simply knowing that all these data are out there is easily understandable, utilizing these data to turn a profit is not trivial. The need of data mining techniques able to extract profitable insight information is the next frontier of innovation, competition and profit. A data analytic services provider, in order to well-scale and exponentially grow its profit, has to deal with scalability, multi-tenancy and self-adaptability. In big data applications, machine learning is a very powerful instrument but a bad choice regarding the algorithm and its configuration parameters can easily lead to poor results. The key problem is automating the tuning process without a priori knowledge of the data and without human intervention. In this research project we implemented and analysed TunUp: A Distributed Cloud-based Genetic Evolutionary Tuning for Data Clustering. The proposed solution automatically evaluates and tunes data clustering algorithms, so that big data services can self-adapt and scale in a cost-efficient manner. For our experiments, we considered k-means as clustering algorithm, that is a simple but popular algorithm, widely used in many data mining applications. Clustering outputs are evaluated using four internal techniques: AIC, Dunn, Davies-Bouldin and Silhouette and an external evaluation: AdjustedRand. We then performed a correlation t-test in order to validate and benchmark our internal techniques against AdjustedRand. Defined the best evaluation criteria, the main challenge of k-means is setting the right value of k, that represents the number of clusters, and the distance measure used to compute distances of each pair of points in the data space. To address this problem we propose an implementation of the Genetic Evolutionary Algorithm that heuristically finds out an optimal configuration of our clustering algorithm. In order to improve performances, we implemented a parallel version of genetic algorithm developing a REST API and deploying several instances in the Amazon Cloud Computing (EC2) infrastructure. In conclusion, with this research we contributed building and analysing TunUp, an open solution for evaluation, validation and tuning of data clustering algorithms, with a particularly focused on cloud services. Our experiments show the quality and efficiency of tuning k-means on a set of public datasets. The research also provides a Roadmap that gives indications of how the current system should be extended and utilized for future clustering applications, such as: Tuning of existing clustering algorithms, Supporting new algorithms design, Evaluation and comparison of different algorithms.

TRANSCRIPT

Page 1: TunUp final presentation

TunUp: A Distributed Cloud-basedGenetic Evolutionary Tuning for DataClustering

Gianmario [email protected]

March 2013

AgilOne, Inc.1091 N Shoreline Blvd. #250Mountain View, CA 94043

Page 2: TunUp final presentation

Agenda

1. Introduction2. Problem description3. TunUp4. K-means5. Clustering evaluation6. Full space tuning7. Genetic algorithm tuning8. Conclusions

Page 3: TunUp final presentation

Big Data

Page 4: TunUp final presentation

Business IntelligenceWhy ? Where? What? How?Insights of customers, products and companies

Can someone else know your customer better than you?Do you have the domain knowledge and proper computation

infrastructure?

Page 5: TunUp final presentation

Big Data as a Service (BDaaS)

Page 6: TunUp final presentation

Problem Description

income cost

customers

Page 7: TunUp final presentation

Tuning of Clustering Algorithms

We need tuning when:

➢ New algorithm or version is released

➢ We want to improve accuracy and/or performance

➢ New customer comes and the system must be adapted for the new dataset and requirements

9

Page 8: TunUp final presentation

TunUp

Java framework integrating JavaML and Watchmaker

Main features:

➢ Data manipulation (loading, labelling and normalization)➢ Clustering algorithms (k-means)➢ Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand)➢ Evaluation techniques validation (Pearson Correlation t-test)➢ Full search space tuning➢ Genetic Algorithm tuning (local and parallel implementation)➢ RESTful API for web service deployment (tomcat in Amazon EC2)

Open-source: http://github.com/gm-spacagna/tunup

Page 9: TunUp final presentation

k-meansGeometric hard-assigning Clustering algorithm:

It partitions n data points into k clusters in which each point belongs to the cluster with the nearest mean centroid.

If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares:

Algorithm:

1. Init ialization: a set of k random centroids are generated

2. Assignment: each point is assigned to the closest centroid

3. Update: the new centroids are calculated as the mean of the new clusters

4. Go to 2 until the convergence (centroids are stable and do not change)

Page 10: TunUp final presentation

k-means tuningInput parameters required:

1. K = (2,...,40)

2. Distance measure

3. Max iterations = 20 (fixed)

Different input parameters

Very different outcomes!!!

0. Angular 2. Chebyshev 3. Cosine 4. Euclidean 5. Jaccard Index 6. Manhattan 7. Pearson Correlation Coefficient8. Radial Basis Function Kernel9. Spearman Footrule

Page 11: TunUp final presentation

Clustering EvaluationDefinition of cluster:“A group of the same or similar elements gathered or occurring closely together”

Two main categories:

➢ Internal criterion : only based on the clustered data itself

➢ External criterion : based on benchmarks of pre-classified items

How do we evaluate if a set of clusters is good or not?

“Clustering is in the eye of the beholder” [E. Castro, 2002]

Page 12: TunUp final presentation

Internal EvaluationCommon goal is assigning better scores when:➢ High intra-cluster similarity➢ Low inter-cluster similarity

Cluster models:

➢ Distance-based (k-means)

➢ Density-based (EM-clustering)

➢ Distribution-based (DBSCAN)

➢ Connectivity-based (linkage clustering)

The choice of the evaluation technique depends on the nature of the data and the cluster model of the algorithm.

Page 13: TunUp final presentation

Proposed techniquesAIC: measure of the relative quantity of lost information of a statistical model. The clustering algorithm is modelled as a Gaussian Mixture Process. (inverted function)

Dunn: ratio between the minimum inter-clusters similarity and maximum cluster diameter. (natural fn.)

Davies-Bouldin : average similarity between each cluster and its most similar one. (inverted fn.)

Silhouette: measure of how well each point lies within its cluster. Indicates if the object is correctly clustered or if it would be more appropriate into the neighbouring cluster. (natural fn.)

Page 14: TunUp final presentation

External criterion: AdjustedRandGiven a a set of n elements S = {o1,...,on} and two partitions to compare: X={X1,...,Xr} and Y={Y1,...,Ys}

We can use AdjustedRand as reference of the best clustering evaluation and use it as validation for the internal criterion.

RandIndex=number of agreements between X and Ytotal number of possible pair combinations

AdjustedRandIndex=RandIndex−ExpectedIndexMaxIndex−ExpectedIndex

Page 15: TunUp final presentation

Correlation t-test

Average correlations:

AIC : 0.77Dunn: 0.49Davies-Bouldin: 0.51Silhouette: 0.49

Pearson correlation over a set of 120 random k-means configuration

evaluations:

Page 16: TunUp final presentation

DatasetD313100 vectors2 dimensions31 clusters

Source: http://cs.joensuu.fi/sipu/datasets/

S15000 vectors2 dimensions15 clusters

Page 17: TunUp final presentation

Initial Centroids issueN. observations = 200Input Configuration: k = 31 , Distance Measure = Eclidean

AdjustedRand AIC

We can consider the median value!

Page 18: TunUp final presentation

Full space evaluation

Global optimal is for:K = 36DistanceMeasure = Euclidean

N executions averaged = 20

Page 19: TunUp final presentation

Genetic Algorithm Tuning

Pr (mutate k i→k j)∝1

distance (k i , k j )

[x1,x2,x3,x4,...,xm]

[y1,y2,y3,y4,...,ym]

Crossovering:

Mutation:

Pr (mutate d i→d j)=1

N dist−1

[x1,x2,x3,y4,...,ym]

[y1,y2,y3,x4,...,xm]

Elitism +

Roulette wheel

Page 20: TunUp final presentation

Tuning parameters:

Fitness Evaluation : AICProb. mutation: 0.5Prob. Crossovering: 0.9Population size: 6Stagnation limit: 5Elitism: 1N executions averaged: 10

Relevant results:

➢ Best fitness value always decreasing➢ Mean fitness value trend decreasing➢ High standard deviation in the previous

population often generates a better mean population in the next one

Page 21: TunUp final presentation

Results

Test1: k = 39, Distance Measure = Manhattan

Test2: k = 33, Distance Measure = RBF Kernel

Test3: k = 36, Distance Measure = Euclidean

Different results due to:1. Early convergence2. Random initial centroids

Page 22: TunUp final presentation

Parallel GA

Optimal n. of servers = POP_SIZE – ELITISM

Amazon Elastic Compute Cloud EC210 x Micro instances

Simulation:10 evolutions, POP_SIZE = 5, no elitism

E[T single evolution] ≤

Page 23: TunUp final presentation

ConclusionsWe developed, tested and analysed TunUp, an open-solution for:Evaluation, Validation, Tuning of Data Clustering Algorithms

Future applications:➢ Tuning of existing algorithms➢ Supporting new algorithms design➢ Evaluation and comparison of different algorithms

Limitations:➢ Single distance measure➢ Equal normalization➢ Master / slave parallel execution➢ Random initial centroids

Page 24: TunUp final presentation

Questions?

Page 25: TunUp final presentation

Thank you! Tack! Grazie!