tunup final presentation
DESCRIPTION
The amount of digital data in the new era has grown exponentially in recent years and with the development of new technologies, is growing more rapidly than ever before. Nevertheless, simply knowing that all these data are out there is easily understandable, utilizing these data to turn a profit is not trivial. The need of data mining techniques able to extract profitable insight information is the next frontier of innovation, competition and profit. A data analytic services provider, in order to well-scale and exponentially grow its profit, has to deal with scalability, multi-tenancy and self-adaptability. In big data applications, machine learning is a very powerful instrument but a bad choice regarding the algorithm and its configuration parameters can easily lead to poor results. The key problem is automating the tuning process without a priori knowledge of the data and without human intervention. In this research project we implemented and analysed TunUp: A Distributed Cloud-based Genetic Evolutionary Tuning for Data Clustering. The proposed solution automatically evaluates and tunes data clustering algorithms, so that big data services can self-adapt and scale in a cost-efficient manner. For our experiments, we considered k-means as clustering algorithm, that is a simple but popular algorithm, widely used in many data mining applications. Clustering outputs are evaluated using four internal techniques: AIC, Dunn, Davies-Bouldin and Silhouette and an external evaluation: AdjustedRand. We then performed a correlation t-test in order to validate and benchmark our internal techniques against AdjustedRand. Defined the best evaluation criteria, the main challenge of k-means is setting the right value of k, that represents the number of clusters, and the distance measure used to compute distances of each pair of points in the data space. To address this problem we propose an implementation of the Genetic Evolutionary Algorithm that heuristically finds out an optimal configuration of our clustering algorithm. In order to improve performances, we implemented a parallel version of genetic algorithm developing a REST API and deploying several instances in the Amazon Cloud Computing (EC2) infrastructure. In conclusion, with this research we contributed building and analysing TunUp, an open solution for evaluation, validation and tuning of data clustering algorithms, with a particularly focused on cloud services. Our experiments show the quality and efficiency of tuning k-means on a set of public datasets. The research also provides a Roadmap that gives indications of how the current system should be extended and utilized for future clustering applications, such as: Tuning of existing clustering algorithms, Supporting new algorithms design, Evaluation and comparison of different algorithms.TRANSCRIPT
TunUp: A Distributed Cloud-basedGenetic Evolutionary Tuning for DataClustering
Gianmario [email protected]
March 2013
AgilOne, Inc.1091 N Shoreline Blvd. #250Mountain View, CA 94043
Agenda
1. Introduction2. Problem description3. TunUp4. K-means5. Clustering evaluation6. Full space tuning7. Genetic algorithm tuning8. Conclusions
Big Data
Business IntelligenceWhy ? Where? What? How?Insights of customers, products and companies
Can someone else know your customer better than you?Do you have the domain knowledge and proper computation
infrastructure?
Big Data as a Service (BDaaS)
Problem Description
income cost
customers
Tuning of Clustering Algorithms
We need tuning when:
➢ New algorithm or version is released
➢ We want to improve accuracy and/or performance
➢ New customer comes and the system must be adapted for the new dataset and requirements
9
TunUp
Java framework integrating JavaML and Watchmaker
Main features:
➢ Data manipulation (loading, labelling and normalization)➢ Clustering algorithms (k-means)➢ Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand)➢ Evaluation techniques validation (Pearson Correlation t-test)➢ Full search space tuning➢ Genetic Algorithm tuning (local and parallel implementation)➢ RESTful API for web service deployment (tomcat in Amazon EC2)
Open-source: http://github.com/gm-spacagna/tunup
k-meansGeometric hard-assigning Clustering algorithm:
It partitions n data points into k clusters in which each point belongs to the cluster with the nearest mean centroid.
If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares:
Algorithm:
1. Init ialization: a set of k random centroids are generated
2. Assignment: each point is assigned to the closest centroid
3. Update: the new centroids are calculated as the mean of the new clusters
4. Go to 2 until the convergence (centroids are stable and do not change)
k-means tuningInput parameters required:
1. K = (2,...,40)
2. Distance measure
3. Max iterations = 20 (fixed)
Different input parameters
Very different outcomes!!!
0. Angular 2. Chebyshev 3. Cosine 4. Euclidean 5. Jaccard Index 6. Manhattan 7. Pearson Correlation Coefficient8. Radial Basis Function Kernel9. Spearman Footrule
Clustering EvaluationDefinition of cluster:“A group of the same or similar elements gathered or occurring closely together”
Two main categories:
➢ Internal criterion : only based on the clustered data itself
➢ External criterion : based on benchmarks of pre-classified items
How do we evaluate if a set of clusters is good or not?
“Clustering is in the eye of the beholder” [E. Castro, 2002]
Internal EvaluationCommon goal is assigning better scores when:➢ High intra-cluster similarity➢ Low inter-cluster similarity
Cluster models:
➢ Distance-based (k-means)
➢ Density-based (EM-clustering)
➢ Distribution-based (DBSCAN)
➢ Connectivity-based (linkage clustering)
The choice of the evaluation technique depends on the nature of the data and the cluster model of the algorithm.
Proposed techniquesAIC: measure of the relative quantity of lost information of a statistical model. The clustering algorithm is modelled as a Gaussian Mixture Process. (inverted function)
Dunn: ratio between the minimum inter-clusters similarity and maximum cluster diameter. (natural fn.)
Davies-Bouldin : average similarity between each cluster and its most similar one. (inverted fn.)
Silhouette: measure of how well each point lies within its cluster. Indicates if the object is correctly clustered or if it would be more appropriate into the neighbouring cluster. (natural fn.)
External criterion: AdjustedRandGiven a a set of n elements S = {o1,...,on} and two partitions to compare: X={X1,...,Xr} and Y={Y1,...,Ys}
We can use AdjustedRand as reference of the best clustering evaluation and use it as validation for the internal criterion.
RandIndex=number of agreements between X and Ytotal number of possible pair combinations
AdjustedRandIndex=RandIndex−ExpectedIndexMaxIndex−ExpectedIndex
Correlation t-test
Average correlations:
AIC : 0.77Dunn: 0.49Davies-Bouldin: 0.51Silhouette: 0.49
Pearson correlation over a set of 120 random k-means configuration
evaluations:
DatasetD313100 vectors2 dimensions31 clusters
Source: http://cs.joensuu.fi/sipu/datasets/
S15000 vectors2 dimensions15 clusters
Initial Centroids issueN. observations = 200Input Configuration: k = 31 , Distance Measure = Eclidean
AdjustedRand AIC
We can consider the median value!
Full space evaluation
Global optimal is for:K = 36DistanceMeasure = Euclidean
N executions averaged = 20
Genetic Algorithm Tuning
Pr (mutate k i→k j)∝1
distance (k i , k j )
[x1,x2,x3,x4,...,xm]
[y1,y2,y3,y4,...,ym]
Crossovering:
Mutation:
Pr (mutate d i→d j)=1
N dist−1
[x1,x2,x3,y4,...,ym]
[y1,y2,y3,x4,...,xm]
Elitism +
Roulette wheel
Tuning parameters:
Fitness Evaluation : AICProb. mutation: 0.5Prob. Crossovering: 0.9Population size: 6Stagnation limit: 5Elitism: 1N executions averaged: 10
Relevant results:
➢ Best fitness value always decreasing➢ Mean fitness value trend decreasing➢ High standard deviation in the previous
population often generates a better mean population in the next one
Results
Test1: k = 39, Distance Measure = Manhattan
Test2: k = 33, Distance Measure = RBF Kernel
Test3: k = 36, Distance Measure = Euclidean
Different results due to:1. Early convergence2. Random initial centroids
Parallel GA
Optimal n. of servers = POP_SIZE – ELITISM
Amazon Elastic Compute Cloud EC210 x Micro instances
Simulation:10 evolutions, POP_SIZE = 5, no elitism
E[T single evolution] ≤
ConclusionsWe developed, tested and analysed TunUp, an open-solution for:Evaluation, Validation, Tuning of Data Clustering Algorithms
Future applications:➢ Tuning of existing algorithms➢ Supporting new algorithms design➢ Evaluation and comparison of different algorithms
Limitations:➢ Single distance measure➢ Equal normalization➢ Master / slave parallel execution➢ Random initial centroids
Questions?
Thank you! Tack! Grazie!