Scalable Clustering on the Data Grid
Patrick Wendel ([email protected])Moustafa GhanemYike Guo
Discovery NetDepartment of ComputingImperial College, London
20/09/2005All Hands Meeting, Nottingham
Outline
Discovery Net Data Clustering Mining Distributed Data Description of the strategy Deployment Evaluation Conclusions – Future Works
20/09/2005All Hands Meeting, Nottingham
Discovery Net Multidisciplinary project funded by the EPSRC under the UK e-Science programme
(started Oct 2002, ended March 05)
Developed an infrastructure for Knowledge Discovery Services for integrating and analysing data collected from high throughput devices and sensors
Applications to: Life Sciences
• High throughput genomics and proteomics Real-time Environmental Monitoring
• High throughput dispersed air sensing technology Geo-Hazard modelling
• Earthquake modelling through satellite imagery
The project covered many areas including infrastructure, applications and algorithms (text mining)
Produced the Discovery Net platform which aims to integrate, compose, coordinate and deploy knowledge discovery services using a workflow technology.
20/09/2005All Hands Meeting, Nottingham
Discovery Net
Using Distributed Computing Resources
ScientificInformationScientific
InformationScientific Discovery
LiteratureLiterature
DatabasesDatabases
OperationalData
OperationalData
ImagesImages
InstrumentData
InstrumentData
e-Sciencelarge scale science that will increasingly be carried out through distributed global collaborations enabled by the Internet.
20/09/2005All Hands Meeting, Nottingham
Data Clustering We concentrate on a particular class of data mining
algorithms: Clustering
A class of explorative data mining techniques, used to find out groups of points that are similar/close to each other.
Popular analysis technique. Useful for exploring, understanding, modelling large data sets
Two main types of clustering: Hierarchical: Reorganises the data set into a hierarchy of
clusters based on their similarity. Partition/Model based: Tries to partition the data set into a
number of clusters or try to fit a statistical model (e.g. mixture of Gaussians) to a data set
Successfully applied to sociological data, image processing and genomic data.
20/09/2005All Hands Meeting, Nottingham
Mining Data on the Grid
Changing environment for data analysis: From analysing data files held locally (or
close to the algorithm), to using remote data source, using remote services through portals, now towards distributed data executions.
Distributed data sources: Data mining processes can now require data
spread across multiple organisations Service-oriented approach:
High-level functionalities are now available through well-defined services, instead of providing low-level (terminal etc..) access to resources
20/09/2005All Hands Meeting, Nottingham
Goal
Design a service-oriented distributed data clustering strategy:
that can be deployed on a Grid environment (i.e. a standard-based, service oriented, secure distributed environment)
that would allow the end-user/data analysts to deploy easily against its own data sets
20/09/2005All Hands Meeting, Nottingham
Requirements 1/2
Performance issues: The analysis process using data grids directly and
analysis services must be more efficient than gathering all the data on my desktop!
Accuracy: The strategy should at least provide a model more
representative of the overall data set Security
The deployed strategy should ensure consistent handling of authentication and authorization aspects throughout
Privacy: Restricted access to the data source
20/09/2005All Hands Meeting, Nottingham
Requirements 2/2
Heterogeneity of the resources used and/or connectivity It’s very unlikely the set of resources involved in the
distributed analysis process will be similar or work over networks of similar bandwidth
Loose-coupling between resources participating in the distributed analysis The analyst has less control on what is available/provided
by each data grid or each analysis service. Therefore the framework should, as much as possible, be unaffected by minor differences between functionalities provided by each site.
Service-oriented approach: The deployment of the analysis process should be based on the co-ordination of high-level services (instead of a dedicated distributed algorithm, e.g. MPI implementation)
20/09/2005All Hands Meeting, Nottingham
Current strategy
We restrict the current framework to the case where instances are distributed but have the same attributes on each different fragments (~ horizontal fragments)
Based on the EM-Clustering algorithm (mixture of Gaussian model fitting algorithm). Hierarchical clustering inherently complex to
distribute Statistical approach of EM provides a sound basis to
define a model combination strategy
20/09/2005All Hands Meeting, Nottingham
Approach
Generate clustering models at each data source location (compute near the data)
Transfer partial models in standard format (PMML) to a combiner site
Normalise the relative weights of each model
Perform an EM-based method on partial models to generate a global model.
20/09/2005All Hands Meeting, Nottingham
Combining Cluster Models
Derived from the EM-Clustering algorithm itself
Adapted to take as input the models generated at each site
Each partial model is treated like a (very) compressed representation of the fragment (similar to the two step approaches of some scalable clustering algorithms).
More detailed algorithm and formulae in proceedings
20/09/2005All Hands Meeting, Nottingham
Deployment: Discovery Net
The Discovery Net platform is used to build and deploy this framework.
Implementation based on an open architecture re-using common protocols and common infrastructure elements (such as the Globus Toolkits).
It also defines its own protocol for workflows, Discovery Process Markup Language (DPML) which allows the definition of data analysis workflows to be executed on distributed resources.
The platform comprises a server that stores, schedules the workflows and manage the data, and a thick client to help the workflow construction process.
Thus giving the end user the ability to define application-specific workflows performing such tasks as distributed data mining.
The model combiner is implemented as a workflow activity in Discovery Net
20/09/2005All Hands Meeting, Nottingham
DeploymentData sources Discovery Net servers
Partial clustering
Partial clustering
Partial clustering
PMML
PMML
PMML
Partial models
Global model
Combiner site
Source A
Source B
Source C
20/09/2005All Hands Meeting, Nottingham
Deployment: Workflow
The Discovery Net client enables the composition and the execution of the distributed process as a workflow constructed visually.
The execution engine will coordinate the distributed execution
20/09/2005All Hands Meeting, Nottingham
Accuracy Evaluation: Data Distribution Comparison of the accuracy of the combined model with the
average accuracy of partial models against the entire data sets (i.e. have we gained some accuracy by considering the fragments together)
Accuracy will strongly depend on how the data is distributed among different sites. In the evaluation we introduce a randomness ratio to determine how similar the data distribution is among fragments. 0 meaning that each site would have data drawn from
different distributions 1 meaning that the data from all fragments are drawn from
the same distribution Measured by log-likelihood function of the test data set:
The likelihood function of a data set represents how much that data is likely to be following the distribution function defined by the model
20/09/2005All Hands Meeting, Nottingham
Accuracy Evaluation: Data distribution
Randomness ratio effect
-120000
-100000
-80000
-60000
-40000
-20000
0
0 0.2 0.4 0.6 0.8 1 1.2
Ratio
LL
ikel
iho
od
Average log-Likelihood
Combined log-Likelihood
As expected, the ratio has a huge effect on gained accuracy. For low levels, each fragment becomes less and less representative of the complete data set, therefore the combined
model will outperform partial ones.
20/09/2005All Hands Meeting, Nottingham
Accuracy Evaluation: Number of fragments
-80000
-70000
-60000
-50000
-40000
-30000
-20000
-10000
0
2 3 4 5 6 7 8 9 10
Avg Likelihood
Combined Likelihood
(r= 0.2, 10,000 points, 5 clusters) The accuracy does degrade with increasing number of fragments, but so does the average accuracy of models generated from individual fragments.
20/09/2005All Hands Meeting, Nottingham
Accuracy Evaluation: Increasing data size
(r=0.2,d=5,5 fragments). Consistent behaviour of the combined model’s accuracy over partial ones.
Increasing data size
-5000000
-4000000
-3000000
-2000000
-1000000
0
0 200000 400000 600000 800000 1000000
1200000
# instances
LL
ikel
iho
od
Average log-Likelihood
Combined log-Likelihood
20/09/2005All Hands Meeting, Nottingham
Performance Evaluation Performance evaluation is only partially relevant, as the process does not
feed back combined models and partial models are generated near the data.
The heterogeneity of real deployments is difficult to take into account.
Time in seconds, for an increasing number of fragments
20/09/2005All Hands Meeting, Nottingham
Performance Evaluation
Execution time with lower dimensionality and larger data sets
20/09/2005All Hands Meeting, Nottingham
Conclusions
Encouraging results in terms of accuracy vs. performance, given the constraints.
But is the trade-off between accuracy and flexibility (generally the case in distributed data mining) acceptable?
This should be part of a wider explorative process, probably as a first step into the understanding of the data set.
Being part of the Discovery Net platform, the distributed analysis process can be simply designed from the Discovery Net client software.
20/09/2005All Hands Meeting, Nottingham
Future Works
First step towards more generic distributed data mining strategies (classification algorithms, association rules)
Need evaluation against real data sets ! Possible improvements including:
Refinement through feedback Use of a more complex intermediate summary
structure for the partial models (e.g. tree structures containing summary information)
Estimation of the number of clusters (using Bayesian Information Criteria)
Plenty of possible clustering algorithms to try to use.