scalable density clustering for spark - crim.ca · 3 apache • popular distributed in-memory...
TRANSCRIPT
![Page 1: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/1.jpg)
W W W . C R I M . C A
Principal partenaire financier
Scalable Density Clustering for Spark
THOMAS TRIPLET, PH.D., ENG.
MARCH 9TH 2016
![Page 2: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/2.jpg)
2
TECHNOLOGIES BIG-DATA
• Hadoop Core – HDFS: Système de fichiers distribué – YARN: Gestion des ressources CPU et planification – MapReduce: Traitement en lot (batch) des données à grande échelle
• Écosystème Hadoop – NoSQL: HBase, Cassandra, Accumulo, etc… – SQL: Hive, Stinger (Hortonworks), Impala (Cloudera), Presto (FB), Tajo, Drill (MapR) – Transfert: Sqoop, Flume – Calcul/ML: Spark, Storm, Giraph, Mahout – Scripts: Pig, Cascading – Administration: Hue, ZooKeeper, Knox – Recherche: Solr, ElasticSearch
![Page 3: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/3.jpg)
3
APACHE
• Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear horizontal scalability • Fault tolerant (RDDs) • Applications range from long-running batch jobs to stream processing • High-level Scala, Java, Python and R APIs
![Page 4: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/4.jpg)
4
AGENDA
• Clustering algorithms (unsupervised learning) – Distance-based (k-means) – Density-based (DBSCAN)
• PatchWork – Algorithm – Results – Performance
• Conclusion
• Future Work
![Page 5: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/5.jpg)
5
• Class labels are known and pre-defined
• Training and testing datasets are (manually) labeled with same classes
• Goal is to learn function/rule that can classify new data points
• Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees…
• Class labels of the data are unknown
• Group/cluster similar data points without prior knowledge
• Goal is to discover structure or pattern in the data
• Examples: k-means, EM, DBScan, HCA…
INTRODUCTION: MACHINE LEARNING
Supervised Learning Unsupervised Learning (clustering)
![Page 6: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/6.jpg)
5
• Class labels are known and pre-defined
• Training and testing datasets are (manually) labeled with same classes
• Goal is to learn function/rule that can classify new data points
• Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees…
• Class labels of the data are unknown
• Group/cluster similar data points without prior knowledge
• Goal is to discover structure or pattern in the data
• Examples: k-means, EM, DBScan, HCA…
INTRODUCTION: MACHINE LEARNING
Supervised Learning Unsupervised Learning (clustering)
PatchWork ➔
![Page 7: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/7.jpg)
6
• Popular algorithm: k-means (implemented in MLLib)
• Relies on distance function between data points
• Easy to implement
• Linear complexity (big-data)
• Easy to distribute
• Discovers spherical clusters of similar sizes only
• Sensitive to noise and local optima
• Prior knowledge of k.
• Popular algorithm: DBScan(not in MLLib)
• Relies on the density of data points in feature space
• Natural protection against noise and outliers
• Discovers clusters of arbitrary shape and size
• No prior knowledge of k
• Discovers clusters of similar densities only
• Quadratic complexity: not scalable
INTRODUCTION: CLUSTERING
Distance-based Density-based
![Page 8: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/8.jpg)
6
• Popular algorithm: k-means (implemented in MLLib)
• Relies on distance function between data points
• Easy to implement
• Linear complexity (big-data)
• Easy to distribute
• Discovers spherical clusters of similar sizes only
• Sensitive to noise and local optima
• Prior knowledge of k.
• Popular algorithm: DBScan(not in MLLib)
• Relies on the density of data points in feature space
• Natural protection against noise and outliers
• Discovers clusters of arbitrary shape and size
• No prior knowledge of k
• Discovers clusters of similar densities only
• Quadratic complexity: not scalable
INTRODUCTION: CLUSTERING
Distance-based Density-based
PatchWork ➔
![Page 9: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/9.jpg)
7
PATCHWORK ALGORITHM
2 main steps:
1. createCells( dataPoints ) à cells à RDD[(string, int)]
2. createClusters( cells) à clusters
![Page 10: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/10.jpg)
8
STEP 1: CELL CREATION
![Page 11: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/11.jpg)
9
STEP 1: CELL CREATION
![Page 12: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/12.jpg)
10
STEP 1: CELL CREATION
4( -1,2 ; )
4( -1,3 ; )
4( -2,2 ; )
1( -3,4 ; )
4( 2,3 ; )
3( 2,4 ; )
3( 3,3 ; )
3( 3,4 ; )
![Page 13: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/13.jpg)
11
STEP 1: CELL CREATION
1( -1,2 ; )
1( -2,2 ; )
1( -1,2 ; )
1( -1,2 ; )
.
.
.
1( 3,4 ; )
1( 3,4 ; )
1( 3,4 ; )
)
)
)
)
)
)
)
)
4( -1,2 ;
4( -1,3 ;
4( -2,2 ;
1( -3,4 ;
4( 2,3 ;
3( 2,4 ;
3( 3,3 ;
3( 3,4 ;
setOfCells = dataPoints.map(Pà(cellID(P),1)) .reduceByKey(_ + _)
![Page 14: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/14.jpg)
12
STEP 2: CLUSTER CREATION
•
![Page 15: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/15.jpg)
13
EXPERIMENTAL SETUP
• 6 servers, each with: – Intel Xeon E5-2650 8 cores @2.6GHz – 192GB memory – 30TB storage
• Cloudera CDH 5.4.0 • Apache Spark 1.3
![Page 16: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/16.jpg)
14
DATASETSAggregation Compound
Jain Spiral
![Page 17: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/17.jpg)
15
RESULTS (JAIN DATASET)K-means
DBScan PatchWork
![Page 18: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/18.jpg)
16
RESULTS (SPIRAL DATASET)K-means
DBScan PatchWork
![Page 19: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/19.jpg)
17
RESULTS (AGGREGATION DATASET)K-means
DBScan PatchWork
![Page 20: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/20.jpg)
18
RESULTS (COMPOUND DATASET)K-means
DBScan PatchWork
![Page 21: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/21.jpg)
19
PERFORMANCER
unni
ng T
ime
(sec
onds
)
1
10
100
1,000
10,000
100,000
Millions of data points10,000.0 100,000.0 1,000,000.0 10,000,000.0 100,000,000.0 1,000,000,000.0 10,000,000,000.0
DBSCAN PatchWork MLLib k-means ||
![Page 22: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/22.jpg)
20
PERFORMANCE: SCALABILITYN
orm
aliz
ed e
xecu
tion-
time
0
0.25
0.5
0.75
1
Number of servers1 2 3 4 5
MLLib k-means|| PatchWork
![Page 23: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/23.jpg)
21
CONCLUSION
![Page 24: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/24.jpg)
22
FUTURE WORK
• Tests against new clustering algorithms available in Spark 1.6
• Better distribution of step 2
• Indexing for region query using R-trees
• Streaming version
![Page 25: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/25.jpg)
Q & A
Contact: [email protected]
Availability: https://github.com/crim-ca/patchwork (MIT Licence)
Reference: Frank Gouineau, Tom Landry, Thomas Triplet (2016) PatchWork, a Scalable Density-Grid Clustering Algorithm. In Proc. 31th ACM Symposium On Applied Computing, Data-Mining track
![Page 26: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear](https://reader030.vdocuments.mx/reader030/viewer/2022041217/5e05cb5665d2101ede4130f2/html5/thumbnails/26.jpg)
WWW.CRIM.CA
Suivez-nous Dialoguez avec nous
Suivez-nous #CRIM_ca wwwCRIMca
Tous droits réservés © 2016 CRIM. 405, avenue Ogilvy, bureau 101, Montréal (Québec) H3N 1M3/514 840-1234/1 877 840-2746
Thomas Triplet, Ph.D., Eng. [email protected]
Principal partenaire financierLe CRIM est un centre de recherche appliquée en TI qui développe, en mode collaboratif avec ses clients et partenaires, des technologies innovatrices et du savoir-faire de pointe, et les transfère aux entreprises et aux organismes québécois afin de les rendre plus productifs et plus compétitifs localement et mondialement. Le CRIM dispose de quatre équipes de recherche en TI de calibre mondial. Le CRIM œuvre principalement dans les domaines des interactions et interfaces personne-système, de l’analytique avancée et des architectures et technologies avancées de développement et tests. Détenteur d’une certification ISO 9001:2008, son action s’inscrit dans les politiques et stratégies pilotées par le ministère de l'Économie, de l'Innovation et des Exportations (MEIE), son principal partenaire financier.