scalable clustering december 16 algorithm, big data ...€¦ · framework online-offline of...
TRANSCRIPT
![Page 1: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/1.jpg)
Scalable Clustering Algorithm,
M. Ghesmoune, T. Sarazin, M. Lebbah, H. Azzag
1
BIG DATA, MACHINE LEARNING AND SOCIAL NETWORK ANALYSIS, DECEMBER 16
![Page 2: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/2.jpg)
Outline
2
● Context
● Clustering using MapReduce
● Deal with large data sets such as streams
● Conclusion & Perspectives
![Page 3: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/3.jpg)
3
Context
Visualization
Clustering
ExplorationVisualization
Clustering
Tutorial, Ieee bigdata 2014
Difficulties :
- Structure
- similarity measure ?
- Number of clusters ?
(Combinatory)
- Validation (unlabeled data)
- data types : categorical, mixed ...
![Page 4: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/4.jpg)
Two alternatives
Data
Algo. LEArning
…
MapReduce / Spark
4
Algo. LEA
LEA
LEA
LEA
Massive data mining as stream mining
![Page 5: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/5.jpg)
Spark as an alternative
[Sparks et al ICDM 2013]
5
Logistic regression
![Page 6: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/6.jpg)
Clustering
6
![Page 7: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/7.jpg)
Implementation : K-Means
7
data = spark.textFile("hdfs://...") .map(parsePoint)centroids = Array( Point(randX(), randY()), Point(randX(), randY()))
x
y
![Page 8: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/8.jpg)
Compute distance with prototypes
8
x
y closestCentroid(p, centroids)
![Page 9: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/9.jpg)
Assignment
9
x
y closestCentroid(p, centroids)
![Page 10: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/10.jpg)
Assignment
10
x
y closestCentroid(p, centroids)
![Page 11: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/11.jpg)
Map - Assignments
11
x
y val closest = data.map(p => (closestCentroid(p, centroids), (p, 1)) )
![Page 12: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/12.jpg)
Reduce - update of prototypes
12
x
y val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) }
![Page 13: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/13.jpg)
Iteration 1
13
x
y val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) } pointStats.foreach{case(id, value) => centroids(id) = value._1 / value._2 }
![Page 14: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/14.jpg)
Iteration 2
14
x
y for (i <- 1 until 10) { val closest = data.map(p => (closestCentroid(p, centroids), (p, 1)) ) val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) } pointStats.foreach{case(id, value) => centroids(id) = value._1 / value._2 }}
![Page 15: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/15.jpg)
15
Topological Map
Prototypewx
Why ?➢ Topological organization➢ Generalization of K-means ➢ Adapted to MapReduce (batch version )➢ Used for visualisation ➢ Used for exploration phase
![Page 16: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/16.jpg)
MapReduce / Spark
Assignment
Quantization
map1
Reduce
map1
Row assignments Column assignment
map2
Reduce
Quantization
Reduce
SOM BiTM
https://github.com/TugdualSarazin/spark-clustering 16
Quantization
![Page 17: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/17.jpg)
BiTM-MapReduce-SPARK
2 millions, 20 variables
2 millions, 40 variables
![Page 18: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/18.jpg)
Two alternatives
Data
Algo. LEArning
…
MapReduce / Spark
18
Algo. LEA
LEA
LEA
LEA
Massive data mining as stream mining
![Page 19: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/19.jpg)
Big Data as data stream
19
![Page 20: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/20.jpg)
20
Big Data as data stream
Framework online-offline of clustering data streams
![Page 21: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/21.jpg)
21
Big Data as data stream
Framework online-offline of clustering data streams
![Page 22: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/22.jpg)
G-STREAM : GNG + Data Stream
GNG : [Fritzke 95]
― Evolutive topology― Number of cells is not fixed
22
![Page 23: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/23.jpg)
G-STREAM and others
![Page 24: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/24.jpg)
G-Stream: characteristics
• No initialization phase of the model,
• A graph representing the topological structure,
• Creating multiple nodes at the same time,
• One single stage (online), (no offline stage)
• Use of a reservoir.
… G-Stream
24
![Page 25: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/25.jpg)
Data sets
25
![Page 26: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/26.jpg)
G-Stream: Example
26
… G-Stream
![Page 27: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/27.jpg)
G-Stream on letter4
![Page 28: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/28.jpg)
G-Stream on DS1
![Page 29: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/29.jpg)
G-Stream on DS2
![Page 30: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/30.jpg)
G-Stream vs GNG online: accuracy
30
● Accuracy
![Page 31: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/31.jpg)
G-Stream vs GNG online: RMS Error● RMS Error
31
![Page 32: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/32.jpg)
G-Stream vs GNG online: #Nodes● Nombre de noeuds
32
![Page 33: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/33.jpg)
Accuracy
33
![Page 34: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/34.jpg)
NMI
34
![Page 35: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/35.jpg)
Conclusion & perspectives● Data set size has increased significantly
○ MapReduce is crucial for some algorithms ○ Deal with large data sets such as streams
● New approach ○ Resampling & Sketching ○ Boosting & bagging [Kleiner et al ICML 2012]
35
![Page 36: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/36.jpg)
ReferencesAriel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Michael I. Jordan. The Big Data Bootstrap. Proceedings of the 29th International Conference on Machine Learning (ICML-12). Pages: 1759--1766. 2012
Ghesmoune M, Azzag H, Lebbah M. (2014), «G-Stream: Growing Neural Gas over Data Stream», Neural Information Processing. Lecture Notes in Computer Science Volume 8834, 2014, pp 207-214. Kuching, Sarawak, Malaysia, 03-06 November 2014
Sarazin T, Lebbah M, Azzag H. (2014), "Biclustering using Spark-MapReduce". IEEE International Conference on Big Data. October 27-30, 2014, pp. 58-60. Washington DC, USA . (Poster).
E. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, M. Franklin, M. I. Jordan, T. Kraska. MLI: An API for Distributed Machine Learning. International Conference on Data Mining (ICDM), 2013
36
![Page 37: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering](https://reader035.vdocuments.mx/reader035/viewer/2022071216/6048cb0de24a562ba434e86c/html5/thumbnails/37.jpg)
37