sandy ryza – software engineer, cloudera at mlconf atl
DESCRIPTION
Unsupervised Learning on Huge Data with Apache Spark Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Spark’s MLLib module contains implementations of several unsupervised learning algorithms that scale to large datasets. In this talk, we’ll discuss how to use and implement large-scale machine learning algorithms with the Spark programming model, diving into MLLib’s K-means clustering and Principal Component Analysis (PCA).TRANSCRIPT
![Page 1: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/1.jpg)
Clustering with SparkSandy Ryza / Data Science / Cloudera
![Page 2: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/2.jpg)
● Data scientist at Cloudera● Recently lead Apache Spark development at
Cloudera● Before that, committing on Apache Hadoop● Before that, studying combinatorial
optimization and distributed systems at Brown
Me
![Page 3: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/3.jpg)
Sometimes you find yourself with lots of stuff
![Page 4: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/4.jpg)
Large Scale Learning
![Page 5: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/5.jpg)
Network Packets
![Page 6: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/6.jpg)
Detect Network Intrusions
![Page 7: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/7.jpg)
Credit Card Transactions
![Page 8: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/8.jpg)
Detect Fraud
![Page 9: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/9.jpg)
Movie Viewings
![Page 10: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/10.jpg)
Recommend Movies
![Page 11: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/11.jpg)
Unsupervised Learning
● Learn hidden structure of your data● Interpret new data as it relates to this
structure
![Page 12: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/12.jpg)
![Page 13: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/13.jpg)
![Page 14: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/14.jpg)
Two Main Problems
● Designing a system for processing huge data in parallel
● Taking advantage of it with algorithms that work well in parallel
![Page 15: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/15.jpg)
CONFIDENTIAL - RESTRICTED*
MapReduce
Map Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce
Key advances by MapReduce:
•Data Locality: Automatic split computation and launch of mappers appropriately
•Fault tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware
•Linear scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems
![Page 16: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/16.jpg)
CONFIDENTIAL - RESTRICTED*
MapReduce
Map Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce
Limitations of MapReduce
•Each job reads data from HDFS
•No concept of a session
•Jobs are rigin map-then-reduce
![Page 17: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/17.jpg)
CONFIDENTIAL - RESTRICTED*
Spark is a general purpose computation framework geared towards massive data - more flexible than MapReduce
Extra properties:•Leverages distributed memory•Full Directed Graph expressions for data parallel computations•Improved developer experience
Yet retains:Linear scalability, Fault-tolerance and Data-Locality
![Page 18: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/18.jpg)
RDDs
bigfile.txt lines
val lines = sc.textFile (“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map ((x) => x.toDouble) numbers.sum()
![Page 19: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/19.jpg)
RDDs
bigfile.txt lines
val lines = sc.textFile (“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map ((x) => x.toInt) numbers.cache()
.sum()
![Page 20: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/20.jpg)
numbers.sum()
bigfile.txt lines numbers
Partition
Partition
Partition
sum
Driver
![Page 21: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/21.jpg)
Spark MLlib
Discrete Continuous
Supervised Classification● Logistic regression (and
regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests
(soon)
Regression● Linear regression (and
regularized variants)
Unsupervised Clustering● K-means
Dimensionality reduction, matrix factorization
● Principal component analysis / singular value decomposition
● Alternating least squares
![Page 22: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/22.jpg)
Spark MLlib
Discrete Continuous
Supervised Classification● Logistic regression (and
regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests
(soon)
Regression● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-meansDimensionality reduction, matrix factorization
● Principal component analysis / singular value decomposition
● Alternating least squares
![Page 23: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/23.jpg)
![Page 24: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/24.jpg)
![Page 25: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/25.jpg)
Using it
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map( _.split(' ').map(_.toDouble))
// Cluster the data into two classes using KMeans
val numIterations = 20
val numClusters = 2
val clusters = KMeans.train(parsedData, numClusters,
numIterations)
![Page 26: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/26.jpg)
K-Means
● Choose some initial centers● Then alternate between two steps:
○ Assign each point to a cluster based on existing centers
○ Recompute cluster centers from the points in each cluster
![Page 27: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/27.jpg)
![Page 28: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/28.jpg)
![Page 29: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/29.jpg)
![Page 30: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/30.jpg)
![Page 31: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/31.jpg)
![Page 32: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/32.jpg)
K-Means - very parallelizable
● Alternate between two steps:○ Assign each point to a cluster based on
existing centers■ Process each data point independently
○ Recompute cluster centers from the points in each cluster■ Average across partitions
![Page 33: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/33.jpg)
// Find the sum and count of points mapping to each center
val totalContribs = data.mapPartitions { points =>
val k = centers.length
val dims = centers(0).vector.length
val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])
val counts = Array.fill(k)(0L)
points.foreach { point =>
val (bestCenter, cost) = KMeans.findClosest(centers, point)
costAccum += cost
sums(bestCenter) += point.vector
counts(bestCenter) += 1
}
val contribs = for (j <- 0 until k) yield {
(j, (sums(j), counts(j)))
}
contribs.iterator
}.reduceByKey(mergeContribs).collectAsMap()
![Page 34: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/34.jpg)
// Update the cluster centers and costs
var changed = false
var j = 0
while (j < k) {
val (sum, count) = totalContribs(j)
if (count != 0) {
sum /= count.toDouble
val newCenter = new BreezeVectorWithNorm(sum)
if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {
changed = true
}
centers(j) = newCenter
}
j += 1
}
if (!changed) {
logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")
}
cost = costAccum.value
![Page 35: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/35.jpg)
![Page 36: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/36.jpg)
The Problem
● K-Means is very sensitive to initial set of center points chosen.
● Best existing algorithm for choosing centers is highly sequential.
![Page 37: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/37.jpg)
![Page 38: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/38.jpg)
K-Means++
● Start with random point from dataset● Pick another one randomly, with probability
proportional to distance from the closest already chosen
● Repeat until initial centers chosen
![Page 39: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/39.jpg)
K-Means++
● Initial cluster has expected bound of O(log k) of optimum cost
![Page 40: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/40.jpg)
K-Means++
● Requires k passes over the data
![Page 41: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/41.jpg)
K-Means||
● Do only a few (~5) passes● Sample m points on each pass● Oversample● Run K-Means++ on sampled points to find
initial centers
![Page 42: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/42.jpg)
![Page 43: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/43.jpg)
![Page 44: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/44.jpg)
![Page 45: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/45.jpg)
![Page 46: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/46.jpg)
![Page 47: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/47.jpg)
![Page 48: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/48.jpg)
![Page 49: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/49.jpg)
Then on the full data...
![Page 50: Sandy Ryza – Software Engineer, Cloudera at MLconf ATL](https://reader034.vdocuments.mx/reader034/viewer/2022042521/546e75aeaf79595d298b577f/html5/thumbnails/50.jpg)