sandy ryza – software engineer, cloudera at mlconf atl

Clustering with SparkSandy Ryza / Data Science / Cloudera

● Data scientist at Cloudera● Recently lead Apache Spark development at

Cloudera● Before that, committing on Apache Hadoop● Before that, studying combinatorial

optimization and distributed systems at Brown

Me

Sometimes you find yourself with lots of stuff

Large Scale Learning

Network Packets

Detect Network Intrusions

Credit Card Transactions

Detect Fraud

Movie Viewings

Recommend Movies

Unsupervised Learning

● Learn hidden structure of your data● Interpret new data as it relates to this

structure

Two Main Problems

● Designing a system for processing huge data in parallel

● Taking advantage of it with algorithms that work well in parallel

CONFIDENTIAL - RESTRICTED*

MapReduce

Map Map Map Map Map Map Map Map Map Map Map Map

Reduce Reduce Reduce Reduce

Key advances by MapReduce:

•Data Locality: Automatic split computation and launch of mappers appropriately

•Fault tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware

•Linear scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems


MapReduce

Map Map Map Map Map Map Map Map Map Map Map Map

Reduce Reduce Reduce Reduce

Limitations of MapReduce

•Each job reads data from HDFS

•No concept of a session

•Jobs are rigin map-then-reduce


Spark is a general purpose computation framework geared towards massive data - more flexible than MapReduce

Extra properties:•Leverages distributed memory•Full Directed Graph expressions for data parallel computations•Improved developer experience

Yet retains:Linear scalability, Fault-tolerance and Data-Locality

RDDs

bigfile.txt lines

val lines = sc.textFile (“bigfile.txt”)

numbers

Partition

Partition

Partition

Partition

Partition

Partition

HDFS

sum

Driver

val numbers = lines.map ((x) => x.toDouble) numbers.sum()

RDDs

bigfile.txt lines

val lines = sc.textFile (“bigfile.txt”)

numbers

Partition

Partition

Partition

Partition

Partition

Partition

HDFS

sum

Driver

val numbers = lines.map ((x) => x.toInt) numbers.cache()

.sum()

numbers.sum()

bigfile.txt lines numbers

Partition

Partition

Partition

sum

Driver

Spark MLlib

Discrete Continuous

Supervised Classification● Logistic regression (and

regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests

(soon)

Regression● Linear regression (and

regularized variants)

Unsupervised Clustering● K-means

Dimensionality reduction, matrix factorization

● Principal component analysis / singular value decomposition

● Alternating least squares

Spark MLlib

Discrete Continuous

Supervised Classification● Logistic regression (and

regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests

(soon)

Regression● Linear regression (and

regularized variants)

Unsupervised Clustering

● K-meansDimensionality reduction, matrix factorization

● Principal component analysis / singular value decomposition

● Alternating least squares

Using it

val data = sc.textFile("kmeans_data.txt")

val parsedData = data.map( _.split(' ').map(_.toDouble))

// Cluster the data into two classes using KMeans

val numIterations = 20

val numClusters = 2

val clusters = KMeans.train(parsedData, numClusters,

numIterations)

K-Means

● Choose some initial centers● Then alternate between two steps:

○ Assign each point to a cluster based on existing centers

○ Recompute cluster centers from the points in each cluster

K-Means - very parallelizable

● Alternate between two steps:○ Assign each point to a cluster based on

existing centers■ Process each data point independently

○ Recompute cluster centers from the points in each cluster■ Average across partitions

// Find the sum and count of points mapping to each center

val totalContribs = data.mapPartitions { points =>

val k = centers.length

val dims = centers(0).vector.length

val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])

val counts = Array.fill(k)(0L)

points.foreach { point =>

val (bestCenter, cost) = KMeans.findClosest(centers, point)

costAccum += cost

sums(bestCenter) += point.vector

counts(bestCenter) += 1

}

val contribs = for (j <- 0 until k) yield {

(j, (sums(j), counts(j)))

}

contribs.iterator

}.reduceByKey(mergeContribs).collectAsMap()

// Update the cluster centers and costs

var changed = false

var j = 0

while (j < k) {

val (sum, count) = totalContribs(j)

if (count != 0) {

sum /= count.toDouble

val newCenter = new BreezeVectorWithNorm(sum)

if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {

changed = true

}

centers(j) = newCenter

}

j += 1

}

if (!changed) {

logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")

}

cost = costAccum.value

The Problem

● K-Means is very sensitive to initial set of center points chosen.

● Best existing algorithm for choosing centers is highly sequential.

K-Means++

● Start with random point from dataset● Pick another one randomly, with probability

proportional to distance from the closest already chosen

● Repeat until initial centers chosen

K-Means++

● Initial cluster has expected bound of O(log k) of optimum cost

K-Means++

● Requires k passes over the data

K-Means||

● Do only a few (~5) passes● Sample m points on each pass● Oversample● Run K-Means++ on sampled points to find

initial centers

Then on the full data...

sandy ryza – software engineer, cloudera at mlconf atl

Technology