distributing matrix computations with spark mllibrezab/slides/reza_mllib_maryland.pdf · mllib...
TRANSCRIPT
![Page 1: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/1.jpg)
Reza Zadeh
Distributing Matrix Computations with Spark MLlib
![Page 2: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/2.jpg)
A General Platform
Spark Core
Spark Streaming"
real-time
Spark SQL structured
GraphX graph
MLlib machine learning
…
Standard libraries included with Spark
![Page 3: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/3.jpg)
Outline Introduction to MLlib Example Invocations Benefits of Iterations: Optimization
Singular Value Decomposition All-pairs Similarity Computation MLlib + {Streaming, GraphX, SQL}
![Page 4: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/4.jpg)
Introduction
![Page 5: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/5.jpg)
MLlib History MLlib is a Spark subproject providing machine learning primitives
Initial contribution from AMPLab, UC Berkeley
Shipped with Spark since Sept 2013
![Page 6: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/6.jpg)
MLlib: Available algorithms classification: logistic regression, linear SVM,"naïve Bayes, least squares, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means|| decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS
![Page 7: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/7.jpg)
Example Invocations
![Page 8: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/8.jpg)
Example: K-means
![Page 9: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/9.jpg)
Example: PCA
![Page 10: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/10.jpg)
Example: ALS
![Page 11: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/11.jpg)
Benefits of fast iterations
![Page 12: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/12.jpg)
Optimization At least two large classes of optimization problems humans can solve:
- Convex Programs - Spectral Problems (SVD)
![Page 13: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/13.jpg)
Optimization - LR data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data.map(lambda p: (1 / (1 + exp(-‐p.y * w.dot(p.x)))) * p.y * p.x ).reduce(lambda a, b: a + b) w -‐= gradient print “Final w: %s” % w
![Page 14: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/14.jpg)
Spark PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing
Neighbors (id, edges)
Ranks (id, rank)
join
partitionBy
join join …
![Page 15: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/15.jpg)
PageRank Results
171
72
23
0
50
100
150
200
Tim
e pe
r ite
ratio
n (s)
Hadoop
Basic Spark
Spark + Controlled Partitioning
![Page 16: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/16.jpg)
Spark PageRank
Generalizes to Matrix Multiplication, opening many algorithms from Numerical Linear Algebra
![Page 17: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/17.jpg)
Deep Dive: Singular Value Decomposition
![Page 18: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/18.jpg)
Singular Value Decomposition Two cases: Tall and Skinny vs roughly Square
computeSVD function takes care of which one to call, so you don’t have to.
![Page 19: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/19.jpg)
SVD selection
![Page 20: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/20.jpg)
Tall and Skinny SVD
![Page 21: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/21.jpg)
Tall and Skinny SVD
Gets us V and the singular values
Gets us U by one matrix multiplication
![Page 22: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/22.jpg)
Square SVD via ARPACK Very mature Fortran77 package for computing eigenvalue decompositions"
JNI interface available via netlib-java"
Distributed using Spark
![Page 23: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/23.jpg)
Square SVD via ARPACK Only needs to compute matrix vector multiplies to build Krylov subspaces
The result of matrix-vector multiply is small"
The multiplication can be distributed
![Page 24: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/24.jpg)
Deep Dive: All pairs Similarity
![Page 25: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/25.jpg)
Deep Dive: All pairs Similarity Compute via DIMSUM: “Dimension Independent Similarity Computation using MapReduce”
Will be in Spark 1.2 as a method in RowMatrix
![Page 26: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/26.jpg)
All-pairs similarity computation
![Page 27: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/27.jpg)
Naïve Approach
![Page 28: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/28.jpg)
Naïve approach: analysis
![Page 29: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/29.jpg)
DIMSUM Sampling
![Page 30: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/30.jpg)
DIMSUM Analysis
![Page 31: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/31.jpg)
Spark implementation
![Page 32: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/32.jpg)
Ongoing Work in MLlib stats library (e.g. stratified sampling, ScaRSR) ADMM LDA
General Convex Optimization
![Page 33: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/33.jpg)
MLlib + {Streaming, GraphX, SQL}
![Page 34: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/34.jpg)
MLlib + Streaming As of Spark 1.1, you can train linear models in a streaming fashion
Model weights are updated via SGD, thus amenable to streaming
More work needed for decision trees
![Page 35: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/35.jpg)
MLlib + SQL
points = context.sql(“select latitude, longitude from tweets”) !
model = KMeans.train(points, 10) !!
![Page 36: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/36.jpg)
MLlib + GraphX
![Page 37: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/37.jpg)
Future of MLlib
![Page 38: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/38.jpg)
General Linear Algebra CoordinateMatrix RowMatrix BlockMatrix
Local and distributed versions."Operations in-between.
Goal: version 1.2
Goal: version 1.3
![Page 39: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/39.jpg)
Research Goal: General Convex Optimization
Distribute CVX by backing CVXPY with
PySpark
Easy-‐to-‐express distributable convex
programs
Need to know less math to optimize
complicated objectives
![Page 40: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution](https://reader036.vdocuments.mx/reader036/viewer/2022062402/5f0f2af97e708231d442d260/html5/thumbnails/40.jpg)
Spark and ML Spark has all its roots in research, so we hope to keep incorporating new ideas!