parallel machine learning- dsgd and systemml
TRANSCRIPT
Big Data Analytics Seminar- Parallel Machine Learning
Janani Chakkaradhari09/01/2014
Parallel Machine Learning 2
Outline• Parallelism • Machine Learning• The computational engine for ML• Large Scale Matrix Factorization with DSGD • Overhead in parallelizing ML Algorithms• Declarative Machine Learning: SystemML• Summary• References
Parallel Machine Learning 3
Parallelism• Parallel Processing• Processing of multiple tasks simultaneously on multiple
processor
• Parallel Programming• Programming on multiprocessor system using divide and
conquer technique • Work is now shared • Higher computing power
Parallel Machine Learning 4
Machine Learning
• Types
• Supervised learning – teach the machine• Unsupervised learning- let it learn by itself
• Cite: Machine Learning on Big Data REF[7]
DataLearning Model
Ratings
Parallel Machine Learning 5
The computational engine for ML
• Why So?
• The best way is to learn by examples.
Parallel Machine Learning 6
Example - Supervised Learning
1000 1500 2000 2500 3000 35000
100
200
300
400
500
600
Price($)
Price($)
We want to predict the price of other houses as a function of size of their living areas?
Parallel Machine Learning 7
Example - Supervised Learning• We wish to infer the mapping implied by the data
• , where is good predictor for the corresponding value of
• Approximate as linear function of (
1000 2000 3000 40000
200
400
600
Price($)
Parallel Machine Learning 8
Linear Algebra - The computational engine for ML
• Linear Discriminant Analysis (LDA)• To find linear combination of features• Linear classifier
• Principle Component Analysis• Covariance matrix • Eigen value and Eigen vector (what's stable under
repetitions of a linear transform)
• Page Rank• Eigen value and Eigen Vector
• Recommender Systems, Topic Modeling• Non- Negative Matrix Factorization
• And more . . .
Parallel Machine Learning 9
Large-Scale Matrix Factorizationwith Distributed Stochastic Gradient Descent
Parallel Machine Learning 10
Outline• Introduction to Gradient Descent • Matrix Factorization• SGD for Matrix Factorization• DSGD Matrix Factorization Algorithm
Parallel Machine Learning 11
Example – Supervised Learning (conti)
• The value of should be approximately equal to(at least for the example data)
• So, the value should be carefully chosen
• To measure how close the to the corresponding for each value of , define the cost function or loss function as =
Parallel Machine Learning 12
Example – Supervised Learning (conti.)
• The value of should be approximately equal to(at least for the example data)
• So, the value should be carefully chosen
• To measure how close the to the corresponding for each value of , define the cost function as =
Goal is to minimize the cost function and update
Desired value
Actual value
Parallel Machine Learning 13
Gradient Descent
Figure: Intuition behind Gradient Descent REF[5]
Parallel Machine Learning 14
Gradient Descent (aka Batch Gradient)• Starts with by assigning some random value to • Repeatedly updates to make smaller until it
converges• Since we are changing the value of , the rate of
change is represented by using partial derivative where
• Produces smoother cost function curve but the computation is costly
=
Parallel Machine Learning 15
Stochastic Gradient Descent• An iterative stochastic optimization algorithm
• For each training data (), the parameter gets updated independent of other data!
• Produces noisy cost function curve but very fast and easy to obtain
Parallel Machine Learning 16
Matrix Factorization
Partially Observed ImageOriginal Image
Reconstructed Image
approximation
Parallel Machine Learning 17
Matrix FactorizationAvatar The Matrix Up
Alice _ 4 2
Bob 3 2 _
Charlie 5 _ 3
Parallel Machine Learning 18
Matrix FactorizationAvatar The Matrix Up
2.24 1.92 1.18
Alice 1.98 _ 43.8
22.3
Bob 1.21 32.7
22.3
_
Charlie 2.30 55.2
_ 32.7
The minimum loss function
Parallel Machine Learning 19
Matrix FactorizationAvatar The Matrix Up
2.24 1.92 1.18
Alice 1.98 _4.4
43.8
22.3
Bob 1.21 32.7
22.3
_1.4
Charlie 2.30 55.2
_4.4
32.7
The minimum loss function at each element is
Bias, Regularization
Parallel Machine Learning 20
Problem Definition• Given input matrix V ( ) with rank
• Finding the best model
User Feature Matrix/
latent user factors
Movie Feature Matrix/
latent item factors
Parallel Machine Learning 21
SGD for Matrix Factorization
Parallel Machine Learning 22
SGD for Matrix Factorization• As we know, SGD steps depends on each other
Parallel Machine Learning 23
SGD for Matrix Factorization• As we know, SGD steps depends on each other
• Not all steps are dependent
• Interchangeable:• Two training points are interchangeable with respect to any loss
function L having summation form if they share neither row nor column
Parallel Machine Learning 24
DSGD Matrix Factorization Algorithm
• ,
• Goal – Minimize the loss function value and update (W,H) factor vectors
Node1
Node 2
Node 3
Parallel Machine Learning 25
DSGD Matrix Factorization Algorithm• Select the subset of the blocks• Run SGD on each block independently, and then
sum up the results
Node1
Node 2
Node 3
Parallel Machine Learning 26
DSGD Matrix Factorization Algorithm• Select the subset of the blocks (for example the
block diagonal)• Run SGD on each block independently, and then
sum up the results
Node1
Node 2
Node 3
Parallel Machine Learning 27
DSGD Matrix Factorization Algorithm• How to get the set of interchangeable sub
matrices?• Answer: Permutations! E.g. 6 permutations of 3
balls
Figure: The 6 permutations of 3 balls REF[9]
Parallel Machine Learning 28
Stratified SGD• How to get the set of interchangeable sub
matrices?• Randomly permute the rows and columns of Z,
and then create d×d blocks of size (m/d) ×(n/d) each
Figure: Stratified SGD – REF[3]
Parallel Machine Learning 29
DSGD Matrix Factorization Algorithm• Map Reduce:• Divide the processing into d independent map tasks• Each task takes , as input• Runs SGD to find the local minimum for loss function and
updates the factor matrices• are matrices obtained by running sequential SGD on
training sequence (strata)• and
Parallel Machine Learning 30
Parallel Machine Learning 31
Parallel Machine Learning 32
Overhead in parallelizing ML Algorithms
• Cost of implementing a large class of ML algorithms as low level map reduce jobs
• Each individual MapReduce job in an ML algorithm has to be hand-coded
• For better performance, the actual execution plan for the same ML algorithm has to be hand-tuned for different input and cluster sizes.
Parallel Machine Learning 33
Challenge in Optimization• Matrix Multiplication in two ways. Which one to
choose? 2 1
8 7
1
2
4
22
2 1
8 7
1
2
2
8× (1 )+¿
1
7× (2 )=¿
4
22¿
¿
Parallel Machine Learning 34
Challenge in Optimization – MR
The Choice of RMM or CPMM depends on characteristics of matrices involved in the multiplication
REF[4]
Parallel Machine Learning 35
Declarative Machine Learning on Map Reduce
Parallel Machine Learning 36
Outline• SystemML Architecture• SystemML Components•Matrix Block Representation• Piggybacking
Parallel Machine Learning 37
System ML• System ML : Origin from IBM Almaden and Watson
Research Center• Machine Learning Algorithms are expressed in High level
language (DML – Declarative Machine learning Language syntax similar to R)
• E.g. DML for transpose of matrix is • So, the user can focus on writing scripts that answers to
“What?” and not “How?”• Covers linear statistical models, PCA, Page Rank, Matrix
Factorization, iterative algorithms (while and for) and so on.• Scales to very large datasets and efficiently tunes the
performance
38
System ML Architecture
Parallel Machine Learning REF[10]
Parallel Machine Learning 39
Matrix Factorization - DML Script REF[4]
Parallel Machine Learning 40
SystemML – Program Analysis
• Break DML scripts into smaller units called statement blocks
REF[4]
Parallel Machine Learning 41
System ML Components
SystemML Program Analysis
For each statement block Do
1. High Level Operator component analysis
2. Low level Operator component analysis
3. RuntimeREF[4]
Parallel Machine Learning 42
High Level Operator component (HOP)• Input : Statement Blocks• Output: High level execution plan (HOP Dags)• Action: Hops represents the basic operations on Matrices and
scalar (an Operation or Transformation)• Instantiation of hops from the DML parsed representation
• Optimizations: Algebraic rewrites, selection of physical representation for intermediate matrices and cost based optimizations
REF[4]
Parallel Machine Learning 43
Low Level Operator component (LOP)• Input : High level execution plan (HOP Dags) • Output: Low level execution plan (LOP Dags)• Action: Lops represents the basic operations in MR
Environments
REF[4]
Parallel Machine Learning 44
Runtime• Matrices as Key-Value Pairs• Block representation of Matrices (using a block operation)
• Generic MapReduce Job • Main Execution engine in System ML• Instantiated by the Piggybacking algorithm (Multiple LOPs
inside a single MR jobs)
• Control Module• Orchestrating the execution of the instantiated MapReduce
jobs for a DML script.
• Multiple optimizations are performed in the runtime component (dynamically deciding based on data characteristics)
Parallel Machine Learning 45
Example Component Analysis for DML Script in SystemML –REF[4]
Parallel Machine Learning 46
Blocking
Representation of Blocking –REF[4]
Parallel Machine Learning 47
Piggybacking
Aggr. (+)
group
mmcj
transform
Data W
Aggr. (+)
group
transform
Data W
mmcj
M/R
M/R
M&R
M&R
R
M
R
M
R
Topological Sort
Parallel Machine Learning 48
Summary• Linear Algebra the computation engine for
Machine Learning Algorithms• Gradient Descent and SGD• Distributed matrix-factorization algorithm that can
efficiently handle web-scale matrices (DSGD)• Overhead in parallelizing ML Algorithms• Declarative ML on Map Reduce - IBM System ML
Parallel Machine Learning 49
References[1] Parallel Computing at a Glance, http://www.buyya.com/microkernel/chap1.pdf[2] CS229 Lecture notes -Andrew Ng, http://see.stanford.edu/materials/aimlcs229/cs229-notes1.pdf[3] Gemulla, Rainer, et al. "Large-scale matrix factorization with distributed stochastic gradient descent." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.[4] Ghoting, Amol, et al. "SystemML: Declarative machine learning on MapReduce." Data Engineering (ICDE), 2011 IEEE 27th International Conference on. IEEE, 2011.
Parallel Machine Learning 50
References[5] http://www.quora.com/Machine-Learning/[6] http://web.cs.wpi.edu/~cs525/f13b-EAR//cs525-homepage/lectures/lectures-papers/Large-ScaleMatrixFactorization-ppt.pdf[7] NEOKLIS POLYZOTIS (UCSC, GOOGLE ), TYSON CONDIE (UCLA, MICROSOFT), MARKUS WEIMER (MICROSOFT), Machine Learning on Big Data [8] http://www.cliffsnotes.com/math/algebra/linear-algebra/real-euclidean-vector-spaces/the-rank-of-a-matrix[9] http://en.wikipedia.org/wiki/Permutation[10] https://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/Bruce%20Brown%20-%20BigInsights-1-16-12-external/$file/Bruce%20Brown%20-%20BigInsights-1-16-12-external.pdf
Parallel Machine Learning 51