parallel machine learning- dsgd and systemml

Big Data Analytics Seminar- Parallel Machine Learning

Janani Chakkaradhari09/01/2014

Parallel Machine Learning 2

Outline• Parallelism • Machine Learning• The computational engine for ML• Large Scale Matrix Factorization with DSGD • Overhead in parallelizing ML Algorithms• Declarative Machine Learning: SystemML• Summary• References


Parallelism• Parallel Processing• Processing of multiple tasks simultaneously on multiple

processor

• Parallel Programming• Programming on multiprocessor system using divide and

conquer technique • Work is now shared • Higher computing power


Machine Learning

• Types

• Supervised learning – teach the machine• Unsupervised learning- let it learn by itself

• Cite: Machine Learning on Big Data REF[7]

DataLearning Model

Ratings


The computational engine for ML

• Why So?

• The best way is to learn by examples.


Example - Supervised Learning

1000 1500 2000 2500 3000 35000

100

200

300

400

500

600

Price($)

Price($)

We want to predict the price of other houses as a function of size of their living areas?


Example - Supervised Learning• We wish to infer the mapping implied by the data

• , where is good predictor for the corresponding value of

• Approximate as linear function of (

1000 2000 3000 40000

200

400

600

Price($)


Linear Algebra - The computational engine for ML

• Linear Discriminant Analysis (LDA)• To find linear combination of features• Linear classifier

• Principle Component Analysis• Covariance matrix • Eigen value and Eigen vector (what's stable under

repetitions of a linear transform)

• Page Rank• Eigen value and Eigen Vector

• Recommender Systems, Topic Modeling• Non- Negative Matrix Factorization

• And more . . .


Large-Scale Matrix Factorizationwith Distributed Stochastic Gradient Descent


Outline• Introduction to Gradient Descent • Matrix Factorization• SGD for Matrix Factorization• DSGD Matrix Factorization Algorithm


Example – Supervised Learning (conti)

• The value of should be approximately equal to(at least for the example data)

• So, the value should be carefully chosen

• To measure how close the to the corresponding for each value of , define the cost function or loss function as =


Example – Supervised Learning (conti.)

• The value of should be approximately equal to(at least for the example data)

• So, the value should be carefully chosen

• To measure how close the to the corresponding for each value of , define the cost function as =

Goal is to minimize the cost function and update

Desired value

Actual value


Gradient Descent

Figure: Intuition behind Gradient Descent REF[5]


Gradient Descent (aka Batch Gradient)• Starts with by assigning some random value to • Repeatedly updates to make smaller until it

converges• Since we are changing the value of , the rate of

change is represented by using partial derivative where

• Produces smoother cost function curve but the computation is costly

=


Stochastic Gradient Descent• An iterative stochastic optimization algorithm

• For each training data (), the parameter gets updated independent of other data!

• Produces noisy cost function curve but very fast and easy to obtain


Matrix Factorization

Partially Observed ImageOriginal Image

Reconstructed Image

approximation


Matrix FactorizationAvatar The Matrix Up

Alice _ 4 2

Bob 3 2 _

Charlie 5 _ 3



2.24 1.92 1.18

Alice 1.98 _ 43.8

22.3

Bob 1.21 32.7

22.3

_

Charlie 2.30 55.2

_ 32.7

The minimum loss function



2.24 1.92 1.18

Alice 1.98 _4.4

43.8

22.3

Bob 1.21 32.7

22.3

_1.4

Charlie 2.30 55.2

_4.4

32.7

The minimum loss function at each element is

Bias, Regularization


Problem Definition• Given input matrix V ( ) with rank

• Finding the best model

User Feature Matrix/

latent user factors

Movie Feature Matrix/

latent item factors


SGD for Matrix Factorization


SGD for Matrix Factorization• As we know, SGD steps depends on each other


SGD for Matrix Factorization• As we know, SGD steps depends on each other

• Not all steps are dependent

• Interchangeable:• Two training points are interchangeable with respect to any loss

function L having summation form if they share neither row nor column


DSGD Matrix Factorization Algorithm

• ,

• Goal – Minimize the loss function value and update (W,H) factor vectors

Node1

Node 2

Node 3


DSGD Matrix Factorization Algorithm• Select the subset of the blocks• Run SGD on each block independently, and then

sum up the results

Node1

Node 2

Node 3


DSGD Matrix Factorization Algorithm• Select the subset of the blocks (for example the

block diagonal)• Run SGD on each block independently, and then

sum up the results

Node1

Node 2

Node 3


DSGD Matrix Factorization Algorithm• How to get the set of interchangeable sub

matrices?• Answer: Permutations! E.g. 6 permutations of 3

balls

Figure: The 6 permutations of 3 balls REF[9]


Stratified SGD• How to get the set of interchangeable sub

matrices?• Randomly permute the rows and columns of Z,

and then create d×d blocks of size (m/d) ×(n/d) each

Figure: Stratified SGD – REF[3]


DSGD Matrix Factorization Algorithm• Map Reduce:• Divide the processing into d independent map tasks• Each task takes , as input• Runs SGD to find the local minimum for loss function and

updates the factor matrices• are matrices obtained by running sequential SGD on

training sequence (strata)• and


Overhead in parallelizing ML Algorithms

• Cost of implementing a large class of ML algorithms as low level map reduce jobs

• Each individual MapReduce job in an ML algorithm has to be hand-coded

• For better performance, the actual execution plan for the same ML algorithm has to be hand-tuned for different input and cluster sizes.


Challenge in Optimization• Matrix Multiplication in two ways. Which one to

choose? 2 1

8 7

1

2

4

22

2 1

8 7

1

2

2

8× (1 )+¿

1

7× (2 )=¿

4

22¿

¿


Challenge in Optimization – MR

The Choice of RMM or CPMM depends on characteristics of matrices involved in the multiplication

REF[4]


Declarative Machine Learning on Map Reduce


Outline• SystemML Architecture• SystemML Components•Matrix Block Representation• Piggybacking


System ML• System ML : Origin from IBM Almaden and Watson

Research Center• Machine Learning Algorithms are expressed in High level

language (DML – Declarative Machine learning Language syntax similar to R)

• E.g. DML for transpose of matrix is • So, the user can focus on writing scripts that answers to

“What?” and not “How?”• Covers linear statistical models, PCA, Page Rank, Matrix

Factorization, iterative algorithms (while and for) and so on.• Scales to very large datasets and efficiently tunes the

performance

38

System ML Architecture

Parallel Machine Learning REF[10]


Matrix Factorization - DML Script REF[4]


SystemML – Program Analysis

• Break DML scripts into smaller units called statement blocks

REF[4]


System ML Components

SystemML Program Analysis

For each statement block Do

1. High Level Operator component analysis

2. Low level Operator component analysis

3. RuntimeREF[4]


High Level Operator component (HOP)• Input : Statement Blocks• Output: High level execution plan (HOP Dags)• Action: Hops represents the basic operations on Matrices and

scalar (an Operation or Transformation)• Instantiation of hops from the DML parsed representation

• Optimizations: Algebraic rewrites, selection of physical representation for intermediate matrices and cost based optimizations

REF[4]


Low Level Operator component (LOP)• Input : High level execution plan (HOP Dags) • Output: Low level execution plan (LOP Dags)• Action: Lops represents the basic operations in MR

Environments

REF[4]


Runtime• Matrices as Key-Value Pairs• Block representation of Matrices (using a block operation)

• Generic MapReduce Job • Main Execution engine in System ML• Instantiated by the Piggybacking algorithm (Multiple LOPs

inside a single MR jobs)

• Control Module• Orchestrating the execution of the instantiated MapReduce

jobs for a DML script.

• Multiple optimizations are performed in the runtime component (dynamically deciding based on data characteristics)


Example Component Analysis for DML Script in SystemML –REF[4]


Blocking

Representation of Blocking –REF[4]


Piggybacking

Aggr. (+)

group

mmcj

transform

Data W

Aggr. (+)

group

transform

Data W

mmcj

M/R

M/R

M&R

M&R

R

M

R

M

R

Topological Sort


Summary• Linear Algebra the computation engine for

Machine Learning Algorithms• Gradient Descent and SGD• Distributed matrix-factorization algorithm that can

efficiently handle web-scale matrices (DSGD)• Overhead in parallelizing ML Algorithms• Declarative ML on Map Reduce - IBM System ML


References[1] Parallel Computing at a Glance, http://www.buyya.com/microkernel/chap1.pdf[2] CS229 Lecture notes -Andrew Ng, http://see.stanford.edu/materials/aimlcs229/cs229-notes1.pdf[3] Gemulla, Rainer, et al. "Large-scale matrix factorization with distributed stochastic gradient descent." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.[4] Ghoting, Amol, et al. "SystemML: Declarative machine learning on MapReduce." Data Engineering (ICDE), 2011 IEEE 27th International Conference on. IEEE, 2011.


References[5] http://www.quora.com/Machine-Learning/[6] http://web.cs.wpi.edu/~cs525/f13b-EAR//cs525-homepage/lectures/lectures-papers/Large-ScaleMatrixFactorization-ppt.pdf[7] NEOKLIS POLYZOTIS (UCSC, GOOGLE ), TYSON CONDIE (UCLA, MICROSOFT), MARKUS WEIMER (MICROSOFT), Machine Learning on Big Data [8] http://www.cliffsnotes.com/math/algebra/linear-algebra/real-euclidean-vector-spaces/the-rank-of-a-matrix[9] http://en.wikipedia.org/wiki/Permutation[10] https://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/Bruce%20Brown%20-%20BigInsights-1-16-12-external/$file/Bruce%20Brown%20-%20BigInsights-1-16-12-external.pdf

http://en.wikipedia.org/wiki/Permutation

http://en.wikipedia.org/wiki/Permutation

parallel machine learning- dsgd and systemml

Data & Analytics

z parallel machine learning

machine unsupervised

parallelism machine

data learning model

loss function value

linear function of

random value

function of size