parallel machine learning- dsgd and systemml

51
Big Data Analytics Seminar- Parallel Machine Learning Janani Chakkaradhari 09/01/2014

Upload: janani-c

Post on 21-Jun-2015

730 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Parallel Machine Learning- DSGD and SystemML

Big Data Analytics Seminar- Parallel Machine Learning

Janani Chakkaradhari09/01/2014

Page 2: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 2

Outline• Parallelism • Machine Learning• The computational engine for ML• Large Scale Matrix Factorization with DSGD • Overhead in parallelizing ML Algorithms• Declarative Machine Learning: SystemML• Summary• References

Page 3: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 3

Parallelism• Parallel Processing• Processing of multiple tasks simultaneously on multiple

processor

• Parallel Programming• Programming on multiprocessor system using divide and

conquer technique • Work is now shared • Higher computing power

Page 4: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 4

Machine Learning

• Types

• Supervised learning – teach the machine• Unsupervised learning- let it learn by itself

• Cite: Machine Learning on Big Data REF[7]

DataLearning Model

Ratings

Page 5: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 5

The computational engine for ML

• Why So?

• The best way is to learn by examples.

Page 6: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 6

Example - Supervised Learning

1000 1500 2000 2500 3000 35000

100

200

300

400

500

600

Price($)

Price($)

We want to predict the price of other houses as a function of size of their living areas?

Page 7: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 7

Example - Supervised Learning• We wish to infer the mapping implied by the data

• , where is good predictor for the corresponding value of

• Approximate as linear function of (

1000 2000 3000 40000

200

400

600

Price($)

Page 8: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 8

Linear Algebra - The computational engine for ML

• Linear Discriminant Analysis (LDA)• To find linear combination of features• Linear classifier

• Principle Component Analysis• Covariance matrix • Eigen value and Eigen vector (what's stable under

repetitions of a linear transform)

• Page Rank• Eigen value and Eigen Vector

• Recommender Systems, Topic Modeling• Non- Negative Matrix Factorization

• And more . . .

Page 9: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 9

Large-Scale Matrix Factorizationwith Distributed Stochastic Gradient Descent

Page 10: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 10

Outline• Introduction to Gradient Descent • Matrix Factorization• SGD for Matrix Factorization• DSGD Matrix Factorization Algorithm

Page 11: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 11

Example – Supervised Learning (conti)

• The value of should be approximately equal to(at least for the example data)

• So, the value should be carefully chosen

• To measure how close the to the corresponding for each value of , define the cost function or loss function as =

Page 12: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 12

Example – Supervised Learning (conti.)

• The value of should be approximately equal to(at least for the example data)

• So, the value should be carefully chosen

• To measure how close the to the corresponding for each value of , define the cost function as =

Goal is to minimize the cost function and update

Desired value

Actual value

Page 13: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 13

Gradient Descent

Figure: Intuition behind Gradient Descent REF[5]

Page 14: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 14

Gradient Descent (aka Batch Gradient)• Starts with by assigning some random value to • Repeatedly updates to make smaller until it

converges• Since we are changing the value of , the rate of

change is represented by using partial derivative where

• Produces smoother cost function curve but the computation is costly

=

Page 15: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 15

Stochastic Gradient Descent• An iterative stochastic optimization algorithm

• For each training data (), the parameter gets updated independent of other data!

• Produces noisy cost function curve but very fast and easy to obtain

Page 16: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 16

Matrix Factorization

Partially Observed ImageOriginal Image

Reconstructed Image

approximation

Page 17: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 17

Matrix FactorizationAvatar The Matrix Up

Alice _ 4 2

Bob 3 2 _

Charlie 5 _ 3

Page 18: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 18

Matrix FactorizationAvatar The Matrix Up

2.24 1.92 1.18

Alice 1.98 _ 43.8

22.3

Bob 1.21 32.7

22.3

_

Charlie 2.30 55.2

_ 32.7

The minimum loss function

Page 19: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 19

Matrix FactorizationAvatar The Matrix Up

2.24 1.92 1.18

Alice 1.98 _4.4

43.8

22.3

Bob 1.21 32.7

22.3

_1.4

Charlie 2.30 55.2

_4.4

32.7

The minimum loss function at each element is

Bias, Regularization

Page 20: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 20

Problem Definition• Given input matrix V ( ) with rank

• Finding the best model

User Feature Matrix/

latent user factors

Movie Feature Matrix/

latent item factors

Page 21: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 21

SGD for Matrix Factorization

Page 22: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 22

SGD for Matrix Factorization• As we know, SGD steps depends on each other

Page 23: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 23

SGD for Matrix Factorization• As we know, SGD steps depends on each other

• Not all steps are dependent

• Interchangeable:• Two training points are interchangeable with respect to any loss

function L having summation form if they share neither row nor column

Page 24: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 24

DSGD Matrix Factorization Algorithm

• ,

• Goal – Minimize the loss function value and update (W,H) factor vectors

Node1

Node 2

Node 3

Page 25: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 25

DSGD Matrix Factorization Algorithm• Select the subset of the blocks• Run SGD on each block independently, and then

sum up the results

Node1

Node 2

Node 3

Page 26: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 26

DSGD Matrix Factorization Algorithm• Select the subset of the blocks (for example the

block diagonal)• Run SGD on each block independently, and then

sum up the results

Node1

Node 2

Node 3

Page 27: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 27

DSGD Matrix Factorization Algorithm• How to get the set of interchangeable sub

matrices?• Answer: Permutations! E.g. 6 permutations of 3

balls

Figure: The 6 permutations of 3 balls REF[9]

Page 28: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 28

Stratified SGD• How to get the set of interchangeable sub

matrices?• Randomly permute the rows and columns of Z,

and then create d×d blocks of size (m/d) ×(n/d) each

Figure: Stratified SGD – REF[3]

Page 29: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 29

DSGD Matrix Factorization Algorithm• Map Reduce:• Divide the processing into d independent map tasks• Each task takes , as input• Runs SGD to find the local minimum for loss function and

updates the factor matrices• are matrices obtained by running sequential SGD on

training sequence (strata)• and

Page 30: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 30

Page 31: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 31

Page 32: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 32

Overhead in parallelizing ML Algorithms

• Cost of implementing a large class of ML algorithms as low level map reduce jobs

• Each individual MapReduce job in an ML algorithm has to be hand-coded

• For better performance, the actual execution plan for the same ML algorithm has to be hand-tuned for different input and cluster sizes.

Page 33: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 33

Challenge in Optimization• Matrix Multiplication in two ways. Which one to

choose? 2 1

8 7

1

2

4

22

2 1

8 7

1

2

2

8× (1 )+¿

1

7× (2 )=¿

4

22¿

¿

Page 34: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 34

Challenge in Optimization – MR

The Choice of RMM or CPMM depends on characteristics of matrices involved in the multiplication

REF[4]

Page 35: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 35

Declarative Machine Learning on Map Reduce

Page 36: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 36

Outline• SystemML Architecture• SystemML Components•Matrix Block Representation• Piggybacking

Page 37: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 37

System ML• System ML : Origin from IBM Almaden and Watson

Research Center• Machine Learning Algorithms are expressed in High level

language (DML – Declarative Machine learning Language syntax similar to R)

• E.g. DML for transpose of matrix is • So, the user can focus on writing scripts that answers to

“What?” and not “How?”• Covers linear statistical models, PCA, Page Rank, Matrix

Factorization, iterative algorithms (while and for) and so on.• Scales to very large datasets and efficiently tunes the

performance

Page 38: Parallel Machine Learning- DSGD and SystemML

38

System ML Architecture

Parallel Machine Learning REF[10]

Page 39: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 39

Matrix Factorization - DML Script REF[4]

Page 40: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 40

SystemML – Program Analysis

• Break DML scripts into smaller units called statement blocks

REF[4]

Page 41: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 41

System ML Components

SystemML Program Analysis

For each statement block Do

1. High Level Operator component analysis

2. Low level Operator component analysis

3. RuntimeREF[4]

Page 42: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 42

High Level Operator component (HOP)• Input : Statement Blocks• Output: High level execution plan (HOP Dags)• Action: Hops represents the basic operations on Matrices and

scalar (an Operation or Transformation)• Instantiation of hops from the DML parsed representation

• Optimizations: Algebraic rewrites, selection of physical representation for intermediate matrices and cost based optimizations

REF[4]

Page 43: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 43

Low Level Operator component (LOP)• Input : High level execution plan (HOP Dags) • Output: Low level execution plan (LOP Dags)• Action: Lops represents the basic operations in MR

Environments

REF[4]

Page 44: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 44

Runtime• Matrices as Key-Value Pairs• Block representation of Matrices (using a block operation)

• Generic MapReduce Job • Main Execution engine in System ML• Instantiated by the Piggybacking algorithm (Multiple LOPs

inside a single MR jobs)

• Control Module• Orchestrating the execution of the instantiated MapReduce

jobs for a DML script.

• Multiple optimizations are performed in the runtime component (dynamically deciding based on data characteristics)

Page 45: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 45

Example Component Analysis for DML Script in SystemML –REF[4]

Page 46: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 46

Blocking

Representation of Blocking –REF[4]

Page 47: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 47

Piggybacking

Aggr. (+)

group

mmcj

transform

Data W

Aggr. (+)

group

transform

Data W

mmcj

M/R

M/R

M&R

M&R

R

M

R

M

R

Topological Sort

Page 48: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 48

Summary• Linear Algebra the computation engine for

Machine Learning Algorithms• Gradient Descent and SGD• Distributed matrix-factorization algorithm that can

efficiently handle web-scale matrices (DSGD)• Overhead in parallelizing ML Algorithms• Declarative ML on Map Reduce - IBM System ML

Page 49: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 49

References[1] Parallel Computing at a Glance, http://www.buyya.com/microkernel/chap1.pdf[2] CS229 Lecture notes -Andrew Ng, http://see.stanford.edu/materials/aimlcs229/cs229-notes1.pdf[3] Gemulla, Rainer, et al. "Large-scale matrix factorization with distributed stochastic gradient descent." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.[4] Ghoting, Amol, et al. "SystemML: Declarative machine learning on MapReduce." Data Engineering (ICDE), 2011 IEEE 27th International Conference on. IEEE, 2011.

Page 50: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 50

References[5] http://www.quora.com/Machine-Learning/[6] http://web.cs.wpi.edu/~cs525/f13b-EAR//cs525-homepage/lectures/lectures-papers/Large-ScaleMatrixFactorization-ppt.pdf[7] NEOKLIS POLYZOTIS (UCSC, GOOGLE ), TYSON CONDIE (UCLA, MICROSOFT), MARKUS WEIMER (MICROSOFT), Machine Learning on Big Data [8] http://www.cliffsnotes.com/math/algebra/linear-algebra/real-euclidean-vector-spaces/the-rank-of-a-matrix[9] http://en.wikipedia.org/wiki/Permutation[10] https://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/Bruce%20Brown%20-%20BigInsights-1-16-12-external/$file/Bruce%20Brown%20-%20BigInsights-1-16-12-external.pdf

Page 51: Parallel Machine Learning- DSGD and SystemML

Parallel Machine Learning 51