scaling up machine learning algorithms for classification

Scaling up Machine Learning algorithms for classification

Department of Mathematical InformaticsThe University of Tokyo

Shin Matsushima

How can we scale up Machine Learning to Massive datasets?

• Exploit hardware traits– Disk IO is bottleneck– Dual Cached Loops– Run Disk IO and Computation simultaneously

• Distributed asynchronous optimization (ongoing)– Current work using multiple machines

2

LINEAR SUPPORT VECTOR MACHINES VIA DUAL CACHED LOOPS

3

• Intuition of linear SVM

– xi: i-th datapoint

– yi: i-th label. +1 or -1

– yi w ･ xi : larger is better, smaller is worse

4

××

×

×

×

×

×

×× ×: yi = +1

×: yi = -1

• Formulation of Linear SVM

– n: number of data points– d: number of features– Convex non-smooth optimization

5

• Formulation of Linear SVM – Primal

– Dual

6

Coordinate descent7

• Coordinate Descent Method– For each update we solve one-variable optimization

problem with respect to the variable to update.

15

• Applying Coordinate Descent for Dual formulation of SVM

16

17

• Applying Coordinate Descent for Dual formulation of SVM

Dual Coordinate Descent [Hsieh et al. 2008]

18

Attractive property

• Suitable for large scale learning– We need only one data for each update.

• Theoretical guarantees– Linear convergence （ cf. SGD ）

• Shrinking[Joachims 1999]

– We can eliminate “uninformative” data:

cf.

19

Shrinking [Joachims 1999]

• Intuition: a datapoint far from the current decision boundary is unlikely to become a support vector

20

×

×

×

×

○

○

Shrinking [Joachims 1999]

• Condition

• Available only in the dual problem

21

Problem in scaling up to massive data

• In dealing with small-scale data, we first copy the entire dataset into main memory

• In dealing with large-scale data, we cannot copy the dataset at once

22

Read

Disk

Memory

Data

ReadData

• Schemes when data cannot fit in memory1. Block Minimization [Yu et al. 2010]– Split the entire dataset into blocks so that each

block can fit in memory

Train RAM



ReadData



Train RAM



Block Minimization[Yu et al. 2010]

27

ReadData

• Schemes when data cannot fit in memory2. Selective Block Minimization [Chang and Roth 2011]

– Keep “informative data” in memory

Train RAM

Block



ReadData



Train RAM

Block



Selective Block Minimization[Chang and Roth 2011]

34

• Previous schemes switch CPU and DiskIO– Training (CPU) is idle while reading– Reading (DiskIO) is idle while training

35

• We want to exploit modern hardware1. Multicore processors are commonplace2. CPU(Memory IO) is often 10-100 times

faster than Hard disk IO

36

1.Make reader and trainer run simultaneously and almost asynchronously.

2.Trainer updates the parameter many times faster than reader loads new datapoints.

3.Keep informative data in main memory.(=Evict uninformative data primarily from main memory)

37

Dual Cached Loops

ReaderThread

TrainerThread

Parameter

Dual Cached Loops

RAM

Disk

Memory

Data

38

ReaderThread

TrainerThread

Parameter

Dual Cached Loops

RAM

Disk

Memory

Data

39

Read

Disk

Memory

Data

W: working 　index set

40

Train

ParameterMemory

41

Which data is “uninformative”?

• A datapoint far from the current decision boundary is unlikely to become a support vector

• Ignore the datapoint for a while.

42

××

×

×

×

○

○○

Which data is “uninformative”?

– Condition

43

• Datasets with Various Characteristics:

• 2GB Memory for storing datapoints • Measured Relative Function Value

45

• Comparison with (Selective) Block Minimization (implemented in Liblinear)

– ocr ： dense, 45GB

46

47

• Comparison with (Selective) Block Minimization (implemented in Liblinear)

– dna ： dense, 63GB

48

Comparison with (Selective) Block Minimization (implemented in Liblinear)

– webspam ： sparse, 20GB

49

Comparison with (Selective) Block Minimization (implemented in Liblinear)

– kddb ： sparse, 4.7GB

• When C gets larger (dna C=1)

51

• When C gets larger(dna C=10)

52


53


54

• When memory gets larger(ocr C=1)

55

• Expanding Features on the fly– Expand features explicitly when the reader thread

loads an example into memory.• Read (y,x) from the Disk• Compute f(x) and load (y,f(x)) into RAM

Read

Disk

Data

12495340( )x R

x=GTCCCACCT…

56

2TB data

16GB memory

10hrs

50M examples

12M featurescorresponding to

2TB

57

• Summary– Linear SVM Optimization when data cannot fit in

memory– Use the scheme of Dual Cached Loops– Outperforms state of the art by orders of magnitude– Can be extended to• Logistic regression• Support vector regression• Multiclass classification

58

DISTRIBUTED ASYNCHRONOUS OPTIMIZATION (CURRENT WORK)

59

Future/Current Work

• Utilize the same principle as dual cached loops in multi-machine algorithm– Transportation of data can be efficiently done without

harming optimization performance– The key is to run Communication and Computation

simultaneously and asynchronously– Can we do more sophisticated communication

emerging in multi-machine optimization?

60

• (Selective) Block Minimization scheme for Large-scale SVM

61

Move data Process Optimization

HDD/ File 　system

One machine

One machine

• Map-Reduce scheme for multi-machine algorithm

62

Move parameters Process Optimization

Master node

Workernode

Workernode

63

≈

Stratified Stochastic Gradient Descent [Gemulla, 2011]

66

• Map-Reduce scheme for multi-machine algorithm

69

Move parameters Process Optimization

Master node

Workernode

Workernode

Asynchronous multi-machine scheme70

Parameter Communication

Parameter Updates

NOMAD71

NOMAD72

Asynchronous multi-machine scheme

• Each machine holds a subset of data• Keep communicating a potion of parameter from

each other• Simultaneously run updating parameters for

those each machine possesses

77

• Distributed stochastic gradient descent for saddle point problems– Another formulation of SVM (Regularized Risk

Minimization in general)– Suitable for parallelization

78

How can we scale up Machine Learning to Massive datasets?

• Exploit hardware traits– Disk IO is bottleneck– Run Disk IO and Computation simultaneously

• Distributed asynchronous optimization (ongoing)– Current work using multiple machines

79

scaling up machine learning algorithms for classification

Education

data schemes

uninformative data

disk memory data w

smallscale data

largescale data

gb memory

main memory

block minimization yu