Transcript
Page 1: Scaling up Machine Learning Algorithms for Classification

Scaling up Machine Learning algorithms for classification

Department of Mathematical InformaticsThe University of Tokyo

Shin Matsushima

Page 2: Scaling up Machine Learning Algorithms for Classification

How can we scale up Machine Learning to Massive datasets?

• Exploit hardware traits– Disk IO is bottleneck– Dual Cached Loops– Run Disk IO and Computation simultaneously

• Distributed asynchronous optimization (ongoing)– Current work using multiple machines

2

Page 3: Scaling up Machine Learning Algorithms for Classification

LINEAR SUPPORT VECTOR MACHINES VIA DUAL CACHED LOOPS

3

Page 4: Scaling up Machine Learning Algorithms for Classification

• Intuition of linear SVM

– xi: i-th datapoint

– yi: i-th label. +1 or -1

– yi w ・ xi : larger is better, smaller is worse

4

××

×

×

×

×

×

×× ×: yi = +1

×: yi = -1

Page 5: Scaling up Machine Learning Algorithms for Classification

• Formulation of Linear SVM

– n: number of data points– d: number of features– Convex non-smooth optimization

5

Page 6: Scaling up Machine Learning Algorithms for Classification

• Formulation of Linear SVM – Primal

– Dual

6

Page 7: Scaling up Machine Learning Algorithms for Classification

Coordinate descent7

Page 8: Scaling up Machine Learning Algorithms for Classification

8

Page 9: Scaling up Machine Learning Algorithms for Classification

9

Page 10: Scaling up Machine Learning Algorithms for Classification

10

Page 11: Scaling up Machine Learning Algorithms for Classification

11

Page 12: Scaling up Machine Learning Algorithms for Classification

12

Page 13: Scaling up Machine Learning Algorithms for Classification

13

Page 14: Scaling up Machine Learning Algorithms for Classification

14

Page 15: Scaling up Machine Learning Algorithms for Classification

• Coordinate Descent Method– For each update we solve one-variable optimization

problem with respect to the variable to update.

15

Page 16: Scaling up Machine Learning Algorithms for Classification

• Applying Coordinate Descent for Dual formulation of SVM

16

Page 17: Scaling up Machine Learning Algorithms for Classification

17

• Applying Coordinate Descent for Dual formulation of SVM

Page 18: Scaling up Machine Learning Algorithms for Classification

Dual Coordinate Descent [Hsieh et al. 2008]

18

Page 19: Scaling up Machine Learning Algorithms for Classification

Attractive property

• Suitable for large scale learning– We need only one data for each update.

• Theoretical guarantees– Linear convergence ( cf. SGD )

• Shrinking[Joachims 1999]

– We can eliminate “uninformative” data:

cf.

19

Page 20: Scaling up Machine Learning Algorithms for Classification

Shrinking [Joachims 1999]

• Intuition: a datapoint far from the current decision boundary is unlikely to become a support vector

20

×

×

×

×

Page 21: Scaling up Machine Learning Algorithms for Classification

Shrinking [Joachims 1999]

• Condition

• Available only in the dual problem

21

Page 22: Scaling up Machine Learning Algorithms for Classification

Problem in scaling up to massive data

• In dealing with small-scale data, we first copy the entire dataset into main memory

• In dealing with large-scale data, we cannot copy the dataset at once

22

Read

Disk

Memory

Data

Page 23: Scaling up Machine Learning Algorithms for Classification

ReadData

• Schemes when data cannot fit in memory1. Block Minimization [Yu et al. 2010]– Split the entire dataset into blocks so that each

block can fit in memory

Page 24: Scaling up Machine Learning Algorithms for Classification

Train RAM

• Schemes when data cannot fit in memory1. Block Minimization [Yu et al. 2010]– Split the entire dataset into blocks so that each

block can fit in memory

Page 25: Scaling up Machine Learning Algorithms for Classification

ReadData

• Schemes when data cannot fit in memory1. Block Minimization [Yu et al. 2010]– Split the entire dataset into blocks so that each

block can fit in memory

Page 26: Scaling up Machine Learning Algorithms for Classification

Train RAM

• Schemes when data cannot fit in memory1. Block Minimization [Yu et al. 2010]– Split the entire dataset into blocks so that each

block can fit in memory

Page 27: Scaling up Machine Learning Algorithms for Classification

Block Minimization[Yu et al. 2010]

27

Page 28: Scaling up Machine Learning Algorithms for Classification

ReadData

• Schemes when data cannot fit in memory2. Selective Block Minimization [Chang and Roth 2011]

– Keep “informative data” in memory

Page 29: Scaling up Machine Learning Algorithms for Classification

Train RAM

Block

• Schemes when data cannot fit in memory2. Selective Block Minimization [Chang and Roth 2011]

– Keep “informative data” in memory

Page 30: Scaling up Machine Learning Algorithms for Classification

Train RAM

Block

• Schemes when data cannot fit in memory2. Selective Block Minimization [Chang and Roth 2011]

– Keep “informative data” in memory

Page 31: Scaling up Machine Learning Algorithms for Classification

ReadData

• Schemes when data cannot fit in memory2. Selective Block Minimization [Chang and Roth 2011]

– Keep “informative data” in memory

Page 32: Scaling up Machine Learning Algorithms for Classification

Train RAM

Block

• Schemes when data cannot fit in memory2. Selective Block Minimization [Chang and Roth 2011]

– Keep “informative data” in memory

Page 33: Scaling up Machine Learning Algorithms for Classification

Train RAM

Block

• Schemes when data cannot fit in memory2. Selective Block Minimization [Chang and Roth 2011]

– Keep “informative data” in memory

Page 34: Scaling up Machine Learning Algorithms for Classification

Selective Block Minimization[Chang and Roth 2011]

34

Page 35: Scaling up Machine Learning Algorithms for Classification

• Previous schemes switch CPU and DiskIO– Training (CPU) is idle while reading– Reading (DiskIO) is idle while training

35

Page 36: Scaling up Machine Learning Algorithms for Classification

• We want to exploit modern hardware1. Multicore processors are commonplace2. CPU(Memory IO) is often 10-100 times

faster than Hard disk IO

36

Page 37: Scaling up Machine Learning Algorithms for Classification

1.Make reader and trainer run simultaneously and almost asynchronously.

2.Trainer updates the parameter many times faster than reader loads new datapoints.

3.Keep informative data in main memory.(=Evict uninformative data primarily from main memory)

37

Dual Cached Loops

Page 38: Scaling up Machine Learning Algorithms for Classification

ReaderThread

TrainerThread

Parameter

Dual Cached Loops

RAM

Disk

Memory

Data

38

Page 39: Scaling up Machine Learning Algorithms for Classification

ReaderThread

TrainerThread

Parameter

Dual Cached Loops

RAM

Disk

Memory

Data

39

Page 40: Scaling up Machine Learning Algorithms for Classification

Read

Disk

Memory

Data

W: working  index set

40

Page 41: Scaling up Machine Learning Algorithms for Classification

Train

ParameterMemory

41

Page 42: Scaling up Machine Learning Algorithms for Classification

Which data is “uninformative”?

• A datapoint far from the current decision boundary is unlikely to become a support vector

• Ignore the datapoint for a while.

42

××

×

×

×

○○

Page 43: Scaling up Machine Learning Algorithms for Classification

Which data is “uninformative”?

– Condition

43

Page 44: Scaling up Machine Learning Algorithms for Classification

• Datasets with Various Characteristics:

• 2GB Memory for storing datapoints • Measured Relative Function Value

45

Page 45: Scaling up Machine Learning Algorithms for Classification

• Comparison with (Selective) Block Minimization (implemented in Liblinear)

– ocr : dense, 45GB

46

Page 46: Scaling up Machine Learning Algorithms for Classification

47

• Comparison with (Selective) Block Minimization (implemented in Liblinear)

– dna : dense, 63GB

Page 47: Scaling up Machine Learning Algorithms for Classification

48

Comparison with (Selective) Block Minimization (implemented in Liblinear)

– webspam : sparse, 20GB

Page 48: Scaling up Machine Learning Algorithms for Classification

49

Comparison with (Selective) Block Minimization (implemented in Liblinear)

– kddb : sparse, 4.7GB

Page 49: Scaling up Machine Learning Algorithms for Classification

• When C gets larger (dna C=1)

51

Page 50: Scaling up Machine Learning Algorithms for Classification

• When C gets larger(dna C=10)

52

Page 51: Scaling up Machine Learning Algorithms for Classification

• When C gets larger(dna C=100)

53

Page 52: Scaling up Machine Learning Algorithms for Classification

• When C gets larger(dna C=1000)

54

Page 53: Scaling up Machine Learning Algorithms for Classification

• When memory gets larger(ocr C=1)

55

Page 54: Scaling up Machine Learning Algorithms for Classification

• Expanding Features on the fly– Expand features explicitly when the reader thread

loads an example into memory.• Read (y,x) from the Disk• Compute f(x) and load (y,f(x)) into RAM

Read

Disk

Data

12495340( )x R

x=GTCCCACCT…

56

Page 55: Scaling up Machine Learning Algorithms for Classification

2TB data

16GB memory

10hrs

50M examples

12M featurescorresponding to

2TB

57

Page 56: Scaling up Machine Learning Algorithms for Classification

• Summary– Linear SVM Optimization when data cannot fit in

memory– Use the scheme of Dual Cached Loops– Outperforms state of the art by orders of magnitude– Can be extended to• Logistic regression• Support vector regression• Multiclass classification

58

Page 57: Scaling up Machine Learning Algorithms for Classification

DISTRIBUTED ASYNCHRONOUS OPTIMIZATION (CURRENT WORK)

59

Page 58: Scaling up Machine Learning Algorithms for Classification

Future/Current Work

• Utilize the same principle as dual cached loops in multi-machine algorithm– Transportation of data can be efficiently done without

harming optimization performance– The key is to run Communication and Computation

simultaneously and asynchronously– Can we do more sophisticated communication

emerging in multi-machine optimization?

60

Page 59: Scaling up Machine Learning Algorithms for Classification

• (Selective) Block Minimization scheme for Large-scale SVM

61

Move data Process Optimization

HDD/ File  system

One machine

One machine

Page 60: Scaling up Machine Learning Algorithms for Classification

• Map-Reduce scheme for multi-machine algorithm

62

Move parameters Process Optimization

Master node

Workernode

Workernode

Page 61: Scaling up Machine Learning Algorithms for Classification

63

Page 62: Scaling up Machine Learning Algorithms for Classification

64

Page 63: Scaling up Machine Learning Algorithms for Classification

65

Page 64: Scaling up Machine Learning Algorithms for Classification

Stratified Stochastic Gradient Descent [Gemulla, 2011]

66

Page 65: Scaling up Machine Learning Algorithms for Classification

67

Page 66: Scaling up Machine Learning Algorithms for Classification

68

Page 67: Scaling up Machine Learning Algorithms for Classification

• Map-Reduce scheme for multi-machine algorithm

69

Move parameters Process Optimization

Master node

Workernode

Workernode

Page 68: Scaling up Machine Learning Algorithms for Classification

Asynchronous multi-machine scheme70

Parameter Communication

Parameter Updates

Page 69: Scaling up Machine Learning Algorithms for Classification

NOMAD71

Page 70: Scaling up Machine Learning Algorithms for Classification

NOMAD72

Page 71: Scaling up Machine Learning Algorithms for Classification

73

Page 72: Scaling up Machine Learning Algorithms for Classification

74

Page 73: Scaling up Machine Learning Algorithms for Classification

75

Page 74: Scaling up Machine Learning Algorithms for Classification

76

Page 75: Scaling up Machine Learning Algorithms for Classification

Asynchronous multi-machine scheme

• Each machine holds a subset of data• Keep communicating a potion of parameter from

each other• Simultaneously run updating parameters for

those each machine possesses

77

Page 76: Scaling up Machine Learning Algorithms for Classification

• Distributed stochastic gradient descent for saddle point problems– Another formulation of SVM (Regularized Risk

Minimization in general)– Suitable for parallelization

78

Page 77: Scaling up Machine Learning Algorithms for Classification

How can we scale up Machine Learning to Massive datasets?

• Exploit hardware traits– Disk IO is bottleneck– Run Disk IO and Computation simultaneously

• Distributed asynchronous optimization (ongoing)– Current work using multiple machines

79


Top Related