streamsvm - linear svms and logistic regression …vishy/talks/streamsvm.pdfstreamsvm linear svms...

StreamSVMLinear SVMs and Logistic Regression When Data Does Not Fit

In Memory

S.V.N. (vishy) Vishwanathan

Purdue University and [email protected]

October 9, 2012

S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 1 / 41

Linear Support Vector Machines

Outline

1 Linear Support Vector Machines

2 StreamSVM

3 Experiments - SVM

4 Logistic Regression

5 Experiments - Logistic Regression

6 Conclusion



Binary Classification

yi = −1

yi = +1



Binary Classification

yi = −1

yi = +1

x | 〈w , x〉 = 0




yi = −1

yi = +1

x | 〈w , x〉 = 0x | 〈w , x〉 = −1

x | 〈w , x〉 = 1

x2 x1

〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2

⟩= 2‖w‖




Optimization Problem

minw

12‖w‖2 + C

m∑i=1

max (0,1− yi 〈w , xi〉)

−3 −2 −1 1 2 3 4

1

2

3Hinge Loss

Margin

Loss



The Dual

Objective

minα

D(α) :=12

∑i,j

αiαjyiyj⟨xi , xj

⟩−∑

i

αi

s.t. 0 ≤ αi ≤ C

Convex and quadratic objectiveLinear (box) constraintsw =

∑i αiyixi

Each αi corresponds to an xi



Coordinate Descent

−2

0

2−3

−2−1

01

23

0

20

40

60



Coordinate Descent

−2

0

2−3

−2−1

01

23

0

20

40

60

x y



Coordinate Descent

−3 −2 −1 0 1 2 30

20

40

60

x



Coordinate Descent

−2

0

2−3

−2−1

01

23

0

20

40

60

x y



Coordinate Descent

−3 −2 −1 0 1 2 3

2

4

6

8

y



Coordinate Descent in the Dual

One dimensional function

D(αt ) =α2

t2〈xt , xt〉+

∑i 6=t

αtαiyiyt 〈xi , xt〉 − αt + const.

s.t. 0 ≤ αt ≤ C

Solve

∇D(αt ) = αt 〈xt , xt〉+∑i 6=t

αiyiyt 〈xi , xt〉 − 1 = 0

αt = median(

0,C,1−

∑i 6=t αiyiyt 〈xi , xt〉〈xt , xt〉

)S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 8 / 41

StreamSVM

Outline


2 StreamSVM

3 Experiments - SVM



6 Conclusion


StreamSVM

We are Collecting Lots of Data . . .

yi = −1

yi = +1


StreamSVM

What if Data Does not Fit in Memory?

Idea 1: Block Minimization [Yu et al., KDD 2010]Split data into blocks B1, B2 . . . such that Bj fits in memoryCompress and store each block separatelyLoad one block of data at a time and optimize only those αi ’s

Idea 2: Selective Block Minimization [Chang and Roth, KDD 2011]Split data into blocks B1, B2 . . . such that Bj fits in memoryCompress and store each block separatelyLoad one block of data at a time and optimize only those αi ’sRetain informative samples from each block in main memory


StreamSVM

What are Informative Samples?


StreamSVM

Some Observations

SBM and BM are wastefulBoth split data into blocks and compress the blocks

This requires reading the entire data at least once (expensive)

Both pause optimization while a block is loaded into memory

Hardware 101Disk I/O is slower than CPU (sometimes by a factor of 100)Random access on HDD is terrible

sequential access is reasonably fast (factor of 10)

Multi-core processors are becoming commonplaceHow can we exploit this?


StreamSVM

Our Architecture

Reader Trainer

RAM RAMHDD

Data Working Set Weight Vec


StreamSVM

Our Philosophy

Iterate over the data in main memory while streaming datafrom disk. Evict primarily examples from main memory thatare “uninformative”.


StreamSVM

Reader

for k = 1, . . . ,max_iter dofor i = 1, . . . ,n do

if |A| = Ω thenrandomly select i ′ ∈ AA = A \ i ′delete yi ,Qii , xi from RAM

end ifread yi , xi from Diskcalculate Qii = 〈xi , xi〉store yi ,Qii , xi in RAMA = A ∪ i

end forif stopping criterion is met then

exitend if

end forS.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 16 / 41

StreamSVM

Trainer

α1 = 0,w1 = 0, ε = 9, εnew = 0, β = 0.9while stopping criterion is not met do

for t = 1, . . . ,n doIf |A| > 0.9× Ω then ε = βεrandomly select i ∈ A and read yi ,Qii , xi from RAMcompute ∇iD := yi

⟨w t , xi

⟩− 1

if (αti = 0 and ∇iD > ε) or (αt

i = C and ∇iD < −ε) thenA = A \ i and delete yi ,Qii , xi from RAMcontinue

end ifαt+1

i = median(

0,C, αti −

∇i DQii

), w t+1 = w t + (αt+1

i − αti )yixi

εnew = max(εnew, |∇iD|)end forUpdate stopping criterionε = εnew

end whileS.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 17 / 41

StreamSVM

Proof of Convergence (Sketch)

Definition (Luo and Tseng Problem)

minimizeα

g(Eα) + b>α

subject to Li ≤ αi ≤ Ui

α,b ∈ Rn, E ∈ Rd×n, Li ∈ [−∞,∞),Ui ∈ (−∞,∞]The above optimization problem is a Luo and Tseng problem if thefollowing conditions hold:

1 E has no zero columns2 the set of optimal solutions A is non-empty3 the function g is strictly convex and twice cont. differentiable4 for all optimal solutions α∗ ∈ A, ∇2g(Eα∗) is positive definite.


StreamSVM


Lemma (see Theorem 1 of Hsieh et al., ICML 2008)If no data xi = 0, then the SVM dual is a Luo and Tseng problem.

Definition (Almost cyclic rule)There exists an integer B ≥ n such that every coordinate is iteratedupon at least once every B successive iterations.

Theorem (Theorem 2.1 of Luo and Tseng, JOTA 1992)

Let αt be a sequence of iterates generated by a coordinate descentmethod using the almost cyclic rule. The αt converges linearly to anelement of A.


StreamSVM


LemmaIf the trainer accesses the cached training data sequentially, and thefollowing conditions hold:

The trainer is at most κ ≥ 1 times faster than the reading thread,that is, the trainer performs at most κ coordinate updates in thetime that it takes the reader to read one training data from disk.A point is never evicted from the RAM unless the αi correspondingto that point has been updated.

Then this confirms to the almost cyclic rule.


Experiments - SVM

Outline


2 StreamSVM

3 Experiments - SVM



6 Conclusion


Experiments - SVM

Experiments

dataset n d s(%) n+ :n− Datasizeocr 3.5 M 1156 100 0.96 45.28 GBdna 50 M 800 25 3e−3 63.04 GBwebspam-t 0.35 M 16.61 M 0.022 1.54 20.03 GBkddb 20.01 M 29.89 M 1e-4 6.18 4.75 GB


Experiments - SVM

Does Active Eviction Work?

0 0.5 1 1.5 2 2.5·104

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Wall Clock Time (sec)

Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce

Linear SVM

webspam-t C = 1.0RandomActive


Experiments - SVM

Comparison with Block Minimization

0 1 2 3 4·104

10−9

10−7

10−5

10−3

10−1


Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce

ocr C = 1.0StreamSVM

SBMBM


Experiments - SVM


0 0.5 1 1.5 2·104

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100


Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce

webspam-t C = 1.0StreamSVM

SBMBM


Experiments - SVM


0 1 2 3 4·104

10−6

10−5

10−4

10−3

10−2

10−1

100


Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce

kddb C = 1.0StreamSVM

SBMBM


Experiments - SVM


0 1 2 3 4·104

10−11

10−9

10−7

10−5

10−3

10−1


Rel

ativ

eO

bjec

tive

Func

tion

Valu

e

dna C = 1.0StreamSVM

SBMBM


Experiments - SVM

Effect of Varying C

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·105

10−11

10−9

10−7

10−5

10−3

10−1


Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce


SBMBM


Experiments - SVM

Effect of Varying C

0 0.5 1 1.5·105

10−2

10−1

100


Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce


SBMBM


Experiments - SVM

Effect of Varying C

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·105

10−1

100


Rel

ativ

eO

bjec

tive

Func

tion

Valu

e


SBMBM


Experiments - SVM

Varying Cache Size

0 0.5 1 1.5 2·104

10−6

10−5

10−4

10−3

10−2

10−1

100


Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

cekddb C = 1.0

256 MB1 GB4 GB

16 GB


Experiments - SVM

Expanding Features

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·105

10−10

10−8

10−6

10−4

10−2

100


Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

cedna expanded C = 1.0

16 GB32GB


Logistic Regression

Outline


2 StreamSVM

3 Experiments - SVM



6 Conclusion


Logistic Regression

Logistic Regression

Optimization Problem

minw

12‖w‖2 + C

m∑i=1

log (1 + exp (−yi 〈w , xi〉))

−4 −2 2 4

1

2

3Hinge Loss

Logistic Loss

Margin

Loss


Logistic Regression

The Dual

Objective

minα

D(α) :=12

∑i,j

αiαjyiyj⟨xi , xj

⟩+∑

i

αi logαi + (C − αi) log(C − αi)

s.t. 0 ≤ αi ≤ C

w =∑

i αiyixi

Convex objective (but not quadratic)Box constraints


Logistic Regression

Coordinate Descent

One dimensional function

D(αt ) :=α2

t2〈xt , xt〉+

∑i 6=t

αtαiyiyt 〈xi , xt〉

+ αt logαt + (C − αt ) log(C − αt )

s.t. 0 ≤ αt ≤ C

SolveNo closed form solutionA modified Newton method does the job!See Yu, Huang, Lin 2011 for details.


Logistic Regression

Determining Importance: SVM case

Shrinking

αti = 0 and ∇iD(α) > ε or

αti = C and ∇iD(α) < −ε


Logistic Regression

Determining Importance: SVM case

−4 −2 2 4 6

0.5

1

Margin

Loss Gradient

∇w loss(w , xi , yi) = αi (KKT condition)

∇πi D(α) = 0|〈w , xi〉 − 1| ≥ ε

Computing ε

ε = maxi∈A∣∣∇πi D(α)

∣∣S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 33 / 41

Logistic Regression

Determining Importance: Logistic Regression

−8 −6 −4 −2 2 4 6 8

0.2

0.4

0.6

0.8

1

Margin

Loss Gradient

∇w loss(w , xi , yi) ≈ αi (KKT condition)

∇iD(α) ≈ 0|〈w , xi〉| ≥ ε

Computing ε

ε = maxi∈A |∇iD(α)|


Experiments - Logistic Regression

Outline


2 StreamSVM

3 Experiments - SVM



6 Conclusion



Does Active Eviction Work?

0 0.5 1 1.5 2 2.5·104

10−10

10−8

10−6

10−4

10−2

100


Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

ce

Logistic Regression

webspam-t C = 1.0RandomActive




0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·104

10−10

10−8

10−6

10−4

10−2

100


Rel

ativ

eFu

nctio

nVa

lue

Diff

eren

cewebspam-t C = 1.0

StreamSVMBM


Conclusion

Outline


2 StreamSVM

3 Experiments - SVM



6 Conclusion


Conclusion

Things I did not talk about/Work in Progress

StreamSVMAmortizing the cost of training different modelsOther storage Heirarchies (e.g. Solid State Drives)Extensions to other problems

Multiclass problemsStructured Output Prediction

Distributed version of the algorithm


Conclusion

References

Shin Matsushima, S. V. N. Vishwanathan, and Alexander J Smola.Linear Support Vector Machines via Dual Cached Loops. KDD2012.Code for StreamSVM available fromhttp://www.r.dl.itc.u-tokyo.ac.jp/~masin/streamsvm.html


http://www.r.dl.itc.u-tokyo.ac.jp/~masin/streamsvm.html

http://www.r.dl.itc.u-tokyo.ac.jp/~masin/streamsvm.html

Conclusion

Joint work with

Reading Thread Training Thread

Update

RAM

Weight Vector

RAM

Cached Data (Working Set)

Disk

Dataset

Read

(Random Access)

Read

(Sequential Access)

Load

(Random Access)


streamsvm - linear svms and logistic regression …vishy/talks/streamsvm.pdfstreamsvm linear svms...

Documents