streamsvm - linear svms and logistic regression …vishy/talks/streamsvm.pdfstreamsvm linear svms...
TRANSCRIPT
StreamSVMLinear SVMs and Logistic Regression When Data Does Not Fit
In Memory
S.V.N. (vishy) Vishwanathan
Purdue University and [email protected]
October 9, 2012
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 1 / 41
Linear Support Vector Machines
Outline
1 Linear Support Vector Machines
2 StreamSVM
3 Experiments - SVM
4 Logistic Regression
5 Experiments - Logistic Regression
6 Conclusion
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 2 / 41
Linear Support Vector Machines
Binary Classification
yi = −1
yi = +1
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 3 / 41
Linear Support Vector Machines
Binary Classification
yi = −1
yi = +1
x | 〈w , x〉 = 0
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 3 / 41
Linear Support Vector Machines
Linear Support Vector Machines
yi = −1
yi = +1
x | 〈w , x〉 = 0x | 〈w , x〉 = −1
x | 〈w , x〉 = 1
x2 x1
〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2
⟩= 2‖w‖
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 4 / 41
Linear Support Vector Machines
Linear Support Vector Machines
Optimization Problem
minw
12‖w‖2 + C
m∑i=1
max (0,1− yi 〈w , xi〉)
−3 −2 −1 1 2 3 4
1
2
3Hinge Loss
Margin
Loss
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 5 / 41
Linear Support Vector Machines
The Dual
Objective
minα
D(α) :=12
∑i,j
αiαjyiyj⟨xi , xj
⟩−∑
i
αi
s.t. 0 ≤ αi ≤ C
Convex and quadratic objectiveLinear (box) constraintsw =
∑i αiyixi
Each αi corresponds to an xi
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 6 / 41
Linear Support Vector Machines
Coordinate Descent
−2
0
2−3
−2−1
01
23
0
20
40
60
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
Linear Support Vector Machines
Coordinate Descent
−2
0
2−3
−2−1
01
23
0
20
40
60
x y
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
Linear Support Vector Machines
Coordinate Descent
−3 −2 −1 0 1 2 30
20
40
60
x
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
Linear Support Vector Machines
Coordinate Descent
−3 −2 −1 0 1 2 30
20
40
60
x
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
Linear Support Vector Machines
Coordinate Descent
−2
0
2−3
−2−1
01
23
0
20
40
60
x y
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
Linear Support Vector Machines
Coordinate Descent
−3 −2 −1 0 1 2 3
2
4
6
8
y
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
Linear Support Vector Machines
Coordinate Descent
−3 −2 −1 0 1 2 3
2
4
6
8
y
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
Linear Support Vector Machines
Coordinate Descent in the Dual
One dimensional function
D(αt ) =α2
t2〈xt , xt〉+
∑i 6=t
αtαiyiyt 〈xi , xt〉 − αt + const.
s.t. 0 ≤ αt ≤ C
Solve
∇D(αt ) = αt 〈xt , xt〉+∑i 6=t
αiyiyt 〈xi , xt〉 − 1 = 0
αt = median(
0,C,1−
∑i 6=t αiyiyt 〈xi , xt〉〈xt , xt〉
)S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 8 / 41
StreamSVM
Outline
1 Linear Support Vector Machines
2 StreamSVM
3 Experiments - SVM
4 Logistic Regression
5 Experiments - Logistic Regression
6 Conclusion
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 9 / 41
StreamSVM
We are Collecting Lots of Data . . .
yi = −1
yi = +1
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41
StreamSVM
We are Collecting Lots of Data . . .
yi = −1
yi = +1
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41
StreamSVM
We are Collecting Lots of Data . . .
yi = −1
yi = +1
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41
StreamSVM
We are Collecting Lots of Data . . .
yi = −1
yi = +1
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41
StreamSVM
What if Data Does not Fit in Memory?
Idea 1: Block Minimization [Yu et al., KDD 2010]Split data into blocks B1, B2 . . . such that Bj fits in memoryCompress and store each block separatelyLoad one block of data at a time and optimize only those αi ’s
Idea 2: Selective Block Minimization [Chang and Roth, KDD 2011]Split data into blocks B1, B2 . . . such that Bj fits in memoryCompress and store each block separatelyLoad one block of data at a time and optimize only those αi ’sRetain informative samples from each block in main memory
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 11 / 41
StreamSVM
What are Informative Samples?
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 12 / 41
StreamSVM
Some Observations
SBM and BM are wastefulBoth split data into blocks and compress the blocks
This requires reading the entire data at least once (expensive)
Both pause optimization while a block is loaded into memory
Hardware 101Disk I/O is slower than CPU (sometimes by a factor of 100)Random access on HDD is terrible
sequential access is reasonably fast (factor of 10)
Multi-core processors are becoming commonplaceHow can we exploit this?
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 13 / 41
StreamSVM
Our Architecture
Reader Trainer
RAM RAMHDD
Data Working Set Weight Vec
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 14 / 41
StreamSVM
Our Philosophy
Iterate over the data in main memory while streaming datafrom disk. Evict primarily examples from main memory thatare “uninformative”.
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 15 / 41
StreamSVM
Reader
for k = 1, . . . ,max_iter dofor i = 1, . . . ,n do
if |A| = Ω thenrandomly select i ′ ∈ AA = A \ i ′delete yi ,Qii , xi from RAM
end ifread yi , xi from Diskcalculate Qii = 〈xi , xi〉store yi ,Qii , xi in RAMA = A ∪ i
end forif stopping criterion is met then
exitend if
end forS.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 16 / 41
StreamSVM
Trainer
α1 = 0,w1 = 0, ε = 9, εnew = 0, β = 0.9while stopping criterion is not met do
for t = 1, . . . ,n doIf |A| > 0.9× Ω then ε = βεrandomly select i ∈ A and read yi ,Qii , xi from RAMcompute ∇iD := yi
⟨w t , xi
⟩− 1
if (αti = 0 and ∇iD > ε) or (αt
i = C and ∇iD < −ε) thenA = A \ i and delete yi ,Qii , xi from RAMcontinue
end ifαt+1
i = median(
0,C, αti −
∇i DQii
), w t+1 = w t + (αt+1
i − αti )yixi
εnew = max(εnew, |∇iD|)end forUpdate stopping criterionε = εnew
end whileS.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 17 / 41
StreamSVM
Proof of Convergence (Sketch)
Definition (Luo and Tseng Problem)
minimizeα
g(Eα) + b>α
subject to Li ≤ αi ≤ Ui
α,b ∈ Rn, E ∈ Rd×n, Li ∈ [−∞,∞),Ui ∈ (−∞,∞]The above optimization problem is a Luo and Tseng problem if thefollowing conditions hold:
1 E has no zero columns2 the set of optimal solutions A is non-empty3 the function g is strictly convex and twice cont. differentiable4 for all optimal solutions α∗ ∈ A, ∇2g(Eα∗) is positive definite.
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 18 / 41
StreamSVM
Proof of Convergence (Sketch)
Lemma (see Theorem 1 of Hsieh et al., ICML 2008)If no data xi = 0, then the SVM dual is a Luo and Tseng problem.
Definition (Almost cyclic rule)There exists an integer B ≥ n such that every coordinate is iteratedupon at least once every B successive iterations.
Theorem (Theorem 2.1 of Luo and Tseng, JOTA 1992)
Let αt be a sequence of iterates generated by a coordinate descentmethod using the almost cyclic rule. The αt converges linearly to anelement of A.
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 19 / 41
StreamSVM
Proof of Convergence (Sketch)
LemmaIf the trainer accesses the cached training data sequentially, and thefollowing conditions hold:
The trainer is at most κ ≥ 1 times faster than the reading thread,that is, the trainer performs at most κ coordinate updates in thetime that it takes the reader to read one training data from disk.A point is never evicted from the RAM unless the αi correspondingto that point has been updated.
Then this confirms to the almost cyclic rule.
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 20 / 41
Experiments - SVM
Outline
1 Linear Support Vector Machines
2 StreamSVM
3 Experiments - SVM
4 Logistic Regression
5 Experiments - Logistic Regression
6 Conclusion
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 21 / 41
Experiments - SVM
Experiments
dataset n d s(%) n+ :n− Datasizeocr 3.5 M 1156 100 0.96 45.28 GBdna 50 M 800 25 3e−3 63.04 GBwebspam-t 0.35 M 16.61 M 0.022 1.54 20.03 GBkddb 20.01 M 29.89 M 1e-4 6.18 4.75 GB
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 22 / 41
Experiments - SVM
Does Active Eviction Work?
0 0.5 1 1.5 2 2.5·104
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Wall Clock Time (sec)
Rel
ativ
eFu
nctio
nVa
lue
Diff
eren
ce
Linear SVM
webspam-t C = 1.0RandomActive
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 23 / 41
Experiments - SVM
Comparison with Block Minimization
0 1 2 3 4·104
10−9
10−7
10−5
10−3
10−1
Wall Clock Time (sec)
Rel
ativ
eFu
nctio
nVa
lue
Diff
eren
ce
ocr C = 1.0StreamSVM
SBMBM
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41
Experiments - SVM
Comparison with Block Minimization
0 0.5 1 1.5 2·104
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Wall Clock Time (sec)
Rel
ativ
eFu
nctio
nVa
lue
Diff
eren
ce
webspam-t C = 1.0StreamSVM
SBMBM
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41
Experiments - SVM
Comparison with Block Minimization
0 1 2 3 4·104
10−6
10−5
10−4
10−3
10−2
10−1
100
Wall Clock Time (sec)
Rel
ativ
eFu
nctio
nVa
lue
Diff
eren
ce
kddb C = 1.0StreamSVM
SBMBM
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41
Experiments - SVM
Comparison with Block Minimization
0 1 2 3 4·104
10−11
10−9
10−7
10−5
10−3
10−1
Wall Clock Time (sec)
Rel
ativ
eO
bjec
tive
Func
tion
Valu
e
dna C = 1.0StreamSVM
SBMBM
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41
Experiments - SVM
Effect of Varying C
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·105
10−11
10−9
10−7
10−5
10−3
10−1
Wall Clock Time (sec)
Rel
ativ
eFu
nctio
nVa
lue
Diff
eren
ce
dna C = 10.0StreamSVM
SBMBM
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41
Experiments - SVM
Effect of Varying C
0 0.5 1 1.5·105
10−2
10−1
100
Wall Clock Time (sec)
Rel
ativ
eFu
nctio
nVa
lue
Diff
eren
ce
dna C = 100.0StreamSVM
SBMBM
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41
Experiments - SVM
Effect of Varying C
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·105
10−1
100
Wall Clock Time (sec)
Rel
ativ
eO
bjec
tive
Func
tion
Valu
e
dna C = 1000.0StreamSVM
SBMBM
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41
Experiments - SVM
Varying Cache Size
0 0.5 1 1.5 2·104
10−6
10−5
10−4
10−3
10−2
10−1
100
Wall Clock Time (sec)
Rel
ativ
eFu
nctio
nVa
lue
Diff
eren
cekddb C = 1.0
256 MB1 GB4 GB
16 GB
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 26 / 41
Experiments - SVM
Expanding Features
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·105
10−10
10−8
10−6
10−4
10−2
100
Wall Clock Time (sec)
Rel
ativ
eFu
nctio
nVa
lue
Diff
eren
cedna expanded C = 1.0
16 GB32GB
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 27 / 41
Logistic Regression
Outline
1 Linear Support Vector Machines
2 StreamSVM
3 Experiments - SVM
4 Logistic Regression
5 Experiments - Logistic Regression
6 Conclusion
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 28 / 41
Logistic Regression
Logistic Regression
Optimization Problem
minw
12‖w‖2 + C
m∑i=1
log (1 + exp (−yi 〈w , xi〉))
−4 −2 2 4
1
2
3Hinge Loss
Logistic Loss
Margin
Loss
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 29 / 41
Logistic Regression
The Dual
Objective
minα
D(α) :=12
∑i,j
αiαjyiyj⟨xi , xj
⟩+∑
i
αi logαi + (C − αi) log(C − αi)
s.t. 0 ≤ αi ≤ C
w =∑
i αiyixi
Convex objective (but not quadratic)Box constraints
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 30 / 41
Logistic Regression
Coordinate Descent
One dimensional function
D(αt ) :=α2
t2〈xt , xt〉+
∑i 6=t
αtαiyiyt 〈xi , xt〉
+ αt logαt + (C − αt ) log(C − αt )
s.t. 0 ≤ αt ≤ C
SolveNo closed form solutionA modified Newton method does the job!See Yu, Huang, Lin 2011 for details.
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 31 / 41
Logistic Regression
Determining Importance: SVM case
Shrinking
αti = 0 and ∇iD(α) > ε or
αti = C and ∇iD(α) < −ε
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 32 / 41
Logistic Regression
Determining Importance: SVM case
−4 −2 2 4 6
0.5
1
Margin
Loss Gradient
∇w loss(w , xi , yi) = αi (KKT condition)
∇πi D(α) = 0|〈w , xi〉 − 1| ≥ ε
Computing ε
ε = maxi∈A∣∣∇πi D(α)
∣∣S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 33 / 41
Logistic Regression
Determining Importance: Logistic Regression
−8 −6 −4 −2 2 4 6 8
0.2
0.4
0.6
0.8
1
Margin
Loss Gradient
∇w loss(w , xi , yi) ≈ αi (KKT condition)
∇iD(α) ≈ 0|〈w , xi〉| ≥ ε
Computing ε
ε = maxi∈A |∇iD(α)|
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 34 / 41
Experiments - Logistic Regression
Outline
1 Linear Support Vector Machines
2 StreamSVM
3 Experiments - SVM
4 Logistic Regression
5 Experiments - Logistic Regression
6 Conclusion
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 35 / 41
Experiments - Logistic Regression
Does Active Eviction Work?
0 0.5 1 1.5 2 2.5·104
10−10
10−8
10−6
10−4
10−2
100
Wall Clock Time (sec)
Rel
ativ
eFu
nctio
nVa
lue
Diff
eren
ce
Logistic Regression
webspam-t C = 1.0RandomActive
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 36 / 41
Experiments - Logistic Regression
Comparison with Block Minimization
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6·104
10−10
10−8
10−6
10−4
10−2
100
Wall Clock Time (sec)
Rel
ativ
eFu
nctio
nVa
lue
Diff
eren
cewebspam-t C = 1.0
StreamSVMBM
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 37 / 41
Conclusion
Outline
1 Linear Support Vector Machines
2 StreamSVM
3 Experiments - SVM
4 Logistic Regression
5 Experiments - Logistic Regression
6 Conclusion
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 38 / 41
Conclusion
Things I did not talk about/Work in Progress
StreamSVMAmortizing the cost of training different modelsOther storage Heirarchies (e.g. Solid State Drives)Extensions to other problems
Multiclass problemsStructured Output Prediction
Distributed version of the algorithm
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 39 / 41
Conclusion
References
Shin Matsushima, S. V. N. Vishwanathan, and Alexander J Smola.Linear Support Vector Machines via Dual Cached Loops. KDD2012.Code for StreamSVM available fromhttp://www.r.dl.itc.u-tokyo.ac.jp/~masin/streamsvm.html
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 40 / 41
Conclusion
Joint work with
Reading Thread Training Thread
Update
RAM
Weight Vector
RAM
Cached Data (Working Set)
Disk
Dataset
Read
(Random Access)
Read
(Sequential Access)
Load
(Random Access)
S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 41 / 41