scd for lasso (shooting) parallel scd (shotgun) parallel sgd … · 2016. 5. 10. · ©sham kakade...
TRANSCRIPT
![Page 1: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/1.jpg)
©Sham Kakade 2016 1
Machine Learning for Big Data CSE547/STAT548, University of Washington
Sham Kakade
May 5, 2016
LASSO Solvers – Part 2:SCD for LASSO (Shooting)
Parallel SCD (Shotgun)Parallel SGD
Averaging Solutions
Case Study 3: fMRI Prediction
![Page 2: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/2.jpg)
Scaling Up LASSO Solvers
• Another way to solve LASSO problem:– Stochastic Coordinate Descent (SCD)
– Minimizing a coordinate in LASSO
• A simple SCD for LASSO (Shooting)– Your HW, a more efficient implementation!
– Analysis of SCD
• Parallel SCD (Shotgun)
• Other parallel learning approaches for linear models– Parallel stochastic gradient descent (SGD)
– Parallel independent solutions then averaging
• ADMM
©Sham Kakade 2016 2
![Page 3: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/3.jpg)
Coordinate Descent
• Given a function F
– Want to find minimum
• Often, hard to find minimum for all coordinates, but easy for one coordinate
• Coordinate descent:
• How do we pick a coordinate?
• When does this converge to optimum?
©Sham Kakade 2016 3
![Page 4: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/4.jpg)
Soft Threshholding
©Sham Kakade 2016 4
From Kevin Murphy textbook
![Page 5: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/5.jpg)
Stochastic Coordinate Descent for LASSO (aka Shooting Algorithm)
• Repeat until convergence
– Pick a coordinate j at random
• Set:
• Where:
©Sham Kakade 2016 5
![Page 6: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/6.jpg)
Analysis of SCD [Shalev-Shwartz, Tewari ’09/’11]
©Sham Kakade 2016 6
• Analysis works for LASSO, L1 regularized logistic regression, and other objectives!
• For (coordinate-wise) strongly convex functions:
• Theorem: – Starting from
– After T iterations
– Where E[ ] is wrt random coordinate choices of SCD
• Natural question: How does SCD & SGD convergence rates differ?
![Page 7: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/7.jpg)
Shooting: Sequential SCD
©Sham Kakade 2016 7
Stochastic Coordinate Descent (SCD)(e.g., Shalev-Shwartz & Tewari, 2009)
While not converged,
Choose random coordinate j,
Update βj (closed-form minimization)
minbF(b) F(b) =|| Xb -y ||2
2 +l || b ||1whereLasso:
F(b) contour
![Page 8: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/8.jpg)
Shotgun: Parallel SCD [Bradley et al ‘11]
©Sham Kakade 2016 8
Shotgun (Parallel SCD)
While not converged,
On each of P processors,
Choose random coordinate j,
Update βj (same as for Shooting)
minbF(b) F(b) =|| Xb -y ||2
2 +l || b ||1whereLasso:
![Page 9: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/9.jpg)
Is SCD inherently sequential?
©Sham Kakade 2016 9
Coordinate update:
b j ¬ b j +db j
(closed-form minimization)
Db =
æ
è
çççç
ö
ø
÷÷÷÷
Collective update:
dbi
db j
00
0
minbF(b) F(b) =|| Xb -y ||2
2 +l || b ||1whereLasso:
![Page 10: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/10.jpg)
Convergence Analysis
©Sham Kakade 2016 10
£d 1
2 || b* ||22 +F(b (0))( )TP
E F(b (T ))éë
ùû-F(b*)
Theorem: Shotgun Convergence
Assume
P < d /r +1
where
r = spectral radius of XTX
Nice case:Uncorrelatedfeatures
r = __ Þ Pmax = __
Bad case:Correlatedfeatures
r = __ Þ Pmax = __(at worst)
minbF(b) F(b) =|| Xb -y ||2
2 +l || b ||1whereLasso:
![Page 11: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/11.jpg)
Stepping Back…
©Sham Kakade 2016 11
• Stochastic coordinate ascent
– Optimization:
– Parallel SCD:
– Issue:
– Solution:
• Natural counterpart:
– Optimization:
– Parallel
– Issue:
– Solution:
![Page 12: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/12.jpg)
What you need to know
• Sparsistency
• Fused LASSO
• LASSO Solvers– LARS
– A simple SCD for LASSO (Shooting)• Your HW, a more efficient implementation!
• Analysis of SCD
– Parallel SCD (Shotgun)
©Sham Kakade 2016 12
![Page 13: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/13.jpg)
©Sham Kakade 2016 13
Machine Learning for Big Data CSE547/STAT548, University of Washington
Sham Kakade
May 5th, 2016
“Scalable” LASSO Solvers:Parallel SCD (Shotgun)
Parallel SGDAveraging Solutions
ADMM
Case Study 3: fMRI Prediction
![Page 14: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/14.jpg)
Stepping Back…
©Sham Kakade 2016 14
• Stochastic coordinate ascent
– Optimization:
– Parallel SCD:
– Issue:
– Solution:
• Natural counterpart:
– Optimization:
– Parallel
– Issue:
– Solution:
![Page 15: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/15.jpg)
Parallel SGD with No Locks[e.g., Hogwild!, Niu et al. ‘11]
• Each processor in parallel:
– Pick data point i at random
– For j = 1…p:
• Assume atomicity of:
©Sham Kakade 2016 15
![Page 16: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/16.jpg)
Addressing Interference in Parallel SGD
• Key issues:
– Old gradients
– Processors overwrite each other’s work
• Nonetheless: – Can achieve convergence and some parallel speedups
– Proof uses weak interactions, but through sparsity of data points
©Sham Kakade 2016 16
![Page 17: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/17.jpg)
Problem with Parallel SCD and SGD
• Both Parallel SCD & SGD assume access to current estimate of weight vector
• Works well on shared memory machines
• Very difficult to implement efficiently in distributed memory
• Open problem: Good parallel SGD and SCD for distributed setting…
– Let’s look at a trivial approach
©Sham Kakade 2016 17
![Page 18: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/18.jpg)
Simplest Distributed Optimization Algorithm Ever Made
• Given N data points & P machines
• Stochastic optimization problem:
• Distribute data:
• Solve problems independently
• Merge solutions
• Why should this work at all????
©Sham Kakade 2016 18
![Page 19: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/19.jpg)
For Convex Functions…
• Convexity:
• Thus:
©Sham Kakade 2016 19
![Page 20: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/20.jpg)
Hopefully…
• Convexity only guarantees:
• But, estimates from independent data!
©Sham Kakade 2016 20
Figure from John Duchi
![Page 21: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/21.jpg)
Analysis of Distribute-then-Average[Zhang et al. ‘12]
• Under some conditions, including strong convexity, lots of smoothness, etc.
• If all data were in one machine, converge at rate:
• With P machines, converge at a rate:
©Sham Kakade 2016 21
![Page 22: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/22.jpg)
Tradeoffs, tradeoffs, tradeoffs,…
• Distribute-then-Average:
– “Minimum possible” communication
– Bias term can be a killer with finite data
• Issue definitely observed in practice
– Significant issues for L1 problems:
• Parallel SCD or SGD– Can have much better convergence in practice for multicore setting
– Preserves sparsity (especially SCD)
– But, hard to implement in distributed setting
©Sham Kakade 2016 22
![Page 23: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/23.jpg)
Alternating Directions Method of Multipliers
• A tool for solving convex problems with separable objectives:
• LASSO example:
• Know how to minimize f(β) or g(β) separately
©Sham Kakade 2016 23
![Page 24: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/24.jpg)
ADMM Insight
• Try this instead:
• Solve using method of multipliers
• Define the augmented Lagrangian:
– Issue: L2 penalty destroys separability of Lagrangian
– Solution: Replace minimization over (x, z) by alternating minimization
©Sham Kakade 2016 24
![Page 25: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/25.jpg)
ADMM Algorithm
• Augmented Lagrangian:
• Alternate between:
1. x
2. z
1. y
©Sham Kakade 2016 25
![Page 26: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/26.jpg)
ADMM for LASSO
• Objective:
• Augmented Lagrangian:
• Alternate between:
1. β
2. z
1. a
©Sham Kakade 2016 26
![Page 27: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/27.jpg)
ADMM Wrap-Up
• When does ADMM converge?– Under very mild conditions
– Basically, f and g must be convex
• ADMM is useful in cases where– f(x) + g(x) is challenging to solve due to coupling
– We can minimize• f(x) + (x-a)2
• g(x) + (x-a)2
• Reference– Boyd, Parikh, Chu, Peleato, Eckstein (2011) “Distributed optimization and statistical
learning via the alternating direction method of multipliers.” Foundations and Trends in Machine Learning, 3(1):1-122.
©Sham Kakade 2016 27
![Page 28: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/28.jpg)
What you need to know
• A simple SCD for LASSO (Shooting)– Your HW, a more efficient implementation!
– Analysis of SCD
• Parallel SCD (Shotgun)
• Other parallel learning approaches for linear models– Parallel stochastic gradient descent (SGD)
– Parallel independent solutions then averaging
• ADMM– General idea
– Application to LASSO
©Sham Kakade 2016 28
![Page 29: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/29.jpg)
©Sham Kakade 2016 29
![Page 30: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/30.jpg)
©Sham Kakade 2016 30
![Page 31: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD … · 2016. 5. 10. · ©Sham Kakade 2016 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham](https://reader036.vdocuments.mx/reader036/viewer/2022071105/5fdec09eacce9b68f94a0b03/html5/thumbnails/31.jpg)
©Sham Kakade 2016 31