new divide & conquer methods for large-scale data analysis · 2015. 10. 14. · divide &...
Post on 14-Oct-2020
2 Views
Preview:
TRANSCRIPT
Divide & Conquer Methods for Large-Scale Data Analysis
Inderjit S. DhillonDept of Computer Science
UT Austin
International Conference on Machine Learning and ApplicationsDetroit, MIDec 4, 2014
Joint work with C.-J. Hsieh and S. Si
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Machine Learning Applications
Link prediction
LinkedIn.
gene-gene network
fMRI
Image classification
Spam classification
How to develop machine learning algorithms that scale to massivedatasets?
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Divide & Conquer Method
Divide original (large) problem into smaller instances
Solve the smaller instances
Obtain solution to original problem by combining these solutions
Problem
split / merge
split / merge
split / merge
Subproblem
Compute Subproblem
Compute Subproblem
Subproblem
Compute Subproblem
Compute Subproblem
merge sort
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Divide & Conquer Method
Strategy:
D1
D2
D3
D11
D12
D21
D22
D23
D31
D33
D32
D34
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Three Examples in Machine Learningclassification using kernel SVM
dimensionality reduction for social network analysislearning graphical model structure
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Divide & Conquer Kernel SVM
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Linear Support Vector Machines (SVM)
SVM is a widely used classifier.
Given:
Training data points x1, · · · , xn.Each x i ∈ Rd is a feature vector:Consider a simple case with two classes: yi ∈ {+1,−1}.
Goal: Find a hyperplane to separate these two classes:if yi = 1, wT x i ≥ 1− ξi ; yi = −1, wT x i ≤ −1 + ξi .
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Support Vector Machines (SVM)
Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem:
minw ,ξ
1
2wTw + C
n∑i=1
ξi
s.t. yi (wT x i ) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,
SVM dual problem:
minα
1
2αTQα−eTα,
s.t. 0 ≤ αi ≤ C , for i = 1, . . . , n,
where Qij = yiyjxTi x j and e = [1, . . . , 1]T .
At optimum: w =∑n
i=1 αiyix i ,
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Support Vector Machines (SVM)
Given training data x1, · · · , xn ∈ Rd with labels yi ∈ {+1,−1}.SVM primal problem:
minw ,ξ
1
2wTw + C
n∑i=1
ξi
s.t. yi (wT x i ) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , n,
SVM dual problem:
minα
1
2αTQα−eTα,
s.t. 0 ≤ αi ≤ C , for i = 1, . . . , n,
where Qij = yiyjxTi x j and e = [1, . . . , 1]T .
At optimum: w =∑n
i=1 αiyix i ,
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Support Vector Machines (SVM)
What if the data is not linearly separable?
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Support Vector Machines (SVM)
What if the data is not linearly separable?
Solution: map data x i to higher dimensional(maybe infinite) featurespace ϕ(xi ), where the classes are linearly separable.
Kernel trick: K (x i , x j) = ϕ(x i )Tϕ(x j).
Various types of kernels:
Gaussian kernel: K (x , y) = e−γ‖x−y‖22 ;
Linear kernel: K (x , y) = xT y ;Polynomial kernel: K (x , y) = (γxT y + c)d .
Solve the dual problem for SVM.
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Linear vs. Nonlinear SVM
UCI Forest Cover dataset (transform to binary version):Predicting forest cover type from cartographic features:
(Elevation, slope, soil type, . . . )464,810 training samples, 54 featuresTransform to binary problem: Lodgepole Pine vs. the rest 6 types
Training Time Prediction Time Prediction Accuracy
Liblinear 3.6 secs 1x 76.35%
LibSVM1 day 78862x 96.15%
(Gaussian kernel)
Figure from (Blackard and Dean, 1999)Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Kernel SVM
Recently, (Fernandez-Delgado et al., 2014) evaluated 179 classifiersfrom 17 families using 121 UCI data sets.
Kernel SVM is the second best classifier (slightly outperformed byrandom forest)
Table from (Fernandez-Delgado et al., 2014)
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Scalability for kernel SVM
Challenge for solving kernel SVMs:
Space: O(n2);Training Time: O(n3).Prediction Time: O(nd)(n: number of support vectors)
Our solution: Divide-and-Conquer SVM (DC-SVM)
Training Time Prediction Time Prediction Accuracy
Liblinear 3.6 secs 1x 76.35%
LibSVM 1 day 78862x 96.15%
DC-SVM (early) 10 mins 308x 96.12%
DC-Pred++ 6 mins 18.8x 95.19%
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
DC-SVM with a single level – divide step
Partition α into k subsets {V1, . . . ,Vk}.Solve each subproblem independently:
minα(i)
1
2(α(i))
TQ(i ,i)α(i) − eTα(i),
s.t. 0 ≤ α(i) ≤ C ,
Approximate solution for the whole problem:
α = [α(1), . . . , α(k)].
Space complexity for k subproblems:O(n2)→ O(n2/k2).
Time complexity for k subproblems:O(n3)→ O(n3/k2).
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
DC-SVM with a single level – conquer step
Use α to initialize a global coordinatedescent solver.
Converges quickly if
(1) ‖α−α∗‖ is small.(2) Support vectors in α∗ are correctlyidentified in α.
Question: how to find a partitioning thatsatisfies both?
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Data Partitioning
Theorem 1
For a given partition π, the corresponding α satisfies
0 ≤ f (α)− f (α∗) ≤ (1/2)C 2D(π),
where D(π) =∑
i ,j :π(x i )6=π(x j ) |K (x i , x j)|.
Want a partition which
(1) Minimizes D(π).(2) Has balanced cluster sizes (for efficient training).
Use kernel kmeans (but slow).
Two step kernel kmeans:
Run kernel kmeans on a subset of samples with size m� n.Identify the clusters for the rest of data.
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Demonstration of the bound
Covertype dataset with 10000 samples and γ = 32 (best in crossvalidation).
Our data partition scheme leads to a good approximation to the globalsolution α∗.
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Quickly identifying support vectors
Divide-and-conquer scheme helps to quickly identify support vectors.
Theorem 2
If αi = 0 and
∇i f (α) > CD(π)(1 +√nKmax/
√σnD(π)),
where Kmax = maxi K (x i , x i ), then x i will not be a support vector of thewhole problem (i.e., α∗i = 0).
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
DC-SVM with multiple levels
A tradeoff in choosing k (number of partitions):
Small k ⇒ smaller ‖α−α∗‖ but more time to solve subproblems.Large k ⇒ larger ‖α−α∗‖ but less time to solve subproblems.
Therefore we run DC-SVM with multiple levels.
Data Division
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
DC-SVM with multiple levels
A tradeoff in choosing k (number of partitions):
Small k ⇒ smaller ‖α−α∗‖ but more time to solve subproblems.Large k ⇒ larger ‖α−α∗‖ but less time to solve subproblems.
Therefore we run DC-SVM with multiple levels.
Data Division
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
DC-SVM with multiple levels
A tradeoff in choosing k (number of partitions):
Small k ⇒ smaller ‖α−α∗‖ but need more time to solve subproblems.Large k ⇒ larger ‖α−α∗‖ but less time to solve subproblems.
Therefore propose to run DC-SVM with multiple levels.
Solve the leaflevel problems.
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
DC-SVM with multiple levels
A tradeoff in choosing k (number of partitions):
Small k ⇒ smaller ‖α−α∗‖ but need more time to solve subproblems.Large k ⇒ larger ‖α−α∗‖ but less time to solve subproblems.
Therefore propose to run DC-SVM with multiple levels.
Solve theintermediatelevel problems.
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
DC-SVM with multiple levels
A tradeoff in choosing k (number of partitions):
Small k ⇒ smaller ‖α−α∗‖ but need more time to solve subproblems.Large k ⇒ larger ‖α−α∗‖ but less time to solve subproblems.
Therefore propose to run DC-SVM with multiple levels.
Solve theoriginal problem.
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Early Prediction
Prediction using the l-th level solution
faster training time; the prediction accuracy is close to or even betterthan the global SVM solution.
Prediction: sign(∑
i∈Vπ(x)yiαiK (x i , x)).
Prediction time reduced from O(d(#SV )) to O(d(#SV )/k)
Covtype (k = 256): 78862x → 308x
Still not fast enough for real-time applications!
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Illustrative Toy Example
Two Circle Data: each circle is a class; separable by kernel kmeans.
1st cluster 2nd cluster
DC-SVM (early) RBF SVM
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Illustrative Toy Examples
Two Circle Data: each circle is a class; separable by kernel kmeans.
1st cluster 2nd cluster
DC-SVM (early) RBF SVM
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Illustrative Toy Examples
Two Circle Data: not separable by kernel kmeans
1st cluster 2nd cluster 3rd cluster
4th cluster DC-SVM (early) RBF SVM
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Illustrative Toy Examples
Two Circle Data: not separable by kernel kmeans
1st cluster 2nd cluster 3rd cluster
4th cluster DC-SVM (early) RBF SVM
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Methods included in comparisons
DC-SVM: our proposed method for solving the global SVM problem.
DC-SVM (early): our proposed method with early stopping (at 64 clusters).
LIBSVM (Chang and Lin, 2011)
Cascade SVM (Graf et al., 2005)
Fastfood (Le et al., 2013)
LaSVM (Bordes et al., 2005)
LLSVM (Zhang et al., 2012)
SpSVM (Keerthi et al., 2006)
LTPU (Moody and Darken., 1989)
Budgeted SVM (Wang et al., 2012; Djuric et al., 2013)
AESVM (Nandan et al., 2014)
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Results with Gaussian kernel.
webspam covtype mnist8mn = 2.8× 105, d = 254 n = 4.65× 105, d = 54 n = 8× 106, d = 784
C = 8, γ = 32 C = 32, γ = 32 C = 1, γ = 2−21
time(s) acc(%) time(s) acc(%) time(s) acc(%)
DC-SVM (early) 670 99.13 672 96.12 10287 99.85
DC-SVM 10485 99.28 11414 96.15 71823 99.93LIBSVM 29472 99.28 83631 96.15 298900 99.91
LaSVM 20342 99.25 102603 94.39 171400 98.95
CascadeSVM 3515 98.1 5600 89.51 64151 98.3
LLSVM 2853 97.74 4451 84.21 65121 97.64
FastFood 5563 96.47 8550 80.1 14917 96.5
SpSVM 6235 95.3 15113 83.37 121563 96.3
LTPU 4005 96.12 11532 83.25 105210 97.82
Budgeted SVM 2194 98.94 3839 87.83 29266 98.8
AESVM 3027 98.90 3821 87.03 16239 96.6
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Results with Gaussian kernel
Time vs Prediction Accuracy
Covtype dataset MNIST dataset
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Results with Gaussian kernel
covtype prediction accuracy MNIST8m prediction accuracy
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Results with grid of C , γ
Total time on the grid of parameters C , γ:
C = 2−10, 2−6, 21, 26, 210
γ = 2−10, 2−6, 21, 26, 210
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Fast Prediction for Kernel SVM
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Prediction for SVM
Linear SVM: y = sign(wT x + b); O(d).
Kernel SVM: y = sign(∑n
i=1 αiK (x , x i )); O(d(#SV )).
DC-SVM with early prediction: y = sign(∑
i∈Vπ(x)αiK (x , x i ));
O(d(#SV /k)).
DC-Pred++: O(dt); t � #SV /k
Goal: achieve kernel SVM’s accuracy using linear SVM prediction time.
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
DC-SVM with Nystrom Approximation
DC-SVM with early prediction
Drawback: still too many support vectors in each block
Use Nystrom approximation within each block
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Nystrom Kernel Approximation
Goal: rank-m approximation G to kernel matrix G based on landmarkpoints u1, . . . , um.
G ≈ G = CWCT .
Perform prediction on a new point x given the model α:n∑i=1
αi K (x , x i ) = xTWCTα = xTβ
where x = [K (x , u1), . . . ,K (x , um)]T .
The main computation is to form x .Time complexity for prediction: O(md).
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Improvements over Traditional Nystrom Approximation
Traditional Nystrom Approximation emphasizes kernel approximationerror ‖G − G‖F .
We are interested in model approximation error ‖α∗ − α‖.Use α-weighted kmeans centroids as the landmark points.
Add pseudo landmark points for faster prediction.
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
DC-Pred++
Training:
Prediction:
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Kernel SVM Results
Dataset Metric DC-Pred++ LDKL kmeans Nystrom AESVM Liblinear
Letter Prediction Time 12.8x 29x 140x 1542x 1xn = 12, 000 Accuracy 95.90% 95.78% 87.58% 80.97% 73.08%
CovType Prediction Time 18.8x 35x 200x 3157x 1xn = 522, 910 Accuracy 95.19% 89.53% 73.63% 75.81% 76.35%
USPS Prediction Time 14.4x 12.01x 200x 5787x 1xn = 7291 Accuracy 95.56% 95.96% 92.53% 85.97% 83.65%
Webspam Prediction Time 20.5x 23x 200x 4375x 1xn = 280, 000 Accuracy 98.4% 95.15% 95.01% 98.4% 93.10%
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Social Network Analysis
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Social Networks
Nodes: individual actors (people or groups of people)
Edges: relationships (social interactions) between nodes
Example: Karate Club Network:2 CHAPTER 1. OVERVIEW
27
15
23
10 20
4
13
16
34
31
14
12
18
17
30
33
32
9
2
1
5
6
21
24
25
3
8
22
11
7
19
28
29
26
Figure 1.1: The social network of friendships within a 34-person karate club [421].
The imagery of networks has made its way into many other lines of discussion as well:
Global manufacturing operations now have networks of suppliers, Web sites have networks
of users, and media companies have networks of advertisers. In such formulations, the
emphasis is often less on the structure of the network itself than on its complexity as a large,
diffuse population that reacts in unexpected ways to the actions of central authorities. The
terminology of international conflict has come to reflect this as well: for example, the picture
of two opposing, state-supported armies gradually morphs, in U.S. Presidential speeches, into
images of a nation facing “a broad and adaptive terrorist network” [296], or “at war against
a far-reaching network of violence and hatred” [328].
1.1 Aspects of Networks
How should we think about networks, at a more precise level, so as to bring all these issues
together? In the most basic sense, a network is any collection of objects in which some pairs
of these objects are connected by links. This definition is very flexible: depending on the
setting, many different forms of relationships or connections can be used to define links.
Because of this flexibility, it is easy to find networks in many domains, including the ones
we’ve just been discussing. As a first example of what a network looks like, Figure 1.1 depicts
the social network among 34 people in a university karate club studied by the anthropologist
Wayne Zachary in the 1970s. The people are represented by small circles, with lines joining
the pairs of people who are friends outside the context of the club. This is the typical way
in which networks will be drawn, with lines joining the pairs of objects that are connected
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Multi-Scale Link Prediction
Combine predictions at each level to make final predictions
U(0)
U1(1)
U2(1)
U1(1)
U2(2)
U3(2)
level 0 level 1 level 2
Prediction Prediction Prediction
Final Prediction
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Experimental Results
Flickr dataset: 1.9M users & 42M links.
Test set: sampled 5K users.
Precision at top-100 & AUC results:
Method Prec AUC
Preferential Attachment 1.02 0.6981Common Neighbor 7.08 0.8649Adamic-Adar 7.29 0.8758Random Walk w/ Restarts 5.49 0.7872Logistic Regression* 2.54 0.7115
Katz 7.17 0.8429MSLP-Katz 13.34 0.8924MF 12.05 0.9078MSLP-MF 13.07 0.9145
10 20 30 40 50 60 70 80 90 1000
2
4
6
8
10
12
14
Top−k
Pre
cis
ion
MSLP−KatzMSLP−MFKatzCNRWR
*using network-based features.
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Learning Graphical Model Structure
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Sparse Inverse Covariance Estimation
Given: n i.i.d. samples {y1, . . . , yn}, y i ∼ N (µ,Σ),Goal: Estimate the inverse covariance Θ = Σ−1.The resulting optimization problem:
Θ = arg minX�0
{− log detX + tr(SX ) + λ‖X‖1
}= arg min
X�0f (X ),
where ‖X‖1 =∑n
i ,j=1 |Xij | and S = 1n
∑ni=1(y i − µ)(y i − µ)T .
Regularization parameter λ > 0 controls the sparsity.Real world example: graphical model which reveals the relationshipsbetween Senators: (Figure from Banerjee et al, 2008)
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Divide-and-Conquer QUIC
How can we solve high dimensional problems?
Time complexity is O(p3), so computationally expensive to solve itdirectly.
Solution: Divide & Conquer!
Θ∗ Θ from level-1 clusters Θ from level-2 clusters.
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Performance of Divide & Conquer QUIC
(a) Time for Leukemia, p = 1, 255 (b) Time for Climate, p = 10, 512
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
Conclusions
Divide & Conquer approach for big data problems
Divide & Conquer Kernel SVM — fast training and fast predictionSocial Network Analysis: Multi-Scale Link PredictionSparse Inverse Covariance Estimation: Divide & Conquer QUIC
Interesting questions
What other data analysis problems can benefit from a divide & conquerapproach? Regression? Matrix completion? Multi-label learning?
What are the corresponding algorithms?
Theoretical/statistical guarantees?
Can one exploit the multiplicity of local and global models for betterprediction?
Parallelization? Needs dynamic load balancing, asynchronous parallelism.
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
References
Divide and Conquer Kernel SVM
[1] C.-J. Hsieh, S. Si and I. S. Dhillon A Divide-and-Conquer Solver for Kernel Support VectorMachines, ICML, 2014.
[2] S. Si, C.-J. Hsieh and I. S. Dhillon Memory Efficient Kernel Approximation, ICML, 2014.
[3] C.-J. Hsieh, S. Si and I. S. Dhillon Fast Prediction for Large-Scale Kernel Machines, NIPS,2014.
Divide and Conquer Link-Prediction/Eigen-decomposition
[4] D. Shin, S. Si and I. S. Dhillon Multi-Scale Link Prediction, CIKM, 2012.
[5] S. Si, D. Shin, I. S. Dhillon and B. Parlett Multi-Scale Spectral Decomposition of MassiveGraphs, NIPS, 2014.
Divide and Conquer Sparse Inverse Covariance Estimation
[6] C.-J. Hsieh, M. Sustik I. S. Dhillon and P. Ravikumar Sparse Inverse Covariance MatrixEstimation using Quadratic Approximation. NIPS, 2011.
[7] C.-J. Hsieh, I. S. Dhillon, P. Ravikumar and A. Banerjee A Divide-and-Conquer Method forSparse Inverse Covariance Estimation, NIPS, 2012.
[8] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, P. Ravikumar, and R. A. Poldrack BIG & QUIC: SparseInverse Covariance Estimation for a Million Variables, NIPS, 2013.
[9] C.-J. Hsieh, I. S. Dhillon, P. Ravikumar and S. Becker, and P. A. Olsen QUIC & DIRTY: AQuadratic Approximation Approach for Dirty Statistical Models, NIPS, 2014.
Inderjit S. Dhillon Dept of Computer Science UT Austin Divide & Conquer Methods for Big Data
top related