using analytic qp and sparseness to speed training of support vector machines john c. platt...
Post on 20-Dec-2015
220 views
TRANSCRIPT
Using Analytic QP and Sparseness to Speed Training of
Support Vector Machines
John C. Platt
Presented by: Travis Desell
Overview
• Introduction– Motivation– General SVMs– General SVM training– Related Work
• Sequential Minimal Optimization (SMO)– Choosing the smallest optimization problem– Solving the smallest optimization problem
• Benchmarks• Conclusion• Remarks & Future Work• References
Motivation
• Traditional SVM Training Algorithms– Require quadratic programming (QP) package
– SVM training is slow, especially for large problems
• Sequential Minimal Optimization (SMO)– Requires no QP package
– Easy to implement
– Often faster
– Good scalability properties
General SVMs
u = i iyiK(xi,x) – b (1)
• u : SVM output
• : weights to blend different kernels
• y in {-1, +1} : desired output
• b : threshold
• xi : stored training example (vector)
• x : input (vector)
• K : kernel function to measure similarity of xi to xi
General SVMs (2)
• For linear SVMs, K is linear, thus (1) can be expressed as the dot product of w and x minus the threshold:
u = w * x – b(2)
w = i iyixi (3)
• Where w, x, and xi are vectors
General SVM Training
• Training an SVM is finding i, expressed as minimizing a dual quadratic form:
min () = min ½ i j yiyjK(xi, xj)ij – ii (4)
• Subject to box constraints:0 <= i <= C, for all I (5)
• And the linear equality constraint:i yii = 0 (6)
• The i are Lagrange multipliers of a primal QP problem: there is a one-to-one correspondence between each i and each training example xi
General SVM Training (2)
• SMO solves the QP expressed in (4-6)
• Terminates when all of the Karush-Kuhn-Tucker (KKT) optimality conditions are fulfilled:
i = 0 -> yiui >= 1 (7)
0 < i < C -> yiui = 1 (8)
i = C -> yiui <= 1 (9)
• Where ui is the SVM output for the ith training example
Related Work
• “Chunking” [9]– Removing training examples with i = 0 does not change solution.– Breaks down large QP problem into smaller sub-problems to
identify non-zero i.– The QP sub-problem consists of every non-zero i from previous
sub-problem combined with M worst examples that violate (7-9) for some M [1].
– Last step solves the entire QP problem as all non-zero i have been found.
– Cannot handle large-scale training problems if standard QP techniques are used. Kaufman [3] describes QP algorithm to overcome this.
Related Work (2)
• Decomposition [6]:– Breaks the large QP problem into smaller QP sub-
problems.
– Osuna et al. [6] suggest using fixed size matrix for every sub-problem – allows very large training sets.
– Joachims [2] suggests adding and subtracting examples according to heuristics for rapid convergence.
– Until SMO, requires numerical QP library, which can be costly or slow.
Sequential Minimal Optimization
• SMO decomposes the overall QP problem (4-6), into fixed size QP sub-problems.
• Chooses the smallest optimization problem (SOP) at each step.– This optimization consists of two elements of
because of the linear equality constraint.
• SMO repeatedly chooses two elements of to jointly optimize until the overall QP problem is solved.
Choosing the SOP
• Heuristic based approach
• Terminates when the entire training set obeys (7-9) within (typically <= 10-3)
• Repeatedly finds 1 and 2 and optimizes until termination
Finding 1
• “First choice heuristic”– Searches through examples most likely to violate
conditions (non-bound subset)– i at the bounds likely to stay there, non-bound i will
move as others are optimized
• “Shrinking Heuristic”– Finds examples which fulfill (7-9) more than the worst
example failed– Ignores these examples until a final pass at the end to
ensure all examples fulfill (7-9)
Finding 2
• Chosen to maximize the size of the step taken during the joint optimization of 1 and 2
• Each non-bound has a cached error value E for each non-bound example
• If E1 is negative, chooses 2 with minimum E2
• If E1 is positive, chooses 2 with maximum E2
Solving the SOP
• Computes minimum along the direction of the linear equality constant:
2new = y2(E1-E2)/(K(x1,x1)+K(x2,x2)–2K(x1, x2)) (10)
Ei = ui-yi (11)
• Clips 2new within [L,H]:
L = max(0,2+s1-0.5(s+1)C) (12)
H = min(C,2+s1-0.5(s-1)C) (13)
s = y1y2 (14)
• Calculates 1new:
1new = 1 + s(2 – 2
new,clipped) (15)
Benchmarks
• UCI Adult: SVM is given 14 attributes of a census and is asked to predict if household income is greater than $50k. 8 categorical attributes, 6 continues = 123 binary attributes.
• Web: classify if a web page is in a category or not. 300 sparse binary keyword attributes.
• MNIST: One classifier is trained. 784-dimensional, non-binary vectors stored as sparse vectors.
Description of Benchmarks
• Web and Adult are trained with linear and Gaussian SVMs.
• Performed with and without sparse inputs, with and without kernel caching
• PCG chunking always uses caching
Benchmarking SMO
Conclusions
• PCG chunking slower than SMO, SMO ignores examples whose Lagrange multipliers are at C.
• Overhead of PCG chunking not involved with kernel (kernel optimizations do not greatly effect time).
Conclusions (2)
• SVMlight solves 10 dimensional QP sub-problems.• Differences mostly due to kernel optimizations
and numerical QP overhead.• SMO faster on linear problems due to linear SVM
folding, but SVMlight can potentially use this as well.
• SVMlight benefits from complex kernel cache while SVM does have a complex kernel cache and thus does not benefit from it at large problem sizes.
Remarks & Future Work
• Heuristic based approach to finding 1 and 2 to optimize:
– Possible to determine optimal choice strategy to minimize the number of steps?
• Proof that SMO always minimizes the QP problem?
References
• [1] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 1998.
• [2] T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 169–184. MIT Press, 1998.
References (2)
• [3] L. Kaufman. Solving the quadratic programming problem arising in support vector classification. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 147–168. MIT Press, 1998.
• [6] E. Osuna, R. Freund, and F. Girosi. Improved training algorithm for support vector machines. In Proc. IEEE Neural Networks in Signal Processing ’97, 1997.
References (3)
• [9] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982.