introductionto machinelearningpeople.sabanciuniv.edu/berrin/cs512/lectures/2015/11...vapnik and...
TRANSCRIPT
INTRODUCTION TO Machine Learning
ETHEM ALPAYDIN © The MIT Press, 2004 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml
Lecture Slides for
CHAPTER 10:
Linear Discrimination
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 3
Likelihood- vs. Discriminant-based Classification
n Likelihood-based: Assume a model for p(x|Ci), use Bayes’ rule to calculate P(Ci|x)
Choose Ci if gi(x) = log P(Ci|x) is maximum n Discriminant-based: Assume a model for the
discriminant gi (x|Φi); no density estimation ¨ Estimating the boundaries is enough; no need to accurately
estimate the densities inside the boundaries
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 4
Linear Discriminant
n Linear discriminant:
n Advantages: ¨ Simple: O(d) space/computation
¨ Knowledge extraction: Weighted sum of attributes; positive/negative weights, magnitudes (credit scoring)
¨ Optimal when p(x|Ci) are Gaussian with shared cov matrix; useful when classes are (almost) linearly separable
( ) 01
00| ij
d
jiji
Tiiii wxwww,g +=+= ∑
=
xwwx
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 5
Generalized Linear Model
n Quadratic discriminant:
n Instead of higher complexity, we can still use a linear classifier if we use higher-order (product) terms.
n Map from x to z using nonlinear basis functions and use a linear discriminant in z-space
n The linear function defined in the z space corresponds to a non-linear function in the x space.
215224
2132211 xxz,xz,xz,xz,xz =====
( ) 00| iTii
Tiiii ww,,g ++= xwxxwx WW
( ) ( )∑=
=k
jjiji wg
1
xx φ
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 6
Two Classes
( ) ( ) ( )( ) ( )( ) ( )
0
201021
202101
21
w
ww
ww
ggg
T
T
TT
+=
−+−=
+−+=
−=
xw
xww
xwxw
xxx
( )⎩⎨⎧ >
otherwise
0if choose
2
1
C
gC x
choose C1 if g1(x) > g2(x) C2 if g2(x) > g1(x) Define:
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 7
Learning the Discriminants
As we have seen before, when p (x | Ci ) ~ N ( μi , ∑),
the optimal discriminant is a linear one:
( )
( )iiTiiii
iTiiii
CPw
ww,g
log21
|
10
1
00
+−==
+=
−− µµµ ΣΣw
xwwx
So, estimate µi and Σ from data, and plug into the gi’s to find the linear discriminant functions. Of course any way of learning can be used (e.g. perceptron, gradient descent, logistic regression...).
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 8
n When K > 2 ¨ Combine K two-class problems, each one separating one
class from all other classes
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 9
Multiple Classes
( ) 00| iTiiii ww,g += xwwx
Why? Any problem? Convex decision regions based on gis (indicated with blue) dist is |gi(x)|/||wi|| Assumes that classes are linearly separable: reject may be used
( ) ( )xx j
K
ji
i
gg
C
1max
if Choose
==
How to train? How to decide on a test?
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 10
Pairwise Separation
( ) 00| ijTijijijij ww,g += xwwx
( )⎪⎩
⎪⎨
⎧
∈≤
∈>
=
otherwisecare tdon'
if 0
if 0
j
i
ij C
C
g x
x
x
( ) 0
if choose
>≠∀ xij
i
g,ij
C
If the classes are not linearly separable:
Uses k(k-1)/2 linear discriminants
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11
A Bit of Geometry
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13
n Dot Product and Projection
n <w,p> =wTp = ||w||||p||Cosθ
n proj. of p onto w
= ||p||Cosθ
= wT.p/||w||
p
w
θ
p
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 14
Geometry The points x on the separating hyperplane have g(x) = wTx+w0=0. Hence for the points on the boundary wTx=-w0. Thus, these points also have the same projection onto the weight vector w, namely wTx/||w|| (by definition of projection and dot product). But this is equal to –w0/||w||. Hence ...
The perpendicular distance of the boundary to the origin is |w0|/||w||. The distance of any point x to the decision boundary is |g(x)|/||w||.
Support Vector Machines
n Vapnik and Chervonenkis – 1963 n Boser, Guyon and Vapnik – 1992 (kernel trick) n Cortes and Vapnik – 1995 (soft margin)
n The SVM is a machine learning algorithm which ¨ solves classification problems ¨ uses a flexible representation of the class boundaries ¨ implements automatic complexity control to reduce overfitting ¨ has a single global minimum which can be found in polynomial
time
n It is popular because • it can be easy to use • it often has good generalization performance • the same algorithm solves a variety of problems with little tuning
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 16
SVM Concepts
n Convex programming and duality n Using maximum margin to control complexity n Representing non-linear boundaries with feature
expansion n The kernel trick for efficient optimization
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 17
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 18
Linear Separators
n Which of the linear separators is optimal?
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 19
Classification Margin
n Distance from example xi to the separator is n Examples closest to the hyperplane are support vectors. n Margin ρ of the separator is the distance between support
vectors from two classes.
wxw br i
T +=
r
ρ
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 20
Maximum Margin Classification
n Maximizing the margin is good according to intuition.
n Implies that only support vectors matter; other training examples are ignorable.
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 21
SVM as 2-class Linear Classifier
{ }
1for -1
1for 1
such that and find if1 if1
where,
0
0
0
2
1
−=≤+
+=+≥+
⎩⎨⎧
∈−
∈+==
ttT
ttT
t
tt
ttt
rwrw
wCC
rr
xwxww
xx
xX
Optimal separating hyperplane: Separating hyperplane maximizing the margin
(Cortes and Vapnik, 1995; Vapnik, 1995)
Note the condition >= 1 (not just 0). We can always do this if the classes are linearly separable by rescaling w and w0, without affecting the separating hyperplane:
0 0 =+ wtTxw
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 22
Optimal Separating Hyperplane
(Cortes and Vapnik, 1995; Vapnik, 1995)
( ) 1asrewritten becan which
1for 1
1for 1:satisfyMust
0
0
0
+≥+
−=−≤+
+=+≥+
wr
rwrw
tTt
ttT
ttT
xw
xwxw
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 23
Maximizing the Margin
w|)(| xgd =
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
−=
w
w
|1|
1
d
To maximize margin, minimize the Euclidian norm
of the weight vector w
w22 == dρ
Distance from the discriminant to the closest instances on either side is called the margin In general this relationship holds (geometry): So, for the support vectors, we have:
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 24
Maximizing the Margin-Alternate explanation
n Distance from the discriminant to the closest instances on either side is called the margin
n Distance of x to the hyperplane is
n We require that this distance is at least some value ρ > 0.
n We would like to maximize ρ, but we can do so in infinitely many ways by scaling w.
n For a unique sol’n, we fix ρ||w||=1 and minimize ||w||.
w
xw 0wtT +
( )t,
wr tTt
∀ρ≥+
w
xw 0
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 25
min 12w 2 subject to rt wTxt +w0( ) ≥ +1,∀t
Lp =12w 2
− α t rt wTxt +w0( )−1$%
&'
t=1
N
∑Unconstrained problem using Lagrange multipliers (+ numbers)
The solution, if it exists, is always at a saddle point of the Lagrangian Lp should be minimized w.r.t w and maximized w.r.t αts
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 26
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 27
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 28
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 33
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 34
( )
( )[ ]
( ) ∑∑
∑
==
=
++−=
−+−=
∀+≥+
N
t
tN
t
tTtt
N
t
tTttp
tTt
wr
wrL
twr
110
2
10
2
02
21
121
,1 subject to 21min
αα
α
xww
xww
xww
Convex quadratic optimization problem can be solved using the dual form where we use these local minima constraints and maximize w.r.t αts
00
0
10
1
=⇒=∂
∂
=⇒=∂
∂
∑
∑
=
=
N
t
ttp
N
t
tttp
rwL
rL
α
α xww
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 35
from: hIp://math.oregonstate.edu/home/programs/undergrad/CalculusQuestStudyGuides/vcalc/lagrang/lagrang.html
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 36
( )
( )
( )
∑
∑∑∑
∑
∑ ∑∑
∀≥=
+−=
+−=
+−−=
t
0
0 and 0 to subject
21
21
21
t,r
rr
rwrL
ttt
t
tsTtst
t s
st
t
tT
t t
ttt
t
tttTTd
αα
ααα
α
ααα
xx
ww
xwww
00
0
10
1
=⇒=∂
∂
=⇒=∂
∂
∑
∑
=
=
N
t
ttp
N
t
tttp
rwL
rL
α
α xww
( )[ ]
( ) ∑∑
∑
==
=
++−=
−+−=
N
t
tN
t
tTtt
N
t
tTttp
wr
wrL
110
2
10
2
21
121
αα
α
xww
xww
• Maximize Ld with respect to αt only
• Quadratic programming problem • Thanks to the convexity of the problem, optimal value of Lp = Ld
n To every convex program corresponds a dual
n Solving original (primal) is equivalent to solving dual
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 37
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 38
Support vectors for which
( ) 10 =+wr tTt xw
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 39
( )
( )
( ) ∑∑∑
∑
∑ ∑∑
+−=
+−=
+−−=
t
tsTtst
t s
st
t
tT
t t
ttt
t
tttTTd
rr
rwrL
ααα
α
ααα
xx
ww
xwww
21
21
21
0
Size of the dual depends on N and not on d
• Maximize Ld with respect to αt only
• Quadratic programming problem • Thanks to the convexity of the problem, optimal value of Lp = Ld
∑ ∀≥=t
,0 and 0 subject to tr ttt αα
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 40
Calculating the parameters w and w0
Note that: ¨ either the constraint is exactly satisfied (=1)
(and αt can be non-zero) ¨ or the constraint is clearly satisfied (> 1)
(then αt must be zero)
n Once we solve for αt, we see that most of them are 0 and only a small number have αt >0 ¨ the corresponding xts are called the support vectors
( )[ ]∑=
−+−=N
t
tTttp wrL
10
2 121 xww α
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 42
Calculating the parameters w and w0
Once we have the Lagrange multipliers, we can compute w and w0: where SV is the set of the Support Vectors.
∑∑∈=
==SVt
tttN
t
ttt rr xxw αα1
tT0 xw−= trw
n We make decisions by comparing each query x with only the support vectors
n Choose class C1 if +, C2 if negative
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 43
y = sign(wTx+w0 ) = ( α trtxt )x +w0t∈SV
N
∑
Not-‐‑Linearly Separable Case
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 46
n The non-separable case cannot find a feasible solution using the previous approach ¨ The objective function (LD) grows arbitrarily large.
n Relax the constraints, but only when necessary ¨ Introduce a further cost for this
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 48
Soft Margin Hyperplane
n Not linearly separable
( ) ttTt wxr ξ−≥+ 10w
0≥tξ
Three cases (shown in fig): Case 1: Case 2: Case 3:
10 1
0
<≤
≥
=
t
t
t
ξ
ξ
ξ
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 49
Soft Margin Hyperplane
n Define Soft error
n New primal is
n Parameter C can be viewed as a way to control overfitting: it “trades off” the relative importance of maximizing the margin and fitting the training data.
( )[ ] ∑∑∑ ξµ−ξ+−+α−ξ+=t
tt
t
ttTtt
t
tp wxrCL 1
21
0
2ww
∑t
tξ
Upper bound on the number of training errors
Lagrange multipliers to enforce positivity of ξ
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 50
Soft Margin Hyperplane
n New dual is the same as the old one
subject to
n As in the separable case, instances that are not support vectors vanish with their αt=0 and the remaining define the boundary.
( ) ∑∑∑ +−=t
tsTtst
t s
std rrL ααα xx
21
∑ ∀≤≤=t
,0 and 0 tCr ttt αα
Kernel Functions in SVM
n We can handle the overfitting problem: even if we have lots of parameters, large margins make simple classifiers
n “All” that is left is efficiency
n Solution: kernel trick
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 52
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 53
Kernel Functions
n Instead of trying to fit a non-linear model, we can ¨ map the problem to a new space through a non-linear
transformation and ¨ use a linear model in the new space
n Say we have the new space calculated by the basis functions z = φ(x) where zj=φj(x) , j=1,...,k
d-dimensional x space k-dimensional z space
φ(x) = [φ1(x) φ2(x) ... φk(x)]
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 54
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 55
Kernel Functions
( ) ( )
( ) ( )
( ) xx
xx
xx
∀=
=
+=
∑
∑
=
=
for 1 assume weif 0
0
1
ϕ
ϕ
ϕ
kkk
kkk
wg
bwg
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 56
( )
( ) ( ) ( ) ( )
( ) ( )∑
∑
∑∑
α=
α==
α=α=
t
ttt
t
TtttT
t
ttt
t
ttt
,Krg
rg
rr
xxx
xφxφxφwx
xφzw
Kernel Machines
n Preprocess input x by basis functions z = φ(x) g(z)=wTz
g(x)=wT φ(x)
n SVM solution: Find Kernel functions K(x,y) such that the inner product of basis functions are replaced by a Kernel function in the original input space
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 57
Kernel Functions
n Consider polynomials of degree q:
K x, y( ) = xTy+1( )q
(Cherkassky and Mulier, 1998)
( ) ( )( )
( ) [ ]T
T
xxxxxx
yxyxyyxxyxyxyxyx
K
22
212121
22
22
21
2121212211
22211
2
,,2,2,2,1
22211
1,
=
+++++=
++=
+=
x
yxyx
φ
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 59
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 60
Examples of Kernel Functions
n Linear: K(xi,xj)= xiTxj
¨ Mapping Φ: x → φ(x), where φ(x) is x itself
n Polynomial of power p: K(xi,xj)= (1+ xiTxj)p
¨ Mapping Φ: x → φ(x), where φ(x) has dimensions
n Gaussian (radial-basis function): K(xi,xj) = ¨ Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional:
every point is mapped to a function (a Gaussian)
n Higher-dimensional space still has intrinsic dimensionality d, but linear separators in it correspond to non-linear separators in original space.
2
2
2σji
exx −
−
⎟⎟⎠
⎞⎜⎜⎝
⎛ +
ppd
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 61
n Typically k is much larger than d, and possibly larger than N ¨ Using the dual where the complexity depends on N rather than k
is advantageous
n We use the soft margin hyperplane ¨ If C is too large, too high a penalty for non-separable points (too
many support vectors) ¨ If C is too small, we may have underfitting
n Decide by cross-validation
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 62
Other Kernel Functions
n Polynomials of degree q:
n Radial-basis functions:
n Sigmoidal functions such as:
( ) ( )qtTt ,K 1+= xxxx
( )⎥⎥
⎦
⎤
⎢⎢
⎣
⎡
σ
−−=
2
2
expxx
xxt
t ,K
( ) ( )12tanh += tTt ,K xxxx
( ) ( )qtTtK xxxx =,
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 63
What Functions are Kernels? Advanced
n For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be cumbersome.
n Any function that satisfies some constraints called the Mercer conditions can be a Kernel function - (Cherkassky and Mulier, 1998)
Every semi-positive definite symmetric function is a kernel
n Semi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)
K=
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 64
n Informally, kernel methods implicitly define the class of possible patterns by introducing a notion of similarity between data ¨ Choice of similarity -> Choice of relevant features
n More formally, kernel methods exploit information about the inner products between data items ¨ Many standard algorithms can be rewritten so that they only
require inner products between data (inputs) ¨ Kernel functions = inner products in some feature space
(potentially very complex) ¨ If kernel given, no need to specify what features of the data are
being used ¨ Kernel functions make it possible to use infinite dimensions
n efficiently in time / space
String kernels
n For example, given two documents, D1 and D2, the number of words appearing in both may form a kernel.
n Define φ(D1) as the M-dimensional binary vector where dimension i is 1 if word wi appears in D1; 0 otherwise.
n Then φ(D1)Tφ(D2) indicates the number of shared words.
n If we define K(D1,D2) as the number of shared words; ¨ no need to preselect the M words ¨ no need to create the bag-of-words model explicitly ¨ M can be as large as we want
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 65
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 66
Projecting into Higher Dimensions
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 67
SVM Applications
n Cortes and Vapnik 1995: ¨ Handwritten digit classification
¨ 16x16 bitmaps -> 256 dimensions
¨ Polynomial kernel where q=3 -> feature space with 106 dimensions
¨ No overfitting on a training set of 7300 instances
¨ Average of 148 support vectors over different training sets
Expected test error rate:
ExpN[P(error)] = ExpN[#support vectors] / N (= 0.02 for the above example)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 70
SVM history and applications
n SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s.
n SVMs represent a general methodology for many PR problems: classification,regression, feature extraction, clustering, novelty detection, etc.
n SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data.
n SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
n Most popular optimization algorithms for SVMs use decomposition to hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and [Joachims ’99]
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 71
Advantages of SVMs
¨ There are no problems with local minima, because the solution is a Qaudratic Programming problem with a global minimum.
¨ The optimal solution can be found in polynomial time
¨ There are few model parameters to select: the penalty term C, the kernel function and parameters (e.g., spread σ in the case of RBF kernels)
¨ The final results are stable and repeatable (e.g., no random initial weights)
¨ The SVM solution is sparse; it only involves the support vectors
¨ SVMs rely on elegant and principled learning methods
¨ SVMs provide a method to control complexity independently of dimensionality
¨ SVMs have been shown (theoretically and empirically) to have excellent generalization capabilities
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 72
Challenges
n Can the kernel functions be selected in a principled manner?
n SVMs still require selection of a few parameters, typically through cross-validation
n How does one incorporate domain knowledge? ¨ Currently this is performed through the selection of the kernel and the
introduction of “artificial” examples
n How interpretable are the results provided by an SVM?
n What is the optimal data representation for SVM? What is the effect of feature weighting? How does an SVM handle categorical or missing features?
n Do SVMs always perform best? Can they beat a hand-crafted solution for a particular problem?
n Do SVMs eliminate the model selection problem?
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 73
n More explanations or demonstrations can be found at: ¨ http://www.support-vector-machines.org/index.html ¨ Haykin Chp. 6 pp. 318-339 ¨ Burges tutorial (under/reading/)
n Burges, CJC "A Tutorial on Support Vector Machines for Pattern Recognition" Data Mining and Knowledge Discovery, Vol 2 No 2, 1998.
¨ http://www.dtreg.com/svm.htm
n Software ¨ SVMlight, by Joachims, is one of the most widely used SVM classification and
regression package. Distributed as C++ source and binaries for Linux, Windows, Cygwin, and Solaris. Kernels: polynomial, radial basis function, and neural (tanh).
¨ LibSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM (Library for Support Vector Machines), is developed by Chang and Lin; also widely used. Developed in C++ and Java, it supports also multi-class classification, weighted SVM for unbalanced data, cross-validation and automatic model selection. It has interfaces for Python, R, Splus, MATLAB, Perl, Ruby, and LabVIEW. Kernels: linear, polynomial, radial basis function, and neural (tanh).
n Applet to play with: ¨ http://lcn.epfl.ch/tutorial/english/svm/html/index.html ¨ http://cs.stanford.edu/people/karpathy/svmjs/demo/
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 75
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 76
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 77