a gentle introduction to support vector machines in biomedicineshatkay/course/papers/uosv... ·...
TRANSCRIPT
A Gentle Introduction to Support Vector Machines
in BiomedicineAlexander Statnikov*, Douglas Hardin#,
Isabelle Guyon†, Constantin F. Aliferis*
(Materials about SVM Clustering were contributed by Nikita Lytkin*)
*New York University, #Vanderbilt University, †ClopiNet
Part I
•Introduction
•N
ecessary mathem
atical concepts
•Support vector m
achines for binary classification: classical form
ulation
•Basic principles of statistical m
achine learning
2
Introduction
3
About this tutorial
4
Main goal:Fully understand support vector m
achines (and im
portant extensions) with a m
odicum of m
athematics
knowledge.
•This tutorial is both modest (it d
oe
s no
t inve
nt a
ny
thin
g n
ew
)
and ambitious (su
pp
ort ve
ctor m
ach
ine
s are
ge
ne
rally
con
side
red
ma
the
ma
tically
qu
ite d
ifficult to
gra
sp).
•Tutorial approach: learning problem
Æm
ain idea of the SVM
solution Ægeom
etrical interpretation Æm
ath/theory Æbasic algorithm
s Æextensions Æ
case studies.
Data-analysis problem
s of interest
1.Build com
putational classification models (or
“classifie
rs”) that assign patients/samples into tw
o or m
ore classes. -
Classifiers can be used for diagnosis, outcome prediction, and
other classification tasks.
-E.g., build a decision-support system
to diagnose primary and
metastatic cancers from
gene expression profiles of the patients:5
Classifier m
odelPatient
BiopsyG
ene expressionprofile
Primary Cancer
Metastatic Cancer
Data-analysis problem
s of interest
2.Build com
putational regression models to predict values
of some continuous response variable or outcom
e. -
Regression models can be used to predict survival, length of stay
in the hospital, laboratory test values, etc.
-E.g., build a decision-support system
to predict optimal dosage
of the drug to be administered to the patient. This dosage is
determined by the values of patient biom
arkers, and clinical and dem
ographics data:
6
Regression m
odelPatient
Biomarkers,
clinical and dem
ographics data
Optim
al dosage is 5 IU
/Kg/week
12.2
3423
23
922
18
Data-analysis problem
s of interest
3.O
ut of all measured variables in the dataset, select the
smallest subset of variables that is necessary for the
most accurate prediction (classification or regression) of
some variable of interest (e.g., phenotypic response
variable).-
E.g., find the most com
pact panel of breast cancer biomarkers
from m
icroarray gene expression data for 20,000 genes:
7
Breast cancer tissues
Norm
altissues
Data-analysis problem
s of interest
4.Build a com
putational model to identify novel or outlier
patients/samples.
-Such m
odels can be used to discover deviations in sample
handling protocol when doing quality control of assays, etc.
-E.g., build a decision-support system
to identify aliens.
8
Data-analysis problem
s of interest
5.G
roup patients/samples into several
clusters based on their similarity.
-These m
ethods can be used to discovery disease sub-types and for other tasks.
-E.g., consider clustering of brain tum
or patients into 4 clusters based on their gene expression profiles. A
ll patients have the sam
e pathological sub-type of the disease, and clustering discovers new
disease subtypes that happen to have different characteristics in term
s of patient survival and tim
e to recurrence after treatment.
9
Cluster #1
Cluster #2
Cluster #3
Cluster #4
Basic principles of classification
10
•W
ant to classify objects as boats and houses.
Basic principles of classification
11
•A
ll objects before the coast line are boats and all objects after the coast line are houses.
•Coast line serves as a d
ecisio
n su
rface
that separates two classes.
Basic principles of classification
12
These b
oats w
ill be m
isclassified as h
ou
ses
Basic principles of classification
13
Longitude
Latitude
Boat
House
•The methods that build classification m
odels (i.e., “classifica
tion
alg
orith
ms”)
operate very similarly to the previous exam
ple.•First all objects are represented geom
etrically.
Basic principles of classification
14
Longitude
Latitude
Boat
House
Then the algorithm seeks to find a decision
surface that separates classes of objects
Basic principles of classification
15
Longitude
Latitude
??
?
??
?
These objects are classified as boats
These objects are classified as houses
Unseen (new
) objects are classified as “boats” if they fall below
the decision surface and as “houses” if the fall above it
The Support Vector M
achine (SVM
) approach
16
•Support vector m
achines (SVM
s) is a binary classification algorithm
that offers a solution to problem #1.
•Extensions of the basic SV
M algorithm
can be applied to solve problem
s #1-#5.
•SV
Ms are im
portant because of (a) theoretical reasons:-
Robust to very large number of variables and sm
all samples
-Can learn both sim
ple and highly complex classification m
odels
-Em
ploy sophisticated mathem
atical principles to avoid overfitting
and (b) superior empirical results.
Main ideas of SV
Ms
17
Cancer patientsN
ormal patients
Gene X
Gene Y
•Consider exam
ple dataset described by 2 genes, gene X and gene Y•
Represent patients geometrically (by “vectors”)
Main ideas of SV
Ms
18
•Find a linear decision surface (“hyperplane”) that can separate patient classes and
has the largest distance (i.e., largest “gap” or “m
argin”) between border-line patients (i.e., “support vectors”);
Cancer patientsN
ormal patients
Gene X
Gene Y
Main ideas of SV
Ms
19
•If such linear decision surface does not exist, the data is m
apped into a m
uch higher dimensional space (“feature space”) w
here the separating decision surface is found;
•The feature space is constructed via very clever m
athematical
projection (“kernel trick”).
Gene Y
Gene X
Ca
nce
r
No
rma
l
Ca
nce
r
No
rma
l
kernel
Decision surface
History of SV
Ms and usage in the literature
20
•Support vector m
achine classifiers have a long history of developm
ent starting from the 1960’s.
•The m
ost important m
ilestone for development of m
odern SVM
s is the 1992 paper by Boser, G
uyon, and Vapnik(“A
train
ing
alg
orith
m fo
r op
tima
l ma
rgin
classifie
rs”)
359 621
906
1,430
2,330
3,530
4,950
6,660
8,180
8,860
4 12
46 99
201 351
521 726
917 1,190
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
19981999
20002001
20022003
20042005
20062007
Use of Support Vector M
achines in the Literature
General sciences
Biomedicine
9,770 10,800
12,000 13,500
14,900 16,000
17,700
19,500 20,000
19,600
14,900 15,500
19,200 18,700
19,100
22,200
24,100
20,100
17,700 18,300
0
5000
10000
15000
20000
25000
30000
19981999
20002001
20022003
20042005
20062007
Use of Linear R
egression in the Literature
General sciences
Biomedicine
Necessary m
athematical concepts
21
How
to represent samples geom
etrically?Vectors in n-dim
ensional space (4n)
•A
ssume that a sam
ple/patient is described by ncharacteristics
(“features” or “variables”)•
Representation:Every sample/patient is a vector in 4
nw
ith tail at point w
ith 0 coordinates and arrow-head at point w
ith the feature values.
•Exam
ple:Consider a patient described by 2 features: Sy
stolic B
P= 110 and A
ge
= 29. This patient can be represented as a vector in 4
2:
22Systolic BP
Age
(0, 0)
(110, 29)
0100
200300
050
100
150200 0 20 40 60
How
to represent samples geom
etrically?Vectors in n-dim
ensional space (4n)
Patient 3Patient 4
Patient 1Patient 2
23
Patient id
Cholesterol (m
g/dl)Systolic BP
(mm
Hg)
Age
(years)Tail of the
vectorA
rrow-head of
the vector
1150
11035
(0,0,0)(150,110, 35)
2250
12030
(0,0,0)(250,120, 30)
3140
16065
(0,0,0)(140,160, 65)
4300
18045
(0,0,0)(300, 180, 45)
Age (years)
0100
200300
050
100
150200 0 20 40 60
How
to represent samples geom
etrically?Vectors in n-dim
ensional space (4n)
Patient 3Patient 4
Patient 1Patient 2
24
Age (years)
Since we assum
e that the tail of each vector is at point with 0
coordinates, we w
ill also depict vectors as points (where the
arrow-head is pointing).
Purpose of vector representation•H
aving represented each sample/patient as a vector allow
s now
to geometrically represent the decision surface that
separates two groups of sam
ples/patients.
•In order to define the decision surface, we need to introduce
some basic m
ath elements…
25
01
23
45
67
0 1 2 3 4 5 6 7
01
23
45
67
0
5
10 0 1 2 3 4 5 6 7
A decision surface in 4
2A
decision surface in 43
Basic operation on vectors in 4n
1. Multiplication by a scalar
Consider a vector and a scalar cD
efine:
Wh
en
yo
u m
ultip
ly a
vecto
r by a
scala
r, yo
u “stre
tch” it in
the
sam
e o
r op
po
site d
irectio
n d
ep
en
din
g o
n w
he
the
r the
scala
r is
po
sitive o
r ne
ga
tive.
),...,
,(
21
na
aa
a
&
),...,
,(
21
nca
caca
ac
&
)4,2
( 2)2
,1( ac c a& &
a &
ac &
1 2 3 401
23
4
)2,1
( 1)2
,1(
��
� ac c a& &
a &
ac &
1 2 3 401
23
4-2
-1
-2 -1
26
Basic operation on vectors in 4n
2. Addition
Consider vectors and
Define:
Re
call a
dd
ition
of fo
rces in
classica
l me
cha
nics.
),...,
,(
21
na
aa
a
&
27
),...,
,(
21
nb
bb
b
&),...,
,(
22
11
nnb
ab
ab
ab
a�
��
�&
&
1 2 3 401
23
4
)2,4
( )0,3
(
)2,1
(
� ba b a
&& & &
a &
b & ba
&&�
Basic operation on vectors in 4n
3. Subtraction
Consider vectors and
Define:
Wh
at ve
ctor d
o w
e
ne
ed
to a
dd
to to
ge
t ? I.e
., simila
r to
sub
tractio
n o
f rea
l
nu
mb
ers.
),...,
,(
21
na
aa
a
&
28
),...,
,(
21
nb
bb
b
&),...,
,(
22
11
nnb
ab
ab
ab
a�
��
�&
&
)2,2
( )0,3
(
)2,1
(
�
� ba b a
&& & &
a &
b &
ba
&&�
1 2 3 401
23
4-3
-2-1
b &
a &
Basic operation on vectors in 4n
4. Euclidian length or L2-norm
Consider a vector
Define the L2-norm
:
We often denote the L2-norm
without subscript, i.e.
),...,
,(
21
na
aa
a
&
29
222
212
...na
aa
a�
��
&
24.2
5 )2,1(
2|
a a& &
a &1 2 30
12
34
Length of this vector is §�����
a &
L2-n
orm
is a ty
pica
l wa
y to
me
asu
re le
ng
th o
f a ve
ctor;
oth
er m
eth
od
s to m
ea
sure
len
gth
also
exist.
Basic operation on vectors in 4n
5. Dot product
Consider vectors and
Define dot product:
The law of cosines says that w
here ș
is the angle between and . Therefore, w
hen the vectors are perpendicular .
),...,
,(
21
na
aa
a
&
30
),...,
,(
21
nb
bb
b
&
¦
�
��
�
nii
in
nba
ba
ba
ba
ba
12
21
1...
&&
Tcos
||||
||||
22b
ab
a&
&&
&
�a &
b &
0
�ba&
&
1 2 3 401
23
4
3 )0,3
(
)2,1
( � ba b a
&& & &
a &
b &
1 2 3 401
23
4
0 )0,3
(
)2,0
( � ba b a
&& & &
a &
b &
TT
Basic operation on vectors in 4n
5. Dot product
(continued)
•Property:
•In the classical regression equation
the response variable yis just a dot product of the
vector representing patient characteristics ( ) and
the regression weights vector (
) which is com
mon
across all patients plus an offset b.
31
¦
�
��
�
nii
in
nba
ba
ba
ba
ba
12
21
1...
&&
222
21
1...
aa
aa
aaa
aa
nn
&&
&
��
�
�b
xw
y�
�
&&
w &x &
Hyperplanes as decision surfaces
•A
hyperplane is a linear decision surface that splits the space into tw
o parts;
•It is obvious that a hyperplane is a binary classifier.
32
01
23
45
67
0 1 2 3 4 5 6 7
01
23
45
67
0
5
10 0 1 2 3 4 5 6 7
A hyperplane in 4
2is a line
A hyperplane in 4
3is a plane
A hyperplane in 4
nis an n-1 dim
ensional subspace
33
Equation of a hyperplane
Source: http://ww
w.m
ath.umn.edu/~nykam
p/
First we show
with show
the definition of hyperplane by an interactive dem
onstration.
or go to http://ww
w.dsl-lab.org/svm
_tutorial/planedemo.htm
l
Click here for dem
o to begin
Equation of a hyperplane
34
Consider the case of 43:
An equation of a hyperplane is defined
by a point (P0 ) and a perpendicular
vector to the plane ( ) at that point.w &
P0
P
w &0x &
x &
0x
x&
&�
Define vectors: and , w
here Pis an arbitrary point on a hyperplane.
00OP
x
&OP
x
&
A condition for P
to be on the plane is that the vector is perpendicular to :
The above equations also hold for 4n
when n>3.
0x
x&
&�w &
0)
(0
��
xx
w&
&&
00
��
�x
wx
w&
&&
&ordefine
0x
wb
&&�
�
0
��
bx
w&
&
O
Equation of a hyperplane
35
043
64
043
),
,(
)6,1
,4(
043
)6,1
,4(
043
43)
421
0()7
,1,0(
)6,1
,4(
)3(
)2(
)1(
)3(
)2(
)1(
0
0
�
��
�
�
��
�
��
��
�
��
�
��
�
�
�
�
xx
xx
xx x
xw
xw
b P w
&
&&
&&
&
P0
w &
043
�
�xw
&&
What happens if the b
coefficient changes? The hyperplane m
oves along the direction of . W
e obtain “parallel hyperplanes”.w &
Example
010
�
�xw
&&
050
�
�xw
&&
wb
bD
&/
21 �
D
istance between tw
o parallel hyperplanes and is equal to .
01
��
bx
w&
&0
2
��
bx
w&
&
+ direction
-direction
(Derivation of the distance betw
een two
parallel hyperplanes)
36
w &
01
��
bx
w&
&
02
��
bx
w&
&
wb
bwt
D
wb
bt
bwt
b
bwt
bb
xw
bwt
xw
bwt
xw
bx
wwt
wtD
wtx
x
&&
&
&
&&
&
&&
&
&&
&
&&
&&
&&
&
/
/)(
0
0)
(
0 0)
(0
21 2
21
22
1
22
11
1
22
1
21
22
12
�
�
�
�
��
�
��
��
�
��
�
��
�
�
�
1x &
2x &
wt &
Recap
37
We know
…•
How
to represent patients (as “vectors”)•
How
to define a linear decision surface (“hyperplane”)
We need to know
…•
How
to efficiently compute the hyperplane that separates
two classes w
ith the largest “gap”?
ÎN
eed to introduce basics of relevant optim
ization theory
Cancer patientsN
ormal patients
Gene X
Gene Y
Basics of optimization:
Convex functions
38
•A
function is called con
vex
if the function lies below the
straight line segment connecting tw
o points, for any two
points in the interval.•
Property: Any local m
inimum
is a global minim
um!
Convex functionN
on-convex function
Global m
inimum
Global m
inimum
Local minim
um
39
•Quadratic program
ming (Q
P) is a special optim
ization problem: the function to optim
ize (“o
bje
ctive”) is quadratic, subject to linear co
nstra
ints.
•Convex QP problem
s have convex objective functions.
•These problems can be solved easily and efficiently
by greedy algorithms (because every local
minim
um is a global m
inimum
).
Basics of optimization:
Quadratic program
ming (Q
P)
Consider
Minim
ize subject to
This is QP problem
, and it is a convex QP as w
e will see later
We can rew
rite it as:
Minim
ize subject to
40
Basics of optimization:
Example Q
P problem
22||
||2 1x &
),
(2
1 xx
x
&
01
21
t�
�xx
)(
2 122
21x
x�
01
21
t�
�xx
qu
ad
ratic
ob
jectiv
e
line
ar
con
strain
ts
qu
ad
ratic
ob
jectiv
e
line
ar
con
strain
ts
41
Basics of optimization:
Example Q
P problem
x1
x2
f(x1 ,x
2 )
)(
2 122
21x
x�
12
1�
�xx
01
21
t�
�xx
01
21
d�
�xx
The solution is x1 =1/2
and x2 =1/2.
Congratulations! You have mastered
all math elem
ents needed to understand support vector m
achines.
Now
, let us strengthen your know
ledge by a quiz -
42
1)Consider a hyperplane show
n w
ith white. It is defined by
equation: W
hich of the three other hyperplanes can be defined by equation: ?
-Orange
-Green
-Yellow
2) What is the dot product betw
een vectors and ?
Quiz
43
P0
w &
010
�
�xw
&&
03
��xw
&&
)3,3
( a &
)1,1(�
b &
b &
1 2 3 401
23
4-2
-1
-2 -1
a &
1
3) What is the dot product betw
een vectors and ?
4) What is the length of a vector
and what is the length of
all other red vectors in the figure?
Quiz
44
)3,3
( a &
)0,1
(
b &
b &
1 2 3 402
34
-2-1
-2 -1
a &
)0,2
( a &
1
1 2 3 402
34
-2-1
-2 -1
a &
5) Which of the four functions is/are convex?
Quiz
45
13
24
Support vector machines for binary
classification: classical formulation
46
Case 1: Linearly separable data; “H
ard-margin” linear SV
MG
iven training data:
47
}1,1
{,...,
,,...,
,
21
21
��
� �N
nNy
yy
Rx
xx
&&
&Positive instances (y=+1)N
egative instances (y=-1)
•W
ant to find a classifier (hyperplane) to separate negative instances from
the positive ones.
•A
n infinite number of such
hyperplanes exist.•
SVM
s finds the hyperplane that m
aximizes the gap betw
een data points on the boundaries (so-called “support vectors”).
•If the points on the boundaries are not inform
ative (e.g., due to noise), SV
Ms w
ill not do well.
w &Since w
e want to m
aximize the gap,
we need to m
inimize
or equivalently minim
ize
Statement of linear SV
M classifier
48
Positive instances (y=+1)N
egative instances (y=-1)
0
��
bx
w&
&
1�
��
bx
w&
&
1�
��
bx
w&
&
The gap is distance between
parallel hyperplanes:
and
Or equivalently:
We know
that
Therefore:
1�
��
bx
w&
&
1�
��
bx
w&
&
0)1
(
��
�b
xw
&&
0)1
(
��
�b
xw
&&
wb
bD
&/
21 �
wD
&/2
22 1w &
( is convenient for taking derivative later on)2 1
In summ
ary:
Want to m
inimize subject to for i= 1,…
,NThen given a new
instance x, the classifier is
Statement of linear SV
M classifier
49
Positive instances (y=+1)N
egative instances (y=-1)
0
��
bx
w&
&
1�t
��
bx
w&
&
1�d
��
bx
w&
&In addition w
e need to im
pose constraints that all instances are correctly classified. In our case:
ifif
Equivalently: 1�d
��
bx
wi&
&
1�t
��
bx
wi&
&1�
iy
1� i
y
1)
(t
��
bx
wy
ii
&&
22 1w &
1)
(t
��
bx
wy
ii
&&
)(
)(
bx
wsign
xf
��
&
&&
Minim
ize subject to for i= 1,…,N
SVM
optimization problem
: Prim
al formulation
50
¦ ni
iw1
22 1
01
)(
t�
��
bx
wy
ii
&&
Objective function
Constraints
•This is called “prim
al fo
rmu
latio
n o
f line
ar SV
Ms”.
•It is a convex quadratic programm
ing (QP)
optimization problem
with n
variables (wi , i= 1,…
,n), w
here nis the num
ber of features in the dataset.
SVM
optimization problem
: D
ual formulation
51
•The previous problem can be recast in the so-called “d
ua
l
form
” giving rise to “du
al fo
rmu
latio
n o
f line
ar SV
Ms ”.
•It is also a convex quadratic programm
ing problem but w
ith N
variables (Di ,i= 1,…
,N), w
here Nis the num
ber of sam
ples.
Maxim
ize subject to and .
Then the w-vector is defined in term
s of Di :
And the solution becom
es:
¦¦
��
Nji
ji
ji
ji
Nii
xx
yy
1,
2 1
1
&&
DD
D0
tiD
01
¦ Ni
ii y
D
Objective function
Constraints
¦
Ni
ii
ixy
w1
&&
D
)(
)(
1b
xxy
signx
fNi
ii
i�
�
¦
&&
&D
SVM
optimization problem
: Benefits of using dual form
ulation
52
1) No need to access original data, need to access only dot
products.
Objective function:
Solution:
2) Num
ber of free parameters is bounded by the num
ber of support vectors and not by the num
ber of variables (beneficial for high-dim
ensional problems).
E.g., if a microarray dataset contains 20,000 genes and 100
patients, then need to find only up to 100 parameters!
¦¦
��
Nji
ji
ji
ji
Nii
xx
yy
1,
2 1
1
&&
DD
D
)(
)(
1b
xxy
signx
fNi
ii
i�
�
¦
&&
&D
Minim
ize subject to for i= 1,…,N
(Derivation of dual form
ulation)
53
¦ ni
iw1
22 1
01
)(
t�
��
bx
wy
ii
&&
Objective function
Constraints
Apply the m
ethod of Lagrange multipliers.
Define Lagrangian
We need to m
inimize this Lagrangian w
ith respect to and simultaneously
require that the derivative with respect to vanishes , all subject to the
constraints that
��
��
¦¦
��
��
/
Nii
ii
nii
Pb
xw
yw
bw
11
22 1
1)
(,
,&
&&
&D
D
a vector with n
elements
a vector with N
elements
D &b
w, &
.0ti
D
(Derivation of dual form
ulation)
54
If we set the derivatives w
ith respect to to0, w
e obtain:
We substitute the above into the equation for and obtain “d
ua
l
form
ula
tion
of lin
ea
r SVM
s”:
We seek to m
aximize the above Lagrangian w
ith respect to , subject to the
constraints that and .
bw, &
��
��
¦
¦
�
w
/w
�
w
/w
Nii
ii
P
Nii
iP
xy
wwb
w
yb bw
1
1
0,
,
00
,,
&&
&
&&
&&
DD
DD
��D &
&,
,bw
P/
��
¦¦
��
/
Nji
ji
ji
ji
Nii
Dx
xyy
1,
2 1
1
&&
&D
DD
D
D &
0ti
D0
1
¦ Ni
ii y
D
Case 2: N
ot linearly separable data;“Soft-m
argin” linear SVM
55
Want to m
inimize subject to for i= 1,…
,N
Then given a new instance x, the classifier is
¦
�Ni
iC
w1
22 1
[&
ii
ib
xw
y[
�t
��
1)
(&
&
)(
)(
bx
wsign
xf
��
&
&
Assign a “slack variable” to each instance , w
hich can be thought of distance from
the separating hyperplane if an instance is misclassified and 0 otherw
ise.0
ti[
00
0 00
000
00
0
0
00
0W
hat if the data is not linearly separable? E.g., there are outliers or noisy m
easurements,
or the data is slightly non-linear.
Want to handle this case w
ithout changing the fam
ily of decision functions.
Approach:
Two form
ulations of soft-margin
linear SVM
56
Minim
ize subject to for i= 1,…,N
¦¦
�Ni
i
niiC
w1
1
22 1
[
Objective function
Constraints
ii
ib
xw
y[
�t
��
1)
(&
&
¦¦
��
Nji
ji
ji
ji
nii
xx
yy
1,
2 1
1
&&
DD
DC
i ddD
00
1
¦ Ni
ii y
DM
inimize subject to and
for i= 1,…,N.
Objective function
Constraints
Prim
al fo
rmu
latio
n:
Du
al fo
rmu
latio
n:
Parameter C
in soft-margin SV
M
57
Minim
ize subject to for i= 1,…,N
¦
�Ni
iC
w1
22 1
[&
ii
ib
xw
y[
�t
��
1)
(&
&
C=100C=1
C=0.15C=0.1
•W
hen Cis very large, the soft-
margin SV
M is equivalent to
hard-margin SV
M;
•W
hen Cis very sm
all, we
admit m
isclassifications in the training data at the expense of having w
-vector with sm
all norm
;•
Chas to be selected for the
distribution at hand as it will
be discussed later in this tutorial.
Case 3: N
ot linearly separable data;Kernel trick
58
Gene 2
Gene 1
Tum
or
No
rma
l
Tum
or
No
rma
l?
?
Data is not linearly separable
in the input spaceD
ata is linearly separable in the feature space
obtained by a kernel
kernel
)
HR
o)
N:
Data in a higher dim
ensional feature space
Kernel trick
59
)(
)(
bx
wsign
xf
��
&
&)
)(
()
(b
xw
signx
f�
)�
&&
¦
Ni
ii
ixy
w1
&&
D¦
)
Nii
ii
xy
w1
)( &
&D
))
()
((
)(
1b
xx
ysign
xf
Nii
ii
�)�
)
¦
&&
D
))
,(
()
(1
bx
xKy
signx
fNi
ii
i�
¦
&&
D
x &)
(x &)
Original data (in input space)
Therefore, we do not need to know
ˇexplicitly, w
e just need to define function K(·, ·): 4
N×4
NÆ
4.
Not every function 4
N×4
NÆ
4�can be a valid kernel; it has to satisfy so-called
Mercer conditions. O
therwise, the underlying quadratic program
may not be solvable.
Popular kernels
60
A kernel is a dot product in so
me
feature space:
Examples:
Linear kernelG
aussian kernelExponential kernel
Polynomial kernel
Hybrid kernel
Sigmoidal
)tanh(
),
(
)exp(
)(
),
(
)(
),
(
)exp(
),
(
)exp(
),
(
),
(
2
2
G
J
J J
��
��
��
��
��
��
�
ji
ji
ji
qj
ij
i
qj
ij
i
ji
ji
ji
ji
ji
ji
xxk
xx
K
xx
xx
px
xK
xx
px
xK
xx
xx
K
xx
xx
K
xx
xx
K
&&
&&
&&
&&
&&
&&
&&
&&
&&
&&
&&
&&
&&
)(
)(
),
(j
ij
ix
xx
xK
&&
&&
)�)
Understanding the G
aussian kernel61
)exp(
),
(2
jj
xx
xx
K&
&&
&�
�
JConsider G
aussian kernel:
Geom
etrically, this is a “bump” or “cavity” centered at the
training data point :jx &
The resulting m
apping function is a com
bination of bum
ps and cavities.
Æ"bum
pӮ
“cavity”
Understanding the G
aussian kernel62
Several more view
s of the data is m
apped to the feature space by G
aussian kernel
63
Understanding the G
aussian kernelLinear hyperplane that separates tw
o classes
Consider polynomial kernel:
Assum
e that we are dealing w
ith 2-dimensional data
(i.e., in 42). W
here will this kernel m
ap the data?
Understanding the polynom
ial kernel64
3)1(
),
(j
ij
ix
xx
xK
&&
&&
��
)2(
)1(
xx
)2(
2)1
(2
)2(
)1(
3)2
(3
)1(
)2(
)1(
2)2
(2
)1(
)2(
)1(
1x
xx
xx
xx
xx
xx
x
2-dimensional space
10-dimensional space
kernel
Example of benefits of using a kernel65
)2(x
)1(x
1x &
2x &
3x &
4x &
•D
ata is not linearly separable in the input space (4
2).•
Apply kernel
to map data to a higher
dimensional space (3-
dimensional) w
here it is linearly separable.
2)
()
,(
zx
zx
K&
&&
&�
>@
)(
)(
22
2
)(
),
(
2)2
(
)2(
)1( 2
)1(
2)2
(
)2(
)1( 2
)1(
2)2
(2
)2(
)2(
)2(
)1(
)1(
2)1
(2
)1(
2)2
()2
()1
()1
(
2
)2(
)1(
)2(
)1(
2
zx
zz
z z
xx
x xz
xz
xz
xz
x
zx
zx
z zx x
zx
zx
K
&&
&&
&&
)�)
¸ ¸ ¸¹ ·
¨ ¨ ¨© §�¸ ¸ ¸¹ ·
¨ ¨ ¨© §
�
�
�
» »¼ º
« «¬ ª¸ ¸¹ ·
¨ ¨© §�¸ ¸¹ ·
¨ ¨© §
�
Example of benefits of using a kernel66
¸ ¸ ¸¹ ·
¨ ¨ ¨© §
)
2)2
(
)2(
)1( 2
)1(
2)
(xx
x xx &
Therefore, the explicit mapping is
)2(x
)1(x
1x &
2x &
3x &
4x &
2)2
(x
2)1
(x
)2(
)1(
2x
x
21 ,xx
&&
43 ,xx
&&
kernel
)
Com
parison with m
ethods from classical
statistics & regression
67
•N
eed ш�ϱ�ƐĂŵƉůĞƐ�ĨŽƌ�ĞĂĐŚ�ƉĂƌĂŵ
ĞƚĞƌ�ŽĨ�ƚŚĞ�ƌĞŐƌĞƐƐŝŽŶ�m
odel to be estimated:
•SV
Ms do not have such requirem
ent & often require
much less sam
ple than the number of variables, even
when a high-degree polynom
ial kernel is used.
Num
ber of variables
Polynomial
degreeN
umber of
parameters
Required sam
ple
23
1050
103
2861,430
105
3,00315,015
1003
176,851884,255
1005
96,560,646482,803,230
Basic principles of statistical m
achine learning
68
Generalization and overfitting
•G
eneralization:A classifier or a regression algorithm
learns to correctly predict output from
given inputs not only in previously seen sam
ples but also in previously unseen sam
ples.
•O
verfitting: A classifier or a regression algorithm
learns to correctly predict output from
given inputs in previously seen sam
ples but fails to do so in previously unseen sam
ples.
•O
verfitting ÎPoor generalization.
69
Predictor X
Outcom
e of Interest Y
Training Data
Test Data
Example of overfitting and generalization
•A
lgorithm 1 learned non-reproducible peculiarities of the specific sam
ple available for learning but did not learn the general characteristics of the function that generated the data. Thus, it is overfitted and has poor generalization.
•A
lgorithm 2 learned general characteristics of the function that produced the
data. Thus, it generalizes.70
Algorithm
2
Algorithm
1
There is a linear relationship between predictor and outcom
e (plus some G
aussian noise).
Predictor X
Outcom
e of Interest Y
“Loss + penalty” paradigm for learning to
avoid overfitting and ensure generalization
•Many statistical learning algorithm
s (including SVM
s) search for a decision function by solving the follow
ing optim
ization problem:
Minim
ize (Loss
+ ʄP
en
alty)
•Lossm
easures error of fitting the data
•Penaltypenalizes com
plexity of the learned function
•ʄis regularization param
eter that balances Lossand Penalty
71
SVM
s in “loss + penalty” form
SVM
s build the following classifiers:
Consider soft-margin linear SV
M form
ulation:
Find and that
Minim
ize subject to for i= 1,…,N
This can also be stated as:
Find and that
Minim
ize
(in fact, one can show that ʄ
= 1/(2C)).
72
221
)](
1[w
xf
yNi
ii
&&
O¦
� ��
w &b
)(
)(
bx
wsign
xf
��
&
&&
w &b
¦
�Ni
iC
w1
22 1
[&
ii
ib
xw
y[
�t
��
1)
(&
&
Loss
(“hin
ge
loss”)
Pe
na
lty
Meaning of SV
M loss function
Consider loss function:
•Recall that […
]+indicates the positive part
•For a given sam
ple/patient i, the loss is non-zero if
•In other w
ords,
•Since , this m
eans that the loss is non-zero iffor y
i = +1for y
i = -1
•In other w
ords, the loss is non-zero iffor y
i = +1for y
i = -1
73
¦
��
Nii
ix
fy
1)]
(1[
&
0)
(1
!�
iix
fy
&
1)
(�
iix
fy
&
}1,1
{�
� i
y1
)(
�ix
f&
1)
(�
!ix
f&
1�
��
bx
wi&
&
1�!
��
bx
wi&
&
74
Meaning of SV
M loss function
Positive instances (y=+1)N
egative instances (y=-1)
0
��
bx
w&
&
1�
��
bx
w&
&
1�
��
bx
w&
&12
34
•If the instance is negative, it is penalized only in regions 2,3,4
•If the instance is positive, it is penalized only in regions 1,2,3
Flexibility of “loss + penalty” framew
ork75
Lossfunction
Penalty functionResulting algorithm
Hinge loss:
SVM
s
Mean
squared error:Ridge regression
Mean
squared error:Lasso
Mean
squared error:Elastic net
Hinge loss:
1-norm SV
M
Minim
ize (Loss
+ ʄP
en
alty)
¦
��
Nii
ix
fy
1)]
(1[
&22
w &O
¦
�Ni
ii
xf
y1
2))
((
&22
w &O
¦
�Ni
ii
xf
y1
2))
((
&1
w &O
¦
��
Nii
ix
fy
1)]
(1[
&
1w &
O
¦
�Ni
ii
xf
y1
2))
((
&22
21
1w
w&
&O
O�
Part 2
•M
odel selection for SVM
s•
Extensions to the basic SVM
model:
1.SV
Ms for m
ulticategory data2.
Support vector regression3.
Novelty detection w
ith SVM
-based methods
4.Support vector clustering
5.SV
M-based variable selection
6.Com
puting posterior class probabilities for SVM
classifiers
76
Model selection for SV
Ms
77
78
Need for m
odel selection for SVM
s
Gene 2
Gene 1
Tum
or
No
rma
l
•It is im
possible to find a linear SVM
classifier that separates tum
ors from norm
als! •
Need a non-linear SV
M classifier, e.g. SV
M
with polynom
ial kernel of degree 2 solves this problem
without errors.
Gene 2
Gene 1
Tum
or
No
rma
l
•W
e should not apply a non-linear SVM
classifier w
hile we can perfectly solve
this problem using a linear SV
M
classifier!
79
A data-driven approach for
model selection for SV
Ms
•D
o not know a
prio
ri what type of SV
M kernel and w
hat kernel param
eter(s) to use for a given dataset?•
Need to exam
ine various combinations of param
eters, e.g. consider searching the follow
ing grid:
•H
ow to search this grid w
hile producing an unbiased estimate
of classification performance?
Polynomialdegree d
Parameter
C
(0.1,1)(1, 1)
(10,1)(100, 1)
(1000, 1)
(0.1,2)(1, 2)
(10,2)(100, 2)
(1000, 2)
(0.1,3)(1, 3)
(10,3)(100, 3)
(1000, 3)
(0.1,4)(1, 4)
(10,4)(100, 4)
(1000, 4)
(0.1,5)(1, 5)
(10,5)(100, 5)
(1000, 5)
80
Nested cross-validation
Recall the main idea of cross-validation:
datatrain
test
train
testtrain
test
What com
bination of SVM
param
eters to apply on training data?
train
validtrain
valid
train
Perform “grid search” using another nested
loop of cross-validation.
81
Example of nested cross-validation
P1P2P3^ data
Outer Loop
Inner LoopTraining
setValidation
setC
AccuracyAverage Accuracy
P1P2
86%P2
P184%
P1P2
70%P2
P190%
185%
280%
Training set
Testing set
CAccuracy
Average Accuracy
P1, P2P3
189%
P1,P3P2
284%
P2, P3P1
176%
83%
chooseC=1
`
……
Consider that we use 3-fold cross-validation and w
e want to
optimize param
eter C that takes values “1” and “2”.
82
On use of cross-validation
•Empirically w
e found that cross-validation works w
ell for m
odel selection for SVM
s in many problem
dom
ains;•M
any other approaches that can be used for model
selection for SVM
s exist, e.g.:�
Generalized cross-validation
�Bayesian inform
ation criterion (BIC)�
Minim
um description length (M
DL)
�Vapnik-Chernovenkis
(VC) dim
ension�
Bootstrap
SVM
s for multicategory data
83
84
One-versus-rest m
ulticategory SV
M m
ethod***
**
**
**
**
*
*
*
* *
*
Gene 1
Gene 2
Tum
or I
Tum
or II
Tum
or III
?
?
85
One-versus-one m
ulticategorySV
M m
ethod***
**
**
**
**
*
*
*
* *
*
Gene 1
Gene 2
Tum
or I
Tum
or II
Tum
or III
?
?
86
DA
GSV
M m
ulticategory SV
M m
ethod
Not A
LL B-cell
AM
L vs. ALL T-cell
ALL B
-cell vs. ALL T-cell
AM
L vs. ALL B
-cell
Not A
ML
Not A
LL B-cell
Not A
LL T-cellN
ot AM
L
ALL T-cell
ALL B-cell
AM
L
Not A
LL T-cell
AM
L vs. ALL T-cell
ALL B
-cell vs. ALL T-cell
ALL B-cell
87
SVM
multicategory m
ethods by Weston
and Watkins and by C
ramm
er and Singer
***
**
**
**
**
*
*
*
* *
*
Gene 1
Gene 2
?
?
Support vector regression
88
89
H-Support vector regression (H-SVR
)
Given training data:
Ry
yy
Rx
xx
N
nN
� �,...,
,,...,
,
21
21
&&
&
Main idea:
Find a function that approxim
atesy1 ,…
,yN
:•
it has at most H
derivation fromthe true valuesy
i•
it is as “flat” as possible (to avoid overfitting)
***
*
**
**
**
** **
*
*
*
x
y+H-H
bx
wx
f�
�
&&
&)(
E.g., build a model to predict survival of cancer patients that
can admit a one m
onth error (= H).
Find
by minim
izing subject to constraints:
for i= 1,…,N
.
90
Formulation of “hard-m
argin” H-SVR
0
��
bx
w&
&
22 1w &H H�
t�
��
d�
��
)(
)(
bx
wy
bx
wy
i i&
&
&&
***
*
**
**
**
** *
*
*
x
y+H-H
bx
wx
f�
�
&&
&)(
I.e., difference between y
i and the fitted function should be smaller
than Hand larger than -H
Ùall points
yi should be in the “H-ribbon”
around the fitted function.
91
Formulation of “soft-m
argin” H-SVR
***
*
**
**
**
** *
*
*
x
y+H-H
*
*
If we have points like this
(e.g., outliers or noise) we
can either:
a)increase H
to ensurethat
these points are within the
new H-ribbon, or
b)assign a penalty (“slack” variable) to each of this points (as w
as done for “soft-m
argin” SVM
s)
92
Formulation of “soft-m
argin” H-SVR
***
*
**
**
**
** *
*
*
x
y+H-H
*
*
[i
[i *
Find
by minim
izing
subject to constraints:
for i= 1,…,N
.
��
¦
��
Nii
iC
w1
*2
2 1[
[&
0,
)(
)(*
*
t
��
t�
��
�d
��
�
ii
ii
ii
bx
wy
bx
wy
[[
[H
[H
&&
&&
bx
wx
f�
�
&&
&)(
Notice that only points outside H-ribbon are penalized!
93
Nonlinear H-SV
R
***
**
**
*
**
** **
*
)(x)
y-)
(H)
x
y+H-H
kernel
)+)
(H)
** *
*
***
*
**
***
**
x
y+H-H
** *
*
***
*
**
***
**
Cannot approximate w
ell this function w
ith small H!
1�)
Build decision function of the form:
Find and that
Minim
ize
94
H-Support vector regression in “loss + penalty” form
221
)|)
(|,
0m
ax(w
xf
yNi
ii
&&
OH
¦
��
�
bx
wx
f�
�
&&
&)(
w &b
Loss
(“line
ar H-insensitive loss”)
Pe
na
lty
Erro
r in a
pp
roxim
atio
n
Loss fu
nctio
n v
alu
e
95
Com
paring H-SVR
with popular
regression methods
Lossfunction
Penalty functionResulting algorithm
Linear H-insensitive loss:H-SV
R
Quadratic H-insensitive loss:
Another variant of H-SV
R
Mean
squared error:Ridge regression
Mean linear
error:A
nothervariant of ridge
regression
¦
��
Nii
ix
fy
1)
|)(
|,0
max(
H&
22w &
O
¦
�Ni
ii
xf
y1
2))
((
&22
w &O
¦
�Ni
ii
xf
y1
|)(
|&
22w &
O
¦
��
Nii
ix
fy
1
2)
))(
(,0
max(
H&
22w &
O
96
Com
paring loss functions of regression m
ethods
H-H
Loss fu
nctio
n
va
lueE
rror in
a
pp
roxim
atio
n
H-H
Loss fu
nctio
n
va
lueE
rror in
a
pp
roxim
atio
n
H-H
Loss fu
nctio
n
va
lueE
rror in
a
pp
roxim
atio
n
H-H
Loss fu
nctio
n
va
lueE
rror in
a
pp
roxim
atio
n
Linear H-insensitive lossQ
uadratic H-insensitive loss
Mean squared error
Mean linear error
97
Applying H-SV
R to real data
In the absence of domain know
ledge about decision functions, it is recom
mended to optim
ize the following
parameters (e.g., by cross-validation using grid-search):
•param
eter C•
parameter H
•kernel param
eters (e.g., degree of polynomial)
Notice that param
eter Hdepends on the ranges of
variables in the dataset; therefore it is recomm
ended to norm
alize/re-scale data prior to applying H-SVR.
Novelty detection w
ith SVM
-based m
ethods
98
99
What is it about?
•Find the sim
plest and most
compact region in the space of
predictors where the m
ajority of data sam
ples “live” (i.e., w
ith the highest density of sam
ples). •
Build a decision function that takes value +1 in this region and -1 elsew
here.•
Once w
e have such a decision function, w
e can identify novel or outlier sam
ples/patients in the data.
Predictor X
*** *
*
*
* ***
***
* **
**
** *
* **
**
*
**
**
***
*
**
***
* **
*
**
**
* ** *
* ****
****
* * **
*
** ***
* **
**
** *
***
*** *
** * *
**
** * * *
* ****
**
Predictor Y
**
* *
Decision function = +1
Decision function = -1
* **** **
** ***
* ***
100
Key assumptions
•W
e do not know classes/labels of sam
ples (positive or negative) in the data available for learningÎ
this is not a classification problem•
All positive sam
ples are similar but each negative
sample can be different in its ow
n way
Thus, do not need to collect data for negative samples!
101
Sample applications
“Novel”
“Novel”
“Novel”
“Norm
al”
Modified from
: ww
w.cs.huji.ac.il/course/2004/learns/N
oveltyDetection.ppt
102
Sample applications
Discover deviations in sam
ple handling protocol when
doing quality control of assays.
Pro
tein
X
Pro
tein
Y
*** ** **
** *
** *
***
***
* **
**
* *
** *
*
*
*
* ** *
**
*
**
**
***
*
**
* **
* **
*
Samples w
ith high-quality of processing
***
Samples w
ith low quality
of processing from infants
Samples w
ith low quality of
processing from patients
with lung cancer
Samples w
ith low quality of
processing from ICU
patients
Samples w
ith low quality
of processing from the
lab of Dr. Sm
ith
103
Sample applications
Identify websites that discuss benefits of proven cancer
treatments.
We
igh
ted
freq
ue
ncy
of
wo
rd Y
*** ** **
** *
** *
***
***
* **
**
* *
** *
*
*
*
* ** *
*
**
*
*
**
**
***
**
**
* *
**
*
Websites that discuss
benefits of proven cancer treatm
ents
**
**
Websites that discuss side-effects of
proven cancer treatments
Blogs of cancer patients
Websites that discuss
cancer prevention m
ethods
We
igh
ted
freq
ue
ncy
of
wo
rd X
* **W
ebsites that discuss unproven cancer treatm
ents
* ** ** ** * * ** *
* **
**
* *** ** *** * *
*
104
One-class SV
MM
ain idea:Find the maxim
al gap hyperplane that separates data from
the origin (i.e., the only mem
ber of the second class is the origin).
i[
j[
Use “slack variables” as in soft-m
argin SVM
s to penalize these instances
Origin is the only
mem
ber of the second class
0
��
bx
w&
&
105
Formulation of one-class SV
M:
linear case
i[
j[
0 �
�b
xw
&&
Find
by minim
izing
subject to constraints:
for i= 1,…,N
.
¦
��
Niib
Nw
1
22 1
1[
Q&
0t
�t
��i
ib
xw[
[&
&
)(
)(
bx
wsign
xf
��
&
&&
Given training data:
nNR
xx
x�
&&
&,...,
,2
1
upper bound on the fraction of outliers (i.e., points outside decision surface) allow
ed in the data
i.e., the decision function should be positive in all training sam
ples except for sm
all deviations
106
Formulation of one-class SV
M:
linear and non-linear cases
Find
by minim
izing
subject to constraints:
for i= 1,…,N
.
¦
��
Niib
Nw
1
22 1
1[
Q&
0t
�t
��i
ib
xw[
[&
&
)(
)(
bx
wsign
xf
��
&
&&
Find
by minim
izing
subject to constraints:
for i= 1,…,N
.
¦
��
Niib
Nw
1
22 1
1[
Q&
0)
(t
�t
�)�
i
ib
xw[
[&
&
))
((
)(
bx
wsign
xf
�)�
&
&&
Linear caseN
on-linear case (use “kernel trick”)
107
More about one-class SV
M
•O
ne-class SVM
s inherit most of properties of SV
Ms for
binary classification (e.g., “kernel trick”, sample
efficiency, ease of finding of a solution by efficient optim
ization method, etc.);
•The choice of other param
eter significantly affects the resulting decision surface.
•The choice of origin is arbitrary and also significantly affects the decision surface returned by the algorithm
.
Q
Support vector clustering
108
Contributed by Nikita Lytkin
Goal of clustering (aka class discovery)
Given a heterogeneous set of data points
Assign labels
such that points w
ith the same label are highly “sim
ilar“ to each other and are distinctly different from
the rest
109 nNR
xx
x�
&&
&,...,
,2
1
},...,
2,1{
,...,,
21
Ky
yy
N�
Clustering process
Support vector domain description
•Support Vector D
omain D
escription (SVD
D) of the data is
a set of vectors lying on the surface of the smallest
hyper-sphere enclosing all data points in a
fea
ture
spa
ce
–These surface points are called S
up
po
rt Ve
ctors
110
kernel
)R
RR
SVD
D optim
ization criterion
111
Minim
ize subject to for i= 1,…,N
2R
Squared radius of the sphereConstraints
22
||)
(||
Ra
xi
d�
)
Form
ula
tion
with
ha
rd co
nstra
ints:
RRR
'Ra
'a
Main idea behind Support Vector
Clustering•
Cluster boundaries in the input space are formed by the set of
points that when m
apped from the input space to the feature
space fall exactly on the surface of the minim
al enclosing hyper-sphere–
SVs identified by SVD
D are a subset of the cluster boundary points
112
1�)
RRR
a
Cluster assignment in SV
C
•Tw
o points and belong to the same cluster (i.e., have
the same label) if every point of the line segm
ent projected to the feature space lies w
ithin the hyper-sphere
113
)R
RRa
Every point is within the hyper-
sphere in the feature space
Some points lie outside the
hyper-sphere
),
(j
i xx
ix
jx
Cluster assignment in SV
C (continued)
•Point-w
ise adjacency matrix is
constructed by testing the line segm
ents between every pair of
points
•Connected com
ponents are extracted
•Points belonging to the sam
e connected com
ponent are assigned the sam
e label
114
AB
CD
E
A1
11
00
B1
10
0
C1
00
D1
1
E1
AB
C
D
E
AB
C
D
E
Effects of noise and cluster overlap
•In practice, data often contains noise, outlier points and overlapping clusters, w
hich would prevent contour
separation and result in all points being assigned to the sam
e cluster
115
Ideal dataTypical data
Noise
Outliers
Overlap
SVD
D w
ith soft constraints
•SV
C can be used on noisy data by allowing a fraction of points,
called Bounded SVs (BSV), to lie outside the hyper-sphere
–BSVs are not considered as cluster boundary points and are not assigned to clusters by SV
C
116
kernel
)R
RR
Overlap
a
Typical data
Noise
Outliers
Overlap
Noise
Outliers
Soft SVD
D optim
ization criterion
117
Minim
izesubject to
for i= 1,…,N
2R
Squared radius of the sphereSoft constraints
ii
Ra
x[
�d
�)
22
||)
(||
Prim
al fo
rmu
latio
n w
ith so
ft con
strain
ts:
0ti
[
RR
Overlap
ai
R[
�
Introduction of slack variables m
itigates the influence of noise and overlap on the clustering process
i[
Noise
Outliers
•G
aussian kerneltends to yield tighter
contour representations of clusters than the polynomial kernel
•The G
aussian kernel width param
eterinfluences tightness of
cluster boundaries, number of SVs and the num
ber of clusters
•Increasing causes an increase in the num
ber of clusters
Dual form
ulation of soft SVD
D
118
Minim
ize
subject tofor i= 1,…
,N
¦¦
�
ji
ji
ji
ii
ii
xx
Kx
xK
W,
),
()
,(
EE
E
Constraints Ci d
dE
0
)exp(
),
(2
ji
ji
xx
xx
K&
&&
&�
�
J
•A
s before,denotes a kernel function
•Param
etergives a trade-off betw
een volume of the sphere and
the number of errors (C=1 corresponds to hard constraints)
)(
)(
),
(j
ij
ix
xx
xK
)�)
10
d�C
0!
J
J
SVM
-based variable selection
119
Recall standard SVM
formulation:
Find and that minim
ize
subject to
for i= 1,…,N.
Use classifier:
•The w
eight vector contains as many elem
ents as there are input variables in the dataset, i.e. .
•The m
agnitude of each element denotes im
portance of the corresponding variable for classification task.
120
Understanding the w
eight vector w
Positive instances (y=+1)N
egative instances (y=-1)
0
��
bx
w&
&
22 1w &
1)
(t
��
bx
wy
ii
&&
w &b
)(
)(
bx
wsign
xf
��
&
&&
w &n
wR
�&
121
Understanding the w
eight vector w
x1
x2
01
1)1,
1(
21
�
� b
xx w &
x1
x2
00
1)0
,1(
21
�
� b
xx w &
x1
x2
01
0)1,
0(
21
�
� b
xx w &
x2
x1
x3
00
11
)0,1,
1(
32
1
��
� b
xx
x w &
X1
and X2
are equally important
X1
is important, X
2is not
X2
is important, X
1is not
X1
and X2
are equally im
portant, X
3 is not
122
Understanding the w
eight vector w
Gene X
1
Gene X
2M
elanoma
Nevi
X1
PhenotypeX
2
SVM
decision surface
01
1)1,
1(
21
�
� b
xx w &
Decision surface of
another classifier
00
1)0
,1(
21
�
� b
xx w &
True model
•In the true model, X
1 is causal and X2 is redundant
•SVM
decision surface implies that X
1and X
2are equally
important; thus it is locally causally inconsistent
•There exists a causally consistent decision surface for this example
•Causal discovery algorithms can identify that X
1 is causal and X2 is redundant
123
Simple SV
M-based variable selection
algorithm
Algorithm
:1.
Train SVM
classifier using data for all variables to estim
ate vector 2.
Rank each variable based on the magnitude of the
corresponding element in vector
3.U
sing the above ranking of variables, select the sm
allest nested subset of variables that achieves the best SV
M prediction accuracy.
w &
w &
124
Simple SV
M-based variable selection
algorithmConsider that w
e have 7 variables: X1 , X
2 , X3 , X
4 , X5 , X
6 , X7
The vector is: (0.1, 0.3, 0.4, 0.01, 0.9, -0.99, 0.2)The ranking of variables is: X
6 , X5 , X
3 , X2 , X
7 , X1 , X
4
Subset of variablesClassification
accuracy
X6
X5
X3
X2
X7
X1
X4
0.920
X6
X5
X3
X2
X7
X1
0.920
X6
X5
X3
X2
X7
0.919
X6
X5
X3
X2
0.852
X6
X5
X3
0.843
X6
X5
0.832
X6
0.821
Best classification accuracy
Classification accuracy that is statistically indistinguishable from
the best one
ÎSelect the follow
ing variable subset: X6 , X
5 , X3 , X
2, X
7
w &
125
Simple SV
M-based variable selection
algorithm•
SVM
weights are not locally causally consistent Æ
we
may end up w
ith a variable subset that is not causal and not necessarily the m
ost compact one.
•The m
agnitude of a variable in vector estimates
the effect of removing that variable on the objective
function of SVM
(e.g., function that we w
ant to m
inimize). H
owever, this algorithm
becomes sub-
optimal w
hen considering effect of removing several
variables at a time…
This pitfall is addressed in the SV
M-RFE algorithm
that is presented next.
w &
126
SVM
-RFE variable selection algorithm
Algorithm
:1.
Initialize Vto all variables in the data
2.Repeat
3.Train SV
M classifier using data for variables in V
to estim
ate vector 4.
Estimate prediction accuracy of variables in V
using the above SV
M classifier (e.g., by cross-validation)
5.Rem
ove from V
a variable (or a subset of variables) w
ith the smallest m
agnitude of the corresponding elem
ent in vector 6.
Until there are no variables in V
7.Select the sm
allest subset of variables with the best
prediction accuracy w &w &
127
SVM
-RFE variable selection algorithm
10,000 genes
SVM
m
odel
Prediction accuracy
5,000 genes
5,000 genes
Important for
classification
Not im
portant for classification
SVM
m
odel
2,500 genes
2,500 genes
Important for
classification
Not im
portant for classification
Discarded
Discarded
…Prediction accuracy
•U
nlike simple SV
M-based variable selection algorithm
, SVM
-RFE estim
ates vector many tim
es to establish ranking of the variables.
•N
otice that the prediction accuracy should be estimated at
each step in an unbiased fashion, e.g. by cross-validation.
w &
128
SVM
variable selection in feature space
The real power of SV
Ms com
es with application of the kernel
trick that maps data to a m
uch higher dimensional space
(“feature space”) where the data is linearly separable.
Gene 2
Gene 1
Tum
or
No
rma
l
Tum
or
No
rma
l?
?kernel
)
input spacefeature space
129
SVM
variable selection in feature space•
We have data for 100 SN
Ps (X1 ,…
,X100 ) and som
e phenotype.•
We allow
up to 3rd
order interactions, e.g. we consider:
•X
1 ,…,X
100
•X
1 2,X1 X
2 , X1 X
3 ,…,X
1 X100
,…,X
100 2
•X
1 3,X1 X
2 X3 , X
1 X2 X
4 ,…,X
1 X99 X
100,…
,X100 3
•Task: find the sm
allest subset of features (either SNPs or
their interactions) that achieves the best predictive accuracy of the phenotype.
•Challenge: If w
e have limited sam
ple, we cannot explicitly
construct and evaluate all SNPs and their interactions
(176,851 features in total) as it is done in classical statistics.
130
SVM
variable selection in feature space
Heuristic solution:A
pply algorithm SV
M-FSM
B that:1.
Uses SV
Ms w
ith polynomial kernel of degree 3 and
selects M features (not necessarily input variables!)
that have largest weights in the feature space.
E.g., the algorithm can select features like: X
10 , (X
1 X2 ), (X
9 X2 X
22 ), (X7 2X
98 ), and so on.
2.A
pply HITO
N-M
B Markov blanket algorithm
to find the M
arkov blanket of the phenotype using M
features from step 1.
Com
puting posterior class probabilities for SV
M classifiers
131
132
Output of SV
M classifier
1.SVM
s output a class label (positive or negative) for each sam
ple:
2. One can also com
pute distance from
the hyperplane that separates classes, e.g. . These distances can be used to com
pute performance m
etrics like area under RO
C curve.
Positive samples (y=+1)
Negative sam
ples (y=-1)
0
��
bx
w&
&
)(
bx
wsign
�� &
&
bx
w�
� &&
Question:H
ow can one use SV
Ms to estim
ate posterior class probabilities, i.e., P(class positive | sam
ple x)?
133
Simple binning m
ethod
Train
ing
set
Va
lida
tion
set
Testin
g se
t
1.Train SV
M classifier in the Tra
inin
g se
t.
2.A
pply it to the Va
lida
tion
set
and compute distances
from the hyperplane to each sam
ple.
3.Create a histogram
with Q
(e.g., say 10) bins using the above distances. Each bin has an upper and low
er value in term
s of distance.
Sample
#1
23
45
...98
99100
Distance
2-1
83
4-2
0.30.8
-15-10
-50
510
150 5 10 15 20 25
Distance
Number of samples in validation set
134
Simple binning m
ethod
Train
ing
set
Va
lida
tion
set
Testin
g se
t
4.G
iven a new sam
ple from the Te
sting
set, place it in
the corresponding bin.
E.g., sample #382 has distance to hyperplane = 1, so it
is placed in the bin [0, 2.5]
5.Com
pute probability P(positive class | sample #382) as
a fraction of true positives in this bin.
E.g., this bin has 22 samples (from
the Va
lida
tion
set),
out of which 17 are true positive ones , so w
e compute
P(positive class | sample #382) = 17/22 = 0.77
-15-10
-50
510
150 5 10 15 20 25
Distance
Number of samples in validation set
-10-8
-6-4
-20
24
68
100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Distance
P(positive class|sample)
135
Platt’s method
Convert distances output by SVM
to probabilities by passing them
through the sigmoid filter:
where d
is the distance from hyperplane and A
and Bare param
eters.
)exp(
11
)|
(B
Adsam
pleclass
positiveP
��
136
Platt’s method
Train
ing
set
Va
lida
tion
set
Testin
g se
t
1.Train SV
M classifier in the Tra
inin
g se
t.
2.A
pply it to the Va
lida
tion
set
and compute distances
from the hyperplane to each sam
ple.
3.D
etermine param
eters Aand B
of the sigmoid
function by minim
izing the negative log likelihood of the data from
the Va
lida
tion
set.
4.G
iven a new sam
ple from the Te
sting
set, com
pute its posterior probability using sigm
oid function.
Sample
#1
23
45
...98
99100
Distance
2-1
83
4-2
0.30.8
Part 3
•Case studies (taken from
our research)1.
Classification of cancer gene expression microarray data
2.Text categorization in biom
edicine3.
Prediction of clinical laboratory values4.
Modeling clinical judgm
ent5.
Using SV
Ms for feature selection
6.O
utlier detection in ovarian cancer proteomics data
•Softw
are
•Conclusions
•Bibliography
137
1. Classification of cancer gene
expression microarray data
138
139
Com
prehensive evaluation of algorithms
for classification of cancer microarray data
Main goals:
ͻFind the best perform
ing decision support algorithm
s for cancer diagnosis from
microarray gene expression data;
ͻInvestigate benefits of using gene selection and ensem
ble classification methods.
140
Classifiers
ͻK-N
earest Neighbors (KN
N)
ͻBackpropagation N
eural Netw
orks (NN
)
ͻProbabilistic N
eural Netw
orks (PNN
)
ͻM
ulti-Class SVM
: One-Versus-Rest (O
VR)
ͻM
ulti-Class SVM
: One-Versus-O
ne (OV
O)
ͻM
ulti-Class SVM
: DA
GSV
M
ͻM
ulti-Class SVM
by Weston &
Watkins (W
W)
ͻM
ulti-Class SVM
by Cramm
er & Singer (CS)
ͻW
eighted Voting: One-Versus-Rest
ͻW
eighted Voting: One-Versus-O
ne
ͻD
ecision Trees: CART
kernel-based
neural netw
orks
voting
decision trees
instance-based
141
Ensemble classifiers
Ensemble
Classifier
FinalPrediction
dataset
Classifier 1Classifier 2
Classifier N…
Prediction 1Prediction 2
Prediction N…
dataset
142
Gene selection m
ethods
1.Signal-to-noise (S2N
) ratio in one-versus-rest (O
VR)
fashion;
2.Signal-to-noise (S2N
) ratio in one-versus-one (O
VO
) fashion;
3.Kruskal-W
allis nonparametric
one-way A
NO
VA (KW
);
4.Ratio of genes betw
een-categories to w
ithin-category sum
of squares (BW).
genes
Uninform
ative genesH
ighly discriminatory genes
143
Performance m
etrics andstatistical com
parison1.
Accuracy
+can com
pare to previous studies+
easy to interpret & sim
plifies statistical comparison
2.Relative classifier inform
ation (RCI)+
easy to interpret & sim
plifies statistical comparison
+not sensitive to distribution of classes
+accounts for difficulty of a decision problem
ͻRandom
ized permutation testing to com
pare accuracies of the classifiers (D
=0.05)
144
Microarray datasets
Total:ͻ~
1300 samples
ͻ74diagnostic categories
ͻ41cancer types and
12norm
al tissue types
Sam-
plesV
ariables (genes)
Cate-
gories
11_Tumors
17412533
11Su, 2001
14_Tumors
30815009
26R
amasw
amy, 2001
9_Tumors
605726
9Staunton, 2001
Brain_Tumor1
905920
5Pom
eroy, 2002
Brain_Tumor2
5010367
4N
utt, 2003
Leukemia1
725327
3G
olub, 1999
Leukemia2
7211225
3A
rmstrong, 2002
Lung_Cancer
20312600
5B
hattacherjee, 2001
SRBCT
832308
4K
han, 2001
Prostate_Tumor
10210509
2Singh, 2002
DLBC
L77
54692
Shipp, 2002
Dataset nam
eN
umber of
Reference
145
Summ
ary of methods and datasets
S2N O
ne-Versus-Rest
S2N O
ne-Versus-One
Non-param
. AN
OVA
BW ratio
Gene
SelectionM
ethods (4)C
ross-ValidationD
esigns (2)10-Fold CV
LOO
CV
Accuracy
RCI
Performance
Metrics (2)
One-Versus-Rest
One-Versus-O
ne
DA
GSV
M
Method by W
W
Classifiers (11)
Method by CS
KNN
Backprop. NN
Prob. NN
Decision Trees
One-Versus-Rest
One-Versus-O
ne
MC-SVMWV
Majority Voting
MC-SV
M O
VR
MC-SV
M O
VO
MC-SV
M D
AG
SVM
Ensemble
Classifiers (7)
Decision Trees
Decision Trees
Majority Voting
Based on MC-SVM outputs
Based on outputsof all classifiers
Randomized
permutation testing
StatisticalC
omparison
11_Tumors
14_Tumors
9_Tumors
Brain Tumor1
Gene
ExpressionD
atasets (11)
Brain_Tumor2
Leukemia1
Leukemia2
Lung_Cancer
SRBCT
Prostate_Tumors
DLBCL
Multicategory DxBinary Dx
146
9_Tumors 14_Tumors Brain_Tumor2 Brain_Tumor1 11_TumorsLeukemia1 Leukemia2 Lung_Cancer SRBCT
Prostate_Tumor DLBCL
Results w
ithout gene selection
OVR
OVO
DA
GSVM
WW
CS
KN
NN
NPN
N
MC-SVM
0 20 40 60 80
100
Accuracy, %
147
Results w
ith gene selection
OVROVO
DAGSVMW
WCS
KNNNN
PNN
10 20 30 40 50 60 70
Improvement in accuracy, %
SVMnon-SVM
Improvem
ent of diagnostic perform
ance by gene selection (averages for the four datasets)
Average reduction of genes is 10-30
times
Diagnostic perform
ance before
and aftergene selection
SVM
non-SVM
SVM
non-SVM
20 40 60 80
100
20 40 60 80
100
Accuracy, %
9_Tumors
14_Tumors
Brain_Tum
or1Brain_Tum
or2
148
Com
parison with previously
published resultsAccuracy, %
0 20 40 60 80
100
9_Tumors 14_Tumors Brain_Tumor2 Brain_Tumor1 11_TumorsLeukemia1 Leukemia2 Lung_Cancer SRBCT
Prostate_Tumor DLBCL
Multiple specialized
classification methods
(original primary studies)
Multiclass SV
Ms
(this study)
149
Summ
ary of results
ͻM
ulti-class SVM
s are the best family am
ong the tested algorithm
s outperforming KN
N, N
N, PN
N, D
T, and W
V.ͻ
Gene selection in som
e cases improves classification
performance of all classifiers, especially of non-SV
M
algorithms;
ͻEnsem
ble classification does not improve
performance of SV
M and other classifiers;
ͻResults obtained by SV
Ms favorably com
pare with the
literature.
•A
ppealing properties–
Work w
hen # of predictors > # of samples
–Em
bedded gene selection
–Incorporate interactions
–Based on theory of ensem
ble learning
–Can w
ork with binary &
multiclass tasks
–D
oes not require much fine-tuning of param
eters
•Strong theoretical claim
s
•Em
pirical evidence: (Diaz-U
riarte and Alvarez de
Andres, B
MC
Bio
info
rma
tics, 2006) reported superior classification perform
ance of RFs compared
to SVM
s and other methods
150
Random
Forest (RF) classifiers
Key principles of RF classifiers
Trainingdata
2) Random
geneselection
3) Fit unpruned decision trees
4) Apply to testing data &
com
bine predictions
Testingdata
1) Generate
bootstrap sam
ples
151
Results w
ithout gene selection
152
•SV
Ms n
om
ina
llyoutperform
RFs is 15 datasets, RFs outperform SV
Ms in 4 datasets,
algorithms are exactly the sam
e in 3 datasets.•
In 7 datasets SVM
s outperform RFs sta
tistically
sign
ifican
tly.•
On average, the perform
ance advantage of SVM
s is 0.033 AU
C and 0.057 RCI.
Results w
ith gene selection
153
•SV
Ms n
om
ina
llyoutperform
RFs is 17 datasets, RFs outperform SV
Ms in 3 datasets,
algorithms are exactly the sam
e in 2 datasets.•
In 1 dataset SVM
s outperform RFs sta
tistically
sign
ifican
tly.•
On average, the perform
ance advantage of SVM
s is 0.028 AU
C and 0.047 RCI.
2. Text categorization in biomedicine154
Models to categorize content and quality:
Main idea
155
1. Utilize existing (or easy to build) training corpora
1234
2. Simple docum
ent representations (i.e., typically Æ
stemm
ed and weighted
words in title and abstract,
Mesh term
s if available; occasionally
Æaddition of
Metam
ap CUIs, author info) as
“bag-of-words”
Models to categorize content and quality:
Main idea
156
LabeledExam
ples
Unseen
Exam
ples
Labeled
3. Train SVM
models that capture
implicit categories of m
eaning or quality criteria
5. Evaluate performance prospectively
&
compare to prior cross-validation
estimates
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Txm
tD
iag
Pro
gE
tio
Estim
ate
d P
erfo
rman
ce
2005 P
erfo
rman
ce
4. Evaluate models’ perform
ances -
with nested cross-validation or other
appropriate error estimators
-use prim
arily AU
Cas w
ell as other metrics
(sensitivity, specificity, PPV, Precision/Recall curves, H
IT curves, etc.)
Models to categorize content and quality:
Some notable results
157
1. SVM
models have
excellent ability to identify high-quality PubM
ed docum
entsaccording to
ACPJ gold standard
Category
Average A
UC
Range
over n folds
Treatment
0.97*0.96 -
0.98
Etiology0.94*
0.89 –0.95
Prognosis0.95*
0.92 –0.97
Diagnosis
0.95*0.93 -
0.98
Category
Average A
UC
Range
over n folds
Treatment
0.97*0.96 -
0.98
Etiology0.94*
0.89 –0.95
Prognosis0.95*
0.92 –0.97
Diagnosis
0.95*0.93 -
0.98
Meth
od
Treatm
ent -
AU
CE
tiolo
gy -A
UC
Pro
gno
sis -A
UC
Diagn
osis -
AU
C
Google Pagerank
0.540.54
0.430.46
Yaho
o W
ebranks0.56
0.490.52
0.52
Impact Facto
r 2005
0.670.62
0.510.52
Web page hit
count
0.630.63
0.580.57
Biblio
metric
Citatio
n Count
0.760.69
0.670.60
Machine Learning
Models
0.960.95
0.950.95
Meth
od
Treatm
ent -
AU
CE
tiolo
gy -A
UC
Pro
gno
sis -A
UC
Diagn
osis -
AU
C
Google Pagerank
0.540.54
0.430.46
Yaho
o W
ebranks0.56
0.490.52
0.52
Impact Facto
r 2005
0.670.62
0.510.52
Web page hit
count
0.630.63
0.580.57
Biblio
metric
Citatio
n Count
0.760.69
0.670.60
Machine Learning
Models
0.960.95
0.950.95
2. SVM
models have better classification
performance than PageRank, Yahoo ranks,
Impact Factor, W
eb Page hit counts, and bibliom
etriccitation counts on the W
eb according to A
CPJ gold standard
Models to categorize content and quality:
Some notable results
158
3. SVM
models have better
classification performance than
PageRank, Impact Factor and
Citation count in Medline for
SSOA
B gold standard
Gold standard: SSO
AB
Area under the R
OC
curve*
SSOA
B-specific filters
0.893
Citation C
ount0.791
AC
PJ Txm
t-specific filters0.548
Impact Factor (2001)
0.549
Impact Factor (2005)
0.558
Gold standard: SSO
AB
Area under the R
OC
curve*
SSOA
B-specific filters
0.893
Citation C
ount0.791
AC
PJ Txm
t-specific filters0.548
Impact Factor (2001)
0.549
Impact Factor (2005)
0.558D
iag
no
sis
- Fix
ed
Sp
ecific
ity
0.6
5
0.9
70.8
20.9
7
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Sen
sF
ixed
Sp
ec
Qu
ery
Filte
rsL
earn
ing
Mo
dels
Dia
gn
osis
- Fix
ed
Sen
sitiv
ity0.9
6
0.6
8
0.9
60.8
8
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Fix
ed
Sen
sS
pec
Qu
ery
Filte
rsL
earn
ing
Mo
dels
Pro
gn
osis
- Fix
ed
Sp
ecific
ity
0.8
0.7
7
1
0.7
7
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Sen
sF
ixed
Sp
ec
Qu
ery
Filte
rsL
earn
ing
Mo
dels
Pro
gn
osis
- Fix
ed
Sen
sitiv
ity
0.8
0.7
10.8
0.8
7
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Fix
ed
Sen
sS
pec
Qu
ery
Filte
rsL
earn
ing
Mo
dels
Tre
atm
en
t - Fix
ed
Sp
ecific
ity
0.8
0.9
10.9
50.9
1
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Sen
sF
ixed
Sp
ec
Qu
ery
Filte
rsL
earn
ing
Mo
dels
Tre
atm
en
t - Fix
ed
Sen
sitiv
ity0.9
8
0.7
1
0.9
80.8
9
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Fix
ed
Sen
sS
pec
Qu
ery
Filte
rsL
earn
ing
Mo
dels
Etio
log
y - F
ixed
Sp
ecific
ity
0.6
8
0.9
10.9
40.9
1
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Sen
sF
ixed
Sp
ec
Qu
ery
Filte
rsL
earn
ing
Mo
dels
Etio
log
y - F
ixed
Sen
sitiv
ity0.9
8
0.4
4
0.9
8
0.7
5
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
Fix
ed
Sen
sS
pec
Qu
ery
Filte
rsL
earn
ing
Mo
dels
4. SVM
models have better sensitivity/specificity in PubM
ed than CQFs at
comparable thresholds according to A
CPJ gold standard
Other applications of SV
Ms to text
categorization
159
1. Identifying Web Pages
with m
isleading treatment inform
ation according to special purpose gold standard (Q
uack Watch). SV
M m
odels outperform
Quackom
eterand G
oogle ranks in the tested domain of cancer treatm
ent.
Model
Area U
nder the C
urve
Machine Learning M
odels0.93
Quackom
eter*0.67
0.63
Model
Area U
nder the C
urve
Machine Learning M
odels0.93
Quackom
eter*0.67
0.63
2. Prediction of future paper citationcounts (w
ork of L. Fu and C.F. Aliferis,
AM
IA 2008)
3. Prediction of clinical laboratory values
160
Dataset generation and experim
ental design
TrainingTesting
Validation (25%
of Training)
01/1998-05/200106/2001-10/2002
•StarPaneldatabase contains ~8·10
6lab m
easurements of ~100,000 in-
patients from Vanderbilt U
niversity Medical Center.
•Lab m
easurements w
ere taken between 01/1998 and 10/2002.
For each combination of lab test and norm
al range, we generated
the following datasets.
161
Query-based approach for
prediction of clinical cab values
Train SVM
classifier
Training data
Data m
odel
Validation data
database
These steps are performed
for every data model
Performance
Prediction
Testing data
Testing sam
ple
Optim
aldata m
odelSV
M classifier
These steps are performed
for every testing sample
162
Classification results
Area under ROC
curve (without feature selection)
>1<99
[1, 99]>2.5
<97.5[2.5, 97.5]
BU
N80.4%
99.1%76.6%
87.1%98.2%
70.7%
Ca
72.8%93.4%
55.6%81.4%
81.4%63.4%
CaIo
74.1%60.0%
50.1%64.7%
72.3%57.7%
CO
282.0%
93.6%59.8%
84.4%94.5%
56.3%
Creat
62.8%97.7%
89.1%91.5%
98.1%87.7%
Mg
56.9%70.0%
49.1%58.6%
76.9%59.1%
Osm
ol50.9%
60.8%60.8%
91.0%90.5%
68.0%
PCV
74.9%99.2%
66.3%80.9%
80.6%67.1%
Phos74.5%
93.6%64.4%
71.7%92.2%
69.7%
Laboratory test
Ran
ge of n
orm
al values
>1<99
[1, 99]>2.5
<97.5[2.5, 97.5]
BU
N75.9%
93.4%68.5%
81.8%92.2%
66.9%C
a67.5%
80.4%55.0%
77.4%70.8%
60.0%C
aIo63.5%
52.9%58.8%
46.4%66.3%
58.7%C
O2
77.3%88.0%
53.4%77.5%
90.5%58.1%
Creat
62.2%88.4%
83.5%88.4%
94.9%83.8%
Mg
58.4%71.8%
64.2%67.0%
72.5%62.1%
Osm
ol77.9%
64.8%65.2%
79.2%82.4%
71.5%P
CV
62.3%91.6%
69.7%76.5%
84.6%70.2%
Phos
70.8%75.4%
60.4%68.0%
81.8%65.9%
Ran
ge of n
orm
al values
Laboratory test
Area under R
OC
curve (without feature selection)
Including cases with K=0 (i.e. sam
plesw
ith no prior lab measurem
ents)Excluding cases w
ith K=0 (i.e. samples
with no prior lab m
easurements)
A total of 84,240 SV
M classifiers w
ere built for 16,848 possible data models.163
Improving predictive pow
er and parsimony
of a BUN
model using feature selection
Test name
BU
NR
ange of normal values
< 99 perc.D
ata modeling
SR
TN
umber of previous
measurem
ents5
Use variables corresponding to
hospitalization units?Y
es
Num
ber of prior hospitalizations used
2
Model description
N samples
(total)N abnorm
al sam
plesN
variables
Training set3749
78
Validation set
125127
Testing set836
16
3442
Dataset description
050
100150
200250
0
0.5 1
1.5 2
2.5 3x 10
4H
istogram of test B
UN
Test value
Frequency (N measurements)
normal
values abnorm
alvalues
105
Classification perform
ance (area under RO
C curve)
A
llR
FE_LinearR
FE_Poly
HITO
N_P
CH
ITON
_MB
Validation set
95.29%98.78%
98.76%99.12%
98.90%Testing set
94.72%99.66%
99.63%99.16%
99.05%N
umber of features
344226
311
17
164
Classification perform
ance (area under RO
C curve)
All
RFE_LinearRFE_Poly
HITON_PC
HITON_M
BV
alidation set95.29%
98.78%98.76%
99.12%98.90%
Testing set94.72%
99.66%99.63%
99.16%99.05%
Number of features
344226
311
17
Features1
LAB: PM
_1(BUN)LA
B: PM_1(BUN)
LAB: PM
_1(BUN)LA
B: PM_1(BUN)
2
LAB: PM
_2(Cl)LA
B: Indicator(PM_1(M
g))LA
B: PM_5(Creat)
LAB: PM
_5(Creat)
3
LAB: DT(PM
_3(K))LA
B: Test Unit NO
_TEST_MEA
SUREMENT
(Test CaIo, PM 1)
LAB: PM
_1(Phos)LA
B: PM_3(PCV
)
4
LAB: DT(PM
_3(Creat))
LAB: Indicator(PM
_1(BUN))LA
B: PM_1(M
g)
5
LAB: Test Unit J018 (Test Ca, PM
3)
LAB: Indicator(PM
_5(Creat))LA
B: PM_1(Phos)
6
LAB: DT(PM
_4(Cl))
LAB: Indicator(PM
_1(Mg))
LAB: Indicator(PM
_4(Creat))
7
LAB: DT(PM
_3(Mg))
LA
B: DT(PM_4(Creat))
LAB: Indicator(PM
_5(Creat))
8
LAB: PM
_1(Cl)
LAB: Test Unit 7SCC (Test Ca, PM
1)LA
B: Indicator(PM_3(PCV
))
9
LAB: PM
_3(Gluc)
LA
B: Test Unit RADR (Test Ca, PM
5)LA
B: Indicator(PM_1(Phos))
10
LAB: DT(PM
_1(CO2))
LA
B: Test Unit 7SMI (Test PCV
, PM 4)
LAB: DT(PM
_4(Creat))
11
LAB: DT(PM
_4(Gluc))
DEM
O: G
enderLA
B: Test Unit 11NM (Test BUN, PM
2)
12
LAB: PM
_3(Mg)
LAB: Test Unit 7SCC (Test Ca, PM
1)
13
LAB: DT(PM
_5(Mg))
LAB: Test Unit RA
DR (Test Ca, PM 5)
14
LAB: PM
_1(PCV)
LAB: Test Unit 7SM
I (Test PCV, PM
4)
15
LAB: PM
_2(BUN)
LA
B: Test Unit CCL (Test Phos, PM 1)
16
LAB: Test Unit 11NM
(Test PCV, PM
2)
DEM
O: G
ender
17
LAB: Test Unit 7SCC (Test M
g, PM 3)
DEMO
: Age
18
LAB: DT(PM
_2(Phos))
19
LAB: DT(PM
_3(CO2))
20
LAB: DT(PM
_2(Gluc))
21
LAB: DT(PM
_5(CaIo))
22
DEMO
: Hospitalization Unit TVO
S
23
LAB: PM
_1(Phos)
24
LAB: PM
_2(Phos)
25
LAB: Test Unit 11NM
(Test K, PM 5)
26
LAB: Test Unit V
HR (Test CaIo, PM 1)
165
4. Modeling clinical judgm
ent
166
Methodological fram
ework and study
outline
same across physicians
different across physicians
Physician 1Physician6
PatientFeature
ClinicalDiagnosis
Gold
Standard
1f1 …
fmcd
1hd
1
……
…
Ncd
Nhd
N
……
……
1f1 …
fmcd
1hd
1
……
…
Ncd
Nhd
N
Patients
Physicians
Guidelines
Predict clinical decisions
Identify predictors ignored by physicians
Explain each physician’s diagnostic m
odel
Compare physicians w
ith each other and w
ith guidelines
Clinical context of experim
ent
Incidence & m
ortality have been constantly increasing in the last decades.
Malignant m
elanoma is the m
ost dangerous form of skin cancer
Physicians and patients
Derm
atologists ÆN
= 6
3 experts-3 non-experts
Patients ÆN
=177 76 m
elanomas
-101 neviFeatures
Lesion location
Family history of
melanom
aIrregular Border
Streaks (radial stream
ing, pseudopods)
Max-diam
eterFitzpatrick’s Photo-type
Num
ber of colorsSlate-blue veil
Min-diam
eterSunburn
Atypical pigmented
network
Whitish veil
EvolutionEphelis
Abrupt network cut-off
Globular elem
ents
AgeLentigos
Regression-Erythem
aC
omedo-like openings,
milia-like cysts
Gender
Asymm
etryH
ypo-pigmentation
Telangiectasia
Data collection
:
Patients seen prospectively, from
1999 to 2002 at D
epartment of D
ermatology,
S.ChiaraH
ospital, Trento, Italy
inclusion criteria: histological diagnosis and >1 digital im
age available
Diagnoses m
ade in 2004
Method to explain physician-specific
SVM
models
Build SV
M
SVM”Black Box”
Build D
T
Regular LearningM
eta-Learning
Apply SV
M
FS
Results: Predicting physicians’ judgm
ents
PhysiciansA
ll(features)
HITO
N_PC
(features)H
ITON
_MB
(features)R
FE(features)
Expert 10.94 (24)
0.92 (4)0.92 (5)
0.95 (14)
Expert 20.92 (24)
0.89 (7)0.90 (7)
0.90 (12)
Expert 30.98 (24)
0.95 (4)0.95 (4)
0.97 (19)
NonExpert 1
0.92 (24)0.89 (5)
0.89 (6)0.90 (22)
NonExpert 2
1.00 (24)0.99 (6)
0.99 (6)0.98 (11)
NonExpert 3
0.89 (24)0.89 (4)
0.89 (6)0.87 (10)
Results: Physician-specific m
odels
Results: Explaining physician agreem
ent
Expert 1
AUC=0.92
R2=99%
Expert 3
AUC=0.95
R2=99%
Blue
veilirregular border
streaks
Patient 001yes
noyes
Results: Explain physician disagreem
entB
lue veil
irregular border
streaksnum
ber of colors
evolution
Patient 002no
noyes
3no
Expert 1
AUC=0.92
R2=99%
Expert 3
AUC=0.95
R2=99%
Results: G
uideline compliance
PhysicianR
eported guidelines
Com
pliance
Experts1,2,3, non-expert 1
Pattern analysis N
on-compliant: they ignore the
majority of features (17 to 20)
recomm
ended by pattern analysis.
Non expert 2
ABCD
E ruleN
on compliant:asym
metry, irregular
border and evolution are ignored.
Non expert 3
Non-standard.
Reports using 7
features
Non com
pliant: 2 out of 7 reported features are ignored w
hile some non-
reported ones are not
On the contrary: In all guidelines, the m
ore predictors present, the higher the likelihood of m
elanoma. All physicians w
ere com
plia
nt
with this principle.
5. Using SV
Ms for feature selection
176
Feature selection methods
177
Feature selection methods (non-causal)
•SV
M-RFE
•U
nivariate + wrapper
•Random
forest-based•
LARS-Elastic N
et•
RELIEF + wrapper
•L0-norm
•Forw
ard stepwise feature selection
•N
o feature selection
Causal feature selection methods
•H
ITON
-PC•
HITO
N-M
B•
IAM
B•
BLCD•
K2MB
…
This method outputs a
Markov blanket of the
response variable (under assum
ptions)
This is an SVM
-based feature selection m
ethod
13 real datasets were used to evaluate
feature selection methods
178
Dataset nam
eD
omain
Num
ber of variables
Num
ber of sam
plesTarget
Data type
Infant_Mortality
Clinical
865,337
Died w
ithin the first yeardiscrete
Ohsum
edText
14,3735,000
Relevant to nenonatal diseases
continuous
AC
PJ_Etiology
Text28,228
15,779R
elevant to eitology continuous
Lym
phoma
Gene
expression7,399
2273-year survival: dead vs. alive
continuous
Gisette
Digit
recognition5,000
7,000Separate 4 from
9continuous
Dexter
Text19,999
600R
elevant to corporate acquisitionscontinuous
SylvaEcology
21614,394
Ponderosa pine vs. everything elsecontinuous &
discrete
Ovarian_C
ancerProteom
ics2,190
216C
ancer vs. normals
continuous
Throm
binD
rug discovery
139,3512,543
Binding to throm
indiscrete (binary)
Breast_C
ancerG
ene expression
17,816286
Estrogen-receptor positive (ER+) vs. ER
-continuous
Hiva
Drug
discovery1,617
4,229A
ctivity to AID
S HIV
infectiondiscrete (binary)
Nova
Text16,969
1,929Separate politics from
religion topics discrete (binary)
Bankruptcy
Financial147
7,063Personal bankruptcy
continuous & discrete
Classification perform
ance vs. proportion of selected features
179
00.5
10.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95 1O
riginal
Classification performance (AUC)
Proportion of selected features
HITO
N-P
C w
ith G2 test
RFE
0.050.1
0.150.2
0.85
0.86
0.87
0.88
0.89
0.9
Magnified
Classification performance (AUC)
Proportion of selected features
HITO
N-P
C w
ith G2 test
RFE
Statistical comparison of predictivity and
reduction of features
180
PredicitivityR
eduction
P-valueN
ominal w
innerP-value
Nom
inal winner
SVM
-RFE
(4variants)
0.9754SV
M-R
FE0.0046
HITO
N-PC
0.8030SV
M-R
FE0.0042
HITO
N-PC
0.1312H
ITON
-PC0.3634
HITO
N-PC
0.1008H
ITON
-PC0.6816
SVM
-RFE
•N
ull hypothesis: SVM
-RFE and HITO
N-PC perform
the same;
•U
se permutation-based statistical test w
ith alpha = 0.05.
Simulated datasets w
ith known causal
structure used to compare algorithm
s181
182
Com
parison of SVM
-RFE and H
ITON
-PC
Com
parison of all methods in term
s of causal graph distance
183
SVM
-RFEH
ITON
-PC based causal
methods
Summ
ary results
184SV
M-RFE
HITO
N-PC
based causal
methods
HITO
N-PC-
FDR
methods
Statistical comparison of graph distance
185
Sample size = 200
Sample size = 500
Sample size =5000
Com
parisonP-value
Nom
inal w
innerP-value
Nom
inal w
innerP-value
Nom
inal w
inner
average HITO
N-PC
-FDR
with G
2test vs.average SV
M-R
FE<0.0001
HITO
N-PC
-FD
R0.0028
HITO
N-PC
-FD
R<0.0001
HITO
N-PC
-FD
R
•N
ull hypothesis: SVM
-RFE and HITO
N-PC-FD
R perform the sam
e;•
Use perm
utation-based statistical test with alpha = 0.05.
6. Outlier detection in ovarian cancer
proteomics data
186
Data S
et 1 (Top), Data S
et 2 (Bottom
)
Cancer
Norm
al
Other
Clock Tick
40008000
12000
Cancer
Norm
al
Other
Ovarian cancer data
Same set of 216
patients, obtained using the CiphergenH
4 ProteinChiparray
(dataset 1) and using the Ciphergen
WCX2
ProteinChiparray
(dataset 2).
The gross break at the “benign disease” juncture in dataset 1 and the similarity of the
profiles to those in dataset 2 suggest change of protocol in the middle of the first
experiment.
Experiments w
ith one-class SVM
Assum
e that sets {A, B} are
normal and {C, D
, E, F} are outliers. A
lso, assume that w
e do not know
what are norm
al and outlier sam
ples.
•Experiment 1: Train one-class SV
M
on {A, B, C} and test on {A
, B, C}: A
rea under ROC curve =
0.98•Experim
ent 2: Train one-class SVM
on {A
, C} and test on {B, D, E,F}:
Area under RO
C curve = 0.98
Data Set 1 (Top), Data Set 2 (Bottom)
Cancer
Normal
Other
Clock Tick4000
800012000
Cancer
Normal
Other
188
Software
189
Interactive media and anim
ations
190
SVM
Applets
•http://w
ww
.csie.ntu.edu.tw/~cjlin/libsvm
/•
http://svm.dcs.rhbnc.ac.uk/pagesnew
/GPat.shtm
l•
http://ww
w.sm
artlab.dibe.unige.it/Files/sw/A
pplet%20SV
M/svm
applet.html
•http://w
ww
.eee.metu.edu.tr/~alatan/Courses/D
emo/A
ppletSVM
.html
•http://w
ww
.dsl-lab.org/svm_tutorial/dem
o.html(requires Java 3D
)
Anim
ations•
Support Vector Machines:
http://ww
w.cs.ust.hk/irproj/Regularization%
20Path/svmKernelpath/2m
oons.avihttp://w
ww
.cs.ust.hk/irproj/Regularization%20Path/svm
Kernelpath/2Gauss.avi
http://ww
w.youtube.com
/watch?v=3liCbRZPrZA
•Support Vector Regression: http://w
ww
.cs.ust.hk/irproj/Regularization%20Path/m
ovie/ga0.5lam1.avi
Several SVM
implem
entations for beginners
191
•G
EMS: http://w
ww
.gems-system
.org
•W
eka: http://ww
w.cs.w
aikato.ac.nz/ml/w
eka/
•Spider (for M
atlab): http://ww
w.kyb.m
pg.de/bs/people/spider/
•CLO
P(for M
atlab): http://clopinet.com/CLO
P/
Several SVM
implem
entations for interm
ediate users
192
•LibSVM
: http://ww
w.csie.ntu.edu.tw
/~cjlin/libsvm/
�G
eneral purpose�
Implem
ents binary SVM
, multiclass SV
M, SV
R, one-class SVM
�Com
mand-line interface
�Code/interface for C/C++/C#, Java, M
atlab, R, Python, Pearl
•SVM
Light: http://svmlight.joachim
s.org/�
General purpose (designed for text categorization)
�Im
plements binary SV
M, m
ulticlass SVM
, SVR
�Com
mand-line interface
�Code/interface for C/C++, Java, M
atlab, Python, Pearl
More softw
are links at http://ww
w.support-vector-m
achines.org/SVM
_soft.html
and http://ww
w.kernel-m
achines.org/software
Conclusions
193
194
Strong points of SVM
-based learning m
ethods•
Empirically achieve excellent results in high-dim
ensional data w
ith very few sam
ples•
Internal capacity control to avoid overfitting•
Can learn both simple linear and very com
plex nonlinear functions by using “kernel trick”
•Robust to outliers and noise (use “slack variables”)
•Convex Q
P optimization problem
(thus, it has global minim
um
and can be solved efficiently)•
Solution is defined only by a small subset of training points
(“support vectors”)•
Num
ber of free parameters is bounded by the num
ber of support vectors and not by the num
ber of variables•
Do not require direct access to data, w
ork only with dot-
products of data-points.
195
Weak points of SV
M-based learning
methods
•M
easures of uncertainty of parameters are not
currently well-developed
•Interpretation is less straightforw
ard than classical statistics
•Lack of param
etric statistical significance tests•
Power size analysis and research design considerations
are less developed than for classical statistics
Bibliography
196
197
Part 1: Support vector machines for binary
classification: classical formulation
•Boser
BE, Guyon IM
, VapnikV
N: A
training algorithm for optim
al margin classifiers.
Pro
cee
din
gs o
f the
Fifth
An
nu
al W
orksh
op
on
Co
mp
uta
tion
al Le
arn
ing
Th
eo
ry (C
OLT
)
1992:144-152.•
Burges CJC: A tutorial on support vector m
achines for pattern recognition. Da
ta M
inin
g
an
d K
no
wle
dg
e D
iscove
ry1998, 2:121-167.
•CristianiniN
, Shawe-Taylor J: A
n in
trod
uctio
n to
sup
po
rt vecto
r ma
chin
es a
nd
oth
er ke
rne
l-
ba
sed
lea
rnin
g m
eth
od
s. Cambridge: Cam
bridge University Press; 2000.
•H
astie T, Tibshirani R, Friedman JH
: Th
e e
lem
en
ts of sta
tistical le
arn
ing
: da
ta m
inin
g,
infe
ren
ce, a
nd
pre
dictio
n. New
York: Springer; 2001.
•H
erbrichR: Le
arn
ing
kern
el cla
ssifiers: th
eo
ry a
nd
alg
orith
ms. Cam
bridge, Mass: M
IT Press; 2002.
•SchölkopfB, Burges CJC, Sm
olaA
J: Ad
van
ces in
kern
el m
eth
od
s: sup
po
rt vecto
r lea
rnin
g. Cam
bridge, Mass: M
IT Press; 1999.
•Shaw
e-Taylor J, CristianiniN: K
ern
el m
eth
od
s for p
atte
rn a
na
lysis. Cam
bridge, UK:
Cambridge U
niversity Press; 2004.
•Vapnik
VN
: Sta
tistical le
arn
ing
the
ory. N
ew York: W
iley; 1998.
198
Part 1: Basic principles of statistical m
achine learning•
Aliferis CF, Statnikov A
, Tsamardinos I: Challenges in the analysis of m
ass-throughput data: a technical com
mentary from
the statistical machine learning perspective.
Ca
nce
r
Info
rma
tics2006, 2:133-162.
•D
uda RO, H
art PE, Stork DG
: Pa
ttern
classifica
tion. 2nd edition. N
ew York: W
iley; 2001.
•H
astie T, Tibshirani R, Friedman JH
: Th
e e
lem
en
ts of sta
tistical le
arn
ing
: da
ta m
inin
g,
infe
ren
ce, a
nd
pre
dictio
n. New
York: Springer; 2001.
•M
itchell T: Ma
chin
e le
arn
ing. N
ew York, N
Y, USA
: McG
raw-H
ill; 1997.
•Vapnik
VN
: Sta
tistical le
arn
ing
the
ory. N
ew York: W
iley; 1998.
199
Part 2: Model selection for SVM
s•
Hastie T, Tibshirani R, Friedm
an JH: T
he
ele
me
nts o
f statistica
l lea
rnin
g: d
ata
min
ing
,
infe
ren
ce, a
nd
pre
dictio
n. New
York: Springer; 2001.
•Kohavi R: A
study of cross-validation and bootstrap for accuracy estimation and m
odel selection. P
roce
ed
ing
s of th
e Fo
urte
en
th In
tern
atio
na
l Join
t Co
nfe
ren
ce o
n A
rtificial
Inte
llige
nce
(IJCA
I)1995, 2:1137-1145.
•Scheffer T: Error estim
ation and model selection. Ph.D
.Thesis, TechnischenU
niversitätBerlin, School of Com
puter Science; 1999.
•Statnikov A
, Tsamardinos I, D
osbayev Y, Aliferis CF: G
EMS: a system
for automated cancer
diagnosis and biomarker discovery from
microarray gene expression data. In
tJ M
ed
Info
rm2005, 74:491-503.
200
Part 2: SVMs for m
ulticategory data•
Cramm
er K, Singer Y: On the learnability
and design of output codes for multiclass
problems. P
roce
ed
ing
s of th
e T
hirte
en
th A
nn
ua
l Co
nfe
ren
ce o
n C
om
pu
tatio
na
l Lea
rnin
g
Th
eo
ry (C
OLT
)2000.
•Platt JC, CristianiniN
, Shawe-Taylor J: Large m
argin DA
Gs for m
ulticlass classification. A
dva
nce
s in N
eu
ral In
form
atio
n P
roce
ssing
Syste
ms (N
IPS)2000, 12:547-553.
•SchölkopfB, Burges CJC, Sm
olaA
J: Ad
van
ces in
kern
el m
eth
od
s: sup
po
rt vecto
r lea
rnin
g. Cam
bridge, Mass: M
IT Press; 1999.
•Statnikov A
, Aliferis CF, Tsam
ardinos I, Hardin D
, Levy S: A com
prehensive evaluation of m
ulticategory classification methods for m
icroarray gene expression cancer diagnosis. B
ioin
form
atics
2005, 21:631-643.
•W
eston J, Watkins C: Support vector m
achines for multi-class pattern recognition.
Pro
cee
din
gs o
f the
Se
ven
th E
uro
pe
an
Sym
po
sium
On
Artificia
l Ne
ura
l Ne
two
rks1999, 4:6.
201
Part 2: Support vector regression•
Hastie T, Tibshirani R, Friedm
an JH: T
he
ele
me
nts o
f statistica
l lea
rnin
g: d
ata
min
ing
,
infe
ren
ce, a
nd
pre
dictio
n. New
York: Springer; 2001.
•Sm
olaA
J, SchölkopfB: A tutorial on support vector regression. S
tatistics a
nd
Co
mp
utin
g
2004, 14:199-222.
•ScholkopfB, Platt JC, Shaw
e-Taylor J, Smola
AJ, W
illiamson RC: Estim
ating the Support of a H
igh-Dim
ensional Distribution. N
eu
ral C
om
pu
tatio
n2001, 13:1443-1471.
•Tax D
MJ, D
uinRPW
: Support vector domain description. P
atte
rn R
eco
gn
ition
Lette
rs1999,
20:1191-1199.
•H
urBA
, Horn D
, Siegelmann
HT, Vapnik
V: Support vector clustering. Jou
rna
l of M
ach
ine
Lea
rnin
g R
ese
arch
2001, 2:125–137.
Part 2: Novelty detection w
ith SVM-based
methods
and Support Vector Clustering
202
Part 2: SVM-based variable selection
•G
uyon I, Elisseeff A: A
n introduction to variable and feature selection. Jou
rna
l of M
ach
ine
Lea
rnin
g R
ese
arch
2003, 3:1157-1182.•
Guyon I, W
eston J, Barnhill S, VapnikV: G
ene selection for cancer classification using support vector m
achines. Ma
chin
e Le
arn
ing
2002, 46:389-422.•
Hardin D
, Tsamardinos I, A
liferis CF: A theoretical characterization of linear SV
M-based
feature selection. Pro
cee
din
gs o
f the
Twe
nty
First In
tern
atio
na
l Co
nfe
ren
ce o
n M
ach
ine
Lea
rnin
g (IC
ML)2004.
•Statnikov A
, Hardin D
, Aliferis CF: U
sing SVM
weight-based m
ethods to identify causally relevant and non-causally relevant variables. P
roce
ed
ing
s of th
e N
IPS 2
00
6 W
orksh
op
on
Ca
usa
lity a
nd
Fea
ture
Se
lectio
n2006.
•Tsam
ardinos I, Brown LE: M
arkov Blanket-Based Variable Selection in Feature Space. Te
chn
ical re
po
rt DSL-0
8-0
12008.
•W
eston J, Mukherjee
S, ChapelleO
, PontilM, Poggio
T, VapnikV: Feature selection for
SVM
s. Ad
van
ces in
Ne
ura
l Info
rma
tion
Pro
cessin
g Sy
stem
s (NIP
S)2000, 13:668-674.
•W
eston J, Elisseeff A, ScholkopfB, Tipping M
: Use of the zero-norm
with linear m
odels and kernel m
ethods. Jou
rna
l of M
ach
ine
Lea
rnin
g R
ese
arch
2003, 3:1439-1461.•
Zhu J, RossetS, H
astie T, Tibshirani R: 1-norm support vector m
achines. Ad
van
ces in
Ne
ura
l Info
rma
tion
Pro
cessin
g Sy
stem
s (NIP
S)2004, 16.
203
Part 2: Computing posterior class
probabilities for SVM classifiers
•Platt JC: Probabilistic outputs for support vector m
achines and comparison to regularized
likelihood methods.In A
dva
nce
s in La
rge
Ma
rgin
Cla
ssifiers.Edited by Sm
olaA
, Bartlett B, ScholkopfB, Schuurm
ansD
. Cambridge, M
A: M
IT press; 2000.
Part 3: Classification of cancer gene expression
microarray data
(Case Study 1)•
Diaz-U
riarte R, Alvarez de A
ndres S: Gene selection and classification of m
icroarray data using random
forest. BM
C B
ioin
form
atics
2006, 7:3.•
Statnikov A, A
liferis CF, Tsamardinos I, H
ardin D, Levy S: A
comprehensive evaluation of
multicategory classification m
ethods for microarray gene expression cancer diagnosis.
Bio
info
rma
tics2005, 21:631-643.
•Statnikov A
, Tsamardinos I, D
osbayev Y, Aliferis CF: G
EMS: a system
for automated cancer
diagnosis and biomarker discovery from
microarray gene expression data. In
tJ M
ed
Info
rm2005, 74:491-503.
•Statnikov A
, Wang L, A
liferis CF: A com
prehensive comparison of random
forests and support vector m
achines for microarray-based cancer classification. B
MC
Bio
info
rma
tics
2008, 9:319.
204
Part 3: Text Categorization in Biomedicine
(Case Study 2)•
Aphinyanaphongs Y, A
liferis CF: Learning Boolean queries for article quality filtering. M
ed
info
20
04
2004, 11:263-267.
•A
phinyanaphongs Y, Tsamardinos I, Statnikov A
, Hardin D
, Aliferis CF: Text categorization
models for high-quality article retrieval in internal m
edicine. J Am
Me
d In
form
Asso
c
2005, 12:207-216.
•A
phinyanaphongs Y, Statnikov A, A
liferis CF: A com
parison of citation metrics to m
achine learning filters for the identification of high quality M
EDLIN
E documents. J A
m M
ed
Info
rm A
ssoc
2006, 13:446-455.
•A
phinyanaphongs Y, Aliferis CF: Prospective validation of text categorization m
odels for indentifying high-quality content-specific articles in PubM
ed. AM
IA 2
00
6 A
nn
ua
l
Sym
po
sium
Pro
cee
din
gs
2006.
•A
phinyanaphongs Y, Aliferis C: Categorization M
odels for Identifying Unproven Cancer
Treatments on the W
eb. ME
DIN
FO
2007.
•Fu L, A
liferis C: Models for Predicting and Explaining Citation Count of Biom
edical A
rticles. AM
IA 2
00
8 A
nn
ua
l Sym
po
sium
Pro
cee
din
gs
2008.
205
Part 3: Modeling clinical judgm
ent (Case Study 4)
•Sboner A
, Aliferis CF: M
odeling clinical judgment and im
plicit guideline compliance in the
diagnosis of melanom
as using machine learning. A
MIA
20
05
An
nu
al Sy
mp
osiu
m
Pro
cee
din
gs
2005:664-668.
Part 3: Using SVM
s for feature selection (Case Study 5)
•A
liferis CF, Statnikov A, Tsam
ardinos I, Mani S, Koutsoukos XD
: Local Causal and Markov
Blanket Induction for Causal Discovery and Feature Selection for Classification. Part II:
Analysis and Extensions. Jo
urn
al o
f Ma
chin
e Le
arn
ing
Re
sea
rch2008.
•A
liferis CF, Statnikov A, Tsam
ardinos I, Mani S, Koutsoukos XD
: Local Causal and Markov
Blanket Induction for Causal Discovery and Feature Selection for Classification. Part I:
Algorithm
s and Empirical Evaluation. Jo
urn
al o
f Ma
chin
e Le
arn
ing
Re
sea
rch2008.
206
Part 3: Outlier detection in ovarian cancer
proteomics
data (Case Study 6)•
Baggerly KA, M
orris JS, Coombes
KR: Reproducibility of SELDI-TO
F protein patterns in serum
: comparing datasets from
different experiments. B
ioin
form
atics
2004, 20:777-785.
•Petricoin
EF, ArdekaniA
M, H
ittBA
, Levine PJ, FusaroVA
, Steinberg SM, M
ills GB, Sim
one C, Fishm
an DA
, Kohn EC, LiottaLA
: Use of proteom
ic patterns in serum to identify ovarian cancer.
Lan
cet
2002, 359:572-577.
Thank you for your attention!
Questions/C
omm
ents?
Email: A
lexander.Statnikov@m
ed.nyu.edu
UR
L: http://ww
.nyuinformatics.org
207