a gentle introduction to support vector machines in biomedicineshatkay/course/papers/uosv... ·...

A Gentle Introduction to Support Vector Machines

in BiomedicineAlexander Statnikov*, Douglas Hardin#,

Isabelle Guyon†, Constantin F. Aliferis*

(Materials about SVM Clustering were contributed by Nikita Lytkin*)

*New York University, #Vanderbilt University, †ClopiNet

Part I

•Introduction

•N

ecessary mathem

atical concepts

•Support vector m

achines for binary classification: classical form

ulation

•Basic principles of statistical m

achine learning

2

Introduction

3

About this tutorial

4

Main goal:Fully understand support vector m

achines (and im

portant extensions) with a m

odicum of m

athematics

knowledge.

•This tutorial is both modest (it d

oe

s no

t inve

nt a

ny

thin

g n

ew

)

and ambitious (su

pp

ort ve

ctor m

ach

ine

s are

ge

ne

rally

con

side

red

ma

the

ma

tically

qu

ite d

ifficult to

gra

sp).

•Tutorial approach: learning problem

Æm

ain idea of the SVM

solution Ægeom

etrical interpretation Æm

ath/theory Æbasic algorithm

s Æextensions Æ

case studies.

Data-analysis problem

s of interest

1.Build com

putational classification models (or

“classifie

rs”) that assign patients/samples into tw

o or m

ore classes. -

Classifiers can be used for diagnosis, outcome prediction, and

other classification tasks.

-E.g., build a decision-support system

to diagnose primary and

metastatic cancers from

gene expression profiles of the patients:5

Classifier m

odelPatient

BiopsyG

ene expressionprofile

Primary Cancer

Metastatic Cancer


s of interest

2.Build com

putational regression models to predict values

of some continuous response variable or outcom

e. -

Regression models can be used to predict survival, length of stay

in the hospital, laboratory test values, etc.


to predict optimal dosage

of the drug to be administered to the patient. This dosage is

determined by the values of patient biom

arkers, and clinical and dem

ographics data:

6

Regression m

odelPatient

Biomarkers,

clinical and dem

ographics data

Optim

al dosage is 5 IU

/Kg/week

12.2

3423

23

922

18


s of interest

3.O

ut of all measured variables in the dataset, select the

smallest subset of variables that is necessary for the

most accurate prediction (classification or regression) of

some variable of interest (e.g., phenotypic response

variable).-

E.g., find the most com

pact panel of breast cancer biomarkers

from m

icroarray gene expression data for 20,000 genes:

7

Breast cancer tissues

Norm

altissues


s of interest

4.Build a com

putational model to identify novel or outlier

patients/samples.

-Such m

odels can be used to discover deviations in sample

handling protocol when doing quality control of assays, etc.


to identify aliens.

8


s of interest

5.G

roup patients/samples into several

clusters based on their similarity.

-These m

ethods can be used to discovery disease sub-types and for other tasks.

-E.g., consider clustering of brain tum

or patients into 4 clusters based on their gene expression profiles. A

ll patients have the sam

e pathological sub-type of the disease, and clustering discovers new

disease subtypes that happen to have different characteristics in term

s of patient survival and tim

e to recurrence after treatment.

9

Cluster #1

Cluster #2

Cluster #3

Cluster #4

Basic principles of classification

10

•W

ant to classify objects as boats and houses.


11

•A

ll objects before the coast line are boats and all objects after the coast line are houses.

•Coast line serves as a d

ecisio

n su

rface

that separates two classes.


12

These b

oats w

ill be m

isclassified as h

ou

ses


13

Longitude

Latitude

Boat

House

•The methods that build classification m

odels (i.e., “classifica

tion

alg

orith

ms”)

operate very similarly to the previous exam

ple.•First all objects are represented geom

etrically.


14

Longitude

Latitude

Boat

House

Then the algorithm seeks to find a decision

surface that separates classes of objects


15

Longitude

Latitude

??

?

??

?

These objects are classified as boats

These objects are classified as houses

Unseen (new

) objects are classified as “boats” if they fall below

the decision surface and as “houses” if the fall above it

The Support Vector M

achine (SVM

) approach

16

•Support vector m

achines (SVM

s) is a binary classification algorithm

that offers a solution to problem #1.

•Extensions of the basic SV

M algorithm

can be applied to solve problem

s #1-#5.

•SV

Ms are im

portant because of (a) theoretical reasons:-

Robust to very large number of variables and sm

all samples

-Can learn both sim

ple and highly complex classification m

odels

-Em

ploy sophisticated mathem

atical principles to avoid overfitting

and (b) superior empirical results.

Main ideas of SV

Ms

17

Cancer patientsN

ormal patients

Gene X

Gene Y

•Consider exam

ple dataset described by 2 genes, gene X and gene Y•

Represent patients geometrically (by “vectors”)

Main ideas of SV

Ms

18

•Find a linear decision surface (“hyperplane”) that can separate patient classes and

has the largest distance (i.e., largest “gap” or “m

argin”) between border-line patients (i.e., “support vectors”);

Cancer patientsN

ormal patients

Gene X

Gene Y

Main ideas of SV

Ms

19

•If such linear decision surface does not exist, the data is m

apped into a m

uch higher dimensional space (“feature space”) w

here the separating decision surface is found;

•The feature space is constructed via very clever m

athematical

projection (“kernel trick”).

Gene Y

Gene X

Ca

nce

r

No

rma

l

Ca

nce

r

No

rma

l

kernel

Decision surface

History of SV

Ms and usage in the literature

20

•Support vector m

achine classifiers have a long history of developm

ent starting from the 1960’s.

•The m

ost important m

ilestone for development of m

odern SVM

s is the 1992 paper by Boser, G

uyon, and Vapnik(“A

train

ing

alg

orith

m fo

r op

tima

l ma

rgin

classifie

rs”)

359 621

906

1,430

2,330

3,530

4,950

6,660

8,180

8,860

4 12

46 99

201 351

521 726

917 1,190

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

19981999

20002001

20022003

20042005

20062007

Use of Support Vector M

achines in the Literature

General sciences

Biomedicine

9,770 10,800

12,000 13,500

14,900 16,000

17,700

19,500 20,000

19,600

14,900 15,500

19,200 18,700

19,100

22,200

24,100

20,100

17,700 18,300

0

5000

10000

15000

20000

25000

30000

19981999

20002001

20022003

20042005

20062007

Use of Linear R

egression in the Literature

General sciences

Biomedicine

Necessary m

athematical concepts

21

How

to represent samples geom

etrically?Vectors in n-dim

ensional space (4n)

•A

ssume that a sam

ple/patient is described by ncharacteristics

(“features” or “variables”)•

Representation:Every sample/patient is a vector in 4

nw

ith tail at point w

ith 0 coordinates and arrow-head at point w

ith the feature values.

•Exam

ple:Consider a patient described by 2 features: Sy

stolic B

P= 110 and A

ge

= 29. This patient can be represented as a vector in 4

2:

22Systolic BP

Age

(0, 0)

(110, 29)

0100

200300

050

100

150200 0 20 40 60

How



ensional space (4n)

Patient 3Patient 4

Patient 1Patient 2

23

Patient id

Cholesterol (m

g/dl)Systolic BP

(mm

Hg)

Age

(years)Tail of the

vectorA

rrow-head of

the vector

1150

11035

(0,0,0)(150,110, 35)

2250

12030

(0,0,0)(250,120, 30)

3140

16065

(0,0,0)(140,160, 65)

4300

18045

(0,0,0)(300, 180, 45)

Age (years)

0100

200300

050

100

150200 0 20 40 60

How



ensional space (4n)

Patient 3Patient 4

Patient 1Patient 2

24

Age (years)

Since we assum

e that the tail of each vector is at point with 0

coordinates, we w

ill also depict vectors as points (where the

arrow-head is pointing).

Purpose of vector representation•H

aving represented each sample/patient as a vector allow

s now

to geometrically represent the decision surface that

separates two groups of sam

ples/patients.

•In order to define the decision surface, we need to introduce

some basic m

ath elements…

25

01

23

45

67

0 1 2 3 4 5 6 7

01

23

45

67

0

5

10 0 1 2 3 4 5 6 7

A decision surface in 4

2A

decision surface in 43

Basic operation on vectors in 4n

1. Multiplication by a scalar

Consider a vector and a scalar cD

efine:

Wh

en

yo

u m

ultip

ly a

vecto

r by a

scala

r, yo

u “stre

tch” it in

the

sam

e o

r op

po

site d

irectio

n d

ep

en

din

g o

n w

he

the

r the

scala

r is

po

sitive o

r ne

ga

tive.

),...,

,(

21

na

aa

a

&

),...,

,(

21

nca

caca

ac

&

)4,2

( 2)2

,1( ac c a& &

a &

ac &

1 2 3 401

23

4

)2,1

( 1)2

,1(

��

� ac c a& &

a &

ac &

1 2 3 401

23

4-2

-1

-2 -1

26


2. Addition

Consider vectors and

Define:

Re

call a

dd

ition

of fo

rces in

classica

l me

cha

nics.

),...,

,(

21

na

aa

a

&

27

),...,

,(

21

nb

bb

b

&),...,

,(

22

11

nnb

ab

ab

ab

a�

��

�&

&

1 2 3 401

23

4

)2,4

( )0,3

(

)2,1

(

� ba b a

&& & &

a &

b & ba

&&�


3. Subtraction


Define:

Wh

at ve

ctor d

o w

e

ne

ed

to a

dd

to to

ge

t ? I.e

., simila

r to

sub

tractio

n o

f rea

l

nu

mb

ers.

),...,

,(

21

na

aa

a

&

28

),...,

,(

21

nb

bb

b

&),...,

,(

22

11

nnb

ab

ab

ab

a�

��

�&

&

)2,2

( )0,3

(

)2,1

(

�

� ba b a

&& & &

a &

b &

ba

&&�

1 2 3 401

23

4-3

-2-1

b &

a &


4. Euclidian length or L2-norm

Consider a vector

Define the L2-norm

:

We often denote the L2-norm

without subscript, i.e.

),...,

,(

21

na

aa

a

&

29

222

212

...na

aa

a�

��

&

24.2

5 )2,1(

2|

a a& &

a &1 2 30

12

34

Length of this vector is §��

a &

L2-n

orm

is a ty

pica

l wa

y to

me

asu

re le

ng

th o

f a ve

ctor;

oth

er m

eth

od

s to m

ea

sure

len

gth

also

exist.


5. Dot product


Define dot product:

The law of cosines says that w

here ș

is the angle between and . Therefore, w

hen the vectors are perpendicular .

),...,

,(

21

na

aa

a

&

30

),...,

,(

21

nb

bb

b

&

¦

�

��

�

nii

in

nba

ba

ba

ba

ba

12

21

1...

&&

Tcos

||||

||||

22b

ab

a&

&&

&

�a &

b &

0

�ba&

&

1 2 3 401

23

4

3 )0,3

(

)2,1

( � ba b a

&& & &

a &

b &

1 2 3 401

23

4

0 )0,3

(

)2,0

( � ba b a

&& & &

a &

b &

TT


5. Dot product

(continued)

•Property:

•In the classical regression equation

the response variable yis just a dot product of the

vector representing patient characteristics ( ) and

the regression weights vector (

) which is com

mon

across all patients plus an offset b.

31

¦

�

��

�

nii

in

nba

ba

ba

ba

ba

12

21

1...

&&

222

21

1...

aa

aa

aaa

aa

nn

&&

&

��

�

�b

xw

y�

�

&&

w &x &

Hyperplanes as decision surfaces

•A

hyperplane is a linear decision surface that splits the space into tw

o parts;

•It is obvious that a hyperplane is a binary classifier.

32

01

23

45

67

0 1 2 3 4 5 6 7

01

23

45

67

0

5

10 0 1 2 3 4 5 6 7

A hyperplane in 4

2is a line

A hyperplane in 4

3is a plane

A hyperplane in 4

nis an n-1 dim

ensional subspace

33

Equation of a hyperplane

Source: http://ww

w.m

ath.umn.edu/~nykam

p/

First we show

with show

the definition of hyperplane by an interactive dem

onstration.

or go to http://ww

w.dsl-lab.org/svm

_tutorial/planedemo.htm

l

Click here for dem

o to begin


34

Consider the case of 43:

An equation of a hyperplane is defined

by a point (P0 ) and a perpendicular

vector to the plane ( ) at that point.w &

P0

P

w &0x &

x &

0x

x&

&�

Define vectors: and , w

here Pis an arbitrary point on a hyperplane.

00OP

x

&OP

x

&

A condition for P

to be on the plane is that the vector is perpendicular to :

The above equations also hold for 4n

when n>3.

0x

x&

&�w &

0)

(0

��

xx

w&

&&

00

��

�x

wx

w&

&&

&ordefine

0x

wb

&&�

�

0

��

bx

w&

&

O


35

043

64

043

),

,(

)6,1

,4(

043

)6,1

,4(

043

43)

421

0()7

,1,0(

)6,1

,4(

)3(

)2(

)1(

)3(

)2(

)1(

0

0

�

��

�

�

��

�

��

��

�

��

�

��

�

�

�

�

xx

xx

xx x

xw

xw

b P w

&

&&

&&

&

P0

w &

043

�

�xw

&&

What happens if the b

coefficient changes? The hyperplane m

oves along the direction of . W

e obtain “parallel hyperplanes”.w &

Example

010

�

�xw

&&

050

�

�xw

&&

wb

bD

&/

21 �

D

istance between tw

o parallel hyperplanes and is equal to .

01

��

bx

w&

&0

2

��

bx

w&

&

+ direction

-direction

(Derivation of the distance betw

een two

parallel hyperplanes)

36

w &

01

��

bx

w&

&

02

��

bx

w&

&

wb

bwt

D

wb

bt

bwt

b

bwt

bb

xw

bwt

xw

bwt

xw

bx

wwt

wtD

wtx

x

&&

&

&

&&

&

&&

&

&&

&

&&

&&

&&

&

/

/)(

0

0)

(

0 0)

(0

21 2

21

22

1

22

11

1

22

1

21

22

12

�

�

�

�

��

�

��

��

�

��

�

��

�

�

�

1x &

2x &

wt &

Recap

37

We know

…•

How

to represent patients (as “vectors”)•

How

to define a linear decision surface (“hyperplane”)

We need to know

…•

How

to efficiently compute the hyperplane that separates

two classes w

ith the largest “gap”?

ÎN

eed to introduce basics of relevant optim

ization theory

Cancer patientsN

ormal patients

Gene X

Gene Y

Basics of optimization:

Convex functions

38

•A

function is called con

vex

if the function lies below the

straight line segment connecting tw

o points, for any two

points in the interval.•

Property: Any local m

inimum

is a global minim

um!

Convex functionN

on-convex function

Global m

inimum

Global m

inimum

Local minim

um

39

•Quadratic program

ming (Q

P) is a special optim

ization problem: the function to optim

ize (“o

bje

ctive”) is quadratic, subject to linear co

nstra

ints.

•Convex QP problem

s have convex objective functions.

•These problems can be solved easily and efficiently

by greedy algorithms (because every local

minim

um is a global m

inimum

).


Quadratic program

ming (Q

P)

Consider

Minim

ize subject to

This is QP problem

, and it is a convex QP as w

e will see later

We can rew

rite it as:

Minim

ize subject to

40


Example Q

P problem

22||

||2 1x &

),

(2

1 xx

x

&

01

21

t�

�xx

)(

2 122

21x

x�

01

21

t�

�xx

qu

ad

ratic

ob

jectiv

e

line

ar

con

strain

ts

qu

ad

ratic

ob

jectiv

e

line

ar

con

strain

ts

41


Example Q

P problem

x1

x2

f(x1 ,x

2 )

)(

2 122

21x

x�

12

1�

�xx

01

21

t�

�xx

01

21

d�

�xx

The solution is x1 =1/2

and x2 =1/2.

Congratulations! You have mastered

all math elem

ents needed to understand support vector m

achines.

Now

, let us strengthen your know

ledge by a quiz -

42

1)Consider a hyperplane show

n w

ith white. It is defined by

equation: W

hich of the three other hyperplanes can be defined by equation: ?

-Orange

-Green

-Yellow

2) What is the dot product betw

een vectors and ?

Quiz

43

P0

w &

010

�

�xw

&&

03

��xw

&&

)3,3

( a &

)1,1(�

b &

b &

1 2 3 401

23

4-2

-1

-2 -1

a &

1

3) What is the dot product betw

een vectors and ?

4) What is the length of a vector

and what is the length of

all other red vectors in the figure?

Quiz

44

)3,3

( a &

)0,1

(

b &

b &

1 2 3 402

34

-2-1

-2 -1

a &

)0,2

( a &

1

1 2 3 402

34

-2-1

-2 -1

a &

5) Which of the four functions is/are convex?

Quiz

45

13

24

Support vector machines for binary

classification: classical formulation

46

Case 1: Linearly separable data; “H

ard-margin” linear SV

MG

iven training data:

47

}1,1

{,...,

,,...,

,

21

21

��

� �N

nNy

yy

Rx

xx

&&

&Positive instances (y=+1)N

egative instances (y=-1)

•W

ant to find a classifier (hyperplane) to separate negative instances from

the positive ones.

•A

n infinite number of such

hyperplanes exist.•

SVM

s finds the hyperplane that m

aximizes the gap betw

een data points on the boundaries (so-called “support vectors”).

•If the points on the boundaries are not inform

ative (e.g., due to noise), SV

Ms w

ill not do well.

w &Since w

e want to m

aximize the gap,

we need to m

inimize

or equivalently minim

ize

Statement of linear SV

M classifier

48

Positive instances (y=+1)N


0

��

bx

w&

&

1�

��

bx

w&

&

1�

��

bx

w&

&

The gap is distance between

parallel hyperplanes:

and

Or equivalently:

We know

that

Therefore:

1�

��

bx

w&

&

1�

��

bx

w&

&

0)1

(

��

�b

xw

&&

0)1

(

��

�b

xw

&&

wb

bD

&/

21 �

wD

&/2

22 1w &

( is convenient for taking derivative later on)2 1

In summ

ary:

Want to m

inimize subject to for i= 1,…

,NThen given a new

instance x, the classifier is

Statement of linear SV

M classifier

49



0

��

bx

w&

&

1�t

��

bx

w&

&

1�d

��

bx

w&

&In addition w

e need to im

pose constraints that all instances are correctly classified. In our case:

ifif

Equivalently: 1�d

��

bx

wi&

&

1�t

��

bx

wi&

&1�

iy

1� i

y

1)

(t

��

bx

wy

ii

&&

22 1w &

1)

(t

��

bx

wy

ii

&&

)(

)(

bx

wsign

xf

��

&

&&

Minim

ize subject to for i= 1,…,N

SVM

optimization problem

: Prim

al formulation

50

¦ ni

iw1

22 1

01

)(

t�

��

bx

wy

ii

&&

Objective function

Constraints

•This is called “prim

al fo

rmu

latio

n o

f line

ar SV

Ms”.

•It is a convex quadratic programm

ing (QP)


with n

variables (wi , i= 1,…

,n), w

here nis the num

ber of features in the dataset.

SVM


: D

ual formulation

51

•The previous problem can be recast in the so-called “d

ua

l

form

” giving rise to “du

al fo

rmu

latio

n o

f line

ar SV

Ms ”.

•It is also a convex quadratic programm

ing problem but w

ith N

variables (Di ,i= 1,…

,N), w

here Nis the num

ber of sam

ples.

Maxim

ize subject to and .

Then the w-vector is defined in term

s of Di :

And the solution becom

es:

¦¦

��

Nji

ji

ji

ji

Nii

xx

yy

1,

2 1

1

&&

DD

D0

tiD

01

¦ Ni

ii y

D

Objective function

Constraints

¦

Ni

ii

ixy

w1

&&

D

)(

)(

1b

xxy

signx

fNi

ii

i�

�

¦

&&

&D

SVM


: Benefits of using dual form

ulation

52

1) No need to access original data, need to access only dot

products.

Objective function:

Solution:

2) Num

ber of free parameters is bounded by the num

ber of support vectors and not by the num

ber of variables (beneficial for high-dim

ensional problems).

E.g., if a microarray dataset contains 20,000 genes and 100

patients, then need to find only up to 100 parameters!

¦¦

��

Nji

ji

ji

ji

Nii

xx

yy

1,

2 1

1

&&

DD

D

)(

)(

1b

xxy

signx

fNi

ii

i�

�

¦

&&

&D

Minim


(Derivation of dual form

ulation)

53

¦ ni

iw1

22 1

01

)(

t�

��

bx

wy

ii

&&

Objective function

Constraints

Apply the m

ethod of Lagrange multipliers.

Define Lagrangian

We need to m

inimize this Lagrangian w

ith respect to and simultaneously

require that the derivative with respect to vanishes , all subject to the

constraints that

��

��

¦¦

��

��

/

Nii

ii

nii

Pb

xw

yw

bw

11

22 1

1)

(,

,&

&&

&D

D

a vector with n

elements

a vector with N

elements

D &b

w, &

.0ti

D

(Derivation of dual form

ulation)

54

If we set the derivatives w

ith respect to to0, w

e obtain:

We substitute the above into the equation for and obtain “d

ua

l

form

ula

tion

of lin

ea

r SVM

s”:

We seek to m

aximize the above Lagrangian w

ith respect to , subject to the

constraints that and .

bw, &

��

��

¦

¦

�

w

/w

�

w

/w

Nii

ii

P

Nii

iP

xy

wwb

w

yb bw

1

1

0,

,

00

,,

&&

&

&&

&&

DD

DD

��D &

&,

,bw

P/

��

¦¦

��

/

Nji

ji

ji

ji

Nii

Dx

xyy

1,

2 1

1

&&

&D

DD

D

D &

0ti

D0

1

¦ Ni

ii y

D

Case 2: N

ot linearly separable data;“Soft-m

argin” linear SVM

55

Want to m

inimize subject to for i= 1,…

,N

Then given a new instance x, the classifier is

¦

�Ni

iC

w1

22 1

[&

ii

ib

xw

y[

�t

��

1)

(&

&

)(

)(

bx

wsign

xf

��

&

&

Assign a “slack variable” to each instance , w

hich can be thought of distance from

the separating hyperplane if an instance is misclassified and 0 otherw

ise.0

ti[

00

0 00

000

00

0

0

00

0W

hat if the data is not linearly separable? E.g., there are outliers or noisy m

easurements,

or the data is slightly non-linear.

Want to handle this case w

ithout changing the fam

ily of decision functions.

Approach:

Two form

ulations of soft-margin

linear SVM

56

Minim


¦¦

�Ni

i

niiC

w1

1

22 1

[

Objective function

Constraints

ii

ib

xw

y[

�t

��

1)

(&

&

¦¦

��

Nji

ji

ji

ji

nii

xx

yy

1,

2 1

1

&&

DD

DC

i ddD

00

1

¦ Ni

ii y

DM

inimize subject to and

for i= 1,…,N.

Objective function

Constraints

Prim

al fo

rmu

latio

n:

Du

al fo

rmu

latio

n:

Parameter C

in soft-margin SV

M

57

Minim


¦

�Ni

iC

w1

22 1

[&

ii

ib

xw

y[

�t

��

1)

(&

&

C=100C=1

C=0.15C=0.1

•W

hen Cis very large, the soft-

margin SV

M is equivalent to

hard-margin SV

M;

•W

hen Cis very sm

all, we

admit m

isclassifications in the training data at the expense of having w

-vector with sm

all norm

;•

Chas to be selected for the

distribution at hand as it will

be discussed later in this tutorial.

Case 3: N

ot linearly separable data;Kernel trick

58

Gene 2

Gene 1

Tum

or

No

rma

l

Tum

or

No

rma

l?

?

Data is not linearly separable

in the input spaceD

ata is linearly separable in the feature space

obtained by a kernel

kernel

)

HR

o)

N:

Data in a higher dim

ensional feature space

Kernel trick

59

)(

)(

bx

wsign

xf

��

&

&)

)(

()

(b

xw

signx

f�

)�

&&

¦

Ni

ii

ixy

w1

&&

D¦

)

Nii

ii

xy

w1

)( &

&D

))

()

((

)(

1b

xx

ysign

xf

Nii

ii

�)�

)

¦

&&

D

))

,(

()

(1

bx

xKy

signx

fNi

ii

i�

¦

&&

D

x &)

(x &)

Original data (in input space)

Therefore, we do not need to know

ˇexplicitly, w

e just need to define function K(·, ·): 4

N×4

NÆ

4.

Not every function 4

N×4

NÆ

4�can be a valid kernel; it has to satisfy so-called

Mercer conditions. O

therwise, the underlying quadratic program

may not be solvable.

Popular kernels

60

A kernel is a dot product in so

me

feature space:

Examples:

Linear kernelG

aussian kernelExponential kernel

Polynomial kernel

Hybrid kernel

Sigmoidal

)tanh(

),

(

)exp(

)(

),

(

)(

),

(

)exp(

),

(

)exp(

),

(

),

(

2

2

G

J

J J

��

��

��

��

��

��

�

ji

ji

ji

qj

ij

i

qj

ij

i

ji

ji

ji

ji

ji

ji

xxk

xx

K

xx

xx

px

xK

xx

px

xK

xx

xx

K

xx

xx

K

xx

xx

K

&&

&&

&&

&&

&&

&&

&&

&&

&&

&&

&&

&&

&&

)(

)(

),

(j

ij

ix

xx

xK

&&

&&

)�)

Understanding the G

aussian kernel61

)exp(

),

(2

jj

xx

xx

K&

&&

&�

�

JConsider G

aussian kernel:

Geom

etrically, this is a “bump” or “cavity” centered at the

training data point :jx &

The resulting m

apping function is a com

bination of bum

ps and cavities.

Æ"bum

p”Æ

“cavity”

Understanding the G

aussian kernel62

Several more view

s of the data is m

apped to the feature space by G

aussian kernel

63

Understanding the G

aussian kernelLinear hyperplane that separates tw

o classes

Consider polynomial kernel:

Assum

e that we are dealing w

ith 2-dimensional data

(i.e., in 42). W

here will this kernel m

ap the data?

Understanding the polynom

ial kernel64

3)1(

),

(j

ij

ix

xx

xK

&&

&&

��

)2(

)1(

xx

)2(

2)1

(2

)2(

)1(

3)2

(3

)1(

)2(

)1(

2)2

(2

)1(

)2(

)1(

1x

xx

xx

xx

xx

xx

x

2-dimensional space

10-dimensional space

kernel

Example of benefits of using a kernel65

)2(x

)1(x

1x &

2x &

3x &

4x &

•D

ata is not linearly separable in the input space (4

2).•

Apply kernel

to map data to a higher

dimensional space (3-

dimensional) w

here it is linearly separable.

2)

()

,(

zx

zx

K&

&&

&�

>@

)(

)(

22

2

)(

),

(

2)2

(

)2(

)1( 2

)1(

2)2

(

)2(

)1( 2

)1(

2)2

(2

)2(

)2(

)2(

)1(

)1(

2)1

(2

)1(

2)2

()2

()1

()1

(

2

)2(

)1(

)2(

)1(

2

zx

zz

z z

xx

x xz

xz

xz

xz

x

zx

zx

z zx x

zx

zx

K

&&

&&

&&

)�)

¸ ¸ ¸¹ ·

¨ ¨ ¨© §�¸ ¸ ¸¹ ·

¨ ¨ ¨© §

�

�

�

» »¼ º

« «¬ ª¸ ¸¹ ·

¨ ¨© §�¸ ¸¹ ·

¨ ¨© §

�

Example of benefits of using a kernel66

¸ ¸ ¸¹ ·

¨ ¨ ¨© §

)

2)2

(

)2(

)1( 2

)1(

2)

(xx

x xx &

Therefore, the explicit mapping is

)2(x

)1(x

1x &

2x &

3x &

4x &

2)2

(x

2)1

(x

)2(

)1(

2x

x

21 ,xx

&&

43 ,xx

&&

kernel

)

Com

parison with m

ethods from classical

statistics & regression

67

•N

eed ш�ϱ�ƐĂŵƉůĞƐ�ĨŽƌ�ĞĂĐŚ�ƉĂƌĂŵ

ĞƚĞƌ�ŽĨ�ƚŚĞ�ƌĞŐƌĞƐƐŝŽŶ�m

odel to be estimated:

•SV

Ms do not have such requirem

ent & often require

much less sam

ple than the number of variables, even

when a high-degree polynom

ial kernel is used.

Num

ber of variables

Polynomial

degreeN

umber of

parameters

Required sam

ple

23

1050

103

2861,430

105

3,00315,015

1003

176,851884,255

1005

96,560,646482,803,230

Basic principles of statistical m

achine learning

68

Generalization and overfitting

•G

eneralization:A classifier or a regression algorithm

learns to correctly predict output from

given inputs not only in previously seen sam

ples but also in previously unseen sam

ples.

•O

verfitting: A classifier or a regression algorithm

learns to correctly predict output from

given inputs in previously seen sam

ples but fails to do so in previously unseen sam

ples.

•O

verfitting ÎPoor generalization.

69

Predictor X

Outcom

e of Interest Y

Training Data

Test Data

Example of overfitting and generalization

•A

lgorithm 1 learned non-reproducible peculiarities of the specific sam

ple available for learning but did not learn the general characteristics of the function that generated the data. Thus, it is overfitted and has poor generalization.

•A

lgorithm 2 learned general characteristics of the function that produced the

data. Thus, it generalizes.70

Algorithm

2

Algorithm

1

There is a linear relationship between predictor and outcom

e (plus some G

aussian noise).

Predictor X

Outcom

e of Interest Y

“Loss + penalty” paradigm for learning to

avoid overfitting and ensure generalization

•Many statistical learning algorithm

s (including SVM

s) search for a decision function by solving the follow

ing optim

ization problem:

Minim

ize (Loss

+ ʄP

en

alty)

•Lossm

easures error of fitting the data

•Penaltypenalizes com

plexity of the learned function

•ʄis regularization param

eter that balances Lossand Penalty

71

SVM

s in “loss + penalty” form

SVM

s build the following classifiers:

Consider soft-margin linear SV

M form

ulation:

Find and that

Minim


This can also be stated as:

Find and that

Minim

ize

(in fact, one can show that ʄ

= 1/(2C)).

72

221

)](

1[w

xf

yNi

ii

&&

O¦

� ��

w &b

)(

)(

bx

wsign

xf

��

&

&&

w &b

¦

�Ni

iC

w1

22 1

[&

ii

ib

xw

y[

�t

��

1)

(&

&

Loss

(“hin

ge

loss”)

Pe

na

lty

Meaning of SV

M loss function

Consider loss function:

•Recall that […

]+indicates the positive part

•For a given sam

ple/patient i, the loss is non-zero if

•In other w

ords,

•Since , this m

eans that the loss is non-zero iffor y

i = +1for y

i = -1

•In other w

ords, the loss is non-zero iffor y

i = +1for y

i = -1

73

¦

��

Nii

ix

fy

1)]

(1[

&

0)

(1

!�

iix

fy

&

1)

(�

iix

fy

&

}1,1

{�

� i

y1

)(

�ix

f&

1)

(�

!ix

f&

1�

��

bx

wi&

&

1�!

��

bx

wi&

&

74

Meaning of SV

M loss function



0

��

bx

w&

&

1�

��

bx

w&

&

1�

��

bx

w&

&12

34

•If the instance is negative, it is penalized only in regions 2,3,4

•If the instance is positive, it is penalized only in regions 1,2,3

Flexibility of “loss + penalty” framew

ork75

Lossfunction

Penalty functionResulting algorithm

Hinge loss:

SVM

s

Mean

squared error:Ridge regression

Mean

squared error:Lasso

Mean

squared error:Elastic net

Hinge loss:

1-norm SV

M

Minim

ize (Loss

+ ʄP

en

alty)

¦

��

Nii

ix

fy

1)]

(1[

&22

w &O

¦

�Ni

ii

xf

y1

2))

((

&22

w &O

¦

�Ni

ii

xf

y1

2))

((

&1

w &O

¦

��

Nii

ix

fy

1)]

(1[

&

1w &

O

¦

�Ni

ii

xf

y1

2))

((

&22

21

1w

w&

&O

O�

Part 2

•M

odel selection for SVM

s•

Extensions to the basic SVM

model:

1.SV

Ms for m

ulticategory data2.

Support vector regression3.

Novelty detection w

ith SVM

-based methods

4.Support vector clustering

5.SV

M-based variable selection

6.Com

puting posterior class probabilities for SVM

classifiers

76

Model selection for SV

Ms

77

78

Need for m


s

Gene 2

Gene 1

Tum

or

No

rma

l

•It is im

possible to find a linear SVM

classifier that separates tum

ors from norm

als! •

Need a non-linear SV

M classifier, e.g. SV

M

with polynom

ial kernel of degree 2 solves this problem

without errors.

Gene 2

Gene 1

Tum

or

No

rma

l

•W

e should not apply a non-linear SVM

classifier w

hile we can perfectly solve

this problem using a linear SV

M

classifier!

79

A data-driven approach for

model selection for SV

Ms

•D

o not know a

prio

ri what type of SV

M kernel and w

hat kernel param

eter(s) to use for a given dataset?•

Need to exam

ine various combinations of param

eters, e.g. consider searching the follow

ing grid:

•H

ow to search this grid w

hile producing an unbiased estimate

of classification performance?

Polynomialdegree d

Parameter

C

(0.1,1)(1, 1)

(10,1)(100, 1)

(1000, 1)

(0.1,2)(1, 2)

(10,2)(100, 2)

(1000, 2)

(0.1,3)(1, 3)

(10,3)(100, 3)

(1000, 3)

(0.1,4)(1, 4)

(10,4)(100, 4)

(1000, 4)

(0.1,5)(1, 5)

(10,5)(100, 5)

(1000, 5)

80

Nested cross-validation

Recall the main idea of cross-validation:

datatrain

test

train

testtrain

test

What com

bination of SVM

param

eters to apply on training data?

train

validtrain

valid

train

Perform “grid search” using another nested

loop of cross-validation.

81

Example of nested cross-validation

P1P2P3^ data

Outer Loop

Inner LoopTraining

setValidation

setC

AccuracyAverage Accuracy

P1P2

86%P2

P184%

P1P2

70%P2

P190%

185%

280%

Training set

Testing set

CAccuracy

Average Accuracy

P1, P2P3

189%

P1,P3P2

284%

P2, P3P1

176%

83%

chooseC=1

`

……

Consider that we use 3-fold cross-validation and w

e want to

optimize param

eter C that takes values “1” and “2”.

82

On use of cross-validation

•Empirically w

e found that cross-validation works w

ell for m


s in many problem

dom

ains;•M

any other approaches that can be used for model

selection for SVM

s exist, e.g.:�

Generalized cross-validation

�Bayesian inform

ation criterion (BIC)�

Minim

um description length (M

DL)

�Vapnik-Chernovenkis

(VC) dim

ension�

Bootstrap

SVM

s for multicategory data

83

84

One-versus-rest m

ulticategory SV

M m

ethod***

**

**

**

**

*

*

*

* *

*

Gene 1

Gene 2

Tum

or I

Tum

or II

Tum

or III

?

?

85

One-versus-one m

ulticategorySV

M m

ethod***

**

**

**

**

*

*

*

* *

*

Gene 1

Gene 2

Tum

or I

Tum

or II

Tum

or III

?

?

86

DA

GSV

M m

ulticategory SV

M m

ethod

Not A

LL B-cell

AM

L vs. ALL T-cell

ALL B

-cell vs. ALL T-cell

AM

L vs. ALL B

-cell

Not A

ML

Not A

LL B-cell

Not A

LL T-cellN

ot AM

L

ALL T-cell

ALL B-cell

AM

L

Not A

LL T-cell

AM

L vs. ALL T-cell

ALL B

-cell vs. ALL T-cell

ALL B-cell

87

SVM

multicategory m

ethods by Weston

and Watkins and by C

ramm

er and Singer

***

**

**

**

**

*

*

*

* *

*

Gene 1

Gene 2

?

?

Support vector regression

88

89

H-Support vector regression (H-SVR

)

Given training data:

Ry

yy

Rx

xx

N

nN

� �,...,

,,...,

,

21

21

&&

&

Main idea:

Find a function that approxim

atesy1 ,…

,yN

:•

it has at most H

derivation fromthe true valuesy

i•

it is as “flat” as possible (to avoid overfitting)

***

*

**

**

**

** **

*

*

*

x

y+H-H

bx

wx

f�

�

&&

&)(

E.g., build a model to predict survival of cancer patients that

can admit a one m

onth error (= H).

Find

by minim

izing subject to constraints:

for i= 1,…,N

.

90

Formulation of “hard-m

argin” H-SVR

0

��

bx

w&

&

22 1w &H H�

t�

��

d�

��

)(

)(

bx

wy

bx

wy

i i&

&

&&

***

*

**

**

**

** *

*

*

x

y+H-H

bx

wx

f�

�

&&

&)(

I.e., difference between y

i and the fitted function should be smaller

than Hand larger than -H

Ùall points

yi should be in the “H-ribbon”

around the fitted function.

91

Formulation of “soft-m

argin” H-SVR

***

*

**

**

**

** *

*

*

x

y+H-H

*

*

If we have points like this

(e.g., outliers or noise) we

can either:

a)increase H

to ensurethat

these points are within the

new H-ribbon, or

b)assign a penalty (“slack” variable) to each of this points (as w

as done for “soft-m

argin” SVM

s)

92

Formulation of “soft-m

argin” H-SVR

***

*

**

**

**

** *

*

*

x

y+H-H

*

*

[i

[i *

Find

by minim

izing

subject to constraints:

for i= 1,…,N

.

��

¦

��

Nii

iC

w1

*2

2 1[

[&

0,

)(

)(*

*

t

��

t�

��

�d

��

�

ii

ii

ii

bx

wy

bx

wy

[[

[H

[H

&&

&&

bx

wx

f�

�

&&

&)(

Notice that only points outside H-ribbon are penalized!

93

Nonlinear H-SV

R

***

**

**

*

**

** **

*

)(x)

y-)

(H)

x

y+H-H

kernel

)+)

(H)

** *

*

***

*

**

***

**

x

y+H-H

** *

*

***

*

**

***

**

Cannot approximate w

ell this function w

ith small H!

1�)

Build decision function of the form:

Find and that

Minim

ize

94

H-Support vector regression in “loss + penalty” form

221

)|)

(|,

0m

ax(w

xf

yNi

ii

&&

OH

¦

��

�

bx

wx

f�

�

&&

&)(

w &b

Loss

(“line

ar H-insensitive loss”)

Pe

na

lty

Erro

r in a

pp

roxim

atio

n

Loss fu

nctio

n v

alu

e

95

Com

paring H-SVR

with popular

regression methods

Lossfunction

Penalty functionResulting algorithm

Linear H-insensitive loss:H-SV

R

Quadratic H-insensitive loss:

Another variant of H-SV

R

Mean

squared error:Ridge regression

Mean linear

error:A

nothervariant of ridge

regression

¦

��

Nii

ix

fy

1)

|)(

|,0

max(

H&

22w &

O

¦

�Ni

ii

xf

y1

2))

((

&22

w &O

¦

�Ni

ii

xf

y1

|)(

|&

22w &

O

¦

��

Nii

ix

fy

1

2)

))(

(,0

max(

H&

22w &

O

96

Com

paring loss functions of regression m

ethods

H-H

Loss fu

nctio

n

va

lueE

rror in

a

pp

roxim

atio

n

H-H

Loss fu

nctio

n

va

lueE

rror in

a

pp

roxim

atio

n

H-H

Loss fu

nctio

n

va

lueE

rror in

a

pp

roxim

atio

n

H-H

Loss fu

nctio

n

va

lueE

rror in

a

pp

roxim

atio

n

Linear H-insensitive lossQ

uadratic H-insensitive loss

Mean squared error

Mean linear error

97

Applying H-SV

R to real data

In the absence of domain know

ledge about decision functions, it is recom

mended to optim

ize the following

parameters (e.g., by cross-validation using grid-search):

•param

eter C•

parameter H

•kernel param

eters (e.g., degree of polynomial)

Notice that param

eter Hdepends on the ranges of

variables in the dataset; therefore it is recomm

ended to norm

alize/re-scale data prior to applying H-SVR.

Novelty detection w

ith SVM

-based m

ethods

98

99

What is it about?

•Find the sim

plest and most

compact region in the space of

predictors where the m

ajority of data sam

ples “live” (i.e., w

ith the highest density of sam

ples). •

Build a decision function that takes value +1 in this region and -1 elsew

here.•

Once w

e have such a decision function, w

e can identify novel or outlier sam

ples/patients in the data.

Predictor X

*** *

*

*

* ***

***

* **

**

** *

* **

**

*

**

**

***

*

**

***

* **

*

**

**

* ** *

* ****

****

* * **

*

** ***

* **

**

** *

***

*** *

** * *

**

** * * *

* ****

**

Predictor Y

**

* *

Decision function = +1

Decision function = -1

* **** **

** ***

* ***

100

Key assumptions

•W

e do not know classes/labels of sam

ples (positive or negative) in the data available for learningÎ

this is not a classification problem•

All positive sam

ples are similar but each negative

sample can be different in its ow

n way

Thus, do not need to collect data for negative samples!

101

Sample applications

“Novel”

“Novel”

“Novel”

“Norm

al”

Modified from

: ww

w.cs.huji.ac.il/course/2004/learns/N

oveltyDetection.ppt

102

Sample applications

Discover deviations in sam

ple handling protocol when

doing quality control of assays.

Pro

tein

X

Pro

tein

Y

*** ** **

** *

** *

***

***

* **

**

* *

** *

*

*

*

* ** *

**

*

**

**

***

*

**

* **

* **

*

Samples w

ith high-quality of processing

***

Samples w

ith low quality

of processing from infants

Samples w

ith low quality of

processing from patients

with lung cancer

Samples w

ith low quality of

processing from ICU

patients

Samples w

ith low quality

of processing from the

lab of Dr. Sm

ith

103

Sample applications

Identify websites that discuss benefits of proven cancer

treatments.

We

igh

ted

freq

ue

ncy

of

wo

rd Y

*** ** **

** *

** *

***

***

* **

**

* *

** *

*

*

*

* ** *

*

**

*

*

**

**

***

**

**

* *

**

*

Websites that discuss

benefits of proven cancer treatm

ents

**

**

Websites that discuss side-effects of

proven cancer treatments

Blogs of cancer patients

Websites that discuss

cancer prevention m

ethods

We

igh

ted

freq

ue

ncy

of

wo

rd X

* **W

ebsites that discuss unproven cancer treatm

ents

* ** ** ** * * ** *

* **

**

* *** ** *** * *

*

104

One-class SV

MM

ain idea:Find the maxim

al gap hyperplane that separates data from

the origin (i.e., the only mem

ber of the second class is the origin).

i[

j[

Use “slack variables” as in soft-m

argin SVM

s to penalize these instances

Origin is the only

mem

ber of the second class

0

��

bx

w&

&

105

Formulation of one-class SV

M:

linear case

i[

j[

0 �

�b

xw

&&

Find

by minim

izing


for i= 1,…,N

.

¦

��

Niib

Nw

1

22 1

1[

Q&

0t

�t

��i

ib

xw[

[&

&

)(

)(

bx

wsign

xf

��

&

&&

Given training data:

nNR

xx

x�

&&

&,...,

,2

1

upper bound on the fraction of outliers (i.e., points outside decision surface) allow

ed in the data

i.e., the decision function should be positive in all training sam

ples except for sm

all deviations

106

Formulation of one-class SV

M:

linear and non-linear cases

Find

by minim

izing


for i= 1,…,N

.

¦

��

Niib

Nw

1

22 1

1[

Q&

0t

�t

��i

ib

xw[

[&

&

)(

)(

bx

wsign

xf

��

&

&&

Find

by minim

izing


for i= 1,…,N

.

¦

��

Niib

Nw

1

22 1

1[

Q&

0)

(t

�t

�)�

i

ib

xw[

[&

&

))

((

)(

bx

wsign

xf

�)�

&

&&

Linear caseN

on-linear case (use “kernel trick”)

107

More about one-class SV

M

•O

ne-class SVM

s inherit most of properties of SV

Ms for

binary classification (e.g., “kernel trick”, sample

efficiency, ease of finding of a solution by efficient optim

ization method, etc.);

•The choice of other param

eter significantly affects the resulting decision surface.

•The choice of origin is arbitrary and also significantly affects the decision surface returned by the algorithm

.

Q

Support vector clustering

108

Contributed by Nikita Lytkin

Goal of clustering (aka class discovery)

Given a heterogeneous set of data points

Assign labels

such that points w

ith the same label are highly “sim

ilar“ to each other and are distinctly different from

the rest

109 nNR

xx

x�

&&

&,...,

,2

1

},...,

2,1{

,...,,

21

Ky

yy

N�

Clustering process

Support vector domain description

•Support Vector D

omain D

escription (SVD

D) of the data is

a set of vectors lying on the surface of the smallest

hyper-sphere enclosing all data points in a

fea

ture

spa

ce

–These surface points are called S

up

po

rt Ve

ctors

110

kernel

)R

RR

SVD

D optim

ization criterion

111

Minim


2R

Squared radius of the sphereConstraints

22

||)

(||

Ra

xi

d�

)

Form

ula

tion

with

ha

rd co

nstra

ints:

RRR

'Ra

'a

Main idea behind Support Vector

Clustering•

Cluster boundaries in the input space are formed by the set of

points that when m

apped from the input space to the feature

space fall exactly on the surface of the minim

al enclosing hyper-sphere–

SVs identified by SVD

D are a subset of the cluster boundary points

112

1�)

RRR

a

Cluster assignment in SV

C

•Tw

o points and belong to the same cluster (i.e., have

the same label) if every point of the line segm

ent projected to the feature space lies w

ithin the hyper-sphere

113

)R

RRa

Every point is within the hyper-

sphere in the feature space

Some points lie outside the

hyper-sphere

),

(j

i xx

ix

jx

Cluster assignment in SV

C (continued)

•Point-w

ise adjacency matrix is

constructed by testing the line segm

ents between every pair of

points

•Connected com

ponents are extracted

•Points belonging to the sam

e connected com

ponent are assigned the sam

e label

114

AB

CD

E

A1

11

00

B1

10

0

C1

00

D1

1

E1

AB

C

D

E

AB

C

D

E

Effects of noise and cluster overlap

•In practice, data often contains noise, outlier points and overlapping clusters, w

hich would prevent contour

separation and result in all points being assigned to the sam

e cluster

115

Ideal dataTypical data

Noise

Outliers

Overlap

SVD

D w

ith soft constraints

•SV

C can be used on noisy data by allowing a fraction of points,

called Bounded SVs (BSV), to lie outside the hyper-sphere

–BSVs are not considered as cluster boundary points and are not assigned to clusters by SV

C

116

kernel

)R

RR

Overlap

a

Typical data

Noise

Outliers

Overlap

Noise

Outliers

Soft SVD

D optim

ization criterion

117

Minim

izesubject to

for i= 1,…,N

2R

Squared radius of the sphereSoft constraints

ii

Ra

x[

�d

�)

22

||)

(||

Prim

al fo

rmu

latio

n w

ith so

ft con

strain

ts:

0ti

[

RR

Overlap

ai

R[

�

Introduction of slack variables m

itigates the influence of noise and overlap on the clustering process

i[

Noise

Outliers

•G

aussian kerneltends to yield tighter

contour representations of clusters than the polynomial kernel

•The G

aussian kernel width param

eterinfluences tightness of

cluster boundaries, number of SVs and the num

ber of clusters

•Increasing causes an increase in the num

ber of clusters

Dual form

ulation of soft SVD

D

118

Minim

ize

subject tofor i= 1,…

,N

¦¦

�

ji

ji

ji

ii

ii

xx

Kx

xK

W,

),

()

,(

EE

E

Constraints Ci d

dE

0

)exp(

),

(2

ji

ji

xx

xx

K&

&&

&�

�

J

•A

s before,denotes a kernel function

•Param

etergives a trade-off betw

een volume of the sphere and

the number of errors (C=1 corresponds to hard constraints)

)(

)(

),

(j

ij

ix

xx

xK

)�)

10

d�C

0!

J

J

SVM

-based variable selection

119

Recall standard SVM

formulation:

Find and that minim

ize

subject to

for i= 1,…,N.

Use classifier:

•The w

eight vector contains as many elem

ents as there are input variables in the dataset, i.e. .

•The m

agnitude of each element denotes im

portance of the corresponding variable for classification task.

120

Understanding the w

eight vector w



0

��

bx

w&

&

22 1w &

1)

(t

��

bx

wy

ii

&&

w &b

)(

)(

bx

wsign

xf

��

&

&&

w &n

wR

�&

121

Understanding the w

eight vector w

x1

x2

01

1)1,

1(

21

�

� b

xx w &

x1

x2

00

1)0

,1(

21

�

� b

xx w &

x1

x2

01

0)1,

0(

21

�

� b

xx w &

x2

x1

x3

00

11

)0,1,

1(

32

1

��

� b

xx

x w &

X1

and X2

are equally important

X1

is important, X

2is not

X2

is important, X

1is not

X1

and X2

are equally im

portant, X

3 is not

122

Understanding the w

eight vector w

Gene X

1

Gene X

2M

elanoma

Nevi

X1

PhenotypeX

2

SVM

decision surface

01

1)1,

1(

21

�

� b

xx w &

Decision surface of

another classifier

00

1)0

,1(

21

�

� b

xx w &

True model

•In the true model, X

1 is causal and X2 is redundant

•SVM

decision surface implies that X

1and X

2are equally

important; thus it is locally causally inconsistent

•There exists a causally consistent decision surface for this example

•Causal discovery algorithms can identify that X

1 is causal and X2 is redundant

123

Simple SV


algorithm

Algorithm

:1.

Train SVM

classifier using data for all variables to estim

ate vector 2.

Rank each variable based on the magnitude of the

corresponding element in vector

3.U

sing the above ranking of variables, select the sm

allest nested subset of variables that achieves the best SV

M prediction accuracy.

w &

w &

124

Simple SV


algorithmConsider that w

e have 7 variables: X1 , X

2 , X3 , X

4 , X5 , X

6 , X7

The vector is: (0.1, 0.3, 0.4, 0.01, 0.9, -0.99, 0.2)The ranking of variables is: X

6 , X5 , X

3 , X2 , X

7 , X1 , X

4

Subset of variablesClassification

accuracy

X6

X5

X3

X2

X7

X1

X4

0.920

X6

X5

X3

X2

X7

X1

0.920

X6

X5

X3

X2

X7

0.919

X6

X5

X3

X2

0.852

X6

X5

X3

0.843

X6

X5

0.832

X6

0.821

Best classification accuracy

Classification accuracy that is statistically indistinguishable from

the best one

ÎSelect the follow

ing variable subset: X6 , X

5 , X3 , X

2, X

7

w &

125

Simple SV


algorithm•

SVM

weights are not locally causally consistent Æ

we

may end up w

ith a variable subset that is not causal and not necessarily the m

ost compact one.

•The m

agnitude of a variable in vector estimates

the effect of removing that variable on the objective

function of SVM

(e.g., function that we w

ant to m

inimize). H

owever, this algorithm

becomes sub-

optimal w

hen considering effect of removing several

variables at a time…

This pitfall is addressed in the SV

M-RFE algorithm

that is presented next.

w &

126

SVM

-RFE variable selection algorithm

Algorithm

:1.

Initialize Vto all variables in the data

2.Repeat

3.Train SV

M classifier using data for variables in V

to estim

ate vector 4.

Estimate prediction accuracy of variables in V

using the above SV

M classifier (e.g., by cross-validation)

5.Rem

ove from V

a variable (or a subset of variables) w

ith the smallest m

agnitude of the corresponding elem

ent in vector 6.

Until there are no variables in V

7.Select the sm

allest subset of variables with the best

prediction accuracy w &w &

127

SVM

-RFE variable selection algorithm

10,000 genes

SVM

m

odel

Prediction accuracy

5,000 genes

5,000 genes

Important for

classification

Not im

portant for classification

SVM

m

odel

2,500 genes

2,500 genes

Important for

classification

Not im

portant for classification

Discarded

Discarded

…Prediction accuracy

•U

nlike simple SV

M-based variable selection algorithm

, SVM

-RFE estim

ates vector many tim

es to establish ranking of the variables.

•N

otice that the prediction accuracy should be estimated at

each step in an unbiased fashion, e.g. by cross-validation.

w &

128

SVM

variable selection in feature space

The real power of SV

Ms com

es with application of the kernel

trick that maps data to a m

uch higher dimensional space

(“feature space”) where the data is linearly separable.

Gene 2

Gene 1

Tum

or

No

rma

l

Tum

or

No

rma

l?

?kernel

)

input spacefeature space

129

SVM

variable selection in feature space•

We have data for 100 SN

Ps (X1 ,…

,X100 ) and som

e phenotype.•

We allow

up to 3rd

order interactions, e.g. we consider:

•X

1 ,…,X

100

•X

1 2,X1 X

2 , X1 X

3 ,…,X

1 X100

,…,X

100 2

•X

1 3,X1 X

2 X3 , X

1 X2 X

4 ,…,X

1 X99 X

100,…

,X100 3

•Task: find the sm

allest subset of features (either SNPs or

their interactions) that achieves the best predictive accuracy of the phenotype.

•Challenge: If w

e have limited sam

ple, we cannot explicitly

construct and evaluate all SNPs and their interactions

(176,851 features in total) as it is done in classical statistics.

130

SVM

variable selection in feature space

Heuristic solution:A

pply algorithm SV

M-FSM

B that:1.

Uses SV

Ms w

ith polynomial kernel of degree 3 and

selects M features (not necessarily input variables!)

that have largest weights in the feature space.

E.g., the algorithm can select features like: X

10 , (X

1 X2 ), (X

9 X2 X

22 ), (X7 2X

98 ), and so on.

2.A

pply HITO

N-M

B Markov blanket algorithm

to find the M

arkov blanket of the phenotype using M

features from step 1.

Com

puting posterior class probabilities for SV

M classifiers

131

132

Output of SV

M classifier

1.SVM

s output a class label (positive or negative) for each sam

ple:

2. One can also com

pute distance from

the hyperplane that separates classes, e.g. . These distances can be used to com

pute performance m

etrics like area under RO

C curve.

Positive samples (y=+1)

Negative sam

ples (y=-1)

0

��

bx

w&

&

)(

bx

wsign

�� &

&

bx

w�

� &&

Question:H

ow can one use SV

Ms to estim

ate posterior class probabilities, i.e., P(class positive | sam

ple x)?

133

Simple binning m

ethod

Train

ing

set

Va

lida

tion

set

Testin

g se

t

1.Train SV

M classifier in the Tra

inin

g se

t.

2.A

pply it to the Va

lida

tion

set

and compute distances

from the hyperplane to each sam

ple.

3.Create a histogram

with Q

(e.g., say 10) bins using the above distances. Each bin has an upper and low

er value in term

s of distance.

Sample

#1

23

45

...98

99100

Distance

2-1

83

4-2

0.30.8

-15-10

-50

510

150 5 10 15 20 25

Distance

Number of samples in validation set

134

Simple binning m

ethod

Train

ing

set

Va

lida

tion

set

Testin

g se

t

4.G

iven a new sam

ple from the Te

sting

set, place it in

the corresponding bin.

E.g., sample #382 has distance to hyperplane = 1, so it

is placed in the bin [0, 2.5]

5.Com

pute probability P(positive class | sample #382) as

a fraction of true positives in this bin.

E.g., this bin has 22 samples (from

the Va

lida

tion

set),

out of which 17 are true positive ones , so w

e compute

P(positive class | sample #382) = 17/22 = 0.77

-15-10

-50

510

150 5 10 15 20 25

Distance

Number of samples in validation set

-10-8

-6-4

-20

24

68

100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Distance

P(positive class|sample)

135

Platt’s method

Convert distances output by SVM

to probabilities by passing them

through the sigmoid filter:

where d

is the distance from hyperplane and A

and Bare param

eters.

)exp(

11

)|

(B

Adsam

pleclass

positiveP

��

136

Platt’s method

Train

ing

set

Va

lida

tion

set

Testin

g se

t

1.Train SV

M classifier in the Tra

inin

g se

t.

2.A

pply it to the Va

lida

tion

set

and compute distances

from the hyperplane to each sam

ple.

3.D

etermine param

eters Aand B

of the sigmoid

function by minim

izing the negative log likelihood of the data from

the Va

lida

tion

set.

4.G

iven a new sam

ple from the Te

sting

set, com

pute its posterior probability using sigm

oid function.

Sample

#1

23

45

...98

99100

Distance

2-1

83

4-2

0.30.8

Part 3

•Case studies (taken from

our research)1.

Classification of cancer gene expression microarray data

2.Text categorization in biom

edicine3.

Prediction of clinical laboratory values4.

Modeling clinical judgm

ent5.

Using SV

Ms for feature selection

6.O

utlier detection in ovarian cancer proteomics data

•Softw

are

•Conclusions

•Bibliography

137

1. Classification of cancer gene

expression microarray data

138

139

Com

prehensive evaluation of algorithms

for classification of cancer microarray data

Main goals:

ͻFind the best perform

ing decision support algorithm

s for cancer diagnosis from

microarray gene expression data;

ͻInvestigate benefits of using gene selection and ensem

ble classification methods.

140

Classifiers

ͻK-N

earest Neighbors (KN

N)

ͻBackpropagation N

eural Netw

orks (NN

)

ͻProbabilistic N

eural Netw

orks (PNN

)

ͻM

ulti-Class SVM

: One-Versus-Rest (O

VR)

ͻM

ulti-Class SVM

: One-Versus-O

ne (OV

O)

ͻM

ulti-Class SVM

: DA

GSV

M

ͻM

ulti-Class SVM

by Weston &

Watkins (W

W)

ͻM

ulti-Class SVM

by Cramm

er & Singer (CS)

ͻW

eighted Voting: One-Versus-Rest

ͻW

eighted Voting: One-Versus-O

ne

ͻD

ecision Trees: CART

kernel-based

neural netw

orks

voting

decision trees

instance-based

141

Ensemble classifiers

Ensemble

Classifier

FinalPrediction

dataset

Classifier 1Classifier 2

Classifier N…

Prediction 1Prediction 2

Prediction N…

dataset

142

Gene selection m

ethods

1.Signal-to-noise (S2N

) ratio in one-versus-rest (O

VR)

fashion;

2.Signal-to-noise (S2N

) ratio in one-versus-one (O

VO

) fashion;

3.Kruskal-W

allis nonparametric

one-way A

NO

VA (KW

);

4.Ratio of genes betw

een-categories to w

ithin-category sum

of squares (BW).

genes

Uninform

ative genesH

ighly discriminatory genes

143

Performance m

etrics andstatistical com

parison1.

Accuracy

+can com

pare to previous studies+

easy to interpret & sim

plifies statistical comparison

2.Relative classifier inform

ation (RCI)+

easy to interpret & sim

plifies statistical comparison

+not sensitive to distribution of classes

+accounts for difficulty of a decision problem

ͻRandom

ized permutation testing to com

pare accuracies of the classifiers (D

=0.05)

144

Microarray datasets

Total:ͻ~

1300 samples

ͻ74diagnostic categories

ͻ41cancer types and

12norm

al tissue types

Sam-

plesV

ariables (genes)

Cate-

gories

11_Tumors

17412533

11Su, 2001

14_Tumors

30815009

26R

amasw

amy, 2001

9_Tumors

605726

9Staunton, 2001

Brain_Tumor1

905920

5Pom

eroy, 2002

Brain_Tumor2

5010367

4N

utt, 2003

Leukemia1

725327

3G

olub, 1999

Leukemia2

7211225

3A

rmstrong, 2002

Lung_Cancer

20312600

5B

hattacherjee, 2001

SRBCT

832308

4K

han, 2001

Prostate_Tumor

10210509

2Singh, 2002

DLBC

L77

54692

Shipp, 2002

Dataset nam

eN

umber of

Reference

145

Summ

ary of methods and datasets

S2N O

ne-Versus-Rest

S2N O

ne-Versus-One

Non-param

. AN

OVA

BW ratio

Gene

SelectionM

ethods (4)C

ross-ValidationD

esigns (2)10-Fold CV

LOO

CV

Accuracy

RCI

Performance

Metrics (2)

One-Versus-Rest

One-Versus-O

ne

DA

GSV

M

Method by W

W

Classifiers (11)

Method by CS

KNN

Backprop. NN

Prob. NN

Decision Trees

One-Versus-Rest

One-Versus-O

ne

MC-SVMWV

Majority Voting

MC-SV

M O

VR

MC-SV

M O

VO

MC-SV

M D

AG

SVM

Ensemble

Classifiers (7)

Decision Trees

Decision Trees

Majority Voting

Based on MC-SVM outputs

Based on outputsof all classifiers

Randomized

permutation testing

StatisticalC

omparison

11_Tumors

14_Tumors

9_Tumors

Brain Tumor1

Gene

ExpressionD

atasets (11)

Brain_Tumor2

Leukemia1

Leukemia2

Lung_Cancer

SRBCT

Prostate_Tumors

DLBCL

Multicategory DxBinary Dx

146

9_Tumors 14_Tumors Brain_Tumor2 Brain_Tumor1 11_TumorsLeukemia1 Leukemia2 Lung_Cancer SRBCT

Prostate_Tumor DLBCL

Results w

ithout gene selection

OVR

OVO

DA

GSVM

WW

CS

KN

NN

NPN

N

MC-SVM

0 20 40 60 80

100

Accuracy, %

147

Results w

ith gene selection

OVROVO

DAGSVMW

WCS

KNNNN

PNN

10 20 30 40 50 60 70

Improvement in accuracy, %

SVMnon-SVM

Improvem

ent of diagnostic perform

ance by gene selection (averages for the four datasets)

Average reduction of genes is 10-30

times

Diagnostic perform

ance before

and aftergene selection

SVM

non-SVM

SVM

non-SVM

20 40 60 80

100

20 40 60 80

100

Accuracy, %

9_Tumors

14_Tumors

Brain_Tum

or1Brain_Tum

or2

148

Com

parison with previously

published resultsAccuracy, %

0 20 40 60 80

100

9_Tumors 14_Tumors Brain_Tumor2 Brain_Tumor1 11_TumorsLeukemia1 Leukemia2 Lung_Cancer SRBCT

Prostate_Tumor DLBCL

Multiple specialized

classification methods

(original primary studies)

Multiclass SV

Ms

(this study)

149

Summ

ary of results

ͻM

ulti-class SVM

s are the best family am

ong the tested algorithm

s outperforming KN

N, N

N, PN

N, D

T, and W

V.ͻ

Gene selection in som

e cases improves classification

performance of all classifiers, especially of non-SV

M

algorithms;

ͻEnsem

ble classification does not improve

performance of SV

M and other classifiers;

ͻResults obtained by SV

Ms favorably com

pare with the

literature.

•A

ppealing properties–

Work w

hen # of predictors > # of samples

–Em

bedded gene selection

–Incorporate interactions

–Based on theory of ensem

ble learning

–Can w

ork with binary &

multiclass tasks

–D

oes not require much fine-tuning of param

eters

•Strong theoretical claim

s

•Em

pirical evidence: (Diaz-U

riarte and Alvarez de

Andres, B

MC

Bio

info

rma

tics, 2006) reported superior classification perform

ance of RFs compared

to SVM

s and other methods

150

Random

Forest (RF) classifiers

Key principles of RF classifiers

Trainingdata

2) Random

geneselection

3) Fit unpruned decision trees

4) Apply to testing data &

com

bine predictions

Testingdata

1) Generate

bootstrap sam

ples

151

Results w

ithout gene selection

152

•SV

Ms n

om

ina

llyoutperform

RFs is 15 datasets, RFs outperform SV

Ms in 4 datasets,

algorithms are exactly the sam

e in 3 datasets.•

In 7 datasets SVM

s outperform RFs sta

tistically

sign

ifican

tly.•

On average, the perform

ance advantage of SVM

s is 0.033 AU

C and 0.057 RCI.

Results w

ith gene selection

153

•SV

Ms n

om

ina

llyoutperform

RFs is 17 datasets, RFs outperform SV

Ms in 3 datasets,

algorithms are exactly the sam

e in 2 datasets.•

In 1 dataset SVM

s outperform RFs sta

tistically

sign

ifican

tly.•

On average, the perform

ance advantage of SVM

s is 0.028 AU

C and 0.047 RCI.

2. Text categorization in biomedicine154

Models to categorize content and quality:

Main idea

155

1. Utilize existing (or easy to build) training corpora

1234

2. Simple docum

ent representations (i.e., typically Æ

stemm

ed and weighted

words in title and abstract,

Mesh term

s if available; occasionally

Æaddition of

Metam

ap CUIs, author info) as

“bag-of-words”


Main idea

156

LabeledExam

ples

Unseen

Exam

ples

Labeled

3. Train SVM

models that capture

implicit categories of m

eaning or quality criteria

5. Evaluate performance prospectively

&

compare to prior cross-validation

estimates

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Txm

tD

iag

Pro

gE

tio

Estim

ate

d P

erfo

rman

ce

2005 P

erfo

rman

ce

4. Evaluate models’ perform

ances -

with nested cross-validation or other

appropriate error estimators

-use prim

arily AU

Cas w

ell as other metrics

(sensitivity, specificity, PPV, Precision/Recall curves, H

IT curves, etc.)


Some notable results

157

1. SVM

models have

excellent ability to identify high-quality PubM

ed docum

entsaccording to

ACPJ gold standard

Category

Average A

UC

Range

over n folds

Treatment

0.97*0.96 -

0.98

Etiology0.94*

0.89 –0.95

Prognosis0.95*

0.92 –0.97

Diagnosis

0.95*0.93 -

0.98

Category

Average A

UC

Range

over n folds

Treatment

0.97*0.96 -

0.98

Etiology0.94*

0.89 –0.95

Prognosis0.95*

0.92 –0.97

Diagnosis

0.95*0.93 -

0.98

Meth

od

Treatm

ent -

AU

CE

tiolo

gy -A

UC

Pro

gno

sis -A

UC

Diagn

osis -

AU

C

Google Pagerank

0.540.54

0.430.46

Yaho

o W

ebranks0.56

0.490.52

0.52

Impact Facto

r 2005

0.670.62

0.510.52

Web page hit

count

0.630.63

0.580.57

Biblio

metric

Citatio

n Count

0.760.69

0.670.60

Machine Learning

Models

0.960.95

0.950.95

Meth

od

Treatm

ent -

AU

CE

tiolo

gy -A

UC

Pro

gno

sis -A

UC

Diagn

osis -

AU

C

Google Pagerank

0.540.54

0.430.46

Yaho

o W

ebranks0.56

0.490.52

0.52

Impact Facto

r 2005

0.670.62

0.510.52

Web page hit

count

0.630.63

0.580.57

Biblio

metric

Citatio

n Count

0.760.69

0.670.60

Machine Learning

Models

0.960.95

0.950.95

2. SVM

models have better classification

performance than PageRank, Yahoo ranks,

Impact Factor, W

eb Page hit counts, and bibliom

etriccitation counts on the W

eb according to A

CPJ gold standard


Some notable results

158

3. SVM

models have better

classification performance than

PageRank, Impact Factor and

Citation count in Medline for

SSOA

B gold standard

Gold standard: SSO

AB

Area under the R

OC

curve*

SSOA

B-specific filters

0.893

Citation C

ount0.791

AC

PJ Txm

t-specific filters0.548

Impact Factor (2001)

0.549


0.558

Gold standard: SSO

AB

Area under the R

OC

curve*

SSOA

B-specific filters

0.893

Citation C

ount0.791

AC

PJ Txm

t-specific filters0.548


0.549


0.558D

iag

no

sis

- Fix

ed

Sp

ecific

ity

0.6

5

0.9

70.8

20.9

7

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Sen

sF

ixed

Sp

ec

Qu

ery

Filte

rsL

earn

ing

Mo

dels

Dia

gn

osis

- Fix

ed

Sen

sitiv

ity0.9

6

0.6

8

0.9

60.8

8

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Fix

ed

Sen

sS

pec

Qu

ery

Filte

rsL

earn

ing

Mo

dels

Pro

gn

osis

- Fix

ed

Sp

ecific

ity

0.8

0.7

7

1

0.7

7

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Sen

sF

ixed

Sp

ec

Qu

ery

Filte

rsL

earn

ing

Mo

dels

Pro

gn

osis

- Fix

ed

Sen

sitiv

ity

0.8

0.7

10.8

0.8

7

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Fix

ed

Sen

sS

pec

Qu

ery

Filte

rsL

earn

ing

Mo

dels

Tre

atm

en

t - Fix

ed

Sp

ecific

ity

0.8

0.9

10.9

50.9

1

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Sen

sF

ixed

Sp

ec

Qu

ery

Filte

rsL

earn

ing

Mo

dels

Tre

atm

en

t - Fix

ed

Sen

sitiv

ity0.9

8

0.7

1

0.9

80.8

9

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Fix

ed

Sen

sS

pec

Qu

ery

Filte

rsL

earn

ing

Mo

dels

Etio

log

y - F

ixed

Sp

ecific

ity

0.6

8

0.9

10.9

40.9

1

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Sen

sF

ixed

Sp

ec

Qu

ery

Filte

rsL

earn

ing

Mo

dels

Etio

log

y - F

ixed

Sen

sitiv

ity0.9

8

0.4

4

0.9

8

0.7

5

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9 1

Fix

ed

Sen

sS

pec

Qu

ery

Filte

rsL

earn

ing

Mo

dels

4. SVM

models have better sensitivity/specificity in PubM

ed than CQFs at

comparable thresholds according to A

CPJ gold standard

Other applications of SV

Ms to text

categorization

159

1. Identifying Web Pages

with m

isleading treatment inform

ation according to special purpose gold standard (Q

uack Watch). SV

M m

odels outperform

Quackom

eterand G

oogle ranks in the tested domain of cancer treatm

ent.

Model

Area U

nder the C

urve

Machine Learning M

odels0.93

Quackom

eter*0.67

Google

0.63

Model

Area U

nder the C

urve

Machine Learning M

odels0.93

Quackom

eter*0.67

Google

0.63

2. Prediction of future paper citationcounts (w

ork of L. Fu and C.F. Aliferis,

AM

IA 2008)

3. Prediction of clinical laboratory values

160

Dataset generation and experim

ental design

TrainingTesting

Validation (25%

of Training)

01/1998-05/200106/2001-10/2002

•StarPaneldatabase contains ~8·10

6lab m

easurements of ~100,000 in-

patients from Vanderbilt U

niversity Medical Center.

•Lab m

easurements w

ere taken between 01/1998 and 10/2002.

For each combination of lab test and norm

al range, we generated

the following datasets.

161

Query-based approach for

prediction of clinical cab values

Train SVM

classifier

Training data

Data m

odel

Validation data

database

These steps are performed

for every data model

Performance

Prediction

Testing data

Testing sam

ple

Optim

aldata m

odelSV

M classifier

These steps are performed

for every testing sample

162

Classification results

Area under ROC

curve (without feature selection)

>1<99

[1, 99]>2.5

<97.5[2.5, 97.5]

BU

N80.4%

99.1%76.6%

87.1%98.2%

70.7%

Ca

72.8%93.4%

55.6%81.4%

81.4%63.4%

CaIo

74.1%60.0%

50.1%64.7%

72.3%57.7%

CO

282.0%

93.6%59.8%

84.4%94.5%

56.3%

Creat

62.8%97.7%

89.1%91.5%

98.1%87.7%

Mg

56.9%70.0%

49.1%58.6%

76.9%59.1%

Osm

ol50.9%

60.8%60.8%

91.0%90.5%

68.0%

PCV

74.9%99.2%

66.3%80.9%

80.6%67.1%

Phos74.5%

93.6%64.4%

71.7%92.2%

69.7%

Laboratory test

Ran

ge of n

orm

al values

>1<99

[1, 99]>2.5

<97.5[2.5, 97.5]

BU

N75.9%

93.4%68.5%

81.8%92.2%

66.9%C

a67.5%

80.4%55.0%

77.4%70.8%

60.0%C

aIo63.5%

52.9%58.8%

46.4%66.3%

58.7%C

O2

77.3%88.0%

53.4%77.5%

90.5%58.1%

Creat

62.2%88.4%

83.5%88.4%

94.9%83.8%

Mg

58.4%71.8%

64.2%67.0%

72.5%62.1%

Osm

ol77.9%

64.8%65.2%

79.2%82.4%

71.5%P

CV

62.3%91.6%

69.7%76.5%

84.6%70.2%

Phos

70.8%75.4%

60.4%68.0%

81.8%65.9%

Ran

ge of n

orm

al values

Laboratory test

Area under R

OC

curve (without feature selection)

Including cases with K=0 (i.e. sam

plesw

ith no prior lab measurem

ents)Excluding cases w

ith K=0 (i.e. samples

with no prior lab m

easurements)

A total of 84,240 SV

M classifiers w

ere built for 16,848 possible data models.163

Improving predictive pow

er and parsimony

of a BUN

model using feature selection

Test name

BU

NR

ange of normal values

< 99 perc.D

ata modeling

SR

TN

umber of previous

measurem

ents5

Use variables corresponding to

hospitalization units?Y

es

Num

ber of prior hospitalizations used

2

Model description

N samples

(total)N abnorm

al sam

plesN

variables

Training set3749

78

Validation set

125127

Testing set836

16

3442

Dataset description

050

100150

200250

0

0.5 1

1.5 2

2.5 3x 10

4H

istogram of test B

UN

Test value

Frequency (N measurements)

normal

values abnorm

alvalues

105

Classification perform

ance (area under RO

C curve)

A

llR

FE_LinearR

FE_Poly

HITO

N_P

CH

ITON

_MB

Validation set

95.29%98.78%

98.76%99.12%

98.90%Testing set

94.72%99.66%

99.63%99.16%

99.05%N

umber of features

344226

311

17

164


ance (area under RO

C curve)

All

RFE_LinearRFE_Poly

HITON_PC

HITON_M

BV

alidation set95.29%

98.78%98.76%

99.12%98.90%

Testing set94.72%

99.66%99.63%

99.16%99.05%

Number of features

344226

311

17

Features1

LAB: PM

_1(BUN)LA

B: PM_1(BUN)

LAB: PM

_1(BUN)LA

B: PM_1(BUN)

2

LAB: PM

_2(Cl)LA

B: Indicator(PM_1(M

g))LA

B: PM_5(Creat)

LAB: PM

_5(Creat)

3

LAB: DT(PM

_3(K))LA

B: Test Unit NO

_TEST_MEA

SUREMENT

(Test CaIo, PM 1)

LAB: PM

_1(Phos)LA

B: PM_3(PCV

)

4

LAB: DT(PM

_3(Creat))

LAB: Indicator(PM

_1(BUN))LA

B: PM_1(M

g)

5

LAB: Test Unit J018 (Test Ca, PM

3)

LAB: Indicator(PM

_5(Creat))LA

B: PM_1(Phos)

6

LAB: DT(PM

_4(Cl))

LAB: Indicator(PM

_1(Mg))

LAB: Indicator(PM

_4(Creat))

7

LAB: DT(PM

_3(Mg))

LA

B: DT(PM_4(Creat))

LAB: Indicator(PM

_5(Creat))

8

LAB: PM

_1(Cl)

LAB: Test Unit 7SCC (Test Ca, PM

1)LA

B: Indicator(PM_3(PCV

))

9

LAB: PM

_3(Gluc)

LA

B: Test Unit RADR (Test Ca, PM

5)LA

B: Indicator(PM_1(Phos))

10

LAB: DT(PM

_1(CO2))

LA

B: Test Unit 7SMI (Test PCV

, PM 4)

LAB: DT(PM

_4(Creat))

11

LAB: DT(PM

_4(Gluc))

DEM

O: G

enderLA

B: Test Unit 11NM (Test BUN, PM

2)

12

LAB: PM

_3(Mg)

LAB: Test Unit 7SCC (Test Ca, PM

1)

13

LAB: DT(PM

_5(Mg))

LAB: Test Unit RA

DR (Test Ca, PM 5)

14

LAB: PM

_1(PCV)

LAB: Test Unit 7SM

I (Test PCV, PM

4)

15

LAB: PM

_2(BUN)

LA

B: Test Unit CCL (Test Phos, PM 1)

16

LAB: Test Unit 11NM

(Test PCV, PM

2)

DEM

O: G

ender

17

LAB: Test Unit 7SCC (Test M

g, PM 3)

DEMO

: Age

18

LAB: DT(PM

_2(Phos))

19

LAB: DT(PM

_3(CO2))

20

LAB: DT(PM

_2(Gluc))

21

LAB: DT(PM

_5(CaIo))

22

DEMO

: Hospitalization Unit TVO

S

23

LAB: PM

_1(Phos)

24

LAB: PM

_2(Phos)

25

LAB: Test Unit 11NM

(Test K, PM 5)

26

LAB: Test Unit V

HR (Test CaIo, PM 1)

165

4. Modeling clinical judgm

ent

166

Methodological fram

ework and study

outline

same across physicians

different across physicians

Physician 1Physician6

PatientFeature

ClinicalDiagnosis

Gold

Standard

1f1 …

fmcd

1hd

1

……

…

Ncd

Nhd

N

……

……

1f1 …

fmcd

1hd

1

……

…

Ncd

Nhd

N

Patients

Physicians

Guidelines

Predict clinical decisions

Identify predictors ignored by physicians

Explain each physician’s diagnostic m

odel

Compare physicians w

ith each other and w

ith guidelines

Clinical context of experim

ent

Incidence & m

ortality have been constantly increasing in the last decades.

Malignant m

elanoma is the m

ost dangerous form of skin cancer

Physicians and patients

Derm

atologists ÆN

= 6

3 experts-3 non-experts

Patients ÆN

=177 76 m

elanomas

-101 neviFeatures

Lesion location

Family history of

melanom

aIrregular Border

Streaks (radial stream

ing, pseudopods)

Max-diam

eterFitzpatrick’s Photo-type

Num

ber of colorsSlate-blue veil

Min-diam

eterSunburn

Atypical pigmented

network

Whitish veil

EvolutionEphelis

Abrupt network cut-off

Globular elem

ents

AgeLentigos

Regression-Erythem

aC

omedo-like openings,

milia-like cysts

Gender

Asymm

etryH

ypo-pigmentation

Telangiectasia

Data collection

:

Patients seen prospectively, from

1999 to 2002 at D

epartment of D

ermatology,

S.ChiaraH

ospital, Trento, Italy

inclusion criteria: histological diagnosis and >1 digital im

age available

Diagnoses m

ade in 2004

Method to explain physician-specific

SVM

models

Build SV

M

SVM”Black Box”

Build D

T

Regular LearningM

eta-Learning

Apply SV

M

FS

Results: Predicting physicians’ judgm

ents

PhysiciansA

ll(features)

HITO

N_PC

(features)H

ITON

_MB

(features)R

FE(features)

Expert 10.94 (24)

0.92 (4)0.92 (5)

0.95 (14)

Expert 20.92 (24)

0.89 (7)0.90 (7)

0.90 (12)

Expert 30.98 (24)

0.95 (4)0.95 (4)

0.97 (19)

NonExpert 1

0.92 (24)0.89 (5)

0.89 (6)0.90 (22)

NonExpert 2

1.00 (24)0.99 (6)

0.99 (6)0.98 (11)

NonExpert 3

0.89 (24)0.89 (4)

0.89 (6)0.87 (10)

Results: Physician-specific m

odels

Results: Explaining physician agreem

ent

Expert 1

AUC=0.92

R2=99%

Expert 3

AUC=0.95

R2=99%

Blue

veilirregular border

streaks

Patient 001yes

noyes

Results: Explain physician disagreem

entB

lue veil

irregular border

streaksnum

ber of colors

evolution

Patient 002no

noyes

3no

Expert 1

AUC=0.92

R2=99%

Expert 3

AUC=0.95

R2=99%

Results: G

uideline compliance

PhysicianR

eported guidelines

Com

pliance

Experts1,2,3, non-expert 1

Pattern analysis N

on-compliant: they ignore the

majority of features (17 to 20)

recomm

ended by pattern analysis.

Non expert 2

ABCD

E ruleN

on compliant:asym

metry, irregular

border and evolution are ignored.

Non expert 3

Non-standard.

Reports using 7

features

Non com

pliant: 2 out of 7 reported features are ignored w

hile some non-

reported ones are not

On the contrary: In all guidelines, the m

ore predictors present, the higher the likelihood of m

elanoma. All physicians w

ere com

plia

nt

with this principle.

5. Using SV

Ms for feature selection

176

Feature selection methods

177

Feature selection methods (non-causal)

•SV

M-RFE

•U

nivariate + wrapper

•Random

forest-based•

LARS-Elastic N

et•

RELIEF + wrapper

•L0-norm

•Forw

ard stepwise feature selection

•N

o feature selection

Causal feature selection methods

•H

ITON

-PC•

HITO

N-M

B•

IAM

B•

BLCD•

K2MB

…

This method outputs a

Markov blanket of the

response variable (under assum

ptions)

This is an SVM

-based feature selection m

ethod

13 real datasets were used to evaluate

feature selection methods

178

Dataset nam

eD

omain

Num

ber of variables

Num

ber of sam

plesTarget

Data type

Infant_Mortality

Clinical

865,337

Died w

ithin the first yeardiscrete

Ohsum

edText

14,3735,000

Relevant to nenonatal diseases

continuous

AC

PJ_Etiology

Text28,228

15,779R

elevant to eitology continuous

Lym

phoma

Gene

expression7,399

2273-year survival: dead vs. alive

continuous

Gisette

Digit

recognition5,000

7,000Separate 4 from

9continuous

Dexter

Text19,999

600R

elevant to corporate acquisitionscontinuous

SylvaEcology

21614,394

Ponderosa pine vs. everything elsecontinuous &

discrete

Ovarian_C

ancerProteom

ics2,190

216C

ancer vs. normals

continuous

Throm

binD

rug discovery

139,3512,543

Binding to throm

indiscrete (binary)

Breast_C

ancerG

ene expression

17,816286

Estrogen-receptor positive (ER+) vs. ER

-continuous

Hiva

Drug

discovery1,617

4,229A

ctivity to AID

S HIV

infectiondiscrete (binary)

Nova

Text16,969

1,929Separate politics from

religion topics discrete (binary)

Bankruptcy

Financial147

7,063Personal bankruptcy

continuous & discrete


ance vs. proportion of selected features

179

00.5

10.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95 1O

riginal

Classification performance (AUC)

Proportion of selected features

HITO

N-P

C w

ith G2 test

RFE

0.050.1

0.150.2

0.85

0.86

0.87

0.88

0.89

0.9

Magnified

Classification performance (AUC)

Proportion of selected features

HITO

N-P

C w

ith G2 test

RFE

Statistical comparison of predictivity and

reduction of features

180

PredicitivityR

eduction

P-valueN

ominal w

innerP-value

Nom

inal winner

SVM

-RFE

(4variants)

0.9754SV

M-R

FE0.0046

HITO

N-PC

0.8030SV

M-R

FE0.0042

HITO

N-PC

0.1312H

ITON

-PC0.3634

HITO

N-PC

0.1008H

ITON

-PC0.6816

SVM

-RFE

•N

ull hypothesis: SVM

-RFE and HITO

N-PC perform

the same;

•U

se permutation-based statistical test w

ith alpha = 0.05.

Simulated datasets w

ith known causal

structure used to compare algorithm

s181

182

Com

parison of SVM

-RFE and H

ITON

-PC

Com

parison of all methods in term

s of causal graph distance

183

SVM

-RFEH

ITON

-PC based causal

methods

Summ

ary results

184SV

M-RFE

HITO

N-PC

based causal

methods

HITO

N-PC-

FDR

methods

Statistical comparison of graph distance

185

Sample size = 200

Sample size = 500

Sample size =5000

Com

parisonP-value

Nom

inal w

innerP-value

Nom

inal w

innerP-value

Nom

inal w

inner

average HITO

N-PC

-FDR

with G

2test vs.average SV

M-R

FE<0.0001

HITO

N-PC

-FD

R0.0028

HITO

N-PC

-FD

R<0.0001

HITO

N-PC

-FD

R

•N

ull hypothesis: SVM

-RFE and HITO

N-PC-FD

R perform the sam

e;•

Use perm

utation-based statistical test with alpha = 0.05.

6. Outlier detection in ovarian cancer

proteomics data

186

Data S

et 1 (Top), Data S

et 2 (Bottom

)

Cancer

Norm

al

Other

Clock Tick

40008000

12000

Cancer

Norm

al

Other

Ovarian cancer data

Same set of 216

patients, obtained using the CiphergenH

4 ProteinChiparray

(dataset 1) and using the Ciphergen

WCX2

ProteinChiparray

(dataset 2).

The gross break at the “benign disease” juncture in dataset 1 and the similarity of the

profiles to those in dataset 2 suggest change of protocol in the middle of the first

experiment.

Experiments w

ith one-class SVM

Assum

e that sets {A, B} are

normal and {C, D

, E, F} are outliers. A

lso, assume that w

e do not know

what are norm

al and outlier sam

ples.

•Experiment 1: Train one-class SV

M

on {A, B, C} and test on {A

, B, C}: A

rea under ROC curve =

0.98•Experim

ent 2: Train one-class SVM

on {A

, C} and test on {B, D, E,F}:

Area under RO

C curve = 0.98

Data Set 1 (Top), Data Set 2 (Bottom)

Cancer

Normal

Other

Clock Tick4000

800012000

Cancer

Normal

Other

188

Software

189

Interactive media and anim

ations

190

SVM

Applets

•http://w

ww

.csie.ntu.edu.tw/~cjlin/libsvm

/•

http://svm.dcs.rhbnc.ac.uk/pagesnew

/GPat.shtm

l•

http://ww

w.sm

artlab.dibe.unige.it/Files/sw/A

pplet%20SV

M/svm

applet.html

•http://w

ww

.eee.metu.edu.tr/~alatan/Courses/D

emo/A

ppletSVM

.html

•http://w

ww

.dsl-lab.org/svm_tutorial/dem

o.html(requires Java 3D

)

Anim

ations•

Support Vector Machines:

http://ww

w.cs.ust.hk/irproj/Regularization%

20Path/svmKernelpath/2m

oons.avihttp://w

ww

.cs.ust.hk/irproj/Regularization%20Path/svm

Kernelpath/2Gauss.avi

http://ww

w.youtube.com

/watch?v=3liCbRZPrZA

•Support Vector Regression: http://w

ww

.cs.ust.hk/irproj/Regularization%20Path/m

ovie/ga0.5lam1.avi

Several SVM

implem

entations for beginners

191

•G

EMS: http://w

ww

.gems-system

.org

•W

eka: http://ww

w.cs.w

aikato.ac.nz/ml/w

eka/

•Spider (for M

atlab): http://ww

w.kyb.m

pg.de/bs/people/spider/

•CLO

P(for M

atlab): http://clopinet.com/CLO

P/

Several SVM

implem

entations for interm

ediate users

192

•LibSVM

: http://ww

w.csie.ntu.edu.tw

/~cjlin/libsvm/

�G

eneral purpose�

Implem

ents binary SVM

, multiclass SV

M, SV

R, one-class SVM

�Com

mand-line interface

�Code/interface for C/C++/C#, Java, M

atlab, R, Python, Pearl

•SVM

Light: http://svmlight.joachim

s.org/�

General purpose (designed for text categorization)

�Im

plements binary SV

M, m

ulticlass SVM

, SVR

�Com

mand-line interface

�Code/interface for C/C++, Java, M

atlab, Python, Pearl

More softw

are links at http://ww

w.support-vector-m

achines.org/SVM

_soft.html

and http://ww

w.kernel-m

achines.org/software

Conclusions

193

194

Strong points of SVM

-based learning m

ethods•

Empirically achieve excellent results in high-dim

ensional data w

ith very few sam

ples•

Internal capacity control to avoid overfitting•

Can learn both simple linear and very com

plex nonlinear functions by using “kernel trick”

•Robust to outliers and noise (use “slack variables”)

•Convex Q

P optimization problem

(thus, it has global minim

um

and can be solved efficiently)•

Solution is defined only by a small subset of training points

(“support vectors”)•

Num

ber of free parameters is bounded by the num

ber of support vectors and not by the num

ber of variables•

Do not require direct access to data, w

ork only with dot-

products of data-points.

195

Weak points of SV

M-based learning

methods

•M

easures of uncertainty of parameters are not

currently well-developed

•Interpretation is less straightforw

ard than classical statistics

•Lack of param

etric statistical significance tests•

Power size analysis and research design considerations

are less developed than for classical statistics

Bibliography

196

197

Part 1: Support vector machines for binary

classification: classical formulation

•Boser

BE, Guyon IM

, VapnikV

N: A

training algorithm for optim

al margin classifiers.

Pro

cee

din

gs o

f the

Fifth

An

nu

al W

orksh

op

on

Co

mp

uta

tion

al Le

arn

ing

Th

eo

ry (C

OLT

)

1992:144-152.•

Burges CJC: A tutorial on support vector m

achines for pattern recognition. Da

ta M

inin

g

an

d K

no

wle

dg

e D

iscove

ry1998, 2:121-167.

•CristianiniN

, Shawe-Taylor J: A

n in

trod

uctio

n to

sup

po

rt vecto

r ma

chin

es a

nd

oth

er ke

rne

l-

ba

sed

lea

rnin

g m

eth

od

s. Cambridge: Cam

bridge University Press; 2000.

•H

astie T, Tibshirani R, Friedman JH

: Th

e e

lem

en

ts of sta

tistical le

arn

ing

: da

ta m

inin

g,

infe

ren

ce, a

nd

pre

dictio

n. New

York: Springer; 2001.

•H

erbrichR: Le

arn

ing

kern

el cla

ssifiers: th

eo

ry a

nd

alg

orith

ms. Cam

bridge, Mass: M

IT Press; 2002.

•SchölkopfB, Burges CJC, Sm

olaA

J: Ad

van

ces in

kern

el m

eth

od

s: sup

po

rt vecto

r lea

rnin

g. Cam

bridge, Mass: M

IT Press; 1999.

•Shaw

e-Taylor J, CristianiniN: K

ern

el m

eth

od

s for p

atte

rn a

na

lysis. Cam

bridge, UK:

Cambridge U

niversity Press; 2004.

•Vapnik

VN

: Sta

tistical le

arn

ing

the

ory. N

ew York: W

iley; 1998.

198

Part 1: Basic principles of statistical m

achine learning•

Aliferis CF, Statnikov A

, Tsamardinos I: Challenges in the analysis of m

ass-throughput data: a technical com

mentary from

the statistical machine learning perspective.

Ca

nce

r

Info

rma

tics2006, 2:133-162.

•D

uda RO, H

art PE, Stork DG

: Pa

ttern

classifica

tion. 2nd edition. N

ew York: W

iley; 2001.

•H

astie T, Tibshirani R, Friedman JH

: Th

e e

lem

en

ts of sta

tistical le

arn

ing

: da

ta m

inin

g,

infe

ren

ce, a

nd

pre

dictio

n. New


•M

itchell T: Ma

chin

e le

arn

ing. N

ew York, N

Y, USA

: McG

raw-H

ill; 1997.

•Vapnik

VN

: Sta

tistical le

arn

ing

the

ory. N

ew York: W

iley; 1998.

199

Part 2: Model selection for SVM

s•

Hastie T, Tibshirani R, Friedm

an JH: T

he

ele

me

nts o

f statistica

l lea

rnin

g: d

ata

min

ing

,

infe

ren

ce, a

nd

pre

dictio

n. New


•Kohavi R: A

study of cross-validation and bootstrap for accuracy estimation and m

odel selection. P

roce

ed

ing

s of th

e Fo

urte

en

th In

tern

atio

na

l Join

t Co

nfe

ren

ce o

n A

rtificial

Inte

llige

nce

(IJCA

I)1995, 2:1137-1145.

•Scheffer T: Error estim

ation and model selection. Ph.D

.Thesis, TechnischenU

niversitätBerlin, School of Com

puter Science; 1999.

•Statnikov A

, Tsamardinos I, D

osbayev Y, Aliferis CF: G

EMS: a system

for automated cancer

diagnosis and biomarker discovery from

microarray gene expression data. In

tJ M

ed

Info

rm2005, 74:491-503.

200

Part 2: SVMs for m

ulticategory data•

Cramm

er K, Singer Y: On the learnability

and design of output codes for multiclass

problems. P

roce

ed

ing

s of th

e T

hirte

en

th A

nn

ua

l Co

nfe

ren

ce o

n C

om

pu

tatio

na

l Lea

rnin

g

Th

eo

ry (C

OLT

)2000.

•Platt JC, CristianiniN

, Shawe-Taylor J: Large m

argin DA

Gs for m

ulticlass classification. A

dva

nce

s in N

eu

ral In

form

atio

n P

roce

ssing

Syste

ms (N

IPS)2000, 12:547-553.

•SchölkopfB, Burges CJC, Sm

olaA

J: Ad

van

ces in

kern

el m

eth

od

s: sup

po

rt vecto

r lea

rnin

g. Cam

bridge, Mass: M

IT Press; 1999.

•Statnikov A

, Aliferis CF, Tsam

ardinos I, Hardin D

, Levy S: A com

prehensive evaluation of m

ulticategory classification methods for m

icroarray gene expression cancer diagnosis. B

ioin

form

atics

2005, 21:631-643.

•W

eston J, Watkins C: Support vector m

achines for multi-class pattern recognition.

Pro

cee

din

gs o

f the

Se

ven

th E

uro

pe

an

Sym

po

sium

On

Artificia

l Ne

ura

l Ne

two

rks1999, 4:6.

201

Part 2: Support vector regression•

Hastie T, Tibshirani R, Friedm

an JH: T

he

ele

me

nts o

f statistica

l lea

rnin

g: d

ata

min

ing

,

infe

ren

ce, a

nd

pre

dictio

n. New


•Sm

olaA

J, SchölkopfB: A tutorial on support vector regression. S

tatistics a

nd

Co

mp

utin

g

2004, 14:199-222.

•ScholkopfB, Platt JC, Shaw

e-Taylor J, Smola

AJ, W

illiamson RC: Estim

ating the Support of a H

igh-Dim

ensional Distribution. N

eu

ral C

om

pu

tatio

n2001, 13:1443-1471.

•Tax D

MJ, D

uinRPW

: Support vector domain description. P

atte

rn R

eco

gn

ition

Lette

rs1999,

20:1191-1199.

•H

urBA

, Horn D

, Siegelmann

HT, Vapnik

V: Support vector clustering. Jou

rna

l of M

ach

ine

Lea

rnin

g R

ese

arch

2001, 2:125–137.

Part 2: Novelty detection w

ith SVM-based

methods

and Support Vector Clustering

202

Part 2: SVM-based variable selection

•G

uyon I, Elisseeff A: A

n introduction to variable and feature selection. Jou

rna

l of M

ach

ine

Lea

rnin

g R

ese

arch

2003, 3:1157-1182.•

Guyon I, W

eston J, Barnhill S, VapnikV: G

ene selection for cancer classification using support vector m

achines. Ma

chin

e Le

arn

ing

2002, 46:389-422.•

Hardin D

, Tsamardinos I, A

liferis CF: A theoretical characterization of linear SV

M-based

feature selection. Pro

cee

din

gs o

f the

Twe

nty

First In

tern

atio

na

l Co

nfe

ren

ce o

n M

ach

ine

Lea

rnin

g (IC

ML)2004.

•Statnikov A

, Hardin D

, Aliferis CF: U

sing SVM

weight-based m

ethods to identify causally relevant and non-causally relevant variables. P

roce

ed

ing

s of th

e N

IPS 2

00

6 W

orksh

op

on

Ca

usa

lity a

nd

Fea

ture

Se

lectio

n2006.

•Tsam

ardinos I, Brown LE: M

arkov Blanket-Based Variable Selection in Feature Space. Te

chn

ical re

po

rt DSL-0

8-0

12008.

•W

eston J, Mukherjee

S, ChapelleO

, PontilM, Poggio

T, VapnikV: Feature selection for

SVM

s. Ad

van

ces in

Ne

ura

l Info

rma

tion

Pro

cessin

g Sy

stem

s (NIP

S)2000, 13:668-674.

•W

eston J, Elisseeff A, ScholkopfB, Tipping M

: Use of the zero-norm

with linear m

odels and kernel m

ethods. Jou

rna

l of M

ach

ine

Lea

rnin

g R

ese

arch

2003, 3:1439-1461.•

Zhu J, RossetS, H

astie T, Tibshirani R: 1-norm support vector m

achines. Ad

van

ces in

Ne

ura

l Info

rma

tion

Pro

cessin

g Sy

stem

s (NIP

S)2004, 16.

203

Part 2: Computing posterior class

probabilities for SVM classifiers

•Platt JC: Probabilistic outputs for support vector m

achines and comparison to regularized

likelihood methods.In A

dva

nce

s in La

rge

Ma

rgin

Cla

ssifiers.Edited by Sm

olaA

, Bartlett B, ScholkopfB, Schuurm

ansD

. Cambridge, M

A: M

IT press; 2000.

Part 3: Classification of cancer gene expression

microarray data

(Case Study 1)•

Diaz-U

riarte R, Alvarez de A

ndres S: Gene selection and classification of m

icroarray data using random

forest. BM

C B

ioin

form

atics

2006, 7:3.•

Statnikov A, A

liferis CF, Tsamardinos I, H

ardin D, Levy S: A

comprehensive evaluation of

multicategory classification m

ethods for microarray gene expression cancer diagnosis.

Bio

info

rma

tics2005, 21:631-643.

•Statnikov A

, Tsamardinos I, D

osbayev Y, Aliferis CF: G

EMS: a system

for automated cancer

diagnosis and biomarker discovery from

microarray gene expression data. In

tJ M

ed

Info

rm2005, 74:491-503.

•Statnikov A

, Wang L, A

liferis CF: A com

prehensive comparison of random

forests and support vector m

achines for microarray-based cancer classification. B

MC

Bio

info

rma

tics

2008, 9:319.

204

Part 3: Text Categorization in Biomedicine

(Case Study 2)•

Aphinyanaphongs Y, A

liferis CF: Learning Boolean queries for article quality filtering. M

ed

info

20

04

2004, 11:263-267.

•A

phinyanaphongs Y, Tsamardinos I, Statnikov A

, Hardin D

, Aliferis CF: Text categorization

models for high-quality article retrieval in internal m

edicine. J Am

Me

d In

form

Asso

c

2005, 12:207-216.

•A

phinyanaphongs Y, Statnikov A, A

liferis CF: A com

parison of citation metrics to m

achine learning filters for the identification of high quality M

EDLIN

E documents. J A

m M

ed

Info

rm A

ssoc

2006, 13:446-455.

•A

phinyanaphongs Y, Aliferis CF: Prospective validation of text categorization m

odels for indentifying high-quality content-specific articles in PubM

ed. AM

IA 2

00

6 A

nn

ua

l

Sym

po

sium

Pro

cee

din

gs

2006.

•A

phinyanaphongs Y, Aliferis C: Categorization M

odels for Identifying Unproven Cancer

Treatments on the W

eb. ME

DIN

FO

2007.

•Fu L, A

liferis C: Models for Predicting and Explaining Citation Count of Biom

edical A

rticles. AM

IA 2

00

8 A

nn

ua

l Sym

po

sium

Pro

cee

din

gs

2008.

205

Part 3: Modeling clinical judgm

ent (Case Study 4)

•Sboner A

, Aliferis CF: M

odeling clinical judgment and im

plicit guideline compliance in the

diagnosis of melanom

as using machine learning. A

MIA

20

05

An

nu

al Sy

mp

osiu

m

Pro

cee

din

gs

2005:664-668.

Part 3: Using SVM

s for feature selection (Case Study 5)

•A

liferis CF, Statnikov A, Tsam

ardinos I, Mani S, Koutsoukos XD

: Local Causal and Markov

Blanket Induction for Causal Discovery and Feature Selection for Classification. Part II:

Analysis and Extensions. Jo

urn

al o

f Ma

chin

e Le

arn

ing

Re

sea

rch2008.

•A

liferis CF, Statnikov A, Tsam

ardinos I, Mani S, Koutsoukos XD

: Local Causal and Markov

Blanket Induction for Causal Discovery and Feature Selection for Classification. Part I:

Algorithm

s and Empirical Evaluation. Jo

urn

al o

f Ma

chin

e Le

arn

ing

Re

sea

rch2008.

206

Part 3: Outlier detection in ovarian cancer

proteomics

data (Case Study 6)•

Baggerly KA, M

orris JS, Coombes

KR: Reproducibility of SELDI-TO

F protein patterns in serum

: comparing datasets from

different experiments. B

ioin

form

atics

2004, 20:777-785.

•Petricoin

EF, ArdekaniA

M, H

ittBA

, Levine PJ, FusaroVA

, Steinberg SM, M

ills GB, Sim

one C, Fishm

an DA

, Kohn EC, LiottaLA

: Use of proteom

ic patterns in serum to identify ovarian cancer.

Lan

cet

2002, 359:572-577.

Thank you for your attention!

Questions/C

omm

ents?

Email: A

lexander.Statnikov@m

ed.nyu.edu

UR

L: http://ww

.nyuinformatics.org

207

a gentle introduction to support vector machines in biomedicineshatkay/course/papers/uosv... ·...

Documents