a gentle introduction to support vector machines in ... · december 21, 2010 9:41 9 x 7.5...

A GENTLE INTRODUCTION TO SUPPORT VECTOR MACHINES IN BIOMEDICINE - Volume 1: Theory and Methods© World Scientific Publishing Co. Pte. Ltd.http://www.worldscibooks.com/lifesci/7922.html

December 21, 2010 9:41 9 x 7.5 b1022-ch01 A Gentle Introduction to Support Vector Machines in Biomedicine Volume 1: Theory and Methods

CHAPTER 1

Introduction

Classes of Data-Analytic Problems Considered in This Book

This book focuses on several classes of data-analytic problems that are described below.

Problem class I (classification ): Build computational classification models (or “classifiers”)that assign objects (e.g., patients/samples) into two or more classes. A classifier can be used fordifferential diagnosis, outcome prediction, and other classification tasks. Figure 1.1 illustratesan example classifier, which is a decision support system to diagnose primary and metastaticcancers from gene expression profiles of patients with lung cancer.1

Classifiermodel

Patient withlung cancer

Biopsy Gene expressionprofile

Primary Lung Cancer

Metastatic Lung Cancer

Figure 1.1

1The use of SVMs has grown with the genomic era and we use in this book many examples drawn from microarraygene expression analysis. The genes are the basic elements of the genome, stored in the cell nuclei as DNA(Deoxyribonucleic Acid), which encodes the program of all living organisms. Genes are selectively activated duringdevelopment and, after maturity of an organism, perform a variety of vital functions. The advent of microarrays hasrevolutionized genomics and opened the doors to new ways of studying gene regulation and diagnosing complexdiseases, including cancers. Microarrays now record the expression of tens of thousands of genes simultaneously.A patient may thus be associated to a vector of thousands of gene expression coefficients representing his/hergene activation status, which is characteristic of his/her health condition. This numeric representation allowscomputational methods described in this book to perform automatically diagnosis, prognosis, and other functions.

1



2 A Gentle Introduction to SVMs in Biomedicine, Volume 1: Theory and Methods

Problem: Which of the model(s) in Figure 1.2 (a), (b), or (c) is/are notclassifiers?

ModelArticle

Relevant to clinical trials

Irrelevant to clinical trials

Model

PatientBloodsample

Mass spectrometryproteomics profile

Will respond to treatment

Will not respond to treatment

Model

Patient Biopsy

Cannot make decision

Gene expressionprofile

Will survive for 3.5 years

a)

b)

c)

Figure 1.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Answer:ThemodelinFigure1.2(c)isnotaclassifierbecauseitdoesnotassignobjectsintodiscreteclasses.

Problem class II (regression): Build computational regression models to predict values ofsome continuous response variable. Regression models can be used to predict patient survival,length of stay in the hospital, laboratory test values, etc. Figure 1.3 illustrates an exampleregression model, which is a decision support system to predict optimal dosage of a drug ofchoice to be administered to the patient. This dosage is determined by the values of patientbiomarkers, and clinical and demographics data.

Regressionmodel

PatientBiomarkers,clinical and

demographics data

Optimaldosage is 5IU/Kg/week

1 3 22.2 423 3 92 2 1 8

Figure 1.3



Introduction 3

Problem: What type of model is shown in Figure 1.2(c)?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Answer:Thisisaregressionmodel,itpredictsacontinuousvalue(time)ofpatientsurvival.

Problem class III (feature/variable selection): Out of all measured variables in thedataset, select the smallest subset of variables that is necessary for accurate prediction (classi-fication or regression) of some response variable of interest (e.g., phenotypic response variable).Figure 1.4 gives a graphical illustration of the problem to find the most compact set of breastcancer biomarkers from microarray gene expression data for 20,000 genes. The figure showsa heat-map (matrix) of gene expression coefficients rendered with the following color cod-ing. Green values represent under-expressed genes and red values over-expressed genes. Eachcolumn of the matrix represents a gene and each line a patient sample. The problem is toselect the smallest subset of genes, which crisply separates normal from cancer patients (this iscalled “gene selection”). Notice that none of the genes outside the yellow box can accuratelyclassify breast cancer. However, any single gene in the yellow box classifies breast cancer with100% accuracy and thus solves the current problem.

Breastcancertissues

Normaltissues

20,000 genes measured by a microarray

Figure 1.4

Problem: Consider a dataset with 4 variables: systolic blood pressure, favoritenewspaper, first letter of the last name, and age of the pet. What will be the smallestsubset of variables that is necessary for accurate prediction of hypertension?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Answer:Systolicbloodpressure.




Problem class IV (novelty detection): Build a computational model to identify novel oroutlier objects (e.g., patients/samples). For example, such a model can be used to discoverdeviations in sample handling protocol when doing quality control of assays. Figure 1.5shows a more humorous illustration of novelty detection. The problem here is to build adecision support system to identify aliens.

Figure 1.5

Problem class V (clustering): Group objects (e.g., patients/samples) into several clustersbased on their similarity. Figure 1.6 gives a graphical illustration of the clustering of braintumor patients into 4 clusters based on their gene expression profiles. The figure shows aheat-map (matrix) of gene expression coefficients rendered with the following color coding.Green values represent under-expressed genes and red values over-expressed genes. Eachcolumn of the matrix represents a gene and each line a patient sample. All patients havethe same pathological type of the disease, and clustering defines four new disease subtypes(corresponding to clusters shown). These subtypes may have different characteristics in termsof patient survival and time to recurrence after treatment.



Introduction 5

Cluster #1

Cluster #2

Cluster #3

Cluster #4

Figure 1.6

Problem: The dataset contains the age of 20 patients admitted to the emer-gency room this morning. The patients’ ages were recorded as “child”, “adult”,and “senior”.a) How should one cluster this data to identify patients who should be seen by

the geriatrician?b) How should one cluster this data to identify patients who should be seen by

the pediatrician?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Answer:Thereisnouniversallyoptimalwaytoclusterthisdata.Itdependsontheintendedapplication.a)Toidentifypatientswhoshouldbeseenbythegeriatrician,onecangroup

children&adultsinoneclusterandkeepseniorsintheotherone.b)Toidentifypatientswhoshouldbeseenbythepediatrician,onecangroup

adults&seniorsinoneclusterandkeepchildrenintheotherone.




Basic Principles of Classification

Let us consider basic principles of classification. Imagine a situation where one would like toclassify objects as boats and houses from the picture shown in Figure 1.7.

Figure 1.7

One simple way to do it is to say that all objects before the coast line are boats and allobjects after the coast line are houses. In this case, the coast line serves as a decision surfacethat separates two classes. Consider Figure 1.8 where the decision surface is shown in yellow,boats are shown inside red circles and houses are shown inside green squares.

Obviously such classification is not ideal and can lead to misclassifications. For example,if there are boats that are located on the shore, they will be misclassified as houses (seeFigure 1.9).



Introduction 7

Figure 1.8

These boats will be misclassified as houses

Figure 1.9




Problem: What will be the classification of this house that is located before thecoast line?

Figure 1.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Answer:Itwillbeclassifiedasaboatbecauseitisbeforethecoastline.

The methods that build classification models (i.e., “classification algorithms”) operate verysimilarly to the example above. First, all objects (boats and houses) are represented geomet-rically, for example see Figure 1.11. In this example, the horizontal axis is longitude and thevertical one is latitude. The details of this representation are provided in Chapter 2.

Then the algorithm seeks to find a decision surface that separates classes of objects(Figure 1.12). The example decision surface is shown in yellow.



Introduction 9

Longitude

Latitude

Boat

House

Figure 1.11

Longitude

Latitude

Boat

House

Figure 1.12




The coast line is one possible decision surface; however, there are infinitely many otherdecision surfaces to separate these two classes of objects without errors. Two more decisionsurfaces are shown in Figure 1.13 with blue and magenta, respectively.

Once we have defined a decision surface, we can use the following rule for classification:unseen (new) objects are classified as “boats” if they fall below the decision surface and oth-erwise as “houses” (Figure 1.14). The decision surface together with the above classificationrule define the classification model.

Longitude

Latitude

Boat

House

Figure 1.13



Introduction 11

Longitude

Latitude

? ? ?

? ? ?

These objects are classified as boats

These objects are classified as houses

Figure 1.14

Problem: What will be the classification of the three objects shown in Fig-ure 1.15 below?

Longitude

Latitude

Object #2

Object #1

Object #3

Figure 1.15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Answer:Object#1willbeclassifiedasahouse(itisabovethedecisionsurface),object#2willbeclassifiedasaboat(itisbelowthedecisionsurface),andobject#3willbeclassifiedasaboat(itisbelowthedecisionsurface).




Main Ideas of the Support Vector Machine (SVM)Classification Algorithm

In this book we focus on a family of classification algorithms (i.e., they solve problems ofclass I) known as Support Vector Machines (SVMs). Extensions of the SVM algorithm canbe applied to solve problems of classes II-V.

Consider a dataset shown in Figure 1.16. This graph is a representation (scatter plot) ofthe health status of subjects in a two-dimensional space. Each coordinate represents a geneexpression level. The symbols (stars and circles) represent the two categories of subjects. Wewant to build a classifier to differentiate cancer patients from normal subjects based on twogenes X and Y.

Cancer patientsNormal subjectsGene X

Gene Y

Figure 1.16

Support vector machines seek a linear decision surface (e.g., a line in a two-dimensionalspace) that can separate classes of objects and has the largest distance (or largest “gap” or“margin”) between border-line objects (that are also called “support vectors”). See Figure 1.17for an example of several linear decision surfaces that can separate classes of objects (shown aslines of different colors); an infinite number of such decision surfaces exist in this data. SeeFigure 1.18 for an example of a linear decision surfaces that can separate classes of objects andalso has the largest gap between support vectors (border-line objects); only one such decisionsurface exists. Support vectors are three objects (one normal subject and two patients withcancer) that are shown with yellow highlighting in the figure.



Introduction 13


Gene Y

Figure 1.17


Gene Y

Figure 1.18




Problem: Which of the decision surfaces in Figure 1.19 a) or b) are linear andhave maximum gap between border-line objects?


Gene Y


Gene Y

a)

b)

Figure 1.19. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Answer:None.Decisionsurfaceina)doesnothavemaximumgap,andtheoneinb)isnon-linear.

Now consider another dataset shown in Figure 1.20. Obviously, there is no linear decisionsurface that separates normal subjects from cancer patients on the basis of genes X and Y.

What SVMs do in such cases is they “map” the data into a higher dimensional space (oftenmuch higher) known as the “feature space”, where the separating linear decision surface existsand it is determined. The feature space results from a clever mathematical construction knownas the “kernel trick”, see Figure 1.21.



Introduction 15

Gene Y

Gene X

Cancer

Normal

Figure 1.20

Gene Y

Gene X

Cancer

Normal

Cancer

Normal

kernel

Decision surface

Figure 1.21

The above examples were concerned with two-dimensional data (described by twogenes); however, SVMs are particularly helpful with data that has many thousands of di-mensions (e.g., genes, proteins, SNPs, etc.) when the analyst cannot make such intuitivedrawings.




History of SVMs and Their Use in the Literature

Support vector machines have a long history of development starting from the early 1950’s.In 1950 Aronszajn introduced the theory of reproducing kernels which broadly constitutesa theoretical basis of support vector machines (Aronszajn, 1950). The next milestone wasthe invention by Rosenblatt of a linear classifier called the Perceptron in 1957 (Rosenblatt,1962). Then in 1963 Vapnik and Lerner introduced the Generalized Portrait algorithm, aparticular case of which computes the optimal margin linear classifier, the linear version ofSVMs (Vapnik and Lerner, 1963). In 1964 Aizerman, Braverman and Rozonoer introducedthe geometrical interpretation of kernels as inner products in a feature space (Aizerman et al.,1964). They formally proved the duality (equivalence) of Perceptrons and Potential functions(a particular class of radial basis function), what was later referred to as the “kernel trick”. Thisinterpretation is a key component of non-linear SVMs. In 1965 Cover discussed large marginhyperplanes and also talked about data sparseness, all of which are fundamental aspects of SVMs(Cover, 1965). Then in 1968 Smith introduced the notion of slack variables to deal with noisyand linearly non-separable data (Smith, 1968). This was a very important milestone becauseslack variables are instrumental in soft-margin SVMs. In 1974–1979 Vapnik and Chervonenkislaid the foundation of learning theory, which gives a strong theoretical backing of SVMs andexplains their superior classification performance (Vapnik, 1979; Vapnik and Chervonenkis,1974). In 1975 Poggio proposed the much used polynomial kernel in SVMs (Poggio, 1975).In 1990 Wahba advanced the field of kernel methods for regression and contributed thecelebrated “Representer Theorem”, which states that regularized risk functionals in RKHSadmit solutions that are linear combinations of kernel functions of the training examples(Wahba, 1990). In 1990 Poggio and Girosi established connections between kernel regressionmethods and neural networks (Poggio and Girosi, 1990a; Poggio and Girosi, 1990b). SVMsin their modern form were first introduced in the milestone work by Boser, Guyon andVapnik in 1992 (Boser et al., 1992). Specifically, Boser, Guyon and Vapnik extended theoptimal margin linear classifier that was proposed by Vapnik in 1963 to non-linear cases.They suggested a way to create non-linear classifiers by applying the kernel trick (originallyproposed by Aizerman et al. in 1964) to maximum-margin classifiers. The resulting algorithmis formally similar, except that every dot product is replaced by a non-linear kernel function andis what is now known as SVM.2 In 1995 Cortes and Vapnik introduced soft-margin versionof SVM classifiers that use slack variables and are suitable for noisy and linearly non-separabledata (Cortes and Vapnik, 1995).

2The group of Vapnik and that of Aizerman both worked in the sixties in the same institution in Moscow, but ittook another 30 years to put together the two algorithms they proposed and give birth to the modern SVMs!



Introduction 17

359 621

906

1,430

2,330

3,530

4,950

6,660

8,180

8,860

4 12 46 99 201 351 521 726

917 1,190

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007

Use of Support Vector Machines in the Literature

General sciences

Biomedicine

Figure 1.22

Figure 1.22 shows the number of publications that reported the use of support vector ma-chines in biomedicine and general sciences. The data was obtained from the Google Scholarsystem (http://scholar.google.com/) on 10/23/2009 using the query “support vectormachines” OR “support vector machine” OR “support vector classifier” OR “support vector clas-sifiers” OR “support vector classification” to retrieve relevant publications. For biomedicalpublications, we used subject categories Biology, Life Sciences, and Environmental Science andMedicine, Pharmacology, and Veterinary Science. For general sciences, we used the remainingsubject categories. Figure 1.23 presents similar statistics for use of linear regression (querying“linear regression”). As can be seen from Figure 1.22, the use of SVMs is significantly increas-ing from year to year both in biomedicine and general sciences; however, general sciencesutilize support vector machines much more than biomedicine. On the other hand, an estab-lished and mathematically simpler method such as linear regression is used in more or less thesame way in biomedicine and general sciences, and its use is not increasing significantly overtime (Figure 1.23). From our experience of teaching and applying SVMs, the relatively smalladoption rate of these methods in biomedicine can be attributed to a large degree to a lack oftechnical background of biomedical researchers that impedes grasping both the theory andapplications of SVMs. That is why the primary purpose of this book is to introduce SVMsand their extensions in a very easy manner to allow biomedical researchers to understand andapply these important methods to real-life problems.




9,770 10,800

12,000

13,500

14,900 16,000

17,700

19,500 20,000 19,600

14,900 15,500

19,200 18,700 19,100

22,200

24,100

20,100

17,700 18,300

0

5000

10000

15000

20000

25000

30000

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007

Use of Linear Regression in the Literature

General sciences

Biomedicine

Figure 1.23

a gentle introduction to support vector machines in ... · december 21, 2010 9:41 9 x 7.5...

Documents