[ieee 2006 ieee symposium on computational intelligence and bioinformatics and computational biology...

Visualization of Support Vector Machines withUnsupervised Learning

Lutz HamelDepartment of Computer Science and StatisticsUniversity of Rhode Island, Kingston, RI 02881

lutzginductive-reasoning.com

Abstract - The visualization of support vectormachines in realistic settings is a difficult problem due tothe high dimensionality of the typical datasets involved.However, such visualizations usually aid theunderstanding of the model and the underlying processes,especially in the biosciences. Here we propose a novelvisualization technique of support vector machines basedon unsupervised learning, specifically self-organizingmaps. Conceptually, self-organizing maps can be thoughtof as neural networks that investigate a high-dimensionaldata space for clusters of data points and then project theclusters onto a two-dimensional map preserving thetopologies of the original clusters as much as possible.This allows for the visualization of high-dimensionaldatasets together with their support vector models. Withthis technique we investigate a number of support vectormachine visualization scenarios based on real worldbiomedical datasets.

I. INTRODUCTIONSupport vector machines represent a powerful new

machine learning paradigm introduced in the early 1990's byVapnik [1] based on kernel functions. Although thesealgorithms exhibit excellent machine learning performance itis often difficult to obtain an intuitive understanding of theinduced model', since support vector machines do not sharethe same transparency that decision trees [2] or inductive logicprogramming models [3] possess. However, an intuitiveunderstanding of the obtained classifier is important,particularly in the biosciences, not only for the validation ofthe model but also to deepen the insight into the underlyingbiological processes.

Support vector machine models consist of points in dataspace (the "support vectors") that identify a separatinghyperplane between classes. The task of the support vectormachine algorithm is to identify such a set of support vectorsin a given dataset. Understanding where the support vectorsare located with respect to the overall dataset and the kind ofdecision surface they induce provides substantial insight intothe model.

Visualization of support vector models is a difficultproblem due to the high-dimensionality of the typical dataset.

IHere we only consider support vector machine classification.

Here we propose a visualization technique of support vectormachines that makes use of unsupervised learning in order tocompute an appropriate visualization of the given datasettogether with the support vector machine model. Morespecifically, we use self-organizing maps [4] to visualize thedata and the support vector model.

Conceptually, self-organizing maps can be thought of asneural networks that investigate a high-dimensional data spacefor clusters of data points and then project the clusters onto atwo-dimensional map preserving the topologies of the originalclusters as much as possible. In the simplest case, where wehave two linearly separable clusters in high-dimensionalspace, each representing a different class, we would expect theself-organizing map to project the clusters onto the map withthe support vectors and the discriminating hyperplaneappropriately placed between them. Section IV A. describesthis baseline experiment with a three-dimensional dataset.

It is interesting to observe that our visualization naturallyguides further analysis of a given support vector model. Forinstance, when we observe simple clusters with smoothdecision boundaries between them on the projected map forhigh-dimensional data, we can infer that a lower dimensionalsubspace exists that allows for the near perfect discriminationbetween the classes. In this case, a dimension reduction orfeature selection on the original dataset would seemappropriate to simplify the support vector model. We showthat the support vector model of the original, high-dimensionaldata can be used to suggest how to approach this dimensionreduction [5]. The converse is true as well; if we observe anintricate cluster structure with complicated decisionboundaries we can assume that a high-dimensional subspace isnecessary in order to be able to discriminate between thevarious classes. This in turn implies that we need a complexsupport vector model to describe the induced decision surface.

The approach to the visualization of support vectormachines proposed here exhibits two major advantages overexisting techniques: (1) No preprocessing of the data isnecessary for the visualization, that is, neither do we need toperform feature selection nor do we have to guess whichfeatures to choose for the display. (2) It seems that theprojection of high dimensional data onto a two-dimensionalmap can provide a "big picture" overview of the supportvector decision surface not possible with other visualizationapproaches. In some sense we can say that our approach isdecision boundary oriented where as existing techniques are

1-4244-0623-4/06/$20.00 ©2006 IEEE.

feature oriented. We have implemented a prototype that simplifying the underlying mathematics.produces PDF maps of the dataset projections and the Notice that in Figure I the margin is limitedby three datavisualization of the support vector models.2 points. These three points fully define the margin and with it

The remainder of the paper is structured. as fo'llows: the orienltation ofthe decision surface. These three poinlts areSectionl II introduces some basic notions onl support vector called, the support vectors. The goa'l of traini;ng a supportmachinles. Sectionl III briefly introduces self-organizinlg maps. vector machine with a givenl dataset is to identify such supportSectionl IV descr;'bes various visualization experiments, most vectors and. with them the optima'l decision surface. Thenlotably we describe three experimenlts with bio-niedical data. traininlg algorithm canl be stated as a quadratic programminlgSectionl V discusses related, work in more detail. We conclude problem,the paper with final remarks and notes on further research in 1m m m

Section VI. min 2 EyjyiYiaxak(xi,xd a-Eci,i=l j=l i=l

II. SUPPORT VECTOR MACHLINES m

Stuppor}lt Vectorl maX2chines were' introduced|t in} the eriEly s.t. y)iai = 09: (4|)1990's as a new breed of c'lassification algorithms 'based on i=1optimal margins [1, 6, 7]. Conlceptually, especially in the a to 0 i = 1 ..... ~mcase of two linearly separable classes, support vector machines i

are. filyst.l,SraiBght forwarld-in;di al hyp,erp[lane thaEt separaEtes thle wherle y/l:rpresnslthenumdl]tl]eric clas labefor)dal:ta:: point x/:two classes in such a way that the hyperplane is equidistant and, cc/ is the support vector coefficient for data point xl. Thefrom both clusters. In technical Jargon, we aim to compute a indices i anldj are inldices over a'll the poinlts m of a dataset. Inlhyperplane that maximizes the margin between the two general, the value of cc/ is O for any point xl not considered a

classes. support vector. A number of well established algorithms existto optimize (4), e.g. [8, 9].

w Perhaps the most ilnteresting feature of (4) is the use of thee ,\ ~~~~kennel furiction (or just kernlel) k. Wheln applyilng support

X*fS" XMat ~~~~~vector machines to particular datasets the user typically has a* .F+-++ Vg ~~~~choice of which kernel to use. Choices for kernel functions

@ Xe/ #v ~~~~~~ilnclude the lineazr kernel which is just the dot product iln dataspaEce

tw / wv it~~~~~~~~~~~~~~ ~~(x,z)=x*z, (5)<.tfgffg / j//rf0' A ~~~the polynomial kernel of degree d,

A kk(x,z) =(*z+1)'(6)f

A ~~~~~~~~~~anclthle galussian7 kernel,

Fig. 1: Optimal margin hyperplane separating two classes. k(x, z) = exp- ,7Fig. I shows an optilmal margin hyperplane separating where ar is a free paramreter. EIere, the construction Ilx -z

two classes. We can construct a classifier f based on this typically represenlts the Euc'lideanl distanlce between vectors xhyperplane that can classify any urlk.-nown point x, and, z. The polynomia'l and gaussian kemrlels a'llow for the

f(x) = label(w * x + b), (1) construction of nlon-'linear hypersurfaces for the separation ofwhere the classes in data space.

[ball if n 2t 0 In the case that none of the induced models can separatelabel(n) = . .(2) the classes perfectly, support vector machines allow for

ttriLanLgle iLf n < 09 lmodels to admit certain lmodeling exceptions. At thew is the lnormal vector of the given hyperplane, wex theoretical level this is accomplished via slalck var7iables. Atrepresents the dot product between the normral w and some the algorithmic level the only thing that changes is that thepoint x, and b is the intercept. Instead of assigning symbolic support vector coefficients are further constrained comnpared tolabels to the classes it is usual to assign the nulmeric labels +1 (4),and -I in such away that the +I 'label is assignled to the class m m m

being pointed to byw miln - yiyj.oc ock(x ,x . a oi,(x+I if n2 yt 0

to reflect the fact that some data points have to be modeled by regression analysis in the sense that the reference modelsmak-ing them explicit support vectors that do not necessarily cannot take on arbitrary values but are sub .ect to a smoothingsupport the decision surface directly. Without this further function called the neighborhood function. During trainingconstraint value C the support vector coefficients for these the values of the reference models on the map become orderedmodeling exceptions would grow to infinity. The value of C so that similar reference models are close to each other on theis a user defined constant at model construction time and. is map and. dissimilar ones are further apart from each other.called the "cost" value. The training of the map is carried out by a sequential

We can construct a binary classifier f in this general regression process, where t = 1, 2,... is the step index. Forsetting given the support vector coefficients and, the kemel each observation x(t), we first identify the index c of some

function, reference model which represents the best match in terms ofm Euclidean distance by the condition,

f(x) = label( yiaik(x,x) b), (9)c = argmin ..X(t) mp).., vi.

i=1the offset b can also be computed from the support vectorcoefficients. This is easily extended to the multi-class setting Here, the index i ranges over all reference models on the map.by using schemes such as one-against-the-rest [7]. Next, all reference models on the map are updated with the

following regression rule where model index c is the referenceA. Feature Selection with Linear Kernels model index as computed, in (I 1),

In general, it is difficult to ascertain how a particular mi (t + 1) = mi (t) + hci [x (t) mi (t)], Vi. (I 2)support vector machine uses dataset attributes or features.That is especially true for the non-linear models produces by Here hci is the neighborhood. function that is defined, as follows,the polynomial and. gaussian kemels. Here we outline an

0 if

approach developed. in [5] based. on linear models. If we have h .C >(13)

a linear support vector model, then we can construct the ci if .C i. :5normal w to the decision surface using the support vectors (Y./,

m Ic il represents the distance between the best matchingW= yiaixi. (I 0) reference model c and, some other reference model i on the

i=1 map, 13 is the neighborhood. distance and q is the leaming rate.Now, the orientation of the norrnal vector holds It is customary to express q and 13 also as functions of time.

substantial information on the relative importance of the This regression is usually repeated over the availablefeatures in the dataset, that is, the larger the projected. observations many times during the training phase of the map.component of the normal onto a particular dimension, the An advantage of self-organizing maps is that they have anhigher the importance of that particular feature. appealing visual representation. Fig. 2 shows animals mapped

If we are dealing with a non-linear model we can onto a self-organizing map. Each animal is described by a setapproximate the non-linear model with a linear model making of 13 features [4] such as how many legs, does it possessuse of slack variables and the associated modeling exceptions feathers, does it hunt, etc. In effect, each animal can bein order to obtain an approximate importance ranking on the considered a point in thirteen-dimensional space and, the mapfeatures. in Fig. 2 is a two-dimensional projection of this thirteen-

dimensional space.111. SELF-ORGANIZING MAPS

HorseSelf-organizing maps [4] were introduced by Kohonen in 6Uck Hdii Udh ZObta

cbw1982 and can be viewed as tools to visualize structure in high-dimensional data. Self-organizing maps are consideredmembers of the class of unsupervised, machine learning

...........................................................................................algorithms, since they do not require a predefined. concept but

will learn the structure of a target domain without supervision. MMMUMMMUMMMUMMMUTypically, a self-organizing map consists of a rectangular MMMU

grid, of processing units. Multidimensional observations areowl ..................................represented, as feature vectors. Each processing unit in the wiwk go as

self-organizing map also consists of a feature vector called a Fig. 2: Mapping animals on to a self-organizing map.reference -vector or reference model. The goal of the -map is to Each souare in the man renresents a reference model. The

represent clusters of similar entities. Besides low quantization in Table 1. Finally, support vectors are again indicated viaerror, proximity is also a clustering indicator- the further apart, asterisks.the more dissimilar two objects on the map. For example, in z

Fig. 2 we find, two major clusters- mammals and, birds. Within AL

these major clusters we can find, sub-clusters such as largehooved mammals in the top right comer of the map and.predatory birds in the bottom left comer.

In this paper we make use of this ability of self-organizing *6maps to visualize high-dimensional spaces in order tovisualize high-dimensional biomedical datasets together with

0

their support vector models.

IV. VISUALIZATION EXPERIMENTSWe discuss four experiments. The first experiment is a

baseline experiment with a synthetic dataset illustrating the ybasic functionality of our visual approach. The remainingthree datasets are biomedical datasets.

A. Baseline VisualizationThe first visualization is intended. to provide an overview

of the functionality of our technique and. can also be thought x

of as a baseline we test whether the support vectors are Fig. 3: Two classes with their separating hyperplanemapped, to reasonable locations on the self-organizing map. and. the corresponding support vectors (asterisks).The actual dataset is given in Table 1.

TABLE ILiNEARLY SEPARABLE SYNTHETic DATASET.

ID x y z Class alpha1 0 0 0 A 0.002 1 0 0 A 0.81 A(1)3 0 1 0 A 1.004 0 0 1 A 1.005 4 0 0 B 0.906 0 4 0 B 0.957 0 0 4 B 0.958 5 5 5 B 0.00

It is a three-dimensional dataset and a quick inspection Fig. 4: Visualization of data points, support vectors, andreveals that the classes A and B are linearly separable in an decision surface on a self-organizing map.

optimal way by a plane that intersects the axes at (2.5, 0, 0), A number of interesting observations can be made on the(0, 2.5, 0), and (0, 0, 2.5) (see Fig. 3, the balls represent class projected data. First, as one would expect, the linear decisionA and the squares class B). The columns of the table should surface in three-dimensional space is projected as higher-orderbe self-explanatory with the exception of the right most one. polynomial surface in two-dimensional space. The projectedThis column lists the cc-values for each data point computed. decision surface is indicated on the map with a dashed whiteby a linear support vector machine separating the two classes line. Second, we can observe that the support vectors areA and B. Given the position of the separating hypeirplane, it mapped, along the decision surface. Third., the data pointsshould be no surprise that all six data points close to the classified. by the decision surface but which themselves are notdecision surface are chosen as support vector, i.e., have an ct- support vectors lie in appropriate regions of the map suitablyvalue > 0. In Fig. 3 the support vectors are indicated with the demarked. by support vectors. Finally, class A forms a muchasterisks. Conversely, the points at (0, 0, 0) and (5, 5, 5) are tighter cluster than class B, that is, the map displays a inuchnot chosen as support vectors- their cc-values are 0. smaller quantization error for class A than class B. This

Fig. 4 shows the projection of the dataset in Table I onto seems self-evident given the data structure shown in Fig. 3.a two-dimensional self-organizing map. The classes are Here, class A forms a tight cluster around the origin and classindicated with their corresponding letters. The numbers in B forms a loose cluster around the point (5, 5, 5).

m

...... ...... ........... ....... ..... ...... ...... .....

....................

.............................E

......................

...................................

HE

..........

.... ....... ..... ........................ ..... ................ .... ......

Fig. 5: Visualization of the Wisconsin breast cancer data with the support vector model and the decision surface.

As the last thing we investigate how the support vector map in its original size and, therefore the labels are not verymachine uses the features X, Y, and Z. Using (IO) from above readable, but the map is d.ivided into two regions- a malignantallows us to compute the normal w of the induced, decision region (right sid,e) and, a benign region (left side).' With the

surface given the cc-values of the support vectors. The normal exception of minor irregularities right at the boundary between

here computes to w = (-2.81, -2.81, -2.81), that is, all threethe two classes the decision surface is smooth considering that

features are of equal importance and, class A is assigned. thewe are projecting a thirty-dimensional space onto two

numeric label +1.dimensions. Also, note that the support vectors (long labels)are placed along the decision surface. There are a few

B. The Wisconsin Breast Cancer Dataset exceptions and we could consider "drilling through" to the

instances representing these support vectors to find out whyOur next experiment is the visualization of a support these instances are considered outliers by the self-organizing

vector model of the Wisconsin breast cancer dataset [10] map and. if there is any biological significance to that (to bepublicly available from the UCI dataset repository [Ifl. The studied, in a follow-on paper).dataset contains 569 instances of fine needle cell aspirates of In ord,er to provide a more concrete sense of the kind ofwhich 212 are malignant and, 357 are benign. Each instance is information shown on the map in Fig. 5, Fig. 6 displays adescribed by thirty features such as cell radius, smoothness of detailed view of a section of the map right at the borderthe cell surface, concavity among others. It is k-nown that in between the two classes. The decision surface is indicatedthirty-dimensional space the two classes are linearly separable with the dashed line. The labels B and M indicate that benignand a number of techniques have been used to construct linear and malignant instances were mapped to these locations,classifiers, e.g. [12]. Here we construct a linear support vector respectively. Support vectors are indicated with asterisks.machine in thirty-dimensional space and then visualize this The number in parentheses is the instance identifier of themodel with a self-organizing map. Fig. 5 shows this map. support vector data points.Unfortunately, due to space constraints we cannot show the Given that the two classes can be projected, so smoothly

from a thirty-dimensional space onto a two-d,imensionalsurface begs the question whether a much lower dimensional

as outlined above we obtain the graph given in Fig. 7. Fromthis graph it seems that a four-dimensional feature space mightbe sufficient for the classification of the dataset. Additionalexperiments con reducing the number of featuresdoes not significantly alter the map.

.......................................................

CL 0 0

E 0

z0 o o

o0

0

0 5 io 20 26 30

F6atures

Fig. 7: Feature ranking of the Wisconsin breast cancer data. Fig. 9: Detail of the AML-ALL dataset visualization.

separating hyperplane is induced in 7000-dimensional spaceC. The AML-ALL Leukemia Dataset we have support vectors that support the hyperplane in 7000-

Ourthird visualization experiment uses the AML- dimensional space but are projected onto locations very far

ALL leukemia dataset published, by the Broad. Institute at MIT away from the decision surface on the map.A glance at the feature ranking (Fig. I 0) confirms our[13].The goal isto distinguish between acute myeloid suspicion, only a handful of features are ranked. highly, the

Ieukeniia (AML) and. acute lymphoblastic Ieukeniia (ALL)' jThe dataset consists of 38 bone marrow samples (27 ALL and. ma ority of the features has less than half the support than the

highly ranked. features. Performing feature selection on theI I AML- we are using only the training set a test set is also original dataset with the top five features given in Fig. I 0available from the website), over 7129 probes from 6817 (gene accession numbers- M96326 mal at, M25079 s at,human genes. The two classes can be separated perfectly with Z19554-s-at, M27891-at , and Y00433-at) and recomputinga linear support vector machine with cost C= 1. Fig. 8 depicts the support vector model and the self-organizing map gives usthe dataset together with the support vector model and the Fig. I 1. Here the top right of the map contains the AML classdecision surface. The AML class is at the bottom right of the and the remainder of the map is dedicated to the ALL class.map and the remainder of the map is dedicated to the ALL Notice that less support vectors are necessary for this modelclass. Due to space limitations the map is not very readable, and, also notice that the majority of the support vectors arean excerpt of the map is given in Fig. 9.

now close to the decision surface.Considering that we are projecting a 7000-dimensional

space onto a two-dimensional map the decision surface is D. The Pap Smear Datasetremarkably smooth suggesting that there exists a much lower

Our final experiment is the visualization of Pap

dimensional space that allows for the classification of thedataset [14]. The dataset consists of 52 instances where each

instances. On the other hand., due to the fact that the

................

.......................................

................................

AML

........ ..................... ...............iiiiI. IIII.precancerous(Y)ornot(N).The wavenumbers represent ......................... ................~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~..........................................000..........................................................

.................................................................. ..........................................................

....................near-infrared spectroscopy measurementsranging from 4,000 MMaaaa~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~............................................-...........................................

N

.....................gffiffiffiffiffiffiffiffiffiffil=.

...................

...........................................................................................................

'Niluni

ftm

......................

Ixy

Fig. 12: The Pap smear dataset visualization with model and decision surface.

Another interesting research direction to pursue is to use and Methods," JLP, vol. 19, pp. 629-679, 1994.[4] T. Kohonen, Se.f-Organizing Maps, 3rd ed., Berlin New York: Springer,manifold. leaming, such as locally linear embedding (e.g.

1 8]), instead. of self-organizing maps. It would. be interesting 2001,pp.501.

[5] J. Brank, M. Grobelnik, N. Milic-Frayling and D. Mladenic, "Featureto see if in this setting the decision surfaces will be detected as selection using linear support vector machines," MSR-TR-2002-63,distinguished structures perhaps reducing support vector Microsoft Research 2002, 2002.

outliers we have observed in the visualizations above. [6] K. P. Bennett and C. Campbell, "Support vector machines: hype or

halleluiahT' ACMSIGKDD Explorations Alewsletter, vol. 2, pp. 1-13,Finally, in all of the maps above the decision surfaces are 2000.interpolated manually given the locations of the support [7] B. Scholkopf and. A. J. Smola, Learning with Kernels: Support Vector

Machines, Regularization, Optimization, and Beyond MIT Press, 2002.vectors. It would greatly enhance the visualization if theseinterpolations could be doneby machine on the projection [8] C. C. Chang and C. J. Lin. (2006, LIBSVM: A library for support vector

machines. Available: http://www.csie.ntu.edu..tw/c.jlin/libsvmmaps. [9] J. Platt, "Sequential minimal optimization: A fast algorithm for training

ACKNOWLEDGEMENTS support vector machines," Advances in Kernel Methods-SupportVector Learning, pp. 185-208, 1999.

The author would like to thank Professor Chris Brown from the Chemistry [10] W. Street, W. Wolberg and 0. Mangasarian, "Nuclear feature extractionDepartment at the University of Rhode Island for providing access to the for breast tumor diagnosis," Proc.SPIE, vol. 1905, pp. 861-870, 1993.Pap smear dataset. [II] D.J. Newman, S. Hettich, C.L.Blake and C.J. Merz, 1998, UCI repository

of machine learning databases. Available:REFER-ENCES http://www.ics.uci.edu/mlearn/MLRepository.htmI

[1] V. N. Vapnik, The Nattire of Statistical Learning Theory, 2nd. ed., New [12] 0. L. Mangasarian, W. N. Street and W. H. Wolberg, "Breast CancerYork: Springer, 2000, pp. 314. Diagnosis and Prognosis via Linear Programming," Oper. Res., vol.

[2] L. Breiman, Classification and Regression Trees. Pacific Grove, Calif.: 43, pp. 570-577, 1995.Wadsworth & Brooks/Cole, 1984, pp. 358. [13] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov,

[3] S. Muggleton and. L. D. Raedt, "Inductive Logic Programming: Theory H. Coller, M. Loh, J. Downing and M. Caligiuri, "MolecularClassification of Cancer: Class Discovery and Class Prediction byGene Expression Monitoring," Science, vol. 286, pp. 531-537, 1999.

[14] Z. Ge, C. W. Brown and H. J. Kisner, "Screening Pap Smears with Near-S Infrared Spectroscopy," Appl. Spectrosc., vol. 49, pp. 432-436, 1995.

[15] F. Potilet, "SVM and graphical algorithms: a cooperative approach,"ICDM 2004. Proceedings. Fourth IEEE International Conference on

Data Mining, pp. 499-502, 2004.[16] D. Caragea, D. Coo an V. G. Honavar, "Gaining insig s into support

E vector machine pattern classifiers using projection-based tour0 methods," Proceedings of the Seventh ACM SIGKDD InternationalZ

Conference on Knowledge Discovery and Data Mining, pp. 251-256,2001.

[ieee 2006 ieee symposium on computational intelligence and bioinformatics and computational biology...

Documents