articles support vector machines and kernel...

■ Kernel methods, a new generation of learningalgorithms, utilize techniques from optimization,statistics, and functional analysis to achieve maxi-mal generality, flexibility, and performance. Thesealgorithms are different from earlier techniquesused in machine learning in many respects: Forexample, they are explicitly based on a theoreticalmodel of learning rather than on loose analogieswith natural learning systems or other heuristics.They come with theoretical guarantees about theirperformance and have a modular design thatmakes it possible to separately implement and ana-lyze their components. They are not affected bythe problem of local minima because their trainingamounts to convex optimization. In the lastdecade, a sizable community of theoreticians andpractitioners has formed around these methods,and a number of practical applications have beenrealized. Although the research is not concluded,already now kernel methods are considered thestate of the art in several machine learning tasks.Their ease of use, theoretical appeal, and remark-able performance have made them the system ofchoice for many learning problems. Successfulapplications range from text categorization tohandwriting recognition to classification of gene-expression data.

In many respects, the last few years have wit-nessed a paradigm shift in the area ofmachine learning, comparable to the one of

the mid-1980s when the nearly simultaneousintroduction of decision trees and neural net-work algorithms revolutionized the practice of

pattern recognition and data mining. In just afew years, a new community has gathered,involving several thousands of researchers andengineers, a yearly workshop, web sites, andtextbooks. The focus of their research: supportvector machines (SVMs) and kernel methods.

Such paradigm shifts are not unheard of inthe field of machine learning. Dating back atleast to Alan Turing’s famous article in Mind in1950, this discipline has grown and changedwith time. It has gradually become a standardpiece of computer science and even of softwareengineering, invoked in situations where anexplicit model of the data is not available but,instead, we are given many training examples.Such is the case, for example, of handwritingrecognition or gene finding in biosequences.

The nonlinear revolution of the 1980s, initi-ated when the introduction of back-propaga-tion networks and decision trees opened thepossibility of efficiently learning nonlineardecision rules, deeply influenced the evolutionof many fields, and paved the way for the cre-ation of entire disciplines, such as data miningand a significant part of bioinformatics. Untilthen, most data analysis was performed withlinear methods essentially based on the sys-tems developed in the 1960s, such as the PER-CEPTRON. The asymmetry however remained: Anumber of optimal algorithms and theoreticalresults were available for learning linear depen-

Articles

FALL 2002 31

Support Vector Machines and

Kernel MethodsThe New Generation of

Learning Machines

Nello Cristianini and Bernhard Schölkopf

Copyright © 2002, American Association for Artificial Intelligence. All rights reserved. 0738-4602-2002 / $2.00

AI Magazine Volume 23 Number 3 (2002) (© AAAI)

ties of the learning system’s output.In a way, researchers now have the power of

nonlinear function learning together with theconceptual and computational conveniencethat was, to this point, a characteristic of linearsystems.

SVMs are probably the best-known exampleof this class of algorithms. Introduced in 1992at the Conference on Computational LearningTheory (Boser, Guyon, and Vapnik 1992), ithas since been studied, greatly generalized, andapplied to a number of different problems. Thegeneral class of algorithms resulting from thisprocess is known as kernel methods or kernelmachines. They exploit the mathematicaltechniques mentioned earlier to achieve themaximal flexibility, generality, and perfor-mance, both in terms of generalization and interms of computational cost. They owe theirname to one of the central concepts in theirdesign: the notion of kernel functions, used inthe representation of the nonlinear relationsdiscovered in the data and discussed later.

The growing impact of this new approach inthe larger field of machine learning can begauged by looking at the number of researchersand events related to it. The research commu-nity gathered around these algorithms is verydiverse, including people from classic machinelearning, neural networks, statistics, optimiza-tion, and functional analysis. It meets at a year-ly workshop, held at the Neural InformationProcessing Systems Conference for the past fiveyears. Since the first such workshop, thegrowth of this field has been rapid: Textbookshave appeared (Cristianini and Shawe-Taylor2000; Schölkopf and Smola 2002); many hun-dreds of papers have been published on thistopic; and all the major conferences and jour-nals in machine learning, neural networks, andpattern recognition have devoted increasingattention to it, with special issues, dedicatedsessions, and tutorials. The recently launchedJournal of Machine Learning Research has a regu-lar section for kernel methods.1 An increasingnumber of universities teach courses entirelyor partly dedicated to kernel machines.

Currently, SVMs hold records in perfor-mance benchmarks for handwritten digitrecognition, text categorization, informationretrieval, and time-series prediction and havebecome routine tools in the analysis of DNAmicroarray data.

In this article, we survey the main concepts inthe theory of SVMs and kernel methods; theirstatistical and algorithmic foundations; andtheir application to several problems, such astext categorization, machine vision, handwrit-ing recognition, and computational biology.

dencies from data, but for nonlinear ones,greedy algorithms, local minima, and heuristicsearches were all that was known. In a way,researchers had come to accept that the theo-retical elegance and practical convenience oflinear systems was not achievable in the morepowerful setting of nonlinear rules.

A decade later, however, kernel methodsmade it possible to deal with nonlinear rules ina principled yet efficient fashion, which is whywe identify them with a new generation, fol-lowing the linear learning machines, one inthe 1960s and the first generation of nonlinearones in the 1980s. The consequences of thisnew revolution could be even more far reach-ing. In many ways, kernel methods representan evolution of the subsymbolic learningapproaches such as neural networks, but inother ways, they are an entirely new family ofalgorithms and have more in common withstatistical methods than with classical AI.

The differences with the previous approach-es are worth mentioning. Most of the learningalgorithms proposed in the past 20 years hadbeen based to a large extent on heuristics or onloose analogies with natural learning systems,such as the concept of evolution, or models ofnervous systems. They were mostly the resultof creativity and extensive tuning by thedesigner, and the underlying reasons for theirperformance were not fully understood. A largepart of the work was devoted to designingheuristics to avoid local minima in the hypoth-esis-search process. The new pattern-recogni-tion algorithms overcome many such limita-tions by using a number of mathematical tools.

To start with, in the last decade, a general the-ory of learning machines has emerged and withit the possibility of analyzing existing algo-rithms and designing new ones to explicitlymaximize their chances of success. The impactof learning theory on machine learning hasbeen profound, and its effects have also beenfelt in the industrial applications. Further-more—and somehow independently—new effi-cient representations of nonlinear functionshave been discovered and used for the design oflearning algorithms. This representation makesuse of so-called “kernel functions,” discussedlater, and has a number of useful properties.

The combination of these two elements hasled to powerful algorithms, whose trainingoften amounts to convex optimization. In oth-er words, they are free from local minima. Thisuse of convex optimization theory, largely aconsequence of the novel representation,marks a radical departure from previous greedysearch algorithms and, furthermore, enablesresearchers to analyze general formal proper-

Articles

32 AI MAGAZINE

Statistical Learning TheoryThe kind of learning algorithm that we aretalking about can mathematically be describedas a system that receives data (or observations)as input and outputs a function that can beused to predict some features of future data. Inother words, it automatically builds a model ofthe data being observed and exploits it to makepredictions. Generalization is the activity ofinferring from specific examples a general rule,which also applies to new examples. Of course,building a model capable of generalizingrequires detecting and exploiting regularities(or patterns) in the data.

For example, a learning system can betrained to recognize a handwritten digit 5 fromthe other nine classes of digits. Such a systemwould initially be presented with a number ofimages of handwritten digits and their correctclass label, and after learning, it would berequired to correctly label new, unseen imagesof handwritten digits.

Statistical learning theory (Vapnik 1998)models this as a function estimation problem:The function in this case maps representationsof images to their correct class label. Givensome observations of this function at randompositions (the training set of labeled images), itrequires the hypothesis to correctly label newexamples with high accuracy. The performancein predicting labels in a test set of images isknown as generalization performance. VladimirVapnik and his coworkers at the Institute ofControl Sciences of the Russian Academy ofScience in Moscow pioneered this approach tostatistical pattern recognition, and since then,many other researchers refined their earlyresults.

It turns out that the risk of a learned func-tion making wrong predictions depends onboth its performance on the training set and ameasure of its complexity. In other words, it isnot sufficient to accurately describe the train-ing data, but it is necessary to do so with a suit-ably simple hypothesis (an idea related to thewell-known Occam’s razor principle). Forexample, it is always possible to interpolate 5points in the plane with a polynomial of, say,degree 25, but nobody expects the resultingfunction to have any predictive power; howev-er, we are more likely to trust predictions of alinear function interpolating them.

In Vapnik’s theory, the key observation isthat the complexity of a function is not anabsolute concept, but it depends on the class offunctions from which it was extracted. Com-plexity of a learning machine is measured bycounting the number of possible data sets thatthe learning machine could perfectly explain

without errors using elements of its associatedfunction class. Now suppose the learningmachine has explained one data set at hand.Certainly, if the learning machine’s capacity ishigh enough such that it can explain all possi-ble data sets, then it will come as no surprise ifit can explain the given one. Thus, from themathematical point of view, no generalizationto new data points can be guaranteed. If, how-ever, the learning machine explains the givendata although its capacity is small, then wehave reason to believe that it will also workwell on new data points that it has not seenduring training.

This insight is formalized in statistical learn-ing theory, and it leads to bounds on the risk ofthe learned hypothesis making wrong predic-tions that depend on many factors, includingthe training sample size and the complexity ofthe hypothesis (see sidebar 1).

The theory identifies which mathematicalcharacteristics of learning machines controlthis quantity. Named after its inventors Vapnikand Chervonenkis, the most famous such char-acteristic is called the VC dimension. Morerefined notions of complexity (or capacity)have been proposed in recent years.

The resulting bounds can point algorithmdesigners to those features of the system thatshould be controlled to improve generaliza-tion. In many classical learning algorithms, themain such feature is the dimensionality of thefunction class or data representation beingused. The class of algorithms described in thisarticle, however, exploit very high dimensionalspaces, and for this purpose, it is crucial toobtain bounds that are not affected by it. Oneof such bounds is at the basis of the supportvector machine algorithm and is a reason whythis algorithm does not suffer from the “curseof dimensionality.”

Although these general ideas had beenaround (but, to some extent, neglected) sincethe 1960s, the crucial development for practi-tioners occurred in the 1990s. It turned outthat not only did this theory explain the suc-cess of several existing learning procedures, butmore importantly, it enabled researchers todesign entirely new learning algorithms, notmotivated by any analogy, just aimed at direct-ly optimizing the generalization performance.

Support Vector MachinesIn introducing the main ideas of SVMs, ourreasoning is informal, confining mathematicalarguments to separate sidebars and pointingthe interested readers to the references at theend of the article.

The fundamental idea of kernel methods is

Articles

FALL 2002 33

The functions learned by kernel machinescan be represented as linear combinations ofkernels computed at the training points:

In approximation theory and functional analy-sis, this representation was already used,although never combined with statisticallearning theory results and pattern-recognitionalgorithms such as the ones described here.Importantly, they correspond to linear func-tions in the embedding space, and hence, pow-erful linear methods and analysis can beapplied.

The basic algorithm we analyze deals withthe problems of learning classifications intotwo categories. SVMs achieve this by firstembedding the data into a suitable space andthen separating the two classes with a hyper-plane (see sidebar 2). Results from statisticallearning theory show that the generalizationpower of a hyperplane depends on its margin,defined as the distance from the nearest datapoint, rather than the dimensionality of theembedding space. This result provides themotivation for the following simple algorithm:

Embed the data into a high dimensionalspace and then separate it with a maxi-mum margin hyperplane. Because themargin measures the distance of the clos-est training point to the hyperplane, themaximum margin hyperplane not onlycorrectly classifies all points, it actuallydoes so with the largest possible “safetymargin.”

The decision function learned by such amachine has the form

where xi, yi, k are training points, their labels,and the kernel function, respectively, and x isa generic test point.

It is important to note that the problem ofconstructing the maximal margin hyperplanereduces to a quadratic programming problem,whose convex objective function can always bemaximized efficiently under the given con-straints. The absence of the so-called local min-ima in the cost function marks a radical depar-ture from the standard situations in systemssuch as neural networks and decision trees—tocite the most popular ones—that were forcedto utilize greedy algorithms to find locally opti-mal solutions. An entire set of empiricaltricks—that often absorbed most of theresearchers’ attention—can thus be replaced bya well-developed field of mathematics: convexoptimization.

SVMs can always be trained to the optimal

f x sign y k x x bi i ii

( ) ( ( , ) )= +∑α

f x y k x x bi i i( ) ( , )= +∑ α

to embed the data into a vector space, wherelinear algebra and geometry can be performed.One of the simplest operations one can per-form in such space is to construct a linear sep-aration between two classes of points. Otheroperations can involve clustering the data ororganizing them in other ways. The use of lin-ear machines is easier if the data are embeddedin a high dimensional space.

However, it is important to embed the datainto the correct space, so that the sought-afterregularities can easily be detected, for example,by means of linear separators. Informally, onewould like “similar” data items to be represent-ed by nearby points in the embedding space.High dimensional spaces are often used for thispurpose.

Two major problems occur when pursuingthis line of action. First, on the statistical side,if the dimensionality of such a space is veryhigh, then it is trivial to find a separation con-sistent with the labels in the training data.Hence, according to the learning-theoretic rea-soning sketched earlier, one cannot expect thistrivial separation to predict well the labels ofnew unseen points. For this reason, we have toincorporate ideas from statistical learning the-ory to rule out meaningless explanations of thedata, which would lead to overfitting. We seelater how this is performed in practice.

Second, on the computational side, workingdirectly in very high dimensional spaces can bedemanding, and this cost can pose a severelimit on the size of problems that can be solvedwith this algorithm. This problem is addressedin a rather clever way by using kernel func-tions, as follows.

Kernels are inner products in some space butnot necessarily in the space where the inputdata come from. They can often be computedefficiently. By reformulating the learning algo-rithms in a way that only requires knowledgeof the inner product between points in theembedding space, one can use kernels withoutexplicitly performing such embedding. In oth-er words, rather than writing the coordinatesof each point in the embedding space, wedirectly compute the inner products betweeneach pair of points in such space. This is some-times called implicit mapping or implicit embed-ding. If we call φ the embedding function, thena kernel can be written as k(x, z) = ⟨φ (x), φ (z)⟩.Surprisingly, many learning algorithms can beformulated in this special way, that is, kernel-ized. Among these are many standard lineardiscriminant techniques, clustering proce-dures, regression methods such as ridge regres-sion, and principal components analysis(PCA).

Articles

34 AI MAGAZINE

solution in polynomial time, which is one ofthe reasons for their fast uptake among practi-tioners in pattern recognition. However, otherimportant features contribute to their perfor-mance.

With convex optimization concepts, it ispossible to show that in the solution, onlysome of the α coefficients are nonzero, namely,the ones lying nearest to the separation hyper-plane. The other points could be removed fromthe training set without loss of informationabout the classification function. Thus, what iscalled a sparseness of the solution is induced,which is the key for many efficient algorithmictechniques for optimization as well as ananalysis of SVM generalization performancebased on the concept of data compression.Such points are called support vectors (hence thename of the entire approach).

Another major property of this class of algo-rithms is their modularity: The maximal mar-gin algorithm is independent of the kernelbeing used, and vice versa. Domain knowledgeand other techniques dictate the choice of ker-nel function, after which the same margin-maximizing module is used. It is important tonote that the maximal margin hyperplane inthe embedding space will usually correspondto a nonlinear boundary in the input space, asillustrated by figure 1.

The high generalization power of large mar-gin classifiers is the result of the explicit opti-mization of statistical bounds controlling therisk of “overfitting.” The choice of the kernelfunction is the only major design choice. How-ever, this choice can be guided by domainexpertise, keeping in mind that a kernel can beregarded as a similarity measure and that oftendomain experts already know effective similari-ty measures between data points. Such has beenthe case, for example, in the field of text catego-rization, where a kernel inspired by informa-tion-retrieval techniques has successfully beenused (Joachims 1998). Domain-specific kernelsare also increasingly used in bioinformatics.

The maximal margin hyperplane is, however,not the only classification algorithm to be usedin combination with kernels. For example, thewell-known Fisher discriminant procedure andmethods motivated by the Bayesian approachto statistics and machine learning have beencombined with kernels, leading to other inter-esting nonlinear algorithms whose trainingoften amounts to convex optimization.

RegressionSimilarly, several kernel-based methods forregression have been proposed. For example,the classic ridge regression algorithm acquires

Articles

FALL 2002 35

Figure 1. A Simple Two-Dimensional Classification Problem (Separate Ballsfrom Circles), with Three Possible Solutions of Increasing Complexity.

The first solution, chosen from a class of functions with low capacity, misclassifieseven some rather simple points. The last solution, however, gets all points rightbut is so complex that it might not work well on test points because it is basicallyjust memorizing the training points. The medium-capacity solution represents acompromise between the other two. It disregards the outlier in the top left cornerbut correctly classifies all other points.

PCA in a principled way, now known as kernelPCA (figures 2, 3, and 4).

Novelty detection is the task of spotting anom-alous observations, that is, points that appearto be different from the others. It can be used,for example, in credit card fraud detection,spotting of unusual transactions of a customer,and engine fault detection and medical diag-nosis. An efficient SVM method for represent-ing nonlinear regions of the input space wherenormal data should lie makes use of kernelembedding.

Clustering is the task of detecting clusterswithin a data set, that is, sets of points that are“similar to each other and different from allthe others.” Obviously, performing clusteringrequires a notion of similarity, which has usu-ally been a distance in the data space. Beingable to reformulate this problem in a genericembedding space created by a nonlinear kernelhas paved the way to more general versions ofclustering.

ApplicationsThe number of successful applications of thisclass of methods has been growing steadily inthe last few years. The record for performancein text categorization, handwriting recogni-tion, and some genomics applications is cur-rently held by SVMs. Furthermore, in manyother domains, SVMs deliver performancecomparable to the best system, requiring just aminimum amount of tuning.

We review just a few cases: text categoriza-

a completely new flavor when executed in anonlinear feature space. Interestingly, theresulting algorithm is identical to a well-known statistical procedure called Gaussianprocesses regression and to a method known ingeostatistics as Krieging.

Also in this case, the nonlinear regressionfunction is found as a linear combination ofkernels by solving a convex optimization prob-lem. A special choice of loss function also leadsto sparseness and an algorithm known as SVMregression.

Unsupervised LearningThe same principles and techniques deployedin the case of SVMs have been exported to anumber of other machine learning tasks. Unsu-pervised learning deals with the problem ofdiscovering significant structures in data setswithout an external signal (such as labels pro-vided by human expert).

For example, PCA is a technique aimed atdiscovering a maximally informative set ofcoordinates in the input space to obtain a moreefficient representation of the data. This tech-nique was routinely performed in a linear fash-ion, using as coordinate basis vectors the eigenvectors of the data set’s covariance matrix(roughly speaking, the directions in which thedata vary most). Several tricks existed to over-come the obvious limitations of linear systems,but the discovery that this algorithm can beused in any kernel-induced embedding spaceopened the possibility to perform nonlinear

Articles

36 AI MAGAZINE

Figure 2. The First (and Only) Two Principal Components of a Two-Dimensional Toy Data Set.Shown are the feature-extraction functions, that is, contour plots of the projections onto the first two eigenvectors of the covariance matrix.Being a linear technique, principal component analysis cannot capture the nonlinear structure in the data set.

tion, handwritten digit recognition, and geneexpression data classification. We briefly sum-marize here the results of these benchmarksbut strongly recommend readers access theoriginal papers to obtain a full description andjustification of the cost functions being usedfor the comparison, the data sets, the experi-mental design, and other fundamental statisti-cal considerations that we could not addresshere because of space limitations. A proper dis-cussion would require several pages for eachexperiment.

Text CategorizationResearchers have constructed a kernel from arepresentation of text documents known ininformation retrieval as a bag of words. In thisrepresentation, a document is associated with asparse vector that has as many dimensions asthere are entries in a dictionary. Words that arepresent in the document are associated with apositive entry, whereas words that are absentwill have a zero entry in the vector. The weightgiven to each word depends also on its infor-mation content within the given corpus. Thesimilarity between two documents is measuredby the cosine between vectors, which corre-sponds to a kernel.

The combination of this kernel with SVMshas created the most efficient algorithm for theclassification of news wire stories from Reuters,a standard benchmark.

Comparisons with four conventional algo-rithms (k-nearest neighbor, the decision tree

learner C4.5, the standard Rocchio algorithm,and naïve Bayes) reported by ThorstenJoachims (1998) on two standard benchmarks(the Reuters and the Oshumed corpora) showSVMs consistently outperforming these meth-ods. Performance was measured by microaver-aged precision/recall breakeven point across 10categories for Reuters and 23 categories forOshumed. In particular, for the Reuters data,the average performance across the 10 cate-gories ranges for the conventional methodsfrom 72 to 82 where SVMs achieve 86.0 withpolynomial kernels and 86.4 with Gaussiankernels. For the Oshumed data, the conven-tional systems ranged from 50.0 to 59.1, butSVMs achieved 65.9 with polynomial kernelsand 66.0 with Gaussian kernels.

Since then, SVMs have been used in a num-ber of text categorization applications.

Handwritten-Digit RecognitionDuring the 1990s, the so-called MNIST setemerged as the “gold standard” benchmark forpattern-recognition systems in what was thenthe Adaptive Systems Research Group at AT&TBell Labs. This group pioneered the industrialuse of machine learning as well as contributedfundamental insights to the development ofthe field in general. The MNIST training setcontains 60,000 handwritten digits, scanned ina resolution of 28 x 28 pixels. All major algo-rithms were tested on this set, and a sophisti-cated multilayer neural net (called LENET) hadlong held the benchmark record. Recently, an

Articles

FALL 2002 37

Figure 3. The First Two Principal Components for Kernel Principal Components Analysis (PCA), That Is, PCA Performed in theFeature Space Associated with a Kernel Function.

The two feature extractors identify the cluster structure in the data set: The first one (left) separates the bottom left cluster from the topone; the second one (right) then disregards this distinction and focuses on separating these two clusters from the third one.

ing systems. In particular, gene-expression datagenerated by DNA microarrays are high dimen-sional and noisy and usually expensive toobtain (hence only available in small data sets).A typical microarray experiment produces aseveral-thousand–dimension vector, and itscost means that usually only a few dozen suchvectors are produced.

SVMs are trained on microarray data to accu-rately recognize tissues that are, for example,cancerous or to detect the function of genesand, thus, for genomic annotation.

SVMs were first used for gene-function pre-diction based on expression data at the Univer-sity of California at Santa Cruz in 1999 (Brownet al. 2000). The same group later pioneeredthe use of SVMs for cancer diagnosis based ontissue samples. In both cases, they deliveredstate-of-the-art performance, and since thistime, these experiments were repeated andextended by a number of groups in essentiallyall the leading institutions for bioinformaticsresearch. Kernel methods are now routinelyused to analyze such data.

In this first experiment, SVMs were com-pared to four conventional algorithms (deci-sion tree learner C4.5, the perceptron decisiontree learner MOC1, Parzen windows, and theFisher discriminant) on a data set of about

SVM invariant to a class of image transforma-tions has achieved a new record result, thustaking over one of the previous strongholds ofneural nets. The previous record was owned bya boosted version of the sophisticated neuralnetwork LENET4, which has been refined foryears by the ATT Labs, with 0.7-percent testerror. SVMs with so-called jittered kernels(DeCoste and Schoelkopf 2002) achieved 0.56percent. On the same data set, 3 nearest neigh-bors obtains 2.4 percent, 2-layer neural net-works 1.6 percent, and SVMs without special-ized kernels 1.4 percent, illustrating theimportance of kernel design.

Gene Expression DataModern biology deals with data mining asmuch as it does with biochemical reactions.The huge mass of data generated first by theGenome Project and then by the many postge-nomic techniques such as DNA microarrayscalls for new methods to analyze data. Nowa-days, the problem of obtaining biologicalinformation lies just as much in the data analy-sis as it lies in the development of actual mea-suring devices.

However, the high dimensionality and theextreme properties of bioinformatics data setsrepresent a challenge for most machine learn-

Articles

38 AI MAGAZINE

Figure 4. Because It Is Performed in a High-Dimensional Feature Space, Kernel Principal Components Analysis Can Extract MoreThan Just the Two Feature Extractors Shown in Figure 2.

It turns out that the higher-order components analyze the internal structure of the clusters; for example, component number 4 (top right)splits the bottom left cluster in a way that is orthogonal to component number 8 (bottom right).

Articles

FALL 2002 39

In pattern classification, the objective isto estimate a function f: ℜn → {–1, +1}using training data—that is, n-dimension-al vectors xi and class labels yi such that fwill correctly classify new examples (x, y)with high probability. More precisely, wewould like the predicted label f(x) to equalthe true label y for examples (x, y), whichwere generated from the same underlyingprobability distribution P(x, y) as thetraining data.

If we put no restriction on the class offunctions that we choose our estimate ffrom, however, even a function that doeswell on the training data—for example, bysatisfying f(xi) = yi for all i—need not gener-alize well to unseen examples. Suppose wehave no additional information about f (forexample, about its smoothness). Then thevalues on the training patterns carry noinformation whatsoever about values onnovel patterns. Hence, learning is impossi-ble, and minimizing the training error doesnot imply a small expected test error.

Statistical learning theory shows that itis crucial to restrict the class of functionsthat the learning machine can implementto one with a capacity that is suitable forthe amount of available training data.

We call error of the function f the prob-ability of mislabeling a point x whoselabel is y:

_ = P({x | f(x) ≠ y})and the theory aims at upper boundingthis quantity with an expression thatincludes observable characteristics of thelearning machine.

For example, the probability of misla-beling a new point from the same distrib-ution, with a function that perfectlylabels the training sample, increases witha measure of the function complexity,known as the VC dimension (but othermeasures of capacity exist). Controllingthe capacity is, hence, a crucial step in thedesign of a learning machine: Given twofunctions with identically performingtraining sets, the one with lower capacityhas the higher probability of generalizingcorrectly on new points.

It follows from the previous considera-tions that to design effective learning algo-rithms, we must come up with a class offunctions whose capacity can be comput-ed. The algorithm should then attempt tokeep the capacity low and also fit the train-

ing data. Support vector classifiers are basedon the class of hyperplanes

⟨w, x⟩ + b = 0w, x ∈ ℜn, b ∈ ℜ

corresponding to decision functionsf(x) = sign(⟨w, x⟩ + b)It is possible to prove that the optimal

hyperplane, defined as the one with themaximal margin of separation betweenthe two classes (figure A), stems from thefunction class with the lowest capacity.This hyperplane can be constructed uni-quely by solving a constrained quadraticoptimization problem whose solution whas an expansion in terms of a subset oftraining patterns that lie closest to theboundary (figure A). These training pat-terns, called support vectors, carry all rele-vant information about the classificationproblem. Omitting many details of thecalculations, there is just one crucialproperty of the algorithm that we need toemphasize: Both the quadratic program-ming problem and the final decisionfunction depend only on dot productsbetween patterns, which is precisely whatlets us generalize to the nonlinear case.

The key to set up the optimizationproblem of finding a maximal marginhyperplane is to observe that for twopoints x+ and x– that lie nearest to it, it istrue that

Hence, the margin is inversely propor-tional to the norm of w. Therefore, weneed to minimize the norm of w subjectto the given constraints:

min⟨w, w⟩s.t.yi[⟨w, xi⟩ + b] ≥ 1

This equation can be solved by construct-ing a Lagrangian function:

where the coefficient α is a Lagrange mul-tiplier, and by transforming it into thecorresponding dual Lagrangian by impos-ing the optimal conditions,

Figure A. A Maximal Margin Separating Hyperplane.

Notice that its position is determined by thenearest points, called support vectors.

The result is a quadratic programmingproblem with linear constraints

that presents just a global maximum andcan always be exactly solved efficiently. Theresulting solution has the property that

and in fact, often most of the coefficientsαi are equal to zero. The only positivecoefficients correspond to the points thatlie closest to the hyperplane, and for thisreason, such points go under the name ofsupport vectors.

The final decision function can be writ-ten as

where the index i runs only on the sup-port vectors. In other words, if all datapoints other than the support vectorswere removed, the algorithm would findthe same solution. This property, knownas sparseness, has many consequences,both in the implementation and in theanalysis of the algorithm.

Notice also that both in the trainingand in the testing, the algorithm usesdata-only inside inner products. Thisproperty paves the way for the use of ker-nel functions, which can be regarded asgeneralized inner products, with thisalgorithm (see sidebar 2).

f x w x b y x x bi i i( ) , ,= + = +∑ α

w y xi i i= ∑ α

W y y x x

y

i i j i j i ji ji

i

i i

( ) ,,

α α α α

αα

= −

≥=

∑∑

∑

12

0

0

∂∂

= − =

∂∂

= =

∑

∑

Lw

w y x

Lb

y

i i i

i i

α

α

0

0

x

xx

xx

o

o

o

o

o x

x

12

1

0

w w y w x bi i i

i

, ,− +( ) −[ ]≥

∑α

α

w x b

w x b

w x x

ww

x xw

,

,

,( )

,( )

+

−

+ −

+ −

+ = +

+ = −

− =

− =

1

1

2

2

Learning Classifications with Large Margin Hyperplanes

(with standard approaches ranging between 6and 12). Other successful applications of SVMsto computational biology involve proteinhomology prediction, protein fold classifica-tion, and drug activity prediction.

ConclusionsThe development of kernel-based learning sys-tems in the mid-1990s represented anotherturning point in the history of pattern-recogni-tion methods, comparable to the “nonlinearrevolution” of the mid-1980s, when the intro-

2500 genes, each described by 79 expressionvalues and labeled according to one of 5 func-tions as stated by the MIPS database. The per-formance was measured by the number of falsepositives plus twice the number of false nega-tives. SVMs with Gaussian kernels achieved acost of 24 for the first class (conventionalmethods ranging between 28 and 41), 21 forthe second class (conventional methodsbetween 30 and 61), 17 for the third class (con-ventional methods between 22 and 78), 17 forthe fourth class (conventional methodsbetween 31 and 44); and 4 for the fifth class

Articles

40 AI MAGAZINE

The main idea behind kernel methods isto embed the data items into a vectorspace and use linear algebra and geome-try to detect structure in the data. Forexample, one can learn classifications inthe form of maximal margin hyper-planes, as described in sidebar 1. Thereare several reasons to embed the datainto a feature space. By mapping thedata into a suitable space, it is possibleto transform nonlinear relations withinthe data into linear ones. Furthermore,the input domain does not need to be avector space, just the embedding spaceneeds to be.

Rather than writing down explicitlythe positions of the data points withinsome reference frame, we make use ofinformation about the relative positionsof the points with respect to each other.In particular, we use the inner productsbetween all pairs of vectors in the embed-ding space. Such information can oftenbe obtained in a way that is independentof the dimensionality of this space. Manyalgorithms (for example, the maximalmargin classifier described in this article)can make use of inner product informa-tion.

Figure A shows the basic idea behindsupport vector machines, which is tomap the data into some dot productspace (called the feature space) F using anonlinear map φ: ℜn → F and perform themaximal margin algorithm in F.

Clearly, if F is high dimensional, work-ing in the feature space can be expensive.

Figure A. A Nonlinear Map Is Used toEmbed the Data in a Space Where Linear

Relations Are Sought.

However, in sidebar 1, we saw that all weneed to know to perform the maximalmargin algorithm is the inner productbetween vectors in the feature space. Thisquantity can often be computed moreefficiently by means of a kernel functionk(x, z) = ⟨φ(x), φ(z)⟩.

This computational shortcut can beused by a large class of algorithms,including principal components analysisand ridge regression.

As an example of kernel function,consider the following map from a two-dimensional space to a three-dimension-al one:

The inner product in such a space caneasily be computed without explicitlyrewriting the data in the new representa-tion.

Consider two points:x = (x1, x2)z = (z1, z2)

and consider the kernel functionobtained by squaring their inner prod-uct:

This function corresponds to the innerproduct between 2 three-dimensionalvectors. If we had used a higher expo-nent, we would have virtually embeddedthese two vectors in a much higher-dimensional space at a very low compu-tational cost.

More generally, we can prove that forevery kernel that gives rise to a positivedefinite matrix Kij = k(xi, xj), we can con-struct a map φ such that k(x, z) = ⟨φ(x),φ(z)⟩.

Other examples of kernels include theGaussian

the polynomial k(x, z) = (⟨x, z⟩ + 1)d

and kernels defined over sets such asstrings and text documents. Such kernelsvirtually embed elements of general setsinto a Euclidean space, where learningalgorithms can be executed. This capabil-ity to naturally deal with general datatypes (for example, biosequences, images,or hypertext documents) is one of themain innovations introduced by the ker-nel approach.

k x z ex z

( , ) =−

− 2

22σ

x z x z x z

x z x z x z x z

x x x x z z z z

x z

, ( )

( , , ),( , , )

( ), ( )

21 1 2 2

2

12

12

22

22

1 1 2 2

12

22

1 2 12

22

1 2

2

2 2

= + =

= + + =

= =

= φ φ

φ x x x x x x1 2 12

22

1 22, , ,( ) = ( )

xx

x

φ(x)φ(x)

φ(x)

φ(x)φ(0)

φ(0)

φ(0)

φ(0)0

0

0

0 x

φ

X F

Inner Products, Feature Spaces, and Kernels

duction of back-propagation networks and deci-sion trees triggered a series of rapid advances,that ultimately spawned entire fields such asdata mining. The present revolution comes witha new level of generalization performance, the-oretical rigor, and computational efficiency. Theextensive use of optimization methods, thedeeper understanding of the phenomenon ofoverfitting from the statistical point of view,and the use of kernels as nonlinear similaritymeasures all represent elements of novelty,directly addressing weaknesses of the previousgeneration of pattern-recognition systems.

The resulting systems have rapidly becomepart of the toolbox of practitioners in additionto still being the object of much theoreticalattention. A rich research community hasemerged, and many universities and softwarecompanies participate in the research in thisfield, including startups that use it for miningpostgenomic data. The future of this field ulti-mately depends on the performance of thealgorithms. As long as kernel-based learningmethods continue to deliver state-of-the-artperformance in strategic applications such astext categorization, handwriting recognition,gene function, and cancer tissue recognition,the interest in them is bound to remain high.2

Kernel methods also mark an importantdevelopment in AI research, demonstratingthat—at least in very specific domains—a rigor-ous mathematical theory is not only possiblebut also pays off from a practical point of view.The impact of this research direction would beeven larger if it could inspire neighboring fieldsto introduce analogous tools in their method-ology. There is no doubt that a mathematicaltheory of intelligent systems is still far in thefuture, but the success of learning theory indelivering effective learning methods demon-strates that this possibility is at least not impos-sible.

Notes1. www.jmlr.org.

2. Official kernel machines web site: www.kernel-machines.org.

ReferencesBoser, B.; Guyon, I.; Vapnik, V. 1992. A TrainingAlgorithm for Optimal Margin Classifiers. In Pro-ceedings of the Fifth Conference on ComputationalLearning Theory, 144–152. New York: Association ofComputing Machinery.

Brown, M.; Grundy, W.; Lin, D.; Cristianini, N.; Sug-net, C.; Furey, T.; Ares, M.; Haussler, D. 2000. Knowl-edge-Based Analysis of Microarray Gene ExpressionData Using Support Vector Machines. Proceedings ofthe National Academy of Sciences 97:262–267.

Cristianini, N., and Shawe-Taylor, J. 2000. An Intro-duction to Support Vector Machines. Cambridge, U.K.:Cambrige University Press. Also see www.support-vector.net.

DeCoste, D., and Schoelkopf, B. 2002. TrainingInvariant Support Vector Machines. Machine Learn-ing 46(1–2–3): 161–190.

Joachims, T. 1998. Text Categorization with SupportVector Machines: Learning with Many Relevant Fea-tures. In Proceedings of European Conference onMachine Learning, 137–142. Berlin: Springer-Verlag.

Schölkopf, B., and Smola, A. 2002. Learning with Ker-nels. Cambridge, Mass.: MIT Press. See also www.learning-with-kernels.org.

Vapnik, V. 1998. Statistical Learning Theory. New York:Wiley.

Nello Cristianini works forBIOwulf, a small startup withoffices in Berkeley, California, andNew York City. He is a visiting lec-turer at the University of Califor-nia at Berkeley and honoraryresearch associate at Royal Hol-loway, University of London. He is

an active researcher in the area of kernel methods,was the author of a recent textbook, and was instru-mental in building the primary web site of the com-munity.

Bernhard Schölkopf works forBIOwulf, a small startup withoffices in Berkeley, California, andNew York City. He is also directorof the Max-Planck-Institute forBiological Cybernetics, Tübingen,Germany. He is an active re-searcher in the area of kernel

methods, is the author of a recent textbook, and wasinstrumental in building the primary web site for thecommunity.

Articles

FALL 2002 41

Articles

42 AI MAGAZINE

ADVANCES INDISTRIBUTED AND PARALLEL

KNOWLEDGE DISCOVERY

Edited by Hillol Kargupta and Philip Chan

Knowledge discovery and data mining (KDD) deals with the problem ofextracting interesting associations, classifiers, clusters, and other patternsfrom data. The emergence of network-based distributed computing environ-

ments has introduced an important new dimension to this problem—distributedsources of data. Distributed knowledge discovery (DKD) works with the merger ofcommunication and computation by analyzing data in a distributed fashion. Thistechnology is particularly useful for large heterogeneous distributed environmentssuch as the Internet, intranets, mobile computing environments, and sensor net-works. When the datasets are large, scaling up the speed of the KDD process is cru-cial. Parallel knowledge discovery (PKD) techniques addresses this problem by usinghigh-performance multi-processor machines. This book presents introductions toDKD and PKD, extensive reviews of the field, and state-of-the-art techniques.

PUBLISHED BY THE AAAI PRESS / COPUBLISHED BY THE MIT PRESSFive Cambridge Center, Cambridge, Massachusetts 02142 USA

http://mitpress.edu/ • 800-405-1619 ISBN 0-262-61155-4, 472 pp., illus., bibliography, index

articles support vector machines and kernel...

Documents