learning theory: past performance and future results

news and views

378 NATURE | VOL 428 | 25 MARCH 2004 | www.nature.com/nature

Ahallmark of intelligent learning isthat we can apply what we havelearned to new situations. In the

mathematical theory of learning, this abilityis called generalization. On page 419 of thisissue1, Poggio et al. formulate an elegantcondition for a learning system to general-ize well.

As an illustration, consider practisinghow to hit a tennis ball.We see the trajectoryof the incoming ball, and we react with com-plex motions of our bodies. Sometimes wehit the ball with the racket’s sweet spot andsend it where we want; sometimes we do lesswell. In the theory of supervised learning,aninput–output pair exemplified by a trajec-tory and the corresponding reaction is calleda training sample. A learning algorithmobserves many training samples and com-putes a function that maps inputs to out-puts.The learned function generalizes well ifit does about as well on new inputs as on theold ones: if this is true, our performanceduring tennis practice is a reliable indicationof how well we will play during the game.

Given an appropriate measure for the‘cost’ of a poor hit, the algorithm couldchoose the least expensive function over theset of training samples,an approach to learn-ing called empirical risk minimization. Aclassical result2 in learning theory shows thatthe functions learned through empirical riskminimization generalize well only if the‘hypothesis space’ from which they are cho-sen is simple enough. That there may betrouble in a poor choice of hypotheses is afamiliar concept in most scientific disci-plines. For instance, a high-degree poly-nomial fitted to a set of data points can swingwildly between them, and these swingsdecrease our confidence in the ability ofthe polynomial to make correct predictionsabout function values between available datapoints. For similar reasons, we have come totrust Kepler’s simple description of the ellip-tical motion of heavenly bodies more thanthe elaborate system of deferents, epicyclesand equants of Ptolemy’s Almagest, no mat-ter how well the latter fit the observations.

The classical definition of a ‘simpleenough’ hypothesis space is brilliant buttechnically involved. For instance, the set oflinear functions defined on the planehas a complexity (or Vapnik–Chervonenkisdimension2) of three because this is thegreatest number of points that can bearranged on the plane so that suitable linearfunctions assume any desired combinationof signs (positive or negative) when evalua-ted at the points. This definition is a mouth-ful already for this simple case.Although thisapproach has generated powerful learningalgorithms2, the complexity of hypothesisspaces for many realistic scenarios quicklybecomes too hard to measure with thisyardstick. In addition, not all learning prob-lems can be formulated through empiricalrisk minimization, so classical results mightnot apply.

Poggio et al.1 propose an elegant solutionto these difficulties that builds on earlierintuitions3–5 and shifts attention away fromthe hypothesis space. Instead, they requirethe learning algorithm to be stable if it is to

Learning theory

Past performance and future resultsCarlo Tomasi

Learning from experience is hard, and predicting how well what we havelearned will serve us in the future is even harder. The most useful lessonsturn out to be those that are insensitive to small changes in our experience.

Hinxton, Cambridge CB10 1SA, UK.e-mail: [email protected]. Fire, A. et al. Nature 391, 806–811 (1998).

2. Kamath, R. S. et al. Nature 421, 231–237 (2003).

3. Kiger, A. et al. J. Biol. 2, 27 (2003).

4. Paddison, P. J. et al. Nature 428, 427–431 (2004).

5. Berns, K. et al. Nature 428, 431–437 (2004).

produce functions that generalize well. In anutshell, an algorithm is stable if the removalof any one training sample from any large setof samples results almost always in a smallchange in the learned function. Post facto,this makes intuitive sense: if removing onesample has little consequence (stability),then adding a new one should cause littlesurprise (generalization). For example, weexpect that adding or removing an obser-vation in Kepler’s catalogue will usuallynot perturb his laws of planetary motionsubstantially.

The simplicity and generality of the sta-bility criterion promises practical utility. Forexample,neuronal synapses in the brain mayhave to adapt (learn) with little or nomemory of past training samples. In thesecases, empirical risk minimization does nothelp, because computing the empirical riskrequires access to all past inputs and outputs.In contrast, stability is a natural criterion touse in this context, because it impliespredictable behaviour. In addition, stabilitycould conceivably lead to a so-called onlinealgorithm — that is, one that improves itsoutput as new data become available.

Of course, stability is not the whole story,just as being able to predict our tennis per-formance does not mean that we will playwell. If after practice we play as well as thebest game contemplated in our hypothesisspace, then our learning algorithm is said tobe consistent. Poggio et al.1 show that stabil-ity is equivalent to consistency for empiricalrisk minimization, whereas for other learn-ing approaches stability only ensures goodgeneralization.Even so,stability can becomea practically important learning tool,as longas some key challenges are met. Specifically,Poggio et al.1 define stability in asymptoticform, by requiring certain limits to vanish asthe size of the training set becomes large. Inaddition, they require this to be the case forall possible probabilistic distributions of thetraining samples. True applicability to realsituations will depend on how well theseresults can be rephrased for finite set sizes. Inother words, can useful measures of stabilityand generalization be estimated fromfinite training samples? And is it feasible todevelop statistical confidence tests for them?A new, exciting research direction hasbeen opened. ■

Carlo Tomasi is in the Department of ComputerScience, Duke University, Durham, North Carolina27708, USA.e-mail: [email protected]. Poggio, T., Rifkin, R., Mukherjee, S. & Niyogi, P. Nature 428,

419–422 (2004).

2. Vapnik, V. N. Statistical Learning Theory (Wiley, New York,

1998).

3. Devroye, L. & Wagner, T. IEEE Trans. Information Theory 25,

601–604 (1979).

4. Bousquet, O. & Elisseeff, A. J. Machine Learning Res. 2, 499–526

(2002).

5. Kutin, S. & Niyogi, P. in Proc. 18th Conf. Uncertainty in Artificial

Intelligence, Edmonton, Canada, 275–282 (Morgan Kaufmann,

San Francisco, 2002).

6. Dykxhoorn, D. M., Novina, C. D. & Sharp, P. A. Nature Rev. Mol.Cell Biol. 4, 457–467 (2003).

7. Hemann, M. T. et al. Nature Genet. 33, 396–400 (2003).8. Shoemaker, D. D., Lashkari, D. A., Morris, D., Mittmann, M. &

Davis, R. W. Nature Genet. 14, 450–456 (1996).9. Jackson, A. L. et al. Nature Biotechnol. 21, 635–637 (2003).10.Chi, J. T. et al. Proc. Natl Acad. Sci. USA 100, 6343–6346

(2003).

In or out: success rests on learning algorithmsthat are stable against slight changes in inputconditions1.

FRE

DE

RIK

E H

ELW

IG/G

ET

TY

IM

AG

ES

25.3 n&v MH 19/3/04 4:59 pm Page 378

© 2004 Nature Publishing Group

learning theory: past performance and future results

Documents