lessons in neural network training: overfitting may be harder than

6
Lessons in Neural Network Training: Overfitting May be Harder than Expected, Proceedings of the Fourteenth National Conference on Artificial Intelligence, AAAI-97, AAAI Press, Menlo Park, California, pp. 540–545, 1997. Copyright AAAI. Lessons in Neural Network Training: Overfitting May be Harder than Expected Steve Lawrence , C. Lee Giles , Ah Chung Tsoi NEC Research, 4 Independence Way, Princeton, NJ 08540, Faculty of Informatics, Uni. of Wollongong, Australia , Abstract For many reasons, neural networks have become very pop- ular AI machine learning models. Two of the most impor- tant aspects of machine learning models are how well the model generalizes to unseen data, and how well the model scales with problem complexity. Using a controlled task with known optimal training error, we investigate the con- vergence of the backpropagation (BP) algorithm. We find that the optimal solution is typically not found. Furthermore, we observe that networks larger than might be expected can result in lower training and generalization error. This result is supported by another real world example. We further in- vestigate the training behavior by analyzing the weights in trained networks (excess degrees of freedom are seen to do little harm and to aid convergence), and contrasting the inter- polation characteristics of multi-layer perceptron neural net- works (MLPs) and polynomial models (overfitting behavior is very different – the MLP is often biased towards smoother solutions). Finally, we analyze relevant theory outlining the reasons for significant practical differences. These results bring into question common beliefs about neural network training regarding convergence and optimal network size, suggest alternate guidelines for practical use (lower fear of excess degrees of freedom), and help to direct future work (e.g. methods for creation of more parsimonious solutions, importance of the MLP/BP bias and possibly worse perfor- mance of “improved” training algorithms). Introduction Neural networks are one of the most popular AI machine learning models, and much has been written about them. A common belief is that the number of parameters in the network should be related to the number of data points and the expressive power of the network. The results in this pa- per suggest that the characteristics of the training algorithm should also be considered. Generalization and Overfitting Neural networks and other AI machine learning models are prone to “overfitting”. Figure 1 illustrates the concept us- ing polynomial approximation. A training dataset was cre- ated which contained 21 points according to the equation where is a uniformly distributed ran- dom variable between -0.25 and 0.25. The equation was evaluated at . This dataset was then used to fit polynomial models with orders between 2 and 20. For Copyright 1997, American Association for Artificial Intel- ligence (www.aaai.org). All rights reserved. order 2, the approximation is poor. For order 10, the ap- proximation is reasonably good. However, as the order (and number of parameters) increases, significant overfitting and increasingly poor generalization is evident. At order 20, the approximated function fits the training data very well, however the interpolation between training points is very poor. Overfitting can also be a very important problem in MLPs, and much work has been devoted to preventing over- fitting with techniques such as model selection, early stop- ping, weight decay, and pruning. -1.5 -1 -0.5 0 0.5 1 1.5 0 10 20 y x Approximation Training Data Target Function without Noise -1.5 -1 -0.5 0 0.5 1 1.5 0 10 20 y x Approximation Training Data Target Function without Noise Order 2 Order 10 -1.5 -1 -0.5 0 0.5 1 1.5 0 10 20 y x Approximation Training Data Target Function without Noise -1.5 -1 -0.5 0 0.5 1 1.5 0 10 20 y x Approximation Training Data Target Function without Noise Order 16 Order 20 Figure 1. Polynomial interpolation of the function in the range 0 to 20 as the order of the model is increased from 2 to 20. is a uniformly distributed random vari- able between -0.25 and 0.25. Significant overfitting can be seen for orders 16 and 20. Theory The selection of a model size which maximizes generaliza- tion is an important topic. There are several theories for de- termining the optimal network size e.g. the NIC (Network Information Criterion) which is a generalization of the AIC (Akaike Information Criterion) (Akaike 1973) widely used in statistical inference, the generalized final prediction error (GPE) as proposed by (Moody 1992), and the VC dimen- sion (Vapnik 1995) – which is a measure of the expressive power of a network. NIC relies on a single well-defined minimum to the fitting function and can be unreliable when there are several local minima (Ripley 1995). There is very little published computational experience of the NIC, or the GPE. Their evaluation is prohibitively expensive for large networks. VC bounds have been calculated for various net- work types. Early VC-dimension work handles only the case of discrete outputs. For the case of real valued out- puts, a more general notion of a “dimension” is required.

Upload: buitram

Post on 10-Feb-2017

225 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Lessons in Neural Network Training: Overfitting May be Harder than

Lessonsin Neural NetworkTraining: OverfittingMay beHarder thanExpected, Proceedingsof theFourteenthNationalConferenceonArtificial Intelligence,AAAI-97, AAAI Press,Menlo Park,California,pp. 540–545,1997.Copyright AAAI.

Lessonsin Neural Network Training: Overfitting May beHarder than Expected

Steve Lawr ence�, C. LeeGiles

�, Ah Chung Tsoi

����NECResearch,4 IndependenceWay, Princeton,NJ08540,

�Facultyof Informatics,Uni. of Wollongong,Australia����� ������������������������������������ �"!$#%��&'#(�����)#*��+�,

, - ! .!�/���� 0���+�����/�+�1#���2/3#4��/Abstract

For many reasons,neuralnetworks have becomevery pop-ular AI machinelearningmodels. Two of the most impor-tant aspectsof machinelearningmodelsare how well themodelgeneralizesto unseendata,andhow well the modelscaleswith problem complexity. Using a controlled taskwith known optimal training error, we investigatethe con-vergenceof the backpropagation(BP) algorithm. We findthattheoptimalsolutionis typically notfound.Furthermore,we observe thatnetworkslargerthanmight beexpectedcanresultin lower trainingandgeneralizationerror. This resultis supportedby anotherrealworld example.We further in-vestigatethe training behavior by analyzingthe weightsintrainednetworks (excessdegreesof freedomareseento dolittle harmandto aidconvergence),andcontrastingtheinter-polationcharacteristicsof multi-layerperceptronneuralnet-works(MLPs) andpolynomialmodels(overfitting behavioris verydifferent– theMLP is oftenbiasedtowardssmoothersolutions).Finally, we analyzerelevant theoryoutlining thereasonsfor significantpracticaldifferences. Theseresultsbring into questioncommonbeliefs aboutneuralnetworktraining regarding convergenceand optimal network size,suggestalternateguidelinesfor practicaluse(lower fear ofexcessdegreesof freedom),andhelp to direct future work(e.g. methodsfor creationof moreparsimonioussolutions,importanceof theMLP/BP biasandpossiblyworseperfor-manceof “improved” trainingalgorithms).

Intr oduction

Neuralnetworks areoneof the mostpopularAI machinelearningmodels,and much hasbeenwritten aboutthem.A commonbelief is that the numberof parametersin thenetwork shouldberelatedto thenumberof datapointsandtheexpressivepowerof thenetwork. Theresultsin thispa-persuggestthatthecharacteristicsof thetrainingalgorithmshouldalsobeconsidered.

Generalizationand Overfitting

NeuralnetworksandotherAI machinelearningmodelsareproneto “overfitting”. Figure1 illustratesthe conceptus-ing polynomialapproximation.A trainingdatasetwascre-atedwhich contained21 pointsaccordingto the equation57698�:<;>=@?BADC�E)FHG where G is a uniformly distributedran-dom variablebetween-0.25 and0.25. The equationwasevaluatedat I�JLK�JNM JPOLOPOQJNM�I . This datasetwasthenusedtofit polynomialmodelswith ordersbetween2 and20. For

RCopyright 1997,AmericanAssociationfor Artificial Intel-

ligence(www.aaai.org). All rightsreserved.

order2, the approximationis poor. For order10, the ap-proximationis reasonablygood.However, astheorder(andnumberof parameters)increases,significantoverfittingandincreasinglypoor generalizationis evident. At order 20,the approximatedfunction fits the training datavery well,however the interpolationbetweentraining points is verypoor. Overfitting canalsobe a very importantprobleminMLPs,andmuchwork hasbeendevotedto preventingover-fitting with techniquessuchasmodelselection,earlystop-ping,weightdecay, andpruning.

-1.5

-1

-0.5

0

0.5

1

1.5

0 10 20

ySx

ApproximationTraining Data

Target Function without Noise

-1.5

-1

-0.5

0

0.5

1

1.5

0 10 20

ySx

ApproximationTraining Data

Target Function without Noise

Order2 Order10

-1.5

-1

-0.5

0

0.5

1

1.5

0 10 20

ySx

ApproximationTraining Data

Target Function without Noise

-1.5

-1

-0.5

0

0.5

1

1.5

0 10 20

ySx

ApproximationTraining Data

Target Function without Noise

Order16 Order20

Figure 1. Polynomial interpolation of the function TVUWYX[Z�\^]�_L`�acbed in the range0 to 20 as the order of the model isincreasedfrom 2 to 20. d is a uniformly distributedrandomvari-ablebetween-0.25and0.25. Significantoverfitting canbe seenfor orders16and20.

Theory

Theselectionof amodelsizewhichmaximizesgeneraliza-tion is animportanttopic. Thereareseveraltheoriesfor de-terminingtheoptimalnetwork sizee.g. theNIC (NetworkInformationCriterion)which is ageneralizationof theAIC(Akaike InformationCriterion)(Akaike1973)widely usedin statisticalinference,thegeneralizedfinal predictionerror(GPE)asproposedby (Moody 1992),andthe VC dimen-sion(Vapnik1995)– which is a measureof theexpressivepower of a network. NIC relies on a single well-definedminimumto thefitting functionandcanbeunreliablewhenthereareseverallocalminima(Ripley 1995).Thereis verylittle publishedcomputationalexperienceof theNIC, or theGPE.Their evaluationis prohibitively expensive for largenetworks.VC boundshavebeencalculatedfor variousnet-work types. Early VC-dimensionwork handlesonly thecaseof discreteoutputs. For the caseof real valuedout-puts,a moregeneralnotion of a “dimension” is required.

Page 2: Lessons in Neural Network Training: Overfitting May be Harder than

Sucha “pseudo-dimension”canbedefinedby consideringa lossfunctionwhichmeasuresthedeviationof predictionsfrom thetargetvalues.VC boundsarelikely to betoo con-servativebecausethey providegeneralizationguaranteessi-multaneouslyfor any probabilitydistributionandany train-ing algorithm.Thecomputationof VC boundsfor practicalnetworksis difficult.

Student-TeacherTask

To investigateempiricalperformancewewill useastudent-teachertask(Craneetal. 1995)sothatweknow theoptimalsolutionandcancarefullycontrol theproblem.Thetaskisasfollows:1. An MLP with fhg input nodes,fji hiddennodes,and fhk out-

put nodes(the“teacher”network, denotedby f g'l fji l f k )is initialized with uniform randomweightsin the range monto n except for the bias weightswhich arewithin the range\ mop�qsrLt�p�qsr a .

2. uwvyx trainingdatapointsand uwvyz testpointsarecreatedby se-lectingGaussianrandominputswith zeromeanandunit vari-anceandpropagatingthemthroughthenetwork to find thecor-respondingoutputs.uwvyz is 5,000.

3. Thetrainingdatasetis usedto train new MLPs (“student”net-works),with thefollowing architecture:f g{l f}|i l f k , wheref | i is variedfrom f~i to � , where ������fji . The initialweightsof thesenew networksaresetrandomlyusingthepro-ceduresuggestedin (Haykin 1994)(i.e. they arenot equaltothe weightsin the network usedto createthe dataset).Theo-retically, if fh|i�� fji , asit is throughoutthis paper, thentheoptimaltrainingseterroris zero(for thecasewherenonoiseisaddedto thedata).

Simulation Results

Thissectioninvestigatesthetrainingandgeneralizationbe-havior of thenetworksfor thestudent-teachertaskwith theteachernetwork sizefixedbut thestudentnetwork sizein-creasing.For all cases,thedatawascreatedwith a teachernetwork architectureM�I���KLI���K (where20, 10,and1 werechosento representa typical network wherethenumberofinputsis greaterthanthe numberof hiddennodesandthespecificvalueswerechosensuchthatthetotal trainingtimeof thesimulationswasreasonable),andtherandomweightmaximumvalue, � , was1. Thestudentnetworkshadthefollowing architecture:M�I����������K , where ���� wasvariedfrom 10 to 50. Theoretically, theoptimal trainingseterrorfor all networkstestedis zero,as����~�e� � . However, noneof thenetworkstrainedhereobtainedtheoptimalerror(us-ing backpropagation(BP) (Rumelhart,Hinton,& Williams1986)for �h�7K�I�� updates)1.

Eachconfigurationof the MLP was testedwith ten sim-ulations,eachwith a differentstartingcondition (randomweights). No methodof controlling generalizationwasused(other thana maximumnumberof updates)in order

1Alternative optimizationtechniques(e.g.conjugategradient)can improve convergencein many cases. However, thesetech-niquesoften lose their advantagewith larger problemsandmaysometimesbe detrimentalbecausethe training algorithmbiasinBPmaybebeneficial,seelaterin thepaper.

to demonstratethiscase(notbecauseweadvocatetheprac-tice). All networks were trainedfor an identical numberof stochasticupdates( ����K�I�� ). It is expectedthat over-fitting could occur. The initial learningrate was 0.5 andwasreducedlinearly to zeroduring training. We usedthestandardMLP. Batchupdatewasalsoinvestigated– conver-gencewasfoundto bevery poorevenwhentrainingtimeswere extendedby an order of magnitude. The quadraticcostfunctionwasused.

Consideringthatnetworkswith morethan10 hiddenunitscontainmoredegreesof freedomthanis necessaryfor zeroerror, a reasonableexpectationwould be for the perfor-manceto be worse,on average,as the numberof hiddenunits is increased.Figure2 shows the trainingandtestseterrorasthenumberof hiddenunits in thestudentnetworkis variedfrom 10to 50. Theresultsarepresentedusingbothbox-whiskersplots2 andtheusualmeanplusandminusonestandarddeviationplots.Weperformedtheexperimentsus-ing threedifferentvaluesfor ���@� , the numberof trainingpoints(200,2,000and20,000).On average,the bestgen-eralizationerror correspondsto networks with more than10 hiddenunits(30,40,and40 respectively for ���@� = 200,2,000,and 20,000)3 4. The numberof parametersin thenetworks is greaterthan200, even for the caseof 10 hid-denunits(thenumbersof parametersas � � � is variedfrom10 to 50 are(221,441,661,881,1101). It is of interesttoobserve the effect of noiseon this problem. Figure2 alsoshows the resultsfor the caseof 200 training pointswhenGaussiannoiseis addedto the input datawith a standarddeviationequalto 1%of thestandarddeviationof theinputdata.A similar trendis observed.

2The distribution of resultsis often not Gaussianand alter-native meansof presentingresultsotherthanthemeanandstan-darddeviationcanbemoreinformative (Giles& Lawrence1997).Box-whiskers plots (Tukey 1977) show the interquartile range(IQR) with a box and the medianasa bar acrossthe box. Thewhiskers extend from the endsof the box to the minimum andmaximumvalues. The medianandthe IQR aresimplestatisticswhicharenotassensitive to outliersasthemeanandthestandarddeviation. Themedianis thevaluein themiddlewhenarrangingthedistribution in orderfrom thesmallestto the largestvalue. Ifthe datais divided into two equalgroupsaboutthemedian,thenthe IQR is the differencebetweenthe mediansof thesegroups.TheIQR contains50%of thepoints.

3Caruanapresenteda tutorial at NIPS93(Caruana1993)withgeneralizationresultson a variety of problemsasthe sizeof thenetworkswasvariedfrom “too small” to “too large”. “Toosmall”and “too large” are relatedto the numberof parametersin themodel (without considerationof the distribution of the data,theerror surface,etc.). Caruanareportedthat large networks rarelydo worse than small networks on the problemshe investigated.The resultsin this paperpartially correlatewith thatobservation.Caruanasuggestedthat“backpropignoresexcessparameters”.

4This trendvariesaccordingto theteachernetwork size(num-ber of inputs,hiddennodesandoutputs),thenatureof the targetfunction, etc. For example, the optimal size networks performbestfor certaintasks,andin othercasesthe advantageof largernetworkscanbeevengreater.

Page 3: Lessons in Neural Network Training: Overfitting May be Harder than

0

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

10 20 30 40 50

Tra

in N

MS

E�Number of Hidden Nodes

0.1

0.15

0.2

0.25

0.3

0.35

10 20 30 40 50

Tes

t NM

SE�

Number of Hidden Nodes

200trainingpoints

0

0.0005

0.001

0.0015

0.002

0.0025

10 20 30 40 50

Tra

in N

MS

E�Number of Hidden Nodes

0

0.001

0.002

0.003

0.004

0.005

0.006

10 20 30 40 50

Tes

t NM

SE�

Number of Hidden Nodes

2,000trainingpoints

0

0.0005

0.001

0.0015

10 20 30 40 50

Tra

in N

MS

E�Number of Hidden Nodes

0.15

0.2

0.25

0.3

0.35

10 20 30 40 50

Tes

t NM

SE�

Number of Hidden Nodes

200trainingpoints+ noise

Figure2. Theerrorfor networkswith a topology20:f | i :1 using200(with andwithoutnoise)and2,000trainingpoints.For 20,000points(not shown) theresultsweresimilar to the2,000pointscase.Thegraphson theleft arethetrainingerrorsandthegraphson theright arethetesterrors.Theabscissacorrespondsto thenumberof hiddennodes.Box-whiskersplotsareshown on theleft in eachcasealongwiththemeanplusor minusonestandarddeviation which is shown on theright in eachcase.

Theseresultsshouldnot betakento indicatethatoversizednetworksshouldalwaysbeused.However, they doindicatethatoversizednetworksmaygeneralizewell. Additionally,theresultsindicatethatif trainingis moresuccessfulin thelargernetworks, thenit is possiblefor the largernetworksto alsogeneralizebetterthanthesmallernetworks. A fewobservations:

1. It remainsdesirableto find solutionswith the smallestnumberof parameters.

2. A similar resultwouldnotbeexpectedif agloballyopti-mal solutionwasfoundin thesmallnetworks,i.e. if the10hiddenunit networksweretrainedto zeroerrorthenitwould beexpectedthatany networkswith extra degreesof freedomwould resultin worseperformance.

3. Thedistributionof theresultsis important.For example,observe in figure 2 that the advantageof the larger net-works for 2,000training points is decreasedwhencon-sideringtheminimumerrorratherthanthemeanerror.

4. The numberof trials is important. If sufficiently manytrials areperformedthenit shouldbe possibleto find anearoptimal solution in the optimal size networks (inthelimit of aninfinite numberof randomstartingpoints,finding a globaloptimumis guaranteedwith appropriateinitialization). Any advantagefrom usinglargersizenet-workswouldbeexpectedto disappear.

5. Note that therehasdeliberatelybeenno control of thegeneralizationcapability of the networks (e.g. using avalidationsetor weight decay),other thana maximumnumberof updates.Therearemany solutionswhich fitthe trainingdatawell thatwill not generalizewell. Yet,contraryto what might be expected,the resultsindicate

thatit is possiblefor oversizednetworksto providebettergeneralization.Successive pruning and retrainingof alargernetwork mayarrive at a network with similar sizeto thesmallernetworksherebut with improvedtrainingandgeneralizationerror.

NotethatBP did not find theoptimalsolutionin any of thecasespresentedhere. Also of interestis how the solutionfound scaleswith problemcomplexity. The parameter�can be controlledto investigatethis. As � is increased,the function mappinggenerallybecomesmore“complex”and less“smooth”. Experimentswith increasing� showthat the solutionfound becomesprogressively worsewithrespectto theoptimalerrorof zeroas � is increased.Anal-ysisof theoperationof BP (not givenhere)supportstheseresults.

Degreesof Freedom Rulesbasedon thedegreesof free-dom in the model have beenproposedfor selectingthetopologyof an MLP, e.g. “The numberof parameters inthe networkshouldbe (significantly)lessthan the numberof examples” or “Each parameterin an MLP can com-fortably store 1.5bits of information.A networkwith morethanthis will tendto memorizethedata.” . Theserulesaimto preventoverfitting,but they areunreliableastheoptimalnumberof parametersis likely to dependon otherfactors,e.g.thequalityof thesolutionfound,thedistributionof thedatapoints,theamountof noise,any biasin thetrainingal-gorithm,andthenatureof thefunctionbeingapproximated.Specificrules,suchasthosementionedabove,arenotcom-monly believed to be accurate. However, the stipulationthat the numberof parametersmustbe lessthanthe num-

Page 4: Lessons in Neural Network Training: Overfitting May be Harder than

berof examplesis typically believedto betruefor commondatasets.Theresultshereindicatethatthis is notalwaysthecase.

10

20

30

40

50

60

70

6 8 10 12 14

Tes

t Err

or %�

Number of Hidden Nodes

Figure 3. Facerecognitionexample: the bestgeneralizingnet-work has364 timesmoreparametersthantrainingpoints(18210parameters).

FaceRecognitionExample This sectionpresentsresultson real data. Figure 3 shows the resultsof training anMLP to classify10peoplefrom imagesof their faces5. Thetrainingsetcontains5 imagesperperson,for a total of 50training patterns6. The testsetcontaineda differentsetof5 imagesper person. A small window wassteppedovertheimagesandtheimagesamplesateachpointwerequan-tized using a two dimensionalself-organizingmap. Theoutputsof the self-organizingmapfor eachimagesamplewereusedasthe inputsto the MLP. In eachcase,the net-worksweretrainedfor 25,000updates.Thenetworksusedcontainmany more parametersthan the numberof train-ing points(for hiddenlayer sizesof (6, 8, 10, 12, 14) thenumberof weightsis (7810,10410,13010,15610,18210))yet thebesttrainingerrorandthebestgeneralizationerrorcorrespondsto the largestmodel. Note that a) generaliza-tion hasnotbeencontrolledusing,for example,avalidationsetor weight decay, andb) overfitting would be expectedwith sufficiently large networks andsufficiently “success-ful” training.

When simulatedon serial machines,larger networks re-quirelongertrainingtimesfor thesamenumberof updates.Hence,it is of interestto comparewhathappenswhenthesmallernetworksaretrainedfor longerthanthelargernet-works. For this andotherproblemswe have investigated,trainingfor equaltimeratherthanequalnumbersof updatesdoesnot significantlyaffect theresultsor conclusions.

5This is notproposedasapracticalfacerecognitiontechnique.6Thedatabaseusedis theORLdatabasewhichcontainsasetof

facestakenbetweenApril 1992andApril 1994at theOlivetti Re-searchLaboratoryin Cambridgeandis availablefrom �D�����>�%�"�� ¡"¡"¡>¢Y£L¤N¥  "¦L§D¨ ¢Y£ ¦ ¢@©�ª �L« ¤"£L¬�­�¤ � ¤P®�¤D¯�¬B¢ �D� ¥ ¨ . Thereare10 dif-ferent imagesof 40 distinct subjectsin the database.Therearevariationsin facial expressionandfacial details. All the imagesaretaken againsta dark homogeneousbackgroundwith the sub-jectsin anup-right,frontalposition,with tolerancefor sometiltingandrotationof up to about20 degrees.Thereis somevariationinscaleof up to about10%. The imagesaregreyscale(256 levels)with a resolutionof 92x112.

Polynomial and MLP Inter polation

Figure 4 shows the resultsof using an MLP to approxi-matethesametrainingsetasusedearlierin thepolynomialapproximationexample7. As for the polynomialcase,thesmallestnetwork with onehiddenunit (4 weightsincludingbiasweights),did not approximatethedatawell. With twohiddenunits (7 weights),the approximationis reasonablygood.In contrastto thepolynomialcasehowever, networkswith 10 hiddenunits(31weights)and50 hiddenunits(151weights)alsoresultedin reasonablygoodapproximations.Hence,for this particular(verysimple)example,MLP net-works trainedwith backpropagationdo not leadto a largedegreeof overfitting,evenwith morethan7 timesasmanyparametersasdatapoints. It is certainly true that overfit-ting canbe a seriousproblemwith MLPs. However, thisexamplehighlights the possibility that MLPs trainedwithbackpropagationmaybebiasedtowardssmootherapprox-imations. We list a numberof possibilitieswhich canleadto suchabias:

1. TraininganMLP is NP-completein generalandit is wellknown thatpracticaltraining algorithmsusedfor MLPsoften resultsin sub-optimalsolutions(e.g. dueto localminima)8. Often,aresultof attainingasub-optimalsolu-tion is thatnotall of thenetwork resourcesareefficientlyused.Experimentswith a controlledtaskhave indicatedthatthesub-optimalsolutionsoftenhavesmallerweightsonaverage(Lawrence,Giles,& Tsoi1996).An intuitiveexplanationfor this is thatweightstypically startout rea-sonablysmall (for goodreason),andmayget trappedinlocalminimabeforereachinglargevalues.

2. MLPs are universal approximators(Hornik, Stinch-combe, & White 1989). However, the universalapproximation result requires an infinite number ofhiddennodes. For a given numberof hiddennodesanetwork may be incapableof representingthe requiredfunction and instead implement a simpler functionwhichapproximatestherequiredfunction.

3. Weightdecay(Krogh & Hertz1992)or weightelimina-tion (Weigend,Rumelhart,& Huberman1991)areoftenusedin MLP trainingandaimtominimizeacostfunctionwhich penalizeslargeweights.Thesetechniquestendtoresultin networkswith smallerweights.

4. A commonlyrecommendedtechniquewith MLP classi-ficationis to setthetrainingtargetsawayfrom theboundsof theactivationfunction(e.g. (-0.8,0.8) insteadof (-1,1) for the °²± ;�³ activationfunction)(Haykin 1994).

MLP networks are,of course,not always this resistantto

7Trainingdetailswereasfollows. A singlehiddenlayerMLP,backpropagation,100,000stochastictrainingupdates,andalearn-ing rateschedulewith aninitial learningrateof 0.5wereused.

8Theresultsin thispapershow thatBPtrainingoftenresultsinsub-optimalsolutions.Commonly, thesesolutionsarereferredtoaslocal minima,aboutwhich muchhasbeenwritten andproven(e.g. (Yu 1992)).However, it is not only local minimathatcreatetroublefor BP– othererrorsurfacefeaturessuchas“ravines”and“plateaus”or “flat spots”canalsobetroublesome.Theerrorsur-facefor two differentproblemsmayhaveno localminimayetonemaybefar moreamenableto gradientdescentoptimization.

Page 5: Lessons in Neural Network Training: Overfitting May be Harder than

overfitting. For example,whenrepeatingtheaboveexperi-mentbut only evaluatingtheequationat I�JLK�JNM JPOLOPOQJN� (cre-ating6 datapoints),overfitting is seenwith only threehid-dennodes.

-1.5

-1

-0.5

0

0.5

1

1.5

0 10 20

ySx

ApproximationTraining Data

Target Function without Noise

-1.5

-1

-0.5

0

0.5

1

1.5

0 10 20

ySx

ApproximationTraining Data

Target Function without Noise

1 HiddenNode 2 HiddenNodes

-1.5

-1

-0.5

0

0.5

1

1.5

0 10 20

ySx

ApproximationTraining Data

Target Function without Noise

-1.5

-1

-0.5

0

0.5

1

1.5

0 10 20

ySx

ApproximationTraining Data

Target Function without Noise

10HiddenNodes 50HiddenNodes

Figure4. MLP interpolationof thefunction T´U WYXsZ�\^] _L`"a bµd intherange0 to 20asthenumberof hiddennodesis increasedfrom1 to 50. d is a uniformly distributed randomvariablebetweenmop�q ¶�· and0.25.A largedegreeof overfittingcannotbeobserved.

Network Sizeand Degreesof Freedom

A simpleexplanationfor why larger networks can some-timesprovide improvedtrainingandgeneralizationerroristhat theextra degreesof freedomcanaid convergence,i.e.the addition of extra parameterscan decreasethe chanceof becomingstuck in local minima or on “plateaus”,etc.(Krose& vanderSmagt1993).This sectionpresentsa vi-sualizationtechniquefor showing theweightsin thestudentnetworksasthenetwork sizeis varied.A smallertaskwasusedto aidvisualization:theteachernetwork topologywas5:5:1andthestudentnetworkscontained5, 15,and25hid-dennodes.1,000trainingpointswereusedand � was2.

Figures5 show theweightsin thestudentnetworksfor thecasewhenGaussiannoisewith standarddeviation 5% ofthe input standarddeviation was addedto the inputs (wealsoperformedexperimentswith 0% and10% which pro-ducedsimilar results).Thediagramsareplottedasfollows:The columns(1 to 6) correspondto the weightsfrom thehiddennodesto the biasandthe 5 input nodes.The rowsareorganizedinto groupsof two with aspacebetweeneachgroup.Thenumberof groupsis equalto thenumberof hid-dennodesin thestudentnetwork. For thetwo rows in eachgroup,thetop row correspondsto theteachernetwork andthe bottom row correspondsto the studentnetwork. Theidea is to comparethe weightsin the teacherandstudentnetworks. A coupleof difficultiesarisein this comparisonwhich are resolved as follows. Firstly, thereis no reasonfor hiddennode1 in the teachernetwork to correspondtohiddennode1 in thestudentnetwork, etc. This problemisresolvedby finding thebestmatchingsetof weightsin thestudentnetwork for eachhiddenunit in theteachernetwork,andmatchingthehiddennodesaccordingly. Thesematches

areorderedaccordingto the quality of the match,i.e. thetoptwo rowsshowstheteachernetwork hiddennodewhichwas bestapproximatedby a studenthiddennode. Like-wise, theworstmatchis at thebottom. A secondproblemis that trying to matchthe weightsfrom the hiddennodesto the input nodesdoesnot take into accountthe outputlayer weights, e.g. exactly the samehidden node func-tion could be computedwith differentweightsif the hid-dennodesweightsarescaledandtheoutputlayerweightsare scaledaccordingly. For the caseof only one outputwhich is consideredhere,thesolutionis simple:thehiddenlayerweightsarescaledaccordingto therespective outputlayerweight. Eachindividual weight(scaledby theappro-priate output weight) is plotted as follows: the squareisshadedin proportionto themagnitudeof theweight,wherewhite equals0 andblackequalsthemaximumvaluefor allweights in the networks. Negative weightsare indicatedby a white squareinsidetheouterblacksquarewhich sur-roundseachweight.

Observations:a) the teachernetwork weightsarematchedmore closely by the larger networks (considerthe fourthandfifth bestmatchinggroupsof two rows), b) the extraweights in the larger networks contribute to the final ap-proximationin only a minorway, c) thehiddenunitsin thelarger networks do not appearto be usedredundantlyinthis case– this mayberelatedto theartificial natureof thetask,andd) the resultsindicatethat pruning(andoption-ally retraining)the larger networks may performwell. Aconclusionis thatbackpropagationcanresultin theunder-utilization of network resourcesin certaincases(i.e. someparametersmaybeineffectiveoronly partiallyeffectivedueto sub-optimalconvergence).

5 hiddenunits 15 hiddenunits 25 hiddenunits

Figure5. The weightsafter training in networks with 5, 15, and25hiddenunitsfor thecaseof Gaussiannoisewith standarddevi-ation5% of thestandarddeviation of theinputs.In eachcase,theresultsareshown for two networkswith differentrandomstartingweights.Theplottingmethodis describedin thetext.

Page 6: Lessons in Neural Network Training: Overfitting May be Harder than

Learning Theory

The resultsare not in contradictionwith statisticallearn-ing theory. (Vapnik1995)statesthatmachineswith asmallVC dimensionarerequiredto avoid overfitting. However,he alsostatesthat “it is difficult to approximatethe train-ing data”, i.e. for a givenproblemin MLP approximation,the goal is to find the appropriatenetwork sizein ordertominimizethetradeoff betweenoverfittingandpoorapprox-imation.Vapniksuggeststhattheuseof a priori knowledgemayberequiredfor small trainingerrorandsmallgeneral-izationerror. For thecaseof linearoutputneurons,Barron1991hasderivedthe following boundon the total risk for

anMLP estimator: ¸º¹)»B¼½�¾�¿ F ¸º¹ ½3¾L½3ÀÁ{Â<ÃÅÄÇÆÈ � �@� ¿ , whereÉ3Êis the first absolutemomentof the Fourier magnitude

distribution of thetargetfunction Ë andis a measureof the“complexity” of Ë . Again, a tradeoff canbe observedbe-tweenthe accuracy of the bestapproximation(which re-quireslarger � � ), andthe avoidanceof overfitting (whichrequiresa smaller � � A ���@� ratio). The left-handterm(theapproximationerror) correspondsto the error betweenthetargetfunctionandtheclosestfunctionwhich theMLP canimplement.For thenoise-freeartificial task,theapproxima-tion erroris zerofor � � �~�ÌK�I . Basedon thisequation,it islikely that � � � 6 KLI would beselectedastheoptimalnet-work size(notethattheresultsreportedhereusesigmoidalratherthanlinearoutputneurons).Why do the theoryandpracticalresultsdiffer? Becausethedomainof applicabil-ity of the theorydoesnot cover the practicalcaseandtheassumptionsincorporatedin thetheoryarenot alwaysrea-sonable.Specifically, this theorydoesnot take into accountlimited trainingtime,differentratesof convergencefor dif-ferent Ë , or sub-optimalsolutions.

Recentwork by (Bartlett 1996)correlateswith the resultsreportedhere. Bartlett comments:“the VC-boundsseemloose; neural networksoften perform successfullywithtrainingsetsthatareconsiderablysmallerthanthenumberof networkparameters” . Bartlettshows (for classification)thatthenumberof trainingsamplesonly needsto grow ac-cording to Í ��Î (ignoring log factors)to avoid overfitting,whereÍ is aboundonthetotalweightmagnitudefor aneu-ronand Ï is thenumberof layersin thenetwork. This resultandeitheran explicit (weight decayetc.) or implicit biastowardssmallerweightsleadsto thephenomenonobservedhere,i.e. larger networks may generalizewell andbettergeneralizationis possiblefrom largernetworks if they canbetrainedmoresuccessfullythanthesmallernetworks(e.g.reduceddifficulty with local minima). For thetaskconsid-eredin this paper, thedistribution of weightsafter trainingmoves towardssmallerweightsas the sizeof the studentnetwork increases.

Conclusions

It canbeseenthatbackpropagationfails to find anoptimalsolutionin many cases.Furthermore,networkswith moreweightsthanmightbeexpectedcanresultin lower trainingandgeneralizationerrorin certaincases.Overfittingbehav-ior is significantlydifferentin MLP andpolynomialmod-

els – MLPs trainedwith BP arebiasedtowardssmoothersolutions.Given infinite time andan appropriatealternatetraining algorithm,an optimal solutioncouldbe found foranMLP. However, theexamplesin thispaperillustratethatthemodeof failureexhibitedbybackpropagationcanin factbe beneficialandresult in bettergeneralizationover “im-proved” algorithms,in somuchastheimplicit smoothnessbiascreatedby thenetwork structureandtrainingalgorithmmatchesthedesiredtargetfunction. This biasmayaccountfor part of the successMLPs have encounteredover com-petingmethodsin real-world problems.

Acknowledgments

We would like to acknowledgeusefuldiscussionswith andcom-mentsfrom R. Caruana,S.Hanson,G. Heath,B.G.Horne,T. Lin,K.-R. Muller, T. Rognvaldsson,andW. Sarle. The opinionsex-pressedhereinarethoseof theauthors.Thanksto theOlivetti Re-searchLaboratoryandF. Samariafor compilingandmaintainingtheORL database.

ReferencesAkaike, H. 1973. Information theory andan extensionof the maximumlikeli-hoodprinciple. In Petrov, B. N., andCsaki,F., eds.,Proceeding2ndInternationalSymposiumon InformationTheory. Budapest:AkademiaKiado. 267–281.Barron,A. 1991. Complexity regularizationwith applicationto artificial neuralnetworks. In Roussas,G., ed.,NonparametricFunctionalEstimationandRelatedTopics. Dordrecht,TheNetherlands:Kluwer AcademicPublishers.561–576.Bartlett, P. 1996. The samplecomplexity of patternclassificationwith neuralnetworks: the sizeof the weightsis moreimportantthanthe sizeof the network.Technicalreport,AustralianNationalUniversity.Caruana,R. 1993. Generalizationvs.NetSize. Denver, CO: NeuralInformationProcessingSystems,Tutorial.Crane,R.; Fefferman,C.; Markel, S.; andPearson,J. 1995. Characterizingneu-ral network errorsurfaceswith a sequentialquadraticprogrammingalgorithm. InMachinesThatLearn.Giles, C. L., andLawrence,S. 1997. Presentingandanalyzingthe resultsof AIexperiments:Dataaveraginganddatasnooping.In Proceedingsof theFourteenthNational Conferenceon Artificial Intelligence, AAAI-97. Menlo Park, California:AAAI Press.362–367.Haykin,S. 1994.Neural Networks,A ComprehensiveFoundation. New York, NY:Macmillan.Hornik, K.; Stinchcombe,M.; andWhite, H. 1989. Multilayer feedforward net-worksareuniversalapproximators.Neural Networks2:359–366.Krogh, A., andHertz, J. 1992. A simpleweight decaycanimprove generaliza-tion. In Moody, J.; Hanson,S. J.; andLippmann,R. P., eds.,Advancesin NeuralInformationProcessingSystems, volume4. SanMateo,CA: MorganKaufmann.950–957.Krose,B., and van der Smagt,P. 1993. An Introduction to Neural Networks.Universityof Amsterdam,fifth edition.Lawrence,S.; Giles, C. L.; andTsoi, A. 1996. What sizeneuralnetwork givesoptimal generalization?Convergencepropertiesof backpropagation.TechnicalReportUMIACS-TR-96-22and CS-TR-3617,Institute for AdvancedComputerStudies,Universityof Maryland,CollegePark MD 20742.Moody, J.1992.Theeffectivenumberof parameters:An analysisof generalizationand regularizationin nonlinearlearningsystems. In Moody, J.; Hanson,S. J.;andLippmann,R. P., eds.,Advancesin Neural InformationProcessingSystems,volume4. SanMateo,CA: MorganKaufmann.847–854.Ripley, B. 1995. Statisticalideasfor selectingnetwork architectures. InvitedPresentation,NeuralInformationProcessingSystems8.Rumelhart,D.; Hinton, G.; andWilliams, R. 1986. Learninginternal represen-tationsby errorpropagation.In Rumelhart,D., andMcClelland,J., eds.,ParallelDistributedProcessing, volume1. Cambridge:MIT Press.chapter8, 318–362.Tukey, J. 1977.ExploratoryData Analysis. Reading,MA: Addison-Wesley.Vapnik,V. 1995.TheNatureof StatisticalLearningTheory. Springer.Weigend,A.; Rumelhart,D.; andHuberman,B. 1991. Generalizationby weight-eliminationwith applicationto forecasting. In Lippmann,R. P.; Moody, J.; andTouretzky, D. S., eds.,Advancesin Neural InformationProcessingSystems, vol-ume3. SanMateo,CA: MorganKaufmann.875–882.Yu, X.-H. 1992. Canbackpropagationerrorsurfacenot have local minima. IEEETransactionson Neural Networks3:1019–1021.