the constraint based decomposition (cbd) training architecturesod/paper7.pdf · out loss of...

32
The Constraint Based Decomposition (CBD) training architecture Sorin Dr˘ aghici 431 State Hall, Department of Computer Science, Wayne State University, Detroit, MI, 48202, USA The Constraint Based Decomposition (CBD) is a constructive neural network technique that builds a three or four layer neural network and is guaranteed to find a solution for any classifi- cation problem. CBD is shown to be able to solve complicated problems in a simple, fast and reliable manner. The technique is further enhanced by two modifications (locking detection and redundancy elimination) which address the training speed and the efficiency of the inter- nal representation built by the network. The redundancy elimination aims at building more compact architectures while the locking detection aims at improving the training speed. The computational cost of the redundancy elimination is negligible and this enhancement can be used for any problem. However, the computational cost of the locking detection is exponen- tial in the number of dimensions and should only be used in low dimensional spaces. The experimental results show the performance of the algorithm presented in a series of classical benchmark problems including the 2-spiral problem and the Iris, Wine, Glass, Lenses, Iono- sphere, Lung cancer, Pima indians, Bupa, Tic-Tac-Toe, Balance and Zoo data sets from the CMU machine learning repository. CBD’s generalization accuracy is compared with that of C4.5, C4.5 with rules, incremental decision trees, oblique classifiers, linear machine decision trees, CN2, learning vector quantization (LVQ), backpropagation, nearest neighbor, Q* and ra- dial basis functions (RBFs). CBD provides the second best average accuracy on the problems tested as well as the best reliability (the lowest standard deviation). Introduction Many training algorithms require that the initial architec- ture of the network be specified as a prerequisite for the train- ing. If this is the case, the neural network practitioner is confronted with the difficult task of choosing an architecture well suited to the task at hand. If the chosen architecture is not powerful enough for the given task (e.g. it does not have enough hidden units or layers) the training will fail. On the other hand, if the chosen architecture is too rich, the repre- sentation built by the network will be inappropriate and the network will exhibit bad generalization properties (e.g. over- fitting). There are two fundamentally different approaches to overcoming this problem. One of them is to build the net- work from scratch adding units as needed. The category of algorithms following this approach is known as constructive algorithms. The other approach is to start with a very rich ini- tial architecture and eliminate some of the units either during or after the training. This category of algorithms is known as pruning algorithms. Constructive algorithms share a number of interesting properties. Most such algorithms are very fast and very re- liable in the sense that they are guaranteed to converge to a solution for any problem in their scope. However, some con- structive algorithms have a very limited scope and in gen- eral, they are believed to give poor generalization results. Non-constructive algorithms are much less reliable than con- structive algorithms and can fail to converge even when a solution exists (Brady, Raghavan, & Slawny, 1988). Some non-constructive algorithms can even converge to ‘false’ so- lutions (Brady, Raghavan, & Slawny, 1989). However, some researchers believe that they offer better generalization re- sults and therefore, are better suited to real world problems ( ´ Smieja, 1993). More recently, the issue of the ‘transparency’ of a neu- ral network (or its ability to provide an explanation for its output) has become very important. Very often, neural net- work techniques are used in combination with a separate rule extraction module in which a different technique is used to translate the internal architecture of the network into some human understandable symbolic form. In many applications, the output of the network inspires little confidence without such a symbolic backup which can be analyzed by human experts. The technique presented in this paper, the Constraint Based Decomposition (CBD) is a constructive algorithm that exhibits all advantages of its category of training algorithm- s: CBD is guaranteed to find a solution for any classification problem, it is fast and it builds an architecture suitable for the given problem. Apart from these characteristics shared by all algorithms in its class, CBD has other interesting properties such as: 1) it is flexible and can be used with a variety of weight changing mechanisms; 2) it is more transparent than other neural network techniques inasmuch as it can provide a symbolic description of the internal structure of the network built during the training; 3) the generalization abilities of the network built during the training are comparable with or bet- ter than those of other neural and non-neural machine learn- ing techniques; 4) it is very stable in the sense that the gen- eralization accuracy has a very low standard deviation over many trials. The paper is organized as follows. First, we present the basic version of the CBD technique. The CBD description is

Upload: others

Post on 19-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

TheConstraintBasedDecomposition(CBD) trainingarchitecture

SorinDraghici431StateHall, Departmentof ComputerScience,WayneStateUniversity, Detroit,MI, 48202,USA

TheConstraintBasedDecomposition(CBD) is a constructive neuralnetwork techniquethatbuildsa threeor four layerneuralnetwork andis guaranteedto find a solutionfor any classifi-cationproblem.CBD is shown to beableto solve complicatedproblemsin a simple,fastandreliablemanner. The techniqueis further enhancedby two modifications(locking detectionandredundancy elimination)which addressthe trainingspeedandtheefficiency of the inter-nal representationbuilt by the network. The redundancy eliminationaimsat building morecompactarchitectureswhile the locking detectionaimsat improving the trainingspeed.Thecomputationalcostof the redundancy eliminationis negligible andthis enhancementcanbeusedfor any problem. However, the computationalcostof the locking detectionis exponen-tial in the numberof dimensionsandshouldonly be usedin low dimensionalspaces.Theexperimentalresultsshow theperformanceof thealgorithmpresentedin a seriesof classicalbenchmarkproblemsincluding the2-spiralproblemandthe Iris, Wine, Glass,Lenses,Iono-sphere,Lung cancer, Pima indians,Bupa,Tic-Tac-Toe, BalanceandZoo datasetsfrom theCMU machinelearningrepository. CBD’s generalizationaccuracy is comparedwith that ofC4.5,C4.5with rules,incrementaldecisiontrees,obliqueclassifiers,linearmachinedecisiontrees,CN2,learningvectorquantization(LVQ), backpropagation,nearestneighbor, Q* andra-dial basisfunctions(RBFs).CBD providesthesecondbestaverageaccuracy on theproblemstestedaswell asthebestreliability (theloweststandarddeviation).

Introduction

Many trainingalgorithmsrequirethat the initial architec-tureof thenetwork bespecifiedasaprerequisitefor thetrain-ing. If this is the case,the neuralnetwork practitionerisconfrontedwith thedifficult taskof choosinganarchitecturewell suitedto the taskat hand. If thechosenarchitectureisnot powerful enoughfor thegiventask(e.g. it doesnot haveenoughhiddenunitsor layers)thetrainingwill fail. On theotherhand,if the chosenarchitectureis too rich, the repre-sentationbuilt by the network will be inappropriateandthenetwork will exhibit badgeneralizationproperties(e.g.over-fitting). Therearetwo fundamentallydifferentapproachestoovercomingthis problem. Oneof themis to build the net-work from scratchaddingunits asneeded.The category ofalgorithmsfollowing this approachis known asconstructivealgorithms. Theotherapproachis to startwith averyrich ini-tial architectureandeliminatesomeof theunitseitherduringor afterthetraining.Thiscategoryof algorithmsis known aspruningalgorithms.

Constructive algorithms sharea number of interestingproperties.Most suchalgorithmsarevery fastandvery re-liable in the sensethat they areguaranteedto convergeto asolutionfor any problemin theirscope.However, somecon-structive algorithmshave a very limited scopeand in gen-eral, they are believed to give poor generalizationresults.Non-constructivealgorithmsaremuchlessreliablethancon-structive algorithmsand can fail to converge even when asolutionexists (Brady, Raghavan, & Slawny, 1988). Somenon-constructivealgorithmscanevenconvergeto ‘f alse’so-lutions(Brady, Raghavan,& Slawny, 1989).However, someresearchersbelieve that they offer bettergeneralizationre-

sultsandtherefore,arebettersuitedto real world problems(Smieja,1993).

More recently, the issueof the ‘transparency’ of a neu-ral network (or its ability to provide an explanationfor itsoutput)hasbecomevery important. Very often,neuralnet-work techniquesareusedin combinationwith aseparateruleextractionmodulein which a differenttechniqueis usedtotranslatethe internalarchitectureof the network into somehumanunderstandablesymbolicform. In many applications,the outputof the network inspireslittle confidencewithoutsucha symbolic backupwhich can be analyzedby humanexperts.

The techniquepresentedin this paper, the ConstraintBasedDecomposition(CBD) is aconstructivealgorithmthatexhibits all advantagesof its category of trainingalgorithm-s: CBD is guaranteedto find a solutionfor any classificationproblem,it is fastandit buildsanarchitecturesuitablefor thegivenproblem.Apart from thesecharacteristicssharedby allalgorithmsin its class,CBD hasotherinterestingpropertiessuchas: 1) it is flexible andcanbe usedwith a variety ofweightchangingmechanisms;2) it is moretransparentthanotherneuralnetwork techniquesinasmuchasit canprovideasymbolicdescriptionof theinternalstructureof thenetworkbuilt duringthetraining;3) thegeneralizationabilitiesof thenetwork built duringthetrainingarecomparablewith or bet-ter thanthoseof otherneuralandnon-neuralmachinelearn-ing techniques;4) it is very stablein thesensethat thegen-eralizationaccuracy hasa very low standarddeviation overmany trials.

The paperis organizedasfollows. First, we presentthebasicversionof theCBD technique.TheCBD descriptionis

Page 2: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

2 SORINDRAGHICI

dividedinto severalparts.Thefirst subsectiondescribeshowthe algorithmconstructsthe first hiddenlayer. The secondsubsectiondealswith subsequentlayers and the third onepresentsa proof of convergencefor the algorithm. Finally,the last two subsectionsin this part discusssomeissuesre-latedto multiclassclassificationandextendingthe algorith-m to problemswith continuousoutputs. Subsequently, thepaperpresentssomeenhancementsof the basictechnique.Although inspiredby anddesignedfor the CBD technique,theseenhancementscanbe appliedto otherconstructive al-gorithms,aswell. Two suchenhancementsarepresented.

It hasbeenstatedthat someconstructive algorithmstendto generaterather cumbersomearchitectureswhich yieldpoor generalization(Smieja, 1993). One of the enhance-mentsdealswith this issue. The techniqueis calledredun-dancy elimination and is aimedat reducingthe numberofhyperplanesby eliminating thosehyperplaneswhoselocaldiscriminationis alreadyimplementedby someexisting hy-perplane. The techniqueis essentiallydifferent from anypruningtechniqueinasmuchit performsredundancy checksduringthetrainingasopposedto afterthetraining.Thus,thetechniquepreventsthetrainingalgorithmfrom deploying re-dundanthyperplanesin thefirst placeinsteadof eliminatingthemin a post-trainingpruningphase.

Thesecondenhancementis a techniquecalledlockingde-tection. This techniquecan be usedwith any constructivealgorithmusingsubgoalsand is aimedat identifying thosesituationsin which thepositionof a particularhyperplaneislockedandany furthertrainingwouldbewasteful.Thepaperpresentstwo techniquesableto identify suchsituations.

The paperthen proceedsto presentsomeexperiments.Theexperimentalpart is dividedinto threesubsections.Thefirst suchsubsectionpresentsthe hypothesesto be verifiedby experiments.The secondsubsectionpresentsthe meth-odsusedandthe third subsectionpresentsthe experimentalresults.

The two main aimsof the experimentswerei) to assessthe impact of eachindividual enhancementupon the per-formanceof the plain CBD algorithm and ii) to assessthetraining andgeneralizationperformancesof the CBD algo-rithm againstotherneuralandnon-neuralmachinelearningtechniques.ThetrainingandgeneralizationperformancesoftheCBD areassessedon 2 classicalbenchmarksand11 pro-blemsfrom theUCI machinelearningrepository. Theresultsarecomparedwith thoseof 10 otherneuralandnon-neuralmachinelearningtechniques.

A discussionsection discussesseveral algorithms andtechniquesthatarerelatedin variouswayswith CBD. Final-ly, thelastsectionof thepaperpresentssomeconclusions.

TheConstraintBasedDecomposition(CBD) Algorithm

The CBD algorithmconsistsof i) a patternpresentationalgorithm,ii) a constructionmechanismfor building thenetand iii) a weight changealgorithm for a single layer net-work. The patternpresentationalgorithm stipulateshowvariouspatternsin the patternsetarepresentedto the net-

work. The constructionmechanismspecifieshow the u-nits are addedto form the final architectureof the trainednetwork. Both the patternpresentationalgorithm and theconstructionmechanismarespecificto the CBD technique.Theweightchangealgorithmspecifiesquantitativelyhow theweightsarechangedfrom oneiterationto another. CBD canwork with any weight changealgorithmableto train a sin-gleneuron(gradientdescent,conjugategradient,perceptron,etc.) andsomeof its propertieswill dependon thechoiceofsuchalgorithm.

The techniquederivesits namefrom a moregeneralap-proachbasedon the ideaof reducingthe dimensionalityofthesearchspacethroughdecomposingtheprobleminto sub-problemsusingsubgoalsandconstraintsdefinedin theprob-lemspace.However, thedescriptionof thegeneralapproachis outsidethescopeof thepresentpaper. We shall limit our-selves to describingthe CBD algorithmanddiscussingtherelatedissues. For a formal definition of constraintsandacompletedescriptionof themoregeneralCBD approachthereaderis referredto (Draghici,1995,1994).

Thefirsthiddenlayer

In its simplestform, theCBD is ableto solveany classifi-cationprobleminvolving binaryor realvaluedinputs.With-out lossof generality, weshallconsideraclassificationprob-lem involving patternsfrom Rd (d � N) belongingto twoclassesC1 andC2. The goal of our classificationproblemis to separatethe patternsin the setS � C1 � C2. For now,we shall assumethat the units usedimplementa thresholdfunction. Extensionsof thealgorithmto othertypesof unitswill beaddressedlater. We shallalsoassumethatthechosenweightchangingmechanismis ableto train reliably a singleneuron(e.g.perceptronfor binaryvalues).A reliableweightchangingmechanismis definedasanalgorithmthatwill finda solutionin a finite time if a solutionexists.

TheCBD algorithmstartsbychoosingonerandompatternfrom eachclass.Let thesepatternsbex1 belongingtoC1 andx2 belongingto C2. Let Sbetheunionof C1 andC2 andm�andm� be thenumberof patternsin C1 andC2, respective-ly. Thetwo patternsx1 andx2 constitutethecurrentsubgoalScurrent ��� x1 � x2 � andareremovedfrom thesetS. Thetrain-ing startswith just one hiddenunit. This hiddenunit willimplementa hyperplanein input space.Thealgorithmusesthe weight changingmechanismto move the hyperplanesothat it separatesthe given patternsx1 andx2. If the patternsetis consistent(i.e. eachpatternbelongsto only oneclass),thisproblemis linearlyseparable1. Therefore,asolutionwillalwaysexist andthe weight changingmechanism(assumedto be reliable) will find it. Then, the CBD algorithm willchooseanotherrandompatternfrom Sandaddit to thecur-rent subgoalScurrent . The weight changingmechanismwillbeinvokedagainandwill try to adjusttheweightssothatthecurrentsubgoal(now including threepatterns)is separated.

1 In n-dimensions,onecanstartwith an initial training setthatincludesn patternschosenrandomlyfrom thetwo classes.If thesen patternsarein generalpositions(i.e. they arenot linearly depen-dent)they arelinearly separable

Page 3: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 3

This subgoaltraining may or may not be successful.If theweightchangingmechanismhasbeensuccessfulin adjustingthepositionof thehyperplanesothateventhelastaddedpat-ternis classifiedcorrectly, thealgorithmwill chooseanotherrandompatternfrom S, remove it from there,add it to thecurrentsubgoalScurrent andcontinuein thesameway.

However, unlessthe classesC1 andC2 arelinearly sepa-rable— which would make theproblemtrivial — onesuchsubgoaltrainingwill eventuallyfail. Sinceall previoussub-goaltrainingweresuccessful,thepatternthatcausedthefail-ure wasthe onewhich wasaddedlast. This patternwill beremovedfrom thecurrentsubgoal.Also recall that this pat-ternwasremovedfrom Swhenit wasaddedto Scurrent . An-otherpatternwill be chosenandremoved from S, addedtothe currentsubgoaland a new subgoaltraining will be at-tempted.Note that theweightsearchproblemposedby theCBD algorithmto theweightchangingmechanismis alwaysthesimplestpossibleproblemsinceit involvesjustonelayer,one unit and a patternset containingat most one misclas-sified pattern. This processwill continueuntil all patternsin S have beenconsidered.At this point in time, the algo-rithm hastrainedcompletelythe first hiddenunit of the ar-chitecture.The positionof the hyperplaneimplementedbythishiddenunit is suchthatit correctlyclassifiesthepatternsin the setScurrent andit misclassifiesthe patternsin the set� C1 � C2 � Scurrent . If the set of misclassifiedpatternsisempty, the algorithmwill stopbecausethecurrentarchitec-turesolvesthegivenproblem.If therearepatternswhicharemisclassifiedby the currentarchitecture,the algorithmwillanalyzethe two half-spacesdeterminedby the hyperplaneimplementedby theprevioushiddenunit. At leastonesuchhalf-spacewill beinconsistentin thesensethatit will containpatternsfrom bothC1 andC2. Thealgorithmwill form anewgoalSwith thepatternsfrom bothC1 andC2 thatarefoundinthis inconsistenthalf-space.A new hiddenunit will beaddedandtrainedin thesameway so that it separatesthepatternsin thenew goalS.

The CBD algorithmis presentedin Fig. 1 in a recursiveform whichunderlinesthedivide-and-conquerstrategy used.This versionof the algorithm also constructsthe symbolicrepresentationof the solution as describedbelow. The al-gorithm is presentedasa recursive function which takesasparametersa region of interestin the input space,onesetofpatternsfrom eachclassanda factor. Initially, thealgorithmwill becalledwith region � Rd, all patternsin C1 andC2 anda null initial factor.

Subsequentlayers

Sincethe patternsarenow separatedby the hyperplanesimplementedby thehiddenunits,theoutputscannow beob-tainedusingavarietyof methodsdependingontheparticularneedsof theproblem.Onesuchmethodis to synthesizetheoutput using a layer of units implementinga logical ANDand anotherlayer implementinga logical OR (seeFig. 2).This particular methodallows the algorithm to produceasymbolicdescriptionof thebinaryfunctionimplementedbythenetwork. This symbolicdescriptionis built by thealgo-

rithm duringthetrainingandwill bedescribedin thefollow-ing.

Enhancementsof theConstraintBasedDecomposition

Let usassumethattheresultsof theCBD searchis asetofhiddenunitswhich implementthehyperplanesh1 � h2 ��� � ��� hn.Thesehyperplaneswill determinea setof regionsin the in-put space.Theseregionswill be consistentfrom the pointof view of the givenclassificationproblemin thesensethatthey will contain only patternsof one class. Each suchregion can be expressedas an intersectionof somehalf-spacesdeterminedby the given hyperplanes.In the sym-bolic descriptionprovidedby the algorithma consistentre-gion will be describedby a term. A term will have the for-m Ti ��� sign� h1 � h1 � sign� h2 � h2 ��� ����� sign� hn � hn � Cj � wheresign� hi � canbe1, -1, or nil andCj is theclassto which theregion belongsto. A nil sign meansthat the correspondinghyperplanedoesnot actually appearin the given term. Anegative signwill bedenotedby overlining thecorrespond-ing hyperplaneasin hi . No overline meansthe sign of thehyperplaneis positive.

Eachhyperplanewill divide theinputspaceinto two half-spaces,onepositive andonenegative. A hyperplaneanditssignwill berepresentedby a factor. A factorcanbeusedtorepresentoneof the two half-spacesdeterminedby the hy-perplane.A term is obtainedby performinga logical ANDbetweenfactors. Not all hyperplaneswill contributewith afactor to all terms. Finally, a logical OR is performedbe-tweentermsin orderto obtaintheexpressionof thesolutionfor eachclass. Fig. 3 presentsan exampleinvolving 2 hy-perplanes(lines)in 2D. The’+’ charactermarksthepositivehalf-spaceof eachhyperplane.Usingthenotationabove,theregions in Fig. 3 canbe describedas follows: A � h1 � h2,B � h1 � h2, C � h1 � h2 andD � h1 � h2. If classC1 includ-ed the union of regionsB and D, we would describeit asC1 � h1 � h2 � h1 � h2.

Thefinal network constructedby CBD will be describedin a symbolic form by a set of expressionsas explainedabove. Although someinterestingconclusionscan alwaysbeextractedfrom suchsymbolicdescriptions(e.g. theprob-lem is linearly separableor not; someclassesare linearlyseparable,etc.), they arenot alwaysmeaningfulto the enduser. Also thesesymbolicforms of the solutionsmight be-comecomplicatedfor highly non-linearor high dimensionalproblems.More work needsto be donein orderto explorefully thepotentialof CBD in thisdirectionbut this is beyondthescopeof thepresentpaper.

Proofof convergence

A simplifiedversionof thealgorithmin Fig.1 will beusedto prove theconvergenceof theCBD training. This simpli-fied versionis presentedin Fig. 4. This simplifiedalgorithmdoesnotstorethesolutionandit doesnot look for agoodso-lution. An inefficient solutionwill suffice. In theworstcase,the solution given by this algorithm will constructregionscontainingonly a singlepatternwhich is very wasteful.Let

Page 4: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

4 SORINDRAGHICI

separate(region,C1 patternsfrom class1,C2 patternsfrom class2,f actor)if C1 is emptyor C2 is emptythen

returnBuild a subgoalS with patternsxC1

1�

C1 andxC22�

C2.DeletexC1

1 andxC21 from C1 andC2.

Add ahiddenunit andtrain it to separatexC11 andxC2

1 . Let h bethehyperplanethatseparatesthem.for eachpatternp in C1 � C2 do

Add p to thecurrentsubgoalSSaveh in hcopyTrain with thecurrentsubgoalSif not successthen

Restoreh from hcopyRemove p from S

Let newf actor � h � f actorif theregiondeterminedby newf actor is consistentthen

Let Cj betheclassof thepatternsin this half-spaceAdd newf actor asa new termin thedescriptionof theclassCj

elseDeletefrom C1 andC2 all patternsnot in h�Storetheresultin Cnew

1 andCnew2 .

separate( h� , Cnew1 , Cnew

2 , newf actor )Let newf actor � h � f actorif theregiondeterminedby newf actor is consistentthen

Let Cj betheclassof thepatternsin this half-spaceAdd newf actor asa new termin thedescriptionof theclassCj

elseDeletefrom C1 andC2 all patternsnot in h Storetheresultin Cnew

1 andCnew2 .

separate( h , Cnew1 , Cnew

2 , newf actor )endfor

Figure 1. TheCBD algorithm.

hyperplane layer (each unit connected to all input units)

AND layer (the units are connected only to relevant hyperplane units)

OR layer (the output units are connected only to units which are turned on for the given class)

units turned on by (some of) the patterns in class 1

units turned on by (some of) the patterns in class 2)

Figure 2. Thearchitecturebuilt by thealgorithm

Page 5: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 5

A

DB

C

h1+

h2+

Figure 3. Theformalismusedin thesymbolicexpressionof theinternalrepresentationbuilt by thenetwork. Theregionscanbedescribedasfollows: A � h1 � h2, B � h1 � h2, C � h1 � h2 andD � h1 � h2.

usassumetherearem (distinct)patternsxi � i � 1 � ��� ��� m of ninputvariables.Eachpatternxi belongsto oneof two classesC1 andC2. It is assumedthat the numberm of patternsisfinite.

Onecanprovethatthis algorithmconvergesundertheas-sumptionthat, for any two givenpatterns,a hyperplanethatseparatesthemcanbefoundin afinite time. Thisassumptionis usedin step3. Fromtheterminationcondition1 it is clearthatif thealgorithmterminates,this happensbecausethere-gioncontainsonly patternsfromthesameclassornopatternsat all. Becausethe two patternschosenin step2 are sep-arated(using the assumptionabove) in step3, andbecausethey areremovedfrom thepatternset,both patternset � andpatternset � will containat leastonepatternlessthanpatternset. This implies thatafterat mostm 1 recursive stepstheprocedureseparatewill be called with a region containingjust onepattern.Sucha region satisfiestheterminationcon-dition 1 and the algorithm will terminate. This worst-casesituationhappenswhentheconsistentregionscontainingon-ly patternsfrom the sameclassall containjust onepattern.In this case,theinput spaceis shatteredinto m regions.

Note that the termination condition and the recursivemechanismareidentical for both the simplified CBD algo-rithm in Fig. 4 andtheCBD algorithmin Fig. 1. Thediffer-encebetweenthemis that theCBD algorithmin Fig. 1 triesto decreasethenumberof hyperplanesusedby trying to opti-mizetheirpositionswith respectto all thepatternsin thecur-renttrainingset(thefirst for loop in thealgorithmin Fig. 1).This meansthat the CBD algorithm cannotperform worsethanthesimplifiedalgorithmi.e. it is guaranteedto convergein at mostm 1 recursive stepsfor any patternsetcontain-ing m patterns.The numberof suchstepsandthereforethenumberof hiddenunits deployed by the CBD algorithminFig. 1 will be somewherebetweenlog2 � min � m� � m� ��� inthe bestcaseandmax� m� � m� �� 1 in the worst case. Thebestcasecorrespondsto a situationin which the algorith-m positionsthe hyperplanessuchthat for eachsubgoalanequalnumberof patternsfrom the leastnumerousclassis

found in eachhalf-space.The worst casecorrespondsto asituationin which eachhyperplaneonly separatesonesin-gle patternfrom the mostnumerousclass.Oneshouldnotethattheworst-casescenariotakesinto considerationonly thepatternpresentationalgorithmpresentedabove. In practice,d 1 patternsin arbitrarypositionsin d dimensionswill belinearly separableby an n-dimensionalhyperplaneso, fora reliableweight changingmechanism,the worst casewillneverhappen.

Multiclassclassification

Thealgorithmcanbeextendedto classificationproblemsinvolving morethanoneclassin severaldifferentways. Letus assumethe probleminvolvesclassesC1 � C2 � ��� ��� Ck. Thefirst approachis to choosetwo arbitraryclasses(let ussayC1andC2) andseparatethem. This will producea numberofhyperplaneswhich determinea partition of the input spaceinto regionsR1 � R2 � � ����� Rp. Thefirst approachis to iterateontheseregions. The algorithm will be called recursively oneachsuchregionuntil all regionsproducedcontainonly pat-ternsbelongingto a singleclass.Sinceat every iterationthenumberof patternsis reducedby at leastone,thealgorithmwill eventuallystopfor any finite numberof patterns.

The secondapproachiterateson the classesof the prob-lem. For eachremainingclassCj , eachpatternwill betakenandfedto thenetwork in orderto identify theregionin whichthis patternlies. Let this region beRi andtheclassassignedto this regionbeCik. Thealgorithmwill becalledto separateCj from Cik in Ri . In eachsuchsubgoaltraining,thepatternfrom classessubsequenttoCj (in thearbitraryorderin whichpatternsareconsidered)will beignored.Thealgorithmneednotworry abouttheclassesprecedenttoCj becausethey havealreadybeenconsideredandthecurrentregion is consistentfrom their point of view. This processwill berepeateduntilall patternsfrom all classeshavebeenconsidered.

A third approachinvolvessomemodificationsof thealgo-rithm. Themodifiedalgorithmis presentedin Fig. 5. In thiscase,thealgorithmwill choosea randompatternxi . Subse-

Page 6: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

6 SORINDRAGHICI

mainbegin

region � input spacepattern set � whole pattern setseparate(region, patternset)

end

separate(region, patternset)begin

1. if regioncontainsonly patternsfrom oneclass(or no class)thenendproc

2. takeonepatternfrom eachclassandremovethemfrom pattern set3. separatethesetwo patternswith ahyperplanehp4. region� = positivehalf-spaceof hp5. region� = negativehalf-spaceof hp6. patternset � = patternsin pattern set andin region�7. patternset � = patternsin pattern set andin region�8. separate(region� , patternset � )9. separate(region� , patternset � )

end

Figure 4. A simplifiedversionof thealgorithm.

quently, all otherpatternsconsideredwhichdonot belongtothesameclassCi will beassignedanew targetvalueT. Thusthe weight changingmechanismwill still be called on thesimplestpossibleproblem:separatingjust onemisclassifiedpatternin a two-classproblem. Practically, thehyperplaneswill bepositionedsothattheclassof thepatternchosenfirstwill be separatedfrom all others. Sincethe first patternischosenat randomfor eachsubgoalpattern(i.e. for eachhid-denunit), no specialpreferencewill begivento any particu-lar class. This is the approachimplementedin the versionwhich was testedon the problemsfrom the UCI machinelearningrepository.

If theproblemto besolvedinvolvesmany classesandif adistributedenvironmentis available,onecouldconsideran-otherapproachto multiclassclassification.In this approach,the algorithm will be executedin parallel on a numberofprocessorsequal to the numberof classesin the problem.Eachprocessorwill be assigneda classand will separatetheassignedclassfrom all theothers.This canbeachievedthroughre-labelingthepatternsof then 1 classeswhicharenotassigned.Thesolutionprovidedby thisversionof theal-gorithmwill bemorewastefulthroughtheredundancy of thesolutionsbut it might still be interestingdueto the parallelfeaturesandthe availability of ’classexperts’ i.e. networksspecializedin individual classes.We arecurrentlyworkingon a moreinterestingimplementationof this algorithmin adistributed environmentthat allows the useof an arbitrarynumberof heterogeneousprocessors.

Continuousoutputs

The CBD algorithm wasdescribedabove in the contextof classificationproblems.However, this is not an intrinsic

limitation of this algorithm. Thealgorithmcanbeextendedto dealwith continuousoutputsin averysimplemannerthatwill bedescribedin thefollowing.

A first stepis to useonly thesignof thepatternsin orderto constructthenetwork usingtheCBD algorithmpresentedabove. Using only the sign of the patternstransformstheprobleminto a classificationproblemandthealgorithmcanbeapplieddirectly. Thearchitectureobtainedfrom this stepis alreadyguaranteedto separatethe patternswith the hy-perplanesimplementedby the first hiddenlayer. Now, thethresholdactivation functionsof all neuronswill be substi-tuted by sigmoid activation functionsand the training willproceedwith thedesiredanalogvalues.

Sincethe hidden layer alreadyseparatesthe patterns,alow error(correctclassification)weightstateis guaranteedtoexist for thegivenproblem. Theweightsof thefirst hiddenlayerwill beinitializedwith weightvectorshaving thesamedirectionastheweightvectorsfoundin thepreviousstep.Anumberof weightchangingmechanisms(e.g. backpropaga-tion, quickprop,conjugategradientdescent,etc.)cannow beusedin orderto achieve thedesiredcontinuousI/O behaviorby minimizing thechosenerrormeasure.

Efficientuseof thehyperplanes

Due to its divide andconquerapproachand the lack ofinteractionbetweensolving different problemsin differentareasof theinput space,thesolutionbuilt by theCBD algo-rithm will not necessarilyusethe minimum numberof hy-perplanes.This is a characteristiccommonto several con-structive algorithmsandsomeauthorsusethis as an argu-ment to sustainthe idea that constructive algorithmsoffergeneralizationpropertieswhich areworsethanthoseof oth-

Page 7: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 7

separate ( region, C1 � C2 ��� ����� Ck, f actor )if all Ci � i � 1 � � ����� k areemptythen

returnBuild a subgoalS with anarbitrarypatternxCi

i�

C1 andxC2j�

Cj � j �� i.

Assignto xC2j thenew targetvalueT �� Tl � l � 1 � ��� ��� k

DeletexCii andx

Cjj from Ci andCj .

Add ahiddenunit andtrain it to separatexCii andx

Cjj . Let h bethe

hyperplanethatseparatesthem.for eachpatternp in C1 � C2 � ��� � � Ck do

Add p to thecurrentsubgoalS. If p �� Ci assignit thetargetvalueTSaveh in hcopyTrain with thecurrentsubgoalSif not successthen

Restoreh from hcopyRemove p from S

Let newf actor � h � f actorif theregiondeterminedby newf actor is consistentthen

Let Cj betheclassof thepatternsin this half-spaceAdd newf actor asa new termin thedescriptionof theclassCj

elseDeletefrom C1 andC2 all patternsnot in h�Storetheresultin Cnew

1 andCnew2 .

separate( h� , Cnew1 , Cnew

2 , newf actor )Let newf actor � h � f actorif theregiondeterminedby newf actor is consistentthen

Let Cj betheclassof thepatternsin this half-spaceAdd newf actor asa new termin thedescriptionof theclassCj

elseDeletefrom C1 andC2 all patternsnot in h Storetheresultin Cnew

1 andCnew2 .

separate( h , Cnew1 , Cnew

2 , newf actor )end

Figure 5. ThemulticlassCBD algorithm.

er typesof trainingalgorithms(Smieja,1993). Theremain-ing of thecurrentsectionwill presentthe redundancy prob-lem andsomeapproachesto addressingit. Fig. 6 presentsanon-optimalsolutionwhich usesthreehyperplanesin a sit-uationin which two hyperplaneswould besufficient. Sincetheproblemis notlinearlyseparable,theCBD algorithmwillstartby separatingsomeof the patternsusinga first hiddenunit. This first hiddenunit will implementeitherhyperplane1 or hyperplane2. Let ussupposeit implementshyperplane1. Subsequently, thesearchwill beperformedin eachof thehalf-spacesdeterminedby thehyperplane1. At leastonehy-perplanewill be requiredto solve theproblemin eachhalf-spaceeven thoughonesinglehyperplanecould separateallpatterns.

Therearedifferent typesof optimizationswhich can beperformed.In Fig. 7, hyperplanes4 and5 areredundantbe-causethey classifyin thesameway(up to asignreversal)allpatternsin thetrainingset. This typeof redundancy will becalledglobal redundancybecausethe hyperplanesperform

the sameclassificationat the level of theentiretrainingset.This type of redundancy canbe eliminatedby checkingattheendof thetrainingwhethertherearetwo differenthiddenunitswhich classifyall patternsin thesameway.

This type of redundantunits are equivalent to the “noncontributing units” describedin (Siestma& Dow, 1991).Thesenon contributing units aredescribedas“units which[...] have outputsacrossthe training set which mimic theoutputsof anotherunit”. Eliminatingthis typeof redundan-cy as in (Siestma& Dow, 1991) involveschangingall theweightsconnectedto theunit whichwill bepreserved.In theconstructive CBD approach,eliminatingthis typeof redun-dancy at theendof thetrainingwould involveonly removingtheredundantunit andreconnectingall its outgoingweightsto the otherunit performingthe sameglobal classification.However, a betterredundancy eliminationmethodwill bep-resentedshortly.

In the sameFig. 7, hyperplanes2 and3 areonly locallyredundant. This meansthey performthesameseparationin

Page 8: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

8 SORINDRAGHICI

1

2

3

+

++

Figure 6. A non-optimalsolution

1

2

3

45

+

+

+

++

a b

Figure 7. Characterizingredundancy. Hyperplanes4 and5 areglobally redundantwhereas2 and3 arelocally redundant.

a limited region of the input space,in this case,thepositivehalf-spaceof hyperplane1. However, the hyperplanesarenot globally redundantbecausethey classifypatterns(a)and(b) differently. Consequently, a globalsearchfor redundantunitscannoteliminatethis typeof redundancy.

It is interestingto notethat, in thecaseof theCBD algo-rithm (andmostotherconstructive techniques),other typesof redundancy discussedin (Siestma& Dow, 1991)suchasunits which have constantoutputacrossthe training setareavoidedby thealgorithmitself. This is becausethealgorith-m trains eachunit so that it separatesat leastone unit (orfor anotherspecificpurpose)and thereforethe output willnot beconstantacrossthetrainingset.Thesamecanbesaidfor unnecessary-informationunitswhichdonotappearin theCBD constructivealgorithmfor thesamereasons.

Eliminatingredundancy.A straightforwardapproachis to optimizethepositionof

a hyperplanewith respectto all patternsandnot only withrespectto thepatternsin thecurrentregion. Thus,afterob-

taining thebestpositionof a hyperplanewith respectto theregion of the spacein which thealgorithmcurrentlyworks,thepositionof thehyperplanecanbeoptimizedwith respec-t to all patternswhich have not beencorrectlyclassifiedyet.Subsequently, eachregiondeterminedby thenew hyperplane(by splittingold regionscrossedby it), is checkedfor consis-tency andlabeledif possible.

This approachbrings somedifficulties. Firstly, the pat-tern set of a subgoalwill be formed by two typesof pat-terns.Supposethealgorithmis calledin region R of the in-put space.Thefirst typewill containthepatternscontainedin R. Thesepatternsmustbe separatedaswell aspossiblebecausetheir separationensuresthe convergenceof the al-gorithm. The secondtype of patternscontainsthe patternsoutsideR whoseseparationis desiredbut not compulsory.Measuresmustbetakento ensurethatthepatternsoutsideRwill not determinea depreciationof the classificationscorein R. For instanceonecouldchecktheclassificationscoreinRaftereachweightchangeandstopwhenthisscorestartstodecrease.

Page 9: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 9

In orderto improve the generalefficiency, after a hyper-planeis positionedasbestit could,thehalf-spacewith morepatternscould be consideredfirst for further separationofthepatternswhich arenot yet separated.This would ensurethat in the optimizationstage,when all patternsare takeninto consideration,the possibility of the classificationscorein the currentregion R beingworsenedby the patternout-sideR is diminished.This featureoffersa betteruseof thehiddenunits for the price of a slower training. The train-ing speedis reducedfor two reasons.Firstly, eachsubgoaltrainingsetwill considermorepatternsthanin thestandardCBD approach.Secondly, for eachnew hyperplane,all newregionsdeterminedby it mustbecheckedfor consistency andthiscouldleadto acombinatorialexplosionof thenumberofchecksto beperformed.

Theredundancy eliminationapproachasdiscussedsofaralsotendsto expressthesolutionasa unionof smallregionsevenif abetterexpression(asonly onebiggerregion)exists.Thisdeterminesashatteringof theinputspacein unnecessar-ily small classificationareas.However, this problemcanbesolved by analyzing(automatically)the symbolicalexpres-siongeneratedandreducingit to its simplestform. Althoughthe inputs to the network maintainthe ability to copewithrealvalues,therulesgeneratedasa resultof thelearningareexpressedasBooleanfunctions.This allows theuseof clas-sicaloptimizationalgorithm(likeQuine-McCluskey (Quine,1955;McCluskey, 1956))suitablefor automation.

Although this redundancy eliminationapproachis feasi-ble, it also has somedisadvantages:a more complicatedtreatmentof the patterns,slower training and the solutionneedlesslyexpressedasa union of small regions. Further-more, this approachwaits until the training is finishedbe-fore attemptingto do anything. A muchbetterapproachisto checkfor redundanciesduringtheconstructionof thenet-work. Thus,precioustraining time canbespared,the solu-tion canbeobtaineddirectly in amorecompactform andthegeneralizationcanbe improved by having a morecompactinternalrepresentation.Suchanalgorithmwill bepresentedin thefollowing.

Let us considerthe problempresentedin Fig. 7. Let ussupposethefirst hyperplaneintroducedduringthetrainingishyperplane1. Its negative half-spaceis checked for consis-tency, found to be inconsistentandthe algorithmwill try toseparatethe patternsin this negative half-space(the otherswill be ignoredfor the moment). Then, let us assumethathyperplane2 will be placedasshown (seealsoFig. 8). Itspositive half-space(intersectedwith the negative half-spaceof hyperplane1) is consistentandwill be labeledasa blackregion. The algorithm will now considerthe region deter-mined by the intersectionof the negative half-spacesof 1and2, which is inconsistent.Hyperplane4 will beaddedtoseparatethepatternsin thisregionandtheglobalsolutionforthenegativehalf-spaceof 1 will beh1h2 � h1h2h4 asa blackregionandh1h2h4 asa white region.

The situationafter addinghyperplanes1,2 and4 is pre-sentedin Fig. 8. Then,the algorithmwill considerthe pos-itive half-spaceof hyperplane1 andwill try to separatethepatternsin this region. A new hyperplanewill beintroduced

to separatethe patternsin the positive half-spaceof hyper-plane1. Eventually, thishyperplanewill endupbetweenoneof the groupsof white patternsandthe groupof black pat-terns.Let ussupposeit will endup betweengroups(a) and(b). Thishyperplanewill beredundantbecausehyperplane2couldperformexactly thesametask.

Theredundancy is causedby factthatthealgorithmtakesinto considerationonly local patterninformation i.e. onlythe patternssituatedin the areacurrently underconsidera-tion, ignoringtheothers.At thesametime, this is oneof theessentialfeaturesof thealgorithm,thefeaturewhichensuresthe convergenceandyields a high training speed. Consid-ering all the patternsin the training set is a sourceof pro-blems(considerfor instancetheherdeffectwhichcharacter-izesasituationin whichseveralhiddenunitsfollow thesameweightevolution (Fahlman& Lebiere,1990)).Oneseemstofaceaninsolublequestion.Cananalgorithmbelocal sothatthetrainingis easyandglobalsothattoomuchredundancy isavoided?Theanswer, at leastfor theconstructiveCBD algo-rithm is affirmative. Thesolutionis to considerthepatternslocally, as in standardCBD, but to take into considerationprevious solutionsaswell. Thus,althoughthe patternsarenot consideredglobally which would make theproblemdif-ficult, someglobal informationis usedwhich will eliminatesomeredundancy from thefinal solution.

Let us reconsiderseparatingthe patternsin the positivehalf-spaceof hyperplane1 with a new hyperplanebetweengroups(a) and(b). Insteadof automaticallyacceptingthisposition, the algorithm could checkwhetherthereare oth-er hyperplanesthatclassifythepatternsin thepositive half-spaceof 1 in thesameway. In this case,hyperplane2 doesthis and it will be usedinsteadof a new hyperplane.Notethatthis doesnot affect in any way thepreviouspartialsolu-tions andthereforethe convergenceof the algorithmis stillensured.At the sametime this modificationensuresthe e-limination of both global andlocal redundancy andis donewithout takinginto considerationall patterns.

Computationally, this check needsonly two passes(tocaterfor thepossiblydifferentsigns)throughthecurrentsub-goal. In eachpass,the outputof the candidatehyperplaneunit is comparedwith theoutputsof theexisting units. If atthe endof this, an existing unit is found to behave like thecandidateunit, theexisting unit will substitutethecandidateunit which will bediscarded.

Thisideafor redundancy eliminationhasbeenimplement-ed and testedas an enhancementof the constructive CBDalgorithm.Theenhancedalgorithmis presentedin Fig. 9.

In orderto improve further theefficiency of thealgorith-m, oneshouldconsiderthe eliminationof the potentialre-dundanciesi.e. thosedeterminedby two hyperplaneswhicharenot redundant(eitherlocally or globally) but couldbesowithout affecting the solution. In other words, thereis noperfectequivalent for the candidatehyperplaneamongtheold hyperplanesbut oneof themcouldbecomesuchwithoutaffecting the previoussolutions.For instance,in Fig. 7, theold hyperplane4 cansubstitutethe candidatehyperplane5with a smallchangein its positionandwithout affectingtheseparationof the patternsin the negative half-spaceof hy-

Page 10: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

10 SORINDRAGHICI

1

2

4

+

+

+a b

a

b

Figure8. Hyperplanes2 and4 separatethepatternsin thenegative half-spaceof hyperplane1. A new hyperplaneis neededto separatethepatternsin thepositive half-spaceof 1.

perplane1 i.e. the region for which hyperplane4 hasbeenintroduced.

In orderto implementthis, theredundancy eliminational-gorithmdescribedcouldperhapsbefurthermodifiedsothatthe old hyperplanesare not eliminatedat the first patternwhichis classifieddifferentlyfrom theclassificationgivenbythecandidatehyperplane,but only afterfurtherpatternshavebeenclassifieddifferently. The numberof further patternstakenmight perhapsbedeterminedby how importantit is toeliminatetheglobalredundancy for thegivenproblem.If thiscouldbedone,attheendof thecyclethroughthepatternsthesetof old hyperplaneswill containhyperplaneswhich clas-sify a majority of the patterns(in the currentregion) in thesamewayasthecandidatehyperplane.Then,anattemptcanbe madeto train thesepotentiallyredundanthyperplanestoclassifycorrectlythepatternsin theregionthey havebeenin-troducedfor andto classifythepatternsin thecurrentregionasthe candidatehyperplanedoes. Onceagain,precautionsmustbe taken to ensurethat the optimizationtraining doesnotaffect theclassificationof previouslyconsideredpatternsbecausethisis theelementwhichguaranteestheconvergenceof thealgorithm.

Locking detection

Oneof thecharacteristicsof theconstraintbaseddecom-positionalgorithmis that the actualweight updatingis per-formedonly in verysimplenetworkswith justonenon-inputneuron.In geometricaltermsonly onehyperplaneis movedat any singletime. In thefollowing discussionwe shallcon-sider a problemin a d dimensionalspace. Thereare twoqualitatively distincttrainingsituations.Thefirst situationisthatof a first trainingaftera new neuronhasbeenadded.Inthis situation,thepatternsetcontainspatternswhich form alinearly separableproblemand the problemcanalways besolved.This is becausethenumberof thepatternsis restrict-edto at mostd andthepatternsareassumedto bein generalposition(i.e. notbelongingto asub-spacewith fewerdimen-sions). The secondsituationis thatof addinga patternto a

trainingsetcontainingmorethand patterns.In thiscase,theproblemis to move theexisting hyperplaneso thateventhelastaddedpatternis correctlyclassified.Thereis no guaran-teethatasolutionexistsfor thisproblembecausethelastpat-terncouldhave madetheproblemlinearly inseparable.Thisdeterminesa terminationproblem.Whenshouldthetrainingbestoppedif theerrorwill nevergobelow its errorlimit?

Thesimplestsolutionis to usea time-outcondition. Thetraining is haltedif no solution hasbeenfound in a givennumberof iterationsNmax. This condition is usedby thesimplestimplementationof theCBD algorithm(thestandardCBD) and by the vast majority of constructive algorithmsliketiling (Mezard& Nadal,1989),theupstart(Frean,1990),extentron(Baffes& Zelle,1992)thepocket algorithm(Gal-lant,1986),divideandconquernetworks(Romaniuk& Hall,1993)andso forth. A notableexceptionto this approachisthe CARVE algorithm(Young& Downs,1998)which sub-stitutesthe a priori chosentimeoutlimit with a (equallyar-bitrarily chosen)pairof numbersNinit � Nrot whichdeterminesthenumberof initial hyperplanesandthenumberof rotationsinvolvedin theprocessof finding a linearly separablesubsetof patternof oneclass.

If thetimeoutconditionis used,thechoiceof Nmax is cru-cial for theperformanceof thealgorithm. A largeNmax willmeanthat the algorithm could spenda long time trying tosolve problemswhich do not have a solution and this willdramaticallyaffect thetotal trainingtime. A smallNmax willcausemany trainingsub-sessions(subgoalsessionsfor CBD)to bedeclaredasinsolubleevenif they have a solution. ForCBD, this secondsituationwill result in the useof a largenumberof hiddenunits and a fragmentationof the globalsolutionwhich canbe undesirablefor someproblems. Anexcessively small Nmax will have negative effectsupon theoverall I/O mapping,the training time or both, for all algo-rithmswhich usethis terminationcondition. Unfortunately,thenumberof iterationsrequiredis not thesamefor all train-ing sessionsandcannotbedecideda priori. Someheuristicsare neededto ensurethat i) most of the training problemswhich have a solution will be solved and ii) not too much

Page 11: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 11

separate ( region, C1 patternsfrom class1,C2 patternsfrom class2, f actor )if C1 is emptyor C2 is emptythen

returnBuild a subgoalS with patternsxC1

1�

C1 andxC22�

C2.DeletexC1

1 andxC21 from C1 andC2.

Add ahiddenunit andtrain it to separatexC11 andxC2

1 . Let h bethehyperplanethatseparatesthem.for eachpatternp in C1 � C2 do

Add p to thecurrentsubgoalSSaveh in hcopyTrain with thecurrentsubgoalSif notsuccessthen

Restoreh from hcopyRemove p from S

Initialize old hp set with thesetof existinghyperplanesfor eachpatternp

�C1 � C2 do /* this is theredundancy check*/

for eachhp in old hp set doif p is classifieddifferentlyby h andhp then

/* h andhp arenot redundant*/removehp from old hp set

if old hp set is not emptythen/* any of theelementsof old hp set is redundantwith h;pick up any of them*/h = any of theelementsof old hp set

Let newf actor � h � f actorif theregiondeterminedby newf actor is consistentthen

Let Cj betheclassof thepatternsin this half-spaceAdd newf actor asa new termin thedescriptionof theclassCj

elseDeletefrom C1 andC2 all patternsnot in h�Storetheresultin Cnew

1 andCnew2 .

separate( h� , Cnew1 , Cnew

2 , newf actor )Let newf actor � h � f actorif theregiondeterminedby newf actor is consistentthen

Let Cj betheclassof thepatternsin this half-spaceAdd newf actor asa new termin thedescriptionof theclassCj

elseDeletefrom C1 andC2 all patternsnot in h Storetheresultin Cnew

1 andCnew2 .

separate( h , Cnew1 , Cnew

2 , newf actor )end

Figure 9. TheCBD with redundancy elimination.

time(ideallynotimeatall) will bespentwith thoseproblemswhich cannotbe solved. Theseheuristicswill be discussedfor CBD herebut can be easily extendedto other trainingalgorithmswhich usethesameterminationcondition.

The ideaof locking wasinspiredby a symbolicAI tech-nique (the candidateelimination (Mitchell, 1977, 1978)).The presentpaperdescribesa locking techniquewhich canbeusedto improve thetrainingperformanceof constructivealgorithms. Although the useof locking will be discussedhere in connectionwith the CBD algorithm, this is not alimitation andthe ideacanbe usedwith otherconstructivealgorithmsaswell.

A locking situationcanbedefinedasa situationin which

the patternscontainedin the current training subsetdeter-minethepositionof thedividing hyperplaneup to a suitablychosentolerance.In thissituation,addingnew patternsto thetrainingsetandcontinuingthetrainingis uselessbecausethepositionof the dividing hyperplanecannotbe changedout-sidethegiventolerancewithoutmisclassifyingsomepattern-s. An exampleof suchsituationis givenin Fig.10. Thetrain-ing speedof many constructive algorithmscanbeimprovedif attemptsto train furtherin suchsituationsareeliminated.

We shallstartby consideringthefollowing definitions:DefinitionTwo setsaresaidto belinearlyseparableif and

only if thereexistsa hyperplaneH thatseparatesthem.Let C1 andC2 be two linearly separablesetsof patterns

Page 12: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

12 SORINDRAGHICI

Case A Case BFigure10. Two lockingsituationsin two dimensions.Thecolorof thepatternsrepresentstheirclass.Thepositionof thehyperplanecannotbechangedwithoutmisclassifyingsomepatterns

from two classes.C1 andC2 determinea locking situationwith toleranceε0 if andonly if for all di , d j ,

di � x1 � � ����� xn � � a1x1 ����� � � anxn � an� 1

d j � x1 � � ����� xn � � b1x1 ����� � � bnxn � bn� 1

suchthat:1. di � pk ��� 0, � pk

�C1

2. di � pk �� 0, � pk�

C23. d j � pk �!� 0, � pk

�C1

4. d j � pk �! 0, � pk� C2

arccos

"ni �

"n j# "

ni#$# "

n j#� ε0

whereni �%� a1 � a2 � a3 ���$�&�&� an � andn j �%� b1 � b2 � b3 � �&�$�&� bn � arethegradientvectorsof di andd j respectively.

In the definition above, conditions(1) and(2) meanthatdi is a separatinghyperplanefor theclassesC1 andC2. Sim-ilarly, conditions(3) and(4) meanthatd j is a separatinghy-perplaneaswell. Thesignof theclassificationis not relevantbecausefor any di classifyingC1 in its negative half-spacethereexist a di suchthatthenegativehalf-spaceof di is thepositive half-spaceof di . The definition saysthat the pat-ternsdeterminea locking situationwith toleranceε0 if andonly if theanglebetweenthenormalsof any two separatinghyperplanesis lessthanε0.

Note that accordingto this definition, any two linearlyseparablesetsof patternscreatealockingsituationwith sometolerance.In practice,weareinterestingin thoselockingsit-uationsthathavea tolerancesuchthatany furthertrainingislikely to beof little use.Theseinstanceswill becalled tighttolerancelocking situations.

A locking detectionmechanismby itself would not bevery usefulunlessonecouldcalculatea meaningfullockingtoleranceusingsomesimplestatisticalpropertiesof thedatasetof the given problem. We shall considera locking situ-ation aspresentedin caseB in Fig. 10. It is clear that thepatternsimposemorerestrictionswhenthepairsof oppositepatternsarefartherapart(seeFig. 11). For this reason,we

shallconsiderthefurthesttwo suchpairsof patternsfor anygivendataset.

We canconsiderthesmallesthypersphereS that includesall patterns.Let theradiusof this hyperspherebe r. We areinterestedin themostrestrictedsituationsowewill considerthat thetwo pairsof patternsfrom Fig. 11 aresituatedasfarapartaspossiblei.e. on thesurfaceof this hypersphere.Weshall keepthe distancebetweenthe patternsin pair A fixedandwe shallstudywhathappenswhenthedistancebetweenthe patternsin pair B increases.Keepingthe distancebe-tweenthepatternsin pair A fixedandassumingthatthisdis-tanceis muchsmallerthanthediameterof thehypersphere,meansthatwe assumethehyperplanecanmove only by ro-tationaroundthepair A (seeFig. 12).

Furthermore,let us assumethat the minimum Euclideandistancebetweentwo patternsfrom differentclassesis dminandthatthetwo patternsp1 andp2 in pairB areseparatedbythis distance.If we considerthehyperplanerotatingaroundthe pair A, we obtainan angleβ � arcsindmin

4r . This anglecanbetakenasa reasonablevaluefor the locking tolerancesincewe areinterestedin movementsof thehyperplanethatwill changetheclassificationof at leastonepoint. Notethatconsideringthecasein which thepointsp1 andp2 areon thehypersphereis a worst-casesituationin thesensethat,if an-otherpair of patternsof oppositeclasses,let’s sayp'1 andp'2existedcloserto thecenter, thehyperplanewill needto movewith morethatα in orderto changetheclassificationof anysuchpoint. Also notethat a tolerancecalculatein this wayhasonly a heuristicvaluebecauseonecaneasilyconstructartificial examplesin which changingthepositionof thehy-perplanewith lessthanα would changetheclassificationofsomepatterns.However, theexperimentsshowedthatsuchavalueis usefulin practiceanddeterminesanimprovementofthetrainingtime.

Thefollowingcriterionwill identify someelementswhichinfluencehow tight a locking situationis.

Characterizationof a locking situation. Let us considertwo setsof patternsfrom two linearly separableclassesCiandCj . Let hk bea separatinghyperplane,xi beanarbitrarypatternfromCi andHkj betheconvex hull determinedby the

Page 13: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 13

A

B

A

B

Figure 11. Two pairsof oppositepatternslockinga hyperplane.For thesamedistancebetweenpatternswithin a pair, thefurtheraway thepairsare,themorerestrictedthepositionof thehyperplanewill be.

r

β

α dmin

B

A

p1

p2

p(2

p(2

Figure12. If werotateahyperplanearoundthepointA with ananglelessthanβ, theclassificationof eitherp1 or p2 will remainthesame.

projectionon hk of all pointsin Cj . The following observa-tionswill characterizea lockingsituation:

1. A necessaryconditionfor a tight tolerancelocking sit-uationdeterminedby xi andCj is thattheprojectionof xi onany separatinghyperplanehk fall in theconvex hull Hkj (theprojectioncondition).

2. A necessaryconditionfor atight tolerancelockingsitu-ationdeterminedby x1 ��� � ��� xm1 from C1 andy1 � ��� ��� ym2 fromC2 is that the intersectionof the convex hulls of the projec-tions of x1 � ��� ��� xm1 and y1 � ��� ��� ym2 on any dividing hyper-planebenon-degenerate2 (theintersectioncondition).

3. The toleranceof the locking situationdeterminedbyxi andCj is inverselyproportionalto themaximumdistancefrom xi to a separatinghyperplanehk.

4. The toleranceof the locking situationdeterminedby

xi andCj is inverselyproportionalto themaximumdistancefrom theprojectionof xi on a dividing hyperplanehk to thecentroid3 of theconvex hull Hkj

A full justificationfor thecriterionabove will beomittedfor spacereasons.However, thecases1-4aboveareexempli-fied for thetwo dimensionalcasein Fig. 13,Fig. 14,Fig. 15andFig. 16 respectively. In thesefigures,the shadedareasrepresentthe setof possiblepositionsfor a dividing hyper-plane.Thesmallerthesurfaceof theshadedareas,thetighter

2 Thereis no ) d * 1+ -dimensionalspacethat includesthe inter-section.Thisalsoimpliesthattheintersectionis not empty.

3 The centroidof a setof points p1 ,.-.-/-., pn 0 Rd is their arith-meticmean ) p1 12-.-.-.1 pn +.3 d i.e. thepoint whosecoordinatesarethearithmeticmeansof thecorrespondingcoordinatesof thepointsp1 ,/-.-.-., pn.

Page 14: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

14 SORINDRAGHICI

1

2

d

Figure 13. A looselocking situation(condition1) Theprojectionof thewhite patterndoesnot fall within theconvex hull determinedbytheprojectionson theseparatinghyperplaneof thepatternsfrom theoppositeclass

Figure14. A looselockingsituation(condition2). In thesituationabove,theconvex hullsof theprojectionsof thewhiteandblackpatternsdonot intersecton all dividing hyperplanes.In thesituationbelow, thehullsdo intersectandthelocking is tight.

1

2

3

d

Figure 15. A looselockingsituation(condition3). Thewhite patternis far from thefurthestseparatinghyperplane

Page 15: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 15

1

2

d

3

Figure 16. A looselocking situation(condition4). The projectionof the white patternxi is far from the centroidof the convex hulldeterminedby theprojectionsof patternsfrom Cj

thelocking situation.

Two lockingdetectionheuristics

We shall presentand discusstwo heuristicsfor lockingdetectionbasedon the discussionabove. Although theseheuristicsdo not take into considerationall possibletypesoflockingthey havebeenshown to beableto improvethetrain-ing speed.For amoredetaileddiscussionof varioustypesoflocking see(Draghici,1995).

Let us supposethereare two classesC1 andC2 in a d-dimensionalspace.A possibilityfor pinpointingthepositionof a hyperplanein a d-dimensionalspaceis to have d � 1pointsof which d areon onesideof thehyperplaneandoneis on the othersideof it. This would be the d-dimensionalgeneralizationof the locking situationdeterminedby threepatternsCaseA in Fig. 10. Althoughthereareothertypesoflocking situations,theanalysiswill concentrateon this typefirst.

For thissituation,theextremepositionsof thedividinghy-perplanesuchthatall patternsarestill correctlyclassifiedaredeterminedby all combinationsof d 1patternsfromC1 anda patternfrom C2. Eachsuchcombination(d pointsin a d-dimensionalspace)determinesahyperplaneandall thesehy-perplanesdeterminea simplex4 in the d dimensionalspace.If all thesehyperplanesare close(the simplex is squashedtowardsthedividing hyperplane)thenthishyperplaneis pin-pointedandthelockinghasoccurred.Thelockingtestcanbeacomparisonof thegradientsof thehyperplanesdeterminedby combinationsof d 1 pointsfrom oneclassandonepointfrom theotherclass.A heuristicfor locking detectionbasedon theideaspresentedabove is givenin Fig. 17.

An exhaustive searchfor this type of locking would in-cludeall the possiblecombinationsof d patternsfrom oneclassand one patternfrom the other class. If the numberof patternsin bothclasses(C1 � C2) is muchlarger thanthenumberd of dimensionsof the space(which is usually thecase),this searchcould take a long time. However, this ex-haustive searchis not necessarybecausethe purposeof the

searchis to investigatethelocking of thecurrenthyperplanein its presentposition. Only thosepatternswhich arecloseto thehyperplaneareprobableto lock it. Consequently, onlythe d patternsclosestto the hyperplaneneedbe taken in-to considerationwhenlocking is investigatedandthis is thejustificationfor giving the algorithmonly for the particularcaseinvolving d points from oneclassandonepoint fromtheotherclass.Sinceateverysteponly onepatternis added,we only needto comparethe distancefrom this patterntotheseparatinghyperplane.If thenewly addedpatternis notcloserto this hyperplanethanany of thed closestpatternsatthepreviousstep,no furthercomputationis necessary. If thepatternis closer, thenthelist of theclosestd patternswill beupdatedandthe locking will be checked again. Therefore,the bestcasecomplexity of this heuristicis constantin thenumberof patterns.However, on average,it is reasonabletoexpecta lineardependency on thenumberof patternsin theproblem.

However, therearetight lockingsituationsthatarenotde-tectedby theheuristicpresentedabove.An examplesof sucha locking situationin two dimensionsis presentedin caseB in Fig. 10. In this particularsituation,no subsetof twopoints from one classand anotherpoint from the oppositeclasswould determinea tight locking situation. A heuristicableto detectsuchlockingsituationswill bepresentedin thefollowing.

Theheuristicinvolvesconsideringthed pointsclosesttothe boundaryfrom eachclass.Finding suchpointsrequiresO � m1 � m2 � computationswherem1 and m2 are the num-bers of patternsin C1 and C2, respectively. The numberof combinationsof one patternfrom one classand d pat-ternsfrom the oppositeclassas requiredby the algorithmgiven is O � d � . In the worst case,d hyperplanesmust be

4 A simplex is thegeometricalfigureconsisting,in d dimensions,of d 1 1 points(or vertices)andall their interconnectingline seg-ments,polygonalfaces,etc. In two dimensions,a simplex is a tri-angle. In threedimensionsit is a tetrahedron,not necessarilytheregulartetrahedron

Page 16: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

16 SORINDRAGHICI

detect locking with ε0 tolerance( p1 � ��� ��� pd�

C1, pk�

C2)begin

if theprojectionconditionis notsatisfiedthenlocking hasnot beendetected;return

for eachi from 1 to d doconstructhi with thepoints� pk � p1 � ��� ��� pi � 1 � pi � 1 ��� ����� pd �calculatetheangleα betweenthehyperplanehi andhif α � ε0 then

locking hasnot beendetected;returnlocking hasbeendetected

end

Figure17. Heuristic1 for locking detectionin d dimensions

comparedfor eachset of points. Therefore,the algorithmneedsO � m1 � m2 ��� O � d2 � computationsup to this stage.The numberof operationsneededto constructthe convexhull of the d patternsclosestto the dividing hyperplaneisO � � d 1�54&6 d � 1798 2: � 1 � in theworstcase(Preparata& Shamos,1985). In general,the convex hull canbe constructedusingany classicalcomputationalgeometrytechnique.However,oneparticularlyinterestingtechniqueis thebeneath-beyondmethodwhich hasthe advantagethat is suitablefor on-lineuse(i.e. it constructsthe hull by introducingone point ata time). In consequence,if the iterative stepof the CBDalgorithmaddsthenew patternin sucha way that theprevi-ousd 1 patternsfrom the hull arestill amongthe closestd patterns,the effort necessaryto constructthe convex hullwill only be O � φd � 2 � i.e. linear in the numberof � d 2� -dimensionalfacets(notethatwe areconsideringtheconvexhull of the projectionson the dividing hyperplanewhich isa � d 1� -dimensionalspace). The numberof operationsneededto checkwhetherthe projectionof pk is internal tothis convex hull is O � d 1� . In the best case,the con-vex hull will not be affectedby the introductionof the newpoint so the effort will be limited to checkingthe distancesfrom thedividing hyperplanewhich is O � m� . In conclusion,the algorithm will needO � m1 � m2 � in the best caseandO � m1 � m2 �;� O ��� d 1�54&6 d � 17&8 2: � 1 � in theworstcase.Notethat d is the numberof dimensionsof the input spaceandthereforeis alwayslower thanthenumberof patterns.If thenumberof dimensionsis largerthanthenumberof patterns,the patterns(assumedto be in generalposition)arelinearlyindependentandtheclassesareseparablesothereis noneedto checkfor locking.

The secondlocking detectionheuristicassumesthat thenumberof patternsis larger than the dimensionalityd (asjustifiedabove)andthatthelocking is determinedby severalpatternsfrom eachclass.

In theseconditions,onecouldconsiderthed pointsclos-estto thedividing hyperplanefrom eachclassandcalculatethehyperplanedeterminedby them. If the two hyperplanesarecloseenough(theangleof their normalsis lessthanthechosentoleranceε0 ) andif theconvex hulls determinedbytheprojectionson thedividing hyperplaneintersect(andthe

intersectionis not degenerate),thenlocking is present.Thed closestpoints to the boundaryare found andusedto de-terminea borderhyperplanefor eachclass.Thecurrentsep-aratinghyperplaneis comparedwith the two borderhyper-planes.If all threearecloseto eachotherthed closestpointsfrom eachclassareprojectedon the boundary. If the hullsthusdeterminedintersectin a non-degenerateway, lockingis present.

Thetoleranceof thelocking is still inverselyproportionalto the distancefrom the patternsto the dividing hyperplaneasdescribedin thethird item of thelocking characterizationabove. Thus,theoretically, thereexist situationsin which thethreehyperplanesusedby the heuristicabove are (almost)parallel5 without a tight locking. This canhappenif thetwo’clouds’ of patternsarefar from eachotherbut thed closestpatternsfrom eachclasshappento determinetwo parallelhy-perplaneswhich arealsoparallelwith thecurrentseparatinghyperplane.Theprobabilityof sucha situationis negligibleandtheheuristicdoesnot take it into account.

This heuristicneedsO � m1 � m2 � operationsto find thedclosestpatternsandconstructtheborderhyperplanes.In thebestcase,nothingmoreneedsto bedone.In theworstcase,bothconvex hullsneedto bere-constructedandthiswill takeO ��� d 1�<496 d � 1798 2: � 1 � computations.In general,calculatingtheintersectionof two polyhedronsin ad-dimensionalspaceis computationallyintensive. However, the algorithm doesnot requirethe computationof the intersectionbut only thedetectionof such. A simple(but ratherinefficient) methodfor detectingsuchintersectionis to considereachpoint inoneof the hulls andclassifyall verticesof the otherhull aseitherconcave, supportingor reflex (as in Section3.3.6 in(Preparata& Shamos,1985))which in turn canbe doneinO ��� d 1�<4&6 d � 1798 2: � 1 � in theworstcase.

Thecharacteristicsof theheuristicspresentedabovemakethemwell suitedfor problemsinvolving a large numberofpatternsin a relatively low dimensionalspace(lessthan 5

5 In d dimensions,theparallelismis notnecessarilyasymmetricrelation. In this context, parallelis meantas”strictly parallel” (ac-cordingto thedefinitionsusedin (Borsuk,1969))which is a sym-metricrelation

Page 17: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 17

Case BFigure 18. A differenttypeof locking. No subsetof threepatternswoulddeterminea tight lockingsituation.Notethattheconvex hullsoftheprojectionsof thepatternsfrom thetwo classesintersectonany dividing hyperplane.

detect locking with ε0 tolerance( p1 � ��� ��� pm1

�C1, q1 � ��� ��� qm2

�C2, h )

beginfind thed closestpointsto h from C1; usethemto construct

a hyperplaneb1find thed closestpointsto h from C2; usethemto construct

a hyperplaneb2if = � b1 � b2 �!� ε0 > = � b1 � h�!� ε0 > = � b2 � h�!� ε0 then

locking hasnot beendetected;returnendifprojectp1 ��� ����� pd onh andconstructtheconvex hull H1 of theprojectionsprojectq1 ��� � ��� qd on h andconstructtheconvex hull H2 of theprojectionsif dim � H1 ? H2 �@� d 1 then

locking hasbeendetectedelse

locking hasnot beendetectedend

Figure 19. Heuristic2 for locking detectionin d dimensions.

variables)due to their worst-casebehavior. In higher di-mensionalspaces,the savings in training time provided bydetectinglocking situationscanbeoffsetby the complexityof the computationsnecessaryby the algorithmsabove andthe useof the CBD techniquewithout this enhancementissuggested.However, in low dimensionalspaces,thelockingdetectioncansave up to 50% of the training time asshownby someof theexperimentspresentedin thenext section.

Experimentalresults

Hypothesesto beverifiedby theexperiments

This sectionwill presentsomeexperimentalresultsob-tainedwith the ConstraintBasedDecompositiontechnique.The experimentswill be focusedon severalaspects.A firstcategory of experimentswasdesignedto investigatethe ef-fectivenessof theenhancementsproposedabove. Theseex-perimentswill compareresultsobtainedwith theplain CBDalgorithmto resultsobtainedby activatingoneenhancementat a time. Two criteriahave beenmonitored:thesizeof thenetwork and the durationof the training. Due to the fact

that CBD canbe usedto generateeithera 3-layer(hidden-AND-OR layers)or a 2-layer(hiddenandoutputlayers)ar-chitecturethe total numberof units dependson this choice.Therefore,the size of the network was characterizedonlyby the numberof hyperplanesused(which is equalto thenumberof units on the first hiddenlayer)asopposedto thetotal numberof units. The locking detectionis expectedtobringanimprovementof thetrainingspeedwithoutaffectingthenumberof hyperplanesusedin thesolution. Theredun-dancy elimination is expectedto reducethe numberof hy-perplanesusedin thesolutionwithout affecting the trainingspeed. Theseexpectations,if confirmed,would allow theredundancy eliminationandthelocking detectionto beusedsimultaneously, summingup theirpositiveeffects.

All enhancementshavebeenexperimentedon3 problems:the 2-spiralproblemwith 194 patterns(seeFig. 20), the 2-spiralproblemwith 770patterns(for thesametotal length-3 timesaroundthe origin) andthe 2-grid problem. The 2-grid problem(seeFig. 21) canbe seenasan 2D extensionof theXOR problemor asa 2D extensionof theparityprob-lem (in the 2-grid problem,the outputshouldbe 1 - black-if the sum of the Cartesianco-ordinatesis odd). We have

Page 18: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

18 SORINDRAGHICI

Figure 20. The2-spiralproblemwith 194patterns..

class 2

class 1

11 2 3

-1

-2

2

Figure 21. The2 grid problemwith 20patterns.

chosenthis extensionto theparity problemfor two reasons.Firstly, we believe it is moredifficult to solve sucha prob-lem by addingpatternsin the samenumberof dimensionsinsteadof increasingthenumberof dimensionsandlookingmerely in the cornersof the unit hypercube.Secondly, wewantedto emphasizetheability of thetechniqueto copewithreal-valuedinputvalues,ability thatdifferentiatesCBD frommany otherconstructivealgorithms.

Due to the fact that different algorithmsusea differentamountof work per epochor pattern,a machineindepen-dent speedcomparisonof varioustechniquesis not a sim-ple matter. In order to provide resultsindependentof anyparticularmachine,operatingsystemor programminglan-guage,the trainingspeedis reportedby countingtheopera-tionsperformedby thealgorithm,in amannersimilar to thatwidely usedin theanalysisof algorithmcomplexity (Knuth,1976)andto the“connection-crossings”usedby Fahlmanin(Fahlman& Lebiere,1990).Dueto its constructivecharacterandits guaranteedconvergenceproperties,CBD exhibitsex-cellentspeedcharacteristics.A detailedexplanationof themethodusedto assessthe training speedand speedcom-parisonswith other algorithmsare presentedin (Draghici,

1995). Thoseexperimentsshowed CBD is a few ordersofmagnitudefasterthanbackpropagationin dedicated,customdesignedarchitecturesandfew timesfasterthanothercon-structive techniqueslike divide-and-conquernetworks (Ro-maniuk& Hall, 1993),on specificproblems.The focusoftheexperimentspresentedherewasto investigatethegener-alizationpropertiesof CBD andhow they comparewith thegeneralizationpropertiesof otherneuralandnon-neuralma-chinelearningtechniques.Theaimof suchexperimentswastoshow thatconstructivelearningneuraltechniquescanyieldgeneralizationperformancescomparablewith thoseprovid-edby a largecategoryof classicaland/orstate-of-the-artma-chinelearningtechniques.

Methodsused

In orderto investigatethehypothesisabove,a numberofexperimentshavebeenperformedandtheir resultshavebeenprocessedstatistically. A shortexplanationof thetermsusedandof the processingperformedis given in the following.Moredetailsabouttheseandotherstatisticalmethodscanbefoundin sourceslike (Hugill, 1985).

Most of the experimentsperformedinvolve two samples

Page 19: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 19

x1 � x2 ��� � ��� xm andy1 � y2 ��� ����� ym of readings(for instancenum-ber of hyperplanesusedin the solutionswith and withoutredundancy reduction,numberof operationsusedwith andwithout locking, etc.). It is assumedthat thesearerandomsamplesfrom two populationsof possiblereadings.In theseconditions,the“null hypothesis”statesthat the two popula-tions are(in fact) the same.The “one-sidedalternative hy-pothesis”is that the x valuestendto be consistentlybigger(or consistentlysmaller) than the y values. The statisticalargumentworks by believing the null hypothesisuntil thevalueof a certaintest statisticis so extremethat would bevery unlikely underthenull hypothesis.In theseconditions,oneis forcedto changeonesmind andacceptthealternativehypothesis.This argumenthowever, doesnot guaranteethefalsity of the null hypothesis.The only claim of this argu-ment is that the null hypothesisis unlikely to be true. Theprobabilityof rejectingthenull hypothesiswhenit is, in fact,true is called the “significancelevel” or “confidencelevel”andis usually taken to be 5%. The confidencelevel of 5%hasbeenusedin all testsperformed.

In order to comparethe effects of a particularenhance-ment,a numberof testswereperformedusingthealgorithmwith andwithout the enhancement.Eachindividual exper-iment wasrun in the sameconditionsfor the two versions:the sameinitial weight state,orderof the patterns,weight-changingmechanism,etc. In theseconditions,anappropri-atestatisticaltest is the pairedtwo-samplet-test for means(smallsamples).This testusesthet-distribution(alsoknownasStudent’st-distribution)andtestswhetherthemeansof thetwo samplesaresignificantlydifferent.Thetestassumesthepopulationsfrom which the sampleshave beendrawn havenormaldistributions6 with the samevariance.The testusesthesamplesto calculatethevalueof thet variable.Thevalueresultedfrom this calculationis comparedwith the criticalvalueof thet for thegivenconfidencelevel. Thecomparisondecideswhetherthenull hypothesiscanberejectedor not.

A secondcategory of experimentswere aimedat com-paringthe performancesof the CBD algorithmwith perfor-mancesof otherexisting machinelearningtechniques.Suchtechniquesincludedsomeotherneuralnetwork constructivealgorithmsandnon-constructive neuralnetwork trainingal-gorithms. Here, the focusof the experimentswas the gen-eralizationabilitiesof thetechniques.We usedtheclassicalmethodof cross-validation (Breiman,Friedman,Olsen,&Stone,1984). The idea behindcross-validation is that thelearningtechniquemustbe checked on datawhich belongsto thesamedistributionbut wasnotusedduringthetraining.This canbe implementedin several waysdependingon thenumberof patternsavailable. If only few patternsareavail-able,reducingthesizeof thetrainingsetevenfurtherby set-ting patternsasidefor generalizationtestingcouldjeopardizethe training. In suchcases,the algorithmis usedwith onlyn 1 of then availablepatternsandtestedon theremainingone. This is donen times,eachtime leaving out a differen-t pattern. An averageis calculatedover the n experiments.This is known astheleave-one-outmethod.If morepatternsareavailable,the patternsetcanbe divided into n differentsubsetsof patterns.Thenonesubsetwill be left out of the

training andusedto testthe generalization.Again, the val-ue reportedis an averageof the n trials performedleavingout eachof the n subsets.This methodis known asn-foldcrossvalidation.Finally, if thepatternsetis very large,it cansimply bedividedinto a trainingsetanda validationset. Inthis case,thegeneralizationabilitiesof thetechniquewill becharacterizedby its performanceon thevalidationset.

Experimentalresults

Lockingdetection.A lockingsituationis asituationin whichin whichtheex-

istingpatternsdeterminethepositionof thehyperplanewith-in a giventolerance.In sucha situation,thehyperplanecan-not be movedanymore(outsidethe tolerance)without mis-classifyingsomeexisting patterns.Fromthis definition it isclearthatsucha situationmayor maynot appearfrequentlyin a problem. Consequently, the locking detectionmay ormay not bring an importantimprovementto the training ofa particularproblem.Therefore,oneexpectsthis mechanis-m to be moreimportantfor someproblemsandlessimpor-tant for others.In orderto assesscorrectlytheefficiency ofthismechanism,aproblemof eachcategoryhasbeenchosen.A problemfor which the locking detectionmechanismwasexpectedto bring an importantimprovementis the 2-spiralproblem(proposedoriginally by Wielandandcitedin (Lang& Witbrock,1989)).This is becausethis problemis charac-terizedby a relatively largenumberof patternsin the train-ing set, a high degreeof non-linearityand thereforeneed-s a solution with a relatively large numberof hyperplanes.Due to the distribution of the patterns,it wasexpectedthatsomehyperplanesbelockedinto positionby someclosepat-terns. The resultsof 20 trials using20 differentorderingsof the patternsin the training set containing194 patternsarepresentedin Fig. 22. Theaveragenumberof operations(connectioncrossings)usedby thevariationwithout lockingwas65,340,103.5whereastheaveragenumberof operations(connectioncrossings)usedby thevariationwith lockingde-tection was 32,192,110.8.This correspondsto an averageimprovementof 50.73%of thetrainingspeed.

The t-test was performedon the numberof operationsusedin 20 trainingsessionswith andwithout thelockingde-tectionmechanism.Thecalculatedvalueof t is -14.97andissmallerthanthecritical valueof t for 5%level of confidencewhich is -1.72. In theseconditions,the null hypothesiscanberejected.In otherwords,theimprovementbroughtby thelocking detectionmechanismin termsof convergencespeedis significant.Thesametestwasperformedondataregardingthenumberof hyperplanes.Thevalueof t that resultsfromthesedatais -0.84andis smallerthanthe critical valueof t(5%) which is 2.09for thetwo-sidedalternative hypothesis.In theseconditions,one can concludethat thereis no evi-denceto sustainthat the locking doesaffect the numberofhyperplanes.

6 Thenormalityassumptionis necessarybecausethenumberofthereadingsin thetwo sampleswasrelatively small(20). However,checksperformedon the datashowed that this normality assump-tion is reasonable.

Page 20: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

20 SORINDRAGHICI

trial

0

10000000

20000000

30000000

40000000

50000000

60000000

70000000

80000000

90000000

counter with locking detection counter without locking detection

Figure 22. Comparisonbetweenthe numberof operations(connectioncrossings)usedby the techniquewith andwithout the lockingdetectionin solvingthe2-spiralproblem(194patterns).

trial

0

200000000

400000000

600000000

800000000

1000000000

1200000000

counter with locking detection counter without locking detection

Figure 23. Comparisonbetweenthe numberof operations(connectioncrossings)usedby the techniquewith andwithout the lockingdetectionin solvingthe2-spiralproblem(770patterns).

A problemfor which the locking detectionmechanismwasnot expectedto bring sucha spectacularimprovementis the 2-grid problempresentedin Fig. 21. This problemischaracterizedby a relatively small numberof patternsin atraining set which are relatively sparselydistributed in theinput space. If one looks at a solution, one can note thatin general,the patternsdo not determinethe positionof thehyperplaneswithin a tolerancecomparableto the tolerancewhich characterizesthe2-spiralproblem. In otherwords,itis probablethat locking situations(with thesametolerance)appearlessfrequentlyduringthetraining. Underthesecon-ditions,it is normalfor theversionwith thelockingdetectionto show lessimprovementoverthestandardversionof theal-gorithm.Theresultsof 20 trials using20 differentorderingsof thepatternsin thetrainingsetarepresentedin Fig.24. Theaveragenumberof operations(connectioncrossings)usedbythe variation without locking was 1,149,094.2whereastheaveragenumberof operations(connectioncrossings)usedbythe variationwith locking detectionwas1,038,254.1.Thiscorrespondsto anaverageimprovementof 9.65%.Althoughimportanteven for this problem,the improvementbroughtby the locking detectionmechanismis not asspectacularasfor the2-spiralcase.The t-testshows that the improvementis statisticallysignificantfor thegivenconfidencelevel evenfor thisproblem.

Redundancyelimination.As theconstraintbaseddecompositiontechniqueis more

sensitive to the orderof the patternsin the training setthanto the initial weight state,20 trials with differentorderingsof the patternsin the patternset were performedwith andwithout checkingfor redundanthyperplanes.Theresultsarepresentedin Fig. 25. The patternset is that of the 2-spiralproblemandcontains194patterns.Theorderof thepatternsin the training setwaschangedrandomlybeforeeachtrial.For eachtrial, the samerandompermutationof the pattern-s in the patternsetwasusedfor both the standardand theenhancedversionof thealgorithm. Thestandardversionofthe algorithm solved the problemwith an averagenumberof 87.65hyperplanes(the averageis performedover the 20trials). The enhancedversionof the algorithmwith the re-dundancy checksolvedthesameproblemwith anaverageof58.8hyperplaneswhich representsanaverageimprovementof 32.92%.

The t-test performedon the numberof hyperplaneda-ta coming from experimentswith this problemshows thatthe effect of the redundancy elimination mechanismuponthe numberof patternsusedin the solution is significant(t � 18� 95 tcritical � 1 � 72). The sametestperformedon the numberof operationsdatashows that the redundan-cy elimination mechanismdoesnot affect significantly theresultsfrom thispoint of view.

Page 21: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 21

trial

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

counter with locking detection counter without locking detection

Figure 24. Comparisonbetweenthe numberof operations(connectioncrossings)usedby the techniquewith andwithout the lockingdetectionin solvingthe2-gridproblem(20patterns).

trial

0

20

40

60

80

100

120

hp_no with redundancy check hp_no without redundancy check

Figure 25. Comparisonbetweenthenumberof hyperplanes(hiddenunitson thefirst layer)usedby the techniquewith andwithout thecheckfor redundancy. Thetrainingsetis thatof the2-spiralproblemcontaining194patterns.

Thealgorithmwasalsotestedonapatternsetof thesame2-spiral problemcontaining770 patterns. The resultsaresummarizedin Fig. 26. The standardversionof the algo-rithm solvedtheproblemwith anaveragenumberof 186.50hyperplanes(the averageis performedover 16 trials). Theenhancedversionof thealgorithmwith theredundancy checksolved the sameproblemwith an averageof 99.19hyper-planeswhichrepresentsanaverageimprovementof 46.82%.Thet-testshows thattheimprovementis statisticallysignifi-cantfor thegivenconfidencelevel.

Thesamecomparisonwasperformedfor the2 grid prob-lem. Theresultsof theexperiments(the numbersof hyper-planesusedin the solution) are presentedin Fig. 27. Thestandardversionof the algorithm solved the problemwithanaveragenumberof 12.45hyperplanes(theaverageis per-formedover the 20 trials). The enhancedversionof the al-gorithmwith theredundancy checksolvedthesameproblemwith an averageof 11.05hyperplaneswhich representsanaverageimprovementof 11.24%. The t-testshows that theimprovementis statisticallysignificantfor the given confi-dencelevel.

The experimentsperformedshowed that the redundancyeliminationis a reliablemechanismthatprovidesmorecom-pactarchitectureson a consistentbasiswith minimal com-putationaleffort. However, thelockingdetectionmechanismproposedis only useful in problemswith a low numberof

dimensions.This is becausethecomputationaleffort neces-sary to computethe convex hulls and their intersectionde-pendsexponentiallyon the numberof dimensionsand be-comesquickly moreexpensive thanthecomputationaleffortinvolved in usinga time-out limit. Therefore,we proposeCBD with redundancy eliminationasthemaintool for prac-tical usewhile thelocking detectionis only anoptionusefulin low dimensionalproblems.

Generalizationexperiments.In order to evaluatethe generalizationpropertiesof the

algorithm,we have conductedexperimentson a numberofclassicalreal world machinelearningdatasets. The exper-imentsaimedat assessingthe generalizationabilities of thenetworksconstructedby theCBD algorithmwereperformedwith redundancy eliminationbut without locking detection.The currentimplementationis able to deal with multiclassproblemsanddoessousingthe third approachdescribedinthe sectiondealingwith multiclassclassification. This ap-proachchoosesrandomlyonepatternfrom any classfor ev-ery region in which thefunctionis called.Subsequently, thepatternsfrom the sameclassas the patternchosenwill beseparatedfrom theotherpatterns.We shallreferto theCBDwith redundancy elimination,without locking detectionandwith themulticlasscapabilitiesasthestandardCBD (or sim-ply CBD) for theremainingof this paper.

Page 22: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

22 SORINDRAGHICI

trial

0

50

100

150

200

250

hp_no with redundancy check hp_no without redundancy check

Figure 26. Comparisonbetweenthenumberof hyperplanes(hiddenunitson thefirst layer)usedby the techniquewith andwithout thecheckfor redundancy. Thetrainingsetis thatof the2-spiralproblemcontaining770patterns.

trial

0

2

4

6

8

10

12

14

16

18

hp_no with redundancy check hp_no without redundancy check

Figure 27. Comparisonbetweenthenumberof hyperplanes(hiddenunitson thefirst layer)usedby the techniquewith andwithout thecheckfor redundancy. Thetrainingsetis thatof the2-gridproblemcontaining20 patterns.

The balancedatasetmodelspsychologicalexperimentalresults. Each patternrepresentsthe position of a balancedescribedby the weightssuspendedon the two armsandthe distanceof the suspensionpoint from the center. Thesetcontains625patternsof threeclasses(49 balanced,288tippedto theleft and288tippedto theright). Eachexamplehasfour numericattributes(left-weight, left-distance,right-weight, right-distance)and a classlabel (balanced,left orright). Theclassdistribution is 46.08%left, 7.84%balancedand46.08%right.

Theglassdatasetcontainsexamplesof glassclassificationbasedon thechemicalcompositionandusedfor criminolog-ical investigation.Thedatasethas14 attributesrepresentingthe concentrationof variousmineralsin the chemicalcom-position of the glassand the type of glass. Thereare 214examplesdistributedin 7 classesasfollows: 163instancesofwindow glassof which 87 float processed(70 building win-dows and17 vehiclewindows) and76 non-floatprocessed(76 building windows and 0 vehicle windows) and 51 in-stancesof non-window glassof which 13 arecontainers,9aretablewareand29 areheadlamps.

The iris datasetcontains3 classesof 50 instanceseach,whereeachclassrefersto a type of iris plant. Thereare4real–valuedattributes(sepallength,sepalwidth, petallengthin cm andpetalwidth) and3 classes(Iris Setosa,Iris Versi-colour, Iris Virginica). Oneclassis linearly separablefrom

theother2; thelatterarenotlinearlyseparablefromeachoth-er. Theclassdistribution is 33.33%for eachclass.Notethattherearetwo attributeswith a high classcorrelationwhichmayindicatethatthisis alessdifficult classificationproblem.

The lensesdatasetcontains24 patternswith 4 attributeseach(ageof patient, type of prescription,astigmatismandtearproductionrate).Thereare3 classes(softcontactlenses,hardcontactlensesandnocontactlenses).Theattributesareeitherbinary or ternary. Due to the reducednumberof pat-ternsthis problemwastestedwith theleave-one-outmethodwhile all otherproblemsweretestedwith the 10-fold crossvalidation.

The wine datasetcontainsdataresultedfrom a chemicalanalysisof winesgrown in the sameregion in Italy but de-rived from threedifferentcultivars. The analysisprovidedthe quantitiesof 13 constituentsfound in eachof the threetypesof wines.Thereare3 classescontaining59,71 and48patterns,respectively. All attributesarereal-valued.

Thezoodatasetcontains101patternsdividedin 7 classes.Thereare16 mixed attributes( 14 Booleanand2 integer).Thenumberof examplesin eachof the7 classesare41, 20,5, 13,4, 8 and10, respectively. A summaryof theseandtheotherdatasetsusedis given in Table1. More detailsabouttheproblemsusedcanbe found in (Blake, Keogh,& Merz,1998).

We have comparedtheresultsprovidedby CBD with the

Page 23: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 23

Dataset Classes Attributes Examples Domainglass 7 9 214 Glassidentificationfor criminologicalinvestigationiris 3 4 150 Iris plantclassificationlenses 3 4 24 Databasefor fitting contactlenseswine 3 13 178 Wine recognitiondatazoo 7 16 101 Animal classificationbalance 3 4 625 Balance-scaleclassificationionosphere 2 34 351 Atmosphericradardatalungcancer 3 56 32 LungcancerdataPimaIndians 2 8 768 Diabetescasesin Pimapopulationbupa 2 6 345 Liverdisordersdatatictactoe 2 9 958 End-gameboardconfigurationsin tictactoe

Table1Thedatasetsusedin assessingthegeneralizationabilities of CBD.

resultsobtainedwith several machinelearningtechniques:CN2 (Clark & Niblett, 1989),C4.5 (Quinlan,1993),CLE-F (Precup& Utgoff, 1998; Utgoff & Precup,1997,1998)anda symbolicAI systemELEM2 (An & Cercone,1998).CN2 andC4.5 are well known techniquesfor constructingclassifiersfrom examples. CLEF is a variationof a linearmachinecombinedwith a non-linearfunctionapproximatorthatconstructsits own features.ELEM2 is a state-of-the-arttechniqueable to assessthe degreeof relevanceof variousinstancesandgenerateclassificationrulesaccordingly. Al-so,ELEM2 is ableto handleinconsistentdatasetsandusespruning and post-pruningquality measuresto improve thegeneralizationabilitiesof themodelconstructed.

A summaryof the experimentalresultscomparingCN2,C4.5, ELEM2 andCBD is given in Table2. The accuracyvaluesreportedfor theCBD wereobtainedby averagingtheaccuraciesobtainedonthevalidationsetoverasetof 10trialsfor eachproblem.Theweightchangingmechanismwasthedeltarule with a learningrateof 0.9. The valuesfor CN2,C4.5 andELEM2 arequotedfrom (An & Cercone,1998).Theaccuracy valuesfor CLEF arethosereportedin (Precup& Utgoff, 1998). Of the six real-world problemsstudied,CBD yieldedthebestgeneralizationresultsin threeof themandsecondbeston two of theremainingthree.Evenon theglassproblemon which CBD performedworst (third bestoverall), thedifferencebetweenCBD’s performanceandthebestperformingalgorithmin thisgroupis only 6%.

The resultsprovided by the CBD algorithm do not de-pendverymuchontheinitial weightstate.Severaltrialsper-formedwith differentinitial weightstatesyieldedpracticallythe samesolutions. Also, CBD seemsto be ratherinsensi-tive to large variationsof the learningratefor many weightchangingmechanismsused.For instance,varyingthelearn-ing ratebetween0.25and0.9 produceddifferencesof only2%in theaverageaccuracy. Thesecharacteristicsof theCBDalgorithmcanbeseenasadvantagesandshow its robustnessto parametervariations.ThesinglemostimportantfactorforCBD is theorderof thepatternsin thetrainingset.As shownin (Draghici, 1995),this canbe usedto improve the gener-alizationandto forcetheCBD to find solutionswith certaindesiredcharacteristics.In the experimentspresentedhere,

this featureof the algorithm wasnot used. This wasdonein order to study its performancesin thosesituationswhenthereis no a priori knowledgeaboutthe patternsetand/orno biastowardscertaintypesof solutions.In orderto studythenumericalstability of thealgorithmandits sensitivity totheorderof thepatternsin thetrainingset,eachexperimentwasperformed10 timesfor eachproblemandwaspreced-edby a randompermutationof thepatterns.Thenumberoftrials waschosenequalto 10 sothat thefiguresobtainedforthe standarddeviationscould be comparedwith the figuresreportedin (An & Cercone,1998).Thecomparisonof thesestandarddeviations(shown in Table3) illustratesthefactthatCBD is extremelyconsistentandreliable. CBD’s standarddeviation oscillatesbetween1/10and1/2 of thatof thenextbestalgorithm.

Anothersetof experimentscomparedfurther CBD withthe resultsof several othermachinelearningalgorithmsin-cluding C4.5 usingclassificationrules(Quinlan,1993), in-crementaldecisiontreeinduction(ITI) (Utgoff, 1989;Utgof-f, Berkman,& Clouse,1997), linear machinedecisiontree(LMDT) (Utgoff & Brodley, 1991), learningvectorquanti-zation(LVQ) (Kohonen,1988,1995), inductionof obliquetrees (OC1) (Murthy, Kasif, Salzberg, & Beigel, 1993;Heath& Salzberg, 1993),Nevadabackpropagation(NEVP)(http://www.scsr.nevada.edu/nevprop/), k-nearestneighborswith k=5 (K5), Q* andradialbasisfunctions(RBF)(Musav-i, Ahmed, K.H.Chan,Faris, & Hummels,1992; Poggio&Girosi, 1990). The Q* andRBF areavailableasoptionsinTooldiag(Rauber, Barata,& Steiger-Garcao,1993). An ex-cellentcomparisonbetweenthesealgorithmscanbefoundin(Eklund,2000).

Figure 30 presentsthe comparisonbetweenthe averageaccuracy obtainedby CBD and the algorithmsmentionedabove. For this setof experiments,the datawasrandomlysplit into a trainingsetcontaining80%of thedataanda val-idationsetcontaining20%of thedata.Theaveragereportedwasobtainedover5 trialswith theaccuracy measuredon thevalidationsetonly. CBD wasusedwith deltarule (0.9learn-ing rate),redundancy andno locking detection.Thetimeoutlimit wassetat700epochson all experiments.

The comparisonpresentedin Fig. 30 shows that CBD

Page 24: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

24 SORINDRAGHICI

Dataset AccuracyC4.5 CN2 ELEM2 CLEF CBD

glass 68.7 55.17 72.88 - 67.38iris 95.3 88.68 94.67 94.4 95.66lenses 73.3 75.01 76.67 - 77.08wine 92.8 86.00 97.78 94.2 97.19zoo 92.1 92.09 98.00 96.4 92.27balance 78.10 79.05 81.14 92.5 91.63

Table2Thegeneralizationperformancecomparisonof CBD with C4.5,CN2andELEM2.Dashesstandfor unavailabledata.

Dataset StandarddeviationC4.5 CN2 ELEM2 CBD

glass 13.96 12.11 10.11 4.07iris 6.31 8.34 5.26 0.78lenses 34.43 22.56 26.29 4.30wine 7.44 5.91 3.88 1.56zoo 6.30 7.87 4.22 3.55balance 5.63 3.23 5.29 0.83

Table3Thestandard deviationcomparisonof CBDwith C4.5,CN2andELEM2.Standarddeviationdatawasnotavailablefor CLEF.

yielded accuraciesclose to the bestaccuracy for most in-dividual problems. Standarddeviationsfor the samesetofproblemsandalgorithmsaregivenin Fig. 31. CBD hasthesecondbestaverageaccuracy and the beststandarddevia-tion. The problemon which CBD performedworst in thissetof experimentsis TicTacToe on which CBD yieldedanaccuracy about24% lower than that of the bestalgorithm(C4.5 ruleswith 99.17%). All resultswereobtainedusingthestandardCBD anddefaultvaluesfor all parameters.Bet-ter accuraciesfor this (andany) particularproblemmay beobtainedby tweakingthevariousparametersthatcontrolthetraining.

If deltaruleis used,themaximumnumberof epochsspec-ifies theamountof effort that thealgorithmshouldspendintrying to optimizethepositionof oneindividualhyperplane.If this valueis low (e.g. 50), CBD will producea networkwith morehyperplanes.If this valueis higher(e.g. 1500),CBD will spenda fair amountof time optimizingindividualhyperplanesandthearchitectureproducedwill bemorepar-simoniouswith hyperplanes.ConventionalwisdombasedonOccam’srazorandthenumberof freeparametersin themod-el suggeststhata morecompactarchitectureis likely to pro-vide bettergeneralizationperformance.Suchexpectationsareconfirmedonmostproblems.Fig. 28andFig. 29presentthevariationsin thenumberof hyperplanesandtheaccuracyasfunctionsof themaximumnumberof epochsfor thebal-anceandiris datasetsrespectively. Eachdatapoint on thesecurveswasobtainedasanaverageof 5 trials. It is interestingto notethatboththeaccuracy andthenumberof hyperplanesremainpracticallyconstantfor a very large rangeof valuesof themaximumepochsparametershowing thatCBD is verystablewith respectto thisparameter. It is only attheverylowendof therange(maximumnumberof epochslessthan200)

that onecanseea noticeableincreasein the numberof hy-perplanesusedandacorrespondingdecreasein theaccuracy.

Redundancyeliminationandgeneralization.Anotherquestionthatonemayaskregardsthe effectsof

the redundancy eliminationuponthe generalizationperfor-manceof thealgorithms.Thisquestionis particularlyimpor-tantsincetheredundancy eliminationoptionis onby defaultin thestandardversionof CBD.

If reducingthenumberof hyperplanesusedhastheconse-quenceof deterioratingthegeneralizationperformance,thenonemay not always want the solution with the fewesthy-perplanes.In orderto find out theeffectsof theredundancyeliminationmechanismuponthegeneralizationperformanceof thesolutionnetwork, a numberof experimentshave beenperformedwith andwithout theredundancy elimination.Asanexample,Fig. 32 presentssomeresultsobtainedwith andwithout redundancy for severalproblemsfrom theUCI ma-chinelearningrepository. No statisticallysignificantdiffer-enceswereobservedbetweenthegeneralizationperformanceobtainedwith andwithout redundancy.

Trainingspeed.The training speedis influencedmainly by two factors:

the weight changingmechanismand the locking detectionmechanism.Thedefault weightchangingmechanismis thedeltarule. Theperceptronruleprovidestrainingtimesslight-ly shorter. Morecomplex weightchangingmechanismsmaybe expectedto yield training times directly proportionaltotheir complexity.

As discussed,thelockingdetectionmechanismis only ef-fectivein spaceswith a low numberof dimensions.In higherdimensionalspaces,it is moreefficient to useatimeoutcon-dition on the numberof epochs.In general,the shorterthe

Page 25: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 25

0

10

20

30

40

50

60

0 500 1000 1500 2000 2500 3000 35000102030405060708090100

hyperplanes accuracy

Figure 28. Thevariationof thenumberof hyperplanesusedandaccuracy asa functionof themaximumnumberof epochsfor thebalancedataset.Theaccuracy is measuredon theright axiswhile thenumberof hyperplanesis measuredon theleft axis. Both curvesarevery flatfor a largerangeof themaximumnumberof epochsshowing thatCBD is verystablewith respectto thisparameter. However, asthetimeoutlimit becomesverysmall(lessthan300)thealgorithmdeploys morehyperplanesthannecessaryandtheaccuracy startsto degrade.

0

2

4

6

8

10

12

0 500 1000 1500 2000 2500 3000

50

55

60

65

70

75

80

85

90

95

100

hyperplanes accuracy

Figure 29. The variationof the numberof hyperplanesusedandaccuracy asa function of the maximumnumberof epochsfor the irisdataset.Theaccuracy is measuredon theright axiswhile thenumberof hyperplanesis measuredon the left axis. Again, bothcurvesareveryflat showing thatCBD is verystable.

time out valuethefasterthetraining. However, suchadjust-mentsneedto be madewith caresincethe time out valuealso influencesthe generalizationperformancethroughthenumberof hyperplanesused(seethe discussionin the sec-tion on generalization).The default value for the time outconditionis 700epochs.This valuehasbeenfound to pro-vide a goodcompromisebetweenthetrainingspeedandthegeneralizationresults.However, a valuegoingaslow as150epochsis sufficient to provide goodgeneralizationin manycases.Table4 presentstypicaltrainingtimesfor 10problemsfromtheUCI machinelearningrepositoryonamodestlaptopwith a PentiumIII processorrunningat 500MHzand256Mof RAM. The valueshave beenaveragedover 5 trials and

roundedto the nearestsecond.Note that eachtrial is a 10fold crossvalidationexperimentwhichinvolvestraining90%of the patterns10 times. Suchtraining speedrepresentsanimprovementof severalordersof magnitudewith respecttostandardbackpropagationandof approximately300%withrespectto comparableconstructivealgorithmssuchasdivideandconquernetworks (Romaniuk& Hall, 1993). A moredetailedspeedcomparisoncanbefoundin (Draghici,1995).However, suchcomparisonsareonly approximatedueto thefact that thespeedof mostotheralgorithmsis reportedonlyin CPUtimeon variousplatforms.

In orderto gaina betterunderstandingof therelationshipbetweenthe maximumnumberof epochsandthe resultsof

Page 26: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

26 SORINDRAGHICI

DATASET C4.5 C4.5 r ITI LMDT CN2 LVQ OC1 NEVP K5 Q* RBF CBDGLASS 70.23 67.96 67.49 60.59 70.23 60.69 57.72 44.08 69.09 74.78 69.54 68.37IONOSPHERE 91.56 91.82 93.65 86.89 90.98 88.58 88.29 83.8 85.91 89.7 87.6 88.17LUNG CANCER 40.17 39.84 38.47 55.49 37.17 55.71 54.28 33.12 68.54 60 65.7 60WINE 91.09 91.9 91.09 95.4 91.09 68.9 87.31 95.41 69.49 74.35 67.87 94.44PIMA INDIANS 71.02 71.55 73.16 73.51 72.19 71.28 50 68.52 71.37 68.5 70.57 68.72BUPA 65.14 65.39 63 71.54 64.31 64.13 65.57 77.72 66.43 61.43 59.85 62.32TICTACTOE 83.52 99.17 92.89 89.61 98.18 65.61 78.56 96.91 84.32 65.7 72.19 75.1BALANCE 64.61 75.01 76.76 93.27 80.89 89.54 92.5 91.04 83.96 69.21 89.06 90.08IRIS 91.6 91.58 91.25 95.45 91.92 92.55 93.89 90.34 91.94 92.1 85.64 96ZOO 90.27 90 90.93 96.61 91.91 91.42 66.68 92.86 67.64 74.94 X 94.29AVG 75.92 78.42 77.87 81.84 78.89 74.84 73.48 77.38 75.87 73.07 74.22 79.75

Figure 30. ThegeneralizationperformancecomparisonbetweenCBD andwith several existing neuralandnon-neuralmachinelearningalgorithmson variousproblemsfrom theUCI machinelearningrepository. Dashesstandfor unavailabledata. CBD providesthe secondbestaverageaccuracy on theproblemstested.

DATASET C4.5 C4.5 r ITI LMDT CN2 LVQ OC1 NEVP K5 Q* RBF CBDGLASS 7.23 6.28 7.96 11.25 8.34 10.24 9.1 6.29 7.81 6.98 7.35 2.08IONOSPHERE 2.82 2.58 2.71 3.51 3.29 3.36 2.21 3.81 4.14 4.7 6.45 2.56LUNG CANCER 14.2 18.92 13.52 32.2 13.79 12.48 17.53 14.83 11.96 18.6 16.27 12.6WINE 5.84 5.09 6.24 5.22 6.11 4.84 8.45 2.22 6.86 6.64 5.16 1.96PIMA INDIANS 2.1 3.92 2.16 4.3 2.36 4.46 22.4 3.19 3.67 8.19 2.39 3.02BUPA 5.74 6.05 4.23 6.63 7.99 7.14 8.45 11.97 7.22 4.25 7.92 2.05TICTACTOE 2.44 1.05 2.38 8.79 0.95 2.99 5.88 1.32 2.7 3.16 3.35 9.43BALANCE 3.35 3.98 3 2.95 3.38 4.39 2.07 7.12 7.53 19.09 2.38 3.03IRIS 5.09 5.09 4.81 4.71 5.95 3.73 4.68 7.45 4.1 5.28 27.37 4.35ZOO 7.59 7.24 6.11 1.56 5.95 6.26 30.36 4.62 20.03 23.8 X 2.13AVG 5.64 6.02 5.312 8.112 5.811 5.989 11.11 6.282 7.602 10.07 8.738 4.321

Figure 31. Thestandarddeviationsof CBD andtheothermachinelearningalgorithmson several problemsfrom UCI machinelearningrepository. CBD yieldedtheloweststandarddeviation on theproblemstested.

Sheet1

Experiment NR R NR R NR R NR R NR R NR R1 93.12 91.68 62.14 57.94 95.33 95.33 75 75 94.38 96.06 82.17 81.182 91.52 90.56 64.95 64.95 92.66 95.33 79.16 62.5 94.38 95.5 88.11 74.253 90.08 90.72 63.55 61.21 95.33 96.66 75 66.66 94.94 96.06 83.16 84.154 88.16 89.76 62.61 65.42 94 94.66 66.66 70.83 94.38 96.62 84.15 83.165 90.24 88.32 66.35 64.48 94 96 75 83.33 94.38 96.06 90.09 80.196 90.4 90.56 66.35 67.28 96 95.33 83.33 75 96.06 94.94 82.17 81.187 89.6 89.12 63.08 64.01 95.33 95.33 75 75 94.38 96.06 83.16 81.188 90.4 91.68 64.95 62.61 94.66 96 70.83 79.16 94.38 93.82 74.25 87.129 88.64 89.44 65.42 59.34 95.33 94 58.33 79.16 94.38 96.06 79.2 86.1310 88.96 88.64 62.61 64.95 95.33 95.33 79.16 83.33 96.06 96.06 82.17 83.16

Average 90.112 90.048 64.201 63.219 94.797 95.397 73.747 74.997 94.772 95.724 82.863 82.170Std. Dev. 1.370 1.122 1.511 2.772 0.934 0.696 6.732 6.454 0.665 0.760 4.131 3.401

Wine ZooBalance Glass Iris Lenses

Seite 1

Figure32. Theeffectsof theredundancy eliminationmechanismonthegeneralizationcapabilitiesof thesolution.”R” denotesexperimentsperformedwith theredundancy eliminationmechanismenabledwhile ”NR” denotesexperimentsperformedwith themechanismdisabled.No statisticallysignificantdifferenceswereobserved.

Page 27: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 27

Dataset Dimensions Patterns Time(hh:mm:ss) Avg. hyperplanes Avg. termsGlass 9 214 1:32 29 35Ionosphere 34 351 7:25 3 4Lungcancer 56 32 4 2 3Wine 13 178 16 2 3PimaIndians 8 768 1:10:08 68 79Bupa 6 345 10:05 43 51Tic-Tac-Toe 9 958 2:38:07 47 49Balance 4 625 6:45 22 29Iris 4 150 10 4 5Zoo 16 101 5 8 11

Table4The training time of CBD on several problemsfrom UCI machine learning repository. The valuesreportedare averagesover 5 trials usinga timeout limit of 700epochsand10 fold crossvalidation. Notethat onesuch trial actually trains90%of the patterns10 times. Thevalueshavebeenroundedto the nearestsecond.Thecomputerusedwasa modestPentiumIII@500MHzlaptopwith 256Mof RAM.

thetraining,wehaveplottedthenumberof hyperplanesusedandthe training time asfunctionsof the maximumnumberof epochs.In orderto identify easilythepointwherethegen-eralizationstartsto deteriorate,wehavealsoplottedthegen-eralizationaccuracy on thesameplot. Fig. 33 presentsthesedependenciesfor the balancedatasetandFig. 34 presentsthesamedependenciesfor the iris dataset. Eachdatapointon thesecurveswascalculatedasan averageof five trials.As expected,thereis an almostlinear dependency betweenthe maximumepochsparameterandthe total numberof e-pochsperformed.Furthermore,thetotalnumberof epochsisdirectly proportionalwith the training time. However, in allcases,theaccuracy fluctuatesverylittle for averylargerangeof the maximumnumberof epochs.This shows that CBDis very stablewith respectto the choiceof this parameter.However, if themaximumnumberof hyperplanesis chosento be very small, the performancedoesdeteriorate.This isto beexpectedsincefor a very smalltimeoutlimit, thealgo-rithm doesnot have the chanceto adjustthe positionof thehyperplanein order to classifycorrectly the patternsin thegivensubgoal.In consequence,thealgorithmwill usemorehyperplanesthannormallyandwill overfit thedata. This iscanbeseenclearlyon the graphfor a maximumnumberofepochslessthan100whenthenumberof hyperplanesstarttoincreaseandtheaccuracy startsto decrease.This is followedby amoreabruptdegradationfor valueslessthan50.

Noteson theCBDimplementation.The CBD implementationis available in C++ and Ja-

va. Theexperimentalresultspresentedabove wereobtainedwith theJava versionimplementedby ChristopherGallivan.This versionwill beavailableaspartof theWEKA machinelearningpackageat http://www.cs.waikato.ac.nz/ml/weka/or http://vortex.cs.wayne.edu.

Discussion

This sectionwill presentsometechniquesthatarerelatedin variouswayswith CBD. The main differencesandsimi-laritiesbetweenCBD andeachsuchtechniquewill besum-

marizedbriefly. Thisenumerationof relatedtechniquesis byno meansexhaustive. Somereviews of theseandotherverygoodconstructivealgorithmstogetherwith a generalreviewof variousconstructive approachescanbefoundin (Smieja,1993;Kwok& Yeung,1997b;Fiesler, 1994;Fiesler& Beale,1996;Thimm& Fiesler, 1997).

In its simplestform, the CBD algorithm canbe seenasbuilding a decisiontree.For anexcellentreview of decisiontreetechniquessee(Murthy, 1995).In particular, theentropynetsof Sethi(see(Sethi,1990;Sethi& Otten,1990))useadecisiontreeto classifytheregionsandtwo layersof weight-s,onefor logicalAND andonefor logical OR. Theselayersaresimilar to thoseusedby CBD. However, thebuilding ofthe decisiontree can be a very lengthy processbecauseitinvolvestestingmany candidatequestionsfor eachnodeinthetree. For instance,CART (ClassificationandRegressionTrees)usesa standardset of candidatequestionswith onecandidatetestvaluebetweeneachpairof datapoints.A can-didatequestionis of the form � Is xm c � wherexm is avariableand c is the test value for that particularvariable.At eachnode,CART searchesthroughall the variablesxm,finding the bestsplit c for each. Then the bestof the bestis found (Breimanet al., 1984). For a problemin a highdimensionalityspaceandmany input patterns,this canbeavery time consumingprocess.On the otherhand,the tech-niqueswhich build a network by convertinga decisiontreeoffer someintrinsic optimization.Usually, in theprocessofbuilding thetree,somemeasuresaretakento ensurethatthesplitsoptimizesomefactorssuchastheinformationgain.

CBD builds up the desiredI/O surface gradually, oneregion after another. The idea of locally constructingtheI/O shapeis presentin all radialbasisfunction (RBF) algo-rithms(Broomhead& Lowe,1988;Moody& Darken,1989;Musavi et al., 1992; Poggio& Girosi, 1990). In the RBFcase,oneunit with a localizedactivation function will en-surethedesiredresponsefor asmallregionof theI/O space.However, therearesituationsin which a netbuilding piece-wise linear boundariesis betterthanan RBF net. Further-more,for an RBF net to beefficient, a pre-processingstage

Page 28: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

28 SORINDRAGHICI

0

50

100

150

200

250

300

350

400

450

0 500 1000 1500 2000 2500 3000 35000

50000

100000

150000

200000

250000

300000

350000

400000

450000

accuracy time(seconds) epochs

Figure33. Thevariationof thenumberof epochsandtrainingtimeasafunctionof themaximumnumberof epochsfor thebalancedataset.Theaccuracy (in percentages)andthetraining time (in seconds)aremeasuredon the left axis; the total numberof epochsis measuredontheright axis.

0

10

20

30

40

50

60

70

80

90

100

0 500 1000 1500 2000 2500 3000

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

accuracy time(seconds) epochs

Figure 34. Thevariationof thenumberof epochsandtraining time asa functionof themaximumnumberof epochsfor the iris dataset.Theaccuracy (in percentages)andthetraining time (in seconds)aremeasuredon the left axis; the total numberof epochsis measuredontheright axis.

mustbeperformedandparameterssuchasradii of theactiva-tion functions,theirshapeandorientation,theclustering,etc.mustbecalculated.By contrast,CBD is relatively simple.

On theotherhand,anRBF network will respondonly toinputswhich arecloseto theinputscontainedin thetrainingset.For completelyunfamiliar inputs,theRBF network willremainsilent, automaticallysignalingits incompetence.Incontrast,CBD networks(asany othernetworksusinghyper-planes)automaticallyextendtheir trainedbehavior to infini-ty. This meansthat they producesomeresponsefor any in-put,nomatterhow unfamiliarthis is. Thepotentialproblemsintroducedby suchbehavior canbeeliminatedby usingtech-niquesfor validatingindividualoutputs(Bishop,1994;Cour-rieu,1994).

A constructivealgorithmrelatedbothwith CBD andRBF-s is DistAI presentedin (Yang,Parekh,& Honavar, 1998;Yang,Parekh,Honavar, & Dobbs,1999). DistAI construct-s a singlelayer of hyper-sphericalthresholdneuronswhichexcludeclustersof training patternsbelongingto the sameclass.However, in DistAI, theweightsandthresholdsof theneuronsaredetermineddirectly by comparingthedistancesbetweenvariouspatternswhereasin CBD the weightsaretrainedin aniterativeprocess.

Theideasof trainingonly oneneuronat a timeandbuild-ing the net graduallyarepresentin mostconstructive algo-rithms. The upstartalgorithm (Frean,1990) builds a hier-archicalstructure(which canbe reducedeventually to a 2-layer net) by starting with a unit and adding daughteru-

Page 29: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 29

nits which cater for the misclassificationsof the parents.Sirat and Nadal proposeda similar algorithm in (Sirat &Nadal,1990). Other interestingconstructive techniquesaresequentiallearning(Marchand,Golea,& Rujan, 1990),thepatchalgorithm(Barkema,Andree,& Taal, 1993), the oil-spotalgorithm(Frattale-Mascioli& Martinelli, 1995)andthetechniquespresentedin (Muselli, 1995;Rujan& Marchand,1989). MezardandNadalin (Mezard& Nadal,1989),pro-posedatiling algorithmwhichstartsby trainingasingleuniton thewholetrainingset. Thetraining is stoppedwhenthisunit producesthecorrecttargeton asmany patternsaspos-sible.This partial-solutionweightstateis givenby Gallant’spocket algorithm(Gallant,1986)which assumesthat if theproblemis not linearly separablethe algorithm will spendmostof its time in aregionof theweightspaceproviding thefewesterrors. However, all thesetechniqueswork for bina-ry units only. Even in its simplestform, the CBD candealwith bothbinaryandrealvaluedinputs.Furthermore,it canbeeasilyextendedin severalwaysto continuousinputsandoutputs.

A recentpaper(Parekh,Yang,& Honavar, 2000)presentstwo constructivealgorithmsableto copewith multipleclass-es.MPyramid-realandMTiling-real aretwo algorithmsthatextend the pyramid and tiling algorithmsrespectively andareable to dealwith real inputsandmultiple outputclass-es. The problemof how to extendconstructive algorithmsfrom two classesto multiclassclassificationis alsothorough-ly discussedin (Draghici, 1995; Parekh,R. G., Yang,J. &Honavar, 1995)

(Biehl & Opper, 1991)presentsa tiling-lik e constructivealgorithmfor aparity-machine.(Buhot& Gordon,2000)de-rivesupperandlowerboundsfor thetypicalstoragecapacityof this constructivealgorithm.

The CascadeCorrelation(CC) net proposedby FahlmanandLebierein (Fahlman& Lebiere,1990). usesthe wholepatternset to constructan architectureof cascadingunits.In general,using the whole training set hasthe advantagethat thesolutioncanbeoptimizedfrom somepoint of view.In CC’s case,the weightsarechosenso that the correlationbetweenthe outputof the last addedunit and the output ismaximum. This ensuresthat the unit is as useful as pos-sible in the given context. Becauseof this, in general,CCwill constructnetworks that useslightly fewer hiddenunitsthanCBD. However, thegeneralizationprovidedby CBD iscomparablewith the generalizationofferedby CC. Variousextensionsandmodificationsof the original cascadecorre-lation have beenproposed.Theseincludeimprovementsofthe initialization(LIang& Dai, 1998;Lehtokangas,1999a),improvinggeneralizationthroughavoidingoverfitting(Tetko& Villa, 1997b,1997a),a projectionpursuitperspective (H-wang,You, Lay, & Jou,1996)andmany others. Also cas-cadecorrelationhasbeenmodified for recurrentnetwork-s (Fahlman,1991), parallel execution(Springer& Gulati,1995),usewith limited precisionweightsin hardwareimple-mentations(Hohfeld & Fahlman,1992)andcombinedwithothertechniquessuchasARTMAP (Tan,1997). An empir-ical evaluationof six algorithmfrom thecascadecorrelationfamily is presentedin (Prechelt,1997).

A techniqueclosely related to CascadeCorrelation isthe CascadeError Projection(CEP)proposedin (Duong&Daud,1999).Thearchitecturegeneratedby CEPis verysim-ilar to theoriginalCCarchitectureandis generatedin asim-ilar mannerbut usinga differentcriterion. Becauseof this,the resultingnetwork hasseveral characteristicsthat makesis suitablefor hardwareimplementationin VLSI.

The extentronproposedby Baffes and Zelle in (Baffes& Zelle, 1992) grows multilayer networks capableof dis-tinguishingnon-linearlyseparabledatausingtheperceptronrule for linear thresholdunits. The extentronlooks for thebesthyperplanerelative to the examplesin the training set.The extentron approachesthe problem of separatingnon-linearly separablesetsby connectingeachhiddenunit bothto all the inputsandto all the previous units. Thus,the di-mensionof theproblem’sspaceis extendedandtheproblemcouldbecomelinearlyseparable.In theworstcase,eachunitwill separatea single patternbut experimentsshowed thatthis doesn’t happenunlesstheproblemis ”pathologicallyd-ifficult” (thetwo-spiralproblemisquotedassuchaproblem).ExperimentsshowedCBD is moreefficient thanthis eveninsolving the two-spiralproblem. If the extentronis coupledwith an output layer trainedby backpropagation(in orderto copewith continuousvalues),the largenumberof layersis an importantdisadvantagedueto errorattenuationacrosslayers.Thearchitecturegeneratedby CBD containsalwaysthe samenumberof layersanddoesnot have this problemif thenetwork is to besubsequentlytrainedwith backpropa-gation.Furthermore,if theextentronarchitecturewereto beimplementedin hardware,synchronizationproblemsmightarisedueto theexistenceof pathswith verydifferentlengthsbetweeninputandoutput.Thisproblemis not presentin thecaseof thesolutiongeneratedby CBD.

RomaniukandHall (Romaniuk& Hall, 1993)proposedadivide-and-conquernet which builds an architecturesimilarto thatof a CascadeCorrelationnetwork. Their divide-and-conquerstrategy startswith oneneuronandtheentiretrain-ing set. If the problemis linearly inseparable(which is theusualsituation),thefirst training is boundto fail andthis isdetectedby a time-outcondition.In comparison,CBD startswith theminimumproblemwhichis guaranteedto haveaso-lution. The divide-and-conquertechniquehasa morecom-plicatedpatternpresentationalgorithm and requiresa pre-processingstagein which thenearestneighbouris foundforeachpatternin thetrainingset.

The CARVE algorithm (Young & Downs, 1998) con-structsthenetwork throughaddinghiddenunitswhich carvethe patternsof oneclassout of the whole patternset. Thelocking detectionmechanismof theCBD techniqueandtheCARVE algorithmarebothinspiredby andmakeuseof clas-sical computationalgeometrytechniqueslike the gift wrap-ping methodof constructingthe convex hull. However, theconvex hull searchis thecenterpieceof theCARVE algorith-m whereasin CBD, it is usedonly asan optionalenhance-ment(Draghici,1996). Theconnectionbetweenlinearsep-arability andconvex hulls hasalsobeenstudiedextensivelyby others(seefor instance(Bennett& Bredensteiner, 1998)).

TheCLEF algorithmproposein (Precup& Utgoff, 1998)

Page 30: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

30 SORINDRAGHICI

combinesa versionof a linear machinewith a non-linearfunctionapproximationthatconstructsits own features(Ut-goff & Precup,1997,1998). Thealgorithmfinds theneces-sarynon-lineardecisionboundariesandwasshown to pro-vide resultsbetterthanC4.5.

(Kwok & Yeung,1997b)studiesa numberof objectivefunctionsfor traininghiddenunitsin constructivealgorithms.(Kwok & Yeung,1997a)studiestheuseof constructivealgo-rithmsfor structurelearningin regressionproblems.(Tread-gold & Gedeon,1999a)investigatesthegeneralizationabili-tiesof constructivecascadealgorithms.TheCasperalgorith-m (Treadgold& Gedeon,1998,1999b,1997)addressestheissueof the amountof regularizationnecessaryin order toobtaingoodgeneralization.(Hosseini& Jutten,1999)pro-posesa constructive approachin which the target networkis constructedby addingsmall, variable-sizeaccessorynet-worksasopposedto addingsimpleunits.

(Campbell& Vicente,1995)presentsa constructivealgo-rithm that canperformclassificationof binary or analogueinput datausingonly binaryweights.Suchweightsarepar-ticularly interestingfor a hardwareimplementation.

A constructivebackpropagationalgorithmis presentedin(Lehtokangas,1999a). This algorithmcombinesthe classi-cal ideaof backpropagatingthe error with the constructiveapproachof addingunitsasthey areneeded.Structureadap-tation featuresof constructive backpropagationinclude theability to deleteunitsin orderto improvegeneralization.Thealgorithmswastestedin modelingproblemsandshown to bebetterthancascadecorrelationfor the problemsstudied. Aversionof this algorithmadaptedfor recurrentnetworks ispresentedin (Lehtokangas,1999b).

(Thimm & Fiesler, 1997) usesa booleanapproximationof the given problemto constructan initial neuralnetworktopology. Subsequently, this network is trainedon theorig-inal real-valueddata.Thetopologyresultedis shown to ex-hibit bettergeneralizationcapabilitiesthan fully connectedhighordermultilayerperceptrons.

In (Hammadi,Ohmameunda,Kaneo,& Ito, 1998),a dy-namicconstructivealgorithmis usedto constructfault toler-ant feedforward networks. This dynamicconstructive faulttolerantalgorithm(DCFTA) estimatesa relevancefactorforeachweight andusesit to updatethe weightsin a selectivemanner.

Conclusions

ThispaperpresentedtheConstraintBasedDecomposition(CBD) asa constructive algorithmwith guaranteedconver-gence,fasttrainingandgoodgeneralizationproperties.TheCBD training algorithm is composedof any weight updat-ing algorithmableto train a singleneuron(perceptron,deltarule, etc.), a patternpresentationalgorithm and a networkconstructionmethod. CBD is able to deal with binary, n-ary, class-labeledandreal-valuedproblemsasexemplifiedinthe experimentspresented.Like any constructive algorith-m, CBD doesnot needan a priori guessof the necessarynetwork architecturesinceit constructsa sufficiently power-ful architectureduring the training. Furthermore,CBD can

providea symbolicdescriptionof thesolutionfoundfor anyclassificationproblemthusbeingmoretransparentthanoth-er neuralnetwork algorithms.For realvaluedproblems,thissymbolicsolutionis limited to describingthepartitioningoftheinput space.

The paperalsopresentedtwo enhancementsof the CBDapproach:thelockingdetectionandredundancy elimination.Thelockingdetectiontechniqueusesacomputationalgeom-etry approachto detectthosesituationsin which thecurren-t hyperplaneis locked into its currentposition by someofthe patternsin the given subgoal. The eliminationof suchproblemscanincreasethespeedof thetraining.However, inhighdimensionalspacesthecomputationaleffort requiredbythis detectionmight offset the speedgain obtainedthroughtheeliminationof the linearly inseparableproblems.There-fore, the useof this enhancementis only recommendedforlow dimensionalspaces(lessthan5).

The redundancy elimination techniqueeliminatedlocal-ly andglobally redundanthyperplanes.Thus,thealgorithmensuresthat the final solution will not containany uselesshyperplanes(neurons).Thetechniqueachievesthegoalnor-mally accomplishedby a pruningtechnique.However, theredundancy eliminationhastheaddedadvantagesthat: i) thetechniqueis integratedinto the convergenceprocess;ii) theunitsareeliminatedon-line,duringthetrainingandthusiii)no training time is wastedon units that will eventually beeliminated.

The experimentspresentedshowed that the techniqueisableto copesuccessfullywith a varietyof problemsrangingfrom classicaltoy-problemslike XOR, parity and2-spiralsup to real-world problems.

CBD wascomparedwith C4.5,C4.5with rules,incremen-tal decisiontrees,obliqueclassifiers,linearmachinedecisiontrees,CN2,learningvectorquantization,backpropagation,n-earestneighbor,Q* andradialbasisfunctionsonGlass,Iono-sphere,Lung cancer, Wine, Pimaindians,Bupa,TicTacToe,Balance,Iris, LensesandZoo from theUCI machinelearn-ing repository. The cross-validationexperimentsperformedshowed that the techniquebuilds fastandreliably solution-s thathave generalizationperformancesthatcomparefavor-ablywith thoseof severalothermachinelearningalgorithms.TheCBD techniquemaybeanargumentagainsttheideathatconstructive techniquestradein generalizationfor trainingspeed.Perhapsonecanhave both reliability andgoodgen-eralizationat the sametime, after all. By performingcom-parisonswith non-neuraltechniques,we think we have alsoshown that modernneuralnetwork techniquescancompetesuccessfullywith any otherlearningparadigm.

References

An, A., & Cercone,N. (1998).ELEM2: A learningsystemfor moreaccurateclassifications.In R. E. Mercer& E. Neufeld(Eds.),Lecture notesin artificial intelligence(Vol. 1418,pp.426–441).Vancouver.

Baffes,P. T., & Zelle,J.M. (1992).Growing layersof perceptrons:introducingtheextentronalgorithm. In Internationaljoint con-ferenceon neural networks(Vol. 2, pp.392–397).Baltimore.

Page 31: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

THE CONSTRAINTBASEDDECOMPOSITION(CBD) TRAINING ARCHITECTURE 31

Barkema,G. T., Andree,H. M. A., & Taal,A. (1993). Thepatchalgorithm: Fastdesignof binary feedforward neuralnetworks.Network, 5, 393–407.

Bennett,K. P., & Bredensteiner, E. J. (1998). In C.Gorini, E. Hart,W. Meyer, & T. Phillips(Eds.),Geometryat work.MathematicalAssociationof America.

Biehl, M., & Opper, M. (1991). Tilinglik e learningin the paritymachine.PhysicalReview A, 44(10),6888–6894.

Bishop,C.M. (1994).Novelty detectionandneuralnetwork valida-tion. IEE Proceedingson Vision, Image andSignalProcessing,141(4), 217–222.

Blake, C., Keogh,E., & Merz, C. (1998). UCI repositoryof ma-chinelearningdatabases.

Borsuk,K. (1969). Multidimensionalanalytic geometry. PolishScientificPublishers.

Brady, M., Raghavan, R., & Slawny, J. (1988). Gradientdescentfails to separate.In Proceedingsof the ieeeinternationalcon-ferenceon neural networks(Vol. 1, pp.649–656).

Brady, M. L., Raghavan,R.,& Slawny, J. (1989).Backpropagationfails to separatewhereperceptronssucceed.IEEE TransactionsOnCircuitsandSystems, 36(5), 665–674.

Breiman,L., Friedman,J. H., Olsen,R. A., & Stone,C. J. (1984).Classificationandregressiontrees.WadsworthandBrooks.

Broomhead,D. S., & Lowe, D. (1988). Multivariablefunctionalinterpolationandadaptive networks. Complex Systems, 2, 321–323.

Buhot,A., & Gordon,M. B. (2000).Storagecapacityof aconstruc-tive learningalgorithm. Journal of PhysicsA - MathematicalandGeneral, 33(9), 1713–1727.

Campbell,C., & Vicente,C. P. (1995). Thetargetswitchalgorith-m: A constructive learningprocedurefor feed-forward neuralnetworks[Letter]. Neural Computation, 7(6), 1245–1264.

Clark, P., & Niblett, T. (1989). The CN2 induction algorithm.MachineLearning, 3, 261–283.

Courrieu,P. (1994).Threealgorithmsfor estimatingthedomainofvalidity of feedforwardneuralnetworks.Neural Networks, 7(1),169–174.

Draghici,S. (1994). Theconstraintbaseddecompositiontrainingarchitecture.In Proceedingsof world congresson neural net-works(Vol. 3, pp.545–555).SanDiego,CA.

Draghici, S. (1995). Using constraints to improve generalizationandtraining of feedforward neural networks:Constraint baseddecompositionandcomplex backpropagation. Phdthesis,Uni-versityof St Andrews.

Draghici,S. (1996).Enhancementsof theconstraintbaseddecom-position training architecture. In Proceedingsof ieeeinterna-tional conferenceon neural networksicnn’96 (Vol. 1, p. 317-322). Washington,DC.

Duong, T. A., & Daud, T. (1999). Cascadeerror projection: Alearningalgorithmfor hardwareimplementation.In J. Mira &J. V. S’anchez-Andre’es(Eds.),Foundationsand tools for neu-ral modeling, lecture notesin computerscience(pp. 450–457).Springer-Verlag.

Eklund,P. W. (2000). Comparative studyof public-domainsuper-visedmachine-learningaccuracy on the uci database.In Pro-ceedingsof thedatamining andknowledge discovery: Theory,toolsandtechnology (pp.39–50).Orlando,Florida:SPIE.

Fahlman,S.E. (1991).Therecurrentcascade-correlationarchitec-ture. In R. P. Lippmann,J.E. Moody, & D. S.Touretzky (Eds.),Advancesin neural informationprocessingsystems3 (pp.190–196). Denver, CO: MorganKaufmann.

Fahlman,S. E., & Lebiere,C. (1990). The cascade-correlationlearningarchitecture (Tech.Rep.).CarnegieMellon University.

Fiesler, E. (1994). Comparative bibliographyof ontogeneticneu-ral networks. In Proceedingsof theinternationalconferenceonneural networks(icann’94)(pp.793–796).Sorrento,Italy.

Fiesler, E.,& Beale,R. (Eds.).(1996).Handbookof neural compu-tation. OxfordUniversityPressandtheInst.of PhysicsPubl.

Frattale-Mascioli,F. M., & Martinelli, G. (1995). The oil-spotalgorithm.IEEETransactionsonNeural Networks, 6, 794–797.

Frean,M. (1990). TheUpstartalgorithm:A methodfor construct-ing andtraining feedforward neuralnetworks. Neural Compu-tation, 2, 198–209.

Gallant, S. I. (1986). Optimal linear discriminants. In Eighthinternationalconferenceon patternrecognition (pp. 849–852).IEEE.

Hammadi,N., Ohmameunda,T., Kaneo,K., & Ito, H. (1998).Dy-namicconstructive fault tolerantalgorithmfor feedforwardneu-ral networks. IEICE Transactionson InformationandSystems,E81D(1), 115–123.

Heath,S.K. D., & Salzberg, S. (1993). Inductionof obliquedeci-siontree.In IJCAI-93. Washington,D.C.

Hohfeld, M., & Fahlman,S. E. (1992). Learningwith limitednumericalprecisionusing the Cascade-Correlationalgorithm.IEEETransactionsonNeural Networks, 3(4), 602–611.

Hosseini,S.,& Jutten,C. (1999). Weight freezingin constructiveneuralnetworks:A novel approach.In Lecturenotesin comput-er science: Engineeringapplicationsof bio-inspired artificialneural networks,(pp.1607:10–20).Berlin: Springer-Verlag.

Hugill, M. (1985).Advancedstatistics.Bell andHyman.Hwang, J.-N., You, S.-S.,Lay, S.-R., & Jou, I.-C. (1996). The

Cascade-Correlationlearning:A projectionpursuitlearningper-spective [Paper]. IEEE Transactionson Neural Networks, 7(2),278–289.

Knuth, D. E. (1976). Big omicronandbig omega andbig theta.SIGACTNews, 8(2), 18-24.

Kohonen,T. (1988). Learningvectorquantization. Neural Net-works, 1((suppl.1)), 303.

Kohonen,T. (1995).Learningvectorquantization.In Thehandbookof brain theoryandneural networks(pp.537–540).Cambridge,Massachusetts:TheMIT Press.

Kwok, T.-Y., & Yeung,D.-Y. (1997a).Constructive algorithmsforstructurelearningin feedforwardneuralnetworksfor regressionproblems[Paper].IEEETransactionsonNeural Networks, 8(3),630–645.

Kwok, T. Y., & Yeung,D. Y. (1997b). Objective functionsfortraining new hiddenunits in constructive neuralnetworks [Pa-per]. IEEETransactionsonNeural Networks, 8(5), 1131–1148.

Lang,K. J.,& Witbrock,M. J. (1989). Learningto tell two spiralsapart.In Proceedingsof the1988connectionistmodelssummerschool (pp.52–61).MorganKaufmann.

Lehtokangas,M. (1999a).Fastinitializationfor cascade-correlationlearning[Brief Paper]. IEEE-NN, 10(2), 410.

Lehtokangas,M. (1999b).Fastinitializationfor cascade-correlationlearning[Brief Paper]. IEEE-NN, 10(2), 410.

LIang,H., & Dai,G.L. (1998).Improvementof cascadecorrelationlearningalgorithmwith anevolutionaryinitialization. Informa-tion Sciences, 112(1-4),1-6.

Marchand,M., Golea,M., & Rujan, P. (1990). A convergencetheoremfor sequentiallearningin two-layerperceptrons.Euro-physicsLetters, 11(6), 487–492.

Page 32: The Constraint Based Decomposition (CBD) training architecturesod/paper7.pdf · out loss of generality, we shall consider a classification prob-lem involving patterns from Rd (d

32 SORINDRAGHICI

McCluskey, E. J., Jr. (1956). Minimization of booleanfunctions.Bell SystemTechnical Journal, 35, 1417–1444.

Mezard,M., & Nadal,J. P. (1989). Learningin feedforward lay-erednetworks:Thetiling algorithm.J. Phys.A: Math.Gen., 22,2191–2203.

Mitchell, T. M. (1977). Versionspaces:A candidateeliminationapproachto rule learning.IJCAI, 1, 1139-1151.

Mitchell, T. M. (1978). Version spaces:An approach to conceptlearning. Unpublisheddoctoraldissertation,StanfordUniversi-ty.

Moody, J., & Darken, C. J. (1989). Fast learningin networks oflocally-tunedprocessingunits. Neural Computation, 1, 281–294.

Murthy, K. V. S.(1995).Ongrowingbetterdecisiontreesfromdata.Unpublisheddoctoraldissertation,JohnsHopkinsUniversity.

Murthy, S. K., Kasif, S.,Salzberg, S., & Beigel,R. (1993). OC1:Randomizedinductionof obliquedecisiontrees.In Proceedingsof theeleventhnationalconferenceonartificial intelligence(pp.322–327).Washington,D.C.

Musavi, M., Ahmed, W., K.H.Chan, Faris, K., & Hummels,D.(1992).On thetrainingof radialbasisfunctionclassifiers.Neu-ral Networks, 5, 595–603.

Muselli, M. (1995). On sequentialconstructionof binary neuralnetworks. IEEE TransactionsonNeural Networks, 6, 678–690.

Parekh,R., Yang,J., & Honavar, V. (2000). Constructive neural-network learningalgorithmsfor patternclassification[Paper].IEEETransactionsonneural networks, 11(2), 436.

Parekh,R. G.,Yang,J.,& Honavar, V. (1995).Constructiveneuralnetworklearning algorithmsfor multi-category patternclassi-fication (Postscript). Departmentof ComputerScience,IowaStateUniversity, Ames,Iowa: Tech.Rep.ISU-CS-TR95-15.

Poggio,T., & Girosi, F. (1990). Networks for approximationandlearning.Proceedingsof IEEE, 78(9), 1481–149.

Prechelt,L. (1997). Investigationof thecascorfamily of learningalgorithms.Neural Networks, 10(5), 885–896.

Precup,D., & Utgoff, P. E. (1998).ClassificationusingΦ-machinesandconstructive functionapproximation.In Proc.15thinterna-tional conf. on machinelearning (pp.439–444).MorganKauf-mann,SanFrancisco,CA.

Preparata,F. P., & Shamos,M. I. (1985).Computationalgeometry- an introduction.Springer-Verlag.

Quine,W. V. (1955). A way to simplify truth functions.AmericanMathematicalMonthly, 62, 627–631.

Quinlan, J. R. (1993). C4.5: Programs for machine learning.Morgan-Kaufmann.

Rauber, T. W., Barata,M. M., & Steiger-Garcao,A. (1993).A tool-box for analysisandvisualizationof sensordatain supervision.In Proceedingsof international conferenceof fault diagnosis.Toulouse,France.

Romaniuk,S.T., & Hall, L. O. (1993). Divide andconquerneuralnetworks. Neural Networks, 6, 1105–1116.

Rujan, P., & Marchand,M. (1989). Learningby minimizing re-sourcesin neuralnetworks. Complex Systems, 3, 229–241.

Sethi, I. K. (1990). Entropy nets: from decisiontreesto neuralnetworks. Proceedingsof theIEEE, 78, 1605–1613.

Sethi,I. K., & Otten,M. (1990).Comparisonbetweenentropy netsanddecisiontreeclassifiers.In Internationaljoint conferenceonneural networks(Vol. 3, pp.41–46).SanDiego,CA.

Siestma,J.,& Dow, R. J. F. (1991). Creatingartificial neuralnet-worksthatgeneralise.Neural Networks, 4, 67–79.

Sirat,J.A., & Nadal,J.P. (1990).TheUpstartalgorithm:A methodfor constructingandtrainingfeedforwardneuralnetworks. Net-work: Computationin Neural Systems, 1(4), 423–438.

Smieja,F. J. (1993).Neuralnetwork constructivealgorithms:Trad-ing generalizationfor learningefficiency? CircuitsSystemsSig-nal Processing, 12(2), 331–374.

Springer, P. L., & Gulati, S. (1995). Parallelizing the cascade-correlationalgorithmusingtime warp. Neural Networks, 8(4),571–577.

Tan,A.-H. (1997). CascadeARTMAP: Integratingneuralcompu-tationandsymbolicknowledgeprocessing[Paper].IEEETrans-actionson Neural Networks, 8(2), 237–250.

Tetko, I. V., & Villa, A. E.P. (1997a).An efficientpartitionof train-ing datasetimprovesspeedandaccuracy of cascade-correlationalgorithm.Neural ProcessingLetters, 6(1/2),51–59.

Tetko, I. V., & Villa, A. E. P. (1997b). An enhancementof gen-eralizationability in cascadecorrelationalgorithmby avoidanceof overfitting/overtrainingproblem. Neural ProcessingLetters,6(1/2),43–50.

Thimm,G.,& Fiesler, E. (1997).Two neuralnetwork constructionmethods.Neural ProcessingLetters, 6(1/2),25–31.

Treadgold,N. K., & Gedeon,T. D. (1997). Extendingandbench-markingtheCasPeralgorithm. In A. Sattar(Ed.), Proceedingsof the10thaustralian joint conferenceonartificial intelligence:Advancedtopicsin artificial intelligence(AI-97) (Vol. 1342,pp.398–406).Berlin: Springer.

Treadgold,N. K., & Gedeon,T. D. (1998). Constructinghigh-er orderneuronsof increasingcomplexity in cascadenetworks.Lecture Notesin ComputerScience, 1416, 557–563.

Treadgold,N. K., & Gedeon,T. D. (1999a).Exploringconstructivecascadenetworks[Paper]. IEEE-NN, 10(6), 1335.

Treadgold,N. K., & Gedeon,T. D. (1999b). A constructivecascadenetwork with adaptive regularization”. In J. Mira &J.Sanchez-Andres(Eds.),Proceedingsof IWANN’99to appearas lecture notesin computerscience(pp. 40–49). Alicante, S-pain:Springer-Verlag.

Utgoff, P. E. (1989). Incrementalinductionof decisiontrees.Ma-chineLearning, 4, 161–186.

Utgoff, P. E.,Berkman,N. C.,& Clouse,J.A. (1997).Decisiontreeinductionbasedon efficient treerestructuring.MachineLearn-ing, 29, 5–44.

Utgoff, P. E., & Brodley, C. E. (1991). Linear machine decisiontrees(TechnicalReportNo. UM-CS-1991-010).University ofMassachusetts,Amherst,ComputerScience.

Utgoff, P. E., & Precup,D. (1997). Constructivefunctionapprox-imation (Tech.Rep.No. 97-04). Amherst,MA: DepartmentofComputerScience,Universityof Massachusetts.

Utgoff, P. E., & Precup,D. (1998). Constructive functionapproxi-mation. In H. Liu & H. Motoda(Eds.),Feature extraction,con-structionand selection:A data mining perspective(Vol. 453).Kluwer AcademicPublishers.

Yang, J., Parekh,R., & Honavar, V. (1998). DistAI: an inter-patterndistance-basedconstructive learningalgorithm.In Worldcongresson computationalintelligence(pp. 2208–2213).An-chorage,Alaska.

Yang,J.,Parekh,R.,Honavar, V., & Dobbs,D. (1999).Data-driventheoryrefinementusingKBDistAl. Lecture Notesin ComputerScience, 1642, 331–??

Young,S., & Downs, T. (1998). CARVE - a constructive algo-rithm for real-valuedexamples. IEEE Transactionson NeuralNetworks, 9(6), 1180–1190.