ieee-ann

8/2/2019 ieee-ann

1/5

1-4244-0023-6/06/$20.00 2006 IEEE CIS 2006

Hidden Unit Reduction of Artificial Neural Networkon English Capital Letter Recognition

Kietikul JEARANAITANAKIJ

Department of Computer EngineeringFaculty of Engineering

King Mongkuts Institute of Technology LadkrabangBangkok, Thailand

Ouen PINNGERN

Department of Computer EngineeringFaculty of Engineering

King Mongkuts Institute of Technology LadkrabangBangkok, Thailand

Abstract We present an analysis on the minimum number of

hidden units that is required to recognize English capital letters

of the artificial neural network. The letter font that we use as a

case study is the System font. In order to have the minimum

number of hidden units, the number of input features has to be

minimized. Firstly, we apply our heuristic for pruning

unnecessary features from the data set. The small number of the

remaining features leads the artificial neural network to have the

small number of input units as well. The reason is a particularfeature has a one-to-one mapping relationship onto the input

unit. Next, the hidden units are pruned away from the network

by using the hidden unit pruning heuristic. Both pruning

heuristic is based on the notion of the information gain. They can

efficiently prune away the unnecessary features and hidden units

from the network. The experimental results show the minimum

number of hidden units required to train the artificial neural

network to recognize English capital letters in System font. In

addition, the accuracy rate of the classification produced by the

artificial neural network is practically high. As a result, the final

artificial neural network that we produce is fantastically compact

and reliable.

Keywords Artificial Neural Network, letter recognition,

hidden unit, pruning, information gain

I. INTRODUCTIONAn artificial neural network can be defined as a model ofreasoning based on the human brain. Recent developments onartificial neural network have been used widely in characterrecognition because of its ability to generalize well on theunseen patterns [1-8]. Recognition of both printed andhandwritten letters is a typical domain where neural networkshave been successfully applied. Letter recognition or incommon, called OCR (Optical Character Recognition) is theability of a computer to translate character images into a textfile, using special software. It allows us to take a printed

document and put it into a computer in editable form withoutthe need of retyping the document (Negnevitsky, 2002, [9]).

One issue of the letter recognition that uses the artificialneural network as the learning model is the suitable number ofhidden units. The number of neurons in the hidden layer affects

both the accuracy of character recognition and the speed of thetraining the network. Complex patterns cannot be detected by asmall number of hidden units; however too many of them canunpleasantly increase the computational burden. Another

problem is overfitting. The greater the number of hidden units,

the greater the ability of the network to recognize existingpatterns. However, if the number of hidden units is too big, thenetwork might simply memorize all training examples. Thismay prevent it from generalizing, or producing incorrectoutputs when presented with pattern that was not used intraining.

There are some proposed methods that can be used to

reduce the number of hidden units in the artificial neuralnetwork. Sietsma and Dow [10], [11] suggested an interactivemethod in which they inspect a trained network and identify ahidden unit that has a constant activation over all training

patterns. Then, the hidden unit which does not influence theoutput is pruned away. Murase et al. [12] measured theGoodness Factors of the hidden units in the trained network.The unit which has the lowest value of the Goodness Factorisremoved from the hidden layer. Hagiwara [13] presented theConsuming Energy and the Weights Power methods forremoval of both hidden units and weights, respectively.Jearanaitanakij and Pinngern [14] proposed the information-gain based pruning heuristic that can efficiently removeunnecessary hidden units within the nearly minimum period of

time.

In this paper, we analyze the reduction of hidden units ofthe artificial neural network for recognizing English capitalletters that are printed in System font. There are 10x10 pixelsfor a particular image of English capital letter. Each pixel (orfeature) is represented by either 1 or 0. Our objective is todetermine the minimum number of hidden units that is requiredto classify the 26-English letters with the practical recognitionrate. Firstly, unnecessary features are filtered out of the data set

by the feature pruning heuristic [14]. Then the hidden unit pruning heuristic [14] is utilized in order to find a suitablenumber of hidden units. The analysis of the experimentalresults show exceeding low number of hidden units required tothe classification process. In addition, the results support ourheuristics [14] in terms of the compact network and the nearlyminimum pruning time.

The rest of this paper is organized into the following orders.In Section 2, we give a brief review about the information gainand our hidden unit pruning heuristic. In Section 3, the data setof English capital letters is described. Next, in Section 4, wedescribe the experimental results and analysis. Finally, inSection 5, the conclusions and possible future work arediscussed.

8/2/2019 ieee-ann

2/5

II. HIDDEN UNIT PRUNINGWe begin by briefly review the notion of information gain

and our hidden unit pruning heuristic.

A. Information GainEntropy, a measure commonly used in the information

theory, characterizes the (im)purity of an arbitrary collection ofexamples. Given a collection S, containing examples with eachof the Coutcomes, the entropy ofSis

)],(log)([)( 2 IpIpSEntropy

CI

= (1)

wherep(I) is the proportion ofSbelonging to class I. Notethat Sis not a feature but an entire sample set. Entropy is 0 ifall members of S belong to the same class. The scale of theentropy is 0 (purity) to 1 (impurity). The next measure is aninformation gain. This was first defined by Shannon andWeaver [15] to measure the expected reduction in entropy. Fora particular feature A, Gain(S, A) means the information gainof the sample set S on the feature A and is defined by thefollowing equation:

)],(|).|/|(|[)(),( vvAv

SEntropySSSEntropyASGain = (2)

where is the summation on all possible values (v) of thefeatureA; Sv is the subset ofSfor which featureA has value v;|Sv| is the number of elements in Sv, and; |S| is the number ofelements in S.

The merit of the information gain is that it indicates thedegree of significance that a particular feature has on theclassification output. Therefore, the more information gain thefeature has, the more significance the feature gets. We always

prefer the feature which has high value of information gain tothose which have lower values.

B. Hidden Unit Pruning HeuristicWe describe a hidden unit pruning heuristic (Jearanaitanakij

and Pinngern, 2005, [14]) used as ordering criterions for thehidden unit pruning in the artificial neural network. Before

performing the hidden unit pruning, we must calculate theinformation gains of all features and then pass these gains tothe hidden units in the next layer.

The hidden unit pruning heuristic is based on thepropagated information gains from feature units. Before goingfurther, let us define some notations used in this section such asinformation gain of feature unit i (Gaini), incoming informationgain of a hidden unit (GainIn), outgoing information gain of ahidden unit (GainOut), the weight from the i-th unit of the (n-1)-

th layer to thej-th unit of the n-th layer (nn

jiw,1

), and, similarly,the weight from thej-th unit of the n-th layer to the k-th unit ofthe (n+1)-th layer ( 1, +nn

kjw ). All notations are shown in Fig. 1.

Figure 1. Network notations

The amount of information received at a hidden unit is thesummation, on training patterns, of the total squared production

between weights, which connect from feature units to a hiddenunit in a hidden layer, and information gains of all featureunits. Then the result is averaged over the number of training

patterns and the number of feature units. We define theincoming information gains of thej-th hidden unit in n-th layer( n

jInGain ) as the following:

,)(1 21,1

=

i

ni

nnji

P

njIn Gainw

IPGain (3)

where P and I are the number of training patterns and thenumber of feature units in the (n-1)-th layer, respectively. This

n

jInGain is, in turn, used for calculating the outgoing

information gain of the j-th hidden unit. The degree of

importance of a particular hidden unit can be determined by theoutgoing information gain of the hidden unit ( n

jOutGain ). The

outgoing information gain of a particular hidden unit is thesummation, on training patterns, of the total squared production

between weights, which connect from the hidden unit to outputunits, and the incoming information gain of that hidden unit.Then the result is averaged over the number of training patternsand the number of output units. The outgoing information gainof the j-th hidden unit in the n-th layer ( n

jOutGain ) is given by:

,)(1 21,

=+

k

njIn

nnkj

P

njOut Gainw

OPGain (4)

where O is the number of output units in the (n+1)-th layer.

Note that the number of training patterns,P, in both (3) and (4)is the number of training patterns that the network has seen sofar. The hidden unit which has the lowest outgoing informationgain should be firstly removed from the trained network

because it does not affect the convergence time for retrainingthe network that much. There is only one hidden unit removedat a time until the network cannot converge. Then, the last

pruned unit and network configurations are restored.

8/2/2019 ieee-ann

3/5

III. DATASETThe data set used as the case study is the set of twenty-six

English capital letters (A to Z) which are printed in Systemfont. Each letter image is represented by 10x10 pixels. A

particular pixel can be either on (1) or off (0). We scan eachpixel in the enhanced letter image from top to bottom, from leftto right, to locate the capital printed letter on the paper. Anassumption has been made that the letters are separated clearly

with each other.

Figure 2. Image transformation

Figure 3. 26 English capital letter images without noises

As shown in Figure 2, all pixels in an extracted letter aretransformed to either 1 or 0. These pixels represent thefeatures of the training set of the artificial neural network. For a

particular pattern, there can be only one letter that correspondsto it. A set of non-noised 26 letter images is shown in Figure 3.In order to be realistic, we add more letter images which havenoise probability of 0.05 in each pixel. Each letter has 5 non-

noised and 5 noised images. Therefore, 260 letter images areused as the dataset. The dataset is randomly decomposed into130 letter images for the training set, and 130 letter images forthe test set. After a letter image has passed a transformationinto an array of 10x10 binary-value features, the artificialneural network is brought into the training procedure. Allfeatures connect to input units by one-to-one relationship. Theoutput units are encoded into 26 units, each stands for anEnglish capital letter. For a particular classification, one of the26 output units has value 1 whereas other output units mustcontain 0 as their values.

IV. EXPERIMENTAL RESULTSWe train the 26-letter data set with the initial artificial

neural network which has 100 input units, 10 hidden units, and26 output units. There is only one single hidden layer betweenthe input and output layers. The learning algorithm usedtraining process is the standard back-propagation algorithm(Rumelhart et al., 1986, [17]), without momentum. All theweights and thresholds are randomly reset into the range

between -1 and +1. In order to obtain the highest recognitionrate, the sum-squared error is set to be converged below 0.3.

Note that the number of hidden unit, i.e. 10, here is not theminimum number but it is the number that allows the artificialneural network to converge easily. However, our goal is to findthe minimum number of hidden units of the artificial neuralnetwork that still correctly classifies the patterns at highrecognition rate.

Since the number of hidden units depends on the number ofinput features, it is worthwhile to remove the feature units

before the hidden unit pruning begins. The idea of featureremoval is similar to the hidden unit pruning. Instead of usingoutgoing information gain, the information gain of everyfeature is used as the pruning criterion. When the initialnetwork is trained, the feature which has the lowest information

gain is firstly removed from the network. There is only onefeature unit removed at a time until the network cannotconverge. Then the final number of features is returned. Theexperimental result on the number of input features is depictedin Fig. 4.

0

20

40

60

80

100

120

1 501 1001 1501 2001 2501 3001 3501 4001

Number of epochs

Numberoffeatures

Figure 4. Number of features during the training

The number of features is constant at 100 units during thefirst 1473 training epochs. This is the duration that we train theinitial neural network to get a convergence. When the network

8/2/2019 ieee-ann

4/5

is trained, the number of features keeps decreasing until itsettles down at 37. This means that the essential number offeatures for classifying the 26-letter of English capital letters inSystem font is 37. This is not only the minimum number offeatures, but also the number of features that still maintain thehighest recognition rate of the artificial neural network.

0

2

4

6

8

10

12

1 501 1001 1501 2001 2501 3001 3501 4001

Number of epochs

Numberofhiddenunits

Figure 5. Number of hidden units during the training process

Fig. 5 illustrates the number of hidden units during thetraining process. After the feature pruning has been done, at the2000th epoch, the hidden unit which has the lowest outgoinginformation gain is pruned away. The hidden unit pruning

process is seized, shown as a horizontal line, until the networkis retrained. The hidden units are removed in succession untilthe network cannot be retrained. We discover that the finalnumber of hidden units that maintains the highest recognitionrate is 6.

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 501 1001 1501 2001 2501 3001 3501 4001

Number of epochs

Sum-squareder

ror

Figure 6. Sum-squared error during the training

Fig. 6 shows the sum-squared error throughout the training

process. The error gradually decreases from 4.75 to 0.3 within1473 training epochs. When the network reaches a convergence

point, the feature pruning process starts removing features one- by-one. This causes the sum-squared error increases slightly.However, the error suddenly decreases within a small period oftime. The similar situation happens to the hidden unit pruning.The hidden unit pruning starts the task at the 2001 st trainingepoch. At this point, the sum-squared error suddenly rises up to3.5. This high error does not prolong the network training

because the abrupt reduction of the error allows the network to

converge in no time. The ripples on the sum-squared errorindicate the places where the hidden unit reductions are taken.The pruning process finishes when the sum-squared error doesnot consecutively decrease within 500 epochs. Then thenetwork is restored to the previous convergence point. Therestoration of the network back to the previous convergencecan explain why the sum-squared error at the end of the graphin Fig. 6 suddenly falls below 0.3.

TABLE I. ACCURACY RATE ON THE TEST SET

Conventional Neural Network Proposed

94.10 97.71

Table I shows the accuracy rates on the test set between theconventional NN and our proposed method. The conventionalapproach has 10 hidden units and 100 input units, while the

proposed method has 6 hidden units and 37 input units. Weintend to use different number of hidden units in order toinvestigate that having the lower number of hidden units doesnot degrade the accuracy when classifying the unseen data. Theresult, in Table I, shows that the conventional NN has less

accuracy rate than the proposed method. This can be explainedas the effect of the overfitting problem that happens in theconventional NN. By having unnecessary hidden units, theconventional NN memorizes all training patterns instead oflearning them. Moreover, our approach removes unimportantfeatures from the original feature. This can filter some noisesout of the training set.

V. CONCLUSIONSWe give an analysis of the hidden units that are necessary

to recognize the English capital letters printed in System font.The 10x10 pixels in the letter image are the features that are

passed into the input units of the artificial neural network. In

the input layer, information gain indicates the degree ofimportance of a feature. The feature which has the smallestinformation gain is not important to the output classificationand it should be ignored. As a result, we have the smallestnumber of epochs needed for retraining when that feature is

pruned away. In the hidden layer, the hidden unit which has thesmallest outgoing information gain tends to propagate rathersmall amount of information to the output units in the nextlayer. Consequently, removing that unit from the network giveslittle effect on the retraining time. The experimental resultshows that the number of hidden units that is necessary toidentify the 26 English capital letters in System font, which has37 essential features, is 6 units. In addition, this small-sizedartificial neural network gives a testing accuracy rate at

97.17%. Removing unnecessary hidden units reduces theoverfitting problem that may occur to the network. If thenetwork has too many hidden units, it will memorize all thetraining patterns, instead of learning them. This situation may

prevent the network from generalizing, or producing incorrectoutputs when presented with pattern that was not used intraining.

8/2/2019 ieee-ann

5/5

REFERENCES

[1] B. Widrow et al., Layered neural nets for pattern recognition, IEEETrans. On ASSP, vol. 36, no. 7, July 1988.

[2] V.K. Govindan and A.P. Shivaprasad, Character recognition Review,Pattern Recognition, vol. 23, no. 2, pp. 671-679, 1990.

[3] B. Boser et al., Hardware requirements for neural network patternclassifiers, IEEE Micro, pp. 32-40, 1992.

[4] A. Shustorovich and C.W. Thrasher, Neural Network Positioning andClassification of Handwritten Characters, Neural Networks vol. 9, no. 4,

pp. 685-693, 1996.[5] R. Parekh, J. Yang and V. Honavar, Constructive neural network

learning algorithms for pattern classification, IEEE Transactions onNeural Networks, pp. 436-451, vol. 11. no. 2, 2000.

[6] Kamruzzaman J., Kumagai Y., Mahfuzul Aziz S., Characterrecognition by double backpropagation neural network, Proceedings ofIEEE Region 10 Annual Conference, Speech and Image Technologiesfor Computing and Telecommunications, vol. 1, pp. 411-414, 1997.

[7] Kamruzzaman, J., Comparison of feed-forward neural net algorithms inapplication to character recognition, Proceedings of IEEE Region 10International Conference on Electrical and Electronic Technology, vol. 1,

pp. 165-169, 2001.

[8] Jacquet, D., Saucier, G., Design of a digital neural chip: application tooptical character recognition by neural network, Proceedings EuropeanDesign and Test Conference, pp. 256-260, 1994.

[9] Negnevitsky. M, Artificial Intelligence A Guide to Intelligent Systems,Addison-Wesley, 2002.

[10] J. Sietsma and R.J.F. Dow, Creating artificial neural networks thatgeneralize, Neural Networks, vol.4, no.1, pp. 67-69, 1991.

[11] J. Sietsma and R.J.F. Dow, Neural net pruning why and how, in Proc.IEEE Int. Conf. Neural Networks, vol. I (San Diego), pp.325-333, 1988.

[12] K. Murase, Y. Matsunaga, and Y. Nakade, A Back-PropagationAlgorithm which Automatically Determines the Number of AssociationUnits, Proc. IEEE Int. Conf. Neural Networks, vol. 1, pp. 783-788,1991.

[13] M. Hagiwara, Removal of Hidden Units and Weights for BackPropagation Networks, Proc. IJCNN93, vol. 1, pp. 351-354, 1993.

[14] K. Jearanaitanakij, O. Pinngern, Determining the Orders of Feature andHidden Unit Prunings of Artificial Neural Networks, Proc. IEEE 2005Fifth Int. Conf. on Information, Communications and Signal Processing(ICICS), w3c.3, pp. 353-356, 2005.

[15] Shannon, C. E. and Weaver, W., The Mathematical Theory ofCommunication, University of Illinois Press, Urbana, Illinois, 1949.

[16] Quinlan, J. R., Induction of decision trees, Machine Learning, vol. 1,issue 1, pp. 81106., 1986.

[17] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Learning internalrepresentations by error propagation, in Parallel Distributed Processing:Exploration in the Microstructure of Cognition: vol.1: Foundations, eds.D.E. Rumelhart and J.L. McClelland, pp.318-362, The MIT Press,Cambridge, Massachusetts, 1986.

ieee-ann

Documents