an optimum classifier approximation for network-based handwritten character recognition

19
An Optimum Classifier Approximation for Network-Based Handwritten Character Recognition Marcello Federico, Stefano Messelodi and Luigi Stringa IRST - Istituto per la Ricerca Scientifica e Tecnologica 38100 Trento, Italy Eternal Report - Draft Version December 1991 Abstract An approximation of the Bayes decision rule and its imple- mentation on a two-layered network are described. The net is trained in two phases: first, probabilities of the discrete- valued input features are learnt by applying a Good-Turing based estimator; second, net weights are estimated by ap- plying an adaptive gradient descent technique. Experiments were performed on a database of 67,000 real life handwritten numerals. By using input units that read sub-patterns of the character bitmap, a recognition rate of 93.30% is achieved, with 1.39% substitution rate. The paper shows that compu- tational complexity and implementation characteristics make this approach a possible competitor of artificial neural net- works described in the literature. 1 Introduction Classification is the problem of mapping a set of patterns into a fixed number of classes. With a statistical approach, classification is usually based on an a posteriori probability. Moreover, many non-statistical classifiers can be seen 1

Upload: independent

Post on 14-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

An Optimum Classifier

Approximation for Network-Based

Handwritten Character Recognition

Marcello Federico, Stefano Messelodi and Luigi Stringa

IRST - Istituto per la Ricerca Scientifica e Tecnologica38100 Trento, Italy

Eternal Report - Draft VersionDecember 1991

Abstract

An approximation of the Bayes decision rule and its imple-

mentation on a two-layered network are described. The net

is trained in two phases: first, probabilities of the discrete-

valued input features are learnt by applying a Good-Turing

based estimator; second, net weights are estimated by ap-

plying an adaptive gradient descent technique. Experiments

were performed on a database of 67,000 real life handwritten

numerals. By using input units that read sub-patterns of the

character bitmap, a recognition rate of 93.30% is achieved,

with 1.39% substitution rate. The paper shows that compu-

tational complexity and implementation characteristics make

this approach a possible competitor of artificial neural net-

works described in the literature.

1 Introduction

Classification is the problem of mapping a set of patterns into a fixed numberof classes. With a statistical approach, classification is usually based on an aposteriori probability. Moreover, many non-statistical classifiers can be seen

1

as approximation of statistical classifiers. For example, recent work showsthat both radial basis function and sigmoid artificial neural networks (ANN)can be trained to approximate performances of a Bayesian discriminant clas-sifier [3, 4, 5].Classification is usually performed on some measurements (features) of

the pattern to be classified. Feature selection can be viewed as a mappingfrom the input patterns space, usually unstructured and high dimensional,to a space with lower dimension. The design of the feature extractor isconsidered to be the problem-dependent aspect of pattern classification andrequires most of the ad-hoc skill needed to compete with human performance.Generally, in statistical pattern recognition, features generate a metric spaceand are considered as a random vector. By contrast, our approach usesmetric-less discrete features [6, 13]. This peculiarity proves to be particularlyuseful for classification problems where no good metrics are available, or whenboth qualitative and quantitative features can be used.This paper is organized as follows. The next section introduces the Bayes

decision rule. In Section 3 the curse of dimensionality problem is discussedand some possible solutions are given. The approximation of the likelihoodformula here proposed follows. Sections 4 and 5 introduce the classifierarchitecture and the training algorithms. Section 6 and 7 respectively reportresults on handwritten character classification and provide information abouttime and space complexity of the algorithms. Section 8 concludes the paperwith comments and proposals for future developments.

2 Bayes decision rule

Bayes classification can be formalized in terms of an optimization problem[1]. Given a pattern x and m possible classes ω1, . . . , ωm to assign it, thatclass is chosen which minimizes the conditional risk

Lx(ωi) =m∑

j=1

λ(ωi/ωj)P (ωj/x) (1)

where λ(ωi/ωj) represents the loss of deciding to place x into class ωi whileit belongs to class ωj, and P (ωj/x) is the probability of class ωj given thatpattern x was observed.By using the usual symmetric loss function

λ(ωi/ωj) = 1− δi,j

2

with δi,j = 1 if i = j and δi,j = 0 if i 6= j, minimum loss is obtained bychoosing a class ωi that maximizes P (ωi/x). That is, given a pattern x theclassifier should decide in favour of a class ωi satisfying the following relation:

P (ωi/x) = max1≤j≤m

P (ωj/x) (2)

It can be shown that, in this way, probability of errors is also minimized.The well known Bayes’ formula allows the right hand side probability of

(2) to be rewritten as

P (ωj/x) =P (x/ωj)P (ωj)

P (x)(3)

where P (ωj) is the (a priori) probability of class ωj, P (x/ωj) is the probabil-ity that when class ωj occurs, pattern x is observed, and P (x) is the averageprobability that pattern x is observed; that is,

P (x) =m∑

j=1

P (ωj)P (x/ωj) (4)

Since maximization in (2) is carried out with a fixed value of x, it fol-lows that the aim of the classifier is to find the class ωi that maximizes thelikelihood

max1≤j≤m

P (x/ωj)P (ωj) (5)

If x = (x1, . . . , xn) is a realization of a random vector X = (X1, . . . , Xn) ofn discrete-valued features, a general form of (5) becomes

maxj

(

P (ωj)n∏

i=1

P (xi/xi+1, . . . , xn, ωj)

)

(6)

where P (xi/xi+1, . . . , xn, ωj) represents a progression of dependencies amongthe features. In the next sections we will indicate random variables by upper-case letters, and values of them by lower-case letters.

3 Product approximations

In many applications, the bayesian decision rule involves computation oflarge space probabilities (6). If probabilities are estimated from a finite set

3

of samples the so called curse of dimensionality arises. That is, the samplesize needed in order to compute reasonable estimates grows exponentiallywith the dimension n. To overcome this difficulty the likelihood formula isoften approximated by introducing some simplifying assumption. The sim-plest approximation is obtained by assuming mutual independence betweenfeatures conditioned to a fixed class. Formally, given a class ωj:

P (Xi/Xi+1, . . . , Xn, ωj) = P (Xi/ωj)

∀i = 1, . . . , n (7)

The well known version of (6) follows:

maxj

(

P (ωj)n∏

i=1

P (xi/ωj)

)

(8)

Other more sophisticated approximations have been proposed in the litera-ture with different degrees of generality or problem dependency. For exam-ple, if a time notion can be introduced in the problem, a markovian sourceassumption can provide many solutions. Moreover, a general solution waspresented in [7] which is based on a first order approximation of (6). Thistechnique introduces a partial ordering between the features, called depen-

dence tree. Thereafter, the problem is reduced to finding an optimal depen-dence tree. We propose a formula that modifies the product of equation (8)by introducing an exponent for each class-feature probability. Hence, thefollowing maximization formula is proposed:

maxj

(

Oj(x) = P (ωj)n∏

i=1

P (xi/ωj)θij

)

(9)

Exponents are estimated over a sample by minimizing an error functionthat guaranties asymptotical convergence to the Optimum Classifier’s perfor-mance [4]. Estimates of the probabilities and exponents in (9) are computedin two steps. First, the distributions P (Xi/ωj) are estimated locally for eachclass, then a global optimization process is performed in the distributionspace in order to find optimal exponents values. This latter phase showsto be similar to train a generalized linear classifier [1]. The two processesare performed over different training samples. Further, in order to avoidovertraining, the parameters optimization is monitored by means of a cross-validation sample.Decision rule (9) induces an efficient network-like implementation that

will be described in the following section.

4

4 The classifier architecture

The resulting classifier is implemented with a two-layered network based onthe S-NET architecture 1, presented in [14]. Figure 1 shows the network andoutlines how features work in the application proposed here. The (first layer)input units are fed with the feature values, which are defined as small portionsof the character bitmap. Input units maintain probabilities of formula (9)by means of efficient hash-tables that map each feature value to a probabilityvector. Each component of this vector is passed to the corresponding outputunit through a weighted link. Each output unit ωj computes a weightedproduct on its input, multiplies it with the local class probability P (ωj),and output the result. Classification is performed by choosing the class withhighest output level.

5 Training algorithms

Network training involves two phases: first, probability distributions are esti-mated over a large training sample; second, optimal values for the exponentsare searched by means of an error minimizing process. Both phases will bedescribed in the following subsections.

5.1 Probabilities estimation

In order to estimate the probability distribution functions of each discretefeature, two types of histogram-based approximation techniques [2] have beenevaluated.In general, the probability distribution function of a discrete valued fea-

ture Xi, conditioned to a class ωk, can be estimated by means of frequenciesmeasured over a training sample. Thus for each value xi of Xi:

P (xi/ωk) ≈f(xi, ωk)

u∈Xif(u, ωk)

(10)

where f(xi, ωk) indicates the frequency of the pair (xi, ωk) in the trainingsample, and where summation is computed for all possible values of Xi.Problems with this estimation arise if data used for training is very sparse

and incomplete. In other words, if many “possible” items never occur in

1U.S. Patent pending

5

Figure 1: Classifier architecture.

6

the training sample, biased probability estimates may be computed. In ourcharacter classification task only a 20% of the possible configurations occurin the training sample, while several never seen (“unknown”) configurationsoccour in the testing sample. Several methods for smoothing frequenciescomputed from sparse data are known (see [8] for a review). Two of themhave been tested: the method we call naive smoothing and a technique basedon the Good-Turing formula.

The naive smoothing. In this approach, zero frequencies are eliminatedby adding a positive constant ε to all the frequencies. A renormalization isthen performed to re-establish the total sum to one. The adjusted frequencyf ∗(xi, ωk) becomes:

f ∗(xi, ωk) =(f(xi, ωk) + ε)

u∈Xi

j(f(u, ωj) + ε)(11)

where the summations range over all possible values of variable Xi and overall classes, respectively. The ε parameter is computed by maximizing correctclassifications on a different data sample.

The Good-Turing formula. A more sophisticated frequency estimate isprovided by the Good-Turing formula. This estimator, which Good appliedin the 50’s in population biology, is based on two assumptions: first, thatvalues of the variable are finite, and second, that the distribution of eachvalue is binomial. While the former assumption is trivially satisfied, thelatter is generally acceptable as the sample could be viewed as a sequence ofrandom and independent trials.To outline this method, whose theoretical properties can be found in [9],

the notion of frequency distribution must be introduced. For each frequencyvalue f measured on the training sample, we indicate by Nf the number ofdifferent realizations (xi, ωk) that appear f times. This distribution explainsall frequencies measured on the training sample. The value N0 for the zerofrequency is obtained by subtracting from the universe size the sum of allobserved different pairs (xi, ωk). Normally, this distribution presents manylarge holes or zero valued intervals. The Nf distribution approximation canbe enhanced by averaging along the holes and by using a local cross validationsmoothing method. The smoothed values S(Nf ) are then used to computethe Good-Turing formula in order to obtain new adjusted frequencies f ∗:

7

f ∗ = (f + 1)S(Nf+1)

S(Nf )f = 0, 1, 2, . . . (12)

Corrected frequencies are then used to estimate probability distributionsof each feature by means of formula (10). These computations are accom-plished by a training algorithm which passes once through the training set.Details about the complexity of this algorithm are provided in Section 6.

5.2 Parameters estimation

After probability estimation is completed, values for the free parameters θi,jare estimated. It can be shown that this phase is equivalent to train ageneralized linear classifier [1]. If the logarithm of the decision criterion (9)is taken, each net output can be rewritten as a linear discriminant functionof the form:

Gj(x) = logOj(x) = ΘTj Yj(x) + hj (13)

where Θj is the weight vector (θ1,j , . . . , θn,j), Yj(x) is the log probability vec-tor (logP (x1/ωj), . . . , P (xn/ωj)) and hj corresponds to the bias logP (ωj). Anew sample of data is used to compute the exponents by minimizing classi-fication errors of the net. Errors are measured with the mean squared errorfunction (MSE), which is one of the “reasonable error measures” identifiedin [4]. In fact, it is proved that minimization of the MSE leads to classifier’soutputs which asymptotically equate a posteriori probabilities. The MSE is:

E = 0.5∑

(x,ωk)∈S

m∑

j=1

(Oj(x)− δk,j)2 (14)

where the outmost summation ranges over all the examples of the sampleS, Oj(x) represents the normalized output response for class ωj, and δk,jrepresents the desired output for class ωj given a pattern of class ωk. Briefly,the desired output is imposed to be equal to the a posteriori probability ofclass ωj:

δk,j = P (ωj/(x, ωk)) =

{

1 if ωj = ωk0 otherwise

In order to correctly compute the a posteriori probability, the classifier’soutputs are normalized as follows:

Oj(x) =P (ωj)

∏ni=1 P (xi/ωj)

θij

∑mk=1 P (ωk)

∏ni=1 P (xi/ωk)

θik(15)

8

for j = 1, . . . ,m.It is straightforward to show that normalized outputs approximate a pos-

teriori probabilities.By applying a standard gradient descent technique, each parameter θij is

updated after a complete pass through the whole training set by the quantity:

∆θij = −α∂E

∂θij(16)

The learning rate α varies according to the following adaptive strategy. Itstarts from a relatively small value and grows as the error is reduced, whileit drops to its initial value when the error starts to increase. Moreover, if αis less than or equal to the initial value, while error increases, it is reducedulteriorly. The value of ∂E

∂θijis obtained by differentiating Eq.(14) and Eq.

(15):

∂E

∂θij= −

(x,ωk)∈S

Oj(x) log(P (xi/ωj))

m∑

h=1

(Oh(x)− δh,k)(Oh(x)− δh,j)

For the purpose of generalization, a cross validation set has been adoptedto check generalization performances after each training cycle.

6 Handwritten character classification

Experiments were performed on handwritten digits classification. Machinerecognition of handwritten characters is an interesting problem and efficientsolutions of it can be very useful in many practical applications. Basically,automatic recognition of unconstrained handwritten characters is a difficulttask because of the large variety of different styles and shapes of the char-acters. Research on this field has provided many different approaches andsolutions [10]. Unfortunately, most results look to be very data-dependentand up to now no commonly accepted standard database for assessment yetexists.

9

6.1 Character database.

The here used sample of characters originates from three different corporasupplied by ELSAG 2. The sample contains 19,000 numerals (1,900 per digit)digitized from hand filled tax returns, 19,000 numerals (1,900 per digit) digi-tized from samples collected among ELSAG employees, and 29,000 numerals(2,900 per digit) digitized from postal envelopes, which passed through dif-ferent German postal offices. Part of preprocessing, i.e. acquisition, segmen-tation of the single digits, height and width normalization, was performed bydatabase suppliers. Linear transformations were computed to obtain a stan-dard 16x8 bitmap from each character. In Figure 5 some randomly selectedcharacters are shown.Feature extraction followed three major requirements: computational ef-

ficency, estimates availability and discrimination capability. The first pointsuggested us to use bitmap portions as features. The second and third pointmade us consider the size of these features. In fact, small features providebetter estimates, but larger features are certainly more discriminant. Thistrade-off made us consider features as pairs of two 4x2 pixel rectangles. Fourgroups of eight such features were selected. This determines four differentpartitions of the character bitmap. In Figure 4 the resulting architecture ofthe classifier is shown. For reasons of readability the four groups are visual-ized separately and only a few links are shown for each group.

6.2 Estimators comparison.

The first two experiments we report compare probability estimates obtainedfrom the naive smoothing and from the Good-Turing formula. Training datafor these experiments consist of 50,000 characters and testing data of 17,000characters. No parameter optimization is computed for both experiments andall exponents of formula (9) are set to one. In order to apply naive estimation,a noise constant ε was computed in the way suggested in Section 5. Thefollowing results were obtained. With the naive smoothing approach correctclassification was 99.18% on the training data (0.82% error rate) and 95.89%on the testing data (4.11% error rate). With the Good-Turing formula correctclassification was 99.43% on the training set (0.57% error rate) and 96.21%on the testing set (3.79% error rate). These results confirm that the GoodTuring method provides better probability estimates of the training data

2ELSAG S.p.A., Genova, Italy

10

Figure 2: MSE trend on the weight learning set.

and, more important, also of all the “unknown” features found in the testingdata. This latter point means that better generalization is achieved.

6.3 Weight learning experiments.

Experiments were then performed in order to evaluate the parameter es-timation algorithm. For these experiments the available sample has beendivided as follows: 45,000 characters are used for probabilities estimation(learning set), 5,000 characters for parameters estimation (weight learningset), 5,000 characters as cross-validation set, and 12,000 characters as test-ing set. After termination of probability estimation phase, we let parameters

11

Figure 3: Recognition rate trend on the cross-validation set.

12

Figure 4: Architecture for handwritten character classification experiments.

optimization cycle about 1500 times. The mean squared error on the weightlearning sample reduced from 0.016 to 0.010. A plot of the complete trendof the MSE is reported in Figure 2. The trend of the recognition rate onthe cross-validation set is shown in Figure 3. Best classification rate over thecross-validation set (96.72%) was achieved after about 740 iterations. Afterthe optimization phase, probabilities were then re-estimated over the wholetraining set (55,000 characters) and tests were performed over the testingset (12,000 characters). For the purpose of comparison three different sets ofexponent values were considered. In order, exponents best performing on thecross-validation data, exponents with lowest MSE on the training set, andexponents all fixed to one were considered. In the first row of Table 1 resultsfor each set of weights are reported. These figures confirm that the intro-duction of parameters improves performances and that the cross-validationtechnique avoids undesirable overtraining effects.

13

6.4 Rejection critera.

In order to effectively evaluate a character classifier it becomes crucial tointroduce some rejection criteria. The reason is that in real applicationsmisclassifications of characters are considered much more costly than rejec-tections. Usually, a performance measure is defined which assigns differentweights to correct classifications and errors.Different rejection criteria are considered, based on the following perfor-

mance index:

I = (RecognitionRate)−W × (ErrorRate)

with W being the error rate cost, given that a unitary cost is assigned to therecognition rate.Rejection critera are defined by three tests that apply on the two highest

output units of the net, Oωbest(x) and Oωsec best

(x) , and on the number ofinput units nunk(x) which read an “unknown” feature value:

1) Oωbest(x) > θabs

2)Oωbest

(x)− Oωsec best(x)

Oωbest(x)

> θrel

3) nunk(x) < θunk

where θabs, θrel, θunk are three fixed thresholds.Hence, a pattern x is classified in class ωbest only if all three criteria

are verified. Further, in order to find optimal thresholds an algorithm wasdeveloped which maximize the performance index I given W .For the purpose of comparison, the three sets of exponents were consid-

ered again. In each case, increasing costs W were tried and thresholds werecomputed for maximizing I on the testing set examples. All results thatproduced an error rate ranging from 1.0% and 2.0% are reported in Table 1.Best achieved performances were achieved by the first set of weights. Asthe 93.30% of correct classifications with 1.39% of errors, and the 92.00% ofcorrect classifications with 1.04% of errors. The confusion table of the firstresult is shown in Table 2. By comparing results in Table 1, it clearly followsthat the exponent learning algorithm considerably improves the generaliza-tion capabilities of the classifier when rejection critera are introduced. As amatter of fact, by making thresholds more and more selective about 1% ofdifference between recognition rates is observable.

14

Best Recognition Exponents Min MSE Exponents Unitary Exponents

Ric % Err % Rej % Ric % Err % Rej % Ric % Err % Rej %

96.30 3.70 0.00 96.27 3.73 0.00 96.03 3.97 0.0094.30 1.80 3.90 94.50 1.91 3.59 94.03 1.99 3.9893.60 1.48 4.92 94.04 1.68 4.28 93.09 1.64 5.2793.30 1.39 5.31 92.59 1.20 6.21 92.25 1.40 6.3592.26 1.10 6.64 92.05 1.06 6.89 91.45 1.11 7.4492.00 1.04 6.96 91.18 1.10 7.72

Table 1: Correct recognition, error and rejection rates relative to differentexponents values and thresholds values θabs, θrel, θunk.

In conclusion, results are promising if compared with those achievied withANN [11, 12] if similar input patterns (i.e. raw character bitmaps) are used.

7 Time space complexity.

An account of time and space complexity of the training and testing algo-rithms is now given. Orders of complexity are provided for the average case.The Good-Turing probability estimate algorithm has time complexity:

O(|S|+ nm|X|S + n|Nf |Snσ)

where |S| denotes the sample size, n the number of features, m the number ofclasses, |X|S the average number of different values found in the sample S foreach feature Xi, |Nf |S the average dimension of the frequency distributionNf for each feature Xi, and nσ the number of different gaussian filters usedfor smoothing the Nf distributions. A cycle of the weight training algorithmhas complexity:

O(|S|m(m+ n))

Recognition algorithm takes time complexity of O(nm) for each example.Space complexity of the architecture in terms of floating point numbers isO(nm|X|S). All the experiments presented in this section used 32 features(n), 10 classes (m), 45,000 characters for probability estimation (|S|), 5,000characters for weights learning (|S|), and 200 gaussian filters (nσ). Averagevalues of |X|S and |Nf |S were respectively about 3,500 and 7,500. The sim-ulator written in C language runned on a SUN-4 330 workstation. Training

15

Figure 5: A random subsample of size normalized characters.

time for each experiment took about three minutes for the probability estima-tion phase and about one minute for each weights learning cycle. Characterrecognition throughput is about 100 characters per second.

8 Conclusions

An approximation of the Optimum Classifier and its network-based imple-mentation were presented. Two learning algorithms are used in order to trainthe classifier. First, probability estimates are computed with the Good-Turing method. Second, values of free parameters are learned through agradient-descent based algorithm. The classifier has been applied to hand-written character recognition. A database of 67, 000 real life numeric charac-ters was used and only a size and orientation normalization was performedon each isolated character. By using small bitmap portions as input fea-tures, state of art results were obtained. Experiments have shown, first,that the Good-Turing formula provides better estimates than simply addinga costant to all frequencies. Second, that the introduction of exponent pa-rameters, further improves classification. Third, that the cross-validationtechnique provides an effective way for monitoring generalization.Possible enhancements of the classifier will be considered in the next

future. First, the feature extraction process will be improved in order to

16

0 1 2 3 4 5 6 7 8 9 Rej0 1181 0 0 0 0 0 0 0 0 0 191 0 1130 11 0 0 0 2 6 0 1 502 0 5 1121 0 2 0 2 2 1 2 653 1 0 10 1086 1 4 1 2 2 6 874 0 10 1 0 1120 1 1 2 0 6 595 0 0 1 0 1 1141 7 2 0 1 476 1 0 1 0 2 8 1137 0 0 0 517 0 7 6 0 14 0 0 1114 0 0 598 1 2 2 2 0 2 0 0 1080 4 1079 0 1 2 5 9 0 0 3 1 1085 94

Table 2: Confusion matrix with 93.30% rec. rate and 1.39% error rate.

introduce more discriminant and task-dependent features. Second, someconstraint on the parameters will be analized for the aim of generalization.Third, other optimization techniques will be investigated, in order to im-prove convergence and generalization. Fourth, performances assessment overlarger corpora will be considered, in particular preliminar work has begunwith the NIST 3 1,000,000 handwritten characters corpus. As a final remarkit is worth noticing that both the hardware requirements and the computa-tional efficency of the training algorithms make this architecture a potentialcompetitor of ANN.

9 Acknowledgments

The authors thank ELSAG S.p.A., Italy, for supplying the character databaseand all researchers and staff of IRST that have contributed to this work.

References

[1] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis,John Wiley & sons, New York, N.Y., 1973.

3NIST- National Insitute of Standard and Technology, Gaithersburg, Maryland, USA.

17

[2] Keinosuke Fukunaga, Introduction to Statistical Pattern Recognition,Electrical Science Series, Academic Press, London, England, 1972.

[3] Tomaso Poggio and Federico Girosi, A Theory of Networks for Approx-

imation and Learning, A.I. Memo No. 1140, Artificial Intelligence Lab-oratory,Massachusetts Institute of Technology, 1989.

[4] John B. Hampshire and Barak A. Pearlmutter, Equivalence Proofs for

Multi-Layer Perceptron Classifiers and the Bayesian Discriminant Func-

tion, Proceedings of the 1990 Connectionist Models Summer School,Touretzky and Elman and Senjnowsky and Hinton ed., Morgan Kauf-mann, Los Altos, Cal., 1990.

[5] Halbert White, Learning in Artificial Neural Networks: A Statistical

Perspective, Neural Computation,1,pages 425-464, 1989.

[6] L. A. Kamentsky and C. N. Liu, Computer-Automated Design of Mul-

tifont Print Recognition Logic, IBM Journal, pages 2-12, January, 1963.

[7] C. K. Chow and C. N. Liu, Approximating Discrete Probability Distribu-tions with Dependence Trees, IEEE Trans. Inform. Theory, IT-14, pages462-467, 1968.

[8] Kenneth W. Church and William A. Gale, A Comparison of the En-

hanced Good-Turing and Deleted Estimation Methods for Estimating

Probabilities of English Bigrams, Computer Speech and Language, 5,pages 19-54, 1991.

[9] A. Nadas, On Turing’s formula formula for word probabilities, IEEETrans. Acoust., Speech and Signal Proc., ASSP-32, pages 1414-1416,1985.

[10] Frontiers in Handwritten Recognition, Ching Y. Suen Ed., CENPARMI,Concordia University, Montreal, 1990.

[11] Y. LeCun and B. Boser and J.S. Denker and D. Henderson and R.E.Howard and W. Hubbard and L.D. Jackel, Backpropagation Applied to

Handwritten Zip Code Recognition, Neural Computation, 1, 4, pages541-551, 1989.

18

[12] Gale L. Martin and James A. Pittman, Recognizing Hand-Printed Let-

ters and Digits Using Backpropagation Learning, Neural Computation,3, 2, pages 258-267, 1991.

[13] Luigi Stringa, A Structural Approach to Automatic Primitive Extrac-

tion in Hand-Printed Character Recognition, Frontiers in HandwritingRecognition, Ching Y. Suen Ed., CENPARMI, Montreal,1990,

[14] Luigi Stringa, S-NETS: A Short Presentation, Proc. of the Second Ital-ian Workshop on Parallel Architectures and Neural Networks, E. R.Caianiello Ed., World Scientific Publishing Co, 1990.

19