integrated handwriting recognition and interpretation using finite-state models

May 28, 2004 14:48 WSPC/115-IJPRAI 00334

International Journal of Pattern Recognitionand Artificial IntelligenceVol. 18, No. 4 (2004) 519–539c© World Scientific Publishing Company

INTEGRATED HANDWRITING RECOGNITION AND

INTERPRETATION USING FINITE-STATE MODELS∗

A. H. TOSELLI

Instituto Tecnologico de Informatica,Universidad Politecnica de Valencia, Camino de Vera s/n, 46022 Valencia, Spain

[email protected]

A. JUAN†, J. GONZALEZ, I. SALVADOR, E. VIDAL and F. CASACUBERTA

Departamento de Sistemas Informaticos y Computacion,Universidad Politecnica de Valencia, Camino de Vera s/n, 46022 Valencia, Spain

†[email protected]

D. KEYSERS‡ and H. NEY

Lehrstuhl fur Informatik VI, Computer Science Department,RWTH Aachen, University of Technology, 52056 Aachen, Germany

‡[email protected]

The interpretation of handwritten sentences is carried out using a holistic approach inwhich both text image recognition and the interpretation itself are tightly integrated.Conventional approaches follow a serial, first-recognition then-interpretation schemewhich cannot adequately use semantic–pragmatic knowledge to recover from recognitionerrors. Stochastic finite-sate transducers are shown to be suitable models for this inte-gration, permitting a full exploitation of the final interpretation constraints. Continuous-density hidden Markov models are embedded in the edges of the transducer to accountfor lexical and morphological constraints. Robustness with respect to stroke verticalvariability is achieved by integrating tangent vectors into the emission densities of thesemodels. Experimental results are reported on a syntax-constrained interpretation taskwhich show the effectiveness of the proposed approaches. These results are also shownto be comparatively better than those achieved with other conventional, N-gram-basedtechniques which do not take advantage of full integration.

Keywords: Handwriting recognition and interpretation; hidden Markov models; sto-chastic finite-state transducers; preprocessing and feature extraction; tangent vectors.

1. Introduction

The recognition of a handwritten sentence, i.e. decoding its symbolic representation

in terms of characters, digits and/or words, is not the ultimate purpose in many

∗Work supported by the Spanish MCT under grant TIC2000-1703-CO3-01.

519

May 28, 2004 14:48 WSPC/115-IJPRAI 00334

520 A. H. Toselli et al.

tasks involving handwritten input. On the contrary, in these tasks, the handwritten

text is often used just as an intermediate means to express some semantic message.

The goal of an ideal automatic system in these cases is to obtain an adequate

interpretation of the handwritten message, rather than achieving a good recognition

of the individual text constituents of this message. This is clearly illustrated in two

prominent tasks: postal address processing and bank check reading.

A system for postal address processing uses knowledge about postal domains

and tries to guess the correct destination even if only incomplete or contradictory

information appears or can be recognized in the postal address. Here recognition

would consist in getting adequate hypotheses about the words and numbers written

in the envelope; interpretation, in contrast, should yield a unique entry to the

postal database containing the right addresses. Similarly, a bank check reading

system has to interpret the legal amount (written in letters) to determine the

real numeric sum (and to optionally verify whether this sum matches the cour-

tesy amount — written in digits). Here recognition would consist in getting ad-

equate hypotheses about the written words, while the goal of interpretation is

to come out with a numeric expression which, overall, reflects what was written

in letters as accurately as possible. It is not of great importance whether all the

words comprising the legal amount were correctly written or whether they can be

exactly recognized or not; only the reliability of the interpreted numeric result really

matters.

From this point of view, the role of an interpretation system is to map input

images into adequate target meanings and the written words or numbers should be

considered just as intermediate results or hidden variables.

Under this paradigm, accurate handwriting interpretation requires a tight co-

operation of lexical, syntactic and semantic/pragmatic knowledge. Each source of

knowledge adds valuable, possibly redundant information which is best exploited

in conjunction with that obtained from the other sources. This is just the very

same situation that appears in the field of continuous speech recognition.16 In this

field, knowledge integration benefits are attained by following three basic principles:

(i) adopt simple, homogeneous and easily understandable models for all the know-

ledge sources; (ii) formulate the problem as one of searching for an optimal path

through an adequate structure based on these models; and (iii) use appropriate

techniques to learn the different models from training data. These principles are

actually the basis of those systems developed using finite-state devices such as

hidden Markov models and stochastic finite-state grammars.17

Inspired by the success of finite-state technology in continuous speech recogni-

tion, several systems based on this technology have been proposed or adapted for

handwritten input in the last few years (e.g. see Refs. 10 and 12, and the refer-

ences therein). However, these systems often break the above-mentioned principles

at the lexical or the syntactic levels. A typical example violating these principles at

the lexical level is given by those systems based on the segmentation of sentences

into single words (or even into individual characters). Clearly, it is quite difficult to

May 28, 2004 14:48 WSPC/115-IJPRAI 00334

Integrated Handwriting Recognition and Interpretation using Finite-State Models 521

locate individual characters or words in a sentence without considering lexical or

syntactic knowledge. Moreover, it is even harder to recover from errors produced

during segmentation and hence this approach is generally unreliable and the more

recent, advanced systems do not rely on word segmentation1 (see also Ref. 9).

Although adequate, homogeneous, finite-state based solutions have been re-

cently developed for handwriting recognition, no such solutions have been yet

proposed for the more general problem of handwriting interpretation. For instance,

the approach followed by Kaufmann and Bunke7 for check processing is to first

decode the handwritten legal amount into a word sequence and then translate this

recognized sentence into a digit string. No attempt is made to do recognition and

interpretation simultaneously and, in fact, it is difficult to do so since the authors use

a translation scheme which is not easily amenable to integration into a finite-state

framework. As in the case of systems relying on word segmentation, this is another

example breaking the basic principles described above. In this case, however, these

principles are broken at the semantic level, thereby preventing semantic knowledge

to be fully exploited in the whole process.

Other works (concerning legal amount handwriting recognition) which also

break these principles in some way or another are worth mentioning. The approaches

followed by Paquet–Lecourtier14 and Gillevic–Suen6 are directly based on recog-

nition of previously segmented words (belonging to a restricted lexicon) using

structural and morphological features. The former approach uses a template

matching word classifier, whereas the latter uses a Bayesian word classier. Further,

the approach proposed by Gillevic–Suen6 also supported the idea of the above-

mentioned three different levels of knowledge to get more accuracy at the inter-

pretation level, but only preliminary results with integration only up to the lexical

level were reported.

In this paper, we propose the integration of handwriting recognition and in-

terpretation via finite-state models. As usual, images of handwritten text are

modeled at the lexical level by continuous density, left-to-right hidden Markov

models. To achieve integration of recognition and interpretation, we advocate the

use of stochastic finite-state transducers. The details of the proposed techniques

are given in Sec. 3. A syntax-constrained interpretation task resembling legal

amount interpretation for bank checks is adopted as an illustrative example. In

Sec. 4, experimental results are reported showing the effectiveness of the proposed

approach. This approach is also shown to clearly outperform a conventional,

N -gram based scheme which cannot easily take advantage of full integration.

Overall, the results constitute a significant improvement over previous (prelimi-

nary) results obtained for the same task.4 Apart from the benefits of integration,

this improvement is due to two refinements: on the one hand we included elaborate

preprocessing and feature extraction techniques (Sec. 2). And on the other hand, we

introduced the use of tangent vectors in the emission densities of the hidden Markov

models to help cope with the vertical variability of the input images (Sec. 3.3).

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


2. Preprocessing and Feature Extraction

Preprocessing of handwritten text lines has not yet been given a general, standard

solution and it can be said that each handwriting recognition (interpretation)

system has its own, particular solution. There are, however, generic preprocessing

operations such as skew and slant correction for which robust, more or less equiva-

lent techniques are available.15 But in many cases, other not so generic preprocessing

operations are also needed to compensate for a weakness in the ability of the system

to model pattern variability. In particular, this is the case of approaches like ours

that use (one-dimensional) hidden Markov models for a handwritten text line image.

Although these models do properly model (nonlinear) horizontal image distortions,

they are to some extent limited for vertical distortion modeling. Therefore, apart

from the usual skew and slant correction preprocessing steps, we have decided to

include a third step aimed at reducing a major source of vertical variability: the

height of ascenders and descenders. These steps are discussed hereafter. See Fig. 1

for an illustrative example.

Skew correction processes an original image to put the text line into horizontal

position. As each word or multiword segment in the text line may be skewed at

a different angle, the original image is divided into segments surrounded by wide

blank spaces and skew correction is applied to each segment separately. This is not

to obtain a segmentation of the text line into words and it is not necessary for

each segment to contain exactly one word. The complete skew correction process

is carried out in four steps: (a) horizontal run-length smoothing of the segments

comprising the original image (panel b.1 in Fig. 1); (b) computation of the upper

and lower contours for each segment (panel b.2); (c) eigenvector line fitting of

the contours (panels b.3 and b.4); and (d) segment deskewing in accordance to

the average angle of the contour lines (panel b.5). Although this process involves

significant computing time, we have found it to be more robust than other simpler

approaches.5

Slant correction shears the deskewed image horizontally to bring the writ-

ing in an upright position. Following the procedure proposed by Yanikoglu and

Sandon,18 the dominant slant angle of the writing is obtained by computing the

slant histogram using Sobel edge operators.

As said above, the third step is aimed at reducing a major source vertical vari-

ability: the height of ascenders and descenders (not that of the main text body). The

reference lines computed for each image segment during skew correction are updated

and joined together to separate the main text body from the zones with ascenders

and descenders. Then, each of these zones is linearly scaled in height to a size de-

termined as a percentage of the main body vertical size (30% for ascenders and

15% for descenders). This percentage was empirically determined through simple

informal tests. Since these zones are often large, nearly blank areas, this scaling

operation has the effect of filtering out most of the uninformative background. It

also compensates for the large variability of the ascenders and descenders height as

compared with that of the main text body.

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


�

��

��

��

��

��

�

�

Fig. 1. Preprocessing and feature extraction example. From top to bottom: (a) original image(“four millions” in Spanish); (b) skew angle estimation and correction (block of 5 joint panels);(c) slant correction; (d) height normalization for ascenders and descenders; and (e) extractedsequence of feature vectors (normalized gray levels, horizontal and vertical derivatives). Fromtop to bottom in the block of five joint panels describing skew angle estimation and correction:(b.1) horizontal run-length smoothing of the two segments (words) comprising the original image;(b.2) upper and lower contours; (b.3) eigenvector line fitting of the contours; (b.4) fitted lines;and (b.5) deskewed image.

As with any approach based on (one-dimensional) hidden Markov models, fea-

ture extraction is required to transform the preprocessed image into a sequence of

(fixed-dimension) feature vectors. To do this, the preprocessed image is first divided

into a grid of square cells whose size is a small fraction of the image height (such as

1/16, 1/20, 1/24 or 1/28). We call this fraction vertical resolution. Then each cell is

characterized by the following features: normalized gray level, horizontal gray-level

derivative and vertical gray-level derivative. To obtain smoothed values of these

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


features, feature extraction is extended to a 5 × 5 window centered at the cur-

rent cell weighted with a Gaussian function. The derivatives are computed by least

squares fitting of a linear function.

Columns of cells are processed from left to right and a feature vector is built for

each column by stacking the features computed in its constituent cells (panel e in

Fig. 1). This process is similar to that followed by Bazzi et al.1

3. Integrated Recognition and Interpretation via

Finite-State Models

3.1. Probabilistic framework

In order to develop a true holistic approach to interpretation, it is useful to think

of recognition as a hidden process and start facing the basic problem, i.e. to search

for an optimal interpretation

t = arg maxt

P (t | x) (1)

where x is the sequence of feature vectors extracted from an image of handwritten

text, and P (t | x) is the posterior probability for t to be the correct interpretation

of x in the semantic or target language.a

To uncover the underlying recognition process, P (t | x) can be seen as a marginal

of the joint probability function P (s, t | x), where s is a decoded sentence in the

sourcea language. Using the Bayes rule and assuming that in practice, P (x | s, t) is

independentb of t, we have

t = argmaxt

∑s

P (s, t | x) (2)

= argmaxt

∑s

P (x | s, t)P (s, t) (3)

= argmaxt

∑s

P (x | s)P (s, t) . (4)

It is convenient to approximate the sum in Eq. (4) by the max operator to

facilitate the search for t:

t ≈ argmaxt

maxs

P (x | s)P (s, t) . (5)

Moreover, this simplification also permits to simultaneously search for both t

and its associated most probable decoding, s:

(s, t) ≈ argmax(s,t)

P (x | s)P (s, t) . (6)

aIn the context of bank check legal amount interpretation a “target language” is any adequateformal representation of numeric amounts; e.g. decimal digit sequences (see Sec. 3.4) and the“source language” is the language used to write legal (worded) amounts.bThat is the writing style of the source text is not conditioned by the overall numerical meaning.

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


This optimization problem serves as the basis for our integrated approach to

handwriting recognition and interpretation via finite-state models. On the one hand,

we adopt conventional hidden Markov models, extended with tangent vectors for in-

creased robustness, to estimate P (x | s) (Sec. 3.2). On the other hand, we advocate

the use of stochastic finite-state transducers to model P (s, t) (Sec. 3.4). Thanks to

the homogeneous finite-state nature of these models, they can be easily integrated

into a single global finite-state network, and both recognition and interpretation

can be efficiently performed at the same time by solving (6), using the well-known

Viterbi algorithm (Sec. 3.5).

3.2. Hidden Markov Models

Hidden Markov Models (HMMs) have received significant attention in handwriting

recognition during the last years. As speech recognizers do for acoustic data,7,16

HMMs are used to estimate the probability for a sequence of feature vectors to be

seen as an “image realization” of a given text sentence. Sentence models are built by

concatenation of word models which, in turn, are often obtained by concatenation of

continuous left-to-right HMMs for individual characters. Figure 2 shows an example

of character HMM.

Basically, each character HMM is a stochastic finite-state device that models

the succession, along the horizontal axis, of (vertical) feature vectors which are

extracted from instances of this character. Each HMM state generates feature

vectors following an adequate parametric probabilistic law; typically, a mixture of

Gaussian densities. The required number of densities in the mixture depends, along

with many other factors, on the “vertical variability” typically associated with each

state. This number needs to be empirically tuned in each task.

0.3

0.7 0.8

0.2

0.9

0.1

0.8

0.2

0.7

0.3

Fig. 2. HMM modeling of instances of the character “a” within the word “cuarenta”. The statesare shared among all the instances of a same character class.

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


The number of states that is adequate to model a certain character depends on

the underlying horizontal variability. For instance, to ideally model a capital “E”

character, only two states might be enough (one to model the vertical bar and the

other for the three horizontal strokes), while three states may be more adequate to

model a capital “H” (one for the left vertical bar, another for the central horizontal

stroke and the last one for the right vertical bar). Note that the possible or optional

blank space that may appear between characters should be also modeled by each

character HMM. In many cases the adequate number of states for a given task may

be conditioned by the available amount of training data.

Once an HMM “topology” (number of states and structure) has been adopted,

the model parameters can be easily trained from continuously handwritten text

(without any kind of segmentation) accompanied by the transcription of this text

into the corresponding sequence of characters. This training process is carried

out using a well-known instance of the EM algorithm called forward-backward or

Baum-Welch re-estimation.16

3.3. Tangent vectors in Hidden Markov Models

Even with our treatment of ascenders and descenders described in Sec. 2, vertical

shift variability remains difficult to model in left-to-right one-dimensional HMMs.

As an additional effective method for coping with this problem we propose the use

of tangent vectors.13 Tangent vectors can be used to enhance tolerance with respect

to small variations of the input patterns in a classifier. Their name is due to the

fact that they are computed as derivatives of these transformations and therefore

tangential to the manifold a transformed pattern is described in pattern space.

Tangent vectors have been successfully applied to various pattern recognition tasks,

most notably (isolated) handwritten digit recognition. The method is especially

suitable for integration into Gaussian models as can be shown to be equivalent to

a modification of the covariance matrix in the Gaussian case.8

For the use of tangent vectors in our task, let µ denote a mean vector of one

Gaussian emission density of one HMM state. Let further f(µ, α) denote a trans-

formation of µ, e.g. vertical shift, that depends on a parameter α. This transforma-

tion can be approximated by a linear subspace for small values of α using a Taylor

expansion around α = 0:

f(µ, α) = µ + αv +O(α2) ≈ µ + αv . (7)

The tangent vector v that spans the resulting subspace is the partial derivative of

the transformation f with respect to the parameter α, i.e. v = ∂f(µ, α)/∂α. Using

this first-order approximation, we obtain the probability density for an observation

vector x:

p(x | µ, α,Σ) = N (x | µ + αv, Σ) . (8)

Now, by integrating out the parameter α and assuming that the distribution of α

is N (0, γ2) and independent of µ and Σ, we obtain the following expression8:

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


p(x|µ,Σ) =

∫p(α) · p(x|µ, α,Σ) dα = N (x|µ,Σ′), Σ′ = Σ + γ2vvT . (9)

Here, we want the character HMMs to be robust with respect to small vertical

shifts. This can be achieved by applying the following procedure to each Gaussian

density N (µ,Σ) of each mixture of the trained HMMs:

• calculate the tangent vector v as the vertical derivative of the mean vector µ;

• modify the covariance matrix Σ by setting Σ← Σ + γ2vvT , where the factor γ

controls the variance along the tangent vector direction.

The increased variance in the direction of the tangent vectors leads to emission

densities which assign higher probability to slightly transformed feature vectors.

This has the effect that the resulting model is more robust with respect to this

transformation, in this case with respect to vertical variability.

3.4. Stochastic finite-state transducers

As discussed in Sec. 3.1, in this work we propose the use of stochastic finite-state

transducers (stochastic FSTs, SFSTs) to model P (s, t) in Eq. 6. Basically, a SFST

is a finite-state network whose transitions are labeled by three items2,17:

(a) an input symbol (a word from the source lexicon);

(b) an output string (a sequence of tokens from the target symbol set); and

(c) a transition probability.

In addition, each state has associated a probability to be an initial state and a

probability to be a final state.

If a SFST is unambiguous, P (s, t) is computed as the product of the probabilities

of the transitions of the unique path that matches (s, t). Otherwise, it is the sum of

the probabilities computed for all paths matching (s, t). In most cases of interest,

this sum can be conveniently approximated by the maximum.

FSTs can be automatically learned from training data,11 or they can be built

by hand in accordance with previous knowledge about the task. A key factor of the

difficulty of manually (or automatically) building a FST is the degree of monoto-

nicity or “sequentiality” between source and target subsequences of the considered

task.17 The simplest case is where translation may proceed from left to right, in

a sequential sweep that considers only one source word at a time, producing a

bounded number of output tokens. This kind of tasks can be properly modeled by

Sequential Transducers,2 a kind of FST that is amenable to manual construction.

If the required mapping source/target is more complex, Subsequential Transducers2

can be used though, in this case, manual construction often becomes exceedingly

difficult even for small, nontrivial tasks.

For illustration purposes, let us consider a simple syntax-constrained interpre-

tation task that will be also considered for the experiments reported in Sec. 4.

It consists of interpreting the Spanish numbers from 0 to 1012 − 1, i.e. translating

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


doscientos+(200

mil+(1000)

sesenta+60

...

veinte+20

...

ydos+2

...

mil)*1000

mil)*1000

Fig. 3. A piece of the hand-designed numbers transducer. Solid-line edges correspond to a paththat accepts “doscientos sesenta y dos mil veinte” (two hundred sixty two thousand and twenty),yielding “+(200 + 60 + 2 ) ∗ 1000 + 20”.

instances of these numbers in text form to their corresponding numerical representa-

tion. It will be referred to as the Spanish numbers task. The source-target mapping

underlying this task is a typical case of Subsequential Transduction.17 However, we

can slightly modify the task specification in order to allow for a simple sequential

mapping.

The source lexicon of this modified task comprises Spanish words such as

“uno”, “dos”, “diez”, “sesenta”, “cien”, “mil”, “millon”, etc. (one, two, ten, sixty,

hundred, thousand, million, etc.). The target set of symbols consists of the ten digits

plus four arithmetic operators: “(, ), +, ∗”. For instance, given the source (Spanish

number) sentence: “doscientos sesenta y dos mil veinte” (two hundred sixty two

thousand and twenty), the corresponding target sequence should be the arithmetic

expression: “+ (200+ 60+ 2) ∗ 1, 000+ 20”. Clearly, from this expression the target

(decimal) number (262, 020) can be readily computed.

For this modified task, we wrote a simple sequential SFST that accepts any text

Spanish number in the range given above and outputs an arithmetic expression

giving its corresponding numerical value. A small fragment of this transducer is

shown in Fig. 3. Its basic features are: 51 source words, 14 target symbols, 32 states

and 660 transitions. Its source language (test-set) perplexity is 6.2.

3.5. Recognition and interpretation as a best path search

Trained character HMMs and the SFST chosen for the task can be easily integrated

into a global finite-state recognition network. To this end each edge of the SFST

is expanded by a concatenation of the HMMs of the successive characters which

constitute the source-language word of this edge. To deal with possible inter-word

white space (as a complement to the inter-caracter blank-space modeling mentioned

in Sec. 3.2), a blank (“@”) special HMM can be trained and also integrated in the

network. This network expansion, illustrated in Fig. 4, realizes the integration,

discussed in Sec. 3.1, of character, lexical and syntactic-semantic levels.

Given an input sequence of feature vectors x, the pair (s, t) in Eq. (6) is obtained

by searching for a best path in the integrated network. This global search process

is very efficiently carried out by the well known (beam-search-accelerated) Viterbi

algorithm.16 This technique allows integration to be performed “on the fly” during

the search process. In this way, only the memory strictly required for the search is

actually allocated.

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


m i l@

u

d

n o@

o s

Fig. 4. A small piece of an integrated finite-state model, using three-state character HMMs. Thepart shown stands for the sentences “mil”, “mil uno” and “mil dos” (1,000; 1,001; 1,002 ). Outputarithmetic-expression tokens are omitted for the sake of clarity.

Table 1. Some details about the imagedatabase and the training and test partitions.

Training Test Total

# writers 18 11 29# sentences 298 187 485# words 1,300 827 2,127# letters 9,220 5,852 15,072# digits 1,543 2,480 4,023

4. Experiments

The proposed approach was empirically evaluated on the Spanish numbers task

described in Sec. 3.4. It was also compared with a more conventional approach

based on a serial combination of word recognition using N -gram syntactic modeling,

followed by text-to-number translation based on a perfect text-to-number trans-

ducer. To acquire a database of handwritten sentences of Spanish numbers, two

independent lists of numbers were randomly generated: one of 10,000 items (L1)

and other of 300 items (L2), and 29 writers were asked to write numbers from these

lists. Each writer was given a blank sheet of paper and a pencil, and asked to write

numbers in well-separated lines to facilitate their segmentation. Filled sheets were

scanned at 300 dpi in 8-bit grayscale. After segmentation of the scanned sheets, 485

images of handwritten numbers were collected, from which 298 were transcribed

from list L1 by 18 writers and 187 from list L2 by the remaining 11 writers.c

Some examples are shown in Fig. 7. For the experiments reported hereafter, the

298 sentence images were employed as a training set and the 187 sentence images

as a test set. Details about this database and the partitions used in the experi-

ments are shown in Table 1. In addition, the text-only sentences of L1 were used to

train N -gram language models in some of the comparative experiments described

in Sec. 4.2.

Two measures are used to assess the empirical results: Word Error Rate (WER)

and Digit Error Rate (DER). Both measure the percentage of tokens that have to

cThe acquired database is available upon request. To our knowledge, there are no publicly avai-lable databases of syntax-constrained handwritten text and thus it may be useful for researchersinterested in handwriting recognition and interpretation.

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


be substituted, inserted or deleted in the system hypotheses in order to match the

corresponding reference sequences. Tokens are words for WER and decimal digits for

DER. WER and DER measure recognition and interpretation errors, respectively. It

should be noted that, for practical application to the proposed task, WER values are

of little interest. In practice, the output of a legal amount reading system generally

needs to be compared with the result of a digit recognizer, which provides a comple-

mentary hypothesis of the check sum, based on the courtesy amount. The automatic

(or manual) work needed to compare and eventually correct the results, is directly

related with the number of digit errors. Therefore, since this is a digit-by-digit

comparison, it is the DER measure what actually matters. In principle, WER and

DER are not directly comparable measures. For the Spanish numbers, for instance,

the average number of words in a text sentence of our database is 5.6, while the

average number of digits in the corresponding decimal representation is 8.3. So, at

first sight, DER might be expected to be lower than WER. However, many typical

single-word errors correspond to two or more digit errors. For instance, mistaking

“diez” for “dos” (a typical error in our system) in a sentence like “mil dos”

corresponds to changing “1002” into “1010”, which entail two digit errors. And

mistaking the word “noventa” for “millones” in a sentence such as “mil millones”

corresponds to changing “1000000000” into “1090”; i.e. seven digit errors! On the

average, for a good system, DER is expected to be somewhat lower than WER

(for Spanish number sentences), the difference approaching to zero with increasing

system accuracy.

4.1. Training feature extraction and HMMs parameters

There are three main parameters that need to be adjusted to design an accurate

Spanish numbers recognizer/interpreter in accordance with our approach. They are

the vertical resolution (VR) for feature extraction, and the number of states (NS )

and Gaussian densities per state (NG) for each character HMM. Automatically

determining optimal values for these parameters is not an easy task. In particular,

it is difficult to determine independent, optimal values of NS and NG for each

character HMM. For simplicity, we decided to use the same values of NS and NG

for all HMMs. Taking into account previous (preliminary) results for the Spanish

numbers database,4 we decided to test the following parameter values: VR = 1/16,

1/20, 1/24 and 1/28; NG = 8, 16, 32 and 64; and NS = 4, 5, 6, 7, 8 and 9. First,

we observed the influence of these parameters without using tangent vectors, which

introduce an additional parameter, the variance factor γ.

The acquired database was preprocessed as described in Sec. 2. Then, feature

extraction was applied to the preprocessed database to obtain a sequence of

(3 ·VR)-dimensional feature vectors for each handwritten number image (Sec. 2).

As discussed in Sec. 3.2, left-to-right continuous-density HMMs of NS states and

NG Gaussian densities per state were used for character modeling. These HMMs

were trained through four iterations of the Baum–Welch algorithm. This algorithm

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


6

8

10

12

14

16

18

20

22

1/16 1/20 1/24 1/28

WER(%)

VR

NG=8163264

6

8

10

12

14

16

18

20

22

4 5 6 7 8 9

WER(%)

NS

NG=8163264

Fig. 5. Test-set recognition word error rate (WER) as a function of the vertical resolution (VR),the number of states (NS) and Gaussian densities (NG) per HMM. (Left) WER as a function ofVR for NS = 6 and varying NG. (Right) WER as a function of NS for VR = 1/20 and varyingNG.

was initialized by a linear segmentation of each training image into a number of

equal-length segments (according to the number of characters in the orthographic

transcription of the sentence). As mentioned in Sec. 3.4, both the sequential

SFST and the lexical models were built by hand. For each test sentence, the

Viterbi algorithm was performed on the integrated finite-state network (Sec. 3.5) to

obtain an optimal decoding of the input image into a sequence of words (along with

its corresponding interpretation in numerical form). The training and test proce-

dures were implemented on the basis of the well-known and widely available Hidden

Markov Model Toolkit (HTK ).19

Since our interest here is tuning character and word-modeling parameters, only

word recognition results matter. Figure 5 shows these results in terms of WER,

for several selected parameter values. The left panel shows the WER as a function

of VR for NS = 6 and varying NG . Similarly, the right panel shows the WER as

a function of NS for VR = 1/20 (the best resolution) and varying NG . The best

result in this figure is a WER of 5.8% with a 95% confidence interval20 of [4.3%,

7.6%], which corresponds to a vertical resolution of 1/20, and character HMMs of

six states with 16 Gaussian densities per state. It is worth noting that this number

of Gaussian densities is consistently the optimal one (or close to) for all vertical

resolutions and numbers of states.

4.2. N-gram modeling

Two experiments were carried out in order to compare the proposed integrated

interpretation approach with a more conventional serial, first-recognition, then in-

terpretation paradigm.

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


4

6

8

10

12

14

16

18

20

4 8 16 32 64

NG

WER (%)

DER (%)

4

6

8

10

12

14

16

4 8 16 32 64

NGWER (%)

DER (%)

4

6

8

10

12

14

16

4 8 16 32 64

NG

WER (%)

DER(%)

Fig. 6. Test-set recognition word error rate (WER) and interpretation digit error rate (DER)as a function of the number of Gaussian densities (NG) per HMM state for HMMs of six statesand a vertical resolution of 1/20. (Top) Using a two-gram language model trained with the 298transcriptions of the training set, followed by perfect number translation. (Bottom-left) Using atwo-gram language model trained with 10,000 Spanish Number text sentences, followed by perfectnumber translation. (Bottom-right) Using the SFST for integrated recognition and interpretation.

In both cases, the HMMs and lexical models discussed above were used in

conjunction with two-grams for word language modeling. In the first experiment a

two-gram language was trained using the 298 text transcriptions of the 298 images

of the training set. In the second the two-gram was trained using 10,000 text Spanish

number sentences obtained from the L1 list. In both cases, training (with stan-

dard back-off smoothing) was performed using the “CMU-Cambridge Statistical

Language Modeling toolkit”.d

dhttp://svr-www.eng.cam.ac.uk/˜prc14/toolkit.html

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


Using Viterbi search with these models, the test images were optimally recog-

nized in terms of word sequences. The resulting sentences were translated into

numerical form by means of the very same SFST used in our integrated interpreta-

tion experiments. Test-set recognition WER and the corresponding interpretation

DER are shown in Fig. 6 (top and bottom-left, respectively). They are plotted as a

function of NG for the best values of NS and VR found in the previous experiments

(NS = 6 and VR = 1/20). As expected, the WERs for the two-gram trained with

the small training set are clearly worse than those of the two-gram trained with

10,000 text sentences. It is worth noting that, in both cases, DER is systematically

worse than WER.

4.3. Integrated transduction

One experiment was carried out to assess the impact of integrated interpretation

following the approach proposed in Sec. 3. This experiment is similar to the one

presented in Sec. 4.1 for VR = 1/20, NS = 6 and varying NG . Here, integrated

recognition and interpretation was carried out using the SFST outlined in Sec. 3.4

(the same used as a back-end in the serial, two-gram experiments).

As in the previous subsection, both test-set recognition WER and interpretation

DER are shown in Fig. 6 (bottom-right), plotted as a function of NG . WER results

are similar to those obtained with the two-gram model trained with 10,000 text

sentences. However, in contrast with the two-gram serial approach, here the DER is

systematically better than the WER. The best result is a DER of 4.6% (with a 95%

confidence interval of [3.6%, 5.7%]; this corresponds to a 3.6% of digit substitution

errors, 0.7% deletions and 0.3% insertions). If compared with this result, the best

DER of the two-gram serial approach was worse by 44% relative.

Three examples of sentences recognized and interpreted by both the integrated

system and the two-gram serial approach are shown in Fig. 7. The first one

was perfectly recognized and interpreted by the SFST integrated system, but the

�� !��"��$#��% &� ��' !��()�� !��

�*��+ ,�-�.)/ 0�1�.1324��5� !�!��6�� &7�$��!��' �78��)��9��7�5��;:��' !��)�&9�� < =�� + !�!�� >�!��$��' �?8@�A�&9��7�7�6�;:��6�� )��9�� ,�-�.)/ 0�1�.�*��+ -�0�/ 0�.'B'/ 0�1�C

1324��5� �7��;��D:�� &E8@�!��6��5��;:��' !��#�� < =�� + �$��$��' �?�5��6��>�)��9��' �E8��$��F�>�7��G:��' !�6#��)� -�0�/ 0�.�-)/ 0�1)H�*��+ H�/ 0�0�0)/ -)HD1

1324��5� #)�"�7�6��6��I�$��6��' !��A�)�� H�/ 0�0�0)/ -)HD1 =�� + #)�"�7�6��6��"�A�&9��A��6�� >�)�)�J� H�/ 0�0�0)/ .)HD1

Fig. 7. Examples of test sentences (correctly and/or incorrectly) recognized and interpreted byboth the two-gram serial approach and the integrated system.

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


two-gram recognizer produced a word sequence with a single word error (“treinta”

for “trescientos”). This error, however, makes the sequence syntactically incor-

rect, thereby preventing the words-to-numbers transducer to provide an adequate

numeric output. The second example corresponds to a rather bad-quality sentence

which is missrecognized by both systems. Here the two-gram recognizer produces a

syntactically incorrect sentence, with many (5) word errors, which can neither be

adequately parsed into numeric form. In contrast, the integrated approach produces

a syntactically correct sentence with only two errors and a corresponding digit

sequence having also two errors. The last example shows a sentence that is correctly

recognized and interpreted by the serial two-gram method but is slightly missrec-

ognized by the integrated approach (with only one word and one digit errors).

As these examples illustrate, a significant number of digit errors of the two-gram

serial approach are due to the incapability of the words-to-numbers translator to

parse syntactically incorrect word sequences. However, in some of these cases there

are segments (typically the final parts) of the word sequences provided by the two-

gram recognizer which do admit some parsing. While this never leads to correct

digit sequences, it would at least provide a few digit hypotheses rather than a null

output. We have recomputed the DER for some of the best results in Fig. 6 (bottom,

left) by parsing the two-gram word sequences in the above suggested error-tolerant

manner. This produced noticeable improvements, but the best DER achieved was

still larger than the corresponding WER and clearly worse than the best DER

obtained with the integrated approach.

4.4. Impact of tangent vectors

Starting with the above result, we incorporated the tangent vectors into the trained

HMMs in order to increase robustness of the approach with respect to changing

4

4.5

5

5.5

6

0.001 0.01 0.1 1 10 100 1000

γ

WER(%)

DER(%)

Fig. 8. Test-set recognition word error rate (WER) and interpretation digit error rate (DER)

as a function of the variance factor γ for the vertical translation tangent vector for 16 Gaussiandensities per mixture, HMMs of six states and a vertical resolution of 1/20.

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


Fig. 9. Examples of new sentences which have been correctly recognized by the system: 5,225and 5,457.

vertical shift within each word. Recognition results are given in Fig. 8 as a function

of the variance factor γ. Using this method, the WER was reduced to 5.0% (with a

95% confidence interval of [3.6%, 6.8%]), which is a relative improvement of about

14%. The digit error rate could be further reduced from 4.6% to 4.1% (with a 95%

confidence interval of [3.2%, 5.2%]) which corresponds to a relative improvement of

about 10%. This accuracy can be considered as very satisfactory given the difficulty

of the task.

4.5. Additional field tests

The purpose of all the experiments described in the previous subsections was two-

fold. In the first place, they were aimed at showing the impact of the different system

design choices and paramenters on the results. But, as a result, in the second place,

we end up with a tuned system prepared to do real work in the task it has been

studied for.

This system recently underwent an informal field test in which 100 new images

of Spanish number sentences (411 words)e were processed. These sentences were

written by a large number of writters, completely disjoint from those in our

database, and were scanned and segmented using scanners and segmentation soft-

ware different from those used in our database. This resulted in images of much

lower quality, as shown in Fig. 9. The results were encouraging. Only a couple of

sentences failed in the preprocessing phase and more than half of the remaining

images were perfectly recognized. Most of the misrecognized sentences contained

just one word error. It should be mentioned that, by varying (some of) the recogni-

tion parameters, the accuracy did not vary significantly. This confirms that the

values of these parameters, tuned throughout the experiments reported in this

section, are adequate for using the system in real-world applications.

5. Conclusions

Integrated recognition and interpretation of handwriting text via finite-state models

has been proposed. We advocate the use of HMMs with tangent vectors for increased

robustness with respect to vertical shift and stochastic finite-state transducers for

their adequacy to globally model all the relevant constraints. A syntax-constrained

interpretation task resembling legal amount interpretation for bank checks has

eProvided by a company potentially interested in this technology.

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


been adopted as an illustrative example. Experimental results have been reported

showing the effectiveness of the proposed approach. They constitute a significant

improvement over previous (preliminary) results obtained in the same task.4 Apart

from the impact of integrated processing, this improvement is due to the inclusion

of elaborated preprocessing and feature extraction techniques.

References

1. I. Bazzi, R. Schwartz and J. Makhoul, An omnifont open-vocabulary OCR system forEnglish and Arabic, IEEE Trans. PAMI 21 (1999) 495–504.

2. J. Berstel, Transductions and Context-Free Languages (Teubner, 1979).3. R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis (John Wiley,

1973).4. J. Gonzalez, I. Salvador, A. H. Toselli, A. Juan, E. Vidal and F. Casacuberta, Off-line

recognition of syntax-constrained cursive handwritten text, in Proc. S+SSPR 2000,Alicante, Spain, 2000, pp. 143–153.

5. D. Guillevic, Unconstrained Handwriting Recognition Applied to the Processing ofBank Cheques, Ph.D. thesis, Concordia University, 1995.

6. D. Guillevic and C. Y. Suen, Cursive script recognition: a sentence level recognitionscheme, in Proc. 3rd Int. Workshop on Frontiers in Handwriting Recognition, 1994,pp. 216–223.

7. X. D. Huang, Y. Ariki and M. A. Jack, Hidden Markov Models for Speech Recognition,Edinburgh Information Technology Series, 1990.

8. F. Jelinek, Statistical Methods for Speech Recognition (MIT Press, 1998).9. G. Kaufmann and H. Bunke, Amount translation and error localization in check

processing using syntax-directed translation, in Proc. ICPR’98, Vol. 2, Brisbane,Australia, 1998, pp. 1530–1534.

10. D. Keysers, W. Macherey, J. Dahmen and H. Ney, Learning of variability for invari-ant statistical pattern recognition, in Proc. ECML 2001, Freiburg, Germany, 2001,pp. 263–275.

11. U.-V. Marti and H. Bunke, Handwritten sentence recognition, in Proc. ICPR’00,Vol. 3, Barcelona, Spain, 2000, pp. 467–470.

12. G. Nagy, Twenty years of document image analysis in PAMI, IEEE Trans. PAMI 22

(2000) 38–62.13. J. Oncina, P. Garcıa and E. Vidal, Learning subsequential transducers for pattern

recognition interpretation tasks, IEEE Trans. PAMI 15 (1993) 448–458.14. T. Paquet and Y. Lecourtier, Recognition of handwritten sentences using a restricted

lexicon, Patt. Recogn. 26 (1993) 391–407.15. R. Plamondon and S. N. Srihari, On-line and off-line handwriting recognition: a

comprehensive survey, IEEE Trans. PAMI 22 (2000) 63–84.16. L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition (Prentice-Hall PTR,

1993).17. P. Simard, Y. Le Cun, J. Denker and B. Victorri, Transformation Invariance in

Pattern Recognition — Tangent Distance and Tangent Propagation, Lecture Notesin Computer Science, Vol. 1524 (Springer, 1998), pp. 239–274.

18. P. Slavik, Equivalence of different methods for slant and skew corrections in wordrecognition applications, IEEE Trans. PAMI 23 (2001) 323–326.

19. E. Vidal, Language learning, understanding and translation, in CRIM/FORWISSWorkshop on Progress and Prospects of Speech Research and Technology, in Proc.Art. Intell., eds. R. de Mori, H. Niemann and G. Hanrieder, Infix, 1994, pp. 131–140.

https://www.researchgate.net/publication/3192989_An_omnifont_open-vocabulary_OCR_system_for_English_and_Arabic?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/3192989_An_omnifont_open-vocabulary_OCR_system_for_English_and_Arabic?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/222747678_Recognition_of_handwritten_sentences_using_a_restricted_lexicon?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/222747678_Recognition_of_handwritten_sentences_using_a_restricted_lexicon?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/3193072_Twenty_Years_of_Document_Image_Analysis_in_PAMI?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/3193072_Twenty_Years_of_Document_Image_Analysis_in_PAMI?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/2494707_Cursive_Script_Recognition_A_Sentence_Level_Recognition_Scheme?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==



https://www.researchgate.net/publication/230875852_Transductions_and_context-free_languages?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/230876149_Statistical_Methods_of_Speech_Recognition?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/3192081_Learning_Subsequential_Transducers_for_Pattern_Recognition_Interpretation_Tasks?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/3192081_Learning_Subsequential_Transducers_for_Pattern_Recognition_Interpretation_Tasks?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/224273340_Amount_translation_and_error_localization_in_check_processing_using_syntax-directed_translation?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==



https://www.researchgate.net/publication/3887589_Handwritten_sentence_recognition?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/3887589_Handwritten_sentence_recognition?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/221275818_Offline_Recognition_of_Syntax-Constrained_Cursive_Handwritten_Text?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==



https://www.researchgate.net/publication/216792777_Transformation_Invariance_in_Pattern_Recognition_--_Tangent_Distance_and_Tangent_Propagation?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==



https://www.researchgate.net/publication/221112333_Learning_of_Variability_for_Invariant_Statistical_Pattern_Recognition?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==



https://www.researchgate.net/publication/37850679_Unconstrained_handwriting_recognition_applied_to_the_processing_of_bank_cheques?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/37850679_Unconstrained_handwriting_recognition_applied_to_the_processing_of_bank_cheques?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/248848344_Pattern_Classification_Scene_Analysis?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/248848344_Pattern_Classification_Scene_Analysis?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/220181213_On-line_and_off-line_handwriting_recognition_a_comprehensive_survey_IEEE_Trans_Pattern_Anal_Mach_Intell_T-PAMI?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/220181213_On-line_and_off-line_handwriting_recognition_a_comprehensive_survey_IEEE_Trans_Pattern_Anal_Mach_Intell_T-PAMI?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


20. B. Yanikoglu and P. A. Sandon, Segmentation of off-line cursive handwriting usinglinear programming, Patt. Recogn. 31 (1998) 1825–1833.

21. S. J. Young, P. C. Woodland and W. J. Byrne, HTK: Hidden Markov Model ToolkitV1.5, Technical Report, Cambridge University Engineering Department Speech Groupand Entropic Research Laboratories Inc., 1993.

Alejandro H. Toselli

received the M.S. degreein electrical engineer-ing from UniversidadNacional de Tucuman(Argentina) in 1997 andthe Ph.D. degree incomputer science fromUniversidad Politecnicade Valencia (Spain) in

2004.Dr. Alejandro is a member of the Spanish

Society for Pattern Recognition and ImageAnalysis (AERFAI) and the InternationalAssociation for Pattern Recognition (IAPR).

His current research interest lies in theareas of pattern recognition, computer visionand human language technology.

Jorge Gonzalez re-ceived the M.S. degree

in computer science fromthe Universidad Poli-tecnica de Valencia(UPV), in 1999. Hehas been a graduateresearch assistant from2000 to 2003. Now heis under a contract on a

research project.His research interests are in the areas of

speech recognition and machine translation.

Alfons Juan receivedthe M.S. and Ph.D. de-grees in computer sci-ence from the Universi-dad Politecnica de Va-lencia (UPV), in 1991and 2000, respectively.He is a Professor atthe UPV since 1995.Dr. Juan is a member of

the Spanish Society for Pattern Recognitionand Image Analysis (AERFAI) and the Inter-national Association for Pattern Recognition(IAPR).

His research interests are in the areasof pattern recognition, computer vision andhuman language technology.

Ismael Salvador re-ceived the M.S. in com-puter science engineer-ing from PolytechnicUniversity of Valenciain 1999. He startedhis Ph.D. studies inpattern recognition in1999. In 2002 he ob-tained the Advanced

Studies Diploma. He worked for 3 years on acomputer vision system to recognize identifi-cation digits in containers and he is currentlyresearching fast nearest neighbors methods.

His research interests are in computervision, image processing and patternrecognition.

https://www.researchgate.net/publication/223217563_Segmentation_of_off-line_cursive_handwriting_using_linear_programming?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/223217563_Segmentation_of_off-line_cursive_handwriting_using_linear_programming?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==

https://www.researchgate.net/publication/239723374_HTK-Hidden_Markov_Model_Toolkit?el=1_x_8&enrichId=rgreq-edcea518a0db934736c587ba8ebe72f9-XXX&enrichSource=Y292ZXJQYWdlOzIyMDM2MDIxNjtBUzoxMjAyMjUwNjQyOTY0NTBAMTQwNTY3NTI5NzkwMw==



May 28, 2004 14:48 WSPC/115-IJPRAI 00334


Enrique Vidal re-ceived the Licenciadodegree in physics in1978 and the Doctor enCiencias Fısicas (Ph.D.in physics) degree in1985, both from theUniversitat de Valencia.

From 1978 to 1986he was with this univer-

sity serving in computer system program-ming and teaching positions. In the sameperiod he coordinated a research group inthe fields of pattern recognition and auto-matic speech recognition. In 1986 he joinedthe Departamento de Sistemas Informaticos yComputacion of the Universidad Politecnicade Valencia (UPV), where he is until nowserving as a full professor of the Facultad deInformatica. In 1995 he joined the InstitutoTecnologico de Informatica, where he hasbeen coordinating several projects on patternrecognition and machine translation. He is co-leader of the Pattern Recognition and HumanLanguage Technology group of the UPV.

Dr. Vidal is a member of the SpanishSociety for Pattern Recognition and ImageAnalysis (AERFAI) and the InternationalAssociation for Pattern Recognition (IAPR).

His current fields of interest include statis-tical and syntactic pattern recognition, andtheir applications to language, speech andimage processing. In these fields, he haspublished more than one hundred papers injournals, conference proceedings and books.

Daniel Keysers re-ceived the Dipl. degreein computer science(with honors) from theRWTH Aachen Univer-sity, Germany, in 2000.Since then, he has beena Ph.D. student andresearch assistant at theDepartment of Com-

puter Science of the RWTH, where hecurrently is the Head of the Image Processingand Object Recognition Group at the Chairof Computer Science VI.

His research interests include statisticalmodeling for pattern recognition, invariancein image object recognition and computervision, and (medical) image retrieval.

Francisco Casacu-

berta received theMaster and Ph.D. de-grees in physics fromthe University of Va-lencia, Spain, in 1976and 1981, respectively.From 1976 to 1979, heworked with the De-partment of Electricity

and Electronics at the University of Valen-cia as an FPI fellow. From 1980 to 1986, hewas with the Computing Center of the Uni-versity of Valencia. Since 1980, he has beenwith the Department of Information Systems

and Computation of the Polytechnic Univer-sity of Valencia first as an Associate Professorand from 1990 as a Full Professor. Since 1981,he has been an active member of a researchgroup in the fields of automatic speechrecognition and machine translation.

Dr. Casacuberta is a member of theSpanish Society for Pattern Recognition andImage Analysis (AERFAI), which is anaffiliate society of IAPR, the IEEE ComputerSociety and the Spanish Association for Ar-tificial Intelligence (AEPIA).

His current research interest lies in theareas of speech recognition, machine transla-tion, syntactic pattern recognition, statisticalpattern recognition and machine learning.

May 28, 2004 14:48 WSPC/115-IJPRAI 00334


Hermann Ney re-ceived the Dipl. degreein physics from the Uni-versity of Goettingen,Germany, in 1977 andthe Dr.-Ing. degree inelectrical engineeringfrom the TU Braun-schweig (University ofTechnology), Germany,

in 1982.In 1977, he joined Philips Research Labo-

ratories (Hamburg and Aachen, Germany)where he worked on various aspects of speakerverification, isolated and connected wordrecognition and large vocabulary continuous-speech recognition. In 1985, he was appointedhead of the Speech and Pattern Recognitiongroup. In 1988–1989 he was a visiting scien-tist at AT&T Bell Laboratories, Murray Hill,NJ. In July 1993, he joined RWTH Aachen

(University of Technology), Germany, as aprofessor for computer science. His work isconcerned with the application of statisti-cal techniques and dynamic programming fordecision-making in context.

His current interests cover pattern recogni-tion and the processing of spoken and writtenlanguage, in particular signal processing,search strategies for speech recognition,language modeling, automatic learning andlanguage translation.

integrated handwriting recognition and interpretation using finite-state models

Documents