[studies in computational intelligence] computational intelligence paradigms in advanced pattern...

M.R. Ogiela and L.C. Jain (Eds.): Computational Intelligence Paradigms, SCI 386, pp. 5–24. springerlink.com © Springer-Verlag Berlin Heidelberg 2012

Chapter 2 Neural Networks for Handwriting Recognition

Marcus Liwicki1, Alex Graves2, and Horst Bunke3

1 German Research Center for Artificial Intelligence, Trippstadter Str. 122, 67663 Kaiserslautern, Germany e-mail: [email protected] 2 Institute for Informatics 6, Technical University of Munich, Boltzmannstr. 3, 85748 Garching bei München, Germany e-mail: [email protected] 3 Institute for Computer Science and Applied Mathematics, Neubrückstr. 10, 3012 Bern, Switzerland e-mail: [email protected]

Abstract. In this chapter a novel kind of Recurrent Neural Networks (RNNs) is described. Bi- and Multidimensional RNNs combined with Connectionist Tem-poral Classification allow for a direct recognition of raw stroke data or raw pixel data. In general, recognizing lines of unconstrained handwritten text is a challeng-ing task. The difficulty of segmenting cursive or overlapping characters, combined with the need to assimilate context information, has led to low recognition rates for even the best current recognizers. Most recent progress in the field has been made either through improved preprocessing, or through advances in language modeling. Relatively little work has been done on the basic recognition algo-rithms. Indeed, most systems rely on the same hidden Markov models that have been used for decades in speech and handwriting recognition, despite their well-known shortcomings. This chapter describes an alternative approach based on a novel type of recurrent neural network, specifically designed for sequence labeling tasks where the data is hard to segment and contains long-range, bidirectional or multidirectional interdependencies. In experiments on two unconstrained handwriting databases, the new approach achieves word recognition accuracies of 79,7% on on-line data and 74,1% on off-line data, significantly outperforming a state-of-the-art HMM-based system. Promising experimental results on various other datasets from different countries are also presented. A toolkit implementing the networks is freely available for public.

1 Introduction

Handwriting recognition is traditionally divided into on-line and off-line recogni-tion. In on-line recognition a time ordered sequence of coordinates, representing

6 M. Liwicki, A. Graves, and H. Bunke

the movement of the pen-tip, is captured, while in the off-line case only an image of the text is available. Because of the greater ease of extracting relevant features, online recognition generally yields better results [1]. Another crucial division is that between the recognition of isolated characters or words, and the recognition of whole lines of text. Unsurprisingly, the latter is substantially harder, and the ex-cellent results that have been obtained for e.g. digit and character recognition [2], [3] have never been matched for complete lines. Lastly, handwriting recognition can be split into cases where the writing style is constrained in some way—for example, only hand printed characters are allowed—and the more challenging scenario where it is unconstrained. Despite more than 40 years of handwriting recognition research [2], [3], [4], [5], developing a reliable, general-purpose sys-tem for unconstrained text line recognition remains an open problem.

1.1 State-of-the-Art

A well known testbed for isolated handwritten character recognition is the UNIPEN database [6]. Systems that have been found to perform well on UNIPEN include: a writer-independent approach based on hidden Markov models [7]; a hy-brid technique called cluster generative statistical dynamic time warping (CSDTW) [8], which combines dynamic time warping with HMMs and embeds clustering and statistical sequence modeling in a single feature space; and a sup-port vector machine with a novel Gaussian dynamic time warping kernel [9]. Typ-ical error rates on UNIPEN range from 3% for digit recognition, to about 10% for lower case character recognition.

Similar techniques can be used to classify isolated words, and this has given good results for small vocabularies (e.g., a writer dependent word error rate of about 4.5% for 32 words [10]). However an obvious drawback of whole word classification is that it does not scale to large vocabularies.

For large vocabulary recognition tasks, such as those considered in this chapter, the usual approach is to recognize individual characters and map them onto com-plete words using a dictionary. Naively, we could do this by presegmenting words into characters and classifying each segment. However, segmentation is difficult for cursive or unconstrained text, unless the words have already been recognized. This creates a circular dependency between segmentation and recognition that is often referred to as Sayre’s paradox [11]. Nonetheless, approaches have been pro-posed where segmentation is carried out before recognition. Some techniques for character segmentation, based on unsupervised learning and data-driven methods, are given in [3]. Other strategies first segment the text into basic strokes, rather than characters. The stroke boundaries may be defined in various ways, such as the minima of the velocity, the minima of the y-coordinates, or the points of max-imum curvature. For example, one online approach first segments the data at the minima of the y-coordinates then applies self-organizing maps [12]. Another, off-line, approach [13] uses the minima of the vertical histogram for an initial estima-tion of the character boundaries and then applies various heuristics to improve the segmentation.

Neural Networks for Handwriting Recognition 7

A more satisfactory solution to Sayre’s paradox would be to segment and rec-

ognize at the same time. Hidden Markov models (HMMs) are able to do this, which is one reason for their popularity for unconstrained handwriting [14], [15], [16], [17], [18], [19]. The idea of applying HMMs to handwriting recognition was originally motivated by their success in speech recognition [20], where a similar conflict exists between recognition and segmentation. Over the years, numerous refinements of the basic HMM approach have been proposed, such as the writer independent system considered in [7], which combines point oriented and stroke oriented input features.

However, HMMs have several well-known drawbacks. One of these is that they assume the probability of each observation depends only on the current state, which makes contextual effects difficult to model. Another is that HMMs are ge-nerative, when discriminative models generally give better performance labeling and classification tasks.

Recurrent neural networks (RNNs) do not suffer from these limitations, and would therefore seem a promising alternative to HMMs. However the application of RNNs alone to handwriting recognition have so far been limited to isolated cha-racter recognition (e.g. [21]). One reason for this is that the traditional neural net-work objective functions require a separate training signal for every point in the input sequence, which in turn requires presegmented data.

A more successful use of neural networks for handwriting recognition has been to combine them with HMMs in the so-called hybrid approach [22], [23]. A varie-ty of network architectures have been tried for hybrid handwriting recognition, in-cluding multilayer perceptrons [24], [25], time delay neural networks (TDNNs) [18], [26], [27], and RNNs [28], [29], [30]. However, although hybrid models al-leviate the difficulty of introducing context to HMMs, they still suffer from many of the drawbacks of HMMs, and they do not realize the full potential of RNNs for sequence modeling.

1.2 Contribution

This chapter describes a recently introduced alternative approach, in which a single RNN is trained directly for sequence labeling. The network uses the connectionist temporal classification (CTC) combined with bidirectional Long Short-Term Memory (BLSTM) architecture, which provides access to long range input context in both directions. A further enhancement which allows the network to work in multiple dimensions will be presented in this chapter. The so-called Multidimem-sional LSTM (MDLSTM) is very successful even on raw pixel data.

The rest of this Chapter is organized as follows. Section 2 presents the handwritten data and the feature extraction techniques. Subsequently, Section 3 describes the novel neural network classifier. Experimental results are presented in Section 4. Finally, Section 5 concludes this chapter.


Fig. 1 Processing steps of the handwriting recognition system

2 Data Processing

As stated above, handwritten data can be acquired for two formats, online and of-fline format. In this section typical preprocessing and feature extraction techniques are presented. Those techniques have been applied for our experiments.


The online and offline databases used are the IAM-OnDB [31] and the IAM-DB [32] respectively. Note that these do not correspond to the same handwriting samples: the IAM-OnDB was acquired from a whiteboard, while the IAM-DB consists of scanned images of handwritten forms.1

2.1 General Processing Steps

A recognition system for unconstrained Roman script is usually divided into con-secutive units which iteratively process the handwritten input data to finally obtain the transcription. The main units are illustrated in Fig. 1 and summarized in this section. Certainly, there are differences between offline and online processing, but the principles are the same. Only the methodology for performing the individual steps differs.

First, preprocessing steps are applied to reduce noise in the raw data. The input is raw handwritten data and the output usually consists of extracted text lines. The amount of effort that need to be invested into the preprocessing depends on the given data. If the data have been acquired from a system that does not produce any noise and only single words have been recorded, there is nothing to do in this step. But usually the data contains noise which need to be removed to improve the qual-ity of the handwriting, e.g., by means of image enhancement. The offline images are furthermore binarized and the online data, which usually contain noisy points and gaps within strokes, is processed with some heuristics to recover from these artifacts. These operations are described in Ref. [33]. The cleaned text data is then automatically divided into lines using some simple heuristics.

Next, the data is normalized, i.e., it is attempted to remove writer-specific cha-racteristics of the handwriting to make writings from different authors looking more similar to each other. This is a very important step in any handwriting rec-ognition system, because the writing styles of the writers differ with respect to skew, slant, height, and width of the characters. In the literature there is no stan-dard way of normalizing the data, but many systems use similar techniques. First, the text line is corrected in regard to its skew, i.e., it is rotated, so that the baseline is parallel to the x-axis. Then, slant correction is performed so that the slant be-comes upright. The next important step is the computation of the baseline and the corpus line. These two lines divide the text into three areas: the upper area, which mainly contains the ascenders of the letters; the middle area, where the corpus of the letters is present; and the lower area with the descenders of some letters. These three areas are normalized to predefined heights. Often, some additional normali-zation steps are performed, depending on the domain. In offline recognition, thin-ning and binarization may be applied. In online recognition the delayed strokes, e.g., the crossing of a “t” or the dot of an “i”, are usually removed, and equidistant resampling is applied.

1 The databases and benchmark tasks are available on http://www.iam.unibe.ch/fki/databases


Subsequently, features are extracted from the normalized data. This particular step is needed because the recognizers need numerical data as their input. Howev-er, no standard method for computing the features exists in the literature. One common method in offline recognition of handwritten text lines is the use of a sliding window moving in the writing direction over the text. Features are extracted at every window position, resulting in a sequence of feature vectors. In the case of online recognition the points are already available in a time-ordered sequence, which makes it easier to get a sequence of feature vectors in writing or-der. If there is a fixed size of the input pattern, such as in character or word recog-nition, one feature vector of a constant size can be extracted for each pattern.

Fig. 2 Features of the vicinity

2.2 Our Online System

In the system described in this chapter state-of-the-art feature extraction methods are applied to extract the features from the preprocessed data. The feature set input to the online recognizer consists of 25 features which utilize information from both the real online data stored in XML format, and pseudo offline information automatically generated from the online data. For each (x, y)-coordinate recorded by the acquisition device a set of 25 features are extracted, resulting in a sequence of 25-dimensional vectors for each given text line. These features can be divided into two classes. The first class consists of features extracted for each point by considering the neighbors with respect to time. The second class takes the offline matrix representation into account, i.e., it is based on spatial information. The fea-tures of the first class are the following:

• pen-up/pen-down: a boolean variable indicating whether the pen-tip touches the

board or not. Consecutive strokes are connected with straight lines for which this feature has the value false.


• hat-feature: this binary feature indicates whether a delayed stroke has been re-moved at the same horizontal position as the considered point.

• speed: the velocity is computed before resampling and then interpolated. • x-coordinate: the x-position is taken after high-pass filtering, i.e., after subtract-

ing a moving average from the real horizontal position. • y-coordinate: this feature represents the vertical position of the point after nor-

malization. • writing direction: here we have a pair of features, given by the cosine and sine

of the angle between the line segment starting at the point and the x-axis. • curvature: similarly to the writing direction, this is a pair of features, given by

the cosine and sine of the angle between the lines to the previous and the next point.

• vicinity aspect: this feature is equal to the aspect of the trajectory (See Fig. 2) • vicinity slope: this pair of features is given by the cosine and sine of the angle

of the straight line from the first to the last vicinity point (see Fig. 2). • vicinity curliness: this feature is defined as the length of the trajectory in the vi-

cinity divided by max(x(t); y(t)) (see Fig. 2). • vicinity linearity: here we use the average squared distance d² of each point in

the vicinity to the straight line from the first to the last vicinity point (see Fig. 2).

Fig. 3 Pseudo offline features

The features of the second class are all computed using a two-dimensional matrix B representing the offline version of the data. For each position the number of points on the trajectory of the strokes is stored. This can be seen as a low-resolution image of the handwritten data. The following features are used:

• ascenders/descenders: these two features count the number of points above the

corpus line (ascenders) and below the baseline (descenders). Only points which


have an x-coordinate in the vicinity of the current point are considered. Addition-ally the points must have a minimal distance to the lines to be considered as part of an ascender or descender. The corresponding distances are set to a predefined fraction of the corpus height.

• context map: the two-dimensional vicinity of the current point is transformed to a 3×3 map. The number of black points in each region is taken as a feature val-ue. So we obtain altogether nine features of this type.

2.3 Our Offline System

To extract the feature vectors from the offline images, a sliding window approach is used. The width of the window is one pixel, and nine geometrical features are computed at each window position. Each text line image is therefore converted to a sequence of 9-dimensional vectors. The nine features are as follows:

• The mean gray value of the pixels • The center of gravity of the pixels • The second order vertical moment of the center of gravity • The positions of the uppermost and lowermost black pixels • The rate of change of these positions (with respect to the neighboring windows) • The number of black-white transitions between the uppermost and lowermost

pixels • The proportion of black pixels between the uppermost and lowermost pixels. For a more detailed description of the offline features, see [17].

In the next phase indicated in Fig. 1, a classification system is applied which generates a list of candidates or even a recognition lattice. This step and the last step, the postprocessing, are described in the next section.

3 Neural Network Based Recognition

The main focus of this chapter is the recently introduced Neural Network classifier based on CTC combined with Bidirectional or Multidimensional LSTM. This Sec-tion describes the different aspects of the architecture and gives brief insights into the algorithms behind.

3.1 Recurrent Neural Networks (RNNs)

Recurrent neural networks (RNNs) are a connectionist model containing a self-connected hidden layer. One benefit of the recurrent connection is that a `memory' of previous inputs remains in the network's internal state, allowing it to make use


of past context. Context plays an important role in handwriting recognition, as il-lustrated in Figure 4. Another important advantage of recurrency is that the rate of change of the internal state can be finely modulated by the recurrent weights, which builds in robustness to localized distortions of the input data.

Fig. 4 Importance of context. The characters “ur” would be hard to recognize without the context of the word “entourage”.

3.2 Long Short-Term Memory (LSTM)

Unfortunately, the range of contextual information that standard RNNs can access is quite limited. The problem is that the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponential-ly as it cycles around the network's recurrent connections, and is repeatedly scaled by the connection weights. In practice this shortcoming (referred to in the litera-ture as the vanishing gradient problem) makes it hard for an RNN to bridge gaps of more than about 10 time steps between relevant input and target events.

Long Short-Term Memory (LSTM) is an RNN architecture specifically de-signed to address the vanishing gradient problem. An LSTM hidden layer consists of multiple recurrently connected subnets, known as memory blocks. Each block contains a set of internal units, or cells, whose activation is controlled by three multiplicative units: the input gate, forget gate and output gate. Figure 5 provides a detailed illustration of an LSTM memory block with a single cell.

The effect of the gates is to allow the cells to store and access information over long periods of time. For example, as long as the input gate remains closed (i.e., has an activation close to 0), the activation of the cell will not be overwritten by the new inputs arriving in the network. Similarly, the cell activation is only avail-able to the rest of the network when the output gate is open, and the cell's recur-rent connection is switched on and off by the forget gate.


Fig. 5 LSTM memory block with one cell

The mathematical background of LSTM is described in depth in [40,41,34]. A short description follows. A conventional recurrent multi-layer Perceptron net-work (MLP) contains a hidden layer where all neurons of the hidden layer are ful-ly connected with all neurons of the same layer (the recurrent connections). The activation of a single cell at the timestamp t is a weighted sum of the inputs xi

t plus the weighted sum of the outputs of the previous timestamp bh

(t-1). This can be ex-pressed as follows (or in a matrix form):

11 −− ⋅+⋅=+= th

ti

thh

tii

t BWXWbwxwa

)(

)(tt

tt

AfB

afb

==

Since the outputs of the previous timestamp are just calculated by the squashing

function of the corresponding cell activations, the influence of the network input in the previous time stamp can be considered as smaller, since it has been weighted already a second time. Thus the overall network activation can be rough-ly rewritten as:

),,,,( 1221 XWXWXWXgA t

ht

ht

htt −−=


where Xt is the overall net input at timestamp t and Wh is the weight matrix of the hidden layer. Note that for clarity reasons we use this abbreviated form of the complex formula, where the input weights do not directly appear (all is hidden in the function g(…)). This formula reveals that the influence of earlier time stamps t-n vanishes rapidly, as the time difference n appears in the exponent of the weight matrix. Since all values of the weight matrix Wh are smaller than 1, the n-th power of Wh is close to zero.

Introducing the LSTM cell brings in three new cells which all get the weighted sum of the outputs of the hidden layer in the previous timestamp as an input, i.e., for the input gate:

1,

1,,

−− +⋅+⋅= tc

th

ti

t

cswBWXWa ιιιι

where sct-1 is the cell state of the previous timestamp and Wi,t and Wh,t are the

weights for the current net input and the hidden layer output of the previous time-stamp, respectively. The activation of the forget gate is:

1,

1,,

−− +⋅+⋅= tc

th

ti

t

cswBWXWa θθθθ

which is the same formula just with other weights (those trained for the forget gate). The cell activation is usually calculated by:

1,,

−⋅+⋅= tch

tci

tc BWXWa

However, the cell state is then weighted with the outputs of the two gate cells:

1)()()( −+= tc

ttc

ttc saagas

θσσ ι

where σ indicates that the sigmoid function is used as a squashing function for the gates and g() is cell’s activation function. As the sigmoid function often returns a value close to zero or one, the formula can be interpreted as:

1]10[)(]10[ −+= tc

tc

tc soragors

or in words: the cell state is either depending on the input activation (if the input gates opens, i.e., the first weight is close to 1) or on the previous cell state (if the forget gate opens, i.e., the second weight is close to one). This particular property enables the LSTM-cell to bridge over long time periods. The value of the output gate is calculated similarly to the other gates, i.e.:

tc

th

ti

t

cswBWXWa ωωωω ,

1,, +⋅+⋅= −

and the final cell output is:

)()( tc

ttc shab ωσ=

which again is close to zero or the usual output of the cell h(…).


3.3 Bidirectional Recurrent Neural Networks

For many tasks it is useful to have access to future as well as past context. In handwriting recognition, for example, the identification of a given letter is helped by knowing the letters both to the right and left of it. Bidirectional Recurrent Neural Networks (BRNNs) [35] are able to access context in both directions along the input sequence. BRNNs contain two separate hidden layers, one of which processes the inputs forwards, while the other processes them backwards. Both hidden layers are connected to the output layer, which therefore has access to all past and future context of every point in the sequence.

Combining BRNNs and LSTM gives bidirectional LSTM (BLSTM) [42].

3.4 Connectionist Temporal Classification (CTC)

Standard RNN objective functions require a presegmented input sequence with a separate target for every segment. This has limited the applicability of RNNs in domains such as cursive handwriting recognition, where segmentation is difficult to determine. Moreover, because the outputs of a standard RNN are a series of in-dependent, local classifications, some form of post processing is required to trans-form them into the desired label sequence. Connectionist Temporal Classification (CTC) [36,34] is an RNN output layer specifically designed for sequence labeling tasks. It does not require the data to be presegmented, and it directly outputs a probability distribution over label sequences. CTC has been shown to outperform RNN-HMM hybrids in a speech recognition task [36].

A CTC output layer contains as many units as there are labels in the task, plus an additional ‘blank’ or ‘no label’ unit. The output activations are normalized (us-ing the softmax function), so that they sum to 1 and are each in the range (0; 1):

′′

=k

a

atk t

k

tk

e

ey ,

where tka is the unsquashed activation of output unit k at time t, and t

ky is the ac-

tivation of the same unit after the softmax function is applied. The above activations are used to estimate the conditional probabilities

)|,( xtkp of observing the label (or blank) with index k at time t in the input

sequence x:

)|,( xtkpytk =


The conditional probability )|( xp π of observing a particular path π through

the lattice of label observations is then found by multiplying together the label and blank probabilities at every time step:

,)|,()|(11

∏∏==

==T

t

tT

tt t

yxtpxp πππ

where tπ is the label observed at time t along path π .

Paths are mapped onto label sequences TLl ≤∈ , where TL≤ denotes the set of all strings on the alphabet L of length T≤ , by an operator B that removes first the repeated labels, then the blanks. For example, both ),,,,( −− baaB and

),,,,,,,( bbaaaB −−− yield the labeling ),,( baa . Since the paths are mutually

exclusive, the conditional probability of a given labelling TLl ≤∈ is the sum of the probabilities of all the paths corresponding to it:

−∈

=)(1

)|()|(lB

xpxlpπ

π

The above step is what allows the network to be trained with unsegmented data. The intuition is that, because we don’t know where the labels within a particular transcription will occur, we sum over all the places where they could occur.

In general, a large number of paths will correspond to the same label sequence, so a naïve calculation of the equation above is unfeasible. However, it can be effi-ciently evaluated using a graph-based algorithm, similar to the forward-backward algorithm for HMMs. More details about the CTC forward-backward algorithm appear in [39].

3.5 Multidimensional Recurrent Neural Networks

Ordinary RNNs are designed for time-series and other data with a single spatio-temporal dimension. However the benefits of RNNs (such as robustness to input distortion, and flexible use of surrounding context) are also advantageous for mul-tidimensional data, such as images and video sequences.

Multidimensional recurrent neural networks (MDRNNs) [43, 34], a special case of Directed Acyclic Graph RNNs [44], generalize the basic structure of RNNs to multidimensional data. Rather than having a single recurrent connection, MDRNNs have as many recurrent connections as there are spatio-temporal dimen-sions in the data. This allows them to access previous context information along all input directions.

Multidirectional MDRNNs are the generalization of bidirectional RNNs to mul-tiple dimensions. For an n-dimensional data sequence, 2n different hidden layers are used to scan through the data in all directions. As with bidirectional RNNs, all


the layers are connected to a single output layer, which therefore has access to context information in both directions along all dimensions.

Multidimensional LSTM (MDLSTM) is the generalization of bidirectional LSTM to multidimensional data.

3.6 Hierarchical Subsampling Recurrent Neural Networks

Hierarchical subsampling is a common technique in computer vision [45] and oth-er domains with large input spaces. The basic principle is to iteratively re-represent the data at progressively lower resolutions, using a hierarchy of feature extractors. The features extracted at each level are subsampled and used as input to the next level. The number and complexity of the features typically increases as one climbs the hierarchy. This is much more efficient for high-resolution data than a single `flat’ feature extractor, since most of the computations are carried out on low resolution feature maps, rather than, for example, raw pixels.

A well-known connectionist hierarchical subsampling architecture is Convolu-tional Neural Networks [46]. Hierarchical subsampling is also possible with RNNs, and hierarchies of MDLSTM layers have been applied to offline handwrit-ing recognition [47]. Hierarchical subsampling with LSTM is equally useful for long 1D sequences, such as raw speech data or online handwriting trajectories with a high sampling rate.

From the point of view of handwriting recognition, the most interesting aspect of hierarchical subsampling RNNs is that they can be applied directly to the raw input data (offline images or online point-sequences) without any normalization or feature extraction.

4 Experiments

The experiments have been performed with the freely available RNNLIB tool by Alex Graves.2 This tool implements the network architecture and furthermore pro-vides examples for the recognition of several scripts.

4.1 Comparison with HMMs on the IAM Databases

The aim of the first experiments was to evaluate the performance of the complete RNN handwriting recognition system, illustrated in Figure 6, for both online and offlne handwriting. In particular we wanted to see how it compared to an HMM-based system. The online and offline databases used were the IAM-OnDB and the IAM-DB respectively (see above). Note that these do not correspond to the same handwriting samples: the IAM-OnDB was acquired from a whiteboard, while the IAM-DB consists of scanned images of handwritten forms.

2 http://sourceforge.net/projects/rnnl/


Fig. 6 Complete RNN handwriting recognition system (here applied to offline Arabic data)

To make the comparisons fair, the same online and offline preprocessing was used for both the HMM and RNN systems. In addition, the same dictionaries and lan-guage models were used for the two systems.

For all the experiments, the task was to transcribe the text lines in the test set, using the words in the dictionary. The basic performance measure was the word accuracy:

++−⋅iontranscriptinwordsofnumber

delitionsionssubstitiutinsertions

____1100

where the number of word insertions, substitutions and deletions is summed over the whole test set. For the RNN system, we also recorded the character accuracy, defined as above except with characters instead of words.


Table 1 Main results for online data

System Word Accuracy Character Accuracy

HMM 65.0% -

CTC (BLSTM)79.7% 88.5%

Table 2 Main results for offline data

System Word Accuracy Character Accuracy

HMM 64.5% -

CTC (BLSTM)74.1% 81.8%

As can be seen from Tables 1 and 2, the RNN substantially outperformed the HMM on both databases. To put these results in perspective, the Microsoft tablet PC handwriting recognizer [37] gave a word accuracy score of 71.32% on the on-line test set. This result is not directly comparable to our own, since the Microsoft system was trained on a different training set, and uses considerably more sophis-ticated language modeling than the HMM and RNN systems we implemented. However, it indicates that the RNN-based recognizer is competitive with the best commercial systems for unconstrained handwriting.

4.2 Recognition Performance of MDLSTM on Contest’ Data

The MDLSTM system participated in three handwriting recognition contests at the ICDAR 2009 (see the proceedings in [38]). The recognition tasks were based on different scripts. In all cases, the systems had to recognize handwriting from un-known writers.

Table 3 Summarized results from the online Arabic handwriting recognition competition

System Word Accuracy Time/Image

REGIM HMM 52.67% 6402.24 ms

Vision Objects 98.99% 69.41 ms

CTC (BLSTM)95.70% 1377.22 ms

Table 4 Summarized results from the offline Arabic handwriting recognition competition

System Word Accuracy Time/Image

Arab-Reader HMM 76.66% 2583.64 ms

Multi-Stream HMM74.51% 143269.81 ms

CTC (MDLSTM) 81.06% 371.61 ms


Table 5 Summarized results from the offline French handwriting recognition competition

System Word Accuracy

HMM+MLP Combination 83.17%

Non-Symmetric HMM 83.17 %

CTC (MDLSTM) 93.17%

A summary of the results appear in Tables 3-5. As can be seen, the approach de-scribed in this chapter always outperformed the other systems in the offline case. This observation is very promising, because the system just uses the 2-dimensional raw pixel data as an input. For the online competition (Table 3) a commercial recognizer performed better than the CTC approach. However, if the CTC system would be combined with State-of-the-Art preprocessing and feature extraction methods, it would probably reach a higher performance. This observation has been made in [39], where ex-tended experiments to those in Section 1.4.1 have been performed.

Having a look at the calculation time (milliseconds per text line) also reveals very promising results. The MDLSTM combined with CTC was among the fastest recognizers in the competitions. Using some pruning strategies could further in-crease the recognition speed.

5 Conclusion

This chapter described a novel approach for recognizing unconstrained handwrit-ten text, using a recurrent neural network. The key features of the network are the bidirectional Long Short-Term Memory architecture, which provides access to long range, bidirectional contextual information, and the Connectionist Temporal Classification output layer, which allows the network to be trained on unseg-mented sequence data. In experiments on online and offline handwriting data, the new approach outperformed state-of-the-art HMM-based classifiers and several other recognizers. We conclude that this system represents a significant advance in the field of unconstrained handwriting recognition, and merits further research. A toolkit implementing the presented architecture is freely available to the public.

References

[1] Seiler, R., Schenkel, M., Eggimann, F.: Off-line cursive handwriting recognition compared with on-line recognition. In: ICPR 1996: Proceedings of the International Conference on Pattern Recognition (ICPR 1996), vol. IV-7472, p. 505. IEEE Com-puter Society, Washington, DC, USA (1996)

[2] Tappert, C., Suen, C., Wakahara, T.: The state of the art in online handwriting recog-nition. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(8), 787–808 (1990)


[3] Plamondon, R., Srihari, S.N.: On-line and off-line handwriting recognition: A com-prehensive survey. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 63–84 (2000)

[4] Vinciarelli, A.: A survey on off-line cursive script recognition. Pattern Recogni-tion 35(7), 1433–1446 (2002)

[5] Bunke, H.: Recognition of cursive roman handwriting - past present and future. In: Proc. 7th Int. Conf. on Document Analysis and Recognition, vol. 1, pp. 448–459 (2003)

[6] Guyon, I., Schomaker, L., Plamondon, R., Liberman, M., Janet, S.: Unipen project of on-line data exchange and recognizer benchmarks. In: Proc. 12th Int. Conf. on Pattern Recognition, pp. 29–33 (1994)

[7] Hu, J., Lim, S., Brown, M.: Writer independent on-line handwriting recognition using an HMM approach. Pattern Recognition 33(1), 133–147 (2000)

[8] Bahlmann, C., Burkhardt, H.: The writer independent online handwriting recognition system frog on hand and cluster generative statistical dynamic time warping. IEEE Trans. Pattern Anal. and Mach. Intell. 26(3), 299–310 (2004)

[9] Bahlmann, C., Haasdonk, B., Burkhardt, H.: Online handwriting recognition with support vector machines - a kernel approach. In: Proc. 8th Int. Workshop on Frontiers in Handwriting Recognition, pp. 49–54 (2002)

[10] Wilfong, G., Sinden, F., Ruedisueli, L.: On-line recognition of handwritten symbols. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(9), 935–940 (1996)

[11] Sayre, K.M.: Machine recognition of handwritten words: A project report. Pattern Recognition 5(3), 213–228 (1973)

[12] Schomaker, L.: Using stroke- or character-based self-organizing maps in the recogni-tion of on-line, connected cursive script. Pattern Recognition 26(3), 443–450 (1993)

[13] Kavallieratou, E., Fakotakis, N., Kokkinakis, G.: An unconstrained handwriting rec-ognition system. Int. Journal on Document Analysis and Recognition 4(4), 226–242 (2002)

[14] Bercu, S., Lorette, G.: On-line handwritten word recognition: An approach based on hidden Markov models. In: Proc. 3rd Int. Workshop on Frontiers in Handwriting Recognition, pp. 385–390 (1993)

[15] Starner, T., Makhoul, J., Schwartz, R., Chou, G.: Online cursive handwriting recogni-tion using speech recognition techniques. In: Int. Conf. on Acoustics, Speech and Signal Processing, vol. 5, pp. 125–128 (1994)

[16] Hu, J., Brown, M., Turin, W.: HMM based online handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(10), 1039–1045 (1996)

[17] Marti, U.-V., Bunke, H.: Using a statistical language model to improve the perfor-mance of an HMM-based cursive handwriting recognition system. Int. Journal of Pat-tern Recognition and Artificial Intelligence 15, 65–90 (2001)

[18] Schenkel, M., Guyon, I., Henderson, D.: On-line cursive script recognition using time delay neural networks and hidden Markov models. Machine Vision and Applica-tions 8, 215–223 (1995)

[19] El-Yacoubi, A., Gilloux, M., Sabourin, R., Suen, C.: An HMM-based approach for off-line unconstrained handwritten word modeling and recognition. IEEE Transac-tions on Pattern Analysis and Machine Intelligence 21(8), 752–760 (1999)

[20] Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of the IEEE 77(2), 257–286 (1989)


[21] Bourbakis, N.G.: Handwriting recognition using a reduced character method and neural nets. In: Proc. SPIE Nonlinear Image Processing VI, vol. 2424, pp. 592–601 (1995)

[22] Bourlard, H., Morgan, N.: Connnectionist Speech Recognition: A Hybrid Approach. Kluwer Academic Publishers (1994)

[23] Bengio, Y.: Markovian models for sequential data. Neural Computing Surveys 2, 129–162 (1999)

[24] Brakensiek, A., Kosmala, A., Willett, D., Wang, W., Rigoll, G.: Performance evalua-tion of a new hybrid modeling technique for handwriting recognition using identical on-line and off-line data. In: Proc. 5th Int. Conf. on Document Analysis and Recogni-tion, Bangalore, India, pp. 446–449 (1999)

[25] Marukatat, S., Artires, T., Dorizzi, B., Gallinari, P.: Sentence recognition through hy-brid neuro-markovian modelling. In: Proc. 6th Int. Conf. on Document Analysis and Recognition, pp. 731–735 (2001)

[26] Jaeger, S., Manke, S., Reichert, J., Waibel, A.: Online handwriting recognition: the NPen++ recognizer. Int. Journal on Document Analysis and Recognition 3(3), 169–180 (2001)

[27] Caillault, E., Viard-Gaudin, C., Ahmad, A.R.: MS-TDNN with global discriminant trainings. In: Proc. 8th Int. Conf. on Document Analysis and Recognition, pp. 856–861 (2005)

[28] Senior, A.W., Fallside, F.: An off-line cursive script recognition system using recur-rent error propagation networks. In: International Workshop on Frontiers in Handwriting Recognition, Buffalo, NY, USA, pp. 132–141 (1993)

[29] Senior, A.W., Robinson, A.J.: An off-line cursive handwriting recognition system. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 309–321 (1998)

[30] Schenk, J., Rigoll, G.: Novel hybrid NN/HMM modelling techniques for on-line handwriting recognition. In: Proc. 10th Int. Workshop on Frontiers in Handwriting Recognition, pp. 619–623 (2006)

[31] IAM-OnDB an on-line English sentence database acquired from handwritten text on a whiteboard, In: Proc. 8th Int. Conf. on Document Analysis and Recognition, pp. 956–961 (2005)

[32] The IAM-database: an English sentence database for offline handwriting recognition. Int. Journal on Document Analysis and Recognition 5, 39–46 (2002)

[33] Liwicki, M., Bunke, H.: Handwriting recognition of whiteboard notes – studying the influence of training set size and type. Int. Journal of Pattern Recognition and Artifi-cial Intelligence 21(1), 83–98 (2007)

[34] Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks., Ph.D. thesis, Technical University of Munich (2008)

[35] Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Processing 45, 2673–2681 (1997)

[36] Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal clas-sification: labeling unsegmented sequence data with recurrent neural networks. In: Proc. Int. Conf. on Machine Learning, pp. 369–376 (2006)

[37] Pitman, J.A.: Handwriting recognition: Tablet pc text input. Computer 40(9), 49–54 (2007)

[38] Proc. 10th Int. Conf. on Document Analysis and Recognition (2009)


[39] Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Transac-tions on Pattern Analysis and Machine Intelligence 31(5), 855–868 (2009)

[40] Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Computa-tion 9(8), 1735–1780 (1997)

[41] Gers, F.: Long Short-Term Memory in Recurrent Neural Networks. Ph.D.thesis, EPFL (2001)

[42] Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5-6), 602–610 (2005)

[43] Graves, A., Fernández, S., Schmidhuber, J.: Multidimensional recurrent neural net-works. In: Proc. Int. Conf. on Artificial Neural Networks (2007)

[44] Baldi, P., Pollastri, G.: The principled design of large-scale recursive neural network architectures–DAG-RNNs and the protein structure prediction problem. J. Mach. Learn. Res. 4, 575–602 (2003)

[45] Reisenhuber, M., Poggio, T.: Hierarchical models of object recognition in cortex. Na-tureNeuroscience 2(11), 1019–1025 (1999)

[46] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)

[47] Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimensional recurrent neural networks. Advances in Neural Information Processing Systems 21, 545–552 (2009)

[studies in computational intelligence] computational intelligence paradigms in advanced pattern...

Documents