online handwriting recognition: the npen++...

IJDAR (2001) 3: 169–180

Online handwriting recognition: the NPen++ recognizerS. Jaeger1,2, S. Manke1,2, J. Reichert1,2, A. Waibel1,2

1 Interactive Systems Laboratories, University of Karlsruhe, Computer Science Department, 76128 Karlsruhe, Germany;e-mail: [email protected]

2 Interactive Systems Laboratories, Carnegie Mellon University, School of Computer Science, Pittsburgh, PA 15213-3890, USA

Received September 3, 2000 / Revised October 9, 2000

Abstract. This paper presents the online handwritingrecognition system NPen++ developed at the Univer-sity of Karlsruhe and Carnegie Mellon University. TheNPen++ recognition engine is based on a multi-statetime delay neural network and yields recognition ratesfrom 96% for a 5,000 word dictionary to 93.4% on a20,000 word dictionary and 91.2% for a 50,000 word dic-tionary. The proposed tree search and pruning techniquereduces the search space considerably without losing toomuch recognition performance compared to an exhaus-tive search. This enables the NPen++ recognizer to berun in real-time with large dictionaries. Initial recogni-tion rates for whole sentences are promising and showthat the MS-TDNN architecture is suited to recogniz-ing handwritten data ranging from single characters towhole sentences.

Key words: Online handwriting recognition – Neuralnetworks – Pen-based computing – Pattern recognition– Human-computer interaction

1 Introduction

Pen-based interfaces are becoming increasingly popularand will play an important role in human-computer in-teraction in the near future. A clear indication for this isthe advent of personal digital assistants, i.e., small hand-held computers accepting pen-based input and combin-ing agenda, address book, and telecommunication facil-ities. These palm computers are widely-used pen-basedsystems already.

Using a pen as an input device has several advan-tages: while a pen is much easier to carry around, whichis a very important feature for small portable computers,a pen can also take over many functions of a conventionalmouse. For instance, clicking and dragging are two basicoperations that a pen can perform just as well. One ma-jor advantage of the pen over the mouse is the fact that

Correspondence to: S. Jaeger

a pen is a natural writing tool while the mouse is verycumbersome when used as a writing tool.

However, the reliable transformation of handwrittentext into a coding that can be directly processed by acomputer, e.g., ASCII, is essential for the pen to be-come a preferred input device. Handwriting needs reli-able recognition rates to be accepted as an alternativeinput modality.

This paper presents the online handwriting recogni-tion system called NPen++ that is based on a multi-state time delay neural network (MSTDNN). NPen++yields recognition rates from 96% for a 5,000 word dic-tionary to 93.4% on a 20,000 word dictionary and 91.2%for a 50,000 word dictionary. The recognition times arevirtually independent of the size of the dictionary. Inthis paper, the preprocessing steps, the computation offeatures, the recognizer with training and testing, andthe dictionary-based search are described in detail. Sec-tion 2 begins with the description of the preprocessingsteps in NPen++ that normalize the handwritten rawdata. Section 3 shows di!erent features computed fromthe normalized raw data. The core of the NPen++ recog-nition engine; i.e., the MSTDNN, is presented in Sect. 4.Training and recognizing are described in Sect. 5 and re-sults for both word recognition and sentence recognitionare presented in Sect. 7. Section 6 describes the searchtechnique used in the NPen++ system for words andsentences. Section 8 contains a summary that concludesthis paper.

2 Preprocessing

This section describes the preprocessing steps ofNPen++. Before features are derived from the hand-written word, the raw data recorded by the hardwaregoes through several preprocessing steps. Raw data forthe NPen++ system was collected with a sampling rateof 200 points/s using Wacom graphic tablets (the opaqueSD-421E Tablet and the HD-648A Tablet with integratedLCD screen), where only the pen-down segments wererecorded by the hardware.

170 S. Jaeger et al.: Online handwriting recognition: the NPen++ recognizer

Fig. 1. Preprocessing (overview)

The main objective of the preprocessing steps is tonormalize words and remove variations that would other-wise complicate recognition. Typical variations are size,slant, and rotation. These variations can provide impor-tant hints in signature verification and writer identifica-tion tasks, but they complicate handwriting recognitionin general.

While several preprocessing steps in NPen++ aremodeled on existing online handwriting recognition sys-tems [4,7,20–22], some steps, such as context maps, areunique in NPen++.

Figure 1 gives an overview of the preprocessing stepsof NPen++.

2.1 Computing baselines

Computing baselines is a main technique in handwritingrecognition and is implemented in several online as wellas o"ine (OCR) systems (e.g., [4,22]). Baselines are uti-lized for many reasons, e.g., normalizing sizes, correctingrotations, or deriving features. Many systems, includingNPen++, compute the following two lines:

– Baseline: The baseline corresponds to the originalwriting line on which a word or text was written.

– Corpus line: The corpus line goes through the topof lower case characters such as “n” or “w”.

In the approach described in this paper, however, wecompute two additional lines as described in [2]. The firstadditional line goes through ascenders (tops of charac-ters such as “E” or “l”) and the second line goes throughdescenders (bottoms of characters such as “p” or “g”).Figure 2 shows all four lines for the word “writing.”Baselines are often computed using linear regression linesthat approximate the local minima (baseline) or the lo-cal maxima (corpus line) of the trajectory [4,21,22]. Inthe NPen++ system, however, all four lines are definedas polynomials with degree 2:

Y = fi(x) = k (x ! x0)2 + s (x ! x0) + yi i = 1, . . . , 4

Fig. 2. Baselines

}h {h~NormalizingSize

Fig. 3. Normalizing size

Fig. 4. Normalizing rotation

The parameters k, s, and x0 are shared among all fourcurves, whereas each curve has its own vertical trans-lation parameter yi. Parameter k denotes the curvatureand parameter s the rotation of all lines. All parametersare determined by fitting a geometrical model to the pentrajectory. This is accomplished using the Expectation-Maximization algorithm (EM) [2,5]. Before applying theEM step, parameters are initialized with values derivedfrom a simple linear regression to obtain a good seed forthe EM procedure. This technique is an advantage overother methods, where baselines and corpus lines are com-puted separately. In particular, computing corpus linesis facilitated here by coupling them with baselines.

The next two preprocessing steps normalize size androtation by exploiting information about baselines andcorpus lines.

2.2 Normalizing size

The main goal of this preprocessing step is to ensure thatthe same characters have approximately the same heightfor every handwritten word. This is accomplished bytransforming every word to a given corpus height, wherethe corpus height is the distance between the baselineand the corpus line. Note that the total height of wordsmay still vary after normalizing size, even for words hav-ing the same corpus height. Figure 3 shows the normal-ization of the word “Clinton.”

2.3 Normalizing rotations

Rotations of words, i.e., deviations of the computed base-line from the horizontal writing line, are corrected bymeans of a Cartesian rotation that is based on the pa-rameter s denoting the rotation of both lines. Figure 4again shows the normalization of the word “Clinton.”

2.4 Interpolating missing points

If the distance between two neighboring points exceeds acertain threshold, we interpolate the trajectory between

S. Jaeger et al.: Online handwriting recognition: the NPen++ recognizer 171

Fig. 5. Interpolating missing points using Bezier curves

Smoothing

Fig. 6. Smoothing

both points using a Bezier curve [1]. In order to do so, wecompute two additional points between both neighbor-ing points and approximate all four points using a Beziercurve [1]. The position of the additional points dependson how the trajectory enters the first and leaves the sec-ond neighbor. This preprocessing step may be necessaryin situations where the window-manager software is notable to capture all points of the trajectory or cannotcatch up with the speed of writing. This preprocessingstep has only a minor impact on the recognition ratesreported in this paper. However, we have reduced errorrates by more than 15% using this preprocessing stepfor some small subsets of data collected on a specificpressure-sensitive hardware platform. Figure 5 presentssome examples before and after interpolating missingpoints.

2.5 Smoothing

To remove jitter from the handwritten text, we replaceevery point (x(t), y(t)) in the trajectory by the meanvalue of its neighbors:

x!(t) =x(t ! N) + . . . + x(t ! 1) + !x(t) + x(t + 1) + . . . + x(t + N)

2N + !

and

y!(t)=y(t ! N) + . . . + y(t ! 1) + !y(t) + y(t + 1) + . . . + y(t + N)

2N + !.

The parameter ! is based on the angle subtended by thepreceding and succeeding curve segment of (x(t), y(t))and is empirically optimized. This helps to avoid thesmoothing of sharp edges, which provide important in-formation when there is a sudden change in direction.The smoothed image of the word “Clinton” is depictedin Fig. 6. In our experiments, smoothing has improvedour overall recognition rate by about 0.5%.

2.6 Normalizing slant

To normalize the di!erent slants of words, we shear everyword according to its slant. The slant is determined bya histogram over all angles subtended by the lines con-necting two successive points of the trajectory and thehorizontal line. The computed angles are weighted with

Fig. 7. Normalizing slant

Resampling

Fig. 8. Resampling

the distance of every pair of successive points. A simplesearch for the maximum entry in the histogram providesus with the slant of the word [14,21]. Figure 7 shows thenormalization of slant for the word “Clinton.” Both thehistogram containing all computed angles and the nor-malized word are shown in Fig. 7. Normalizing the slanthas improved our overall recognition rate by just under1%. However, we have observed major improvements forleft-handed writers.

2.7 Computing equidistant points (Resampling)

This processing step is implemented in almost every on-line handwriting recognition system. In general, thepoints captured during writing are equidistant in timebut not in space. Hence, the number of captured pointsvaries depending on the velocity of writing and the hard-ware used. To normalize the number of points, the se-quence of captured points is replaced with a sequenceof points having the same spatial distance. In NPen++,this distance is some fraction of the corpus height, whichhas been empirically optimized. The optimal value foundin our experiments is h

13 , where h denotes the corpusheight. Figure 8 illustrates the resampling of points. Wehave achieved major improvements of more than 5% ap-plying this preprocessing step.

2.8 Removing delayed strokes

Delayed strokes, e.g., the crossing of a “t” or the dot ofan “i”, are a well-known problem in online handwritingrecognition. These strokes introduce additional temporalvariation and complicate online recognition because thewriting order of delayed strokes is not fixed and variesbetween di!erent writers. This stands in contrast to of-fline recognition since delayed strokes do not occur instatic images. It is one of the reasons why there havebeen approaches in the recent past trying to exploit of-fline data in order to improve online recognition rates [11,14]. Many online recognizers apply a simple technique tocope with delayed strokes: they use heuristics for detect-ing such strokes and remove them before proceeding withfeature computation and recognition. NPen++ handles


Fig. 9. Delayed strokes

delayed strokes the same way. A delayed stroke is usuallya short sequence written in the upper region of the writ-ing pad, above already written parts of a word, and ac-companied by a pen movement to the left. NPen++ usessome simple threshold values for characterizing these fea-tures of delayed strokes. Figure 9 illustrates some exam-ples of delayed strokes and the corresponding trajectoriesof the pen. Removing delayed strokes has improved ouroverall recognition rate about 0.5%.

3 Computing features

There is no doubt that computing features is a veryimportant processing step in every online recognitionsystem. However, neither a standard method for com-puting features nor a widely accepted feature set cur-rently exists. In the NPen++ system, feature computa-tion takes the normalized sequence of captured coordi-nates (x(t), y(t)) as input and computes a sequence offeatures along this trajectory, which is then directly putinto the recognizer. This section presents the features ofNPen++, some of which can already be found in theliterature and others which have been implemented inNPen++ for the first time.

3.1 Vertical position

The vertical position of a point (x(t), y(t)) is the verticaldistance between y(t) and b(x(t)), where b(x(t)) is they-value of the baseline at time t. The vertical distance ispositive if (x(t), y(t)) is above the baseline and negative ifit is below. All distances are normalized with the corpusheight of the word [14].

3.2 Writing direction

The local writing direction at a point (x(t), y(t)) is de-scribed using the cosine and sine [8]:

cos !(t) ="x(t)"s(t)

,

sin !(t) ="y(t)"s(t)

,

where "s(t), "x(t), and "y(t) are defined as follows:

"s(t) =!

"x2(t) + "y2(t),

x(t+2), y(t+2)

x(t), y(t)

αx(t+1), y(t+1)

β

x(t), y(t)

x(t+1), y(t+1)

x(t+2), y(t+2)x(t-1), y(t-1)

x(t-2), y(t-2)x(t-2), y(t-2)

x(t-1), y(t-1)

a) Writing direction b) Curvature

Fig. 10. Writing direction and curvature

"x(t) = x(t ! 1) ! x(t + 1),

"y(t) = y(t ! 1) ! y(t + 1).

The angles involved in this computation are illustratedin Fig. 10.

3.3 Curvature

The curvature at a point (x(t), y(t)) is represented bythe cosine and sine of the angle defined by the followingsequence of points [8]:

(x(t ! 2), y(t ! 2)), (x(t), y(t)), (x(t + 2), y(t + 2)).

This is again illustrated in Fig. 10. Strictly speaking, thissignal does not represent curvature but the angular dif-ference signal. Curvature would be 1

r , of a circle touchingand partially fitting the curve, with radius r. Cosine andsine can be computed using the values from the previousSect. 3.2:cos #(t) = cos !(t ! 1) " cos !(t + 1)

+ sin!(t ! 1) " sin!(t + 1),sin#(t) = cos !(t ! 1) " sin!(t + 1)

! sin!(t ! 1) " cos !(t + 1).

3.4 Pen-up/pen-down

The pen-up/pen-down feature is a binary feature indi-cating whether the pen has contact with the writing padat time t or not. Invisible parts of the trajectory, wherethe pen has no contact with the pad, are linearly inter-polated in NPen++, i.e., a pen-up is connected with thenext pen-down by a straight line. Note that the pen-up/pen-down information is dependent on the position-sensing device. Electromagnetic systems are able to re-turn approximate planar positions while the pen is inthe air, whereas pressure-sensitive technologies cannot.Since the NPen++ system does not exploit informationabout pen-up segments, this distinction has no e!ect onits features.

3.5 “Hat”-feature

This is a binary feature that indicates whether the cur-rent position is below a delayed stroke, e.g., below at-stroke [21]. It is the only information about delayedstrokes for the NPen++ system. Note that NPen++ de-tects and deletes delayed strokes using a simple heuristic(see Sect. 2).


(x(t), y(t))

Δ y(t)

α (t)

Δx(t)

li

di

Fig. 11. Aspect

3.6 Aspect

The aspect of the trajectory in the vicinity of a point(x(t), y(t)) is described by a single value A(t) [21]:

A(t) ="y(t) ! "x(t)"y(t) + "x(t)

It characterizes the height-to-width ratio of the bound-ing box containing the preceding and succeeding pointsof (x(t), y(t)). Figure 11 illustrates the computation ofthe aspect. The vicinity of a point (x(t), y(t)) shown inFig. 11 is also used to define the following three features:curliness, linearity, and slope, which describe the shapeof the trajectory in the vicinity of (x(t), y(t)).

3.7 Curliness

Curliness C(t) is a feature that describes the deviationfrom a straight line in the vicinity of (x(t), y(t)) (Fig. 11).It is based on the ratio of the length of the trajectoryand the maximum side of the bounding box:

C(t) =L(t)

max("x, "y)! 2,

where L(t) denotes the length of the trajectory in thevicinity of (x(t), y(t)), i.e., the sum of lengths of all linesegments. "xand "yare the width and height of thebounding box containing all points in the vicinity of(x(t), y(t)) (see Fig. 11). According to this definition, thevalues of curliness are in the range [!1;N !3]. However,values greater than 1 are rare in practice.

3.8 Linearity

The average square distance between every point in thevicinity of (x(t), y(t)) and the straight line joining thefirst and last point in this vicinity is called linearity,which is defined as follows:

LN(t) =1N

""

i

d2i .

x-d~ ~x+d

Σ

(x(t), y(t))~~

} b

} b

Ascenders

Descenders

Fig. 12. Ascenders and descenders

3.9 Slope

The slope of the straight line joining the first and lastpoint in the vicinity of (x(t), y(t)) is described by thecosine of its angle ! (see Fig. 11).

The next two features are global features that con-sider a wider temporal context, i.e., not only the vicin-ity of a point (x(t), y(t)). These features can be seen asan approach to incorporate additional information aboutthe o"ine image into the feature vector.

3.10 Ascenders/descenders

These two features count the number of points abovethe corpus line (ascenders) and the number of pointsbelow the baseline (descenders) in the word image at agiven time instance t. Only points falling into a local re-gion determined by the x-coordinate of the current point(x(t), y(t)), i.e., only points in the word image that havean x-coordinate x satisfying x(t) ! d < x < x(t) + d,are considered to be part of ascenders or descenders attime t. Additionally, points must have a minimum dis-tance to both lines to be considered as part of ascendersor descenders. This ensures that inaccurately computedbaselines have only a minor impact on the computationof these features. Both features are computed by simplycounting the number of these points, where every pointis weighted with its distance to the corpus line (ascen-ders) or baseline (descenders). Figure 12 illustrates thecomputation of both features. The main idea of these fea-tures is to help identifying specific characters, like “t” or“g”.

3.11 Context maps

A context map is an o"ine, gray-scale image B = b(i, j)of the vicinity of a point (x(t), y(t)), where b(i, j) is thenumber of points of the trajectory falling into pixel (i, j).In particular, a context map is a low-resolution imagewith the pixel in the center containing (x(t), y(t)). Thesize of the context map depends on the corpus height ofthe word. In our experiments, we have transformed thecomputed map into a 3 # 3 map before adding it to thefeature vector [15]. Figure 13 shows the computation ofcontext maps on a gray-scale representation computedfor the word “Clinton” and the transformation to 3 # 3


9 x 9

3 x 3

a) Generating a sequence of Context Maps

b) Scaling of Context Maps

Fig. 13. Context maps

maps. Every pixel of a context map is a feature and thusadded to the set of features described above. The mainidea of context maps is to add features that consider awider context. Context maps allow us to capture infor-mation about parts of the trajectory that are in the spa-tial vicinity of a point (x(t), y(t)) but have a long timedistance to this point. Context maps are an example ofcombining o"ine and online information in handwritingrecognition. This is an interesting research topic, whichhas been investigated by several researchers in the recentpast (see [3,6,11,13–15] for more details).

As to the relevance of local and global features, wehave made the following experiments for a 20, 000 worddictionary: We have achieved recognition rates about90% when taking only the following small subset of localfeatures for training: vertical position, writing direction,curvature, and pen-up/pen-down information. The over-all recognition rate of a system trained only with localfeatures is 2% higher compared to a system trained withglobal features only. The performance of global featureswithout any local features involved in the training, how-ever, is close to the best overall performance. Neverthe-less, optimal performance has been achieved by combin-ing both local and global features into a single featurevector.

4 Multi-state time delay neural networks(MS-TDNN)

The core recognition engine of NPen++ is a MS-TDNN.A MS-TDNN is a connectionist recognizer that inte-grates recognition and segmentation into a single net-work architecture. This architecture was originally pro-posed for continuous-speech recognition tasks [10,16,23].We adopted this technique, including the training proce-dure described in the next section, and have applied it tothe handwriting recognition problem. The MS-TDNN isan extension of the time delay neural network (TDNN)[23], which has been applied successfully to online singlecharacter recognition tasks. The main feature of a TDNNis its time-shift invariant architecture, i.e., a TDNN isable to recognize a pattern independently of its posi-tion in time. The MS-TDNN combines the high accu-racy character recognition capabilities of a TDNN with

12012012

0

0

0

1

1

2

2

zzz

a aabbbccc

aaa

a...

...

...

cba

z

b

c

Σ

Y

Y

Y

E

V

Z

Y

Y

A

M

Inputlayer

Hiddenlayer

Statelayer

Charactermodels

Outputlayer

Fig. 14. The architecture of a multi-state time delay neuralnetwork

a nonlinear time alignment procedure to find an opti-mal alignment between strokes and characters in hand-written words [9,10,14]. Thus, the MS-TDNN is capa-ble of recognizing words by integrating a dynamic time-warping algorithm (DTW) into the TDNN architecture.In MS-TDNNs, words are represented as a sequence ofcharacters with each character being modeled by one ormore states. In the experiments reported in this paper,each character is modeled with three states representingthe first, middle, and last part of the character. Hence,the MS-TDNN can be regarded as a hybrid recognizerthat combines features of both neural networks and hid-den Markov models (HMMs). Figure 14 shows the basicarchitecture of the MS-TDNN. The first three layers,which are respectively called input layer, hidden layer,and state layer, constitute a TDNN. The input layer con-sists of the set of feature vectors computed for every timeinstance t. A sliding window moving from left to right in-tegrates several neighboring input vectors into activationvectors in the hidden layer, which in turn are integratedby a sliding window into activation vectors in the statelayer. The width of the sliding windows is a parameter ofthe MS-TDNN architecture. Every element (“neuron”)in the state layer represents a state of a character inthe alphabet (see Fig. 14). In single character recogni-tion, the score for a character is computed by finding anoptimal alignment path through its states and summingthe activations along this path. In word recognition, thescore of a word is computed by finding an optimal align-


C0C1C2l0l1l2i0i1i2n0n1n2t0t1t2o0o1o2n0n1n2

Fig. 15. Optimal alignment path through the word “Clin-ton” (1)

otnilC

n

lC i n notFig. 16. Optimal alignment path through the word “Clin-ton” (2)

Words Words

Training

Training data

segmented unsegmented

word levelstate levelTraining at Training at

character levelTraining at

Fig. 17. Training of a multi-state time delay neural network

ment path through the states of characters composingthe word. The final score is again derived by summingall activations along this path. Figure 15 and 16 illustratethe computation of the optimal alignment path throughthe word “Clinton”, where dark squares indicate highlyactivated states.

5 Training

The MS-TDNN is trained in three steps with back-pro-pagation. Figure 17 shows a schematic diagram of thetraining procedure. The first and second training stepoperate in a forced alignment mode, during which theMS-TDNN is trained with hand-segmented trainingdata, i.e., the character boundaries are known for thesewords. In the first step (training at state level), it is as-sumed that the Viterbi path remains for the same dura-tion in every state of a character belonging to a word ofthe training set. The states along this Viterbi path con-stitute the training data for the back-propagation pro-cedure starting at the state layer of the MS-TDNN. Fig-ure 18 shows this first step of training. After some itera-

sisi+1si+2 ...

...

1/3 1/31/3Fig. 18. Forced alignment with equal durations for each state

sisi+1si+2 ...

...

Fig. 19. Forced alignment together with the Viterbi algo-rithm

tions, the assumption that the Viterbi path remains forthe same duration in every state is abandoned in favor ofcomputing the actual Viterbi path through a charactermodel, which marks the beginning of Step 2 (trainingat character level). This is shown in Fig. 19. Then, af-ter some iterations, the third step commences by replac-ing the forced alignment in Step 1 and Step 2 with thefree alignment provided by the Viterbi algorithm (train-ing at word level). This has the advantage that train-ing can now be performed on unsegmented data. Thus,only a small part of the training data needs to be la-beled manually with character boundaries, as requiredby the first and second step. When the network has suc-cessfully learned character boundaries on the small seg-mented training set, the forced alignment is replaced bya free alignment and training can be performed on largedatabases containing unsegmented training data.

This mechanism also works for the next level in thehierarchy: whole sentences. We have trained unseg-mented sentences with a MS-TDNN that was alreadypre-trained with single words. The Viterbi algorithmfinds the best segmentation of all unsegmented sentencesbased on the information the neural net already learnedon the word level. Our first results with this techniqueon the sentence level are presented in Sect.7.

In our experiments, we used the cross entropy forpropagating the error ECE back in the MS-TDNN [14]:

ECE = !"

j

[dj log(yj) + (1 ! dj) log(1 ! yj)],

where yj is the output of unit j and dj is the teachinginput for unit j.

6 Search technique

The NPen++ online handwriting recognition system isbased on a dictionary determining the set of recognizablewords. In general, the size of the dictionary influencesboth the recognition performance and response time ofa dictionary-based recognizer, and thus has an e!ecton the degree of user acceptance. NPen++ supplementsthe MS-TDNN approach with a tree-based search engine[17]. It combines a tree representation of the dictionary


a

bl

c

d

e

n

o

a

e

a

i f

b

......

z

A

Z

e

internal nodes

end nodes

root node

Fig. 20. Tree representation of a dictionary

with e#cient pruning techniques to reduce the searchspace without losing much recognition performance com-pared to a flat exhaustive search through all words in thedictionary. A search tree is built for every character rep-resenting all words starting with this specific character.The nodes in every tree are HMMs representing indi-vidual characters. Every tree contains distinguished endnodes. A path from the root of a tree to a distinguishedend node is a sequence of HMMs describing an entry ofthe dictionary. Figure 20 shows a dictionary representedas a tree. In order to achieve real-time performance forvery large dictionaries, NPen++ does not attempt toapply the exact Viterbi algorithm. Instead, NPen++ in-troduces the concept of active and inactive HMMs anddefines a set of pruning rules which specify when to turnon an inactive HMM and when to turn o! an active one.Every HMM-node in the tree can be marked as beingeither active or inactive. When the search is initializedonly the roots of the trees are turned on whereas all othernodes are set to be inactive. There are two lists whoseelements are pointing to active nodes: the first list pointsto the nodes active in the current frame and the secondlist is used to gather pointers to those nodes which areexpected to be active in the next frame. Based on thesetwo lists, the search algorithm goes for each frame, i.e.,feature vector, through the following three steps:

– Evaluation: For every active hidden Markov modela Viterbi step is computed to find the accumulatedscores sij for the next frame, where sij is the scoreat state j in node i. The best state scores si withinevery node and the best score s = max si over allevaluated models are computed.

– Pruning: Turn o! all currently active nodes in thesearch tree where the following pruning criterion isfulfilled:

si < s ! beam

end nodes

internal nodes

a

bl

c

d

e

n

o

a

e

a

i f

b

...z

A

Z

e

...

root node

word transition

SPACE

Fig. 21. Tree representation for recognizing sentences

That means that all nodes whose best accumulatedscore is below a certain threshold, called beam, willbecome inactive in the next frame.

– Expansion: For every active node in the currentframe, test whether a transition from the last stateof the model i to the first state of any child HMMj leads to a higher accumulated score sj0 in the firststate of that model. If that holds and the new score isabove the pruning threshold, the HMM j is markedto be active in the next frame. Go to step 1.

The tree search with pruning is about 15 times fasterthan a flat search and allows us to run the recognizerin real-time with large dictionary sizes. Moreover, therun-time is virtually independent of the dictionary size.In practical applications, we can adjust the beam size tofind a good compromise between recognition accuracyand speed, as will be shown in the next section.

Figure 21 shows how we can easily extend the treestructure in Fig. 20 in order to recognize whole sentences.We insert an additional node that represents the whitespace between two words within a sentence. Every endnode of the tree is connected with this node, which inturn is connected with every root node. Thus, the traver-sal of this node marks the beginning of a new word. Real-izing this feedback by connecting end nodes directly withroot nodes did not perform well in our experiments, sowe decided to represent white spaces explicitly using thisnode. Recognizing sentences requires only minor modi-fications of the search algorithm and the training of theMS-TDNN. In particular, a new state representing thenew node in the search tree is added to the state layer ofthe MS-TDNN, i.e., the representation of spaces consistsof a single state in the state layer. In addition, we haveadded an additional feature to the feature set describedin Sect. 3 that helps training this state. We compute thisfeature using a simple heuristic that detects white spacesbetween words: a space between words is said to be ac-companied by a pen-up and moving the pen to the right.


For a sequence of pen-downs, the new feature is always0. However, for the first pen-down encountered after apen-up, it is higher the more the pen has moved to theright.

The next section also presents our first results forrecognizing sentences.

7 Evaluation

This section describes the practical experiments we car-ried out and presents the recognition rates and recogni-tion times of our system [14]. Section 7.1 describes thechosen design parameters of the MS-TDNN and the datawe used for testing. Section 7.4 provides the recognitionrates achieved.

7.1 Network architecture

The number of input units (“input neurons”) is identicalto the number of features described in Sect. 3. Hence, theinput layer of the NPen++ recognizer contains 22 inputunits, one for each feature. The number of units in thestate layer is determined by the number of characters inthe alphabet and the number of states in the statisticalmodel of each character. Since we want to recognize theLatin characters (small and capital ones) and model eachcharacter using three states, we have 26 " 2 " 3 = 156states in the state layer. The number of hidden units wasempirically set to 120. The widths of the sliding windowsin the input layer and the hidden layer are set to 7. Whilethe time shift of the window in the hidden layer is 1, thewindow in the input layer is moved two frames to theright at every cycle. This ensures that the hidden layeris about half as long as the input layer, which reducescomputational costs.

7.2 Dictionaries

In our experiments, dictionaries were compiled by ex-tracting the most likely words from the Wall Street Jour-nal Continuous Speech Recognition Corpus of the Lin-guistic Data Consortium. For instance, a 20, 000worddictionary would contain the most likely 20, 000wordsfrom the Wall Street Corpus and would contain anysmaller dictionary derived from this corpus in that way.

7.3 Datasets

We used three di!erent databases for training and test-ing: the CMU database collected at Carnegie MellonUniversity, the UKA database compiled at the Univer-sity of Karlsruhe, and the MIT database from the Mas-sachusetts Institute of Technology [14]. While the CMUand the UKA databases contain printed and cursivehandwriting data, the MIT database only containsprinted data, where mixed styles are considered cursive.In addition, the CMU database contains whole sentences.

Table 1. Data sets

Data set Printed Cursive# Writer #Words # Writer #Words

UKA - training 18 2313 70 3230UKA - test 5 314 21 1085CMU - training 39 2020 126 9308CMU - test 15 347 35 900MIT - training 119 6307 0 0MIT - test 40 2120 0 0

Table 2. Sentences of the CMU database

# Writer # Sentences # WordsCMU - training 163 1757 13221CMU - test 50 161 2322

The major part of these data were written by studentsat their respective institutes.

The UKA and CMU databases were collected using aWacom graphic tablet(SD-421E) with a sampling rate of200 points/s. While part of this data was written directlyon the tablet, a substantial part was written with an inkpen on paper lying on the tablet. Collecting the formerdata requires writers to look at the monitor while writ-ing on the tablet to see the graphically emulated inking.Thus, the latter data can be considered more natural.Instructions to the writers were reduced to a minimumto guarantee the most natural way of writing. This alsoensures that the number of collected printed and cur-sive styles reflects their true proportion in practice. TheMIT databases was compiled using a Wacom tablet(HD-648A) with an integrated LCD screen and the same sam-pling rate as in the UKA and CMU databases. Hence,this data can also be considered very natural.

In brief, the collected data covers a wide range of na-tional peculiarities collected by several position-sensingdevices. Almost 500 writers wrote just under 30,000words. Table 1 gives an overview over all three databases,the data they contain, and the division into training setsand test sets. Table 2 shows the number of sentencescontained in the CMU database and their division intotraining and test set. A detailed description of these datasets is given in [14].

Since we need some hand-segmented data to starttraining (see Sect. 5), we have explicitly segmented 4, 000cursive words from the UKA database and 2, 000-cursivewords from the CMU database by visual inspection.

7.4 Recognition results

Figure 22 shows the recognition rates of NPen++ ondi!erent data sets using several dictionaries. NPen++achieves recognition rates of 96% for a 5,000 word dic-tionary when it is trained with printed as well as cursivedata from all three databases. The error rate for thiscompletely trained system approximately doubles whenthe ten-times larger 50, 000word dictionary is used in-


86

88

90

92

94

96

98

100

5K 10K 20K 50K

% R

ecog

nitio

n Ra

te

Dictionary

UKA (PRINTED)UKA (CURSIVE)CMU (PRINTED)CMU (CURSIVE)

MIT (PRINTED)All

Fig. 22. Recognition rates achieved by NPen++

stead. The recognition rate is 91.2% for a dictionary ofthis size.

Figure 22 shows the recognition rates when NPen++is trained with training sets from all databases and testedseparately for each database and each writing style usingthe test subsets given in Table 1. As one would expect,the recognition rates for printed data are higher than therecognition rates for cursive data.

In our experiments, we have observed that the er-ror rate for the completely trained system increases bymore than 1% on the 20, 000word dictionary when con-text maps are excluded from the feature set, as comparedto the rate shown in Fig. 22.

In order to evaluate the performance of NPen++ forsentences, we have computed the word accuracy for everytest sentence. Word accuracy (WA) is based on the editdistance E, i.e., the smallest number of insertions, dele-tions, and replacements necessary to transform a wordsequence into another:

WA =N ! E

N,

where N is the number of words in the test sentence. Theaverage word accuracy measured on the 20, 000 word dic-tionary is more than 81% for recognizing sentences with-out any language model. In our first experiments withlanguage models, which show great promise, we havebeen able to increase this word accuracy to more than86% using delayed bigrams as they are used for con-tinuous speech recognition [18]. Note that NPen++ isamong the first online handwriting recognition systemsthat allow recognition of whole sentences with a neuralnetwork architecture and are not exclusively based onhidden Markov models [19].

Figures 23 and 24 illustrate the tradeo! between rec-ognition time and recognition accuracy measured on the20, 000word dictionary for the completely trained wordrecognizer. Figure 23 shows the recognition times de-pending on the width of the beam b (see Sect. 6). Therightmost column shows the recognition time for the flatsearch with no pruning. The column on its left-hand sideshows the recognition time for the tree search with anindefinite beam width, i.e., no pruning. Since tree searchrequires some bookkeeping during its computation, the

0

1

2

3

4

5

6

7

8

9

0 25 50 75 100 150 200 Max FlatSearch

Seco

nds/

Wor

d

Beam Width

Fig. 23. Recognition times

40

50

60

70

80

90

100

0 25 50 75 100 150 200 Max FlatSearch

% R

ecog

nitio

n R

ate

Beam Width

Fig. 24. Recognition rates

recognition time is slightly higher than the time for theflat search. With shrinking beam size, however, recogni-tion times rapidly decrease. All recognition times weremeasured on a PC containing a 200MHz Pentium pro-cessor and 64MB SDRAM.

Figure 24 shows the recognition rates depending onthe width of the beam. It can be seen that the recognitionrates remain almost constant for beam widths equal toor higher than 50. Hence, we can considerably reducerecognition time without losing recognition performanceby simply reducing the beam width. A beam width ofb = 75 or b = 50 would be an appropriate choice forinteractive applications of NPen++.

8 Summary and conclusion

In this paper, we have presented the online handwrit-ing recognition system NPen++ based on a MS-TDNN,a hybrid architecture combining features of neural net-works and hidden Markov models. Two main features ofa MS-TDNN are its time-shift invariant architecture andthe nonlinear time alignment procedure.

Preprocessing in NPen++ is guided by the appli-cation of well-known techniques in handwriting recog-nition. In particular, the following preprocessing stepsare accomplished in NPen++: computing baselines, nor-malizing size, correcting rotations, interpolating missing


points, smoothing, normalizing slant, computing equidis-tant points (resampling), and removing delayed strokes.

The set of features computed in NPen++ integrateslocal as well as global information about the trajectoryof the pen. While local features describe the shape of thetrajectory at a certain time instance, global features dealwith spatial information covering a wider time context.NPen++ considers both types of information important.The following local and global features are computed inNPen++: vertical position, writing direction, curvature,pen-up/pen-down, “hat”-feature, aspect, curliness, lin-earity, slope, ascenders/descenders, and context maps.

Context maps are a new approach combining globalo"ine features with local online information. They pro-vide additional spatial information about parts of thetrajectory that can have a long time distance to eachother. Context maps increase overall recognition ratesabout 1%. However, one disadvantage of context mapsis that they increase the size of feature vectors by morethan 50%, which in turn increases the complexity ofthe neural network. Nevertheless, we think that contextmaps are an important approach for incorporating onlineas well as o"ine recognition, which goes beyond combin-ing two separate online and o"ine recognizers.

NPen++ uses an e#cient tree search and pruningtechnique to ensure real-time performance for very largedictionary sizes. The recognition rates reported in thispaper range from 96% for a 5,000 word dictionary to91.2% for a 50,000 word dictionary. Hence, it is justifiedto say that NPen++ is a successful example of transfer-ing the MS-TDNN from the speech-recognition domainto the handwriting-recognition task. Moreover, the MS-TDNN is a powerful architecture that allows trainingand recognition of the whole hierarchy of handwrittendata ranging from single characters to whole sentences.

We suppose that transfering the main techniques ofNPen++ to other languages, such as Russian, Greek,or Arabic, requires only slight modifications within theNPen++ system. For Asian languages, however, somemajor changes will be necessary due to the di!erentstructure of these languages. For instance, the baselinescomputed by NPen++ will not work for languages suchas Chinese or Japanese. Since an Asian character resem-bles a Latin word in terms of complexity, the optimalnumber of states representing a character will presum-ably be di!erent from the number of states in westernwritings. In practical experiments, we have achieved theoptimal performance of the completely trained NPen++system when we used three states. However, we have ob-served some slight improvements with four or even fivestates for some smaller subsets of our three databases.

A possible improvement that we have not utilizedin NPen++ are context-dependent states, which are al-ready used in speech recognition [12]. Representing dif-ferent writing styles, such as printed and cursive styles,using separate models could also be a potential improve-ment of NPen++, which deserves future research.

In principle, the basic architecture of NPen++ allowsrun-on recognition, i.e., recognizing parts of the trajec-tory during writing. However, run-on recognition mayrequire modifications within some preprocessing steps.

In particular, baselines computed during the beginningof the trajectory will be unreliable due to the insu#cientnumber of local maxima or minima.

The NPen++ system presented can be easily exten-ded to enable recognition of complete sentences. In orderto so, we have embedded the tree search in a feedbackloop joining the output of the tree structure with theroot nodes via an additional state representing spaces be-tween words. Our first experiments with sentence recog-nition have been very encouraging. Nevertheless, theseexperiments are still in an early stage and there is muchroom for improvement. Most importantly, the investi-gation of more elaborate language models, such as tri-grams, should be the next major step.

References

1. S. Abramowski, H. Muller.: Geometrisches Modellieren(in German), vol. 75 of Reihe Informatik. BI, MannheimWien Zurich, 1991

2. Y. Bengio, Y.L. Cun.: Word Normalization for On-LineHandwritten Word Recognition. In: Proc. Int. Conf. onPattern Recognition, pp. 409–413, 1994

3. G. Boccignone, A. Chianese, L.P. Cordella, A. Marcelli.:Recovering Dynamic Information from Static Handwrit-ing. Pattern Recognition, 26(3): 409–418, 1993

4. T. Caesar, J.M. Gloger, E. Mandler.: Preprocessing andFeature Extraction for a Handwriting Recognition Sys-tem. In: 2nd IAPR Int. Conf. on Document Analysis andRecognition, pp. 408–411, Tsukuba, Japan, 1993

5. A.P. Dempster, N.M. Laird, D.B. Rubin.: MaximumLikelihood from Incomplete Data via the EM Algorithm.J. R. Stat. Soc., 39: 1–38, 1977

6. D.S. Doermann, A. Rosenfeld.: Recovery of TemporalInformation from Static Images of Handwriting. Int. J.Comput. Vision, 15: 143–164, 1995

7. J.G.A. Dolfing, R. Haeb-Umbach.: Signal Representa-tions for Hidden Markov Model Based On-Line Hand-writing Recognition. In: Proc. IEEE Int. Conf. on Acous-tics, Speech, and Signal Processing, Munich, 1997

8. I. Guyon, P. Albrecht, Y. Le Cun, J. Denker, W. Hub-bard.: Design of a Neural Network Character Recognizerfor a Touch Terminal. Pattern Recognition, 24(2): 105–119, 1991

9. H. Hild.: Buchstabiererkennung mit neuronalen Netzenin Auskunftssystemen (in German). PhD thesis, Univer-sity of Karlsruhe, 1997. Shaker, Germany

10. H. Hild, A. Waibel.: Speaker-Independent ConnectedLetter Recognition with a Multi-State Time Delay Neu-ral Network. In: 3rd European Conf. on Speech, Com-munication and Technology (EUROSPEECH), vol. 2, pp.1481–1484, Berlin, 1993

11. S. Jaeger.: Recovering Dynamic Information from Static,Handwritten Word Images. PhD thesis, University ofFreiburg, 1998. Foelbach, Germany

12. A. Kosmala, J. Rottland, G. Rigoll.: An Investigationof the Use of Trigraphs for Large Vocabulary CursiveHandwriting Recognition. In: Proc. IEEE Int. Conf. onAcoustics, Speech, and Signal Processing, Munich, 1997

13. P.M. Lallican, C. Viard-Gaudin.: A Kalman Approachfor Stroke Order Recovering from O!-line Handwriting.In: 4th Int. Conf. on Document Analysis and Recognition(ICDAR), pp. 519–522, 1997


14. S. Manke.: On-line Erkennung kursiver Handschrift beigrossen Vokabularen (in German). PhD thesis, Univer-sity of Karlsruhe, 1998. Shaker, Germany

15. S. Manke, M. Finke, A. Waibel.: Combining Bitmapswith Dynamic Writing Information for On-Line Hand-writing Recognition. In: Proc. 12th Int. Conf. on PatternRecognition, pp. 596–598, 1994

16. S. Manke, M. Finke, A. Waibel.: NPen++: A WriterIndependent, Large Vocabulary On-Line Cursive Hand-writing Recognition System. In: Proc. Int. Conf. on Doc-ument Analysis and Recognition (Montreal/Canada),1995

17. S. Manke, M. Finke, A. Waibel.: A Fast Search Tech-nique for Large Vocabulary On-Line Handwriting Recog-nition. In: Int. Workshop on Frontiers in HandwritingRecognition (IWFHR), Colchester, 1996

18. S. Ortmanns, H. Ney, X. Aubert.: A Word Graph Algo-rithm for Large Vocabulary Continuous Speech Recogni-tion. Comput. Speech Lang., 11: 43–72, 1997

19. E.H. Ratzla!, K.S. Nathan, H. Maruyama.: Search Issuesin the IBM Large Vocabulary Unconstrained Handwrit-ing Recognizer. In: Proc. Int. Workshop on Frontiers inHandwriting Recognition, Colchester, UK, 1996

20. M. Schenkel, I. Guyon, D. Henderson.: On-Line Cur-sive Script Recognition Using Time Delay Neural Net-works and Hidden Markov Models. In: Proc. Int. Conf.on Acoustics, Speech, and Signal Processing, Adelaide,1994

21. M.E. Schenkel.: Handwriting Recognition Using NeuralNetworks and Hidden Markov Models, volume 45 of Se-ries in Microelectronics. Hartung-Gorre, Konstanz, 1995

22. G. Seni.: Large Vocabulary Recognition of On-LineHandwritten Cursive Words. PhD thesis, Department ofComputer Science of the State University of New Yorkat Bu!alo, N.Y., USA, 1995

23. A. Waibel, T. Hanazawa, G. Hinton, K. Shiano, K. Lang.:Phoneme Recognition Using Time-Delay Neural Net-works. In: IEEE Trans. on Acoustics, Speech, and SignalProcessing, pp. 328–339, 1989

Stefan Jaeger was born in Trier,Germany, on June 23, 1967. Hegraduated from the University ofKaiserslautern, Germany, in Com-puter Science in 1994 and receivedhis Ph.D. degree in Computer Sci-ence from Albert-Ludwigs Univer-sity, Freiburg, Germany, in 1998.From 1994 to 1998 he worked asa Ph.D. student at the Daimler-Benz Research Center Ulm, Ger-many, where he was engaged in

o!-line cursive handwriting recognition for postal mail sort-ing. In 1998 he joined the Interactive Systems Laboratorieslocated at Carnegie Mellon University, Pittsburgh, USA, andat the University of Karlsruhe, Germany, where he was amember of the Computer Science Research Sta! and respon-sible for on-line handwriting recognition and pen-computing.He is currently working as an Invited Research Scientist atthe Department of Computer, Information, and Communica-tion Sciences at Tokyo University of A&T, Japan. His Ph.D.thesis addressed the problem of recovering dynamic informa-tion from static, handwritten word images, and was awardedthe Dissertation Prize from the German Research Centers for

Artificial Intelligence in 1999. His current research interestsinclude handwriting recognition, pen-based interfaces, archi-tectures for cognition and perception, and machine learning.

Stefan Manke was born in Ger-many in 1966. He received thediploma and Ph.D. degree inComputer Science in 1991 and1998 from the University of Karl-sruhe, Germany. His Ph.D. the-sis laid the foundations of theNPen++ system and dealt withrecognizing handwritten cursivescript for large vocabularies. Since1998 he has been with the TLCCompany, the systems house ofthe German Federal Railway,where he is currently Systems En-

gineer for Inter- and Intranet Applications.

Juergen Reichert received hisdiploma in computer science fromthe University of Karlsruhe, Ger-many, in 1998. He is currentlyworking as a Ph.D. candidate atthe Interactive Systems Laborato-ries located at Carnegie MellonUniversity in Pittsburgh and atthe University of Karlsruhe inGermany. His research interestsinclude handwriting recognition,speech recognition, machinetranslation, multi modal user in-

terfaces, and artificial intelligence.

Alex Waibel is a Professor ofComputer Science at CarnegieMellon University, Pittsburgh andat the University of Karlsruhe(Germany). He directs the Inter-active Systems Laboratories atboth Universities with researchemphasis in speech recognition,handwriting recognition, languageprocessing, speech translation,machine learning and multimodaland multimedia interfaces. AtCarnegie Mellon, he also serves as

Associate Director of the Language Technology Institute andas Director of the Language Technology PhD program. Hewas also one of the founding members of the CMU’s HumanComputer Interaction Institute (HCII) and continues on itssteering committee. Dr. Waibel was one of the founders ofC-STAR, the international consortium for speech translationresearch and currently serves as its chairman. He codirectsVerbmobil, the German national speech translation initiative.His work on the Time Delay Neural Networks was awardedthe IEEE best paper award in 1990, and his work on speechtranslation systems the Alcatel SEL research prize for tech-nical communication in 1994. Dr. Waibel received his B.S.in Electrical Engineering from the Massachusetts Institute ofTechnology in 1979, and his M.S. and Ph.D. degrees in Com-puter Science from Carnegie Mellon University in 1980 and1986.

online handwriting recognition: the npen++...

Documents