using topic models for ocr correction

12
IJDAR (2009) 12:153–164 DOI 10.1007/s10032-009-0095-7 ORIGINAL PAPER Using topic models for OCR correction Faisal Farooq · Anurag Bhardwaj · Venu Govindaraju Received: 19 December 2008 / Revised: 24 August 2009 / Accepted: 26 August 2009 / Published online: 25 September 2009 © Springer-Verlag 2009 Abstract Despite several decades of research in document analysis, recognition of unconstrained handwritten doc- uments is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typ- ically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by apply- ing a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accu- racy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic cate- gorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database. Keywords OCR correction · Topic models · Lexicon reduction · Language models · Document categorization · Handwritten documents · Unconstrained handwriting F. Farooq (B ) Image and Knowledge Management, Siemens Medical Solutions, Malvern, PA, USA e-mail: [email protected] A. Bhardwaj · V. Govindaraju Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY, USA e-mail: [email protected] V. Govindaraju e-mail: [email protected] 1 Introduction Handwritten document analysis and recognition continues to remain a challenging task. Diversity in writing styles, incon- sistent spacing between words and lines, and uncertainty of the number of lines in a page as well as the number of words in a line, all contribute to towards the difficulty of this prob- lem [1, 2]. However, the single most important contributor is the subtask of handwritten word recognition (Table 1) which relies heavily upon a lexicon. Successful applications involv- ing handwriting recognizers (e.g., postal address interpreta- tion [3], bank check reading [4], and form reading [5]) owe their success primarily to the availability of domain knowl- edge which translates to constraints on the size of the lexicon. When the lexicon size is fairly large (10,000 words) the state-of-the-art recognizers perform at a poor accuracy level of 40% [6]. Accuracy usually improves as the lexicon size decreases. Recent research has thus focused on techniques of lexicon reduction even when a context is not available. We have developed a methodology to achieve lexicon reduction by first estimating the topic of the document from a predefined set. The output of the recognizer is cast as the features for the topic classes and the full lexicon is viewed as a probabilistic distribution over the set of topics. We use statistical topic categorization techniques such as Maximum Entropy to automatically infer the topic of a document from the error-full text returned by the recognizer during train- ing. A desirable property of our approach is that the training procedure does not consider the internal working of the rec- ognizer: rather it treats the recognizer as a blackbox which returns top-n choices at each word position. The idea is to run the (same or different) recognizer a second time with a dynamically generated lexicon where all the words are present but have also associated with them an additional weight corresponding to the likelihood of encountering the 123

Upload: faisal-farooq

Post on 14-Jul-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

IJDAR (2009) 12:153–164DOI 10.1007/s10032-009-0095-7

ORIGINAL PAPER

Using topic models for OCR correction

Faisal Farooq · Anurag Bhardwaj · Venu Govindaraju

Received: 19 December 2008 / Revised: 24 August 2009 / Accepted: 26 August 2009 / Published online: 25 September 2009© Springer-Verlag 2009

Abstract Despite several decades of research in documentanalysis, recognition of unconstrained handwritten doc-uments is still considered a challenging task. Previousresearch in this area has shown that word recognizers performadequately on constrained handwritten documents which typ-ically use a restricted vocabulary (lexicon). But in the case ofunconstrained handwritten documents, state-of-the-art wordrecognition accuracy is still below the acceptable limits. Theobjective of this research is to improve word recognitionaccuracy on unconstrained handwritten documents by apply-ing a post-processing or OCR correction technique to theword recognition output. In this paper, we present twodifferent methods for this purpose. First, we describe alexicon reduction-based method by topic categorization ofhandwritten documents which is used to generate smallertopic-specific lexicons for improving the recognition accu-racy. Second, we describe a method which uses topic-specificlanguage models and a maximum-entropy based topic cate-gorization model to refine the recognition output. We presentthe relative merits of each of these methods and report resultson the publicly available IAM database.

Keywords OCR correction · Topic models · Lexiconreduction · Language models · Document categorization ·Handwritten documents · Unconstrained handwriting

F. Farooq (B)Image and Knowledge Management, Siemens Medical Solutions,Malvern, PA, USAe-mail: [email protected]

A. Bhardwaj · V. GovindarajuDepartment of Computer Science and Engineering,University at Buffalo, Buffalo, NY, USAe-mail: [email protected]

V. Govindarajue-mail: [email protected]

1 Introduction

Handwritten document analysis and recognition continues toremain a challenging task. Diversity in writing styles, incon-sistent spacing between words and lines, and uncertainty ofthe number of lines in a page as well as the number of wordsin a line, all contribute to towards the difficulty of this prob-lem [1,2]. However, the single most important contributor isthe subtask of handwritten word recognition (Table 1) whichrelies heavily upon a lexicon. Successful applications involv-ing handwriting recognizers (e.g., postal address interpreta-tion [3], bank check reading [4], and form reading [5]) owetheir success primarily to the availability of domain knowl-edge which translates to constraints on the size of the lexicon.When the lexicon size is fairly large (≈10,000 words) thestate-of-the-art recognizers perform at a poor accuracy levelof ≈40% [6]. Accuracy usually improves as the lexicon sizedecreases. Recent research has thus focused on techniquesof lexicon reduction even when a context is not available.

We have developed a methodology to achieve lexiconreduction by first estimating the topic of the document froma predefined set. The output of the recognizer is cast as thefeatures for the topic classes and the full lexicon is viewedas a probabilistic distribution over the set of topics. We usestatistical topic categorization techniques such as MaximumEntropy to automatically infer the topic of a document fromthe error-full text returned by the recognizer during train-ing. A desirable property of our approach is that the trainingprocedure does not consider the internal working of the rec-ognizer: rather it treats the recognizer as a blackbox whichreturns top-n choices at each word position. The idea is torun the (same or different) recognizer a second time witha dynamically generated lexicon where all the words arepresent but have also associated with them an additionalweight corresponding to the likelihood of encountering the

123

154 F. Farooq et al.

Table 1 Performance of line separation, word segmentation, and word recognition modules

Writing style

Discrete Cursive Mixed Total

Line separation # of images 6 5 9 20

# of lines 118 101 210 429

# of lines 114 97 198 409

separated 96.6% 96.0% 94.3% 95.3%

Word segmentation # of words 641 692 1,427 2,760

# of words 597 631 1,379 2,607

segmented 93.1% 91.2% 96.6% 94.5%

Word recognition top 1 432 268 750 1,450

72.4% 42.5% 54.4% 55.6%

top 10 541 459 1.081 2,081

90.6% 72.7% 78.4% 79.8%

word under a particular topic. Thus our modeling is sand-wiched between the two runs of the recognizer and thereforeit can be considered to be a paradigm for OCR correctionof the output of the first run as well as a paradigm for lexi-con reduction (weighting) before the second run. This is aninnovative departure from lexicon reduction methods whichactually prune the lexicon and suffer from the problem ofout-of-vocabulary words at subsequent stages.

Figure 1 illustrates the results obtained by the same doc-ument analysis techniques on a page from Newton’s notes.Word recognition for this example would involve the entireEnglish dictionary. Perhaps, if the context were known, alimited but still large dictionary could be used. For exam-ple, if the page is taken from Newton’s notes on Physics, theglossary of terms from a Physics textbook can be used. Wepropose to use this context knowledge to improve recognitionaccuracies.

Prior work on topic categorization is not directly applica-ble because of extreme noisy output generated by the hand-writing recognizers when presented with large raw lexicons.We have adapted the commonly used Naive Bayes andMaximum Entropy models to be able to pipe the top-n choicesas features. The intuition underlying our approach is that sim-ilar word images tend to generate a similarly ranked list of

lexicon entries which can be used as the correction model forthe recognizer. We have tested our approach on the publiclyavailable IAM handwritten document images dataset with13 categories. Results are encouraging with a relativeimprovement of 25% observed by the correction model pro-posed in this paper.

Following is the outline of the rest of the paper. In Sect. 2,we provide a brief survey of techniques for OCR correction.Our proposed method for lexicon reduction is presented inSect. 3. Section 4 describes the other variant of our methodviz topic-specific language model. In Sect. 5, we discuss themodels used for topic categorization which is the underlyingfoundation for both systems. We also describe the featuresextracted from the handwritten documents for training thecategorization models. In Sect. 6 we describe the dataset thatis used for experiments followed by experimental results. Wefinally conclude in Sect. 7.

2 Background

This section entails some significant related work in this area.We organize the related work in two sub-sections. Section 2.1explains some post-processing methods which have beenapplied previously for OCR correction. Since our method-

Fig. 1 a Page of Newton’shandwritten manuscript, b lineand word segmentation for thefirst ten lines of text, c top tenchoices using a random lexiconof size 1000

123

Topic models for OCR 155

ology involves reduction of the effective vocabularies, inSect. 2.2 we describe some lexicon reduction-based strat-egies present in the literature, which have been employedfor eliminating spurious choices from the word recognitionoutputs.

2.1 Post-processing strategies

A general survey of the research in OCR post-processingcan be found in [7]. Perez-Cortes et al. [8] describes a sys-tem which uses a stochastic finite-state machine that acceptsthe smallest k-testable language consistent with a represen-tative language sample. The set of strings accepted such aFSM is equivalent to a k-gram language model. Dependingon the value of k, the model may behave deterministically ornon-deterministically. They report reducing error rate from33% to 2% on OCR output of handwritten Spanish namesfrom forms. Pal et al. [9] describes a method for OCR errorcorrection of an inflectional Indian language Bangla. Theirtechnique is based on morphological parsing where gram-matical agreement between all candidate root–suffix pairs ofeach input string are tested. This allows the detection of theroot/suffix part in which the error has occurred. The correc-tion is made by means of a fast dictionary access technique.They report correcting 84% of the words with a single char-acter error. Taghva et al. [10] proposes an interactive spellingcorrection system specifically designed for OCR error cor-rection. The system uses multiple information resources topropose correction candidates and lets the user review thecandidates and make corrections. In a previous work [11],we describe a phrase-based direct model for OCR correc-tion. Correction is modeled as a simplified translation task.Noisy OCR output is taken as the source language and cleanoutput (ground truth) is taken as the target language. Thealignment is obtained using a simple dynamic programming-based minimum edit distance alignment. The other importantresearch done by Wick et al. [12] used the concept of topic-based newsgroup training corpus to correct lexical errors andshowed a decrease in the error rate of approximately 7%.Even though this work has similar motivation, it is based onan invalid assumption that the incorrect words are known apriori. The correction is performed for non-lexicon wordsto topic-specific lexicon words. This is not possible in word-model-based handwriting recognition systems where the out-puts (correct or otherwise) are always valid lexical entries.

Some of the above-mentioned techniques are highlydependent on the language since they use language-specificfeatures. Therefore, such techniques are not applicable to adifferent language. Second, some of these techniques fail touse multiple recognition choices from the OCR and their per-formance is limited to top choice OCR output. The proposedmethods in this paper is able to use top-n choices from theOCR in eliminating the spurious choices. Third, none of these

techniques uses domain knowledge or topic information ofthe document to reject noisy words from the recognition out-put. In finite-state machine-based methods, transition prob-abilities are obtained using a generic language model whichis assumed to be independent of the topic of the document inquestion. In this work, we relax this assumption of a globallanguage model by incorporating knowledge of the context(topic).

2.2 Lexicon reduction-based strategies

The task of word recognition is a pattern recognition problemthat can be simply stated as ‘given a word image what is thebest possible corresponding word in a given lexicon’. One ofthe elements that contributes more to the complexity of therecognition task is the size of the lexicon. Table 2 describeshow the increase in the size of the lexicon reduces the rec-ognition performance of a handwriting recognition systemdeveloped at CEDAR [13]. The problem of large lexicons isthe number of times that the observation sequence extractedfrom the input image has to be matched against the words inthe lexicon. So, a more intuitive approach attempts to limitthe number of words to be compared during the recognition.

Whereas this is straightforward in a task that is gearedtowards a particular application such as postal automation orbank check recognition, it is non-trivial in large vocabularyand unconstrained systems. A general survey of the researchin this area can be found in [14]. Kaufmann et al. [15] usethe length of the word image as a simple criteria for lexiconreduction. A length-based model is trained for every wordimage. The test word image is classified into one of thesemodels. Words belonging to this selected model are preservedin the lexicon and rest of the words are removed, therebycreating a reduced lexicon for the word image. Powalkaet al. [16] estimate the length of cursive words based on thenumber of times an imaginary horizontal line drawn throughthe middle of the word intersects the trace of the pen inits densest area. Guillevic et al. [17] also adapt a similarapproach of estimating word length and reducing the lexiconsize. They estimate the number of characters by counting thenumber of strokes crossing within the main body of a word.Madhvanath et al. [18,19] use holistic feature based on coarserepresentation of the word shape for pruning large lexicons.

Table 2 Recognition accuracy (%) for the system described in [13]

Rank Lex. size 10 Lex. size 100 Lex. size 1000

Top 1 96.80 88.23 68.70

Top 2 98.63 93.36 81.40

Top 20 98.40 97.60

Top 50 98.70

123

156 F. Farooq et al.

Most of these techniques rely on features extracted from theword image which are susceptible to errors. In addition to thisthe coverage of the lexicon (percentage of test words in lexi-con) tends to reduce drastically using these techniques. Mostof the features extracted for word-shape based reduction arenon-trivial to extract and prone to errors. Also, in large vocab-ulary systems the number of words that have same the lengthor global shape tend to be large thus leading to ineffectivereduction in the size of the lexicon. We present a method thatautomatically categorizes a document into a pre-defined topicand reduces the size of the lexicon based on this information.

3 Lexicon reduction using topic categorization

Lexicon reduction cannot be achieved trivially. It is either apainstaking manual effort or can be automated if we can con-strain the lexicon by the domain we are interested in. Thisis not only important for indexing of documents; it can beexploited in the reduction of the input lexicon to a word recog-nizer as well. Milewski and Govindaraju [20] have describedan initial work in this area where the domain is restrictedto medical forms. Each medical form is associated with a‘concept’, e.g., leg injury, heart attack, etc. They have shownthat given this information, restricting the lexicon by con-cepts increases the efficiency of the recognizer and hence theretrieval significantly.

The rationale behind our approach is that the choice ofwords in a document is characteristic of the topic under dis-cussion. For example, the presence of words like {sensory,brain, cortex, nerve, …} immediately leads us to believethat the document relates to the medical literature. Similarly,words like {commerce, export, import, bank, …} suggest atrade document. However, the knowledge of the topic of dis-course is not available a-priori. This leads to the necessity ofautomatically categorizing documents into topics. Figure 2shows the schematic of our system that targets this problem.We explore automatically creating small and representativelexicons Lexi for topics of interest Ti from a large vocabularyV (Fig. 3) where |Lexi | << |V |. Given the automaticallydetected topic from Fig. 2, we can then use this smaller lex-icon in the second phase of recognition instead of the largevocabulary in order to procure better recognition outputs.

Document categorization is a well-researched area inthe field of information retrieval when it comes to textcategorization. Various topic models have been proposed thattreat documents as bag of word models (SVD decompositionfollowed by cosine similarity measures [21], Naive Bayes[22], etc.). For a given handwritten document to be catego-rized, the raw output of a recognizer is extremely noisy. Inthe presence of noise, accuracies of all methods suffer badly.Work presented in [23] demonstrated how the accuracy ofdocument categorization reduces drastically as the accuracyof OCR reduces. This problem is magnified in the case of

Fig. 2 Topic categorization in handwritten document images

Fig. 3 Generating topic-specific lexicons

handwriting recognition. For a large lexicon the word errorrate for the top choice is generally very high. However, theerror rates for the same lexicon when taking top N outputsof the recognizer into consideration are significantly lowerthan in the first case (see Table 3). We propose to utilizethis fact and will present novel features to categorize a doc-ument based on n-best hypotheses for a word using a wordrecognizer.

Assuming that a document is related to a single topic (anassumption, which we could relate later, however, valid forhandwritten documents at this point) and is composed of a setof words W = {wi , w2, . . . , wn}. Let us assume that we fixthe documents to a set T = {t1, t2, . . . , tn}. Then the prob-ability of the topic given the words in the document can beestimated as

P(T = ti |W ) = P(W |T = ti )P(T = ti )

P(W )(1)

Whereas, in a system where we are dealing with text, theterm P(W |T = ti ) could be estimated from the number ofoccurrences of the word wi given a topic labeled document,in our case each wi is a set of n-best hypotheses for eachword image instead of being an atomic unit. The model canbe thought of as a bag of words where each word instead ofbeing one entry is a bunch of hypotheses with their poster-ior probabilities. Thus, our model would have to consider adocument as a bag of as many bags as there are words in thedocument. The size of each smaller bag would be the value

Table 3 Top-n recognition accu-racies of the word recognizer

Correct choice in Percentage

Top 1 32.22

Top 10 43.50

Top 20 65.89

Top 50 73.24

123

Topic models for OCR 157

Fig. 4 An example of topic categorization

of ‘n’ as in the n-best hypotheses. Figure 2 is a pictorial rep-resentation of the task of categorization of the handwrittendocument.

Figure 4 shows an illustration of topic categorization.Figure 5 illustrates an example of constructing smaller topic-specific lexicons from an initial lexicon after topiccategorization. This translates to extracting high-frequencywords from the topic-specific distribution of words in thevocabulary. As we note, this method solves the problem ofcreating lexicons beforehand since the lexicons are secondaryoutputs of the topic categorization algorithm. However, thismethod still suffers from the problem of out-of-vocabularywords since pruning lexicons based on highly probable wordscan remove some words from the tail of the distribution that

may still occur in the test documents. This leads to the neces-sity of the second variant of our proposed method whereinstead of pruning, the whole distribution is considered.

4 Topic-based language models

Word recognition can be understood as obtaining a maximumlikelihood (ML) estimate between the input word image andthe entries of the lexicon. Each estimate in this case repre-sents the likelihood of the word image having correspondinglexicon term as its correct transcription. These likelihoods aregenerally represented as probabilities and are obtained fromthe word recognizers. In this paper, we propose refining thisprocess by finding maximum a posteriori (MAP) value ofterm in the lexicon corresponding to the input word image asshown in Eq. 2. Previous approaches may also be explainedidentically using this MAP approach, with the main differ-ence being that they assume a uniform prior probability ofall lexicon terms which reduces this approach to maximumlikelihood (ML)-based approach.

P(t = tk |w) = arg maxk

P(t |w) (2)

P(t |w) = P(w|t).P(t)

P(w)(3)

P(t |w) ≈ P(w|t).P(t) (4)

where P(w|t) represents the likelihood of word image giventhe lexicon entry which is obtained from an OCR and P(t)represents the prior probability of observing the term in thelexicon. Most of the previous approaches assume a uniform

Fig. 5 An example of reduced lexicon construction from initial lexicon

123

158 F. Farooq et al.

prior probability of the terms in the lexicon. A non-uniformprior probability of lexicon entries can be learnt using an-gram language model and such methods have also beenpreviously used in the information retrieval community [24].

Typically, language model-based approaches use a singletraining corpus. A unigram term probability is obtained bycounting the frequency of every term in the corpus and thennormalizing it by the total number of words present in thedocument. However, the major drawback of this approach isthat it assumes the term count to be independent of the docu-ment topic which is not true. In practice, it has been observedthat documents consist of topics, and each topic is well repre-sented by a set of terms. Therefore, a single non-topic-basedlanguage model will not allow a better estimate of the termprobability P(t) since the term probability is assumed to beindependent of the document topic (Fig. 6).

We propose a topic-based language model for estimatingprior term probability P(t) of every term in the lexicon. Firstof all, training data are created where documents belongingto different topics are manually categorized. Separate topic-based language models are generated for each of these topicwith an assumption that each language model is consistentwithin a given topic. Let us assume that the training set con-sists of n topics. Each topic-based language is represented byL Mi where i = (1, 2, . . . , n). The global language modelis now replaced by a collection of individual topic-basedlanguage models L Mi . Given an input test document, thedistribution of all the trained topics is computed. Let di rep-resent the topic distribution of a document d, which is basi-cally the probability of document d belonging to topic i .Word likelihood scores P(w|t) for every word w in docu-ment d is obtained from the word recognizer. For every wordw ∈ d, the posterior term probability P(t |w) is calculatedfor all the terms t in the lexicon using Eqs. 4 and 5 (usedfor computing new value of P(t)). Finally, the term tk in the

lexicon having maximum posterior probability is output asthe corrected recognizer choice for the word w. The overallsystem architecture is also shown in Fig. 7.

P(t) =∑

L M

P(t |L Mi ).P(L Mi ) (5)

where P(L Mi ) ≡ P(di ) in this case. This probability isagain computed using topic categorization methods and fromamong the methods described in this paper, we will use theMaximum Entropy model in this setting. We will touch uponthis later.

This methodology can also be understood from the per-spective of a dynamic lexicon construction. The proposedmethod is equivalent to constructing a dynamic lexicon forevery word image in the document. Each dynamic lexiconconsists of all the entries of the complete lexicon but con-tains an additional weight P(t) associated with every term twhich is learnt using Eq. 5. The word recognition likelihoodP(w|t) obtained from the word recognizer is multiplied withP(t) to obtain a corrected posterior probability (Fig. 8).

The foundation of both systems described above is thetopic categorization technique for the noisy handwriting rec-ognition output. In the following section, we now describethis most crucial part of the solution.

5 Models for topic categorization

Researchers have used techniques like Naive Bayes [22],Maximum Entropy Models [25], Conditional Random Fields,Hidden Markov Models and more recently, Latent DirichletAllocation for text classification. As described in Sect. 3, ourtask can be compared to classification with noisy data. Wecannot directly apply these methods due to low accuraciesand hence in this paper we describe how we adapted Naive

Fig. 6 Motivation fortopic-based language models

123

Topic models for OCR 159

Fig. 7 System architecture

Fig. 8 Schematic illustrationof dynamic lexicon

Bayes model and also how we used the top-n information asa feature in the Maximum Entropy Model.

5.1 Naive bayes

Naive Bayes classifier is a simple model for text classificationthat assumes independence amongst words in a give context.

Because of the independence assumption the task of learningis greatly simplified. A document di is assigned to a categoryc j under a model θ using the formula

j0 = arg maxj=1...m

P(c j |di , θ)

= arg maxj=1...m

P(c j |θ)P(di |c j , θ) (6)

123

160 F. Farooq et al.

The independence assumption dictates

P(di |c j , θ) =|di |∏

k=1

P(wk |c j , θ) (7)

where |di | is the length of the document di .If |D| denotes the total number of documents, then the

class prior is given by

P(c j |θ) =∑|D|

i=1 P(c j |di )

|D| (8)

McCallum and Nigam [22] describe how two variants of theNaive Bayes could be utilized for text classification. The firstvariant only depends on the presence or absence of words andthus can be characterized by a distribution that is based on amulti-variate Bernoulli. In this case, the generative model is

P(wt |c j , θ) = 1 + ∑|D|i=1 Bit P(c j |di )

2 + ∑|D|i=1 P(c j |di )

(9)

where

Bit ={

1 wt ∈ di

0 wt /∈ di

As we note, this model only captures the existence of aword given a topic and fails to capture the word counts andoccurrences which is an important feature in the documenttopic. Also, in our case since the accuracies are low and weutilize top-n (n = 10 in our case) outputs from the recog-nizer, this model is not suited for our task. On the other hand,the second variant of Naive Bayes where the document is rep-resented by the set of word occurrences captures this infor-mation. Here it can be characterized by a distribution that isa multinomial in which a document is an ordered sequenceof word events, drawn from the same vocabulary V . This issimilar to what can be referred to as the ‘unigram languagemodel’. The generative model is thus modified to

P(wt |c j , θ) = 1 + ∑|D|i=1 Nit P(c j |di )

|V | + ∑|V |i=1

∑|D|i=1 Nis P(c j |di )

(10)

where

Nit ={∑

P(im = wt ) wt ∈ di

0 wt /∈ di

Note that in the equation of Nit , we have replaced the countof the word by the sum of the probabilities P(im = wt ). Thisis the probability from the word recognizer for the image imbeing word wt . This is what we will refer to as the partialcount of the word. Thus, we were easily able to introducethe top-n choices from the word recognizer by replacingthe count for that choice by its partial count (probabilityscore). Figure 9 shows an example of the partial word countfeature.

5.2 Maximum entropy

The Naive Bayes, however, suffers from the independenceassumption which may not be valid. This is often referred toas the bag-of-words model where the words are exchange-able and the topic does not depend on the mutual dependenceof the words. To account for this we modified the Maxi-mum Entropy Model. The motivating idea behind maximumentropy is that the most uniform model should be preferredwhen no information is present and whatever informationis present should constrain that uniform model. Thus, theMaximum Entropy model prefers the most uniform modelsatisfying the constraint

P(c|d) = exp∑

i λi fi (d,c)

∑c exp

∑i λi fi (d,c)

(11)

λi ’s represent parameters that are estimated during the courseof the Maximum Entropy training. The f ′

i s are any real-val-ued functions describing features of the (document, class)relationship, d represents the document, and c represents acategory. In order to include partial counts as in the Sect. 5.1,we use a real-valued feature function

fw,c′(d, c) ={

0 c �= c′,N (d,w)

|d| otherwise

Fig. 9 Partial word countfeature

123

Topic models for OCR 161

Fig. 10 Chunk distributionfeature

where N (d, w) is the ‘partial’ count (probability from rec-ognizer) of the word w in document d. We also experimentwith extracting other features from the document such aschunk distribution probability. Given a word image and itstop-n recognition hypothesis, we divide the top-n results inton10 chunks each of size 10. For example, top-20 results aredivided into 2 chunks where the first chunk contains choicesfrom rank 1–10, and the second chunk consists of choicesfrom rank 11–20. We maintain a count of each chunk fromthe OCR results of the training documents and convert theminto probability scores by normalizing each chunk countby the sum of all chunk counts. The primary motivationbehind using this feature is the observation that similar wordimages tend to generate a similar ranked list of lexicon entrieswhich can be used to represent the noise model or the cor-rection model of the OCR. Figure 10 illustrates an exampleof extracting chunks from the OCR results.

As is mentioned in Nigam et al. [25] since the MaximumLikelihood training overfits, we use a Maximum A posteri-ori training with Gaussian priors over feature functions. Theprior probability of the model is the product of Gaussians ofeach feature value λi with variance σ 2

i .

i

1√2πσ 2

i

e

−λ2i

2σ2i (12)

The training has been implemented using the Improved Iter-ative Scaling (IIS) algorithm which is of the Quasi-Newtonfamily of numerical algorithms. We used the Mallet(http://mallet.cs.umass.edu) package from University ofMassachusetts at Amherst to train the model. The technicalreport [26] provides further details of the Maximum Entropymodel used in the context of natural language processing.

6 Experiments

For the purpose of our experiments we have used the publiclyavailable dataset called the IAM database [27].

Table 4 Topic categories of the IAM database

S. no. Label Topic category

1 A Press:reportage

2 B Press:editorial

3 C Press:reviews

4 D Religion

5 E Skills, trades and hobbies

6 F Popular lore

7 G Biographies, essays

8 H Miscellaneous

9 J Learned and scientific writings

10 K General fiction

11 L Mystery

12 M Science fiction

13 N Adventure

The IAM database has the following characteristics:

– 657 writers contributed samples of their handwriting– 1,539 pages of scanned text– 5,685 isolated and labeled sentences– 13,353 isolated and labeled text lines– 115,320 isolated and labeled words– ∼15k lexicon entries

There are 13 topic categories ranging from Press to Religionas listed in Table 4. More details on the database can be foundin [27]. This dataset is unconstrained English and the pagesare classified according to the topics. Figure 11 shows animage from the dataset.

6.1 Topic categorization results

The test set (40%) of the data was used to evaluate the topicclassification performance. In order to prove statistical sig-nificance, we conducted these experiments in an n-fold crossvalidation setting (n = 100 in our case). We split the data

123

162 F. Farooq et al.

Fig. 11 Example of images from IAM database

randomly into 60% for training and 40% for testing andrepeated it a hundred times. We also experimented with the80–20 split, but we report results on the former configura-tion. During each iteration, we stored the information aboutthe data that were used for testing as this would be requiredfor the experiment of lexicon reduction. Figure 12 shows theaverage performance of the topic classification algorithms.After classification, we selected the iteration number withtopic classification performance approximately equal to theoverall average classification. As mentioned earlier, the infor-mation about the test data was stored in each iteration andthus we extracted the portion of the data used for testing inthis particular iteration. Table 5 shows the result of fold selec-tion method for 100-fold validation. 0.8 denotes that 80% ofdata is randomly sampled as training data and 0.6 denotesthat 60% represents the training data. The fold value denotesthe iteration number for which the classification accuracywas approximately equal to the overall average classificationaccuracy.

6.2 Lexicon reduction results

After topic classification, the lexicon was reduced based onthe highest mutual information of the words with the topic

Fig. 12 Topic classification results (NB Naives Bayes, ME MaxEnt,suffix 10 implies top-10 choices for each word image)

Table 5 Fold selection results

Hypothesis Data split Selected fold

TOP 0.8 4

TOP 0.6 1

TOP-5 0.8 14

TOP-5 0.6 6

TOP-10 0.8 11

TOP-10 0.6 9

Table 6 Recognition accuracies before and after lexicon reduction

Top (%) Top-10 (%)

Before reduction 32.22 43.4

After reduction 40.01 69.1

and then recognition was performed again with the new set ofwords. As we note, the immediate advantage of the methodis the availability of the lexicon since it is a by-product of theclassification model. On an average, the size of the lexiconlies between 1.5–2k which is an average reduction of about85%. The recognition accuracies before and after lexiconreduction by topic classification are reported in Table 6.

6.3 Dynamic lexicon results

As noted in Fig. 12, the Maximum Entropy model outper-forms the others. So, for this experiment we have used thismethod in conjunction with the topic-specific languagemodels.

Table 7 shows the accuracy of the recognizer using thetopic-specific language models. Raw accuracy here repre-sents the initial word recognition accuracy obtained from the

123

Topic models for OCR 163

Table 7 Recognition accuracieswith and without topic languagemodels

Method Word recognitionaccuracy

Raw 32.33%

Corrected 40.63%

recognizer. Corrected refers to the accuracy obtained afterapplying topic-based language models to each word. Sincethe output in this case is a single decoded sequence, we donot report top-10 accuracy in this case. As shown in the table,the proposed method significantly improves the raw accuracyof the word recognizer by ≈25%. Moreover, the method iscompletely trainable and flexible since it does not assumeany internal details about the word recognition.

7 Conclusion

We have presented a post processing method to improveunconstrained handwriting recognition accuracies by reduc-ing lexicon sizes using topic categorization of noisy wordrecognizer output. The method is novel in the aspect thatit uses a topic-based lexicons or language models to refinethe likelihood scores obtained from the recognizer insteadof using a global language model. The proposed method istrainable and statistical. It is also very flexible given it doesnot assume any internal information about the word recog-nizer. Moreover, we believe this method can also be extendedto other languages since no language-specific information isassumed here. We modified state-of-the-art statistical meth-ods of topic categorization (Naive Bayes, maximum entropy)to account for top-n choices and seamlessly include proba-bility scores of the handwriting recognizers. Without this,the methods are close to ineffective on the erroneous dataproduced by the recognizers. These features are also genericand can find application in similar systems like speech recog-nition, word-based machine print OCR of cursive languageslike Arabic, etc. where language models and vocabularies areused. Our current work is focused on improving the topic cat-egorization results of the noisy OCRed documents which arecrucial to overall performance of the system. Currently, thetopic categorization is independent of the underlying noiselevel of the word recognition output. We are working onautomatically identifying noise levels in the test documentsand then associate a confidence level to the document. Thisconfidence level can then be integrated with the topic cate-gorization score to enable a more robust estimation of topicdistribution of the documents.

Acknowledgments We would like to thank our colleagues HuaiguCao and Gaurav Chandalia for their valuable inputs. We also acknowl-edge the feedbacks from the reviewers that has greatly helped improvethis paper.

References

1. Kim, G., Govindaraju, V., Srihari, S.: Architecture for handwritingrecognition systems. Int. J. Doc. Anal. Recognit. 2(1), 37–44 (1999)

2. Senior, A., Robinson, A.: An off-line cursive handwriting recogni-tion system. IEEE Trans. Pattern Anal. Mach. Intell. 20(3 ), 309–321 (1998)

3. Srihari, S., Keubert, E.: Integration of hand-written address inter-pretation technology into the united states postal service remotecomputer reader system. In: Proceedings of 4th International Con-ference on Document Analysis and Recognition, pp. 892–896.Ulm, Germany (1997)

4. Impedovo, S., Wang, P.S.P., Bunke, H. (eds.): Automatic Bank-check Processing. Series in Machine Perception and Artificial Intel-ligence, vol. 28. World Scientific (1997)

5. Govindaraju, V., Ramanaprasad, V., Lee, D., Srihari, S.: Readinghandwritten us census forms. In: Proceedings of 3rd InternationalConference on Document Analysis and Recognition, pp. 82–85.Montreal, Canada (1997)

6. Vinciarelli, A., Bengio, S., Bunke, H.: Offline recognition ofunconstrained handwritten texts using HMMs and statistical lan-guage models. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 709–720 (2004)

7. Kukich, K.: Techniques for automatically correcting words intext. ACM Comput. Surv. 24(4), 377–439 (1992)

8. Perez-Cortes, J., Amerngual, J., Arlandis, J., Llobet, R.: Stochasticerror-correcting parsing for OCR postprocessing. In: InternationalConference on Pattern Recognition, pp. 4405–4408. Barcelona,Spain (2000)

9. Pal, U., Kundu, P., Chaudhuri, B.: OCR error correction of an inflec-tional Indian language using morphological parsing. J. Inform. Sci.Eng. 16(6), 903–922 (2000)

10. Taghva, K., Stofsky, E.: OCRSpell: an interactive spelling cor-rection system for OCR errors in text. Int. J. Doc. Anal. Recog-nit. 3(3), 125–137 (2001)

11. Farooq, F., Jose, D., Govindaraju, V.: Phrase based direct model forimproving handwriting recognition accuracies. In: Proceedings ofInternational Conference on Frontiers in Handwriting Recognition.Montreal, Canada (2008)

12. Wick, M., Ross, M., Learned-Miller, E.: Context-sensitive errorcorrection: using topic models to improve OCR. In: Proceedingsof 9th International Conference on Document Analysis and Rec-ognition, pp. 1168–1172. Brazil (2007)

13. Kim, G., Govindaraju, V.: A lexicon driven approach to handwrit-ten word recognition for real-time applications. IEEE Trans. PatternAnal. Mach. Intell. 19(4), 366–379 (1997)

14. Koerich, A., Sabourin, R., Suen, C.: Large vocabulary offlinehandwriting recognition using a constrained level building algo-rithm. Pattern Anal. Appl. 6(2), 97–121 (2003)

15. Kaufmann, G., Bunke, H., Hadorn, M.: Lexicon reduction in anhmm-framework based on quantized feature vectors. In: Proceed-ings of the 4th International Conference on Document Analysisand Recognition, pp. 1097–1101. Ulm, Germany (1997)

16. Powalka, N.S.R.K., Whitrow, R.J.: Word shape analysis for ahybrid recognition system. Pattern Recognit. 30(3), 421–445(1997)

17. Guillevic, D., Nishiwaki, D., Yamada, K.: Word lexicon reductionby character spotting. In: Proceedings of the Seventh InternationalWorkshop on Frontiers in Handwriting Recognition, pp. 373–382(2000)

18. Madhvanath, S., Govindaraju, V.: Holistic lexicon reduction forhandwritten word recognition. In: Proceedings of the SPIE-Docu-ment Recognition III, pp. 224–234. San Jose, CA (1996)

19. Madhvanath, S., Govindaraju, V.: Syntatic methodology ofpruning large lexicons in cursive script recognition. PatternRecognit. 34(1), 37–46 (2001)

123

164 F. Farooq et al.

20. Milewski, R., Setlur, S., Govindaraju, V.: A lexicon reduction strat-egy in the context of handwritten medical forms. In: Proceedingsof Eigth International Conference on Document Analysis and Rec-ognition, pp. 1146–1150. Seoul, Korea (2005)

21. Yang, Y., Chute, C.: An example-based mapping method for textcategorization and retrieval. ACM Trans. Inform. Syst. 12(3), 252–277 (1994)

22. McCallum, A., Nigam, K.: A comparison of event models for NaiveBayes text classification. In: Proceedings of AAAI Workshop onLearning for Text Categorization, pp. 41–48. Madison, USA (1998)

23. Price, R., Zukas, A.: Accurate document categorization of OCRgenerared text. In: Proceedings of Symposium on Document ImageUnderstanding Technology, pp. 97–102. Maryland, USA (2005)

24. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Informa-tion Retrieval. Cambridge University Press, Cambridge (2008)

25. Nigam, K., Lafferty, J., Mccallum, A.: Using maximum entropyfor text classification. In: Proceedings of Workshop on MachineLearning for Information Filtering-IJCAI, pp. 61–67. Stockholm,Sweden (1999)

26. Ratnaparkhi, A.: A simple introduction to maximum entropymodels for natural language processing. In: IRCS Report 97–08.University of Pennsylvania (1997)

27. Marti, U., Bunke, H.: The IAM-database: an english sentencedatabase for off-line handwriting recognition. Int. J. Doc. Anal.Recognit. 5, 39–46 (2002)

123