statistical script independent word spotting in offline handwritten documents

Statistical script independent word spotting in offlinehandwritten documents

Safwan Wshah n, Gaurav Kumar, Venu GovindarajuDepartment of Computer Science and Engineering, University at Buffalo, 113 Davis Hall, Amherst, NY 14260-2500, United States

a r t i c l e i n f o

Available online 10 October 2013

Keywords:Script independentKeyword spottingHidden Markov models

a b s t r a c t

We propose a statistical script independent line based word spotting framework for offline handwrittendocuments based on Hidden Markov Models. We propose and compare an exhaustive study of fillermodels and background models for better representation of background or non-keyword text. Thecandidate keywords are pruned in a two stage spotting framework using the character based and lexiconbased background models. The system deals with large vocabulary without the need for word orcharacter segmentation. The script independent word spotting system is evaluated on a mixed corpus ofpublic dataset from several scripts such as IAM for English, AMA for Arabic and LAW for Devanagari.

& 2013 Elsevier Ltd. All rights reserved.

1. Introduction

The recognition of unconstrained offline handwritten docu-ments has been a major area of research during last decades. Dueto large variability in writing styles and huge vocabulary, theproblem is still far from being completely solved [22,34]. As aresult, word spotting has been proposed as an alternative of fulltranscription to retrieve keywords from document images [27].The inputs to a word-spotting system are document sets ordatabases and an element denoted as query, the output is a setof images or sub-images from the database that are relevant to thequery, making it similar to the classical information retrievalsystem. Word spotting finds its application in many areas suchas information retrieval and indexing of handwritten documentthat are made available for searching and browsing [9]. Anextensive number of multilingual handwritten documents andforms are sent every day to companies for processing [6]. Anefficient retrieval system for these documents has the advantage ofsaving these companies time and money. As another example,recently, majority of libraries around the world have digitized theirvaluable handwritten books transcribed in many scripts rangingfrom old ancient to modern ages. An application of word spottingis to make these books searchable.

The optimum trend in word spotting systems is to proposemethods that show high accuracy, high speed and work on anylanguage with minimum preprocessing steps such as preparingthe query format or word segmentation. Our goal in this work is todevelop approaches to improve the word spotting performance in

handwritten documents by simulating any keyword, even thoseunseen in the training corpus, and effectively deal with largebackground vocabulary without the need for word or charactersegmentation, and to make it scalable over many languages suchas English, Arabic, and Devanagari. We will also evaluate theperformance of the system and compare it to current approaches.In the proposed method, we assume minimal preprocessingduring training and validation. Our method does not requiresegmentation or lines into words because it works on the line asone unit. The required keyword and non-keyword models aregenerated at run time to look for certain key word with minimalpreprocessing step.

In this work we elaborate on our work [36] where we haveintroduced the script independent word-segmentation free key-word spotting framework based on Hidden Markov Models(HMMs). This framework is scalable across multiple scripts. Welearn HMMs of trained characters and combine them to simulateany keyword, even those unseen in the training corpus. We usefiller models for better representation of non-keyword imageregions avoiding the limitations of line-based keyword spottingtechnique, which largely relies on lexicon free score normalizationand white space separation. Our system is capable of dealing withlarge background vocabulary without the need for word orcharacter segmentation and is scalable over many languages suchas English, Arabic and Devanagari. The main characteristic of theproposed approach is utilizing script independent methods forfeature extraction, training and recognition.

This work is a detailed illustration in terms of framework setupand detailed evaluation of the proposed technique. Some ofthe key attributes such as feature extraction and different fillerand background models proposed in this work are evaluatedextensively along with an analysis of system complexity. In the

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/pr

Pattern Recognition

0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.patcog.2013.09.019

n Corresponding author. Tel.: þ1 716 587 1594.E-mail addresses: [email protected], [email protected] (S. Wshah).

Pattern Recognition 47 (2014) 1039–1050

www.sciencedirect.com/science/journal/00313203

www.elsevier.com/locate/pr

http://dx.doi.org/10.1016/j.patcog.2013.09.019



http://crossmark.crossref.org/dialog/?doi=10.1016/j.patcog.2013.09.019&domain=pdf



mailto:[email protected]

mailto:[email protected]


experimental evaluation section, the system has been evaluated onindividual and mixed public datasets of English, Arabic andDevanagari manuscripts using different sizes of the keyword list.The proper modeling of the filler and background models has beeninvestigated and the system results compared with the modelpresented by Fischer et al. [13] on English, Arabic and Devanagarishowing better performance over all the languages.

The rest of the paper is structured as follows. We presentrelated work and categorize existing approaches in Section 2. InSection 3 we propose our approach including image preproces-sing, feature extraction and spotting framework. We discuss thecomplexity of our system in Section 4 and experimental evaluationin Section 5.

2. Related work

One of the first word spotting approaches for document imageswas proposed by [19]. Since then, many word spotting approacheshave been proposed. Mainly, word spotting approaches are dividedinto two types based on the query input: query-by-example andquery-by-string [25]. The query-by-example or template basedapproach requires images that are hard to prepare and may notexist in the training set. In the template based approach, inputimage is matched to a set of template keyword images and theoutputs are the images most similar to the query image. The imageis represented as a sequence of features and usually comparedwith dynamic time warping (DTW) technique [24,19,32]. The mainadvantage of this approach is that there is minimum learninginvolved. However, there are limitations of dealing with widevariety of unknown writers [13]. There are certain segmentationfree techniques such as Rusinol et al. [26], Leydier et al. [18] thatwork at document level detecting interest points using gradient orscale-invariant transform features. The query by example is notfocus of our work.

The query-by-string refers to the word-spotting techniqueswhere the input is the string that needs to be located [7,5]. Thequery-by-string is more complicated than query-by-example. Inthe query-by-string the keyword models need to be created eventhough no samples of that keyword exist in the training set. Wefurther categorize the query by string techniques based onprocessing level as below.

2.1. Word recognition based spotting

In word based spotting such as Rodrguez-Serrano and Perron-nin [25], Saabni and El-Sana [27], the HMM model for eachkeyword is trained separately. The score of each HMM is normal-ized with respect to the score of the same topology HMM trainedfor all non-keywords. This approach relies heavily on perfect wordsegmentation and requires several samples for each keyword inthe training set. In a similar way [8,33], use word segmentationfree technique to train character models to build the keywords aswell as non-keywords. The drawback of their approach is theconfidence measure with respect to a general non-keyword modelthat represents everything but keywords as well as the depen-dence on the white space model to segment the words.

2.2. Line recognition based spotting

In the line based approach, the word or character segmentationstep is done during the spotting process. Chan et al. [4], Edwardset al. [7] train character HMMs from manually segmented tem-plates assuming small variation in data. Fischer et al. [13] proposeda line level approach using HMM character models under theassumption that no more than one keyword can be spotted in a

given line. Their approach outperformed the template basedmethods for single writer with few training samples and multi-writers with many training samples. A major drawback in theirapproach is the dependency on the white space to separatekeywords from rest of the text. This not only has a large influenceon the spotting results but also prevents the system from beingscalable over other languages such as Arabic in which the spacecould be within or between the words revealing little informationabout the word boundaries [4]. Besides, the lexicon free approachto model the non-keyword has large negative effect on theirsystem performance as well.

Frinken et al. [15] proposed a neural network-based spottingsystem. It parses the line to recognize the sequence of thecharacters and maps each character's position and its probability.It then takes the sequence of the character probabilities, adictionary, and a language model and computes a likely sequenceof words. The drawback of this approach is the dependency on therecognition system. In addition, increasing the number of thekeywords increases the accuracy due to the use of an efficientlanguage model based on a big dictionary, making it more like a

Fig. 1. Feature extraction using a sliding window.

Fig. 2. Main model, (a) keywords and filler models, (b) score normalization usingbackground models.

S. Wshah et al. / Pattern Recognition 47 (2014) 1039–10501040

recognizer rather than a spotting system. In general, the neuralnetwork-based keyword spotting depends crucially on the amountof the training data [14], and it is hard to segment and train forother languages.

3. Proposed approach

We propose a novel script independent line based spottingframework. Given a line image, the goal is to locate and recognizeone or more occurrences of a keyword. The algorithm starts bydetecting the candidate keywords using a recognizer that searchesfor the keywords in a line. Keyword models consist of all keywordsbuilt by concatenating their HMM character models. The HMMbased recognizer uses the Viterbi beam search decoder [23] toparse the entire line finding the maximum probability pathbetween the keywords and filler models as shown in Fig. 2. Eachcandidate keyword is processed by extracting it from the lineusing the start and end positions. Then the score of each candidatekeyword is normalized with respect to the score of word back-ground models. Our approach utilizes both the filler and back-ground models. The models are explained in Sections 3.3 and 3.4.

3.1. Preprocessing and feature extraction

The quality of input image is enhanced by removing the noisethrough smoothing and morphological operations. Commonshapes such as dots, lines and hole punches are also removedusing some additional preprocessing steps. Skew correction algo-rithm such as one proposed by Yan [37] is used for skew correctionat both the document and line levels.

For line segmentation a robust algorithm presented by Shi et al.[31] is used. The height of each segmented line image is resized to afixed value maintaining the aspect ratio. In analytical recognitionsystems, representing the image as a sequence of features allows apowerful use of the hidden Markov models (HMMs), which is mainlyused for sequence modeling. Thus, instead of extracting features fromthe whole image, a sliding window is moved over the image from leftto right. At each position n, a feature vector fn is computed from onlythe pixels inside the sliding window, as shown in Fig. 1. The slidingwindow preserves the left-to-right writing nature of the documentas well as the variable word length property.

The most popular sliding window features were presented byFavata and Srikantan [10] and Vinciarelli and Luettin [35]. InVinciarelli and Luettin [35], the sliding window is split into 4�2cells known as bins, and the pixel counts in each bin are consideredto be a feature, resulting in a 16-dimensional feature vector. Wedenote these as intensity features. Advanced gradient, structural, andconcavity (GSC) features presented by Favata and Srikantan [10]showed the state of the art for Arabic handwritten word recognition[28]. The GSC features proposed by Favata and Srikantan [10] aremulti-resolution features that combine three different attributes ofthe character shape, the gradient (representing the local orientationof strokes), the structural features (which extend the gradient tolonger distances and provide information about stroke trajectories)and the concavity features (which capture stroke relationships at

long distances). The best performance was found with the combina-tion of gradient features from GSC and intensity features. A 20 pixelswide sliding window was considered with 85% overlap. Thesenumbers were empirically determined and the evaluation is deter-mined based on validation dataset.

For each window gradient and intensity features wereextracted, the sliding window was divided into two vertical binsaccording to the center of mass. For each bin, the gradientdirection of each pixel was calculated, and the gradient directionat any pixel in image I(x, y) was defined as

ϕ¼ tan �1 Gy

Gxð1Þ

where

Gx ¼ Iðxþ1; yÞ� Iðx�1; yÞGy ¼ Iðx; yþ1Þ� Iðx; y�1Þ ð2Þ

For each pixel, the gradient was calculated and the angle wasuniformly quantized into eight directions. Each orientation accu-mulated into a histogram, and after processing all the pixels in thebin, the gradient histogram was normalized with respect to themaximum value in the histogram. Each value in the normalizedhistogram was considered as an independent feature. Thus, thedimension of the gradient features was 8 (directions)�2 (bins)¼16. For the intensity features, the adjusted sliding window wasdivided horizontally into four regions based on center of mass. Foreach region, the black-to-white pixel ratio was calculated and

Table 1Filler models.

Model Description

Characters Group of character models used as fillersBigram context dependent character models All bigram context dependent character not found in the keyword list used as filler modelsBigram context independent sub-words All bigram characters sequence not found in the keyword list is modeled and used as filler modelsTrigram context dependent Character models All trigram context dependent character not found in the keyword list used as filler modelsTrigram context independent sub-words All trigram characters sequence not found in the keyword list is modeled and used as filler models.

Fig. 3. Lexicon based word background model.

Fig. 4. Word background model using character filler models.

S. Wshah et al. / Pattern Recognition 47 (2014) 1039–1050 1041

considered as an independent feature, and the total number of theintensity features was 8. Thus, the total number of features persliding window was 24.

3.2. HMM character models

HMM is a powerful statistical tool for modeling generativesequences that can be characterized by an underlying processgenerating an observable sequence. HMMs are widely used in thefield of machine learning and pattern recognition for sequencemodeling such as speech and handwritten and gesture recogni-tion. State-of-the-art systems for speech recognition [12], online[30] and offline [34,3] handwriting recognition, behavior analysisin videos [2], and others are all HMM based. In addition, whentraining the HMM character models, there is no need to provideexact character positions along with the training transcription,as the character positions are estimated by the Baum–Welchalgorithm. This makes it possible to reduce the time required toprepare the transcription of the training data.

Feature extraction procedure converts the text line into a set offeatures F as shown in Fig. 1. The generated sequence consideredas observations O, where the i-th observation value Oi correspondsto i-th 24-dimensional feature vector fi. For each character, a 14-state linear topology HMM is implemented. The number of statesfor each character was identified empirically. For each state Si, theobservation probability distribution is modeled by a continuousprobability density function of Gaussian mixture given as

biðoÞ ¼ ∑M

m ¼ 1cm � Nðo;μm;ΣmÞ ð3Þ

where o is the observation vector, M the number of mixturecomponents, cm the weight of the m-th component such thatC1þC2þ⋯þCM ¼ 1, N the 24-dimensional Gaussian PDF withmean vector μ and diagonal covariance matrix Σ.

Nðo;μm;ΣmÞ ¼1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2π2jΣj

p :eðo�μÞΣ � 1ðo�μÞ ð4Þ

The character models are concatenated to form word andsequence of words. There is no need to provide exact characterpositions along with the training transcription. In the training ofthe HMM, the character positions are estimated by Baum–Welchalgorithm. This property makes it possible to reduce the timerequired to prepare the transcription of the training data.

3.3. Filler models

Filler models are used to model the non-keywords withoutexplicitly defining them. They allow separation of keyword textfrom non-keyword background. While the proper modeling ofnon-keywords reduces the rate of false positives, the propermodeling of the keywords increases the true positive rate. Weinvestigate several filler models such as sub-words includingcharacters or sequence of characters. Table 1 contains the sum-mary of filler models that we propose and evaluate, includingcharacters, bigram-subwords or trigram-subwords. All these mod-els are compared based on the accuracy, computation complexityand simplicity of the implementation.

In case of context-dependent bigram and trigram models, allcharacter models are trained based on their position in the context.All context-dependent characters not appearing in keyword modelsare used as fillers. This technique requires an exceptionally largetraining data for the purpose of training a large number of contextdependent character models. In the case of context independent fillermodel all the non-keyword character sequences not appearing in thekeywords are used as filler models. Since the number of non-keywords sequences is huge, it adds more complexity to the system.

Character filler models (CFMs) can significantly reduce thecomputational complexity making it more attractive for real applica-tions due to fewer models and high efficiency. Each CFM is an HMMthat has exactly same implementation of the character models buttrained on different classes. It is expected that the number of CFMswill affect the performance, and thus different numbers of CFMs areevaluated for each language. The clustering of these CFMs isimplemented as described in Algorithm 1. The candidate keywordsfrom the filler models are pruned using the word background modelsto efficiently reduce the false positive rate.

Algorithm 1.

INPUT :HMM character models, validation dataset, number ofthe required filler models.

OUTPUT: Character filler models.Initialization: INPUT←HMM character models. OUTPUT←”

Step 1:for all character model in INPUT do

for other character models in INPUT doMerge the characters pair models.PAIRS[Accuracy, characters pair]←Evaluate the validation

set accuracy after merging.end forMaxPair←Pick maximum accuracy from PAIRS.Merge the corresponding pair (MaxPair)and store it inOUTPUT array.Delete pair from INPUT.

end forStep 2:Label the validation dataset according to the new models.Step 3:if OUTPUT size¼¼Number of filler models then

endelse

INPUT←OUTPUT, OUTPUT←”, go to step 1.end if

Table 2Number of text lines used for training, validation and testing and the number of unique characters of each script.

Dataset Training lines Testing lines Validation lines # Writers # Characters # Average char/model # Unique characters

IAM Data (English) 3000 1000 1000 650 16,8013 2625 64AMA Data (Arabic) 5700 1000 1000 35 27,2180 1756 155LAW Data (Devanagari) 3000 1000 1000 1000 119,980 1714 70

Fig. 5. Dataset samples, (a) Arabic, (b) English, (c) Devanagari.


3.4. Background model

Score normalization is an effective method to enhance accu-racy. It is applied as a rejection strategy to decrease the falsepositive rate [25]. In this paper we present two novel methods forscore normalization. The first method is based on score normal-ization between the candidate keyword and non-keyword scoresas shown in Fig. 3. We refer to it as Lexicon Based BackgroundModel. The other method is based on the character filler models asshown in Fig. 4 referred as character based background model.

3.4.1. Lexicon based background modelIn this technique, background model is represented by all

or a subset of non-keywords. A reduced lexicon is used to over-come the high computational complexity which results from usingall non-keywords. The candidate keyword recognized in fillermodels stage is either correct or similar to the keyword due tothe fact that filler models represent the non-keyword regions.The reduction in size of the Background model is implementedbased on the Levenshtein distance between all non-keywords inthe dictionary and the candidate keyword text. The non-keywordis added to the reduced lexicon if its edit distance compared to thecandidate keyword is less than a certain threshold, differentreduction ratios have been studied in the Experiments section.Reduced lexicon can be computed once for each keyword with-out adding more computation cost to the system. In general,for lexicon based background models the likelihood ratio Rbetween keyword candidate score (Sscore) and the WBM scores

(SLexicon_score ) is given by

R¼ SscoreSLexicon_score

ð5Þ

If R is larger than 1, this means the candidate is most likely akeyword. The likelihood ratio R is normalized by the width of thekeyword width. Positive match is considered if the normalizedlikelihood score is greater than a certain threshold T:

RW

4T ð6Þ

3.4.2. Character based background modelThe second background model is based on the character filler

models as shown in Fig. 4. The candidate keyword is evaluatedover the background models as the best path between candidate

Fig. 6. Filler types performance for 100 keywords .

Table 3MAP results for combinations of sets of features.

Features Features description MAP (English) MAP (Arabic) MAP (Devanagari)

Profile features The nine profile features presented by Marti and Bunke [20]. 47.2 46.1 45.8GSC [11] Gradient, structural, and concavity as described in Favata and Srikantan [11] by Favata 48.6 50.2 51.4Intensity features (I) 4 � 2 cells known as bins as described by Vinciarelli 41.1 40.1 42.6IG Combining intensity and gradient 54.2 53.7 53.8IS Combining intensity and structural 49.6 48.5 50.3IC Combining intensity and concavity 51.7 50.9 51.3IGSC Combining intensity, gradient, structural , and concavity 50.9 50.7 50.6IGS Combining intensity, gradient, and structural 47.6 48.6 49.6ISC Combining intensity, structural, and concavity 49.5 49.7 50.5IGC Combining intensity, gradient, and concavity 52.2 51.1 51.3

Table 4Mean precision of the filler types according to the different numbers of keywords inthe list.

Filler type Keyword list size

30 100 500

Character filler models 63.2 58.98 52.69Context-independent bigram 21.44 22.99 24.36Context-independent trigram 17.14 17.36 18.38Context-independent bigram 28.58 26.37 24.27Context-dependent trigram 35.18 32.64 30.33


keyword characters and their corresponding character filler mod-els. Thus, obtaining the separation amount between the keywordand the background. The complexity of this technique is very low

compared to the lexicon free and reduced lexicon techniques. Thenormalized likelihood score is the ratio R between keywordcandidate score (Sscore) and the sum of background character

Fig. 7. Number of character filler models vs. the mean average precision (MAP) for English.

Fig. 8. Number of character filler models vs. the MAP for Arabic.

Fig. 9. Number of character filler models vs. the MAP for Devanagari.


scores (Sbkscore):

R¼ Sscore∑iSbkscoreðiÞ

ð7Þ

If R is closer to 1, it is most likely to be a keyword. Thelikelihood ratio R is normalized with the width of the keyword (W)and a positive match is considered if the normalized likelihoodscore is within certain thresholds, the values of T1 and T2 areexperimentally evaluated on a validation dataset to maximize themean average precision:

T14RW

4T2 ð8Þ

4. System complexity

The word-spotting framework contains two main parts: thekeyword, and the filler and the background models. We found thatCFMs are the best filler models due to their small number andtheir high efficiency. In this section, we calculate the complexity ofthe full system for the lexicon-based and the character-basedbackground models described in Section 3. For the word spottingsystem, the complexity of the Viterbi algorithm recognizing a lineof length L is measured by Oðm2LÞ, where m is the number ofmodels.

The complexity of the system using the lexicon-based back-ground models, complexityLb, is given by

complexityLb ¼ keyword models’ complexityþbackground models’ complexity

¼Oðk2LÞþOðr2LnÞ ð9Þwhere k is the number of the keywords, L is the length of the textline, Ln is the length of the candidate keywords detected in line L,and r is the reduced lexicon size.

The complexity of the system using character-based back-ground models is given by

complexitycb ¼ keyword models’ complexityþbackground models’ complexity

¼Oðk2LÞþOðc2LnÞ ð10Þwhere c is the average number of the keyword characters. Notethat, since the number of the character filler models is constantand usually small, for each script, we ignore them in evaluating thesystem complexity.

In general, if the size of the keyword list (k) is increased, thecomplexity also increases. The lexicon size (r) has a large effect onthe complexity. If the lexicon is reduced to a high degree, then thecomplexity is also significantly reduced. The experimental resultsshowed that a 93% reduced lexicon approximately works asefficiently as the full lexicon. In the structured character-basedbackground model, the complexity is lower than in the lexicon-based approach because c is much smaller than r, where theaverage number of the keyword characters does not exceed 15characters for English, Arabic, or Devanagari, resulting in a hugereduction in the complexity.

5. Experimental setup

We evaluate our system on three publicly available datatsets,the public IAM dataset [21] for English, the public AMA [1] datasetfor Arabic, and LAW [16] dataset for Devanagari.

IAM English dataset: A Modern English handwritten datasetconsists of 1539 pages text from the LancasterOslo = Bergen corpus[17]. The dataset has been written by 657 writers. For more detailsabout this dataset refer to Marti and Bunke [21].

AMA Arabic dataset: A Modern Arabic handwritten dataset of200 unique documents consisting of 5000 documents transcribedby 25 writers using various writing utensils. For more detailsabout this dataset refer to AMA [1].

LAW Devanagari dataset: The Devanagari lines are syntheticallyformed by randomly concatenating up to 5 words from a datasetcontaining 26,720 handwritten words written in Hindi and Marathilanguages (Devanagari script) separated by a random space size. Formore details about this dataset refer to Jayadevan et al. [16].

Table 2 summarizes the statistics of the datasets used fortraining and validation and the sample line images are shown inFig. 5.

Fig. 10. Performance of the background models.

Table 5MAP results for the different types of background models.

BKM type MAP

0% reduction 58.9887% reduction 54.6393% reduction 55.3696% reduction 54.02Char-based model 40.12


5.1. Performance evaluation

We evaluate the proposed word spotting system with filler andword background models on IAM data because of its complexityand large variation. The results are measured using recall, preci-sion and mean average precision (MAP):

Recall¼ TPðTPþFNÞ ð11Þ

Precision¼ TPðTPþFPÞ ð12Þ

where TP is true positive, FN false negative and FP false positive.Mean Average Precision (MAP) is given by the area under curve ofrecall vs. precision graph.

5.2. Performance with different features

We implemented the three main methods that extract thefeatures from the sliding window as discussed in Section 3.1. Weimplemented the profile features by Marti and Bunke [20], thewindow features by Vinciarelli and Luettin [35] and Favata andSrikantan [10]. The results of these features and their combinationson the word-spotting system are reported. Table 3 shows the resultsof the performance of the word-spotting system with different typesof features. We randomly selected 100 keywords and used the 93%reduced lexicon based BKM. In the profile features presented by [20],a sliding window of one pixel width was used, consisting of ninegeometrical features, as shown in Section 3.1. In the featurespresented by Vinciarelli et al. [35] and Favata and Srikantan [11],the width of the sliding window was more than one pixel. We referto these features as window features. Vinciarelli et al. [35] split thesliding window into 4�2 cells known as bins. The pixel count ineach of the bins is considered as a feature results in 16-dimensionalfeature vector. We refer to these features as intensity features (I).Advanced gradient, structural, and concavity (GSC) features pre-sented by Favata and Srikantan [11] showed the state of the art for

Arabic handwritten word recognition [29]. GSC features are a set offeatures that measure the image characteristics at local, intermediate,and large scales, as discussed in Section 3.1. Profile features andwindow features cannot be combined due to the locality property ofthe profile features working on one pixel width window. Thecombinations of the intensity (I) and GSC features are shown inTable 3.

For each feature, we choose the optimum parameters thatachieve the highest performance. In the combined features, weused two vertical bins, as described in Section 3.1. The resultsshowed that the best features are the intensity and the gradientfeatures, as described in detail in Section 3.1. As the features areextracted from the sliding window area only, the sliding windowtechnique is considered as a local measurement unit and theoverlap between the sliding windows represents the sequenceproperty between the extracted windows. The intensity and thegradient features showed high accuracy because they representthe local description of the image pixels. The structural and theconcavity features did not yield promising results because theyrepresent the intermediate and the large-scale features which arenot usually represented by the sliding window.

5.3. Evaluation of filler models

We evaluate the five filler models discussed in Section 3.3 onthe IAM dataset which consists of samples taken from largenumber of writers and with about 9000 unique words. Thebackground model was held fixed and consisted of all non-keywords in the corpus. Three different experiments were carriedout for all filler types with different numbers of keyword sizes: 30,100 and 500. Fig. 6 shows the results for 100 random keywordsand the mean precision of the filler types for different numbers ofkeywords in the list is shown in Table 4. The character fillermodels outperform all the other types. The superior performanceand low complexity associated with the small number of modelsmake it the most attractive model. The system performance wasaffected by the size of the keyword list because the rate of falsepositives increases as the size of the list increases, therebydecreasing the precision of the performance.

5.4. Evaluation of the number of character fillers

As the character filler models outperformed all the other fillertypes, we evaluated our system with different numbers of thecharacter filler models. The models excellent performance and low

Fig. 11. Performance with varying numbers of keywords.

Table 6Performance with varying numbers of keywords.

Keyword list size MAP

30 62.02100 57.7500 49.32


complexity make it the most attractive model. Each language has adifferent number of character models and, thus, a differentoptimum number of filler models. The optimum number ofcharacter filler models was experimentally evaluated. Figs. 7, 8,and 9 show the performance of the different numbers of characterfillers for English, Arabic, and Devanagari respectively. The numberof best CFMs models found for English, Arabic, and Devanagari was4, 11, and 7, respectively.

5.5. Word background

Both the lexicon based background model and character basedword background models are evaluated on English IAM dataset

with 100 randomly selected keywords. The number of characterfiller models was fixed to 4. The main reason of using the lexiconbased background model is to apply lexicon reduction to reducethe complexity without affecting the performance. The lexiconreduction based on Levenshtein distance shows an effectivemethod for huge reduction of the lexicon with slight effect onthe system performance due to the similarity between thedetected candidate and the keyword text. Many experiments havebeen evaluated for full and reduced lexicon with different reduc-tion ratios as shown in Fig. 10 and Table 5. The results of thecharacter based background models are also shown in Fig. 10showing the effectiveness of this method to spot the keywordwithout the need to know all non-keyword list.

The main reason for using the lexicon reduction method is toreduce the computational complexity without affecting the perfor-mance. The reduction based on Levenshtein distance is an effectivemethod for reducing large lexicons; it has a slight effect on thesystems performance due to the similarity between the detectedcandidate and the keyword text. As a result, a huge reduction in thecomputational complexity is achieved. Fig. 10 and Table 5 show theperformance of different background models, including the back-ground character-based method and the lexicon-based method withdifferent reduction ratios. The experiments were performed on an

Fig. 13. Performance comparison with Fischer's algorithm on different keyword sizes.

Table 7MAP results of our proposed system and Fishers algorithm for different numbers ofkeywords.

System Number of keywords MAP

Fischer 100 9.39500 11.54

Our proposed system 100 57.70500 49.32

Fig. 12. System performance with English, Arabic, and Devanagari.


English IAM dataset using 100 randomly selected keywords and thefour optimum character filler models. The IAM dataset was used dueto the large number of writers and the large lexicon size (up to 9000unique words).

The lexicon-based background models showed a high perfor-mance for high lexicon reduction ratios. For a lexicon reduction of93%, the performance only decreased by 6.5% due to the similaritybetween the detected candidate and the keyword text. The lowcomplexity character-based models showed promising resultsrelative to their level of complexity.

5.6. Performance with different numbers of keywords

The size of the keyword list is also a major criteria forevaluating a word-spotting system. The performance of a word-spotting system gradually decreases with the increase in thenumber of keywords but the drop should not be too high. Weinvestigated the effect of the number of keywords on IAM dataset.The best performing filler models, CFMs were used with a reducedbackground model lexicon with a 93% reduction ratio. The resultsare shown in Fig. 11 and Table 6. As the number of keywordsincreased, more false positives were detected and, thus, theperformance of the system decreased. The system showed a highperformance for a reasonable number of keywords (100).

5.7. Performance on Arabic, and Devanagari

The proposed system is independent of the script used. Weevaluated the proposed approach on other scripts such as Arabicand Devanagari. We used the IAM dataset for English, the AMAdataset for Arabic, and the LAW dataset for Devanagari. Werandomly selected 100 keywords for Arabic and English and 30keywords for Devanagari. For the fillers, we used the CFMs andtwo BKM model types: 93% reduced lexicon background modelsand char based background models. The data used for training andvalidation is summarized in Table 2. The result showed a highperformance for English, Arabic, and Devanagari. To the best of ourknowledge, we are the first to apply a line-based approach onArabic and Devanagari. The results are shown in Fig. 12.

5.8. System comparison

We compared the proposed system with the method describedin [13]. In their work no specific information was reported abouttheir results such as the keywords used. Thus we implemented

their method and we used same features described in Section 3.1to compare the strength of the models only. The systems werecompared on keyword lists of sizes 100 and 500 respectively asshown in Fig. 13 and Table 7. The systems were also compared onother scripts such as Arabic and Devanagari as shown in Fig. 14 andTable 8. We used 100 keywords for Arabic and English and 30keywords for Devanagari. We also compared the speed between[13] and our proposed system. We built the algorithm using thesame features and the HMM models used in our approach. Bothsystems were compared in exactly the same environment usingthe same machine with 4 GB RAM and an Intel Core 2 Duo3.17 GHz CPU. We used the same training and testing data forthe two systems. Tables 7 and 8 show that our proposed systemoutperforms Fischer's algorithm on English, Arabic and Devanagariusing different keyword list sizes. The best results of Fischer'salgorithm can be obtained when each line contains only onekeyword. Table 9 shows the results of the speed comparison ofFischer et al. [13] with those of our proposed system for the twoproposed background models (lexicon based and character based).The results in Table 9 show that the lexicon-free approachpresented by Fischer was faster than the proposed lexicon-basedapproach in case of few keywords. For larger numbers of keywords(100), both systems had approximately the same speed. On theother hand, the proposed character-based system was much fasterthan [13], irrespective of the number of keywords.

6. Conclusion

We proposed a statistical script independent word spottingsystem based on hidden Markov models (HMMs). The system usesa line based learning technique to learn individual charactermodels. We propose and experiment with different types of filler

Table 8MAP results of our proposed system and Fischer's algorithm for different numbersof keywords.

System Script MAP

Fischer English 9.39Arabic 4.18Devanagari 8.72

Our proposed system English 57.70Arabic 55.1Devanagari 55.3

Fig. 14. Performance comparison on English, Arabic, and Devanagari compared with Fischer's line-based approach.


and background models on three different scripts: English [21],Arabic [1] and Devanagari LAW [16]. Filler models are used forbetter representation of non-keyword image regions. Efficientscore normalization techniques using the lexicon based back-ground models and character based background models techni-ques are implemented to reduce the false positive rates. Apartfrom being script independent, the system has an added advan-tage of avoiding any word segmentation for spotting which iscrucial for scripts such as Arabic where the word segmentationaccuracy is quite low. We outperform the state of the art line basedlearning approach as well. In the future, language modeling of thekeywords and non-keywords will be investigated to increasesystem performance. The language model assigns a probabilityto a sequence of words, for example language models betweenkeywords and non-keywords can be evaluated and a separatelanguage model for keywords and non-keywords can be evaluatedas well. Also, working on top-n recognition results instead of top-1can be investigated for better performance. In addition, extensionon the research for regular expressions would be interesting.

Conflict of interest statement

None declared.

References

[1] 2007. Applied media analysis, arabic-handwritten-1.0. URL ⟨http://appliedmediaanalysis.com/Datasets.htm⟩.

[2] M. Brand, V. Kettnaker, Discovery and segmentation of activities in video, IEEETransactions on Pattern Analysis and Machine Intelligence 22 (8) (2000)844–851.

[3] H. Bunke, Recognition of cursive roman handwriting—past, present and future,in: ICDAR, 2003, p. 448.

[4] J. Chan, C. Ziftci, D. Forsyth, Searching off-line arabic documents, in: Proceed-ings of the 2006 IEEE Computer Society Conference on Computer Vision andPattern Recognition—Volume 2. CVPR '06, IEEE Computer Society, Washington,DC, USA, 2006, pp. 1455–1462. URL http://dx.doi.org/10.1109/CVPR.2006.269.

[5] C. Choisy, Dynamic handwritten keyword spotting based on the nshp-hmm,in: Ninth International Conference on Document Analysis and Recognition,2007. ICDAR 2007, vol. 1, September 2007, pp. 242–246.

[6] D.S. Doermann, The indexing and retrieval of document images: a survey,Computer Vision and Image Understanding 70 (3) (1998) 287–298.

[7] J. Edwards, Y. Whye, T. David, F. Roger, B.M. Maire, G. Vesom, Making Latinmanuscripts searchable using ghmms, in: NIPS, vol. 17, 2005, pp. 385–392.

[8] M. El-Yacoubi, M. Gilloux, J.-M. Bertille, A statistical approach for phraselocation and recognition within a text line: an application to street namerecognition, IEEE Transactions on Pattern Analysis and Machine Intelligence24 (February (2)) (2002) 172–188.

[9] R. Farrahi Moghaddam, M. Cheriet, M.M. Adankon, K. Filonenko, R. Wisnovsky,Ibn sina: a database for research on processing and understanding of arabicmanuscripts images, in: Proceedings of the 9th IAPR International Workshopon Document Analysis Systems. DAS '10. ACM, New York, NY, USA, 2010,pp. 11–18. URL http://dx.doi.acm.org/10.1145/1815330.1815332.

[10] J.T. Favata, G. Srikantan, A multiple feature/resolution approach to hand-printed digit and character recognition, International Journal of ImagingSystems and Technology 7 (4) (1996) 304–311, URL http://dx.doi.org/10.1002/(SICI)1098-1098(199624)7:4o304::AID-IMA543.0.CO;2-C.

[11] J.T. Favata, G. Srikantan, A multiple feature/resolution approach to handprinteddigit and character recognition, International Journal of Imaging Systems andTechnology 7 (4) (1996) 304–311, URL http://dx.doi.org/10.1002/(SICI)1098-1098(199624)7:4o304::AID-IMA543.0.CO;2-C.

[12] G.A. Fink, Markov Models for Pattern Recognition, From Theory to Applica-tions, springer, 2008.

[13] A. Fischer, A. Keller, V. Frinken, H. Bunke, Hmm-based word spotting inhandwritten documents using subword models, in: Proceedings of the 201020th International Conference on Pattern Recognition. ICPR '10, IEEE Compu-ter Society, Washington, DC, USA, 2010, pp. 3416–3419. URL http://dx.doi.org/10.1109/ICPR.2010.834.

[14] V. Frinken, A. Fischer, H. Bunke, R. Manmatha, Adapting blstm neuralnetwork based keyword spotting trained on modern data to historicaldocuments, in: Proceedings of the 2010 12th International Conference onFrontiers in Handwriting Recognition. ICFHR '10, IEEE Computer Society,Washington, DC, USA, 2010, pp. 352–357. URL http://dx.doi.org/10.1109/ICFHR.2010.61.

[15] V. Frinken, A. Fischer, R. Manmatha, H. Bunke, A novel word spotting methodbased on recurrent neural networks, IEEE Transactions on Pattern Analysis andMachine Intelligence 34 (2012) 211–224.

[16] R. Jayadevan, S.R. Kolhe, P.M. Patil, U. Pal, Database development andrecognition of handwritten Devanagari legal amount words, in: InternationalConference on Document Analysis and Recognition, 2011, pp. 304–308.

[17] S. Johansson, G. Leech, H. Goodluck, Manual of Information to Accompanythe Lancaster-Oslo/Bergen Corpus of British English, for Use with DigitalComputers, 1978.

[18] Y. Leydier, A. Ouji, F. LeBourgeois, H. Emptoz, Towards an omnilingual wordretrieval system for ancient manuscripts, Pattern Recognition 42 (September(9)) (2009) 2089–2105, URL http://dx.doi.org/10.1016/j.patcog.2009.01.026.

[19] R. Manmatha, C. Han, E.M. Riseman, Word spotting: a new approach toindexing handwriting, in: Proceedings of the 1996 Conference on ComputerVision and Pattern Recognition (CVPR '96). CVPR '96, IEEE Computer Society,Washington, DC, USA, 1996, p. 631. URL ⟨http://dl.acm.org/citation.cfm?id=794190.794536⟩.

[20] U.-V. Marti, H. Bunke, Using a statistical language model to improve theperformance of an hmm-based cursive handwriting recognition system,International Journal of Pattern Recognition and Artificial Intelligence 15 (1)(2001) 65–90.

[21] U.-V. Marti, H. Bunke, The iam-database: an English sentence databasefor offline handwriting recognition, International Journal on DocumentAnalysis and Recognition 5 (2002) 39–46, http://dx.doi.org/10.1007/s100320200071.

[22] R. Plamondon, S.N. Srihari, On-line and off-line handwriting recognition: Acomprehensive survey, IEEE Transactions on Pattern Analysis and MachineIntelligence 22 (January (1)) (2000) 63–84, URL http://dx.doi.org/10.1109/34.824821.

[23] L.R. Rabiner, A tutorial on hidden Markov models and selected applications inspeech recognition, in: Proceedings of the IEEE, 1989, pp. 257–286.

[24] T.M. Rath, R. Manmatha, Word spotting for historical documents, InternationalJournal on Document Analysis and Recognition (2007) 139–152.

[25] J.A. Rodrguez-Serrano, F. Perronnin, Handwritten word-spotting using hiddenMarkov models and universal vocabularies, Pattern Recognition 42 (9) (2009)2106–2116, URL ⟨http://www.sciencedirect.com/science/article/pii/S0031320309000673⟩.

[26] M. Rusinol, D. Aldavert, R. Toledo, J. Llados, Browsing heterogeneous docu-ment collections by a segmentation-free word spotting method, in: 2011International Conference on Document Analysis and Recognition, 2011,pp. 63–67. URL ⟨http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6065277⟩.

[27] R. Saabni, J. El-Sana, Keyword searching for arabic handwritten documents, in:11th International Conference on Frontiers in Handwriting recognition(ICFHR2008), Montreal. ICFHR '08, 2008, p. 716–722.

[28] S. Saleem, H. Cao, K. Subramanian, M. Kamali, R. Prasad, P. Natarajan,Improvements in bbn's hmm-based offline arabic handwriting recognitionsystem, In: Proceedings of the 2009 10th International Conference on Docu-ment Analysis and Recognition. ICDAR '09, IEEE Computer Society, Washing-ton, DC, USA, 2009, pp. 773–777. URL http://dx.doi.org/10.1109/ICDAR.2009.282.

[29] S. Saleem, H. Cao, K. Subramanian, M. Kamali, R. Prasad, P. Natarajan,Improvements in bbn's hmm-based offline arabic handwriting recognitionsystem, in: Proceedings of the 2009 10th International Conference on Docu-ment Analysis and Recognition. ICDAR '09, IEEE Computer Society, Washing-ton, DC, USA, 2009, pp. 773–777. URL http://dx.doi.org/10.1109/ICDAR.2009.282.

Table 9Speed of our proposed system and Fischer's algorithm [13] on IAM and AMA datasets with different numbers of keywords.

Dataset Number of keywords Proposed system(Char. based) speed/line (ms)

Proposed system(Lexicon based) speed/line (ms)

Fischer's systemspeed/line (ms)

IAM for English 50 69 230 180100 142 260 255500 472 623 758

AMA for Arabic 50 70 236 223100 136 273 295500 485 676 1,023


http://appliedmediaanalysis.com/Datasets.htm

http://appliedmediaanalysis.com/Datasets.htm

http://refhub.elsevier.com/S0031-3203(13)00391-9/sbref2



http://refhub.elsevier.com/S0031-3203(13)00391-9/othref0010





dx.doi.org/10.1109/CVPR.2006.269
















http://dx.doi.acm.org/10.1145/1815330.1815332






















dx.doi.org/10.1109/ICPR.2010.834

dx.doi.org/10.1109/ICPR.2010.834





dx.doi.org/10.1109/ICFHR.2010.61

dx.doi.org/10.1109/ICFHR.2010.61












dx.doi.org/10.1016/j.patcog.2009.01.026




http://dl.acm.org/citation.cfm?id=794190.794536

http://dl.acm.org/citation.cfm?id=794190.794536





http://dx.doi.org/10.1007/s100320200071

http://dx.doi.org/10.1007/s100320200071

http://dx.doi.org/10.1007/s100320200071

http://dx.doi.org/10.1007/s100320200071



dx.doi.org/10.1109/34.824821

dx.doi.org/10.1109/34.824821







http://www.sciencedirect.com/science/article/pii/S0031320309000673

http://www.sciencedirect.com/science/article/pii/S0031320309000673




http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6065277

http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6065277








dx.doi.org/10.1109/ICDAR.2009.282

dx.doi.org/10.1109/ICDAR.2009.282





dx.doi.org/10.1109/ICDAR.2009.282

dx.doi.org/10.1109/ICDAR.2009.282

[30] M. Schenkel, I. Guyon, D. Henderson, On-line cursive script recognition usingtime-delay neural networks and hidden Markov models, Machine Vision andApplications 8 (4) (1995) 215–223.

[31] Z. Shi, S. Setlur, V. Govindaraju, A steerable directional local profile techniquefor extraction of handwritten arabic text lines, in: ICDAR, 2009, pp. 176–180.

[32] S.N. Srihari, H. Srinivasan, C. Huang, S. Shetty, Spotting words in Latin,Devanagari and Arabic scripts. Vivek: A Quarterly Indian Journal of ArtificialIntelligence (2006).

[33] S. Thomas, C. Chatelain, L. Heutte, T. Paquet, An information extraction modelfor unconstrained handwritten documents, in: Proceedings of the 2010 20thInternational Conference on Pattern Recognition. ICPR '10, IEEE ComputerSociety, Washington, DC, USA, 2010, pp. 3412–3415. URL http://dx.doi.org/10.1109/ICPR.2010.833.

[34] A. Vinciarelli, A survey on off-line cursive word recognition. Idiap-RR Idiap-RR-43-2000, IDIAP, 2000.

[35] A. Vinciarelli, J. Luettin, Off-line cursive script recognition based on continuousdensity HMM. Idiap-RR Idiap-RR-25-1999, IDIAP, 0 1999a.

[36] S. Wshah, G. Kumar, V. Govindaraju, Script independent word spotting inoffline handwritten documents based on hidden markov models, in: Proceed-ings of the 2012 13th International Conference on Frontiers in HandwritingRecognition, 2012.

[37] H. Yan, Skew correction of document images using interline cross-correlation,CVGIP: Graphical Model and Image Processing 55 (6) (1993) 538–543.

Safwan Wshah completed his PhD degree in Computer Science and Engineering from the University at Buffalo, State University of New York, USA, in 2012 and he is workingnow as research scientist at XEROX. His areas of interest include document image processing, natural language processing, pattern recognition and biometrics.

Gaurav Kumar received his MS degree in Computer Science from the University at Buffalo, State University of New York, USA, in 2011. Since then he is a PhD student at theUniversity at Buffalo under Dr. Venu Govindaraju. He has worked as Research and Graduate Assistant at the University at Buffalo. His research interests include documentanalysis, keyword spotting, graphical models and computer vision.

Venu Govindaraju is a SUNY Distinguished Professor of Computer Science and Engineering at the University at Buffalo, State University of New York. He received his B-Tech(Honors) from the Indian Institute of Technology (IIT), Kharagpur, and his PhD degree from UB. He is the founding director of the Center for Unified Biometrics and Sensors.He has authored more than 350 scientific papers and graduated 25 doctoral students. He has been a lead investigator on projects funded by government and industry forabout 60 million dollars. Dr. Govindaraju is a recipient of the IEEE Technical Achievement award, and is a fellow of the AAAS, the ACM, the IAPR, the IEEE, and SPIE.













dx.doi.org/10.1109/ICPR.2010.833

dx.doi.org/10.1109/ICPR.2010.833











statistical script independent word spotting in offline handwritten documents

Documents