[acm press the 2011 joint workshop - beijing, china (2011.09.17-2011.09.17)] proceedings of the 2011...

Script based Text Identification: A Multi-level Architecture

Ehtesham Hassan∗

Department of Electrical Engg.IIT Delhi

Ritu Garg†


Santanu Chaudhury‡


M Gopal§


ABSTRACTScript identification in a multi-lingual document environ-ment has numerous applications in the field of documentimage analysis, such as indexing and retrieval or as an ini-tial step towards optical character recognition. In this paper,we propose a novel hierarchical framework for script iden-tification in bi-lingual documents. The framework presentsa top-down approach by performing page, block/paragraphand word level script identification in multiple stages. Weutilize texture and shape based information embedded in thedocuments at different levels for feature extraction. The pre-diction task at different levels of hierarchy is performed bySupport Vector Machine (SVM) and Rejection based classi-fier defined using AdaBoost. Experimental evaluation of theproposed concept on document collections of Hindi/Englishand Bangla/English scripts have shown promising results.

KeywordsScript identification, shape descriptor, Edge direction fea-tures

1. INTRODUCTIONWith increasing demand to create a paperless environ-

ment, script identification plays an important role for recog-nition, retrieval and storage of digitized documents. Indiabeing a multi-lingual country, there exists a large collec-tion of multi-lingual document images. The major sourcesof such documents are dictionaries, regional language bookswith translation in Hindi/English, official documents, and

∗E-mail:[email protected]†E-mail:[email protected]‡E-mail:[email protected]§E-mail:[email protected]

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.J-MOCR-AND ’11 Beijing, ChinaCopyright 2011 ACM 978-1-4503-0685-0/11/09 ...$10.00.

examination questionnaires etc. The existing OCR tech-niques assume the script information is known beforehand.Additionally, the development of document indexing andretrieval applications also requires the document script in-formation. In the existing literature large amount of workhas been reported related to script identification at page,block, line or word-levels. The techniques at page and blocklevels assume existence of single script in a given exam-ple. Whereas word level techniques perform script predic-tion by processing each word individually. In practice amulti-lingual document may contain words belonging to dif-ferent script embedded in sentences at different places e.g.figure 1. In such scenario, The performance of existing tech-niques cannot be guaranteed.

We present an efficient solution for such problems by com-bining page, block, word level script identification in a hier-archical framework. We use the texture properties at pageand block level to define the document feature represen-tation. The block identification is performed by followingthe split and merge approach. SVM classifier is appliedfor script prediction at page and block level. In this pa-per, we present a novel application of rejection based clas-sifier for script identification at word level by exploiting thestructural features of word objects. We have considered bi-lingual document images with following combinations, En-glish/Hindi and English/Bangla for testing the framework.The authors ensure that the dataset does not consists of syn-thetically generated document images. Rather the datasethas been taken from multiple sources such as books, maga-zines and newspapers which are real world sources that guar-antee presence of multiple script at page/block/line level innon-uniform fashion.

The paper is organized as follows: Section 2 gives a briefsurvey of related state-of-the-art. Section 3 describes docu-ment pre-processing, feature extraction, segmentation tech-niques and proposed hierarchical framework for script identi-fication. The experiments conducted to validate our frame-work are discussed in Section 4. Finally we conclude thepaper in Section 5.

2. RELATED WORKScript Identification has been a topic of regular research.

A. Busch et al.[1] proposed a texture based script identi-fication system using wavelet features. Spitz [19] examinedupward concavities of connected components for distinguish-ing between Asian and European languages. Tan et al. [13]

Figure 1: Bi-lingual document pages considered for script identification

described texture analysis using using multiple channel (Ga-bor) filters and Gray level co-occurrence matrices and in[20] author has proposed rotation invariant features for au-tomatic script identification.

Indian Script recognition has been attempted by manyresearchers [10, 17, 2, 3, 12]. Harit et. al. [2] proposeda method for identification of Indian languages by com-bining Gabor filter based techniques and direction distancehistogram classifier considering Hindi, English, Malayalam,Bengali, Telugu and Urdu. Basavaraj and Subbareddy [12]proposed a neural network based system for script identifi-cation. [3] proposes a method based on morphological re-construction overcoming the constraints over different fontsizes. Sharma et. al. [17] describe discriminating curvaturefeatures for script identification of document images consid-ering eight Indian scripts.

There has been substantial amount of research work re-ported in literature for identifying scripts in multi-lingualIndian documents. Algorithms proposed for script identifi-cation at word level by Dhanya et al. [4] is based on Gaborfilters and spatial spread features, Pal et al. [18] uses topo-logical and structural features and Padma et al. [6] employsdiscriminating features. However the recognition rates de-crease sharply for longer words. Pati et. al [11] propose aword level script identification algorithm in a multi-scriptscenario for 11 Indian scripts. Combination of Gabor anddiscrete cosine transform (DCT) features has shown promis-ing results i.e., over 98% for bi-script and tri-script cases andabove 89% for the eleven-script scenario. However, the iden-

tification accuracy decreases as the number of scripts in doc-uments increases. Additionally, word based identification iscomputationally expensive.

Many approaches for automatic script identification attext block and line level have been proposed in literature.presented below. Pal et al. [9] proposed an automatic tech-nique of separating the text lines from 12 Indian scripts.Padma et. al [8] described projection profile method to de-termine scripts from tri-lingual documents. Lijun Zhou et al.[22] describe a method for Bangla/English script identifica-tion based on analysis of connected component profiles. SFRashid et al. [16] present a discriminative learning approachfor multi-script identification at connected component levelusing convolutional neural network.

3. PROPOSED FRAMEWORKThe following discussion presents the novel framework for

script identification. The proposed framework consists ofthree prediction stages and exploits the combination of tex-ture and shape based features. The processing of subsequentstages depend on the classifier confidence for script predic-tion. At the first level of the hierarchy we aim to classifya page to be mono-lingual or bi-lingual. Texture featuresprovide a strong cue for examining the script dependentcurvature distributions for identifying the dominant scriptat the page level. The overall architecture of the proposedframework is shown in figure 2. SVM based classifier assignscript labels over the pages and blocks with confidence scoreδ1 and δ2 respectively. The confidence score is based on the

Figure 2: Architecture of the proposed framework

object distance from the classifier boundary in the kernelspace. We compute the SVM confidence measure by apply-ing the sigmoid function over the SVM output. The SVMoutput for input x is defined as

y =

Nsv∑i=1

wiyiK(Xi, x) + b (1)

In SVM theory [15], {wi}Nsvi=1 are the weight parameters

learned by training process for the selected kernel functionK. {Xsv}Nsv

i=1 are the set of support vectors defining clas-

sification boundary margin, {yi}Nsvi=1 are the corresponding

labels and b is the bias. The probabilistic conversion of SVMoutput [14] is obtained as

p(C1|x) =1

exp(−y) + 1(2)

The measure p(C1|x) represents the posterior probabil-ity of point x belonging to class C1. We consider p(·|x) asthe confidence score of SVM output with assigned label assign(y). The confidence score of the prediction is the crite-rion for further processing of page/block at the next stage.For our experimental dataset, we have observed confidencescore range 0.3 < δ1 < 0.65 for block level script identifi-cation. The transition from one stage to another requiresblock and word segmentation from pages and blocks respec-tively. The details of the proposed framework are discussedas follows.

Document Pre-ProcessingThe document collection consists of gray-scale images. Thepre-processing steps performed on our document collectionare listed as follows:

1. Gray-scale to Binary conversion, using adaptive bina-rization technique proposed in [7].

2. Skew detection and correction.3. Noise removal, the binarized document may contain

noisy pixels that may degrade the overall system per-formance. To remove small noise like speckle from thebinary image, we have used adaptive median filteringtechnique.

Texture Feature ExtractionWe use the edge direction based features described in [17].The edge based feature extracts the statistical distributionof curvature found in different Indian scripts. Indian scriptshave either horizontal and vertical straight lines or curlyconstruction with almost no straight lines. The Edge Di-rection Histogram (EDH) features are computed on a globalimage patch containing the script to extract the orientationdistribution of the script dependent curvatures. The edge di-rection features are computed as follows. First convolve thedocument image with horizontal and vertical Sobel masks.Then threshold these edges to retain only strong edges. Nextcompute the direction of the edge and create an edge his-togram with b bins (b = 100 at page level, b = 50 at blocklevel) to obtain a b-dimensional feature vector.

Based on the confidence score of the script prediction doc-ument pages are further segmented into blocks. The segmen-tation details are discussed as following.

Profile Based Segmentation for Blocks and WordsThe segmentation algorithm described here segments docu-ment image into text blocks. The segmentation approach an-alyzes the horizontal (X) and vertical (Y) projection profilesof the document image. The widths of valleys of horizontaland vertical projection profiles, helps in efficient tuning ofsegmentation parameters to detect appropriate vertical orhorizontal separators. The position with least profile heightis marked as a valid separator. This approach when applied

to horizontal profile leads to line level segmentation.Traditionally, in many of image based applications feature

extraction is preferred over blocks instead of single text linestripes. The reason being that the relative distribution ofedges over a block provides a strong argument for the clas-sification. Hence, in the segmentation approach we furthermerge single text lines to obtain blocks/paragraphs. Merg-ing single text lines into blocks/paragraphs is based on thetypographic parameters such as:• Alignment,• Interline spacing• Indentation and• White space surrounding the block

Blocks having same alignment, interline spacing are merged.In order to merge the indented blocks correctly, we analyzethe white space surrounding the block. Based on which wemerge the current block with either the subsequent or pre-vious text line or block.

The word segmentation from text lines is performed bycomputing vertical projection profile. The discontinuities inthe text lines are dealt by performing morphological opera-tion before vertical profile computation. We have performeddilation operation using line structuring element of length l.By experimental observation we selected l = 7 as the opti-mum value. The local minimas in the vertical profile definethe possible word separators. These separators are projectedover the original text lines for word segmentation. The wordlevel script identification is described below.

3.1 Script Identification at Word LevelThe block level script identification may overlook the us-

age of multilingual words in text block/paragraph. SVMidentifies the dominant script in the text block with confi-dence score δ2. In case of blocks having composite words,SVM classifier may assign correct script labels. However, theclassifier confidence may not be sufficient. In this scenario,we proceed for script identification at word level. First, weneed to identify the classifier confidence region for filteringsuch blocks. The confidence region should be sufficientlylarge to filter all the composite blocks. Additionally, theregion should not filter many single script blocks as it willincrease system complexity. We selected confidence regionas 0.35 < δ2 < 0.8 based on the preliminary observationon the training dataset. The word image segmentation fromblocks is performed following the horizontal and vertical pro-file based approach discussed above. The feature represen-tation for word images is discussed below.

3.1.1 Shape Based Word RepresentationWe apply shape descriptor proposed in [5] for word image

representation. The feature computation process extractsset of P descriptor points on the word image by overlayinga logical grid. A descriptor point is identified as transitionpoint while traversing over a grid line. For the set P ofn descriptor points on object boundary, there are n(n − 1)point-pair arrangements. The distribution of these point-pair arrangements with respect to distance and orientation isreferred point distribution histogram (pdh). Shape descrip-tor is computed as the Fourier coefficient of the normalizedpdh. To preserve the character sequence information in theword formation, the descriptor is computed by splitting theword image into constant number of partitions. The descrip-tor is computed by arranging partition wise pdh in respective

order and computing the magnitude of Fourier transform ofthe resulting image. For each partition, the pdh is computedby points located in the corresponding partition. The par-tition based approach for descriptor computation localizesthe affect of distortion and noise to the respective parti-tion. Additionally, it also helps to retrieve partially similarmatches and preserves character sequence information. Thedescriptor computation steps are as follows:

• Computation of pdh for all partitions w.r.t. pointslocated in the partition

• hfinal = {h1 : h2 : ... : hnum parts}, hi represents pdhfor ith partition

• Shape descriptor F (P ) is defined by the magnitude ofFourier Transform of hfinal

3.1.2 Rejection Based Script ClassificationWe have applied rejection based multistage classifier for

word level script identification (figure 3). The classifier hi-erarchy is defined by following the classifier cascade pre-sented in [21]. The hierarchical classifier targets the noisywords i.e. different script words as positives and rejects thedominant script words as negative. The base classifiers ateach stage are tuned with different threshold parameters toachieve 100% true positive rate at different false positiverates. The rejection based approach therefore significantlyreduces the identification time with significant improvementin performance. In this paper, we have used four stage clas-sifier hierarchy for prediction.

Figure 3: Hierarchical classifier for word level scriptidentification

We use Adaboost as the base classifier over each node ofthe hierarchy. The shape based feature extraction processrepresent word images by high dimensional vectors. Ad-aboost adaptively selects the most discriminative and com-plimentary features from the training set, and formulatesa strong classifier by learning based combination of severalweak classifiers. The strong classifier applied for predictiontask is defined as parameterized linear combination of sev-eral weak classifiers.

H(x) = sign

{ T∑m=1

wmhm(x)

}(3)

The parameter wm defines the contribution of each weakhypothesis hm for the prediction task. Weight parametersw are obtained by training. The iterative training processis defined in table 1.

4. RESULTS AND DISCUSSIONThe document collection for the validation of the proposed

framework is compiled by scanning the supplementary booksavailable for different type of courses e.g. Guide books andLanguage training books. The document images are scanned

Table 1: Adaboost training algorithm

Consider the given the training data as {(xn, yn)}Nn=1,where xn ∈ Rd and yn ∈ {−1,+1}. Initialize αn

1 =1/N .For each m = 1, . . . , T1 Train the weak classifier hm using training dataset

with example features weighted by αm.2 Classification error computation for hm as

εm =

∑Ni=1 α

nm∆ (hm(xn), yn)∑N

n=1 αi

∆(a, b) represents the Kronecker delta function de-fined as 1 if a and b are equal.

3 Compute weight wm for the weak classifier hm as

wm = log

(1− εmεm

)4 Update the feature data distribution as

∀n, αnm+1 = αn

m+1exp(wm∆(hm(xn), yn))

End

in gray-scale at 300 dpi. A document image collection of 384pages containing English/Hindi script and 526 pages con-taining English/Bangla scripts is used for experiments. Thedocument images have been annotated as per the existenceof the dominant script. For page level script identification,the document collections are partitioned as 70% for train-ing and 30% for testing by random sampling. The featuredetails are discussed in section 3. The SVM based aver-age classification accuracy over five iterations for both thedocument collections are listed in table 2.

Table 2: Page Level Script IdentificationDocument collection Accuracy Average confi-

dence scores

English/Hindi 98.20%Hindi 0.81

English 0.23

English/Bangla 98.76%Bangla 0.78English 0.25

Figure 4 shows the confidence scores obtained at blocklevel script identification for few sampled blocks. For En-glish/Hindi document collections we had 640 blocks belong-ing to Hindi and 810 blocks belonging to English. Similarlyfor English/Bangla document collection we had 741 blocksfor Bangla and 912 blocks for English. SVM training isperformed by conventional 70/30 partitioning as discussedabove. The classification results with average confidencescores are presented in table 3

The word image collection details are as follows. Hindiword image collection contains 4606 words, English word setcontain 8416 words and Bangla word dataset is composed of6718 examples. The shape descriptor computation is per-formed with 40 distance and 36 angular bins by splittingthe word image in four partitions i.e. num parts = 4. Wordlevel identification accuracy of 98.6% for English/Hindi doc-

Table 3: Block Level Script IdentificationDocument collection Accuracy Average confi-

dence scores

English/Hindi 97.46%Hindi 0.87

English 0.28

English/Bangla 96.89%Bangla 0.85English 0.29

uments and 97.9% for English/Bangla documents is achieved.Few sampled blocks and corresponding script labeled imagesare shown in figure 5 to establish the validity of the frame-work.

5. CONCLUSIONWe have presented novel framework to solve the script

identification problem in bi-lingual documents. The pro-posed framework presents a robust and fast approach byunifying page, block and word level script identification. Theapproach therefore presents an efficient algorithm for scriptidentification in real applications having documents withnon-uniform distribution of languages/scripts. We have alsoshown novel application of structural feature in combinationwith rejection based classifier at word level script identifica-tion. The extensive evaluation of the proposed concept ispresented on two document collections. The extension ofproposed framework for multi-lingual document script iden-tification is part of future work.

6. REFERENCES[1] A. Busch, W. W. Boles, and S. Sridharan. Texture for

script identification. IEEE Transactions on PatternAnalysis and Machine Intelligence, 27:1720–1732,2005.

[2] S. Chaudhury, G. Harit, S. Madnani, and R.B.Shet.Identification of scripts of indian languages bycombining trainable classifiers. Proc. of ICVGIP, 2000.

[3] B. V. Dhandra, P. Nagabhushan, M. Hangarge,R. Hegadi, and V. S. Malemath. Script identificationbased on morphological reconstruction in documentimages. In International Conference on PatternRecognition, pages 950–953, 2006.

[4] D. Dhanya and A. G. Ramakrishnan. Scriptidentification in printed bilingual documents. InDocument Analysis Systems, pages 13–24, 2002.

[5] E. Hassan, S. Chaudhury, and M. Gopal. Documentimage retrieval using feature combination in kernelspace. In International Conference on PatternRecognition, pages 2009–2012, 2010.

[6] M.C.Padma and P. Nagabhushan. Identification andseparation of text words of kannada hindi and englishlanguages through discriminating features. In Proc. ofNCDAR, pages 252–260, 2003.

[7] S. Mo and V. J. Mathews. Adaptive, quadraticpreprocessing of document images for binarization.IEEE Transactions on Image Processing, 7:992–999,1998.

[8] M. C. Padma and P. A. Vijya. Script identificationfrom trilingual documents using profile basedsegmentation. In International Journal of ComputerScience and Applications, volume 7, pages 16–33, 2010.

Figure 4: Block level script identification and corresponding confidence scores

[9] U. Pal and B. B. Chaudhuri. Script line separationfrom indian multi-script documents. In InternationalConference on Document Analysis and Recognition,pages 406–409, 1999.

[10] U. Pal, S. Sinha, and B. B. Chaudhuri. Multi-scriptline identification from indian document. InInternational Conference on Document Analysis andRecognition, pages 880–884, 2003.

[11] P. B. Pati and A. G. Ramakrishnan. Word levelmulti-script identification. Pattern Recognition Letters,29:1218–1229, 2008.

[12] S. B. Patil and N. V. Subbareddy. Neural networkbased system for script identification in indiandocuments. Sadhana-academy Proceedings in

Engineering Sciences, 27:83–97, 2002.

[13] G. S. Peake and T. N. Tan. Script and languageidentification from document images. In BritishMachine Vision Conference, 1997.

[14] J. C. Platt. Probabilistic output for support vectormachines and comparisons to regularized likelihoodmethods. 199.

[15] B. Scholkopf and A. J. Smola. Learning with Kernels:Support Vector Machines, Regularization,Optimization, and Beyond. MIT Press, Cambridge,MA, USA, 2001.

[16] SF Rashid, F Shafait, F Shafait, T Breuel. ConnectedComponent level Multiscript Identification fromAncient Document Images. In Ninth IAPR Workshop

Figure 5: Script identification at word level

on Document Analysis System, pages 1–4, 2010.

[17] G. Sharma, R. Garg, and S. Chaudhury. Curvaturefeature distribution based classification of indianscripts from document images. In Proceedings of theInternational Workshop on Multilingual OCR, MOCR’09, pages 3:1–3:6, New York, NY, USA, 2009. ACM.

[18] S. Sinha, U. Pal, and B. B. Chaudhuri. Word-wisescript identification from indian documents. InDocument Analysis Systems, pages 310–321, 2004.

[19] A. L. Spitz. Determination of the script and languagecontent of document images. IEEE Transactions onPattern Analysis and Machine Intelligence,19:235–245, 1997.

[20] T. N. Tan. Rotation invariant texture features andtheir use in automatic script identification. IEEETransactions on Pattern Analysis and MachineIntelligence, 20:751–756, 1998.

[21] P. Viola and M. J. Jones. Robust real-time facedetection. Int. J. Comput. Vision, 57:137–154, May2004.

[22] L. Zhou, Y. Lu, and C. L. Tan. Bangla/english scriptidentification based on analysis of connectedcomponent profiles. In Document Analysis Systems,pages 243–254, 2006.

[acm press the 2011 joint workshop - beijing, china (2011.09.17-2011.09.17)] proceedings of the 2011...

Documents