[ieee 2014 world symposium on computer applications & research (wscar) - sousse, tunisia...

An Interactive Annotation Tool for Indexing

Historical Manuscripts

Haneen Khader, Abeer AI-Marridi, Hena Alpona, Suchithra Kunhoth, Abdulaali Hassaine, and Somaya Al-maadeed

Department of Computer Science and Engineering Qatar University

Doha, Qatar {200759075, aa1107191, hal 106514, suchithra, has saine, and s_alali,}@qu.edu.qa

Abstract-When a huge collection of digitized old manuscripts

are obtained as a part of their preservation, an appropriate

indexing technique is essential to search through them. In this

paper, we have proposed a novel tool to assist the indexing of

offline handwritten historical documents. The annotation tool we

have developed has a semi-automatic interface that can deal with

possible errors in the text segmentation. We have given a full

description and analysis of our tool which can be used for English

as well as Arabic documents with different performance rates.

Keywords-historical manuscript; indexing; annotation; semiautomatic, word spotting; binarization; segmentation

I. INTRODUCTION

A huge number of libraries all over the world have resorted to digitization as an effective method for the preservation of the Historical manuscripts. The ancient manuscripts which are considered to be the richest source of a nation's cultural heritage and moral values may be degraded over period of time. So it is ideal to save those resources by keeping digitalized. A lot of equipments such as scanners and high resolution digital cameras can accomplish the task of scanning and thus creating the digital versions of the documents. But the information contents in these copies would behave as mere images rather than the textual data. For accessing a particular topic, we have to flip through the entire set of document images. It is impossible to retrieve the useful data by a searching or indexing strategy unless it is available in the textual form.

One way to get through the problem is to perform the manual transcription of the data. But it is not a feasible option from the point of view of cost and effort when it is needed for a large number of collections. This makes us to rely on some automatic approaches such as optical character recognition (OCR). Although an extensive research have been taking place in the field of OCR, still it is difficult to be applicable for documents which are blurred, having small fonts, handwritten text etc. These limitations make the OCR an inappropriate choice for the transcription of the handwritten historical documents. Especially with the cursive style of the Arabic handwriting, the automatic system of optical character recognition will be definitely unable to produce the satisfactory results. This calls for a method someway between the aforementioned automatic and manual approaches. Thus

extensive research works are going on in the field of word spotting techniques which can largely help in the annotation when performing a manual transcription. This would greatly reduce the amount of annotation work that has to be performed, by grouping all the words into clusters where each cluster contains words with the same annotation. A perfect line segmentation and word segmentation of the document image can only end up with success while adopting this idea. Hence an annotation tool with partial human intervention that we have proposed in this paper seems to be very useful for performing annotations in the old as well as Arabic handwritten documents. The system we have developed here is an interactive tool which can be used for both the Arabic and English documents.

The paper is organized as follows. Section II gives a brief review of papers relating to the searching and indexing in ancient document. Section III provides a detailed description of our system. Our Experimental results are presented in Section IV. Finally the paper is concluded in Section V.

II. RELATED WORK

Annotation and indexing are mostly based on the word spotting approaches. The keyword spotting which is the identification of Key words in a portion of text can be done by locating the key words using certain features such as the word shapes in the text images. The main idea behind the word spotting is an image matching technique. One of the first such works on word spotting is by Spitz [I] who proposed the use of character shape codes to code the characters of the printed text. The method starts with the identification of words and for each extracted word, the features of its characters based on the connected components and the position in relation to the two base lines is extracted. Thus the characters are coded according to their features to form a word shape token. Query words are also mapped in the same way to word shape token and now the indexing and retrieval can be done in these tokens instead of the actual words. Later, Smeaton and Spitz [2] showed that this technique is useful only if the images are of bad quality implying a failure of the OCR. Manmatha, Han and Risemen [3] have proposed a semi-automatic approach to index the handwritten documents which is much more difficult than the printed text. In this work, the similar images of words are grouped using suitable image matching algorithms into

978-1-4799-2806-4/14/$31.00 ©2014 IEEE

equivalence classes. The most frequently occurring classes are eliminated and the remaining classes are manually coded in ASCII and used as index. Later on, they have investigated different word matching algorithms in their further papers.

Rath, Lavrenko and Manmatha proposed an automatic retrieval system for historical handwritten documents using relevance models [4] . Sari and Kefali introduced a method of searching in the old Arab manuscripts images [5]. They represent each document by the structural features of its subwords as ASCII codes. The textual query of the user is also coded in the same manner, and the comparison is performed by using the DTW algorithm. However they mention some problems remaining unsolved such as document slant correction and reliable feature extraction, resulting in performance decreasing of the coding and retrieval performances. Reference [6] proposes a semi-automatic approach of document indexing and retrieval. Followed by the manual step of choosing indexes, the method detects their contours by representing in the form of chain of codes. At the time of search, the sought word is coded in the same manner proceeded in the indexing, and the optimal code obtained is compared thereafter with the codes of the indexes using the DTW algorithm. A semi-automatic approach is also proposed by Kefali and chemmam in [9] where indexes are chosen manually from the document. Then the structural features are coded in a character string fonn and a search query goes for a comparison between its code string and those of the indexes of all documents already stored in the database using a DTW algorithm.

A search engine is proposed by Makhfi, Bannay and Benslimane in [7] to digitize and create a library of ancient Arabic manuscripts. The authors stress the importance of indexing and diffusing the contents of old Arabic manuscripts either manually or automatically. The document coding is validated by XML schemas to provide format checking, type and semantics of data in XML files. Based on these metadata and annotations, the search engine provides the process of identifying, collecting and registering infonnation. Mainly, the search engine is based on metadata and XML annotations that allow handwritten transcribed documents and indexed images to be searched in the database corresponding to users' queries. The same authors have presented in [8] , an indexing system of Arabic manuscripts to transcribe and annotate the manuscript images according to the metadata. Their system allows user to create works of Arabic manuscripts in a TEl XML format. Pantke et al. have proposed a semi-automatic transcription system for handwritten Arabic documents in [10]. After the scanning and preprocessing of a document and if no previously trained recognizer is available, the process begins with a manual transcription of a certain number of page images. As soon as a relevant amount of page images has manually been processed and verified, the first training of a recognizer that employs, e. g., hidden Markov models (HMMs) are accomplished based on this data. A next set of scanned document pages would be then transcribed automatically using the trained recognizer, also followed by a manual verification. The recognizer is also trained using the extended dataset to increase the recognition accuracy. If a recognizer for a specific

font or writing style is already available at the beginning of the transcription process, an automatic transcription is operated.

III. SYSTEM ARCHITECTURE

Our proposed method comprises an entire system that commences with the scanned copy of a document image and ends up with an xml file that contains the annotation information for the respective words. A binarized image serves as the input to our segmentation stages which perform a line segmentation followed by word segmentation. At this step, the users are allowed to rectify any word segmentation errors and then to enter their annotations to the necessary words. Finally the system enables the user to store the corresponding annotations in xml files. Fig.l depicts the entire stage of operations involved in our system. Also a detailed description of our annotation tool is given in the further sections.

Storing In

the da/aJet

Figure 1. Block Diagram of the proposed system

A. Preprocessing

As far as the degraded historical documents are concerned, some efficient preprocessing operations are needed to obtain satisfactory results in the subsequent segmentation steps. We have provided the choice of suitable thresholding algorithm among the Otsu and the K-means methods in the interface of our tool. Depending on the quality of the document image, suitable method can be chosen to achieve a perfect binarization.

Figure 2. Results of Binarization:Left- Input Image ,Right- Image

after Binarization

Figure 3. Line Segmentation and Word Segmentation Results on a page from Arabic manuscript

B. Segmentation

The segmentation stage perfonns the extraction of lines and then that of words from the binarized text image. Implementation of line and word segmentations are based on the algorithms we have already proposed in [II]. After a smoothing step, the image is thresholded in order to extract the main line components. Small components that do not correspond to line components are then filtered out. Isolation of the text lines are then accomplished using a connected component labeling. The basic idea for word segmentation is to classify the gaps of the line components into inter-word distances or intra-word distances. For that, a chamfer distance image is first computed and then finds the best threshold between the inter-word and intra-word distances so that the segmentations can be done at the correct word splittings.

C. Annotation

Once the segmentation is over, we can move on to the annotation section which needs the user to primarily select whether the text contained in the document is English or Arabic. Correspondingly the word ordering in the text is detennined and the rectangular bounding boxes appear around each word. The annotation interface also allows rectifying any faults in the word segmentation with the options such as adding new words, deleting words, merging two word bounding boxes etc. The reset button enables to neglect the entire word segmentation results and can start from scratch, creating the drag and drop resizable bounding boxes around each and every word by using the Add word option. Now we can start adding annotations to the desired words either by directly entering

them in the text box on the interface or by loading a text file containing the metadata. Placing the cursor over each word box highlights the respective box, displays the word number as per the English or Arabic ordering and the corresponding annotations in the annotation window. If we have used the text file option for adding annotations, there may be a chance of error when the entire text is mapped to the entire set of segmented words. These errors can also be corrected by the various controls such as adding words before or after and also with or without the annotation shifting.

Finally comes the stage of saving the annotations, whereby an xml file will be created for the particular image which stores the image file details, the bounding box co-ordinates and the respective annotations for the word.

a--L- [ ..... -

r.-_.J� ------J

-- J

Figure 4. Snapshot of our Annotation Interface

IV. EXPERIMENTS AND RESULTS

We have used the Microsoft Visual c++ platform to build our annotation interface. In order to test our system, we have collected the document images from three different sources where one of them was acquired ourselves using a digital camera. A brief description of the databases we have used is provided in the following sections.

1) Princeton University Digital Library The database consists of the digitized collection of a large

number of historic books in English, Arabic etc. We have collected few sample pages from the 14th century Arabic manuscript 'AI Jamia fe al nahw' for testing our system [12].

2) Qatar National Library The database contains tens of old collections of Arabic

manuscripts that are preserved in the National Library of Qatar.

3) Our own Database We have been trying manuscript imaging with the digital

cameras such as Canon EOS 5d mark ii and Sony Cybershot DSC- H200. The high resolution document scans provided by the canon camera make it an effective capture device in the field of digitization. So we also collected few sample pages from an Arabic manuscript using the canon camera to test our annotation tool.

Figure 5. A page from the mauuscript captured usiug Cauou EOS 5d

Mark II camera

Although a manual intervention is allowed for correcting the word segmentation results prior to the annotation stage, effective word segmentation can greatly minimize the task of the user. As we have done the segmentation for both Arabic and English scripted documents, it was found that the word segmentation results for Arabic are yet to be improved when compared to the high accuracy of the English. We are currently using the same algorithm for both the English and Arabic word segmentations. Some improvisations are to be done to cope with the difference in inter/intra word distances for segmenting the Arabic words.

V. CONCLUSION

In this paper, we have proposed a novel tool for performing annotations in the scanned or digitized copies of historical handwritten documents containing English as well as Arabic scripts. Although the line and word segmentations are required like those in the word spotting approaches, the interactive interface of our annotation tool allows the user to completely eliminate the segmentation errors. The annotated text and the corresponding word locations are then saved to an xml file which can later on be used for indexing.

ACKNOWLEDGMENT

This paper was made possible by an UREP award [UREP 12 - 125 - 1 - 015] from the Qatar National Research Fund (a member of The Qatar Foundation). The statements made herein are solely the responsibility of the authors.

REFERENCES

[1] A. Spitz, "Using character shape codes for word spotting in document images", Dori D. and Bruckstein A. (Eds.), Shape, Structure and Pattern Recognition, World Scientific, Singapore, pp.382-389, 1995.

[2] Smeaton, A. Spitz, "Using character shape coding for information retrieval", in the 4th ICDAR, pp.974-978, 1997.

[3] R. Manmatha, C. Han, E. Risemen, "word spotting: a new approach to indexing handwriting", IEEE Conference on Computer Vision and Pattern Recognition, pp.631-637, 1996.

[4] T. M. Rath, V. Lavrenko, and Manmatha, R. A search engine for historical manuscript images. In Proc. of the 27th Annual Int'l ACM SIGIR Conf. (Sheflield, UK, July 25-29,2004), pp. 369-376.

[5] T. Sari, A. Kefali, "A search engine for Arabic documents", Dixieme Colloque International Francophone sur l'Ecrit et Ie Document (CIFED), Octobre 2008.

[6] A. Benmohamed, T. Sari, M. Sellami, « Une approche semi automatique pour la recherche de documents anciens », Journees Gestion Electronique de Documents & Reseaux de Recherche en sciences et Technologies d'information, pp. 156-163, Annaba-Algerie, Mai 2009.

[7] N. Makhfi, O. Bannay, and R. Benslimane. Search engine of ancient Arabic manuscripts based on metadata and XML annotations, Information Science and Technology Conference (ClST), May 2011,pp. I-IO.

[8] EI Makhfi, N.; EI Bannay, 0.; Benslimane, R.; Rais, N., "System of indexing, annotation and search in the old Arabic manuscripts" Information Science and Technology Conference (CIST), May 2011, p.9.

[9] Abderrahmane Kefali, Chaouki Chemmam, "A semi-automatic approach of old Arabic documents indexing", CIIA 2011.

[10] Pantke, Werner, Volker Margner, Daniel Fecker, Tim Fingscheidt, Abedelkadir Asi, Ofer Biller, Jihad EI-Sana, Raid Saabni, and Mohammad Yehia. "HADARA A Software System for Semi-Automatic Processing of Historical Handwritten Arabic Documents." In Archiving Conference, vol. 2013, no. I, pp. 161-166. Society for Imaging Science and Technology, 2013.

[11] A. Hassaine. A robust method for line and word segmentation in handwritten text. Qatar Foundation Annual Research Forum Proceedings: Vol. 2013, ICTP 057.

[12] Princeton University Digital http://pudl.princeton.edu/objects/kk91 fk603

Library,