report on internship project: optical character ... · report on internship project: optical...

14
Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle, WA May 27, 2012 Abstract This is the final report on an internship project for the Master of Science in Computational Linguistics program at the University of Washington. The project is an evaluation of existing OCR technologies in the context of recognizing text from multi-script dictionaries. 1

Upload: others

Post on 11-May-2020

25 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

Report on Internship Project: Optical Character Recognition in

Multilingual Text

Christa MabeeUniversity of Washington

Seattle, WA

May 27, 2012

Abstract

This is the final report on an internship project for the Master of Science in ComputationalLinguistics program at the University of Washington. The project is an evaluation of existingOCR technologies in the context of recognizing text from multi-script dictionaries.

1

Page 2: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

1 Introduction

The goal of this project was to evaluateexisting OCR systems with respect to rec-ognizing mixed-script data, particularly inthe context of collecting data for the Uti-lika Foundation’s PanLex project. [7]

This work was completed between May2011 and May 2012. Using AbbyyFineReader version 10.0 and Tesseract3.02, a newly scanned corpus of somethree thousand pages in Thai, Tibetan,Farsi and English has been processed.

2 Activities and Results

2.1 Building a Corpus

The first challenge for this examination,and one of the most time consuming, hasbeen procuring a test corpus. GoogleBooks has some multi-lingual dictionar-ies in their collection, but even the opendomain books this writer found were notavailable as a download in sufficiently highresolution. According to the documenta-tion for both OCR systems evaluated thepage images need to be at least 300dpi.

Seven multi-lingual dictionaries were ac-quired for this project employing Farsi,Thai, Tibetan and Latin scripts. Theserange from a high-gloss pocket dictionaryto an English onion-paper primer on Ti-betan. The varying paper weights allowedme to evaluate the feasibility of using eachin an automatic document feeder. Thevolumes were between 30 and 3 years old,and the print quality varied significantly.

The dictionaries are as follows:

• A New English Thai Dictionary [28]

• A Tibetan English Dictionary [13]

• A Tibetan English Dictionary [12]

• English Persian Dictionary [9]

• English-Persian Dictionary [10]

• Pocket Thai Dictionary [11]

• New Thai-English, English-Thai CompactDictionary [1]

There are 4,627 scanned pages in thiscollection, with more than 19 fonts, sixof which this writer had time to assem-ble sufficient training data for, which forthis project means at least fifteen exam-ples. More common characters have 20-25 examples per font in the training datacollection.

The pages were scanned using the docu-ment feeder of a desktop scanner. Dictio-naries often have very thin pages, whichdocument feeders can damage if the set-tings aren’t calibrated properly, so it re-quired some experimentation to find thebest speed for each paper weight. Fur-ther, since the pages are typically tallerthan they are wide, feeding them in side-ways generally resulted in less jamming.The pages later had to be digitally rotatedinto the correct orientation.

2

Page 3: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

It took roughly 43 hours to scan in thefull collection at 600 dpi. Scanning speedaside, I suspect that a higher quality doc-ument feeder could have sped the process.The one employed takes 15 pages at atime, and the thinner dictionary pages of-ten would get pulled in together and thenneed to be re-scanned separately.

Furthermore, each of the volumes had tobe trimmed (removing the spine) in or-der to use the document feeder. Non-destructive methods of books scanningwould require the use of professional bookscanning hardware. Several professionalfirms will perform this service. Quotes re-ceived for this project (seven books, 4,627pages) ranged between $250 at BoundBook Scanning[23] and $3,000 at Crowley,Inc.[3]. Kirtas’ Kabis I (base model) fornon-destructive book scanning can scan1600 pages an hour, but costs in the areaof $70,000 [16]. In retrospect, it may havebeen better to hire Bound Book Scanningand have more time for coding and analy-sis.

2.2 Postprocssing

To prepare the scans, I wrote a batchscript to straighten out the images. Thescript both rotates the pages that werescanned sideways and deskews the images.I later experimented with a variety of de-speckling and contrast enhancing meth-ods, but did not find that any of themimproved recognition on this corpus.

The deskewing package I used was froma system called pagetools, released under

GPL v2. [14]. The deskew package usesa very fast Radon transform algorithm;deskewing the entire collection took lessthan an hour. I also write a GIMP scriptto do the same with the hope of usingother GIMP plugins to improve the im-age quality, but this proved unnecessary.Both this script and the plugin are avail-able from the project Sourceforge coderepository[4]. 1

3 Tesseract

3.1 About

Tesseract was originally developed byHewlett-Packard and UNLV in the 1990s.In 2005 it was released as open source, andin 2006 became sponsored by Google[26].Today, as of version 3 (released in Octoberof 2011), Tesseract can support multiple-language documents and already has lan-guage packs of varying quality for morethan 30 languages, even including someresource-poor languages such as Cherokeeand classic Quechua. Only six of these areofficially supported, but the open sourcecommunity contributes language packs.As of the time of writing, 3.02 is not avail-able as a binary but must be compiledfrom the source accessible from the coderepository.

Tesseract now has some rudimentary pagelayout analysis capabilities, which do ex-tend to recognizing when text is in multi-

1Technical note for interested parties: on Ubuntu12.04 the makefile failed due to a missing header, butinstalling the libnetpbm10-dev package solved that prob-lem.

3

Page 4: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

ple columns on the page. Accuracy since1995 has only increased around 1%, andcan be expected on English text to ap-proach 97% character-level recognition ac-curacy. The system was unique in its dayfor implementing component analysis be-fore attempting segmentation, which en-abled it to detect sections with inverse-color text. [25]

3.2 Training

Training Tesseract is a fair amount ofwork. For a detailed guide, see Train-ingTesseract3[29]. I wrote a shell scriptto assist the user through each of thesteps, as it is easy to forgot one ormistype. train tesseract.sh is availablefrom the project Sourceforge code repos-itory[4]. The details for each font andcharacter set must be specified, but mostof the work is in generating “boxfiles,”or text files that specify the coordinatesof the boundaries of each character onan associated page. The documentationfor Tesseract[29] specifies that charactersshould have between 10 and 20 trainingexamples. The examples are processedinto feature vectors used by the classifier.As described in Smith 2009[24]:

In training, a 4-dimensional featurevector of (x, y-position, direction,length) is derived from each element ofthe polygonal approximation, and clus-tered to form prototypical feature vec-tors. (Hence the name: Tesseract.)

The instructions for generating trainingdata stress that it is vital to have char-acters spaced out on the training page.

Overlapping boxes result in features beingcreated or dropped inappropriately in thetraining stages. The suggested methodis to print out the desired characters onpaper, so that you have an appropriatenumber of characters and so that they arespaced easily. That is not possible withthis dictionary project, as many of thefonts used are either proprietary or repro-ductions of handwritten characters.

In order to generate valid boxfile/imagesets, I created pages with the desired char-acters essentially copy/pasted out of thecorpus, cleaned up so that features fromoverlapping characters are removed. Thismade it possible to generate clean boxes,but means any baseline data (in the senseof physical spacing) is lost. The resultswere not impressive compared to the de-fault language pack which was not trainedon the given fonts, and so I suspect thateither the importance of the baseline dataor the sheer number of training examplesfar outweighs the benefit of having font-specific training data. Additionally, creat-ing training data this way was very time-consuming.

Creating boxfiles took over 60 hours forthe seven fonts I prepared, and severalof the fonts had characters not presentedfrequently enough to provide the recom-mended 15 examples. As a result I donot recommend attempting to collect font-specific training data for use in Tesseractunless the digital font is available. In addi-tion to being incredibly tedious, it did notresult in improved accuracy for any of the

4

Page 5: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

tested fonts. Bootstrapping a new charac-ter set would probably be better done byfinding similar fonts and printing suitabletraining pages. (See: Future Work.)

There are multiple boxfile editing GUIsavailable; I tried several of them. I foundthe moshpytt[18] interface to be the bestof the installable ones, though it suffersfrom not allowing the user to drag boxboundaries, instead you have to use thedirectional arrow keys to move or resizethem pixel by pixel. This slows downwork considerably. The AJAX basedweb interface Tesseract OCR Chopper byDino Beslagic[27] does not suffer from thatproblem. It allows resizing boxes with themouse. Unfortunately it doesn’t alwaysprocess the entire page, and will some-times skip over entire columns or para-graphs. It would be useful to this commu-nity to have a multi-platform boxfile edi-tor with draggable box boundaries, in myopinion.

As a further note on training, when Ibegan this project multi-language supportwas not in place yet and so I attempted tocreate a custom language that essentiallyjust combined the resources for Englishand Thai. This was a complete failure.The resulting text was mostly Latin char-acters with a smattering of Thai diacriticssuperimposed, in place of the Thai script.I suspect that this is because the Englishresources outnumbered Thai resources bya fair margin (the system does not usegrammar rules). In Adapting the Tesser-act open source OCR engine for multilin-

gual OCR [24] Ryan Smith mentions Thaispecifically as being particularly ambigu-ous and so introducing multilingual recog-nition difficulties.

One can optionally include a wordlistwhen training Tesseract. The wordlist, ifpresent, is one of the few linguistically in-formed features and will be used duringword segmentation.[25]

Tesseract does not output confidence databy default, and although it can be modi-fied to do so the numbers it uses do notappear to be very useful.

3.3 Results

Overall, I was not very impressed withTesseract’s performance. The accuraciesobserved in this corpus do not resemblethose reported by the project. Figure 1shows a sample of these results for Thai.

5

Page 6: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

Figure 1: Tesseract Results Using Official Thai Language Pack

As you can see, character recognition is decent so long as Tesseract recognizes thatlanguage of the word. However, it frequently fails to recognize that the word is inThai and not English; here this error rate is about 80%. It appears that Tesseract3.02’s new multi-language recognition feature is not ready for practical use just yet.Perhaps the disparity in confidence between the Latin and Thai characters could becorrected so that the system prefers Thai at a lower confidence threshold. (See FutureWork.)

6

Page 7: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

4 Abbyy

4.1 About

Abbyy is headquartered in Moscow,Russia. They produce a range of lan-guage products, such as digital dictio-naries, translation software and services,and data capture/document processingsoftware. One of their products, Ab-byy FineReader, is a particularly powerfuldocument processing engine that includestheir SDK. Through their representativeMark Smock I obtained a 60-day trial li-cense for the FineReader SDK.

4.2 Training

No training was required for Thai orFarsi. Abbyy provides a GUI-based train-ing tool to generate data files for new char-acter sets. Unfortunately, I did not haveenough time to evaluate this functionality,and have been so far unable to launch theinterface due to runtime errors.

The SDK was easy to set up; it comeswith a GUI installer. While FineReader9.0 is available for Linux, Windows andMac, the 10.0 version’s Linux distributionis not yet available. As such, I used the10.0 Windows version. The SDK providesdocument analysis features, grammar fea-tures, and character recognition data.

Starting with one of the SDK exam-ples, I wrote a fairly simple Java pro-gram to analyze the page, using the lan-guages specified through the commandline, and output data in raw text and

XML format. This program and accom-panying bash script are available throughthe project Sourceforge page. However,for licensing reasons, I cannot provide thejarfile for the engine itself, which is re-quired to run the program. The programrecords the uncertainty and unrecogniz-able characters as reported by FineReaderto generate a results file. The results fromsample sets can be seen in Table II.

Following in the footsteps of Doermannet all[6], I randomly selected two pagesfrom two of the books and manually pre-pared ground truth data for them. Ithen compared them to the output ofthe FineReader-based program. As I amnot familiar with any good tools to dotext alignment, I used Microsoft Word’s“Compare Document” feature to see dif-ferences in the text. Table III has the re-sults, if all whitespace characters are con-sidered equal (tab vs space).

Thai fared considerably better than Farsi,the problem with both Farsi texts is thatthe transliteration column is written ina partially cursive English script thatFineReader performed poorly on.

4.3 Results

Table II: Character Recognition Confidence

Book Total Failed Suspicious

Pocket Thai 50900 57 2301Thai English 62645 40 2825

0.08% 4.5%

English Farsi 77461 785 26868

1.01% 34.68%

7

Page 8: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

Table III: FineReader Sample Accuracy

Book Sample Characters Correct

Pocket Thai 99.56%Thai English 99.9%

5 Future Work

As there is no Tibetan language packfor Tesseract yet, and I’ve gone to thetrouble of learning to train Tesseract, Iplan to work on that over the summerand contribute it back to the Tesseractproject. Given my experience with theThai language pack, I am going to tryusing existing Tibetan fonts to generatetraining pages in the suggested fashion(printing spaced-out characters and scan-ning the printed pages) rather than con-tinuing with the attempt to collect all theexamples from the existing pages. It willbe interesting to see how the two sets ofsamples compare in terms of recognitionaccuracy, and what the real threshold isfor sufficient character examples in thisscript.

It would be interesting to see if thefont-specific data, when added to exist-ing language packs, can improve Tesser-act’s recognition accuracy. Other work inTesseract worth examining: finding a wayto boost recognition of non-Latin charac-ters on a page with Latin text. I suspectthat the ability to give a weighted prefer-ence to one language or another could beuseful in texts where one script has moreambiguous characters. Ideally this wouldbe a parameter one could provide on thecommand line to test until a good balance

is found.

For Abbyy’s FineReader, I still wish toexamine the training tool. Currently I amunable to get the program to launch andsuspect that there were issues with theSDK installation which the installer didnot flag.

One idea which I briefly experimentedwith was using the OpenCV[20] (OpenComputer Vision) library. While it is easyenough to train a character recognizer, Iwas unable to construct a working doc-ument analyzer (at least casually). It ismy expectation that OpenCV could dovery well at segmenting dictionary entries,given its successes at recognizing and pars-ing far more complex objects.

6 Time Spent

Table I: Time Spent (approximate)

Hours Activity

26 Preliminary research8 Acquiring dictionaries and scanner3 Preparing dictionaries for scanning

43 Scanning dictionaries3 Deskewing pages3 Trying out boxfile editors

69 Preparing spaced boxfiles9 Installing Tesseract, scripting

10 Running Tesseract experiments2 Acquiring Abbyy SDK9 Installing Abbyy, learn SDK3 Writing Abbyy scripts4 Running Abbyy experiments

9+ Writeup, data collation

201+

8

Page 9: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

7 Relevant Coursework

The content of this internship, while rel-evant to my personal interests, has beena nearly complete departure from thecoursework I had in the CLMA program.This surprises me in retrospect. One ofmy primary original motivations for join-ing the CLMA program was to better en-able myself to take on a project I firstattempted as an undergraduate: a ma-chine transliteration system for cuneiformtablets. This internship might be the onlydirectly relevant work in the program.

Advanced Statistical NLP was useful; ithelped me when reading the literature andattempting to understand the algorithmsused by these software packages. Partic-ularly, the assignment on Support VectorMachines, which are discussed in AdaptiveHindi OCR Using Generalized HausdorffImage Comparison[17].

Another honourable mention goes toDeep Language Processing class, whichstrengthened my understanding of HiddenMarkov Models, as made use of in Use ofOCR for Rapid Construction of BilingualLexicons[6].

8 Literature Review

While I will not take the time to reviewall of the documents that were interestingor informative in the course of this investi-gation, here are a few that stood out. Mythanks to Jonathan Pool for several addi-tional papers of interest: [8], [6], and [22].

8.1 Stochastic Language Models forStyle-Directed Layout Analysis ofDocument Images

Kanungo and Mao 2003[15] experimentwith a stochastic grammar describingthe physical layout of a page (head-ers, columns etc). Using the Viterbialgorithm, they determine the optimalstate sequence for weighted automata con-structed from trees representing blackpixels in strips drawn on the page. Thestate sequence provides 1-D segmentation,hierarchical from the page down to thetext lines.

They tested this algorithm on artificiallynoisy test images at sampling resolutionsof 200-400 DPI. One version of the algo-rithm, Model-1, does not use explicit stateduration densities, while Model-II does.They found that Model-II performed bet-ter than Model-I, especially as image noiseincreased.

Essentially: a projection of pixel valueson the page is partitioned into strips, thedarkness of the strip becomes an obser-vation symbol in a FSA, and the optimalstate transitions (representing physicalboundaries) are determined a la Viterbi.

9

Page 10: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

8.2 Adaptive Hindi OCR using Gen-eralized Hausdorff Image Com-parison

Ma and Doermann 2003[17] claim to havea “rapidly retargetable” system with 88-95% character level accuracy. As part of aDARPA TIDES project at the Universityof Maryland to acquire bilingual dictionar-ies, Ma and Doermann took one month todevelop and train the system described.

The system scans Devangari text at 300-400 DPI; the scans are then despeck-led and deskewed. The system performssegmentation using methods described inO’Gormain 1993[21]. Word level script de-tection identifies Devengar versus Romanwords.

The Roman words are fed to “a com-mercial English OCR” while the Hindiwords are further segmented into charac-ters, which are passed to the characterclassifier.

The Devangari segmenter segments char-acters by removing the top and bottomstrips found in that script and detectingthe characters and modifiers before re-inserting the strip. There is some workto segment the “shadow characters”, char-acters that do not touch other charactersbut cannot be separated by a vertical line.

Each character is classified using Gen-eralized Hausdorff Image Comparison(GHIC), and algorithm which computesthe Hausdorff distance, measuring the

similarity between two images (assum-ing there is only one translation betweenthem). Without belabouring the details ofGHIC, suffice it to say that this algorithmprovides a useful confidence measure.

The system was applied to the OxfordHindi-English dictionary, a corpus of 1083pages scanned at 400 dpi as well as theoriginal PDFs. Accuracy was evaluated byrandomly selecting seven pages from thecorpus and preparing ground truth data.

With printed-scanned images, thecharacter-level accuracy was 87.75%,while the images taken from a pdf yielded95% accuracy. The authors state that theclassifier may be trained on small amountsof data, but do not provide specifics.

8.3 Applying the OCRopus OCR Sys-tem to Scholarly Sanskrit Litera-ture

Thomas Breuel (2009)[2] of the Univer-sity of Kaiserslautern gives a high-level de-scription of the OCRopus system (whichhe has led the development of). His de-scription centres on the case of mixed-script scholarly Sanskrit literature usingLatin and Devangari scripts.

Breuel discusses the system itself, briefly.It is a strictly feed-forward system (whyhe stresses this in the paper is a bit ofa puzzlement to me as I have not heardof any OCR system with backtrackingbetween modules) which supports multi-lingual and multi-script OCR. He providesa gloss of each of the modules:

10

Page 11: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

1. Preprocessing - despeckling, deskewing.

2. Layout analysis - computational geome-try algorithms with least square matching,Breuel claims that Voronoi methods don’tperform as well.

3. Text line recognition - OCRopus employsfour recognizers here, including Tesseract.Previous to the current version of 0.4, it onlyused Tesseract[19]. Interesting note: diacrit-ics are handled by treating a character andits diacritic as one unique character.

4. Language modelling - selecting best repre-sentation of text.

The documentation for OCRopus[19]does not cover training of its built-in rec-ognizers, but Tesseract may still be usedas the recognizer and its training I have al-ready described in this paper. In the plansposted to the project website in 2010, thesite maintainers state that training scriptswill become available in version 0.5.

One of the interesting things about apply-ing the algorithms Breuel describes to linesegmentation, in particular, with Latinand Devangari together is that it illus-trates how similar the two scripts are. Dis-tinct graphemes are laid out left-to-right,graphemes may be consonants or vowels,and as Breuel notes “...phenomena likediacritics, word spacing and kerning arecommon to both writing systems”.

My evaluation of OCRopus, from a quickinstallation and test run on English text(note, it will not compile in version of

Ubuntu after 10.04 due to outdated pack-ages) was that it may become relevant ifwork on the project continues, but at themoment offers no off-the-shelf functional-ity that Tesseract on its own does not.Furthermore, work on the project seemsto have stalled.

11

Page 12: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

9 Manifest

Table IV: List of Project Deliverables, with Descriptions

batch-rotate90.scm SourceForge Gimp script for batch rotate, deskewinggimp-deskew-plugin.tar.gz SourceForge the Deskew plugin

deskew.sh SourceForge shell script for the pagetools deskewer[14]AbbyyMultiLang SourceForge Java program using FineReader

A New Thai English Dictionary tarball 1,508 jpg filesA Tibetan English Dictionary, 2003 printing tarball 697 jpg filesA Tibetan English Dictionary, 1881 printing tarball 120 jpg files

English Farsi (Persian) Dictionary tarball 684 jpg filesEnglish Persian Dictionary, tarball 570 jpg files

Pocket Thai Dictionary tarball 97 jpg filesThai English Dictionary tarball 954 jpg files

Training Data, A Tibetan English Dictionary SourceForge Tesseract output files, four font boxfilesTraining Data, Pocket Thai Dictionary SourceForge Tesseract output files, three font boxfiles

Abbyy Results, A New Thai English Dictionary tarball text and XML files for subset of pagesAbbyy Results, Pocket Thai Dictionary tarball text and XML files for subset of pages

Abbyy Results, Thai English Dictionary tarball text and XML files for subset of pages

Some items listed in Table IV are available in the SourceForge project[5]. The box-file/image pairs are available as a download, the code is in the code repository. Forcopyright issues, I will not be posting these images to SourceForge. Should Utilikawish to take ownership of the physical material as well as scanned images, I will behappy to transfer them via post. The scans will be provided on an SDCard; the totalsize even after gzipping is over 3GB.

10 Conclusions

In my opinion, while Tesseract and OCRopus are certainly promising projects, Abbyyis the most practical solution for obtaining clean data. More investigation would berequired for me to comment on the practicality of training Abbyy.

12

Page 13: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

References

[1] Benjawan Poomsan Becker and ChrisPirazzi. New Thai-English, English-ThaiCompact Dictionary for English Speakerswith Tones and Classifiers. Paiboon Pub-lishing, 2009, p. 982. isbn: 1887521321.url: http://www.amazon.com/Thai-

English - English - Thai - Dictionary -

Speakers-Classifiers/dp/1887521321.

[2] Thomas Breuel. “Applying the OCRopusOCR System to Scholarly Sanskrit Litera-ture”. In: Sanskrit Computational Linguis-tics (2009), pp. 391–402. doi: 10.1007/978- 3- 642- 00155- 0\_21. url: http:

//portal.acm.org/citation.cfm?id=

1530303.1530324.

[3] The Crowley Company. Thecrowley-company.com. Mar. 2012. url: http :

//www.thecrowleycompany.com/.

[4] DictionaryReader. May 2012. url: https:/ / sourceforge . net / projects /

dictreader/.

[5] DictionaryReader SourceForge Project.May 2012. url: https://sourceforge.net/p/dictreader/.

[6] D Doermann and B Karagol-Ayan. “Useof OCR for Rapid Construction of Bilin-gual Lexicons”. In: (2003). url: http :

//onlinelibrary.wiley.com/doi/10.

1002/cbdv.200490137/abstracthttp:

/ / www . stormingmedia . us / 28 / 2865 /

A286554.html.

[7] Utilika Foundation. Utlika.org. Mar. 2011.url: http : / / utilika . org / info /

intern2011.html.

[8] Utpal Garain and DS Doermann. “Mary-land at FIRE 2011: Retrieval of OCR’dBengali”. In: lampsrv02.umiacs.umd.edu(2011). url: http://lampsrv02.umiacs.umd . edu / pubs / Papers / utpalgarain -

11b/utpalgarain-11b.pdf.

[9] Dariush Gilani. English Persian Dictio-nary. San Jose: Maple Press Inc, 1983.isbn: 0936347953.

[10] Sulayman Hayyim. English-Persian dic-tionary. 5th ed. New York: HippocreneBooks, 2006. isbn: 0781800560. url:http://books.google.com/books?hl=

en\&lr=\&id=OxBMU6P-4S8C\&pgis=1.

[11] Benjawan Jai-Ua and Michael Gold-ing. Pocket Thai Dictionary. Singapore:Periplus Editions, 2004, p. 96. isbn:079460045X. url: http://www.amazon.com / Pocket - Thai - Dictionary - Thai -

English-English-Thai/dp/079460045X.

[12] H. A. Jaschke. A Tibetan-English Dic-tionary (Dover Language Guides). DoverPublications, 2003. isbn: 0486426971. url:http : / / www . amazon . com / Tibetan -

English-Dictionary-Dover-Language-

Guides/dp/0486426971.

[13] H. A. Jaschke. Tibetan-English Dictionary(With Special Reference to the PrevailingDialects, to which is added an English-Tibetan Vocabulary). Delhi: Motilal Banar-sidass Publishers Private Limited, 1998,p. 671. isbn: 8120803213. url: http://

www . amazon . com / Tibetan - English -

Dictionary - Prevailing - English -

Tibetan-Vocabulary/dp/8120803213.

[14] kale4. pagetools. Mar. 2012. url: http :

/ / sourceforge . net / projects /

pagetools/.

13

Page 14: Report on Internship Project: Optical Character ... · Report on Internship Project: Optical Character Recognition in Multilingual Text Christa Mabee University of Washington Seattle,

[15] Tapas Kanungo and Song Mao. “Stochas-tic language models for style-directed lay-out analysis of document images.” In:IEEE transactions on image processing :a publication of the IEEE Signal Process-ing Society 12.5 (Jan. 2003), pp. 583–96.issn: 1057-7149. doi: 10.1109/TIP.2003.811487. url: http : / / www . ncbi . nlm .

nih.gov/pubmed/18237934.

[16] Kirtas. Kirtas.com. Mar. 2012. url: http://www.kirtas.com/kabisI.php.

[17] Huanfeng Ma. “Adaptive Hindi OCR usinggeneralized Hausdorff image comparison”.In: ACM Transactions on Asian LanguageInformation (2003). url: http://dl.acm.org/citation.cfm?id=979875.

[18] moshpyTT. Mar. 2012. url: http : / /

code.google.com/p/moshpytt/.

[19] ocropus. May 2012. url: http://code.

google.com/p/ocropus/.

[20] OpenCV Library. Mar. 2012. url: http://opencv.willowgarage.com/.

[21] L OGorman. “The document spectrum forpage layout analysis”. In: IEEE Transac-tions on Pattern Analysis and MachineIntelligence 15.11 (1993), pp. 1162–1173.url: http : / / ieeexplore . ieee . org /

lpdocs/epic03/wrapper.htm?arnumber=

244677.

[22] Paul Rodrigues et al. “Quality control fordigitized dictionaries”. In: prodrigues.com(). url: http : / / prodrigues . com /

presentations / rodrigues \ _etal \

_quality - control - for - digitized -

dictionaries.pdf.

[23] Bound Book Scanning. Boundbookscan-ning.com. Mar. 2012. url: http : / /

boundbookscanning.com/order/.

[24] R Smith and D Antonova. “Adapting theTesseract open source OCR engine for mul-tilingual OCR”. In: Workshop on Multilin-gual OCR Cc (2009). url: http://dl.

acm.org/citation.cfm?id=1577804.

[25] Ray Smith. “An overview of the TesseractOCR engine”. In: Document Analysis andRecognition, 2007. ICDAR (2007). url:http://ieeexplore.ieee.org/xpls/

abs\_all.jsp?arnumber=4376991.

[26] tesseract-ocr. Mar. 2012. url: http : / /

code.google.com/p/tesseract-ocr/.

[27] Tesseract OCR Chopper. Mar. 2012. url:http://pp19dd.com/tesseract- ocr-

chopper/.

[28] Wit Thiengburanatham. A New English-Thai Dictionary. Ruamsan Publishers,2000. isbn: 9742464952. url: http : / /

www.amazon.com/New- English- Thai-

Dictionary-Sentence-Structures/dp/

9742464952.

[29] Training Tesseract. Mar. 2012. url: http://code.google.com/p/tesseract-ocr/

wiki/TrainingTesseract3.

14