information processing from document images - cvit

Information Processing from Document Images

M. N. S. S. K. Pavan Kumar and C. V. JawaharCentre for Visual Information Technology

International Institute of Information TechnologyGachibowli, Hyderabad - 500 019, INDIA{pavan@gdit., jawahar@}iiit.net

1 Introduction

Analysis of document images for information extraction has become very prominent in recentpast. Wide variety of information, which has been conventionally stored on paper is now beingconverted into electronic form for better storage and intelligent processing. This needs process-ing of documents using image analysis algorithms. Document image analysis differs from theconventional image processing in its format and the information content. Document images areusually rich in formally presented information. The subjectiveness associated with the naturalimage analysis is not therefore present in the document images. Information in these images ismore structured, and presented in a natural language with the help of a grammar and a script.Consider an image of a document page as shown in Figure 1(a). This contains text blocks andimages. Text blocks can be paragraphs of text in various fonts and sizes, titles or captions.Extracting information from the image (graphics block) is difficult compared to that from thetext block. A text block can be converted to an editable text, if the constituent script, font,character etc. can be recognised. This recognised text can provide useful information about thegraphics block. This can be of immense help in situations where one searches for informationfrom a large database of document images. In this information rich modern era, one often comesacross such situations where the search results are needed at the finger tips. There are two basicissues associated with this : (a) to represent the bulky raw-data in the compact and interactiveform (b) to retrieve relevant information from the database.

An image of a news paper, even if represented as a binary, needs Mega or Giga bytes of storagespace. Compression of these document images can result in drastic reduction of storage space.However most of the compression algorithms are derived for subjectively complex natural im-ages, and for the well structured document images we need a totally different approach. Datacompression in document images can be achieved by converting the image of characters intocorresponding symbolic codes (like ASCII or ISCII). This may introduce loss in the visual con-tent; but not in the information content. More importantly, such a representation of documentimage allows easy manipulation of the content for many applications like word processing andinformation retrieval. Consider the problem of searching all relevant information about a partic-

1

(a) (b)

Figure 1: (a) A sample Indian language document page which contains texts of various fonts,varieties of graphics and colour combinations. (b) A sample document with tables and text. Lo-cating text,image blocks and tables, and defining appropriate algorithms is the major challengein document image analysis.

ular person from the huge news paper archive. A document image analysis system can quicklycarry out the search and provide the necessary results. This involves a number of preprocessing,segmentation and classification steps.

A document image analysis system, starts from an input image similar to the ones shown inFigure 1. System initially attempts to cancel out the noise effects and does the necessarygeometric corrections. The document image in Figure 1(a) contains text blocks in differentfonts, colours, indentation and orientations. The images are inserted at different places in thedocument image to give a pleasing appearance. This makes it very difficult to identify theboundaries of an image and separate it from the textual region. One of the challenges theimage in Figure 1(b) poses is of identifying the table boundaries, graph boundaries and the textboundaries, etc. This is followed by recognition of contents in text, tables and if possible ingraphics block.

Document image analysis introduces many new challenges in the Indian context. The multi-lingual, multi-script documents, provide Himalayan challenge to this discipline. Most of thepractical systems developed in other parts of world focused only on interpretation of a singlescript. In addition to this, most of the Indian scripts have more characters in the alphabetcompared to English, complicating the pattern recognition problem. The presence of cursivescripts and matras further complicate the situation. Any practical document image analysissystem developed in the Indian context should also address the specific sections of the soci-ety. Handwritten documents from the partially literate people may contain typographical andgrammatical errors. Also the documents paper quality and ink quality may vary highly betweenusers.

Document image analysis has become a matured branch of research in the last twenty years [1].Though the research on document images has made remarkable progress in the west [2], theresearch on analysis and interpretation of images with Indian language content has receivedattention only in recent past [3, 4, 5, 6]. In the International conference of Document ImageAnalysis and Recognition(ICDAR) the growth of work in Indian language document is clearlyvisible. In ICDAR’95 around 3 papers are published on Indian language OCR. ICDAR’97 in-troduced a special section on Asian character recognition. In ICDAR’99 a separate section onIndian language document image analysis has been introduced in order to accommodate thegrowing number of publications. Conferences like ICVGIP, KBCS, NCC and NCDAR also en-courage research in this area. This chapter discusses the major concepts and issues involved inthe field of document image analysis. Section 2 introduces the terminology and complexities in

2

Figure 2: Some of the popular digitizing devices

document image analysis with details about different categories of documents and their appli-cation domains. The challenges involved in document image analysis are discussed in section3 with special emphasis on Indian language documents. An overview of optical and intelligentcharacter recognition is presented in section 4. Storage and management of documents datasetsis presented in section 5 and latest technologies that emerged in this field are discussed in thelast section.

2 Document Categories and Application Domains

A document is a written or printed paper that bears the original, official, or legal form ofdata and can be used to furnish decisive evidence or information required. There is a widevariety of documents that we encounter in day to day life. This includes the documents thatare used to communicate information in the form of letters and newspapers, documents whicharchive information for later validation or use. Off late, the conventional definition of documenthas got modified with the emergence of electronic documents. There exists a spectrum ofdocuments in world wide web and electronic media. The challenges associated with processingof such electronic documents are considerably different from that of information extraction fromdigitized or converted-electronic documents. In this chapter, we limit our attention to the lattercase of processing document images, where we presume that the document image consists oftextual and visual information. And these are primarily digitized versions of handwritten orprinted paper documents. A document can be characterized based on the the content or thegeneration mechanism. The broad categories of documents are described below.

2.1 Document Categories

2.1.1 Hard and Soft documents

A hard document is a physical form of document where information is present in textual orgraphical form. Soft documents are the ones which are created with the help of electronic devices.There are a number of electronic devices which allow the conversion of hard documents to softones. Most of these devices convert an analog signal (say illumination or pressure) into a digitalrepresentation. Examples of digitizing devices are scanners, cameras etc. Some of such populardigitizing devices are shown in Figure 2. This figure contains camera, flat bed scanner, handheld scanner, stripe reader, keyboard etc. Soft documents constitute the documents generatedusing markup languages, word processors, synthetic document images, scanned images etc. Softdocuments can be structured or unstructured. Structured documents present the informationin a well organized manner to aid the information extraction. They typically use hierarchicalrepresentation in the form of hypertext, XML etc. Understanding of unstructured electronic

3

documents with the help of computational models is the challenge associated with mimickingthe perception and cognition of human being. Problems in structured document processing areoften macro in nature. They involve more of language or information processing than imageprocessing or pattern analysis, which form the core of unstructured document analysis. Theemphasis of this chapter is on the information extraction from unstructured documents.

2.1.2 Printed and Hand written documents

Printed documents belong to the class of documents which are generated either mechanically orelectronically from the existing data. Most of the documents we encounter now-a-days are of thiskind. This class of documents has the advantage that all the occurrences of any single characterare apparently similar. Therefore, DIA systems aiming at recognition of such documents canmake use of this favorable property. Such documents are characterized by the deterministicnature of characters, their repetition patterns and well defined layouts. Usually such documentsare of very high quality.

Handwritten documents are also very popular. Processing and recognition of handwritten doc-uments is relatively difficult. The cursive scripts, variability and separation of characters, doc-ument skew, accommodating variability of strokes, learning the human characteristics are verydifficult to model. Most of the handwritten document understanding systems are employed inspecific narrow domain. There are also hybrid documents where printed and handwritten wordsare present together. These documents may also contain texts, tables, graphs and graphics.

2.1.3 Single language and multilingual documents

An important component of any document image analysis system is the character recognizer.A character recognizer converts a character image into an ASCII(American Standard Code forInformation Interchange ) or ISCII (Indian Standard Code for Information Interchange) text.The performance of the recognizing algorithm deteriorates with the increase in number of classesand variability within a class. The accuracy one achieves on a single language like English cannot be usually extended to the Indian languages where the alphabet-size and variabilities arevery high. The problem gets compounded when multiple scripts are present.

India is a country with diverse languages and scripts. Therefore the real-life documents canbe also multilingual. Many documents use English, the national language – Hindi – and a re-gional language for official and commercial purposes. Processing of such documents is usuallyapproached by identifying the script and employing the appropriate recognizing scheme. Perfor-mance of a character recognition system can be improved if the appropriate script is identifiedapriori. In real life, it is very rare to find handwritten or printed documents where at least anisolated presence of a foreign language word is absent.

4

2.1.4 Online and Offline Documents

A new class of documents – online documents – find tremendous applications with handhelddevices and natural interfaces. In this category, the digitizer also provides the time informationalong with the spatial and intensity content of the image. The time of writing of each point onthe curve is stored along with the intensity/color/pressure information. Conventional documentanalysis has been focusing on the offline documents. They consider a character as a singleimage, where online documents represent characters as a sequence of strokes. Algorithms forprocessing of online documents utilize time information and map the problem into an orderedsequence analysis.

2.2 Typical Application Domains

2.2.1 Newspaper Documents

Newspapers are one of the first mass communication media introduced by human being. Grad-ually, other media like radio, television and now the Internet became popular. Even thennewspapers remain as the most appealing information source for many. Newspapers are hardprinted offline documents, with very rich information content. These documents exist ever sinceprinting was invented. The current number of available newspapers is so abundant that it isvery difficult to maintain them in a perishable paper format. This attracts attention to theimportance of digitization, understanding and archival of newspaper documents.

A news paper of size 24in× 36in scanned at 300 dpi with 8 bits per pixel needs around 2MegaBytes of storage. Digital preservation of newspaper archives aims both at the preservation ofendangered material (paper) and at the creation of digital library services that will allow fullutilization of the archives. DIA applications aim at developing integrated systems that providesolutions to problems related to digitization and classification of newspaper images. The resultsof which can be further used in information retrieval and content based indexing. For example,one should be able to search and retrieve all relevant information about a particular nationalfigure with minimal effort, from a large newspaper database.

2.2.2 Form Processing

Any document requesting/collecting information from a user in specific format is a form. Formsare one of the most common class of documents which organizations encounter. For example,forms are used in railway reservations, handling deposits and withdrawals in banks, educationalinstitutions, applications, data collection etc. In a form, each field has its corresponding value.The problem of form understanding is extracting the information related to each field. Sinceforms are meant for very specific applications, one usually expects very high accuracy andprocessing speeds.

The problem of information processing from forms is more structured in nature, while the

5

recognition part is complex since the texts are usually handwritten. In addition to the characterrecognition, form processing also introduce challenges to address the issues like, leaving someof the fields blank and leaving spurious marks on the fields such as overwriting and erasures.Accuracy of the form processing can be improved with better design of the forms [7].

Forms also contain many standard characteristics like lines, boxes etc. The sequence of oper-ations in the layout analysis depends on the prior knowledge about forms. Model driven anddata driven approaches can be adopted for form processing. If the form structure is unknown,a possible approach is that of classifying documents in a fixed number of classes and use themodel corresponding to each document class to carry out the layout analysis.

2.2.3 Envelopes and Letters

Even in this modern era, regular mails form a major communication mode. A good amount oftime and effort is spent on sorting and classifying the mails based on handwritten or printedaddresses. An automated system can make this accurate and fast. Machine helping man to dothis is very common in many countries. Introduction of ZIP codes or PIN codes were primarilyto increase the accuracy and thereby reducing the delivery time. Recognition of handwrittenor printed numerals is relatively simple compared to the recognition of text [8]. In the Indiancontext, addresses are often multilingual and written in cursive scripts without PIN codes. Thismakes the problem extremely difficult to handle. Most of the time, even the localization ofaddress and other details is itself difficult. People also tend to use their own abbreviations andspelling for many cities and villages. At the same time, the finite number of post offices andcity names can provide dictionary information to improve the recognition accuracy. The postaladdress recognition system developed at CEDAR [9] is a prime example of a successful documentimage analysis system.

2.2.4 Archival of Existing Documents

The most common application of document image analysis is to convert an existing hard doc-ument into a soft text form such that manipulation of the document is possible. Many of thecommercial scanners also provide software to do this. However, they do not support Indianmultilingual documents. Such intelligent digitization of paper documents find tremendous ap-plications in various domains.

An important class of documents falling in this category is Judgments given by courts. Thejudicial system in India has started 50 years ago and court documents are filed and stored fromthen. A proper digitization of these documents can help the judiciary to provide speedy andbetter judgments. Earlier judgments were handwritten, while the later ones were typewrittenand some of the recent ones being completely electronic. One has to keep this diversity in mindto design a versatile system. The recognition performance in official documents like judgmentscan be improved largely by using the domain knowledge because of the limited and specialvocabulary used in official documents. But again this makes systems specific to different domains

6

and increases the difficulty in improvising a general system.

Document Image Analysis systems can aim at rapid digitization of abundant official and statis-tical information that exists in the administrative offices of India. A close observation of thesekind of documents shows that they contain a lot of information in addition to text, in the formof tables, bar graphs, pie-charts, etc. These are the graphical components of the documentswhich are different from images and need to be interpreted separately depending on the contextand the information they want to convey. These introduce new problems to be handled like textand image separation from the document images, analysis of graphical sections of the image,etc.

2.2.5 Speech Applications

With advancements in speech technologies applications involving speech generation and OCRare developed. Applications like document reading devices involve OCR followed by a Textto Speech (TTS) converter. The text obtained from the OCR output of the document imageis converted into sound signal by TTS converter. These applications are of immense help toblind and illiterate people. Advanced applications in this field may also introduce a languagetranslation module between the OCR and TTS. This helps any person to understand a documentin any unknown language.

3 Document Image Processing: Challenges

3.1 Preprocessing

3.1.1 Filtering and Binarization

Noise is an invariable issue any where in image analysis. Noise gets introduced in many processeswhich involve transmission or change of media. For example conversion of information in papermedia to a digital media may introduces noise. Noise reduction methods generally tend toreduce the effect of noise for better performance of algorithms. Noise also gets introduced intransmissions over networks. This can be mathematically modeled and corresponding reductionmethods can be derived. Most popular methods for noise removal include Mean Filters andMedian Filters. These filters replace the pixel with the mean/median in a neighborhood. Imageprocessing filters are designed either in spatial domain or in frequency domain based on theproblem and computational complexity [10, 11].

In any image analysis or enhancement problem, it is very essential to identify the objects ofinterest from the rest. The group of pixels representing objects of interest are called foregroundpixels and the rest are called background pixels. Finding the boundary between foregroundand background is the process of Segmentation. Image segmentation process may also parti-tion the scene into more than two mutually exclusive and collectively exhaustive regions [10].

7

Figure 3: A skewed handwritten character and its skew corrected image.A skewed and correcteddocument. Note that in real images skews/rotation can be present for the entire document orindividual characters/components.

Segmentation can be achieved in spatial domain or gray scale domain. Spatial segmentation isdrawing a geometrical boundary between the objects present in the scene based on operationslike edge detection, boundary identification etc. The second approach gray scale thresholding,divides the pixels into foreground and background based on a threshold gray value. The pixelson one side of the threshold value are the foreground pixels and the other side are identifiedas the background pixels. This process is called thresholding. Thresholding algorithms may bebroadly classified into global, local or adaptive techniques depending on the way they work [12].Algorithms compute thresholds which optimize some objective functions.

A noisy document usually undergoes filtering and binarization to separate the characters fromthe background. Quality of the output at this stage is usually very critical for the overallperformance of the system.

3.1.2 Skew Detection and Correction

Document images, while creating often gets skewed or misaligned to the standard axes. Skewedcharacters and documents can result in more misclassification and misinterpretation. Thereforecompensation of skew is very important. Skew detection and correction forms a major prepro-cessing step in Document Image Analysis. Some amount of skew is usually introduced when adocument is scanned. Before any further processing it is necessary to ensure that the documentis aligned properly. There can be skews associated with the entire document or with individualcharacters (Figure 3). In both cases, it needs to be corrected before further processing.

Projection profiles can be used for skew correction. A projection profile can be defined asthe histogram of the number of foreground pixel values in each scan line in the image. In adocument with text lines horizontally and white spaces between lines, projection profiles provideclear evidence of white lines as absence of black or colored pixels. If the profile does not havesuch crests and troughs as expected, rotate the image by some expected or computed value.This method is very effective for scripts like Devanagari where the presence of a bar (sirorekha)is a characteristics of the script [13]. There are many fast and efficient algorithms developedusing projection profiles which converge fast and accurately to the solution. One variant of thisalgorithm is to divide the document into vertical or horizontal strips and take projection profilefor each strip. The skew is then determined by measuring the average shift in zero crossingsbetween strips. These variations, though fast enough are restricted to the skews between -10 to10 [14, 15].

Hough transform based techniques are also employed for document skew detection. Documentimages are preprocessed to provide black blobs in presence of text. These blobs are then thinned

8

in order to make them into single lines. These lines, when Hough transformed give rise to peaks inthe parametric space. These peaks are identified and the skew of the document is estimated andcorrected. This method has some inherent disadvantages: Hough transform is computationallyintensive, also if text lines in the document image are sparse, this would not give a good estimateof the skew [16].

A third method uses the nearest neighbor clustering to determine the skew. In this method,1-nearest neighbors of all connected components are found. The direction vectors for all thenearest neighbor pairs are plotted in a histogram and histogram peak is found to obtain theskew angle [17]. The appropriateness of the skew correction algorithm often depends on thespecific script and font. The inherent curliness of the Indian scripts complicate the problem ofline detection and skew correction. At the same time, the presence of sirorekha simplifies theskew detection problem in scripts like Devanagari, Bangla etc.

3.2 Document Segmentation

Separation of text and graphics and their identification and recognition is a very important prob-lem to be addressed in document analysis. The methods used for solving this problem generallymake use of the differences in the properties of textual and image regions within the document.Texts generally posses spatially distinctive sequence and order. They also respond in a differentway from images, to some of the filters which identify the textural properties. Projection profilebased approaches perform equally well in Manhattan layouts. Recent technologies introducedagents into the areas of image processing, based on image pyramids [18]. There are two basicparadigms in which one can look at a document: Top-down and Bottom-up.

The top down view describes the document starting with a hypothesized format. Initially thewhole document is taken, with a few assumptions regarding the document, next high levelprocessing follows. The document is decomposed into textual and non textual regions basingon that. These segments are then broken up into finer sections recursively adapting the sametechnique.This method, since assumes a document format works efficiently and effectively forimages of known document formats [19]. Bottom up approaches buildup the document from apixel level. The pixels in the document are grouped together with some constraint into smallercomponents. The components are now grouped and the steps are repeated until the requiredarea is totally covered. Each of these segments are then processed to identify the text, scriptand image content.

3.2.1 Some Popular Approaches

Smearing can be used to group the areas of document image with similar features together.Smearing a document image horizontally results in converting the textual regions into blackbands. Document images are smeared repeatedly until certain constraints are met. This kindof smearing in both directions results in an image which results in a set of blocks. Featuresfrom these blocks are later extracted and are classified into either text blocks or the images.

9

Finer classifiers are required to separate halftones from the images. These methods are heuristicbased and the results depend on the size of the font and other content in the image. After thesegmentation, the content in the segmented blocks has to be analyzed. The analysis can bebased on different features, of which texture based features form an important set. The textureproperty of the block can be measured using filters like Gabor filter.

Horizontal and vertical projection profiles give structural information about the document imageunder consideration. This information can be used in order to split up the area into blocks andalso can be used to identify what kind of data is present in the segmented blocks. The horizontalprojection consists of periodic crests and troughs in the textual regions and a uniform bandof high values in the image regions. This gives the horizontal boundaries of the images andtext present. Similarly the vertical bounds can be obtained by taking vertical projections.Theintersections of the horizontal and vertical boundaries, give the boundaries of blocks. Thismethod directly yields the kind of data present in each block. Because of the variety of imagesthat might be present, a further validation of the blocks is required. This method expects thedocument to be in Manhattan Layout [20]. Also, projections in directions other than horizontaland vertical directions can be used. Radon Transform can be used to project the block content onto any arbitrary line. Textual and image regions give different projections in different directions,say at 45 and 135 degrees. These responses can be analyzed and labeling of segmented blockscan be done.

Recently agent based methods are also employed for document segmentation, by combiningit with pyramidal image processing. The preprocessing for this kind of pyramid constructioninvolves in constructing 8-connected-components and removing big objects which potentiallybelong to the category of images.This removal process is done taking into consideration thearea histogram. Thus, the pyramid based algorithm builds the next layer with more of textualcontent rather than the image content. In an agent based approach, each agent starts at thehighest layer in the pyramid, where it analyzes whether a connected component in the previouslayer gets disconnected now because of the increase in spatial resolution. If so, then it observesthe density of the connected components. If the density of the connected components is greaterthan the average connected component density in that layer, then the agent follows its path tonext layer. Otherwise, the agent stops at that point.The components at each level are clusteredbasing on some distance criterion.This method does not assume any kind of orientations in thetextual regions of the image[18].

3.3 Document Decomposition

The components of a document can be grouped into different classes basing on the action theperform. Some important classifications are studied below.

10

3.3.1 Physical Components

These components can be defined as the ones which make up the document in a macro way,or they constitute the geometrical components in a document. Examples of such componentsare tables, images, etc. These are one of the very important features for a document if we areconcentrating on the document classification and categorization problem.

One of the most intuitive methods followed in the physical component analysis is by analyzingthe horizontal and vertical profiles of foreground pixels. Consider a document with text andgray level or half-tone images in it. Taking a horizontal profile of the foreground pixels gives acontinuous strip of high foreground-pixel count in the region where there is image and a periodiccount in the textual areas which is illustrated in the figure. Horizontal profile analysis gives thevertical extents of the contents of a document. Similarly, now taking a vertical profile of thedocument, we get the horizontal extents of the documents in a similar way. Combining theseresults gives the extents of geometrical components. After we get the geometrical separationof the existing components, it is relatively possible to observe what kind of region is it, textor image. Profiling itself gives the content type of the component, which can be confirmed byapplying methods like texture analysis on the components because, textual regions form a coarsetexture where as the image regions form a smooth texture.

3.3.2 Logical Components

Logical components are the primitives which convey some semantics to the reader. For example,consider a system for postal applications in Indian states, where letters from different places,with addresses written in different languages are to be separated into different categories fordispatch. It is known that each address obeys a certain grammar. Identifying an instance of thisgrammar on the document is the task of extraction of logical components. This can be done bydecomposing a rectangular address block into sub-images recursively by horizontal and verticalapplication of block grammars to determine the major logical components. A permissible orderof last line in a postal address would be generally, city name followed by its PIN code. Searchingfor this grammar by applying it on to the document image gives us the image separated intoparts which obey the above grammar, from which meaning could be easily identified. The pincode extracted thus can be sent to a digit recognition system which identifies the place to whichthe letter has to be sent.

3.3.3 Recognition of Graphical Components

Graphics recognition is a vast and important area of Document Image analysis. Some of theproblems it deals with include converting raster drawings to vector drawings, recognition ofgraphical primitives, recognition of shapes and symbols like logos, analysis and interpretationdiagrams like engineering drawings, logic diagrams, maps, charts, line drawings, tables etc. Italso includes graphics-based information retrieval without an explicit optical character recog-

11

nizer. A Vector drawing stores the image in vector representation rather than storing all thepixels of the image. For example, a line in the image is stored by its end points or its magnitude,direction and translation rather than storing all the pixels on the line. Conversion of machinedrawings into vector drawings reduces the storage size to a great extent. Vectorized images arewell suited for scaling and other transforms, where as normal bitmaps which are generally usedgive blurred and jagged effects. This conversion, makes the drawings or images editable alongwith making content analysis possible. This feature can be used to build a database of machinedrawings or line art which can be indexed automatically and searched.

While conversion of images of engineering drawings to vector drawings plays an important rolein size reduction and increasing availability of engineering drawings, image interpretation fromthe drawings and maps also forms a major field of interest. Many map analysis applicationsare necessary for automating many GPS applications and also analyzing the satellite images,topographical analysis, automating weather forecast from satellite images etc.

Logo recognition has emerged as an important field of research in the recent days. Automaticgrouping of official documents based on the logos and other pictorial content that are presenton the documents forms an interesting and useful application for this field.

3.4 Multiple Language Documents

One of the major problems to be addressed in Multiple Language Document processing is thedetermination of the location of the individual script and language content in the documentimage. If the text regions are separated into blocks from the image based on known scripts, textblock can be further processed and characters can be directly sent to the corresponding OCRmodules for further processing.

There are many general strategies which are followed in separating and recognizing text seg-ments in a document image. They are based on either the spatial domain or frequency domaincharacteristics of the scripts. In spatial domain, linearity, curvatures etc. form the basic char-acteristics. In frequency domain the response to various band pass filters are employed for thesame [21].

Optical density in text images provides clue for script recognition. This feature was applied torecognize scripts within the Han script class [22]. Yet another language set specific method isbased on the projection profile analysis of the individual components. A pioneering work [23]in this area uses a decision tree based method for separating English, Urdu, Bengali and De-vanagari. Their method uses Statistical, topological and stroke based features in addition to theprojection profiles. These methods require reliable segmentation of the characters.

Script specific templates are also created for different languages in context and the input imageis segmented using these templates [24]. A direction i distance based classifier initially takesthe connected components and analyzes their neighborhood for recognition of scripts. Consid-ering each pixel in the connected component under analysis as origin, a few imaginary linesin pre-decided directions are taken, and a number of black pixels along the lines at specific

12

distances are chosen. This is called as direction-distance histogram. These histograms for allthe components are then added up which finally yields the feature set of that specific region.This is matched to the existing database using Mahalnobis or Euclidean distance. Other fea-tures which can be used for script identification are number of upward concavities,number ofdownward concavities,crossing counts etc. There are algorithms which consider each languageregion in the document as a texture and applying texture segmentation procedures. One of themost reliable texture segmentation procedure is based on Gabor filters. A Gabor filter is a bandpass filter characterized by the center frequencies and bandwidths in the transformed domain.Input image is passed through a set of Gabor filters and the outputs are used for computationof features. Since there are no shape features that are specific to a language, this method isvery general in the sense that it quantizes the frequency components in different directions ofthe pattern. Let us analyze the application of this methods to some languages. English andMalayalam, which look very different give similar response to the 0 degree Gabor filter becauseboth of them contain vertical marks Scripts for languages like Telugu and Kannada, which arecircular in nature, respond similarly to Gabor filters with any given direction.

4 Optical and Intelligent Character Recognition

Once a document is segmented into its constituent components, more specific techniques areneeded to extract information, among which Optical Character Recognition(OCR) is of primaryimportance. Much research on Document Processing since 1960 has been concentrated on OCR.The OCR acts at the character level of a document image. It is often assumed that text, wordsand skew corrected characters are extracted from the document image prior to the applicationof OCR. It converts the individual character images into a character code like ASCII, ISCIIor Unicode, which can be further used to identify or understand a particular word, sentence,paragraph and finally the document.

The popularity of OCR has been increasing each year with the advent of fast microprocessorsproviding the vehicle for vastly improved computational techniques. Many commercial and opensource OCR softwares have been developed for languages like English and Chinese, where as notmany are reported for Indian languages as on the date. OCR activity on the Indian scene hasbeen initiated since early 70s and has acquired a good amount of momentum during the lastfew years. A recent workshop held at Hyderabad central university concentrated on the OCRactivities alone. Indian Statistical Institute has made remarkable contributions in developmentand commercialization of an OCR for Devanagari. OCRs with various amount of success hasbeen reported for many Indian languages including Devanagari, Bangla, Gurmukhi [25], Kan-nada [26],Oriya [27] etc.

Indian languages and scripts introduce many new challenges compared to English. Indian scriptshave originated from the ancient Brahmi script which is phonetic in nature. Indian languagecharacter sets are better referred to as syllabary than alphabet. Any Indian language alphabetis composed of consonants, anuswars, nasalization signs, visarg, vowels, vowel omission signs,

13

Figure 4: The figure shows basic building blocks of a character recognition system. This involvesan offline training process and a classification/recognition based on the trained models.

conjuncts, diacritic marks etc. Languages like English have characters as their building blocks.The smallest entity that can be extracted from a document in such languages is the character,where in Indian Languages, syllables form the basic building blocks. Each syllable is formed bythe adjunction of a consonant, and an optional vowel sign, a nasalization sign or any conjunct.This introduces a great difficulty in separating a syllable into its constituent components. Forexample,the exhaustive sign list of Devanagari script would include about 469 characters whichcan produce the syllabic versions (Barahkhadi) of about 5274 signs. OCR technology has totake care of these complexities into account someday or the other in order to make culturallysignificant text readable by machines. The punctuation marks used in Indian scripts are bor-rowed from English except for the full stop which is now increasingly being substituted by ’.’.Each Indian language has its own numerals and other symbols or signs (like avagrah) which arenot being popularly used in modern Indian scripts due to the internationalization.

4.1 General Architecture for an OCR

A pattern recognition system is at the heart of an OCR. Performance of this system greatlydepends on the document image quality and processing algorithm employed. Figure 4 shows thebasic building blocks of a typical OCR. However, the finer details and interconnections may varybetween implementations and applications. In general, the three basic modules are (a) Imagepreprocessing, (b) Feature extraction and (c) Classification.

An OCR is initially trained using a set of training samples. A set of characters is collectedfor this purpose, which is cleaned and labeled either manually or automatically. The featuresextracted from this set are used for training purposes. Any new character that is to be recognizedis preprocessed first. Features extracted from this character are sent to the classifier which usesthe knowledge gained from training. Classifier recognizes the character and the output is post-processed and displayed.

4.1.1 OCRs and Document Models

An OCR which works on scanned document images is called an Offline OCR. The online timeinformation is missing in scanned documents. Online OCRs are now getting more attentionwith the introduction of new advanced digitizing techniques. Offline OCRs lack informationregrading the order of characters and strokes written. For example, in an offline OCR there isno direct way to distinguish between a line written starting at top extending to bottom fromthe line written from bottom to up.

14

4.1.2 Image Preprocessing

In addition to the generic preprocessing strategies for document images described in the previ-ous section, character images may be preprocessed to improve the performance of the patternrecognition system. This involves algorithms like scaling the characters to a standard size, re-ducing noise in the characters etc. Preprocessing algorithms may also be employed to make thecharacter images font-independent.

4.1.3 Feature Extraction and Feature Selection

A feature is a compact representation of the information content in data. In the case of anOCR, shape information of characters are generally characterized as features. The subjectivemeasures like curvatures, linearity etc. are usually converted as numerical features. Decidingwhat features to use for a specific script is critical, and it is the phase where most of thedevelopment time in an OCR is spent. Below we list some of the popular features employedfor character characterization and recognition. We would like to make it clear that it is notcompulsory to work on a small set of features extracted like this. One could also consider thecharacter image itself as the feature.

Boundary Descriptive Features Character boundaries can be effectively represented usingmethods like chain codes. They are popular for 2D shape representation. Features extractedout of representations like chain codes, directional chain codes etc. encode the directionalinformation of the boundary of the character. Such features are not very popular for characterrecognition.

Fourier Features Fourier domain representation of the boundary result in a different set offeatures which can characterize the shape of the characters. Fourier transform is one of thehighly popular mathematical operation for transforming a signal from one domain to another.Some of the interesting features can be observed by transforming the image into its frequencydomain which can be attained by applying a two dimensional Fourier transform on the givenimage.

Structural Features Structural Features are extracted by analyzing the topology of theboundary or medial axis of the characters. Characters are encoded as the spatial arrangementof different primitive structures which are easy to identify. Horizontal and vertical line seg-ments, their relative ordering, number of knots and junctions etc. characterize the character.Segmentation of a given character into its component geometry is very difficult and may not beachieved easily.

15

Statistical Features These are the simple mathematical methods which give enough infor-mation about the distribution of pixels in the given character image. The basic statistical featureis mean or centroid of the pixels. A translation of the pixels such that origin coincides withcentroid can make the feature extraction translation invariant. Variance quantifies the spreadof the character around its mean. Mean and variance define a characters extent approximately.

Moments Moments are one of the extensively used features by the character recognitionresearchers. A moment µpq of a 2D discrete function f(x, y) as mpq =

∑x

∑y xpyqf(x, y)

Moments can be made invariant to translation, rotation and scaling. Moments are also easy tocompute. Zernike Moments are also popular for character recognition [11].

Selection of an appropriate feature set for character recognition is a very complex task. Majorparameters which control the choice are (a) Script and alphabet size (b) Single or Multilingualrecognition problem (c) Presence or absence of time (online) information (d) Available compu-tational power, (e) Invariance to geometric transformations and (f) Invariance to the individualhandwriting.

4.1.4 Training & Classification

Once the features are extracted, classification modules classify the input feature vector intoa specific class. The most important step in any pattern recognition system is the design ofa classifier. Though a classifier can be supervised or unsupervised, a supervised classifier isemployed for OCR so that the class labels are output at the end. This needs training of systempriori with the help of a Training set. The choice of classifier can also be made based on theavailability of training set (labeled samples). One could also design system which learn overtime and adapt to the individuals.

Nearest Neighbor Classifiers Nearest neighbor classifier is one of the basic and intuitiveclassifiers. The given vector is classified to a class, to which its nearest neighbor belongs to.Feature vectors or templates for each of the classes are stored apriori and they are used todecide the nearest neighbor of the given feature vector. The distance between the input vectorand the sample vectors is calculated and the class of the closest sample vector is given to theinput vector. K-Nearest neighbor classifier is an improvement over nearest neighbor classifier.It labels the class based on the class to which most of its K nearest neighbors belong to. Thedistance between patterns can be computed using any of the distance measures like Euclidean,Kullbeck-Liebler, or Mahalanobis distance.

Neural Networks A neural network is a massively parallel distributed processor that canstore experiential knowledge and can make predictions using it. Neural networks have abilityto learn how to solve problems based on the data given for training. Multilayer perceptronswhich can be trained based on back propagation algorithm are the very popular for character

16

recognition. The synaptic weights are adjusted based on the error made by the neural networkso that error is minimized in the subsequent samples. Initially the synaptic weights of the neuralnetwork are initialized to some random values and the first input is given to it. The networknow computes the corresponding output vector for the given input feature vector. This outputis compared with the desired output and an error vector is calculated. Using this error vector,the weights are adjusted so that next time when the vector is encountered, the computed outputvector is more similar to the desired output vector. This process is applied for a large number ofinput sample vectors and the weights are adjusted. Neural networks can approximate nonlineardecision boundaries and therefore provide excellent classification [28].

Multiple Classifiers The schemes we have seen till now involve a single feature vector anda single corresponding classifier. These methods are often insufficient for problems involvingnoisy inputs and large number of classes which is very common in the case of OCRs. Usingdifferent feature vectors and classifiers must definitely improve the overall performance of thesystem because features and classifiers of different types complement each other in classificationperformance. The combination function for combining the decisions of classifiers must includethe strengths of different classifiers and should be reducing the terms involving the weaknessesof classifiers. The decision combination function must receive useful representations of classifierdecisions. Majority voting or maximum detection are popular as integration mechanisms. Recentmethods involve ranking of the decisions made by each of the classifiers and using a combinationwhich is a function of the ranks of individual classifiers. Class set reduction and Class setreordering are two measures used for comparison of decisions made by different classifiers. Classset reduction aims at extracting a subset from a given set of classes, such that the subset has aminimal size containing the correct class. Class set reordering aims at ranking the given set ofclasses such that the correct class is given rank closer to the best one [29].

Multistage Classifiers Multiple classifiers, when used in a cascaded manner are referred toas multistage classifiers. The basic idea of multistage classifiers aims at a coarse classificationin the initial stage and finer classification at later stages. Simple features and heuristics can beused initially, to select a subset of the total classes, and the second stage classifiers try to fitin the feature vector into the classes in the selected subset. This reduces the number of classeseach classifier at each stage has to handle by a considerable amount and hence they performbetter.

4.2 Post-processing and Intelligent Character Recognition

Intelligent Character Recognition places a layer over normal OCR. This layer helps in refiningthe output of OCR. In ICR, the recognized information from OCR is further interpreted usingcontextual information like occurrence probabilities and frequencies of words and characters.

Some of the proposed applications of ICR include creation of sophisticated indexes for document

17

databases, improving OCR performances, document classification and content analysis. Thetechnology used in the additional layer of ICR is closely related to the technology used ingrammar correctors, spell checkers, etc. The OCR systems, although exhibit high performancesoverall, do not reliably perform in cases where the characters are small or noisy. In such caseseither general or domain specific information can be used to improve the performance. TheOCR output corresponding to the mismatched words is matched against dictionary entries byapproximate string-matching techniques.

4.2.1 Contextual Information and Linguistic Resources

ICR systems are often fine tuned to suit a specific domain as they give better performance thana general purpose OCR. An application, for example, to read the bibliographies from old booksand prepare a research index, can be improved using the domain knowledge of the general termsused in naming the research papers and also a list of names and authors.This information can beused to correct the output or to guess the output where the words are not so clear. It is possibleto create a database of words that frequently occur in the field and the output is cross-checkedfor reference.

General approaches for correcting outputs include extracting information from the Linguisticresources and using statistical or distance based methods to find the correct word.

Dictionaries Dictionaries are the easiest to obtain exhaustive online resources for a language.There are many good dictionaries online for languages like English. Of course, not many Indianlanguage dictionaries are available. For a general purpose OCR dictionaries can be used fordoing an approximate string matching and correct the output. If there is information about thekind of document the OCR system is going to handle, special information related to the fieldcan be used. For example, an OCR developed for digitizing court judgments can have a list ofhighly used terms in courts in addition to a normal dictionary. The words in domain specificdictionary are first used to match the input word and then general language dictionary can betried upon.

Availability of Language Heuristics Language heuristics, although difficult to find, can beused for a global correction procedure. One simple example to indicate the language heuristicsis that in English, in almost all the occurrences of the letter q,it is followed by a u. If weare sure that the letter under consideration is q then we can be equally sure that the nextletter is u. This information may be used to minimise mismatches between u and v. Heuristicbased methods always have their own inherent restrictions that they are not universal. In somedomains, heuristic methods can be derived observing the behavior of OCRs and the frequenciesof the errors committed.

18

Corpora A Text corpus is a large collection of language information in the form of text.A corpus, although may not be directly useful like a dictionary, provides ICR with a lot ofstatistical information regarding the language like frequencies of the words, n-gram frequenciesof characters and word stems, etc. Also it is very common to model a sentence as a nth orderhidden Markov model and find the occurrence probability of the sentence. The n-gram basedmethods prove to be very useful in refining the output from the OCR. This information can beobtained only from corpora and can be used in prioritizing the words obtained as close matchesto the input word from OCR. All the closest words to the input word can be substituted in thesentence obtained and occurrence probabilities can be calculated. The word which makes thesentence a highest probability sentence to occur can be given highest priority.

4.2.2 Text Matching Schemes

Nearest Neighbor Matching Nearest neighbor matching is similar to the Nearest neighborclassifier. The only difference is the distance measure employed. The distance measure used hereis Hamming distance, which is a popular distance used in networking to compute the distancebetween two messages with finite string of characters. Hamming distance is defined as thenumber of characters that need to be changed to obtain one string from another. A hammingdistance of 0 indicates a correct string.

Partial Matching One kind of general errors encountered by an OCR is that word has oneor more character errors, but the length is correct. This error is occurred when there is aclassification error. Partial Matching is a technique that can be used to find words in thereference dictionary for such kind of errors. Partial Matching uses a ternary tree data structureto represent the words in the dictionary. Ternary trees provide extremely fast access and allowsearching on regular expressions made with ’.’ or ’*’. In this technique, ’.’ is used to replacethe characters in the output word of OCR for the characters with a low recognition confidence.Ternary search trees combine the time efficiency of digital tries with the space efficiency of binarysearch trees. Ternary trees are faster and provides more useful operations than hashing, whichis generally used for storing string kind of information.

Probability based matching Probability Matching calculates the probability that the inputword from the OCR is produced by each dictionary word. The words are then ranked basing ontheir probability values. Words are returned with a confidence and ranking. The probabilitiesare calculated as the distance based on the probability of substitution by an OCR and theprobability of the occurrence of the words in the dictionary.

Frequency information Frequency information can be incorporated in any match techniquealong with the distance. After finding some closest matches to the input word from the OCR, thefrequency of the matches can be used to give them priorities when compared to other. Although

19

it is not always true that the correct word is only the word with highest frequency, but certainlythis allows more structural ordering when combined with the distance based methods.

4.3 Form Design to Improve the Performance

Improved form interpretation and understanding can be achieved by developing the currentOCR systems. Effort can be put also in the other direction by reforming or refining the inputthat is given to these systems. The forms can be so designed in order that they facilitate theindividual components of any OCR system perform better. Complexities are introduced in theform understanding systems because of writers idiosyncrasies. It is difficult to handle modelthese idiosyncrasies in designing an OCR. It would definitely be advantageous for the OCRsystems if the forms are designed in such a way that these errors are partially prevented in theinput form itself. The forms currently under use, provide user with either a blank space besidea field or a single line for user to write above that line. Such forms offer great difficulty inprocessing as they inherently allow all the above listed idiosyncrasies of the form fillers. Onesolution for this is to restrict the user from writing haphazardly and making him adhere to astrict space restrictions [7]. It has been experimentally tested that representing the fields withtouching bounding character boxes improves the ease in segmentation, which in turn improvesthe overall performance of an OCR system. The segmentation problem now becomes just cuttingalong the separations in the bounding box or cutting along edges of each character box. Eventhis does not prevent all the problems above. So, one more improvement to this is to separateeach character bounding box by some space. Thus, by redesigning the forms, errors because ofbehavior of users are reduced to classification errors only [30].

5 Storage and Management of Document Datasets

Another important problem in document image analysis involves the efficient storage and re-trieval of document data sets. Documents may be a single image or a composition of text andgraphic components. In both the cases, one is usually interested in retrieving pages which aresimilar in structure and/or content.

The documents, which are electronically created, using word processors etc. are more structuredwhen compared to the scanned document images or documents generated by an OCR. One needsto index these electronic documents for efficient retrieval from a large set of stored documents.Processing of queries in document databases is complex due to the presence of non-numericaland imprecise queries. Deciding the fields in a database for text documents which have manyvariations in style, and populating them is a difficult task. The longterm goal of creatinga document database is to provide the user with an answer to his query by rephrasing theinformation obtained. This being very difficult, the practical goal can be set to provide the userwith a small subset of documents which contain information close to his/her query.

Query formulation can be done in a variety of ways in the case of document databases. For text

20

only documents, the query can be a question or a statement asking for documents containingtexts similar or related to the query statements. In case of document image databases, queriescan be more complex involving layout specifications, image content search, etc.

Automating the indexing of documents needs conversion of the given documents into an index-able representation. Providing such representations for text had been possible, but for images,graphs etc, it still remains an unsolved or partly problem.

5.1 Indexing and Retrieving Text in Documents

The basic idea of indexing a document is to represent it in the form of a vector, which consistsof the most important features of the documents. When a query is given, it can be convertedto a vector, and the results can be ranked based on the distance between their representativevector and the query vector.

Filtering of ‘stop’ words is the first step to be performed while indexing. These words consistof the most commonly used words which are not specific to any document or which do notcarry document specific semantics. Longer words are cut short based on a predefined length.This process is called stemming and it results in stems of the words. Frequencies of stems arefound out. Initially the documents were represented using the top N number of words. Thesewords although represent the current document properly, do not distinguish it from others. Sothe discriminant nature of the words can be characterized using tf(term frequency) -idf(inversedocument frequency) . Using this weighting technique, document specific words are highlighted.Even if some words occur often in the document, they will be assigned less weight if they occurmore frequently in the database of documents also [31].

After representation, string matching or cosine-distances can be used to find out the closestdocuments. If the documents involve a lot of noise, which is generally the case with OCRconverted documents, approximate string matching and fuzzy systems may be employed toindex and retrieve the related documents. Indexing of the OCR converted documents aftercleaning them using post-processing methods (similar to an ICR) would be definitely helpfulin the situations where the information retrieval methods are sensitive.

5.2 Indexing and Retrieving Document Images

A scanned document gives a high quality and exact representation of the original documentwhen compared to the OCR generated ones. The main problem with an image representation ofa document is that the content of the document is not directly accessible as in text documents,and also the size of the document becomes huge. Electronic documents may be indexed basedon the textual data or the image content.

There are many instances where one is interested to retrieve similar document images withoutworrying about the content. Layout of the document provides the critical clue for this. holisticanalysis of written text, size and coarse properties of graphical units can also be used. This can

21

be done by characterizing some parts of text with special properties like proper nouns, whichare capitalized and have some special feature dependencies on the neighboring words. Spottingkey words from the document image using the image properties can also be helpful in indexingthe document images. Key word spotting techniques are generalizations of proper noun findingmethods. Trainable models like HMM are employed to recognize key words in an image usingthe shape and boundary features of the word. Some word level matching techniques includegeneration of images for the query with different fonts and trying to match these generatedimages with the original document. Instead of encoding the word image as text, it can beencoded using different features like crossing counts, densities etc. The input query can alsobe encoded in the similar way and matching can be done. Automatic abstraction of documentimages aims at spotting the key words along the document with internal consistencies in thedocument, i.e the system should be able to provide a summary given a document image withoutconverting it into text. This can again be done using statistical models used for key word spottingtechniques. Stop words and other irrelevant words can be filtered out by using frequency andword width features which reduces the search space.

Along with the content, document structure also plays an important role in indexing. Documentstructure acts as a key when documents with different layouts exist in the database. Documentsegmentation and zoning methods are very useful in detecting the document layout. Also tex-ture based descriptions of the document can be used to segment and index the structure ofthe document. Logos form a unique indexing method for documents like official letters. Logorecognition forms a classical pattern recognition problem and is extremely useful in indexing.Indexing documents like maps,graphs and engineering drawings forms a very important appli-cation for document databases. Text based methods and structure based methods are used inindexing such documents.

6 Emerging Issues and Technologies

With the availability of high computational power and excellent algorithms, document imageanalysis has identified many new domains. Document image processing algorithms also findapplications in situations where most if graphical content and minimal text content. Documentimages are no longer created only by the orthographic projects. They can be general perspective.In addition to the scanners, cameras and online digitizing devices can also be employed forcreating documents. This has introduced a new set of problems in the document image analysis.We briefly discuss some of these issues here.

6.1 Detection and Recognition of Text in Images and Videos

Most of the document image analysis algorithms were developed primarily for the documentswith majority of textual content and minimal graphical content. There are many situations,where this is the reverse. For example, consider the need of extracting some information from

22

an image of a commercial streets. One may need to recognise text which are minority. Indexingthe vehicles in videos by reading the number plates is yet another example.

In such situations, we need to detect text initially. This is achieved by computing some structuralor textural characteristics of the regions. In the next phase, text blocks are processed using theappropriate character recognition modules.

6.2 Modern Digitizing Devices

Many advances in hardware and human computer interfacing have taken place in the recentyears. The main paradigms of input devices at present are keyboards for character based input,mice for position based input and pen-based for a more natural human computer interfacing.There are also input devices which convert audio-visual data into electronic form. Each of theinput devices has its own limitations, which lead to the evolution of newer, simpler and portableinput devices. Input devices may be broadly classified into the following categories: Fixed beamor moving beam, Hand held or fixed mount and Contact or non-contact. The following sectionbriefly describes the evolution of newer hardware which makes data entry into the computersimpler.

Scanners are the widely used input devices for document image analysis. There are three kindsof scanners based on the purpose they are used for – page scanners, mark sensing scanners, andbar code readers. They can be either fixed or hand-held. Digital cameras are recently gainingpopularity for imaging the natural scenes and analyzing textual content in them. While theseare the methods for offline recognition, the online recognition field has also made a wide progress.A Graphics tablet is a flat board on which a magnetic pen is moved. The tablet has a fine gridof wires below it. The movements of the pen cause electrical disturbances in those wires andthe co-ordinates of the pen can be tracked. It is currently being used in sophisticated computerart packages and video-editing systems for special effects [32]. Along with this, different kindsof sensors, based on the same concept, which are more portable and wireless are also developed.Generally graphics tablets are used in unison with a keyboard and are still not so sophisticatedto carry them along with a palm top. The evolution of a new set of input devices solved thisproblem. This is based on using the very natural writing instrument – a pen. These pens,called light pens use no special graphics tables. It is not at all necessary to have a graphicstablet as a base. MEMO-PEN (2000) is the most recent development which has a ball point penwith a CCD close to its tip [33]. It memorizes a series of partial snapshots of the handwritingcaptured by the CCD. Development of this system aimed at reduction of the learning time fornovices. This pen can be used to write on any surface including walls, boards and palm. Here,the limitation of using a pen based input along with a tablet is removed, which makes it moreportable. Light pens [34] are mostly used in drawing programs and CAD (Computer-AidedDesign) applications. Light pens work on the principle of sensing the reflected light. A light penemits a light beam and detects the amount of light reflected. Pen based input is also favoredbecause of its ergonomic advantages.

23

(a) (b)

Figure 5: (a)Distortion introduced while scanning (b) Document Image restored using ShapeFrom Shading techniques

There are a number of alternate and comfortable forms of input that are being devised to helppeople with different types of physical handicaps. Devices held in the mouth and used to pointto keys are manufactured. Also foot controlled conformation devices similar to mice and helmetswhich sense where the user is looking at, etc are developed.

6.3 Recovery of Distorted Document Images

When documents are scanned from a think bound book, it is a common observation that thecentral region of the document image becomes black (see Figure 5). This distortion is introducedbecause of proximal and moving light source, interreflections,specular reflections,non uniformalbedo etc in the imaging system [35]. The distortion makes it almost impossible for a humaneye to recognize what information is exactly present in the black region. Content in these regioncan be identified if the 3D structure of the book is known apriori. This can be using shape fromshading algorithms. Finding the structure of the book helps in reconstructing the documentimage without that distortion. Very interesting results are reported for this problem. A sampleresult is shown in Figure 5

References

[1] L. O’Gorman and R. Kasturi, “Document image analysis : An executive briefing,” 1999.

[2] G. Nagy, “Twenty years of document image analysis in pami,” Pattern Analysis and Ma-chine Intelligence, vol. 22, pp. 38–62, January 2000.

[3] Sadhana, vol. 27. Indian Academy Of Sciences, February 2002.

[4] V. Bansal, “Integrating knowledge sources in devanagari text recognition,” doctoral thesis,IIT Kanpur, Department of Computer Science and Engineering, March 1999.

[5] V. Bansal and R.M.K.Sinha, “A devanagari ocr and a brief overview of ocr research forindian scripts,” in Proceedings of STRANS01,IIT Kanpur, 2001.

[6] U. Pal and B. B. Chaudhuri, “Script line separation from indian multi-script documents,”in Proc. Int. Conf. Document Analysis and Recognition (ICDAR), pp. 406–409, 1999.

[7] M. D.Garris and D. L.Dimmick, “Form design for high accuracy optical character recogni-tion,” Pattern Analysis and Machine Intelligence, pp. 653–656, 1996.

24

[8] E. Cohen, J. J.Hull, and S. N.Srihari, “Control structure for interpreting handwritten ad-dresses,” Pattern Analysis and Machine Intelligence, pp. 1049–1055, 1994.

[9] “Center for document analysis and recognition, buffalo, new york. http://cedar.buffalo.edu.”

[10] R.C.Gonzalez and R.E.Woods, Digital Image Processing, ch. 7, pp. 413–478. Prentice-Hall,1994.

[11] A.K.Jain, Fundamentals of Digital Image Processing, ch. 7. Prentice-Hall, 1989.

[12] V. Wu and R.Manmatha, “Document image clean-up and binarization,” in Proceedings ofSPIE conference one Document Recognition V, San Jose, California, January 24-30, 1998.

[13] U.Pal and B.B.Chaudhuri, “An improved document skew estimation technique,” in PatternRecognition Letters, pp. 899–904, 1996.

[14] B.Yu and A.K.Jain, “A robust and fast skew detection algorithm for generic documents,”in Pattern Recognition, pp. 1599–1629, 1996.

[15] G. Farrow, M. Ireton, and C. Xydeas, “Detecting the skew angle in document images,”SP:IC, vol. 6, pp. 101–114, May 1994.

[16] Y.Min, S.B.Cho, and Y.Lee, “A data reduction method for efficient skew document skew es-timation based on hough trandformation,” in Proceedings of 13th International Conferenceon Pattern Recognition, pp. 732–736, 1996.

[17] L. O’Gorman, “The document spectrum for page layout analysis,” Pattern Analysis andMachine Intelligence, pp. 1162–1173, 1993.

[18] C. L. Tan, B. Yuani, and C.H.Ang, “Agent-based text extraction from pyramid images,” inInternational Conference on Advances in Pattern Recognition, Plymouth, UK, pp. 344–352,November 23-25, 1998.

[19] R. G.Casey and E. Lecolinet, “A survey of methods and strategies in character segmenta-tion,” Pattern Analysis and Machine Intelligence, pp. 690–706, 1996.

[20] M. Krishnamoorthy, G. Nagy, S. Seth, and M. Viswanathan, “Syntactic segmentation andlabeling of digitized pages from technical journals,” Pattern Analysis and Machine Intelli-gence, pp. 737–747, 1993.

[21] Anil.K.Jain and B. Yu, “Document representation and its application to page decomposi-tion,” Pattern Analysis and Machine Intelligence, pp. 294–308, 1998.

[22] A. Spitz, “Determination of the script and language content of document images,” PatternAnalysis and Machine Intelligence, pp. 235–245, 1997.

[23] U. Pal and B. B. Chaudhuri, “Automatic separation of words in multi-lingual multi-scriptindian document,” in Proceedings of International Conference on Document Analysis andRecognition (ICDAR), pp. 576–583, 1997.

25

[24] J. Hochberg, P. Kelly, T. Thomas, and L. Kerns, “Automatic script identification from doc-ument images using cluster-based templates,” Pattern Analysis and Machine Intelligence,pp. 176–187, 1997.

[25] G.S.Lehal and C. Singh, “A gurmukhi script recognition system,” in Proccedings of Inter-national Conference in Pattern Recognition,Barcelona,Spain, vol. 2, pp. 557–560, 2000.

[26] T.V.Ashwin and P.S.Sastry, “A font and size-independent ocr system for printed kannadadocuments using support vector machines,” Sadhana, vol. 27, pp. 35–58, February 2002.

[27] B.B.Chaudhuri, U.Pal, and M.Mitra, “Automatic recognition of printed oriya script,” Sad-hana, vol. 27, pp. 23–34, February 2002.

[28] S. Haykins, Neural Networks A Comprehensive Foundation, ch. 4, pp. 156–175. Addison-Wesley, second ed., 2001.

[29] T.K.Ho, J.J.Hull, and S.N.Srihari, “Decision combination in multiple classifier systems,”Pattern Analysis and Machine Intelligence, vol. 16, no. 1, pp. 66–75, 1994.

[30] F. Cesarini, M. Gori, S. Marinai, and G. Soda, “Informys : A flexible invoice-like form-reader system,” Pattern Analysis and Machine Intelligence, pp. 730–745, 1998.

[31] D. Doermann, “The indexing and retrieval of document images: A survey,” ComputerVision and Image Understanding: CVIU, vol. 70, no. 3, pp. 287–298, 1998.

[32] “http://www.wacom.com/graphire/index.cfm.”

[33] “http://www.acm.org/sigchi/chi95/proceedings/shortppr/sn bdy.htm.”

[34] “http://www.ftgdata.com/release/pxl595nr.pdf.”

[35] T. Wada, H. Ukida, and T. Matsuyama, “Shape from shading with interreflections underproximal light source – 3d shape reconstruction of unfolded book surface from a scannedimage –,” in Proceedings of the Fifth International Conference on Computer Vision, 1995.

26