content of text line segmantation

8/6/2019 Content of Text Line Segmantation

1/78

CHAPTER - 1


2/78

1.1 INTRODUCTION

Text line extraction is generally seen as a preprocessing step for tasks

such as document structure extraction, printed character or handwritingrecognition. Many techniques have been developed for page segmentation of

printed documents newspapers, scientific journals, magazines, business letters)

produced with modern editing tools. The segmentation of handwritten documents

has also been addressed with the segmentation of address blocks on envelopes

and mail pieces, and for authentication or recognition purposes. More recently,

the development of handwritten text databases provides new material for

handwritten page segmentation.

Ancient and historical documents, printed or handwritten, strongly differ

from the documents mentioned above because layout formatting requirements

were looser. Their physical structure is thus harder to extract. In addition,

historical documents are of low quality, due to aging or faint typing. They include

various disturbing elements such as holes, spots, writing from the verso

appearing on the recto, ornamentation, or seals. Handwritten pages include

narrow spaced lines with overlapping and touching components. Characters and.

words have unusual and varying shapes, depending on the writer, the period and

the place concerned. The vocabulary is also large and may include unusual

names and words

Full text recognition is in most cases not yet available, except for printed

documents for which dedicated OCR can be developed. However, invaluable

collections of historical documents are already digitized and indexed for

consulting, exchange and distant access purposes which protect them from

direct manipulation. In some cases, highly structured editions have been

established by scholars. But a huge amount of documents are still to be exploited

electronically. To produce an electronic searchable form, a document has to be


3/78

indexed. The simplest way of indexing a document consists in attaching its main

characteristics such as date, place and author (the so called metadata).

Indexing can be enhanced when the document structure and content are

exploited. When a transcription (published version, diplomatic transcription) is

available, it can be attached to the digitized document: this allows users to

retrieve documents from textual queries. Since text based representations do not

reflect the graphical features of such documents, a better representation is

obtained by linking the transcription to the document image. A direct

correspondence can then be established between the document image and its

content by text/image alignment techniques .This allows the creation of indexes

where the position of each word can be recorded, and of links between both

representations.

Clicking on a word on the transcription or in the index through a GUI

allows users to visualize the corresponding image area and vice versa. The

document analysis embedded in such systems provides tools to search for

blocks, lines and words, and may include a dedicated handwriting recognition

system. Interactive tools are generally offered for segmentation and recognitioncorrection purposes. Several projects also concern printed material. However,

document structure can also be used when no transcription is available. Word

spotting techniques can retrieve similar words in the image document through an

image query. When words of the image document are extracted by top down

segmentation, which is generally the case, text lines are extracted first.

The authentication of manuscripts in the paleographic sense can also

make use of document analysis and text line extraction. Authentication consists

in retrieving writer characteristics independently from document content. It

generally consists in dating documents, localizing the place where the document

was produced, identifying the writer by using characteristics and features

extracted from blank spaces, line orientations and fluctuations, word or character


4/78

shapes. Page segmentation into text lines is performed in most tasks mentioned

above and overall performance strongly relies on the quality of this process.

Fig 1 Line Segmentation Block diagram

The purpose of this project is to survey the efforts made for historicaldocuments on the text line segmentation task. Section 2 describes the

characteristics of text line structures in historical documents and the different

ways of defining a text line. Preprocessing of document images (gray level, color

or black and white) is often necessary before text line extracting to prune

superfluous information (non textual elements, textual elements from the verso)

or to correctly binaries the image.

This problem is addressed in preprocessing. In some related methods like

Projectionbased methods , Smearing methods , Grouping methods , Methods

based on the Hough transform, Repulsive-Attractive network method and

Stochastic method we survey the different approaches to segment the clean

image into text lines. Taxonomy is proposed, listed as projection profiles,


5/78

smearing, grouping, Hough-based, repulsive-attractive network and stochastic

methods. The majority of these techniques have been developed for the projects

on historical documents mentioned above. We address the specific problem of

overlapping and touching components in further.

Document image understanding algorithms are expected to work with a

document, irrespective of its layout, script, font, color, etc. Segmentation aims to

partition a document image into various homogeneous regions such as text

blocks, image blocks, lines, words etc. Page segmentation algorithms can be

broadly classified into three categories: bottom-up, top-down, and hybrid

algorithms. The classification is based on the order in which the regions in a

document are identified and labeled. The layout of the document is represented

by a hierarchy of regions: page, image or text blocks, lines, words, components,

and pixels. The traditional document segmentation algorithms give good results

on most documents with complex layouts but assume the script in the document

to be simple as in English. These algorithms fail to give good results on the

documents with complex scripts such as African, Persian and Indian scripts.


6/78

Fig 2 Reference lines and interfering lines with overlapping and touching

components.

1.2 CHARACTERISTICS AND REPRESENTATION OF TEXT

LINES

To have a good idea of the physical structure of a document image, one

only needs to look at it from a certain distance: the lines and the blocks are

immediately visible. These blocks consist of columns, annotations in margins,

stanzas, etc... As blocks generally have no rectangular shape in historicaldocuments, the text line structure becomes the dominant physical structure. We

first give some definitions about text line components and text line segmentation.

Then we describe the factors which make this text line segmentation hard.

Finally, we describe how a text line can be represented.


7/78

1.2.1 DEFINITION

Baseline: Fictitious line which follows and joins the lower part of the characterbodies in a text line (Fig. 2)

Median line: Fictitious line which follows and joins the upper part of the

character bodies in a text line.

Upper line: Fictitious line which joins the top of ascenders.

Lower line: Fictitious line which joins the bottom of descenders.

Overlapping components: Overlapping components are descenders and

ascenders located in the region of an adjacent line (Fig. 2).

Touching components:Touching components are ascenders and descenders

belonging to consecutive lines which are thus connected. These components are

large but hard to discriminate before text lines are known.

Text line segmentation: Text line segmentation is a labeling process which

consists in assigning the same label to spatially aligned units (such as pixels,

connected components or characteristic points).

There are two categories of text line segmentation approaches:

searching for (fictitious) separating lines or paths, or searching for aligned

physical units. The choice of a segmentation technique depends on the

complexity of the text line structure of the document.

1.2.2 INFLUENCE OF AUTHOR STYLE


8/78

Baseline fluctuation: The baseline may vary due to writer movement. It may

be straight, straight by segments, or curved.

Line orientations: There may be different line orientations, especially on

authorial works where there are corrections and annotations.

Line spacing: Lines that are rather widely spaced lines are easy to find. The

process of extracting text lines grows more difficult as interlines are narrowing

the lower baseline of the first line is becoming closer to the upper baseline of the

second line; also, descenders and ascenders start to fill the blank space left forseparating two adjacent text lines .

Insertions: Words or short text lines may appear between the principal text

lines, or in the margins.

1.2.3 INFLUENCE OF POOR IMAGE QUALITY

Imperfect preprocessing: Smudges, variable background intensity and the

presence of seeping ink from the other side of the document make image

preprocessing particularly difficult and produce binarization errors.

Stroke fragmentation and merging:Punctuation, dots and broken strokes due

to low-quality images and/or binarization may produce many connected

components; conversely, words, characters and strokes may be split into several

connected components. The broken components are no longer linked to the


9/78

median baseline of the writing and become ambiguous and hard to segment into

the correct text line.

1.2.4 TEXT LINE REPRESENTATION

Separating paths and delimited strip: Separating lines (or paths) are

continuous fictitious lines which can be uniformly straight, made of straight

segments, or of curving joined strokes. The delimited strip between two

consecutive separating lines receives the same text line label. So the text line

can be represented by a strip with its couple of separating lines (Fig. 3).

Clusters: Clusters are a general set-based way of defining text lines. A label is

associated with each cluster. Units within the same cluster belong to the same

text line. They may be pixels, connected components, or blocks enclosing pieces

of writing. A text line can be represented by a list of units with the same label.

Strings: Strings are lists of spatially aligned and ordered units. Each string

represents one text line.

Baselines: Baselines follow line fluctuations but partially define a text line. Units

connected to a baseline are assumed to belong to it. Complementary processing

has to be done to cluster non-connected units and touching components.


10/78

Fig. 3 Various text line representations: paths, strings and baselines.


11/78

1.3 DOCUMENT IMAGE ANALYSIS

Document image analysis refers to algorithms and techniques that are

applied to images of documents to obtain a computer-readable description from

pixel data.Awell-known document image analysis product is the Optical

Character Recognition (OCR) software that recognizes characters in a scanned

document. OCR makes it possible for the user to edit or search the documents

contents. Inthis paper we briefly describe various components of a document

analysis system.Many of these basic building blocks are found in most document

analysis systems,

The objective of document image analysis is to recognize the text and

graphics component in images of documents, and to extract the intended

information as a human would. Two categories of document image analysis can

be defined (see figure 4). Textual processing deals with the text components of

a document image. Some tasks here are: determining the skew (any tilt at which

the document may have been scanned into the computer), finding columns,

paragraphs, text lines, and words, and finally recognizing the text (and possibly

its attributes such as size, font etc.) by optical character recognition (OCR).

Graphics processing deals with the non-textual line and symbol components that

make up line diagrams, delimiting straight lines between text sections, company

logos etc.


12/78

Pictures are a third major component of documents, but except for

recognizing their location on a page, further analysis of these is usually the task

of other image processing and machine vision techniques. After application of

these text and graphics analysis techniques, the several megabytes of initial data

are culled to yield a much more concise semantic description of the document.

Fig 4 A hierarchy of document processing subareas listing the types of

document components dealt with in each subarea.

Document analysis systems will become increasingly more evident in the

form of everyday document systems. For instance, OCR systems will be morewidely used to store, search, and excerpt from paper-based documents. Page-

layout analysis techniques will recognize a particular form or page format and

allow its duplication. Diagrams will be entered from pictures or by hand, and

logically edited. Pen-based computers will translate handwritten entries into

electronic documents. Archives of paper documents in libraries and engineering


13/78

companies will be electronically converted for more efficient storage and instant

delivery to a home or office computer.

Consider three specific examples of the need for document analysis

presented here.

(1) Typical documents in todays office are computer-generated, but even so,

inevitably by different computers and software such that even their electronic

formats are incompatible. Some include formatted text and tables as well as

handwritten entries. There are different sizes, from a business card to a large

engineering drawing. Document analysis systems recognize types of documents,

enable the extraction of their functional parts, and translate from one computer

generated format to another.

(2) Automated mail-sorting machines to perform sorting and address recognition

have been used for several decades, but there is the need to process more mail,

more quickly, and more accurately.

(3) In a traditional library, loss of material, misfiling, limited numbers of each

copy, and even degradation of materials are common problems, and may beimproved by document analysis techniques. All these examples serve as

applications ripe for the potential solutions of document image analysis.

1.3.1 DATA CAPTURE

Data in a paper document are usually captured by optical scanning and

stored in a file of picture elements, called pixels that are sampled in a grid pattern

throughout the document. These pixels may have values: OFF (0) or ON (1) for

binary images, 0255 for gray-scale images, and 3 channels of 0255 color

values for color images.


14/78

At a typical sampling resolution of 120 pixels per centimeter, a 20 x 30 cm

page would yield an image of 2400x3600 pixels. When the document is on a

different medium such as microfilm, palm leaves, or fabric, photographic methods

are often used to capture images. In any case, it is important to understand that

the image of the document contains only raw data that must be further analyzed

to glean the information.

1.4 OPTICAL CHARACTER RECOGNITION (OCR)

Optical Character Recognition (OCR) lies at the core of the discipline of

pattern recognition where the objective is to interpret a sequence of characters

taken from an alphabet. Characters of the alphabet are usually rich in shape. In

fact, the characters can be subject to many variations in terms of fonts and

handwriting styles. Despite these variations, there is perhaps a basic abstraction

of the shapes that identifies any of their instantiations. Developing computer

algorithms to identify the characters of the alphabet is the principal task of OCR.

The challenge to the research community is the following while humans can

recognize neatly handwritten characters with 100% accuracy, there is no OCR

that can match that performance. OCR difficulty can increase on several counts.

Increase in fonts, size of the alphabet set, unconstrained handwriting,

touching of adjacent characters, broken strokes due to poor binarization, noise

etc. all contribute to the difficulty shows a sample of 0s and 6s that are easilyconfused by a handwritten digit recognizer. There are many applications that

require the recognition of unconstrained handwriting. A word can be either purely

numeric as in the case of a Zip code, or purely alphabetic as in the case of US

state abbreviations or mixed as in the number of an apartment.


15/78

The task becomes particularly challenging when adjacent characters in a

character string are touching as shown in figure 8. Unlike purely alphabetic

strings where joining of the characters is natural and takes place by means of

ligatures, the joining of numerals in a numeric word and the upper-case

characters in an abbreviation are accidental. There are various ways in which

two digits can touch. Some of the categories lend themselves to natural

segmentation, whereas for some a holistic approach is the only option available.

1.4.1 WHAT IS OCR?

OCR is the acronym for Optical Character Recognition. This technology

allows a machine to automatically recognize characters through an optical

mechanism. Human beings recognize many objects in this manner our eyes are

the "optical mechanism." But while the brain "sees" the input, the ability to

comprehend these signals varies in each person according to many factors. By

reviewing these variables, we can understand the challenges faced by the

technologist developing an OCR system.

First, if we read a page in a language other than our own, we may

recognize the various characters, but be unable to recognize words. However, on

the same page, we are usually able to interpret numerical statements - the

symbols for numbers are universally used. This explains why many OCR

systems recognize numbers only, while relatively few understand the full

alphanumeric character range. Second, there is similarity between many

numerical and alphabetical symbol shapes. For example, while examining a

string of characters combining letters and numbers, there is very little visible

difference between a capital letter "O" and the numeral "0."


16/78

As humans, we can re-read the sentence or entire paragraph to help us

determine the accurate meaning. This procedure, however, is much more difficult

for a machine. Third, we rely on contrast to help us recognize characters. We

may find it very difficult to read text which appears against a very dark

background, or is printed over other words or graphics. Again, programming a

system to interpret only the relevant data and disregard the rest is a difficult task

for OCR engineers. There are many other problems which challenge the

developers of OCR systems. In this paper, we will review the history,

advancements, abilities and limitations of existing systems. This analysis should

help determine if OCR is the correct application for your company's needs, and if

so, which type of system to implement.

1.4.2 HISTORY OF OCR

The engineering attempts at automated recognition of printed characters

started prior to World War II. But it was not until the early 1950's that a

commercial venture was identified that justified necessary funding for researchand development of the technology. This impetus was provided by the American

Bankers Association and the Financial Services Industry. They challenged all the

major equipment manufacturers to come up with a "Common Language" to

automatically process checks. After the war, check processing had become the

single largest paper processing application in the world. Although the banking

industry eventually chose Magnetic Ink Recognition (MICR), some vendors had

proposed the use of an optical recognition technology.

However, OCR was still in its infancy at the time and did not perform as

acceptably as MICR. The advantage of MICR was that it is relatively impervious

to change, fraudulent alteration and interference from non-MlCR inks. The "eye''

of early OCR equipment utilized lights, mirrors, fixed slits for the reflected light to


17/78

pass through, and a moving disk with additional slits. The reflected image was

broken into discrete bits of black and white data, presented to a photo-multiplier

tube, and converted to electronic bits.

The "brain's" logic required the presence or absence of "black'' or "white"

data bits at prescribed intervals. This allowed it to recognize a very limited,

specially designed character set. To accomplish this, the units required

sophisticated transports for documents to be processed. The documents were

required to run at a consistent speed and the printed data had to occur in a fixed

location on each and every form.

This technology also introduced the concept of blue, non-reading inks as

the system was sensitive to the ultraviolet spectrum. The third generation of

recognition devices, introduced in the early 1970's, consisted of photo-diode

arrays. These tiny little sensors were aligned in an array so the reflected image of

a document would pass by at a prescribed speed. These devices were most

sensitive in the infra-red portion of the visual spectrum so "red" inks were used

as non-reading inks. That brings us to this generation of hardware.

1.4.3 LIMITATIONS OF OCR

OCR has never achieved a read rate that is 100% perfect. Because of

this, a system which permits rapid and accurate correction of rejects is a major

requirement. Exception item processing is always a problem because it delays

the completion of the job entry, particularly the balancing function. Of even

greater concern is the problem of misreading a character (substitutions). In

particular, if the system does not accurately balance dollar data, customer

dissatisfaction will occur. The success of any OCR device to read accurately

without substitutions is not the sole responsibility of the hardware manufacturer.

Much depends on the quality of the items to be processed.


18/78

Through the years, the desire has been to increase the accuracy of

reading, that is, to reduce rejects and substitutions to reduce the sensitivity of

scanning to read less-controlled input to eliminate the need for specially

designed fonts (characters), and to read handwritten characters. However,

today's systems, while much more forgiving of printing quality and more accurate

than earlier equipment, still work best when specially designed characters are

used and attention to printing quality is maintained. However, these limits are not

objectionable to most applications, and dedicated users of OCR systems are

growing each year. But the ability to read a special character is not, by itself,

Sufficient to create a successful system.

1.4.4 WHAT DOES IT TAKE TO MAKE A SUCCESSFUL OCR SYSTEM?

1. It takes a complimentary merging of the input document ~ stream with the

processing requirements of the particular application with a total system concept

that provides for convenient entry of exception type items with an output that

provides cost effective entry to complete the system. To show a successful

example, let's review the early credit card OCR applications.

2. Input was a carbon imprinted document. However, if the carbon was wrinkled,

the imprinter was misaligned, or any one of a variety of reasons existed, the

imprinted characters were impossible to read accurately.

3. To compensate for this problem, the processing system permitted direct key

entry of the fail to read items at a fairly high speed. Directly keyed items from the

misread document were under intelligent computer control which placed the

proper data in the right location for the data record. Important considerations in

designing the system encouraged the use of modulus controlled check digits for

the embossed credit card account number. This, coupled with tight monetary

controls by batch totals, reduced the chance of read substitutions.


19/78

4. The output of these early systems provided a "country club" type of billing.

That

is, each of the credit card sales slips was returned to the original purchaser. This

provided the credit card customer with the opportunity to review his own

Purchases to insure the final accuracy of billing. This has been a very successful

operation through the years. Today's systems improve the process by increasing

the amount of data to be read, either directly or through reproduction of details on

the sales draft. This provides customers with a "descriptive" billing statement

which itemizes each transaction. Attention to the details of each application step

is a requirement for successful OCR systems.

1.4.6 PHASES IN OCR

PREPROCESSING

SEGMENTATION

RECOGNITION


20/78

Fig 5 phases of OCR

1.4.6.1 PREPROCESSING

Optical Character Recognition (OCR) refers to the process of converting

printed Tamil text documents into software translated Unicode Tamil Text. The

printed documents available in the form of books, papers, magazines, etc. are

scanned using standard scanners which produce an image of the scanned

document. As part of the preprocessing phase the image file is checked for

skewing. If the image is skewed, it is corrected by a simple rotation technique in

the appropriate direction. Then the image is passed through a noise elimination

phase and is binarized.

The preprocessed image is segmented using an algorithm which

decomposes the scanned text into paragraphs using special space detection

technique and then the paragraphs into lines using vertical histograms, and lines

into words using horizontal histograms, and words into character image glyphs

using horizontal histograms. Each image glyph is comprised of 32x32 pixels.

Thus a database of character image glyphs is created out of the segmentation

phase. Then all the image glyphs are considered for recognition using Unicode

mapping. Each image glyph is passed through various routines which extract the

features of the glyph.

POST PROCESSING


21/78

1.4.6.2 SEGMENTATION

In computer vision, segmentation refers to the process of partitioning a

digital image into multiple segments (sets of pixels, also known as superpixels).

The goal of segmentation is to simplify and/or change the representation of an

image into something that is more meaningful and easier to analyze. Image

segmentation is typically used to locate objects and boundaries (lines, curves,

etc.) in images. More precisely, image segmentation is the process of assigning

a label to every pixel in an image such that pixels with the same label share

certain visual characteristics.

The result of image segmentation is a set of segments that collectively

cover the entire image, or a set of contours extracted from the image (see edge

detection). Each of the pixels in a region are similar with respect to some

characteristic or computed property, such as color, intensity, or texture. Adjacent

regions are significantly different with respect to the same characteristic(s).

1.4.6.3 RECOGNITION

Often abbreviated OCR, optical character recognition refers to the branch

of computer science that involves reading text from paper and translating the

images into a form that the computer can manipulate (for example, into ASCII

codes). An OCR system enables you to take a book or a magazine article, feed it

directly into an electronic computer file, and then edit the file using a word

processor.

All OCR systems include an optical scanner for reading text, andsophisticated software for analyzing images. Most OCR systems use a

combination of hardware (specialized circuit boards) and software to recognize

characters, although some inexpensive systems do it entirely through software.

Advanced OCR systems can read text in large variety of fonts, but they still have

difficulty with handwritten text.


22/78

The potential of OCR systems is enormous because they enable users to

harness the power of computers to access printed documents. OCR is already

being used widely in the legal profession, where searches that once required

hours or days can now be accomplished in a few seconds.

1.4.6.4 POST PROCESSING

In most OCR systems, independent character recognition engine is often

used to recognize each segmented part of an image, where only shape and

structure of the character are considered. In order to improve the recognition

accuracy rate, it is necessary in post-processing to use language knowledge,

which introduces context information, to correct the image recognition results.

Post-processing approaches based on language knowledge include using a

lexicon or some syntax and semantic rules to correct the spelling of words, and

using some statistical language models (SLM) to select out the best sequence

from the candidate characters given by the OCR engine. Because of the

complexity of language, all kinds of language knowledge sometimes are used

together to obtain better performance.

An OCR engine outputs not only candidate characters, but also candidate

distance information of each candidate character, which is also important in OCR

post-processing. Currently, candidate distance is usually transformed to reliability

of the corresponding candidate character to be utilized. Generally speaking, the

bigger the reliability of a candidate character, the smaller the corresponding

candidate distance. In early period, the reliability was calculated by using some

empirical formulas. Afterwards, a statistical approach was proposed, which

calculates the reliability according to the distribution of candidate characters and

correct characters with different candidate distances. It reflects some statistical

characteristics, and its complexity is low, therefore it achieves good results in

some applications. However, the use of candidate distance is still limited in OCR

post-processing.


23/78

1.4.7 STEPS INVOLVED IN OCR

Optical character recognition is the recognition of printed or written text by

a computer. This involves photo scanning of the text, which converts the paper

document into an image, and then translation of the text image into character

codes such as ASCII. Any OCR implementation consists of a number of

preprocessing Steps followed by the actual recognition. The number and types of

preprocessing algorithms employed on the scanned image depend on many

factors such as age of the document, paper quality, resolution of the scanned

image, the amount of skew in the image, the format and layout of the images and

text typical preprocessing includes the following stages.

Binarization,

Noise removing,

Thinning,

Skew detection and correction,

Line segmentation,

Word segmentation, and Character segmentation

Recognition consists of

Feature extraction,

Feature selection, and

Classification


24/78

Fig 6 Steps in an OCR

1.4.6.1 Binarization

Binarization is a technique by which the gray scale images are convertedto binary images. In any image analysis or enhancement problem, it is very

essential to identify the objects of interest from the rest. Binarization separates

the foreground (text) and background information. The most common method for

binarization is to select a proper threshold for the intensity of the image and then

convert all the intensity values above the threshold to one intensity value (for


25/78

example white ), and all intensity values below the threshold to the other

chosen intensity (black).

Binarization is usually reported to be performed either globally or locally.Global methods apply one intensity value to the entire image. Local or adaptive

thresholding methods apply different intensity values to different regions of the

image. These threshold values are determined by the neighborhood of the pixel

to which the thresholding is being applied. Several binarization techniques are

discussed in (Anuradha & Koteswarrao 2006). Figure 7(a) shows the scanned

image of a paper document printed in Telugu, a south Indian language. Figure

7(b) is the same image after binarization in which the text pixels are separated

from the background.

(A)

(B)

Fig 7 (A) Original image, (B) Binarized image


26/78

1.4.6.2 Noise Removal

Scanned documents often contain noise that arises due to printer,

scanner, print quality, age of the document, etc. Therefore, it is necessary to filter

this noise before we process the image. The commonly used approach is to low-

pass filter the image and to use it for later processing. The objective in the design

of a filter to reduce noise is that it should remove as much of the noise as

possible while retaining the entire signal (Rangachar et al 2002).

1.4.6.3 Thinning

Thinning, or, skeletonization or is a process by which a one-pixel-width

representation (or the skeleton) of an object is obtained, by preserving the

connectedness of the object and its end points. The purpose of thinning is to

reduce the image components to their essential information so that further

analysis and recognition are facilitated. This enables easier subsequent detection

of pertinent features. Figure 8 shows an image before and after thinning. A

number of thinning algorithms have been proposed and are being used. Themost common algorithm used is the classical Hilditch algorithm and its variants.

Fig 8 A character image (left) before thinning, and (b) after thinning


27/78

1.4.6.4 Skew detection and correction

When a document is fed to the scanner either mechanically or by a human

operator, a few degrees of skew (tilt) are unavoidable. Skew angle is the angle

that the lines of text in the digital image make with the horizontal direction. Figure

9(a) shows an image with skew.

Fig 9 An image (a) with skew, (b) without skew, and its horizontal profiles


28/78

There exist many techniques for skew estimation. One skew estimation

technique is based on the projection profile of the document; another class of

approach is based on nearest neighbor clustering of connected components.

Techniques based on the Hough transform and Fourier transform are also

employed for skew estimation. A popular method for skew detection employs the

projection profile. A horizontal projection profile is a one-dimensional array where

each element denotes the number of black pixels along a row in the image.

Span horizontally, the horizontal projection profile has peaks whose widths

are equal to the character height and valleys whose widths are equal to the

spacing between lines. At the correct skew angle, since scan lines are aligned to

text lines, the projection profile has maximum height peaks for text and valleys

for line spacing. In the image of figure 9(a), its horizontal projection profile can be

seen with no clear valleys due to the presence of skew. Figure 9(b) is an image

in which the skew is removed. The peaks and valleys in the projection profile can

be clearly seen.

1.4.6.5 Line, word, and character segmentation

After the tilt is corrected, the text has to be segmented first into lines; each

line then into words and finally each word have to be segmented into its

constituted characters. Horizontal projection of a document image is most

commonly employed to extract the lines from the document. If the lines are well

separated, and are not tilted, the horizontal projection will have separated peaks

and valleys, as shown in figure 9(b), which serve as the separators of the text

lines.. Figure 9 shows an image consisting of 3 text lines (left), and the 3

segmented lines (right), using horizontal projection profiles.

Similarly a vertical projection profile gives the column sums. One can

separate lines by looking for minima in horizontal projection profile of the page

and then separate words by looking at minima in vertical projection profile of a


29/78

single line. Figure 11(a) shows a line consisting of 4 words, along with vertical

projection profiles, and figure 11(b) shows the 4 words, after segmentation. In

Figure 11(c), a word is shown segmented into its constituting 3 characters.

Overlapping, adjacent characters in a word (called kerned characters) cannot be

segmented using zero-valued valleys in the vertical projection profile.

Special techniques have to be employed to solve this problem. Feature

extraction and selection Feature extraction can be considered as finding a set of

parameters (features) that define the shape of the underlying character as

precisely and uniquely as possible. The features have to be selected in such a

way that they help in discriminating between characters. Thinned data is

analyzed to detect features such as straight lines, curves, and significant points

along the curves.

Fig 10 Line segmentation


30/78

Fig 11 (a). A line segment, (b). Word segmentation, (c). Character

segmentation

1.4.6.6 Techniques used

Any OCR contains more or less the same steps described further. The

exact number and techniques differ slightly from one language to other. We now

present the studies in different OCRs, along with a detailed description of the

methods used in them. Recognition of isolated and continuous printed multi font

Bengali characters is reported in the work by Mahmud et al (2003).

This is based on Freemanchaincode features, which are explained as

follows. When objects are described by their skeletons or contours, they can be

represented by chain coding, where the ON pixels are represented as sequences

of connected neighbors along lines and curves. Instead of storing the absolutelocation of each ON pixel, the direction from its previously coded neighbor is

stored.

The chain codes from center pixel are 0 for east, 1 for North- East, and so

on. This is represented pictorially in figure 12(a) and (b). Chain code gives the


31/78

boundary of the character image; slope distribution of chain code implies the

curvature properties of the character. In this work, connected components from

each character are divided into four regions with the center of mass of as the

origin. Slope distribution of chain code, in these four regions is used as local

feature. Using chain code representation, classification is done by a feed forward

neural network.

Testing on three types of fonts with accuracy of approximately 98% for

isolated characters and 96% for continuous characters is reported. Ray &

Chatterjee (1984) presented a recognition system based on a nearest neighbor

classifier employing features extracted by using a string connectivity criterion

complete OCR for printed Bangla is reported in the work by Chaudhuri & Pal

(1998), in which a combination of template and feature-matching approach is

used.

A histogram-based thresholding approach is used to convert the image

into binary images. For a clear document the histogram shows two prominent

peaks corresponding to white and black regions. The threshold value is chosen

as the midpoint of the two-histogram peaks. Skew angle is determined from theskew of the headline.

Text lines are partitioned into three zones and the horizontal and vertical

projection profiles are used to segment the text into lines, words, and characters.

Primary grouping of characters into the basic, modified and compound

characters is made before the actual classification. A few stroke features are

used for this purpose along with a tree classifier where the decision at each node

of the tree is taken on the basis of presence/absence of a particular feature.


32/78

The compound character recognition is done in - two stages

1) In the first stage the characters are grouped into small sub-sets by the

above tree classifier.

2) At the second stage, characters in each group are recognized by a

run-based template matching approach. Some character level statistics like

individual character occurrence frequency, bigram and trigram statistics etc. are

utilized to aid the recognition process. For single font, clear documents 99.10%

character level recognition accuracy is reported.

Fig 12 Chain code and graphical representations


33/78

CHAPTER - 2


34/78

BACKGROUND WORK

2.1.1 Projectionbased Methods

Projection-profiles are commonly used for printed document

segmentation. This technique can also be adapted to handwritten documents

with little overlap. The vertical projection profile is obtained by summing pixel

values along the horizontal axis for each y value. From the vertical profile, the

gaps between the text lines in the vertical direction can be observed (Fig. 13).

Profile (y) = f(x,y)1xM

The vertical profile is not sensitive to writing fragmentation. Variants for

obtaining a profile curve may consist in projecting black/white transitions such as

in number of connected components, rather than pixels. The profile curve can be

smoothed, e.g. by a Gaussian or median filter to eliminate local maxima. The

profile curve is then analysed to find its maxima and minima.

There are two drawbacks

Short lines will provide low peaks, and very narrow lines, as

well as those including many overlapping components will not produce significant

peaks. In case of skew or moderate fluctuations of the text lines, the image may

be divided into vertical strips and profiles sought inside each strip. These

piecewise projections are thus a means of adapting to local fluctuations within a

more global scheme.

In the global orientation (skew angle) of a handwritten page is first

searched by applying a Hough transform on the entire image. Once this skew

angle is obtained, projections are achieved along this angle. The number of

maxima of the profile gives the number of lines. Low maxima are discarded on

their value, which is compared to the highest maxima. Lines are delimited by

strips, searching for the minima of projection profiles around each maximum.


35/78

This technique has been tested on a set of 200 pages within a word

segmentation task.

In the work of each minimum of the profile curve is a potential

segmentation point. Potential points are then scored according to their distance

to adjacent segmentation points. The reference distance is obtained from the

histogram of distances between adjacent potential segmentation points. The

highest scored segmentation point is used as an anchor to derive the remaining

ones. The method is applied to printed records of the Second World War which

have regularly spaced text lines. The logical structure is used to derive the text

regions where the names of interest can be found.

Fig. 13 Vertical projection-profile extracted on an autograph of Jean-Paul

Sartre.

The RXY cuts method applied in He and Downtown uses alternating

projections along the X and the Y axis. This results in a hierarchical tree

structure. Cuts are found within white spaces. Thresholds are necessary to

derive inter-line or inter-block distances. This method can be applied to printed

documents (which are assumed to have these regular distances) or well

separated handwritten lines.


36/78

2.1.2 Smearing Methods

For printed and binarized documents, smearing methods such as the Run-

Length Smoothing Algorithm can be applied. Consecutive black pixels along the

horizontal direction are smeared: i.e. the white space between them is filled with

black pixels if their distance is within a predefined threshold. The bounding boxes

of the connected components in the smeared image enclose text lines.

A variant of this method adapted to gray level images and applied to

printed books from the sixteenth century consists in accumulating the image

gradient along the horizontal direction. This method has been adapted to old

printed documents within the Debora project. For this purpose, numerousadjustments in the method concern the tolerance for character alignment and line

justification.

Text line patterns are found in the work of Shi and Govindaraju by building

a fuzzy run length matrix. At each pixel, the fuzzy run-length is the maximal

extent of the background along the horizontal direction. Some foreground pixels

may be skipped if their number does not exceed a predefined value. This matrix

is threshold to make pieces of text lines appear without ascenders and

descenders (Fig. 14). Parameters have to be accurately and dynamically tuned.

2.1.3 Grouping methods

These methods consist in building alignments by aggregating units in a

bottom-up strategy. The units may be pixels or of higher level, such as connected

components, blocks or other features such as salient points. Units are then

joined together to form alignments. The joining scheme relies on both local and

global criteria, which are used for checking local and global consistency

respectively.


37/78

Fig 14 Text line patterns extracted from a letter of Georges Washington

(reprinted from Shi and Govindaraju ). Foreground pixels have beensmeared along the horizontal direction.

Contrary to printed documents, a simple nearest-neighbor joining scheme

would often fail to group complex handwritten units, as the nearest neighbor

often belongs to another line. The joining criteria used in the methods described

below are adapted to the type of the units and the characteristics of the

documents under study.


38/78

But every method has to face the following

1) Initiating alignments: one or several seeds for each alignment.

2) Defining a units neighborhood for reaching the next unit. It is generally a

rectangular or angular area (Fig. 14).

3) Solving conflicts: As one unit may belong to several alignments under

construction, a choice has to be made: discard one alignment or keep

both of them, cutting the unit into several parts.

Hence, these methods include one or several quality measures which ensure

that the text line under construction is of good quality. When comparing the

quality measures of two alignments in conflict, the alignment of lower quality can

be discarded (Fig.9). Also, during the grouping process, it is possible to choose

between the different units that can be aggregated within the same neighborhood

by evaluating the quality of each of the so-formed alignments.

Fig. 15 Angular and rectangular neighborhoods from point and rectangularunits (left). Neighborhood defined by a cluster of units (upright). Twoalignments A and B in conflict: a quality measure will choose A and

discard B (down right).


39/78

Fig. 15 Angular and rectangular neighborhoods from point and rectangular units

(left). Neighborhood defined by a cluster of units (upright). Two alignments A and

B in conflict: a quality measure will choose A and discard B (down right).

Quality measures generally include the strength of the alignment, i.e. the

number of units included. Other quality elements may concern component size,

component spacing, or a measure of the alignments straightness.

Fig. 16 Text lines extracted on Church Registers

Likforman-Sulem and Faure have developed in an iterative method based

on perceptual grouping for forming alignments, which has been applied to

handwritten pages, author drafts and historical documents. Anchors are detected

by selecting connected components elongated in specific directions (0, 45, 90,

125). Each of these anchors becomes the seed of an alignment. First, each

anchor, then each alignment, is extended to the left and to the right.


40/78

This extension uses three Gestalt criteria for grouping components:

proximity, similarity and direction continuity. The threshold is iteratively

incremented in order to group components within a broader neighborhood until

no change occurs. Between each iteration, alignment quality is checked by a

quality measure which gives higher rates to long alignments including anchors of

the same direction. A penalty is given when the alignment includes anchors of

different directions. Two alignments may cross each other, or overlap. A set of

rules is applied to solve these conflicts taking into account the quality of each

alignment and neighboring components of higher order (Fig. 16).

In the work of Feldbach and Tnnies, body baselines are searched in

Church Registers images. These documents include lots of fluctuating and

overlapping lines. Baselines units are the minima points of the writing (obtained

here from the skeleton). First basic line segments (BLS) are constructed, joining

each minima point to its neighbors. This neighborhood is defined by an angular

region (+-20) for the first unit grouped, then by a rectangular region enclosing

the points already joined for the remaining ones. Unwanted basic segments are

found from minima points detected in descenders and ascenders.

These segments may be isolated or in conflict with others. Various

heuristics are defined to eliminate alignments on their size, or the local inter-line

distance and on a quality measure which favors alignments whose units are in

the same direction rather than nearer units but positioned lower or higher than

the current direction. Conflicting alignments can be reconstructed depending on

the topology of the conflicting alignments. The median line is searched from the

baseline and from maxima points (Fig. 16). Pixels lying within a given baseline

and median line are clustered in the corresponding text line, while ascenders and

descenders are not segmented. Correct segmentation rates are reported

between 90% and 97 % with adequate parameter adjustment. The seven

documents tested range from the 17th to the 19th century.


41/78

2.1.4 Methods based on the Hough transform

The Hough transform is a very popular technique for finding straight lines

in images. In Likforman-Sulem eta method has been developed on a hypothesis

validation scheme. Potential alignments are hypothesized in the Hough domain

and validated in the Image domain. Thus, no assumption is made about text line

directions (several may exist within the same page). The centroids of the

connected components are the units for the Hough transform. A set of aligned

units in the image along a line with parameters (, ) is included in the

corresponding cell (, ) of the Hough domain. Alignments including a lot of units

correspond to high peaked cells of the Hough domain. To take into account

fluctuations of handwritten text lines, i.e. the fact that units within a text line are

not perfectly aligned, two hypotheses are considered for each alignment and an

alignment is formed from units of the cell structure of aprimary cell.

Fig. 17 Hypothesized cells (0, 0) and (1, 1) in Hough space. Each peak

corresponds to perfectly aligned units. An alignment is composed of units

belonging to a cluster of cells (the cell structure) around a primary cell.


42/78

A cell structure of a cell (, ) includes all the cells lying in a cluster

centered on (, ). Consider the cell (0, 0) having the greatest count of units. A

second hypothesis (1, 1) is searched in the cell structure of (0, 0). The

alignment chosen between these two hypotheses is the strongest one, i.e. the

one which includes the highest number of units in its cell structure. And the

corresponding cell (0, 0) or (1, 1) is the primary cell (Fig. 17). However,

actual text lines rarely correspond to alignments with the highest number of units

as crossing alignments (from top to bottom for writing in horizontal direction)

must contain more units than actual text lines.

A potential alignment is validated (or invalidated) using contextualinformation, i.e. considering its internal and external neighbors. An internal

neighbor of a unit j is a within-Hough alignment neighbor. An external neighbor is

a out of Hough alignment neighbor which lies within a circle of radius j from unit

j. Distance j is the average distance of the internal neighbor distances from unit

j. To be validated, a potential alignment may contain fewer external units than

internal ones. This enables the rejection of alignments which have no perceptual

relevance. This method can extract oriented text lines and sloped annotations

under the assumption that such lines are almost straight (Fig. 18).

The Hough transform can also be applied to fluctuating lines of

handwritten drafts such as in Pu and Shi . The Hough transform is first applied to

minima points (units) in a vertical strip on the left of the image. The alignments in

the Hough domain are searched starting from a main direction, by grouping cells

in an exhaustive search in 6 directions. Then a moving window, associated with a

clustering scheme in the image domain, assigns the remaining units to

alignments. The clustering scheme (Natural Learning Algorithm) allows the

creation of new lines starting in the middle of the page.


43/78

Fig. 18 Text lines extracted on an autograph of Miguel Angel Asturias. The

orientations of traced lines correspond to those of the primary cells found

in Hough space.


44/78

2.1.5 Repulsive-Attractive network method

An approach based on attractive-repulsive forces is presented in Oztop et

al. It works directly on grey-level images and consists in iteratively adapting the

y-position of a predefined number of baseline units. Baselines are constructed

one by one from the top of the image to bottom. Pixels of the image act as

attractive forces for baselines and already extracted baselines act as repulsive

forces. The baseline to extract is initialized just under the previously examined

one, in order to be repelled by it and attracted by the pixels of the line below (the

first one is initialized in the blank space at top of the document). The lines must

have similar lengths. The result is a set of pseudo-baselines, each one passingthrough word bodies (Fig. 19). The method is applied to ancient Ottoman

document archives and Latin texts.

Fig. 19 Pseudo baselines extracted by a Repulsive-Attractive network on

an Ancient Ottoman text (reprinted from Oztop et al).


45/78

2.1.6 Stochastic method

We present here a method based on a probabilistic Viterbialgorithm

(Tseng and Lee), which derives non-linear paths between overlapping text lines.

Although this method has been applied to modern Chinese handwritten

documents, this principle could be enlarged to historical documents which often

include overlapping lines. Lines are extracted through hidden Markov modeling.

The image is first divided into little cells (depending on stroke width), each one

corresponding to a state of the HMM (Hidden Markov Model). The best

segmentation paths are searched from left to right; they correspond to paths

which do not cross lots of black points and which are as straight as possible.

However, the displacement in the graph is limited to immediately superior

or inferior grids. All best paths ending at each y location of the image are

considered first. Elimination of some of these paths uses a quality threshold T: a

path whose probability is less than T is discarded. Shifted paths are easily

eliminated (and close paths are removed on quality criteria). The method

succeeds when the ground truth path between text lines is slightly changing

along the y-direction (Fig. 20). In the case of touching components, the path of

highest probability will cross the touching component at points with as less blackpixels as possible. But the method may fail if the contact point contains a lot of

black pixels.

Fig. 20 Segmentation paths obtained by a stochastic method2.1.7 Water

reservoir principle


46/78

The water reservoir principle is as follows. If water is poured from the top

(bottom) of a component, the cavity regions of the component where water is

stored are considered the top (bottom) reservoirs (Pal et al 2003). Here, two

Oriya characters touch and create a large space which represents the bottom

reservoir. This large space is very useful for touching character detection and

segmentation. Owing to the shape of Oriya characters a small top reservoir is

also generated due to touching (see figure 21).

This small top reservoir also helps in touching character detection and

segmentation. All reservoirs are not considered for future processing. Reservoirs

having heights greater than a threshold T1 are selected for future use. For a

component the value ofT1 is chosen as 1/9 times the component height. (The

threshold is determined from experiment.) We now discuss here some terms

relating to water reservoirs that will be used in feature extraction.

Top reservoir:By top reservoir of a component, we mean the reservoir

obtained when water is poured from the top of the component. Bottom reservoir:

By bottom reservoir of a component we mean the reservoir obtained when water

is filled from the bottom of the component. A bottom reservoir of a component isvisualized as a top reservoir when water is poured from the top after rotating the

component by 180. Left (right) reservoir: If water is poured from the left (right)

side of a component, the cavity regions of the component where water is stored

are considered the left (right) reservoirs. left (right) reservoir of a component is

visualized as a top reservoir when water is poured from the top after rotating the

by 90 clockwise (anti-clockwise).

Water reservoir area: By area of a reservoir we mean the area of the

cavity region where water can be stored if water is poured from a particular side

of the component. The number of pixels inside a reservoir is computed and this

number is considered the area of the reservoir. Water flow level: The level from

which water overflows from a reservoir is called the water flow level of the

reservoir (see figure 22).


47/78

Reservoir baseline: A line passing through the deepest point of a

reservoir and parallel to the water flow level of the reservoir is called the reservoir

baseline (see figure 21). Height of a reservoir: By height of a reservoir we mean

the depth of water in the reservoir. In other words, height of a reservoir is the

normal distance between reservoir baseline and water flow level of the reservoir.

In figure 22, Hdenotes the reservoir height.

Width of a reservoir: By width of a reservoir, we mean the normal

distance between two extreme boundaries (perpendicular to base-line) of a

reservoir.

Fig 21 Examples of big reservoirs created by touching (because of the

touching of two characters a big bottom reservoir is formed here).


48/78

Figure 22 Illustration of different features obtained from water

reservoir principle. H denotes the height of bottom reservoir. Gray area of

the zoomed portion represents reservoir base area.

In each selected reservoir we compute its base-area points. By base-area

points of a reservoir we mean those border points of the reservoir which have

height less than 2RL from the baseline of the reservoir. Base-area points for a

component are shown in the zoomed-in version of figure 4. Here RL is the length

of the most frequently occurring black runs of a component. In other words, RL is

the statistical mode of the black run lengths of a component. The value ofRL is

calculated as follows. The component is scanned both horizontally and vertically.

If for a component we get n different run-lengths r1, r2, . . . rn with frequencies

f1, f2 . . . fn respectively, then the value ofRL = ri, where fi= max(fj ), j= 1 . . . n.


49/78

3.1 PROCESSING OF OVERLAPPING COMPONENTS

Overlapping components are the main challenges for text line extractions

since no white space is left between lines. Some of the methods surveyed above

do not need to detect such components because they extract only baselines , or

because in the method itself some criteria make paths avoid crossing black

pixels. This section only deals with methods where ambiguous components

(overlapping) are actually detected before, during or after text line segmentation

Such criteria as component size, the fact that the component belongs to several

alignments, or on the contrary to no alignment, can be used for detecting

ambiguous components.

Once the component is detected as ambiguous, it must be classified into

three categories : the component is an overlapping component which belongs to

the upper (resp. lower) alignment, the component is a touching component which

has to be decomposed into several parts (two or more parts, as components may

belong to three or more alignments in historical documents). The separation

along the vertical direction is a hard problem which can be done roughly

(horizontal cut), or more accurately by analysing stroke contours and referring totypical configurations (Fig. 23).

Fig 23 Set of typical overlapping configurations


50/78

Fig. 23 Set of typical overlapping configurations. The grouping technique

presented in Section grouping methods detects an ambiguous component during

the grouping process when a conflict occurs between two alignments. A set of

rules is applied to label the component as overlapping or touching. The

ambiguous component extends in each alignment region. The rules use as

features the density of black pixels of the component in each alignment region,

alignment proximity and contextual information (positions of both alignments

around the component). An overlapping component will be assigned to only one

alignment.

In piece wise, the document page is first cut into eight equal columns. A

projection-profile is performed on each column. In each histogram, two

consecutive minima delimit a text block. In order to detect overlapping

components, a k-means clustering scheme is used to classify the text blocks so

extracted into three classes: big, average, small. Overlapping components

necessarily belong to big physical blocks. All the overlapping cases are found in

the big text blocks class. All the one line blocks are grouped in the average

block text class. A second k-means clustering scheme finds the actual inter-line

blocks; put together with the one line block size, this determines the number of

pieces a large text block must be cut into (cf. Fig. 24). The document is dividedinto vertical strips. Profile cuts within each strip are computed to obtain anchor

points of segmentation (PSLs) which do not cross any black pixels. These points

are grouped through strips by neighboring criteria.

Fig 24 Text line segmentation


51/78

If no segmentation point is present in the adjacent strip, the baseline is

extended near the first black pixel encountered which belongs to an overlapping

or touching component. This component is classified as overlapping or touching

by analyzing its vertical extension (upper, lower) from each side of the

intersection point. An empirical rule classifies the component. In the touching

case, the component is horizontally cut at the intersection point (Fig. 25).

Fig 25 Overlapping components separated (circle) separated into two

parts (rectangle) in Bangla writing.

Some solutions for separation of units belonging to several text lines can

be found also in the case of mail pieces and handwritten databases where efforts

have been made for recognition purposes. In the work of separation is made

from the skeleton of touching characters and the use of a dictionary of possible

touching configurations (Fig. 23). In Bruzzone and Coffetti, the contact point

between ambiguous strokes is detected and processed from their external

border.


52/78

An accurate analysis of the contour near the contact point is performed in

order to separate the strokes according to two registered configurations: a loop in

contact with a stroke, or two loops in contact. In simple cases of handwritten

pages the center of gravity of the connected component is used either to

associate the component to the current line or to the following line, or to cut the

component into two parts. This works well if the component is a single character.

It may fail if the component is a word, or part of a word, or even several words.


53/78

CHAPTER - 3


54/78

3.2 PROPOSED METHOD

PIECE-WISE PROJECTION METHOD

The global horizontal projection method computes the sum of all black

pixels on every row and constructs the corresponding histogram. Based on the

peak/valley points of the histogram, individual lines are generally segmented.

Although this global orizontal projection method is applicable for line

segmentation of printed documents, it cannot be used in unconstrained

handwritten documents because the characters of two Consecutive text-lines

may touch or overlap. For example, see the 4th and 5th text lines of the

document shown in figure 26 a.

Figure 26 (a) N-stripes and PSL lines in each stripe are shown for a sample

of handwritten text. (b) Potential PSLs of figure 26 (a) are shown.

Here these two lines are mostly overlapping. To take care of

unconstrained handwritten documents, we use here a piece-wise projection

method as below. Here, at first, we divide the text into vertical stripes of width W

(here we assume that a document page is in portrait mode). Width of the last


55/78

stripe may differ from W. If the text width is Zand the number of stripe is N, the

width of the last stripe is [Z W (N 1)].

Computation of W is discussed later. Next, we compute piece-wise

separating lines (PSL) from each of these stripes. We compute the row-wise sum

of all black pixels of a stripe. The row where this sum is zero is a PSL. We may

get a few consecutive rows where the sum of all black pixels is zero. Then the

first row of such consecutive rows is the PSL. The PSLs of different stripes of a

text are shown in figure 26 a by horizontal lines. All these PSLs may not be

useful for line segmentation. We choose some potential PSLs as follows. We

compute the normal distances between two consecutive PSLs in a stripe. So if

there are n PSLs in a stripe we get n 1 distances.

This is done for all stripes. We compute the statistical mode (MPSL) of such

distances. If the distance between any two consecutive PSLs of a stripe is less

than MPSL, we remove the upper PSL of these two PSLs. PSLs obtained after this

removal is the potential PSLs. The potential PSLs obtained from the PSLs of

figure 26 a are shown in figure 26b. We note the left and right co-ordinates of

each potential PSL for future use. By proper joining of these potential PSLs, we

get individual text lines. It may be noted that sometimes because of overlappingor touching of one component of the upper line with a component of the lower

line, we may not get PSLs in some regions. Also, because of some modified

characters of telugu we find some extra PSLs in a stripe. We take care of them

during PSL joining, as explained next. Joining of PSLs is done in two steps.

In the first step, we join PSLs from right to left and, in the second step, we first

check whether line-wise PSL joining is complete or not. If for a line it is not

complete, joining from left to right is done to obtain complete segmentation. We

say PSLs joining of a line is complete if the length of the joined PSLs is equal to

the column (width) of the document image. This two-step approach is done to get

good results even if two consecutive text lines are overlapping or connected.


56/78

To join a PSL of the ith stripe, say i, to a PSL of(i 1)th stripe, we check

whether any PSL, whose normal distance from Ki is less than MPSL,, exists or

not in the (i 1) stripe. If it exists, we join the left co-ordinate of Kiwith the right

co-ordinate of the PSL in the (i1)th stripe. If it does not exist, we extend the Ki

horizontally in the left direction until it reaches the left boundary of the (i 1)th

stripe or intersects a black pixel of any component in the (i 1)th stripe. If the

extended part intersects the black pixel of a component of the (i 1)th stripe, we

decide the belongingness of the component in the upper line or lower line.

Based on the belongingness of this component, we extend this line in such a way

that the component falls in its actual line. Belongingness of a component is

decided as follows.

We compute the distances from the intersecting point to the topmost and

bottommost point of the component. Let d1 be the top distance and d2 the

bottom distance.

Ifd1 < d2 and d1 < (MPSL/2) then the component belongs to the lower line.

Ifd2 d1 and d2 < (MPSL/2) then the component belongs to the upper line.

If d1 > (MPSL

/2) and d2 > (MPSL

/2) then we assume the componenttouches another component of the lower line.

If the component belongs to the upper-line (lower-line) then the line is

extended following the contour of the lower part (upper part) of the component so

that the component can be included in the upper line (lower line).

The line extension is done until it reaches the left boundary of the (i 1)th

stripe. If the component is touching, we detect possible touching points based on the

structural shape of the touching component. From the experiment, we notice that in

most of the touching cases there exist junction/crossing shapes or there exist some

obstacle points in the middle portion having low black pixel density of the touching

component. These obstacle points and the junction/crossing shape help to find


57/78

touching position. Extension of PSL is done through this touching point to segment the

component into two parts.

Fig 27 Line-segmented result of text shown in figure 26. Text linesegmentation is shown by solid lines. (a) Two end points of a mis-

segmented line XY are marked by circles. (b) Correct segmentation isshown.

Sometimes because of some modified characters we may get some wrongly

segmented lines. For example, see the line marked XY (see figure 26 a). To take care

of such wrong lines we compute the density of black pixels and compare this value

with the candidate length of the line. (By candidate length of a line we mean the

distance between the leftmost column of the leftmost component and the rightmostcolumn of the rightmost component of that line.)

Let L be the candidate length of a line. Now we scan each column of the

portion of the line that belongs to the candidate length to check the presence of

black pixels. If a black pixel does not exist in at least 50% of the column of that

line then the line is not a valid line and we delete the lower boundary of this line

to merge this line with its lower line. Thus a mis-segmented line like XY of figure

26 a is corrected. The corrected line segmentation result is shown in figure 26 b.

To get a size-independent measure, computation ofWis done as follows.

We compute the statistical mode (md ) of the widths of the bottom reservoirs

obtained from the text. This mode is generally equal to character width. Since


58/78

average character in an word is four, the value ofW is assumed as 4md to

make the stripe width the word width. We computed word-length statistics. The

proposed line-segmented method does not depend on size and style of the

handwriting. Even if the handwritten lines overlap, touch or are curved, the

proposed scheme works. For word segmentation from a line, we compute vertical

histograms of the line. In general, the distance between two consecutive words of

a line is greater than the distance between two consecutive characters in a word.

Taking the vertical histogram of the line and using the above distance criteria we

segment words from lines..


59/78

3.2.1 Flowchart

Fig 28 Flow chart of the algorithm

Compute piece wise

separating lines

Divide the text into vertical

stripes

Choose some potential

PSL

Join these PSL

Compute the

belongingness of the

component


60/78

3.2.2 ALGORITHM

In short, line segmentation algorithm (LINE-SEGM) is as follows :

Algorithm LINE - SEGM

Step 1: Divide the text into vertical stripes of width W.

Step 2:Compute piece-wise separating lines (PSL) from each of these stripes as

discussed earlier.

Step 3: Compute potential PSLs from the PSLs obtained in step 2.

Step 4:Chose the rightmost top potential PSL and extend (from right to left) this

PSL up to the previous stripe.

Step 5:Continue this PSL joining from right to left until we reach the left

boundary of the left-most stripe.

Step 6:Check whether the length of the line drawn equals to the width of the

document. If yes, go to step 7. Else, PSL line extension is done to the right until

we reach the right boundary of the document.

Step 7: Repeat steps 4 to 6 for the potential PSLs not considered for joining so

far. If there is no more PSL for joining, stop.


61/78

Let us see all these steps in detail

Step 1: Divide the given text into some no of vertical stripes.

(a)

1st stripe 2nd stripe....nth stripe

Fig 29 (a) original text image. (b) output of text image


62/78

Step 2: Compute the piece-wise separating lines.

Fig 30 PSL of the text

Compute the row-wise sum of all black pixels of a stripe. The row where

the sum is zero is the PSL. If there are few consecutive rows where the black

pixels are zero, then first row of such rows is the PSL.

Step 3: Choose only potential PSLs

Fig 30

potentialPSLs of

the text


63/78

All the PSLs may not be useful for line segmentation, so choose

some potential PSLs among these. Compute the normal distances between two

consecutive PSLs in a stripe. So if there are n PSLs we get n-1 distances. This

is done for all stripes. Compute the statistical mode Mpsl of such distances. If the

distance between any two consecutive PSLs of a stripe is less than Mpsl then

remove the upper PSL of these two PSLs. PSLs obtained after this removal are

the potential PSLs.

Step 4: Join the PSLs

Fig 32 Joining the PSLs

Joining of PSLs are done in two ways

i) In this step we join PSLs from left to right.

ii) Check whether line-wise PSL joining is complete or not. If for a line it is

not complete, joining from left to right is done to obtain complete

segmentation.


64/78

We say PSLs joining of a line is complete if the length of joined PSLs is

equal to the column size of the document image. This two step approach is done

to get good results even if two consecutive text lines are in overlapping or

connected fashion.

Step 5: Compute belongingness of the component

Fig 33 Belongingness of component

If the extended part intersects black pixel of any component then computethe belongingness of the component. Compute the distances from the

intersecting point to the topmost and bottommost point of the component .let d1

be the topmost point and d2 be the bottommost point of the component.

If d1


65/78

Following is the figure obtained after all the steps

Fig 34 complete line segmentation obtained after all the steps

3.3 APPLICATIONS


66/78

3.3.1 Practical Applications

In recent years, OCR (Optical Character Recognition) technology has been

applied throughout the entire spectrum of industries, revolutionizing the

document management process. OCR has enabled scanned documents to

become more than just image files, turning into fully searchable documents with

text content that is recognized by computers. With the help of OCR, people no

longer need to manually retype important documents when entering them into

electronic databases. Instead, OCR extracts relevant information and enters it

automatically. The result is accurate, efficient information processing in less time.

3.3.2 Banking

The uses of OCR vary across different fields. One widely known

application is in banking, where OCR is used to process checks without human

involvement. A check can be inserted into a machine, the writing on it is scanned

instantly, and the correct amount of money is transferred. This technology has

nearly been perfected for printed checks, and is fairly accurate for handwritten

checks as well, though it occasionally requires manual confirmation. Overall, this

reduces wait times in many banks.

3.3.3 Legal

In the legal industry, there has also been a significant movement to

digitize paper documents. In order to save space and eliminate the need to sift

through boxes of paper files, documents are being scanned and entered into

computer databases. OCR further simplifies the process by making documents

text-searchable, so that they are easier to locate and work with once in thedatabase. Legal professionals now have fast, easy access to a huge library of

documents in electronic format, which they can find simply by typing in a few

keywords.

3.3.4 Healthcare


67/78

Healthcare has also seen an increase in the use of OCR technology to

process paperwork. Healthcare professionals always have to deal with large

volumes of forms for each patient, including insurance forms as well as general

health forms. To keep up with all of this information, it is useful to input relevant

data into an electronic database that can be accessed as necessary. Form

processing tools, powered by OCR, are able to extract information from forms

and put it into databases, so that every patient's data is promptly recorded. As a

result, healthcare providers can focus on delivering the best possible service to

every patient.

3.3.5 OCR in Other Industries

OCR is widely used in many other fields, including education, finance, and

government agencies. OCR has made countless texts available online, saving

money for students and allowing knowledge to be shared. Invoice imaging

applications are used in many businesses to keep track of financial records and

prevent a backlog of payments from piling up. In government agencies and

independent organizations, OCR simplifies data collection and analysis, among

other processes. As the technology continues to develop, more and more

applications are found for OCR technology, including increased use of

handwriting recognition. Furthermore, other technologies related to OCR, such

as barcode recognition, are used daily in retail and other industries. To learn

more about OCR solutions for your office, you can download a free trial of

Maestro Recognition Server, CVISION's OCR toolkit, or Trapeze, our automated

form-processing solution.

3.3.6 Resume processing

Several of the industry leaders in resume processing software use Prime

OCR to generate high accuracy results. Some customers use the text results

straight from Prime OCR while others choose to manually verify OCR results with

Prime Verify for maximum accuracy. One of the largest resume processing


68/78

facilities leverages Prime OCR's increased accuracy by providing recruiting

customers the same accuracy of results without having to manually verify each

resume. They take the results straight from Prime OCR and deliver them to the

customer passing on the savings of processing large batches of resumes. What

used to take days to send off shore to OCR and manually verify can now all be

done overnight in a local facility all with Prime OCR software.

3.3.7 Library archives/Digital Library

Digital library initiatives are adopting advanced OCR technology like Prime

OCR to convert large book collections for on-line viewing of content. Not only is

Prime OCR designed to generate accurate results but it can also provide a level

of reliability that cannot be found in traditional desktop OCR software.

A large university's project of converting large collections and providing

the content on-line was improved with Prime OCR's unique ability to provide high

accuracy results. The results were so impressive that all of the material that had

been previously processed was ran through Prime OCR a second time to

improve the ability to find textual information in the collection.

3.3.8 Document identification

An added option of Prime OCR allows for the software to accurately

identify different types of documents. Using high a

content of text line segmantation

Documents