content of text line segmantation

Upload: m-bharath-reddy

Post on 07-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Content of Text Line Segmantation

    1/78

    CHAPTER - 1

  • 8/6/2019 Content of Text Line Segmantation

    2/78

    1.1 INTRODUCTION

    Text line extraction is generally seen as a preprocessing step for tasks

    such as document structure extraction, printed character or handwritingrecognition. Many techniques have been developed for page segmentation of

    printed documents newspapers, scientific journals, magazines, business letters)

    produced with modern editing tools. The segmentation of handwritten documents

    has also been addressed with the segmentation of address blocks on envelopes

    and mail pieces, and for authentication or recognition purposes. More recently,

    the development of handwritten text databases provides new material for

    handwritten page segmentation.

    Ancient and historical documents, printed or handwritten, strongly differ

    from the documents mentioned above because layout formatting requirements

    were looser. Their physical structure is thus harder to extract. In addition,

    historical documents are of low quality, due to aging or faint typing. They include

    various disturbing elements such as holes, spots, writing from the verso

    appearing on the recto, ornamentation, or seals. Handwritten pages include

    narrow spaced lines with overlapping and touching components. Characters and.

    words have unusual and varying shapes, depending on the writer, the period and

    the place concerned. The vocabulary is also large and may include unusual

    names and words

    Full text recognition is in most cases not yet available, except for printed

    documents for which dedicated OCR can be developed. However, invaluable

    collections of historical documents are already digitized and indexed for

    consulting, exchange and distant access purposes which protect them from

    direct manipulation. In some cases, highly structured editions have been

    established by scholars. But a huge amount of documents are still to be exploited

    electronically. To produce an electronic searchable form, a document has to be

  • 8/6/2019 Content of Text Line Segmantation

    3/78

    indexed. The simplest way of indexing a document consists in attaching its main

    characteristics such as date, place and author (the so called metadata).

    Indexing can be enhanced when the document structure and content are

    exploited. When a transcription (published version, diplomatic transcription) is

    available, it can be attached to the digitized document: this allows users to

    retrieve documents from textual queries. Since text based representations do not

    reflect the graphical features of such documents, a better representation is

    obtained by linking the transcription to the document image. A direct

    correspondence can then be established between the document image and its

    content by text/image alignment techniques .This allows the creation of indexes

    where the position of each word can be recorded, and of links between both

    representations.

    Clicking on a word on the transcription or in the index through a GUI

    allows users to visualize the corresponding image area and vice versa. The

    document analysis embedded in such systems provides tools to search for

    blocks, lines and words, and may include a dedicated handwriting recognition

    system. Interactive tools are generally offered for segmentation and recognitioncorrection purposes. Several projects also concern printed material. However,

    document structure can also be used when no transcription is available. Word

    spotting techniques can retrieve similar words in the image document through an

    image query. When words of the image document are extracted by top down

    segmentation, which is generally the case, text lines are extracted first.

    The authentication of manuscripts in the paleographic sense can also

    make use of document analysis and text line extraction. Authentication consists

    in retrieving writer characteristics independently from document content. It

    generally consists in dating documents, localizing the place where the document

    was produced, identifying the writer by using characteristics and features

    extracted from blank spaces, line orientations and fluctuations, word or character

  • 8/6/2019 Content of Text Line Segmantation

    4/78

    shapes. Page segmentation into text lines is performed in most tasks mentioned

    above and overall performance strongly relies on the quality of this process.

    Fig 1 Line Segmentation Block diagram

    The purpose of this project is to survey the efforts made for historicaldocuments on the text line segmentation task. Section 2 describes the

    characteristics of text line structures in historical documents and the different

    ways of defining a text line. Preprocessing of document images (gray level, color

    or black and white) is often necessary before text line extracting to prune

    superfluous information (non textual elements, textual elements from the verso)

    or to correctly binaries the image.

    This problem is addressed in preprocessing. In some related methods like

    Projectionbased methods , Smearing methods , Grouping methods , Methods

    based on the Hough transform, Repulsive-Attractive network method and

    Stochastic method we survey the different approaches to segment the clean

    image into text lines. Taxonomy is proposed, listed as projection profiles,

  • 8/6/2019 Content of Text Line Segmantation

    5/78

    smearing, grouping, Hough-based, repulsive-attractive network and stochastic

    methods. The majority of these techniques have been developed for the projects

    on historical documents mentioned above. We address the specific problem of

    overlapping and touching components in further.

    Document image understanding algorithms are expected to work with a

    document, irrespective of its layout, script, font, color, etc. Segmentation aims to

    partition a document image into various homogeneous regions such as text

    blocks, image blocks, lines, words etc. Page segmentation algorithms can be

    broadly classified into three categories: bottom-up, top-down, and hybrid

    algorithms. The classification is based on the order in which the regions in a

    document are identified and labeled. The layout of the document is represented

    by a hierarchy of regions: page, image or text blocks, lines, words, components,

    and pixels. The traditional document segmentation algorithms give good results

    on most documents with complex layouts but assume the script in the document

    to be simple as in English. These algorithms fail to give good results on the

    documents with complex scripts such as African, Persian and Indian scripts.

  • 8/6/2019 Content of Text Line Segmantation

    6/78

    Fig 2 Reference lines and interfering lines with overlapping and touching

    components.

    1.2 CHARACTERISTICS AND REPRESENTATION OF TEXT

    LINES

    To have a good idea of the physical structure of a document image, one

    only needs to look at it from a certain distance: the lines and the blocks are

    immediately visible. These blocks consist of columns, annotations in margins,

    stanzas, etc... As blocks generally have no rectangular shape in historicaldocuments, the text line structure becomes the dominant physical structure. We

    first give some definitions about text line components and text line segmentation.

    Then we describe the factors which make this text line segmentation hard.

    Finally, we describe how a text line can be represented.

  • 8/6/2019 Content of Text Line Segmantation

    7/78

    1.2.1 DEFINITION

    Baseline: Fictitious line which follows and joins the lower part of the characterbodies in a text line (Fig. 2)

    Median line: Fictitious line which follows and joins the upper part of the

    character bodies in a text line.

    Upper line: Fictitious line which joins the top of ascenders.

    Lower line: Fictitious line which joins the bottom of descenders.

    Overlapping components: Overlapping components are descenders and

    ascenders located in the region of an adjacent line (Fig. 2).

    Touching components:Touching components are ascenders and descenders

    belonging to consecutive lines which are thus connected. These components are

    large but hard to discriminate before text lines are known.

    Text line segmentation: Text line segmentation is a labeling process which

    consists in assigning the same label to spatially aligned units (such as pixels,

    connected components or characteristic points).

    There are two categories of text line segmentation approaches:

    searching for (fictitious) separating lines or paths, or searching for aligned

    physical units. The choice of a segmentation technique depends on the

    complexity of the text line structure of the document.

    1.2.2 INFLUENCE OF AUTHOR STYLE

  • 8/6/2019 Content of Text Line Segmantation

    8/78

    Baseline fluctuation: The baseline may vary due to writer movement. It may

    be straight, straight by segments, or curved.

    Line orientations: There may be different line orientations, especially on

    authorial works where there are corrections and annotations.

    Line spacing: Lines that are rather widely spaced lines are easy to find. The

    process of extracting text lines grows more difficult as interlines are narrowing

    the lower baseline of the first line is becoming closer to the upper baseline of the

    second line; also, descenders and ascenders start to fill the blank space left forseparating two adjacent text lines .

    Insertions: Words or short text lines may appear between the principal text

    lines, or in the margins.

    1.2.3 INFLUENCE OF POOR IMAGE QUALITY

    Imperfect preprocessing: Smudges, variable background intensity and the

    presence of seeping ink from the other side of the document make image

    preprocessing particularly difficult and produce binarization errors.

    Stroke fragmentation and merging:Punctuation, dots and broken strokes due

    to low-quality images and/or binarization may produce many connected

    components; conversely, words, characters and strokes may be split into several

    connected components. The broken components are no longer linked to the

  • 8/6/2019 Content of Text Line Segmantation

    9/78

    median baseline of the writing and become ambiguous and hard to segment into

    the correct text line.

    1.2.4 TEXT LINE REPRESENTATION

    Separating paths and delimited strip: Separating lines (or paths) are

    continuous fictitious lines which can be uniformly straight, made of straight

    segments, or of curving joined strokes. The delimited strip between two

    consecutive separating lines receives the same text line label. So the text line

    can be represented by a strip with its couple of separating lines (Fig. 3).

    Clusters: Clusters are a general set-based way of defining text lines. A label is

    associated with each cluster. Units within the same cluster belong to the same

    text line. They may be pixels, connected components, or blocks enclosing pieces

    of writing. A text line can be represented by a list of units with the same label.

    Strings: Strings are lists of spatially aligned and ordered units. Each string

    represents one text line.

    Baselines: Baselines follow line fluctuations but partially define a text line. Units

    connected to a baseline are assumed to belong to it. Complementary processing

    has to be done to cluster non-connected units and touching components.

  • 8/6/2019 Content of Text Line Segmantation

    10/78

    Fig. 3 Various text line representations: paths, strings and baselines.

  • 8/6/2019 Content of Text Line Segmantation

    11/78

    1.3 DOCUMENT IMAGE ANALYSIS

    Document image analysis refers to algorithms and techniques that are

    applied to images of documents to obtain a computer-readable description from

    pixel data.Awell-known document image analysis product is the Optical

    Character Recognition (OCR) software that recognizes characters in a scanned

    document. OCR makes it possible for the user to edit or search the documents

    contents. Inthis paper we briefly describe various components of a document

    analysis system.Many of these basic building blocks are found in most document

    analysis systems,

    The objective of document image analysis is to recognize the text and

    graphics component in images of documents, and to extract the intended

    information as a human would. Two categories of document image analysis can

    be defined (see figure 4). Textual processing deals with the text components of

    a document image. Some tasks here are: determining the skew (any tilt at which

    the document may have been scanned into the computer), finding columns,

    paragraphs, text lines, and words, and finally recognizing the text (and possibly

    its attributes such as size, font etc.) by optical character recognition (OCR).

    Graphics processing deals with the non-textual line and symbol components that

    make up line diagrams, delimiting straight lines between text sections, company

    logos etc.

  • 8/6/2019 Content of Text Line Segmantation

    12/78

    Pictures are a third major component of documents, but except for

    recognizing their location on a page, further analysis of these is usually the task

    of other image processing and machine vision techniques. After application of

    these text and graphics analysis techniques, the several megabytes of initial data

    are culled to yield a much more concise semantic description of the document.

    Fig 4 A hierarchy of document processing subareas listing the types of

    document components dealt with in each subarea.

    Document analysis systems will become increasingly more evident in the

    form of everyday document systems. For instance, OCR systems will be morewidely used to store, search, and excerpt from paper-based documents. Page-

    layout analysis techniques will recognize a particular form or page format and

    allow its duplication. Diagrams will be entered from pictures or by hand, and

    logically edited. Pen-based computers will translate handwritten entries into

    electronic documents. Archives of paper documents in libraries and engineering

  • 8/6/2019 Content of Text Line Segmantation

    13/78

    companies will be electronically converted for more efficient storage and instant

    delivery to a home or office computer.

    Consider three specific examples of the need for document analysis

    presented here.

    (1) Typical documents in todays office are computer-generated, but even so,

    inevitably by different computers and software such that even their electronic

    formats are incompatible. Some include formatted text and tables as well as

    handwritten entries. There are different sizes, from a business card to a large

    engineering drawing. Document analysis systems recognize types of documents,

    enable the extraction of their functional parts, and translate from one computer

    generated format to another.

    (2) Automated mail-sorting machines to perform sorting and address recognition

    have been used for several decades, but there is the need to process more mail,

    more quickly, and more accurately.

    (3) In a traditional library, loss of material, misfiling, limited numbers of each

    copy, and even degradation of materials are common problems, and may beimproved by document analysis techniques. All these examples serve as

    applications ripe for the potential solutions of document image analysis.

    1.3.1 DATA CAPTURE

    Data in a paper document are usually captured by optical scanning and

    stored in a file of picture elements, called pixels that are sampled in a grid pattern

    throughout the document. These pixels may have values: OFF (0) or ON (1) for

    binary images, 0255 for gray-scale images, and 3 channels of 0255 color

    values for color images.

  • 8/6/2019 Content of Text Line Segmantation

    14/78

    At a typical sampling resolution of 120 pixels per centimeter, a 20 x 30 cm

    page would yield an image of 2400x3600 pixels. When the document is on a

    different medium such as microfilm, palm leaves, or fabric, photographic methods

    are often used to capture images. In any case, it is important to understand that

    the image of the document contains only raw data that must be further analyzed

    to glean the information.

    1.4 OPTICAL CHARACTER RECOGNITION (OCR)

    Optical Character Recognition (OCR) lies at the core of the discipline of

    pattern recognition where the objective is to interpret a sequence of characters

    taken from an alphabet. Characters of the alphabet are usually rich in shape. In

    fact, the characters can be subject to many variations in terms of fonts and

    handwriting styles. Despite these variations, there is perhaps a basic abstraction

    of the shapes that identifies any of their instantiations. Developing computer

    algorithms to identify the characters of the alphabet is the principal task of OCR.

    The challenge to the research community is the following while humans can

    recognize neatly handwritten characters with 100% accuracy, there is no OCR

    that can match that performance. OCR difficulty can increase on several counts.

    Increase in fonts, size of the alphabet set, unconstrained handwriting,

    touching of adjacent characters, broken strokes due to poor binarization, noise

    etc. all contribute to the difficulty shows a sample of 0s and 6s that are easilyconfused by a handwritten digit recognizer. There are many applications that

    require the recognition of unconstrained handwriting. A word can be either purely

    numeric as in the case of a Zip code, or purely alphabetic as in the case of US

    state abbreviations or mixed as in the number of an apartment.

  • 8/6/2019 Content of Text Line Segmantation

    15/78

    The task becomes particularly challenging when adjacent characters in a

    character string are touching as shown in figure 8. Unlike purely alphabetic

    strings where joining of the characters is natural and takes place by means of

    ligatures, the joining of numerals in a numeric word and the upper-case

    characters in an abbreviation are accidental. There are various ways in which

    two digits can touch. Some of the categories lend themselves to natural

    segmentation, whereas for some a holistic approach is the only option available.

    1.4.1 WHAT IS OCR?

    OCR is the acronym for Optical Character Recognition. This technology

    allows a machine to automatically recognize characters through an optical

    mechanism. Human beings recognize many objects in this manner our eyes are

    the "optical mechanism." But while the brain "sees" the input, the ability to

    comprehend these signals varies in each person according to many factors. By

    reviewing these variables, we can understand the challenges faced by the

    technologist developing an OCR system.

    First, if we read a page in a language other than our own, we may

    recognize the various characters, but be unable to recognize words. However, on

    the same page, we are usually able to interpret numerical statements - the

    symbols for numbers are universally used. This explains why many OCR

    systems recognize numbers only, while relatively few understand the full

    alphanumeric character range. Second, there is similarity between many

    numerical and alphabetical symbol shapes. For example, while examining a

    string of characters combining letters and numbers, there is very little visible

    difference between a capital letter "O" and the numeral "0."

  • 8/6/2019 Content of Text Line Segmantation

    16/78

    As humans, we can re-read the sentence or entire paragraph to help us

    determine the accurate meaning. This procedure, however, is much more difficult

    for a machine. Third, we rely on contrast to help us recognize characters. We

    may find it very difficult to read text which appears against a very dark

    background, or is printed over other words or graphics. Again, programming a

    system to interpret only the relevant data and disregard the rest is a difficult task

    for OCR engineers. There are many other problems which challenge the

    developers of OCR systems. In this paper, we will review the history,

    advancements, abilities and limitations of existing systems. This analysis should

    help determine if OCR is the correct application for your company's needs, and if

    so, which type of system to implement.

    1.4.2 HISTORY OF OCR

    The engineering attempts at automated recognition of printed characters

    started prior to World War II. But it was not until the early 1950's that a

    commercial venture was identified that justified necessary funding for researchand development of the technology. This impetus was provided by the American

    Bankers Association and the Financial Services Industry. They challenged all the

    major equipment manufacturers to come up with a "Common Language" to

    automatically process checks. After the war, check processing had become the

    single largest paper processing application in the world. Although the banking

    industry eventually chose Magnetic Ink Recognition (MICR), some vendors had

    proposed the use of an optical recognition technology.

    However, OCR was still in its infancy at the time and did not perform as

    acceptably as MICR. The advantage of MICR was that it is relatively impervious

    to change, fraudulent alteration and interference from non-MlCR inks. The "eye''

    of early OCR equipment utilized lights, mirrors, fixed slits for the reflected light to

  • 8/6/2019 Content of Text Line Segmantation

    17/78

    pass through, and a moving disk with additional slits. The reflected image was

    broken into discrete bits of black and white data, presented to a photo-multiplier

    tube, and converted to electronic bits.

    The "brain's" logic required the presence or absence of "black'' or "white"

    data bits at prescribed intervals. This allowed it to recognize a very limited,

    specially designed character set. To accomplish this, the units required

    sophisticated transports for documents to be processed. The documents were

    required to run at a consistent speed and the printed data had to occur in a fixed

    location on each and every form.

    This technology also introduced the concept of blue, non-reading inks as

    the system was sensitive to the ultraviolet spectrum. The third generation of

    recognition devices, introduced in the early 1970's, consisted of photo-diode

    arrays. These tiny little sensors were aligned in an array so the reflected image of

    a document would pass by at a prescribed speed. These devices were most

    sensitive in the infra-red portion of the visual spectrum so "red" inks were used

    as non-reading inks. That brings us to this generation of hardware.

    1.4.3 LIMITATIONS OF OCR

    OCR has never achieved a read rate that is 100% perfect. Because of

    this, a system which permits rapid and accurate correction of rejects is a major

    requirement. Exception item processing is always a problem because it delays

    the completion of the job entry, particularly the balancing function. Of even

    greater concern is the problem of misreading a character (substitutions). In

    particular, if the system does not accurately balance dollar data, customer

    dissatisfaction will occur. The success of any OCR device to read accurately

    without substitutions is not the sole responsibility of the hardware manufacturer.

    Much depends on the quality of the items to be processed.

  • 8/6/2019 Content of Text Line Segmantation

    18/78

    Through the years, the desire has been to increase the accuracy of

    reading, that is, to reduce rejects and substitutions to reduce the sensitivity of

    scanning to read less-controlled input to eliminate the need for specially

    designed fonts (characters), and to read handwritten characters. However,

    today's systems, while much more forgiving of printing quality and more accurate

    than earlier equipment, still work best when specially designed characters are

    used and attention to printing quality is maintained. However, these limits are not

    objectionable to most applications, and dedicated users of OCR systems are

    growing each year. But the ability to read a special character is not, by itself,

    Sufficient to create a successful system.

    1.4.4 WHAT DOES IT TAKE TO MAKE A SUCCESSFUL OCR SYSTEM?

    1. It takes a complimentary merging of the input document ~ stream with the

    processing requirements of the particular application with a total system concept

    that provides for convenient entry of exception type items with an output that

    provides cost effective entry to complete the system. To show a successful

    example, let's review the early credit card OCR applications.

    2. Input was a carbon imprinted document. However, if the carbon was wrinkled,

    the imprinter was misaligned, or any one of a variety of reasons existed, the

    imprinted characters were impossible to read accurately.

    3. To compensate for this problem, the processing system permitted direct key

    entry of the fail to read items at a fairly high speed. Directly keyed items from the

    misread document were under intelligent computer control which placed the

    proper data in the right location for the data record. Important considerations in

    designing the system encouraged the use of modulus controlled check digits for

    the embossed credit card account number. This, coupled with tight monetary

    controls by batch totals, reduced the chance of read substitutions.

  • 8/6/2019 Content of Text Line Segmantation

    19/78

    4. The output of these early systems provided a "country club" type of billing.

    That

    is, each of the credit card sales slips was returned to the original purchaser. This

    provided the credit card customer with the opportunity to review his own

    Purchases to insure the final accuracy of billing. This has been a very successful

    operation through the years. Today's systems improve the process by increasing

    the amount of data to be read, either directly or through reproduction of details on

    the sales draft. This provides customers with a "descriptive" billing statement

    which itemizes each transaction. Attention to the details of each application step

    is a requirement for successful OCR systems.

    1.4.6 PHASES IN OCR

    PREPROCESSING

    SEGMENTATION

    RECOGNITION

  • 8/6/2019 Content of Text Line Segmantation

    20/78

    Fig 5 phases of OCR

    1.4.6.1 PREPROCESSING

    Optical Character Recognition (OCR) refers to the process of converting

    printed Tamil text documents into software translated Unicode Tamil Text. The

    printed documents available in the form of books, papers, magazines, etc. are

    scanned using standard scanners which produce an image of the scanned

    document. As part of the preprocessing phase the image file is checked for

    skewing. If the image is skewed, it is corrected by a simple rotation technique in

    the appropriate direction. Then the image is passed through a noise elimination

    phase and is binarized.

    The preprocessed image is segmented using an algorithm which

    decomposes the scanned text into paragraphs using special space detection

    technique and then the paragraphs into lines using vertical histograms, and lines

    into words using horizontal histograms, and words into character image glyphs

    using horizontal histograms. Each image glyph is comprised of 32x32 pixels.

    Thus a database of character image glyphs is created out of the segmentation

    phase. Then all the image glyphs are considered for recognition using Unicode

    mapping. Each image glyph is passed through various routines which extract the

    features of the glyph.

    POST PROCESSING

  • 8/6/2019 Content of Text Line Segmantation

    21/78

    1.4.6.2 SEGMENTATION

    In computer vision, segmentation refers to the process of partitioning a

    digital image into multiple segments (sets of pixels, also known as superpixels).

    The goal of segmentation is to simplify and/or change the representation of an

    image into something that is more meaningful and easier to analyze. Image

    segmentation is typically used to locate objects and boundaries (lines, curves,

    etc.) in images. More precisely, image segmentation is the process of assigning

    a label to every pixel in an image such that pixels with the same label share

    certain visual characteristics.

    The result of image segmentation is a set of segments that collectively

    cover the entire image, or a set of contours extracted from the image (see edge

    detection). Each of the pixels in a region are similar with respect to some

    characteristic or computed property, such as color, intensity, or texture. Adjacent

    regions are significantly different with respect to the same characteristic(s).

    1.4.6.3 RECOGNITION

    Often abbreviated OCR, optical character recognition refers to the branch

    of computer science that involves reading text from paper and translating the

    images into a form that the computer can manipulate (for example, into ASCII

    codes). An OCR system enables you to take a book or a magazine article, feed it

    directly into an electronic computer file, and then edit the file using a word

    processor.

    All OCR systems include an optical scanner for reading text, andsophisticated software for analyzing images. Most OCR systems use a

    combination of hardware (specialized circuit boards) and software to recognize

    characters, although some inexpensive systems do it entirely through software.

    Advanced OCR systems can read text in large variety of fonts, but they still have

    difficulty with handwritten text.

  • 8/6/2019 Content of Text Line Segmantation

    22/78

    The potential of OCR systems is enormous because they enable users to

    harness the power of computers to access printed documents. OCR is already

    being used widely in the legal profession, where searches that once required

    hours or days can now be accomplished in a few seconds.

    1.4.6.4 POST PROCESSING

    In most OCR systems, independent character recognition engine is often

    used to recognize each segmented part of an image, where only shape and

    structure of the character are considered. In order to improve the recognition

    accuracy rate, it is necessary in post-processing to use language knowledge,

    which introduces context information, to correct the image recognition results.

    Post-processing approaches based on language knowledge include using a

    lexicon or some syntax and semantic rules to correct the spelling of words, and

    using some statistical language models (SLM) to select out the best sequence

    from the candidate characters given by the OCR engine. Because of the

    complexity of language, all kinds of language knowledge sometimes are used

    together to obtain better performance.

    An OCR engine outputs not only candidate characters, but also candidate

    distance information of each candidate character, which is also important in OCR

    post-processing. Currently, candidate distance is usually transformed to reliability

    of the corresponding candidate character to be utilized. Generally speaking, the

    bigger the reliability of a candidate character, the smaller the corresponding

    candidate distance. In early period, the reliability was calculated by using some

    empirical formulas. Afterwards, a statistical approach was proposed, which

    calculates the reliability according to the distribution of candidate characters and

    correct characters with different candidate distances. It reflects some statistical

    characteristics, and its complexity is low, therefore it achieves good results in

    some applications. However, the use of candidate distance is still limited in OCR

    post-processing.

  • 8/6/2019 Content of Text Line Segmantation

    23/78

    1.4.7 STEPS INVOLVED IN OCR

    Optical character recognition is the recognition of printed or written text by

    a computer. This involves photo scanning of the text, which converts the paper

    document into an image, and then translation of the text image into character

    codes such as ASCII. Any OCR implementation consists of a number of

    preprocessing Steps followed by the actual recognition. The number and types of

    preprocessing algorithms employed on the scanned image depend on many

    factors such as age of the document, paper quality, resolution of the scanned

    image, the amount of skew in the image, the format and layout of the images and

    text typical preprocessing includes the following stages.

    Binarization,

    Noise removing,

    Thinning,

    Skew detection and correction,

    Line segmentation,

    Word segmentation, and Character segmentation

    Recognition consists of

    Feature extraction,

    Feature selection, and

    Classification

  • 8/6/2019 Content of Text Line Segmantation

    24/78

    Fig 6 Steps in an OCR

    1.4.6.1 Binarization

    Binarization is a technique by which the gray scale images are convertedto binary images. In any image analysis or enhancement problem, it is very

    essential to identify the objects of interest from the rest. Binarization separates

    the foreground (text) and background information. The most common method for

    binarization is to select a proper threshold for the intensity of the image and then

    convert all the intensity values above the threshold to one intensity value (for

  • 8/6/2019 Content of Text Line Segmantation

    25/78

    example white ), and all intensity values below the threshold to the other

    chosen intensity (black).

    Binarization is usually reported to be performed either globally or locally.Global methods apply one intensity value to the entire image. Local or adaptive

    thresholding methods apply different intensity values to different regions of the

    image. These threshold values are determined by the neighborhood of the pixel

    to which the thresholding is being applied. Several binarization techniques are

    discussed in (Anuradha & Koteswarrao 2006). Figure 7(a) shows the scanned

    image of a paper document printed in Telugu, a south Indian language. Figure

    7(b) is the same image after binarization in which the text pixels are separated

    from the background.

    (A)

    (B)

    Fig 7 (A) Original image, (B) Binarized image

  • 8/6/2019 Content of Text Line Segmantation

    26/78

    1.4.6.2 Noise Removal

    Scanned documents often contain noise that arises due to printer,

    scanner, print quality, age of the document, etc. Therefore, it is necessary to filter

    this noise before we process the image. The commonly used approach is to low-

    pass filter the image and to use it for later processing. The objective in the design

    of a filter to reduce noise is that it should remove as much of the noise as

    possible while retaining the entire signal (Rangachar et al 2002).

    1.4.6.3 Thinning

    Thinning, or, skeletonization or is a process by which a one-pixel-width

    representation (or the skeleton) of an object is obtained, by preserving the

    connectedness of the object and its end points. The purpose of thinning is to

    reduce the image components to their essential information so that further

    analysis and recognition are facilitated. This enables easier subsequent detection

    of pertinent features. Figure 8 shows an image before and after thinning. A

    number of thinning algorithms have been proposed and are being used. Themost common algorithm used is the classical Hilditch algorithm and its variants.

    Fig 8 A character image (left) before thinning, and (b) after thinning

  • 8/6/2019 Content of Text Line Segmantation

    27/78

    1.4.6.4 Skew detection and correction

    When a document is fed to the scanner either mechanically or by a human

    operator, a few degrees of skew (tilt) are unavoidable. Skew angle is the angle

    that the lines of text in the digital image make with the horizontal direction. Figure

    9(a) shows an image with skew.

    Fig 9 An image (a) with skew, (b) without skew, and its horizontal profiles

  • 8/6/2019 Content of Text Line Segmantation

    28/78

    There exist many techniques for skew estimation. One skew estimation

    technique is based on the projection profile of the document; another class of

    approach is based on nearest neighbor clustering of connected components.

    Techniques based on the Hough transform and Fourier transform are also

    employed for skew estimation. A popular method for skew detection employs the

    projection profile. A horizontal projection profile is a one-dimensional array where

    each element denotes the number of black pixels along a row in the image.

    Span horizontally, the horizontal projection profile has peaks whose widths

    are equal to the character height and valleys whose widths are equal to the

    spacing between lines. At the correct skew angle, since scan lines are aligned to

    text lines, the projection profile has maximum height peaks for text and valleys

    for line spacing. In the image of figure 9(a), its horizontal projection profile can be

    seen with no clear valleys due to the presence of skew. Figure 9(b) is an image

    in which the skew is removed. The peaks and valleys in the projection profile can

    be clearly seen.

    1.4.6.5 Line, word, and character segmentation

    After the tilt is corrected, the text has to be segmented first into lines; each

    line then into words and finally each word have to be segmented into its

    constituted characters. Horizontal projection of a document image is most

    commonly employed to extract the lines from the document. If the lines are well

    separated, and are not tilted, the horizontal projection will have separated peaks

    and valleys, as shown in figure 9(b), which serve as the separators of the text

    lines.. Figure 9 shows an image consisting of 3 text lines (left), and the 3

    segmented lines (right), using horizontal projection profiles.

    Similarly a vertical projection profile gives the column sums. One can

    separate lines by looking for minima in horizontal projection profile of the page

    and then separate words by looking at minima in vertical projection profile of a

  • 8/6/2019 Content of Text Line Segmantation

    29/78

    single line. Figure 11(a) shows a line consisting of 4 words, along with vertical

    projection profiles, and figure 11(b) shows the 4 words, after segmentation. In

    Figure 11(c), a word is shown segmented into its constituting 3 characters.

    Overlapping, adjacent characters in a word (called kerned characters) cannot be

    segmented using zero-valued valleys in the vertical projection profile.

    Special techniques have to be employed to solve this problem. Feature

    extraction and selection Feature extraction can be considered as finding a set of

    parameters (features) that define the shape of the underlying character as

    precisely and uniquely as possible. The features have to be selected in such a

    way that they help in discriminating between characters. Thinned data is

    analyzed to detect features such as straight lines, curves, and significant points

    along the curves.

    Fig 10 Line segmentation

  • 8/6/2019 Content of Text Line Segmantation

    30/78

    Fig 11 (a). A line segment, (b). Word segmentation, (c). Character

    segmentation

    1.4.6.6 Techniques used

    Any OCR contains more or less the same steps described further. The

    exact number and techniques differ slightly from one language to other. We now

    present the studies in different OCRs, along with a detailed description of the

    methods used in them. Recognition of isolated and continuous printed multi font

    Bengali characters is reported in the work by Mahmud et al (2003).

    This is based on Freemanchaincode features, which are explained as

    follows. When objects are described by their skeletons or contours, they can be

    represented by chain coding, where the ON pixels are represented as sequences

    of connected neighbors along lines and curves. Instead of storing the absolutelocation of each ON pixel, the direction from its previously coded neighbor is

    stored.

    The chain codes from center pixel are 0 for east, 1 for North- East, and so

    on. This is represented pictorially in figure 12(a) and (b). Chain code gives the

  • 8/6/2019 Content of Text Line Segmantation

    31/78

    boundary of the character image; slope distribution of chain code implies the

    curvature properties of the character. In this work, connected components from

    each character are divided into four regions with the center of mass of as the

    origin. Slope distribution of chain code, in these four regions is used as local

    feature. Using chain code representation, classification is done by a feed forward

    neural network.

    Testing on three types of fonts with accuracy of approximately 98% for

    isolated characters and 96% for continuous characters is reported. Ray &

    Chatterjee (1984) presented a recognition system based on a nearest neighbor

    classifier employing features extracted by using a string connectivity criterion

    complete OCR for printed Bangla is reported in the work by Chaudhuri & Pal

    (1998), in which a combination of template and feature-matching approach is

    used.

    A histogram-based thresholding approach is used to convert the image

    into binary images. For a clear document the histogram shows two prominent

    peaks corresponding to white and black regions. The threshold value is chosen

    as the midpoint of the two-histogram peaks. Skew angle is determined from theskew of the headline.

    Text lines are partitioned into three zones and the horizontal and vertical

    projection profiles are used to segment the text into lines, words, and characters.

    Primary grouping of characters into the basic, modified and compound

    characters is made before the actual classification. A few stroke features are

    used for this purpose along with a tree classifier where the decision at each node

    of the tree is taken on the basis of presence/absence of a particular feature.

  • 8/6/2019 Content of Text Line Segmantation

    32/78

    The compound character recognition is done in - two stages

    1) In the first stage the characters are grouped into small sub-sets by the

    above tree classifier.

    2) At the second stage, characters in each group are recognized by a

    run-based template matching approach. Some character level statistics like

    individual character occurrence frequency, bigram and trigram statistics etc. are

    utilized to aid the recognition process. For single font, clear documents 99.10%

    character level recognition accuracy is reported.

    Fig 12 Chain code and graphical representations

  • 8/6/2019 Content of Text Line Segmantation

    33/78

    CHAPTER - 2

  • 8/6/2019 Content of Text Line Segmantation

    34/78

    BACKGROUND WORK

    2.1.1 Projectionbased Methods

    Projection-profiles are commonly used for printed document

    segmentation. This technique can also be adapted to handwritten documents

    with little overlap. The vertical projection profile is obtained by summing pixel

    values along the horizontal axis for each y value. From the vertical profile, the

    gaps between the text lines in the vertical direction can be observed (Fig. 13).

    Profile (y) = f(x,y)1xM

    The vertical profile is not sensitive to writing fragmentation. Variants for

    obtaining a profile curve may consist in projecting black/white transitions such as

    in number of connected components, rather than pixels. The profile curve can be

    smoothed, e.g. by a Gaussian or median filter to eliminate local maxima. The

    profile curve is then analysed to find its maxima and minima.

    There are two drawbacks

    Short lines will provide low peaks, and very narrow lines, as

    well as those including many overlapping components will not produce significant

    peaks. In case of skew or moderate fluctuations of the text lines, the image may

    be divided into vertical strips and profiles sought inside each strip. These

    piecewise projections are thus a means of adapting to local fluctuations within a

    more global scheme.

    In the global orientation (skew angle) of a handwritten page is first

    searched by applying a Hough transform on the entire image. Once this skew

    angle is obtained, projections are achieved along this angle. The number of

    maxima of the profile gives the number of lines. Low maxima are discarded on

    their value, which is compared to the highest maxima. Lines are delimited by

    strips, searching for the minima of projection profiles around each maximum.

  • 8/6/2019 Content of Text Line Segmantation

    35/78

    This technique has been tested on a set of 200 pages within a word

    segmentation task.

    In the work of each minimum of the profile curve is a potential

    segmentation point. Potential points are then scored according to their distance

    to adjacent segmentation points. The reference distance is obtained from the

    histogram of distances between adjacent potential segmentation points. The

    highest scored segmentation point is used as an anchor to derive the remaining

    ones. The method is applied to printed records of the Second World War which

    have regularly spaced text lines. The logical structure is used to derive the text

    regions where the names of interest can be found.

    Fig. 13 Vertical projection-profile extracted on an autograph of Jean-Paul

    Sartre.

    The RXY cuts method applied in He and Downtown uses alternating

    projections along the X and the Y axis. This results in a hierarchical tree

    structure. Cuts are found within white spaces. Thresholds are necessary to

    derive inter-line or inter-block distances. This method can be applied to printed

    documents (which are assumed to have these regular distances) or well

    separated handwritten lines.

  • 8/6/2019 Content of Text Line Segmantation

    36/78

    2.1.2 Smearing Methods

    For printed and binarized documents, smearing methods such as the Run-

    Length Smoothing Algorithm can be applied. Consecutive black pixels along the

    horizontal direction are smeared: i.e. the white space between them is filled with

    black pixels if their distance is within a predefined threshold. The bounding boxes

    of the connected components in the smeared image enclose text lines.

    A variant of this method adapted to gray level images and applied to

    printed books from the sixteenth century consists in accumulating the image

    gradient along the horizontal direction. This method has been adapted to old

    printed documents within the Debora project. For this purpose, numerousadjustments in the method concern the tolerance for character alignment and line

    justification.

    Text line patterns are found in the work of Shi and Govindaraju by building

    a fuzzy run length matrix. At each pixel, the fuzzy run-length is the maximal

    extent of the background along the horizontal direction. Some foreground pixels

    may be skipped if their number does not exceed a predefined value. This matrix

    is threshold to make pieces of text lines appear without ascenders and

    descenders (Fig. 14). Parameters have to be accurately and dynamically tuned.

    2.1.3 Grouping methods

    These methods consist in building alignments by aggregating units in a

    bottom-up strategy. The units may be pixels or of higher level, such as connected

    components, blocks or other features such as salient points. Units are then

    joined together to form alignments. The joining scheme relies on both local and

    global criteria, which are used for checking local and global consistency

    respectively.

  • 8/6/2019 Content of Text Line Segmantation

    37/78

    Fig 14 Text line patterns extracted from a letter of Georges Washington

    (reprinted from Shi and Govindaraju ). Foreground pixels have beensmeared along the horizontal direction.

    Contrary to printed documents, a simple nearest-neighbor joining scheme

    would often fail to group complex handwritten units, as the nearest neighbor

    often belongs to another line. The joining criteria used in the methods described

    below are adapted to the type of the units and the characteristics of the

    documents under study.

  • 8/6/2019 Content of Text Line Segmantation

    38/78

    But every method has to face the following

    1) Initiating alignments: one or several seeds for each alignment.

    2) Defining a units neighborhood for reaching the next unit. It is generally a

    rectangular or angular area (Fig. 14).

    3) Solving conflicts: As one unit may belong to several alignments under

    construction, a choice has to be made: discard one alignment or keep

    both of them, cutting the unit into several parts.

    Hence, these methods include one or several quality measures which ensure

    that the text line under construction is of good quality. When comparing the

    quality measures of two alignments in conflict, the alignment of lower quality can

    be discarded (Fig.9). Also, during the grouping process, it is possible to choose

    between the different units that can be aggregated within the same neighborhood

    by evaluating the quality of each of the so-formed alignments.

    Fig. 15 Angular and rectangular neighborhoods from point and rectangularunits (left). Neighborhood defined by a cluster of units (upright). Twoalignments A and B in conflict: a quality measure will choose A and

    discard B (down right).

  • 8/6/2019 Content of Text Line Segmantation

    39/78

    Fig. 15 Angular and rectangular neighborhoods from point and rectangular units

    (left). Neighborhood defined by a cluster of units (upright). Two alignments A and

    B in conflict: a quality measure will choose A and discard B (down right).

    Quality measures generally include the strength of the alignment, i.e. the

    number of units included. Other quality elements may concern component size,

    component spacing, or a measure of the alignments straightness.

    Fig. 16 Text lines extracted on Church Registers

    Likforman-Sulem and Faure have developed in an iterative method based

    on perceptual grouping for forming alignments, which has been applied to

    handwritten pages, author drafts and historical documents. Anchors are detected

    by selecting connected components elongated in specific directions (0, 45, 90,

    125). Each of these anchors becomes the seed of an alignment. First, each

    anchor, then each alignment, is extended to the left and to the right.

  • 8/6/2019 Content of Text Line Segmantation

    40/78

    This extension uses three Gestalt criteria for grouping components:

    proximity, similarity and direction continuity. The threshold is iteratively

    incremented in order to group components within a broader neighborhood until

    no change occurs. Between each iteration, alignment quality is checked by a

    quality measure which gives higher rates to long alignments including anchors of

    the same direction. A penalty is given when the alignment includes anchors of

    different directions. Two alignments may cross each other, or overlap. A set of

    rules is applied to solve these conflicts taking into account the quality of each

    alignment and neighboring components of higher order (Fig. 16).

    In the work of Feldbach and Tnnies, body baselines are searched in

    Church Registers images. These documents include lots of fluctuating and

    overlapping lines. Baselines units are the minima points of the writing (obtained

    here from the skeleton). First basic line segments (BLS) are constructed, joining

    each minima point to its neighbors. This neighborhood is defined by an angular

    region (+-20) for the first unit grouped, then by a rectangular region enclosing

    the points already joined for the remaining ones. Unwanted basic segments are

    found from minima points detected in descenders and ascenders.

    These segments may be isolated or in conflict with others. Various

    heuristics are defined to eliminate alignments on their size, or the local inter-line

    distance and on a quality measure which favors alignments whose units are in

    the same direction rather than nearer units but positioned lower or higher than

    the current direction. Conflicting alignments can be reconstructed depending on

    the topology of the conflicting alignments. The median line is searched from the

    baseline and from maxima points (Fig. 16). Pixels lying within a given baseline

    and median line are clustered in the corresponding text line, while ascenders and

    descenders are not segmented. Correct segmentation rates are reported

    between 90% and 97 % with adequate parameter adjustment. The seven

    documents tested range from the 17th to the 19th century.

  • 8/6/2019 Content of Text Line Segmantation

    41/78

    2.1.4 Methods based on the Hough transform

    The Hough transform is a very popular technique for finding straight lines

    in images. In Likforman-Sulem eta method has been developed on a hypothesis

    validation scheme. Potential alignments are hypothesized in the Hough domain

    and validated in the Image domain. Thus, no assumption is made about text line

    directions (several may exist within the same page). The centroids of the

    connected components are the units for the Hough transform. A set of aligned

    units in the image along a line with parameters (, ) is included in the

    corresponding cell (, ) of the Hough domain. Alignments including a lot of units

    correspond to high peaked cells of the Hough domain. To take into account

    fluctuations of handwritten text lines, i.e. the fact that units within a text line are

    not perfectly aligned, two hypotheses are considered for each alignment and an

    alignment is formed from units of the cell structure of aprimary cell.

    Fig. 17 Hypothesized cells (0, 0) and (1, 1) in Hough space. Each peak

    corresponds to perfectly aligned units. An alignment is composed of units

    belonging to a cluster of cells (the cell structure) around a primary cell.

  • 8/6/2019 Content of Text Line Segmantation

    42/78

    A cell structure of a cell (, ) includes all the cells lying in a cluster

    centered on (, ). Consider the cell (0, 0) having the greatest count of units. A

    second hypothesis (1, 1) is searched in the cell structure of (0, 0). The

    alignment chosen between these two hypotheses is the strongest one, i.e. the

    one which includes the highest number of units in its cell structure. And the

    corresponding cell (0, 0) or (1, 1) is the primary cell (Fig. 17). However,

    actual text lines rarely correspond to alignments with the highest number of units

    as crossing alignments (from top to bottom for writing in horizontal direction)

    must contain more units than actual text lines.

    A potential alignment is validated (or invalidated) using contextualinformation, i.e. considering its internal and external neighbors. An internal

    neighbor of a unit j is a within-Hough alignment neighbor. An external neighbor is

    a out of Hough alignment neighbor which lies within a circle of radius j from unit

    j. Distance j is the average distance of the internal neighbor distances from unit

    j. To be validated, a potential alignment may contain fewer external units than

    internal ones. This enables the rejection of alignments which have no perceptual

    relevance. This method can extract oriented text lines and sloped annotations

    under the assumption that such lines are almost straight (Fig. 18).

    The Hough transform can also be applied to fluctuating lines of

    handwritten drafts such as in Pu and Shi . The Hough transform is first applied to

    minima points (units) in a vertical strip on the left of the image. The alignments in

    the Hough domain are searched starting from a main direction, by grouping cells

    in an exhaustive search in 6 directions. Then a moving window, associated with a

    clustering scheme in the image domain, assigns the remaining units to

    alignments. The clustering scheme (Natural Learning Algorithm) allows the

    creation of new lines starting in the middle of the page.

  • 8/6/2019 Content of Text Line Segmantation

    43/78

    Fig. 18 Text lines extracted on an autograph of Miguel Angel Asturias. The

    orientations of traced lines correspond to those of the primary cells found

    in Hough space.

  • 8/6/2019 Content of Text Line Segmantation

    44/78

    2.1.5 Repulsive-Attractive network method

    An approach based on attractive-repulsive forces is presented in Oztop et

    al. It works directly on grey-level images and consists in iteratively adapting the

    y-position of a predefined number of baseline units. Baselines are constructed

    one by one from the top of the image to bottom. Pixels of the image act as

    attractive forces for baselines and already extracted baselines act as repulsive

    forces. The baseline to extract is initialized just under the previously examined

    one, in order to be repelled by it and attracted by the pixels of the line below (the

    first one is initialized in the blank space at top of the document). The lines must

    have similar lengths. The result is a set of pseudo-baselines, each one passingthrough word bodies (Fig. 19). The method is applied to ancient Ottoman

    document archives and Latin texts.

    Fig. 19 Pseudo baselines extracted by a Repulsive-Attractive network on

    an Ancient Ottoman text (reprinted from Oztop et al).

  • 8/6/2019 Content of Text Line Segmantation

    45/78

    2.1.6 Stochastic method

    We present here a method based on a probabilistic Viterbialgorithm

    (Tseng and Lee), which derives non-linear paths between overlapping text lines.

    Although this method has been applied to modern Chinese handwritten

    documents, this principle could be enlarged to historical documents which often

    include overlapping lines. Lines are extracted through hidden Markov modeling.

    The image is first divided into little cells (depending on stroke width), each one

    corresponding to a state of the HMM (Hidden Markov Model). The best

    segmentation paths are searched from left to right; they correspond to paths

    which do not cross lots of black points and which are as straight as possible.

    However, the displacement in the graph is limited to immediately superior

    or inferior grids. All best paths ending at each y location of the image are

    considered first. Elimination of some of these paths uses a quality threshold T: a

    path whose probability is less than T is discarded. Shifted paths are easily

    eliminated (and close paths are removed on quality criteria). The method

    succeeds when the ground truth path between text lines is slightly changing

    along the y-direction (Fig. 20). In the case of touching components, the path of

    highest probability will cross the touching component at points with as less blackpixels as possible. But the method may fail if the contact point contains a lot of

    black pixels.

    Fig. 20 Segmentation paths obtained by a stochastic method2.1.7 Water

    reservoir principle

  • 8/6/2019 Content of Text Line Segmantation

    46/78

    The water reservoir principle is as follows. If water is poured from the top

    (bottom) of a component, the cavity regions of the component where water is

    stored are considered the top (bottom) reservoirs (Pal et al 2003). Here, two

    Oriya characters touch and create a large space which represents the bottom

    reservoir. This large space is very useful for touching character detection and

    segmentation. Owing to the shape of Oriya characters a small top reservoir is

    also generated due to touching (see figure 21).

    This small top reservoir also helps in touching character detection and

    segmentation. All reservoirs are not considered for future processing. Reservoirs

    having heights greater than a threshold T1 are selected for future use. For a

    component the value ofT1 is chosen as 1/9 times the component height. (The

    threshold is determined from experiment.) We now discuss here some terms

    relating to water reservoirs that will be used in feature extraction.

    Top reservoir:By top reservoir of a component, we mean the reservoir

    obtained when water is poured from the top of the component. Bottom reservoir:

    By bottom reservoir of a component we mean the reservoir obtained when water

    is filled from the bottom of the component. A bottom reservoir of a component isvisualized as a top reservoir when water is poured from the top after rotating the

    component by 180. Left (right) reservoir: If water is poured from the left (right)

    side of a component, the cavity regions of the component where water is stored

    are considered the left (right) reservoirs. left (right) reservoir of a component is

    visualized as a top reservoir when water is poured from the top after rotating the

    by 90 clockwise (anti-clockwise).

    Water reservoir area: By area of a reservoir we mean the area of the

    cavity region where water can be stored if water is poured from a particular side

    of the component. The number of pixels inside a reservoir is computed and this

    number is considered the area of the reservoir. Water flow level: The level from

    which water overflows from a reservoir is called the water flow level of the

    reservoir (see figure 22).

  • 8/6/2019 Content of Text Line Segmantation

    47/78

    Reservoir baseline: A line passing through the deepest point of a

    reservoir and parallel to the water flow level of the reservoir is called the reservoir

    baseline (see figure 21). Height of a reservoir: By height of a reservoir we mean

    the depth of water in the reservoir. In other words, height of a reservoir is the

    normal distance between reservoir baseline and water flow level of the reservoir.

    In figure 22, Hdenotes the reservoir height.

    Width of a reservoir: By width of a reservoir, we mean the normal

    distance between two extreme boundaries (perpendicular to base-line) of a

    reservoir.

    Fig 21 Examples of big reservoirs created by touching (because of the

    touching of two characters a big bottom reservoir is formed here).

  • 8/6/2019 Content of Text Line Segmantation

    48/78

    Figure 22 Illustration of different features obtained from water

    reservoir principle. H denotes the height of bottom reservoir. Gray area of

    the zoomed portion represents reservoir base area.

    In each selected reservoir we compute its base-area points. By base-area

    points of a reservoir we mean those border points of the reservoir which have

    height less than 2RL from the baseline of the reservoir. Base-area points for a

    component are shown in the zoomed-in version of figure 4. Here RL is the length

    of the most frequently occurring black runs of a component. In other words, RL is

    the statistical mode of the black run lengths of a component. The value ofRL is

    calculated as follows. The component is scanned both horizontally and vertically.

    If for a component we get n different run-lengths r1, r2, . . . rn with frequencies

    f1, f2 . . . fn respectively, then the value ofRL = ri, where fi= max(fj ), j= 1 . . . n.

  • 8/6/2019 Content of Text Line Segmantation

    49/78

    3.1 PROCESSING OF OVERLAPPING COMPONENTS

    Overlapping components are the main challenges for text line extractions

    since no white space is left between lines. Some of the methods surveyed above

    do not need to detect such components because they extract only baselines , or

    because in the method itself some criteria make paths avoid crossing black

    pixels. This section only deals with methods where ambiguous components

    (overlapping) are actually detected before, during or after text line segmentation

    Such criteria as component size, the fact that the component belongs to several

    alignments, or on the contrary to no alignment, can be used for detecting

    ambiguous components.

    Once the component is detected as ambiguous, it must be classified into

    three categories : the component is an overlapping component which belongs to

    the upper (resp. lower) alignment, the component is a touching component which

    has to be decomposed into several parts (two or more parts, as components may

    belong to three or more alignments in historical documents). The separation

    along the vertical direction is a hard problem which can be done roughly

    (horizontal cut), or more accurately by analysing stroke contours and referring totypical configurations (Fig. 23).

    Fig 23 Set of typical overlapping configurations

  • 8/6/2019 Content of Text Line Segmantation

    50/78

    Fig. 23 Set of typical overlapping configurations. The grouping technique

    presented in Section grouping methods detects an ambiguous component during

    the grouping process when a conflict occurs between two alignments. A set of

    rules is applied to label the component as overlapping or touching. The

    ambiguous component extends in each alignment region. The rules use as

    features the density of black pixels of the component in each alignment region,

    alignment proximity and contextual information (positions of both alignments

    around the component). An overlapping component will be assigned to only one

    alignment.

    In piece wise, the document page is first cut into eight equal columns. A

    projection-profile is performed on each column. In each histogram, two

    consecutive minima delimit a text block. In order to detect overlapping

    components, a k-means clustering scheme is used to classify the text blocks so

    extracted into three classes: big, average, small. Overlapping components

    necessarily belong to big physical blocks. All the overlapping cases are found in

    the big text blocks class. All the one line blocks are grouped in the average

    block text class. A second k-means clustering scheme finds the actual inter-line

    blocks; put together with the one line block size, this determines the number of

    pieces a large text block must be cut into (cf. Fig. 24). The document is dividedinto vertical strips. Profile cuts within each strip are computed to obtain anchor

    points of segmentation (PSLs) which do not cross any black pixels. These points

    are grouped through strips by neighboring criteria.

    Fig 24 Text line segmentation

  • 8/6/2019 Content of Text Line Segmantation

    51/78

    If no segmentation point is present in the adjacent strip, the baseline is

    extended near the first black pixel encountered which belongs to an overlapping

    or touching component. This component is classified as overlapping or touching

    by analyzing its vertical extension (upper, lower) from each side of the

    intersection point. An empirical rule classifies the component. In the touching

    case, the component is horizontally cut at the intersection point (Fig. 25).

    Fig 25 Overlapping components separated (circle) separated into two

    parts (rectangle) in Bangla writing.

    Some solutions for separation of units belonging to several text lines can

    be found also in the case of mail pieces and handwritten databases where efforts

    have been made for recognition purposes. In the work of separation is made

    from the skeleton of touching characters and the use of a dictionary of possible

    touching configurations (Fig. 23). In Bruzzone and Coffetti, the contact point

    between ambiguous strokes is detected and processed from their external

    border.

  • 8/6/2019 Content of Text Line Segmantation

    52/78

    An accurate analysis of the contour near the contact point is performed in

    order to separate the strokes according to two registered configurations: a loop in

    contact with a stroke, or two loops in contact. In simple cases of handwritten

    pages the center of gravity of the connected component is used either to

    associate the component to the current line or to the following line, or to cut the

    component into two parts. This works well if the component is a single character.

    It may fail if the component is a word, or part of a word, or even several words.

  • 8/6/2019 Content of Text Line Segmantation

    53/78

    CHAPTER - 3

  • 8/6/2019 Content of Text Line Segmantation

    54/78

    3.2 PROPOSED METHOD

    PIECE-WISE PROJECTION METHOD

    The global horizontal projection method computes the sum of all black

    pixels on every row and constructs the corresponding histogram. Based on the

    peak/valley points of the histogram, individual lines are generally segmented.

    Although this global orizontal projection method is applicable for line

    segmentation of printed documents, it cannot be used in unconstrained

    handwritten documents because the characters of two Consecutive text-lines

    may touch or overlap. For example, see the 4th and 5th text lines of the

    document shown in figure 26 a.

    Figure 26 (a) N-stripes and PSL lines in each stripe are shown for a sample

    of handwritten text. (b) Potential PSLs of figure 26 (a) are shown.

    Here these two lines are mostly overlapping. To take care of

    unconstrained handwritten documents, we use here a piece-wise projection

    method as below. Here, at first, we divide the text into vertical stripes of width W

    (here we assume that a document page is in portrait mode). Width of the last

  • 8/6/2019 Content of Text Line Segmantation

    55/78

    stripe may differ from W. If the text width is Zand the number of stripe is N, the

    width of the last stripe is [Z W (N 1)].

    Computation of W is discussed later. Next, we compute piece-wise

    separating lines (PSL) from each of these stripes. We compute the row-wise sum

    of all black pixels of a stripe. The row where this sum is zero is a PSL. We may

    get a few consecutive rows where the sum of all black pixels is zero. Then the

    first row of such consecutive rows is the PSL. The PSLs of different stripes of a

    text are shown in figure 26 a by horizontal lines. All these PSLs may not be

    useful for line segmentation. We choose some potential PSLs as follows. We

    compute the normal distances between two consecutive PSLs in a stripe. So if

    there are n PSLs in a stripe we get n 1 distances.

    This is done for all stripes. We compute the statistical mode (MPSL) of such

    distances. If the distance between any two consecutive PSLs of a stripe is less

    than MPSL, we remove the upper PSL of these two PSLs. PSLs obtained after this

    removal is the potential PSLs. The potential PSLs obtained from the PSLs of

    figure 26 a are shown in figure 26b. We note the left and right co-ordinates of

    each potential PSL for future use. By proper joining of these potential PSLs, we

    get individual text lines. It may be noted that sometimes because of overlappingor touching of one component of the upper line with a component of the lower

    line, we may not get PSLs in some regions. Also, because of some modified

    characters of telugu we find some extra PSLs in a stripe. We take care of them

    during PSL joining, as explained next. Joining of PSLs is done in two steps.

    In the first step, we join PSLs from right to left and, in the second step, we first

    check whether line-wise PSL joining is complete or not. If for a line it is not

    complete, joining from left to right is done to obtain complete segmentation. We

    say PSLs joining of a line is complete if the length of the joined PSLs is equal to

    the column (width) of the document image. This two-step approach is done to get

    good results even if two consecutive text lines are overlapping or connected.

  • 8/6/2019 Content of Text Line Segmantation

    56/78

    To join a PSL of the ith stripe, say i, to a PSL of(i 1)th stripe, we check

    whether any PSL, whose normal distance from Ki is less than MPSL,, exists or

    not in the (i 1) stripe. If it exists, we join the left co-ordinate of Kiwith the right

    co-ordinate of the PSL in the (i1)th stripe. If it does not exist, we extend the Ki

    horizontally in the left direction until it reaches the left boundary of the (i 1)th

    stripe or intersects a black pixel of any component in the (i 1)th stripe. If the

    extended part intersects the black pixel of a component of the (i 1)th stripe, we

    decide the belongingness of the component in the upper line or lower line.

    Based on the belongingness of this component, we extend this line in such a way

    that the component falls in its actual line. Belongingness of a component is

    decided as follows.

    We compute the distances from the intersecting point to the topmost and

    bottommost point of the component. Let d1 be the top distance and d2 the

    bottom distance.

    Ifd1 < d2 and d1 < (MPSL/2) then the component belongs to the lower line.

    Ifd2 d1 and d2 < (MPSL/2) then the component belongs to the upper line.

    If d1 > (MPSL

    /2) and d2 > (MPSL

    /2) then we assume the componenttouches another component of the lower line.

    If the component belongs to the upper-line (lower-line) then the line is

    extended following the contour of the lower part (upper part) of the component so

    that the component can be included in the upper line (lower line).

    The line extension is done until it reaches the left boundary of the (i 1)th

    stripe. If the component is touching, we detect possible touching points based on the

    structural shape of the touching component. From the experiment, we notice that in

    most of the touching cases there exist junction/crossing shapes or there exist some

    obstacle points in the middle portion having low black pixel density of the touching

    component. These obstacle points and the junction/crossing shape help to find

  • 8/6/2019 Content of Text Line Segmantation

    57/78

    touching position. Extension of PSL is done through this touching point to segment the

    component into two parts.

    Fig 27 Line-segmented result of text shown in figure 26. Text linesegmentation is shown by solid lines. (a) Two end points of a mis-

    segmented line XY are marked by circles. (b) Correct segmentation isshown.

    Sometimes because of some modified characters we may get some wrongly

    segmented lines. For example, see the line marked XY (see figure 26 a). To take care

    of such wrong lines we compute the density of black pixels and compare this value

    with the candidate length of the line. (By candidate length of a line we mean the

    distance between the leftmost column of the leftmost component and the rightmostcolumn of the rightmost component of that line.)

    Let L be the candidate length of a line. Now we scan each column of the

    portion of the line that belongs to the candidate length to check the presence of

    black pixels. If a black pixel does not exist in at least 50% of the column of that

    line then the line is not a valid line and we delete the lower boundary of this line

    to merge this line with its lower line. Thus a mis-segmented line like XY of figure

    26 a is corrected. The corrected line segmentation result is shown in figure 26 b.

    To get a size-independent measure, computation ofWis done as follows.

    We compute the statistical mode (md ) of the widths of the bottom reservoirs

    obtained from the text. This mode is generally equal to character width. Since

  • 8/6/2019 Content of Text Line Segmantation

    58/78

    average character in an word is four, the value ofW is assumed as 4md to

    make the stripe width the word width. We computed word-length statistics. The

    proposed line-segmented method does not depend on size and style of the

    handwriting. Even if the handwritten lines overlap, touch or are curved, the

    proposed scheme works. For word segmentation from a line, we compute vertical

    histograms of the line. In general, the distance between two consecutive words of

    a line is greater than the distance between two consecutive characters in a word.

    Taking the vertical histogram of the line and using the above distance criteria we

    segment words from lines..

  • 8/6/2019 Content of Text Line Segmantation

    59/78

    3.2.1 Flowchart

    Fig 28 Flow chart of the algorithm

    Compute piece wise

    separating lines

    Divide the text into vertical

    stripes

    Choose some potential

    PSL

    Join these PSL

    Compute the

    belongingness of the

    component

  • 8/6/2019 Content of Text Line Segmantation

    60/78

    3.2.2 ALGORITHM

    In short, line segmentation algorithm (LINE-SEGM) is as follows :

    Algorithm LINE - SEGM

    Step 1: Divide the text into vertical stripes of width W.

    Step 2:Compute piece-wise separating lines (PSL) from each of these stripes as

    discussed earlier.

    Step 3: Compute potential PSLs from the PSLs obtained in step 2.

    Step 4:Chose the rightmost top potential PSL and extend (from right to left) this

    PSL up to the previous stripe.

    Step 5:Continue this PSL joining from right to left until we reach the left

    boundary of the left-most stripe.

    Step 6:Check whether the length of the line drawn equals to the width of the

    document. If yes, go to step 7. Else, PSL line extension is done to the right until

    we reach the right boundary of the document.

    Step 7: Repeat steps 4 to 6 for the potential PSLs not considered for joining so

    far. If there is no more PSL for joining, stop.

  • 8/6/2019 Content of Text Line Segmantation

    61/78

    Let us see all these steps in detail

    Step 1: Divide the given text into some no of vertical stripes.

    (a)

    1st stripe 2nd stripe....nth stripe

    Fig 29 (a) original text image. (b) output of text image

  • 8/6/2019 Content of Text Line Segmantation

    62/78

    Step 2: Compute the piece-wise separating lines.

    Fig 30 PSL of the text

    Compute the row-wise sum of all black pixels of a stripe. The row where

    the sum is zero is the PSL. If there are few consecutive rows where the black

    pixels are zero, then first row of such rows is the PSL.

    Step 3: Choose only potential PSLs

    Fig 30

    potentialPSLs of

    the text

  • 8/6/2019 Content of Text Line Segmantation

    63/78

    All the PSLs may not be useful for line segmentation, so choose

    some potential PSLs among these. Compute the normal distances between two

    consecutive PSLs in a stripe. So if there are n PSLs we get n-1 distances. This

    is done for all stripes. Compute the statistical mode Mpsl of such distances. If the

    distance between any two consecutive PSLs of a stripe is less than Mpsl then

    remove the upper PSL of these two PSLs. PSLs obtained after this removal are

    the potential PSLs.

    Step 4: Join the PSLs

    Fig 32 Joining the PSLs

    Joining of PSLs are done in two ways

    i) In this step we join PSLs from left to right.

    ii) Check whether line-wise PSL joining is complete or not. If for a line it is

    not complete, joining from left to right is done to obtain complete

    segmentation.

  • 8/6/2019 Content of Text Line Segmantation

    64/78

    We say PSLs joining of a line is complete if the length of joined PSLs is

    equal to the column size of the document image. This two step approach is done

    to get good results even if two consecutive text lines are in overlapping or

    connected fashion.

    Step 5: Compute belongingness of the component

    Fig 33 Belongingness of component

    If the extended part intersects black pixel of any component then computethe belongingness of the component. Compute the distances from the

    intersecting point to the topmost and bottommost point of the component .let d1

    be the topmost point and d2 be the bottommost point of the component.

    If d1

  • 8/6/2019 Content of Text Line Segmantation

    65/78

    Following is the figure obtained after all the steps

    Fig 34 complete line segmentation obtained after all the steps

    3.3 APPLICATIONS

  • 8/6/2019 Content of Text Line Segmantation

    66/78

    3.3.1 Practical Applications

    In recent years, OCR (Optical Character Recognition) technology has been

    applied throughout the entire spectrum of industries, revolutionizing the

    document management process. OCR has enabled scanned documents to

    become more than just image files, turning into fully searchable documents with

    text content that is recognized by computers. With the help of OCR, people no

    longer need to manually retype important documents when entering them into

    electronic databases. Instead, OCR extracts relevant information and enters it

    automatically. The result is accurate, efficient information processing in less time.

    3.3.2 Banking

    The uses of OCR vary across different fields. One widely known

    application is in banking, where OCR is used to process checks without human

    involvement. A check can be inserted into a machine, the writing on it is scanned

    instantly, and the correct amount of money is transferred. This technology has

    nearly been perfected for printed checks, and is fairly accurate for handwritten

    checks as well, though it occasionally requires manual confirmation. Overall, this

    reduces wait times in many banks.

    3.3.3 Legal

    In the legal industry, there has also been a significant movement to

    digitize paper documents. In order to save space and eliminate the need to sift

    through boxes of paper files, documents are being scanned and entered into

    computer databases. OCR further simplifies the process by making documents

    text-searchable, so that they are easier to locate and work with once in thedatabase. Legal professionals now have fast, easy access to a huge library of

    documents in electronic format, which they can find simply by typing in a few

    keywords.

    3.3.4 Healthcare

  • 8/6/2019 Content of Text Line Segmantation

    67/78

    Healthcare has also seen an increase in the use of OCR technology to

    process paperwork. Healthcare professionals always have to deal with large

    volumes of forms for each patient, including insurance forms as well as general

    health forms. To keep up with all of this information, it is useful to input relevant

    data into an electronic database that can be accessed as necessary. Form

    processing tools, powered by OCR, are able to extract information from forms

    and put it into databases, so that every patient's data is promptly recorded. As a

    result, healthcare providers can focus on delivering the best possible service to

    every patient.

    3.3.5 OCR in Other Industries

    OCR is widely used in many other fields, including education, finance, and

    government agencies. OCR has made countless texts available online, saving

    money for students and allowing knowledge to be shared. Invoice imaging

    applications are used in many businesses to keep track of financial records and

    prevent a backlog of payments from piling up. In government agencies and

    independent organizations, OCR simplifies data collection and analysis, among

    other processes. As the technology continues to develop, more and more

    applications are found for OCR technology, including increased use of

    handwriting recognition. Furthermore, other technologies related to OCR, such

    as barcode recognition, are used daily in retail and other industries. To learn

    more about OCR solutions for your office, you can download a free trial of

    Maestro Recognition Server, CVISION's OCR toolkit, or Trapeze, our automated

    form-processing solution.

    3.3.6 Resume processing

    Several of the industry leaders in resume processing software use Prime

    OCR to generate high accuracy results. Some customers use the text results

    straight from Prime OCR while others choose to manually verify OCR results with

    Prime Verify for maximum accuracy. One of the largest resume processing

  • 8/6/2019 Content of Text Line Segmantation

    68/78

    facilities leverages Prime OCR's increased accuracy by providing recruiting

    customers the same accuracy of results without having to manually verify each

    resume. They take the results straight from Prime OCR and deliver them to the

    customer passing on the savings of processing large batches of resumes. What

    used to take days to send off shore to OCR and manually verify can now all be

    done overnight in a local facility all with Prime OCR software.

    3.3.7 Library archives/Digital Library

    Digital library initiatives are adopting advanced OCR technology like Prime

    OCR to convert large book collections for on-line viewing of content. Not only is

    Prime OCR designed to generate accurate results but it can also provide a level

    of reliability that cannot be found in traditional desktop OCR software.

    A large university's project of converting large collections and providing

    the content on-line was improved with Prime OCR's unique ability to provide high

    accuracy results. The results were so impressive that all of the material that had

    been previously processed was ran through Prime OCR a second time to

    improve the ability to find textual information in the collection.

    3.3.8 Document identification

    An added option of Prime OCR allows for the software to accurately

    identify different types of documents. Using high a