document engineering research paper

Upload: rafayhabibullah

Post on 01-Jun-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Document Engineering Research Paper

    1/10

    K Y W o n g

    R

    G.

    Casey

    F. M. Wahl

    Docum ent Ana lys is Sys tem

    This pape r outlines the requirements and components fo r a proposed Document Ana lysis System , which assists a user in

    encoding printed documents fo r computer processing. Several critical function s have been investigated and the technical

    approaches are discussed . The first is the segmentation and classijication of digitize d printed documents into regions of text

    and images . A nonlinear, run-length smoothing alg orithm has been used for this purp ose. y using the regular featu res of text

    lines, a linear adaptive classification scheme d iscriminates text regions fro m oth ers . The second technique studie d is an

    adap tive approach to the recognition of the hundreds of fon t styl es and sizes that can occur

    on

    printed documents. A

    preclassifier is constructed uring the input proce ss and used to speed up a well-known pattern-matchingmethod fo r clustering

    characters fro m an arbitrary print ource into a small sampleof prot otyp es. Experimental res ults are included.

    Introduction

    Th e decreasing cost of hardw are will eventually m ake com-

    monplace the torage nddistribution of docume nts by

    electronic means. However, today most docu men ts arebeing

    saved, distributed,and presentedon paper.Paper is the

    primarymedium for books, journals, newspapers, maga-

    zines, an d business correspondence. Th is paper describes a n

    experimentalDocum ent Analysis System, which assistsa

    user in the encoding of printed documents for computer

    processing.

    A docum ent analysis system could be applied to extract

    information romprinteddocuments ocreatedata bases

    (e.g.,patents , ourt decisions). Such systems ould lso

    create compute r docume nt files from papers submitted o

    jour nals an d assist in revising books and man uals. In addi-

    tion,a docume nt analysis systemcould provide a gene ral

    da ta compression facility, since encoding optically scanned

    text from a bit-map into symbol codes greatly reduces the

    cost of storing or t ransmitt ing the information.

    Existing com mercial systems cannot adequately convert

    unconstrained printed documents into a forma t suitable for

    compu ter processing. Da ta entr y by operator keying is not

    only expensive and time-consuming but also precludes the

    encoding of graphics and images contained in the document.

    Current optical character recognition

    OCR)

    machines only

    recognize certain preprogrammed fonts; they do not accept

    documentscontaininga mixture of text and images. Th e

    Document Analysis System thusprovides a new capability for

    converting information stored on printed documents, includ-

    ing both images and text, to omputer-processable form.

    Thispaper is organized nto hree sections. Th e first

    section gives an overview of the system req uirem ents. Not all

    the components described in the system have been designed

    an d tested yet. The next two sections present the results of

    the technical investigations

    of

    some of the key components:

    the segmentationof a scanned docume nt into egions of text

    an d images, and the recognition of text. Initial experimen tal

    da ta ar e ncluded to illustrate these wo methods.

    System requirements and overall struc ture

    Th e design of an interactive computerystem to read printed

    docum ents was investigated by Nag y [11. Som e of the system

    concepts used by him are reflected in the Docu men tAnalysis

    System, but the olutions in most are as aredifferent.

    8 Copyright 1982 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of

    royalty provided that (1) each reproduction is done without alteration and 2) the Journal reference and IBM copyright notice are included on

    the first page. The title and abstract, but no other portions,

    of

    this paper may be copied or distributed royalty free without further permission by

    computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the

    Editor.

    64

    K. Y WO N G ET

    A

    BM J.

    RES

    DEVELOP. VOL. 26 NO. 6 NOVE M B E R 1982

  • 8/9/2019 Document Engineering Research Paper

    2/10

    Document

    r - - - - - - - -

    Figure 1

    System

    overview.

    Th e system is designed to have t he following capabilities:

    Manual editing of scanned documents

    e.g. ,

    functions such

    as crop,copy, m erge, erase, save, etc.).

    Separating a document nto regions of text and mages

    automatically.

    Representing text by standard symbols such as EBCDIC ,

    AS CI I, etc., possibly with the assist ance of user interac-

    tion. Autom atic char acter recognition will be provided for

    docu ments printed in certa in conventional fonts.

    Check ing the recognized text against a dictionary.

    Enhan cing the images and converting graphics from bit-

    maps into vector fo rmat.

    Analyzing the page layout

    o

    that appropria te typographi-

    cal commands can be generated automatically or interac-

    tively to reprodu ce the docum ent ith printer subsystems.

    Allowing the user to customiz e the ystem for his particu-

    lar needs.

    48

    K. Y WO N G ET AL

    Separdtlon of

    text inslde

    graphicsi image

    graphics and

    l i n e d r a w ~ n g s

    vector

    conversion

    halftones

    to

    +

    J

    Figure

    1

    shows the modules of the proposed Document

    Analysis System.The haded moduleshave been imple-

    mented or investigated in deta il. T he preliminary version of

    the system permits meaningful experiments

    but

    does not yet

    provide full interactive capability.

    Documentsarescannedasblack/white imageswitha

    commercial aser scanne r at 240 pixels per inch into a

    Series/

    1

    minicomputer, which then transfers the data ver a

    network to the host IBM 3033 comp uter. The isplay device

    is a Tektronics

    618

    storage tube attached to an IBM 3277

    terminal, which co mmunicates with the host. The system can

    automatically eparate a document nto regions of text,

    graphics, and halftone images. The techniqu e for doing this

    is presented in the next section. Alte rnat ively , the user can

    choose to perform the same function man ually by interac-

    tively editing he mage.The user can invoke the Imag e

    Manipulation Editor (called IE DI T), which reads and stores

    IBM

    J.

    RES. DEVELOP. VOL. 26 NO. 6 NOVEMBER 1982

  • 8/9/2019 Document Engineering Research Paper

    3/10

    imagesondisk and provides a variety of image-editing

    functions

    ( e .g . ,

    copy, move, erase, magnify, reduce , rotate,

    scroll, extract subimage, and ave).

    The blocks containing text can be furth er analyze dith a

    pattern-matching program that groups similarymbols from

    the docum ent and creates a prototype pattern to represent

    each group. The prototype pat terns are thendentified either

    manually using an interactivedisplay

    or

    by automatic recog-

    nition logic if this is available for the onts used in the

    docum ent. Inclusion of a dictionary-checking techniq ue is

    planned to help correct errors incurre d during recognition.

    Ot he r functions will be provided in future extensions to the

    system. For instance, hegraphics ndhalftone images

    separated from the document mayeed fur ther processing in

    certain applications. Thus, in a maintenance manual the text

    labels may need revision without changin g the line draw ing

    itself, an d vice versa. In such cases, text inform ation must be

    extracted from the image. Also, d ue to spatial digitization

    errors, scanned ine drawing s and graphics generally have

    missing or extra do ts along edges. Consequently, after scan-

    ning an d thresho ldin g, identical lines may vary in width in

    the binary bit-maps. Such errors can be corrected with an

    image enhancement function. Scanned halftone images may

    also have moirt patterns [2]. Addition alrocessing functions

    are required to remove the moir t patter ns and convert the

    images into gray scale format

    so

    that conventional digital

    halftoning techniques can e used to reprodu ce themage. In

    another application,existing line dra wings and g raphics may

    have to be encoded into an engineering da ta base. Here,

    conversion of thedrawing rombit-maps nto vector

    or

    polynomial rep resentatio n of ob jects is necessary.

    In order to produce the proper sequential list for the text

    blocks, a layo ut analysis is required. For ex amp le, the text

    blocks would be linked differently for a single-column than

    for a multiple-column text page. If th e docu men t is to be

    reproduced exactly, control codes for indenting , paragra ph-

    ing, and line spacing must be supplied. Finally, typesettin g

    commands may be required to reproduce the do cument ith

    a printer subsystem.

    Block segmentation and text dis crimination

    Some earlier approaches to segmentation and discrimination

    [3,4] require d knowledge of the chara cter size in order to

    separate a document into text andontext areas. Themethod

    developed for the DocumentAnalysis Sy stem consists of two

    steps. First, a segmentation procedure subdivides the area of

    a docum ent nto regions blocks), each of which hould

    contain only one type of data (text, graphic, halftone image,

    etc.). Next, some basic features of these blocks ar e calcu -

    lated. Ainear classifier which ad apt s itself to varying

    character heights discriminates between textand images.

    Uncertain classifications are resolved by means of an addi-

    tional classification process using more powerful, complex

    feature nformation . This method is outlined below; fora

    more detailed description see

    [

    51.

    Figure 2(a) hows an example f a document comprised of

    text,graphics,halftone mages,and solid black lines. A

    run-length smoothing algor ithm RLS A)had beenused

    earlier by one of the autho rs

    6 ,

    71 to detect ong vertical and

    horizontal white lines. This algorithm has been extended to

    obtain a bit-m ap of white and black areas epresenting

    blocks containing thevarious types of da ta

    [see

    Fig. 2(d)].

    The basic R LS A is applied to a binary sequence in which

    white pixels are represented by

    0’s

    and black pixels by

    1’s.

    The algo rithm tran sforms binary sequence into an o utput

    sequence

    y

    according to theollowing rules:

    1. 0’s

    in

    x

    are changed to

    ’s

    in

    y

    if the numb er of adjacent

    0’s

    is less th an

    or

    equal to a predefined limit

    C.

    2 .

    1’s in

    x

    are unchanged n

    y .

    For example, with

    C

    4 the sequence

    x

    is mapped into

    y

    as

    follows:

    x : 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 l 0 0 0

    y : 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 l l

    Whe n applied to pattern arrays, th e R LSA has th e ffect of

    linking together neighboring black areas that are separa ted

    by less than

    C

    pixels. W ith a n a ppro priate choice of

    C,

    he

    linked areas will be regions of a common dat a type. Th e

    degree of linkage depends

    n C ,

    the distributionof white and

    black n the docum ent, and he scanning resolution. (The

    Document Analysis System scansa t 240 pixels per inch.)

    T h eRL SA is applied row-by-row as well as column-

    by-column to a do cume nt, yielding two d istinct bit-ma ps.

    Becausespacings of docum entcomponents tend to differ

    horizontally and vertically, different values of

    C

    are used for

    row C, 300) and column C,

    500

    processing. Figures

    2(b) and 2(c) show the results of applying th e RL SA in the

    horizonta l and n the vertical irections of Fig. 2(a). The wo

    bit-maps are hen combined in a logical AN D operation.

    Additional horizontal smoothing using the RL SA C, 30)

    produces the final segmentation result illustrated in Fig.

    2(d).Refere nce 8] describes in mor edetailan efficient

    two-pass algorithm for the implementation of the RL SA in

    two dimensions.

    Th e blocks shown in Fig. 2(d) mu st next be located an d

    classified ac cordin g to conten t. To provide identification for

    subsequent processing,a uniqu e label is assigned to each

    block. This is done by a labeling techniq ue described in

    [9].

    Simultaneously with block labeling, the following measure-

    ments are taken:

    K.

    Y WO N G

    6

    iT

    BM J.

    RES. DEVELOP.

    VOL. 26 NO.

    6

    NOVEMBER 1982

  • 8/9/2019 Document Engineering Research Paper

    4/10

    Figure

    2

    a) Block segmentation example of a mixed text/image document, which here is the original digitized document. b) and c) Results

    of applying the

    RLSA

    in the horizontal and vertical directions. d) Final result

    of

    block segmentation. ( e )Results for blocks considered to be text

    data class 1) .

    Total num ber of black pixels in a segm ented block ( B C ) .

    Minimum x - y coordinates of a block and its x , y lengths

    h i '

    A x , Y,, , AY ) .

    Total num ber of black pixels in original dat a fro m he

    block

    ( D C ) .

    Horizontal white-black transitions of original data TC) .

    These m easurem ents are store d n a table (see Table 1 ) and

    are used to calculate the ollowing features:

    1.

    The height of each block segment: H

    A y .

    2.

    The eccentric ityof the rectangle surroun ding the lock:

    E

    A x / A y .

    3. The ratio of the number of block pixels to the areaof the

    surrounding rectangle: S

    B C / ( A x A y ) .

    If S is close to

    one, the block segment is approximately rectangular.

    4.

    The meanhorizontal ength of the black runs of the

    original data from each lock:

    Rm D C I T C .

    These features are sed to classify the block. Because text

    is the predomin ating data type in a typical office document

    and text ines are basically textured stripes of approximately

    a constant height H a n d mean length of black runs Rm ext

    blocks tend to cluster with respect to these featu res. This is

    Y.WONG ET

    A L . IBM

    J . RES.

    DEVELOP. VOL.

    26 NO.

    6 NOVEMBER

    I

    982

  • 8/9/2019 Document Engineering Research Paper

    5/10

    illustrated by plotting block segments in the

    R-H

    plane (see

    Fig. 3). Each table entry is equal o he number of block

    segments in th e corresponding range of

    R

    a nd

    H .

    Thus , such

    a plot can be considered as a two-dimensional histogram.

    The text l ines of the document shown in Fig. 2(a) form a

    clustered population within the range20 <

    H

    < 35 a nd 2 <

    R

    < 8.The two solid black lines in the lower right part of the

    original document have high

    R

    and low

    H

    values in the

    R-H

    plane, w hereas the graphic and halftone images have high

    values of H . Note hat he scale used in Fig. 3 ishighly

    nonlinear.

    Th e me an value of block height

    H ,

    and the block mean

    black pixel run length

    R ,

    for the text cluster may vary for

    different ypes of docum ents, depending on charact er size

    and font. Furthermore, the text luster’s standard deviations

    o(H,)

    a nd cr(R,) may also vary depending on whethera

    docu men t is in a single font or mult ip le fonts and charac te r

    sizes. To perm it self-ad justm ent of the decision bound aries

    for text discrim ination, estimates are calculatedor the mean

    values

    H ,

    a nd

    R ,

    of blocks from a tig htly defined text region

    of the

    R-H

    plane. Additional heuristic rules a re applied to

    confirm that eac h such block is likely to be text before it is

    included in the cluster. The mem bers f the clust er a re th en

    used to stima te new bounds on the eatures odetect

    addition al t ext blocks [ 5 ] Finally, a variable, linear, separa-

    ble classification schem e assigns the following four classes to

    the blocks:

    Class

    Text:

    R < C, , Rm

    nd

    H

    <

    C,, H,,,.

    R

    >

    C , , R ,

    a nd

    H

    < C, , H,

    .

    E

    >

    I / C 2 ,

    nd

    H

    >

    C,, H ,

    .

    E

    <

    l / C 2 3

    nd

    H

    >

    C, , H,

    .

    Class 2

    Horiz ontal solid black lines:

    Class 3 Graphic and halftone mages:

    Class 4 Vertical solid black lines:

    Values ave been assigned to heparameters based on

    several training documen ts. With C,, 3,

    C,,

    3, and

    C,,

    5 th e outlined method has been tested on a number of test

    documents with satisfactory performance. Figure 2(e) hows

    the result of the blocks which ar e considered tex t dat a (class

    1 ) of the original document in Fig. 2(a).

    There are certain imitations o he block segmentation

    and text discrimination method described so fa r . On some

    documents,ext l ines are linked together by the block

    segmentation algorithm due to smalline-to-line spacin g, and

    thus a reassigned to class

    3.

    line of characters exceeding

    3

    t imes

    H ,

    (with

    C,,

    3), such as a t i t le or heading in a

    Table 1 Processing esult of document in Fig.

    2.

    Columns

    1

    to

    7

    contain the results of the measurements performed simultaneously

    with labeling. The last column is the text classification result.

    BC xmin

    Ax ym,” Ay DC

    TC Class

    702

    6089

    6387

    15396

    9341

    16706

    19447

    9244

    3244

    18502

    17592

    5667

    12300

    19318

    19252

    11101

    1258

    20200

    276 147

    418268

    10935

    18370

    19120

    3434

    13694

    17336

    21 586

    17601

    19174

    23306

    15500

    20263

    16619

    101 12

    16777

    38591 1

    19512

    406563

    10467

    18717

    995 68 2341

    23

    1090 771 2266 13

    307 265 2245

    32

    307

    657

    2208 31

    1090

    366 2184

    31

    185

    779

    2171 24

    1090 771 2147 35

    185 385 2098 31

    1401

    152

    2077

    23

    185 779 2060 32

    185 779 2024

    30

    1091 770 2008 13

    185 474 1947 28

    1144

    717 1923

    35

    1101 760 1887 32

    188

    483

    1830 29

    1144

    171

    1821 22

    1082 779 1782 34

    188

    780

    1037 766

    1084

    780

    1209 546

    1088

    503

    1117 30

    1088 773

    1080 31

    1088 773 1043

    32

    1205

    126

    969 32

    193 489 954 30

    1088 773 934

    31

    193

    783

    917 31

    1088

    773

    897 32

    193 779 880

    31

    193 778 840 31

    1088 773 824 32

    193

    778

    804

    30

    1088

    773

    788 25

    193 393 731

    31

    1088 773 705 31

    1087 778 175

    504

    193 777 643 29

    193

    777 75

    538

    1089 477 115 28

    1089 776 78

    30

    302

    5608

    1142

    3118

    2469

    3489

    5181

    1780

    1587

    41 12

    3613

    5394

    475 1

    4436

    398 1

    2713

    434

    4349

    47742

    195328

    2139

    3406

    3632

    770

    225 1

    3417

    3574

    3580

    3437

    4167

    3544

    3800

    3310

    291 1

    3749

    1751 14

    4046

    63103

    1835

    3427

    76

    170

    396

    1005

    580

    1070

    1110

    528

    303

    1183

    1018

    72

    735

    1155

    1059

    756

    123

    1229

    8738

    59906

    648

    996

    1046

    226

    758

    1012

    1164

    1040

    1165

    1256

    1038

    1183

    989

    666

    1083

    51527

    1205

    12598

    542

    968

    1

    2

    1

    1

    1

    1

    1

    1

    1

    1

    1

    2

    1

    1

    1

    1

    1

    1

    3

    3

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    3

    1

    3

    1

    1

    document, is assigned to class 3.

    An

    isolated line of text

    printed vertically,

    e.g. ,

    a text description of a diagram along

    the vertical axis, may be classified as a num ber of small text

    blocks, or else as a class 3 block.

    Consequently, in order to further classify different da ta

    types within class 3, a second discrimination based on shape

    factors

    [IO]

    is used. The method uses measurements of

    border-to-border distancewithin an arbitrary pattern. These

    measuremen ts can be used to calculate meaningful features

    like the “line-shapeness” or “compactness” of objects within

    binary images. Wh ile he alculations re complex and

    time-consuming, they are don e nly for class

    3,

    and thus add

    only a small increm ent to the overall processing tim e. This

    hierarchical decision procedure, in which “easy”lass

    6

    K. Y . WONG

    ET

    BM J. RES.

    DEVELOP.

    VOL 26

    NO.

    6 N O V E M B E R 1982

  • 8/9/2019 Document Engineering Research Paper

    6/10

    Mean run length of

    black

    plxels

    R

      2 3 4 6 8 I O 20 3 40 60 80 100 300 1000

    Figure

    3 Feature histogram ( R - H plane) of the

    document

    shown

    in Fig. 2.

    assignm ents are mad e quickly, while the difficult ones ar e

    deferr ed to a more complex process, is a prom ising approach

    toanalyzing he immense amou nt of data in a scanned

    document.

    Recognition

    of

    text data

    Hun dreds of font tyles an d sizes areavailable for the

    printing of documents . Ma thematical ymbols, foreign char-

    acters,andother special typographicelementsmayalso

    appear. The tremen dous range f possible input material for

    theDocum ent Analysis System makest impractical o

    attempt o designcompletely auto ma tic recognition logic.

    Instead,analgorith m is used to collectexamples called

    “prototypes”) of the various patte rns in the text.

    The procedure, known as“pa t te rnmatchin g,” is con-

    ducted as ollows. The text s exam ined, character y charac-

    ter, in left-to-right, op-to-bo ttom sequence.Two output

    storage a reas a re mainta ined, one toold the prototypes and

    the other to ecord for each text pattern both its position and

    the index of a matching prototype. The positions saved can

    be, forexample , the coordina tesf the lower left corner f the

    array represent ing the pa t te rn .

    Initially the prototype set is empty; thus, there can be no

    matching prototype for theirst input pattern, andt is stored

    as the first prototype. In addition , its position a nd th e index

    “1” are placed in the second ou tput rea.The second

    52

    K . Y . WO N G ET A L .

    character is then omp ared o he first by means of a

    correlation est for whichone of man yvariantsmay be

    used) . If it match es, hen ts position and he index I a re

    recorded. If there is no match , it is added to the lib rary as a

    secondprototype. The hird chara cter is then read and a

    match sought from among the mem bers of the library. This

    operation is repeated for each input pattern,n turn, with the

    patte rn being ad ded to the prototype collection whenever a

    match cannot be found.

    Pattern matching reduces the recognition problem to one

    of identifying the prototypes, since th e prototype ch arac ter

    codes can be substituted for the sequenc e f indexes obtained

    for the tex t. The codes a nd accomp anying positions consti-

    tute a document encoding that is sufficient for creating a

    computerized version of the original text

    or

    for conducting

    informationetrieval searches within its ontent.Other

    merits of the approach are hat (1) he pattern-m atching

    scheme is simple to im plement, (2) it permits the entry of a

    wide variety of font styles an d sizes, (3 ) the ecognition logic

    need not be designed in advance and no design data need be

    collected.

    Th e prototypes can be identified n several ways. If the

    docum ent is known by th e user to have been printed in any of

    a num ber of specified fonts, then prestored OCR recognition

    logic may be applied [111. Text in a given language source

    can be partially recognized using dictionary look-up, tables

    of bigram or trigram frequencies, etc.; OCR errors can be

    correct ed using such contex t aids as well. Finally, even if

    autom atic techniques cannot be used, the prototypes c an be

    displayed at a keyboard ter mina l and identified interactively

    by the user. The num berof keystrokes needed to identify th e

    prototypes is an order f magnitude less than that equired to

    enter typical documents manually.

    Pattern matc hing has previously been used by Na gy [ l ]

    for text encoding and by Pratt [ 121 for facsimile compres-

    sion. The echnique used or pattern matching in both of

    these systems, however, suffers from nemajor efect,

    particu larly when the algorithm is implemented by softw are

    that is to run on a serial processor. T he problem is that each

    textchar acter is com pared pixel-by-pixel with all of the

    prototypes tha t ar e lose to it in simplemeasuremen ts such as

    width, heig ht, are a, etc. A s the size of the libra ry increases,

    many prototypes may qualify, and the timeequired to do the

    correlations with a conventional sequ ential instruction com-

    puter can be excessive.

    To e duc e heamo unt of computation, heDocument

    Analysisystemmploys a prec lassi fy ingechnique

    described n mor edetail in [131. The preclassifier is a

    decision network whose input is a patt ern array and whose

    output is the index of a possible matching prototype pattern.

    IBM J. RES.

    DEVELOP.

    1

    OL. 26

    VO.

    6

    I

    VOVEMBER 1982

  • 8/9/2019 Document Engineering Research Paper

    7/10

    Each interiornode is a pixel location, and each termina l ode

    is a prototyp e index. Two branches lead from each interior

    node. Onebran ch is abeled“black,” theother, “white.”

    Sta rti ng with the root a node is accessed, and he pixel

    location tha t it specifies is examined in the inp ut arra y. T he

    color of this pixel dete rmin es the bran ch to the next node.

    When a terminal node is accessed, the corresponding proto-

    type index is assigned to the input.

    Eac h successive inp ut rra y is first classified by this

    network, which exa mines only a small subset of the pixels

    before producing a n index. The network is initially null, but

    it is modified each ime a new prototype is added o he

    libra ry. Th e modification consists of adding a n extension to

    the decision tree (or network) that permits it to distinguish

    the new proto type as well as all previously selected proto-

    types.

    The preclassification is either verified or nullified by

    comparing the input pattern with the prototype selected by

    the network. The wo-step operation is repeated with the

    inpu t array shifte d nto several registration positions until

    either a matchin g prototype is found, or else none is found

    and the input s designated asa new prototype.

    The advanta ge of this approach is that only one pattern

    correlation is needed to classify an inpu t in each registration

    position used. Th us, th e classification is greatly accelerate d

    compared with straightforwardatternmatching. he

    reduction in classification time mus t be bala nced again st the

    time expended in modifying the decision network each time

    new class is found and again st he possibility of creating

    extra classes due o preclassification errorsmade by the

    network. Considerable effort has gone into developing an

    efficient network adaptation algorithm.

    The design of the preclassifier is based

    on

    a scheme in

    which the pixels of eac h proto type are categorized as “reli-

    able” or “unreliable.” A pixel is defined to be “reliable” for a

    given prototype if its four horizon tal and vertical neighbor

    pixels are

    of

    the same color. Using this categoriz ation, the

    decision network is constr ucted such tha t for any nput

    pattern the sequence of pixels examined have the following

    properties. First, every pixel examin ed has the same olor in

    both the input array an dn th e selected prototype. Second, if

    these pixels in the inpu t are com pare dith the corresponding

    positions in any of the othe rprototypes, at least onepixel will

    be both different in color in the two patter ns and “reliable”

    pixel for th e prototype. In this way, t he preclassification is

    based

    on

    pixel comparisons away from the dges, where most

    variation occurs in printed characters.

    Thre e meth ods of designing the preclassifierhave been

    investigated. The topextension approach (called strategy

    A)

    adaptation

    After

    adaptation

    N

    Figure

    4 Strategies for mod ifying the decision n etwork to include

    class

    k

    + 1; (a) top extension;

    b)

    bottom extension; and (c) critical

    mode extension.

    add s pixels to be examined bove the root node of the

    previously designed decision network. The trat egy is to

    examine one or more “reliable” pixels such that the adde d

    network ca n distinguish the new prototype class

    G ( k + 1)

    from the remaining k classes. [See Fig. 4(a )]. The botto m

    extension approach (called strategy B) retraces the path of

    G ( k + 1)

    thro ugh the decision network a s in preclassifica-

    tion, but with the difference that when a pixel that is

    unreliable in G ( k + 1) is encountered, both branches from

    the node a re followed. Thu s, th e extension routine traverses

    multiple path s hrou gh he network. Each ermina l node

    reached in this way is then replaced by ahree-node

    subg raph . The top ode of the subgrap h specifies a pixel that

    is a “relia ble” pixel

    for

    G ( k

    +

    l ) ,

    and is “reliable” and of

    opposite value for he prototype

    P,

    dentified by the terminal

    node. The bran ches from this node lead to termina l nodes

    identifiedwith

    P

    and with

    G ( k + l ) ,

    respectively. The

    extension procedure is repeated for each erminal node

    reached, as illustrated n Fig. 4(b).

    The third approach (called strategy C ) is called “critical

    node extension.” Here, prototyp e

    G ( k

    + 1) is entered nto

    6

    IBM

    J . RES. DEVELOP.

    VOL. 26 NO

    NOVEMBER

    1982

    K . Y . WONG ET

  • 8/9/2019 Document Engineering Research Paper

    8/10

    654

    b . R e m o

    v t h f

    r n t s w P h t u 9

    x a

    d i H G 2 3 A d

    u s

    tnte

    I I

    t

    S

    o

    B a n d t i g E 1 S P

    v

    T L r P t r Y n n A

    c I

    d k t h i l i h i g

    F 2 4 ) T l I I

    S f l L h ski. i n 5 6

    M t i 4 R m v

    a

    E ; h

    P k U b 4 * S r t i

    a n t i i r m a l m g i n C H E

    K I N G B L T S O

    O V s h a i e W A R N

    G :

    T h m o r u n i h c

    kinri

    / C I ~ P ~ H Y . ~ U

    Figure

    5

    Prototype patterns from thedocument shown

    in

    Fig.

    2b .

    network N ( k ) as in the bottom extension strateg y, and its

    path is traced ntil it reaches a ode

    n

    specifying an

    unreliable pixel in C ( k + 1 ) . At this point i t is determined

    which subset of prototypes G l ) , G(2) , . . .,G ( k ) can also

    reach node

    n

    Call this prototype subset H . The branch in to

    node n is reconnected to a subnetwork consisting of a

    sequence of pixels that are reliab le in G ( k + 1 ) a nd tha t

    discriminate G ( k +

    1 )

    f rom membersof

    H .

    T he specification

    of this subnetwork s identical to strategyA except that only

    memb ers of H , rather tha n all the prototypes, are discrimi-

    nated from

    G ( k

    +

    1 ) .

    Th e critic al node extension is illus-

    trated in Fig. 4(c).

    Each strategy has its advantages and limitations. Strategy

    A is simplest in conce pt. Each time a new class is added, a

    local modification to he network is mad e at its root. Th e

    disadvantage of theapproach is tha t , as more and more

    c lasses a re added, the s t ruc tureesembles a chain of subnet-

    works, each link of which corresponds to a class. If a patte rn

    belonging to a class n ear th e bot tom of the chain is entered

    into the network, then one or more pixels must be ex amined

    a t each of the higher links.Since theinks require more odes

    as henum ber of classes ncreases, the ime equired o

    classify a pattern grows more than linearly with the num ber

    of classes.

    St ra tegy B, on theotherhan d, results in a elatively

    balanced structure. Starting from null network, strategy

    B

    produces a tree graph, since only subtr ees are append ed at

    each step. Theexact hape of the reedepend s on the

    proto type pixels, but in genera l it has a shorter average path

    length than the network obtained using strategy A.

    In

    fact,

    since any path s increased in ength by at most one node each

    time a class is added , then a network designed using tr ateg y

    B can have

    no

    path longer than

    K

    nodes when it has been

    extended o accommo date

    K

    prototypes, and typically the

    growth in average path length s approximately logarithm ic.

    However, strategy

    B

    calls for modifying the network in a

    num ber of locations. As the network grows, many termin al

    nodes may have to be extended in order to adda single class.

    This effect increases both the tim e required for ada ptatio n

    and the s torage requirementor the network.

    Stra tegy C offers a compromise between the othe r two

    app roac hes . Th e network is modified only at a single node

    each ime a lass is added,but he trategy offers the

    potential of a more balanced configuration than the chain

    produced by top extension. In a particular case, if each

    successive prototype followed a path consisting only of

    interior pixels, then only terminal nodes would ever be

    modified, and strategy

    C

    would be equivalent to strategy

    B.

    On the o ther han d, if for each successive extension the root

    node specifies anedge pixel for the new prototype, then

    st ra tegy

    C

    yields th e amechai n configuration obtain ed

    using strategy A. In prac tice, the results lie between these

    extremes. a s shown below.

    Figure 5 shows the prototype patternsusing the proposed

    text data recognition method on the docum ent shown in Fig.

    2(a). While there is some redundancy in the prototype set,

    only a sm all part of this is due to preclassifier erro rs; the

    major portion is due to failures of the correlation stage to

    declare similarity between an input and a prototype of the

    sam e class. (T he threshold for declaring a correlation match

    has tobe set ra ther t ig htly tovoid giving different cha racte r

    types the sam e esignation, a more serious error.) A compar-

    ison with a conv entional pattern -matc hing algorith m yielded

    about the same size prototype set. T he efficiency advantage

    of the preclassifier system is indica ted in the plot of Fig. 6, in

    which a typewritten docum ent was the inp ut. An average f

    only

    130

    pixels per chara cter pattern was examined by the

    preclassifier (out of approximately

    1000

    in eachpattern

    K. Y

    WONG ET AL.

    IBM

    J. RES.

    DEVELOP. VOL.

    26 NO. 6

    NOVEMBER

    1982

  • 8/9/2019 Document Engineering Research Paper

    9/10

    250

    000

    Plxels

    examined in decislon etwork

    /

    (cumulative plot)

    /

    criticalnode

    bottom extension

    top extension

    200000- _ _ /

    so 000

    100

    000

    n

    6

    50

    000

    E

    z

    0 500 I000 IS00

    Input sequence numhcr

    Figure

    6

    Efficiency advantage of preclassifier.

    arr ay) using the criticalnode extension method. In this same

    experiment, appreciably more C PU time was spent in using

    the decision network for classification than in const ructing it.

    In

    the least fav orable case, bottom extension, the rat io of

    preclassification time oadaptation ime was 3:l; for the

    critical node method the ratio rose to

    1O:l .

    These data were

    obtained from simulation experiments programm ed in

    APL

    and do not provide estimates of the timings obtainab lewith

    efficient programming of the method. An effort is currently

    under way to reprogram the algorithm and to optimize its

    speed.

    A side benefit of the preclassification appro ach has been in

    aecursive techniq ue for the egm enta tion of touching

    character patterns (described in detail in

    [14]).

    The proper

    segm entatio n of closely spaced printed char acte rs is one of

    themajor problems in opticalcharacter recognition. Th e

    Document Analysis System combines segmentation with the

    classification

    of

    the component patterns. Following the p at-

    tern-matching stage, each prototype wider tha n twice the

    width of the nar row est prototype undergoes a series of trial

    segmentat ions .The two-storage preclassification/correla-

    tion technique attempts to find matching prototypes for the

    tr ia l segments . Theecursive algorithm segmentsa sequence

    of touching char acters only if it can decompose the segmen t

    into patte rns that positively matc h previously found proto-

    types. Because the num ber of pattern s tested for segmenta-

    tion is small and because the use of the preclassifier makes

    the process efficient, the comp utational overhead introdu ced

    by recursive segmen tation is relatively insignificant. Figure

    7

    shows asampling of composite patte rns successfullyseg-

    mented by this technique.

    Conclusion

    This paper hasdescribed some of the requir emen ts for and a

    system overview of a document analysisystem which

    Input

    array

    aki

    a i

    al

    Matching

    prototypes

    a k i

    in

    thi

    a i

    a l

    a n

    an

    di

    f l

    gi

    hi

    Input

    array

    Matching

    prototypes

    P U

    Pu

    r rn

    IBM J.

    RES. DEVELOP. VOL. 26 NO. 6 NOVE M B E R

    1982

    Figure 7 Illustration of the recursive ouching

    character

    segmen-

    tation method.

    encodes printed documents to a form suitable for computer

    processing. N ot all the system components mentioned have

    been designed and built. The initial work on the project has

    focused on techniques for the separation of a document into

    regions of text and images an d on the recognition of text

    da ta .

    The image segm entation method has been tested with a

    variety of printed documents fromdifferent sources with one

    common et of param eters.Performance is satisfactory

    althou gh misclassifications occur occasionally. Wi th theuser

    in an interactive feedback loop, misclassification errors ca n

    be checked and corrected quite readily in the classification

    table. more sophisticated pattern discrimination method

    [ I O ] is being tested to furthe r classify those blocks tha t ar e

    labeled as non-text.

    Pattern matching has been investigated as a fundamental

    approach to reducing a large num ber of unknown character

    patterns o asmall num ber of prototypes. A new, highly

    K. Y.

    WONG

    E

  • 8/9/2019 Document Engineering Research Paper

    10/10

    e f f i c ie n t a p p r o a c h t o p e r f o r m t h e m a t c h i n g h a s b e e n i m p l e -

    m e n t e d b y m p l o yi n g a n d a p t i v e e c i s i o n e t w o r k for

    prec lass i f i ca t ion . The reduc t ion

    f

    computa t iona l load i s ve ry

    impor tan t when process ing i s done by a convent iona l com-

    p u t e r . P r e l i m i n a r y re s u l t s o n r i a l d o c u m e n t s h a v e y i e l d e d

    pro to type se t s

    of

    low redundancy , as well as low rates of

    misc lass i f i ca t ion .

    Acknowledgments

    W e w i sh t o a c k n o w l e dg eY . T a k a o of

    IBM

    Tokyo Sc ien t i f i c

    C e n t e r i n J a p a n f o r h i s c o n t r i b u t i o n i n w r i t i n g p a c k a g e

    of

    image-ed i t ing s o f t w a r e c a l l e d E D I T ) o r our D o c u m e n t

    A n a l y s i s S y s t e m . W e a l s o a p p r e c i a t e he s t i m u l a t i n g d i s c u s -

    s ions wi th G . N a g y d u r i n g h i s v i s i t

    o our

    l abora tory .

    References

    1.

    2.

    3.

    4 .

    5.

    6 .

    7 .

    8 .

    9 .

    IO

    11.

    12.

    13.

    R. N. Ascher, G. M. Koppelman, M. J.Miller, G. Nagy, and

    G .

    L. Shelton,Jr.,“An nteractiveSystem or Reading Unfor-

    matted Printed Text,”

    IEEE Trans. Computers C-20 1527-

    1543 (December 1971).

    A. Steinbach and K. Y. Wong, “An Understanding of Moir

    Pattern s in the Reproduction of Ha lftone Images,”

    Proceedings

    of the Pattern Recognition and Image Processing Conference,

    Chicago, Aug. 6-8, 1979, pp. 545-552.

    G. Nagy, “Preliminary Investigation of Techniques for Auto-

    mated Reading of UnformattedText,”

    Commun.ACM

    11

    480-487

    (July

    1968).

    E. Johnstone, “Printed Text D iscrimination,”

    Computer Graph.

    Image P rocess.

    3 ,83 -89 1974) .

    F.

    M. W ahl, K. Y. Wong, and R. G. Casey, “Block Segm enta-

    tion and Text Extraction in Mixed Text/Image Documents,”

    Research Report RJ 3356,

    IBM Research Laboratory,San

    Jose, CA, 198

    1

    F.

    Wahl, L.Abele, and W. Scherl, “M erkmaleuer die Segmen-

    tation von Dokum enten zur Automatischen Textverarbeitung,”

    Proceedings of the 4th DAG M-Symposium,

    Hambu rg, Federal

    Republic of Germany, Springer-Verlag,Berlin,

    198

    1.

    L. Abele, F. W ahl, and W. Scherl, “Procedures for an Auto-

    matic Segm entation of Text Graphic and H alftone Regions in

    Documents,”

    Proceedings of the 2nd Scandinavian Conference

    on Image Ana lysis,

    Helsinki,

    198

    1 .

    F. M. Wahl a nd K. Y. W ong: “An Efficient Method of Running

    a Constrained Run Length Algorithm (C RL A) in Vertical and

    Horizontal Directions on Binary Ima ge Dat a,”

    Research Report

    RJ3438,

    IBM Research Labo ratory, San ose, CA,

    1982.

    A. Rosenfeld and A. C. Kak, “Digital Picture Processing,”

    Academic Press, Inc., New York,

    1976,

    pp.

    347-348.

    F. M. Wahl, “A New Distance Mapping and its Use for Shape

    Measureme nt on B inary Patterns,”

    Research Report RJ 3361,

    IBM Research Labo ratory, San ose, CA, 1982.

    R. G. Casey and G. Nagy, “Decision Tree Design Using a

    ProbabilisticModel,”

    Research eport RJ 3358,

    IBM

    Research Laboratory , San ose, CA , 198

    1.

    W. H. Chen , J. L. Douglas, W. K. Pratt, and R. H. Wallis,

    “Dual-mode Hybrid Compressor for Facsimile Images,”

    SPIE

    J.

    207,226-232 1979).

    R. G. Casey and K. Y. Wong, “Unsupervised Construction of

    Decision Networks for Pattern Classification,” IBM Research

    Report. to appear.

    14. R . G. Casey and G. Nagy, “Recursive Segmentation and

    Classification of Composite Character Patterns,” presented at

    the thnterna tional Conference on Pattern Recognition,

    Munich, October

    1982.

    Rece ived

    M a y

    3, 1982; rev i sed

    July

    7, 1 9 8 2

    Richard G Casey IBM Research Division, 5600 Cottle

    Road, San Jose, California 95193.

    Dr. Casey h as worked primarily

    in pattern recognition, particularly optical character recognition,

    since joining IBM in 1963. Until 1970, he carried outprojects in this

    area at he Thomas J. Watson Research Center in Yorktown

    Heights, New York. After ransfer ring to San Jose, he worked

    initially on the analysis of data bases, but returned to the character

    recognition area in 1975, soon after completing a one-year assign-

    ment for IBM in the United Kingdom. Dr. Casey received a B.E.E.

    from Ma nhattan College in

    1954

    and an M S . and Eng.Sc.D. from

    Columbia University, New York, in

    1958

    and

    1965.

    From

    1957

    to

    1958,

    he was a teaching assistant at Columbia University and from

    1958

    to

    1963,

    he investigated rada r dataprocessing techniques at the

    Columbia University Electronics Research Laboratories. While on

    leave of absence from IBM in 1969, he taught at the University of

    Florida.

    Friedrich M. Wahl

    Lst. s

    Nachrichtentechnich Technische

    Universitaet,Arcisstrasse

    21. 8

    Munich 2, Federal Republic of

    Germany.

    Dr. Wahl studied electrical engineering at the Technical

    University of M unich, where he received his Diplom and his Ph.D. in

    1974 and 1980. In 1974 he started a digital mage processing group

    at he Insti tute of Comm unication Engineering at the Technical

    University of Munich. Dr. Wah l worked n the image processing

    project in the C omputer Science Department at the IBM Research

    laboratory in San Jose, California, for one year, 1981 to 1982, as a

    visiting scientist. At S an Jose, he worked on p roblems of docum ent

    analysis systems, image segmentation, shape description, clustering

    algorithms, and line drawing decomposition.

    H e

    has also been

    involved with autom atic image inspection for manufa cturing . Dr.

    Wahl is a member of the Institute of Electrical and Electronics

    Engineers.

    Kwan

    Y Wong

    IBM Research Division, 5600 CottleRoad,

    San Jose, California 95193.

    Dr. Wong receivedhis B.E. (with

    honors) and his M.E . in electrical engineering from the U niversity of

    New South Wales, Sydney, Australia, in 1960 and 1963. Since he

    received his Ph.D. from the University of California, Berkeley, in

    1966,

    he has been working at the IBM Research laboratory in San

    Jose. He was the leader on projects in process control systems, air-jet

    guiding for flexible media, an d image processing. Curre ntly he is the

    manager of the image processing project in the Computer Science

    Dep artme nt. His recent research activities ar e in the areas of image

    processing and pattern recognition systems for office automation,

    manufacturing automated visual inspection systems, special image

    processing hardw are, and algorithms.

    656

    K.Y

    WONG ET P

    LL.

    IBM J.

    RES. DEVELOP

    \

    OL.

    26

    NO. 6 NOVEMBER 1982