document engineering research paper

8/9/2019 Document Engineering Research Paper

1/10

K Y W o n g

R

G.

Casey

F. M. Wahl

Docum ent Ana lys is Sys tem

This pape r outlines the requirements and components fo r a proposed Document Ana lysis System , which assists a user in

encoding printed documents fo r computer processing. Several critical function s have been investigated and the technical

approaches are discussed . The first is the segmentation and classijication of digitize d printed documents into regions of text

and images . A nonlinear, run-length smoothing alg orithm has been used for this purp ose. y using the regular featu res of text

lines, a linear adaptive classification scheme d iscriminates text regions fro m oth ers . The second technique studie d is an

adap tive approach to the recognition of the hundreds of fon t styl es and sizes that can occur

on

printed documents. A

preclassifier is constructed uring the input proce ss and used to speed up a well-known pattern-matchingmethod fo r clustering

characters fro m an arbitrary print ource into a small sampleof prot otyp es. Experimental res ults are included.

Introduction

Th e decreasing cost of hardw are will eventually m ake com-

monplace the torage nddistribution of docume nts by

electronic means. However, today most docu men ts arebeing

saved, distributed,and presentedon paper.Paper is the

primarymedium for books, journals, newspapers, maga-

zines, an d business correspondence. Th is paper describes a n

experimentalDocum ent Analysis System, which assistsa

user in the encoding of printed documents for computer

processing.

A docum ent analysis system could be applied to extract

information romprinteddocuments ocreatedata bases

(e.g.,patents , ourt decisions). Such systems ould lso

create compute r docume nt files from papers submitted o

jour nals an d assist in revising books and man uals. In addi-

tion,a docume nt analysis systemcould provide a gene ral

da ta compression facility, since encoding optically scanned

text from a bit-map into symbol codes greatly reduces the

cost of storing or t ransmitt ing the information.

Existing com mercial systems cannot adequately convert

unconstrained printed documents into a forma t suitable for

compu ter processing. Da ta entr y by operator keying is not

only expensive and time-consuming but also precludes the

encoding of graphics and images contained in the document.

Current optical character recognition

OCR)

machines only

recognize certain preprogrammed fonts; they do not accept

documentscontaininga mixture of text and images. Th e

Document Analysis System thusprovides a new capability for

converting information stored on printed documents, includ-

ing both images and text, to omputer-processable form.

Thispaper is organized nto hree sections. Th e first

section gives an overview of the system req uirem ents. Not all

the components described in the system have been designed

an d tested yet. The next two sections present the results of

the technical investigations

of

some of the key components:

the segmentationof a scanned docume nt into egions of text

an d images, and the recognition of text. Initial experimen tal

da ta ar e ncluded to illustrate these wo methods.

System requirements and overall struc ture

Th e design of an interactive computerystem to read printed

docum ents was investigated by Nag y [11. Som e of the system

concepts used by him are reflected in the Docu men tAnalysis

System, but the olutions in most are as aredifferent.

8 Copyright 1982 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of

royalty provided that (1) each reproduction is done without alteration and 2) the Journal reference and IBM copyright notice are included on

the first page. The title and abstract, but no other portions,

of

this paper may be copied or distributed royalty free without further permission by

computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the

Editor.

64

K. Y WO N G ET

A

BM J.

RES

DEVELOP. VOL. 26 NO. 6 NOVE M B E R 1982


2/10

Document

r - - - - - - - -

Figure 1

System

overview.

Th e system is designed to have t he following capabilities:

Manual editing of scanned documents

e.g. ,

functions such

as crop,copy, m erge, erase, save, etc.).

Separating a document nto regions of text and mages

automatically.

Representing text by standard symbols such as EBCDIC ,

AS CI I, etc., possibly with the assist ance of user interac-

tion. Autom atic char acter recognition will be provided for

docu ments printed in certa in conventional fonts.

Check ing the recognized text against a dictionary.

Enhan cing the images and converting graphics from bit-

maps into vector fo rmat.

Analyzing the page layout

o

that appropria te typographi-

cal commands can be generated automatically or interac-

tively to reprodu ce the docum ent ith printer subsystems.

Allowing the user to customiz e the ystem for his particu-

lar needs.

48

K. Y WO N G ET AL

Separdtlon of

text inslde

graphicsi image

graphics and

l i n e d r a w ~ n g s

vector

conversion

halftones

to

+

J

Figure

1

shows the modules of the proposed Document

Analysis System.The haded moduleshave been imple-

mented or investigated in deta il. T he preliminary version of

the system permits meaningful experiments

but

does not yet

provide full interactive capability.

Documentsarescannedasblack/white imageswitha

commercial aser scanne r at 240 pixels per inch into a

Series/

1

minicomputer, which then transfers the data ver a

network to the host IBM 3033 comp uter. The isplay device

is a Tektronics

618

storage tube attached to an IBM 3277

terminal, which co mmunicates with the host. The system can

automatically eparate a document nto regions of text,

graphics, and halftone images. The techniqu e for doing this

is presented in the next section. Alte rnat ively , the user can

choose to perform the same function man ually by interac-

tively editing he mage.The user can invoke the Imag e

Manipulation Editor (called IE DI T), which reads and stores

IBM

J.

RES. DEVELOP. VOL. 26 NO. 6 NOVEMBER 1982


3/10

imagesondisk and provides a variety of image-editing

functions

( e .g . ,

copy, move, erase, magnify, reduce , rotate,

scroll, extract subimage, and ave).

The blocks containing text can be furth er analyze dith a

pattern-matching program that groups similarymbols from

the docum ent and creates a prototype pattern to represent

each group. The prototype pat terns are thendentified either

manually using an interactivedisplay

or

by automatic recog-

nition logic if this is available for the onts used in the

docum ent. Inclusion of a dictionary-checking techniq ue is

planned to help correct errors incurre d during recognition.

Ot he r functions will be provided in future extensions to the

system. For instance, hegraphics ndhalftone images

separated from the document mayeed fur ther processing in

certain applications. Thus, in a maintenance manual the text

labels may need revision without changin g the line draw ing

itself, an d vice versa. In such cases, text inform ation must be

extracted from the image. Also, d ue to spatial digitization

errors, scanned ine drawing s and graphics generally have

missing or extra do ts along edges. Consequently, after scan-

ning an d thresho ldin g, identical lines may vary in width in

the binary bit-maps. Such errors can be corrected with an

image enhancement function. Scanned halftone images may

also have moirt patterns [2]. Addition alrocessing functions

are required to remove the moir t patter ns and convert the

images into gray scale format

so

that conventional digital

halftoning techniques can e used to reprodu ce themage. In

another application,existing line dra wings and g raphics may

have to be encoded into an engineering da ta base. Here,

conversion of thedrawing rombit-maps nto vector

or

polynomial rep resentatio n of ob jects is necessary.

In order to produce the proper sequential list for the text

blocks, a layo ut analysis is required. For ex amp le, the text

blocks would be linked differently for a single-column than

for a multiple-column text page. If th e docu men t is to be

reproduced exactly, control codes for indenting , paragra ph-

ing, and line spacing must be supplied. Finally, typesettin g

commands may be required to reproduce the do cument ith

a printer subsystem.

Block segmentation and text dis crimination

Some earlier approaches to segmentation and discrimination

[3,4] require d knowledge of the chara cter size in order to

separate a document into text andontext areas. Themethod

developed for the DocumentAnalysis Sy stem consists of two

steps. First, a segmentation procedure subdivides the area of

a docum ent nto regions blocks), each of which hould

contain only one type of data (text, graphic, halftone image,

etc.). Next, some basic features of these blocks ar e calcu -

lated. Ainear classifier which ad apt s itself to varying

character heights discriminates between textand images.

Uncertain classifications are resolved by means of an addi-

tional classification process using more powerful, complex

feature nformation . This method is outlined below; fora

more detailed description see

[

51.

Figure 2(a) hows an example f a document comprised of

text,graphics,halftone mages,and solid black lines. A

run-length smoothing algor ithm RLS A)had beenused

earlier by one of the autho rs

6 ,

71 to detect ong vertical and

horizontal white lines. This algorithm has been extended to

obtain a bit-m ap of white and black areas epresenting

blocks containing thevarious types of da ta

[see

Fig. 2(d)].

The basic R LS A is applied to a binary sequence in which

white pixels are represented by

0’s

and black pixels by

1’s.

The algo rithm tran sforms binary sequence into an o utput

sequence

y

according to theollowing rules:

1. 0’s

in

x

are changed to

’s

in

y

if the numb er of adjacent

0’s

is less th an

or

equal to a predefined limit

C.

2 .

1’s in

x

are unchanged n

y .

For example, with

C

4 the sequence

x

is mapped into

y

as

follows:

x : 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 l 0 0 0

y : 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 l l

Whe n applied to pattern arrays, th e R LSA has th e ffect of

linking together neighboring black areas that are separa ted

by less than

C

pixels. W ith a n a ppro priate choice of

C,

he

linked areas will be regions of a common dat a type. Th e

degree of linkage depends

n C ,

the distributionof white and

black n the docum ent, and he scanning resolution. (The

Document Analysis System scansa t 240 pixels per inch.)

T h eRL SA is applied row-by-row as well as column-

by-column to a do cume nt, yielding two d istinct bit-ma ps.

Becausespacings of docum entcomponents tend to differ

horizontally and vertically, different values of

C

are used for

row C, 300) and column C,

500

processing. Figures

2(b) and 2(c) show the results of applying th e RL SA in the

horizonta l and n the vertical irections of Fig. 2(a). The wo

bit-maps are hen combined in a logical AN D operation.

Additional horizontal smoothing using the RL SA C, 30)

produces the final segmentation result illustrated in Fig.

2(d).Refere nce 8] describes in mor edetailan efficient

two-pass algorithm for the implementation of the RL SA in

two dimensions.

Th e blocks shown in Fig. 2(d) mu st next be located an d

classified ac cordin g to conten t. To provide identification for

subsequent processing,a uniqu e label is assigned to each

block. This is done by a labeling techniq ue described in

[9].

Simultaneously with block labeling, the following measure-

ments are taken:

K.

Y WO N G

6

iT

BM J.

RES. DEVELOP.

VOL. 26 NO.

6

NOVEMBER 1982


4/10

Figure

2

a) Block segmentation example of a mixed text/image document, which here is the original digitized document. b) and c) Results

of applying the

RLSA

in the horizontal and vertical directions. d) Final result

of

block segmentation. ( e )Results for blocks considered to be text

data class 1) .

Total num ber of black pixels in a segm ented block ( B C ) .

Minimum x - y coordinates of a block and its x , y lengths

h i '

A x , Y,, , AY ) .

Total num ber of black pixels in original dat a fro m he

block

( D C ) .

Horizontal white-black transitions of original data TC) .

These m easurem ents are store d n a table (see Table 1 ) and

are used to calculate the ollowing features:

1.

The height of each block segment: H

A y .

2.

The eccentric ityof the rectangle surroun ding the lock:

E

A x / A y .

3. The ratio of the number of block pixels to the areaof the

surrounding rectangle: S

B C / ( A x A y ) .

If S is close to

one, the block segment is approximately rectangular.

4.

The meanhorizontal ength of the black runs of the

original data from each lock:

Rm D C I T C .

These features are sed to classify the block. Because text

is the predomin ating data type in a typical office document

and text ines are basically textured stripes of approximately

a constant height H a n d mean length of black runs Rm ext

blocks tend to cluster with respect to these featu res. This is

Y.WONG ET

A L . IBM

J . RES.

DEVELOP. VOL.

26 NO.

6 NOVEMBER

I

982


5/10

illustrated by plotting block segments in the

R-H

plane (see

Fig. 3). Each table entry is equal o he number of block

segments in th e corresponding range of

R

a nd

H .

Thus , such

a plot can be considered as a two-dimensional histogram.

The text l ines of the document shown in Fig. 2(a) form a

clustered population within the range20 <

H

< 35 a nd 2 <

R

< 8.The two solid black lines in the lower right part of the

original document have high

R

and low

H

values in the

R-H

plane, w hereas the graphic and halftone images have high

values of H . Note hat he scale used in Fig. 3 ishighly

nonlinear.

Th e me an value of block height

H ,

and the block mean

black pixel run length

R ,

for the text cluster may vary for

different ypes of docum ents, depending on charact er size

and font. Furthermore, the text luster’s standard deviations

o(H,)

a nd cr(R,) may also vary depending on whethera

docu men t is in a single font or mult ip le fonts and charac te r

sizes. To perm it self-ad justm ent of the decision bound aries

for text discrim ination, estimates are calculatedor the mean

values

H ,

a nd

R ,

of blocks from a tig htly defined text region

of the

R-H

plane. Additional heuristic rules a re applied to

confirm that eac h such block is likely to be text before it is

included in the cluster. The mem bers f the clust er a re th en

used to stima te new bounds on the eatures odetect

addition al t ext blocks [ 5 ] Finally, a variable, linear, separa-

ble classification schem e assigns the following four classes to

the blocks:

Class

Text:

R < C, , Rm

nd

H

<

C,, H,,,.

R

>

C , , R ,

a nd

H

< C, , H,

.

E

>

I / C 2 ,

nd

H

>

C,, H ,

.

E

<

l / C 2 3

nd

H

>

C, , H,

.

Class 2

Horiz ontal solid black lines:

Class 3 Graphic and halftone mages:

Class 4 Vertical solid black lines:

Values ave been assigned to heparameters based on

several training documen ts. With C,, 3,

C,,

3, and

C,,

5 th e outlined method has been tested on a number of test

documents with satisfactory performance. Figure 2(e) hows

the result of the blocks which ar e considered tex t dat a (class

1 ) of the original document in Fig. 2(a).

There are certain imitations o he block segmentation

and text discrimination method described so fa r . On some

documents,ext l ines are linked together by the block

segmentation algorithm due to smalline-to-line spacin g, and

thus a reassigned to class

3.

line of characters exceeding

3

t imes

H ,

(with

C,,

3), such as a t i t le or heading in a

Table 1 Processing esult of document in Fig.

2.

Columns

1

to

7

contain the results of the measurements performed simultaneously

with labeling. The last column is the text classification result.

BC xmin

Ax ym,” Ay DC

TC Class

702

6089

6387

15396

9341

16706

19447

9244

3244

18502

17592

5667

12300

19318

19252

11101

1258

20200

276 147

418268

10935

18370

19120

3434

13694

17336

21 586

17601

19174

23306

15500

20263

16619

101 12

16777

38591 1

19512

406563

10467

18717

995 68 2341

23

1090 771 2266 13

307 265 2245

32

307

657

2208 31

1090

366 2184

31

185

779

2171 24

1090 771 2147 35

185 385 2098 31

1401

152

2077

23

185 779 2060 32

185 779 2024

30

1091 770 2008 13

185 474 1947 28

1144

717 1923

35

1101 760 1887 32

188

483

1830 29

1144

171

1821 22

1082 779 1782 34

188

780

1037 766

1084

780

1209 546

1088

503

1117 30

1088 773

1080 31

1088 773 1043

32

1205

126

969 32

193 489 954 30

1088 773 934

31

193

783

917 31

1088

773

897 32

193 779 880

31

193 778 840 31

1088 773 824 32

193

778

804

30

1088

773

788 25

193 393 731

31

1088 773 705 31

1087 778 175

504

193 777 643 29

193

777 75

538

1089 477 115 28

1089 776 78

30

302

5608

1142

3118

2469

3489

5181

1780

1587

41 12

3613

5394

475 1

4436

398 1

2713

434

4349

47742

195328

2139

3406

3632

770

225 1

3417

3574

3580

3437

4167

3544

3800

3310

291 1

3749

1751 14

4046

63103

1835

3427

76

170

396

1005

580

1070

1110

528

303

1183

1018

72

735

1155

1059

756

123

1229

8738

59906

648

996

1046

226

758

1012

1164

1040

1165

1256

1038

1183

989

666

1083

51527

1205

12598

542

968

1

2

1

1

1

1

1

1

1

1

1

2

1

1

1

1

1

1

3

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

3

1

3

1

1

document, is assigned to class 3.

An

isolated line of text

printed vertically,

e.g. ,

a text description of a diagram along

the vertical axis, may be classified as a num ber of small text

blocks, or else as a class 3 block.

Consequently, in order to further classify different da ta

types within class 3, a second discrimination based on shape

factors

[IO]

is used. The method uses measurements of

border-to-border distancewithin an arbitrary pattern. These

measuremen ts can be used to calculate meaningful features

like the “line-shapeness” or “compactness” of objects within

binary images. Wh ile he alculations re complex and

time-consuming, they are don e nly for class

3,

and thus add

only a small increm ent to the overall processing tim e. This

hierarchical decision procedure, in which “easy”lass

6

K. Y . WONG

ET

BM J. RES.

DEVELOP.

VOL 26

NO.

6 N O V E M B E R 1982


6/10

Mean run length of

black

plxels

R

2 3 4 6 8 I O 20 3 40 60 80 100 300 1000

Figure

3 Feature histogram ( R - H plane) of the

document

shown

in Fig. 2.

assignm ents are mad e quickly, while the difficult ones ar e

deferr ed to a more complex process, is a prom ising approach

toanalyzing he immense amou nt of data in a scanned

document.

Recognition

of

text data

Hun dreds of font tyles an d sizes areavailable for the

printing of documents . Ma thematical ymbols, foreign char-

acters,andother special typographicelementsmayalso

appear. The tremen dous range f possible input material for

theDocum ent Analysis System makest impractical o

attempt o designcompletely auto ma tic recognition logic.

Instead,analgorith m is used to collectexamples called

“prototypes”) of the various patte rns in the text.

The procedure, known as“pa t te rnmatchin g,” is con-

ducted as ollows. The text s exam ined, character y charac-

ter, in left-to-right, op-to-bo ttom sequence.Two output

storage a reas a re mainta ined, one toold the prototypes and

the other to ecord for each text pattern both its position and

the index of a matching prototype. The positions saved can

be, forexample , the coordina tesf the lower left corner f the

array represent ing the pa t te rn .

Initially the prototype set is empty; thus, there can be no

matching prototype for theirst input pattern, andt is stored

as the first prototype. In addition , its position a nd th e index

“1” are placed in the second ou tput rea.The second

52

K . Y . WO N G ET A L .

character is then omp ared o he first by means of a

correlation est for whichone of man yvariantsmay be

used) . If it match es, hen ts position and he index I a re

recorded. If there is no match , it is added to the lib rary as a

secondprototype. The hird chara cter is then read and a

match sought from among the mem bers of the library. This

operation is repeated for each input pattern,n turn, with the

patte rn being ad ded to the prototype collection whenever a

match cannot be found.

Pattern matching reduces the recognition problem to one

of identifying the prototypes, since th e prototype ch arac ter

codes can be substituted for the sequenc e f indexes obtained

for the tex t. The codes a nd accomp anying positions consti-

tute a document encoding that is sufficient for creating a

computerized version of the original text

or

for conducting

informationetrieval searches within its ontent.Other

merits of the approach are hat (1) he pattern-m atching

scheme is simple to im plement, (2) it permits the entry of a

wide variety of font styles an d sizes, (3 ) the ecognition logic

need not be designed in advance and no design data need be

collected.

Th e prototypes can be identified n several ways. If the

docum ent is known by th e user to have been printed in any of

a num ber of specified fonts, then prestored OCR recognition

logic may be applied [111. Text in a given language source

can be partially recognized using dictionary look-up, tables

of bigram or trigram frequencies, etc.; OCR errors can be

correct ed using such contex t aids as well. Finally, even if

autom atic techniques cannot be used, the prototypes c an be

displayed at a keyboard ter mina l and identified interactively

by the user. The num berof keystrokes needed to identify th e

prototypes is an order f magnitude less than that equired to

enter typical documents manually.

Pattern matc hing has previously been used by Na gy [ l ]

for text encoding and by Pratt [ 121 for facsimile compres-

sion. The echnique used or pattern matching in both of

these systems, however, suffers from nemajor efect,

particu larly when the algorithm is implemented by softw are

that is to run on a serial processor. T he problem is that each

textchar acter is com pared pixel-by-pixel with all of the

prototypes tha t ar e lose to it in simplemeasuremen ts such as

width, heig ht, are a, etc. A s the size of the libra ry increases,

many prototypes may qualify, and the timeequired to do the

correlations with a conventional sequ ential instruction com-

puter can be excessive.

To e duc e heamo unt of computation, heDocument

Analysisystemmploys a prec lassi fy ingechnique

described n mor edetail in [131. The preclassifier is a

decision network whose input is a patt ern array and whose

output is the index of a possible matching prototype pattern.

IBM J. RES.

DEVELOP.

1

OL. 26

VO.

6

I

VOVEMBER 1982


7/10

Each interiornode is a pixel location, and each termina l ode

is a prototyp e index. Two branches lead from each interior

node. Onebran ch is abeled“black,” theother, “white.”

Sta rti ng with the root a node is accessed, and he pixel

location tha t it specifies is examined in the inp ut arra y. T he

color of this pixel dete rmin es the bran ch to the next node.

When a terminal node is accessed, the corresponding proto-

type index is assigned to the input.

Eac h successive inp ut rra y is first classified by this

network, which exa mines only a small subset of the pixels

before producing a n index. The network is initially null, but

it is modified each ime a new prototype is added o he

libra ry. Th e modification consists of adding a n extension to

the decision tree (or network) that permits it to distinguish

the new proto type as well as all previously selected proto-

types.

The preclassification is either verified or nullified by

comparing the input pattern with the prototype selected by

the network. The wo-step operation is repeated with the

inpu t array shifte d nto several registration positions until

either a matchin g prototype is found, or else none is found

and the input s designated asa new prototype.

The advanta ge of this approach is that only one pattern

correlation is needed to classify an inpu t in each registration

position used. Th us, th e classification is greatly accelerate d

compared with straightforwardatternmatching. he

reduction in classification time mus t be bala nced again st the

time expended in modifying the decision network each time

new class is found and again st he possibility of creating

extra classes due o preclassification errorsmade by the

network. Considerable effort has gone into developing an

efficient network adaptation algorithm.

The design of the preclassifier is based

on

a scheme in

which the pixels of eac h proto type are categorized as “reli-

able” or “unreliable.” A pixel is defined to be “reliable” for a

given prototype if its four horizon tal and vertical neighbor

pixels are

of

the same color. Using this categoriz ation, the

decision network is constr ucted such tha t for any nput

pattern the sequence of pixels examined have the following

properties. First, every pixel examin ed has the same olor in

both the input array an dn th e selected prototype. Second, if

these pixels in the inpu t are com pare dith the corresponding

positions in any of the othe rprototypes, at least onepixel will

be both different in color in the two patter ns and “reliable”

pixel for th e prototype. In this way, t he preclassification is

based

on

pixel comparisons away from the dges, where most

variation occurs in printed characters.

Thre e meth ods of designing the preclassifierhave been

investigated. The topextension approach (called strategy

A)

adaptation

After

adaptation

N

Figure

4 Strategies for mod ifying the decision n etwork to include

class

k

+ 1; (a) top extension;

b)

bottom extension; and (c) critical

mode extension.

add s pixels to be examined bove the root node of the

previously designed decision network. The trat egy is to

examine one or more “reliable” pixels such that the adde d

network ca n distinguish the new prototype class

G ( k + 1)

from the remaining k classes. [See Fig. 4(a )]. The botto m

extension approach (called strategy B) retraces the path of

G ( k + 1)

thro ugh the decision network a s in preclassifica-

tion, but with the difference that when a pixel that is

unreliable in G ( k + 1) is encountered, both branches from

the node a re followed. Thu s, th e extension routine traverses

multiple path s hrou gh he network. Each ermina l node

reached in this way is then replaced by ahree-node

subg raph . The top ode of the subgrap h specifies a pixel that

is a “relia ble” pixel

for

G ( k

+

l ) ,

and is “reliable” and of

opposite value for he prototype

P,

dentified by the terminal

node. The bran ches from this node lead to termina l nodes

identifiedwith

P

and with

G ( k + l ) ,

respectively. The

extension procedure is repeated for each erminal node

reached, as illustrated n Fig. 4(b).

The third approach (called strategy C ) is called “critical

node extension.” Here, prototyp e

G ( k

+ 1) is entered nto

6

IBM

J . RES. DEVELOP.

VOL. 26 NO

NOVEMBER

1982

K . Y . WONG ET


8/10

654

b . R e m o

v t h f

r n t s w P h t u 9

x a

d i H G 2 3 A d

u s

tnte

I I

t

S

o

B a n d t i g E 1 S P

v

T L r P t r Y n n A

c I

d k t h i l i h i g

F 2 4 ) T l I I

S f l L h ski. i n 5 6

M t i 4 R m v

a

E ; h

P k U b 4 * S r t i

a n t i i r m a l m g i n C H E

K I N G B L T S O

O V s h a i e W A R N

G :

T h m o r u n i h c

kinri

/ C I ~ P ~ H Y . ~ U

Figure

5

Prototype patterns from thedocument shown

in

Fig.

2b .

network N ( k ) as in the bottom extension strateg y, and its

path is traced ntil it reaches a ode

n

specifying an

unreliable pixel in C ( k + 1 ) . At this point i t is determined

which subset of prototypes G l ) , G(2) , . . .,G ( k ) can also

reach node

n

Call this prototype subset H . The branch in to

node n is reconnected to a subnetwork consisting of a

sequence of pixels that are reliab le in G ( k + 1 ) a nd tha t

discriminate G ( k +

1 )

f rom membersof

H .

T he specification

of this subnetwork s identical to strategyA except that only

memb ers of H , rather tha n all the prototypes, are discrimi-

nated from

G ( k

+

1 ) .

Th e critic al node extension is illus-

trated in Fig. 4(c).

Each strategy has its advantages and limitations. Strategy

A is simplest in conce pt. Each time a new class is added, a

local modification to he network is mad e at its root. Th e

disadvantage of theapproach is tha t , as more and more

c lasses a re added, the s t ruc tureesembles a chain of subnet-

works, each link of which corresponds to a class. If a patte rn

belonging to a class n ear th e bot tom of the chain is entered

into the network, then one or more pixels must be ex amined

a t each of the higher links.Since theinks require more odes

as henum ber of classes ncreases, the ime equired o

classify a pattern grows more than linearly with the num ber

of classes.

St ra tegy B, on theotherhan d, results in a elatively

balanced structure. Starting from null network, strategy

B

produces a tree graph, since only subtr ees are append ed at

each step. Theexact hape of the reedepend s on the

proto type pixels, but in genera l it has a shorter average path

length than the network obtained using strategy A.

In

fact,

since any path s increased in ength by at most one node each

time a class is added , then a network designed using tr ateg y

B can have

no

path longer than

K

nodes when it has been

extended o accommo date

K

prototypes, and typically the

growth in average path length s approximately logarithm ic.

However, strategy

B

calls for modifying the network in a

num ber of locations. As the network grows, many termin al

nodes may have to be extended in order to adda single class.

This effect increases both the tim e required for ada ptatio n

and the s torage requirementor the network.

Stra tegy C offers a compromise between the othe r two

app roac hes . Th e network is modified only at a single node

each ime a lass is added,but he trategy offers the

potential of a more balanced configuration than the chain

produced by top extension. In a particular case, if each

successive prototype followed a path consisting only of

interior pixels, then only terminal nodes would ever be

modified, and strategy

C

would be equivalent to strategy

B.

On the o ther han d, if for each successive extension the root

node specifies anedge pixel for the new prototype, then

st ra tegy

C

yields th e amechai n configuration obtain ed

using strategy A. In prac tice, the results lie between these

extremes. a s shown below.

Figure 5 shows the prototype patternsusing the proposed

text data recognition method on the docum ent shown in Fig.

2(a). While there is some redundancy in the prototype set,

only a sm all part of this is due to preclassifier erro rs; the

major portion is due to failures of the correlation stage to

declare similarity between an input and a prototype of the

sam e class. (T he threshold for declaring a correlation match

has tobe set ra ther t ig htly tovoid giving different cha racte r

types the sam e esignation, a more serious error.) A compar-

ison with a conv entional pattern -matc hing algorith m yielded

about the same size prototype set. T he efficiency advantage

of the preclassifier system is indica ted in the plot of Fig. 6, in

which a typewritten docum ent was the inp ut. An average f

only

130

pixels per chara cter pattern was examined by the

preclassifier (out of approximately

1000

in eachpattern

K. Y

WONG ET AL.

IBM

J. RES.

DEVELOP. VOL.

26 NO. 6

NOVEMBER

1982


9/10

250

000

Plxels

examined in decislon etwork

/

(cumulative plot)

/

criticalnode

bottom extension

top extension

200000- _ _ /

so 000

100

000

n

6

50

000

E

z

0 500 I000 IS00

Input sequence numhcr

Figure

6

Efficiency advantage of preclassifier.

arr ay) using the criticalnode extension method. In this same

experiment, appreciably more C PU time was spent in using

the decision network for classification than in const ructing it.

In

the least fav orable case, bottom extension, the rat io of

preclassification time oadaptation ime was 3:l; for the

critical node method the ratio rose to

1O:l .

These data were

obtained from simulation experiments programm ed in

APL

and do not provide estimates of the timings obtainab lewith

efficient programming of the method. An effort is currently

under way to reprogram the algorithm and to optimize its

speed.

A side benefit of the preclassification appro ach has been in

aecursive techniq ue for the egm enta tion of touching

character patterns (described in detail in

[14]).

The proper

segm entatio n of closely spaced printed char acte rs is one of

themajor problems in opticalcharacter recognition. Th e

Document Analysis System combines segmentation with the

classification

of

the component patterns. Following the p at-

tern-matching stage, each prototype wider tha n twice the

width of the nar row est prototype undergoes a series of trial

segmentat ions .The two-storage preclassification/correla-

tion technique attempts to find matching prototypes for the

tr ia l segments . Theecursive algorithm segmentsa sequence

of touching char acters only if it can decompose the segmen t

into patte rns that positively matc h previously found proto-

types. Because the num ber of pattern s tested for segmenta-

tion is small and because the use of the preclassifier makes

the process efficient, the comp utational overhead introdu ced

by recursive segmen tation is relatively insignificant. Figure

7

shows asampling of composite patte rns successfullyseg-

mented by this technique.

Conclusion

This paper hasdescribed some of the requir emen ts for and a

system overview of a document analysisystem which

Input

array

aki

a i

al

Matching

prototypes

a k i

in

thi

a i

a l

a n

an

di

f l

gi

hi

Input

array

Matching

prototypes

P U

Pu

r rn

IBM J.

RES. DEVELOP. VOL. 26 NO. 6 NOVE M B E R

1982

Figure 7 Illustration of the recursive ouching

character

segmen-

tation method.

encodes printed documents to a form suitable for computer

processing. N ot all the system components mentioned have

been designed and built. The initial work on the project has

focused on techniques for the separation of a document into

regions of text and images an d on the recognition of text

da ta .

The image segm entation method has been tested with a

variety of printed documents fromdifferent sources with one

common et of param eters.Performance is satisfactory

althou gh misclassifications occur occasionally. Wi th theuser

in an interactive feedback loop, misclassification errors ca n

be checked and corrected quite readily in the classification

table. more sophisticated pattern discrimination method

[ I O ] is being tested to furthe r classify those blocks tha t ar e

labeled as non-text.

Pattern matching has been investigated as a fundamental

approach to reducing a large num ber of unknown character

patterns o asmall num ber of prototypes. A new, highly

K. Y.

WONG

E


10/10

e f f i c ie n t a p p r o a c h t o p e r f o r m t h e m a t c h i n g h a s b e e n i m p l e -

m e n t e d b y m p l o yi n g a n d a p t i v e e c i s i o n e t w o r k for

prec lass i f i ca t ion . The reduc t ion

f

computa t iona l load i s ve ry

impor tan t when process ing i s done by a convent iona l com-

p u t e r . P r e l i m i n a r y re s u l t s o n r i a l d o c u m e n t s h a v e y i e l d e d

pro to type se t s

of

low redundancy , as well as low rates of

misc lass i f i ca t ion .

Acknowledgments

W e w i sh t o a c k n o w l e dg eY . T a k a o of

IBM

Tokyo Sc ien t i f i c

C e n t e r i n J a p a n f o r h i s c o n t r i b u t i o n i n w r i t i n g p a c k a g e

of

image-ed i t ing s o f t w a r e c a l l e d E D I T ) o r our D o c u m e n t

A n a l y s i s S y s t e m . W e a l s o a p p r e c i a t e he s t i m u l a t i n g d i s c u s -

s ions wi th G . N a g y d u r i n g h i s v i s i t

o our

l abora tory .

References

1.

2.

3.

4 .

5.

6 .

7 .

8 .

9 .

IO

11.

12.

13.

R. N. Ascher, G. M. Koppelman, M. J.Miller, G. Nagy, and

G .

L. Shelton,Jr.,“An nteractiveSystem or Reading Unfor-

matted Printed Text,”

IEEE Trans. Computers C-20 1527-

1543 (December 1971).

A. Steinbach and K. Y. Wong, “An Understanding of Moir

Pattern s in the Reproduction of Ha lftone Images,”

Proceedings

of the Pattern Recognition and Image Processing Conference,

Chicago, Aug. 6-8, 1979, pp. 545-552.

G. Nagy, “Preliminary Investigation of Techniques for Auto-

mated Reading of UnformattedText,”

Commun.ACM

11

480-487

(July

1968).

E. Johnstone, “Printed Text D iscrimination,”

Computer Graph.

Image P rocess.

3 ,83 -89 1974) .

F.

M. W ahl, K. Y. Wong, and R. G. Casey, “Block Segm enta-

tion and Text Extraction in Mixed Text/Image Documents,”

Research Report RJ 3356,

IBM Research Laboratory,San

Jose, CA, 198

1

F.

Wahl, L.Abele, and W. Scherl, “M erkmaleuer die Segmen-

tation von Dokum enten zur Automatischen Textverarbeitung,”

Proceedings of the 4th DAG M-Symposium,

Hambu rg, Federal

Republic of Germany, Springer-Verlag,Berlin,

198

1.

L. Abele, F. W ahl, and W. Scherl, “Procedures for an Auto-

matic Segm entation of Text Graphic and H alftone Regions in

Documents,”

Proceedings of the 2nd Scandinavian Conference

on Image Ana lysis,

Helsinki,

198

1 .

F. M. Wahl a nd K. Y. W ong: “An Efficient Method of Running

a Constrained Run Length Algorithm (C RL A) in Vertical and

Horizontal Directions on Binary Ima ge Dat a,”

Research Report

RJ3438,

IBM Research Labo ratory, San ose, CA,

1982.

A. Rosenfeld and A. C. Kak, “Digital Picture Processing,”

Academic Press, Inc., New York,

1976,

pp.

347-348.

F. M. Wahl, “A New Distance Mapping and its Use for Shape

Measureme nt on B inary Patterns,”

Research Report RJ 3361,

IBM Research Labo ratory, San ose, CA, 1982.

R. G. Casey and G. Nagy, “Decision Tree Design Using a

ProbabilisticModel,”

Research eport RJ 3358,

IBM

Research Laboratory , San ose, CA , 198

1.

W. H. Chen , J. L. Douglas, W. K. Pratt, and R. H. Wallis,

“Dual-mode Hybrid Compressor for Facsimile Images,”

SPIE

J.

207,226-232 1979).

R. G. Casey and K. Y. Wong, “Unsupervised Construction of

Decision Networks for Pattern Classification,” IBM Research

Report. to appear.

14. R . G. Casey and G. Nagy, “Recursive Segmentation and

Classification of Composite Character Patterns,” presented at

the thnterna tional Conference on Pattern Recognition,

Munich, October

1982.

Rece ived

M a y

3, 1982; rev i sed

July

7, 1 9 8 2

Richard G Casey IBM Research Division, 5600 Cottle

Road, San Jose, California 95193.

Dr. Casey h as worked primarily

in pattern recognition, particularly optical character recognition,

since joining IBM in 1963. Until 1970, he carried outprojects in this

area at he Thomas J. Watson Research Center in Yorktown

Heights, New York. After ransfer ring to San Jose, he worked

initially on the analysis of data bases, but returned to the character

recognition area in 1975, soon after completing a one-year assign-

ment for IBM in the United Kingdom. Dr. Casey received a B.E.E.

from Ma nhattan College in

1954

and an M S . and Eng.Sc.D. from

Columbia University, New York, in

1958

and

1965.

From

1957

to

1958,

he was a teaching assistant at Columbia University and from

1958

to

1963,

he investigated rada r dataprocessing techniques at the

Columbia University Electronics Research Laboratories. While on

leave of absence from IBM in 1969, he taught at the University of

Florida.

Friedrich M. Wahl

Lst. s

Nachrichtentechnich Technische

Universitaet,Arcisstrasse

21. 8

Munich 2, Federal Republic of

Germany.

Dr. Wahl studied electrical engineering at the Technical

University of M unich, where he received his Diplom and his Ph.D. in

1974 and 1980. In 1974 he started a digital mage processing group

at he Insti tute of Comm unication Engineering at the Technical

University of Munich. Dr. Wah l worked n the image processing

project in the C omputer Science Department at the IBM Research

laboratory in San Jose, California, for one year, 1981 to 1982, as a

visiting scientist. At S an Jose, he worked on p roblems of docum ent

analysis systems, image segmentation, shape description, clustering

algorithms, and line drawing decomposition.

H e

has also been

involved with autom atic image inspection for manufa cturing . Dr.

Wahl is a member of the Institute of Electrical and Electronics

Engineers.

Kwan

Y Wong

IBM Research Division, 5600 CottleRoad,

San Jose, California 95193.

Dr. Wong receivedhis B.E. (with

honors) and his M.E . in electrical engineering from the U niversity of

New South Wales, Sydney, Australia, in 1960 and 1963. Since he

received his Ph.D. from the University of California, Berkeley, in

1966,

he has been working at the IBM Research laboratory in San

Jose. He was the leader on projects in process control systems, air-jet

guiding for flexible media, an d image processing. Curre ntly he is the

manager of the image processing project in the Computer Science

Dep artme nt. His recent research activities ar e in the areas of image

processing and pattern recognition systems for office automation,

manufacturing automated visual inspection systems, special image

processing hardw are, and algorithms.

656

K.Y

WONG ET P

LL.

IBM J.

RES. DEVELOP

\

OL.

26

NO. 6 NOVEMBER 1982