document engineering research paper
TRANSCRIPT
-
8/9/2019 Document Engineering Research Paper
1/10
K Y W o n g
R
G.
Casey
F. M. Wahl
Docum ent Ana lys is Sys tem
This pape r outlines the requirements and components fo r a proposed Document Ana lysis System , which assists a user in
encoding printed documents fo r computer processing. Several critical function s have been investigated and the technical
approaches are discussed . The first is the segmentation and classijication of digitize d printed documents into regions of text
and images . A nonlinear, run-length smoothing alg orithm has been used for this purp ose. y using the regular featu res of text
lines, a linear adaptive classification scheme d iscriminates text regions fro m oth ers . The second technique studie d is an
adap tive approach to the recognition of the hundreds of fon t styl es and sizes that can occur
on
printed documents. A
preclassifier is constructed uring the input proce ss and used to speed up a well-known pattern-matchingmethod fo r clustering
characters fro m an arbitrary print ource into a small sampleof prot otyp es. Experimental res ults are included.
Introduction
Th e decreasing cost of hardw are will eventually m ake com-
monplace the torage nddistribution of docume nts by
electronic means. However, today most docu men ts arebeing
saved, distributed,and presentedon paper.Paper is the
primarymedium for books, journals, newspapers, maga-
zines, an d business correspondence. Th is paper describes a n
experimentalDocum ent Analysis System, which assistsa
user in the encoding of printed documents for computer
processing.
A docum ent analysis system could be applied to extract
information romprinteddocuments ocreatedata bases
(e.g.,patents , ourt decisions). Such systems ould lso
create compute r docume nt files from papers submitted o
jour nals an d assist in revising books and man uals. In addi-
tion,a docume nt analysis systemcould provide a gene ral
da ta compression facility, since encoding optically scanned
text from a bit-map into symbol codes greatly reduces the
cost of storing or t ransmitt ing the information.
Existing com mercial systems cannot adequately convert
unconstrained printed documents into a forma t suitable for
compu ter processing. Da ta entr y by operator keying is not
only expensive and time-consuming but also precludes the
encoding of graphics and images contained in the document.
Current optical character recognition
OCR)
machines only
recognize certain preprogrammed fonts; they do not accept
documentscontaininga mixture of text and images. Th e
Document Analysis System thusprovides a new capability for
converting information stored on printed documents, includ-
ing both images and text, to omputer-processable form.
Thispaper is organized nto hree sections. Th e first
section gives an overview of the system req uirem ents. Not all
the components described in the system have been designed
an d tested yet. The next two sections present the results of
the technical investigations
of
some of the key components:
the segmentationof a scanned docume nt into egions of text
an d images, and the recognition of text. Initial experimen tal
da ta ar e ncluded to illustrate these wo methods.
System requirements and overall struc ture
Th e design of an interactive computerystem to read printed
docum ents was investigated by Nag y [11. Som e of the system
concepts used by him are reflected in the Docu men tAnalysis
System, but the olutions in most are as aredifferent.
8 Copyright 1982 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of
royalty provided that (1) each reproduction is done without alteration and 2) the Journal reference and IBM copyright notice are included on
the first page. The title and abstract, but no other portions,
of
this paper may be copied or distributed royalty free without further permission by
computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the
Editor.
64
K. Y WO N G ET
A
BM J.
RES
DEVELOP. VOL. 26 NO. 6 NOVE M B E R 1982
-
8/9/2019 Document Engineering Research Paper
2/10
Document
r - - - - - - - -
Figure 1
System
overview.
Th e system is designed to have t he following capabilities:
Manual editing of scanned documents
e.g. ,
functions such
as crop,copy, m erge, erase, save, etc.).
Separating a document nto regions of text and mages
automatically.
Representing text by standard symbols such as EBCDIC ,
AS CI I, etc., possibly with the assist ance of user interac-
tion. Autom atic char acter recognition will be provided for
docu ments printed in certa in conventional fonts.
Check ing the recognized text against a dictionary.
Enhan cing the images and converting graphics from bit-
maps into vector fo rmat.
Analyzing the page layout
o
that appropria te typographi-
cal commands can be generated automatically or interac-
tively to reprodu ce the docum ent ith printer subsystems.
Allowing the user to customiz e the ystem for his particu-
lar needs.
48
K. Y WO N G ET AL
Separdtlon of
text inslde
graphicsi image
graphics and
l i n e d r a w ~ n g s
vector
conversion
halftones
to
+
J
Figure
1
shows the modules of the proposed Document
Analysis System.The haded moduleshave been imple-
mented or investigated in deta il. T he preliminary version of
the system permits meaningful experiments
but
does not yet
provide full interactive capability.
Documentsarescannedasblack/white imageswitha
commercial aser scanne r at 240 pixels per inch into a
Series/
1
minicomputer, which then transfers the data ver a
network to the host IBM 3033 comp uter. The isplay device
is a Tektronics
618
storage tube attached to an IBM 3277
terminal, which co mmunicates with the host. The system can
automatically eparate a document nto regions of text,
graphics, and halftone images. The techniqu e for doing this
is presented in the next section. Alte rnat ively , the user can
choose to perform the same function man ually by interac-
tively editing he mage.The user can invoke the Imag e
Manipulation Editor (called IE DI T), which reads and stores
IBM
J.
RES. DEVELOP. VOL. 26 NO. 6 NOVEMBER 1982
-
8/9/2019 Document Engineering Research Paper
3/10
imagesondisk and provides a variety of image-editing
functions
( e .g . ,
copy, move, erase, magnify, reduce , rotate,
scroll, extract subimage, and ave).
The blocks containing text can be furth er analyze dith a
pattern-matching program that groups similarymbols from
the docum ent and creates a prototype pattern to represent
each group. The prototype pat terns are thendentified either
manually using an interactivedisplay
or
by automatic recog-
nition logic if this is available for the onts used in the
docum ent. Inclusion of a dictionary-checking techniq ue is
planned to help correct errors incurre d during recognition.
Ot he r functions will be provided in future extensions to the
system. For instance, hegraphics ndhalftone images
separated from the document mayeed fur ther processing in
certain applications. Thus, in a maintenance manual the text
labels may need revision without changin g the line draw ing
itself, an d vice versa. In such cases, text inform ation must be
extracted from the image. Also, d ue to spatial digitization
errors, scanned ine drawing s and graphics generally have
missing or extra do ts along edges. Consequently, after scan-
ning an d thresho ldin g, identical lines may vary in width in
the binary bit-maps. Such errors can be corrected with an
image enhancement function. Scanned halftone images may
also have moirt patterns [2]. Addition alrocessing functions
are required to remove the moir t patter ns and convert the
images into gray scale format
so
that conventional digital
halftoning techniques can e used to reprodu ce themage. In
another application,existing line dra wings and g raphics may
have to be encoded into an engineering da ta base. Here,
conversion of thedrawing rombit-maps nto vector
or
polynomial rep resentatio n of ob jects is necessary.
In order to produce the proper sequential list for the text
blocks, a layo ut analysis is required. For ex amp le, the text
blocks would be linked differently for a single-column than
for a multiple-column text page. If th e docu men t is to be
reproduced exactly, control codes for indenting , paragra ph-
ing, and line spacing must be supplied. Finally, typesettin g
commands may be required to reproduce the do cument ith
a printer subsystem.
Block segmentation and text dis crimination
Some earlier approaches to segmentation and discrimination
[3,4] require d knowledge of the chara cter size in order to
separate a document into text andontext areas. Themethod
developed for the DocumentAnalysis Sy stem consists of two
steps. First, a segmentation procedure subdivides the area of
a docum ent nto regions blocks), each of which hould
contain only one type of data (text, graphic, halftone image,
etc.). Next, some basic features of these blocks ar e calcu -
lated. Ainear classifier which ad apt s itself to varying
character heights discriminates between textand images.
Uncertain classifications are resolved by means of an addi-
tional classification process using more powerful, complex
feature nformation . This method is outlined below; fora
more detailed description see
[
51.
Figure 2(a) hows an example f a document comprised of
text,graphics,halftone mages,and solid black lines. A
run-length smoothing algor ithm RLS A)had beenused
earlier by one of the autho rs
6 ,
71 to detect ong vertical and
horizontal white lines. This algorithm has been extended to
obtain a bit-m ap of white and black areas epresenting
blocks containing thevarious types of da ta
[see
Fig. 2(d)].
The basic R LS A is applied to a binary sequence in which
white pixels are represented by
0’s
and black pixels by
1’s.
The algo rithm tran sforms binary sequence into an o utput
sequence
y
according to theollowing rules:
1. 0’s
in
x
are changed to
’s
in
y
if the numb er of adjacent
0’s
is less th an
or
equal to a predefined limit
C.
2 .
1’s in
x
are unchanged n
y .
For example, with
C
4 the sequence
x
is mapped into
y
as
follows:
x : 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 l 0 0 0
y : 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 l l
Whe n applied to pattern arrays, th e R LSA has th e ffect of
linking together neighboring black areas that are separa ted
by less than
C
pixels. W ith a n a ppro priate choice of
C,
he
linked areas will be regions of a common dat a type. Th e
degree of linkage depends
n C ,
the distributionof white and
black n the docum ent, and he scanning resolution. (The
Document Analysis System scansa t 240 pixels per inch.)
T h eRL SA is applied row-by-row as well as column-
by-column to a do cume nt, yielding two d istinct bit-ma ps.
Becausespacings of docum entcomponents tend to differ
horizontally and vertically, different values of
C
are used for
row C, 300) and column C,
500
processing. Figures
2(b) and 2(c) show the results of applying th e RL SA in the
horizonta l and n the vertical irections of Fig. 2(a). The wo
bit-maps are hen combined in a logical AN D operation.
Additional horizontal smoothing using the RL SA C, 30)
produces the final segmentation result illustrated in Fig.
2(d).Refere nce 8] describes in mor edetailan efficient
two-pass algorithm for the implementation of the RL SA in
two dimensions.
Th e blocks shown in Fig. 2(d) mu st next be located an d
classified ac cordin g to conten t. To provide identification for
subsequent processing,a uniqu e label is assigned to each
block. This is done by a labeling techniq ue described in
[9].
Simultaneously with block labeling, the following measure-
ments are taken:
K.
Y WO N G
6
iT
BM J.
RES. DEVELOP.
VOL. 26 NO.
6
NOVEMBER 1982
-
8/9/2019 Document Engineering Research Paper
4/10
Figure
2
a) Block segmentation example of a mixed text/image document, which here is the original digitized document. b) and c) Results
of applying the
RLSA
in the horizontal and vertical directions. d) Final result
of
block segmentation. ( e )Results for blocks considered to be text
data class 1) .
Total num ber of black pixels in a segm ented block ( B C ) .
Minimum x - y coordinates of a block and its x , y lengths
h i '
A x , Y,, , AY ) .
Total num ber of black pixels in original dat a fro m he
block
( D C ) .
Horizontal white-black transitions of original data TC) .
These m easurem ents are store d n a table (see Table 1 ) and
are used to calculate the ollowing features:
1.
The height of each block segment: H
A y .
2.
The eccentric ityof the rectangle surroun ding the lock:
E
A x / A y .
3. The ratio of the number of block pixels to the areaof the
surrounding rectangle: S
B C / ( A x A y ) .
If S is close to
one, the block segment is approximately rectangular.
4.
The meanhorizontal ength of the black runs of the
original data from each lock:
Rm D C I T C .
These features are sed to classify the block. Because text
is the predomin ating data type in a typical office document
and text ines are basically textured stripes of approximately
a constant height H a n d mean length of black runs Rm ext
blocks tend to cluster with respect to these featu res. This is
Y.WONG ET
A L . IBM
J . RES.
DEVELOP. VOL.
26 NO.
6 NOVEMBER
I
982
-
8/9/2019 Document Engineering Research Paper
5/10
illustrated by plotting block segments in the
R-H
plane (see
Fig. 3). Each table entry is equal o he number of block
segments in th e corresponding range of
R
a nd
H .
Thus , such
a plot can be considered as a two-dimensional histogram.
The text l ines of the document shown in Fig. 2(a) form a
clustered population within the range20 <
H
< 35 a nd 2 <
R
< 8.The two solid black lines in the lower right part of the
original document have high
R
and low
H
values in the
R-H
plane, w hereas the graphic and halftone images have high
values of H . Note hat he scale used in Fig. 3 ishighly
nonlinear.
Th e me an value of block height
H ,
and the block mean
black pixel run length
R ,
for the text cluster may vary for
different ypes of docum ents, depending on charact er size
and font. Furthermore, the text luster’s standard deviations
o(H,)
a nd cr(R,) may also vary depending on whethera
docu men t is in a single font or mult ip le fonts and charac te r
sizes. To perm it self-ad justm ent of the decision bound aries
for text discrim ination, estimates are calculatedor the mean
values
H ,
a nd
R ,
of blocks from a tig htly defined text region
of the
R-H
plane. Additional heuristic rules a re applied to
confirm that eac h such block is likely to be text before it is
included in the cluster. The mem bers f the clust er a re th en
used to stima te new bounds on the eatures odetect
addition al t ext blocks [ 5 ] Finally, a variable, linear, separa-
ble classification schem e assigns the following four classes to
the blocks:
Class
Text:
R < C, , Rm
nd
H
<
C,, H,,,.
R
>
C , , R ,
a nd
H
< C, , H,
.
E
>
I / C 2 ,
nd
H
>
C,, H ,
.
E
<
l / C 2 3
nd
H
>
C, , H,
.
Class 2
Horiz ontal solid black lines:
Class 3 Graphic and halftone mages:
Class 4 Vertical solid black lines:
Values ave been assigned to heparameters based on
several training documen ts. With C,, 3,
C,,
3, and
C,,
5 th e outlined method has been tested on a number of test
documents with satisfactory performance. Figure 2(e) hows
the result of the blocks which ar e considered tex t dat a (class
1 ) of the original document in Fig. 2(a).
There are certain imitations o he block segmentation
and text discrimination method described so fa r . On some
documents,ext l ines are linked together by the block
segmentation algorithm due to smalline-to-line spacin g, and
thus a reassigned to class
3.
line of characters exceeding
3
t imes
H ,
(with
C,,
3), such as a t i t le or heading in a
Table 1 Processing esult of document in Fig.
2.
Columns
1
to
7
contain the results of the measurements performed simultaneously
with labeling. The last column is the text classification result.
BC xmin
Ax ym,” Ay DC
TC Class
702
6089
6387
15396
9341
16706
19447
9244
3244
18502
17592
5667
12300
19318
19252
11101
1258
20200
276 147
418268
10935
18370
19120
3434
13694
17336
21 586
17601
19174
23306
15500
20263
16619
101 12
16777
38591 1
19512
406563
10467
18717
995 68 2341
23
1090 771 2266 13
307 265 2245
32
307
657
2208 31
1090
366 2184
31
185
779
2171 24
1090 771 2147 35
185 385 2098 31
1401
152
2077
23
185 779 2060 32
185 779 2024
30
1091 770 2008 13
185 474 1947 28
1144
717 1923
35
1101 760 1887 32
188
483
1830 29
1144
171
1821 22
1082 779 1782 34
188
780
1037 766
1084
780
1209 546
1088
503
1117 30
1088 773
1080 31
1088 773 1043
32
1205
126
969 32
193 489 954 30
1088 773 934
31
193
783
917 31
1088
773
897 32
193 779 880
31
193 778 840 31
1088 773 824 32
193
778
804
30
1088
773
788 25
193 393 731
31
1088 773 705 31
1087 778 175
504
193 777 643 29
193
777 75
538
1089 477 115 28
1089 776 78
30
302
5608
1142
3118
2469
3489
5181
1780
1587
41 12
3613
5394
475 1
4436
398 1
2713
434
4349
47742
195328
2139
3406
3632
770
225 1
3417
3574
3580
3437
4167
3544
3800
3310
291 1
3749
1751 14
4046
63103
1835
3427
76
170
396
1005
580
1070
1110
528
303
1183
1018
72
735
1155
1059
756
123
1229
8738
59906
648
996
1046
226
758
1012
1164
1040
1165
1256
1038
1183
989
666
1083
51527
1205
12598
542
968
1
2
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1
3
1
1
document, is assigned to class 3.
An
isolated line of text
printed vertically,
e.g. ,
a text description of a diagram along
the vertical axis, may be classified as a num ber of small text
blocks, or else as a class 3 block.
Consequently, in order to further classify different da ta
types within class 3, a second discrimination based on shape
factors
[IO]
is used. The method uses measurements of
border-to-border distancewithin an arbitrary pattern. These
measuremen ts can be used to calculate meaningful features
like the “line-shapeness” or “compactness” of objects within
binary images. Wh ile he alculations re complex and
time-consuming, they are don e nly for class
3,
and thus add
only a small increm ent to the overall processing tim e. This
hierarchical decision procedure, in which “easy”lass
6
K. Y . WONG
ET
BM J. RES.
DEVELOP.
VOL 26
NO.
6 N O V E M B E R 1982
-
8/9/2019 Document Engineering Research Paper
6/10
Mean run length of
black
plxels
R
2 3 4 6 8 I O 20 3 40 60 80 100 300 1000
Figure
3 Feature histogram ( R - H plane) of the
document
shown
in Fig. 2.
assignm ents are mad e quickly, while the difficult ones ar e
deferr ed to a more complex process, is a prom ising approach
toanalyzing he immense amou nt of data in a scanned
document.
Recognition
of
text data
Hun dreds of font tyles an d sizes areavailable for the
printing of documents . Ma thematical ymbols, foreign char-
acters,andother special typographicelementsmayalso
appear. The tremen dous range f possible input material for
theDocum ent Analysis System makest impractical o
attempt o designcompletely auto ma tic recognition logic.
Instead,analgorith m is used to collectexamples called
“prototypes”) of the various patte rns in the text.
The procedure, known as“pa t te rnmatchin g,” is con-
ducted as ollows. The text s exam ined, character y charac-
ter, in left-to-right, op-to-bo ttom sequence.Two output
storage a reas a re mainta ined, one toold the prototypes and
the other to ecord for each text pattern both its position and
the index of a matching prototype. The positions saved can
be, forexample , the coordina tesf the lower left corner f the
array represent ing the pa t te rn .
Initially the prototype set is empty; thus, there can be no
matching prototype for theirst input pattern, andt is stored
as the first prototype. In addition , its position a nd th e index
“1” are placed in the second ou tput rea.The second
52
K . Y . WO N G ET A L .
character is then omp ared o he first by means of a
correlation est for whichone of man yvariantsmay be
used) . If it match es, hen ts position and he index I a re
recorded. If there is no match , it is added to the lib rary as a
secondprototype. The hird chara cter is then read and a
match sought from among the mem bers of the library. This
operation is repeated for each input pattern,n turn, with the
patte rn being ad ded to the prototype collection whenever a
match cannot be found.
Pattern matching reduces the recognition problem to one
of identifying the prototypes, since th e prototype ch arac ter
codes can be substituted for the sequenc e f indexes obtained
for the tex t. The codes a nd accomp anying positions consti-
tute a document encoding that is sufficient for creating a
computerized version of the original text
or
for conducting
informationetrieval searches within its ontent.Other
merits of the approach are hat (1) he pattern-m atching
scheme is simple to im plement, (2) it permits the entry of a
wide variety of font styles an d sizes, (3 ) the ecognition logic
need not be designed in advance and no design data need be
collected.
Th e prototypes can be identified n several ways. If the
docum ent is known by th e user to have been printed in any of
a num ber of specified fonts, then prestored OCR recognition
logic may be applied [111. Text in a given language source
can be partially recognized using dictionary look-up, tables
of bigram or trigram frequencies, etc.; OCR errors can be
correct ed using such contex t aids as well. Finally, even if
autom atic techniques cannot be used, the prototypes c an be
displayed at a keyboard ter mina l and identified interactively
by the user. The num berof keystrokes needed to identify th e
prototypes is an order f magnitude less than that equired to
enter typical documents manually.
Pattern matc hing has previously been used by Na gy [ l ]
for text encoding and by Pratt [ 121 for facsimile compres-
sion. The echnique used or pattern matching in both of
these systems, however, suffers from nemajor efect,
particu larly when the algorithm is implemented by softw are
that is to run on a serial processor. T he problem is that each
textchar acter is com pared pixel-by-pixel with all of the
prototypes tha t ar e lose to it in simplemeasuremen ts such as
width, heig ht, are a, etc. A s the size of the libra ry increases,
many prototypes may qualify, and the timeequired to do the
correlations with a conventional sequ ential instruction com-
puter can be excessive.
To e duc e heamo unt of computation, heDocument
Analysisystemmploys a prec lassi fy ingechnique
described n mor edetail in [131. The preclassifier is a
decision network whose input is a patt ern array and whose
output is the index of a possible matching prototype pattern.
IBM J. RES.
DEVELOP.
1
OL. 26
VO.
6
I
VOVEMBER 1982
-
8/9/2019 Document Engineering Research Paper
7/10
Each interiornode is a pixel location, and each termina l ode
is a prototyp e index. Two branches lead from each interior
node. Onebran ch is abeled“black,” theother, “white.”
Sta rti ng with the root a node is accessed, and he pixel
location tha t it specifies is examined in the inp ut arra y. T he
color of this pixel dete rmin es the bran ch to the next node.
When a terminal node is accessed, the corresponding proto-
type index is assigned to the input.
Eac h successive inp ut rra y is first classified by this
network, which exa mines only a small subset of the pixels
before producing a n index. The network is initially null, but
it is modified each ime a new prototype is added o he
libra ry. Th e modification consists of adding a n extension to
the decision tree (or network) that permits it to distinguish
the new proto type as well as all previously selected proto-
types.
The preclassification is either verified or nullified by
comparing the input pattern with the prototype selected by
the network. The wo-step operation is repeated with the
inpu t array shifte d nto several registration positions until
either a matchin g prototype is found, or else none is found
and the input s designated asa new prototype.
The advanta ge of this approach is that only one pattern
correlation is needed to classify an inpu t in each registration
position used. Th us, th e classification is greatly accelerate d
compared with straightforwardatternmatching. he
reduction in classification time mus t be bala nced again st the
time expended in modifying the decision network each time
new class is found and again st he possibility of creating
extra classes due o preclassification errorsmade by the
network. Considerable effort has gone into developing an
efficient network adaptation algorithm.
The design of the preclassifier is based
on
a scheme in
which the pixels of eac h proto type are categorized as “reli-
able” or “unreliable.” A pixel is defined to be “reliable” for a
given prototype if its four horizon tal and vertical neighbor
pixels are
of
the same color. Using this categoriz ation, the
decision network is constr ucted such tha t for any nput
pattern the sequence of pixels examined have the following
properties. First, every pixel examin ed has the same olor in
both the input array an dn th e selected prototype. Second, if
these pixels in the inpu t are com pare dith the corresponding
positions in any of the othe rprototypes, at least onepixel will
be both different in color in the two patter ns and “reliable”
pixel for th e prototype. In this way, t he preclassification is
based
on
pixel comparisons away from the dges, where most
variation occurs in printed characters.
Thre e meth ods of designing the preclassifierhave been
investigated. The topextension approach (called strategy
A)
adaptation
After
adaptation
N
Figure
4 Strategies for mod ifying the decision n etwork to include
class
k
+ 1; (a) top extension;
b)
bottom extension; and (c) critical
mode extension.
add s pixels to be examined bove the root node of the
previously designed decision network. The trat egy is to
examine one or more “reliable” pixels such that the adde d
network ca n distinguish the new prototype class
G ( k + 1)
from the remaining k classes. [See Fig. 4(a )]. The botto m
extension approach (called strategy B) retraces the path of
G ( k + 1)
thro ugh the decision network a s in preclassifica-
tion, but with the difference that when a pixel that is
unreliable in G ( k + 1) is encountered, both branches from
the node a re followed. Thu s, th e extension routine traverses
multiple path s hrou gh he network. Each ermina l node
reached in this way is then replaced by ahree-node
subg raph . The top ode of the subgrap h specifies a pixel that
is a “relia ble” pixel
for
G ( k
+
l ) ,
and is “reliable” and of
opposite value for he prototype
P,
dentified by the terminal
node. The bran ches from this node lead to termina l nodes
identifiedwith
P
and with
G ( k + l ) ,
respectively. The
extension procedure is repeated for each erminal node
reached, as illustrated n Fig. 4(b).
The third approach (called strategy C ) is called “critical
node extension.” Here, prototyp e
G ( k
+ 1) is entered nto
6
IBM
J . RES. DEVELOP.
VOL. 26 NO
NOVEMBER
1982
K . Y . WONG ET
-
8/9/2019 Document Engineering Research Paper
8/10
654
b . R e m o
v t h f
r n t s w P h t u 9
x a
d i H G 2 3 A d
u s
tnte
I I
t
S
o
B a n d t i g E 1 S P
v
T L r P t r Y n n A
c I
d k t h i l i h i g
F 2 4 ) T l I I
S f l L h ski. i n 5 6
M t i 4 R m v
a
E ; h
P k U b 4 * S r t i
a n t i i r m a l m g i n C H E
K I N G B L T S O
O V s h a i e W A R N
G :
T h m o r u n i h c
kinri
/ C I ~ P ~ H Y . ~ U
Figure
5
Prototype patterns from thedocument shown
in
Fig.
2b .
network N ( k ) as in the bottom extension strateg y, and its
path is traced ntil it reaches a ode
n
specifying an
unreliable pixel in C ( k + 1 ) . At this point i t is determined
which subset of prototypes G l ) , G(2) , . . .,G ( k ) can also
reach node
n
Call this prototype subset H . The branch in to
node n is reconnected to a subnetwork consisting of a
sequence of pixels that are reliab le in G ( k + 1 ) a nd tha t
discriminate G ( k +
1 )
f rom membersof
H .
T he specification
of this subnetwork s identical to strategyA except that only
memb ers of H , rather tha n all the prototypes, are discrimi-
nated from
G ( k
+
1 ) .
Th e critic al node extension is illus-
trated in Fig. 4(c).
Each strategy has its advantages and limitations. Strategy
A is simplest in conce pt. Each time a new class is added, a
local modification to he network is mad e at its root. Th e
disadvantage of theapproach is tha t , as more and more
c lasses a re added, the s t ruc tureesembles a chain of subnet-
works, each link of which corresponds to a class. If a patte rn
belonging to a class n ear th e bot tom of the chain is entered
into the network, then one or more pixels must be ex amined
a t each of the higher links.Since theinks require more odes
as henum ber of classes ncreases, the ime equired o
classify a pattern grows more than linearly with the num ber
of classes.
St ra tegy B, on theotherhan d, results in a elatively
balanced structure. Starting from null network, strategy
B
produces a tree graph, since only subtr ees are append ed at
each step. Theexact hape of the reedepend s on the
proto type pixels, but in genera l it has a shorter average path
length than the network obtained using strategy A.
In
fact,
since any path s increased in ength by at most one node each
time a class is added , then a network designed using tr ateg y
B can have
no
path longer than
K
nodes when it has been
extended o accommo date
K
prototypes, and typically the
growth in average path length s approximately logarithm ic.
However, strategy
B
calls for modifying the network in a
num ber of locations. As the network grows, many termin al
nodes may have to be extended in order to adda single class.
This effect increases both the tim e required for ada ptatio n
and the s torage requirementor the network.
Stra tegy C offers a compromise between the othe r two
app roac hes . Th e network is modified only at a single node
each ime a lass is added,but he trategy offers the
potential of a more balanced configuration than the chain
produced by top extension. In a particular case, if each
successive prototype followed a path consisting only of
interior pixels, then only terminal nodes would ever be
modified, and strategy
C
would be equivalent to strategy
B.
On the o ther han d, if for each successive extension the root
node specifies anedge pixel for the new prototype, then
st ra tegy
C
yields th e amechai n configuration obtain ed
using strategy A. In prac tice, the results lie between these
extremes. a s shown below.
Figure 5 shows the prototype patternsusing the proposed
text data recognition method on the docum ent shown in Fig.
2(a). While there is some redundancy in the prototype set,
only a sm all part of this is due to preclassifier erro rs; the
major portion is due to failures of the correlation stage to
declare similarity between an input and a prototype of the
sam e class. (T he threshold for declaring a correlation match
has tobe set ra ther t ig htly tovoid giving different cha racte r
types the sam e esignation, a more serious error.) A compar-
ison with a conv entional pattern -matc hing algorith m yielded
about the same size prototype set. T he efficiency advantage
of the preclassifier system is indica ted in the plot of Fig. 6, in
which a typewritten docum ent was the inp ut. An average f
only
130
pixels per chara cter pattern was examined by the
preclassifier (out of approximately
1000
in eachpattern
K. Y
WONG ET AL.
IBM
J. RES.
DEVELOP. VOL.
26 NO. 6
NOVEMBER
1982
-
8/9/2019 Document Engineering Research Paper
9/10
250
000
Plxels
examined in decislon etwork
/
(cumulative plot)
/
criticalnode
bottom extension
top extension
200000- _ _ /
so 000
100
000
n
6
50
000
E
z
0 500 I000 IS00
Input sequence numhcr
Figure
6
Efficiency advantage of preclassifier.
arr ay) using the criticalnode extension method. In this same
experiment, appreciably more C PU time was spent in using
the decision network for classification than in const ructing it.
In
the least fav orable case, bottom extension, the rat io of
preclassification time oadaptation ime was 3:l; for the
critical node method the ratio rose to
1O:l .
These data were
obtained from simulation experiments programm ed in
APL
and do not provide estimates of the timings obtainab lewith
efficient programming of the method. An effort is currently
under way to reprogram the algorithm and to optimize its
speed.
A side benefit of the preclassification appro ach has been in
aecursive techniq ue for the egm enta tion of touching
character patterns (described in detail in
[14]).
The proper
segm entatio n of closely spaced printed char acte rs is one of
themajor problems in opticalcharacter recognition. Th e
Document Analysis System combines segmentation with the
classification
of
the component patterns. Following the p at-
tern-matching stage, each prototype wider tha n twice the
width of the nar row est prototype undergoes a series of trial
segmentat ions .The two-storage preclassification/correla-
tion technique attempts to find matching prototypes for the
tr ia l segments . Theecursive algorithm segmentsa sequence
of touching char acters only if it can decompose the segmen t
into patte rns that positively matc h previously found proto-
types. Because the num ber of pattern s tested for segmenta-
tion is small and because the use of the preclassifier makes
the process efficient, the comp utational overhead introdu ced
by recursive segmen tation is relatively insignificant. Figure
7
shows asampling of composite patte rns successfullyseg-
mented by this technique.
Conclusion
This paper hasdescribed some of the requir emen ts for and a
system overview of a document analysisystem which
Input
array
aki
a i
al
Matching
prototypes
a k i
in
thi
a i
a l
a n
an
di
f l
gi
hi
Input
array
Matching
prototypes
P U
Pu
r rn
IBM J.
RES. DEVELOP. VOL. 26 NO. 6 NOVE M B E R
1982
Figure 7 Illustration of the recursive ouching
character
segmen-
tation method.
encodes printed documents to a form suitable for computer
processing. N ot all the system components mentioned have
been designed and built. The initial work on the project has
focused on techniques for the separation of a document into
regions of text and images an d on the recognition of text
da ta .
The image segm entation method has been tested with a
variety of printed documents fromdifferent sources with one
common et of param eters.Performance is satisfactory
althou gh misclassifications occur occasionally. Wi th theuser
in an interactive feedback loop, misclassification errors ca n
be checked and corrected quite readily in the classification
table. more sophisticated pattern discrimination method
[ I O ] is being tested to furthe r classify those blocks tha t ar e
labeled as non-text.
Pattern matching has been investigated as a fundamental
approach to reducing a large num ber of unknown character
patterns o asmall num ber of prototypes. A new, highly
K. Y.
WONG
E
-
8/9/2019 Document Engineering Research Paper
10/10
e f f i c ie n t a p p r o a c h t o p e r f o r m t h e m a t c h i n g h a s b e e n i m p l e -
m e n t e d b y m p l o yi n g a n d a p t i v e e c i s i o n e t w o r k for
prec lass i f i ca t ion . The reduc t ion
f
computa t iona l load i s ve ry
impor tan t when process ing i s done by a convent iona l com-
p u t e r . P r e l i m i n a r y re s u l t s o n r i a l d o c u m e n t s h a v e y i e l d e d
pro to type se t s
of
low redundancy , as well as low rates of
misc lass i f i ca t ion .
Acknowledgments
W e w i sh t o a c k n o w l e dg eY . T a k a o of
IBM
Tokyo Sc ien t i f i c
C e n t e r i n J a p a n f o r h i s c o n t r i b u t i o n i n w r i t i n g p a c k a g e
of
image-ed i t ing s o f t w a r e c a l l e d E D I T ) o r our D o c u m e n t
A n a l y s i s S y s t e m . W e a l s o a p p r e c i a t e he s t i m u l a t i n g d i s c u s -
s ions wi th G . N a g y d u r i n g h i s v i s i t
o our
l abora tory .
References
1.
2.
3.
4 .
5.
6 .
7 .
8 .
9 .
IO
11.
12.
13.
R. N. Ascher, G. M. Koppelman, M. J.Miller, G. Nagy, and
G .
L. Shelton,Jr.,“An nteractiveSystem or Reading Unfor-
matted Printed Text,”
IEEE Trans. Computers C-20 1527-
1543 (December 1971).
A. Steinbach and K. Y. Wong, “An Understanding of Moir
Pattern s in the Reproduction of Ha lftone Images,”
Proceedings
of the Pattern Recognition and Image Processing Conference,
Chicago, Aug. 6-8, 1979, pp. 545-552.
G. Nagy, “Preliminary Investigation of Techniques for Auto-
mated Reading of UnformattedText,”
Commun.ACM
11
480-487
(July
1968).
E. Johnstone, “Printed Text D iscrimination,”
Computer Graph.
Image P rocess.
3 ,83 -89 1974) .
F.
M. W ahl, K. Y. Wong, and R. G. Casey, “Block Segm enta-
tion and Text Extraction in Mixed Text/Image Documents,”
Research Report RJ 3356,
IBM Research Laboratory,San
Jose, CA, 198
1
F.
Wahl, L.Abele, and W. Scherl, “M erkmaleuer die Segmen-
tation von Dokum enten zur Automatischen Textverarbeitung,”
Proceedings of the 4th DAG M-Symposium,
Hambu rg, Federal
Republic of Germany, Springer-Verlag,Berlin,
198
1.
L. Abele, F. W ahl, and W. Scherl, “Procedures for an Auto-
matic Segm entation of Text Graphic and H alftone Regions in
Documents,”
Proceedings of the 2nd Scandinavian Conference
on Image Ana lysis,
Helsinki,
198
1 .
F. M. Wahl a nd K. Y. W ong: “An Efficient Method of Running
a Constrained Run Length Algorithm (C RL A) in Vertical and
Horizontal Directions on Binary Ima ge Dat a,”
Research Report
RJ3438,
IBM Research Labo ratory, San ose, CA,
1982.
A. Rosenfeld and A. C. Kak, “Digital Picture Processing,”
Academic Press, Inc., New York,
1976,
pp.
347-348.
F. M. Wahl, “A New Distance Mapping and its Use for Shape
Measureme nt on B inary Patterns,”
Research Report RJ 3361,
IBM Research Labo ratory, San ose, CA, 1982.
R. G. Casey and G. Nagy, “Decision Tree Design Using a
ProbabilisticModel,”
Research eport RJ 3358,
IBM
Research Laboratory , San ose, CA , 198
1.
W. H. Chen , J. L. Douglas, W. K. Pratt, and R. H. Wallis,
“Dual-mode Hybrid Compressor for Facsimile Images,”
SPIE
J.
207,226-232 1979).
R. G. Casey and K. Y. Wong, “Unsupervised Construction of
Decision Networks for Pattern Classification,” IBM Research
Report. to appear.
14. R . G. Casey and G. Nagy, “Recursive Segmentation and
Classification of Composite Character Patterns,” presented at
the thnterna tional Conference on Pattern Recognition,
Munich, October
1982.
Rece ived
M a y
3, 1982; rev i sed
July
7, 1 9 8 2
Richard G Casey IBM Research Division, 5600 Cottle
Road, San Jose, California 95193.
Dr. Casey h as worked primarily
in pattern recognition, particularly optical character recognition,
since joining IBM in 1963. Until 1970, he carried outprojects in this
area at he Thomas J. Watson Research Center in Yorktown
Heights, New York. After ransfer ring to San Jose, he worked
initially on the analysis of data bases, but returned to the character
recognition area in 1975, soon after completing a one-year assign-
ment for IBM in the United Kingdom. Dr. Casey received a B.E.E.
from Ma nhattan College in
1954
and an M S . and Eng.Sc.D. from
Columbia University, New York, in
1958
and
1965.
From
1957
to
1958,
he was a teaching assistant at Columbia University and from
1958
to
1963,
he investigated rada r dataprocessing techniques at the
Columbia University Electronics Research Laboratories. While on
leave of absence from IBM in 1969, he taught at the University of
Florida.
Friedrich M. Wahl
Lst. s
Nachrichtentechnich Technische
Universitaet,Arcisstrasse
21. 8
Munich 2, Federal Republic of
Germany.
Dr. Wahl studied electrical engineering at the Technical
University of M unich, where he received his Diplom and his Ph.D. in
1974 and 1980. In 1974 he started a digital mage processing group
at he Insti tute of Comm unication Engineering at the Technical
University of Munich. Dr. Wah l worked n the image processing
project in the C omputer Science Department at the IBM Research
laboratory in San Jose, California, for one year, 1981 to 1982, as a
visiting scientist. At S an Jose, he worked on p roblems of docum ent
analysis systems, image segmentation, shape description, clustering
algorithms, and line drawing decomposition.
H e
has also been
involved with autom atic image inspection for manufa cturing . Dr.
Wahl is a member of the Institute of Electrical and Electronics
Engineers.
Kwan
Y Wong
IBM Research Division, 5600 CottleRoad,
San Jose, California 95193.
Dr. Wong receivedhis B.E. (with
honors) and his M.E . in electrical engineering from the U niversity of
New South Wales, Sydney, Australia, in 1960 and 1963. Since he
received his Ph.D. from the University of California, Berkeley, in
1966,
he has been working at the IBM Research laboratory in San
Jose. He was the leader on projects in process control systems, air-jet
guiding for flexible media, an d image processing. Curre ntly he is the
manager of the image processing project in the Computer Science
Dep artme nt. His recent research activities ar e in the areas of image
processing and pattern recognition systems for office automation,
manufacturing automated visual inspection systems, special image
processing hardw are, and algorithms.
656
K.Y
WONG ET P
LL.
IBM J.
RES. DEVELOP
\
OL.
26
NO. 6 NOVEMBER 1982