recognition of printed chinese characters by automatic pattern analysis

COMPUTER GRAPHICS AND IMAGE PROCESSING (1972) 1~(47--65)

Recognition of Printed Chinese Characters by Automatic Pattern Analysis*

WILLIAM STALLINGS

Honeywell Information Systems, Inc. Waltham, Massachusetts 02154

Communicated by T. S. Hua~g

Received August 13, 1971

An approach to pattern recognition by computer, using analysis of pattern structure, is explored. The approach is programmed and tested on a set of Chinese characters.

The input to the program is a black-white matrix of points depicting a single character. The program produces a description of the character on two levels: (i) the internal structure of each connected part of the charaeter, and (ii) the arrangement in two dimensions of the connected parts. The description is achieved by producing a structural representation of the chm'aeter. The structure corresponding to level (i) is a graph, the edges of which correspond to parts of sta'okes. The structure corresponding to level (ii) is a tree, whose telzninal elements correspond to connected parts of die eharaeter and whose nodes correspond to geometa'ie relations,

A method has been devised for producing a numeric code for eaeh character. The code is generated from the structural representation of a character, and is used for recognition.

1. INTRODUCTION

An App~'oach to Pattern Recognition This paper reports on a study of an approach to pattern recognition based on

the description and analysis of pattern structure. The author s that recognition should imply more than just the classification of patterns according to the features they possess, but should mean the structuring of the pattern, i.e., the determination of the relationship among the elements of the pattern, r

With this point of view, a scheme for automatic pattern recognition has been developed which includes the following tasks:

(i) Description. A systematic scheme for the description of the pictorial structure of the patterns to be recognized is developed.

(ii) Analysis. An algorithm is designed which analyzes the structure of the patterns, producing a representation of the structure conforming to the descriptive scheme.

* This [Japer is based on a thesis submitted in partial hdfillment of the requirements/or the degree of Doctor of Philosophy in the Department of Electrical Engineering at the Massachusetts Institute of Technology, 1971.

This polar o}" view is propounded by Sayre [11] and Grenander [4].

47 �9 Copyright 1972 by Academic Press, Inc.

48 STALLINGS

(iii) Encoding. From the structural representation of a pattern, a code is generated which uniquely identifies the pattern.

This method has been applied to the recognition of Chinese characters. A program has been written which analyzes Chinese characters; the program produces a data structure which describes a character in reims of basic picture elements and the relationship among them. A procedure has been developed ~br generating a numeric code from the structural representation, Recognition is achieved by building up a dictionary matching characters with their codes; the code for any new instance of a character can then be looked up in the die- tionary.

Chinese Characters

Chinese is a pictorial and symbolic language which differs markedly from written Western languages. The characters are of uniform dimension; they are generally square; they are not alphabetic but are composed of strokes.

Chinese characters possess a great deal of structure and hence are well- suited to the method of recognition outlined above. Many regularities of stroke configuration occur. Quite frequently, a character is simply a two- dimensional arrangement of two or more simpler characters. Nevertheless, the system is rich; strokes and collections of strokes are combined in many dff ferent ways to produce thousands of different character patterns.

The author feels that the method of pattern recognition by analysis is suited for application to a class of patterns which display a rich structure developed from a small number of simple basic elements. Hence Chinese charaeters were chosen. In addition, the development of a successful Chinese character recognition device is desireable in itself." It is hoped that the method pre- sented here can be the basis of a feasible Chinese character recognition device, which would increase the access of the West to the vast quantity of Chinese writing.

g. THE STRUCTURE OF CHINESE CHARACTERS

Chinese characters consist of strokes, which are drawn roughly along a straight line2 Nearly all strokes appear as horizontal, vertical, or in a direction along one of the main diagonals. Strokes are combined to form connected units hereafter referred to as components. Each character consists of a two- dimensional arrangement of one or more disjoint components. Figure 1 shows a character having three components.

The structure of a Chinese character may therefore be specified on two levels:

(i) a description of the internal structure of each component, and (ii) a description of the arrangement of components in two dimensions.

.a The author is aware of one previous investigation, hy Casey and Nagy [1]. The method used was template matching.

'~ This is not quite correct, A native Chinese would draw 7 with a single stroke, not lifting his pen, For the sake of this discussion, however, it is simpler to say that 7 is composed of two "strokes", namely ~ a n d / .

R E C O G N I T I O N O F P R I N T E D C H I N E S E C H A R A C T E R S

IiitIIIIIIIIIIIIIIIIIIIIIIIIIIIII

IOtQH~H 4aJoJO~OitoeO Ill~176176 IIIIIIIIIIII Ill~

IIIII I|IIIIIII IIIIIIIIIII IIIIIIIIIIII | l l I I I I I # I I I I I I I I | I I I I I I l l l l O I I I Q t I I I O | ~ # # I O e l | l ~eIeeIHeol + l J l m l l I p # f e + O l l e l l l l O l l | l l I I O l I O I | I O I I J

IllIIIIIe)|II t I I I O 0 1 1 O O I O I I I I * I e O I O I l l I O e ....,...t, .,t,,~ ,.,...,.,.,,.: . . . . . . . . . . . . . lilt~176 it ~ Ill 6Ol II IIIllIIlllOOl ' " " ~ 1 7 6 1 7 6 , , t . . . . . ! . . t , , . t . . , , , , , , , , , , , ,

........... t.h!hl ihl!.hi " " "' ,,.+++*'"'"~+', o+~ . , + ~176 o +l+|,+I+J, , + , , , . . h i

l l l l I l l ~ l l I I I I I i i i i i t l 0 I I I l l I I I I I I I I I I . . , . . . . . . . . t , l l l t t h , . . . . . , , . . l t h . h . l . . . . . . . . . . . l l l l l l l l l l I l l l l l l l l l l l l l l l l l l l I ~ l e l l l l l l l l

i i i i �9 l l l l l l l l l l l l l l l l l l l l l I I I I I 1 4 1 1 1 1 i i i i i

ItlttllltI llIllllll llllllllll IIIII III l l l l l l I I I I I I . . . . . I l l l I t , , , , i!!iiil t11tltltI' 'ttttHtIttl ,!!!!!:i!!!!!'! :Iiiiiiiii,,,,,,ii,,11t1111111,.11111iii,.,,,,......ii i,,

t I ' 1 I!!! J l . l 11 I I i i ~ l l l l l ~ I ,Ill . . . . . , l t . . l h h ,,h, ........ l l l l l l l l l l I f i l l l f l l l l l i I l l l l l l l l l l l l l l l

IttIllltttll] llllllllll8 llll!Iltlll fill!Ill~ l l l l i l | l l I l l l l l l l l i l l l l ,I ,,,Ilhlhl I lllhlll ll||lllhll . . . . . . . . . . . f i l l I I I I I I I I I I I I l l l l l l l l l l l l l l l l I l l l l l ~llbllI~ l l l m l I l l I ~ I l l l l l l l l l t i l b t l l l + l

1 1 1 1 1 1 1 1 1 1 1 1 l l l l l I I I i i ~ 1 0 1 1 l l I ~ l 1 1 1 1 1 1 1 4 1 1 1 l l l l l l l i l l I I I 6 1 ~ 1 6 1 6 6 6 I l l ~ �9 4 1 1 6 1 1 6 1 1 6 1

.++++++++'" ++++l+i++++ ++i+i++++++ +++++++++++ l I l l J l l l I l l l l l l l l l l l l J l l l l l l l l l l l l l l I l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l + l l i i i i i i i i i i i

i i i i i i i i i i i I I I I I I I l l l l l l l l l ItI:tI ' ' ' ' " ' l l l l l l l l l l l l l l l l l l l l l * l l # l l I l l l l l I i i l i I l l l l l l I I I I I l l l I tilt . . . . . . ' . . . . . . .

I l i i ~ l b l i l I I I I I p l l l l l 1 1 1 1 1 1 1 1 1 ~ 1 I + l l l l l l + l l # t l l l t # t ~ I l t ~ q t l l l l l l l l I l l l l I l l l I I I i I l l l l l l I * l l e l l l l l l l

I I I I I l l l l + l l I l l I l l l l l O I i I I I

F I G . 1 . Charac ter wi th T h r e e C o m p o n e n t s .

49

Compone~zt8 Two questions are involved in the decision of how to describe the internal

structure of a component:

(i) What class of objects shall be considered as the basic picture element? (ii) What sort of structure shall be used to indicate the relationship

between elements ?

Three criteria were used in answering these questions:

(i) The structure mentioned in question (ii) should be relatively easy to generate from the original pattern.

(ii) It should be relatively easy to generate a unique numeric code from the structure.

(iii) The structure should represent the pattern in a natural manner.

A quite natural method of representing the internal structure of a component would be in terms of strokes. This indeed is the approach taken by several previous recognition schemes [5,7]. These schemes make use of on-line input, in which strokes are drawn one at a time. The difficulty with taking this approaeh for printed characters is that strokes do overlap and are not easily

50 STALLINGS

1 0 J O 0 0 �9 , 1 0 o J , 1 0 1 J ,

�9 ~ , , l ~ ~ o 0 o l o o 6 o l e t o * J l a e * J H O Q 0 J o e e e e o e r �9 ~ o o e o o e o o e J o 4 6 0 o e o e J e e e 4 o e e ~ 6 e H e o e e o ~ e e e ~ e t e o e o s e e o a e e e e ~ e e e J ge 0 4 e e e e e 4 e o e e e e e o o J . , . , , . . . . . . . , , , . , , , , , , , , , . . . , , . , , , 0 , , , , . , . . . . . . . . e ~ a o Q ~ e Q e o Q s � 9 �9 e e H e e e e e e o 6 e l e e e e e e o o s ~ s e e e G e e e e 6 u ~ r o ~ a l l l s Q ~ B i Q e l l ~ 6 4 1 ~ e ~ o l l o e ~ l o o ~ e e e s ~ 4 ~ e e ~ e o e e s ~ e e e ~

�9 . . : . = = . : . . . : . . . . . . : ! i . . . . : . . = . iiiii:iiM,::iMii iiiiii&ii==!!iiii iiii , t t ~ t t t ~ e t o l 4 ~ t t o o ~ t o o o t o ~ o o J o o o t t ~ o o o t t o * 6 4 4 o $ l l g ~ t o ~ e t g 0 4 4 e �9 . . . . h . . . . . | l l . . . . . . , . t . . t o . t t . t t 1 1 i t ~ �9 � 9 1 4 9 ~ o 1 ~ �9 ~ e o o o ~ o

�9 ~ ttt:ttt:.: �9 �9 �9 ":'t::t .~ltt "" 4 $ o t t 4

t ~ . . . . . . uqi|mm::i mm l � 9 ~ e l o e o l $ e � 9 o o o � 9 e 6 $ e e o 4 ~ o . . . . . . . . . . . . . e ~

a = = ~ a z : z : = ~ = = 0 0 o o � 9 � 9 . . . . �9 � 9 1 7 6 1 4 9 1 4 9 1 4 9 1 4 9 1 4 9 1 4 9 1 4 9

0 6 �9 o e o e a . . . . . . . . . . . . e o o e o e o e e � 9 � 9

| I t t t l | t t t ' l t , . : : 0 4 0 0 4 0 1 0 4 4 ~ e 0 1 1 0 1

�9 . : . : . : : - ~ - : , ~ t:ttt:~ttttltt::: 0 ~ 0 . 0 4 1 P 4 4 ~ �9 �9 ~ O O ~ O t f l i t l l o , e o � 9 . . . . . . . ~ l l t t t ~ t t t t z , s t . . . . . . . . . . . . . . . . . . . . . . . . . .

�9 .,ii!il,}... .= . . . . . . . . i iiiii!!ii!i o :: . %.

r a e e e . a t ' ' ~

~ 1 7 6 1 7 6 1 7 6 1 7 6

. . . . . . . . . (o i , . , , ~ , ,

r

4

( 0

FIe. 2. Componen t (a) and Graph (b),

2

2

9,

(b)

6.( 0

4

isolated. Further , the descr ip t ion of file relat ionship b e t w e e n strokes is not straightforward. 4

A m u c h more promis ing approach is to descr ihe componen ts in terms of" stroke segments . This can bes t be unders tood with reference to Fig. 2, As can be seen, a c o m p o n e n t can be dep ic ted as a graph. The branches of the graph cor respond to segments of strokes. These segments are b o u n d e d by stroke in- tersect ions and ends of strokes. 5

It wil l be s:hown in later sections that this representa t ion satisfies criteria (i) and (if). That it satisfies criterion (iii) is fairly clear. To the h u m a n observer, t he g raph of" a c o m p o n e n t is readi ly apparent.

Characters

T h e a r r angemen t of componen t s in two d imens ions to form characters can be descr ibed using the concep t of frame. Each character is v iewed as occupy ing a hypothet ica l square. The segmenta t ion of a character into c o m p o n e n t s segments its square accordingly. The square, or frame [~, may be s e g m e n t e d in one of three ways: (a) East-West [JJ, (b) North-South B , (c) Border- In ter ior []. Each of these segmentat ions corresponds to a two- c o m p o n e n t character. For example, /~ ~_ wou ld be represen ted by (a), which

4 For a discussion of a system for the descript ion of Chinese characters in terms of strokes, see F u j i m u r a and Kagaya [3]. The authors are primarily in te res ted in computer generat ion of Ch inese characters .

5 The number s on the nodes and branches are for the sake of discussion later in the text.

]RECOGNITION O F P]RINTED CHINESE CHAI/ACTE]RS 5 1

decomposes the character into d and ~. ~- would be represented by (b). Fi- nally, either partial or complete enclosure, such as ~ and [-~ would be represented by (c). Frames for characters composed of more than two components are obtained by embedding (a), (b), or (c) in one of the sub-frames of (a), (b), or (c). The process of embedding is recursive, in that any sub- frame of a derived frame may be used for further embedding.

For example, the four-component character of Fig. 3 can be described by the frame arrangement of Fig. 4a. The frame description can be conveniently represented by a tree, as indicated in Fig. 4b.

This description of the arrangement of components is based on the work of Rankin [10], who introduced the concept of frame-embedding. The definition of component used here is slightly different from that of Rankin. Despite this, Rankin's claim that the three relations used in his scheme are sutt3eient to describe accurately nearly all characters seems to apply.

3, INPUT

The program operates on a representation of one character at a time. The representation is in the ~bnn of a matrix whose entries have value 0 or I corresponding to white or black in the original picture. The matrix is obtained by means of a flying-spot scanner. The printed characters used were taken from a number of different sources; the characters were all of roughly the same style but varied considerably in size.

Certain functions of the program depend on the fact that there are no gaps or holes in any of the strokes. This is not always the ease due to the quality of the printed input. Accordingly, a smoothing operation is performed to fill in the

.~:I..o

* * , * * , , * * ~ . , , . . . . . . . . * * . * . , , * . * * ~ 1 7 6 1 7 6

~ e ~ l l * * . , , ~ 1 7 6 1 7 6 1 7 6

* . ~ 1 7 6 1 7 6 i J J o B ~ l g o o ~ e I H a ~ 1 7 6 ~

, o * o o e o e e o * * o e * * . * . ~ 1 7 6

iiii i!iiiiii!i! ~ 1 7 6 1 7 6 1 7 6 o ~ 1 7 6 1 7 6

, - ~ 1 7 6 1 7 6 1 7 6 1 7 6 * ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6

* ~ 1 7 6 1 7 6 1 7 6 * ~ . ~ 1 7 6 1 7 6 1 7 6

H ~ 1 7 6 1 7 6 1 7 6 1 7 6 * * ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6

* ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 e ~ 1 7 6 1 7 6 1 7 6

~ 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6

~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . , ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6

": " ~i . . . . . . . . . . . . . . . . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 , ~ , . o ~ . ~ ~ 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6

* ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 , ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 o ~ ~ . ~ 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 , , ~ 1 7 6 , ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6

~ 1 7 6 . . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 . ~ 1 7 6 1 7 6 . . . . . ~ . . . . . o ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 . . ~ . . . . . . . . . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . . . . ~ 1 7 6 ~ 1 7 6 1 7 6 ~ . . . . . . . ~ . . . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . ~ 1 7 6 ~ , ~ ~ 1 7 6 . . . . . . . . . . . . . . . . . ~ 1 7 6 , o . ~ . ~ 1 7 6 1 7 6 1 7 6 ~ . . ~ ~ 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . . . . . . ~ . . . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 ~ . , ~ . , ~ 1 7 6 . . . . . . .

. . . . . . . . . , . ~ 1 7 6 1 7 6 1 7 6 ~ ~ 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 . . . . . . . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . . ~ , . . . . . . . . . ~ 1 7 6 . . . . . ~ 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 . . . . . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . . ~ 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 , ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . . . ~ , . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . . . . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . . . . . . . ~ . ~ 1 7 6 1 7 6 . . . . . o ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . �9 . . . . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . .

, , ~ . . . ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . . . . , ~ 1 7 6 ~ ~ 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 1 7 6 . . . . . . . . . . . . . .

FIG. 3. Chinese Character,

52 STALLINGS

I

(e) (b)

FIG. 4. Frame Description (a) and Tree Representation (b).

gaps. The resulting matrix is used as the data base for the program. The digitized form of a character can be displayed on a CRT. Figures 1, 2a,

3 and 11 are photographs of such displays.

4. ANALYSIS OF COMPONENTS

A program has been written to perform the analysis of components . For a given component, the output of the program is a graph in which branches correspond to stroke segments and nodes correspond to the endpoints of stroke segments.

To eonstlalet the graph of a component , one principal procedure, BUILD, is used. In addition, use is made of some auxiliary routines. It will be helpful to describe these first.

C o1~tour Tracing

Contour tracing is the process of finding a series of black points on the boundary of a black region in a white field. Two routines are used: one which keeps the black region on the left as the tracing proceeds, and one which keeps the black region on the right.

To keep the black region on the left, the tracing proceeds from poin t to point, turning right after encounter ing a black point and left after encoun- tering a white point. An additional rule is used to increase the speed of the algorithm: If three points of the same color are encountered in succession, the next point is assumed to be of the opposite color. Thus, two steps may be taken at once. The operation of the algorithm is depic ted in Fig. 5. The last step shown is diagonal, indicating the effect of the g-move rule.

The algorithm for keeping the black region on the right is similar. Both algorithms were developed by Prerau [9].

Search

The task of the SEARCH routine is to find some stroke segment to be used as a starting point. It is unimportant which particular segment of a componen t is found.

The output of the SEARCH routine is the coordinates of the endpoints of a strip of black points straddling a stroke segment. SEARCH proceeds by scan-

R E C O G N I T I O N O F P R I N T E D C H I N E S E CHARACTE1RS

O ~ 0

t 1 0 I

0 1

O ~ 0 0 0

1 t

, 0 0

0

53

0 I 1 ~ 0 0

0 0 0 0 o o

F1G. 5. Contour Tracing.

ning alternately from left to right and from top to bottom along various rows and columns of the pattern. This continues until a series or strip of black points is encountered. Is strip is too long (more than �88 width of the pattern), it is assumed that the strip is lying along the length of a stroke. This is rejected and the scanning continues.

Similarly, if the strip is too small (1 or 2 points), it is rejected as being a speck. Otherwise, it is assumed that the strip is straddling a stroke segment and the endpoints of the strip are returned.

Figure 6 shows examples of all possible outcomes of scanning a single row.

) , , .

(a) No block points found, (b) Speck found.

{C) Line olong slroke found. (d) Line straddling slroke found.

FIG. 6. Outcomes of Scan ])y SEARCH Algorithm.

54 STALLINGS

Cl'~zw[

The CRAWL routine is used kbr "crawling along" a stroke segment. The routine proceeds along a stxoke segment in a given direction, halting when a node is encountered, i.e., when an intersection or the tip of a stroke is reached.

The input to CRAWL is (i) a location on a segment, in the form of the two endpoints of a horizontal or vertical strip of points straddling the segment, and (ii) one of four directions (left, right, up, down) in which the crawl is to proceed. The output is the location on tlae segment where the crawl halted, again in the fbrm of two endpoints of a strip.

The crawl is accomplished by moving from each of the input points along the contour of the segment. Tracing from the left-hand input point (with respect to the direction of tlae crawl) is done keeping the black region on the right and conversely for the right-hand input point. The crawling proceeds by advancing both "tracers" one unit in the specified direction at a time. This is depicted in Fig. 7. For each move from one line to the next, each tracer goes through one or more contour points.

Figure 8 shows the four conditions under which a crawl will be halted. All four cases correspond to a node being encountered:

(i) If the two tracers, instead of advancing, meet each other, then the tip of a stroke has been encountered.

(ii) If the two ta'acers do advance, but not all of the points between tlaem are black, then a fork has been encountered.

(iii) If the new strip of black points on which the two ta'acers sit is signifi- cantly longer than the previous strip, then an intersection has been encountered.

Direction of crawl

7 ( ~ X X X X X (~)

6 (~) X X X X X X X X C )

x | x x • x x x |

4 | x x x x |

5 (~) • X X X X (~)

2 (~) X X X X X X ( ~

1 | x • • x x |

FIc. 7. Crawling along a Stroke. Circled points indicate position of tracers at end of 'each move. Botb tracers are always on the same line. Lines are numbered for sake or discussion. See text.

RECOGNITION OF PRINTED CHINESE CHARACTERS

~ xx xxx x x xxx xx

XXXXX XXXXX XX

X X X X X

55

(o) Tip (b) Fork

XXXXXXXXX X X X X X X

X X X X X X X X X X X X X X X X X X X

(c) intersection (d) Turn-around

FIG. 8. Conditions for halting CRAWL procedure. Circled points indicate location of two tracers just before crawl is halted, Direction of crawl same as in Figure 7.

(iv) If one of the two b'acers reverses direction, then again a fork has been encountered, but this time by coming up one of the two anns rather than the main road.

Although only horizontal and vertical directions of crawl are specified, the routine works on diagonally-oriented segments. Notice that in Fig. 7 both tracers move diagonally from line 1 to line 2. This could continue along the entire length of a diagonal segment.

Node

After CRAWL has encountered a node, NODE is called to investigate it. The input to NODE is the output of'CRAWL: the endpoints of a strip of points which marks the termination of a segment at an intersection. The task of the NODE routine is to find all other sta'oke segments radiating from this intersection. For each segment found, NODE returns the endpoints of a strip straddling that segment at the intersection. Also, the direction of the segment away from the node is indicated.

The operation of NODE is shown in Fig. 9. The routine starts at one of the input points and proceeds along contour points around the intersection. This continues until a contour point is found which is the endpoint of a horizontal or vertical strip straddling a segment (i.e., the endpoint of a small black strip). This strip and the direction perpendicular to it away from the node are noted. The routine then continues from the other endpoint of the strip. This process of going a few contour points, finding a segment, crossing it, going a few contour points, etc., continues until the other input point is encountered.

In addition to locating the segments leading from a node, the routine as- signs a position to the node. This is done by averaging the X and Y coordinates (with respect to an origin in the upper left-hand corner of the matrix) of the endpoints of all the strips found, including the input points.

56 STALLINGS

)

~nw points [ point

Fie,, 9, The NODE Algorithm.

Build

The construction of a graph can now be described. As a graph is a collection of interconnected nodes, it is represented in the computer as a collection of" interconnected blocks of data. For each node in a graph, a block of contiguous memory words is allocated. The length of a block depends on how many branches there are at the corresponding node, If two nodes are adjacent in a graph, their data blocks will contain pointers to each other. Each of these pairs of pointers represents a branch.

To begin construction of a graph for a particular component, SEARCH is called to find some initial stroke segment. SEARCH returns a position some- where along the length of a segment. From this position, CRAWL is used to crawl along the segment in both directions to its two endpoints. Thus two initial nodes are found. NODE is called once for each endpoint to determine the segments leading from them. Storage blocks are allocated for each node. Pointers are placed in each block linking the two together.

From this start, the graph is completed using BUILD. BUILD is called once for each segment leading from each of the two initial nodes. The arguments to BUILD are (i) a pointer to a block of data corresponding to a node (the input node), and (ii) the starting point of some segment (the input segment) leading from the input node. BUILD performs the following operations:

1. The input segment is crawled along to reach its endpoint, using CRAWL.

9.. NODE is called to examine this endpoint, or node. The coordinates of the node and the segments leading from it are determined.

3. a. The coordinates of this node are compared to those of all previously encountered nodes (those for which data blocks already exist). If a match is found, then a pointer to the existing block for this node is placed in the block of the input node, and the routine stops.

RECOGNITION OF PRINTED CHINESE CHARACTERS 57

b. If the encountered node is new, then a block is allocated for it, and it is l inked back to the block of the input node. BUILD is then called once for each segment leading from the new node. Then the routine stops,

It can be seen that BUILD is a recursive routine. BUILD is described more formally in Fig. 10. As an example, the analysis of" the component of Fig. 2 wi l l be described. The two nodes initially found are marked 1 and 2. The branch be tween them corresponds to the initial segment found by SEARCH. Blocks of data are allocated }br 1 and 2. Then, all the segments leading from 1 are examined, clockwise, by BUILD. Crawling along the first segment, node 3 is found. This is l inked back to 1. The segment leading from 3 is examined next, finding node 4. The procedure unwinds back to node 1 and examines its next segment. As a result, 5 and 6 are found. From 6, node 2 is encountered. Node 6 is linked to node 2 and the procedure again returns to node 1, which is seen to be completed. Next BUILD is applied to node 2 which finds first 6 and then 7. At flais point, 2 is complete and the analysis terminates.

5. ANALYSIS OF CHARACTERS

The algorithm for analyzing a character is in two parts:

1. A collection of graphs is produced, one for each component. 2. The relationship be tween components is determined.

Finding All Components The first palt of the algorithm involves a l'ew modifications to the program

discussed in the previous section. The objective is to keep track of which components in a pattern have already been analyzed.

To do this, the following procedure is employed. As a component is being analyzed, its outline is drawn on a separate pattern. That is, the contour points

procedure build (block,stroke) :

begin node := find node at end of s t r o k e ; n .= number of other branches at node ; branch .= n-vector of other branches at node ; i f node = oldblock* then place pointer to oldblock in block else begin

newblock .= create block of length n+5 ; place pointer to newblock in block ; place pointer to block in newblock ; place number,x,y in newblock ; for i := 1 step I unti l n do build (newblock,branch(i))

end end

*i.e., node is compared to all nodes previously encountered. The wdue is true if node is the same as another node represented by the data block "oldbh)ek".

FIG. 10. BUILD Procedure.

58 STALLINGS

:.! (o) .: :

: . !

, .~

, . . o o ~ 1 7 6

(b) !':

!"

,o

**%

t'

$ !

o.~ i .

�9 %.~ . . . . . . o . . . . . . . . . . . . . . . . . . . . . . . . ~

| l t , , i

i ~

(c) .'il i i

. f i ! . . . . ' . i

�9 " i i ~ |

w

~176 'o:

mO~

" : , . . : �9 . . . . . . ~ 1 7 6 1 4 9

~Joa,i ,

% ' | | '*~176 ,

do i ! '

Yi ! f i " :" , ! i

�9 o . . . . .

. . e O O O W ~ O . . . . . . . . . . t . . . . . . . . . . . . . . . . q ~

(d)

i

,i i �9 ' !

"' i !

~176

'::::::i' "'"'".: !

;! !,

* " | , , ,o, �9 , , , . |

l J

00|

i . , ,

�9 ~

! ~176176176176176149 o . . . . . . . . . . . . . . . i

e ,0o ,

:l t ""',,

! !

o: ~ . . . . . . .~

FIG. 11. Outline of a Character. (a) Step one; (b) step two; (e) step three; (d) step four.

of a componen t are filled in on a new pattern as they are encountered. The n e w pattern contains, at any time, the outline of" all components of a character wh ich have been processed. The SEARCH routine is modif ied to test the endpoints of any strip of black points it finds against the new pattern. I f the corresponding points are black irt the new pattern, then the strip is rejected and SEARCH eontSnues to scan. If no new strip is found after scanning a suf- f iciently large number of rows and columns, it is assumed that no new components remain to be found. After each component is analyzed, SEARCH is cal led to locate a stroke segment on a new component. The process of analyzing components continues until no new components can be found. The result is to produce a collection of connected graphs.

F igure 11 shows the result of" applying the algorithm to the character of Fig. 3.


Constructing the Frame

Represen ta t ion o f the f l ame descr ip t ion of a character is done c o n v e n i e n t l y by means of a h'ee. T h e root node of the t ree has as its va lue one of the t h r e e relat ions indicat ing how the overal l f lame is b roken into two subframes. T h e two sons r ep re sen t the s t rue ture of the two subffames. TmTninal e l emen t s cor- r e spond to c o m p o n e n t s (see Fig. 4).

The m e t h o d of obta in ing such a tree will be briefly descr ibed . First, e a c h c o m p o n e n t in the charac te r is inscr ibed in a rectangle. Th i s is easy to do s ince the coordinates of each n o d e are known. T h e re la t ionship b e t w e e n all pos- s ible pairs o f componen t s is detmTnined by de t e rmin ing the re la t ionsh ip b e t w e e n the i r rectangles . T h e one of the three p e rm i t t ed re la t ionships (East- West , North-South , Border- In ter ior ) which most near ly approximates the t rue re la t ionship is chosen. T h e n i t is d e t e r m i n e d if one of the co m p o n en t s has the same relat ion to all o ther components . This will usual ly b e the case. I f so, that c o m p o n e n t b e c o m e s one son of the root node of the tree; the value of the n o d e is the appropr ia te re la t ion; the o ther son is a t ree r ep re sen ta t i on d e v e l o p e d for the remain ing components . This subt ree is d e t e r m i n e d in the stone way.

I f no single c o m p o n e n t is found, a more compl ica ted p r o c e d u r e is u sed to d e t e r m i n e if any two componen t s have the same re la t ion to all others, and so on. A p roc e du re for cons t ruc t ing the t ree represen ta t ion of a frmne desc r ip t ion is desc r ibed formal ly in Fig. 19..

6, ENCODING OF COMPONENTS

For recogni t ion purposes , a p rocedu re has b e e n d e v e l o p e d for gene ra t ing a numer i c code for each character . The first step in this p r o c e d u r e is the genera- t ion of a code for each c o m p o n e n t in a character.

procedure f rame (list,tree) ;

Note s:

begin l i s t l := f irst group of components l ist2 := second group of components ; node := relation between two groups if l ist l is a list

then frame ( l i s t l , t ree l ) else t r ee l ;= list1

if l ist2 is a list then frame (l ist2,tree2) else t ree2 := l ist2

tree := t r e e l , node , tree2 end

1. The input to flame is the argument list, which is a list of eombinatimls of two or more components taken two at a time.

2, The output of flame is the argument tree which is a triple corresponding to the left son, node, and right son of a tree.

3. listl and listg represent disjoint groups of con,ponents such that the two groups have one of'the three allowed relations between fllem. If either group contains only one component, file emTesponding variable (listi or list2) is simply an identifier of that component and not a list.

FiG. 12. FRAME Procedure.

60 STALLINGS

The code for a component is generated fi'om its graph. To this end, the branches of a graph are labeled at each end. The label on a branch at a node indicates the direction or slope of that branch quantized into eight directions. All the branch labels at a node are stored in the data block of that node. An algorithm can then be specified for starting at a particular node of a graph and traversing all of its branches. The sequence of branch numbers encountered is the code produced. An example appears in Fig. 13.

The algorithm obeys the following rules:

1. Start at the node in the upper left-hand corner of the graph. Exit by the branch with the lowest-valued label. Mark the exiting branch to indicate its having been taken, and write down the branch label.

2. Upon entering a node, check to see if" it is being visited for the first time. If" so, mark the entering branch to indicate this.

3. Upon leaving a node, if there are available unused directions other than along the first entering branch, choose the one among these with the lowest-valued label. Leave by the first entering branch only as a last resort. Mark the exiting branch to indicate its having been taken and write down the label on the branch.

Since at each node there are just as many exiting branches as entering branches, the procedure can only halt at the starting node. At the starting node, all exiting branches have been used (otherwise the procedure could have been continued), hence a]l entering branches have been used since

4

6" 75~ 13 0

L

rz " s ~

00246206734426 FIC. 13. Encoding a Graph.

RECOGNITION O F PRINTED CHINESE CHARACTERS 61

there are just as many of these. The same reasoning can be applied to the second node that is visited. The first entering branch is from the starting node and this branch has been covered both ways. But this branch would only have been used for exit from the second node if all other exits had been exhausted. Therefore all branches at the second node have been covered both ways. In this manner, we find that the branches of all nodes visited have been traversed both ways. Since the graph is connected, this means that the whole graph has been covered.

All branches are traversed exactly once in each direction by this procedure, so all labels are picked up. The code consists of the branch labels in the graph written down in the order in which they are encountered.

This algorithm is based on a procedure for traversing graphs described in Ore [8].

While this scheme will always generate the same code for a given component, the goal of generating a unique code for each component is not achieved. For example, _+ and +_ are represented by the same graph, hence the same code. Fortunately, this type of situation is rare. Characters with this property could be treated as special cases without seriously impairing the ef- ficiency of the algorithm.

7. ENCODING OF CHARACTERS

The representation of a character is in the form of a tree, The nodes of the tree are binary relations; the terminal elements correspond to components. Considering the relations as binary operators, the tree can easily be flattened to prefix form. This is done by walking around the tree counter-clockwise, starting from the root node, and picking up nodes and terminals the first time they are encountered. As is well-known, the string generated in such a fashion is unique; the tree can readily be reconstructed from it. To generate a numeric string, the following code can be used..

0 r ~ terminals (components) 1 < > left node 2 ( > above node 3 ~ surround node

Figure 14 shows the generation of code from the tree of Fig. 4. We can consider that the code so generated defines a class of Chinese char-

acters all of which have the same frame description. Therefore, a Chinese character may be specified by first giving its frame description code and then giving the code for each of the components that fits into one of the subframes. A character having n components will have a code consisting of the concatena- tion o f n + 1 numbers:

No, N1 . . . . . N,,

where No is the code generated from the tree and Nx through N. are the codes of the components listed according to the order in which the components were encountered in the tree flattening.

0

62 STALLINGS

1012000

FIG. 14. Flattening a Tree.

8. RESULTS

The algorithms discussed in this paper have heen implemented as a computer program. The program is written in FORTRAN augmented by a package of assembly language routines to permit sh'uctured data and recursive proce- dures. The program runs on a PDP-9 computer.

The program has been tested with a number of characters fi'om several dif: ferent sources. The tests were designed to consider 4 questions:

1. How successful is the program in analyzing the structure of Chinese characters ?

2. Does the program generate consistent codes for characters of the same font? That is, will two instances of the same character from the same source yield the same code?

3. Does the program work for characters from different sources? 4. Do factors such as character size and character complexity affect pro-

gram performance?

Initial results were obtained from a set of characters obtained from a Ta iwan printer. A sample of this set appears in Fig. 15. To start, 225 different characters were processed. This was to provide a dictionary for later tests, and to test the pattern analysis capabilities of the program.

The result show a reasonable structural representation produced for about 94% of the characters. The failures were all due to a particular component not being analyzed; for all characters the relationship among components was cor- rectly determined. The problems all occured in the NODE routine, which is

RECOGNITION OF PRINTED CHLNESE CHARACTERS 63

supposed to isolate a node and locate all segments leading from it. The NODE routine would sometimes make mistakes if, for example, two nodes were very close together or one node covered a large area. The characters involved were typically quite complex.

From the characters that were successfully analyzed, 25 were chosen for additional testing. Four additional instances of each character from the same source were processed, for a total of 100 new characters.

All new instances of the 25 characters produced reasonable structural repre- sentations. For 5 of the characters, one of the new instances produced a slightly different representation, hence a different code. No character generated more than two codes. In all cases, the discrepancy was caused by the fact that two strokes which were very close in one instance touched in another instance of the same character.

Additional testing was done using two other sources. Characters from issues of a Chinese magazine were used. These were approximately half the size of

FIG. 15. Example of Character Set.

64 STALLINGS

the characters in the original set. Also, some computer-generated characters [6] were used. These were about double the size of the originals. Both were of about the same style. 50 instances were taken ~om each source. The percent- age of instances generating the same code as the corresponding character fl'om the original set was 89% for the magazine source and 95% for the computer source. Discrepancies mostly had to do with stroke segments appearing at somewhat different angles and with strokes touching in one case but not the other,

9. CONCLUSIONS

Pattern Analysis A descriptive scheme for the structure of Chinese characters has been

proposed and a program for computer analysis conforming to the scheme has been written. The description is on two levels: the internal structure of components, and the relationship among components.

The first level of description is straightforward: a connected part of a character is represented by a graph. This representation is adequate for the description of components; it is reasonable for the human percipient to think os components as graphs.

Analysis on this level works fairly well; difficulty is encountered with some complex characters. Some work has been done on modifying the described approach. The modification consists of "shrinking" a component to a skeleton and obtaining the graph from the skeleton. This procedure is sensitive to contour noise, and it seems that use of this method would result in many components generating several different graphs fi'om different instances.

The second level of description is based on the work of Rankin. With the ex- ception of a very few characters whose components do not fit neatly into the k?ame description, it is an effective means of describing the structure of Chinese characters in terms of components. The analysis pi'ogram for this level has been successful for all characters tested.

Character Recognition Chinese character recognition is made difficult by the size of the character

set and the complexity of the individual characters. Test results indicate that use of the approach described here would necessitate a dictionary in which some characters are associated with several codes.

Several possibilities exist which could improve the chances of constructing a practical character recognition device:

1. High standards of print quality. A device restricted to use only with very high quality print should be more consistent in code generation, thus reducing the size of the required dictionary.

2. Stylized font. A specially-designed font tailored to the recognition algorithm would improve the algorithm's performance.

3. Language simplification. A particularly hopefhl development in this regard is the Communist program to reduce the number of characters in gen- eral use and the complexity of individual characters [2].


The results reported here lead the author to believe that pattern analysis can be a fruitful approach to Chinese character recognition.

ACKNOWLEDGMENTS

The author would like to thank Professor Francis Lee of M, I. T., whose guidance and advice were invaluable to dais project. The author is also grate- ful to Professor Thomas S. Huang of M, I. T. and Professor Herbert Teager of Boston University for their help.

REFERENCES

1. R. CASEY AND G. NAGY, Recognition of printed Chinese characters, IEEE Trans. Electronic Computers, 1966, 91-101, Vol. EC-15.

2. Y. CHU, A comparative study of language refi)rms in China and Japan, Skidmore College Fac- ulty Research Lecture, Skidmore College Bulletin, Saratoga Springs, N. Y., 1969.

3. Fu.IIMUr~ AND IG~,CAYA, Structural PettterJ~s of Chi~ese Characters, Research Institute of Logopedics and Phoniatries, University of Tokyo, Annual Bulletin No. 3, April 1968-July 1969, pp. 131-148.

4. u, GF~ENANDER, A unified approach to pattern analysis, Adua~wes in Computers, Academic Press, 1970, Vol. 10, pp. i75-216.

5. GRONER, HEAFNER, AND ROBINSON, On-line computer classification ofhandprinted Chinese characters as a h-anslation aid, IEEE Trans. Electronic Computers, 1967, Vol. EC-16, 856- 860.

6. A.V. HERSHEY, Calligraphy for Computers, U. S. Naval Weapons Lab., Dahlgren, Virginia, AD 662 398, 1967.

7. J. H. LIU, Real Time Chinese Handwriting Recognition Machine, Thesis, M.I.T., Cambridge, Mass., 1966.

8. O. ORE, Theory of Graphs, American Mathematical Society, Providence, R. I., 1962. 9. D. S. PRERAU, Computer Patter~ Reeognitio~ of Sta~dard E~graved Music Notation, Ph. D.

Thesis, M.I.T., Cambridge, Mass., 1970. 10. RANKIN AND TAN, Component combination and fl'ame-embedding in Chinese character

grammars, NBS Tech. Note 492, National Bureau of Standards, Washington, D.C., 1970. 11. K. M. SAYRE, Recognition: a study in the philosophy of artificial intelligence, University of

Notre Dame Press, Notre Dame, Indiana, 1965. 12. W, W. STALL~NGS, Computer Analysis of Printed Chinese Characters, Ph. D. thesis, M.I.T.,

Cambridge, Mass., 1971.

recognition of printed chinese characters by automatic pattern analysis

Documents