[ieee 2010 international conference on artificial intelligence and computational intelligence (aici)...

Study on Printed Tibetan Character Recognition

Ngodrup Engineering Research Center of Tibetan Information

Technology, Ministry of Education Tibet University

Lhasa, Tibet, China [email protected]

Dongcai Zhao, Putsren, Daluosanglangjie, Fang Liu , Bianbawangdui

Department of Computer Science, Engineering School Tibet University

Lhasa, Tibet, China

Abstract—Owing to special structure Tibetan characters, the recognition of traditional Tibetan characters encounters the problems of low recognition rates and poor recognition effects. Through conducting an in-depth study on features of the printed Tibetan characters, this paper develops a series of methods to increase recognition rate and improve the recognition effects of Tibetan characters even in the case of jamming. These methods are including local self-adaptive binary algorithm, segmentation based on the connected domain, grid-based fuzzy stroke feature extraction and so on. The results of the experiments indicate that the methods can definitely increase the recognition rates of the printed Tibetan character recognition system and improve its ability to prevent jamming.

Keywords-Printed Tibetan character; Segmentation;Adaptive Local Binarazation;OCR

I. INTRODUCTION

A. Information Processing The development of Tibetan recognition system is still in

a beginning stage. Along with the promotion of Tibetan information technology, there is an increasing demand on saving, processing large quantities of Tibetan Scripture books, texts and other materials on the computers. However, it consumes so much human power and time to input the texts and material into computer. In order to properly and efficiently process Tibetan related documents, it is quite urgent to develop a Tibetan character recognition system (Tibetan OCR) to process Tibetan texts and written materials into computer.

B. The Features of Tibetan Scripts The written Tibetan is composed of a series of

horizontally and/or vertically arranged letters which combine to form composite syllables and words. 1 Height Feature

Tibetan script writing direction is from left to right. Tibetan character is presented as a vertical combination of consonant and vowel. In a general framework of a single character wide, the forms of the character can be categorized into three groups, respectively short form, medium form and long form. Under each group, there are numbers of sub-groups existed. The short form group can be divided into two subgroups, standard short form and wider standard short

form. The character width of standard short form is 5 penpoints, for example And the character width of wider standard short form is 6 penpoints, for example . The medium form group can be divided into medium form A and medium form B. In medium form A group, two different types of characters can be identified, namely double-stacked and flat triple-stacked. The character width of both types is 7.5 penpoints. The double-stacked form can be also divided into two sub-groups. One is a combination of thick and flat

and the other is a combination of thick and thick An example for flat triple-stacked Tibetan character is . The medium form B is mainly three scripts-stacked character, which is very often

seen in Tibetan. For example: . The long form group can be divided into single long form

and composite long form. Single long form character is consisted of one single script, the character width is 12.5 penpoints such as There are two types of composite long form characters, respectively secondary long form and standard long form.

1 The secondary long form character lies in between medium form A and standard form, the character width is between 10 to 11.5 penpoins. It can be categorized into thick triple-stacked and thick flat fourfold-stacked characters.

An example of thick triple-stacked character is An example of thick flat fourfold-stacked characters is

2 The standard long form characters are consisted of characters arranged from thick double-stacked to six-fold stacked. The character width is 12.5 penpoint. When there is five or six fold stacked, the ending point of writing could be 0.5 penpoint wider than the standard width. But it happens

very rarely in real cases. For example: On the basis of the above, the width of each Tibetan

character is similar, but the height of it is very different. 2 Baseline Feature

Baseline” is one of the most important features of Tibetan characters. It presents like “there is a flat line existed above Tibetan scripts, which draws a baseline for height of every single Tibetan script.” Only vowel mark can be appeared above the baseline. For example:

, the blue line represents the baseline. The vertical structure of Tibetan character

2010 International Conference on Artificial Intelligence and Computational Intelligence

978-0-7695-4225-6/10 $26.00 © 2010 IEEE

DOI 10.1109/AICI.2010.66

280

2010 International Conference on Artificial Intelligence and Computational Intelligence

978-0-7695-4225-6/10 $26.00 © 2010 IEEE

DOI 10.1109/AICI.2010.66

280

presents in a sequence of upper vowel, prefix, root letter, suffix and post suffix. Therefore, the structure of Tibetan character is sort of complex. The emerging of baseline realizes a Tibetan writing system that is based on baseline. The uniform of Tibetan scripts above the baseline and free stretching in height of Tibetan scripts below the baseline form the Tibetan character structure system and unique style. This is also one of the major features on Tibetan OCR. 3 Direction Feature The characteristic of Tibetan writing system is another major feature of Tibetan font. Generally speaking, the basic orientation of a Tibetan character is the direction of the first stroke of the character. The opening of the first stroke or the recesses in space orients towards the direction of the first stroke of the character, and the closing part of the character usually presents as a closed vertical line.

This feature is closely related with the structure of Tibetan character. The character is displayed in different forms when it is in different positions. For example, Tibetan character “�” is displayed in different forms when it presents as root letter, prefix or suffix in Tibetan. For

example:

II. TIBETAN OCR SYSTEM

A. The Structure of Printed Tibetan OCR System The recognition function of Tibetan OCR system are

basically realized by following components, the basic structure of the system is shown in figure 1:

Figure 1. The Structure of Tibetan OCR system

B. Major Steps and Processing Methods 1) Adaptive Local Binarization

Since the manifestation of Tibetan script is quite simple and obvious line, it is easy to cause key information missing if binarization is not properly processed. For example: the figure 2 shows the original image.

Figure 2. 256 colored bitmap

If the common method is adopted to process, a binarization image as shown in the figure 3 will appear.

Figure 3. Bitmap processed by common binarization methods

One of the most significant cases is that the Tibetan

character “ ” is easily recognized as “ ”, which might cause a series of recognition mistakes in latter.

The adaptive local binarization method, used in Tibetan OCR system, is using on pixel pre-domain information to calculate the threshold value k of every pixel. Set the grey value of the image as L level, for the every k in the 1 L k< < , 1......k is divided into two groups: the Pixel numbers of the first group is

1( )kω , the average grey value

is 1

( )M k , and the variance is 1( )kσ ; the Pixel numbers of the

second group is 2( )kω , the average grey value is

1( )M k , and

the variance is 2( )kσ , then:

Variance within groups 2 2

1 1wσ ω σ=

Variance between groups 2

1 2 1 2

2( ) ( )( )

BM k M kσ ω ω= −

It can be tested for a given image that 2 2

B Wσ σ+ = constant.

Therefore, if the maximum value of Bσ is obtained, then the

value of Wσ will be the minimum. In this method, when two wave crests appear (exist one

wave trough) in histograms, k represents the grey value at wave trough. Even if wave trough does not appear, value, the best separatrice threshold, can be calculated.

Tibetan OCR system adopts two times binarization for in recognition process. In the first time, it binarizes the entire region of the image in order to implement line segmentation and character segmentation. In the second time, it binarizes the local region of the segmented character zone in order to extract critical information of individual character. Accord-ing to results of the experiments, if the thresholds value dur-ing local binarization process is over 128, a more accurate binarized Tibetan script image can be obtained when the threshold value is increasing by 10.

2) Segmentation Based on Connected Domain Projection Method

The segmentation of Tibetan scripts can be divided into two parts: line segmentation and word segmentation.

a) Line Segmentation The Tibetan script is relatively unique in the fact that

both horizontally and vertically adjacent letters can sometimes overlap and impair the digital recognition process. Additionally, the recognition is generally disturbed by im-perfect or varied coloring and imaging of Tibetan letters. Four example presented below are the most typical examples for this circumstance. See figure 4, 5, 6, 7. (1) Radian Interference

Figure 4. Radian Interference

In this circumstance: when the book is scanned, the areas near the book midline would generate radian due to curve of

281281

the book, which causes overlapping between the vowel in latter line and lower vowel in previous line. (2) Adhesion Interference

Figure 5. Adhesion Interference

Remarks Letter adhesion, the non-uniform width; separation between upper vowel and baseline, the

wave trough at vowel is in the lowest position; a close distance between lines, the wave trough is

not significant; an adhesion between lines, and non-uniform

height; When there is a change in the quantity of scripts

between upper and lower lines, it is no way to determine if the line with fewer scripts is individual script or is just a part of the lower vowel in previous line.

In this circumstance: an adhesion between lines, letter adhesion in line and separation between upper vowel and lower vowel, the wave trough is not significant. (3) Separation Interference

Figure 6. Separation Interference

In this circumstance: the vowel is separated from the major component, and the distance between the vowel and the major component of previous line is too close. When the font sizes are different, it is hard to determine that the part of vowel is script or vowel. (4) Number of Script Interference

Figure 7. Number of Script Interference

In this circumstance: the number of scripts in different lines is different, and the wave trough is not significant.

Regarding four typical circumstances presented above, it is impossible to realize proper segmentations by using common discrimination methods. According to the results of plenty of experiments, the connected domain projection method could be utilized to line segmentation of Tibetan script.

Connected Domain Projection Method: First searching connected domains: Starting from the first

endpoint ( , )F x y at leftmost of the image, and seeking the

class of all interconnected points ( , )D x y by recursive ways within 8 yards scope. Then, the connected domain ( , )Q x y can be identified according to ( , )D x y . For instance, the image shown in figure 8 is one that needs to be recognized. Through connected domain projection method, all connected domains (shown in figure 9) can be obtained.

Figure 8. Image for Recognition

Figure 9. The Connected Domains Obtained

Usually the obtained connected domains are some separated domains that are in different sizes, overlapped to each other and/or separated from each other. In order to acquire useful domains for the system, it is necessary to run availability discriminations on them. The method of discrimination is as following:

Set the height and width of the connected domain, and save the connected domain that fulfills the following conditions:

width height< : In standard Tibetan script, the highest probability of the circumstance is that the width of the script is lower than the height of it.

3width height< × : In standard Tibetan scripts, the highest probability of the circumstance is that the height of the script is lower than three times of the width of it.

100width height× > : The minimum range of the script that the OCR system could accept;

Through examining with the threes conditions listed above, the connected domains that match the conditions will emerge. See figure 10.

Figure 10. The Connected Domain that Matches the conditions

Now running a projection to the connected domain, the projection method is as following: set the width of the con-nected domain as 1, divide the height by 2, and then formu-late the linear potion 0nX = ( n is equal to the numbers of the connection domains). A new wave will be formed by

282282

accumulating the values of projections of all linear potions on the Y axis nY Y Y M= + × . ( M is the amplification factor) .See the figure 11

Figure 11. New Waves

In the end, the segmentation points are identified according to the positions of wave trough and wave crest to complete the line segmentation.

b) Word Segmentation The word segmentation is implemented based on

principles of the line segmentation. It extracts one single word image from the entire scripts image. The word segmentation is also using the connected domain segmentation method according to the characteristics of Tibetan script images.

The most typical examples for word segmentation is as following: (See figure 12, 13 and 14) (1) Words Overlapping

Figure 12. Words Overlapping Interference

In this circumstance, two individual letters are adjacent adhesion to each other. (2) Tsheg Point Adhesion

Figure 13. Tsheg Point Adhesion Interference

In this circumstance, the syllable marker tsheg is adhered with other letter due to scanning problems or severe interferences. (3) Overlapping

Figure 14. Overlapping interference

In this circumstance, the letters are not adhered together, but overlapped on the longitudinal axis

The algorithm used in word segmentation is same with the one used in line segmentation. A result shown in figure 15 will be obtained by running connected domain project method.

Figure 15. The Obtained Connected Domain in Word Segmentation

Once obtaining the connected domains, the system will integrate or split the domains. For the connected domain of Tibetan scripts image, it can be summarized into three circumstances as following: up and down relationship, left and right relation, overlapping relationship; See figure 16

Up and Down Left and Right Overlap Figure 16. Relationship between Connected Domains

Regulations of integrating and splitting: Since there are many smaller regions that represent the

syllable marker Tsheg bar and vowel on ( , )D x y , so they

need to firstly identify the average width pD of the segmentation regions ( , )D x y when they are split:

1

( , )n

ii

D D x yp =

=

n is the number of ( , )D x y , and then the system will

eliminate all ( , )mD x y domains which width is smaller than

sD . The rest of domains will obtain sD by using same

algorithm. sD can be considered as the width of the exact

script, and then sJ , the width of syllable marker tsheg bar,

can be calculated by the algorithm of s p sJ D D= − .

Once the value of sD and sJ are obtained, the system will determine the integration and split of the domains according to following conditions:

If ( , ) 4 3s sD x y D J> + × − is realized, the domains shall be split. The principle of split is to divide the domains from middle position into two parts to form two domains.

If 1D and 2D are two connected domains, then the

boundaries of them are respectively: 1 1 1( , )D Left Right and

2 2 2( , )D Left Right , then leftmost boundary of 1D , 2D is :

1 2 1 2? :MLeft Left Left Left Left= < the rightmost boundary is

1 2 1 2? :MRight Right Right Right Right= > If the sum of widths of two connected domains is larger

than the distance from leftmost to rightmost boundary of entire domain which is formed by two connected domains, as

1 1 2 2Right Left Right Left MRight MLeft− + − > − then domain with lower width could be determined

1 1 2 2( , )MIN Right Left Right Left− − If the area of overlapping domains is larger than the one

of non-overlapping domains, then two connected domains shall be integrated. The integration principle is to set new

Letters adhesion

Tsheg Point adhesion

Overlapping Letters

283283

domains for MLeft and MRight , and to delete the two original domains. If the area of overlapping domains is smaller than the one of non-overlapping domains, then two connected domains shall be split. The split principle is to identify 1Right of 1D and 2Left of 2D at the middle position of overlapping area to split the domains.

Result after integration or split is shown in figure 17

Figure 17. Result after integration or split

3) Vague Stroke and Profile Feature Extraction Based on grid

The feature extraction system of Tibetan script is based on grid design. It normalizes the single script image after segmentation into 66 34× dot matrix image, and then smoothes the image, partitions the network, extracts the information of vague stroke and profile feature in each grid. The system consists of two major components: grid partition and grid feature description.

a) Grid Partition The grid partition adopts even sharing measure of

histogram, the result is shown in figure 18.

Figure 18. Even Sharing Measure of Histogram

The even sharing measurement of histogram is to project Tibetan script in the horizontal and vertical directions, then divides the values of projections evenly, and turns the uneven grid zones into 3 2× . Following the same method, the system will re-divide every grid zone that has already been divided, and turn the uneven grid zones again into 3 2× . At end, the total number of the grid zones will be 36. The results of division are shown in figure 19.

Figure 19. Result of Grid Division

b) Grid Feature Description The extraction of the grid feature is based on algorisms

of stroke direction decomposition and stroke outline decomposition. Qualitative stroke is represented by black pixels stroke, while the qualitative stroke outline is represented by white pixels stroke.

Taking black pixels stroke as an example: set the first upper left black pixel dot ( , )P x y in any randomly selected

grid, and recurse the black pixels dots into 8 directions (as shown in figure 20) within 8 yards from the point p. When the black pixels dot meets the white pixels dot, the recursion is terminated, and the lengths of eight different directions will be obtained as f1 to f8

Figure 20. Eight directions

As shown in figure 21, the black pixels dots recursed to eight directions from the point P in the image of Tibetan

script , and then lengths of black pixels in eight directions were obtained as f1 f2 f3 f3 f4 f5 f6f7 f8.

Figure 21. Black Pixels Recursing to Eight Direction from Point P

Owing to the interferences and effects occurred in the scanning process or by the changes of the script, the system adopts fuzzy definition in the process of direction decomposition. The exact directions of the strokes of Tibetan script, such as Heng (Horizontal), Shu (Vertical), Pie (Throw Away), Zhe (Break),radian and so on, will be processed vaguely according to following formulas:

D1=f1+f5+f2+f6+f8+f4; D2=f2+f6+f3+f7+f1+f5; D3=f3+f7+f4+f8+f2+f6; D4=f4+f8+f1+f5+f3+f7; The four directions obtained after the processing are the

vague directions of their corresponding strokes. The processing eliminates the interferences perfectly, then discriminates the maximum values of D1-4, and accumulates them into G1to G4:

Max(D1,D2,D3,D4)=D1 then G1++; Max(D1,D2,D3,D4)=D2 then G2++; Max(D1,D2,D3,D4)=D3 then G3++; Max(D1,D2,D3,D4)=D4 then G4++; In the end, the system shall determine the distributions of

every direction, n represents the number of black pixels in the grid.

F1=G1/n; F2=G2/n; F3=G3/n; F4=G4/n; The same methodology is also applied to process the

strokes of the white pixels to obtain the data for F5, F6, F7 and F8.

Usually one single grid consists of eight feature data. There are 36 grids in total for one single script image, which means 288 dimensions of feature data could be collected for one single script. Through a number of experiments, it is

284284

testified that 288 dimensions feature data could reach optimum results both in recognition effect and recognition time for Tibetan OCR system.

4) Recognition Processing The recognition processing is utilizing the method of

distance-classifier. Some commonly used distance measures are including: Euclidean distance, weighted distance, city block distance and so on. Tibetan OCR system utilizes a weighted error equalizing distance, and the function that identifies the distance between two proper vectors is as following:

2 2

1

, [ ]( ) ( )n

i i i i i

i

f X Y w x y wε=

= − +

wherein

( ) ( )1

1in

ii

i

nw

aa

σσ=

=

++

σ is the variance, ε is 10 and a is 8. In the sequence, the result that is the minimum to f shall be the final rec-ognition result of the script.

III. TEST RESULTS Along with formulating Chinese national standard for

Tibetan coded character set (including Basic Set, Extension Set A and Extension Set B), the standard for Tibetan coded character set are largely applied in all kinds of printed matters today. Tibetan OCR system that is developed and integrated based on all measures and algorithm listed above has run tests to 1,625 Tibetan scripts from Chinese national standard for Tibetan Coded Character Basic Set (partial) and Extension Set A (all). Under the condition of zero interference in scanning, the average recognition rate could reach to 98%, the average recognition speed could reach to 260 Tibetan scripts per second. Under the condition of existing common-mode interferences, the average recognition rate could reach to 92%, and the average recognition speed could reach to 260 Tibetan scripts per second. The detail test data is shown in the Table 1 below:

Table1 Average Recognition Rates of Samples

Data

The Sample Quality

Zero Interference

Common Scanning Interference Partial Field Recognition

Entire Text Recognition

Training Samples % 99.39 95.76 94.55

Testing Samples % 98.78 93.93 92.74

IV. CONCLUSIONS Through the researches and experiments on a large

quantity of printed Tibetan Characters, we developed a series of measures and methods that are fit for Tibetan character recognition. And we integrated these measures and methods to develop a complete recognition scheme, and also

continually optimize and study it to build a more effective recognition system.

ACKNOWLEDGMENT We thank the National Natural Science Foundation of

China for financial support (No.60863013).

REFERENCES

[1] WANG Haojun, ZHAO Nanyuan, DENG Gangyi. A Preprocessing Algorithm for Tibetan Character Recognition. Computer Engineering. 2001, 27 (09): 93-96

[2] X. Lin, Xiaoqing Ding, Ming Chen, et al. Adaptive Confidence Transform Based Classifier Combination for Chinese Character Recognition .Pattern Recognition Letters, 1998,1910, 19(10) :975-988 .

[3] HuaWang, Xiaoqing Ding. New statistical method for multi-font printed Tibetan/English OCR. Proc.of SPIE-IS&T Electronic Imaging, SPIE. 2004,5296, .

[4] HuaWang, Xiaoqing Ding. Comprehensive printed Tibetan/English mixed text segmentation method[J] . proc. of SPIE-IS&T Electronic Imaging. SPIE, 2003, 5296: 136-146.

[5] Nojun Kwak, Chong-Ho Choi. Input feature selection for classification problems .I EEE Trans. On Neural Networks, 2002,131, 13(1) :143~159 .

[6] Kato N, Suzuki M, Omachi S, et al. A Handwritten Character Recognition System Using Directional Element Feature and Asymmetric Mahalanobis Distance .IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999,213, 21(3) :258-262 .

[7] Hao Jun Wang, Nan Yuan Zhao, Gang Yi Deng. A Stroke Segment Extraction Algorithm for Tibetan Character Recognition. Journal of Chinese Information Processing. 2001,15(4),41-46

[8] ToruWakahara,Hiroshimurase,KazumiOdaka. On-line Handwritting Recofnition .Proceedings of the IEEE, 1992,80, 80 (7) :1181-1194 .

[9] Ji xiang Sun. Modern Pattern Recognition(Second Edition). [10] Xu L, Krzyzak A, Suen C Y. Methods for combining multiple

classifiers and their applications in handwritten character recognition .IEEE Trans on Systems, Man, and Cyber, 1992,223, 22(3) :418~435 .

[11] Ho T K, Hull J J, Srihari S N. Decision combination in multiple classifier system .IEEE Trans on Pattern Analysis and Machine Intelligence, 1994,161, 16(1) :66~75 .

285285

[ieee 2010 international conference on artificial intelligence and computational intelligence (aici)...

Documents