[ieee 2010 10th international symposium on communications and information technologies (iscit) -...

New Approach on Structural Feature Extraction for Character Recognition

Premnath Dubey Image Technology Laboratory

National Electronics and Computer Technology Center Thailand

[email protected]

Wasin Sinthupinyo Image Technology Laboratory

National Electronics and Computer Technology Center Thailand

[email protected]

Abstract—In this paper, a new structural feature extraction technique was proposed. The features from character images are extracted through a new approach which based on a structural feature model called feature template. The feature template was created by feature vector extracted from a character image corpus then clustered into some N classes. These N classes of feature vectors will be used as template for generalized structural feature extraction. The structural features here are defined automatically through K-Mean clustering, unlike traditional techniques that was manually defined. Through a very simple template matching for character recognition application, the technique show a very promising result with average 95% accuracy.

Keywords: feature extraction, structural feature, statistical feature, pattern recognition, character recognition

I. INTRODUCTION In pattern recognition applications, there are two main

approaches: statistical and structural. In the statistical pattern recognition quantitative or numerical features are adopted[1][2][3]. While qualitative features, which are usually called primitive, are applied in the structural approaches[4][5][6]. Therefore features in the statistical approach which is measurable data can be classified by mean of decision theorem. While the structural features are in form of non-number data such as type of shapes e.g. line circle triangle, etc. However the strength of the structural method over the statistical one is its representation of a pattern that is similar to the way human perceive it. Recognition of patterns in this method involves extracting the features and analyzing their interconnectivity. However to extract structural features, those various type of shapes as mention before, is very difficult to obtain accurately. Moreover the set of primitives is usually application oriented and need domain expert to manually define them. In this paper a new approach on structural feature extraction for application of character recognition is proposed. This aims to reduce time consumption and complexity of defining the primitives and implementing algorithm for every new set of character by introducing a generalize structural feature extraction technique.

In the proposed technique, features from character

images are extracted through a new method which based on a data structure called feature template. The feature template was created by feature vector extracted from a character image corpus then clustered into some N classes. These N classes of feature vectors will be used as template for generalized structural feature extraction. Since the structure here are defined automatically through K-Mean clustering, we can not expect it to be the same as the one that was manually defined. Most of the structural feature extraction methods involve applying set of rules to identify the structures and topology of characters. The proposed feature extractor identifies the structures by mapping feature vector on to the feature template which will return a label of the closest class. As we divided each characters into 16 zone and there is a feature vector representing each zone, the feature extraction give a sequence of 16 cluster label as its output. Then, in the recognition step, we simply compare the string of the feature class to the user provided samples. Although this method can be applied to many type of patterns, the application of character recognition will be used to demonstrate the technique. The implementation can be divided into two main processes, the offline and system evaluation process. The detail can be found in the following sections.

II. THE OFFLINE PROCESS In the offline step, feature vectors from a character

corpus are extracted and clustered to create a feature template. The total number of 50,750 characters is used in this process. These include 10 fonts of Thai character set for 6 point sizes. And there are a number of different scanner configurations to create some variation on the image. Normal Thai characters are composed of some basic structures such as straight line (vertical/horizontal), curve, circle, curling etc. The processes to create the feature template are shown in Fig. 1.

A. Feature vector The feature vectors extracted in this step are numerical

features that finally will be used to create the feature

946978-1-4244-7010-5/10/$26.00 c2010 IEEE ISCIT 2010

template. Basically, this feature is based on contour direction of the character image. The contour direction can be obtained as a product of edge detection algorithm such as Sobel(1).

)arctan(),(

22

963

387

x

y

y

x

GG

yx

zzzGzzzG

=

++=++=

α (1)

The ),( yxα is the direction of the gradient on the

character image contour (edge). Where nz is the gray level of points in the 3x 3 mask[7].

The contour direction is a feature that preserves some of local information about the shape of a character (Fig. 2). To obtain the higher level description, we divided a character image into some smaller region or zone. Then collect the direction information of each zone into a histogram. The 2� value of the contour direction is divided equally into 6 groups and collected in a histogram. This 6 bins histogram is our feature vector which will be used as a descriptor of the shape in a particular zone.

Figure 1. llustrate of the Offline processes.

B. Zoning Since the image of each character may has different size,

they have to be normalized to the same size first. All of the input images are normalized to 32�32 pixel image. On the normalized image, we divide it into 4 by 4 or 16 zones with some overlapping area as seen in Figure 3. In each of these zones there will be a histogram of contour direction. This histogram represents a feature vector which, ultimately, should be able to represent the shape in the zone as well.

Figure 2. A Thai character image and its direction of contour.

C. Feature vector clustering The feature vector or histogram of each zone is clustered

to create feature template. The K-Mean algorithm is used to cluster total 812,000 feature vectors generated from 50,750 characters (50,750 x 16 = 812,000).

The different number of K has been tested. And we have found that K= 40 give the best clustering and as well as the final character recognition result.

Figure 3. Illustrate zoning apply on the 32�32 normalized image.

D. Feature template The feature template can be considered as a set of all

possible basic structure or primitive of a character set. As mention in the previous section, the number of cluster or K with the best result is 40. Therefore we have 40 types of primitive. Normally the representative of a K-Mean clustering is the centroid (or mean) vector of each cluster. Here too, we use the centroid vector as the feature template. However in this research we have tested another method to create the model for feature template. That is via Back-Propagation neural network which is aimed to be better handling of noises and faster execution.

The feature vectors that fall into each cluster via K-Mean algorithm are used as training samples for each class of

Feature Extraction

Character Corpus

Feature vector clustering

Clustered feature vector

Feature template

Generate feature template

0.00 0.000.00 0.000.00 0.000.00 0.000. 0.46 0.32 0.00 0.000.00 0.00 0.46 0.00 0.00 0.00 3.14 3.14 0.00 0.00 0.79 0.32 0.79 0.00 0.00 1.25 0.79 0.32 0.79 1.57 2.36 2.82 2.36 0.00 0.00 2.68 2.82 2.36 5.50 5.96 5.82

947

feature template. Therefore the number of the input for the neural network is 6 and the number of output classes is 40. A number of different hidden nodes have been tested and it has been found that 100 give the optimal performance.

III. THE SYSTEM EVALUATION In order to evaluate our method, a simple template

matching system was constructed. Since the feature template is aimed to be a general purpose structural feature descriptor which means it should be able to apply to any set of characters. Therefore we choose another set of images which contain 2 fonts of Thai characters and 2 fonts of English characters for the testing. And all of these fonts are unknown to the system. To perform the template matching a few samples from each character is randomly chosen to be the template.

Figure 4. Illustrate of the system evaluation processes.

A. Preparing template To perform the character recognition by template

matching, we have to provide some sample images of each character to use as a template. These samples also have to go through the process of normalization and zoning as describe previously. The histogram of each zone will be examined with the feature template to find the closest cluster. Therefore the output for each character will be a sequence of label of feature cluster which will be called here as descriptor string. The length of the descriptor string will be

corresponding to the number of zone which is 16 for this case. The descriptor string with the character code of the sample characters will be kept for comparison in the testing process.

B. Testing Process In the testing process, the feature template will be used to

generate the descriptor string as in the previous step. But after that the string has to compare to the sample's string. The comparison is straight forward. Each of the elements in the string is compare to the corresponding element of the samples. The score will be counted for every pair of match element. The sum of total score tell the closeness to sample and the one with the highest score is the recognition result

IV. EXPERIMENT The test was done with three set of unknown fonts. There

were two fonts of Thai character set and one fonts of English character set. The number of classes in the Thai characters set is 46 and 26 in English. The number of test image for each font is 1150 and 520 for Thai and English respectively. Following are different tested aspect that was carried out.

Firstly we have tested our system with variable number of sample which is 5, 7 and 10. The test result is shown in the table 1. The 10 sample give the highest score for all cases. It is apparent that as the number of sample increase the recognition rate is increase likewise. In case of English character set, the test with 5 sample of Iris font give the accuracy rate at 94.8% which is higher than the Thai counterpart. And the English gains the higher accuracy in every case. One reason for that is there are a number of Thai characters that have almost identical look. This is where most of the error occurs in the test. Some samples of these characters are shown in Fig. 5.

TABLE I. TEST RESULT WITH THAI FONTS

5 7 10Iris 92.0837 93.7391 95.1404

Jasmine 91.0435 93.4783 94.9565

TABLE II. TEST RESULT WITH ENGLISH FONT

5 7 10Iris 94.83 95.6701 96.0643

948

Figure 5. Illustrate some of the Thai characters that have almost identical

look.

As state previously there are two methods of feature

template classifier, the K-Mean's centroid matching and the back propagation neural network. The result is shown in the table 2. As we can see that the neural network out perform the K-Mean classifier in almost all testes with average 1% higher.

V. CONCLUSION In this paper, a new approach on structural feature

extraction was proposed. We introduced the feature template

to mimic structural feature extractor by mapping of feature vector on to the feature template which will return a label of the closest class. The feature template is a K-Mean cluster of feature vectors. We applied a very simple string matching as our recognition process. The experimental result shows the average of 95% accuracy rate with both Thai and English character set.

REFERENCES [1] Richard O. Duda, Peter E. Hart, and David E. Stork. Pattern

Classification. Wiley, New York, second edition, 2001. [2] Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition.

Academic Press, Boston, second edition, 1990. [3] Anil K. Jain, Robert P. W. Duin, and Jianchang Mao. Statistical

Pattern Recognition: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4 –37, January 2000.

[4] K. S. Fu. Syntactic Pattern Recognition and Applications. PrenticeHall, Englewood Cliffs, New Jersey, 1982.

[5] Rafael C. Gonzalez and Michael G. Thomason. Syntactic Pattern Recognition: An Introduction. Addison Wesley, Reading, Massachusetts, 1978.

[6] T. Pavlidis. Structural Pattern Recognition. SpringerVerlag, Berlin, 1977.

[7] Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing, second edition, Prentice Hall, 15 January 2002.

949

[ieee 2010 10th international symposium on communications and information technologies (iscit) -...

Documents