chapter 1 introductionshodhganga.inflibnet.ac.in/bitstream/10603/13664/6/06_chapter 1.pdfthe...

1

CHAPTER 1

INTRODUCTION

1.1 TAMIL

Tamil is a South Indian language spoken widely in Tamil Nadu,

one of the states of India. Tamil is one of the oldest languages in the world

and the Tamil script is used to write the Tamil language in Tamil Nadu state

of India, Sri Lanka, Singapore and parts of Malaysia as well as to write

minority languages such as Badaga. Although Tamil has been influenced by

Sanskrit to a certain degree, Tamil along with other south Indian languages

are genetically unrelated to the descendants of Sanskrit such as Hindi, Bengali

and Gujarati.

As a result of intensive research and development efforts,

sophisticated character recognition systems are available for English

language, Chinese/Japanese languages and handwritten numerals. However,

less attention has been given to Indian languages. Some efforts have been

reported in the literature for Devanagari, Tamil and Bangla scripts. The

significance of the thesis stems from the fact that the Tamil language is

ancient and widespread, owning rich literature content. Realizing the research

potential in Tamil text image problem, this thesis is focused towards Tamil

text image Restoration and Tamil Character Recognition.

Most Tamil letters have circular shapes; partially due to the fact

that they were originally carved with needles on palm leaves, a technology

2

that favored circular shapes. Tamil has 12 vowels, 18 consonants, 216

composite characters and 1 special character (Aayutha Ezhuthu) counting to a

total of 247 characters. The character set used herein is comprised of the set of

vowels and consonants and produced in Table 1.1 for quick reference.

Table 1.1 Tamil Fonts Set

Processing of the Tamil text image, like any other text image

primarily involves two components: Restoration and Character recognition.

The initial part of the thesis is focused towards restoration of noisy Tamil text

images. A spatially adaptive restoration algorithm based on Expectation

Maximization (EM) operating in the wavelet transform domain is proposed.

The second part is dedicated to Tamil character recognition in terms of feature

extraction and classification. Two novel feature extraction methods viz. Slope

method and Discrete Wavelet Transform (DWT) domain based method are

proposed. Also, a tree classifier is proposed which would operate on the

features extracted by the methods proposed herein. The feature extraction

algorithms and the classifier algorithm presented herein prove to be promising

candidates for processing of Tamil text images.

1.2 LITERATURE SURVEY

Restoration is the first step in the processing of any text image in

most of the applications and is used to reconstruct or recover an image that

has been degraded by using a priori knowledge of the degradation

phenomenon. Feature extraction for character recognition deals with

determination of various attributes as well as properties associated with a

3

character. In the following sections, a survey of the existing restoration and

character recognition algorithms is provided.

1.2.1 Restoration

There is a wide range of restoration algorithms applied in different

contexts available in the literature. The restoration methods for document

images are presented below.

Moghaddam and Cheriet (2010) have proposed a multi-scale

framework for adaptive binarization of degraded document images. In this

work, a multi-scale binarization framework is introduced which can be used

along with any adaptive threshold-based binarization method. This framework

is able to improve the binarization results and to restore weak connections and

strokes, especially in the case of degraded historical documents.

Moghaddam and Cheriet (2009) have presented the problem of

enhancing and restoring single-sided low-quality single-sided document

images. Initially, a series of multi-level classifiers is introduced covering

several levels, including the regional and content levels. These classifiers can

then be integrated into any enhancement or restoration method to generalize

or improve them. EI-Sallam et al (2009) have presented a blind image

restoration algorithm to construct a high resolution image from a degraded

and noisy low resolution image captured by Thin Observation Module by

Bound Optics (TOMBO) imaging systems. In this method, all spectral

information in each captured image is used to restore each output pixel in the

reconstructed high resolution image.

Meng et al (2007) have proposed a very fast and robust algorithm

for locating and removing a special kind of circular noises caused by scanning

documents with punched holes. Firstly, original image is reduced according to

4

an elaborately selected ratio. Punched holes after reduction will leave some

distinctive small regions. By examining such small regions, holes noises can

be fast detected and located. To diminish false detections, Hough

transformation is applied to the roughly located regions to further confirm the

located holes. Finally, circular noise is eliminated by fitting a bi-linear

blending Coons surface which interpolates along the four edges of noisy

region.

Tan et al (2006) have proposed an efficient restoration method

based on the discovery of the 3D shape of a book surface from the shading

information in a scanned document image. Scanning a document page from a

thick bound volume often results in two kinds of distortions in the scanned

image, i.e., shade along the "spine" of the book and warping in the shade area.

From a technical point of view, this Shape From Shading (SFS) problem in

real-world environments is characterized by a proximal and moving light

source, Lambertian reflection and document skew. Taking all these factors

into account, practical models consisting of a 3D geometric model and a 3D

optical model have been built for the practical scanning conditions to

reconstruct the 3D shape of the book surface. The scanned document image

has been then restored using this shape based on deshading and dewarping

models.

Molton et al (2003) have proposed a phase congruency technique

for visual enhancement of incised text. Incised stroke information was

deduced by imaging the document under a carefully selected set of lighting

conditions that cause shadows to be cast and observing the position and

motion of shadow areas. This technique was described to interpolating broken

strokes and recovering 3D surface structure. Fan and Tan (2003) have

proposed an adaptive image-smoothing using a coplanner matrix and its

application to document image binarization. For document images corrupted

5

by various kinds of noise, direct binarization images may be severely blurred

and degraded. A common treatment for this problem is to pre-smooth input

images using noise-suppressing filters. An image-smoothing method is used

for pre-filtering the document image binarization. Conceptually, it has been

proposed that the influence range of each pixel affecting its neighbors should

depend on local image statistics. This property adapts the smoothing process

to the contrast, orientation, and spatial size of local image structures.

Fan et al (2002) have presented a local thresholding method and a

region growing method for the removal of Marginal noise in gray-scale

document images and binary document images respectively. Marginal noise

is a common phenomenon in document analysis, which results from the

scanning of thick documents or skew documents. It usually appears in the

front of a large and dark region around the margin of document images.

Marginal noise might cover meaningful document objects, such as text,

graphics and forms.

Tan et al (2002) have presented a wavelet technique for restoration

of archival documents. This paper addresses a problem of restoring

handwritten archival documents by recovering their contents from the

interfering handwriting on the reverse side caused by the seeping of ink. A

novel method that works by first matching both sides of a document such that

the interfering strokes are mapped with the corresponding strokes originating

from the reverse side has been presented. This facilitates the identification of

the foreground and interfering strokes. A wavelet reconstruction process then

iteratively enhances the foreground strokes and smears the interfering strokes

so as to strengthen the discriminating capability of an improved Canny edge

detector against the interfering strokes.

Fan and Tan (2002) have proposed a structure tensor based image

restoration algorithm. Structure tensors capture the structural and textural

6

distribution of similar pixels at each site. This property adapts the smoothing

process to the contrast, orientation and spatial size of local image structures.

Zheng and Kanungo (2001) have presented a model-based

restoration algorithm for document image restoration. The restoration

algorithm first estimates the parameters of a degradation model and then uses

the estimated parameters to construct a lookup table for restoring the

degraded image. The estimated degradation model is used to estimate the

probability of an ideal binary pattern, given the noisy observed pattern. This

probability is estimated by degrading noise-free document images and then

computing the frequency of corresponding noise-free and noisy pattern pairs.

This conditional probability is then used to construct a lookup table to restore

noisy images.

Ye et al (2001) have proposed a generic method of cleaning and

enhancing handwritten data from business forms. Preparing clean and clear

images for the recognition engines is often taken for granted as a trivial task

that requires little attention. In reality, handwritten data usually touch or cross

the preprinted form frames and texts, creating tremendous problems for the

recognition engines. Here, a generic system including only cleaning and

enhancing phases has been proposed. In the cleaning phase, the system

registers a template to the input form by aligning corresponding landmarks. A

unified morphological scheme was proposed to remove the form frames and

restore the broken handwriting from gray or binary images. When the

handwriting is found touching or crossing preprinted texts, morphological

operations based on statistical features are used to clean it. In applications

where a black-and-white scanning mode is adopted, handwriting may contain

broken or hollow strokes due to improper thresholding parameters. Therefore,

a module has been designed to enhance the image quality based on

morphological operations.

7

Sattar and Tay (1998) have proposed a fuzzy logic approach for

enhancing document images which are blurred and corrupted binary images

obtained from a scanner. Sattar et al (1997) have proposed a nonlinear

multiscale method for image restoration by successively combining each

coarser scale image with the corresponding modified interscale image.

Banham and Kastaggelos (1996) have presented a spatially

adaptive approach to the restoration of noisy blurred images which is

particularly effective at producing sharp deconvolution while suppressing the

noise in the flat regions of an image. This is accomplished through a

multiscale Kalman smoothing filter applied to a prefiltered observed image in

the discrete, separable, 2-D wavelet domain. The prefiltering step involves

constrained least-squares filtering based on optimal choices for the

regularization parameter. This leads to a reduction in the support of the

required state vectors of the multiscale restoration filter in the wavelet domain

and improvement in the computational efficiency of the multiscale filter. The

proposed method has the benefit that the majority of the regularization or

noise suppression of the restoration is accomplished by the efficient

multiscale filtering of wavelet detail coefficients ordered on quadtrees. Not

only does this lead to potential parallel implementation schemes, but it

permits adaptivity to the local edge information in the image. In particular,

this method changes filter parameters depending on scale, local Signal-to-

Noise Ratio (SNR), and orientation. Because the wavelet detail coefficients

are a manifestation of the multiscale edge information in an image, this

algorithm has been viewed as an edge-adaptive multiscale restoration

approach.

8

1.2.2 Character Recognition

A wide range of algorithms has been developed for a number of

languages around the world. To start with the development in the field of

character recognition in languages other than Tamil is given, followed by

which the survey of the algorithms in Tamil is provided.

Extensive experimentations of the recognition of different

handwritten scripts have been carried out during the last three decades. Anita

Pal and Dayashankar Singh (2010) have described a Fourier descriptor

method for English character recognition. Each character is resized into

normalized image which is then converted into binary image. Fourier co-

efficient for each character are calculated and then fed as input features to

Neural network classifier.

Mujtaba and Shahid (2009) have described English letter

classification using Bayesian decision theory and feature extraction using

Principal Component Analysis. Bayesian Decision Theory (BDT), one of the

statistical techniques for pattern classification, is used to identify each of the

large number of black-and-white rectangular pixel displays as one of the 26

capital letters in the English alphabet. Principal Component Analysis (PCA) is

used for feature extraction to reduce the dimensions of the pattern data.

Bhattacharya and Chaudhuri (2009) have presented the pioneering

development of two databases for the recognition of two most popular Indian

scripts namely Devnagari and Bangla. They have presented the multistage

cascaded recognition scheme using wavelet-based multi-resolution

representations. They implemented their work for the recognition of mixed

handwritten numerals of three Indian scripts namely Devnagari, Bangla and

English.

9

Lajish (2008) has proposed the feature extraction method for

recognition of Malayalam characters from their gray scale images without the

usual step of binarization. They investigated a new approach to model

Malayalam characters using the state space map and state space point

distribution stage. Sandhya et al (2008) have reported multiple feature

extraction techniques for handwritten Devnagari character recognition. They

computed shadow features globally for a character image and segmented the

character image to compute the intersection and line fitting features. Raju

(2008) has presented the wavelet and projection profile-based feature

extraction method for Malayalam characters. They computed vertical and

horizontal projection profiles. The projection profiles have been subjected to

„n‟ levels of the wavelet transform. They took the average component of the

coefficient set as the feature vector.

Nasser and Shamsuddin (2008) have presented handwritten digit

recognition using Particle Swarm Optimization (PSO). Here PSO based

method is exploited to recognize unconstrained handwritten digits. Each class

is encoded as a centroid in multidimensional feature space and PSO is

employed to probe the optimal position for each centroid. The algorithm

evaluates on 5 folds cross validation of handwritten digits data and the results

reveal that PSO gives promising performance and stable behavior in

recognizing these digits. Sagar et al (2008) have described a character

recognition system for printed text documents in Kannada, a South Indian

language. Partha Pratim et al (2008) have presented Convex Hull based

approach towards recognition of English character in multi-scale and multi-

oriented environments. Graphical document such as map consists of text lines,

which appear in different orientation. Sometimes, characters in a single word

may follow a curvy-linear way to annotate the graphical curve lines. For

recognition of such multi-scale and multi-oriented characters, a Support

Vector Machine (SVM) based scheme has been presented. The feature used

10

here is invariant to character orientation. Circular ring and convex hull have

been used along with angular information of the contour pixels of the

character to make the feature rotation invariant.

Much attention has been paid to the recognition of English,

Chinese, Japanese and Korean characters because it is a handy case for testing

various techniques viz. Preprocessing, feature extraction and classification

and it has many applications viz. Postal mail sorting, Cheque reading, form

processing etc. In Oriya, many characters have a shape similarity and most of

them have a curve-like stroke. Pal et al (2007) have reported this curvature

feature for recognition purposes. They extracted the feature by segmenting the

input image into (49 x 49) blocks. These blocks are then down sampled by a

Gaussian filter and the features obtained from the down sampled blocks are

fed to a modified quadratic classifier for recognition. They used the principle

component analysis for feature dimension reduction. Hanmandlu et al (2007)

have presented a zone-based feature extraction method for the recognition of

Hindi characters. They divided the character image into 24 zones. By

considering the bottom left corner of the image as an absolute reference, the

average vector distance for the foreground pixels present in the zone is

determined.

Manjunath et al (2007) have proposed a system based on the Radon

transform for Kannada digit recognition. They extracted the features using the

Radon function, which represents an image as collection projections along

various directions. They used the nearest neighbor classifier for subsequent

classification and recognition. Jayaraman et al (2007) have described some

issues in developing character recognition system for the Telugu script. They

proposed a modular approach for the recognition of strokes. Based on the

relative position of a stroke in a character, the stroke set has been divided into

three subsets, namely base line strokes, bottom strokes and top strokes. They

11

used the Hidden Markov Model (HMM) and Support Vector Machine (SVM)

classifiers for subsequent classification and recognition. Lajish (2007) has

proposed a feature extraction method for recognition of Malayalam characters

based on fuzzy zoning and average vector distance measures. They used a

modular neural network for subsequent classification.

Rajput and Mallikarjun (2007) have proposed the feature extraction

method using an image fusion technique for the recognition of isolated

Kannada characters. They divided the character image into equal zones and

computed the pixel foreground density for each zone. They compared each

zone pixel density with the threshold. When the zone pixel density exceeds

the threshold, 1 is stored for that particular zone; otherwise 0 is stored.

Michael et al (2007) have provided a novel feature extraction technique for

the recognition of cursive characters. They used the modified direction feature

extraction technique that extracts information from the structure of character

contours. Wen et al (2007) have proposed two approaches for Bangla

character recognition. The first one is based on the image reconstruction error

in principal subspace and another is based on the Kirsh gradient

dimensionality reduction by the Principal Component Analysis (PCA) and

classification by the SVM. Pal et al (2007) have reported the results of

character recognition of six Indian scripts namely Devnagari, Bangla, Telugu,

Oriya, Kannada and Tamil. They used two types of features for high accuracy

classification. The one for high accuracy classification is the 16-direction

gradient histogram feature, extracted by the mean filtering of a binary image

(for obtaining a gray scaled image), the Roberts gradient filtering, tangent

direction quantization and down sampling.

Liana and Govindaraju (2006) have provided an excellent survey of

character recognition for the Arabic script. Manjumdar and Chaudhuri (2006)

have presented a zone-based feature extraction method for printed and

12

handwritten Bangla numeral recognition. Hani Khasawneh (2006) has

presented a new and novel Arabic character recognition system. The system

accepts a scanned-page image containing a set of text lines affected by typical

noise level. The system preprocesses the image before the separate lines are

handled to the character extraction phase. A neural network is used to process

the features and to classify them into one of the characters.

Lajish et al (2005) have proposed the weighted minimum rectangle

feature extraction method for Malayalam characters. They cropped the image

into a minimum rectangle such that the image fits exactly into the rectangle.

They divided the rectangle into equal zones. Then they computed the number

of crossings of the character curve with each of the sides. Further, they

computed the ratio of the foreground (black) pixels to the background (white)

pixels for each zone. They used statistical classification techniques for

subsequent classification.

Pal and Chaudhuri (2004) have provided an excellent review of the

character recognition work done on Indian language scripts and the different

methodologies applied in character recognition development in the

international scenario. Romesh et al (2004) have described a fuzzy method for

English character recognition where each character is divided into number of

segments. For each segment, fuzzy membership value is calculated. Based on

these membership values, characters are recognized by means of Min-max

composition procedure.

Suen et al (2003) have provided an excellent review of the analysis

and recognition of Asian scripts (Chinese/Japanese/Korean). Bhattacharya

and Chaudhuri (2003) have proposed a multi-resolution wavelet analysis and

majority voting approach and applied it for Bangla character recognition. One

of the main features of this proposed scheme is that this is not script

dependent. Another interesting feature is that it is sufficiently fast for real-life

13

applications. In contrast to the usual practices, the efficiency of a majority

voting approach is studied when all the classifiers involved are Multi-Layer

Perceptrons (MLP) of different sizes and respective features are based on

wavelet transforms at different resolution levels. The rationale for this

approach is to explore how one can improve the recognition performance

without adding much to the requirements for computational time and

resources. For simplicity and efficiency, only three coarse-to-fine resolution

levels of wavelet representation are considered.

Nafiz and Fatos (2001) have provided an overview of character

recognition that focused on English script. Dhanya and Ramakrishnan (2001)

have suggested Recognition via script identification and Bilingual approach

methods for character recognition from bilingual text. Plamondon and Srihari

(2000) have proposed a comprehensive survey for English character

recognition. Tao and Tang (1999) have described the feature extraction of

Chinese character based on contour information.

The geometrical and topological representation of an image

provides various global and local properties. Such a type of representation

yields high tolerance to distortions and style variations. The topological

representation of an image uses predefined structures like strokes. The

strokes are searched in a character/image and the number or relative position

of these structures within the character forms a descriptive representation.

Characters and words can be represented by extracting and counting many

topological features. These features include extreme points, maxima and

minima, cross and end points, loops etc., that make up a character (Nishida

1995). The Geometrical representation of a character includes geometrical

quantities such as the aspect ratio, the relative distance between horizontal

and vertical points, the comparative length between two strokes, change in

curvature etc., (Kundu and He 1991, Okamoto and Yamamoto 1997). The

14

Freeman‟s chain code is the most popular coding scheme and this code is

obtained by mapping the strokes of a character into a two-dimensional

parameter space, which is made up of codes. A set of topological primitives,

such as strokes, loops, cross points, etc., are obtained by partitioning the

characters. Then, these primitives are represented using attributed or

relational graphs (Li et al 1997). Trees can also be used to represent the

characters with a set of features, which have a hierarchical relation

(Madhvanath et al 1997).

Trier et al (1996) have presented a good survey of feature

extraction methods such as Template matching, Deformable templates,

Unitary image transforms, Graph description, Contour profiles, Zoning,

Geometric moment invariants, Zernike moments, Spline curve approximation

and Fourier descriptors. In template matching, all the pixels in the gray scale

character image are used as features. A similarity or dissimilarity measure

between each template and the character image is computed. In the case of a

similarity measure, the template Tk having the highest similarity measure is

identified and if this similarity is above a specified threshold, then the

character is assigned the class label k. Else, the character remains

unclassified. In the case of a dissimilarity measure, the template Tk having the

lowest dissimilarity measure is identified and if the dissimilarity is below a

specified threshold, the character is given the class label k. Deformable

templates are used for character recognition in gray scale images of credit

card slips with poor print quality. The templates used are character skeletons.

Unitary transform is applied to character images, obtaining a reduction in the

number of features while preserving most of the information about the

character shape. In the transformed space, the pixels are ordered by their

variance and the pixels with the highest variance are used as features. The

unitary transform has to be applied to a training set to obtain estimates of the

variances of the pixels in the transformed space. Zernike moments are

15

projections of the input image onto the space spanned by the orthogonal V-

functions. The amplitudes of the Zernike moments are used as features for

character recognition of binary solid symbols.

Some major statistical features used for image representation are

Zoning, Crossing and Distances, Projections. An image is divided into several

overlapping or non-overlapping zones. The image representation is computed

by considering the densities of the zone or some features in different regions

are analyzed and form the representation. Mohiuddin and Mao (1994) have

described contour direction features which are generated by dividing the

image array into rectangular and diagonal zones and computing histograms of

a chain code in these zones. Lee and Park (1994) have reviewed nonlinear

shape normalization methods in order to compensate for shape distortions in

large-set handwritten Korean characters. Chen et al (1994) have presented a

complete scheme for totally unconstrained handwritten word recognition

based on a single contextual Hidden Markov Model (HMM) type stochastic

network. This scheme includes morphology and heuristics based

segmentation algorithm, a training algorithm that can adapt itself with the

changing dictionary. Amin and AI-Sadoun (1994) have proposed a structural

technique for automatic recognition of hand printed Arabic characters. The

advantages of this technique are: more efficient for large and complex sets

such as Arabic characters; not expensive for feature extraction; and its

execution time does not depend on either the font or the size of the characters.

Yamada et al (1990) have proposed a nonlinear normalization

method called the line density equalization for handprinted kanji character

recognition. Here, a resampling is done so as to equate the product of a local

line density and a sampling pitch. Consequently, the line density in the space

is homogenized, the efficiency of utilization of the space is increased, and a

stable normalization is obtained for partially irregular shape variations.

16

Srihari et al (1989) have described a system to automatically locate

and recognize ZIP codes in English handwritten address. Given a grey-scale

image of a handwritten address block, the system preprocesses the image by

thresholding, border removal and underline removal. One or more candidate

words for the ZIP Code are isolated. Each candidate is divided into 5 or 9

segments and recognition is attempted on each segment. Digit recognition is

accomplished by means of an arbitration procedure that takes as input the

decisions of three different classifiers: template matching using stored

prototypes, a mixed approach that uses statistical and structural analysis of

digit boundary and a rule-based approach to analyze digit strokes. The result

of ZIP Code recognition is verified using a postal directory. Tsukumo and

Tanaka (1988) have presented a system for the classification of hand printed

Chinese characters using correlation methods for fast classification and a

nonlinear normalization based on uniform relocation of the strokes used to

form the character. Experimental results for handprinted Chinese character

classification are presented.

A popular statistical feature is the number of crossings of a contour

by a line segment in a specified direction. Also, the distance of the line

segments from a given boundary such as the upper and lower portions of the

frame can be used as statistical features (Brown and Ganapathy 1983). Suen

et al (1980) have presented a survey in the challenging field of character

recognition. Recognition algorithms, data bases, character models and

handprint standards are examined. Achievements in the recognition of hand

printed numerals, alphanumeric, Fortran and Katakana characters are

analyzed and compared. Data quality and constraints, as well as human and

machine factors are also described. Characteristics, problems and actual

results on on-line recognition of handprinted characters for different

applications have been discussed. New emphases and directions are

suggested.

17

Ishwarya et al (2010) have proposed a convolutional Neural

Network which recognizes the Tamil characters. Convolutional Neural

Networks are a special kind of multi-layer Neural Networks. They are trained

with a version of the back-propagatioon algorithm and are designed to

recognize visual patterns directly from pixel images with minimal

preprocessing. They can recognize patterns with extreme variability and with

robustness to distortions and geometric transformations. Recognition of the

test sample is performed using a nearest neighbor classifier.

Shashikiran et al (2010) have studied the performance of HMM and

Statistical Dynamic Time Warping (SDTW) for Tamil Character Recognition.

HMM is used for a 156-class problem. Different feature sets and values for

the HMM states & mixtures are tried and the best combination is found to be

16 states & 14 mixtures, giving an accuracy of 85%. The features used in this

combination are retained and a SDTW model with 20 states and single

Gaussian is used as classifier.

Ramanathan et al (2010) have proposed a new technique of optical

character recognition using Gabor filters and support vector machines

(SVM). This method proves to be very effective with the use of Gabor filters

for feature extraction and (SVM) for developing the model. The model

proposed is trained and validated for two languages –English and Tamil and

the results are found to be very much encouraging. The model developed

works for the entire set in both the languages including symbols and

numerals. In addition, the model can recognize the characters of six different

fonts in English and Twelve different fonts in Tamil. The average accuracy

of recognition for Tamil is 84% which is achieved in just three iteration of

training. The method can turn out to be a suitable candidate for future

applications in this area.

18

Abirami and Manjula (2009) have presented feature string-based

intelligent information retrieval from Tamil document images. This

methodology generates a feature string for every word image by extracting its

features. This relies on their basic characteristics or shapes of letters. The

strength of this technique lies in extracting the text, using their basic features

such as black / white disposition rates and lines in characters. Sundaram and

Ramakrishnan (2009) have proposed script-specific post processing schemes

for improving the recognition rate of Tamil characters. At the first level,

features derived at each sample point of the preprocessed character are used to

construct a subspace using the 2D-PCA algorithm. Recognition of the test

sample is performed using a nearest neighbor classifier. Based on the analysis

of the confusion matrix, multiple pairs of confused characters are identified.

At the second level, they have used script specific cues to sort out the

ambiguities among the confused characters. This strategy reduces the

recognition error among the confused character sets handled, by more than

496. This approach can be applied irrespective of the nature of the classifier

used for the first level of recognition, though the nature of the confusion set

might vary.

Shanthi and Duraiswamy (2008) have described a method for Tamil

character recognition where each character image is divided into equal

numbers of horizontal and vertical stripes. This will result in a grid with

square shaped zones. For each zone, pixel density is calculated and is used as

feature vector for character recognition. Jagadeesh Kannan and Prabhakar

(2008) have presented a Hidden Markov Model (HMM) method for Tamil

character recognition in which two HMMs are created for each unknown

character. One HMM is for modeling the horizontal information and the other

HMM is for modeling the vertical information. The created HMMs are then

trained for character recognition.

19

Batmavady and Manivannan (2007) have presented a Polynomial

method for Tamil character recognition where the unknown character image

is divided into blocks of equal size. Each block contains a part of character,

which is fitted to a fifth-order polynomial. The coefficients of polynomial are

used as features and are compared with standard template. Then the error

percentage is found. If the error percentage is less than the threshold, then the

recognized characters are matched to the template. Shivsubramani et al (2007)

have presented an efficient method for recognizing printed Tamil characters

exploring the interclass relationship between them, which should be

accomplished using Multiclass Hierarchical Support Vector Machines. A new

variant of Multi Class Support Vector Machine constructs a hyper plane that

separates each class of data from other classes. Thus Character recognition

involves classification of characters into multi-classes.

Bharath and Sriganesh (2007) have proposed a Data-driven Hidden

Markov Model (HMM) based online handwritten word recognition system for

Tamil. A symbol set consisting of 84 symbols has been defined for the word

recognition task and each symbol has been modeled using a left-to-right

HMM. Inter-symbol pen-up strokes have been modeled explicitly using two

state left-to-right HMMs to capture the relative positions between symbols in

the word context. Independently built symbol models and inter-symbol pen-

up stroke models have been concatenated to form the word models. The

relatively low performance in the case of high lexicon size can be improved

by the use of statistical language models, which are commonly applied in

Western cursive recognition.

Suresh and Ganesan (2005) have described an approach to use the

fuzzy concept on handwritten Tamil characters to classify them as one among

the prototype characters using a feature called distance from the frame and a

suitable membership function. The unknown and prototype characters are

20

preprocessed and considered for recognition. The theory of fuzzy set provides

an approximate but effective means of describing the behavior of ill-defined

systems. Patterns of human origin like handwritten characters are to some

extent found to be fuzzy in nature. It is decided to use fuzzy conceptual

approach effectively.

Seethalakshmi et al (2005) have discussed the various strategies

and techniques involved in the recognition of Tamil text and they have

referred Optical Character Recognition (OCR) for the process of converting

printed Tamil text documents into software translated Unicode Tamil Text.

The printed documents available in the form of books, papers, magazines, etc.

are scanned using standard scanners which produce an image of the scanned

document. As part of the preprocessing phase, the image file is checked for

skewing. The skewed image is corrected by a simple rotation technique in the

appropriate direction and then it is passed through a noise elimination phase

and is binarized. The preprocessed image is segmented using an algorithm,

which decomposes the scanned text into paragraphs using special space

detection technique and then the paragraphs into lines using vertical

histograms, and lines into words using horizontal histograms and words into

character image glyphs using horizontal histograms. Each image glyph is

comprised of 32×32 pixels. Thus a database of character image glyphs is

created out of the segmentation phase. Then all the image glyphs are

considered for recognition using Unicode mapping. Each image glyph is

passed through various routines, which extract the features of the glyph. The

extracted features are passed to a Support Vector Machine (SVM) where the

characters are classified by Supervised Learning Algorithm. These classes are

mapped onto Unicode for recognition.

Joshi et al (2004) have given a comparison of elastic matching

schemes for writer dependent on-line handwriting recognition of isolated

21

Tamil characters. Three different features are considered namely,

preprocessed x-y co-ordinates, quantized slope values and dominant point co-

ordinates. Seven schemes based on these three features are compared using an

elastic distance measure. The comparison is carried out in terms of

recognition accuracy, recognition speed, number of training templates and

dynamic time warping based distance measure has been presented. The results

show that dominant points based two-stage scheme and combination of rigid

and elastic matching schemes perform better than rest of the schemes,

especially from the point of view of implementing them in a real time

application. Efforts are underway to devise character grouping schemes for

hierarchical classification and classifier combination schemes so as to obtain a

computationally more efficient recognition scheme with improved accuracy.

Aparna et al (2004) have proposed a generalized framework for

Indic script character recognition and Tamil character recognition is discussed

as a special case. Unique strokes in the script are manually identified and each

stroke is represented as a string of shape features. The test stroke is compared

with the database of such strings using the proposed flexible string-matching

algorithm. The sequence of stroke labels is then converted into horizontal

blocks using a rule list and the sequence of horizontal blocks is recognized as

a character using a Finite State Automaton (FSA).

Deepu and Madhvanath (2004) have proposed a subspace-based

method using Principal Component Analysis (PCA) for Tamil character

recognition. The input is a temporally ordered sequence of (x, y) pen

coordinates corresponding to an isolated character obtained from a digitizer.

The input is converted into a feature vector of constant dimensions following

smoothing and normalization. Each class is modeled as a subspace and for

classification, the orthogonal distance of the test sample to the subspace of

each class is computed.

22

Aparna et al (2002) have presented a complete character

recognition system for Tamil newsprint that includes the full suite of

processes from skew correction, binarization, segmentation, text and non-text

block classification, line, word and character segmentation and character

recognition to final reconstruction.

Hewavitharana and Fernando (2002) have described a system to

recognize handwritten Tamil characters using a two-stage classification

approach for a subset of the Tamil alphabet, which is a hybrid of structural

and statistical techniques. In the first stage, an unknown character is pre-

classified into one of the three groups: core, ascending and descending

characters. Structural properties of the text line are used for this classification.

Then, in the second stage, members of the pre-classified group are further

analyzed using a statistical classifier for final recognition. The main

recognition errors were due to abnormal writing and ambiguity among similar

shaped characters. This could be avoided by using a word dictionary to look-

up for possible character compositions. The presence of contextual knowledge

will help to eliminate the ambiguity. The method of pre-classification would

have much higher recognition accuracy if applied to Optical Character

Recognition, since printed characters preserve the correct positioning on

three-zone frame.

Chinnuswamy and Krishnamoorthy (1980) have proposed an

approach for hand-printed Tamil character recognition. Here, the characters

are assumed to be composed of line-like elements, called primitives,

satisfying certain relational constraints. Labeled graphs are used to describe

the structural composition of characters in terms of the primitives and the

relational constraints satisfied by them. The recognition procedure consists of

converting the input image into a labeled graph representing the input

character and computing correlation coefficients with the labeled graphs

23

stored for a set of basic symbols. This algorithm uses topological matching

procedure to compute the correlation coefficients and then maximizes the

correlation coefficient.

Siromoney et al (1978) have described a method for recognition of

machine printed Tamil characters using an encoded character string

dictionary. The scheme employs string features extracted by row and column

wise scanning of character matrix. The features in each row and column are

encoded suitably depending upon the complexity of the script to be

recognized. A given text is presented symbol by symbol and information from

each symbol is extracted in the form of a string and compared with the strings

in the dictionary. When there is agreement, the letters are recognized and

printed out in Roman letters following a special method of transliteration. The

lengthening of vowels and hardening of consonants are indicated by numerals

printed above each letter.

1.3 CONTRIBUTIONS OF THE THESIS

From the above literature survey, it is clear that the character

recognition of Indian scripts is in demand and provides challenging and

interesting applications in the field of document image analysis. In the

literature, many papers have been published with research detailing new

techniques for the classification of characters. This thesis is focused towards

restoration of noisy Tamil Text Images followed by Tamil Character

recognition. A Tamil text image restoration algorithm based on Expectation

Maximization (EM) is proposed. As part of Tamil Character recognition, two

novel feature extraction methods viz. Slope method and Discrete Wavelet

Transform (DWT) domain based method are proposed. Also, a tree classifier

is proposed which would operate on the features extracted by the methods

proposed herein. The restoration algorithm, the feature extraction algorithms

24

and the classifier algorithm presented herein prove to be promising candidates

for processing of Tamil text images.

The contributions of the research work constituting the proposed

thesis could be summarized as follows:

Expectation Maximization (EM) Restoration algorithm based

Adaptive Custom Thresholding technique for noisy Tamil text

images.

Technique for Feature Extraction based on Spatial Distribution

of Slope for Tamil character recognition.

Technique for Feature Extraction operating in the Discrete

Wavelet Transform Domain for Tamil character recognition.

Decision Tree Classifier based on Information Gain for Tamil

character recognition.

The contributions mentioned above are dealt in detail in the

subsequent chapters.

1.4 ORGANISATION OF THE THESIS

The organization of the proposed thesis is as follows: Chapter 1

deals with literature survey followed by the contributions of the proposed

thesis. Chapter 2 deals with the proposed spatially adaptive restoration

algorithm based on Expectation Maximization (EM) operating in the wavelet

transform domain. Chapter 3 deals with the proposed feature extraction

algorithms for Tamil character recognition. Chapter 4 deals with classification

techniques for Tamil character recognition. Chapter 5 deals with conclusions

and future scope of research.

chapter 1 introductionshodhganga.inflibnet.ac.in/bitstream/10603/13664/6/06_chapter 1.pdfthe...

Documents