chapter 1 introductionshodhganga.inflibnet.ac.in/bitstream/10603/13664/6/06_chapter 1.pdfthe...
TRANSCRIPT
1
CHAPTER 1
INTRODUCTION
1.1 TAMIL
Tamil is a South Indian language spoken widely in Tamil Nadu,
one of the states of India. Tamil is one of the oldest languages in the world
and the Tamil script is used to write the Tamil language in Tamil Nadu state
of India, Sri Lanka, Singapore and parts of Malaysia as well as to write
minority languages such as Badaga. Although Tamil has been influenced by
Sanskrit to a certain degree, Tamil along with other south Indian languages
are genetically unrelated to the descendants of Sanskrit such as Hindi, Bengali
and Gujarati.
As a result of intensive research and development efforts,
sophisticated character recognition systems are available for English
language, Chinese/Japanese languages and handwritten numerals. However,
less attention has been given to Indian languages. Some efforts have been
reported in the literature for Devanagari, Tamil and Bangla scripts. The
significance of the thesis stems from the fact that the Tamil language is
ancient and widespread, owning rich literature content. Realizing the research
potential in Tamil text image problem, this thesis is focused towards Tamil
text image Restoration and Tamil Character Recognition.
Most Tamil letters have circular shapes; partially due to the fact
that they were originally carved with needles on palm leaves, a technology
2
that favored circular shapes. Tamil has 12 vowels, 18 consonants, 216
composite characters and 1 special character (Aayutha Ezhuthu) counting to a
total of 247 characters. The character set used herein is comprised of the set of
vowels and consonants and produced in Table 1.1 for quick reference.
Table 1.1 Tamil Fonts Set
Processing of the Tamil text image, like any other text image
primarily involves two components: Restoration and Character recognition.
The initial part of the thesis is focused towards restoration of noisy Tamil text
images. A spatially adaptive restoration algorithm based on Expectation
Maximization (EM) operating in the wavelet transform domain is proposed.
The second part is dedicated to Tamil character recognition in terms of feature
extraction and classification. Two novel feature extraction methods viz. Slope
method and Discrete Wavelet Transform (DWT) domain based method are
proposed. Also, a tree classifier is proposed which would operate on the
features extracted by the methods proposed herein. The feature extraction
algorithms and the classifier algorithm presented herein prove to be promising
candidates for processing of Tamil text images.
1.2 LITERATURE SURVEY
Restoration is the first step in the processing of any text image in
most of the applications and is used to reconstruct or recover an image that
has been degraded by using a priori knowledge of the degradation
phenomenon. Feature extraction for character recognition deals with
determination of various attributes as well as properties associated with a
3
character. In the following sections, a survey of the existing restoration and
character recognition algorithms is provided.
1.2.1 Restoration
There is a wide range of restoration algorithms applied in different
contexts available in the literature. The restoration methods for document
images are presented below.
Moghaddam and Cheriet (2010) have proposed a multi-scale
framework for adaptive binarization of degraded document images. In this
work, a multi-scale binarization framework is introduced which can be used
along with any adaptive threshold-based binarization method. This framework
is able to improve the binarization results and to restore weak connections and
strokes, especially in the case of degraded historical documents.
Moghaddam and Cheriet (2009) have presented the problem of
enhancing and restoring single-sided low-quality single-sided document
images. Initially, a series of multi-level classifiers is introduced covering
several levels, including the regional and content levels. These classifiers can
then be integrated into any enhancement or restoration method to generalize
or improve them. EI-Sallam et al (2009) have presented a blind image
restoration algorithm to construct a high resolution image from a degraded
and noisy low resolution image captured by Thin Observation Module by
Bound Optics (TOMBO) imaging systems. In this method, all spectral
information in each captured image is used to restore each output pixel in the
reconstructed high resolution image.
Meng et al (2007) have proposed a very fast and robust algorithm
for locating and removing a special kind of circular noises caused by scanning
documents with punched holes. Firstly, original image is reduced according to
4
an elaborately selected ratio. Punched holes after reduction will leave some
distinctive small regions. By examining such small regions, holes noises can
be fast detected and located. To diminish false detections, Hough
transformation is applied to the roughly located regions to further confirm the
located holes. Finally, circular noise is eliminated by fitting a bi-linear
blending Coons surface which interpolates along the four edges of noisy
region.
Tan et al (2006) have proposed an efficient restoration method
based on the discovery of the 3D shape of a book surface from the shading
information in a scanned document image. Scanning a document page from a
thick bound volume often results in two kinds of distortions in the scanned
image, i.e., shade along the "spine" of the book and warping in the shade area.
From a technical point of view, this Shape From Shading (SFS) problem in
real-world environments is characterized by a proximal and moving light
source, Lambertian reflection and document skew. Taking all these factors
into account, practical models consisting of a 3D geometric model and a 3D
optical model have been built for the practical scanning conditions to
reconstruct the 3D shape of the book surface. The scanned document image
has been then restored using this shape based on deshading and dewarping
models.
Molton et al (2003) have proposed a phase congruency technique
for visual enhancement of incised text. Incised stroke information was
deduced by imaging the document under a carefully selected set of lighting
conditions that cause shadows to be cast and observing the position and
motion of shadow areas. This technique was described to interpolating broken
strokes and recovering 3D surface structure. Fan and Tan (2003) have
proposed an adaptive image-smoothing using a coplanner matrix and its
application to document image binarization. For document images corrupted
5
by various kinds of noise, direct binarization images may be severely blurred
and degraded. A common treatment for this problem is to pre-smooth input
images using noise-suppressing filters. An image-smoothing method is used
for pre-filtering the document image binarization. Conceptually, it has been
proposed that the influence range of each pixel affecting its neighbors should
depend on local image statistics. This property adapts the smoothing process
to the contrast, orientation, and spatial size of local image structures.
Fan et al (2002) have presented a local thresholding method and a
region growing method for the removal of Marginal noise in gray-scale
document images and binary document images respectively. Marginal noise
is a common phenomenon in document analysis, which results from the
scanning of thick documents or skew documents. It usually appears in the
front of a large and dark region around the margin of document images.
Marginal noise might cover meaningful document objects, such as text,
graphics and forms.
Tan et al (2002) have presented a wavelet technique for restoration
of archival documents. This paper addresses a problem of restoring
handwritten archival documents by recovering their contents from the
interfering handwriting on the reverse side caused by the seeping of ink. A
novel method that works by first matching both sides of a document such that
the interfering strokes are mapped with the corresponding strokes originating
from the reverse side has been presented. This facilitates the identification of
the foreground and interfering strokes. A wavelet reconstruction process then
iteratively enhances the foreground strokes and smears the interfering strokes
so as to strengthen the discriminating capability of an improved Canny edge
detector against the interfering strokes.
Fan and Tan (2002) have proposed a structure tensor based image
restoration algorithm. Structure tensors capture the structural and textural
6
distribution of similar pixels at each site. This property adapts the smoothing
process to the contrast, orientation and spatial size of local image structures.
Zheng and Kanungo (2001) have presented a model-based
restoration algorithm for document image restoration. The restoration
algorithm first estimates the parameters of a degradation model and then uses
the estimated parameters to construct a lookup table for restoring the
degraded image. The estimated degradation model is used to estimate the
probability of an ideal binary pattern, given the noisy observed pattern. This
probability is estimated by degrading noise-free document images and then
computing the frequency of corresponding noise-free and noisy pattern pairs.
This conditional probability is then used to construct a lookup table to restore
noisy images.
Ye et al (2001) have proposed a generic method of cleaning and
enhancing handwritten data from business forms. Preparing clean and clear
images for the recognition engines is often taken for granted as a trivial task
that requires little attention. In reality, handwritten data usually touch or cross
the preprinted form frames and texts, creating tremendous problems for the
recognition engines. Here, a generic system including only cleaning and
enhancing phases has been proposed. In the cleaning phase, the system
registers a template to the input form by aligning corresponding landmarks. A
unified morphological scheme was proposed to remove the form frames and
restore the broken handwriting from gray or binary images. When the
handwriting is found touching or crossing preprinted texts, morphological
operations based on statistical features are used to clean it. In applications
where a black-and-white scanning mode is adopted, handwriting may contain
broken or hollow strokes due to improper thresholding parameters. Therefore,
a module has been designed to enhance the image quality based on
morphological operations.
7
Sattar and Tay (1998) have proposed a fuzzy logic approach for
enhancing document images which are blurred and corrupted binary images
obtained from a scanner. Sattar et al (1997) have proposed a nonlinear
multiscale method for image restoration by successively combining each
coarser scale image with the corresponding modified interscale image.
Banham and Kastaggelos (1996) have presented a spatially
adaptive approach to the restoration of noisy blurred images which is
particularly effective at producing sharp deconvolution while suppressing the
noise in the flat regions of an image. This is accomplished through a
multiscale Kalman smoothing filter applied to a prefiltered observed image in
the discrete, separable, 2-D wavelet domain. The prefiltering step involves
constrained least-squares filtering based on optimal choices for the
regularization parameter. This leads to a reduction in the support of the
required state vectors of the multiscale restoration filter in the wavelet domain
and improvement in the computational efficiency of the multiscale filter. The
proposed method has the benefit that the majority of the regularization or
noise suppression of the restoration is accomplished by the efficient
multiscale filtering of wavelet detail coefficients ordered on quadtrees. Not
only does this lead to potential parallel implementation schemes, but it
permits adaptivity to the local edge information in the image. In particular,
this method changes filter parameters depending on scale, local Signal-to-
Noise Ratio (SNR), and orientation. Because the wavelet detail coefficients
are a manifestation of the multiscale edge information in an image, this
algorithm has been viewed as an edge-adaptive multiscale restoration
approach.
8
1.2.2 Character Recognition
A wide range of algorithms has been developed for a number of
languages around the world. To start with the development in the field of
character recognition in languages other than Tamil is given, followed by
which the survey of the algorithms in Tamil is provided.
Extensive experimentations of the recognition of different
handwritten scripts have been carried out during the last three decades. Anita
Pal and Dayashankar Singh (2010) have described a Fourier descriptor
method for English character recognition. Each character is resized into
normalized image which is then converted into binary image. Fourier co-
efficient for each character are calculated and then fed as input features to
Neural network classifier.
Mujtaba and Shahid (2009) have described English letter
classification using Bayesian decision theory and feature extraction using
Principal Component Analysis. Bayesian Decision Theory (BDT), one of the
statistical techniques for pattern classification, is used to identify each of the
large number of black-and-white rectangular pixel displays as one of the 26
capital letters in the English alphabet. Principal Component Analysis (PCA) is
used for feature extraction to reduce the dimensions of the pattern data.
Bhattacharya and Chaudhuri (2009) have presented the pioneering
development of two databases for the recognition of two most popular Indian
scripts namely Devnagari and Bangla. They have presented the multistage
cascaded recognition scheme using wavelet-based multi-resolution
representations. They implemented their work for the recognition of mixed
handwritten numerals of three Indian scripts namely Devnagari, Bangla and
English.
9
Lajish (2008) has proposed the feature extraction method for
recognition of Malayalam characters from their gray scale images without the
usual step of binarization. They investigated a new approach to model
Malayalam characters using the state space map and state space point
distribution stage. Sandhya et al (2008) have reported multiple feature
extraction techniques for handwritten Devnagari character recognition. They
computed shadow features globally for a character image and segmented the
character image to compute the intersection and line fitting features. Raju
(2008) has presented the wavelet and projection profile-based feature
extraction method for Malayalam characters. They computed vertical and
horizontal projection profiles. The projection profiles have been subjected to
„n‟ levels of the wavelet transform. They took the average component of the
coefficient set as the feature vector.
Nasser and Shamsuddin (2008) have presented handwritten digit
recognition using Particle Swarm Optimization (PSO). Here PSO based
method is exploited to recognize unconstrained handwritten digits. Each class
is encoded as a centroid in multidimensional feature space and PSO is
employed to probe the optimal position for each centroid. The algorithm
evaluates on 5 folds cross validation of handwritten digits data and the results
reveal that PSO gives promising performance and stable behavior in
recognizing these digits. Sagar et al (2008) have described a character
recognition system for printed text documents in Kannada, a South Indian
language. Partha Pratim et al (2008) have presented Convex Hull based
approach towards recognition of English character in multi-scale and multi-
oriented environments. Graphical document such as map consists of text lines,
which appear in different orientation. Sometimes, characters in a single word
may follow a curvy-linear way to annotate the graphical curve lines. For
recognition of such multi-scale and multi-oriented characters, a Support
Vector Machine (SVM) based scheme has been presented. The feature used
10
here is invariant to character orientation. Circular ring and convex hull have
been used along with angular information of the contour pixels of the
character to make the feature rotation invariant.
Much attention has been paid to the recognition of English,
Chinese, Japanese and Korean characters because it is a handy case for testing
various techniques viz. Preprocessing, feature extraction and classification
and it has many applications viz. Postal mail sorting, Cheque reading, form
processing etc. In Oriya, many characters have a shape similarity and most of
them have a curve-like stroke. Pal et al (2007) have reported this curvature
feature for recognition purposes. They extracted the feature by segmenting the
input image into (49 x 49) blocks. These blocks are then down sampled by a
Gaussian filter and the features obtained from the down sampled blocks are
fed to a modified quadratic classifier for recognition. They used the principle
component analysis for feature dimension reduction. Hanmandlu et al (2007)
have presented a zone-based feature extraction method for the recognition of
Hindi characters. They divided the character image into 24 zones. By
considering the bottom left corner of the image as an absolute reference, the
average vector distance for the foreground pixels present in the zone is
determined.
Manjunath et al (2007) have proposed a system based on the Radon
transform for Kannada digit recognition. They extracted the features using the
Radon function, which represents an image as collection projections along
various directions. They used the nearest neighbor classifier for subsequent
classification and recognition. Jayaraman et al (2007) have described some
issues in developing character recognition system for the Telugu script. They
proposed a modular approach for the recognition of strokes. Based on the
relative position of a stroke in a character, the stroke set has been divided into
three subsets, namely base line strokes, bottom strokes and top strokes. They
11
used the Hidden Markov Model (HMM) and Support Vector Machine (SVM)
classifiers for subsequent classification and recognition. Lajish (2007) has
proposed a feature extraction method for recognition of Malayalam characters
based on fuzzy zoning and average vector distance measures. They used a
modular neural network for subsequent classification.
Rajput and Mallikarjun (2007) have proposed the feature extraction
method using an image fusion technique for the recognition of isolated
Kannada characters. They divided the character image into equal zones and
computed the pixel foreground density for each zone. They compared each
zone pixel density with the threshold. When the zone pixel density exceeds
the threshold, 1 is stored for that particular zone; otherwise 0 is stored.
Michael et al (2007) have provided a novel feature extraction technique for
the recognition of cursive characters. They used the modified direction feature
extraction technique that extracts information from the structure of character
contours. Wen et al (2007) have proposed two approaches for Bangla
character recognition. The first one is based on the image reconstruction error
in principal subspace and another is based on the Kirsh gradient
dimensionality reduction by the Principal Component Analysis (PCA) and
classification by the SVM. Pal et al (2007) have reported the results of
character recognition of six Indian scripts namely Devnagari, Bangla, Telugu,
Oriya, Kannada and Tamil. They used two types of features for high accuracy
classification. The one for high accuracy classification is the 16-direction
gradient histogram feature, extracted by the mean filtering of a binary image
(for obtaining a gray scaled image), the Roberts gradient filtering, tangent
direction quantization and down sampling.
Liana and Govindaraju (2006) have provided an excellent survey of
character recognition for the Arabic script. Manjumdar and Chaudhuri (2006)
have presented a zone-based feature extraction method for printed and
12
handwritten Bangla numeral recognition. Hani Khasawneh (2006) has
presented a new and novel Arabic character recognition system. The system
accepts a scanned-page image containing a set of text lines affected by typical
noise level. The system preprocesses the image before the separate lines are
handled to the character extraction phase. A neural network is used to process
the features and to classify them into one of the characters.
Lajish et al (2005) have proposed the weighted minimum rectangle
feature extraction method for Malayalam characters. They cropped the image
into a minimum rectangle such that the image fits exactly into the rectangle.
They divided the rectangle into equal zones. Then they computed the number
of crossings of the character curve with each of the sides. Further, they
computed the ratio of the foreground (black) pixels to the background (white)
pixels for each zone. They used statistical classification techniques for
subsequent classification.
Pal and Chaudhuri (2004) have provided an excellent review of the
character recognition work done on Indian language scripts and the different
methodologies applied in character recognition development in the
international scenario. Romesh et al (2004) have described a fuzzy method for
English character recognition where each character is divided into number of
segments. For each segment, fuzzy membership value is calculated. Based on
these membership values, characters are recognized by means of Min-max
composition procedure.
Suen et al (2003) have provided an excellent review of the analysis
and recognition of Asian scripts (Chinese/Japanese/Korean). Bhattacharya
and Chaudhuri (2003) have proposed a multi-resolution wavelet analysis and
majority voting approach and applied it for Bangla character recognition. One
of the main features of this proposed scheme is that this is not script
dependent. Another interesting feature is that it is sufficiently fast for real-life
13
applications. In contrast to the usual practices, the efficiency of a majority
voting approach is studied when all the classifiers involved are Multi-Layer
Perceptrons (MLP) of different sizes and respective features are based on
wavelet transforms at different resolution levels. The rationale for this
approach is to explore how one can improve the recognition performance
without adding much to the requirements for computational time and
resources. For simplicity and efficiency, only three coarse-to-fine resolution
levels of wavelet representation are considered.
Nafiz and Fatos (2001) have provided an overview of character
recognition that focused on English script. Dhanya and Ramakrishnan (2001)
have suggested Recognition via script identification and Bilingual approach
methods for character recognition from bilingual text. Plamondon and Srihari
(2000) have proposed a comprehensive survey for English character
recognition. Tao and Tang (1999) have described the feature extraction of
Chinese character based on contour information.
The geometrical and topological representation of an image
provides various global and local properties. Such a type of representation
yields high tolerance to distortions and style variations. The topological
representation of an image uses predefined structures like strokes. The
strokes are searched in a character/image and the number or relative position
of these structures within the character forms a descriptive representation.
Characters and words can be represented by extracting and counting many
topological features. These features include extreme points, maxima and
minima, cross and end points, loops etc., that make up a character (Nishida
1995). The Geometrical representation of a character includes geometrical
quantities such as the aspect ratio, the relative distance between horizontal
and vertical points, the comparative length between two strokes, change in
curvature etc., (Kundu and He 1991, Okamoto and Yamamoto 1997). The
14
Freeman‟s chain code is the most popular coding scheme and this code is
obtained by mapping the strokes of a character into a two-dimensional
parameter space, which is made up of codes. A set of topological primitives,
such as strokes, loops, cross points, etc., are obtained by partitioning the
characters. Then, these primitives are represented using attributed or
relational graphs (Li et al 1997). Trees can also be used to represent the
characters with a set of features, which have a hierarchical relation
(Madhvanath et al 1997).
Trier et al (1996) have presented a good survey of feature
extraction methods such as Template matching, Deformable templates,
Unitary image transforms, Graph description, Contour profiles, Zoning,
Geometric moment invariants, Zernike moments, Spline curve approximation
and Fourier descriptors. In template matching, all the pixels in the gray scale
character image are used as features. A similarity or dissimilarity measure
between each template and the character image is computed. In the case of a
similarity measure, the template Tk having the highest similarity measure is
identified and if this similarity is above a specified threshold, then the
character is assigned the class label k. Else, the character remains
unclassified. In the case of a dissimilarity measure, the template Tk having the
lowest dissimilarity measure is identified and if the dissimilarity is below a
specified threshold, the character is given the class label k. Deformable
templates are used for character recognition in gray scale images of credit
card slips with poor print quality. The templates used are character skeletons.
Unitary transform is applied to character images, obtaining a reduction in the
number of features while preserving most of the information about the
character shape. In the transformed space, the pixels are ordered by their
variance and the pixels with the highest variance are used as features. The
unitary transform has to be applied to a training set to obtain estimates of the
variances of the pixels in the transformed space. Zernike moments are
15
projections of the input image onto the space spanned by the orthogonal V-
functions. The amplitudes of the Zernike moments are used as features for
character recognition of binary solid symbols.
Some major statistical features used for image representation are
Zoning, Crossing and Distances, Projections. An image is divided into several
overlapping or non-overlapping zones. The image representation is computed
by considering the densities of the zone or some features in different regions
are analyzed and form the representation. Mohiuddin and Mao (1994) have
described contour direction features which are generated by dividing the
image array into rectangular and diagonal zones and computing histograms of
a chain code in these zones. Lee and Park (1994) have reviewed nonlinear
shape normalization methods in order to compensate for shape distortions in
large-set handwritten Korean characters. Chen et al (1994) have presented a
complete scheme for totally unconstrained handwritten word recognition
based on a single contextual Hidden Markov Model (HMM) type stochastic
network. This scheme includes morphology and heuristics based
segmentation algorithm, a training algorithm that can adapt itself with the
changing dictionary. Amin and AI-Sadoun (1994) have proposed a structural
technique for automatic recognition of hand printed Arabic characters. The
advantages of this technique are: more efficient for large and complex sets
such as Arabic characters; not expensive for feature extraction; and its
execution time does not depend on either the font or the size of the characters.
Yamada et al (1990) have proposed a nonlinear normalization
method called the line density equalization for handprinted kanji character
recognition. Here, a resampling is done so as to equate the product of a local
line density and a sampling pitch. Consequently, the line density in the space
is homogenized, the efficiency of utilization of the space is increased, and a
stable normalization is obtained for partially irregular shape variations.
16
Srihari et al (1989) have described a system to automatically locate
and recognize ZIP codes in English handwritten address. Given a grey-scale
image of a handwritten address block, the system preprocesses the image by
thresholding, border removal and underline removal. One or more candidate
words for the ZIP Code are isolated. Each candidate is divided into 5 or 9
segments and recognition is attempted on each segment. Digit recognition is
accomplished by means of an arbitration procedure that takes as input the
decisions of three different classifiers: template matching using stored
prototypes, a mixed approach that uses statistical and structural analysis of
digit boundary and a rule-based approach to analyze digit strokes. The result
of ZIP Code recognition is verified using a postal directory. Tsukumo and
Tanaka (1988) have presented a system for the classification of hand printed
Chinese characters using correlation methods for fast classification and a
nonlinear normalization based on uniform relocation of the strokes used to
form the character. Experimental results for handprinted Chinese character
classification are presented.
A popular statistical feature is the number of crossings of a contour
by a line segment in a specified direction. Also, the distance of the line
segments from a given boundary such as the upper and lower portions of the
frame can be used as statistical features (Brown and Ganapathy 1983). Suen
et al (1980) have presented a survey in the challenging field of character
recognition. Recognition algorithms, data bases, character models and
handprint standards are examined. Achievements in the recognition of hand
printed numerals, alphanumeric, Fortran and Katakana characters are
analyzed and compared. Data quality and constraints, as well as human and
machine factors are also described. Characteristics, problems and actual
results on on-line recognition of handprinted characters for different
applications have been discussed. New emphases and directions are
suggested.
17
Ishwarya et al (2010) have proposed a convolutional Neural
Network which recognizes the Tamil characters. Convolutional Neural
Networks are a special kind of multi-layer Neural Networks. They are trained
with a version of the back-propagatioon algorithm and are designed to
recognize visual patterns directly from pixel images with minimal
preprocessing. They can recognize patterns with extreme variability and with
robustness to distortions and geometric transformations. Recognition of the
test sample is performed using a nearest neighbor classifier.
Shashikiran et al (2010) have studied the performance of HMM and
Statistical Dynamic Time Warping (SDTW) for Tamil Character Recognition.
HMM is used for a 156-class problem. Different feature sets and values for
the HMM states & mixtures are tried and the best combination is found to be
16 states & 14 mixtures, giving an accuracy of 85%. The features used in this
combination are retained and a SDTW model with 20 states and single
Gaussian is used as classifier.
Ramanathan et al (2010) have proposed a new technique of optical
character recognition using Gabor filters and support vector machines
(SVM). This method proves to be very effective with the use of Gabor filters
for feature extraction and (SVM) for developing the model. The model
proposed is trained and validated for two languages –English and Tamil and
the results are found to be very much encouraging. The model developed
works for the entire set in both the languages including symbols and
numerals. In addition, the model can recognize the characters of six different
fonts in English and Twelve different fonts in Tamil. The average accuracy
of recognition for Tamil is 84% which is achieved in just three iteration of
training. The method can turn out to be a suitable candidate for future
applications in this area.
18
Abirami and Manjula (2009) have presented feature string-based
intelligent information retrieval from Tamil document images. This
methodology generates a feature string for every word image by extracting its
features. This relies on their basic characteristics or shapes of letters. The
strength of this technique lies in extracting the text, using their basic features
such as black / white disposition rates and lines in characters. Sundaram and
Ramakrishnan (2009) have proposed script-specific post processing schemes
for improving the recognition rate of Tamil characters. At the first level,
features derived at each sample point of the preprocessed character are used to
construct a subspace using the 2D-PCA algorithm. Recognition of the test
sample is performed using a nearest neighbor classifier. Based on the analysis
of the confusion matrix, multiple pairs of confused characters are identified.
At the second level, they have used script specific cues to sort out the
ambiguities among the confused characters. This strategy reduces the
recognition error among the confused character sets handled, by more than
496. This approach can be applied irrespective of the nature of the classifier
used for the first level of recognition, though the nature of the confusion set
might vary.
Shanthi and Duraiswamy (2008) have described a method for Tamil
character recognition where each character image is divided into equal
numbers of horizontal and vertical stripes. This will result in a grid with
square shaped zones. For each zone, pixel density is calculated and is used as
feature vector for character recognition. Jagadeesh Kannan and Prabhakar
(2008) have presented a Hidden Markov Model (HMM) method for Tamil
character recognition in which two HMMs are created for each unknown
character. One HMM is for modeling the horizontal information and the other
HMM is for modeling the vertical information. The created HMMs are then
trained for character recognition.
19
Batmavady and Manivannan (2007) have presented a Polynomial
method for Tamil character recognition where the unknown character image
is divided into blocks of equal size. Each block contains a part of character,
which is fitted to a fifth-order polynomial. The coefficients of polynomial are
used as features and are compared with standard template. Then the error
percentage is found. If the error percentage is less than the threshold, then the
recognized characters are matched to the template. Shivsubramani et al (2007)
have presented an efficient method for recognizing printed Tamil characters
exploring the interclass relationship between them, which should be
accomplished using Multiclass Hierarchical Support Vector Machines. A new
variant of Multi Class Support Vector Machine constructs a hyper plane that
separates each class of data from other classes. Thus Character recognition
involves classification of characters into multi-classes.
Bharath and Sriganesh (2007) have proposed a Data-driven Hidden
Markov Model (HMM) based online handwritten word recognition system for
Tamil. A symbol set consisting of 84 symbols has been defined for the word
recognition task and each symbol has been modeled using a left-to-right
HMM. Inter-symbol pen-up strokes have been modeled explicitly using two
state left-to-right HMMs to capture the relative positions between symbols in
the word context. Independently built symbol models and inter-symbol pen-
up stroke models have been concatenated to form the word models. The
relatively low performance in the case of high lexicon size can be improved
by the use of statistical language models, which are commonly applied in
Western cursive recognition.
Suresh and Ganesan (2005) have described an approach to use the
fuzzy concept on handwritten Tamil characters to classify them as one among
the prototype characters using a feature called distance from the frame and a
suitable membership function. The unknown and prototype characters are
20
preprocessed and considered for recognition. The theory of fuzzy set provides
an approximate but effective means of describing the behavior of ill-defined
systems. Patterns of human origin like handwritten characters are to some
extent found to be fuzzy in nature. It is decided to use fuzzy conceptual
approach effectively.
Seethalakshmi et al (2005) have discussed the various strategies
and techniques involved in the recognition of Tamil text and they have
referred Optical Character Recognition (OCR) for the process of converting
printed Tamil text documents into software translated Unicode Tamil Text.
The printed documents available in the form of books, papers, magazines, etc.
are scanned using standard scanners which produce an image of the scanned
document. As part of the preprocessing phase, the image file is checked for
skewing. The skewed image is corrected by a simple rotation technique in the
appropriate direction and then it is passed through a noise elimination phase
and is binarized. The preprocessed image is segmented using an algorithm,
which decomposes the scanned text into paragraphs using special space
detection technique and then the paragraphs into lines using vertical
histograms, and lines into words using horizontal histograms and words into
character image glyphs using horizontal histograms. Each image glyph is
comprised of 32×32 pixels. Thus a database of character image glyphs is
created out of the segmentation phase. Then all the image glyphs are
considered for recognition using Unicode mapping. Each image glyph is
passed through various routines, which extract the features of the glyph. The
extracted features are passed to a Support Vector Machine (SVM) where the
characters are classified by Supervised Learning Algorithm. These classes are
mapped onto Unicode for recognition.
Joshi et al (2004) have given a comparison of elastic matching
schemes for writer dependent on-line handwriting recognition of isolated
21
Tamil characters. Three different features are considered namely,
preprocessed x-y co-ordinates, quantized slope values and dominant point co-
ordinates. Seven schemes based on these three features are compared using an
elastic distance measure. The comparison is carried out in terms of
recognition accuracy, recognition speed, number of training templates and
dynamic time warping based distance measure has been presented. The results
show that dominant points based two-stage scheme and combination of rigid
and elastic matching schemes perform better than rest of the schemes,
especially from the point of view of implementing them in a real time
application. Efforts are underway to devise character grouping schemes for
hierarchical classification and classifier combination schemes so as to obtain a
computationally more efficient recognition scheme with improved accuracy.
Aparna et al (2004) have proposed a generalized framework for
Indic script character recognition and Tamil character recognition is discussed
as a special case. Unique strokes in the script are manually identified and each
stroke is represented as a string of shape features. The test stroke is compared
with the database of such strings using the proposed flexible string-matching
algorithm. The sequence of stroke labels is then converted into horizontal
blocks using a rule list and the sequence of horizontal blocks is recognized as
a character using a Finite State Automaton (FSA).
Deepu and Madhvanath (2004) have proposed a subspace-based
method using Principal Component Analysis (PCA) for Tamil character
recognition. The input is a temporally ordered sequence of (x, y) pen
coordinates corresponding to an isolated character obtained from a digitizer.
The input is converted into a feature vector of constant dimensions following
smoothing and normalization. Each class is modeled as a subspace and for
classification, the orthogonal distance of the test sample to the subspace of
each class is computed.
22
Aparna et al (2002) have presented a complete character
recognition system for Tamil newsprint that includes the full suite of
processes from skew correction, binarization, segmentation, text and non-text
block classification, line, word and character segmentation and character
recognition to final reconstruction.
Hewavitharana and Fernando (2002) have described a system to
recognize handwritten Tamil characters using a two-stage classification
approach for a subset of the Tamil alphabet, which is a hybrid of structural
and statistical techniques. In the first stage, an unknown character is pre-
classified into one of the three groups: core, ascending and descending
characters. Structural properties of the text line are used for this classification.
Then, in the second stage, members of the pre-classified group are further
analyzed using a statistical classifier for final recognition. The main
recognition errors were due to abnormal writing and ambiguity among similar
shaped characters. This could be avoided by using a word dictionary to look-
up for possible character compositions. The presence of contextual knowledge
will help to eliminate the ambiguity. The method of pre-classification would
have much higher recognition accuracy if applied to Optical Character
Recognition, since printed characters preserve the correct positioning on
three-zone frame.
Chinnuswamy and Krishnamoorthy (1980) have proposed an
approach for hand-printed Tamil character recognition. Here, the characters
are assumed to be composed of line-like elements, called primitives,
satisfying certain relational constraints. Labeled graphs are used to describe
the structural composition of characters in terms of the primitives and the
relational constraints satisfied by them. The recognition procedure consists of
converting the input image into a labeled graph representing the input
character and computing correlation coefficients with the labeled graphs
23
stored for a set of basic symbols. This algorithm uses topological matching
procedure to compute the correlation coefficients and then maximizes the
correlation coefficient.
Siromoney et al (1978) have described a method for recognition of
machine printed Tamil characters using an encoded character string
dictionary. The scheme employs string features extracted by row and column
wise scanning of character matrix. The features in each row and column are
encoded suitably depending upon the complexity of the script to be
recognized. A given text is presented symbol by symbol and information from
each symbol is extracted in the form of a string and compared with the strings
in the dictionary. When there is agreement, the letters are recognized and
printed out in Roman letters following a special method of transliteration. The
lengthening of vowels and hardening of consonants are indicated by numerals
printed above each letter.
1.3 CONTRIBUTIONS OF THE THESIS
From the above literature survey, it is clear that the character
recognition of Indian scripts is in demand and provides challenging and
interesting applications in the field of document image analysis. In the
literature, many papers have been published with research detailing new
techniques for the classification of characters. This thesis is focused towards
restoration of noisy Tamil Text Images followed by Tamil Character
recognition. A Tamil text image restoration algorithm based on Expectation
Maximization (EM) is proposed. As part of Tamil Character recognition, two
novel feature extraction methods viz. Slope method and Discrete Wavelet
Transform (DWT) domain based method are proposed. Also, a tree classifier
is proposed which would operate on the features extracted by the methods
proposed herein. The restoration algorithm, the feature extraction algorithms
24
and the classifier algorithm presented herein prove to be promising candidates
for processing of Tamil text images.
The contributions of the research work constituting the proposed
thesis could be summarized as follows:
Expectation Maximization (EM) Restoration algorithm based
Adaptive Custom Thresholding technique for noisy Tamil text
images.
Technique for Feature Extraction based on Spatial Distribution
of Slope for Tamil character recognition.
Technique for Feature Extraction operating in the Discrete
Wavelet Transform Domain for Tamil character recognition.
Decision Tree Classifier based on Information Gain for Tamil
character recognition.
The contributions mentioned above are dealt in detail in the
subsequent chapters.
1.4 ORGANISATION OF THE THESIS
The organization of the proposed thesis is as follows: Chapter 1
deals with literature survey followed by the contributions of the proposed
thesis. Chapter 2 deals with the proposed spatially adaptive restoration
algorithm based on Expectation Maximization (EM) operating in the wavelet
transform domain. Chapter 3 deals with the proposed feature extraction
algorithms for Tamil character recognition. Chapter 4 deals with classification
techniques for Tamil character recognition. Chapter 5 deals with conclusions
and future scope of research.