d3.1.2: description and evaluation of tools for dia, htr...
TRANSCRIPT
D3.1.2: Description and evaluation of tools for DIA, HTR and KWS
Basilis Gatos (NCSR)
Giorgos Louloudis (NCSR) Nikolaos Stamatopoulos (NCSR)
Kostas Ntirogiannis (NCSR) Alexandros Papandreou (NCSR)
Ioannis Pratikakis (NCSR) Konstantinos Zagoris (NCSR)
Joan Andreu Sánchez (UPVLC) Verónica Romero (UPVLC) Alejandro Toselli (UPVLC)
Enrique Vidal (UPVLC) Mauricio Villegas (UPVLC) Francisco Álvaro (UPVLC) Vicente Bosch (UPVLC)
Distribution: Public
tranScriptorium
ICT Project 600707 Deliverable 3.1.2
December 31, 2013
Project funded by the European Community
under the Seventh Framework Programme for Research and Technological Development
Final Version December 31, 2013 2/154
Project ref no. ICT-600707
Project acronym tranScriptorium
Project full title tranScriptorium
Instrument STREP
Thematic Priority ICT-2011.8.2 ICT for access to cultural resources
Start date / duration 01 January 2013 / 36 Months
Distribution Public
Contractual date of delivery December 31, 2013
Actual date of delivery December 31, 2013
Date of last update March 18, 2014
Deliverable number 3.1.2
Deliverable title Description and evaluation of tools for DIA, HTR and KWS
Type Report
Status & version Final
Number of pages 154
Contributing WP(s) 3
WP / Task responsible NCSR
Other contributors UPVLC
Internal reviewer Joan Andreu Sánchez, Jesse de Does, Katrien Depuydt
Author(s) Basilis Gatos, Giorgos Louloudis, Nikolaos Stamatopoulos, Kostas Ntirogiannis,
Alexandros Papandreou, Ioannis Pratikakis, Konstantinos Zagoris,Joan Andreu
Sánchez, Verónica Romero, Alejandro Toselli, Enrique Vidal, Mauricio Villegas,
Francisco Álvaro, Vicente Bosch
EC project officer José María del Águila Gómez
Keywords
The partners in tranScriptorium are:
Universitat Politècnica de València - UPVLC (Spain)
University of Innsbruck - UIBK (Austria)
National Center for Scientific Research “Demokritos” - NCSR (Greece)
University College London - UCL (UK)
Institute for Dutch Lexicology - INL (Netherlands)
University London Computer Centre - ULCC (UK)
For copies of reports, updates on project activities and other tranScriptorium related information, contact:
The tranScriptorium Project Co-ordinator
Joan Andreu Sánchez,
Universitat Politècnica de València
Camí de Vera s/n. 46022 València, Spain
Phone (34) 96 387 7358 - (34) 699 348 523
Copies of reports and other material can also be accessed via the project’s homepage: http://www.transcriptorium.eu/
© 2013, The Individual Authors
No part of this document may be reproduced or transmitted in any form, or by any means,
electronic or mechanical, including photocopy, recording, or any information storage and
retrieval system, without permission from the copyright owner.
Draft Version December 31, 2013 3/154
Executive Summary
Historical handwritten documents often suffer from several degradations, have low quality, exhibit
dense layout, may have adjacent text line touching and arbitrary text line skew. In WP3, we strive
towards the development of innovative methodologies in document image analysis (DIA),
handwritten text recognition (HTR) and keyword spotting (KWS) in order to provide the core
elements of an HTR engine. Our aim is to go beyond state-of-the-art techniques in DIA in order to
efficiently enhance the quality and segment historical handwritten documents as well as to prepare
the necessary HTR and KWS modules which will be used in the subsequent recognition stages. This
deliverable includes an overview of current state-of-the-art techniques as well as a description and
evaluation of all developed tools for DIA, HTR and KWS during the first year of the project.
Contents1. Introduction ...................................................................................................................................... 5 2. Related work .................................................................................................................................... 6
2.1 Image Pre-processing ........................................................................................................... 6 2.1.1 Binarization ......................................................................................................................... 6
2.1.2 Border Removal .................................................................................................................. 7
2.1.3 Image Enhancement ............................................................................................................ 9
2.1.4 Deskew .............................................................................................................................. 10
2.1.5 Deslant .............................................................................................................................. 13
2.2 Image Segmentation ........................................................................................................... 15 2.2.1 Detection of main page elements ...................................................................................... 15
2.2.2 Text Line Detection ........................................................................................................... 17
2.2.3 Word Detection ................................................................................................................. 19
2.3 Handwritten Text Recognition ........................................................................................... 22 2.3.1 Optical Character Recognition .......................................................................................... 22
2.3.2 Handwritten Text Recognition .......................................................................................... 23
2.4 Keyword Spotting .............................................................................................................. 24
2.4.1 The query-by-example approaches ................................................................................... 24
2.4.2 The query-by-string approaches ........................................................................................ 26
3. Developed Methods ....................................................................................................................... 28
3.1 Image Pre-processing ......................................................................................................... 28 3.1.1 Binarization ....................................................................................................................... 28
3.1.2 Border Removal ................................................................................................................ 31
3.1.3 Image Enhancement .......................................................................................................... 33
3.1.4 Deskew .............................................................................................................................. 36
3.1.5 Deslant .............................................................................................................................. 41
3.2 Image Segmentation ........................................................................................................... 44 3.2.1 Detection of main page elements ...................................................................................... 44
3.2.2 Text Line Detection ........................................................................................................... 51
3.2.3 Word Detection ................................................................................................................. 58
Draft Version December 31, 2013 4/154
3.3 Handwritten Text Recognition ........................................................................................... 61
3.4 Keyword Spotting .............................................................................................................. 64
3.4.1 The query-by-example approach ...................................................................................... 64
3.4.2 The query-by-string approach ........................................................................................... 71
4. Evaluation ...................................................................................................................................... 74 4.1 Image Pre-processing ......................................................................................................... 74
4.1.1 Binarization ....................................................................................................................... 74
4.1.2 Border Removal ................................................................................................................ 80
4.1.3 Image Enhancement .......................................................................................................... 81
4.1.4 Deskew .............................................................................................................................. 85
4.1.5 Deslant .............................................................................................................................. 87
4.2 Image Segmentation ........................................................................................................... 89
4.2.1 Detection of main page elements ...................................................................................... 89
4.2.2 Text Line Detection ........................................................................................................... 91
4.2.3 Word Detection ................................................................................................................. 95
4.3 Handwritten Text Recognition ........................................................................................... 97
4.4 Keyword Spotting ............................................................................................................ 107 4.4.1 The query-by-example approach .................................................................................... 107
4.4.2 The query-by-string approach ......................................................................................... 110
5. Description of Tools ..................................................................................................................... 116 5.1 Image Pre-processing ....................................................................................................... 116
5.1.1 Binarization ..................................................................................................................... 116
5.1.2 Border Removal .............................................................................................................. 118
5.1.3 Image Enhancement ........................................................................................................ 119
5.1.4 Deskew ............................................................................................................................ 122
5.1.5 Deslant ............................................................................................................................ 124
5.2 Image Segmentation ......................................................................................................... 126 5.2.1 Detection of main page elements .................................................................................... 126
5.2.2 Text Line Detection ......................................................................................................... 129
5.2.3 Word Detection ............................................................................................................... 132
5.3 Handwritten Text Recognition ......................................................................................... 133 5.4 Keyword Spotting ............................................................................................................ 138
5.4.1 Query by example ........................................................................................................... 138
5.4.2 Query by string................................................................................................................ 140
6. References .................................................................................................................................... 142
Draft Version December 31, 2013 5/154
1. Introduction
The aim of this project is to develop innovative, cost-effective solutions for the indexing, search and
full transcription of historical handwritten document images using HTR and KWS technologies.
Existing Optical Character Recognition (OCR) systems are not suitable due to the fact that in
handwritten text images several factors such as degradations, low quality, dense layout, adjacent
text line touching and arbitrary text line skew significantly affect the recognition performance. To
this end, in WP3 we focus on the development of innovative methodologies in DIA, HTR and KWS
in order to enhance the quality and segment historical handwritten documents as well as to provide
the core elements of HTR and KWS modules.
Image pre-processing is necessary as a first step in order to (i) process color or grayscale historical
handwritten documents so that the foreground (regions of handwritten text) is separated from the
background (paper), (ii) eliminate the presence of unwanted and noisy regions as well as to enhance
the quality of text regions, (iii) correct dominant page skew and (iv) identify and correct character
slant. Special focus is given on the challenging characteristics of the handwritten documents which
include low contrast and uneven background illumination, bleed through and shining or shadow
through effects.
In order to achieve accurate HTR performance, a robust and efficient solution to a hierarchical
segmentation task is necessary. In tranScriptorium, we envisage a hierarchical segmentation model
that consists of three levels. The first level is dedicated to the detection and separation of
handwritten text blocks in the document. A document structure and layout analysis will be involved
in order to detect the spatial position of text information. In the second level, text lines will be
identified while the third level will involve the segmentation of each text line into words.
Challenges that will be met concern the local skew and noise, the existence of adjacent text lines
touching, as well as the non-uniform spacing among text areas.
Current HTR technology will be adapted to the tranScriptorium goals. Efficient techniques for
training morphological (HMM) transcription models will be used. These models will be trained
from scratch or adapting previously trained models. The language models needed in the HTR
systems will be investigated while a word graph associated to each line will be obtained as a
byproduct of the HTR process. These word graphs can be considered as metadata that can be
associated to each document.
Concerning KWS, we will follow three different paths which will be unified at the end. In
particular, we will consider not only word spotting approaches starting from handwritten documents
that have been previously segmented into words and whole text lines, but we will also consider a
segmentation-free approach. In particular, we will implement a segmentation-based and a
segmentation-free approach to the query by example task, and a segmentation-free approach to the
query by string task. The resulting ranked lists from all three independent word spotting approaches
will be treated as different modalities for which a late fusion algorithm will merge them into one
coherent final result.
Draft Version December 31, 2013 6/154
2. Related work
2.1 Image Pre-processing
2.1.1 Binarization
Image binarization refers to the conversion of a gray-scale or color image into a binary image. In
document image processing, binarization is used to separate the text from the background regions
by using a threshold selection technique in order to categorize all pixels as text or non-text. It is an
important and critical stage in the document image analysis and recognition pipeline since it permits
less image storage space, enhances the readability of text areas and allows efficient and quick
further processing for page segmentation and recognition.
Document image binarization methods are usually classified in two main categories, namely global
and local. Global thresholding methods use a single threshold for the entire image, while local
methods find a local threshold based on local statistics and characteristics within a window (e.g.
mean μ, standard deviation σ and edge contrast). Well-known binarization techniques are the global
method of Otsu [Otsu 1979] and the local methods of Niblack [Niblack 1986] and Sauvola et al.
[Sauvola 2000]. In the case of document images with bimodal histogram, Otsu yields satisfactory
results but cannot effectively handle documents with degradations such as faint characters, bleed-
through or uneven background. Niblack calculates for each pixel local statistics (μ,σ) within a
window and adapts the local threshold according to those local statistics. Niblack has the advantage
to detect the text but it introduces a lot of background noise. Sauvola modified the thresholding
formula of Niblack to decrease the background noise but the text detection rate is also decreased,
while bleed-through still remains in many cases.
Certain binarization methods have incorporated background estimation and removal steps (e.g.
[Gatos 2006] and [Lu 2010]), as well as local contrast computations to provide improved
binarization results (e.g. [Su 2010] and [Howe 2012]). Other binarization methods, aiming at an
increased binarization performance, proposed combination methodologies of binarization methods
(e.g. [Gatos 2008] and [Su 2011]), while water-flow based methods (e.g. [Kim 2002]), consider the
image as 3D terrain with valleys and mountains corresponding to text and background regions,
respectively. The logical level methods of Kamel et al. [Kamel 1993] and Yang et al. [Yang 2000]
are also considered reference points in binarization. Representative binarization methods of the
aforementioned categories are described in the following.
The logical level techniques of Kamel et al. [Kamel 1993] and Yang et al. [Yang 2000] use the
stroke width to set the local window size. Moreover, for the computation of the local threshold,
information from anti-diametric points is considered. The distance of those anti-diametric points is
set according to the stroke width. In the initial logical level technique of Kamel et al. [Kamel 1993],
the local threshold and the stroke width are given by the user, while in [Yang 2000], those
parameters were automatically computed.
In [Kim 2002], the original image was considered as a 3D terrain on which water was poured to fill
the valleys that represented the textual components. The final binarization result was produced by
applying [Otsu 1979] to the compensated image, i.e. the difference between the original image and
the water-filled image. In [Gatos 2006], a Wiener filter was used for pre-processing and the
background was estimated taking into account [Sauvola 2000] binarization output. The final
threshold was based on the difference between the estimated background and the preprocessed
image, while post-processing enhanced the final result. Although this method achieves better OCR
performance than Sauvola, it inherits some drawbacks of Sauvola. For instance, faint characters
cannot be successfully restored, and bleed-through remains. In the recent work of Lu et al. [Lu
2010], the background was estimated using polynomial smoothing for each row and column of the
original image. Then, the original image was normalized and Otsu was performed to detect the text
Draft Version December 31, 2013 7/154
stroke edges. Furthermore, their local threshold formula was based on the local number of the
detected text stroke edges and their mean intensity. Although this method provides high overall
results, achieving the top position in the DIBCO'09 competition ([Gatos 2011]), it has certain
limitation as stated by the authors. It is based on the local contrast for the final thresholding and
hence some bleed-though noise remains or noisy background components of high contrast.
In Su et al. [Su 2010], the authors calculated the image contrast (based on the local maximum and
minimum intensity) and binarized the contrast image using Otsu to detect the text edge pixels. They
used a local thresholding formula similar to [Lu 2010] and presented better results. The authors also
participated with a modified version of this method in the binarization contests of H-DIBCO’10
([Pratikakis 2010]) and DIBCO’11 ([Pratikakis 2011]), where they achieved 1st and 2nd rank,
respectively. This method is capable of removing the majority of the background noise and bleed-
through but it does not detect the faint characters effectively. [Howe 2012] proposed a method
based on the Laplacian of the image intensity, in which an energy function was minimized via a
graph-cut computation. It incorporates Canny edge (Canny [Canny1986]) information in the graph
construction to encourage solutions where discontinuities align with detected edges. It is an efficient
method that uses several parameters. The author tuned the parameters on the DIBCO’09 set and
achieved 3rd position on H-DIBCO’10 and the same position on DIBCO’11 (including both printed
and handwritten images). However, this method could miss faint character parts, introduce
background noise or even fail in some cases.
As far as the combination methodologies are concerned, Su et al. [Su 2011] proposed a framework
for combination of binarization methods in which pixels are classified in 3 categories, foreground,
background and uncertain. Specifically, a pixel is considered foreground if it is a foreground pixel
in all the combined binarization outputs and the same holds for the background pixels. Uncertain
pixels are classified (iteratively until all uncertain pixels are classified) according to their difference
(in terms of intensity and local contrast), from the background and the foreground classified pixels
within a neighbourhood. Local contrast is calculated according to the image intensity and the
maximum intensity. In [Su 2011], the authors presented good results on the whole dataset (both
printed and handwritten) of the DIBCO’09. Furthermore, the proposed contrast computation
augments the contrast in the faint characters but it also augments the contrast in background noisy
components and bleed-through. Another framework for combination of binarization methods was
proposed by Gatos et al. [Gatos 2008], in which foreground pixels from different binarization
methods were selected using a majority voting. Canny edges were also incorporated to enhance the
combination result by filling white runs between the Canny edges that correspond to foreground.
This method improves the results of the authors’ previous work ([Gatos 2006]) but it depends on the
Canny edges by which faint characters can be missed while background noise including bleed-
through can be detected as text.
2.1.2 Border Removal
Approaches proposed for document segmentation and character recognition usually consider ideal
scanned images without noise. However, there are many factors that may generate imperfect
document images. When a page of a book is scanned, text from an adjacent page may also be
captured into the current page image. These unwanted regions are called “noisy text regions”.
Additionally, whenever a scanned page does not completely cover the scanner setup image size,
there will usually be black borders in the image. These unwanted regions are called “noisy black
borders”. Fig 2.1.2.1 shows noisy black borders as well as noisy text regions. All these problems
influence the performance of segmentation and recognition processes. Since the page segmentation
algorithms take into account noisy text regions, the text recognition accuracy decreases since the
text recognition system usually outputs several extra characters in these regions. The goal of border
detection is to find the principal text region and to ignore the noisy text and black borders.
Draft Version December 31, 2013 8/154
Figure 2.1.2.1: Example of an image with noisy black border and noisy text region.
The most common approach to eliminate marginal noise is to perform document cleaning by
filtering out connected components based on their size and aspect ratio. However, when characters
from the adjacent page are also present, they usually cannot be filtered out using only these features.
There are only few techniques in the bibliography for page borders detection, and they are mainly
focused on printed document images.
Le et al. [Le 1996] propose a method for border removal which is based on classification of blank,
textual and non-textual rows and columns, location of border objects, and an analysis of projection
profiles and crossing counts of textual squares. This approach uses several heuristics. Moreover, it
is based on the assumption that the page borders are very close to edges of images and borders are
separated from image contents by a white space. However, this assumption is often violated. Fan et
al. [Fan 2002] propose a method for removing noisy black regions overlapping the text region, but
do not consider noisy text regions. They propose a scheme to remove the black borders of scanned
documents by reducing the resolution of the document image, which makes the textual part of the
document disappear, thus leaving only blocks to be classified either as images or borders by a
threshold filter. The deletion process should be performed on the original image instead of the
image of reduced size.
In [Avila 2004] Avila et al. propose invading and non-invading border algorithms which work as
“flood–fill” algorithms. The invading algorithm assumes that the noisy black border does not invade
the black areas of the document. In the case that the document text region is merged to noisy black
borders, the whole area, including part of the text region, is flooded and removed. On the other
hand, the non-invading border algorithm assumes that the noisy black border does merge with
document information. In order to restrain flooding in the whole connected area, it takes into
account two parameters related to the nature of the documents, the maximum size of a segment
belonging to the document text as well as the maximum distance between lines. Dey et al. [Dey
2012] propose a technique for removing margin noise from printed document images. Firstly, they
perform layout analysis to detect words, lines, and paragraphs in the document image and the
detected elements are classified into text and non-text components based on their characteristics
(size, position, etc.). The geometric properties of the text blocks are sought to detect and remove the
margin noise. Finally, Agrawal and Doermann [Agrawal 2013] present a clutter detection and
removal algorithm for complex document images. They propose a distance transform-based
approach which aims to remove irregular and non-periodic clutter noise from binary document
images independently of clutter’s position, size, shape and connectivity with text.
Draft Version December 31, 2013 9/154
2.1.3 Image Enhancement
Document image processing, analysis and recognition consist of many different steps. The output of
each step is input for the consecutive step and thus the performance of each step is important.
Consequently, the initial image quality and the quality of the very first steps (e.g. binarization) play
an important role in the whole document image recognition process.
Document images may contain degradations due to the image acquisition process, such as speckle
or salt and pepper noise, blurring, shadows, etc. as well as degradations due to the environmental
conditions (humidity), ageing or extended use. Specifically, the use of quill pens in handwritten
documents is responsible for several degradations, including faint characters, bleed-through and
large stains. Image enhancement methods aim to increase the quality of the original color or
grayscale image as well as the quality of the binarization.
The enhancement of the original color or grayscale image is usually performed by filtering the
image. Common filters used for image enhancement are the Mean, the Median and the Gaussian
filter, as well as the Wiener filter ([Gatos 2006]), the Total Variation regularization and the Non-
local means ([Barney Smith 2010]). The use of a low-pass Wiener filter assures contrast
enhancement between background and text areas, smoothing of background texture as well as the
elimination of noisy areas. Total variation regularization flattens background intensity and reduces
the background noise. Non-local means filtering can smooth character parts and improve character
quality based on neighboring data information. The main role of the aforementioned filters is to
smooth the image and hence reduce the background noise. On the contrary, the use of the unsharp
masking technique (Ramponi et al. [Ramponi 1996]) enhances the local contrast and the text
becomes more clear. Unfortunately, noise would be also enhanced if present.
Another enhancement technique is based on the background estimation. The estimated background
can be used to subtract it or divide it from the original image in an attempt to remove large stains
and smooth the background. This procedure has been used in [Gatos 2006], as well as in [Lu 2010].
Histogram equalization could also be used for image enhancement purposes; it is actually a contrast
enhancement technique. A generalized fuzzy operator pre-processed with histogram equalization
and partially overlapped sub-block histogram equalization is used from Leung et al. [Leung 2005]
for reducing background noise and increasing the readability of text by contrast enhancement. This
method achieves good performance in enhancement of extremely low contrast and low illuminated
document images. Another image enhancement method worth mentioning was presented by Drira et
al. [Drira 2011], which suggests the application of the Partial Diffusion Equation (PDE). This
method was created for restoring degraded text characters by repairing the shapes of the features as
well as extrapolating lost information.
There also methods aiming at removing the bleed-through ([Tonazzini 2010] and [Fadoua 2006]). In
[Fadoua 2006], a recursive unsupervised segmentation approach was proposed, which was applied
on the decorrelated data space by principal component analysis. This technique does not require any
specific learning process or any input parameters. The stopping criterion for the proposed recursive
approach has been determined empirically and set to a fixed number of iterations. However, bleed-
through removal techniques may also remove faint characters.
Apart from image enhancement performed on the original color or grayscale image, enhancement
could also be performed on the binarization output. In many cases, binarization methods incorporate
post-processing steps for enhancing their output. Common operations that are applied include
masks, connected component analysis, morphological operations as well as shrink and swell
operations ([Gatos 2006]).
Morphological closing, which consists of a morphological dilation followed by a morphological
erosion of the same mask, is used to restore small broken character parts. Broken characters are
usually encountered due to the existence of faint characters in the original color or grayscale image.
Draft Version December 31, 2013 10/154
Unfortunately, small characters holes, such as the whole of the "e" could also be filled. An example
of morphological closing is presented in Fig. 2.1.3.1. On the other hand, morphological opening,
which consists of morphological erosion followed by morphological dilation of the same mask, is
used to remove small unwanted foreground components. Those parts could be either noise of salt-
and-pepper type, or unwanted bleed-through. Unfortunately, punctuation marks (dots) could also be
removed.
(a)
(b)
(c)
Figure 2.1.3.1: Example application of morphological closing (3x3 mask): (a) original image; (b)
initial binary image; (c) the enhanced binary image after morphological closing has less holes and
some components are connected.
Another way for enhancing the binarization output is to remove components of size below or above
a certain threshold. This method requires connected component labeling while several parameters
are set heuristically. Usually, small connected components less than a certain height or containing
less than a certain amount of pixels are considered as noise and they are removed. The despeckle
filter is classified in this category. A despeckle filter removes connected components that have
width and height below a certain value (which is usually small). A representative example of
despeckle filtering is shown in Figure 2.1.3.2.
Figure 2.1.3.2: Example of despeckle filtering (7x7 mask): (a) initial binary image; (b) the
enhanced binary image after despeckle.
Binarization often introduces a certain amount of single-pixel holes, concavities and convexities
along the contour of the text. These single pixel defects are actually artifacts, which can be removed
by using certain logical operators that can be simply set according to their neighborhood patterns
([Lu 2010]).
2.1.4 Deskew
The detection and correction of the skew is one of the most important document image analysis
steps. Some degree of skew is unavoidable when a document is scanned manually or mechanically.
This may seriously affect the performance of subsequent stages of segmentation and recognition,
while a skew of as little as 0.1° may be apparent to a human observer. To this end, a reliable skew
estimation and correction technique has to be used for scanned documents as a pre-processing stage
in almost all document analysis and recognition systems [Sadri 2009]. In the literature, a variety of
page level skew detection techniques are available and fall broadly into the following four
categories according to the basic approach they adopt [Jung 2004]: Hough transform ([Srihari
Draft Version December 31, 2013 11/154
1989], [Hinds 1990], [Yu 1996], [Wang 1997] and [Singh 2008]), nearest neighbour clustering
([Hashizume 1986], [Gorman 1993] and [Lu 2003]), interline cross correlation ([Yan 1993], [Gatos
1997], [Chou 2007], [Deya 2010] and [Alireza 2011]) and projection profile ([Postl 1986], [Baird
1987], [Ciardiello 1988], [Ishitani 1993] and [Papandreou 2011]) based methods.
The Hough Transform (HT) is a popular technique for machine vision applications. Despite of its
computational demerit, it has been widely employed to detect lines and curves in digital images.
According to Srihari and Govindaraju [Srihari 1989], each black pixel is mapped to the Hough
space (ρ,θ) and the skew is estimated as the angle in the parameter space that gives the maximum
sum of squares of the gradient along the ρ component. In order to improve the computational
efficiency of the HT based approaches, several variants that reduce the number of points which are
mapped in the Hough space have been proposed. Hinds et al. [Hinds 1990] fuse HT and run length
encoding in order to reduce data. They create a burst image which is built by replacing each vertical
black run with its length placed in the bottom-most pixel of the run. The bin with maximum value
in the Hough space determines the skew angle. Furthermore, Wang et al. [Wang 1997] use bottom
pixels of the candidate objects within a selected region for the HT. Accordingly, document skew is
estimated by finding a local peak in Hough space. Also, Yu and Jain in [Yu 1996] fuse hierarchical
HT and centroids of connected components. They first compute the connected components and their
centroids and then the HT is applied to the centroids using two angular resolutions. Finally, Singh et
al. [Singh 2008] introduce a block adjacency graph (BAG) in a pre-processing stage, and document
skew is calculated based on voting using the HT. As it is mentioned by the authors, the technique is
language dependent. Researchers proposed different strategies to reduce the amount of input data,
but the computational cost of HT is still very high. Furthermore, prior to apply HT, it is pre-
requisite to extract text regions in the document images. However, this is very hard in document
images since layouts maybe arbitrary and complex.
Concerning nearest neighbour approaches, spatial relationships and mutual distances of connected
components are used to estimate the page skew. In the work of Hashizume et al. [Hashizume 1986],
the direction vector of all nearest-neighbour pairs of connected components are accumulated in a
histogram and the peak in the histogram gives the dominant skew. This method is generalized by
O'Gorman [Gorman 1993] where nearest-neighbour clustering uses the K-neighbours for each
connected component. However, it is accurate up to a resolution of 0.5°. Lu and Tan [Lu 2003]
propose an improved nearest-neighbour chain based approach for document skew estimation with
high accuracy and language-independent capability. The methods of this class aim at exploiting the
general assumptions that characters in a line are aligned and close to each other. They are
characterized by a bottom up process which starts from a set of objects and utilizes their mutual
distances and spatial relationships to estimate the document skew. Few researchers prefer the
nearest neighbour clustering techniques for document skew removal, since text-lines can be
grouped in any direction and connections with noisy subparts of characters and this reduces the
accuracy of the method. Furthermore skew detection based on nearest neighbour clustering methods
is rather slow due to the component labelling phase that is performed bottom-up and has quadratic
time complexity 𝑂(𝑛2).
The interline cross-correlation approaches are based on the assumption that de-skewed images
present a homogeneous horizontal structure. Therefore, such approaches aim at estimating the
document skew by measuring vertical and horizontal deviations among foreground pixels along the
document image. Cross-correlation between lines at a fixed distance is used by Yan [Yan 1993].
This method is based on the observation that the correlation between two vertical lines of a skewed
document image is maximized if one line is shifted relatively to the other in such a way that the
character base line levels for the two lines are coincident. The accumulated correlation for many
pairs of lines can provide a precise estimation of the page skew. In [Gatos 1997], Gatos et al.
proposed that the image should first be smoothed by a horizontal run-length algorithm and then,
only the information existing in two or more vertical lines is to be used for skew detection. Chou et
al. in [Chou 2007] proposed piece-wise coverings of objects by parallelograms (PCP) for estimating
Draft Version December 31, 2013 12/154
the skew angle of a scanned document. A document image is initially split into a number of non-
overlapping slabs, and subsequently parallelograms at various angles cover the components in each
slab. The angle at which maximum white space is obtained indicates the estimated skew angle of
the document. In [Deya 2010], Deya and Noushath have enhanced PCP (e-PCP) with a criterion to
classify a document as a landscape or portrait mode, while they have introduced a confidence
measure for filtering the estimated skew angles that may not be reliable. Finally, Alireza et al.
[Alireza 2011], proposed an algorithm which applies piece-wise painting on both directions of a
document and creates two painted images. Those images provide statistical information about
regions with specific height and width. Points of those regions are grouped, a few lines are obtained
with the use of linear regression and line drawing concepts and their individual slopes are
calculated.
The traditional projection profile approach is a simple solution to detect skew angle of a document
image. It was initially proposed by Postl [Postl 1986] and is based on horizontal projection profiles.
According to this approach, a series of horizontal projection profiles are calculated at a range of
angles. The profile with maximum variation corresponds to the best alignment of the text lines. In
order to reduce high computational cost, several variations of this basic method have been
proposed. Papandreou and Gatos have proposed an improvement of the traditional horizontal
projection profile approach by combining it with the minimization of the bounding box technique
[Papandreou 2011]. According to this approach, the maximum variations are divided by the
bounding box area that includes the text. This amount is minimized when the document is
horizontally aligned. Baird [Baird 1987] proposed a technique for selecting the points to be
projected: for each connected component only the midpoint of the bottom side of the bounding box
is projected. The objective function is used to compute the sum of the squares of the profiles.
Ciardiello et al. [Ciardiello 1988] calculate projections on selected sub-regions, with high density of
black pixels per row, of the document image. The algorithm detects the skew angle by maximizing
the mean square deviation of the profile. Ishitani [Ishitani 1993] uses a profile which is defined in a
different style. A cluster of parallel lines on the image is selected and the bins of the profile store the
number of black to white transitions along the lines. Finally, Papandreou and Gatos in [Papandreou
2011] have proposed a vertical projection profile approach, taking advantage of the alignment of
vertical strokes of the Latin alphabet when the document is deskewed, which proved to be more
efficient in the existence of noise and warp.
For skew estimation at the text line level, the Hough transform based and projection profile based
methods are used. The performance of the methods is slightly reduced, since there is much less
information to take advantage for the estimation, though the results are well accepted. Additionally,
Madhvanath et al. in [Madhvanath 1999] used the image contour to detect the line's minima and
determine the baseline as the regression line through those minima. The inclination of the
regression line is regarded to be the word skew.
As far as skew estimation at the word level is concerned, several methods have been proposed in the
literature. The regression approach of Madhvanath et al. [Madhvanath 1999] applies also to words,
with significantly lower performance especially in words with great skew angles. Morita et al. in
[Morita 1999] proposed a method based on mathematical morphology to obtain a pseudo-convex
hull image. Then, the minima are detected on the pseudo-convex image, a reference line is fit
through those points and the inclination of this line is considered to be the word skew. The primary
challenge in these two methods is the rejection of spurious minima. Also, the regression-based
methods do not work well on words of short length because of the lack of a sufficient number of
minima points. Other approaches for handwritten word skew detection are based on density
distribution. In [Cote 1998], several histograms are computed for different vertical projections.
Then, the entropy is calculated for each of them. The histogram with the lowest entropy determines
the word skew angle. In [Kavallieratou 1999], Kavallieratou et al. calculate the Wigner-Ville
distribution for several horizontal projection histograms. The word skew angle is selected by the
Wigner-Ville distribution with the maximal intensity. The main problem for these distribution based
Draft Version December 31, 2013 13/154
methods is the high computational cost, since an image has to be rotated for each angle. Jian-xiong
Dong et al. proposed in [Dong 2005] to maximize a global measure which is defined by Radon
transform of the word image and calculate its gradient in order to estimate the word skew.
Furthermore, Blumenstein et al. in [Blumenstein 2002] identify the skew by detecting the center of
mass in each half (right and left) of a word image. The skew angle is estimated by hypothesizing a
line between the two centers and by measuring its angle with the x-axis. Finally, in [Papandreou
2013a], Papandreou end Gatos propose a coarse-to-fine method that is based on the detection of the
core-region and the calculation of the centers of mass of the right and left overlapping parts of the
word image. Detecting the corresponding center of mass in each part and calculating the inclination
of the line that connects them is a rough estimation of the skew. After the skew is detected roughly
and corrected, the core-region of the word is detected and the same procedure is repeated regarding
solely the information inside the core-region, providing a finer estimation of the skew angle.
2.1.5 Deslant
A crucial preprocessing step in order to improve segmentation and recognition accuracy of
handwritten words is the estimation and correction of word slant. By the term “word slant” we refer
to the average angle in degrees clockwise from vertical at which the characters are drawn in a word.
Word slant estimation can be very helpful in handwritten text processing. Knowledge of the slant
value, facilitates further stages of processing and recognition. Moreover, the character slant is
thought to be very important information which can help to identify the writer of a text.
Slant estimation methodologies proposed in the literature fall broadly into the following categories.
The average slant angle can be estimated by averaging angles of near-vertical strokes [Bozinovic
1989], [Kim 1997] [Papandreou 2012] and [Papandreou 2014], by analyzing projection histograms
[Vinciarelli 2001] and [Kavallieratou 2001] and by using statistics of chain code contours [Kimura
1993], [Ding 2000] and [Ding 2004].
Concerning near-vertical stroke techniques, according to Bozinovic and Srihari [Bozinovic 1989],
for a given word all horizontal lines which contain at least one run of length greater than a
parameter M (depending on the width of the letters) are removed. Additionally, all horizontal strips
of small height are removed. By deleting these horizontal lines, only orthogonal window parts of
the picture remain in the text. For each letter the parts that remain are those that have relatively
small horizontal slant. For each of these parts, the angle between the line indicated by the centers of
gravity for its upper and lower halves of the page is estimated and their mean value is the overall
character slant of the text. In [Papandreou 2012], Papandreou and Gatos propose an extension
method of Bozinovic and Srihari by integrating information of the location and the height of the
remaining parts. The main differences between the proposed technique and the method presented in
[Bozinovic 1989] are the automatic parameter setting and the weights on the contribution of the
character fragments based on their height and position relative to the core-region. Furthermore,
Papandreou and Gatos in [Papandreou 2014] propose a further improved method that incorporates a
new technique for core-region detection, has more accurate parameter setting and reduces the
involved parameters. In the approach of Kim and Govindaraju [Kim 1997] the whole page is
considered as input and the global slant is estimated by averaging over all lines. Vertical and near-
vertical lines are extracted with a chain code representation of the word contour using a pair of one
dimensional filters. Coordinates of the start and end points of each vertical line extracted provide
the slant angle. Global slant angle is the average of all the angles of the lines, weighed by their
vertical direction since the longer line gives more accurate angle than the shorter one.
Analyzing projection histograms, Vinciarelli and Juergen [Vinciarelli 2001] have proposed a
deslanting technique based on the hypothesis that the word has no slant when the number of
columns containing a continuous stroke is maximum. On the other hand, Kavallieratou et al.
[Kavallieratou 2001] have proposed a slant removal algorithm based on the use of the vertical
Draft Version December 31, 2013 14/154
projection profile of word images and the Wigner-Ville distribution (WVD). The word image is
artificially slanted and for each of the extracted word images, the vertical histogram as well as the
WVD of these histograms is calculated. The curve of maximum intensity of the WVDs corresponds
to the histogram with the most intense alternations and as a result to the dominant word slant.
Techniques based on statistics of chain code contours can estimate average slant of a handwritten
word by using the 4-directional chain code histogram of border pixels (Kimura et al. [Kimura
1993]). This method tends to underestimate the slant when its absolute value is close or greater than
45ᵒ. To solve this problem, Ding et al. [Ding 2000] proposed a slant detection method using an 8-
directional chain code. Chain code methods of 12 and 16 directions are also examined in [Ding
2004] but the experimental results show that these methods tend to overestimate the slant. In [Ding
1997], new methods are proposed for evaluating and improving the linearity and accuracy of slant
estimation based on chain contours.
Additionally, a slant estimation method for handwritten characters by means of Zernike moments
has been proposed by Ballesteros et al. [Ballesteros 2005], which is based on the average inclination
of the Zernike reconstructed images for low moments. It is claimed that this method improves the
slant estimation accuracy in comparison with the chain code based methods.
Draft Version December 31, 2013 15/154
2.2 Image Segmentation
2.2.1 Detection of main page elements
Detection of main text zones
A reading system requires the detection of main page elements as well as the discrimination of text
zones from non-textual ones. Historical handwritten documents do not have strict layout rules and
thus, a layout analysis method needs to be invariant to layout inconsistencies, irregularities in script
and writing style, skew, fluctuating text lines, and variable shapes of decorative entities. Most of the
state-of-the-art methodologies focus on machine-printed or modern handwritten documents and
only a few deal with historical handwritten documents. Representative page segmentation methods
can be found in [Shafait 2008a].
Among existing approaches to page segmentation for historical handwritten we mention:
-[Malleron 2009], which focuses on the layout analysis of 19th Century Handwritten Documents.
First, text line detection is modelled as an image segmentation problem by enhancing text line
structure using Hough transform and a clustering of connected components so as to make text line
boundaries appear. Then, snippets decomposition is performed for page layout analysis. This is
based on a first step of content pages classification in five visual and genetic taxonomies, and a
second step of text line extraction and snippets decomposition. Snippets are built by linking
centroids of borders components (see Fig. 2.2.1.1).
Figure 2.2.1.1: Building snippets for page segmentation
- [Bulacu 2007] proposed a method for layout analysis of handwritten historical documents in order
to enable searching the archive of the Cabinet of the Dutch Queen (1798 – 1988). It is based on the
detection of the rule lines of the tables and the page margins. It also uses contour tracing that
generates curvilinear separation paths between text lines in order to preserve the ascenders and
descenders. A layout analysis result of this method is shown in Fig. 2.2.1.2.
Figure 2.2.1.2: An example of layout analysis result of method [Bulacu 2007]
Draft Version December 31, 2013 16/154
- [Bukhari 2012] on layout analysis of Arabic historical document images using machine learning.
Simple and discriminative features are extracted in a connected-component level, and subsequently
robust feature vectors are generated. A multi-layer perceptron classifier is then exploited to classify
connected components to the relevant class of text.
- [Nicolas 2006] on complex handwritten page segmentation using contextual models. Stochastic
and contextual models are used in order to cope with local spatial variability, while some prior
knowledge about the global structure of the document image is taken into account. Two tasks are
defined: a) to label the main regions of the manuscripts such as text body, margins, header, footer,
page number and marginal annotations, b) to detect pseudo words, deletions, diacritics and
background.
Page Split
Document images are usually produced by scanning books or periodicals. Scanning two pages at
the same time is a very common practice as it helps to accelerate the scanning process. However, it
may affect the performance of subsequent processing such as document analysis and optical
character recognition (OCR) since the majority of approaches are able to process only single page
images. Furthermore, another drawback of scanning two pages at the same time is the appearance of
noisy black borders around text areas as well as of noisy black stripes between the two pages (Fig.
2.2.1.3).
Figure 2.2.1.3: Example of a double page document image.
There are only few techniques in the literature that address the problem of page splitting. According
to these approaches, double page documents are split in their middle after border removal or by
defining the coordinates of the print space. Yacoub et al. [Yacoub 2005] involve a splitting process
as a pre-processing stage for the conversion of a large collection of complex documents and
deployment for online web access to its information-rich content. First, the double page images are
cropped to the border of the page, using several criteria, which include searching for horizontal and
vertical projections of straight page edges and checking the computed dimension versus a database
of magazine sizes sampled throughout history. Once the double-pages are cropped, the centrefold of
the double-page is identified and two individual pages are generated using a splitting operation.
Furthermore, a page splitting approach is proposed in the frame of the Metadata Engine Project
(http://meta-e.aib.uni-linz.ac.at/). This method is based on the assumption that books are printed
according to a clearly defined printing space. Only a limited number of special elements may appear
in the surrounding margins. Therefore, the method determines the coordinates of the print space
used in the given document and then applies this zone to the actual page image. Next, a virtual
margin is added around the printing space and the pages are cut. If a document contains
supplements that do not conform to the default print space, such as maps, tables, graphs, and
pictures, this variation is detected automatically.
Draft Version December 31, 2013 17/154
2.2.2 Text Line Detection
One of the early tasks in a handwriting recognition system is the segmentation of a handwritten
document image into text lines, which is defined as the process of defining the region of every text
line on a document image. The overall performance of a handwritten character recognition system
strongly relies on the results of the text line segmentation process. If the quality of the results
produced by the stage of text line segmentation is poor, this will affect the accuracy of the text
recognition procedure. Thus, the algorithms employed for these two stages are critical for the
overall recognition procedure.
We can group existing text line methods into four basic categories: methods making use of the
projection profiles, methods that are based on the Hough transform, smearing methods and, finally,
methods based on the principle of dynamic programming. Also, several methods exist that cannot
be clearly classified in a specific category, since they employ particular techniques.
Methods that make use of the projection profiles include [Bruzzone 1999], [Arivazhagan 2007]. In
[Bruzzone 1999], the initial image is partitioned into vertical strips. At each vertical strip, the
histogram of horizontal runs is calculated. This technique assumes that text appearing in a single
strip is almost parallel to each other. Arivazhagan et al. [Arivazhagan 2007] partitions the initial
image into vertical strips called chunks. The projection profile of every chunk is calculated. The
first candidate lines are extracted among the first chunks. These lines traverse around any
obstructing handwritten connected component by associating it to the text line above or below. This
decision is made by either (i) modeling the text lines as bivariate Gaussian densities and evaluating
the probability of the component for each Gaussian or (ii) the probability obtained from a distance
metric.
Methods that make use of the Hough transform include [Fletcher 1988], [Likforman 1995],
[Louloudis 2008] and [Pu 1998]. The Hough transform is a powerful tool used in many areas of
document analysis that is able to locate skewed lines of text. By starting from some points of the
initial image, the method extracts the lines that fit best to these points. The points considered in the
voting procedure of the Hough transform are usually either the gravity centers [Fletcher 1988],
[Likforman 1995], [Louloudis 2008] or minima points [Pu 1998] of the connected components.
In more detail, Likforman [Likforman 1995] developed a method based on a hypothesis – validation
scheme. Potential alignments are hypothesized in the Hough domain and validated in the image
domain. The centroids of the connected components are the units for the Hough transform. A set of
aligned units in the image along a line with parameters (ρ, θ) is included in the corresponding cell
(ρ, θ) of the Hough domain. Alignments including a lot of units correspond to high peaked cells of
the Hough domain.
A recent method using the Hough transform was proposed by Louloudis et al. [Louloudis 2008].
The main contributions of the approach correspond to a) the partitioning of the connected space into
three distinct spatial sub-domains (small, normal and large) from which only normal connected
components are used in the Hough transformation step, b) a block-based Hough transform step for
the detection of potential text lines and c) a post-processing step for the detection of text lines the
Hough did not reveal as well as the separation of vertically connected parts of adjacent text lines.
The Hough transform can also be applied to fluctuating lines of handwritten drafts such as in [Pu
1998]. The Hough transform is first applied to minima points (units) in a vertical strip on the left of
the image. The alignments in the Hough domain are searched starting from a main direction, by
grouping cells in an exhaustive search in six directions. Then a moving window, associated with a
clustering scheme in the image domain, assigns the remaining units to alignments. The clustering
scheme (Natural Learning Algorithm) allows the creation of new lines starting in the middle of the
page.
Smearing methods mainly include the fuzzy RLSA [Shi 2004] and the adaptive RLSA [Makridis
Draft Version December 31, 2013 18/154
2007]. The fuzzy RLSA measure is calculated for every pixel on the initial image and describes
“how far one can see when standing at a pixel along horizontal direction”. By applying this
measure, a new grayscale image is created, which is binarized and the lines of text are extracted
from the new image. The adaptive RLSA [Makridis 2007] is an extension of the classical RLSA, in
the sense that additional smoothing constraints are set in regard to the geometrical properties of
neighboring connected components. The replacement of background pixels with foreground pixels
is performed when these constraints are satisfied.
Text line segmentation methods based on the dynamic programming principle were recently
presented [Nicolaou 2009], [Saabni 2014]. They try to segment text lines by finding an optimal path
on the background of the document image travelling from the left to the right edge. Nicolaou et al.
approach [Nicolaou 2009] is based on the topological assumption that for each text line, a path
exists from one side of the image to the other which traverses a single text line. The image is first
blurred and at a second step tracers are used to follow the white-most and black-most paths from
left to right as well as from right to left. The final goal is to shred the image into text line areas.
Saabni et al. [Saabni 2014] propose a method which computes an energy map of a text image and
determines the seams that pass across and between text lines. Two different algorithms are
described (one for binary and one for grayscale images). Concerning the first algorithm (binary
case), each seam passes on the middle and along a text line, and marks the components that make
the letters and words of it. At a final step, the unmarked components are assigned to the closest text
line. For the second algorithm (grayscale case) the seams are calculated on the distance transform of
the grayscale image.
Other methodologies include the work of Nicolas et al. [Nicolas 2004]. In this work, the text line
extraction problem is seen from an Artificial Intelligence perspective. The aim is to cluster the
connected components of the document into homogeneous sets that correspond to the text lines of
the document. To solve this problem, a search over the graph that is defined by the connected
components as vertices and the distances among them as edges is applied. A recent paper [Shi
2005], makes use of the Adaptive Local Connectivity Map: the input to the method is a grayscale
image, and a new image is calculated by summing the intensities of each pixel’s neighbors in the
horizontal direction. Since the new image is also a grayscale image, a thresholding technique is
applied and the connected components are grouped into location maps by using a grouping method.
In [Kennard 2006], the method to segment text lines uses the count of foreground/background
transitions in a binarized image to determine areas of the document that are likely to be text lines.
Also, a min-cut/max-flow graph cut algorithm is used to split up text areas that appear to encompass
more than one line of text. A merging of text lines containing relatively little text information to
nearby text lines is then applied. Yi Li [Li 2008] describes a technique that models text line
detection as an image segmentation problem by enhancing text line structures using a Gaussian
window and adopting the level set method to evolve text line boundaries. The method described in
[Lemaitre 2006] is based on a notion of perceptive vision: at a certain distance, text lines can be
seen as line segments. This method is based on the theory of Kalman filtering to detect text lines on
low resolution images. Weliwitage et al. [Weliwitage 2005], describe a method that involves cut text
minimization for segmentation of text lines from handwritten English documents. In doing so, an
optimization technique is applied which varies the cutting angle and start location to minimize the
text pixels cut while tracking between two text lines. In [Basu 2007], a text line extraction technique
is presented for multi-skewed document of handwritten English or Bengali text. It assumes that
hypothetical water flows, from both left and right sides of the image frame, facing obstruction from
characters of text lines. The stripes of areas left unwetted on the image frame are finally labeled for
extraction of text lines. In [Zahour 2007], a text line segmentation method for printed or
handwritten historical Arabic is presented. Documents are first classified into two classes using a K-
means scheme. These classes correspond to document complexity (easy or not easy to segment). A
document which includes overlapping and touching characters is divided into vertical strips. The
extracted text blocks obtained by the horizontal projection are classified into three categories: small,
Draft Version December 31, 2013 19/154
average and large text blocks. After segmenting the large text blocks, the lines are obtained by
matching adjacent blocks within two successive strips using spatial relationship. The document
without overlapping or touching characters is segmented by making abstraction on the segmentation
module of the large blocks. The authors claim 96% accuracy on a collection of 100 historical
documents. Yin [Yin 2007], proposes an approach which is based on minimum spanning tree (MST)
clustering with new distance measures. First, the connected components of the document image are
grouped into a tree by MST clustering with a new distance measure. The edges of the tree are then
dynamically cut to form text lines by using a new objective function for finding the number of
clusters. This approach is totally parameter-free and can apply to various documents with multi-
skewed and curved lines. Stamatopoulos et al. [Stamatopoulos 2009] present a combination method
of different segmentation techniques. The goal is to exploit the segmentation results of
complementary techniques and specific features of the initial image so as to generate improved
segmentation results. Roy et al. [Roy 2008] propose a technique that is based on morphological
operation and run-length smearing. At first, RLSA is applied to get an individual word as a
component. Next, the foreground portion of this smoothed image is eroded to get some seed
components from the individual words of the document. Erosion is also done on background
portions to find some boundary information of text lines. Finally, using the positional information of
the seed components and the boundary information, the lines are segmented. Finally, Du et al. [Du
2008] propose a method which is based on the Mumford–Shah model. The algorithm is claimed to
be script independent. In addition, morphing is used to remove overlaps between neighboring text
lines and connect broken ones. For a more detailed description of text line segmentation
methodologies the interested reader should read [Likforman 2007].
2.2.3 Word Detection
Word segmentation refers to the process of defining the word regions of a text line. Algorithms
dealing with word segmentation in the literature are based primarily on analysis of geometric
relationship of adjacent components. Components are either connected components or overlapped
components. An overlapped component is defined as a set of connected components whose
projection profiles overlap in the vertical direction. Related work for the problem of word
segmentation differs in two aspects. The first aspect is the way the distance of adjacent components
is calculated, while the second aspect concerns the approach used to classify the previously
calculated distances as either between-word gaps or within-word gaps. Most of the methodologies
described in the literature have a preprocessing stage which includes noise removal, skew and slant
correction.
Many distance metrics are defined in the literature. Seni et al. [Seni 1994] presented eight different
distance metrics. These include the bounding box distance, the minimum and average run-length
distance, the Euclidean distance and different combinations of them which depend on several
heuristics. A thorough evaluation of the proposed metrics was described (see examples of the
metrics in Figure 2.2.3.1).
Draft Version December 31, 2013 20/154
Figure 2.2.3.1: Example of types of distance metrics between a pair of connected components. (a)
original image (b) bounding boxes; the horizontal bounding box distance is the horizontal distance
between the boxes (c) Euclidean distance; the arrow indicates the distance between the closest
points in each component. (d) run-length distance; arrows show some run-lengths between the
components. Different distance algorithms could use the average or minimum of the run-lengths.
A different distance metric was defined by Mahadevan [Mahadevan 1995] which was called convex
hull-based metric (see Figure 2.2.3.2). The author after comparing this metric with some of the
metrics of [Seni 1994] concludes that the convex hull-based metric performs better than the other
ones. Kim et al. [Kim 2001], investigated the problem of word segmentation in handwritten Korean
text lines. To this end, they used three well-known metrics in their experiments: the bounding box
distance, the run-length/Euclidean distance and the convex hull-based distance. For the
classification of the distances, the authors considered three clustering techniques: the average
linkage method, the modified Max method and the sequential clustering. Their experimental results
showed that the best performance was obtained by the sequential clustering technique using all
three gap metrics. Varga and Bunke [Varga 2005], tried to extend classical word extraction
techniques by incorporating a tree structure. Since classical word segmentation techniques depend
solely on a single threshold value, they tried to improve the existent theory by letting the decision
about a gap to be taken not only in terms of a threshold, but also in terms of its context i.e.
considering the relatives sizes of the surrounding gaps. Experiments conducted with different gap
metrics as well as threshold types showed that their methodology yielded improvements over
conventional word extraction methods.
Figure 2.2.3.2: Three examples demonstrating the convex-hull based distance metric.
In all the aforementioned methodologies, the gap classification threshold used derives: (i) from the
processing of the calculated distances, (ii) from the processing of the whole text line image or (iii)
after the application of a clustering technique over the estimated distances. There also exist
Draft Version December 31, 2013 21/154
methodologies in the literature that make use of classifiers for the final decision of whether a gap is
a between-word gap or a within-word gap [Kim 1998], [Huang 2008], [Luthy 2007]. An early
published work making use of classifiers for the word segmentation problem is the work of Kim
and Govindaraju [Kim 1998]. A similar approach was presented in [Huang 2008] by Huang and
Srihari. This approach claimed two differences from previous methods: (i) the gap metric was
computed by combining three different distance measures, which avoided the weakness of each of
the individual one and thus provided a more reliable distance measure and (ii) besides the local
features, such as the current gap, a new set of global features were also extracted to help the
classifier make a better decision. Finally, a different approach was presented by Luthy et al. [Luthy
2007]. The problem of segmenting a text line into words was considered as a text line recognition
task, adapted to the characteristics of segmentation. That is, at a certain position of a text line, it had
to be decided whether the considered position belonged to a letter of a word, or to a space between
two words. For this purpose, three different recognizers based on Hidden Markov Models were
designed, and results from writer-dependent as well as writer-independent experiments were
reported.
An analytic evaluation of both steps of the word segmentation stage, namely the distance
calculation and the gap classification step is presented in the work of Louloudis et al. [Louloudis
2009a]. According to the proposed evaluation methodology the distance computation and the gap
classification stages are treated independently. The detection rate calculated for every distance
metric corresponds to the maximum detection rate which could have been achieved if the perfect
classifier for the gap classification stage existed. The proposed evaluation framework was applied to
several state-of-the-art techniques using a handwritten as well as a historical typewritten document
set, and the best combination of distance metric computation and gap classification state-of-the-art
techniques was proposed. According to this evaluation method, the combination of the Euclidean
distance metric and the convex hull metric produced the best result. For the gap classification stage,
the Gaussian mixture modelling together with the sequential clustering algorithm had the best
performance.
Draft Version December 31, 2013 22/154
2.3 Handwritten Text Recognition
Handwritten text recognition (HTR) can be defined as the ability of a computer to transform
handwritten input represented in its spatial form of graphical marks into an equivalent symbolic
representation as ASCII text. Usually, this handwritten input comes from sources such as paper
documents, photographs or electronic pens and touch-screens.
HTR is a relatively new field of computer vision. The optical character recognition (OCR) had its
beginnings in the year 1951 with the optical reader invented by David H. Shepard. This reader,
known as GISMO, was able to read typewritten text, Morse and musical notes [Shepard 1953].
HTR first appeared in the late 1960s with restricted applications with very limited vocabulary, such
as the recognition of postal addresses envelopes or the recognition of bank cheques. The computer
capacity and the available technology of those days was not enough for unconstrained handwritten
documents (such as old manuscripts and/or unconstrained, spontaneous text). Even the recognition
of printed text was only adequate on simplified fonts, for example the font used on credit cards ever
since. However, the increase in computer capacity of these days, the development of adequate
technologies and the necessity of processing automatically a big quantity of handwritten documents
have converted handwritten text recognition in a focus of attention both from industry as well as the
research community, leading to a significant improvement of the accuracy of these systems.
According to [Bertolami 2008], [Bunke 1995], nowadays, OCR systems are capable of accurately
recognizing typewriting in several fonts. However, high recognition accuracy is still difficult to
achieve in continuous handwritten text. The reason for this has to be seen in the high variability
such as individual writing styles, the number of different word classes and the difficulty of the word
segmentation problem which comes with handwriting.
Current research in HTR mainly follows two different basic approaches which will be discussed in
the next subsections. The first approach recognizes isolated characters or digits. Therefore, when
dealing with non-isolated characters, for example in words or text, the data has to be segmented
beforehand. The origin of this method has to be seen in the beginning of HTR, when the input
images contained only one single symbol. The second approach uses a continuous recognition,
meaning that beginning or ending of each character are not needed on beforehand and they are
hypothesized as a byproduct of the proper recognition process.
2.3.1 Optical Character Recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical translation of images
of handwritten, typewritten or printed text (letters, numbers or symbols) into machine- editable text.
An historic review of OCR can be found on [Impedovo 1991], [Mori 1992]. OCR is mainly divided
in two areas: printed character or word recognition, and handwritten character or word recognition.
A big improvement on the recognition of printed text has been obtained until now, mainly due to the
fact that is easy to segment the text into characters. This improvement has made possible the
development of OCR systems with a very good accuracy, around 99% or higher at the character
level for modern printed material. However, given the kind of text image documents involved in
this project, OCR products, or even the most advanced OCR research prototypes [Ratzlaff 2003],
are very far from offering useful solutions to the transcription problem. They are simply not usable
since, in the vast majority of the handwritten text images of interest, characters can by no means be
isolated automatically.
In OCR field the recognition is performed using one or several classifiers. A naive approach to the
recognition of an image containing a single symbol is the nearest neighbour (NN) or k-NN
approach. More sophisticated algorithms which achieve better results are artificial neural networks
(ANNs) and support vector machines (SVMs) [Lee 1996], [Srihari 1993]. Moreover, some work has
been carried out using tangent vectors and local representations [Keysers 1999] or Bernoulli
Draft Version December 31, 2013 23/154
mixture models [Khoury 2010], [Juan 2001], [Romero 2007] with good results. In [Fenrich 1992],
[Nagy 2000] an overview of the current state-of-the-art on OCR is carried out and the different
applications of this technology are commented.
2.3.2 Handwritten Text Recognition
HTR can be considered a relatively new subfield of pattern recognition. An overview about the
state-of-the-art can be found on [Kim 1999], [Plamondon 2000], [Vinciarelli2002]. In [Bunke 2003]
one can find a detailed description of previous work on HTR, the current state of the art, and the
most likely paths for further progress in the near future.
Depending on the way the initial data is acquired, automatic handwriting recognition systems can
be classified into on-line and off-line. In off-line systems the handwriting is given as an image or
scanned text, without time sequence information. In on-line systems the handwriting is given as a
temporal sequence of coordinates that represents the pen tip trajectory.
Some off-line handwriting transcription products rely on technology for isolated character
recognition (OCR) developed in the last decade. First, the input image is segmented into characters
or pieces of a character. Thereafter, isolated character recognition is performed on these segments.
In [Bozinovic 1989], [Kavallieratou 2002], [Senior 1998] this approximation is followed, using
segmentation techniques based on dynamic programming. As was previously explained, the
difficulty of this approach is the segmentation step. In fact, the difficulty of character segmentation
for some handwritten documents makes it impossible to apply this approach to this kind of
documents.
The required technology should be able to recognize all text elements (sentences, words and
characters) as a whole, without any prior segmentation of the image into these elements. This
technology is generally referred to as segmentation-free “off-line Handwritten Text Recognition”
(HTR).
Several approaches have been proposed in the literature for HTR that resemble the noisy channel
approach that is currently used in Automatic Speech Recognition. This, the HTR systems are based
on Hidden Markov Models (HMMs) [Bazzi1999], [Plamondon 2000], [Marti 2001], [Koerich
2003], [Toselli 2004], [Romero 2007] or hybrid HMM and Artificial Neural Networks [Boquera
2011], [Graves 2009].
However, the results obtained with these systems are still very far from offering usable accuracy.
They have proven to be suited for restricted applications with very limited vocabulary or
constrained handwriting, achieving relatively high recognition rates in this kind of task. However,
for (high quality, modern) unrestricted text images, current fully automated HTR state-of-the-art
(research) prototypes provide accuracy levels that range from 30 to 80% at the word level. This fact
has led us to propose the development of computer-assisted solutions based on novel approaches
studied in WP5.
Draft Version December 31, 2013 24/154
2.4 Keyword Spotting
2.4.1 The query-by-example approaches
Nowadays, many digitized ancient manuscripts are not exploited due to lack of proper browsing and
indexing tools. These documents are characterized by complicate layout and variety of writing
styles, while many of them are degraded. The strategies that are applied for OCR of modern printed
documents are not suitable as they are sensitive to noise, character variation and text layout
complexity. Accordingly, up to this day, there is not a single system available to analyze, index and
search for words in a collection of ancient handwritten manuscripts in a satisfactory way.
A valid strategy to deal with this kind of unindexed documents is a word matching procedure that is
relying on a low-level pattern matching called word spotting [Manmatha 1996]. It can be defined as
the task of identifying locations on a document image which have high probability to contain an
instance of a queried word, without explicitly recognizing it. The first approaches attempted to
locate the words inside the document using advanced procedures for binarization, pre-processing
and text layout analysis. Then, analyzing the segmented word, feature vectors are extracted from it
called word descriptors. Based on these descriptors, a distance measure is used to measure the
similarity between the query word and the template segment word. These systems called
segmentation-based word spotting systems. Although there is an abundance of systems suitable for
both modern [Hassan 2012], [Zagoris 2010] and historical [Kesidis 2011], [Konidaris 2008], [Zirari
2013], [Khurshid 2012] printed material, very few of these systems are applicable to historical
handwriting, especially if the document database contains material from different writers.
Rath and Manmatha [Rath 2003b], [Lavrenko 2004], [Rath 2003a] initially de-slant, de-skew, and
normalized the segmented word. Afterwards, they extract from it two families of feature sets. On
the one hand, they have the scalar type features that include aspect ratio, area, etc. On the other
hand, there are the profile-based features that are based on horizontal and vertical words projections
and the upper and lower word profiles. Moreover, the latter features are encoded by Discrete
Fourier Transform. The feature comparison algorithm is dynamic time warping (DTW), as the
features are of variable-length.
Zagoris et al. [Zagoris 2011] created a similar set of profile-based features, encoded in a different
way by Discrete Cosine Transformation, normalization by the first coefficients and quantization
through the Gustafson–Kessel [Kessel 1978] fuzzy algorithm. The result was a very short stable-
length descriptor, which has been tested on a Greek handwriting database from different writers, the
Washington words database and the MPEG-7 CE1 Set B database.
Llados et. al. [Llados 2012] evaluate the performance of various word descriptors, including a bag
of visual words procedure (BoVW), a pseudo-structural representation based on Loci Features, a
structural approach by using words as graphs, and sequences of column features based on DTW.
They tested them on the George Washington database and the marriage records from the Barcelona
Cathedral (ESPOSALLES database). They found that the statistical approach of the BoVW
produces the best results, although the memory requirements to store the descriptors are significant.
Rodrıguez and Perronnin [Rodriguez 2008] extract features from a sliding window, based on the
first gradient and inspired by the SIFT keypoint descriptor. The evaluation scenarios make use two
different matching paradigms: DTW and Hidden Markov Models (HMM). They run tests on 630
real scanned letters (written in French) containing different writing styles, artifacts, spelling
mistakes and other types of noise.
Finally, Srihari et al. [Srihari 2005] present a system for searching handwritten Arabic documents
based on a set of binary shape features suitable for Arabic script along with a correlation distance
that performed best for matching those features.
More recently, new techniques emerged, bypassing the procedure to locate and segment the words
Draft Version December 31, 2013 25/154
from the document and try to locate and match them directly on document. These stemmed from the
weakness to produce meaningful segmentation results with the state-of-the-art methods in historical
handwritten documents.
Leydier et al. [Leydier 2007] detect zones on the images that represent the most informative parts
based on the gradient orientation calculated by the convolution of the image with the first and
second derivatives of the Gaussian kernel. Their matching procedure is based on elastic naive
matching. In order to create guides to the matching, they use morphological operations; an opening
with a heuristic structuring element based on the document type. The evaluation experiments are
conducted on medieval manuscripts of Latin and Semitic alphabets. They found that their algorithm
is not suitable for short words (less than four characters) since the shorter the word, the less
information become available for comparison. In [Leydier 2009], they introduced an improved
version of their initial segmentation–free approach. The main updated parts are their matching
engine and their guide detection which now is based on the second derivative along the isophote
curves. Although it is an improvement, the way they extract the features and match them, is very
sensitive to different writing variations and matching words with different font sizes. Moreover,
their matching engine is very costly making impractical for large databases.
Gatos and Pratikakis [Gatos 2009] calculate block-based document image descriptors that are used
as a template. They created scaled and rotated versions of the query. These versions are scaled by
factors that ranges from 0.8 to 1.2 by a step of 0.1 and are rotated by range from -1° to 1° by a step
of 1°. They produced total 15 different word instances. For each word instance they calculated 5
different set of feature vectors based on calculating all 5x5 non-overlapping window pixel density
and then applying various word image translation. The 15 different queries create a lot of ambiguity
in the final merging state especially where there are many writing variations.
Rusinyol et al. [Rusinyol 2011] present a patch-based framework. The patches are represented by a
BoVW model, in which the local features are the SIFT descriptors. Then, they applied the Spatial
Pyramid Matching method to their BoVW model to compensate the lack of spatial information.
After they normalized the BoVW descriptor with a tif-idf model, they applied a latent semantic
indexing technique. They made experiments on the George Washington and Lord Byron datasets.
Apart from the costly, database-dependent operation of creating a codebook for the BoVW model,
their method depends on the query font size, thus making it impractical for matching words of
different size.
It is worth noting that both aforementioned strategies – segmentation-based and segmentation-free –
are valid under some circumstances. The segmentation-based procedure performs better, when the
layout is simple enough to correctly segment the words. By contrast, the segmentation-free
algorithms perform better in the case that there is considerable degradation on the document.
Unfortunately, both strategies are incompatible with each other, as they often use different pipelines,
features and matching procedures.
Finally, another important point to be considered is that the features for word spotting which rely
only on word shape characteristics are not effective in dealing with a document collection created
by different writers, containing significant writing style variations. Although slant and skew
preprocessing techniques can reduce the shape variations, they cannot eliminate the problem as the
whole structure of the word is different many times. So we argue that although the shape
information is meaningful, the texture information in a spatial context is more reliable in such cases.
Prompted by the above considerations, the proposed segmentation-based and segmentation-free
techniques employ the same base features, which are called Document-Specific Local Information
(DSLI), and belong to the family of local descriptors. The only difference is that the matching
procedure in the segmentation-free scenario is an extension of the one used in the segmentation-
based approach. Moreover, the proposed features - although they use the spatial information of the
current point’s location - are based on texture information.
Draft Version December 31, 2013 26/154
2.4.2 The query-by-string approaches
The idea of using word-specific and filler or “garbage” models was proposed long time ago for key
word spotting (KWS) in the field of automatic speech recognition, and it has been in use for many
years in that field - see for example [Lleida 1993], [Keshet 2009], [Szke 20005]. More recently, the
same basic idea has also been used for KWS in handwriting images [Fischer 2010], [Rodriguez-
Serrano 2009], [Thomas 2010]. In this approach, character hidden Markov models (HMMs) are
employed to build both the filler model and a word-specific model for each query word. This idea is
often referred to as “HMM-Filler”. One of its attractive features is that it is “lexicon-free”; that is, it
does not need any predefined lexicon or set of candidate query words. In this research we follow
the work presented in [Fischer 2010], where HMM-Filler is used for line-oriented KWS. In this
setting, whole text line images, without any kind of segmentation into words or characters, are
analyzed to determine the degree of confidence that a given keyword appears in the image.
HMM-Filler is often proposed as a KWS method for keyword searching “on-the-fly”. However, the
very large computing time of word-specific Viterbi decoding, for all the line images and all the
query words to be spotted, renders this idea unfeasible in practice, even for small collections of
handwritten document images. Nevertheless, if large “off-line” computing cost can be afforded,
HMM-Filler scores can still be used for indexing purposes.
Both the classical HMM-Filler and the approach here proposed rely on segmentation-free, HMM
character recognition. The basics of HTR based on HMMs were originally presented in [Bazzi
1999] and further developed in [Vinciarelli 2004], [Toselli 2013] among others. Recognizers of this
kind accept a given handwritten text line image, represented as a sequence of D-dimensional feature
vectors ⃗⃗ ⃗ 2⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗ ⃗ as input and find a most likely character sequence, ̂ ̂ 2̂ ̂:
(2.4.2.1)
The score associated with ̂ (so called, “Viterbi score”) is:
(2.4.2.2)
The conditional density p(x|c) is approximated by previously trained morphological character
HMMs, while the prior P(c) is given by a character-level “language” model.
Eq. (2.4.2.1-2.4.2.2) are commonly solved by means of Viterbi decoding [Jelinek 1998]. As a
byproduct, a huge set of most likely recognized character sequences can be obtained and
represented in the form of a character-lattice (CL). These lattices are better known in the literature
as “word- graphs” [Wessel 2001].
In this approach, as presented in [Fischer 2010], character HMMs are used to build both a “filler”
model, F, and a keyword-specific model, Kυ, for each individual keyword υ to be spotted. F is built
by arranging all the trained character HMMs in parallel (with equiprobable priors, P(c) in Eq.
(2.4.2.1-2.4.2.2) with a loop-back. This model accounts for any unrestricted sequence of characters
(see Fig. 2.4.1a). On the other hand, Kυ is constructed to model the exact character sequence of υ,
surrounded by the space character and the same unrestricted character sequences modeled by F (see
Fig.2.4.2.1b). In a preprocessing phase, F can be used once for all to compute the Viterbi
Draft Version December 31, 2013 27/154
Figure 2.4.2.1: (a) “Filler HMM” (F) and (b) “keyword HMM” (Kυ) built for the keyword
υ=“bore”.
decoding score Sf(x) (Eq. (2.4.2.2) for each line image x. Similarly, in the searching phase, for each
keyword υ to be spotted and for each line image x, the Viterbi score sυ(x) is computed using the
keyword-specific model Kυ. A spotting score S(υ,x) is then defined as:
(2.4.2.3)
where Lυ is the length of υ in number of frames between the word detected borders. If the spotted
word υ is actually in the line image x, a matching is expected with high (negative) values of S(υ,x),
upper bounded by S(υ,x)=0.
To analyze the computational cost of this approach, we consider separately the one of the character
filler model F, from the one of the key-word Kυ. The computational complexity of the
preprocessing phase is that of the Viterbi algorithm [Jelinek 1998]; that is 𝑂( 𝑛 ), where N is
the number of line images in the collection, n is the length of x and depends on the total number
of states in all the character HMMs in F and on the square of the number of character models in F.
Similarly, taking into account the topology of Kυ, the search phase complexity is 𝑂( 𝑛 ),
where and M is the number of spotted keywords. It is important to remark that while the F
decoding is running once per each text line image, the Kυ is running M times per each text line
image.
Draft Version December 31, 2013 28/154
3. Developed Methods
3.1 Image Pre-processing
3.1.1 Binarization
The developed binarization method is based on the method of Ntirogiannis et al. [Ntirogiannis
2014] which was developed specifically for handwritten document images. The aforementioned
method has several steps, such as background estimation and image normalization, combination of
the global Otsu [Otsu 1979] and local Niblack [Niblack 1986] method, contrast calculation, as well
as intermediate processing to remove small noisy connected components. The performance of the
initial binarization method was very high, outperforming in most cases current state-of-the-art
methods. However, the evaluation was performed on the handwritten images of the H/DIBCO
datasets from 2009-2011 that did not contain cases form the TranScriptorium project. In addition,
to test the effectiveness of this method on images from TranScriptorium, we used DIBCO'13
(Pratikakis et al. [Pratikakis 2013]) dataset which contains portions from 3 images of the
TranScriptorium project. The initial method was ranked second (F-Measure=94.32) with a very
small difference from the first (F-Measure=94.43) concerning those TranScriptorium images.
After extensive experimentation on the TranScriptorium images, we observed that the binarization
method of Ntirogiannis et al. [Ntirogiannis 2014] could miss textual information in an attempt to
clear the background from noisy components or bleed-through (Fig. 3.1.1.1(a)-(b)). Under those
circumstances, some modifications were introduced to preserve textual information (faint characters
or low contrast characters) (Fig. 3.1.1.1(c)). The main adjustments made in the final developed
method concerns the Otsu threshold selection, the incorporation of a third binarization method
(Gatos et al. [Gatos 2006]) into the combination stage and the noisy connected components removal
stage. Secondary adjustments concern the speed enhancement as well as the handling of big black
borders, which the initial method could not manage. In the following, the final developed
binarization method is detailed.
(a)
(b)
(c)
Figure 3.1.1.1: Initial and modified binarization results: (a) Original images from the
TranScriptorium project; (b) Ntirogiannis et al. [Ntirogiannis 2014] binarization; (c) developed
binarization based on Ntirogiannis et al. [Ntirogiannis 2014].
Draft Version December 31, 2013 29/154
Background Estimation and Image Normalization
Background estimation is performed using inpainting. The inpainting mask is considered the
foreground of two binarization methods, namely Niblack [Niblack 1986] and Otsu [Otsu 1979]. To
achieve the inpainting result, five passes of the image are required. The first four image passes are
performed following the LRTB, LRBT, RLTB and RLBT directions, where L, R, T and B refer to
Left, Right, Top and Bottom, respectively. In each pass i, we use as input the original image and get
as output an inpainting image Pi(x,y). In particular, for each "mask" pixel, we calculate the average
of the "non-mask" pixels in the 4-connected neighborhood (cross-type) and consider that pixel as a
"non-mask pixel" for consecutive computations of that pass. At the final (fifth) image pass, for any
direction chosen, we take into account all four images of the four previous inpainting passes (Pi(x,
y), i=1,...,4) and for each pixel we keep the minimum (darkest) intensity. Finally, the original image
is divided by the estimated background image. The values of the new image are stretched in the
range of the original grayscale image (Fig. 3.1.1.2b). The impact on binarization of the
aforementioned background smoothing procedure is shown in Fig. 3.1.1.2c-d.
(a)
(b)
(c)
(d)
Figure 3.1.1.2: Background estimation and image normalization: (a) Original grayscale image
I(x,y); (b) original image after background estimation and normalization N(x,y); (c)-(d) global Otsu
[Otsu 1979] of (a) and (b) respectively.
First Combination and Small Components Removal
Given that the original image has been normalized using background estimation, we can efficiently
apply the global binarization method of Otsu to the normalized image N(x,y). In this way, we detect
the main text of the image, from which we can calculate the image contrast that is required in
consecutive steps. However, it is important to remove small noisy components to minimize the
noise contribution in our calculations and to decrease the false combination when Otsu [Otsu 1979]
is combined at connected component level with the Niblack [Niblack 1986] binarization result.
In order to further enhance the text detection of this step, we also use the binarization method of
Gatos et al. [Gatos 2006]. This method uses Wiener filtering as a pre-processing step and hence the
original image is given as input. The image normalization using the background estimation and the
Wiener filtering has similar effect, considering that both methods result in an image of smoothed
background. In addition, before binarization we apply a 3x3 Gaussian filter while after binarization
we apply a morphological closing operation in order to remove small noisy dots.
We consider that the aforementioned noise consists of several components corresponding to a small
part of the detected foreground. Hence, the ratio between the total number of pixels that correspond
to components of a small height and the number of those components is expected to be very low. In
Draft Version December 31, 2013 30/154
particular, we consider O(x,y) to be the combination of Otsu and Gatos et. al. 2004 binarization
result with "1" corresponding to foreground and "0" to background. Furthermore, let Oi(x,y) be the
i-th connected component of O(x,y) with i = 1,...,NCO, where NCO is the number of the connected
components of O(x,y) and Oi(x, y) equals to "1" only at foreground pixels of the i-th connected
component and "0" elsewhere. Then, we discard all connected components smaller than half of the
height h, as determined by the criterion specified by Eq. (3.1.1.1).
∑𝑅𝑃𝑗
𝑅𝐶𝑗ℎ𝑗= 1 (3.1.1.1)
where h denotes the minimum connected component height for which the above relation Eq.
(3.1.1.1) is satisfied,
𝑅𝑃𝑗 ∑ ∑ 𝑂𝑗(𝑥 𝑦)
𝐼𝑦𝑦=1
𝐼𝑥𝑥=1
∑ ∑ 𝑂(𝑥 𝑦)𝐼𝑦𝑦=1
𝐼𝑥𝑥=1
, denotes the ratio between the number of foreground pixels of O(x,y)
connected components of height j, (i.e. Oj(x,y)) and the total number of foreground pixels of O(x,y),
𝑅𝐶𝑗 𝑁𝐶𝑗
𝑁𝐶𝑂, denotes the ratio between the number NC
j of the connected components of height j and
the total number NCO of connected components of O(x,y).
Based upon the aforementioned post-processing criterion, the first combination image OP(x,y) after
post-processing can be defined by Eq. (3.1.1.2) as follows.
𝑂𝑃( 𝑦) ⋃ 𝑂𝑖( 𝑦) ∀𝑖: 𝐻(𝑖) ≥ ℎ𝑁𝐶𝑂𝑖= / (3.1.1.2)
where h is given by Eq. (3.1.1.1) and H(i) denotes the height of the i-th connected component of
Otsu Oi(x, y).
Niblack Parameter Estimation and Final Combination
Local binarization of Niblack [Niblack 1986] is to be applied in the normalized image N(x,y).
However, Niblack requires two parameters, the window size "w" and the strictness "k". In the
binarization method of Ntirogiannis et al. [Ntirogiannis 2014], the window size was set according to
the stroke width, the computation of which required skeletonization and distance transformation
which increased the computational time. Under those circumstances, we use a fix window size for
Niblack w=50, while parameter "k" is set according to the global image contrast C, as denoted by
Eq. (3.1.1.3).
𝑘 −0. − 0.1 ⌊𝐶
0⌋ (3.1.1.3)
Concerning the global contrast of the image, the contrast ratio is often used, i.e. the ratio between
the foreground and the background intensity, and the logarithmic (log) of the contrast ratio has also
been used including document image applications (Roufs et al. [Roufs 1997]). In particular, we use
the log contrast ratio C (Eq. (3.1.1.4)) modified for degraded document images. As numerator, we
use the average foreground intensity FGave increased by the foreground standard deviation FGstd
in order to be closer to the faint character’s intensity and as denominator we use the average
background intensity BGave decreased by the background standard deviation BGstd in order to be
closer to the background variations such as shadows and stains. In Eq. (3.1.1.4), the constant "-50"
is used to limit the values between 0–100, approximately.
𝐶 −50𝑙𝑜𝑔10(𝐹𝐺𝑎𝑣𝑒−𝐹𝐺𝑠𝑡𝑑
𝐵𝐺𝑎𝑣𝑒−𝐵𝐺𝑠𝑡𝑑) (3.1.1.4)
where the foreground statistics are calculated taking into account information of the original image
that corresponds to foreground of the OP(x,y) (excluding contour pixels). The background statistics
are calculated using a new estimated background image that results after considering for each pixel
the average of the individual inpainting passes performed at the background estimation stage.
Draft Version December 31, 2013 31/154
The result of the first combination followed by the post-processing step OP(x,y) contains low
background noise but fails to retain the faint parts of the characters. On the contrary, the Niblack
result NB(x,y) contains much background noise but detects the faint characters efficiently enough.
Thus, we apply a combination at connected component level between the OP(x,y) and the NB(x,y)
images. In more details, we keep each connected component of NB(x,y) that has at least one
common foreground pixel with the OP(x,y) image. Example results are presented in Fig. 3.1.1.3.
Finally, some enhancement is performed on the binary image to smooth the contours from the
"teeth" effect.
(a)
(b)
Figure 3.1.1.3: Example results of the first and final combination: (a) Combination of Otsu [Otsu
1979] and Gatos et al. [Gatos 2006] binarization; (b) final result after additional combination with
Niblack [Niblack 1986].
3.1.2 Border Removal
Our methodology detects and removes noisy black borders as well as noisy text regions. It is based
on [Stamatopoulos 2007] and focus on handwritten document characteristics. The border removal
method relies upon the projection profiles combined with a connected component labelling process.
Signal cross-correlation is also used in order to verify the detected noisy text areas.
At a first step, the noisy black borders (vertical and horizontal) of the image are removed. The
flowchart for noisy black borders detection and removal is shown in Fig. 3.1.2.1. In order to achieve
this, we first proceed to an image smoothing using the Run Length Smoothing Algorithm (RLSA)
[Wahl 1982]. Then the vertical and horizontal histograms of the image are calculated in order to
detect the starting and ending offsets of borders and text regions. The final clean image without the
noisy black borders is calculated by using the connected components of the image [Chang 2004].
Draft Version December 31, 2013 32/154
Figure 3.1.2.1: Flowchart for noisy black borders detection and removal.
Once the noisy black borders have been removed, we proceed to detect noisy text regions of the
image. Our aim is to calculate the limits of the text region. The flowchart for noisy text regions
detection and removal is shown in Fig. 3.1.2.2.
We first proceed to an image smoothing by using dynamic parameters which depend on average
character height. Our aim is to connect all pixels that belong to the same line. Then, we calculate
the vertical histogram and detect the regions which are separated by a blank area. Finally, we detect
noisy text regions with the help of the signal cross-correlation function [Sauvola 1995] and remove
the noisy text regions by using the connected components of the image.
Figure 3.1.2.2: Flowchart for noisy text regions detection and removal.
An example of the developed border removal methodology is given in Fig. 3.1.2.3.
Draft Version December 31, 2013 33/154
(a) (b) (c)
Figure 3.1.2.3: Border removal example: (a) Original image; (b) Binary image (c) image after
border removal.
3.1.3 Image Enhancement
Color Enhancement
Background Estimation and Image Normalization
As enhancement methods of the original grayscale image, we use background smoothing based on
background estimation. In particular, the background is estimated using a new fast inpainting
technique. Afterwards, the estimated background is used to smooth the original image. In the
following, the background estimation and smoothing technique is detailed.
For the background estimation, an inpainting procedure is followed for which the Niblack [Niblack
1986] binarization output (window size w=60 and k=-0.2.) is used as the inpainting mask.
Afterwards, Niblack’s foreground result (inpainting mask) is dilated using a 3x3 mask in order to
exclude background pixels near the character edges which may have intensity closer to the
foreground intensity. In addition, Otsu [Otsu 1979] is performed, the foreground of which is
considered as part of the inpainting mask (Fig. 3.1.3.1b). This is essential to handle the big black
borders for which Niblack could fail to binarize effectively because of the fixed window size.
Finally, the estimated background image (Fig. 3.1.3.1c) inherits the intensity of the original image
for "non-mask" pixels mask, while the inpainting mask is filled with surrounding background
information of the original image according to the following procedure.
To achieve the inpainting result, five passes of the image are required. The first four image passes
are performed following the LRTB, LRBT, RLTB and RLBT directions, where L, R, T and B refer
to Left, Right, Top and Bottom, respectively. In each pass i, we use as input the original image and
get as output an inpainting image Pi(x,y). In particular, for each "mask" pixel, we calculate the
average of the "non-mask" pixels in the 4-connected neighborhood (cross-type) and consider that
pixel as a "non-mask pixel" for consecutive computations of that pass. At the final (fifth) image
pass, for any direction chosen, we take into account all four images of the four previous inpainting
passes (Pi(x, y), i=1,...,4) and for each pixel we keep the minimum (darkest) intensity (Fig.
3.1.3.1c).
Draft Version December 31, 2013 34/154
Background estimation followed by the image normalization procedure is often used to balance the
illumination of a picture that is taken under uneven lighting conditions. This method modifies the
image towards a bi-modal distribution that can be globally binarized with improved success. The
main principle of the aforementioned method is that the observed image I(x,y) equals to the product
of the illumination (scene lighting conditions) S(x,y) and the reflectance R(x,y) of the picture
objects. If the lighting conditions can be reproduced without any scene objects I’(x, y), then the
normalized image equals to I(x, y)/I’(x,y). In the case of document images, we consider that the
image that corresponds to the lighting conditions without any scene objects I’(x,y) is represented by
the estimated background image BG(x,y). Then, according to the aforementioned discussion, the
normalized image F(x,y) would be equal to I(x,y)/BG(x,y). The F(x,y) image is also normalized in
the range between the minimum and maximum value of the original image Eq. (3.1.3.1), otherwise
we could stretch the F(x,y) image to [0,255] but this leads to potential loss of faint characters and
some noise remains. Example image normalization result is presented in Fig. 3.1.3.1d.
( 𝑦) ⌈𝐼𝑚𝑎𝑥 − 𝐼𝑚𝑖 𝐹(𝑥 𝑦)−𝐹𝑚𝑖𝑛
𝐹𝑚𝑎𝑥−𝐹𝑚𝑖𝑛+ 𝐼𝑚𝑖 ⌉ (3.1.3.1)
where 𝐹( 𝑦) 𝐼(𝑥 𝑦)+
𝐵𝐺(𝑥 𝑦)+ , in order to be theoretically correct and eliminate any possibility of
division by zero, 𝐼𝑚𝑖 , 𝐼𝑚𝑎𝑥 and 𝐹𝑚𝑖 , 𝐹𝑚𝑎𝑥 denote the minimum and maximum values
of I(x, y) and F(x,y) images, respectively.
(a)
(b)
(c)
(d)
Figure 3.1.3.1: Background estimation and image normalization for background smoothing: (a)
original grayscale image; (b) inpainting mask; (c) estimated background using an inpainting
procedure; (d) final enhanced image after background normalization with the estimated background.
Background and noise removal
One of the image pre-processing steps required for handwritten text recognition, and one that is
especially important for scanned antique documents in which it is common to observe some degree
Draft Version December 31, 2013 35/154
of paper degradation, is the extraction of the text from noisy backgrounds. A standard approach to
deal with this problem is to threshold the images [Sauvola 2000], i.e., assigning pixels to be either
black or white depending on whether a pixel value is below or above a certain threshold, thus
obtaining 1-bit depth image. However, in the context of ancient texts, it can be observed that the
writing strokes vary greatly in width and darkness, not only between different pages or lines, but
also within single words. Because of this, when applying thresholding techniques, there is always a
risk that parts of the strokes are lost due to the hard decision of black or white. Furthermore, state-
of-the-art recognizers are based on Gaussian Mixture HMMs or neural networks, which do not
require having as input a binary image, and in fact better performance can be obtained if more of
the original image information is provided.
To this end, instead of thresholding, we aim for a technique that extracts text from noisy
backgrounds while having as output a grayscale image that resembles better the information
contained in the original images. Local thresholding techniques work rather well, so to take
advantage of this, we can use an existing thresholding method and relax it such that values in a band
near the threshold are mapped to a smooth transition between black and white. Due to its good
performance, we have taken as reference Sauvola’s method [Sauvola 2000], and modified it
following the previously mentioned idea.
Given an input image, for every pixel a threshold is computed based on the pixel’s neighborhood.
Like the original Niblack’s method [Niblack 1986], Sauvola’s method computes the threshold based
on the neighborhood’s mean and standard deviation values. The Sauvola’s threshold 𝑇( 𝑦) for a
pixel position ( 𝑦) is given as follows:
𝑇( 𝑦) 𝜇( 𝑦) [1 + 𝑘 (𝜎(𝑥 𝑦)
𝑅− 1)] (3.1.3.2)
where𝜇( 𝑦) and 𝜎( 𝑦) are the neighborhood’s mean and standard deviation, respectively, R is the
dynamic range of the standard deviation (usually R=128), and k is a parameter that needs to be
adjusted although it is common to have k=0.5. The neighborhood is a square region of w pixels
wide, centered at the corresponding pixel. This algorithm is very fast, since it is well known that the
mean and standard deviation of any subwindow of an image can be efficiently computed by using
integral images [Shafait 2008].
The proposed modification of the Sauvola method only requires one additional parameter, which we
have called s, that determines how fast the transition between black and white is. As the value of s is
decreased, the transition from black and white becomes narrower, which means that the images tend
to get closer to a 1-bit depth image, and in the limit when s=0 the result is exactly the same as the
original Sauvola algorithm.
The Sauvola method assumes that the text is dark on a light background. Because of this
assumption, when the text is relatively light, the thresholds tend to be too high, so parts of the
strokes tend to be removed, or become light gray for the case of the proposed technique. To prevent
this, a simple strategy is to previously process the images by stretching the contrast so that the
darkest and brightest pixels of the image reach the lowest and highest values of the grayscale range.
However, care must be taken when doing a contrast stretch. The image must contain text (not be all
background), otherwise the stretch will enhance all of the background noise.
After the extraction of the text using the proposed technique (with or without a contrast stretch),
some of the remaining noise can be removed using more standard approaches. Specifically, a certain
percentage of the smallest connected components are removed after applying the Run Length
Smoothing Algorithm (RLSA) [Wong 1982] to the image.
Draft Version December 31, 2013 36/154
Show-through reduction
Recently we have been working on a method for removing some of the show- through noise. Show-
through [Tonazzini 2007] is the effect caused by the slight transparency of paper that when
scanning, the content of the reverse side of the page tends to appear. The method being developed is
still in preliminary stages, so not much detail will be given. The method is based on the fact that the
color of the content for the reverse side of the page is a combination of the original color and the
color of the paper. This means that in the color space there is a difference in the content from both
sides of the paper, thus this can be exploited to reduce the show-through effect.
Binary Enhancement
As enhancement methods performed on a binary image, we use the shrink and swell filters of Gatos
et al. [Gatos 2006] along with a contour enhancement technique. The shrink and swell operations
uses a local neighborhood to check if the majority of the pixels belong to foreground or background
and set the central pixel to foreground or background, respectively. These operations are described
in details in Gatos et al. [Gatos 2006].
Additional enhancement includes the use of the masks of Lu et al. [Lu 2010] binarization method
(Fig. 3.1.3.2). The operations with the aforementioned masks are an additional enhancement that
can be performed on the binary image to handle the "teeth" effect. The size of 3x3 was used for
these masks. Example application of contour enhancement is shown in Fig. 3.1.3.3.
Figure 3.1.3.2: Masks used for the enhancement of the binary image.
Figure 3.1.3.3: Contour enhancement of the binary image: binarization result before and after the
contour enhancement.
3.1.4 Deskew
Page and Text Line Skew estimation method
The algorithm that was developed for the estimation of the skew at both the page and the text line
level is based on enhanced horizontal projection profiles. Furthermore, this method was combined
with the minimum bounding box technique which is commonly used in the literature [Safabakhsh
Draft Version December 31, 2013 37/154
2000] and [Sarfraz 2008] as a skew estimation technique. The area of the rectangle that contains all
the foreground pixels of a document image is expected to be minimal when the document image is
properly deskewed. Furthermore, the running time of the developed method is improved by the
implementation of coarse-to-fine convergence.
This approach is more efficient compared with state-of-the-art skew detection algorithms. The
developed method deals with all the drawbacks of the projection profile methods since it is noise
and warp resistant, it gives more accurate results and it is fast due to the proposed coarse-to-fine
convergence. For these reasons, it can be efficiently applied to historical documents providing great
results in a wide range of angles. The motivation for this work was the recent publication of
Papandreou and Gatos [Papandreou 2011] which was extended in the developed method.
In order to estimate the skew, the document image is rotated in a range of angles 𝜃 [𝜃𝑚𝑖 𝜃𝑚𝑎𝑥]. Let 𝐼𝜃( 𝑦)be the binary document image in angle 𝜃 having 1s for black (foreground) pixels and 0s
for white (background) pixels and𝐼𝑥𝜃 and 𝐼𝑦
𝜃 be the width and the height of the rotated document
image. At a first step, the rectangle box that bounds the foreground pixels is computed and then, a
histogram of enhanced horizontal profiles is calculated for every angle. Finally, the profiles for each
angle are divided with the corresponding bounding box area and the angle in which the enhanced
histogram has the greatest values is regarded as the skew angle.
In order to compute the minimum rectangle 𝐸𝜃that bounds the foreground pixels for each angle𝜃,
we calculate the 4 extreme points ( 𝑙𝜃 𝑟
𝜃 𝑦𝑢𝜃 𝑦𝑑
𝜃).
𝐸𝜃 ( 𝑟𝜃 − 𝑙
𝜃) × (𝑦𝑑𝜃 − 𝑦𝑢
𝜃) (3.1.4.1)
For each angle 𝜃, after the extreme points of the image have been detected, the horizontal histogram
𝐻ℎ𝜃() is calculated within these boundaries.
In order to calculate 𝐻ℎ𝜃(), the initial horizontal histogram 𝐻ℎ
𝜃() is computed for each angle θ,
where 𝐻ℎ𝜃() is defined as follows:
𝐻ℎ𝜃(𝑦) ∑ 𝐼𝜃( 𝑦)
𝑥𝑟𝜃
𝑥=𝑥𝑙𝜃
(3.1.4.2)
and then it is divided with the area of the bounding box 𝐸𝜃 in order to form 𝐻ℎ𝜃()
𝐻ℎ𝜃(𝑦) 𝐻ℎ
𝜃(𝑦) 𝐸𝜃⁄ (3.1.4.3)
In order to estimate the correct skew of the image, we seek for the angle𝜃ℎ in which the variance
between three successive values of 𝐻ℎ𝜃()histogram is maximized. 𝜃ℎ is defined as follows:
𝜃ℎ argmax(𝐷ℎ(𝜃))𝜃
(3.1.4.4)
where 𝜃 varies in the range of the image document skew expected while𝐷ℎ(𝜃) is defined as:
𝐷ℎ(𝜃) argmax(| × 𝐻ℎ𝜃(𝑦) − 𝐻ℎ
𝜃(𝑦 − 1) − 𝐻ℎ𝜃(𝑦 − )|)
𝑦 (3.1.4.5)
where [ 𝑙𝜃 𝑟
𝜃] and𝑦 [𝑦𝑢𝜃 𝑦𝑑
𝜃] and values of 𝐻ℎ𝜃(𝑦) with and 𝑦 exceeding their range are
considered to be zero.
Draft Version December 31, 2013 38/154
Faster convergence – coarse-to-fine technique
In order to make the developed method faster and reduce the computational cost, a coarse-to-fine
technique, similar to the one proposed by Shutao Li et al [Shutao 2007], is applied. With the use of
this technique the number of angles that the document image is rotated is reduced without
restricting the range of the angles that the document is rotated. This coarse-to-fine technique,
furthermore, enables the extension of the range of the expected skew angle since it accelerates the
proposed algorithm multiple times, depending on the range of angles of the expected skew, without
affecting significantly the efficiency of the proposed method. The flowchart of the method using
this coarse-to-fine technique is presented in Fig.3.1.4.1.
Import Image
K = 1
If K≤4
If K=1A1 = -θΑ2 = θ
step = 1
If K=2A1 = S – 0,5A2 = S + 0,5step = 0,5
If K=3A1 = S – 04A2 = S + 0,4step = 0,1
If K=4A1 = S – 0,05A2 = S + 0,05step = 0,05
Angle = A1maxV=0maxH=0
If angle ≤A2
Rotate Image by (angle)
Compute the Bounding Box Area
E
Skew Estimation
Method
angle = angle +step
K = K+1S
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
S
Figure 3.1.4.1: Flowchart of the coarse-to-fine technique used for faster convergence.
Given a wide range of angles [𝜃𝑚𝑖 𝜃𝑚𝑎𝑥] in which the document skew is expected, a classical
projection profile method would rotate the document in every angle from 𝜃𝑚𝑖 to 𝜃𝑚𝑎𝑥 with a
certain step which corresponds to the required accuracy. In that way, the algorithm would perform
[(𝜃𝑚𝑎𝑥 − 𝜃𝑚𝑖 + 1) × 0] rotations in order to achieve an accuracy of 0.05°. Instead, in the
proposed method the document image is rotated from 𝜃𝑚𝑖 to 𝜃𝑚𝑎𝑥 with step 1° and the dominant
skew angle𝜃𝑑 is defined. Afterwards, the document image is rotated from (𝜃𝑑 − 0.5) to (𝜃𝑑 +0.5) with step 0.5° in order to find the new dominant skew angle𝜃𝑑2. Then, the document image is
rotated from (𝜃𝑑2 − 0.4) to (𝜃𝑑2 + 0.4) with step 0.1° and the dominant skew angle is readjusted
in𝜃𝑑3, while finally, the image is rotated from (𝜃𝑑3 − 0.05) to (𝜃𝑑3 + 0.05) and so the skew angle
𝜃𝑠 is determined. With the use of this coarse-to-fine technique the rotations required are reduced
to(𝜃𝑚𝑎𝑥 − 𝜃𝑚𝑖 + 13). As a conclusion the proposed algorithm succeeds a faster convergence and
it performs 9 times faster in a range between −5° and 5° and 17 5 times faster in a range between
−45° to 45°.
Draft Version December 31, 2013 39/154
The developed method can be appliedto complete document pages as well as individual handwritten
text lines of historical documents. The current implementation differs from the one presented in
[Papandreou 2011] since a coarse-to-fine technique is integrated in order to accelerate the
developed algorithm and the estimated skew angle derives from the maximization of the variance
between three successive values of 𝐻ℎ𝜃()histogram.
Word Skew estimation method
The algorithm that was developed in order to estimate the skew of handwritten word images is a
coarse-to-fine method that is fast, has low computational cost, is accurate and doesn’t depend on the
minima of the word. In that way, it provides an accurate estimation of the word skew, even for
words with small length. The developed method was published in ICDAR2013 in Washington DC
[Papandreou 2013a].
The developed coarse-to-fine method is based on two steps which are demonstrated in the flowchart
in Fig. 3.1.4.2. At first, (a) the word image is divided vertically into two equal overlapping parts, (b)
the corresponding center of mass of the foreground pixels is detected in each part and (c) the
inclination of the line that connects the two centers of mass is calculated. This leads to a first rough
estimation of the skew angle and the corresponding correction is applied in the handwritten word
image. Afterwards, in the next step, the core-region of the word is detected and the word is divided
again vertically in two equal overlapping parts. The corresponding centers of mass are calculated
again, though this time only the information inside the core-region is taken under consideration. The
inclination of the line that connects the updated centers of mass is calculated and a new finer skew
correction is made in the word image. The second step is repeated iteratively till the finest skew
estimation is done.
Divide the word Image in two parts and find their
centers of mass
Calculate the inclination of the line connecting the two
centers of mass
Regard the computed inclination s as a coarse
estimation of the skew and correct it
Detect the core-region of the word Image
Divide the word Image in two parts and find their
centers of mass considering only the information inside
the core-region
Calculate the inclination of the line connecting the
updated centers of mass
If (s<ac)OR(c≥M)
END
YES
NO
START
Step 1
Step 2
Ts=s
C=C+1
Regard the computed inclination s as a finer
estimation of the skew and correct it
Ts=Ts+s
Ts=0C=0
s: Estimated skewac: Accuracy needed
M: Maximum iterationsTs: Total Skew
c: Counter
Figure 3.1.4.2: Flowchart of the developed coarse-to-fine methodology.
Draft Version December 31, 2013 40/154
In order to define the overlapping parts the word image is divided horizontally in three equal stripes
as it is demonstrated in Fig. 3.1.4.3. The first two thirds of the image compose the left part with
dimensions 𝐼𝑥𝑙 × 𝐼𝑦and the last two thirds form the right part with dimensions 𝐼𝑥
𝑟 × 𝐼𝑦.
Figure 3.1.4.3: The word is divided in two equal horizontal overlapping parts with length 𝐼𝑥𝑙 and 𝐼𝑥
𝑟,
respectively for the left and the right part.
Afterwards, the according centers of mass ( 𝑐𝑚𝑙 𝑦𝑐𝑚
𝑙 ) and ( 𝑐𝑚𝑟 𝑦𝑐𝑚
𝑟 ) of the overlapping parts are
calculated. In order to have a coarse estimation of the skew angle of the word image, the inclination
of the line connecting the two reference points is calculated (see Fig. 3.1.4.4) is calculated using Eq.
3.1.4.6.
𝑠 𝑡𝑎𝑛− (𝑦𝑐𝑚
𝑟 − 𝑦𝑐𝑚𝑙
𝑐𝑚𝑟 − 𝑐𝑚
𝑙 ) (3.1.4.6)
Figure 3.1.4.4: The inclination of the line connecting the two reference points is the coarse
estimation of the word image skew angle.
In the developed method, the word image is separated in two equal overlapping parts. In that way,
the overlapping parts are providing continuity in the information of the left and the right side. In the
next step the overlapping parts are calculated again for the corrected word image and the core-
region of the word is detected. In that way, the main part of the word image (inside the core-region)
is distinguished from the ascenders and descenders (see Fig. 3.1.4.5).
Figure 3.1.4.5: The overlapping parts are calculated for the corrected image and the core-region of
the handwritten word is detected.
In order to detect the core-region of the word image the algorithm of reinforced projection profiles
described in [Papandreou 2014] is used, and in that way the upper and lower baselines of the word
are defined. At this point the centers of mass of the two overlapping parts are calculated again,
though only the information inside the core-region is taken under consideration. The coordinates of
the centers of mass are now updated in ( 𝑐𝑚𝑙 𝑦 𝑐𝑚
𝑙 ) and ( 𝑐𝑚𝑟 𝑦 𝑐𝑚
𝑟 ). After defining the updated
Draft Version December 31, 2013 41/154
centers of mass (see Fig. 3.1.4.6) the inclination of the line that connects them is calculated by eq.
3.1.4.7 and in that way a finer estimation of the word image skew angle is achieved.
𝑠 𝑡𝑎𝑛− (𝑦 𝑐𝑚
𝑟 − 𝑦 𝑐𝑚𝑙
𝑐𝑚𝑟 − 𝑐𝑚𝑙 ) (3.1.4.7)
Figure 3.1.4.6: The inclination of the line that connects the updated reference points is a finer
estimation of the word image skew angle.
After correcting the skew detected, the second step is repeated iteratively until a finest skew
estimation is reached. When our last estimation of the word image skew angle is smaller than a
defined accuracy, the algorithm stops and the total skew defines as the sum of all previous
calculated values and is given by the following equation:
𝑇𝑠 ∑𝑠𝑖
𝑖
(3.1.4.8)
where𝑖 is the number of skew angle estimations and depends on the accuracy needed and 𝑠𝑖 is the
estimated word skew of the 𝑖-th calculation.
By excluding the information outside the core-region of the word and by updating the centers of
mass considering only remaining information, a finer estimation of the skew angle is achieved. This
is due to the fact that the ascenders and descenders of a word have a major contribution in the
vertical average deviation of the foreground pixels and they affect erroneously the skew estimation.
3.1.5 Deslant
Text Line and Word Slant estimation method
The algorithm that was developed for the estimation of the slant of text lines and words takes
advantage of the orientation of the non-horizontal strokes as well as their position relative to the
word's core-region. As a first step, the word core-region is detected with the use of reinforced
horizontal black run profiles which permits to detect the core-region scan lines more accurately.
Then, the near-horizontal parts of the document word are extracted and the orientation and the
height of non-horizontal remaining fragments as well as their location in relation to the word's core-
region are calculated. Word slant is estimated taking into consideration the orientation and the
height of each fragment while an additional weight is applied if a fragment is partially outside the
core-region of the word which indicates that this fragment corresponds to a part of the character
stroke that has a significant contribution to the overall word slant and should by definition be
vertical to the orientation of the word. This work is an extension of [Bozinovic 1989] and
[Papandreou 2012]; it has been published in [Papandreou 2014].
The proposed slant estimation method consists of three steps. At a first step, the word core-region is
detected with the use of horizontal black run profiles. Then, the orientation and the height of non-
horizontal fragments of the word as well as their location in relation to the word core-region are
calculated. Finally, the word slant is estimated taking into consideration the orientation and the
height of each fragment while an additional weight is applied if a fragment is outside the core-
region of the word which indicates that this fragment corresponds to a part of character stroke that
has a significant contribution to the overall word slant.
Draft Version December 31, 2013 42/154
Core-region detection
In order to detect the core-region, we focus on the horizontal black runs of the word image and
introduce the reinforced horizontal black run profile 𝐻. The motivation for proposing this profile is
the need to stress the existence of long horizontal black runs as well as the number of horizontal
black runs of every line of the word image. The parts outside of the core-region are most of the
times vertical strokes with no significant width or horizontal strokes that are in a sparse area with no
significant concentration of foreground pixels. On the other hand, the core-region is the main area
of the word where most of the information can be found; in this area there are more black-white
transactions than outside of the core-region. The horizontal black run profile 𝐻 is defined as
follows:
)(
0
),(
0
2)]([)(yB
i
yiL
j
jyByH (3.1.5.1)
where 𝐵(𝑦) is the number of black runs in the horizontal scan of line y and 𝐿(𝑖 𝑦) is the length of
the i-th black run of line y. Afterwards, a threshold 𝑇ℎis set in order to distinguish the core-region
from the ascenders and descenders.
Examples of word core-region detection are given in Fig. 3.1.5.1.
(a) (b) (c)
(d) (e) (f)
Figure 3.1.5.1: Examples of word core-region detection: (a),(d) original word images and the
detected core-regions; (b),(e) the horizontal black run profiles H() and (c),(f) the Boolean horizontal
profiles HB().
Calculation of the orientation of non-horizontal fragments
As a next step we remove all scan lines that contain either horizontal or almost horizontal strokes.
For a given document image words/lines, all scan lines which contain at least one black run with
length greater than are extracted, where is defined as follows for word images:
.5 × 𝑙 (3.1.5.2)
while for text lines we have:
5 × 𝑙 (3.1.5.3)
where 𝑙 is the modal value of all horizontal black run lengths 𝐿() and corresponds to the character
stroke width. The parameter value .5 is used in word images in order not to exclude neighboring
connected strokes with width approximately 𝑙. On the other hand, in text line images the parameter
value is set at 5 so that not most of the scan lines are removed. An example of removing word
image scan lines is shown in Fig. 3.1.5.2.
Draft Version December 31, 2013 43/154
Portions of each strip separable by vertical lines are isolated in boxes. For each box, the centers of
gravity for its upper and lower halves are computed and connected. In case of an empty half, with
no foreground pixels, the whole box is disregarded. The slant 𝑠𝑖 of the connecting line defined by
the centers of gravity for upper and lower halves of box 𝑖 corresponds to the slant of this box.
(a) (b)
(c) (d)
Figure 3.1.5.2: Image boxes that contribute to word slant: (a) original word images and the detected
core-region; (b) scan lines removed; (c) detected image boxes and (d) valid image boxes.
Overall word slant estimation
The weighted average of 𝑠𝑖 for all boxes is used to calculate the slant 𝑆 of the word image as
follows:
b
i
ii
b
i
iii
ch
chs
S
1
1
(3.1.5.4)
where 𝑏 is the number of valid boxes, ℎ𝑖 the height of box 𝑖and 𝑖 a parameter used to weigh the
contribution of box 𝑖 based on its position in relation to the word core-region. Parameter 𝑖 is used
to assign a weight of 1 if the box is inside the core-region and 2 otherwise. The rationale behind this
parameter is that character parts outside the core region have a significant contribution to the overall
word slant. As we observe in Fig. 3.1.5.3 the boxes that bound the largest fragments and are outside
the core region contain the strokes that should be vertical to the orientation of the word and
consequently our algorithm considers such strokes as dominant.
Figure 3.1.5.3: The fragments the bounding box of which have a significant height and are outside
the core region are those that by definition should be vertical to the orientation of the word.
Draft Version December 31, 2013 44/154
3.2 Image Segmentation
3.2.1 Detection of main page elements
Detection of main text zones
In the approach developed in the first year of the project we followed a two-step procedure in order
to detect the main text zones. First, vertical lines are detected based on a fuzzy smoothing method.
Then, we process the information appearing in the vertical white runs of the image. Based on those
two steps we are able to detect the main text zones that may appear in one or two columns.
Detection of vertical lines
In this step, we first proceed to a vertical smoothing based on the Run Length Smoothing
Algorithm(RLSA) [Wahl 1982] in order to connect vertical broken lines (see Fig. 3.2.1.1). Then, we
detect pixels that belong to long vertical lines by applying the vertical fuzzy RLSA [Shi 2004].The
fuzzy RLSA measure is calculated for every pixel on the initial image and describes “how far one
can see when standing at a pixel along vertical direction”. By applying this measure, a new
grayscale image is created, which is then binarized, and the lines of text are extracted from the new
image (see Fig. 3.2.1.2).
(a) (b)
Figure 3.2.1.1: An example of a part of an image before (a) and after (b) vertical smoothing.
Draft Version December 31, 2013 45/154
(a) (b)
Figure 3.2.1.2: Vertical fuzzy RLSA: (a) original image; (b) the fuzzy RLSA result.
In order to detect vertical lines, we proceed to a vertical projection of the resulting image. Since we
want to focus on the vertical black runs of the document image we define the vertical black run
profile V(x) as follows:
𝑉( ) ∑ 𝐿(𝑖 )2𝐵(𝑥)𝑖=0 (3.2.1.1)
where B(x) is the number of black runs in the vertical scan of column x and L(i,x) is the length of
the i-th black run of column x.
An example of vertical projections is presented in Fig. 3.2.1.3(a). By applying a threshold to these
projections and calculating the minimum and maximum image points that contribute to the
corresponding vertical lines, we can define the vertical line segments of the image (see Fig.
3.2.1.3(b)). The main text body can be defined in the zone between those consequent vertical lines
that have maximum distance between them (see Fig. 3.2.1.3(b)).
Draft Version December 31, 2013 46/154
(a) (b)
Figure 3.2.1.3: (a) vertical projections of image in Fig. 3.2.1.2 and (b) the resulting vertical lines as
well as the detected main text zone
The same procedure can be followed to detect horizontal lines. Combining information about
horizontal and vertical lines can lead to the segmentation of the image based on the defined grid as
well as to the detection of the main text region, the title and the side notes zone. Some preliminary
results for this procedure are presented in Fig. 3.2.1.4.
Figure 3.2.1.4: (a) original image; (b) detection of the main text region, the title and the side notes
zone
Draft Version December 31, 2013 47/154
Processing vertical white runs
Processing the background of the image can be used to efficiently detect the main text zone. In our
approach, we first count the vertical white runs of the image that are of significant length
(>image_height/3). Some examples of this procedure for single and two-column documents are
presented in Fig. 3.2.1.5.
(a)
(b)
(c)
(d)
Figure 3.2.1.5: (a),(c) original images; (b),(d) their long vertical white runs (in red) and the
detected main text zones
At a next step we detect and tag image columns that do not have long vertical white runs. If we
calculate and sort all the lengths of successive tagged columns, then if the two longer lengths have
almost the same value, the document is classified as two-column document. Otherwise, the main
text area is detected. Examples of both cases are presented in Fig. 3.2.1.5.
Draft Version December 31, 2013 48/154
Segmentation of structured documents using stochastic grammars
Given an image of a page containing a sequence of marriage registers, the page segmentation is
performed in two steps. First, the image is divided into squared cells of pixels. For each cell we use
a model to compute the probability of belonging to one of the considered classes. Then those
probabilities are parsed using a grammar that accounts for the structure of the document in order to
obtain the most likely segmentation.
Text Classification Features
Following the outline in [Cruz 2012] we used two different set of features, Gabor features as texture
descriptors and Relative Location Features [Gould 2008] for classifying small regions of pixels.
According to the structure of the registers (see Fig. 3.2.1.6) there are four possible classes: name,
body, tax and background.
Figure 3.2.1.6: Sample image of part of a page with two registers. Each register has three parts:
Name (green), Body (Blue) and Tax (Red).
A common procedure to analyze the texture information of a document image is the use of a multi-
resolution filter bank in order to compute a set of responses for several orientations and frequencies
of the filter. For this research we defined the set of features ℝ𝑚× which corresponds with the
set of Gabor filter responses computed for m frequencies and n orientations at pixel level.
Relative Location Features (RLF) were introduced by Gould et al. in [Gould 2008] as a way to
encode inter-class spatial relationships as local features. Each of these features model the
probability of assigning the class label c to a region z taking into account the information provided
by the rest of image regions about their position and its initial label predictions.
Once we obtained the set of features, we calculated the probability P (c|z) that region z belongs to
class c by computing the probability density function for each possible class using a Gaussian
mixture model (GMM).
2D Stochastic Context-Free Grammars
We define a 2D Stochastic Context-Free Grammar (2D-SCFG) as a generalization of SCFG that is
able to deal with bidimensional matrices [Alvaro 2013]. The binary rules of a 2D SCFG have an
additional parameter𝑟 {𝐻 𝑉}that describes a spatial relation: horizontal concatenation (H) or
Draft Version December 31, 2013 49/154
vertical concatenation (V). Given a rule𝐴 𝑟→ 𝐵 𝐶, the combined subproblems must be arranged
according to the spatial relation constraint.
Given a page image, the problem is to obtain the most likely parsing according to a 2D SCFG. For
this purpose, the input page is considered as a bidimensional matrix I with dimensions 𝑤 × ℎand
each cell of the matrix can be either a pixel or a cell of 𝑑 × 𝑑 pixels. Then, we define an extension
of the well-known CYK algorithm to account for bidimensional structures. We basically extended
the algorithm described in [Crespi 2008] to include the stochastic information of our model.
The CYK algorithm for 2D SCFG is based on building a parsing table Ƭ. Each hypothesis accounts
for a region zx,y;x+i,y+j. Each region z is defined as a rectangle delimited by its top-left corner (x,y)
and its bottom-right corner (x+i, y+j), and lixj represents a subproblem of size ixj. An element of the
parsing table Ƭx,y;x+i,y+j[A] represents that nonterminal A can derive the region zx,y;x+i,y+j with a
certain probability.
The parsing process is defined as a dynamic programming algorithm. The derivation probability of
a certain region is marginalized according to the terminal classes c and taking into account the size
of the regions. The most likely parsing is approximated by maximizing the posterior probability and
by considering usual assumptions, the parsing table is initialized for each region z of size 1x1 as
(3.2.1.2)
where𝑃(𝐴 → ) is the probability of the model for terminal c; P (c | z) represents the probability
that region z belongs to class c (described in previous section); and P (l1x1 | A) is the probability that
nonterminal A derives a subproblem of size 1x1. Second, the algorithm continues analyzing new
hypotheses of increasing size using the binary rules of the grammar. The general case is computed
for all i and j, 2 ≤i≤w; 2≤j≤h as
(3.2.1.3)
where a new subproblem Ƭx,y;s,t[A] is computed from two smaller subproblems. The probability is
maximized for every possible vertical and horizontal decomposition resulting in the region zx,y;s,t.
The 2D SCFG provides syntactic and spatial constraints𝑃(𝐴 𝑟→ 𝐵 𝐶), and we have also included
the expected size for a nonterminal P (lixj | A) as a source of information.
The expected size P (lixj | A) has two effects on the parsing process. First, it helps to model the
spatial relations among every part of a given page because there is a specific nonterminal for each
zone of interest. This can be seen in Fig. 3.2.1.6 where the expected size of the background region
on top of the page will be different from the expected size of the background zone over a Tax
region. Furthermore, many unlikely hypotheses are pruned during the parsing process due to its size
information; hence, it speeds up the algorithm.
Finally, the most likely parsing of the full input page is obtained in T0,0;w,h[S]. The time complexity
of the algorithm is O(w3h
3|R|j) and the spatial complexity is O(w
2h
2).
Page Split
Our methodology detects the optimal page frames of double page document images and splits the
image into the two pages as well as removes noisy borders. It is based on white run projections
which have been proved efficient for detecting zones of text areas. It is an extension of
Draft Version December 31, 2013 50/154
[Stamatopoulos 2010] with special focus on historical handwritten documents.
The page split methodology consists of three distinct steps. At a first step, a pre-processing which
includes noise removal and image smoothing based on ARLSA [Nikolaou 2010] is applied. At a
next step, the vertical zones of the two pages are detected. Finally, the frame of both pages is
detected after calculating the horizontal zones for each page.
In order to detect the vertical zones we focus on the white pixels of the image and introduce the
vertical white run projections HV() which have been proved efficient for detecting vertical zones of
text areas. The motivation for proposing these projections is the need to stress the existence of long
vertical white runs in the image. The vertical white run projections HV() are defined as follows:
𝐻𝑉( ) ∑ (𝑦𝑗2−𝑦𝑗1)2
𝑤𝑣𝑖𝑗=1
( −2∗𝑎)∗𝐼𝑦 (3.2.1.4)
where wvi is the number of white runs (i,yj1)-(i,yj2) for row x=i in the range of y=a*Iy… (1-a)*Iy ,Iy
denotes the image height and HV(x) [0… (1-2*a)*Iy].
(a)
(b) (c)
Figure 3.2.1.7: (a) Binary image (b) vertical white pixel projections and (c) vertical white run
projections.
An example of the vertical white run projections compared to classical vertical white pixel
projections is given in Fig. 3.2.1.7. In this figure it is demonstrated that using white run projections
we can better discriminate text from non-text vertical zones. We can safely consider that if
HV(x)<(1-2*a)Iy/2 then the corresponding image column may contain page information. Also, if we
have consecutive n image columns that satisfy this equation with n>b*Ix then a vertical page zone is
detected. Although detecting two vertical page zones is the most common case, we examine all the
cases where more than two, just one or no vertical page zones are detected. Once the vertical page
zones have been detected, a similar process is applied in order to detect the horizontal page zones.
An example of the developed page split methodology is given in Fig. 3.2.1.8.
Draft Version December 31, 2013 51/154
(a) (b)
(c)
Figure 3.2.1.8: Page Split example: (a) original image; (b) binary image (c) image after page split.
3.2.2 Text Line Detection
Baseline Detection
The text line analysis and detection (TLAD) system used in this task is based on HMM and has four
main phases (see Fig. 3.2.2.1): image preprocessing, feature extraction, HMMs and LM training and
decoding. All of the above phases are required, at some level, for the application of pattern
recognition techniques to a problem. Although these processing steps are described in other sections
of this deliverable, it is important to note that our TLAD system uses specific software to perform
them.
Figure 3.2.2.1: Global scheme of the handwritten text line detection process.
Draft Version December 31, 2013 52/154
Preprocessing
The input data for TLAD are page images. Hence as a first step each image undergoes conventional
noise reduction, geometry normalization and other preprocessing procedures in order to meet the
assumptions of the approach.
The preprocessing phase is composed of several techniques that are applied on the input page
images in order to suppress unwanted issues (noise, background, bleed through, distortions, etc.)
and/or to enhance some image features (specific for TLAD), which are important for subsequent
feature extraction and line detection processes.
In Fig. 3.2.2.2 we can see the step-by-step results of the successive application of each
preprocessing technique to a sample portion of a page image.
As one can see, in the last step shown in Fig. 3.2.2.2 we smear the line with the RLSA algorithm.
This is performed in order to enhance the text line regions previous to the feature extraction phase.
Feature extraction
Since our TLAD approach is based on continuous density HMMs, each preprocessed image must be
represented as a sequence of feature vectors. This is done by dividing the image resulting after
applying RLSA into D non-overlapping vertical rectangular regions all with height equal to the
image-height, L.
For each pixel line of each rectangular region we calculate the projection profile of the black pixels
and smooth it with a rolling average filter. By stacking for each page image pixel line L the value of
the projection profiles for each region D we obtain an LxD-dimensional feature vector sequence that
represents the image.
Figure 3.2.2.2: Figure shows the effects of each of the preprocessing techniques on a handwritten
text image and the final feature extraction
Modelling and decoding
Analogous to the speech recognition and handwritten text recognition (ASR,HTR) problems, the
handwritten text line detection problem can be formulated as the problem of finding a most likely
text line sequence hypothesis, �̂� ⟨ℎ ℎ2 ℎ ⟩, for a given handwritten page image (or selected
region image) represented by an observation sequence1, that is:
1Henceforward, in the context of this formal framework, each time we mention the image of a page or selected text, we
are implicitly referring to its input feature vector sequence “o” describing it
Draft Version December 31, 2013 53/154
(3.2.2.1)
The above formula (Please read [Bosch 2012] for more details) can be optimally solved by using
the Viterbi search algorithm [Jelinek 1998]. The solution for a given page image not only provides
us the most likely text line sequence hypothesis �̂�, but also yields the physical location of the text
line by making explicit the “hidden” alignment b (a sequence of marks in the page) between the
input feature vector subsequence and the text line sequence hypothesis (see Fig. 3.2.2.3).
Our line detection approach is based on two modelling levels: morphological and syntactical. The
morphological level, expressed as the P(o|h) term (see Eq. 3.2.2.1) is modelled by using HMMs and
is in charge of explaining the different vertical line region classes that appear on the input images
along its vertical direction. In our line detection approach four different kinds of vertical regions are
defined:
Normal text Line-region (NL): Region occupied by the main body of a normal handwritten text
line.
Inter Line-region (IL): Defined as the region found within two consecutive normal text lines,
characterized by being crossed by the ascenders and descenders belonging to the adjacent text lines.
Blank Line-region (BL): Large rectangular region of blank space usually found at the start and end
of page image (top and bottom margins).
Non-text Line-region (NT): Stands for everything which does not belong to any of the other
regions.
Figure 3.2.2.3: Schematic representation for a page image extraction of L feature vectors of D
components.
Draft Version December 31, 2013 54/154
Figure 3.2.2.4: Examples of the different classes of line regions.
We model each of these regions by an HMM which is trained with instances of such regions.
Basically, each line-region HMM is a stochastic finite-state device that models the succession of
feature vectors extracted from instances of the specific line-region images. In turn, each HMM state
generates feature vectors following an adequate parametric probabilistic law; typically a mixture of
Gaussian densities. The adequate number of states and Gaussians per state may be conditioned by
the available amount of training data. Once an HMM “topology” (number of states and structure)
has been adopted, the model parameters can be easily trained from instances (sequences of features
vectors) of full images containing a sequence of line-regions (without any kind of segmentation)
accompanied by the reference labels of these images that correspond to the actual sequence of line-
region classes. This training process is carried out using a well-known instance of the EM algorithm
called forward-backward or Baum-Welch re-estimation [Jelinek 1998]. The syntactic modelling
level, expressed as the P(h) term (see Eq.(3.2.2.1)), is responsible for defining the way that the
different line regions can be concatenated in order to produce a valid page structure. It is worth
noting that at this level, NL and NT line regions are always forced to be followed by IL region:
NL+IL and NT+IL. We can also use the LM to impose restrictions about the minimum or maximum
number of line-regions to be detected. The LM for our text line detection approach is implemented
as a stochastic finite state grammar (SFSG) which recognizes valid sequences of elements (line
regions): NL+IL, NT+IL and BL.
Both modelling levels, morphological and syntactical, which are represented by finite-state
automata, can be integrated into a single global model on which Eq. (3.2.2.1) is easily solved; that
is, given an input sequence of raw feature vectors, an output string consisting of the recognized
sequence of line region class labels along with their corresponding coordinate positions is obtained.
It should be mentioned that, in practice, HMM and LMs probabilities (usually expressed in
logarithms) are generally “balanced” before being used in Eq. (3.2.2.1). This is carried out by using
a “Grammar Scale Factor” (GSF), the value of which is tuned empirically.
Polygon Based Text Line Segmentation
The proposed method for text line segmentation in handwritten documents is an extension of the
work presented in [Louloudis 2008] and consists of three main steps. The first step includes
binarization, vertical line removal, connected component extraction, average character height
estimation and partitioning of the connected component domain into three distinct spatial sub-
domains. In the second step, a block-based Hough transform is used for the detection of potential
Draft Version December 31, 2013 55/154
text lines, while a third step is used to correct possible splitting, to detect possible text lines which
the previous step did not reveal and, finally, to separate vertically connected parts and assign them
to text lines. A detailed description of these stages is given in the following subsections. For a more
detailed description of the method the interested reader should see [Louloudis 2008]. In this
extended method, a novel vertical character separation algorithm is presented.
Preprocessing
The preprocessing step consists of four stages. First, an adaptive binarization and vertical line
removal is applied. Then, the connected components of the binary image are extracted and the
bounding box coordinates for each connected component are calculated. The average character
height AH for the whole document image is calculated. We assume that the average character height
equals the average character width AW.
The connected components domain contains components of a different profile with respect to width
and height, since it is frequent to have components describing one character, multiple characters, a
whole word, accents and characters from adjacent touching text lines. The aforementioned
connected components variability has motivated us to divide the connected component domain into
different sub-domains, in order to deal with these categories separately. More specifically, in the
proposed approach, we divide the connected components domain into three distinct spatial sub-
domains denoted as “Subset 1”, “Subset 2” and “Subset3”.
“Subset 1” (corresponding to the majority of the normal character components) contains all
components with size which satisfies the following constraints:
0.5* 3* AND 0.5*AH H AH AW W (3.2.2.2)
where H, W denote the component’s height and width, respectively, and AH, AW denote the average
character height and the average character width, respectively. The motivation for “Subset 1”
definition is to exclude accents and components that are large in height and are likely to belong to
more than one text line.
“Subset 2” contains all large connected components. Large components are either capital letters or
characters from adjacent text lines touching. The size of these components is described by the
following equation:
3*H AH (3.2.2.3)
The motivation for “Subset 2” definition is to grasp all connected components that exist due to
touching text lines. We assume that the corresponding height will exceed three times the average
character height.
Finally, “Subset 3” should contain characters as accents, punctuation marks and small characters.
The equation describing this set is:
3* AND 0.5* OR 0.5* AND 0.5*H AH AW W H AH AW W (3.2.2.4)
The motivation for ‘Subset 3’ definition is that accents usually have width less than half the average
character width or height less than half the average character height.
Hough Transform Mapping
In this step, the Hough transform takes into consideration only connected components that belong to
sub-domain “Subset 1”. Selection of this sub-domain is made for the following reasons: (i) It is
guaranteed that components which span across more than one text line will not vote in the Hough
Draft Version December 31, 2013 56/154
domain; (ii) it rejects components, such as accents, which have a small size. This avoids false text
line detection by connecting all the accents above the core text line.
In our approach, instead of having only one representative point for every connected component, a
partitioning is applied for each connected component lying in “Subset 1”, in order to have more
representative points voting in the Hough domain. In particular, every connected component lying
in this subset is partitioned to equally-sized blocks. The number of the blocks is defined by the
following equation:
AW
WN c
b (3.2.2.5)
where cW denotes the width of the connected component and AW the average character width of all
connected components in the image. For a detailed description of this step see [Louloudis 2008].
Postprocessing
The post-processing step consists of two stages. At the first stage (i) a merging technique over the
result of the Hough transform is applied to correct possible false alarms and (ii) connected
components of “Subset 1” that were not clustered to any text line are checked to see whether they
create a new text line that the Hough transform did not reveal.
The second post-processing stage deals with components lying in “Subset 2”. This subset includes
components whose height exceeds three times the average height AH. All components of this subset
mainly belong to n detected text lines (n>1). Our novel method for splitting these components
consists of the following:
(A) Calculate yi, which are the average y values of the intersection of detected line i and the
connected component’s bounding box (i =1..n) (see Figure 3.2.2.5a).
(B) Then, compute the skeleton of the connected component (Figure 3.2.2.5b), detect all
junction points and remove them from the skeleton (Figure 3.2.2.5c). If no junction point exist in
the zone between yi and yi+1 remove all skeleton points in the center of the zone.
(C) Assign the skeleton connected components to the closest line yi (Figure 3.2.2.5d).
(D) Each pixel of the initial image (Figure 3.2.2.5a) inherits the id of the closest skeleton pixel
(Figure 3.2.2.5e).
Draft Version December 31, 2013 57/154
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Figure 3.2.2.5: Steps for assigning pixels of vertically connected characters to corresponding text lines (colored lines): (a)-(f) Initial image, (b)-(g)
skeleton, (c)-(h) removal of junction points, (d)-(i) assignment of skeleton connected components to closest line, (e)-(j) final result.
Draft Version December 31, 2013 58/154
3.2.3 Word Detection
Word segmentation refers to the process of detecting the word boundaries starting from a text line
image. Although the reader may think that the solution to this problem is trivial, it is still considered
an open problem among researchers of handwriting recognition and document analysis areas due to
several challenges that need to be addressed. These challenges include: i) the appearance of skew
along a single text line (Figure 3.2.3.1), ii) the existence of slant (Figure 3.2.3.2), iii) the existence
of punctuation marks (Figure 3.2.3.3) and iv) the non-uniform spacing of words (Figure
3.2.3.4).The proposed word segmentation method is an extension of the method presented in
[Louloudis 2009b], adapted to historical handwritten documents.
Figure 3.2.3.1: Text line image which presents different skew angle along its width.
Figure 3.2.3.2: Example of a slanted text line image.
Figure 3.2.3.3: Part of a document image showing the existence of punctuation marks.
Figure 3.2.3.4: A document image portion showing the non-uniform spacing of words.
A word segmentation algorithm usually contains two steps. The first step deals with the
computation of the distances of adjacent components in the text line image and the second step
concerns the classification of the previously computed distances as either inter-word gaps or inter-
character gaps. For the first step, we used the Euclidean distance metric [Louloudis 2009a]. For the
classification of the computed distances we used a well-known methodology from the area of
Draft Version December 31, 2013 59/154
unsupervised clustering techniques, namely Gaussian Mixtures. A more detailed description of the
two steps is provided in the following subsections.
Distance Computation
In order to calculate the distance of adjacent components in the text line image, a pre-processing
procedure is applied. The pre-processing procedure concerns the correction of the skew angle as
well as the dominant slant angle [Vinciarelli 2001] of the text line image (Figure 3.2.3.5, Figure
3.2.3.6). The computation of the gap metric is considered not on the connected components (CCs)
but on the overlapped components (OCs), where an OC is defined as a set of CCs whose projection
profiles overlap in the vertical direction (see Figure 3.2.3.7).
We define as distance of two adjacent overlapped components (OCs) their Euclidean distance. The
Euclidean distance between two adjacent overlapped components is defined as the minimum among
the Euclidean distances of all pairs of points of the two adjacent overlapped components (Figure
3.2.3.8). For the calculation of the Euclidean distance we apply a fast scheme that takes into
consideration only a subset of the pixels of the left and right OCs instead of the whole number of
black pixels. In order to define the subset of pixels of the left OC, we include in this subset the
rightmost black pixel of every scanline. The subset of pixels for the right OC is defined by
including the leftmost black pixel of every scanline. Finally, the Euclidean distance of the two OCs
is defined as the minimum of the Euclidean distances of all pairs of pixels.
(a)
(b)
Figure 3.2.3.5: (a) a text line image and (b) the same text line after skew correction.
(a)
(b)
Figure 3.2.3.6: (a) a text line image and (b) the same text line after slant correction.
Figure 3.2.3.7: The overlapped components of the text line image are defined based on the
projection profile.
Draft Version December 31, 2013 60/154
Figure 3.2.3.8: The arrows define the Euclidean distances between adjacent overlapped
components.
Gap Classification
A mixture model based clustering is based on the idea that each cluster is mathematically presented
by a parametric distribution. We have a two clusters problem (inter-word and intra-word gaps) so
every cluster is modeled with a Gaussian distribution. The algorithm that is used to calculate the
parameters for the Gaussians is the Expectation Maximization (EM) algorithm. We use this
methodology since Gaussian Mixture Modeling is a well-known unsupervised clustering technique
with many advantages which include: (i) the mixture model covers the data well, (ii) an estimation
of the density for each cluster can be obtained and (iii) a “soft” classification is available. For a
detailed description of Gaussian Mixtures, the interested reader is referred to [Marin 2005]. For the
calculation of the number of parameters and the number of Gaussians we used the software package
CLUSTER, which implements an unsupervised algorithm for modeling Gaussian mixtures
[https://engineering.purdue.edu/~bouman/software/cluster/].
Draft Version December 31, 2013 61/154
3.3 Handwritten Text Recognition
The HTR system used here is based on hidden Markov models (HMM) and N-grams. A complete
description of the system can be found in [Toselli 2004],[Vidal 2012]. It follows the classical
architecture composed of three main modules (see Figure3.3.1): document image preprocessing,
line image feature extraction and model training/decoding. The preprocessing module is in charge
to reduce the noise of the images, enhance the quality of text regions, divide the image into lines
and reduce the variability of text styles. This module has been studied in the previous sections (T3.1
Image pre-processing and T3.2 Image segmentation). Therefore in this section we focus on the rest
of the modules. In the feature extraction module a feature vector sequence is obtained as the
representation of a handwritten text image. The training module is in charge to train the different
models used on the decoding step: HMMs, language models and lexical models. Finally, the
decoding module obtains the most likely word sequence for a sequence of feature vectors.
Figure 3.3.1: Diagram representing the different modules of a handwritten text recognition system.
Feature extraction
Each preprocessed line image is represented as a sequence of feature vectors using a sliding
window with a width of W pixels that sweeps across the text line from left to right in steps of T
pixels. At each of the M positions of the sliding window, N features are extracted.
The features used here are known as PRHLT-features and were introduced in [Toselli 2004]. The
sliding window is vertically divided into R cells, where R is tuned empirically and T is the height of
the line image divided by R. Then W is defined as W = 5T. In each cell, three features are
calculated: the normalized gray level, the horizontal and vertical components of the grey level
gradient. The feature vector is constructed for each window by stacking the three values of each cell
(N = 3R). Hence, at the end of this process, a sequence of (3R)-dimensional feature vectors is
obtained.
Modelling and decoding
Given a handwritten line image represented by a feature vector sequence, x = x1x2 ...xm, the HTR
problem can be formulated as the problem of finding a most likely word sequence, �̂� 𝑤 ̂𝑤2̂ 𝑤�̂�,
i.e., �̂� 𝑎𝑟𝑔𝑚𝑎 𝑤Pr (𝑤| ). Using the Bayes’ rule:
(3.3.1)
Pr(x|w) is typically approximated by concatenated character models, usually Hidden Markov
Models [Jelinek 1998], and Pr(w) is approximated by a word language model, usually n-grams
[Jelinek 1998].
Draft Version December 31, 2013 62/154
In practice, the simple multiplication of P(x|w) and P(w) needs to be modified in order to balance
the absolute values of both probabilities. The most common modification is to usea language weight
α (Grammar Scale Factor, GSF), which weights the influence of the language model on the
recognition result, and an insertion penalty β (Word Insertion Penalty, WIP), which helps to control
the word insertion rate of the recognizer [Ogawa 1998]. In addition, usually log-probabilities are
used to avoid the numeric underflow problems that can appear using probabilities. So, Eq. 3.3.1can
be rewritten as:
(3.3.2)
where α and β are optimized for all the training sentences of the corpus.
Each character is modelled by a continuous density left-to-right HMM, which models the horizontal
succession of feature vectors representing this character. Fig. 3.3.2 shows an example of how a
HMM models two feature vector subsequence corresponding to the character “a”. A Gaussian
mixture model governs the emission of feature vectors in each HMM state. The required number of
Gaussians (NG) in the mixture depends, along with many other factors, on the vertical variability
typically associated with each state. On the other hand, the adequate number of states ( N S ) to
model a certain character depends on the underlying horizontal variability. The number of states and
Gaussians define the total amount of parameters to be estimated. Therefore, these numbers need to
be empirically tuned to optimize the overall performance for the given amount of training vectors
available.
Figure 3.3.2: Example of 5-states HMM modelling (feature vectors sequences of) instances of the
character “a” within the Spanish word “Castilla”. The states are shared among all instances of
characters of the same class.
The model parameters can be easily trained from images of continuously handwritten text (without
any kind of segmentation) accompanied by the transcription of these images into the corresponding
sequence of characters. This training process is carried out using a well-known instance of the EM
algorithm called forward backward or Baum-Welch.
The concatenation of characters to form words is modelled by simple lexical models. Each word is
modelled by a stochastic finite-state automaton which represents all possible concatenations of
Draft Version December 31, 2013 63/154
individual characters that may compose the word. This automaton takes into account optional
character capitalizations. By embedding the character HMMs into the edges of this automaton, a
lexical HMM is obtained. These HMMs estimate the word-conditional probability P(x|w) of Eq.
3.3.1. Finally, the concatenation of words into text lines or sentences is modelled by an n-gram
language model, with Kneser-Ney back-off smoothing [Kneser 1995].
Once all the character, word and language models are available, recognition of new test sentences
can be performed. Thanks to the homogeneous finite-state (FS) nature of all these models, they can
be easily integrated into a single global FS model by replacing each word character of the n-gram
model by the corresponding HMM. The search for decoding the input feature vectors sequence x
into the output words sequence �̂� is performed over this global model. This search is optimally
done by using the Viterbi algorithm [Jelinek 1998].
Word graphs
A word graph (WG) is a data structure that represents a set of strings in a very efficient way. In
handwritten recognition, a WG represents a large amount of sequences of words whose posterior
probability Pr(w|x) is large enough, according to the morphological (likelihood) and language
(prior) models used to decode a text image, represented as a sequence of feature vectors. In other
words, the word graph is just (a pruned version of) the Viterbi search trellis obtained when
transcribing the whole image sentence. A detailed description of word graphs in HTR can be found
in [Vidal 2012].
Draft Version December 31, 2013 64/154
3.4 Keyword Spotting
3.4.1 The query-by-example approach
Word spotting in document images is related to Content-Based Image Retrieval systems by
searching a word image from a set of unindexed document images using the image contentas the
only information source. As final outcome, the system returns a ranking list of document image
words to the user.
There are two fundamentally different approaches to the above systems, namely, the segmentation-
based and the segmentation-free approach. As the name implies, the former requires an initial step
for partitioning the document image into word images which are subsequently matched to the query
word image based on a similarity measure. The latter is based on matching the reference query
word image directly to the document image without any imposed constraints in the search space.
In the following sections, the description of the proposed methodologies for each of the
aforementioned approaches will be detailed. As an outline, the proposed methodologies share a
common base, i.e. the computation of local document-specific features while they follow a different
path at the level of matching between the query word image and the corresponding information
which resides in the document image. This difference is due to, on the one hand, the availability of
segmented word images (segmentation-based case) and on the other hand, the lack of information
about the localization of word images since no segmentation has been applied (segmentation-free
case).
KeyPoints Detection
Figure 3.4.1.1 shows the consecutive steps of the proposed methodology for the detection of
characteristic local points (keypoints) in a document image.
First Gradient Vector
OrientationQuantization Median Filter
Corner Edges
Detection
Local Point Grouping
and FilteringFinal Local Points
Preprocessing Stage
Local Point Selection
Gradient
OrientationQuantization
Convex Hull Corner
Detection
Keypoint Entropy
CalculationKeypoint FilteringFinal Keypoints
Keypoint Detection
Keypoint Selection
Figure 3.4.1.1: The steps for the detection and selection of characteristic keypoints
Initially, the gradient vector kG of the image k and its orientation ( )kG is calculated and they are
defined as:
Draft Version December 31, 2013 65/154
**
1
* *, ( )
yxk k
y x
IIG G tan
I I (3.4.1.1)
where *
xI and *
yI are calculated by the convolution of the 1-D kernels 1,0,1 and [ 1,0,1]T to the
image k , respectively. The resulting *
xI and *
yI is very sensitive to noise. In order to make them
most robust, we assume that *
xI and *
yI may belong to two distinct categories: noise and meaningful
data. The estimation of the threshold that separates these two clusters is achieved by minimizing the
intra-class variance between the clusters as in the Otsu approach [Otsu 1975]. Finally, the values
that are below this dynamic threshold are rejected.
Then, the gradient orientation is calculated (eq. 3.4.1.1). Fig. 3.4.1.2(b) shows an example of the
feature ( )kG using a grey-level representation. The grey values are the zero angles that are not
taken into consideration by the algorithm in the consecutive steps. The dark colors represent
negative angles while the bright colors represent positive angles.
The orientation of the gradient describes the local structure of the strokes. Gradient orientation
features are also used in character recognition. Most of the authors use 4- or 8-direction histograms
computed in zones [Fujisawa 2010], [Koerich 2002]. The first order feature is not invariant to
image rotation. We argue that it is not beneficial in document images to incorporate invariant
properties as it results in noise amplification. This is further supported by the observation stated in
[Leydier 2009] wherein the used features which are invariant to rotation have resulted in worse
performance, when compared to features that are dependent on rotation. They adhere to the
observation that the features that are invariant to rotation are more sensitive to the noise and the
complex texture of the background.
The next step involves the quantization of the ( )kG to n levels. The purpose of this step is to
detect the changes to the writing direction as these points consist of important and descriptive
information. The quantization levels n is a parameter of the proposed algorithm and controls the
amount of the final local points. Increasing the quantization levels, the process becomes more
sensitive to the writing directions. After extensive experimentation it is proven that 3 levels are
enough in the context of word spotting retrieval. Fig. 3.4.1.2(d) shows the output for the
quantization of the ( )kG values to 3 levels.
Example of connected components (CCs) attributed to each quantization level is shown at (Fig.
3.4.1.2(e-g)), respectively. These CCs represent chunks of strokes that have different writing
directions between them. The most descriptive and important points on those CCs are the endpoints
that appear along the edge. For the endpoint detection, the convex hull that contains one CC is
taken into consideration wherein its corners are nominated as the initial keypoints kP. Figure 3.4.1.3
shows a visual representation of the CC’s convex hull (with green color) and the red dots are the
initial keypoints. The endpoint detection step is applied for each CC at each quantization level. It is
worth mentioning that aiming to decrease the performance cost, small CCs, i.e. width and height
less than 5 pixels, are rejected.
Draft Version December 31, 2013 66/154
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Figure 3.4.1.2: The steps for the keypoint detection: (a) original document image, (b) orientation of
the gradient vector, (c) quantization of the gradient vector orientation (greyscale version), (d)
quantization of the gradient vector orientation (color version), (e-g) connected components of the
distinct quantization levels, (h) the initial keypoints and (i) the final keypoints
The next steps involve the selection of those kPs that mark the most descriptive information of the
word. This is achieved, by initially calculating the entropy of the quantized gradient angles around
the keypoint using the Shannon entropy equation:
( ( ))(
1( )) ln
kk i
iW
i W
occ Gocc
NNGE (3.4.1.2)
where W is the pixels in the window, ( ( ))k
iocc G is the occurrence of the corresponding quantized
gradient angle in W and
( )( )ki
i W
GN occ .
The kPs are finally selected starting from those that have the maximum entropy and if other kPs are
found in their neighborhood of size W, then those kPs are rejected. The remaining points are those
that contain the maximum entropy in their neighborhood and, consequently, are the most
significant. The size of window W is a parameter that directly correlates to the final output numbers
of the kPs detection algorithm. It is proposed to be W=18x18 which equals to the size of the
neighborhood that the local point descriptor is calculated upon as is detailed in the next section. Fig.
3.4.1.2(i) shows the final keypoints kP.
Draft Version December 31, 2013 67/154
(a) (b)
Figure 3.4.1.3: Detection of keypoints : (a) Example connected components;(b) the corner points of
the convex hull as the keypoints of the connected components
Feature Extraction
The feature for the local keypoint is calculated from the quantized gradient angles. An area of
18x18 pixels around the kP, is divided into 9 cells with size 6x6 for each of them, as shown at Fig.
3.4.1.4. Each cell is represented by a 3-bin histogram (each bin corresponds to a quantization level).
Each pixel accumulates a vote in the corresponding angle histogram bin. The strength of voting
depends on the norm of the gradient vector and on the distance from the location of local point as
shown at the following equation:
, , ,x y x y x yV s G (3.4.1.3)
where the ,x yV is the contribution of pixel (x,y) to the distribution of the corresponding cell
histogram, ,x yG is the norm of the gradient vector G , and ,x ys denotes a linear weighting factor of
the distance between pixel ( , )x y and the keypoint, based on the following equation:
2 2
,
21
3 9 2
LP LP
x y
x x y ys
(3.4.1.4)
where ( , )LP LPx y denotes the position of the LP. The weighting factor values range in the interval
[1/3,1] for which the maximum value is achieved for x , yLP LPx y while its minimum is reached
at x 9, y 9LP LPx y . The task of the variable ,x ys is to weigh the pixel participation to the
histogram taking into account its distance from the kP.
Finally, all nine (9) histograms are concatenated in one 27-bin histogram and normalized by its
norm. In order to make the descriptor illumination independent all the values above 0.2 are fixed to
0.2 and the resulting values are re-normalized again.
(a) (b) (c)
Figure 3.4.1.4: (a) example keypoint at the global connected components level; (b) example
keypoint at the local level; (c) Neighborhood used for the computation of the descriptor at the
keypoint.
Draft Version December 31, 2013 68/154
Matching in a Segmentation-based Word Spotting context
In the case of segmentation-based word spotting, the aim is to match the query keypoints to the
corresponding keypoints of any word image in the document. For this task, the descriptor that has
been presented in the previous section is taken into consideration along with a Local Proximity
Nearest Neighbor (LPNN) search. The advantage of LPNN search is two-fold: (i) it enables a search
in focused areas instead of searching in a brute force manner and (ii) it goes beyond the typical use
of a descriptor by the incorporation of spatial context in the local search addressed. In the sequel,
the complete matching step will be detailed.
Figure 3.4.1.5: (a) the query image, (b) the word image, (c) the projection of query local points to
the word image: the red lines are the matching local points; the green area is the local proximity
area of the nearest neighbor search.
First of all, it is worth to note that in the description that follows, it is assumed that there is an
outcome of a word image segmentation method and there will be no discussion about any specific
methodology used for the segmentation process. In particular, for the experiments, the word image
Draft Version December 31, 2013 69/154
segmentation information is taken from the ground truth corpora.
The initial stage in the matching step is a normalization which is applied for any word image
including the query word image. The aim of this stage is to alleviate any scale variability for the
same word. The normalized procedure comprises the following steps:
Calculation of the mean center in the x- and y-axis of the keypoints set in a word image:
1 1( , ) ,
k k
i i
x y
i ix y
p p
c ck k
(3.4.1.5)
where ,i i
x yp p denote the location of the i
th keypoint and k denotes the total number of the keypoints
in a word image.
Calculation of the mean distance of each keypoint from the mean center:
0 0,
k ki ix x y y
i ix y
p c p c
D Dk k
(3.4.1.6)
Calculation of the updated location of each keypoint point which is transformed to a new space
wherein [ , ]x y
c c is the new coordinate origin:
,
iiy yi ix x
x y
x y
p cp cp p
D D (3.4.1.7)
After normalization, all word images are directly comparable due to the achieved scale invariance.
In the next stage, the LPNN for each keypoint that resides on the query image is addressed. LPNN
is realized in a search area which is computed by taking into account a percentage (25%) of the
already calculated distances Dx, Dy. During search, if there is one or more word keypoints in the
proximity of the query keypoint under consideration, the Euclidean distance between their
descriptors is calculated and the minimum distance is kept. This is repeated for each keypoint in the
query image. The final similarity measure is the sum of all the minimal distances. If there is not a
local point in its proximity then a penalty value is added to the similarity measure and it is equal to
maximum Euclidean distance that can be calculated between the keypoint descriptors, which results
in a value of 27 .
As a final stage, the system presents to the user all the word images based on ascending sort order
of the calculated similarity measure.
Matching in a Segmentation-free Word Spotting context
The segmentation-free word spotting scenario using the document-specific keypoint and
corresponding descriptor is an extension of the procedure described for the segmentation-based
word spotting scenario. The fundamental difference is that there is no information about the
potential word image that should be matched to the query.
To deal with this challenge, a strategy is followed which comprises the following stages:
As a first stage, the normalization described in the previous section using Eq. (3.4.1.5-3.4.1.6) is
applied on the query word image. The computation of ( , )x y
c c does not guarantee that this is a
keypoint and therefore, in the case that ( , )x y
c c is not a keypoint, a selection of the nearest keypoint
is realized for which the proposed descriptor is computed.
Draft Version December 31, 2013 70/154
In a second stage, the similarity between the selected keypoint of the query word image and all
keypoints in the document image is calculated. In the achieved similarity set, a ranked list is created
out of which, the top N matches are kept for further use. The N is chosen as half of the total number
of the keypoints in the document image.
(a) (b)
(c)
Figure 3.4.1.6: (a) The original document (b) the initial keypoints (c) the top N keypoints.
In the sequel, for each keypoint that belongs to the top N matches, a LPNN search is achieved. As
described in the previous section, an LPNN search requires the computation of ( ,x yD D ) which also
depends on the knowledge of the position of the keypoints that can be found in the word image
under consideration. Since, in a segmentation-free approach, there is no knowledge about the word
image, a stepwise process of multiple instances of the computed ( ' ',x yD D ) at an interval that is
guided by the computed ( ,x yD D ) at the query word image is used. In particular, 'xD take values in
the range 3,
2 2x xD D
and 'yD take values in the range
3,
2 2
y yD D
with a step of 10
xD and10
yD ,
respectively. In other words, there are ten (10) instances of ( ' ',x yD D ) for which a correlation test will
be used to determine the optimal ( ' ',x yD D ) for each keypoint that belongs to the top N matches in the
document. This kind of information permits to realize an LPNN search and consecutively, to
proceed with a matching between the keypoints of the query word image and the keypoints that are
contained in the search area that is set by the optimal ( ' ',x yD D ) and is centered at one of the
keypoints that belongs to the top N matches.
The optimal ( ' ',x yD D ) for each keypoint in the document image is calculated, so that the correlation
of the histograms between the query and the document is maximized.
The correlation coefficient between the query vector q and the word vector w is
1
2 2
1 1
( )( )
( , )
( ) (w )
p
i ii
p p
i ii i
q q w w
cor q w
q q w
(3.4.1.8)
where q andw are the mean value of vectors q and w, p is the length of the vectors that represent
the number of query keypoints.
Draft Version December 31, 2013 71/154
It is worth noting that to utilize such a correlation there should be a correspondence between the
vectors used. In the problem at hand, each of the vectors are built by sorting the orientation of the
keypoint in relation with the mean center of the search area (Figure 3.4.1.7(a),(b)). In this way a
sorting list is achieved for the query word image and each of the search areas around the keypoints
in the document, wherein the valuation at each slot in the list equals to the relative distance ρ
(Figure 3.4.1.7(b)).
After finding the optimal ( ' ',x yD D ) for each candidate center point, the matching procedure that is
described in the previous section (segmentation-based case) is applied, wherein a similarity measure
is computed between this candidate center point and the query word image.
As a final stage, the system presents to the user all the word images based on ascending sort order
of the calculated similarity measure.
(a) (b)
Figure 3.4.1.7: (a) The keypoints considered in an LPNN search for the query word image, (b)
keypoints position in (ρ,θ) terms.
3.4.2 The query-by-string approach
We outline the basic character-lattice (CL) concepts and present the proposed CL-Based KWS
HMM-Filler approach along with its computing cost.
Character-Lattice Basics
A CL on a vector sequence x is a weighted directed acyclic graph with a finite set of nodes Q,
including an initial node 𝑞𝐼 𝑄 and a set of final nodes 𝐹 ⊆ (𝑄 − 𝑞𝐼). Each node q is associated
with a horizontal position in x, given by t (q ) [0,n], where n is the length of x. For an edge (q',q )
E (q' ≠ q,q' ∉F,q ≠ qI), c = w(q', q) is its associated character and s(q', q) is its score, corresponding
to the HMM likelihood that the character w(q', q) appears between frames t(q')+1 and t(q),
computed during the Viterbi decoding of x. For an illustrative example of a CL, see Fig. 3.4.2.1.
A complete path of a CL is a sequence of edges 𝑃 ≡ (𝑞 𝑞 ) (𝑞2
𝑞2) (𝑞 𝑞 ) such that 𝑞
𝑞𝐼, 𝑞𝑖 𝑞𝑖+ , 1 ≤i < L, qL F. A complete path corresponds to a whole line decoding hypothesis
and its score is the product of the scores of all its edges: 𝑆(𝒫 ) ∏ 𝑠(𝑞𝑖 𝑞𝑖).
𝑖=
The Viterbi score of x, S(x), is the maximum of the scores of all complete CL paths. It can be easily
and efficiently computed by Dynamic Programming. For our purposes in this paper, we will obtain
S(x), by previously obtaining forward (Ψ) or backward (Φ) partial path scores, also very efficiently
computed by Dynamic Programming:
Draft Version December 31, 2013 72/154
(3.4.2.1)
(3.4.2.2)
These equations are similar to those used in the standard backward-forward computations in word-
graphs (see [Wessel 2001]).
Figure 3.4.2.1: A small, partial example of CL for a handwritten word image. Two edge sequences:
(q 1 , q 2 ) , . . . , (q 4 , q 5 ) and (q 1 , q 2 ) , . . . , (q7, q 8 ) (in dashed lines) correspond to the character se-
quence: “bore”.
CL-based KWS Score for Character Sequences
Let a sub-path e of a CL be defined as a sequence of connected edges) of the form (see examples in
Fig 3.4.2.1: 𝒆 (𝑞 𝑞 ) (𝑞2
𝑞2) (𝑞 𝑞 ): 𝑞𝑖 𝑞𝑖+ 1 ≤ 𝑖 < 𝐿. The maximum score of a
complete path 𝒫 which contains a sub-path e is denoted as φ(e,x). It can be efficiently computed
using the backward and forward partial scores of Eq. (3.4.2.1-3.4.2.2):
(3.4.2.3)
where 𝒆 𝒫 means that e is a sub-path of 𝒫. Given a character sequence c, let S'(c,x) be the
maximum score of a complete path 𝒫 containing a sub-path e associated with c:
(3.4.2.4)
where Ω(e) is the character sequence ≡ 2 associated with e; that is,
𝒆 (𝑞 𝑞 ) (𝑞2
𝑞2) (𝑞 𝑞 )𝑤(𝑞𝑖
𝑞𝑖) 𝑖 1 ≤ 𝑖 < 𝐿. Eq. (3.4.2.3-3.4.2.4) can be easily
computed as shown in. Alg. 1.
Assuming the CL is complete, if c is the character sequence of a word υ to be spotted (including
Draft Version December 31, 2013 73/154
blank spaces) it can be readily seen that S(x) = sf(x) and S'(c,x) = sυ (x)(c.f Eq 2.4.2.3).Therefore,
(3.4.2.5)
where length normalization is now Lc = t ( q L ) — t (q1 ).
Computational Complexity
The preprocessing cost is that of generating the CLs for the N text line images and computing the
corresponding forward-backward scores. The cost of producing the CLs is 𝑂(𝛤 𝑛 ), where n is
the average line image length and Γ is a (generally large) constant which depends on the size of the
CLs. On the other hand, forward-backward computing time is negligible with respect to that of CL
generation [Toselli 2013]. In the search phase, according to Alg. 1, the worst-case cost of
computing S'(c,x) is = O(m+BL), where m is the number of edges of the CL of x, L is the number
of characters of c and B is the maximum CL branching factor (≈30 for the CLs of Table4.4.2.3).
Finally, the cost of spotting M keywords in collection of N text line images is: 𝑂( 𝛮 𝛭). The
relative improvement with respect to the classical HMM-filler method depends on how 𝑛
compares with . Real computing times of the proposed and the classical HMM-filler approaches
will be reported later in Table 4.4.2.5.
Draft Version December 31, 2013 74/154
4. Evaluation
4.1 Image Pre-processing
4.1.1 Binarization
At first we tested the developed binarization method on the three images of the TranScriptorium
project from the DIBCO 2013 ([Pratikakis 2013] competition. Representative results are presented
in Fig. 4.1.1.1(c)-(d). In addition, corresponding results of the best DIBCO 2013 algorithm for the
three images of TranScriptorium and for the whole dataset are presented in Fig. 4.1.1.1(e)-(f) and
Fig. 4.1.1.1(g)-(h). Finally, the results of the state-of-the-art method [Ntirogiannis 2014] are
presented in Fig. 4.1.1.1(i)-(j). From Fig. 4.1.1.1, we can observe that the newly developed method
preserves the textual information in a better way.
In the following, representative binarization results are shown in Fig. 4.1.1.2 - 4.1.1.4 concerning
images from Bentham, Hattem and Plantas collection. In particular, the binarization methods used
are the newly developed method, the method of Ntirogiannis et al. [Ntirogiannis 2014] and Gatos et
al. [Gatos 2006]. From those figures we can conclude that the newly developed binarization method
can handle efficiently images from different collections, while the initial method of Ntirogiannis et
al. [Ntirogiannis 2014] cannot handle images from the Hattem collection (Fig. 4.1.1.3). On the
contrary, in Fig. 4.1.1.4 the developed method does not remove all the bleed-through, but the
binarization result is satisfactory enough. Lastly, in Fig. 4.1.1.2 it is clear that more textual is
preserved using the newly developed method.
The developed binarization method was tested against the algorithms of the DIBCO 2013
competition ([Pratikakis 2013]). The corresponding results are presented in Table 4.1.1.1. The
performance of the developed method is very high, in particular it is ranked 3rd in terms of F-
Measure. It is worth mentioning that the developed method has very high Recall rate since our goal
is to detect as much textual information as possible.
Draft Version December 31, 2013 75/154
Table 4.1.1.1: Quantitative results for the TS images of DIBCO 2013 competition
Methods F-Measure Recall Precision
3 94,43292 92,04163 96,95178
5 93,67852 90,20371 97,43178
New 92,02438 97,30573 87,35464
17 91,24978 89,43323 93,14166
13 90,07855 85,59744 95,05475
7 89,89771 85,26649 95,0609
2 89,79226 85,01224 95,14184
9 89,25338 84,22794 94,91656
8,2 88,94308 82,50688 96,4684
18 88,93362 83,38049 95,2792
10,1 88,90041 84,4482 93,8482
15,2 88,86937 81,13896 98,22791
10,2 88,29652 82,9598 94,36705
10,3 88,13874 82,52364 94,57376
8,1 87,56595 82,45027 93,35843
11 85,43979 76,60897 96,57174
12 85,17727 79,9144 91,18219
8,3 84,55081 74,59911 97,5664
15,1 83,65741 79,38634 88,41419
4 83,01694 71,65871 98,65406
6 81,53499 70,79074 96,12423
16 81,09038 69,92609 96,49693
14 80,26672 69,35751 95,24829
1 36,3517 92,9747 22,59251
Draft Version December 31, 2013 76/154
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Figure 4.1.1.1: Comparison between the new binarization method and state-of-the-art methods: (a)-
(b) original images; (c)-(d) the newly developed method; (e)-(f) best DIBCO 2013 method (Howe
[Howe 2012]) for the three TranScriptorium images; (g)-(h) DIBCO 2013 winning method; (i)-(j)
method of Ntirogiannis et al. [Ntirogiannis 2014].
Draft Version December 31, 2013 77/154
(a)
(b)
(c)
(d)
Figure 4.1.1.2: Representative Bentham case: (a) Original image; (b) developed binarization; (c)
Gatos et al. [Gatos 2006] binarization; (d) Ntirogiannis et al. [Ntirogiannis 2014].
Draft Version December 31, 2013 78/154
(a)
(b)
(c)
(d)
Figure 4.1.1.3: Representative Hattem case: (a) Original image; (b) developed binarization; (c)
Gatos et al. [Gatos 2006] binarization; (d) Ntirogiannis et al. [Ntirogiannis 2014].
Draft Version December 31, 2013 79/154
(a)
(b)
(c)
(d)
Figure 4.1.1.4: Representative Plantas case: (a) Original image; (b) developed binarization; (c)
Gatos et al. [Gatos 2006] binarization; (d) Ntirogiannis et al. [Ntirogiannis 2014].
Draft Version December 31, 2013 80/154
4.1.2 Border Removal
In order to record the efficiency of the developed border removal algorithm we manually marked
the correct text region in the binary images in order to create the ground truth set (see Fig.
4.1.2.1(a)). The performance evaluation is based on a pixel based approach and counts the pixels at
the correct text region and the result image after borders removal (see Fig. 4.1.2.1(b)). Let G be the
set of all pixels inside the correct text region in ground truth, R the set of all pixels inside the result
image and T(s) a function that counts the elements of set s. We calculate the precision and recall as
follows:
𝑃𝑟𝑒 𝑖𝑠𝑖𝑜𝑛 T(G∩R)
T(R) 𝑅𝑒 𝑎𝑙𝑙
T(G∩R)
T(G) (4.1.2.1)
A performance metric FM can be extracted if we combine the values of precision and recall:
𝐹 2∗Precision∗Recall
Precision+Recall (4.1.2.2)
In our example (see Fig. 4.1.2.1) the precision is 100% because all the black borders have been
removed and Recall is 94% because marginal text (which belongs to the correct text región) has
been cropped.
(a) (b)
Figure 4.1.2.1: (a) Marked text region (ground truth); (b) result after borders removal.
We used a set of 533 randomly selected handwritten historical images (433 images from the
Bentham dataset, 50 images from the Plantas dataset and 50 images from Hattem dataset). For
comparison purposes, we applied at the same set the state-of-the art method D.X. Le [Le 1996].
Table 4.1.2.1 illustrates the average precision, recall and FM of the two methods. As it can be
observed, the proposed border removal method outperforms the state-of-the art method and
achieves FM= 93.76%.
Table 4.1.2.1: Border Removal - Evaluation results
Method Precision Recall FM
TranScriptorium 90.53% 97.23% 93.76%
D.X. Le 85.40% 99.54% 91.93%
Draft Version December 31, 2013 81/154
4.1.3 Image Enhancement
Color Enhancement
Background Estimation and Image Normalization
For the enhancement of the grayscale image we conducted an end-to-end experiment using the
HTR/HTK tool and the initial set that consists of 53 images from the Bentham collection. We had
randomly partitioned the image collection into two sets, train (1203 images of text lines) and test
(140 images of text lines). The word error rate using the original images was 29.1 approximately.
Using a smoothing filter, specifically a 5x5 Wiener filter (Fig. 4.1.3.1d), the word error rate
decreased to 28.9, while using a 3x3 Gaussian filter (Fig. 4.1.3.1c) the word error rate decreased to
28.3, enhancing the overall performance. On the contrary, using a filter that has the opposite effect
of smoothing, i.e. highlighting the local contrast and details, such as an unsharp filter (Fig.
4.1.3.1e), the word error rate increased to 30.1 approximately. Unexpectedly, the background
smoothing using the background estimation method (Fig. 4.1.3.1b), even though it has visually
satisfactory results, the word error rate increased to 30.3.
In addition, we also used a much larger dataset that consisted of 433 images from the Bentham
collection that had been completely ground-truthed. This set consists of 11470 line images
approximately, that is partitioned into a train set of 9195 line images, a validation set of 1415 line
images and a test set of 860 line images. For the HTR/HTK experiments we used the train and the
validation set. The word error rate using the original images was 41.55 approximately. Using a
smoothing filter, for instance the 5x5 Wiener filter, the word error rate slightly decreased to 41.00,
while using the background smoothing filter, the word error rate remained almost the same at 41.53.
(a)
(b)
(c)
(d)
(e)
Figure 4.1.3.1: Example enhancement methods: (a) original text line image; (b) background
smoothing using background estimation; (c) gaussian filter (3x3); (d) wiener filter (5x5); (e)
unsharp filter (3x3).
Draft Version December 31, 2013 82/154
Background and noise removal
The methods developed for background and noise removal have been evaluated using two datasets,
Esposalles and Plantas, for which the details are provided in the section for the evaluation of the
handwritten text recognition task (Section 4.3)
To make a fair comparison of the different pre-processing methods, in the experiments only the pre-
processing is varied. The rest of the HTR system parameters were chosen arbitrarily and kept fixed.
Orig. image Old method Sauvola Proposed + Stretch + RLSA
Figure 4.1.3.2: Example images from the Esposalles corpus comparing the original image with: the
previous background removal method, Sauvola thresholding, the proposed method, the proposed
method with contrast stretching and the proposed method with both contrast stretching and RLSA
cleaning.
The Esposalles corpus was used for preliminary testing. A few images were chosen at random, and
these were visually inspected in order to observe the results obtained for each of the methods and
the impact that each parameter had. In Fig. 4.1.3.2 a few example images are presented comparing
the original and the result after applying the thresholding or the proposed technique. It can be
observed that strokes that are locally lighter than other neighboring strokes, with thresholding there
is a risk that these are discarded. On the other hand, with the proposed technique there is a much
lower chance that locally light strokes be completely removed, instead these are mapped to a shade
of gray. Furthermore, the handwritten text looks smoother and more natural since all of the
boundaries between background and strokes have a softer transition.
After this visual inspection, a single set of parameters was chosen, specifically W = 30, R = 128, k =
0.2 and s = 1, and an experiment was run taking as performance measure the well-known Word
Error Rate (WER). The rest of the processing steps and training of the models was exactly the same
as the ones from [Romero 2013]. Table4.1.3.1 presents the WERs for the best previously published
result on that dataset as well as the result obtained for Sauvola thresholding and the proposed
technique. The previously used technique performs considerably worse than local thresholding,
obtaining a relative improvement of about 16.8%. Then comparing the results of the proposed
technique and the thresholding, there is a smaller, although still significant improvement. From this
we can observe that the hard decision that a thresholding technique imposes, definitely discards
useful information that benefits a recognizer based on HMMs and Gaussian mixtures.
The Plantas corpus was used for a more in-depth evaluation of the proposed technique. The
parameters of the algorithms were varied in order to obtain the best performance in each case. The
results are also presented in Table 4.1.3.1, although in this case Character Error Rate (CER) was
used in order to observe the benefit at a character level. For this corpus the proposed technique also
performs significantly better than Sauvola thresholding.
Draft Version December 31, 2013 83/154
Table 4.1.3.1: Recognition results for the Esposalles and the Plantas corpora comparing the
proposed technique with previously published results and the Sauvola thresholding.
Dataset Approach WER (%) 95% conf. int.*
Esposalles Previously published [Romero 2013] 16.1 15.8–16.4
Esposalles Sauvola thresholding 13.4 13.1–13.7
Esposalles Proposed technique 13.1** 12.8–13.4
Dataset Approach CER (%) 95% conf. int.
Plantas Sauvola thresholding 25.4 25.0–25.7
Plantas Proposed technique 24.7*** 24.3–25.0
*Wilson interval estimation.
**Better than Sauvola for a confidence level of 90% using a two-proportion z-test.
***Better than Sauvola for a confidence level of 99% using a two-proportion z-test.
Show-through reduction
As mentioned earlier, the show-through reduction technique being developed is still in early stages.
Work remains to be done related to the theory behind the approach and perform exhaustive
experimentation. For the time being, we have very interesting results by simple visual inspection of
the images. Figure 4.1.3.3 presents an example image from the Plantas corpus that shows the kind
of results that are being obtained.
Original color scanned image
Standard grayscale converted image
Draft Version December 31, 2013 84/154
Image after removal of show-through
Figure 4.1.3.3: Example image from the Plantas corpus comparing the original image and the result
obtained after show-through reduction.
Binary Enhancement
As far as the enhancement of the binary image is concerned, it is only visually evaluated. A
representative example of contour enhancement of the binary image is demonstrated in Fig. 4.1.3.4.
The contour is smoothed and the "teeth" effect is greatly reduced.
(a)
(b)
(c)
Figure4.1.3.4: Contour enhancement of the binary image: (a) original image, (b)-(c) binarization
result before and after the contour enhancement.
Draft Version December 31, 2013 85/154
4.1.4 Deskew
In order to directly evaluate the estimation accuracy of the described methods we developed a tool
which allowed us to parse the pages of the Bentham dataset, to extract the text lines from the xml
ground truth file and to extract the words from each text line. For each word the skew angle was
marked and its length was computed. After all the words of each text line were treated the tool
calculated a weighted average of the skew angles computed for the words depending on their length
and proposed a skew angle for the text line. The user was able to mark and further correct the skew
of each text line if the calculation was not satisfactory. After all the text lines of a page were parsed
a weighted average of the skew angles of the text lines was computed and this was proposed as the
skew angle of the page. In this way, after treating 10 representative pages of the Bentham
collection, randomly chosen from the different boxes of the first part, 1949 words, 247 text lines
and 10 pages with known ground truth were created. Those images were used to form the datasets
for direct evaluation.
The performance evaluation of the algorithm is based on a well-established method for document
skew estimation similar to the one used in ICDAR Document Image Skew Estimation Contest 2013
(DISEC’13) [Papandreou 2013b]. For every document image of the dataset the distance between the
ground truth and the estimation of the algorithm was calculated. In order to measure the
performance of the methods the following three criteria were used: (a) the Average Error Deviation
(AED), (b) the Average Error Deviation of the Top 80% of the results of each algorithm (TOP80)
and (c) the percentage of Correct Estimations (CE) (error deviation lower than 0.2º). The evaluation
will record the accuracy and the efficiency of the methodologies.
In order to obtain a larger evaluation set for the page level evaluation, we rotated each page at 11
different angles from -5 to 5 degrees with a one degree step at each time. In this way, a dataset of
110 pages with known ground truth was created. In order to compare the developed method with the
current state-of-the-art skew estimation techniques we implemented (i) the classical Projection
Profiles (PP) [Postl 1986], (ii) the enhanced vertical profile (eVP) [Papandreou 2011], (iii) the
classical Hough Transform (HT) [Srihari 1989] and (iv) the Yan Cross Correlation (CC) [Yan 1993].
The range of expected skew for all projection algorithms was set from −6° to 6° while the step of
rotation, and consequently the accuracy, was 0.05°. It should be also noted that the estimations of
each algorithm were rounded to the second decimal. The results presented in Table 4.1.4.1
demonstrate the performance of the developed method in comparison with the state-of-the-art skew
estimation algorithms.
Table 4.1.4.1: Results of Skew Estimation Algorithms in Page Level
Skew Estimation Technique
Page Level Dataset
AED (°) TOP80 (°) CE (%)
Projection Profiles (PP) 0.92 0.72 15.45
Enhanced Vertical Profile (eVP) 1.13 0.76 8.18
Hough Transform (HT) 0.95 0.76 14.27
Cross Correlation (CC) 2.63 1.33 15.45
Developed Method (eHP c-f) 0.78 0.49 23.63
Concerning the text line level performance evaluation, we excluded from the 247 cropped text lines
those that consisted of less than 2 words and a dataset was formed with the remaining 215 text lines.
The same three metrics were used in order to evaluate the performance of the algorithms. In order to
make a comparison of the developed method with the current state-of-the-art techniques the four
above mentioned algorithms were used and moreover, we also implemented the regression (R)
algorithm [Madhvanath 1999]. The results presented in Table 4.1.4.2 demonstrate the performance
of the developed method in comparison with the state-of-the-art skew estimation algorithms.
Draft Version December 31, 2013 86/154
Table 4.1.4.2: Results of Skew Estimation Algorithms in Text Line Level
Skew Estimation Technique
Text Line Level Dataset
AED (°) TOP80 (°) CE (%)
Projection Profiles (PP) 0.35 0.19 47.44
Enhanced Vertical Profile (eVP) 0.88 0.58 16.74
Hough Transform (HT) 0.24 0.14 59.06
Cross Correlation (CC) 2.32 1.39 7.9
Regression (R) 0.28 0.15 55.34
Developed Method (eHP c-f) 0.32 0.18 51.62
Concerning the word level evaluation, we excluded from the 1949 cropped word images the words
that consisted of less than 3 letters as well as the smeared words and a dataset was formed with the
remaining 1376 words that were marked and the same three metrics were used in order to evaluate
the performance of the algorithms. In this case, though, we regard as correct the estimations where
the error deviation is lower than 0.5°. In order to make a comparison of the developed method with
the current state-of-the-art skew estimation techniques we implemented (i) (PP), (ii) (eVP), (iii)
(HT), (iv) (R), (v) the method that was developed for the page and text line level (eHP c-f) and (vi)
the algorithm of Blumenstein et al. [Blumenstein 2002] that takes advantage of the centers of mass
(CM) of the word image. The results presented in Table 4.1.4.3 demonstrate the performance of the
developed method in comparison with the other skew estimation algorithms.
Table 4.1.4.3: Results of Skew Estimation Algorithms in Word Level
Skew Estimation Technique
Word Level Dataset
AED (°) TOP80 (°) CE (%)
Projection Profiles (PP) 1.25 0.77 40.82
Enhanced Vertical Profile (eVP) 1.94 1.43 12.42
Hough Transform (HT) 2.45 0.82 31.24
Regression (R) 3.94 0.71 37.86
Developed Method (eHP c-f) 0.96 0.72 43.01
Centers of Mass (CM) 0.79 0.64 51.38
Developed Method 0.66 0.58 57.99
As can be observed in Table 4.1.4.2, when it is applied to text lines, the performance of the
developed skew estimation method is close to HT and R, which proved to be the most efficient
state-of-the-art methods. Moreover, as can be seen in Table 4.1.4.3 it outperforms both the HT and
R methods on word skew estimation. The fact that has a good performance both in text lines and
words proves to be a great advantage of the method, since there are several text lines in the
transcript that contain only one word or one letter. In those cases both HT and R algorithms fail
with a great error deviation.
Table 4.1.4.3 also shows that the developed algorithm outperforms all state-of-the-art algorithms on
the word level, and proves to be more efficient. Our proposal is that eHP c-f should be applied on
the page level (before text line segmentation), next it should be applied in text line level (preceding
word segmentation) and finally the word skew estimation should be applied in order to correct the
skew of each word.
Draft Version December 31, 2013 87/154
4.1.5 Deslant
In order to directly evaluate the slant estimation accuracy of the described methods we developed a
tool which allows us to parse the pages of the Bentham dataset, extract the text lines from the xml
ground truth file and then extract the words from each text line. For each word the slant angle was
marked. After all the words of each text line were treated the tool calculated the average of the slant
angles computed for the words and estimated the dominant slant angle for the text line. In this way,
after processing the same 10 pages of the Bentham collection, that were chosen for the skew
estimation evaluation, the same 1949 words and 247 text lines with known ground truth were
created. Those images were used to form the datasets for direct evaluation.
The performance evaluation of the algorithms is based on the same method that was used for the
skew estimation algorithms. For every document image of the dataset the distance between the
ground truth and the estimation of the algorithm was calculated. In order to measure the
performance of the methods the following three criteria were used: (a) (AED), (b) (TOP80) and (c)
(CE) (correct is regarded an error deviation less than 5º). The evaluation will record the accuracy
and the efficiency of the methodologies.
Concerning the text line level performance evaluation, we excluded from the 247 cropped text lines
those that were smeared or consisted of one or two letters and a dataset was formed with the
remaining 228 text lines. In order to make a comparison of the developed method with the current
state-of-the-art skew estimation techniques we implemented (i) the Bozinovic and Srihari
[Bozinovic 1989], (ii) the Papandreou and Gatos [Papandreou 2012], (iii) the Kimura et al. [Kimura
1993], (iv) the Ding et al. [Ding 2000] and (v) Vinciarelli and Juergen [Vinciarelli 2001]
algorithms. The range of expected slant for the projection algorithm of Vinciarelli and Juergen was
set from −50 to 50 while the step of rotation, and consequently their accuracy, was 1 . The results
presented in Table 4.1.5.1 demonstrate the performance of the developed method in comparison
with the state-of-the-art slant estimation algorithms.
Table 4.1.5.1: Results of Slant Estimation Algorithms in Text Line Level
Slant Estimation Technique
Text Line Level Dataset
AED (°) TOP80 (°) CE (%)
Bozinovic and Srihari 11.75 6.86 38.59
Papandreou and Gatos 9.36 5.57 49.56
Kimura et al. 35.88 33.81 2.19
Ding et al. 23.64 20.95 10.52
Vinciarelli and Juergen 8.47 5.29 57.92
Developed Method 7.18 3.90 60.32
As it can be observed the developed method outperforms the state-of-the-art algorithms. While it
has similar performance with the technique of Vinciarelli and Juergen, Vinciarelli and Juergen
algorithm is a projection profile algorithm that uses the shear transform several times in every
image to detect the dominant slant, and due to this fact is more computationally expensive and time
consuming especially when it concerns text line images.
Concerning the word level evaluation, we excluded from the 1949 cropped word images the words
that consisted of less than 2 letters as well as the smeared words and a dataset was formed with the
remaining 1693 words that were marked and the same three metrics were used in order to evaluate
the performance of the algorithms. In order to make a comparison of the developed method with the
Draft Version December 31, 2013 88/154
current state-of-the-art techniques the same five, above mentioned, algorithms were used. The
results presented in Table 4.1.5.2 demonstrate the performance of the developed method in
comparison with the state-of-the-art slant estimation algorithms.
Table 4.1.5.2: Results of Slant Estimation Algorithms in Word Level
Slant Estimation Technique
Word Level Dataset
AED (°) TOP80 (°) CE (%)
Bozinovic and Srihari 13.12 7.26 33.72
Papandreou and Gatos 10.35 6.03 46.96
Kimura et al. 37.6 35. 19 1.83
Ding et al. 23 20.88 11.04
Vinciarelli and Juergen 8.06 4.97 59. 20
Developed Method 7.24 3.96 59.87
The performance of the slant estimation algorithm on the word level is close to the performance on
text line level. The developed algorithm outperforms the state-of-the-art algorithms and has a
performance close to the Vinciarelli and Juergen algorithm, without having its drawbacks.
Draft Version December 31, 2013 89/154
4.2 Image Segmentation
4.2.1 Detection of main page elements
Detection of main text zones
In order to record the efficiency of the developed text zone detection method, we created a set of
300 handwritten historical images from the Bentham, the Plantas and the Hattem datasets. These
images have been also used for the border removal evaluation so we used their border removed
version (the ground truth for the border removal evaluation). At a next step, we manually marked
the main text zones. These correspond to one area for single column documents or two areas for two
column documents (see Fig. 4.2.1.1).
Figure 4.2.1.1: The ground truth of main document text zones
The performance evaluation of the developed method is based on comparing the number and the
coordinates of areas between the result and the ground truth. A result is assumed correct only if the
number of areas is the same and the horizontal coordinates of the detected and ground truth areas
are almost the same (their absolute difference is less than image_width / 30). The results showed
that in 254 out of the 300 images, the main text zones were correctly detected. This leads to an
accuracy of ~ 85%.
Segmentation of structured documents using stochastic grammars
We used a 2D-SCFG model for segmenting structured documents of Marriage license books. The
input image of a page is represented as a matrix of dimensions wxh cells. Each cell accounts for a
region of 50 x 50 pixels. We carried out an experiment using 80 pages of this dataset. The training
set contained 60 pages and the remaining 20 pages were used as test set. We followed this
experimentation in order to obtain results comparable to those in [Cruz 2012]. In that paper they
used the same text classification features (Gabor and RLF) but a different logical model:
Conditional Random Fields (CRF).
As a result of the segmentation process we obtain the most likely layout class for each cell, hence,
we evaluated the recognition performance by reporting the average precision, recall and F-measure
computed at cell level for each page in the test set. Table 4.2.1.1 shows the experimental results
obtained by both sets of text features and both logical models. More details can be found in [Alvaro
2013] and [Cruz 2012].
Draft Version December 31, 2013 90/154
Table 4.2.1.1: Classification results for different models and text classification features.
Model Class
Gabor Gabor + RLF
Precision Recall F-measure Precision Recall F-measure
Body 0.89 0.87 0.88 0.88 0.96 0.92
CRF Name 0.27 0.82 0.40 0.77 0.73 0.74
Tax 0.35 0.91 0.50 0.69 0.66 0.66
Body 0.88 0.97 0.92 0.88 0.97 0.92
2D SCFG Name 0.72 0.93 0.81 0.82 0.83 0.82
Tax 0.39 0.94 0.54 0.63 0.69 0.64
Results showed that classes Body and Name were classified with good F-measure rates, whereas
class Tax represented the most challenging part. It is related with the percentage of the page that
represents each region, because it is more difficult to properly classify small regions.
There are two factors to take into account: the text classification features and the page segmentation
model. Regarding text classification features, results using CRF as page segmentation model
showed that the inclusion of the RLF information significantly outperformed the results obtained
with Gabor features. However, RLF did not produce such a great improvement when grammars
were the segmentation model. 2D SCFG took advantage of the knowledge about the document
structure, thus, grammars were able to overcome the lacks of Gabor features obtaining good results
for the classes Body and Name without the additional spatial information provided by RLF. Even
though, the inclusion of RLF in 2D SCFG classification slightly improved the segmentation results
for class Name, as well as it achieved a good F-measure value for class Tax.
Comparing the performance of 2D SCFG with CRF, results showed that grammars achieved a great
improvement when Gabor features were used, which was not so prominent with RLF features.
Moreover, CRF recognized regions but did not identify the explicit segmentation in records.
However, 2D SCFG can provide the coordinates of every region in the most likely hypothesis.
Using this information we also carried out a semantic evaluation. We manually evaluated the results
attending to two different metrics: the number of records properly segmented in terms of F-
measure; and the numbers of lines correctly detected into the Body of the proper record in views of
a future line segmentation process. Results showed that the combination of 2D-SCFG and RLF
improved the final records detection as well as the number of lines properly segmented with F-
measure values of 0.80 and 0.89 respectively.
Page Split
In order to record the efficiency of the developed page split algorithm we used the same
performance evaluation method as that used in border removal algorithm. We manually mark the
correct page frames in the binary double images in order to create the ground truth set (see Fig.
4.2.1.2). The performance evaluation is based on a pixel based approach and counts the pixels at the
correct page frames and the result page frames after page split. Let G be the set of all pixels inside
the correct text region in ground truth, R the set of all pixels inside the result image and T(s) a
function that counts the elements of set s. We calculate the precision and recall as follows
Draft Version December 31, 2013 91/154
𝑃𝑟𝑒 𝑖𝑠𝑖𝑜𝑛 T(G∩R)
T(R) 𝑅𝑒 𝑎𝑙𝑙
T(G∩R)
T(G) (4.2.1.1)
A performance metric FM can be extracted if we combine the values of precision and recall:
𝐹 2∗Precision∗Recall
Precision+Recall (4.2.2.2)
Figure 4.2.1.2: Marked page frames in a double document image (ground truth)
We used a set of 140 randomly selected double handwritten historical images from 4 different
books. In order to calculate the Precision, the Recall and the FM of a double page image we
calculate the values for each individual page frame and then the average values are extracted. Table
4.2.1.2 illustrates the average precision, recall and FM of the developed page split methodology. As
it can be observed, it achieves FM= 93.21%.
Table 4.2.1.2: Page Split - Evaluation results
Method Precision Recall FM
TranScriptorium 87.68% 99.50% 93.21%
4.2.2 Text Line Detection
Polygon Based Text Line Segmentation
For the evaluation of the text line segmentation methods we follow the same protocol which was
used in the ICDAR 2013 Handwriting Segmentation Competition [Stamatopoulos 2013]. The
performance evaluation method is based on counting the number of one-to-one matches between
the areas detected by the algorithm and the areas in the ground truth (manual annotation of correct
text lines). We use a MatchScore table whose values are calculated according to the intersection of
the ON pixel sets of the result and the ground truth. Let Gi the set of all points of the i ground truth
region, Rj the set of all points of the j result region, Τ(s) a function that counts the elements of set s.
Table MatchScore(i,j) represents the matching results of i ground truth region and the j result region
as follows:
Draft Version December 31, 2013 92/154
)(
)(),(
ji
ji
RGT
RGTjiMatchScore
(4.2.2.1)
We consider a region pair as a one-to-one match only if the matching score is equal to or above the
evaluator's acceptance threshold Ta. If N is the count of ground-truth elements, M is the count of
result elements, and o2o is the number of one-to-one matches, we calculate the detection rate (DR)
and recognition accuracy (RA) as follows:
N
ooDR
2 (4.2.2.2)
M
ooRA
2 (4.2.2.3)
A performance metric FM can be extracted if we combine the values of detection rate and
recognition accuracy:
RADR
RADRFM
**2 (4.2.2.4)
The proposed text line segmentation method (denoted as NCSR in the experimental results table)
has been tested on 433 handwritten document images derived from the Bentham collection. For all
these document images, we have manually created the corresponding text line segmentation ground
truth using the Aletheia tool and stored it using the PAGE xml format. The total number of text lines
appearing on those images was 11468. The acceptance threshold for the text line segmentation
evaluation is set to Ta = 0.95. For the sake of comparison, we also tested the winning method of the
ICDAR2013 Handwriting Segmentation Competition [Stamatopoulos 2013] for the text line
segmentation task as well as the method proposed by Nicolaou et al. [Nicolaou 2009]. Table 4.2.2.1
shows comparative experimental results in terms of detection rate (DR), recognition accuracy (RA)
and FM. As it can be observed from Table 4.2.2.1, the proposed method (NCSR) outperforms the
other two approaches achieving detection rate 81.87%, recognition accuracy 85.97% and FM
83.87%.
Table 4.2.2.1:Comparative experimental results on the dataset of the Bentham collection.
Method N M o2o DR (%) RA (%) FM (%)
NCSR 11468 11034 9528 83.08 86.35 84.68
ICDAR2013
Handwriting
Segmentation
Competition winner
[Stamatopoulos 2013]
11468 11051 9418 82.12 85.22 83.64
Nicolaou
[Nicolaou 2009]
11468 10022 8617 75.13 85.98 80.19
Baseline Detection
In order to assess the quality of the baseline detection approach, two kinds of measures have been
adopted: “line error rate” (LER), considered a qualitative measure, is calculated as the number of
incorrectly assigned line labels divided by the total correct line regions; and the “segmentation error
rate” (SER) which, addressing the evaluation more from a quantitative point of view, measures the
geometrical accuracy of the detected line region positions with respect to the corresponding
(correct) reference marks. The LER is obtained by comparing the sequences of automatically
obtained region labels (ℎ̂ in Eq. 3.2.2.1) with the corresponding sequences. This is computed in the
same way as the well-known WER, with equal editing-costs assigned to deletions, insertions and
Draft Version December 31, 2013 93/154
substitutions [McCowan 2004]. Concerning SER computation, it is carried out in two phases. First,
for each page, we find the best alignment between the system-proposed line position marks (�̂� in
Eq. 3.2.2.1) and the page reference marks by minimizing its accumulate absolute differences using
dynamic programming. Then, in the second phase, taking into account only the detected regions
marked as NL, SER is calculated as the average value (and standard deviation) of the already
computed absolute differences of the position marks for all the pages globally.
In order to evaluate the approach used for this task we performed experiments in a number of
different corpuses.
In the following sections we will describe each corpus used and provide experimental results for
them.
Cristo Salvador Database
Experiments were carried out using a corpus compiled from a XIX century Spanish manuscript
identified as “Cristo-Salvador” (CS), which was kindly provided by the Biblioteca Valenciana
Digital (BiVaLDi)2. This is a rather small document composed of 53 color images of text pages,
scanned at 300 dpi and written by a single writer.
In this case, we employ the already predefined book partition [Romero 2007]. This partition
divides the dataset into a test set containing 20 page images and a training set composed of the 33
remaining pages.
For this corpus, text regions were classified into one of the following types: normal line, short line,
paragraph line, blank space, non-text zone and inter line.
Esposalles Database
The ESPOSALLES database [Romero 2013] is compiled from a book of the Marriage License
Books col-lection (called Llibres d'Esposalles), conserved at the Archives of the Cathedral of
Barcelona. This collection is composed of 291 books with information of approximately 600,000
marriage licenses given in 250 parishes between 1451 and 1905.
These demographic documents have already been proven to be useful for genealogical re-search
and population investigation, which renders their complete transcription an interesting and relevant
problem [Esteve 2009]. The contained data can be useful for research into population estimates,
marriage dynamics, cycles, and indirect estimations for fertility, migration and survival, as well as
socio-economic studies related to social homogamy, intra-generational social mobility, and inter-
generational transmission and sibling differentials in social and occupational positions.
In our experiments, we used only volume 69, which encompasses 173 pages [Vidal 2012] that
contains a total of 5,831 text lines. We used the following page hold-out partition:
Training - consisting of the first 150 pages; used for parameter training and parameter
tuning.
Test - consisting of the last 23 pages; used only for final evaluation.
For this corpus text regions where classified into one of the following types: normal line, start
paragraph line, end paragraph line, end paragraph with ruler, header, footer, blank space, inter line
and inter line with added text.
2http://bv2.gva.es
Draft Version December 31, 2013 94/154
Bentham Database
The Bentham collection has more than 80,000 documents, most of them digitized. From the
digitized documents, more than 6,000 have been transcribed during the Transcribe Bentham
initiative3. The transcripts are recorded in TEI-compliant XML format.
The digitized material includes legal reform, punishment, the constitution, religion, and his
panopticon prison scheme. The Bentham Papers include manuscripts written by Bentham himself
over a period of sixty years, as well as fair copies written by Bentham's secretarial staff. This is
important to the tranScriptorium project, as handwriting is complex enough to challenge the HTR
software, while manuscripts written by secretarial staff will provide variety. Bentham's manuscripts
are often complicated by deletions, marginalia, interlineal additions and other features. Some
manuscripts are written on large sheets artificially divided, and a few are dificed into tables with
several columns. Portions of the collection are written in French, and Bentham occasionally uses
Latin and ancient Greek in his writings.
In order to produce the Ground Truth (GT) as much automatically as possible and to carry out HTR
experiments, the images were classified according to their difficulty, for being processed in
increasing order of difficult. In [Gatos 2014] a detailed description of the GT creation process can
be found. Three criteria have been used for classifying the documents: first, a score was computed
for each image according to the TEI elements that appear in the transcripts. Thus, a document image
with many deletions, additions, stroke-out words, etc. had lower score than a document image that
had few of these elements. Second, after sorting the images according to this score, they were
classified according to their physical geometry. For the first round, only the images that had 2731 x
4096 pixels were chosen. Third, the resulting images were automatically clustered in three classes:
images that had a main text block in one column and left margin, images that had a main text block
in one column and right margin, and other. Horizontal projections were used for performing this
clustering. Finally, the initial chosen images were the images that belonged to the first two clusters
and which score were above a given threshold. The initial set is composed by 798 pages.
This initial set has been divided into two batches. The first one composed by the easier 433 pages
and the second batch is composed by the rest.
We used the 1st Batch to evaluate our approach on this corpus by dividing it into two sets:
Training - consisting of the first 215 pages; used for parameter training and parameter
tuning.
Test - consisting of the last 218 pages; used only for final evaluation.
Due to the large size of this corpus text regions where classified into one of the following types:
normal line, blank space and inter line. This simple labelling was performed in order to reduce the
effort correction and labelling for creation of the ground truth but this implies that we will be
burdening the HMMs to cope with much more variability.
Plantas Database
The PLANTAS database has been compiling from the manuscript piece entitled “Historia de las
plantas” written by Bernardo de Cienfuegos in the 17th century. In this case, HTR experiments were
only carried out on the PROLOGO chapter, which belongs to the first volume of the manuscript.
The PROLOGO chapter is composed of 38 pages and was divided, for experimentation, into two
sets:
Training - consisting of the first 20 pages; used for parameter training and parameter tuning.
Test - consisting of the last 18 pages; used only for final evaluation.
3http://blogs.ucl.ac.uk/transcribe-bentham/
Draft Version December 31, 2013 95/154
With all of the above databases our experimentation process consisted of performing training and
parameter tuning in the database's training partition and with the best parameters obtained we
performed a single experiment on the test partition.
In Table 4.2.2.2 we report the best figures for LER and SER achieved for classification through the
indicated experimentation process. The SER mean and std-dev are given in this case in percentage
of the text line average width (dependent of the actual corpus).
Table 4.2.2.2: LER and SER achieved
Database LER(%) SER(%)
𝜇𝑟 𝜎𝑟
CS 0.7 8.81 15.40
Esposalles 8.9 61.3 21.31
Bentham 13.8 30.22 15.71
Plantas 0.66 21.81 20.3
As we can see in Table4.2.2.2 the approach used provided adequate performance in all tested corpus
presenting worse values for more complex corpus or were more classes have been considered but in
general yielded good results.
4.2.3 Word Detection
For the evaluation of the word segmentation methods we follow the same protocol which was used
in the ICDAR 2013 Handwriting Segmentation Competition [Stamatopoulos 2013]. An analytic
description of the protocol is provided in the abovementioned section for evaluation of text line
segmentation methods. The proposed word segmentation method (denoted as Gaussian_Mixtures in
the experimental results tables) has been tested on 433 handwritten document images derived from
the Bentham collection. For all these document images, we have manually created the
corresponding word segmentation ground truth using the Aletheia tool and stored it using the PAGE
xml format. The total number of words appearing on those images was 94886.The acceptance
threshold Ta was set to 0.9.
For the sake of comparison two different experiments were conducted. The first experiment used as
input the text line segmentation result which came from the ground truth (after manual annotation).
In this case, the input of the word segmentation algorithms was correct text lines. In the second
experiment, the input of the word segmentation algorithms came from the result of the automatic
text line segmentation method proposed in the previous section (NCSR). The motive behind the two
experiments was to measure the drop in the performance after starting from erroneous input
(automatic text line segmentation result).
For both scenarios, we measure the performance of four different word segmentation algorithms
which correspond to i) the proposed method (Gaussian_Mixtures), ii) the sequential clustering
method (Sequential_Clustering) [Kim 2001], iii) the modified max method (Modified_Max) [Kim
2001] and iv) the average linkage method (Average_Linkage) [Kim 2001] . For ii, iii and iv the first
three hypotheses were considered (more details in [Kim 2001]).
Draft Version December 31, 2013 96/154
Table 4.2.3.1 presents comparative experimental results in terms of detection rate (DR), recognition
accuracy (RA) and FM when the input to the word segmentation algorithms is correct text lines
whereas Table 4.2.3.2 shows experimental results when the input to the word segmentation
algorithms is text lines produced by the automatic text line segmentation method (NCSR).
Table 4.2.3.1: Comparative experimental results on the dataset of the Bentham collection starting
from correct text lines (ground truth).
Method N M o2o DR (%) RA (%) FM (%)
Gaussian_Mixtures 94886 103229 77308 81.47 74.89 78.04
Sequential_Clustering1 94886 92124 74037 78.03 80.37 79.18
Sequential_Clustering2 94886 91712 73744 77.72 80.41 79.04
Sequential_Clustering3 94886 91327 73490 77.45 80.47 78.93
Average_Linkage1 94886 86798 69546 73.29 80.12 76.56
Average_Linkage2 94886 62143 38251 40.31 61.55 48.72
Average_Linkage3 94886 69801 42255 44.53 60.54 51.32
Modified_Max1 94886 47035 26330 27.75 55.98 37.11
Modified_Max2 94886 58530 36159 38.11 61.78 47.14
Modified_Max3 94886 65161 34065 35.90 52.28 42.57
Table 4.2.3.2: Comparative experimental results on the dataset of the Bentham collection starting
from text lines produced by automatic method (NCSR).
Method N M o2o DR (%) RA (%) FM (%)
Gaussian_Mixtures 94886 102566 74066 78.06 72.21 75.02
Sequential_Clustering1 94886 70467 70211 74.00 77.61 75.76
Sequential_Clustering2 94886 90096 69959 73.73 77.65 75.64
Sequential_Clustering3 94886 89702 69692 73.45 77.69 75.51
Average_Linkage1 94886 84465 64976 68.48 76.93 72.46
Average_Linkage2 94886 59460 35075 36.97 58.99 45.45
Average_Linkage3 94886 68089 39904 42.05 58.61 48.97
Modified_Max1 94886 43659 25280 26.64 57.90 36.49
Modified_Max2 94886 59249 32938 34.71 55.59 42.74
Modified_Max3 94886 57310 30387 32.02 53.02 39.93
Draft Version December 31, 2013 97/154
4.3 Handwritten Text Recognition
The quality of the automatic transcriptions obtained by the system is measured by means of the
Word Error Rate (WER). It is defined as the minimum number of words that need to be substituted,
deleted, or inserted to match the recognition output with the corresponding reference ground truth,
divided by the total number of words in the reference transcriptions.
The Character Error Rate (CER) it is also used to assess the quality of the automatic transcriptions
obtained. Its definition is analogous to the previous one, but substituting words by characters.
HTR in Bentham database
The Bentham collection has more than 80,000 documents, most of them digitized. From the
digitized documents, more than 6,000 have been transcribed during the Transcribe Bentham
initiative4. The transcripts are recorded in TEI-compliant XML format.
The digitized material includes legal reform, punishment, the constitution, religion, and his
panopticon prison scheme. The Bentham Papers include manuscripts written by Bentham himself
over a period of sixty years, as well as fair copies written by Bentham's secretarial staff. This is
important to the tranScriptorium project, as handwriting is complex enough to challenge the HTR
software, while manuscripts written by secretarial staff will provide variety. Bentham's manuscripts
are often complicated by deletions, marginalia, interlineal additions and other features. Some
manuscripts are written on large sheets artificially divided, and a few are dificed into tables with
several columns. Portions of the collection are written in French, and Bentham occasionally uses
Latin and ancient Greek in his writings.
In order to produce the Ground Truth (GT) as much automatically as possible and to carry out HTR
experiments, the images were classified according to their difficulty, for being processed in
increasing order of difficult. In [Gatos 2014] a detailed description of the GT creation process can
be found. Three criteria have been used for classifying the documents: first, a score was computed
for each image according to the TEI elements that appear in the transcripts. Thus, a document image
with many deletions, additions, stroke-out words, etc. had lower score than a document image that
had few of these elements. Second, after sorting the images according to this score, they were
classified according to their physical geometry. For the first round, only the images that had 2731 x
4096 pixels were chosen. Third, the resulting images were automatically clustered in three classes:
images that had a main text block in one column and left margin, images that had a main text block
in one column and right margin, and other. Horizontal projections were used for performing this
clustering. Finally, the initial chosen images were the images that belonged to the first two clusters
and which score were above a given threshold. The initial set is composed by 798 pages.
This initial set has been divided into two batches. The first one composed by the easier 433 pages
and the second batch is composed by the rest. In addition, to carry out fast HTR experiments to test
different techniques, 53 pages from the first batch were chosen. In the next subsections the
experiments carried out with the selected 53 pages and with the complete first batch are presented.
Preliminary HTR experiments
In this subsection we provide the results obtained in the experiments carried out with the 53 selected
pages. These 53 pages contain 1,342 lines with nearly of 12,000 running words and a vocabulary of
more than 2,000 different words. The last column of the Table 4.3.1 summarizes the basic statistics
of these pages.
4 http://blogs.ucl.ac.uk/transcribe-bentham/
Draft Version December 31, 2013 98/154
Table 4.3.1: Basic statistics of the different partitions in selected 53 pages of the Bentham database.
Number of: P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 Total
Pages 5 6 6 6 5 5 5 5 5 5 53
Lines 138 145 140 138 122 94 126 150 146 144 1,342
Run. words 1,174 1,334 1,216 1,287 1,032 824 1,111 1,429 1,282 1,246 11,935
Run. OOV 130 158 115 114 115 74 174 120 189 211 -
Lex. OOV 115 143 108 98 109 073 169 114 178 176 -
Lexicon 445 504 470 447 434 326 472 497 532 465 2,181
Character set size 81 81 81 81 81 81 81 81 81 81 81
Run. Characters 6,384 7,013 6,536 6,669 5,450 4,401 5,963 7,778 6,854 6,326 63,400
The 53 pages were divided into ten blocks of 5 or 6 pages each, aimed at performing cross-
validation experiments. That is, we carry out ten rounds, with each of the partitions used once as
validation data and the remaining nine partitions used as training data. Table 4.3.1 contains some
basic statistics of the different partitions defined. The number of running words for each partition
that do not appear in the other nine partitions is shown in the running out-of-vocabulary (Run.
OOV) row.
The first two rows of Table 4.3.2 show the best results obtained with both Open and Closed
Vocabulary (OV and CV). The values of the parameters that give the best results are: NS = 8, NG =
64, α = 50 and β = 50. In the OV setting, only the words seen in the training transcriptions were
included in the recognition lexicon. In CV, on the other hand, all the words which appear in the test
set but were not seen in the training transcriptions (i.e., the Lex. OOV) were added to the lexicon.
Except for this difference, the language model in both cases were the same; i.e., bi-grams estimated
only from the training transcriptions. Also, the morphological character models were obviously
trained using only training images and transcriptions.
These two lexicon settings represent extreme cases with respect to real use of HTR systems.
Clearly, only the words which appear in the lexicon of a (word-based) HTR system can ever have
the chance to be output as recognition hypotheses. Correspondingly, an OV system will commit at
least one error for every OOV word instance which appears in test images. Therefore, in practice
the training lexicon is often extended with missing common words obtained from some adequate
vocabulary of the task considered. In this sense, the CV setting represents the best which could be
done in practice, while OV corresponds to the worst situation. In the next experiments we only use
OV lexicon.
Table 4.3.2: Transcription Word Error Rate (WER) for the different partitions defined. All results
are percentages.
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 Total
WER OV 29.8 31.6 26.2 30.3 31.1 30.9 47.0 23.2 39.7 49.0 33.7
WER CV 19.9 20.6 17.5 22.1 22.3 22.1 35.1 15.3 25.7 29.3 22.8
Optional BS 31.0 34.4 27.9 30.8 33.1 32.4 51.7 24.3 41.1 51.6 35.7
BS with skyps 29.5 33.2 27.9 30.9 29.5 32.6 49.9 23.5 41.2 52.0 34.8
Variable NS 27.6 31.5 26.5 28.2 30.8 28.8 41.5 23.3 36.7 41.5 31.5
Draft Version December 31, 2013 99/154
Figure 4.3.1: Normalized histogram of the length of words that take part in error events in
BENTHAM database. Punctuation marks are not considered.
As expected, the OV experiment has a WER bigger than the experiment with CV. In Table 4.3.3 is
shown the analysis of the errors involved in the OV experiment. In this table we can see that the
punctuation marks are responsible of around the 20% of the errors, the numbers of the 3.5% and the
stop words of the 27.2%. The rest of the errors are due to other words.
Table 4.3.3: Errors analysis in the OV experiment for 53 selected pages of Bentham database.
Involving to Insertion Deletions Substitutions Total (%)
Punc-Marks 0.4 9.3 10.2 19.9
Numbers 0.1 0.8 2.6 3.5
Stop-Words 4.4 1.3 21.5 27.2
Others 9.1 4.6 35.6 49.3
In Fig. 4.3.1 we can see a normalized histogram of the length of words that take part in the error
events. From the graph we can see that the 15% of errors corresponds to words of 4 characters. To
carry out this analysis we did not consider the punctuation marks.
In Fig. 4.3.2 a normalized histogram of the position of the error in the line is shown. To carry out
this histogram we have only considered the lines longer than 5 words with recognition errors. In the
graph we can see that the first and the last position have more errors than the others. This can be
due to the effect of the hyphenated words and to the lack of context of the first word of each line. In
future works we plant to attack the hyphenated words problem.
Draft Version December 31, 2013 100/154
Figure 4.3.2: Normalized histogram of the position of the error in the line in BENTHAM database.
Only lines longer than 5 words with recognition errors are considered.
Some ideas to solve this problem are: paste sequentially all the line images in a page into a very
long line(=page) image and try to recognize this long image as a whole; paste lines pairwise with 1-
line overlap or model possible/probable hyphenation at the lexical level.
In order to improve the quality of the automatic transcriptions obtained, we have carried out several
experiments with different HMMs. Most specifically we have tested several ways to model the
blank space that can appear between words. This blank space is usually modelled by a specific
HMM. Then, this specific HMM is added to the end of the stochastic finite-state automatons that
represent each word, forcing to recognize a blank space after each word. However, in the historical
documents, this separation between words not always exists.
The first idea that we tested consisted in making the blank space optional in the stochastic finite-
state automatons. In the third row of Table 4.3.2 (Optional BS) we can see the results obtained
after tuning the parameters. Contrary to expectations, the results obtained with this optional blank
space are worse than those obtained when it is compulsory. After analyzing the obtained
transcriptions we saw that, the increase in WER is due to the recognition of some words as the
concatenation of several words. For example, the word “somebody”, instead of being recognized as
“somebody”, is recognized as “some” and “body”.
The second idea that we tested consisted in changing the structure of the blank-space HMM. In the
previous experiments all the HMMs are modelled as constrained left-to-right HMMs (each state
only has connections with itself and with the next state. See Fig. 3.3.2). In this new experiment we
have modified the blank-space HMM to allow skipping states (see Fig. 4.3.3). In the row “BS with
skyps” of the Table 4.3.2 we can see the result obtained with this new HMM. Although the result
improve the optional BS, it is still worse than the obtained baseline.
Finally, we have also tested a variable number of states for each character class instead of using a
fixed number of states for all HMMs. The number of states NSc chosen for each HMM character
class was computed as N Sc = lc/f, where lc is the average length of the sequences of feature vectors
used to train the character class, and f is a design parameter measuring the average number of
feature vectors modelled per state (state load factor). After tuning f the value that gave the best
result was 14 and 32 gaussians per state. In the last row of Table 4.3.2 the results obtained are
shown. This result slightly improves the baseline obtained in the first row.
Draft Version December 31, 2013 101/154
Figure 4.3.3: Example of 5-states blank space HMM.
First batch HTR experiments
In this subsection we provide the results obtained in the experiments carried out with the first batch
formed by 433 pages. These 433 pages contain 11,537 lines with nearly of 110,000 running words
and a vocabulary of more than 9,500 different words. The last column of the Table 4.3.4
summarizes the basic statistics of these pages.
Table 4.3.4: Basic statistics of the different partitions in selected 433 pages of the Bentham
database.
Number of: Training Validation Test Total
Pages 350 50 33 433
Lines 9,198 1,415 860 11,473
Run. words 86,075 12,962 7,868 106,905
Lexicon 8,659 2,710 1,947 9,717
Run. OOV - 857 417 -
Lex. OOV - 681 377 -
Character set size 86 86 86 86
Run. Characters 442,336 67,400 40,938 550,674
The 433 pages were divided into three sets: training, composed by 350 pages; validation, composed
by 50 images and the test set composed by the 33 remaining pages. Table 4.3.4 contains some basic
statistics of these partitions. The number of OOV words that appear in the Validation column are
words that do not appear in the training set. On the other hand, the OOV words that appear in the
Test column are words that do not appear neither in the training set nor in the validation set.
Table 4.3.5 shows the WER(%) obtained for the Validation set for different values of the number of
states (NS) and the number of gaussians (NG) parameters. The experiments have been carried out
with open vocabulary. The best WER, 32.6%, has been obtained using NS =12 and NG = 128.
Draft Version December 31, 2013 102/154
Table 4.3.5: Transcription Word Error Rate (WER) for the validation set for different values of the
parameters NS and NG. All results are percentages.
NG/NS 6 8 10 12
32 43.8 38.6 36.3 34.5
64 41.4 37.0 34.6 33.4
128 40.1 36.2 34.0 32.6
HTR in ESPOSALLES database
The ESPOSALLES database [Romero 2013] is compiled from a book of the Marriage License
Books collection (called Llibres d'Esposalles), conserved at the Archives of the Cathedral of
Barcelona. This collection is composed of 291 books with information of approximately 600,000
marriage licenses given in 250 parishes between 1451 and 1905.
These demographic documents have already been proven to be useful for genealogical research and
population investigation, which renders their complete transcription an interesting and relevant
problem [Esteve 2009]. The contained data can be useful for research into population estimates,
marriage dynamics, cycles, and indirect estimations for fertility, migration and survival, as well as
socio-economic studies related to social homogamy, intra-generational social mobility, and inter-
generational transmission and sibling differentials in social and occupational positions.
The ESPSALLES database, used here, consists of a relatively small part of the whole collection
outlined above. Most specifically, the current database version encompasses 173 page images
containing 1,747 marriage licenses.
ESPOSALLES experiments
In this subsection we provide the results obtained in the experiments carried out with the
ESPOSALLES database. A major detail of the database and additional HTR results can be found in
[Romero 2013]
The database consists of a marriage license manuscripts digitalized by experts at 300dpi in true
colors and saved in TIFF format. The book was written between 1617 and 1619 by a single writer in
old Catalan. It contains 1,747 licenses on 173 pages. The main block of the whole manuscript was
transcribed line by line by an expert paleographer in the same way as they appear on the text,
without correcting orthographic mistakes or writing out abbreviations.
The complete annotation contains around 60,000 running words in over 5,000 lines, split up into
nearly 2,000 licenses. The last column of the Table 4.3.8 summarizes the basic statistics of the text
transcriptions.
Each page is divided vertically into individual license records. The marriage license contains
information about the husband's and wife's names, the husband's occupation, the husband's and wife
former marital status, and the socioeconomic position given by the amount of the fee. In some
cases, even further information is given, such as the fathers' names and occupations, information
about a deceased parent, place of residence, or geographical origin.
The 173 pages of the database were divided into 7 consecutive blocks of 25 pages each (1 – 25, 26 -
50, … , 150 - 173), aimed at performing cross-validation experiments. Table 4.3.6 contains some
basic statistics of the different partitions defined. The number of running words for each partition
that do not appear in the other six partitions is shown in the running out-of-vocabulary (Run. OOV)
row.
Draft Version December 31, 2013 103/154
Table 4.3.6: Basic statistics of the different partitions for the database LICENSES
Number of: P0 P1 P2 P3 P4 P5 P6 Total
Pages 25 25 25 25 25 25 23 173
Licenses 256 246 246 249 243 255 252 1,747
Lines 827 779 786 768 771 773 743 5,447
Run. words 8,893 8,595 8,802 8,506 8,572 8,799 8,610 60,777
Run. OOV 426 374 368 340 329 373 317 -
Lexicon 1,119 1,096 1,106 1,036 1,046 1,078 1,011 3,465
Characters 48,464 46,459 47,902 45,728 46,135 47,529 46,012 328,229
Characters set size 85 85 85 85 85 85 85 85
Two different recognition tasks were performed with the ESPOSALLES data set. In the first one,
called line level, each text line image was independently recognized. In the second, called license
level, the feature sequences extracted from all the line images belonging to each license record are
concatenated. Hence, in the license level recognition more context information is available. In
addition, two experimental conditions were considered: Open and Closed Vocabulary (OV and CV).
The results can be seen in Table 4.3.7.
Table 4.3.7: Transcription Word Error Rate (WER) in the ESPOSALLES database. All results are
percentages.
P0 P1 P2 P3 P4 P5 P6 Total
Lines OV 19.4 15.3 13.4 16.1 14.1 18.5 15.8 16.1
CV 15.2 10.9 9.2 11.7 9.7 14.6 12.6 12.0
Licenses
OV 17.3 14.1 12.8 15.5 14.0 16.6 15.0 15.0
CV 13.3 9.8 8.8 11.2 10.0 12.6 11.3 11.0
The characters were modelled by continuous density left-to-right HMMs with 6 states an 64
Gaussian mixture components per state. The optimal number of HMM states and Gaussian densities
per state were tuned empirically. In the line recognition problem the WER was 12.0% and 16.1% in
the closed and open vocabulary settings, respectively. In the whole license recognition case, the
WER decreased as expected, given that the language model in this case is more informative. In
particular, the system achieved a WER of 11.0% with closed vocabulary and 15.0% with open
vocabulary.
HTR in PLANTAS database
The PLANTAS database has been compiling from the manuscript piece entitled “Historia de las
plantas” written by Bernardo de Cienfuegos in the 17th century. In this case, HTR experiments were
only carried out on the PROLOGO chapter, which belongs to the first volume of the manuscript.
Draft Version December 31, 2013 104/154
Table 4.3.8 shows basic statistics of this part of the dataset.
Table 4.3.8: Basic statistics of for the PROLOGO chapter of the PLANTAS manuscript.
Num. of Pages 38
Num. of Lines 1 206
Running Words 11 642
Lexicon 3 899
Running Chars 61 973
Num. of Dif. Chars 71
Experimental setup
Because of the small size of the PROLOGO chapter of the PLANTAS manuscript (38 pages), we
have performed a tenfold cross validation for experimental evaluation. Therefore, Table 4.3.9
includes information about number of text lines used for training and test along with the
corresponding running words, lexicon and out-of-vocabulary words (OOV) over all cross test sets.
Table 4.3.9: Basic statistics of the tenfold cross validation partition defined for the PROLOGO
chapter of the PLANTAS manuscript. OOV stands for out-of-vocabulary words.
Blk-ID
Lines Words Lexicon
Training Test RW-Test OOV OOV(%) Lex-Test Lex-OOV
00 1 080 126 1 490 337 22.61 565 304
01 1 074 132 1 528 314 20.54 595 300
02 1 081 125 1 435 388 27.03 592 346
03 1 080 126 1 493 317 21.23 573 277
04 1 078 128 1 444 357 24.72 642 337
05 1 078 128 1 446 384 26.55 662 364
06 1 080 126 1 445 358 24.77 666 348
07 1 078 128 1 560 335 21.47 606 306
08 1 082 124 1 529 356 23.28 631 341
09 1 143 63 798 171 21.42 36 165
For each cross-validation run, using only corresponding training samples, an open-vocabulary
dictionary was built and Kneser-Ney smoothed n-gram model was trained. On the other hand, the
71 different characters were modelled by continuous density left-to-right HMMs with different
number of states and number of Gaussian mixture components per state. As in previous datasets, the
optimal number of HMM states and Gaussian densities per state have been also tuned empirically,
PLANTAS experiments
Preliminary recognition results over the ten cross-validation runs, on the PROLOGO chapter of the
PLANTAS manuscript, are reported in Table 4.3.10 for different number of Gaussian densities per
HMM state, optimized for 12 HMM states and corresponding values of grammar scale factors
(GSFs) and word insertion penalties (WIPs). WER and CER figures were computed filtering out
Draft Version December 31, 2013 105/154
previously tags and punctuation marks.
The best WER/CER figure of 50.06%/22.40%, is achieved by using character HMMs of 12 states
and 32 Gaussians per state.
Table 4.3.10: Best Word Error Rate (WER(%)) and Character Error Rate (CER(%)) reported results
on Prologo PLANTAS database, for HMMs with 12 states and for different number of Gaussian
Densities per each HMM state (NG) and optimized parameter values of grammar scale factor (GSF)
and word insertion penalty (WIP).
NG GSF WIP WER(%) CER(%)
8 105.81 -43.89 50.68 23.60
16 120.71 -66.96 50.11 23.46
32 107.96 -13.33 50.06 22.40
64 102.68 -72.01 50.99 22.82
HTR in Hattem database
The Hattem database experiment which is described in the following section is transcribed at line
level only for 40 page images. From the layout point of view, these images are characterized for
having the lines very near each other. In fact, the images have many touching characters between
consecutive lines. In this way the line detection and the line extraction seem very difficult. For the
experiments reported in this section, the baselines were manually annotated as polylines.
From the HTR point of view, this is a single writer collection. Another characteristic is that there are
not punctuation symbols. Paragraphs are marked by special symbols that have not been taken into
account in the following experiments. The manuscript has many abbreviations, but this
characteristic has not been taken into account in the following experiments. That means that
“diplomatic” HTR experiments have been carried out.
Preliminary HTR experiments
This subsection shows the results obtained in the experiments carried out with the 40 selected
pages. These 40 pages contain 1,552 lines with nearly of 10,330 running words and a vocabulary of
more than 2,602 different words. The last column of the Table 4.3.11 summarizes the basic statistics
of these pages.
Table 4.3.11: Basic statistics of the different partitions in selected 40 pages of the Hattem database.
Number of: P0 P1 P2 P3 P4 P5 P6 P7 Total
Pages 5 5 5 5 5 5 5 5 40
Lines 186 200 195 185 191 191 203 201 1,552
Run. words 1,280 1,351 1,318 1,227 1,237 1,230 1,344 1,343 10,330
Run. OOV 244 276 264 261 250 285 306 248 -
Lex. OOV 219 241 228 234 210 221 264 225 -
Lexicon 551 575 562 581 554 544 611 570 2,602
Character set size 45 43 44 46 43 52 44 45 60
Run. Characters 5,148 5,506 5,446 5,063 5,242 5,006 5,656 5,645 42,712
Draft Version December 31, 2013 106/154
The 40 pages were divided into 8 blocks of 5 pages each, aimed at performing cross-validation
experiments. That is, we carry out 8 rounds, with each of the partitions used once as validation data
and the remaining nine partitions used as training data. Table 4.3.11 contains some basic statistics
of the different partitions defined. The number of running words for each partition that does not
appear in the other six partitions is shown in the running out-of-vocabulary (Run. OOV) row.
The first two rows of Table 4.3.12 show the best results obtained with both Open and Closed
Vocabulary (OV and CV). The values of the parameters that give the best results are: NS = 8, NG =
64. In the OV setting, only the words seen in the training transcriptions were included in the
recognition lexicon. In CV, on the other hand, all the words which appear in the test set but were not
seen in the training transcriptions (i.e., the Lex. OOV) were added to the lexicon. Except for this
difference, the language models in both cases were the same; i.e., bi-grams estimated only from the
training transcriptions. Also, the morphological character models were obviously trained using only
training images and transcriptions.
These two lexicon settings represent extreme cases with respect to real use of HTR systems.
Clearly, only the words which appear in the lexicon of a (word-based) HTR system can ever have
the chance to be output as recognition hypotheses. Correspondingly, an OV system will commit at
least one error for every OOV word instance which appears in test images. Therefore, in practice
the training lexicon is often extended with missing common words obtained from some adequate
vocabulary of the task considered. In this sense, the CV setting represents the best which could be
done in practice, while OV corresponds to the worst situation. In the next experiments we only use
OV lexicon.
Table 4.3.12: Best Word Error Rate (WER(%)) and Character Error Rate (CER(%)) reported results
on Hattem database, for HMMs with 8 states and 32 and 64 Gaussian Densities per each HMM state
(NG) and optimized parameter values of grammar scale factor (GSF) and word insertion penalty
(WIP). Only the best results are reported.
NG GSF WIP WER(%) CER(%) WER(%) CER(%)
32 70 -110 25.5 13.7 43.3 21.6
64 70 -110 24.5 13.1 42.8 21.0
Draft Version December 31, 2013 107/154
4.4 Keyword Spotting
4.4.1 The query-by-example approach
We evaluate the segmentation-based and segmentation-free word spotting on a handwritten database
from the writing of Bentham secretaries. It consists of 433 high quality (approximately 3000 pixel
width and 4000 pixel height) handwritten documents from a variety of authors. It contains several
very difficult problems, wherein the most difficult is the word variability. The variation of the same
word is extreme and involves writing style, font size, noise as well as their combination. Figure
4.4.1.1 shows some examples of these instances.
(a)
(b)
Figure 4.4.1.1: Type of word variations met in the Bentham Database (a) word “England” (b)
word “Embezzlement”
The metrics employed in the performance evaluation of the proposed segmentation-based and
segmentation-free algorithms are the Precision at the k Top Retrieved words (P@k) and the Mean
Average Precision (MAP). To further detail the metrics, let define Precision and P@k as follows:
{ } { }
{ }
relevant words retrievedwordsPrecision
retrievedwords (4.4.1.1)
Draft Version December 31, 2013 108/154
{ } { }
@{ }
relevant words k retrieved wordsP k
k retrieved words (4.4.1.2)
Precision is the fraction of retrieved words that are relevant to the search, while in the case that
precision should be determined for the k top retrieved words, P@k is computed. In particular, in the
proposed evaluation, we use P@5 which is the precision at top 5 retrieved words. This metric
defines how successfully the algorithms produce relevant results to the first 5 positions of the
ranking list. This is important measurement because the current word-spotting scenarios are
recommender systems to a transcription operation. It is apparent that the above metrics are suitable
for binary classification i.e. a retrieved word is relevant or non-relevant to the query. That it may
imposes problems for segmentation – free systems as they may not detect the whole word or they
include parts of another word. This situation is resolved by detecting the common area between the
ground truth word area and the output of the method and if it is larger by a threshold, it is
considered relevant.
The second metric used in the proposed evaluation is the Mean Average Precision (MAP) which is a
typical measure for the performance of information retrieval systems. There are two implementation
strategies for the calculation of Average Precision (AP). The interpolated AP and the non-
interpolated AP [Nist 2013], [Chatzichristofis 2011] which both implemented from the Text
Retrieval Conference (TREC) community by the National Institute of Standards and Technology
(NIST). For the proposed evaluation, the non-interpolated AP is used which is defined as the
average of the precision value obtained after each relevant word is retrieved:
1
@ ( )
{ }
n
k
P K rel k
APrelevant words
(4.4.1.3)
where
1( )
0
if word at rank k isrelevantrel k
if word at rank k isnot relevant (4.4.1.4)
Finally, the Mean Average Precision (MAP) is calculated by averaging the AP for all the queries and
it is defined as:
1
Q
q
q
MAP AP
(4.4.1.5)
where Q is the total number of queries.
This metric is chosen because it rewards systems that retrieve highly ranked relevant words.
Initially the proposed segmentation-based word-spotting was evaluated against two previous
segmentation-based profile-based strategies. The results are shown at Table 4.4.1.1.
In the sequel, the proposed segmentation-free word-spotting procedure was evaluated against two
previous segmentation-free procedures. One involves a sliding window strategy and another by
detecting and matching information parts of the document. Another consideration of segmentation-
free procedures is the performance efficiency. As the word spotting system is a component that
plays the role of a recommendation system in the transcription stage, then the retrieved results
should be achieved quickly, otherwise, they will be of no use. For this purpose, the time expense for
the matching step has been computed as shown at Table 4.4.1.1. The experiments were performed
Draft Version December 31, 2013 109/154
single threaded using an Intel Core 2 Duo CPU E8400 (3GHZ). The number of query word images
used was 5. For each word image, two variations are used resulting in a total of 10 queries. This
number should be increased in future experiments, however, the high time expense of the methods
for which comparisons were addressed did not permit us to achieve such a goal. In the case of
segmentation-based methods, the aforementioned-obstacle was not existing and therefore, to
improve the statistical significance of our results we have made additional experiments with the
segmentation-based approaches using 20 query word images. For each query word, two different
variations have been used resulting in a total of 40 queries .Table 4.4.1.2 shows the performance
evaluation results for the augmented query set.
Table 4.4.1.1: Overall Performance Evaluation Results
Methodology type Methodology P@5 MAP Time Expense for Matching
Segmentation-free Sliding Window [Gatos 2009]
10% 0.53% ~12 sec/page
Patch-based [Leydier 2009]
51.33% 15.01% ~263 sec/page
Proposed Method 35.33% 6.54% ~3 sec/page
Segmentation-based
Proposed Method 57.33% 20.42% ~0.5 sec/page
WS [Manmatha 1996] 21.33% 9.89% ~0.001 sec/page
CSPD [Zagoris 2010] 55.33% 19.6% ~0.001 sec/page
Table 4.4.1.2: Performance Evaluation Results for Segmentation–based Methods
P@5 MAP
Manmatha [Manmatha 1996] 20.00% 2.42%
CSPD [Zagoris 2010] 30.00% 3.81%
Proposed Method 45.33% 10.02%
The performance evaluation results shown at Tables 4.4.1.1-2 confirm the extreme difficulty of the
Bentham database due to writing variations and the variety of degradations appearing in the
document. Moreover, it is shown that, the segmentation-based methods performed better both in
effectiveness and efficiency than the segmentation-free methods. However, we should note that in
our experiments we have considered perfect word segmentation due to the use of the ground truth.
This means that these results are the upper limit of the achieved performance which is going to be
reduced in the case of using an imperfect state-of-the-art word segmentation method.
Last but not least, the role of the word spotting architecture as recommender system to a
transcription operation makes imperative to take into consideration both effectiveness and
efficiency. In this respect, the results produced by the proposed methods proved to be promising.
Draft Version December 31, 2013 110/154
4.4.2 The query-by-string approach
To compare the effectiveness of the proposed and the classical HMM-Filler KWS approaches,
several experiments were carried out. The evaluation measures, corpora, experimental setup and the
results are presented next.
Evaluation Measures
The standard recall and interpolated precision measures [Manning 2008] are used here. Interpolated
precision is widely used in the literature to avoid cases in which plain precision can be ill-defined.
In addition, we employ another popular scalar KWS assessment measure called average precision
(AP) [Zhu 2004][Robertson 2008] which, actually, is the area under the Recall-Precision curve.
On the other hand, real computing times are reported in terms of total elapsed times needed on
dedicated 64-bit Intel Xeon computers running at 2.5GHz.
Corpora Description
Experiments were carried out with two different corpora: “Cristo-Salvador” (CS) and IAMDB.
CS is a XIX century Spanish manuscript, entitled “Noticia histrica de las estas con que Valencia
celebr el siglo sexto de la venida a esta capital de la milagrosa imagen del Salvador”, which was
kindly provided by the Biblioteca Valenciana Digital (BiVaLDi). It is composed of 50 color images
of 5.1’’ x 7.2’’ relevant text pages, written by a single writer and scanned at 300 dpi. Some examples
are shown in Fig. 4.4.2.1. Line images, extracted from the original page images in previous works
with this corpus [Romero 2007], are also available. The whole CS corpus, along with directions for
its use in HTR experiments, is publicly available for research purposes from
https://prhlt.iti.upv.es/page/data .
The first 29 pages (675 lines) are used for training and the test set encompasses the remaining 21
pages (497 lines), many of which contain notes and short comments written in a style rather
different from that of the training pages. Table 4.4.2.1 summarizes this information.
Word statistics were computed ignoring capitalization and punctuation marks. The number of
different words of the test partition that do not appear in the training partition is shown in the
column OOV (out of vocabulary words).
Figure 4.4.2.1: Examples of page images from CS corpus.
Draft Version December 31, 2013 111/154
Table 4.4.2.1: Basic statistics of the partition of the CS database. OOV stands for Out-Of-
Vocabulary words.
Training Test Total
Running Chars 35 863 26 353 62 216
Running Words 6 223 4 637 10 860
Running OOV(%) 0 29.03 -
# Lines 675 497 1 172
# Pages 29 21 50
Char Lex. Size 78 78 78
Word Lex. Size 2 236 1 671 3 287
OOV Lex. Size 0 1 051 -
Figure 4.4.2.2: Examples of text line images from IAMDB.
IAMDB is also a publicly accessible corpus compiled by the Research Group on Computer Vision
and Artificial Intelligence (FKI) at Institute of Computer Science an Applied Mathematics (IAM).
The acquisition was based on the Lancaster-Oslo/Bergen Corpus (LOB). Fig. 4.4.2.2 shows an
example of a handwritten sentence images from this corpus.
The last released version (3.0) is composed of 1539 scanned text pages, handwritten by 657
different writers. The database is provided at different segmentation levels: words, lines, sentences
and page images. In our case, the line segmentation level is considered [Marti 2002]. The corpus
was partitioned into training, validation and test sets contain text lines from a different set of
writers, whose basic statistic information are summarized in Table 4.4.2.2.
Draft Version December 31, 2013 112/154
Table 4.4.2.2: Basic statistics of the partition of the IAMDB database. OOV stands for Out-Of-
Vocabulary words.
Training Valid. Test Total
Running Chars 269 270 39 318 39 130 347 718
Running Words 47 615 7 291 7 197 62 103
Running OOV(%) 0 18.26 14.81 -
# Lines 6 161 920 929 8 010
# Pages 1 539
Char Lex. Size 72 69 65 81
Word Lex. Size 7 778 2 442 2 488 9 809
OOV Lex. Size 0 1 095 936 -
Sets of Word Queries to be Spotted
The sets of key word to be spotted for the CS and IAMDB corpora are the ones already used in
[Toselli 2013] and [Frinken 2012] respectively. The criterion followed for key word selection was
described in [Frinken 2012], where all the words seen in the training partition are tested as
keywords. However, in contrast of the IAMDB's, stop words were not removed from the CS one.
For the case of CS corpus, the selected query vocabulary consists of M = 2236 words, including
numeric expressions and a few symbols. It is worth mentioning that, from these query words, only
620 actually appear in the test line images. We say that these query words are pertinent, whereas the
remaining 1616 words are not pertinent. Clearly, spotting non-pertinent words is also challenging,
since the system may erroneously find other similar words, thereby leading to important precision
degradations. On the other hand, many of the pertinent query words appear more than once in the
test set. Concretely, there are 3291 pertinent query word occurrences, representing around 85% of
the total number of running words in the test images. Each of the 2236 word queries has to be tested
against all N =497 test lines. So the total number of query-line events, N .M, is 1111292, from which
only 3005 (corresponding to the above mentioned running pertinent words) are pertinent and
relevant. All this information is summed up in Table 4.4.2.3 along with the corresponding one of the
IAMDB.
Experimental Set-up
CS and IAMDB character HMMs were trained from the corresponding training partitions. In
general, a left-to-right HMM was trained for each of the elements appearing in the training text
images, such as lowercase and uppercase letters, symbols, blank spaces, special abbreviations,
possible blank spaces between words and characters, crossed-words, etc. However, specific details
of the image preprocessing and HMM training setup usually adopted for CS and IAMDB are fairly
different. In particular, different line image preprocessing, writing style attribute normalization and
feature extraction are usually adopted for each corpus. In addition, a caseless HMM topology was
utilized for CS; that is, each character HMM models, “in parallel”, both lower- and upper-case
letters.
Draft Version December 31, 2013 113/154
Table 4.4.2.3: Basic statistics of the selected query word sets for CS and IAMDB along with their
related information.
CS IAMDB
Total Relev/Pert Total Relev/Pert
# Line images: N 497 497 929 855
# Query words: M 2 236 620 3 421 1 098
# Running pertinent query words | 3 291 | 1 925
# Line-query events: N M 1 111 292 3 005 3 178 109 1 916
See [Toselli 2013] and [Fischer 2010] for details about the meta-parameters of line-image
preprocessing, feature extraction and HMMs (which were optimized through cross-validation on the
training data for CS and on the validation data for IAMDB).
In the preprocessing phase of the proposed CL approach, the CL of each line image of the test
partition was obtained, using only the “filler” model (see Fig. 2.4.2.1a). Table 4.4.2.4 shows some
statistics of the resulting CLs, for both corpora, which were generated using the HTK toolkit
[Young 1997] setting the maximum input branching factor to 30. Then, the corresponding forward-
backward scores were computed according to Eq. 3.4.2.1-3.4.2.2. In the query phase, the character
sequence scores, S’(c,x), were computed for each keyword, υ = c, and line image, x, according to
Alg. 1. Finally, the KWS scores S(c, x) were determined for all the keywords, according to
Eq.3.4.2.5.
Table 4.4.2.4: Statistics of the CS and IAMDB CLs. All the figures are numbers of elements,
averaged over all the CLs.
Corpora Nodes Edges Branch. Fact. Paths
CS 35 608 1 058 890 30 ~10305
IAMDB 37 313 1 112 070 30 ~10307
The experimental setup for HMM-Filler approach was the usual one [Fischer 2010]. In the
preprocessing phase the filler model scores sf (x) for all the test line images x were computed. Then
in the query phase, for each keyword υ to be spotted a keyword-specific model was assembled and
the corresponding decoding score sυ(x) was obtained for each line image x. Finally the KWS scores
were computed according to Eq. 2.4.2.3. HTK was also used for HMM decoding.
Results
The goal of experiments carried out is twofold: to assess the effectiveness and the efficiency of the
proposed CL-based HMM-Filler with respect to its classical version.
Experiments were carried out with the CS and IAMDB corpora using both the classical HMM-Filler
and the CL-based approaches here proposed. Since, for each corpus, both approaches share identical
text-line image processing and character HMMs, precision-recall performance results should be
identical for both approaches, according to Eq. 3.4.2.5. However, since CL-based KWS
(necessarily) relies on incomplete (pruned) CLs, some degradation is actually expected. On the
Draft Version December 31, 2013 114/154
other hand, according to the Sec. 2.4.2and 3.4.2, much lower keyword search computing time is
expected for the proposed CL-based approach.
Recall precision curves for both corpora and both approaches are shown in Fig. 4.4.2.3, while
Table 4.4.2.5 shows the corresponding Average Precision (AP) figures. This table also contains the
relevant computing times, split into preprocessing and search (query) times, the latter given in
average minutes required for each single keyword search. The table also shows total time (in days)
needed to fully index each corpus with the corresponding selected keywords (2336 for CS and 3421
for IAMDB).
Figure 4.4.2.3: Recall-Precision curves for CS (left) and IAMDB (right), using the reference
HMM-Filler approach (REF) and the CL-Based version (CL).
As expected, a large gain in efficiency is observed for the proposed CL-based HMM- filler method
with respect to the classical approach. The average query response time is reduced by a factor of 43
for CS and by a factor of 76 for IAMDB. The corresponding overall time needed for fully indexing
all the selected keywords is affordable using the proposed CL-based KWS method, while it
becomes clearly prohibitive for the classical HMM-Filler approach. These great speedups are
achieved without any significant loss in effectiveness, as supported by the overall AP figures of
Table 4.4.2.5 and by the detailed recall-precision graphs of Fig.4.4.2.3.
Table 4.4.2.5: Effectiveness (AP) and efficiency in terms of preprocessing and average query times
and total indexing times, for classical (REF) and CL-based HMM-Filler KWS.
CS IAMDB
REF CL REF CL
Average Precision 0.62 0.61 0.36 0.36
Preprocessing Time (min) 124.8 197.6 66.7 219.5
Average Query Time (min) 124.4 2.9 68.4 0.9
Total Indexing Time (days) 193.3 4.6 162.5 2.3
Draft Version December 31, 2013 115/154
To conclude, it is important to remark that in all the experiments, full Viterbi decoding has been
carried out. At the expense of very small accuracy degradations, processing times for both methods
can be easily reduced (perhaps by one order of magnitude) by using beam-search pruning
techniques. Additionally, the CL-based approach has still much room for further speedups simply
by using smaller CLs. Informal tests suggest that, in this way, CL-based KWS can become about
one order of magnitude faster without significant additional accuracy degradations.
On the other hand, small accuracy degradations can be more than compensated by taking into
account contextual information in the filler model, for instance using character N-Grams. Clearly,
this will result in significantly larger filler models which will make the computing time of the
classical HMM-Filler approach much more prohibitive. In contrast, using these larger models no
significant computing time increase is expected for the proposed CL-based approach.
Draft Version December 31, 2013 116/154
5. Description of Tools
5.1 Image Pre-processing
5.1.1 Binarization
The first version of the binarization toolkit was delivered in the form of a console application:
binarization [infile] [outfile] [Param]
where [infile] is the input binary image, [outfile] the output binary image after binarization and
[Param] is a parameter that could be 0, 1, or 2.If no [Param] is given, [Param=0] is considered.
Notice that for [Param=1] binarization gives better results for bleed-through cases, for [Param=2]
binarization gives better results for faint character detection and [Param=0] is the default
binarization that can handle most cases in general. Examples of the developed binarization
methodology are given in Fig.5.1.1.1.
(a) (b)
(c) (d) (e)
Draft Version December 31, 2013 117/154
(f) (g) (h) Figure 5.1.1.1: Binarization examples; (a)-(b) original image and binarization result of a typical
case; (c)-(e) original image and binarization results using [Param=0] and [Param=1], respectively of
a bleed-through case; (f)-(h) original image and binarization results using [Param=0] and
[Param=2], respectively of a case with faint characters.
Draft Version December 31, 2013 118/154
5.1.2 Border Removal
The first version of the border removal toolkit was delivered in the form of a console application:
BorderRemoval [infile] [outfile]
where [infile] is the input binary image and [outfile] the output binary image after border removal.
Examples of the developed border removal methodology are given in Fig.5.1.2.1.
(a) (d) (g)
(b) (e) (h)
(c) (f) (i) Figure 5.1.2.1: Border removal examples; (a)-(c) original images; (d)-(f) image after binarization;
(g)-(i) image after border removal.
Draft Version December 31, 2013 119/154
5.1.3 Image Enhancement
Text image enhancement tool
The imgtxtenh command line tool implements the proposed modification of the Sauvola algorithm.
It takes as input an image which can be in any of the following formats: PPM, PGM, BMP, PNG,
JPEG and TIFF. The output is the resulting image in grayscale, and the output format can be
requested to be in any of the following formats: PGM, PNG, JPEG and TIFF. By default the
original image is read from standard input and the resulting image is sent to standard output. It can
be requested that the original image be read from a file and that the output image be written to a
file, by using the arguments -i {INFILE} and -o {OUTFILE}.
The algorithm parameters that can be varied are the window width w , the mean threshold factor k
(which are from the original Sauvola) and the black to white transition factor s . These three
parameters are specified to the tool as arguments with the respective letters, i.e., -w {INT}, -k
{FLOAT} and -s {FLOAT}. To do a contrast stretching to the image before the algorithm is applied
use the argument -S {SATU} where SATU is the percentage of pixels that will be saturated in the
extremes of the range.
For example to apply the original Sauvola algorithm with parameters 30w and 2.0k to an
image Esposalles.png and save the result to Esposalles_sauvola.png, the following command would
be issued:
imgtxtenh -s 0 -w 30 -k 0.2 -i Esposalles.jpg -o out.jpg
To apply the proposed algorithm with the same parameters as the previous example and a black to
white transition factor 5.0s , the following command would be issued:
imgtxtenh -s 0.5 -w 30 -k 0.2 -i Esposalles.jpg -o out.jpg
Finally, for the same parameters although doing a contrast stretching with saturation 0.01%, the
following command would be issued:
imgtxtenh -S 0.01 -s 0.5 -w 30 -k 0.2 -i Esposalles.jpg -o out.jpg
Show-through reduction tool (imgtogray)
The imgtogray command line tool as the name suggests, simply receives as input a color image and
produces a grayscale image. The same as for the imgtxtenh tool, as input the image can be in any of
the following formats: PPM, PGM, BMP, PNG, JPEG and TIFF, and the output in the following
formats: PGM, PNG, JPEG and TIFF. By default the original image is read from standard input and
the resulting image is sent to standard output. It can be requested that the original image be read
from a file and that the output image be written to a file by using the arguments -i {INFILE} and -o
{OUTFILE}.
This imgtogray tool can be used for show-through reduction by providing a color to gray
conversion model that has been optimized for this task. A model that has been optimized for the
Plantas corpus has been hard coded in the tool, and can be selected by giving as argument -1. For
example to process an image Plantas.jpg, one would issue the command:
imgtogray -1 -i Plantas.jpg -o out.jpg
Draft Version December 31, 2013 120/154
Grayscale image enhancement
The first version of the grayscale enhancement toolkit was delivered in the form of a console
application:
enhanceGray [infile] [outfile] [Param]
where [infile] is the input grayscale or color image, [outfile] the output grayscale image after
enhancement and [Param] is a parameter that could be 0, 1, 2 or 3.If no [Param] is given, [Param=0]
is considered. For [Param=0] background smoothing is performed using background estimation, for
[Param=1] a 5x5 Wiener filter is applied, for [Param=2] a 3x3 Gaussian filter of σ=1 is applied and
for [Param=3] an unsharp filter using an 9x9 mask followed by 3x3 Wiener filter is applied.
Examples of the enhancing methodologies are provided in Fig. 5.1.3.1.
(a)
(b)
(c)
(d)
Draft Version December 31, 2013 121/154
(e)
Figure 5.1.3.1: Representative examples of grayscale image enhancement; (a)original image; (b)
background smoothing using background estimation; (c) 5x5 Wiener filtering; (d) 3x3 Gaussian
filtering; (e) 9x9 unsharp filtering followed by a 3x3 Wiener filter.
Binary image enhancement
The first version of the binary image enhancement toolkit was delivered in the form of a console
application:
enhanceBinary [infile] [outfile] [Param]
if [Param=2]: enhanceBinary [infile] [outfile] [Param] [Param2]
where [infile] is the input binary image, [outfile] the output binary image after enhancement and
[Param] is a parameter that could be 0, 1, or 2.If no [Param] is given, [Param=0] is considered,
while in the case of [Param=2] and additional parameter. i.e. Param2, can be given. For [Param=0]
contour enhancement is perfromed, for [Param=1] a morphological closing using a 3x3 mask is
applied and for [Param=2] a despeckle filter of a 7x7 mask is applied. In the last case of the
despeckle filter, if an additional parameter is given of value "Z", then the despeckle mask will be of
size ZxZ Examples of the enhancing methodologies are given in Fig. 5.1.3.2.
(a)
(b)
(c)
(d)
Figure 5.1.3.2: Examples of enhancing a binary image; (a) initial binary image; (b) contour
enhancement; (c) 3x3 morphological closing; (d) despeckle filtering (7x7 mask).
Draft Version December 31, 2013 122/154
5.1.4 Deskew
Deskew at page and text line level
The first version of the skew correction toolkit was delivered in the form of a console application:
Page_Line_Skew [infile] [outfile]
where [infile] is the input binary image and [outfile] the output binary image after skew correction.
Examples of the developed skew correction methodology are given in Fig.5.1.4.1.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 5.1.4.1: Skew correction examples; (a), (c), (e) images after binarization; (b), (d), (f) images
after skew correction.
Draft Version December 31, 2013 123/154
Deskew at word level
The first version of the skew correction toolkit was delivered in the form of a console application:
Word_Skew [infile] [outfile]
where [infile] is the input binary image and [outfile] the output binary image after skew correction.
Examples of the developed skew correction methodology are given in Fig.5.1.4.2.
(a) (f)
(b) (g)
(c) (h)
(d) (i)
(e) (j) Figure 5.1.4.2: Skew correction examples; (a)-(e) images after binarization; (f)-(j) images after
skew correction.
Draft Version December 31, 2013 124/154
5.1.5 Deslant
Deslant at text line level
The first version of the slant correction toolkit was delivered in the form of a console application:
Text_Line_Slant [infile] [outfile]
where [infile] is the input binary image and [outfile] the output binary image after slant correction.
Examples of the developed slant correction methodology are given in Fig.5.1.5.1.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 5.1.5.1: Slant correction examples; (a), (c), (e) images after binarization; (b), (d), (f) images
after slant correction.
Draft Version December 31, 2013 125/154
Deslant at word level
The first version of the slant correction toolkit was delivered in the form of a console application:
Word_Slant [infile] [outfile]
where [infile] is the input binary image and [outfile] the output binary image after slant correction.
Examples of the developed slant correction methodology are given in Fig.5.1.5.2.
(a) (f)
(b) (g)
(c) (h)
(d) (i)
(e) (j) Figure 5.1.5.2: Slant correction examples; (a)-(e) images after binarization; (f)-(j) images after slant
correction.
Draft Version December 31, 2013 126/154
5.2 Image Segmentation
5.2.1 Detection of main page elements
Page split
The first version of the page split toolkit was delivered in the form of a console application:
PageSplit [infile1] [infile2] [outfile]
where [infile1] is the input original color image, [infile2] the corresponding binary image and
[outfile] the result PAGE xml file in which the detected page frames are marked as different
regions. Examples of the developed page split methodology are given in Fig.5.2.1.1.
(a) (b)
(c)
Draft Version December 31, 2013 127/154
(d) (e)
(f)
Figure 5.2.1.1: Page Split examples; (a)-(d) original images; (b)-(e) image after binarization; (c)-(f)
image after page split.
Detection of text zones
The first version of the text zones detection toolkit was delivered in the form of a console
application:
TextZonesDetection [infile] [outfile]
where [infile] is the input binary image and [outfile] an ascii file containing the coordinates of main
text zones (x1, x2 if the document is single column or x1,x2,x3,x4 if the document has two
columns). Examples of the developed text zones detection methodology are given in Fig. 5.2.1.2.
Draft Version December 31, 2013 128/154
Figure 5.2.1.2: Document text zones detection results
Segmentation of structured documents using stochastic grammars
The first version of the segmentation toolkit of structured documents was delivered in the form of a
console application:
espoparser –c config –i infile –o outfile
where [infile] is the input file with information about class probabilities for the cell of the image
and [outfile] the parsed image.
Draft Version December 31, 2013 129/154
5.2.2 Text Line Detection
Baseline detection
Preliminaries
In this document you can find the instructions to install and use the tools developed for Baseline
Detection. As part of this readme we will use as input 10 pages of the “Plantas Prologo” database as
input of this experiment and at the end of the process we will obtain the baselines and the
corresponding Liner Error Rate (LER) and Segmentation Error Rate (SER).
All the scripts and code provided has been developed for linux.
Following software must be installed in order to proceed:
• HTK 3.4.1 toolkit
• Ruby Virtual Machine 1.9.2
• C++ Libraries: Open CV, Boost, Log4C++,Eigen
• Ruby Libraries: narray, ruby-debug, trollop, ap, fileutils
For details on installation of this base software please follow the official webpage instructions that
can be found on the project's webpage.
Baseline detection experiment
The script launch-training-detection.rb launches a complete Baseline detection experiment with a
subset of the “Plantas Prologo” data set. Most specifically it launches a hold out experiment using
the first 5 pages of the corpus to train and the second 5 pages to test.
To execute the script:
cd baseline_detection
./launch_training_detection.rb
The time required to carry out the complete experiment is of 30 minutes approximately.
The Baseline detection experiment described here, involves four different phases:
Preprocessing - The script assumes we are providing clean images and applies RLSA to them in order to enhance the zones with lines.
Feature Extraction - Extracts feature vector of all pages.
Training - Trains the statistical model with the training pages feature vectors.
Test - Provides a hypothesis of the detected text lines and their location in the test pages.
When the experiment is launched it will leave the results in the folder experimentation/experimentl in
this directory you will find the following:
recogAlg_*.mlf: a series of the result files obtained by models trained with 1,2,4,8,16,32,64,128 gaussians. Inside you can find for each test file the labels and location of each line detected.
TESTGLOBAL: a file that contains the LER and SER for the test set for each of the models trained with a different number of gaussians.
Draft Version December 31, 2013 130/154
Adding new images for training and test
In order to change the pages used for training and test the following steps must be performed.
There must be a clean preprocessed page image added to the directory images/clean_extracted
If the page will be used for training or we want to be able to measure the error on the test pages a .inf for that image must be added to the inf folder. We will explain later how to obtain a .inf file from a page xml file.
The new page must be added to a list file either experimentation/data/train.lst or experimentation/data/train.lst so that the new sample is used for training or testing.
Adding the extracted baselines to PAGE format
In order to add the baselines to the PAGE format files the following must be performed:
Enter the experiment folder
Copy the recogAlg_*.mlf and the test list file to the asses_line_detection_accuracy/bin
Extract the baseline info from the .mlf file using the line_limits command
Copy the generated .l files to the images folder
Add the baseline info to the PAGE xmls with the page_format_add_baseline command
cd experimentation/experiment1
cp recogAlg_32.mlf ../../asses_line_detection_accuracy/bin
cp data/test.lst ../../asses_line_detection_accuracy/bin
cd ../../asses_line_detection_accuracy/bin
./line_limits -t test.lst -e recogAlg_32.mlf cp *.l ../../images
cd ../../images
for f in *.l; do
../page_format_add_baseline -i ${f/l/jpg} -l ${f/l/xml} -t $f -m FILE -v 0; cp output.xml ${f/l/xml}; done
Obtaining an inf file for training from a PAGE xml file
In order to generate an inf file with the baseline information from a PAGE file the command
page2inf
cd images
for f in *.xml; do ../page2inf -i ${f/xml/jpg} -l $f -m FILE -v 0 > ${f/xml/inf}; done mv *.inf ../inf
Draft Version December 31, 2013 131/154
Polygon based text line detection
The first version of the text line segmentation tool was delivered in the form of a console
application:
TextLineSegmentation [inputimage] [inputxml] [outputxml]
where [inputimage] is the input binary image, [inputxml] the xml file in Page format containing
information about the regions of the document image and [outputxml] the result PAGE xml file in
which the detected text lines are marked using polygons. Examples of the developed text line
segmentation method are provided in Fig.5.2.2.1.
(a) (b)
Figure 5.2.2.1: Text line segmentation example. (a) Initial binary image, (b) text line segmentation
result
Draft Version December 31, 2013 132/154
5.2.3 Word Detection
The first version of the word segmentation tool was delivered in the form of a console application:
WordSegmentation [inputimage] [inputxml] [outputxml]
where [inputimage] is the input binary image, [inputxml] the xml file in Page format containing
information about the text lines of the document image and [outputxml] the result PAGE xml file in
which the detected words are marked using polygons. Examples of the developed text line
segmentation method are provided in Fig.5.2.3.1.
(a) (b)
Figure 5.2.3.1: Word segmentation example. (a) Initial binary image, (b) word segmentation result
Draft Version December 31, 2013 133/154
5.3 Handwritten Text Recognition
Preliminaries
In this document you will find directions to install the required tools for Handwritten Text
Recognition (HTR) and to carry out a HTR experiment. The extracted lines of 50 pages of
Bentham are the input of this experiment and at the end of the process we will obtain their
transcriptions and the corresponding Word Error Rate (WER). All the scripts are developed for
Linux.
It is necessary to install some standard Linux tools and libraries:
sudo apt-get install netpbm libnetpbm10-dev libpng12-dev libtiff4-dev
HTK and software tools for HTR
To carry out the HTR experiment we are going to use the HTK toolkit. It is a hidden Markov
model toolkit for building and manipulating hidden Markov models. This toolkit is publicly
available at:
http://htk.eng.cam.ac.uk
Once you have downloaded it, the steps to compile and install it are:
tar -xzvf HTK-3.4.1.tar.gz cd htk
CFLAGS=-m64 ./configure --prefix=$HOME --without-x make all make install
The HTK programs and scripts will be installed in “$HOME/bin” (unless other route is been
specified in the parameter --prefix).
The rest of tools, utils and examples for HTR can be downloaded from the svn in wp3/T3.3/HTR-
toolsUtils.tgz. To decompress it:
tar -xzvf HTR-toolsUtils.tgz -C $HOME
This will create the directory “HTR” in $HOME. Inside this directory there are four subdirectories:
“src”, “bin”, “scripts” and “BenthamData”.
In “src” is the source code of the preprocessing tools. To compile and install them:
cd $HOME/HTR/src ./compile.sh ../bin
The tools will be installed in “$HOME/HTR/bin”. The following executable files are found:
imgtxtenh: performs noise reduction using a modification of Sauvola
method.
pgmskew: detects the skew angle in the image and corrects it.
pgmslant: detects the slant of words (connected components) in the image
and corrects each slant by an horizontal shift transform (see [Pastor 2004]).
pgmnormsize: performs size normalization of each detected word in the image (see
Draft Version December 31, 2013 134/154
[Romero 2006]).
pgmtextfea: transforms the already preprocessed image into sequence of
feature vectors.
pfl2htk: converts sequence of real-values vectors in ASCII format to HTK format.
tasas: given the HTR output and the reference text for each line, computes the WER.
In the next section the use of these tools will be explained in detail.
In order to make these tools and the HTK tools available to the user-command line, do not forget to
include their respective paths into the PATH variable of the shell:
export PATH=$PATH:$HOME/bin:$HOME/HTR/bin:.
In “$HOME/HTR/scripts” the following bash-shell scripts can be found:
HTR-Lines-preprocessing.sh: launches the complete text image preprocessing
on a given set of text line images.
HTR-feature-extraction.sh: launches the features extraction process on a given
set of text line images.
Script_Train.sh: launches a complete HMMs training process.
Script_Test.sh: launches a complete Viterbi testing process.
HTR-Bentham-Experiment.sh: launches a complete HTR experiment
(preprocesing, training and test) with the Bentham corpus.
HTR-LM-testing: launches a HTR test with the language model that receives as a
parameter.
Create_HMMs-TOPOLOGY.sh: sets the initial HMM topology.
Create_WER_Bentham.sh: compute the WER given the system transcription hy-
pothesis and the reference.
You may want to have a look into each of these scripts to learn more about what exactly they are
doing. Most of them just employ HTK commands.
Finally, the directory “$HOME/HTR/BenthamData” has the models, images and partitions required
to carry out an experiment.
A HTR experiment
The script HTR-Bentham-Experiment.sh launches a complete HTR experiment with the Bentham data
set. Most specifically it launches a cross-validation experiment with the 10 partitions defined for 50
pages. To execute the script:
cd $HOME/HTR
./scripts/HTR-Bentham-Experiment.sh
The time required to carry out the complete experiment can be several days. The HTR experiment
described here, involves three different phases: preprocessing and feature extraction, training and test.
In the next subsections we are going to explain in detail what is doing the HTR-Bentham-Experiment.sh
script.
Draft Version December 31, 2013 135/154
Preprocessing and feature extraction
First, the HTR preprocessing tools are applied on the line images of the corpus. To do this the script
HTR-Bentham-Experiment.sh calls script HTR-Lines-preprocesing.sh:
$D/scripts/HTR-Lines-preprocessing.sh Lines Preprocessed
This script takes as arguments: 1. the directory containing the images in PNG format. 2. the output directory where the preprocessed images will be stored.
This script performs, for each of the images, the following sequence of commands:
imgtxtenh -i $f -S 4 -a | # Clean the background of the image.
pgmskew | # Perform "skew/slope" correction.
pgmslant -M M | # Perform "Slant correction.
pgmnormsize -c 5| # Perform size normalization.
pnmcrop -white # Remove left and right white zones.
To test different preprocessing operations this script must be changed. For example, if a new
pgmslant method is going to be tested, it will be necessary to call to the new executable instead to
pgmslant. Once all the images have been preprocessed it is time for the feature extraction process.
In this work we use the features introduced in [Toselli 2004]. The image is divided into square cells
and from each cell three features are calculated: the normalized gray level, the horizontal and
vertical components of the grey level gradient. To do this the script HTR-feature-extraction.sh is
used:
$D/scripts/HTR-feature-extraction.sh Preprocessed Features 20 2 2 2
This script takes as arguments: 1. the directory containing the preprocessed images. 2. the output directory where the feature vectors files will be stored. 3. the feature extraction resolution (20) 4. the analysis windows size (2 2) 5. the frequency (2)
Training phase
The next step in the HTR-Bentham-Experiment.sh script is to generate a HMM prototype per
character with an initial continuous density left-to-right topology.
The script Create_HMMs-TOPOLOGY.sh is used for this:
$D/scripts/Create_HMMs-TOPOLOGY.sh 60 8 > proto
This script accepts as input the dimension of the feature vectors and the number of states to be set
up for HMMs. It outputs an initial HMM topology with only one Gaussian density in the mixture
per state, whose dimension (in this case) is set to 60 = 20 x 3, according to the feature extraction
resolution of 20 computed in the preprocessing step (20 grey level features, 20 horizontal derivative
features and 20 vertical derivative features). The number of states per HMM has been set to 8, each
of them has the state-transition to itself with probability 0.6 and to the next state (on the right side)
with probability 0.4.
Draft Version December 31, 2013 136/154
Finally (once the processing and feature extraction have finished for all images), the script to train
the HMMs is called:
$D/scripts/Script_Train.sh train hmms proto labels HMM_list
This accepts as arguments: 1. the list of sample files to use in the training process. 2. the directory where the trained HMMs will be stored. 3. the initial HMM topology file (HMM prototype). 4. the transcriptions file in MLF format (this file contains the transcription of the training lines in HTK
format) 5. the file of HMMs names.
Recognition phase
Finally, the proper recognition through the HTK command HVite is performed. To do this the script
Script_Test.sh is used:
$D/scripts/Script_Test.sh list.test ../Train_${i}/hmms dic bgram.gram HMM_list 50 -50 6 4
This accepts as arguments: 1. the list of sample files to use in the recognition process. 2. the directory where the trained HMMs are stored. 3. the lexicon in HTK format. 4. the language model in HTK format. 5. The file of HMMs name lists. 6. The grammar scale factor (50) (See [Ogawa 1998]). 7. The word insertion penalty (-50) (See [Ogawa 1998]). 8. The number of Gaussians per state in the HMMs (64).
Once the recognition has finished, it is time to assess the accuracy of the recognized hypotheses. To
do this the following steps are carried out:
1. First, the recognized hypotheses for all the partitions from 0 to 9 are joined in the directory
Test:
for f in Test_?; do
cp $f/resultados/res-64gauss Test; done
2. The script Create_WER_Bentham.sh returns the WER for the obtained transcripts:
$D/Create_WER_Bentham.sh Test $D/BenthamData/Transcription
which accepts as arguments the directory with all hypotheses and the reference transcripts.
Therefore, at the end of this process you will have a directory with all the hypothesis transcripts
($HOME/HTR/Experiments/Test), and the WER obtained by Create_WER_Bentham.sh.
Testing a new lexicon and language model
To test a new lexicon and a new language model we can use the script HTR-LM-testing.sh.
cd $HOME/HTR
./scripts/HTR-LM-testing.sh Lexicon LanguageModel GSF WIP
Draft Version December 31, 2013 137/154
This accepts as arguments: 1. the new lexicon. 2. the new language model. 3. the Grammar Scale Factor. 4. the Word Insertion Penalty.
Draft Version December 31, 2013 138/154
5.4 Keyword Spotting
5.4.1 Query by example
For the execution of the program, the VCImageProcessingProgram.exe file must be run. The
program depends on Microsoft .NET Framework 4.5.1. The offline installer for the framework is in
the dependencies folder. Run dotNetFx40_Full_x86_x64.exe first and then the file NDP451-
KB2858728-x86-x64-AllOS-ENU.exe which is the update to 4.5.1. The supported operating
systems are Microsoft Vista and later editions. Fig.5.4.1.1 depicts the main interface of the program.
Figure 5.4.1.1: The main interface
The program at the moment supports only the Bentham database but future version will support add
databases through the graphical user interface. Now, as shown in Figure 5.4.1.2 in the
Configuration-> Debug section, you must point the directory where the Bentham directory lies in
the following directory structure: bentham/images.
Draft Version December 31, 2013 139/154
Figure 5.4.1.2: Setting a local path to the Bentham database
In order to run the word spotting procedure, you select an image from the right list, and then click
on a word on the image on the main area. A pull down section will appear with the selected image
word as query. You will have options to query the current document or the database and the method
as seen in Figure 5.4.1.3
Figure 5.4.1.3: Querying a word image
Draft Version December 31, 2013 140/154
Figure 5.4.1.4: Word spotting results
5.4.2 Query by string
A series of software programs have been developed for the query-by-string tool. The query-by-
string tools accepts a word graph and a list of words, and provide at which degree each word in the
list can appear in the word graph and consequently in a line.
The provided script Launch-Demo.sh as well as the binary WGPhraseSpooting have been
developed for linux. Following software must be installed in order to proceed:
HTK 3.4.1 toolkit (for word-graph generation)
GLIBC 2.13 or superior
Bash shell for scripting
In order to give an example of how to use this tool, we will employ (as input) a word graph of a text
line image belonging to the “IAMDB” database. This word graph was previously obtained during
the decoding process of the corresponding text line image using the HTK's HVite command with a
character HMM filler model (see Sec. 4.4.2).
Obtain the spotting confidence score for each word listed in spotWrdList_SPACE.lst:
Draft Version December 31, 2013 141/154
for f in WG_Den30/*.lat.gz; do F=$(basename $f .lat.gz) echo $F ../bin/WGPhraseSpooting -i $f -w spotWrdList_SPACE.lst 2>/dev/null | awk 'BEGIN{while (getline < "puntMap.inf" > 0) M[$1]=$2}\ !/^#/{ CAD=M[$1] for (i=2;i<NF-4;i++) CAD=CAD"#"M[$i] print CAD,$(NF-4),$(NF-3),$(NF-2),$(NF-1) }' > WG_Den30/${F}_Vit.lst done
The script Launch-Demo.sh actually launches the WGPhraseSpooting taken as inputs the word
graph file “f07-084b-04.lat.gz”, in standard-lattice-format (STL), and the file
“spotWrdList_SPACE.lst"listing the query words to be spotted. In this case, each word entry of this
list is arranged in to the corresponding character sequence separated by spaces.
To execute the script Launch-Demo.sh:
cd pub/deliverables/D3.1/D3.1.1/5.4.2/Index-Generation-Tool/Example ./launch.sh f07-084b-04.lat.gz spotWrdList_SPACE.lst > scores
In the file “scores” are stored the final spotting scores for all the entries in “spotWrdList
SPACE.lst".
For this case, the approximate total elapsed time required by this tool to process the whole query
word list “spotWrdList SPACE.lst" is around 150 seconds. This is assuming that is has been
employed a dedicated single core of a 64-bit Intel Core Quad CPU running at 2.83GHz.
It is important to remark that this tool was conceived only for testing empirically the approach
addressed in Sec. 4.4.2, and currently it lacks of a friendly interface for interacting with the user and
for showing the spotting results, whose realization is planned for future deliveries of it.
Draft Version December 31, 2013 142/154
6. References
[Agrawal 2013] M. Agrawal and D. Doermann, “Clutter Noise Removal in Binary Document
Images”, International Journal on Document Analysis and Recognition (IJDAR), vol. 16, no. 4, pp.
351-369, 2013.
[Alireza 2011] Alireza A., Umapada P., Nagabhushan P. and Kimura F., "A Painting Based
Technique for Skew Estimation of Scanned Documents", International Conference on Document
Analysis and Recognition, pp.299-303, 2011.
[Alvaro 2013] F. Alvaro, F. Cruz, J.-A. Sanchez, O. Ramos-Terrades, and J.- M. Benedl. Page
segmentation of structured documents using 2d stochastic context-free grammars. In 6th Iberian
Conference on Pattern Recognition and Image Analysis (IbPRIA), LNCS 7887, pages 133-140.
Springer, 2013.
[Arivazhagan 2007] M. Arivazhagan, H. Srinivasan, and S. N. Srihari, "A Statistical Approach to
Handwritten Line Segmentation", Document Recognition and Retrieval XIV, Proceedings of SPIE,
2007, pp. 6500T-1-11.
[Avila 2004] B.T. Avila and R.D. Lins, “A New Algorithm for Removing Noisy Borders from
Monochromatic Documents”, Proc. of ACM-SAC’2004, pp 1219-1225, Cyprus, ACM Press, March
2004.
[Baird 1987] Baird H. S, "The skew angle of printed documents", Proc. SPSE 40th symposium
hybrid imaging systems, Rochester, NY, pp 739–743M, 1987.
[Ballesteros 2005] Ballesteros J., Travieso C.M., Alonso J.B. and M.A. Ferrer, "Slant estimation of
handwritten characters by means of Zernike moments", Electronics Letters, Vol. 41, No 20, pp.
1110-1112, 2005.
[Barney Smith 2010] E.H. Barney Smith, L. Likforman-Sulem and J. Darbon, “Effect of pre-
processing on binarization,” in Proc. SPIE Conf. on Document Recognition Retrieval (DRR’10),
vol. 7534, San Jose, CA USA, Jan. 2010.
[Basu 2007] S. Basu, C. Chaudhuri, M. Kundu, M. Nasipuri, and D.K. Basu, "Text Line Extraction
from Multi – Skewed Handwritten Documents", Pattern Recognition, vol. 40, no. 6, June 2007, pp.
1825-1839.
[Bazzi 1999] I. Bazzi, R. Schwartz, and J. Makhoul. An omnifont open-vocabulary OCR system for
English and Arabic. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(6):495-
504, 1999.
[Bertolami 2008] R. Bertolami and H. Bunke. Hidden markov model-based ensemble methods for
offline handwritten text line recognition. Pattern Recognition, 41(11):3452-3460, 2008.
[Blumenstein 2002] Blumenstein M., Cheng C. K. and Liu X. Y., "New Preprocessing Techniques
for Handwritten Word Recognition," Second IASTED International Conference on Visualization,
Imaging and Image Processing, pp.332-336, 2002.
[Boquera 2011] S. Espaa-Boquera, M.J. Castro-Bleda, J. Gorbe-Moya, and F. Zamora-Martlnez.
Improving offline handwriting text recognition with hybrid hmm/ann models. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 33(4):767-779, 2011.
[Bosch 2012] V. Bosch, A. H. Toselli, and E. Vidal. “Statistical text line analysis in handwritten
documents”, In International Conference on Frontiers in Handwriting Recog-nition (ICFHR), pages
201-206, 2012.
[Bozinovic 1989] Bozinovic A. and Srihari A., "Off-Line Cursive Script Word Recognition",
Transactions on Pattern Analysis and Machine Intelligence, Vol II, No 1, pp. 69-82, 1989.
[Bruzzone 1999] E. Bruzzone, and M. C. Coffetti, ''An Algorithm for Extracting Cursive Text
Draft Version December 31, 2013 143/154
Lines'', Proc. 5th Int’l Conf. on Document Analysis and Recognition (ICDAR’99), 1999, pp. 749-
752.
[Bukhari 2012] S. S. Bukhari, T. M. Breuel, A Asi, J. El-Sana, “Layout Analysis for Arabic
Historical Document Images Using Machine Learning”, ICFHR 2012, pp. 635-640.
[Bulacu 2007] M. Bulacu, R. van Koert, L.Schomaker, T. van der Zant, "Layout analysis of
handwritten historical documents for searching the archive of the Cabinet of the Dutch Queen",
ICDAR 2007, pp. 357-361.
[Bunke 1995] H. Bunke, M. Roth, and E.G. Suchakat-Talamazzini. Off-line cursive handwriting
recognition using hidden markov models.Pattern Recognition, 28(9):1399-1413, 1995.
[Bunke 2003] H. Bunke. Recognition of cursive roman handwriting- past, present and future.In
Proceedings of the 7th International Conference Document Analysis and Recognition (ICDAR 03),
page 448, Washington, DC, USA, 2003.IEEE Computer Society.
[Canny 1986] J. Canny, “A Computational Approach to Edge Detection”, IEEE Trans. on Pattern
Analysis and Machine Intelligence, vol. 8, no. 6, Nov. 1986, pp. 679-698.
[Chang 2004] F. Chang, C.J. Chen and C.J. Lu, “A linear-time component-labeling algorithm using
contour tracing technique”, Computer Vision and Image Understanding, Vol. 93, No.2, February
2004, pp. 206-220.
[Chatzichristofis 2011] S. A. Chatzichristofis, K. Zagoris, and A. Arampatzis. The trec files: the
(ground) truth is out there. In Proceedings of the 34th international ACM SIGIR conference on
Research and development in Information, SIGIR ’11, pages 1289–1290, New York, NY, USA,
2011. ACM.
[Chou 2007] Chou C.-H., Chu S.-Y.and Chang F., "Estimation of skew angles for scanned
documents based on piecewise covering by parallelograms", Pattern Recognition vol. 40, pp.443–
455, 2007.
[Ciardiello 1988] Ciardiello G., Scafuro G., Degrandi M. T., Spada M. R. and Roccotelli M. P., "An
experimental system for office document handling and text recognition", Proc. 9th international
conference on pattern recognition, pp. 739–743, 1988.
[Cote 1998] Cote M., Lecolinet E., Cheriet M. and Suen C., "Automatic reading of cursive scripts
using a reading model and perceptual concepts," International Journal on Document Analysis and
Recognition, vol. 1, no. 1, pp. 3–17, 1998.
[Crespi 2008] S. Crespi Reghizzi and M. Pradella.A CKY parser for picture grammars.Information
Processing Letters, 105(6):213-217, February 2008.
[Cruz 2012] F. Cruz and O. R. Terrades. Document segmentation using relative location features. In
International Conference on Pattern Recognition, pages 1562-1565, Japan, 2012.
[Dey 2012] S. Dey, J. Mukhopadhyay, S. Sural and P. Bhowmick, “Margin noise removal from
printed document images”, Workshop on Document Analysis and Recognition (DAR 2012). ACM,
USA, pp. 86-93, 2012.
[Deya 2010] Deya P. and Noushath S., "e-PCP: A robust skew detection method for scanned
document images", Pattern Recognition vol. 43, pp.937-948, 2010.
[Ding 1999] Ding Y., Kimura F., Miyake Y. and Shridhar M., "Evaluation and Improvement of slant
estimation for handwritten words", Proceedings of the Fifth International Conference on Document
Analysis and Recognition (ICDAR '99), pp. 753-756, 1999.
[Ding 2000] Ding Y., Kimura F. and Miyake Y., "Slant estimation for handwritten words by
directionally refined chain code", 7th International Workshop on Frontiers in Handwriting
Recognition (IWFHR 2000), pp. 53-62, 2000.
Draft Version December 31, 2013 144/154
[Ding 2004] Ding Y., Ohyama W., Kimura F. and Shridhar M., "Local Slant Estimation for
Handwritten English Words", 9th International Workshop on Frontiers in Handwriting Recognition
(IWFHR 2004), pp. 328-333, 2004.
[Dong 2005] Dong J., Ponson D., Krzyÿzak A. and Suen C. Y., "Cursive word skew/slant
corrections based on Radon transform," Proc. of 8th International Conference on Document
Analysis and Recognition, pp.478-483, 2005.
[Drira 2011] F. Drira, F. LeBourgeois and H. Emptoz, “A new PDE-based approach for singularity-
preserving regularization: application to degraded characters restoration”, International Journal on
Document Analysis and Recognition, DOI 10.1007/s10032-011-0165-5, 2011.
[Du 2008] X. Du, W. Pan, and T. D. Bui, "Text line segmentation in handwritten documents using
Mumford–Shah model", Proc. 11th Int’l Conf. on Frontiers in Handwriting Recognition
(ICFHR'08), 2008, pp. 253–258.
[Esteve 2009] A. Esteve, C. Cortina, and A. Cabre. Long term trends in marital age homogamy
patterns: Spain, 1992-2006. Population, 64(1):173-202, 2009.
[Fadoua 2006] D. Fadoua, F. Le Bourgeois and H. Emptoz, “Restoring Ink Bleed-Through
Degraded Document Images Using a Recursive Unsupervised Classification Technique”, 7th
international conference on Document Analysis Systems, pp. 38-49, 2006.
[Fan 2002] K.C. Fan, Y.K. Wang and T.R. Lay, “Marginal Noise Removal of Document Images”,
Pattern Recognition, vol.35, no. 11, pp. 2593-2611, 2002.
[Fenrich 1992] R. Fenrich, S. Lam, and S. N. Srihari. Chapter optical character recognition. In
Encyclopedia of Computer Science and Engineering, pages 993-1000. Third ed., Van Nostrand,
1992.
[Fischer 2010] A. Fischer, A. Keller, V. Frinken, and H. Bunke. Lexicon-free hand¬written word
spotting using character HMMs. Pattern Recognition Letters, 33(7):934 - 942, 2012. Special Issue
on Awards from ICPR 2010.
[Fletcher 1988] L. A. Fletcher, and R. Kasturi, "A Robust Algorithm for Text String Separation from
Mixed Text/Graphics Images", IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol.10, no.6, Nov. 1988, pp. 910-918.
[Frinken 2012] V. Frinken, A. Fischer, R. Manmatha, and H. Bunke. A Novel Word Spot¬ting
Method Based on Recurrent Neural Networks. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 34(2):211 -224, Feb. 2012.
[Fujisawa 2010] H. Fujisawa and C.-L. Liu. Directional pattern matching for character recognition
revisited. algorithms, 13:14, 2010.
[Gatos 1997] Gatos B., Papamarkos N. and Chamzas C., "Skew detection and text line position
determination in digitized documents", Pattern Recognition, vol.30, issue 9, pp.1505-1519, 1997.
[Gatos 2006] B. Gatos, I. Pratikakis, and S.J. Perantonis, “Adaptive degraded document image
binarization,” Pattern Recognition, vol. 39, no. 3, pp. 317–327, Mar. 2006.
[Gatos 2008] B. Gatos, I. Pratikakis, and S.J. Perantonis, “Improved document image binarization
by using a combination of multiple binarization techniques and adapted edge information”, in
Proceedings of the 19th International Conference on Pattern Recognition (ICPR’08), Tampa,
Florida, USA, Dec.2008, pp. 1-4.
[Gatos 2009] B. Gatos and I. Pratikakis. Segmentation-free word spotting in historical printed
documents.In Document Analysis and Recognition, 2009.ICDAR’09. 10th International Conference
on, pages 271–275. IEEE, 2009.
[Gatos 2011] B. Gatos, K. Ntirogiannis and I. Pratikakis, “DIBCO 2009: Document Image
Draft Version December 31, 2013 145/154
Binarization Contest,” International Journal on Document Analysis and Recognition, vol. 14, no. 1,
pp. 35–44, 2011.
[Gatos 2014] B. Gatos, G. Louloudis, T. Causer, K. Grint, V. Romero, J.A. Sanchez, A.H. Toselli,
and E. Vidal. Ground-Truth Production in the tranScriptporium Project. In Proceedings of the DAS,
2014.
[Gorman 1993] Gorman L., "The document spectrum for page layout analysis," IEEE Trans Pattern
Anal Mach Intell vol. 15 issue 11, pp. 1162–1173, 1993.
[Gould 2008] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller. Multi-class segmentation
with relative location prior.Int. Journal of Computer Vision, 80(3):300-316, 2008.
[Graves 2009] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber.
A novel connectionist system for unconstrained handwriting recognition.IEEE Transactions on
Pattern Analysis and Machine Intelligence, 31(5):855-868, 2009.
[Gustafson 1978] D. E Gustafson and W. C Kessel. Fuzzy clustering with a fuzzy covariance
matrix. In Decision and Control including the 17th Symposium on Adaptive Processes, 1978 IEEE
Conference on, volume 17, pages 761–766. IEEE, 1978.
[Hashizume 1986] Hashizume A., Yeh P.S. and Rosenfeld A., "A method of detecting the orientation
of aligned components", Pattern Recognition Letters, vol. 4, pp. 125-132, 1986.
[Hassan 2012] E. Hassan, S. Chaudhury, and M. Gopal. Word shape descriptor-based document
image indexing: a new dbh-based approach. International Journal on Document Analysis and
Recognition (IJDAR), pages 1–20, 2012.
[Hinds 1990] Hinds J., Fisher L. and D’Amato D. P., "A document skew detection method using
run-length encoding and the Hough transform", Proceedings of the 10th international conference
pattern recognition. IEEE CS Press, Los Alamitos, CA, pp 464–468, 1990.
[Howe 2012] N.R. Howe, “Document Binarization with Automatic Parameter Tuning”,
International Journal on Document Analysis and Recognition, DOI: 10.1007/s10032-012-0192-x,
Sep. 2012.
[Huang 2008] C. Huang, and S. Srihari, "Word segmentation of off-line handwritten documents",
Proc. Annual Symposium on Document Recognition and Retrieval (DRR) XV, IST/SPIE, 2008.
[Impedovo 1991] S. Impedovo, L. Ottaviano, and S. Occhiegro. Optical character recognition a
survey.International Journal of Pattern Recognition and Artificial Intelligence, 5(1):1-24, 1991.
[Ishitani 1993] Ishitani Y., "Document skew detection based on local region complexity", Proc. 2nd
international conference on document analysis and recognition, Tsukuba, Japan, pp. 49–52, 1993.
[Jain 1989] A. Jain, Fundamentals of Digital Image Processing, Prentice-Hall, Englewood Cliffs,
NJ, 1989.
[Jelinek 1998] F. Jelinek. Statistical methods for speech recognition.MIT Press, 1998.
[Juan 2001] A. Juan and E. Vidal. On the use of Bernoulli mixture models for text classification. In
Proceedings of the Workshop on Pattern Recognition in Information Systems (PRIS 01), Setubal
(Portugal), July 2001.
[Jung 2004] Jung K., Kim K.I. and Jain A.K., "Text information extraction in images and video: a
survey", Pattern Recognition vol. 37 issue 5, pp. 977-997S, 2004.
[Kamel 1993] M. Kamel, and A. Zhao, “Extraction of Binary Character/Graphics Images from
Grayscale Document Images, CVGIP: Computer Vision Graphics and Image Processing, vol. 55(3),
pp. 203-217, May 1993.
[Kavallieratou 1999] Kavallieratou E., Fakotakis N. and Kokkinakis G., "New algorithms for
Draft Version December 31, 2013 146/154
skewing correction and slant removal on word-level," Proc. IEEE 6th International Conference on
Electronics, Circuits and Systems, Pafos, Cyprus, pp. 1159–1162, 1999.
[Kavallieratou 2001] Kavallieratou E., Fakotakis N., Kokkinakis G., "Slant estimation algorithm for
OCR systems", Pattern Recognition, Vol. 34, pp. 2515-2522, 2001.
[Kavallieratou 2002] E. Kavallieratou, N. Fakotakis, and G. Kokkinaki. An unconstrained
handwriting recognition system.International Journal of Document Analysis and Recognition,
4:226-242, 2002.
[Kennard 2006] D. J. Kennard, and W. A. Barrett, "Separating Lines of Text in Free-Form
Handwritten Historical Documents", Proc. 2nd Int’l Conf. on Document Image Analysis for
Libraries (DIAL'06), 2006, pp. 12-23.
[Keshet 2009] J. Keshet, D. Grangier, and S. Bengio. Discriminative keyword spotting. Speech
Communication, 51(4):317-329, April 2009.
[Kesidis 2011] A. L. Kesidis, E. Galiotou, B. Gatos, and I. Pratikakis.A word spotting framework for
historical machine-printed documents.International Journal on Document Analysis and Recognition
(IJDAR), 14(2):131–144, 2011.
[Keysers 2002] D. Keysers, R. Paredes, H. Ney, and E. Vidal. Combination of tangent vectors and
local representations for handwritten digit recognition. In International Workshop on Statistical
Pattern Recognition, Lecture Notes in Computer Science, pages 407-441. Springer-Verlag, 2002.
[Khoury 2010]I. Khoury A. Gimenez and A. Juan.Windowed Bernoulli Mixture HMMs for Arabic
Handwritten Word Recognition.In Proceedings of the 12th International Conference on Frontiers
in Handwriting Recognition (ICFHR), Kolkata (India), Nov 2010 2010.
[Khurshid 2012] K. Khurshid, C. Faure, and N. Vincent. Word spotting in historical printed
documents using shape and sequence comparisons.Pattern Recognition, 45(7):2598–2609, 2012.
[Kim 1997] Kim G. and Govindaraju V., “A Lexicon Driven Approach to Handwritten Word
Recognition for Real-Time Applications”, IEEE Trans. Pattern Anal.Mach. Intell., Vol. 19, Issue 4,
pp. 366-379, 1997.
[Kim 1998] G. Kim, and V. Govindaraju, "Handwritten Phrase Recognition as Applied to Street
Name Images", Pattern Recognition, vol. 31, no. 1, Jan. 1998, pp. 41-51.
[Kim 1999] G. Kim, V. Govindaraju, and S. Srihari. An architecture for handwritten text
recognition systems. International Journal of Document Analysis and Recognition, 2(1):37-44,
1999.
[Kim 2001] S.H. Kim, S. Jeong, G.- S. Lee, C.Y. Suen, "Word segmentation in handwritten Korean
text lines based on gap clustering techniques", Proc. 6th Int’l Conf. on Document Analysis and
Recognition (ICDAR’01), 2001, pp. 189-193.
[Kim 2002] I.K. Kim, D.W. Jung, and R.H. Park, “Document image binarization based on
topographic analysis using a water flow model,” Pattern Recognition, vol. 35, no. 1, pp. 265–277,
2002.
[Kimura 1993] Kimura F., Shridhar M. and Chen Z., "Improvements of a Lexicon Directed
Algorithm for Recognition of Unconstrained Handwritten Words", 2nd International Conference on
Document Analysis and Recognition (ICDAR 1993), pp. 18-22, Oct. 1993.
[Kneser 1995] R. Kneser and H. Ney. Improved backing-off for m-gram language modeling.
volume 1, pages 181-184, Detroit, USA, 1995.
[Koerich 2002] A.L Koerich, Y. Leydier, R. Sabourin, and C. Y Suen. A hybrid large vocabulary
handwritten word recognition system using neural networks with hidden markov models.In
Frontiers in Handwriting Recognition, 2002.Proceedings. Eighth International Workshop on, pages
Draft Version December 31, 2013 147/154
99–104. IEEE, 2002.
[Koerich 2003] A. L. Koerich, R. Sabourin, and C. Y. Suens. Large vocabulary off-line
handwriting recognition: A survey. Pattern Analysis Applications, 6(2):97-121, 2003.
[Konidaris 2008] T. Konidaris, B. Gatos, S.J. Perantonis, and A. Kesidis. Keyword matching in
historical machine-printed documents using synthetic data, word portions and dynamic time
warping.In Document Analysis Systems, 2008.DAS ’08. The Eighth IAPR International Workshop
on, pages 539–545, 2008.
[Lavrenko 2004] V. Lavrenko, T. M Rath, and R Manmatha. Holistic word recognition for
handwritten historical documents.In Document Image Analysis for Libraries, 2004.Proceedings.
First International Workshop on, pages 278–287. IEEE, 2004.
[Le 1996] D.X. Le and G.R. Thoma, “Automated Borders Detection and Adaptive Segmentation for
Binary Document Images”, International Conference on Pattern Recognition, p. III: 737-741, 1996.
[Lee 1996] S. W. Lee. Off-line recognition of totally unconstrained handwritten numerals using
mcnn.IEEE Transactions on Pattern Analysis and Machine Intelligence, 18:648-652, 1996.
[Lemaitre 2006] A. Lemaitre, and J. Camillerapp, "Text Line Extraction in Handwritten Document
with Kalman Filter Applied on Low Resolution Image", Proc. 2nd Int’l Conf. on Document Image
Analysis for Libraries (DIAL'06), 2006, pp. 38-45.
[Leung 2005] C.C. Leung, K.S. Chan, H.M. Chan and W.K. Tsui, “A new approach for image
enhancement applied to low-contrast–low-illumination IC and document images”, Pattern
Recognition Letters, Vol.26, pp. 769–778, 2005.
[Leydier 2007] Y. Leydier, F. Lebourgeois, and H. Emptoz. Text search for medieval manuscript
images.Pattern Recognition, 40(12):3552–3567, 2007.
[Leydier 2009] Y. Leydier, A. Ouji, F. LeBourgeois, and H. Emptoz. Towards an omnilingual word
retrieval system for ancient manuscripts.Pattern Recognition, 42(9):2089–2105, 2009.
[Li 2008] Y. Li, Y. Zheng, and D. Doermann, "Script-Independent Text Line Segmentation in
Freestyle Handwritten Documents'', IEEE Trans. Pattern Anal. Machine Intell. (T-PAMI), vol. 30,
no. 8, pp. 1313-1329, August, 2008.
[Likforman 1995] L. Likforman-Sulem, A. Hanimyan, and C. Faure, ''A Hough Based Algorithm
for Extracting Text Lines in Handwritten Documents'', Proc. 3rd Int’l Conf. on Document Analysis
and Recognition (ICDAR’95), 1995, pp. 774-777.
[Likforman 2007] L. Likforman - Sulem, A. Zahour, and B Taconet, ''Text Line Segmentation of
Historical Documents: A Survey'', Internation Journal on Document Analysis and Recognition
(IJDAR), vol. 9, no.2-4, April 2007, pp. 123-138.
[Link 1] https://engineering.purdue.edu/~bouman/software/cluster/
[Liwicki 2007] M. Liwicki, E. Indermuhle, and H. Bunke. On-line handwritten text line detection
using dynamic programming.In Document Analysis and Recognition, 2007.ICDAR 2007.Ninth
International Conference on, volume 1, pages 447 -451, sept. 2007.
[Llados 2012] J. Llados, M. Rusinol, A. Fornes, D. Fernandez, and A. Dutta. On the influence of
word representations for handwritten word spotting in historical documents.International Journal of
Pattern Recognition and Artificial Intelligence, 26(05), 2012.
[Lleida 1993] E. Lleida, J. B. Marino, J. M. Salavedra, A. Bonafonte, E. Monte, and A. Martinez.
Out-of-vocabulary word modelling and rejection for keyword spotting. In Proceedings of the Third
European Conference on Speech Communication and Technology, pages 1265-1268, Berlin,
Germany, 1993.
[Louloudis 2008] G. Louloudis, B. Gatos, I. Pratikakis, C. Halatsis, "Text line detection in
Draft Version December 31, 2013 148/154
handwritten documents", Pattern Recognition, Vol. 41, Issue 12, pp. 3758-3772, 2008.
[Louloudis 2009a] G. Louloudis, N. Stamatopoulos and B. Gatos, "A Novel Two Stage Evaluation
Methodology for Word Segmentation Techniques", 10th International Conference on Document
Analysis and Recognition (ICDAR'09), pp. 686-690, Barcelona, Spain, July 2009.
[Louloudis 2009b] G. Louloudis, B. Gatos, I. Pratikakis, C. Halatsis, "Text line and word
segmentation of handwritten documents", Pattern Recognition, Vol. 42, Issue 12, pp. 3169-3183,
2009.
[Lu 2000] Z. Lu, R. Schwartz, and C. Raphael. Script-independent, hmm-based text line finding for
ocr. In Pattern Recognition, 2000.Proceedings. 15th International Conference on, volume 4, pages
551 -554 vol.4, 2000.
[Lu 2003] Lu Y. and Tan C. L., "Anearest-neighbor chain based approach to skew estimation in
document images", Pattern Recognition Letters vol. 24, pp. 2315–2323, 2003.
[Lu 2010] S. Lu, B. Su and C.L. Tan, “Document image binarization using background estimation
and stroke edges,” International Journal on Document Analysis and Recognition, vol. 13, no. 4, pp.
303-314, 2010.
[Luthy 2007] F. Luthy, T. Varga, and H. Bunke, ''Using Hidden Markov Models as a Tool for
Handwritten Text Line Segmentation'', Proc. 9th Int’l Conf. on Document Analysis and Recognition
(ICDAR’07), 2007, pp. 8-12.
[Madhvanath 1999] Madhvanath S., Kim G., and Govindaraju V., “Chaincode contour processing
for handwritten word recognition,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 21, no. 9, pp. 928–932, 1999.
[Mahadevan 1995] U. Mahadevan, and R. C. Nagabushnam, ''Gap metrics for word separation in
handwritten lines'', Proc. 3rd Int’l Conf. on Document Analysis and Recognition (ICDAR’95),
1995, pp. 124-127.
[Makridis 2007] M. Makridis, N. Nikolaou and B. Gatos, "An Efficient Word Segmentation
Technique for Historical and Degraded Machine-Printed Documents", Proc. 9th Int’l Conf. on
Document Analysis and Recognition (ICDAR'07), 2007, pp. 178-182.
[Malleron 2009] V. Malleron, V. Eglin, H. Emptoz, S. Dord-Crousle, P. Regnier, “Text Lines and
Snippets Extraction for 19th Century Handwriting Documents Layout Analysis”, ICDAR 2009, pp.
1001-1005.
[Manmatha 1996] R Manmatha, C. Han, and E. M Riseman. Word spotting: A new approach to
indexing handwriting. In Computer Vision and Pattern Recognition, 1996.Proceedings CVPR’96,
1996 IEEE Computer Society Conference on, pages 631–637.IEEE, 1996.
[Manning 2008] C. D. Manning, P. Raghavan, and H. Schtze.Introduction to Information
Retrieval.Cambridge University Press, New York, NY, USA, 2008.
[Marin 2005] J.M. Marin, K. Mengersen, and C.P. Robert, "Bayesian Modelling and Inference on
Mixtures of Distributions", Handbook of Statistics, vol. 25, Elsevier-Sciences, 2005.
[Marti 2001] U.-V. Marti and H. Bunke.Using a Statistical Language Model to improve the
preformance of an HMM-Based Cursive Handwriting Recognition System.International Journal of
Pattern Recognition and Artificial In telligence, 15(1):65-90, 2001.
[Marti 2002] U.-V. Marti and H. Bunke. The iam-database: an english sentence database for offline
handwriting recognition. International Journal on Document Analysis and Recognition, 5:39-46,
2002.
[McCowan 2004] I. A. McCowan, D. Moore, J. Dines, D. G.-Perez, M. Flynn, P. Wellner, and H.
Bourlard.On the use of information retrieval measures for speech recognition evaluation.Idiap-RR
Draft Version December 31, 2013 149/154
Idiap-RR-73-2004, IDIAP, Martigny, Switzerland, 0 2004.
[Mori 1992] S. Mori, C. Y. Suen, and K. Yamamoto. Historial review of ocr research and
development. Proceedings of the IEEE, 80(7):1029-1058, 1992.
[Morita 1999] Morita M., Facon J., Bortolozzi F., Garnes S. and Sabourin R., "Mathematical
morphology and weighted least squares to correct handwriting baseline skew," Proc. of 5th
International Conference on Document Analysis and Recognition, Bangalore, India, pp. 430–433,
1999.
[Nagy 2000] G. Nagy. Twenty years of document image analysis. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 22(1):38-62, 2000.
[Niblack 1986] W. Niblack, An Introduction to Digital Image Processing. Englewood Cliffs, NJ:
Prentice Hall, 1986, pp. 115–116.
[Nicolas 2006] S. Nicolas, T. Paquet and L. Heutte, "Complex Handwritten Page Segmentation
Using Contextual Models“, DIAL 2006, pp. 46-59.
[Nicolaou 2009] A. Nicolaou and B. Gatos, "Handwritten Text Line Segmentation by Shredding
Text into its Lines", 10th International Conference on Document Analysis and Recognition
(ICDAR'09), pp. 626-630, Barcelona, Spain, July 2009.
[Nikolaou 2010] N. Nikolaou, M. Makridis, B. Gatos, N. Stamatopoulos and N. Papamarkos,
“Segmentation of historical machine-printed documents using Adaptive Run Length Smoothing and
skeleton segmentation paths”, Image and Vision Computing, vol. 28, no. 4, pp. 590-604, 2010.
[Nicolas 2004] S. Nicolas, T. Paquet, and L. Heutte, "Text Line Segmentation in Handwritten
Document Using a Production System", Proc. 9th Int’l Workshop on Frontiers in Handwriting
Recognition (IWFHR’04), 2004, pp. 245-250.
[Nist 2013] TREC NIST. http://trec.nist.gov/pubs/trec16/appendices/measures.pdf, 2013.
[Ntirogiannis 2014] K. Ntirogiannis, B. Gatos and I. Pratikakis, "A Combined Approach for the
Binarization of Handwritten Document Images", Pattern Recognition Letters - Special Issue on
Frontiers in Handwriting Processing, vol. 35, no.1, pp. 3-15, Jan. 2014.
[Ogawa 1998] A. Ogawa, K. Takeda, and F. Itakura. Balancing acoustic and linguistic
probabilites.volume 1, pages 181-184, Seattle, WA, USAR, 1998.
[Otsu 1975] N. Otsu. A threshold selection method from gray-level histograms.Automatica, 11(285-
296):23–27, 1975.
[Otsu 1979] N. Otsu, “A Thresholding Selection Method from Gray-level Histogram,” IEEE Trans.
Syst., Man, Cybern., vol. 9, no. 1, pp. 62–66, 1979.
[Oztop 1999] E. Oztop, A.Y. Mlayim, V. Atalay, and F. Yarman-Vural. Repulsive attractive network
for baseline extraction on document images. Signal Processing, 75(1):1 - 10, 1999.
[Papandreou 2011] Papandreou A. and Gatos B., "A Novel Skew Detection Technique Based on
Vertical Projections", Proc. 11th international conference on document analysis and recognition,
ICDAR '11, pp. 384-388, 2011.
[Papandreou 2012] Papandreou A. and Gatos B., "Word Slant Estimation using Non-Horizontal
Character Parts and Core-Region Information", 10th IAPR International Workshop on Document
Analysis Systems (DAS 2012), pp. 307-311, 2012.
[Papandreou 2013a] Papandreou A. and Gatos B., "A Coarse to Fine Skew Estimation Technique
for Handwritten Words", Proc. 12th international conference on document analysis and recognition,
ICDAR '13, 2013.
[Papandreou 2013b] Papandreou A., Gatos B., Louloudis G. and Stamatopoulos N., "ICDAR2013
Draft Version December 31, 2013 150/154
Document Image Skew Estimation Contest (DISEC’13)", Proc. 12th international conference on
document analysis and recognition, ICDAR '13, 2013.
[Papandreou 2014] Papandreou A. and Gatos B., "Slant estimation and core-region detection for
handwritten Latin words", Pattern Recognition Letters Vol. 35 pp. 16-22, 2014.
[Pastor 2004] M. Pastor, A. Toselli, and E. Vidal. Projection Profile Based Algorithm for Slant
Removal. In International Conference on Image Analysis and Recognition (ICIAR’04), Lecture
Notes in Computer Science, pages 183-190, Porto, Portugal, September 2004. Springer-Verlag.
[Plamondon 2000] R. Plamondon and S. N. Srihari. On-Line and Off-Line Handwriting
Recognition: A Comprehensive Survey. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(1):63-84, 2000.
[Postl 1986] Postl W., "Detection of linear oblique structures and skew scan in digitized
documents", Proc. 8th international conference on pattern recognition, pp. 687-689, 1986.
[Pratikakis 2010] I. Pratikakis, B. Gatos and K. Ntirogiannis, “H-DIBCO 2010 Handwritten
Document Image Binarization Competition,” in Proc. Int. Conf. on Frontiers in Handwriting
Recognition (ICFHR’10), Kolkata, India, Nov. 2010, pp. 727–732.
[Pratikakis 2011] I. Pratikakis, B. Gatos and K. Ntirogiannis, “ICDAR 2011 Document Image
Binarization Contest - DIBCO 2011,” in Proc. Int. Conf. on Document Analalysis and Recognition
(ICDAR’11), Beijing, China, Sep. 2011, pp. 1506–1510.
[Pratikakis 2013] I. Pratikakis, B. Gatos and K. Ntirogiannis, "ICDAR 2013 Document Image
Binarization Contest (DIBCO 2013)", 2013 International Conference on Document Analysis and
Recognition (ICDAR'13), Washington DC, USA, Aug. 2013, pp. 1471 1476.
[Pu 1998] Y. Pu, and Z. Shi, '' A Natural Learning Algorithm Based on Hough Transform for Text
Lines Extraction in Handwritten Documents'', Proc. 6th Int’l Workshop on Frontiers in Handwriting
Recognition (IWFHR’98), 1998, pp. 637-646.
[Ramponi 1996] G. Ramponi, N. Strobel, S. K. Mitra, and T. Yu, “Nonlinear unsharp masking
methods for image contrast enhancement,” J. Electron. Imag.,vol. 5, pp. 353–366, Jul. 1996.
[Rath 2003a] T. M. Rath and R. Manmatha.Features for word spotting in historical manuscripts.In
Document Analysis and Recognition, 2003.Proceedings. Seventh International Conference on,
pages 218–222. IEEE, 2003.
[Rath 2003b] T. M. Rath and R. Manmatha.Word image matching using dynamic time warping.In
Computer Vision and Pattern Recognition, 2003.Proceedings. 2003 IEEE Computer Society
Conference on, volume 2, pages II–521. IEEE, 2003.
[Ratzlaff 2003] E. H. Ratzlaff. Methods, Report and Survey for the Comparison of Diverse Isolated
Character Recognition Results on the UNIPEN Database. In Proceedings of the Seventh Inter-
national Conference on Document Analysis and Recognition (ICDAR ’03), volume 1, pages 623-
628, Edinburgh, Scotland, August 2003.
[Robertson 2008] S. Robertson. A new interpretation of average precision. In Proceedings of the
31st annual international ACM SIGIR conference on Research and development in information
retrieval, SIGIR, pages 689-690, New York, NY, USA, 2008. ACM.
[Rodriguez 2008] J. A. Rodriguez and F. Perronnin. Local gradient histogram features for word
spotting in unconstrained handwritten documents. In Int. Conf. on Frontiers in Handwriting
Recognition, 2008.
[Romero 2006] V. Romero, M. Pastor, A. H. Toselli, and E. Vidal. Criteria for handwritten off-line
text size nor-malization. In Procc. of The Sixth IASTED international Conference on Visualization,
Imaging, and Image Processing (VIIP 06), Palma de Mallorca, Spain, August 2006.
Draft Version December 31, 2013 151/154
[Romero 2007] V. Romero, A. Gimoenez, and A. Juan. Explicit Modelling of Invariances in
Bernoulli Mixtures for Binary Images. In 3rd Iberian Conference on Pattern Recognition and
Image Analysis, volume 4477 of Lecture Notes in Computer Science, pages 539-546. Springer-
Verlag, Girona (Spain), June 2007.
[Romero 2007] V. Romero, A. H. Toselli, L. Rodriguez, and E. Vidal. Computer Assisted
Transcription for Ancient Text Images. In International Conference on Image Analysis and
Recognition (ICIAR 2007), volume 4633 of LNCS, pages 1182-1193. Springer-Verlag, Montreal
(Canada), August 2007.
[Romero 2007] V. Romero, V. Alabau, and J. M. Benedl. Combination of N-grams and Stochastic
Context- Free Grammars in an Offline Handwritten Recognition System. In 3rd Iberian Conference
on Pattern Recognition and Image Analysis, volume 4477 of LNCS, pages 467-474. Springer-
Verlag, Girona (Spain), June 2007.
[Romero 2013] V. Romero, A. Fornés, N. Serrano, J.A. Sánchez, A.H. Toselli, V. Frinken, E. Vidal,
and J. Lladós. The ESPOSALLES database: An ancient marriage license corpus for off-line
handwriting recognition. Pattern Recognition, 46:1658–1669, 2013.
http://dx.doi.org/10.1016/j.patcog.2012.11.024.
[Roufs 1997] J.A.J. Roufs and M.C. Boschman, “Text Quality Metrics for Visual Display Units: I.
Methodological Aspects”, Displays, vol. 18, no.1, pp. 37–43, 1997.
[Roy 2008] P.P. Roy, U. Pal, and J. Llados, "Morphology based handwritten line segmentation using
foreground and background information", Proc. 11th Int’l Conf. on Frontiers in Handwriting
Recognition (ICFHR'08), 2008, pp. 241–246.
[Rusinol 2011] M. Rusinol, D. Aldavert, R. Toledo, and J. Lladós. Browsing heterogeneous
document collections by a segmentation-free word spotting method. In Document Analysis and
Recognition (ICDAR), 2011 International Conference on, pages 63–67. IEEE, 2011.
[Saabni 2014] Raid Saabni, Abedelkadir Asi, Jihad El-Sana, Text line extraction for historical
document images, Pattern Recognition Letters, Vol. 35, Iss. 1, pp. 23-33, 2014.
[Sadri 2009] Sadri J. and Cheriet M., "A new approach for skew correction of documents based on
particle swarm optimization", Proceedings of 10th international conference on document analysis
and recognition, ICDAR‘09, pp 1066–1070, 2009.
[Safabakhsh 2000] Safabakhsh R. and Khadivi S, "Document Skew Detection Using Minimum-
Area Bounding Rectangle", Proc. The International Conference on Information Technology: Coding
and Computing ITCC 00, pp. 253 – 258, 2000.
[Sarfraz 2008] Sarfraz M. and Rasheed Z., "Skew estimation and correction of text using bounding
box", Proceedings of fifth international conference on computer graphics, imaging and
visualization, (CGIV ‘08), pp 259–264, 2008.
[Sauvola 1995] J. Sauvola and M. Pietikainen, “Page segmentation and classification using fast
feature extraction and connectivity analysis”, International Conference on Document Analysis and
Recognition, pp. 1127-1131, 1995.
[Sauvola 2000] J. Sauvola and M. Pietikainen, “Adaptive document image binarization,” Pattern
Recognition, vol. 33, no. 2, pp. 225–236, 2000.
[Seni 1994] G. Seni, and E. Cohen, ''External Word Segmentation of Off-line Handwritten Text
Lines'', Pattern Recognition, vol. 27, no. 1, Jan. 1994, pp. 41-52.
[Senior 1998] A. Senior and A. Robinson. An off-line cursive handwriting recognition
system.IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):309-321, 1998.
[Serrano 2009] J. A. Rodriguez-Serrano and F. Perronnin.Handwritten word-spotting using hidden
Markov models and universal vocabularies. Pattern Recognition, 42:2106-2116, September 2009.
Draft Version December 31, 2013 152/154
[Shafait 2008] F. Shafait, D. Keysers, and T. M. Breuel. Efficient implementation of local adaptive
thresholding techniques using integral images. In Proc. SPIE 6815, Document Recognition and
Retrieval XV, 681510, pages 1–6, January 2008.
[Shafait 2008a] Shafait F, Keysers D, Breuel T., “Performance evaluation and benchmarking of six-
page segmentation algorithms.”, IEEE Trans Pattern Anal Mach Intell. 2008 Jun;30(6):941-54. doi:
10.1109/TPAMI.2007.70837.
[Shepard 1953] D. H. Shepard. Apparatus for reading.US-Patent No 2663758, 1953.
[Shi 2004] Z. Shi, and V. Govindaraju, "Line Separation for Complex Document Images Using
Fuzzy Runlength", Proc. 1st Int’l Workshop on Document Image Analysis for Libraries (DIAL’04),
2004, pp. 306-312.
[Shi 2005] Z. Shi, S. Setlur, and V. Govindaraju, "Text Extraction from Gray Scale Historical
Document Images Using Adaptive Local Connectivity Map", Proc. 8th Int’l Conf. on Document
Analysis and Recognition (ICDAR’05), 2005, pp. 794-798.
[Shutao 2007] Shutao Li, Qinghua S., and Jun S., "Skew detection using wavelet decomposition and
projection profile analysis", Pattern Recognition Letters vol. 28, issue 5, pp. 555–562, 2007.
[Singh 2008] Singh C., Bhatia N. and Kaur A., "Hough transform based fast skew detection and
accurate skew correction methods", Pattern Recognition vol. 41, pp. 3528–3546, 2008.
[Srihari 1989] Srihari N. and Govindaraju V., "Analysis of textual images using the Hough
transform", Mach Vis A, pp. 141–153, 1989.
[Srihari 1993] S. Srihari. Recognition of handwritten and machine printed text for postal adress
interpretation. Pattern Recognition Letters, 14:291-302, 1993.
[Srihari 2005] S. Srihari, H. Srinivasan, P. Babu, and C. Bhole. Handwritten arabic word spotting
using the cedarabic document analysis system. In Proc. Symposium on Document Image
Understanding Technology (SDIUT-05), pages 123–132, 2005.
[Stamatopoulos 2007] N. Stamatopoulos, B. Gatos and A. Kesidis, “Automatic Borders Detection
of Camera Document Images”, 2nd International Workshop on Camera-Based Document Analysis
and Recognition (CBDAR'07), pp.71-78, 2007.
[Stamatopoulos 2009] N. Stamatopoulos, B. Gatos, and S. J. Perantonis, "A method for combining
complementary techniques for document image segmentation", Pattern Recognition, Vol. 42, Issue
12, pp. 3158-3168, 2009.
[Stamatopoulos 2010] N. Stamatopoulos, B. Gatos and T. Georgiou, “Page Frame Detection for
Double Page Document Images”, 9th International Workshop on Document Analysis Systems
(DAS'10), pp. 401-408, Boston, MA, USA, June 2010.
[Stamatopoulos 2013] N. Stamatopoulos, G. Louloudis, B. Gatos, U. Pal and A. Alaei,
''ICDAR2013 Handwriting Segmentation Contest'', 12th International Conference on Document
Analysis and Recognition (ICDAR'13), 2013, pp. 1434-1438.
[Su 2010] B. Su, S. Lu and C.L. Tan. “Binarization of Historical Document Images using the Local
Maximum and Minimum”, in Proc. Int. Workshop on Document Analysis Systems (DAS’10),
Boston, MA USA, Jun. 2010, pp. 159-166.
[Su 2010] S. Lu, B. Su and C.L. Tan, “Document image binarization using background estimation
and stroke edges,” International Journal on Document Analysis and Recognition, vol. 13, no. 4, pp.
303-314, 2010.
[Su 2011] B. Su, S. Lu and C.L. Tan, “Combination of document image binarization techniques”, in
Proceedings of the 11th International Conference on Document Analysis and Recognition
(ICDAR’11), Beijing, China, Sep. 2011, pp. 22-26.
Draft Version December 31, 2013 153/154
[Szke 2005] I. Szke, P. Schwarz, L. Burget, M. Karafit, and J. Cernock. Phoneme based acoustics
keyword spotting in informal continuous speech. In Radioelektronika 2005, pages 195-198. Faculty
of Electrical Engineering and Communication BUT, 2005.
[Thomas 2010] S. Thomas, C. Chatelain, L. Heutte, and T. Paquet. Alpha-Numerical Sequences
Extraction in Handwritten Documents. In Proceedings of the 12th International Conference on
Frontiers in Handwriting Recognition, ICFHR, pages 232-237, Kolkata, India, 2010. IEEE
Computer Society.
[Tonazzini 2007] A. Tonazzini, E. Salerno, and L. Bedini. Fast correction of bleed-through
distortion in grayscale documents by a blind source separation technique. International Journal of
Document Analysis and Recognition (IJDAR), 10(1):17–25, 2007.
[Tonazzini 2010] A. Tonazzini, Color Space Transformations for Analysis and Enhancement of
Ancient Degraded Manuscripts, Pattern Recognition and Image Analysis, Vol. 20, No. 3, pp. 404–
417, 2010.
[Toselli 2004] A. H. Toselli, A. Juan, D. Keysers, J. Gonzalez, I. Salvador, H. Ney, E. Vidal, and F.
Casacuberta. Integrated Handwriting Recognition and Interpretation using Finite-State Models. Int.
Journal of Pattern Recognition and Artificial Intelligence, 18(4):519-539, June 2004.
[Toselli 2004a] A. H. Toselli et al. Integrated Handwriting Recognition and Interpretation using
Finite- State Models.Int. Journal on Pattern Recognition and Artificial Intelligence, 18(4):519-
539, 2004.
[Toselli 2004b] A. H. Toselli, A. Juan, and E. Vidal.Spontaneous Handwriting Recognition and
Classification. In Proceedings of the 17th International Conference on Pattern Recognition,
volume 1, pages 433-436, Cambridge, United Kingdom, August 2004.
[Toselli 2009] A. H. Toselli, V. Romero, M. Pastor, and E. Vidal.Multimodal interactive
transcription of text images. Pattern Recognition, 43(5):1824-1825, 2009.
[Toselli 2011] A. H. Toselli, V. Romero, and E. Vidal. Language Technology for Cultural Heritage,
chapter Alignment between Text Images and their Transcripts for Handwritten Documents., pages
23-37. Theory and Applications of Natural Language Processing.Springer, 2011. Caroline
Sporleder, Antal van den Bosch y Kalliopi Zervanou (Eds.).
[Toselli 2013] A. H. Toselli, E. Vidal, V. Romero, and V. Frinken. Word- graph based keyword
spotting and indexing of handwritten document images. Technical report, Universidad Politcnica de
Valencia, 2013.
[Varga 2005] T. Varga, and H. Bunke, ''Tree structure for word extraction from handwritten text
lines'', Proc. 8th Int’l Conf. on Document Analysis and Recognition (ICDAR’05), 2005, pp. 352-
356.
[Vidal 2012] E. Vidal V. Romero, A. H. Tosselli. Multimodal Interactive Handwritten Text
Transcription.World Scientific Publishing, 2012.
[Vinciarelli 2001] Vinciarelli A. and Juergen J., "A new normalization technique for cursive
handwritten words", Pattern Recognition Letters, Vol. 22, Num. 9, pp. 1043-1050, 2001.
[Vinciarelli 2002] A. Vinciarelli. A survey on off-line cursive word recognition.Pattern Recognition,
35(7):1033-1446, 2002.
[Vinciarelli 2004] A. Vinciarelli, S. Bengio, and H. Bunke. Off-line recognition of unconstrained
handwritten texts using hmms and statistical language models. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 26(6):709-720, june 2004.
[Wahl 1982] F.M. Wahl, K.Y. Wong, and R.G. Casey, “Block Segmentation and Text Extraction in
Mixed Text/Image Documents”, Computer Graphics and Image Processing, vol. 20, pp. 375-390,
1982.
Draft Version December 31, 2013 154/154
[Wang 1997] Wang J., Leung M. K. H. and Hui S.C. "Cursive word reference line detection",
Pattern Recognition vol. 30 issue 3, pp. 503–511, 1997.
[Weliwitage 2005] Weliwitage, A. L. Harvey, and A. B. Jennings, "Handwritten Document Offline
Text Line Segmentation", Proc. of Digital Imaging Computing: Techniques and Applications
(DICTA’05), 2005, pp. 184-187.
[Wessel 2001] F. Wessel, R. Schluter, K. Macherey, and H. Ney. Confidence measures for large
vocabu¬lary continuous speech recognition. IEEE Transactions on Speech and Audio Processing,
9(3):288-298, mar 2001.
[Wong 1982] K. Y. Wong, R. G. Casey, and F. M. Wahl.Document analysis system. IBM J. Res.
Dev., 26(6):647–656, November 1982.
[Yacoub 2005] S. Yacoub, J. Burns, P. Faraboschi, D. Ortega, J.A. Peiro, and V. Saxena, “Document
Digitization Lifecycle for Complex Magazine Collection”, Proceedings of the ACM symposium on
Document engineering, pp. 197-206, 2005.
[Yan 1993] Yan H., "Skew Correction of Document Images Using Interline Cross-Correlation",
CVGIP: Graphical Models and Image Processing, vol. 55, issue 6, pp. 538-543, 1993.
[Yang 2000] Y. Yang and H. Yan, “An adaptive logical method for binarization of degraded
document images,” Pattern Recognition, vol. 33, no. 5, pp. 787–807, 2000.
[Yin 2007] F. Yin, and C. Liu, ''Handwritten Text Line Extraction Based on Minimum Spanning
Tree Clustering'', Proc. Int’l Conf. on Wavelet Analysis and Pattern Recognition (ICWAPR’07),
vol.3, 2007, pp. 1123-1128.
[Young 1997] S. Young, J. Odell, D. Ollason, V. Valtchev, and P. Woodland. The HTK Book:
Hidden Markov Models Toolkit V2.1. Cambridge Research Laboratory Ltd, March 1997.
[Yu 1996] Yu B. and Jain A. K., "A robust and fast skew detection algorithm for generic
documents", Pattern Recogn vol. 29 issue 10, pp.1599–1629, 1996.
[Zagoris 2010] K. Zagoris, E. Kavallieratou, and N. Papamarkos. A document image retrieval
system.Engineering Applications of Artificial Intelligence, 23(6):872 – 879, 2010.
[Zagoris 2011] K. Zagoris, E. Kavallieratou, and N. Papamarkos. Image retrieval systems based on
compact shape descriptor and relevance feedback information. Journal of Visual Communication
and Image Representation, 22(5):378 – 390, 2011.
[Zahour 2007] Zahour, L. Likforman-Sulem, W. Boussalaa, and B. Taconet, "Text Line
Segmentation of Historical Arabic Documents", Proc. 9th Int’l Conf. on Document Analysis and
Recognition (ICDAR'07), 2007, pp. 138-142.
[Zhu 2004] M. Zhu. Recall, Precision and Average Precision. Working Paper 2004-09 Department
of Statistics & Actuarial Science - University of Waterloo, August 26 2004.
[Zirari 2013] F. Zirari, A. Ennaji, S. Nicolas, and D. Mammass. A methodology to spot words in
historical arabic documents. In Computer Systems and Applications (AICCSA), 2013 ACS
International Conference on, pages 1–4, 2013.