d3.1.2: description and evaluation of tools for dia, htr...

D3.1.2: Description and evaluation of tools for DIA, HTR and KWS

Basilis Gatos (NCSR)

Giorgos Louloudis (NCSR) Nikolaos Stamatopoulos (NCSR)

Kostas Ntirogiannis (NCSR) Alexandros Papandreou (NCSR)

Ioannis Pratikakis (NCSR) Konstantinos Zagoris (NCSR)

Joan Andreu Sánchez (UPVLC) Verónica Romero (UPVLC) Alejandro Toselli (UPVLC)

Enrique Vidal (UPVLC) Mauricio Villegas (UPVLC) Francisco Álvaro (UPVLC) Vicente Bosch (UPVLC)

Distribution: Public

tranScriptorium

ICT Project 600707 Deliverable 3.1.2

December 31, 2013

Project funded by the European Community

under the Seventh Framework Programme for Research and Technological Development

Final Version December 31, 2013 2/154

Project ref no. ICT-600707

Project acronym tranScriptorium

Project full title tranScriptorium

Instrument STREP

Thematic Priority ICT-2011.8.2 ICT for access to cultural resources

Start date / duration 01 January 2013 / 36 Months

Distribution Public

Contractual date of delivery December 31, 2013

Actual date of delivery December 31, 2013

Date of last update March 18, 2014

Deliverable number 3.1.2

Deliverable title Description and evaluation of tools for DIA, HTR and KWS

Type Report

Status & version Final

Number of pages 154

Contributing WP(s) 3

WP / Task responsible NCSR

Other contributors UPVLC

Internal reviewer Joan Andreu Sánchez, Jesse de Does, Katrien Depuydt

Author(s) Basilis Gatos, Giorgos Louloudis, Nikolaos Stamatopoulos, Kostas Ntirogiannis,

Alexandros Papandreou, Ioannis Pratikakis, Konstantinos Zagoris,Joan Andreu

Sánchez, Verónica Romero, Alejandro Toselli, Enrique Vidal, Mauricio Villegas,

Francisco Álvaro, Vicente Bosch

EC project officer José María del Águila Gómez

Keywords

The partners in tranScriptorium are:

Universitat Politècnica de València - UPVLC (Spain)

University of Innsbruck - UIBK (Austria)

National Center for Scientific Research “Demokritos” - NCSR (Greece)

University College London - UCL (UK)

Institute for Dutch Lexicology - INL (Netherlands)

University London Computer Centre - ULCC (UK)

For copies of reports, updates on project activities and other tranScriptorium related information, contact:

The tranScriptorium Project Co-ordinator

Joan Andreu Sánchez,

Universitat Politècnica de València

Camí de Vera s/n. 46022 València, Spain

[email protected]

Phone (34) 96 387 7358 - (34) 699 348 523

Copies of reports and other material can also be accessed via the project’s homepage: http://www.transcriptorium.eu/

© 2013, The Individual Authors

No part of this document may be reproduced or transmitted in any form, or by any means,

electronic or mechanical, including photocopy, recording, or any information storage and

retrieval system, without permission from the copyright owner.

mailto:[email protected]

http://www.transcriptorium.eu/

Draft Version December 31, 2013 3/154

Executive Summary

Historical handwritten documents often suffer from several degradations, have low quality, exhibit

dense layout, may have adjacent text line touching and arbitrary text line skew. In WP3, we strive

towards the development of innovative methodologies in document image analysis (DIA),

handwritten text recognition (HTR) and keyword spotting (KWS) in order to provide the core

elements of an HTR engine. Our aim is to go beyond state-of-the-art techniques in DIA in order to

efficiently enhance the quality and segment historical handwritten documents as well as to prepare

the necessary HTR and KWS modules which will be used in the subsequent recognition stages. This

deliverable includes an overview of current state-of-the-art techniques as well as a description and

evaluation of all developed tools for DIA, HTR and KWS during the first year of the project.

Contents1. Introduction ...................................................................................................................................... 5 2. Related work .................................................................................................................................... 6

2.1 Image Pre-processing ........................................................................................................... 6 2.1.1 Binarization ......................................................................................................................... 6

2.1.2 Border Removal .................................................................................................................. 7

2.1.3 Image Enhancement ............................................................................................................ 9

2.1.4 Deskew .............................................................................................................................. 10

2.1.5 Deslant .............................................................................................................................. 13

2.2 Image Segmentation ........................................................................................................... 15 2.2.1 Detection of main page elements ...................................................................................... 15

2.2.2 Text Line Detection ........................................................................................................... 17

2.2.3 Word Detection ................................................................................................................. 19

2.3 Handwritten Text Recognition ........................................................................................... 22 2.3.1 Optical Character Recognition .......................................................................................... 22

2.3.2 Handwritten Text Recognition .......................................................................................... 23

2.4 Keyword Spotting .............................................................................................................. 24

2.4.1 The query-by-example approaches ................................................................................... 24

2.4.2 The query-by-string approaches ........................................................................................ 26

3. Developed Methods ....................................................................................................................... 28

3.1 Image Pre-processing ......................................................................................................... 28 3.1.1 Binarization ....................................................................................................................... 28

3.1.2 Border Removal ................................................................................................................ 31

3.1.3 Image Enhancement .......................................................................................................... 33

3.1.4 Deskew .............................................................................................................................. 36

3.1.5 Deslant .............................................................................................................................. 41

3.2 Image Segmentation ........................................................................................................... 44 3.2.1 Detection of main page elements ...................................................................................... 44


3.2.3 Word Detection ................................................................................................................. 58


3.3 Handwritten Text Recognition ........................................................................................... 61

3.4 Keyword Spotting .............................................................................................................. 64

3.4.1 The query-by-example approach ...................................................................................... 64

3.4.2 The query-by-string approach ........................................................................................... 71

4. Evaluation ...................................................................................................................................... 74 4.1 Image Pre-processing ......................................................................................................... 74

4.1.1 Binarization ....................................................................................................................... 74

4.1.2 Border Removal ................................................................................................................ 80

4.1.3 Image Enhancement .......................................................................................................... 81

4.1.4 Deskew .............................................................................................................................. 85

4.1.5 Deslant .............................................................................................................................. 87

4.2 Image Segmentation ........................................................................................................... 89

4.2.1 Detection of main page elements ...................................................................................... 89


4.2.3 Word Detection ................................................................................................................. 95

4.3 Handwritten Text Recognition ........................................................................................... 97

4.4 Keyword Spotting ............................................................................................................ 107 4.4.1 The query-by-example approach .................................................................................... 107

4.4.2 The query-by-string approach ......................................................................................... 110

5. Description of Tools ..................................................................................................................... 116 5.1 Image Pre-processing ....................................................................................................... 116

5.1.1 Binarization ..................................................................................................................... 116

5.1.2 Border Removal .............................................................................................................. 118

5.1.3 Image Enhancement ........................................................................................................ 119

5.1.4 Deskew ............................................................................................................................ 122

5.1.5 Deslant ............................................................................................................................ 124

5.2 Image Segmentation ......................................................................................................... 126 5.2.1 Detection of main page elements .................................................................................... 126

5.2.2 Text Line Detection ......................................................................................................... 129

5.2.3 Word Detection ............................................................................................................... 132

5.3 Handwritten Text Recognition ......................................................................................... 133 5.4 Keyword Spotting ............................................................................................................ 138

5.4.1 Query by example ........................................................................................................... 138

5.4.2 Query by string................................................................................................................ 140

6. References .................................................................................................................................... 142


1. Introduction

The aim of this project is to develop innovative, cost-effective solutions for the indexing, search and

full transcription of historical handwritten document images using HTR and KWS technologies.

Existing Optical Character Recognition (OCR) systems are not suitable due to the fact that in

handwritten text images several factors such as degradations, low quality, dense layout, adjacent

text line touching and arbitrary text line skew significantly affect the recognition performance. To

this end, in WP3 we focus on the development of innovative methodologies in DIA, HTR and KWS

in order to enhance the quality and segment historical handwritten documents as well as to provide

the core elements of HTR and KWS modules.

Image pre-processing is necessary as a first step in order to (i) process color or grayscale historical

handwritten documents so that the foreground (regions of handwritten text) is separated from the

background (paper), (ii) eliminate the presence of unwanted and noisy regions as well as to enhance

the quality of text regions, (iii) correct dominant page skew and (iv) identify and correct character

slant. Special focus is given on the challenging characteristics of the handwritten documents which

include low contrast and uneven background illumination, bleed through and shining or shadow

through effects.

In order to achieve accurate HTR performance, a robust and efficient solution to a hierarchical

segmentation task is necessary. In tranScriptorium, we envisage a hierarchical segmentation model

that consists of three levels. The first level is dedicated to the detection and separation of

handwritten text blocks in the document. A document structure and layout analysis will be involved

in order to detect the spatial position of text information. In the second level, text lines will be

identified while the third level will involve the segmentation of each text line into words.

Challenges that will be met concern the local skew and noise, the existence of adjacent text lines

touching, as well as the non-uniform spacing among text areas.

Current HTR technology will be adapted to the tranScriptorium goals. Efficient techniques for

training morphological (HMM) transcription models will be used. These models will be trained

from scratch or adapting previously trained models. The language models needed in the HTR

systems will be investigated while a word graph associated to each line will be obtained as a

byproduct of the HTR process. These word graphs can be considered as metadata that can be

associated to each document.

Concerning KWS, we will follow three different paths which will be unified at the end. In

particular, we will consider not only word spotting approaches starting from handwritten documents

that have been previously segmented into words and whole text lines, but we will also consider a

segmentation-free approach. In particular, we will implement a segmentation-based and a

segmentation-free approach to the query by example task, and a segmentation-free approach to the

query by string task. The resulting ranked lists from all three independent word spotting approaches

will be treated as different modalities for which a late fusion algorithm will merge them into one

coherent final result.


2. Related work

2.1 Image Pre-processing

2.1.1 Binarization

Image binarization refers to the conversion of a gray-scale or color image into a binary image. In

document image processing, binarization is used to separate the text from the background regions

by using a threshold selection technique in order to categorize all pixels as text or non-text. It is an

important and critical stage in the document image analysis and recognition pipeline since it permits

less image storage space, enhances the readability of text areas and allows efficient and quick

further processing for page segmentation and recognition.

Document image binarization methods are usually classified in two main categories, namely global

and local. Global thresholding methods use a single threshold for the entire image, while local

methods find a local threshold based on local statistics and characteristics within a window (e.g.

mean μ, standard deviation σ and edge contrast). Well-known binarization techniques are the global

method of Otsu [Otsu 1979] and the local methods of Niblack [Niblack 1986] and Sauvola et al.

[Sauvola 2000]. In the case of document images with bimodal histogram, Otsu yields satisfactory

results but cannot effectively handle documents with degradations such as faint characters, bleed-

through or uneven background. Niblack calculates for each pixel local statistics (μ,σ) within a

window and adapts the local threshold according to those local statistics. Niblack has the advantage

to detect the text but it introduces a lot of background noise. Sauvola modified the thresholding

formula of Niblack to decrease the background noise but the text detection rate is also decreased,

while bleed-through still remains in many cases.

Certain binarization methods have incorporated background estimation and removal steps (e.g.

[Gatos 2006] and [Lu 2010]), as well as local contrast computations to provide improved

binarization results (e.g. [Su 2010] and [Howe 2012]). Other binarization methods, aiming at an

increased binarization performance, proposed combination methodologies of binarization methods

(e.g. [Gatos 2008] and [Su 2011]), while water-flow based methods (e.g. [Kim 2002]), consider the

image as 3D terrain with valleys and mountains corresponding to text and background regions,

respectively. The logical level methods of Kamel et al. [Kamel 1993] and Yang et al. [Yang 2000]

are also considered reference points in binarization. Representative binarization methods of the

aforementioned categories are described in the following.

The logical level techniques of Kamel et al. [Kamel 1993] and Yang et al. [Yang 2000] use the

stroke width to set the local window size. Moreover, for the computation of the local threshold,

information from anti-diametric points is considered. The distance of those anti-diametric points is

set according to the stroke width. In the initial logical level technique of Kamel et al. [Kamel 1993],

the local threshold and the stroke width are given by the user, while in [Yang 2000], those

parameters were automatically computed.

In [Kim 2002], the original image was considered as a 3D terrain on which water was poured to fill

the valleys that represented the textual components. The final binarization result was produced by

applying [Otsu 1979] to the compensated image, i.e. the difference between the original image and

the water-filled image. In [Gatos 2006], a Wiener filter was used for pre-processing and the

background was estimated taking into account [Sauvola 2000] binarization output. The final

threshold was based on the difference between the estimated background and the preprocessed

image, while post-processing enhanced the final result. Although this method achieves better OCR

performance than Sauvola, it inherits some drawbacks of Sauvola. For instance, faint characters

cannot be successfully restored, and bleed-through remains. In the recent work of Lu et al. [Lu

2010], the background was estimated using polynomial smoothing for each row and column of the

original image. Then, the original image was normalized and Otsu was performed to detect the text


stroke edges. Furthermore, their local threshold formula was based on the local number of the

detected text stroke edges and their mean intensity. Although this method provides high overall

results, achieving the top position in the DIBCO'09 competition ([Gatos 2011]), it has certain

limitation as stated by the authors. It is based on the local contrast for the final thresholding and

hence some bleed-though noise remains or noisy background components of high contrast.

In Su et al. [Su 2010], the authors calculated the image contrast (based on the local maximum and

minimum intensity) and binarized the contrast image using Otsu to detect the text edge pixels. They

used a local thresholding formula similar to [Lu 2010] and presented better results. The authors also

participated with a modified version of this method in the binarization contests of H-DIBCO’10

([Pratikakis 2010]) and DIBCO’11 ([Pratikakis 2011]), where they achieved 1st and 2nd rank,

respectively. This method is capable of removing the majority of the background noise and bleed-

through but it does not detect the faint characters effectively. [Howe 2012] proposed a method

based on the Laplacian of the image intensity, in which an energy function was minimized via a

graph-cut computation. It incorporates Canny edge (Canny [Canny1986]) information in the graph

construction to encourage solutions where discontinuities align with detected edges. It is an efficient

method that uses several parameters. The author tuned the parameters on the DIBCO’09 set and

achieved 3rd position on H-DIBCO’10 and the same position on DIBCO’11 (including both printed

and handwritten images). However, this method could miss faint character parts, introduce

background noise or even fail in some cases.

As far as the combination methodologies are concerned, Su et al. [Su 2011] proposed a framework

for combination of binarization methods in which pixels are classified in 3 categories, foreground,

background and uncertain. Specifically, a pixel is considered foreground if it is a foreground pixel

in all the combined binarization outputs and the same holds for the background pixels. Uncertain

pixels are classified (iteratively until all uncertain pixels are classified) according to their difference

(in terms of intensity and local contrast), from the background and the foreground classified pixels

within a neighbourhood. Local contrast is calculated according to the image intensity and the

maximum intensity. In [Su 2011], the authors presented good results on the whole dataset (both

printed and handwritten) of the DIBCO’09. Furthermore, the proposed contrast computation

augments the contrast in the faint characters but it also augments the contrast in background noisy

components and bleed-through. Another framework for combination of binarization methods was

proposed by Gatos et al. [Gatos 2008], in which foreground pixels from different binarization

methods were selected using a majority voting. Canny edges were also incorporated to enhance the

combination result by filling white runs between the Canny edges that correspond to foreground.

This method improves the results of the authors’ previous work ([Gatos 2006]) but it depends on the

Canny edges by which faint characters can be missed while background noise including bleed-

through can be detected as text.

2.1.2 Border Removal

Approaches proposed for document segmentation and character recognition usually consider ideal

scanned images without noise. However, there are many factors that may generate imperfect

document images. When a page of a book is scanned, text from an adjacent page may also be

captured into the current page image. These unwanted regions are called “noisy text regions”.

Additionally, whenever a scanned page does not completely cover the scanner setup image size,

there will usually be black borders in the image. These unwanted regions are called “noisy black

borders”. Fig 2.1.2.1 shows noisy black borders as well as noisy text regions. All these problems

influence the performance of segmentation and recognition processes. Since the page segmentation

algorithms take into account noisy text regions, the text recognition accuracy decreases since the

text recognition system usually outputs several extra characters in these regions. The goal of border

detection is to find the principal text region and to ignore the noisy text and black borders.


Figure 2.1.2.1: Example of an image with noisy black border and noisy text region.

The most common approach to eliminate marginal noise is to perform document cleaning by

filtering out connected components based on their size and aspect ratio. However, when characters

from the adjacent page are also present, they usually cannot be filtered out using only these features.

There are only few techniques in the bibliography for page borders detection, and they are mainly

focused on printed document images.

Le et al. [Le 1996] propose a method for border removal which is based on classification of blank,

textual and non-textual rows and columns, location of border objects, and an analysis of projection

profiles and crossing counts of textual squares. This approach uses several heuristics. Moreover, it

is based on the assumption that the page borders are very close to edges of images and borders are

separated from image contents by a white space. However, this assumption is often violated. Fan et

al. [Fan 2002] propose a method for removing noisy black regions overlapping the text region, but

do not consider noisy text regions. They propose a scheme to remove the black borders of scanned

documents by reducing the resolution of the document image, which makes the textual part of the

document disappear, thus leaving only blocks to be classified either as images or borders by a

threshold filter. The deletion process should be performed on the original image instead of the

image of reduced size.

In [Avila 2004] Avila et al. propose invading and non-invading border algorithms which work as

“flood–fill” algorithms. The invading algorithm assumes that the noisy black border does not invade

the black areas of the document. In the case that the document text region is merged to noisy black

borders, the whole area, including part of the text region, is flooded and removed. On the other

hand, the non-invading border algorithm assumes that the noisy black border does merge with

document information. In order to restrain flooding in the whole connected area, it takes into

account two parameters related to the nature of the documents, the maximum size of a segment

belonging to the document text as well as the maximum distance between lines. Dey et al. [Dey

2012] propose a technique for removing margin noise from printed document images. Firstly, they

perform layout analysis to detect words, lines, and paragraphs in the document image and the

detected elements are classified into text and non-text components based on their characteristics

(size, position, etc.). The geometric properties of the text blocks are sought to detect and remove the

margin noise. Finally, Agrawal and Doermann [Agrawal 2013] present a clutter detection and

removal algorithm for complex document images. They propose a distance transform-based

approach which aims to remove irregular and non-periodic clutter noise from binary document

images independently of clutter’s position, size, shape and connectivity with text.


2.1.3 Image Enhancement

Document image processing, analysis and recognition consist of many different steps. The output of

each step is input for the consecutive step and thus the performance of each step is important.

Consequently, the initial image quality and the quality of the very first steps (e.g. binarization) play

an important role in the whole document image recognition process.

Document images may contain degradations due to the image acquisition process, such as speckle

or salt and pepper noise, blurring, shadows, etc. as well as degradations due to the environmental

conditions (humidity), ageing or extended use. Specifically, the use of quill pens in handwritten

documents is responsible for several degradations, including faint characters, bleed-through and

large stains. Image enhancement methods aim to increase the quality of the original color or

grayscale image as well as the quality of the binarization.

The enhancement of the original color or grayscale image is usually performed by filtering the

image. Common filters used for image enhancement are the Mean, the Median and the Gaussian

filter, as well as the Wiener filter ([Gatos 2006]), the Total Variation regularization and the Non-

local means ([Barney Smith 2010]). The use of a low-pass Wiener filter assures contrast

enhancement between background and text areas, smoothing of background texture as well as the

elimination of noisy areas. Total variation regularization flattens background intensity and reduces

the background noise. Non-local means filtering can smooth character parts and improve character

quality based on neighboring data information. The main role of the aforementioned filters is to

smooth the image and hence reduce the background noise. On the contrary, the use of the unsharp

masking technique (Ramponi et al. [Ramponi 1996]) enhances the local contrast and the text

becomes more clear. Unfortunately, noise would be also enhanced if present.

Another enhancement technique is based on the background estimation. The estimated background

can be used to subtract it or divide it from the original image in an attempt to remove large stains

and smooth the background. This procedure has been used in [Gatos 2006], as well as in [Lu 2010].

Histogram equalization could also be used for image enhancement purposes; it is actually a contrast

enhancement technique. A generalized fuzzy operator pre-processed with histogram equalization

and partially overlapped sub-block histogram equalization is used from Leung et al. [Leung 2005]

for reducing background noise and increasing the readability of text by contrast enhancement. This

method achieves good performance in enhancement of extremely low contrast and low illuminated

document images. Another image enhancement method worth mentioning was presented by Drira et

al. [Drira 2011], which suggests the application of the Partial Diffusion Equation (PDE). This

method was created for restoring degraded text characters by repairing the shapes of the features as

well as extrapolating lost information.

There also methods aiming at removing the bleed-through ([Tonazzini 2010] and [Fadoua 2006]). In

[Fadoua 2006], a recursive unsupervised segmentation approach was proposed, which was applied

on the decorrelated data space by principal component analysis. This technique does not require any

specific learning process or any input parameters. The stopping criterion for the proposed recursive

approach has been determined empirically and set to a fixed number of iterations. However, bleed-

through removal techniques may also remove faint characters.

Apart from image enhancement performed on the original color or grayscale image, enhancement

could also be performed on the binarization output. In many cases, binarization methods incorporate

post-processing steps for enhancing their output. Common operations that are applied include

masks, connected component analysis, morphological operations as well as shrink and swell

operations ([Gatos 2006]).

Morphological closing, which consists of a morphological dilation followed by a morphological

erosion of the same mask, is used to restore small broken character parts. Broken characters are

usually encountered due to the existence of faint characters in the original color or grayscale image.


Unfortunately, small characters holes, such as the whole of the "e" could also be filled. An example

of morphological closing is presented in Fig. 2.1.3.1. On the other hand, morphological opening,

which consists of morphological erosion followed by morphological dilation of the same mask, is

used to remove small unwanted foreground components. Those parts could be either noise of salt-

and-pepper type, or unwanted bleed-through. Unfortunately, punctuation marks (dots) could also be

removed.

(a)

(b)

(c)

Figure 2.1.3.1: Example application of morphological closing (3x3 mask): (a) original image; (b)

initial binary image; (c) the enhanced binary image after morphological closing has less holes and

some components are connected.

Another way for enhancing the binarization output is to remove components of size below or above

a certain threshold. This method requires connected component labeling while several parameters

are set heuristically. Usually, small connected components less than a certain height or containing

less than a certain amount of pixels are considered as noise and they are removed. The despeckle

filter is classified in this category. A despeckle filter removes connected components that have

width and height below a certain value (which is usually small). A representative example of

despeckle filtering is shown in Figure 2.1.3.2.

Figure 2.1.3.2: Example of despeckle filtering (7x7 mask): (a) initial binary image; (b) the

enhanced binary image after despeckle.

Binarization often introduces a certain amount of single-pixel holes, concavities and convexities

along the contour of the text. These single pixel defects are actually artifacts, which can be removed

by using certain logical operators that can be simply set according to their neighborhood patterns

([Lu 2010]).

2.1.4 Deskew

The detection and correction of the skew is one of the most important document image analysis

steps. Some degree of skew is unavoidable when a document is scanned manually or mechanically.

This may seriously affect the performance of subsequent stages of segmentation and recognition,

while a skew of as little as 0.1° may be apparent to a human observer. To this end, a reliable skew

estimation and correction technique has to be used for scanned documents as a pre-processing stage

in almost all document analysis and recognition systems [Sadri 2009]. In the literature, a variety of

page level skew detection techniques are available and fall broadly into the following four

categories according to the basic approach they adopt [Jung 2004]: Hough transform ([Srihari


1989], [Hinds 1990], [Yu 1996], [Wang 1997] and [Singh 2008]), nearest neighbour clustering

([Hashizume 1986], [Gorman 1993] and [Lu 2003]), interline cross correlation ([Yan 1993], [Gatos

1997], [Chou 2007], [Deya 2010] and [Alireza 2011]) and projection profile ([Postl 1986], [Baird

1987], [Ciardiello 1988], [Ishitani 1993] and [Papandreou 2011]) based methods.

The Hough Transform (HT) is a popular technique for machine vision applications. Despite of its

computational demerit, it has been widely employed to detect lines and curves in digital images.

According to Srihari and Govindaraju [Srihari 1989], each black pixel is mapped to the Hough

space (ρ,θ) and the skew is estimated as the angle in the parameter space that gives the maximum

sum of squares of the gradient along the ρ component. In order to improve the computational

efficiency of the HT based approaches, several variants that reduce the number of points which are

mapped in the Hough space have been proposed. Hinds et al. [Hinds 1990] fuse HT and run length

encoding in order to reduce data. They create a burst image which is built by replacing each vertical

black run with its length placed in the bottom-most pixel of the run. The bin with maximum value

in the Hough space determines the skew angle. Furthermore, Wang et al. [Wang 1997] use bottom

pixels of the candidate objects within a selected region for the HT. Accordingly, document skew is

estimated by finding a local peak in Hough space. Also, Yu and Jain in [Yu 1996] fuse hierarchical

HT and centroids of connected components. They first compute the connected components and their

centroids and then the HT is applied to the centroids using two angular resolutions. Finally, Singh et

al. [Singh 2008] introduce a block adjacency graph (BAG) in a pre-processing stage, and document

skew is calculated based on voting using the HT. As it is mentioned by the authors, the technique is

language dependent. Researchers proposed different strategies to reduce the amount of input data,

but the computational cost of HT is still very high. Furthermore, prior to apply HT, it is pre-

requisite to extract text regions in the document images. However, this is very hard in document

images since layouts maybe arbitrary and complex.

Concerning nearest neighbour approaches, spatial relationships and mutual distances of connected

components are used to estimate the page skew. In the work of Hashizume et al. [Hashizume 1986],

the direction vector of all nearest-neighbour pairs of connected components are accumulated in a

histogram and the peak in the histogram gives the dominant skew. This method is generalized by

O'Gorman [Gorman 1993] where nearest-neighbour clustering uses the K-neighbours for each

connected component. However, it is accurate up to a resolution of 0.5°. Lu and Tan [Lu 2003]

propose an improved nearest-neighbour chain based approach for document skew estimation with

high accuracy and language-independent capability. The methods of this class aim at exploiting the

general assumptions that characters in a line are aligned and close to each other. They are

characterized by a bottom up process which starts from a set of objects and utilizes their mutual

distances and spatial relationships to estimate the document skew. Few researchers prefer the

nearest neighbour clustering techniques for document skew removal, since text-lines can be

grouped in any direction and connections with noisy subparts of characters and this reduces the

accuracy of the method. Furthermore skew detection based on nearest neighbour clustering methods

is rather slow due to the component labelling phase that is performed bottom-up and has quadratic

time complexity 𝑂(𝑛2).

The interline cross-correlation approaches are based on the assumption that de-skewed images

present a homogeneous horizontal structure. Therefore, such approaches aim at estimating the

document skew by measuring vertical and horizontal deviations among foreground pixels along the

document image. Cross-correlation between lines at a fixed distance is used by Yan [Yan 1993].

This method is based on the observation that the correlation between two vertical lines of a skewed

document image is maximized if one line is shifted relatively to the other in such a way that the

character base line levels for the two lines are coincident. The accumulated correlation for many

pairs of lines can provide a precise estimation of the page skew. In [Gatos 1997], Gatos et al.

proposed that the image should first be smoothed by a horizontal run-length algorithm and then,

only the information existing in two or more vertical lines is to be used for skew detection. Chou et

al. in [Chou 2007] proposed piece-wise coverings of objects by parallelograms (PCP) for estimating


the skew angle of a scanned document. A document image is initially split into a number of non-

overlapping slabs, and subsequently parallelograms at various angles cover the components in each

slab. The angle at which maximum white space is obtained indicates the estimated skew angle of

the document. In [Deya 2010], Deya and Noushath have enhanced PCP (e-PCP) with a criterion to

classify a document as a landscape or portrait mode, while they have introduced a confidence

measure for filtering the estimated skew angles that may not be reliable. Finally, Alireza et al.

[Alireza 2011], proposed an algorithm which applies piece-wise painting on both directions of a

document and creates two painted images. Those images provide statistical information about

regions with specific height and width. Points of those regions are grouped, a few lines are obtained

with the use of linear regression and line drawing concepts and their individual slopes are

calculated.

The traditional projection profile approach is a simple solution to detect skew angle of a document

image. It was initially proposed by Postl [Postl 1986] and is based on horizontal projection profiles.

According to this approach, a series of horizontal projection profiles are calculated at a range of

angles. The profile with maximum variation corresponds to the best alignment of the text lines. In

order to reduce high computational cost, several variations of this basic method have been

proposed. Papandreou and Gatos have proposed an improvement of the traditional horizontal

projection profile approach by combining it with the minimization of the bounding box technique

[Papandreou 2011]. According to this approach, the maximum variations are divided by the

bounding box area that includes the text. This amount is minimized when the document is

horizontally aligned. Baird [Baird 1987] proposed a technique for selecting the points to be

projected: for each connected component only the midpoint of the bottom side of the bounding box

is projected. The objective function is used to compute the sum of the squares of the profiles.

Ciardiello et al. [Ciardiello 1988] calculate projections on selected sub-regions, with high density of

black pixels per row, of the document image. The algorithm detects the skew angle by maximizing

the mean square deviation of the profile. Ishitani [Ishitani 1993] uses a profile which is defined in a

different style. A cluster of parallel lines on the image is selected and the bins of the profile store the

number of black to white transitions along the lines. Finally, Papandreou and Gatos in [Papandreou

2011] have proposed a vertical projection profile approach, taking advantage of the alignment of

vertical strokes of the Latin alphabet when the document is deskewed, which proved to be more

efficient in the existence of noise and warp.

For skew estimation at the text line level, the Hough transform based and projection profile based

methods are used. The performance of the methods is slightly reduced, since there is much less

information to take advantage for the estimation, though the results are well accepted. Additionally,

Madhvanath et al. in [Madhvanath 1999] used the image contour to detect the line's minima and

determine the baseline as the regression line through those minima. The inclination of the

regression line is regarded to be the word skew.

As far as skew estimation at the word level is concerned, several methods have been proposed in the

literature. The regression approach of Madhvanath et al. [Madhvanath 1999] applies also to words,

with significantly lower performance especially in words with great skew angles. Morita et al. in

[Morita 1999] proposed a method based on mathematical morphology to obtain a pseudo-convex

hull image. Then, the minima are detected on the pseudo-convex image, a reference line is fit

through those points and the inclination of this line is considered to be the word skew. The primary

challenge in these two methods is the rejection of spurious minima. Also, the regression-based

methods do not work well on words of short length because of the lack of a sufficient number of

minima points. Other approaches for handwritten word skew detection are based on density

distribution. In [Cote 1998], several histograms are computed for different vertical projections.

Then, the entropy is calculated for each of them. The histogram with the lowest entropy determines

the word skew angle. In [Kavallieratou 1999], Kavallieratou et al. calculate the Wigner-Ville

distribution for several horizontal projection histograms. The word skew angle is selected by the

Wigner-Ville distribution with the maximal intensity. The main problem for these distribution based


methods is the high computational cost, since an image has to be rotated for each angle. Jian-xiong

Dong et al. proposed in [Dong 2005] to maximize a global measure which is defined by Radon

transform of the word image and calculate its gradient in order to estimate the word skew.

Furthermore, Blumenstein et al. in [Blumenstein 2002] identify the skew by detecting the center of

mass in each half (right and left) of a word image. The skew angle is estimated by hypothesizing a

line between the two centers and by measuring its angle with the x-axis. Finally, in [Papandreou

2013a], Papandreou end Gatos propose a coarse-to-fine method that is based on the detection of the

core-region and the calculation of the centers of mass of the right and left overlapping parts of the

word image. Detecting the corresponding center of mass in each part and calculating the inclination

of the line that connects them is a rough estimation of the skew. After the skew is detected roughly

and corrected, the core-region of the word is detected and the same procedure is repeated regarding

solely the information inside the core-region, providing a finer estimation of the skew angle.

2.1.5 Deslant

A crucial preprocessing step in order to improve segmentation and recognition accuracy of

handwritten words is the estimation and correction of word slant. By the term “word slant” we refer

to the average angle in degrees clockwise from vertical at which the characters are drawn in a word.

Word slant estimation can be very helpful in handwritten text processing. Knowledge of the slant

value, facilitates further stages of processing and recognition. Moreover, the character slant is

thought to be very important information which can help to identify the writer of a text.

Slant estimation methodologies proposed in the literature fall broadly into the following categories.

The average slant angle can be estimated by averaging angles of near-vertical strokes [Bozinovic

1989], [Kim 1997] [Papandreou 2012] and [Papandreou 2014], by analyzing projection histograms

[Vinciarelli 2001] and [Kavallieratou 2001] and by using statistics of chain code contours [Kimura

1993], [Ding 2000] and [Ding 2004].

Concerning near-vertical stroke techniques, according to Bozinovic and Srihari [Bozinovic 1989],

for a given word all horizontal lines which contain at least one run of length greater than a

parameter M (depending on the width of the letters) are removed. Additionally, all horizontal strips

of small height are removed. By deleting these horizontal lines, only orthogonal window parts of

the picture remain in the text. For each letter the parts that remain are those that have relatively

small horizontal slant. For each of these parts, the angle between the line indicated by the centers of

gravity for its upper and lower halves of the page is estimated and their mean value is the overall

character slant of the text. In [Papandreou 2012], Papandreou and Gatos propose an extension

method of Bozinovic and Srihari by integrating information of the location and the height of the

remaining parts. The main differences between the proposed technique and the method presented in

[Bozinovic 1989] are the automatic parameter setting and the weights on the contribution of the

character fragments based on their height and position relative to the core-region. Furthermore,

Papandreou and Gatos in [Papandreou 2014] propose a further improved method that incorporates a

new technique for core-region detection, has more accurate parameter setting and reduces the

involved parameters. In the approach of Kim and Govindaraju [Kim 1997] the whole page is

considered as input and the global slant is estimated by averaging over all lines. Vertical and near-

vertical lines are extracted with a chain code representation of the word contour using a pair of one

dimensional filters. Coordinates of the start and end points of each vertical line extracted provide

the slant angle. Global slant angle is the average of all the angles of the lines, weighed by their

vertical direction since the longer line gives more accurate angle than the shorter one.

Analyzing projection histograms, Vinciarelli and Juergen [Vinciarelli 2001] have proposed a

deslanting technique based on the hypothesis that the word has no slant when the number of

columns containing a continuous stroke is maximum. On the other hand, Kavallieratou et al.

[Kavallieratou 2001] have proposed a slant removal algorithm based on the use of the vertical


projection profile of word images and the Wigner-Ville distribution (WVD). The word image is

artificially slanted and for each of the extracted word images, the vertical histogram as well as the

WVD of these histograms is calculated. The curve of maximum intensity of the WVDs corresponds

to the histogram with the most intense alternations and as a result to the dominant word slant.

Techniques based on statistics of chain code contours can estimate average slant of a handwritten

word by using the 4-directional chain code histogram of border pixels (Kimura et al. [Kimura

1993]). This method tends to underestimate the slant when its absolute value is close or greater than

45ᵒ. To solve this problem, Ding et al. [Ding 2000] proposed a slant detection method using an 8-

directional chain code. Chain code methods of 12 and 16 directions are also examined in [Ding

2004] but the experimental results show that these methods tend to overestimate the slant. In [Ding

1997], new methods are proposed for evaluating and improving the linearity and accuracy of slant

estimation based on chain contours.

Additionally, a slant estimation method for handwritten characters by means of Zernike moments

has been proposed by Ballesteros et al. [Ballesteros 2005], which is based on the average inclination

of the Zernike reconstructed images for low moments. It is claimed that this method improves the

slant estimation accuracy in comparison with the chain code based methods.


2.2 Image Segmentation

2.2.1 Detection of main page elements

Detection of main text zones

A reading system requires the detection of main page elements as well as the discrimination of text

zones from non-textual ones. Historical handwritten documents do not have strict layout rules and

thus, a layout analysis method needs to be invariant to layout inconsistencies, irregularities in script

and writing style, skew, fluctuating text lines, and variable shapes of decorative entities. Most of the

state-of-the-art methodologies focus on machine-printed or modern handwritten documents and

only a few deal with historical handwritten documents. Representative page segmentation methods

can be found in [Shafait 2008a].

Among existing approaches to page segmentation for historical handwritten we mention:

-[Malleron 2009], which focuses on the layout analysis of 19th Century Handwritten Documents.

First, text line detection is modelled as an image segmentation problem by enhancing text line

structure using Hough transform and a clustering of connected components so as to make text line

boundaries appear. Then, snippets decomposition is performed for page layout analysis. This is

based on a first step of content pages classification in five visual and genetic taxonomies, and a

second step of text line extraction and snippets decomposition. Snippets are built by linking

centroids of borders components (see Fig. 2.2.1.1).

Figure 2.2.1.1: Building snippets for page segmentation

- [Bulacu 2007] proposed a method for layout analysis of handwritten historical documents in order

to enable searching the archive of the Cabinet of the Dutch Queen (1798 – 1988). It is based on the

detection of the rule lines of the tables and the page margins. It also uses contour tracing that

generates curvilinear separation paths between text lines in order to preserve the ascenders and

descenders. A layout analysis result of this method is shown in Fig. 2.2.1.2.

Figure 2.2.1.2: An example of layout analysis result of method [Bulacu 2007]


- [Bukhari 2012] on layout analysis of Arabic historical document images using machine learning.

Simple and discriminative features are extracted in a connected-component level, and subsequently

robust feature vectors are generated. A multi-layer perceptron classifier is then exploited to classify

connected components to the relevant class of text.

- [Nicolas 2006] on complex handwritten page segmentation using contextual models. Stochastic

and contextual models are used in order to cope with local spatial variability, while some prior

knowledge about the global structure of the document image is taken into account. Two tasks are

defined: a) to label the main regions of the manuscripts such as text body, margins, header, footer,

page number and marginal annotations, b) to detect pseudo words, deletions, diacritics and

background.

Page Split

Document images are usually produced by scanning books or periodicals. Scanning two pages at

the same time is a very common practice as it helps to accelerate the scanning process. However, it

may affect the performance of subsequent processing such as document analysis and optical

character recognition (OCR) since the majority of approaches are able to process only single page

images. Furthermore, another drawback of scanning two pages at the same time is the appearance of

noisy black borders around text areas as well as of noisy black stripes between the two pages (Fig.

2.2.1.3).

Figure 2.2.1.3: Example of a double page document image.

There are only few techniques in the literature that address the problem of page splitting. According

to these approaches, double page documents are split in their middle after border removal or by

defining the coordinates of the print space. Yacoub et al. [Yacoub 2005] involve a splitting process

as a pre-processing stage for the conversion of a large collection of complex documents and

deployment for online web access to its information-rich content. First, the double page images are

cropped to the border of the page, using several criteria, which include searching for horizontal and

vertical projections of straight page edges and checking the computed dimension versus a database

of magazine sizes sampled throughout history. Once the double-pages are cropped, the centrefold of

the double-page is identified and two individual pages are generated using a splitting operation.

Furthermore, a page splitting approach is proposed in the frame of the Metadata Engine Project

(http://meta-e.aib.uni-linz.ac.at/). This method is based on the assumption that books are printed

according to a clearly defined printing space. Only a limited number of special elements may appear

in the surrounding margins. Therefore, the method determines the coordinates of the print space

used in the given document and then applies this zone to the actual page image. Next, a virtual

margin is added around the printing space and the pages are cut. If a document contains

supplements that do not conform to the default print space, such as maps, tables, graphs, and

pictures, this variation is detected automatically.

http://meta-e.aib.uni-linz.ac.at/


2.2.2 Text Line Detection

One of the early tasks in a handwriting recognition system is the segmentation of a handwritten

document image into text lines, which is defined as the process of defining the region of every text

line on a document image. The overall performance of a handwritten character recognition system

strongly relies on the results of the text line segmentation process. If the quality of the results

produced by the stage of text line segmentation is poor, this will affect the accuracy of the text

recognition procedure. Thus, the algorithms employed for these two stages are critical for the

overall recognition procedure.

We can group existing text line methods into four basic categories: methods making use of the

projection profiles, methods that are based on the Hough transform, smearing methods and, finally,

methods based on the principle of dynamic programming. Also, several methods exist that cannot

be clearly classified in a specific category, since they employ particular techniques.

Methods that make use of the projection profiles include [Bruzzone 1999], [Arivazhagan 2007]. In

[Bruzzone 1999], the initial image is partitioned into vertical strips. At each vertical strip, the

histogram of horizontal runs is calculated. This technique assumes that text appearing in a single

strip is almost parallel to each other. Arivazhagan et al. [Arivazhagan 2007] partitions the initial

image into vertical strips called chunks. The projection profile of every chunk is calculated. The

first candidate lines are extracted among the first chunks. These lines traverse around any

obstructing handwritten connected component by associating it to the text line above or below. This

decision is made by either (i) modeling the text lines as bivariate Gaussian densities and evaluating

the probability of the component for each Gaussian or (ii) the probability obtained from a distance

metric.

Methods that make use of the Hough transform include [Fletcher 1988], [Likforman 1995],

[Louloudis 2008] and [Pu 1998]. The Hough transform is a powerful tool used in many areas of

document analysis that is able to locate skewed lines of text. By starting from some points of the

initial image, the method extracts the lines that fit best to these points. The points considered in the

voting procedure of the Hough transform are usually either the gravity centers [Fletcher 1988],

[Likforman 1995], [Louloudis 2008] or minima points [Pu 1998] of the connected components.

In more detail, Likforman [Likforman 1995] developed a method based on a hypothesis – validation

scheme. Potential alignments are hypothesized in the Hough domain and validated in the image

domain. The centroids of the connected components are the units for the Hough transform. A set of

aligned units in the image along a line with parameters (ρ, θ) is included in the corresponding cell

(ρ, θ) of the Hough domain. Alignments including a lot of units correspond to high peaked cells of

the Hough domain.

A recent method using the Hough transform was proposed by Louloudis et al. [Louloudis 2008].

The main contributions of the approach correspond to a) the partitioning of the connected space into

three distinct spatial sub-domains (small, normal and large) from which only normal connected

components are used in the Hough transformation step, b) a block-based Hough transform step for

the detection of potential text lines and c) a post-processing step for the detection of text lines the

Hough did not reveal as well as the separation of vertically connected parts of adjacent text lines.

The Hough transform can also be applied to fluctuating lines of handwritten drafts such as in [Pu

1998]. The Hough transform is first applied to minima points (units) in a vertical strip on the left of

the image. The alignments in the Hough domain are searched starting from a main direction, by

grouping cells in an exhaustive search in six directions. Then a moving window, associated with a

clustering scheme in the image domain, assigns the remaining units to alignments. The clustering

scheme (Natural Learning Algorithm) allows the creation of new lines starting in the middle of the

page.

Smearing methods mainly include the fuzzy RLSA [Shi 2004] and the adaptive RLSA [Makridis


2007]. The fuzzy RLSA measure is calculated for every pixel on the initial image and describes

“how far one can see when standing at a pixel along horizontal direction”. By applying this

measure, a new grayscale image is created, which is binarized and the lines of text are extracted

from the new image. The adaptive RLSA [Makridis 2007] is an extension of the classical RLSA, in

the sense that additional smoothing constraints are set in regard to the geometrical properties of

neighboring connected components. The replacement of background pixels with foreground pixels

is performed when these constraints are satisfied.

Text line segmentation methods based on the dynamic programming principle were recently

presented [Nicolaou 2009], [Saabni 2014]. They try to segment text lines by finding an optimal path

on the background of the document image travelling from the left to the right edge. Nicolaou et al.

approach [Nicolaou 2009] is based on the topological assumption that for each text line, a path

exists from one side of the image to the other which traverses a single text line. The image is first

blurred and at a second step tracers are used to follow the white-most and black-most paths from

left to right as well as from right to left. The final goal is to shred the image into text line areas.

Saabni et al. [Saabni 2014] propose a method which computes an energy map of a text image and

determines the seams that pass across and between text lines. Two different algorithms are

described (one for binary and one for grayscale images). Concerning the first algorithm (binary

case), each seam passes on the middle and along a text line, and marks the components that make

the letters and words of it. At a final step, the unmarked components are assigned to the closest text

line. For the second algorithm (grayscale case) the seams are calculated on the distance transform of

the grayscale image.

Other methodologies include the work of Nicolas et al. [Nicolas 2004]. In this work, the text line

extraction problem is seen from an Artificial Intelligence perspective. The aim is to cluster the

connected components of the document into homogeneous sets that correspond to the text lines of

the document. To solve this problem, a search over the graph that is defined by the connected

components as vertices and the distances among them as edges is applied. A recent paper [Shi

2005], makes use of the Adaptive Local Connectivity Map: the input to the method is a grayscale

image, and a new image is calculated by summing the intensities of each pixel’s neighbors in the

horizontal direction. Since the new image is also a grayscale image, a thresholding technique is

applied and the connected components are grouped into location maps by using a grouping method.

In [Kennard 2006], the method to segment text lines uses the count of foreground/background

transitions in a binarized image to determine areas of the document that are likely to be text lines.

Also, a min-cut/max-flow graph cut algorithm is used to split up text areas that appear to encompass

more than one line of text. A merging of text lines containing relatively little text information to

nearby text lines is then applied. Yi Li [Li 2008] describes a technique that models text line

detection as an image segmentation problem by enhancing text line structures using a Gaussian

window and adopting the level set method to evolve text line boundaries. The method described in

[Lemaitre 2006] is based on a notion of perceptive vision: at a certain distance, text lines can be

seen as line segments. This method is based on the theory of Kalman filtering to detect text lines on

low resolution images. Weliwitage et al. [Weliwitage 2005], describe a method that involves cut text

minimization for segmentation of text lines from handwritten English documents. In doing so, an

optimization technique is applied which varies the cutting angle and start location to minimize the

text pixels cut while tracking between two text lines. In [Basu 2007], a text line extraction technique

is presented for multi-skewed document of handwritten English or Bengali text. It assumes that

hypothetical water flows, from both left and right sides of the image frame, facing obstruction from

characters of text lines. The stripes of areas left unwetted on the image frame are finally labeled for

extraction of text lines. In [Zahour 2007], a text line segmentation method for printed or

handwritten historical Arabic is presented. Documents are first classified into two classes using a K-

means scheme. These classes correspond to document complexity (easy or not easy to segment). A

document which includes overlapping and touching characters is divided into vertical strips. The

extracted text blocks obtained by the horizontal projection are classified into three categories: small,


average and large text blocks. After segmenting the large text blocks, the lines are obtained by

matching adjacent blocks within two successive strips using spatial relationship. The document

without overlapping or touching characters is segmented by making abstraction on the segmentation

module of the large blocks. The authors claim 96% accuracy on a collection of 100 historical

documents. Yin [Yin 2007], proposes an approach which is based on minimum spanning tree (MST)

clustering with new distance measures. First, the connected components of the document image are

grouped into a tree by MST clustering with a new distance measure. The edges of the tree are then

dynamically cut to form text lines by using a new objective function for finding the number of

clusters. This approach is totally parameter-free and can apply to various documents with multi-

skewed and curved lines. Stamatopoulos et al. [Stamatopoulos 2009] present a combination method

of different segmentation techniques. The goal is to exploit the segmentation results of

complementary techniques and specific features of the initial image so as to generate improved

segmentation results. Roy et al. [Roy 2008] propose a technique that is based on morphological

operation and run-length smearing. At first, RLSA is applied to get an individual word as a

component. Next, the foreground portion of this smoothed image is eroded to get some seed

components from the individual words of the document. Erosion is also done on background

portions to find some boundary information of text lines. Finally, using the positional information of

the seed components and the boundary information, the lines are segmented. Finally, Du et al. [Du

2008] propose a method which is based on the Mumford–Shah model. The algorithm is claimed to

be script independent. In addition, morphing is used to remove overlaps between neighboring text

lines and connect broken ones. For a more detailed description of text line segmentation

methodologies the interested reader should read [Likforman 2007].

2.2.3 Word Detection

Word segmentation refers to the process of defining the word regions of a text line. Algorithms

dealing with word segmentation in the literature are based primarily on analysis of geometric

relationship of adjacent components. Components are either connected components or overlapped

components. An overlapped component is defined as a set of connected components whose

projection profiles overlap in the vertical direction. Related work for the problem of word

segmentation differs in two aspects. The first aspect is the way the distance of adjacent components

is calculated, while the second aspect concerns the approach used to classify the previously

calculated distances as either between-word gaps or within-word gaps. Most of the methodologies

described in the literature have a preprocessing stage which includes noise removal, skew and slant

correction.

Many distance metrics are defined in the literature. Seni et al. [Seni 1994] presented eight different

distance metrics. These include the bounding box distance, the minimum and average run-length

distance, the Euclidean distance and different combinations of them which depend on several

heuristics. A thorough evaluation of the proposed metrics was described (see examples of the

metrics in Figure 2.2.3.1).


Figure 2.2.3.1: Example of types of distance metrics between a pair of connected components. (a)

original image (b) bounding boxes; the horizontal bounding box distance is the horizontal distance

between the boxes (c) Euclidean distance; the arrow indicates the distance between the closest

points in each component. (d) run-length distance; arrows show some run-lengths between the

components. Different distance algorithms could use the average or minimum of the run-lengths.

A different distance metric was defined by Mahadevan [Mahadevan 1995] which was called convex

hull-based metric (see Figure 2.2.3.2). The author after comparing this metric with some of the

metrics of [Seni 1994] concludes that the convex hull-based metric performs better than the other

ones. Kim et al. [Kim 2001], investigated the problem of word segmentation in handwritten Korean

text lines. To this end, they used three well-known metrics in their experiments: the bounding box

distance, the run-length/Euclidean distance and the convex hull-based distance. For the

classification of the distances, the authors considered three clustering techniques: the average

linkage method, the modified Max method and the sequential clustering. Their experimental results

showed that the best performance was obtained by the sequential clustering technique using all

three gap metrics. Varga and Bunke [Varga 2005], tried to extend classical word extraction

techniques by incorporating a tree structure. Since classical word segmentation techniques depend

solely on a single threshold value, they tried to improve the existent theory by letting the decision

about a gap to be taken not only in terms of a threshold, but also in terms of its context i.e.

considering the relatives sizes of the surrounding gaps. Experiments conducted with different gap

metrics as well as threshold types showed that their methodology yielded improvements over

conventional word extraction methods.

Figure 2.2.3.2: Three examples demonstrating the convex-hull based distance metric.

In all the aforementioned methodologies, the gap classification threshold used derives: (i) from the

processing of the calculated distances, (ii) from the processing of the whole text line image or (iii)

after the application of a clustering technique over the estimated distances. There also exist


methodologies in the literature that make use of classifiers for the final decision of whether a gap is

a between-word gap or a within-word gap [Kim 1998], [Huang 2008], [Luthy 2007]. An early

published work making use of classifiers for the word segmentation problem is the work of Kim

and Govindaraju [Kim 1998]. A similar approach was presented in [Huang 2008] by Huang and

Srihari. This approach claimed two differences from previous methods: (i) the gap metric was

computed by combining three different distance measures, which avoided the weakness of each of

the individual one and thus provided a more reliable distance measure and (ii) besides the local

features, such as the current gap, a new set of global features were also extracted to help the

classifier make a better decision. Finally, a different approach was presented by Luthy et al. [Luthy

2007]. The problem of segmenting a text line into words was considered as a text line recognition

task, adapted to the characteristics of segmentation. That is, at a certain position of a text line, it had

to be decided whether the considered position belonged to a letter of a word, or to a space between

two words. For this purpose, three different recognizers based on Hidden Markov Models were

designed, and results from writer-dependent as well as writer-independent experiments were

reported.

An analytic evaluation of both steps of the word segmentation stage, namely the distance

calculation and the gap classification step is presented in the work of Louloudis et al. [Louloudis

2009a]. According to the proposed evaluation methodology the distance computation and the gap

classification stages are treated independently. The detection rate calculated for every distance

metric corresponds to the maximum detection rate which could have been achieved if the perfect

classifier for the gap classification stage existed. The proposed evaluation framework was applied to

several state-of-the-art techniques using a handwritten as well as a historical typewritten document

set, and the best combination of distance metric computation and gap classification state-of-the-art

techniques was proposed. According to this evaluation method, the combination of the Euclidean

distance metric and the convex hull metric produced the best result. For the gap classification stage,

the Gaussian mixture modelling together with the sequential clustering algorithm had the best

performance.


2.3 Handwritten Text Recognition

Handwritten text recognition (HTR) can be defined as the ability of a computer to transform

handwritten input represented in its spatial form of graphical marks into an equivalent symbolic

representation as ASCII text. Usually, this handwritten input comes from sources such as paper

documents, photographs or electronic pens and touch-screens.

HTR is a relatively new field of computer vision. The optical character recognition (OCR) had its

beginnings in the year 1951 with the optical reader invented by David H. Shepard. This reader,

known as GISMO, was able to read typewritten text, Morse and musical notes [Shepard 1953].

HTR first appeared in the late 1960s with restricted applications with very limited vocabulary, such

as the recognition of postal addresses envelopes or the recognition of bank cheques. The computer

capacity and the available technology of those days was not enough for unconstrained handwritten

documents (such as old manuscripts and/or unconstrained, spontaneous text). Even the recognition

of printed text was only adequate on simplified fonts, for example the font used on credit cards ever

since. However, the increase in computer capacity of these days, the development of adequate

technologies and the necessity of processing automatically a big quantity of handwritten documents

have converted handwritten text recognition in a focus of attention both from industry as well as the

research community, leading to a significant improvement of the accuracy of these systems.

According to [Bertolami 2008], [Bunke 1995], nowadays, OCR systems are capable of accurately

recognizing typewriting in several fonts. However, high recognition accuracy is still difficult to

achieve in continuous handwritten text. The reason for this has to be seen in the high variability

such as individual writing styles, the number of different word classes and the difficulty of the word

segmentation problem which comes with handwriting.

Current research in HTR mainly follows two different basic approaches which will be discussed in

the next subsections. The first approach recognizes isolated characters or digits. Therefore, when

dealing with non-isolated characters, for example in words or text, the data has to be segmented

beforehand. The origin of this method has to be seen in the beginning of HTR, when the input

images contained only one single symbol. The second approach uses a continuous recognition,

meaning that beginning or ending of each character are not needed on beforehand and they are

hypothesized as a byproduct of the proper recognition process.

2.3.1 Optical Character Recognition

Optical character recognition, usually abbreviated to OCR, is the mechanical translation of images

of handwritten, typewritten or printed text (letters, numbers or symbols) into machine- editable text.

An historic review of OCR can be found on [Impedovo 1991], [Mori 1992]. OCR is mainly divided

in two areas: printed character or word recognition, and handwritten character or word recognition.

A big improvement on the recognition of printed text has been obtained until now, mainly due to the

fact that is easy to segment the text into characters. This improvement has made possible the

development of OCR systems with a very good accuracy, around 99% or higher at the character

level for modern printed material. However, given the kind of text image documents involved in

this project, OCR products, or even the most advanced OCR research prototypes [Ratzlaff 2003],

are very far from offering useful solutions to the transcription problem. They are simply not usable

since, in the vast majority of the handwritten text images of interest, characters can by no means be

isolated automatically.

In OCR field the recognition is performed using one or several classifiers. A naive approach to the

recognition of an image containing a single symbol is the nearest neighbour (NN) or k-NN

approach. More sophisticated algorithms which achieve better results are artificial neural networks

(ANNs) and support vector machines (SVMs) [Lee 1996], [Srihari 1993]. Moreover, some work has

been carried out using tangent vectors and local representations [Keysers 1999] or Bernoulli


mixture models [Khoury 2010], [Juan 2001], [Romero 2007] with good results. In [Fenrich 1992],

[Nagy 2000] an overview of the current state-of-the-art on OCR is carried out and the different

applications of this technology are commented.

2.3.2 Handwritten Text Recognition

HTR can be considered a relatively new subfield of pattern recognition. An overview about the

state-of-the-art can be found on [Kim 1999], [Plamondon 2000], [Vinciarelli2002]. In [Bunke 2003]

one can find a detailed description of previous work on HTR, the current state of the art, and the

most likely paths for further progress in the near future.

Depending on the way the initial data is acquired, automatic handwriting recognition systems can

be classified into on-line and off-line. In off-line systems the handwriting is given as an image or

scanned text, without time sequence information. In on-line systems the handwriting is given as a

temporal sequence of coordinates that represents the pen tip trajectory.

Some off-line handwriting transcription products rely on technology for isolated character

recognition (OCR) developed in the last decade. First, the input image is segmented into characters

or pieces of a character. Thereafter, isolated character recognition is performed on these segments.

In [Bozinovic 1989], [Kavallieratou 2002], [Senior 1998] this approximation is followed, using

segmentation techniques based on dynamic programming. As was previously explained, the

difficulty of this approach is the segmentation step. In fact, the difficulty of character segmentation

for some handwritten documents makes it impossible to apply this approach to this kind of

documents.

The required technology should be able to recognize all text elements (sentences, words and

characters) as a whole, without any prior segmentation of the image into these elements. This

technology is generally referred to as segmentation-free “off-line Handwritten Text Recognition”

(HTR).

Several approaches have been proposed in the literature for HTR that resemble the noisy channel

approach that is currently used in Automatic Speech Recognition. This, the HTR systems are based

on Hidden Markov Models (HMMs) [Bazzi1999], [Plamondon 2000], [Marti 2001], [Koerich

2003], [Toselli 2004], [Romero 2007] or hybrid HMM and Artificial Neural Networks [Boquera

2011], [Graves 2009].

However, the results obtained with these systems are still very far from offering usable accuracy.

They have proven to be suited for restricted applications with very limited vocabulary or

constrained handwriting, achieving relatively high recognition rates in this kind of task. However,

for (high quality, modern) unrestricted text images, current fully automated HTR state-of-the-art

(research) prototypes provide accuracy levels that range from 30 to 80% at the word level. This fact

has led us to propose the development of computer-assisted solutions based on novel approaches

studied in WP5.


2.4 Keyword Spotting

2.4.1 The query-by-example approaches

Nowadays, many digitized ancient manuscripts are not exploited due to lack of proper browsing and

indexing tools. These documents are characterized by complicate layout and variety of writing

styles, while many of them are degraded. The strategies that are applied for OCR of modern printed

documents are not suitable as they are sensitive to noise, character variation and text layout

complexity. Accordingly, up to this day, there is not a single system available to analyze, index and

search for words in a collection of ancient handwritten manuscripts in a satisfactory way.

A valid strategy to deal with this kind of unindexed documents is a word matching procedure that is

relying on a low-level pattern matching called word spotting [Manmatha 1996]. It can be defined as

the task of identifying locations on a document image which have high probability to contain an

instance of a queried word, without explicitly recognizing it. The first approaches attempted to

locate the words inside the document using advanced procedures for binarization, pre-processing

and text layout analysis. Then, analyzing the segmented word, feature vectors are extracted from it

called word descriptors. Based on these descriptors, a distance measure is used to measure the

similarity between the query word and the template segment word. These systems called

segmentation-based word spotting systems. Although there is an abundance of systems suitable for

both modern [Hassan 2012], [Zagoris 2010] and historical [Kesidis 2011], [Konidaris 2008], [Zirari

2013], [Khurshid 2012] printed material, very few of these systems are applicable to historical

handwriting, especially if the document database contains material from different writers.

Rath and Manmatha [Rath 2003b], [Lavrenko 2004], [Rath 2003a] initially de-slant, de-skew, and

normalized the segmented word. Afterwards, they extract from it two families of feature sets. On

the one hand, they have the scalar type features that include aspect ratio, area, etc. On the other

hand, there are the profile-based features that are based on horizontal and vertical words projections

and the upper and lower word profiles. Moreover, the latter features are encoded by Discrete

Fourier Transform. The feature comparison algorithm is dynamic time warping (DTW), as the

features are of variable-length.

Zagoris et al. [Zagoris 2011] created a similar set of profile-based features, encoded in a different

way by Discrete Cosine Transformation, normalization by the first coefficients and quantization

through the Gustafson–Kessel [Kessel 1978] fuzzy algorithm. The result was a very short stable-

length descriptor, which has been tested on a Greek handwriting database from different writers, the

Washington words database and the MPEG-7 CE1 Set B database.

Llados et. al. [Llados 2012] evaluate the performance of various word descriptors, including a bag

of visual words procedure (BoVW), a pseudo-structural representation based on Loci Features, a

structural approach by using words as graphs, and sequences of column features based on DTW.

They tested them on the George Washington database and the marriage records from the Barcelona

Cathedral (ESPOSALLES database). They found that the statistical approach of the BoVW

produces the best results, although the memory requirements to store the descriptors are significant.

Rodrıguez and Perronnin [Rodriguez 2008] extract features from a sliding window, based on the

first gradient and inspired by the SIFT keypoint descriptor. The evaluation scenarios make use two

different matching paradigms: DTW and Hidden Markov Models (HMM). They run tests on 630

real scanned letters (written in French) containing different writing styles, artifacts, spelling

mistakes and other types of noise.

Finally, Srihari et al. [Srihari 2005] present a system for searching handwritten Arabic documents

based on a set of binary shape features suitable for Arabic script along with a correlation distance

that performed best for matching those features.

More recently, new techniques emerged, bypassing the procedure to locate and segment the words


from the document and try to locate and match them directly on document. These stemmed from the

weakness to produce meaningful segmentation results with the state-of-the-art methods in historical

handwritten documents.

Leydier et al. [Leydier 2007] detect zones on the images that represent the most informative parts

based on the gradient orientation calculated by the convolution of the image with the first and

second derivatives of the Gaussian kernel. Their matching procedure is based on elastic naive

matching. In order to create guides to the matching, they use morphological operations; an opening

with a heuristic structuring element based on the document type. The evaluation experiments are

conducted on medieval manuscripts of Latin and Semitic alphabets. They found that their algorithm

is not suitable for short words (less than four characters) since the shorter the word, the less

information become available for comparison. In [Leydier 2009], they introduced an improved

version of their initial segmentation–free approach. The main updated parts are their matching

engine and their guide detection which now is based on the second derivative along the isophote

curves. Although it is an improvement, the way they extract the features and match them, is very

sensitive to different writing variations and matching words with different font sizes. Moreover,

their matching engine is very costly making impractical for large databases.

Gatos and Pratikakis [Gatos 2009] calculate block-based document image descriptors that are used

as a template. They created scaled and rotated versions of the query. These versions are scaled by

factors that ranges from 0.8 to 1.2 by a step of 0.1 and are rotated by range from -1° to 1° by a step

of 1°. They produced total 15 different word instances. For each word instance they calculated 5

different set of feature vectors based on calculating all 5x5 non-overlapping window pixel density

and then applying various word image translation. The 15 different queries create a lot of ambiguity

in the final merging state especially where there are many writing variations.

Rusinyol et al. [Rusinyol 2011] present a patch-based framework. The patches are represented by a

BoVW model, in which the local features are the SIFT descriptors. Then, they applied the Spatial

Pyramid Matching method to their BoVW model to compensate the lack of spatial information.

After they normalized the BoVW descriptor with a tif-idf model, they applied a latent semantic

indexing technique. They made experiments on the George Washington and Lord Byron datasets.

Apart from the costly, database-dependent operation of creating a codebook for the BoVW model,

their method depends on the query font size, thus making it impractical for matching words of

different size.

It is worth noting that both aforementioned strategies – segmentation-based and segmentation-free –

are valid under some circumstances. The segmentation-based procedure performs better, when the

layout is simple enough to correctly segment the words. By contrast, the segmentation-free

algorithms perform better in the case that there is considerable degradation on the document.

Unfortunately, both strategies are incompatible with each other, as they often use different pipelines,

features and matching procedures.

Finally, another important point to be considered is that the features for word spotting which rely

only on word shape characteristics are not effective in dealing with a document collection created

by different writers, containing significant writing style variations. Although slant and skew

preprocessing techniques can reduce the shape variations, they cannot eliminate the problem as the

whole structure of the word is different many times. So we argue that although the shape

information is meaningful, the texture information in a spatial context is more reliable in such cases.

Prompted by the above considerations, the proposed segmentation-based and segmentation-free

techniques employ the same base features, which are called Document-Specific Local Information

(DSLI), and belong to the family of local descriptors. The only difference is that the matching

procedure in the segmentation-free scenario is an extension of the one used in the segmentation-

based approach. Moreover, the proposed features - although they use the spatial information of the

current point’s location - are based on texture information.


2.4.2 The query-by-string approaches

The idea of using word-specific and filler or “garbage” models was proposed long time ago for key

word spotting (KWS) in the field of automatic speech recognition, and it has been in use for many

years in that field - see for example [Lleida 1993], [Keshet 2009], [Szke 20005]. More recently, the

same basic idea has also been used for KWS in handwriting images [Fischer 2010], [Rodriguez-

Serrano 2009], [Thomas 2010]. In this approach, character hidden Markov models (HMMs) are

employed to build both the filler model and a word-specific model for each query word. This idea is

often referred to as “HMM-Filler”. One of its attractive features is that it is “lexicon-free”; that is, it

does not need any predefined lexicon or set of candidate query words. In this research we follow

the work presented in [Fischer 2010], where HMM-Filler is used for line-oriented KWS. In this

setting, whole text line images, without any kind of segmentation into words or characters, are

analyzed to determine the degree of confidence that a given keyword appears in the image.

HMM-Filler is often proposed as a KWS method for keyword searching “on-the-fly”. However, the

very large computing time of word-specific Viterbi decoding, for all the line images and all the

query words to be spotted, renders this idea unfeasible in practice, even for small collections of

handwritten document images. Nevertheless, if large “off-line” computing cost can be afforded,

HMM-Filler scores can still be used for indexing purposes.

Both the classical HMM-Filler and the approach here proposed rely on segmentation-free, HMM

character recognition. The basics of HTR based on HMMs were originally presented in [Bazzi

1999] and further developed in [Vinciarelli 2004], [Toselli 2013] among others. Recognizers of this

kind accept a given handwritten text line image, represented as a sequence of D-dimensional feature

vectors ⃗⃗ ⃗ 2⃗⃗⃗⃗ ⃗⃗⃗⃗ ⃗⃗ ⃗ as input and find a most likely character sequence, ̂ ̂ 2̂ ̂:

(2.4.2.1)

The score associated with ̂ (so called, “Viterbi score”) is:

(2.4.2.2)

The conditional density p(x|c) is approximated by previously trained morphological character

HMMs, while the prior P(c) is given by a character-level “language” model.

Eq. (2.4.2.1-2.4.2.2) are commonly solved by means of Viterbi decoding [Jelinek 1998]. As a

byproduct, a huge set of most likely recognized character sequences can be obtained and

represented in the form of a character-lattice (CL). These lattices are better known in the literature

as “word- graphs” [Wessel 2001].

In this approach, as presented in [Fischer 2010], character HMMs are used to build both a “filler”

model, F, and a keyword-specific model, Kυ, for each individual keyword υ to be spotted. F is built

by arranging all the trained character HMMs in parallel (with equiprobable priors, P(c) in Eq.

(2.4.2.1-2.4.2.2) with a loop-back. This model accounts for any unrestricted sequence of characters

(see Fig. 2.4.1a). On the other hand, Kυ is constructed to model the exact character sequence of υ,

surrounded by the space character and the same unrestricted character sequences modeled by F (see

Fig.2.4.2.1b). In a preprocessing phase, F can be used once for all to compute the Viterbi


Figure 2.4.2.1: (a) “Filler HMM” (F) and (b) “keyword HMM” (Kυ) built for the keyword

υ=“bore”.

decoding score Sf(x) (Eq. (2.4.2.2) for each line image x. Similarly, in the searching phase, for each

keyword υ to be spotted and for each line image x, the Viterbi score sυ(x) is computed using the

keyword-specific model Kυ. A spotting score S(υ,x) is then defined as:

(2.4.2.3)

where Lυ is the length of υ in number of frames between the word detected borders. If the spotted

word υ is actually in the line image x, a matching is expected with high (negative) values of S(υ,x),

upper bounded by S(υ,x)=0.

To analyze the computational cost of this approach, we consider separately the one of the character

filler model F, from the one of the key-word Kυ. The computational complexity of the

preprocessing phase is that of the Viterbi algorithm [Jelinek 1998]; that is 𝑂( 𝑛 ), where N is

the number of line images in the collection, n is the length of x and depends on the total number

of states in all the character HMMs in F and on the square of the number of character models in F.

Similarly, taking into account the topology of Kυ, the search phase complexity is 𝑂( 𝑛 ),

where and M is the number of spotted keywords. It is important to remark that while the F

decoding is running once per each text line image, the Kυ is running M times per each text line

image.


3. Developed Methods


3.1.1 Binarization

The developed binarization method is based on the method of Ntirogiannis et al. [Ntirogiannis

2014] which was developed specifically for handwritten document images. The aforementioned

method has several steps, such as background estimation and image normalization, combination of

the global Otsu [Otsu 1979] and local Niblack [Niblack 1986] method, contrast calculation, as well

as intermediate processing to remove small noisy connected components. The performance of the

initial binarization method was very high, outperforming in most cases current state-of-the-art

methods. However, the evaluation was performed on the handwritten images of the H/DIBCO

datasets from 2009-2011 that did not contain cases form the TranScriptorium project. In addition,

to test the effectiveness of this method on images from TranScriptorium, we used DIBCO'13

(Pratikakis et al. [Pratikakis 2013]) dataset which contains portions from 3 images of the

TranScriptorium project. The initial method was ranked second (F-Measure=94.32) with a very

small difference from the first (F-Measure=94.43) concerning those TranScriptorium images.

After extensive experimentation on the TranScriptorium images, we observed that the binarization

method of Ntirogiannis et al. [Ntirogiannis 2014] could miss textual information in an attempt to

clear the background from noisy components or bleed-through (Fig. 3.1.1.1(a)-(b)). Under those

circumstances, some modifications were introduced to preserve textual information (faint characters

or low contrast characters) (Fig. 3.1.1.1(c)). The main adjustments made in the final developed

method concerns the Otsu threshold selection, the incorporation of a third binarization method

(Gatos et al. [Gatos 2006]) into the combination stage and the noisy connected components removal

stage. Secondary adjustments concern the speed enhancement as well as the handling of big black

borders, which the initial method could not manage. In the following, the final developed

binarization method is detailed.

(a)

(b)

(c)

Figure 3.1.1.1: Initial and modified binarization results: (a) Original images from the

TranScriptorium project; (b) Ntirogiannis et al. [Ntirogiannis 2014] binarization; (c) developed

binarization based on Ntirogiannis et al. [Ntirogiannis 2014].


Background Estimation and Image Normalization

Background estimation is performed using inpainting. The inpainting mask is considered the

foreground of two binarization methods, namely Niblack [Niblack 1986] and Otsu [Otsu 1979]. To

achieve the inpainting result, five passes of the image are required. The first four image passes are

performed following the LRTB, LRBT, RLTB and RLBT directions, where L, R, T and B refer to

Left, Right, Top and Bottom, respectively. In each pass i, we use as input the original image and get

as output an inpainting image Pi(x,y). In particular, for each "mask" pixel, we calculate the average

of the "non-mask" pixels in the 4-connected neighborhood (cross-type) and consider that pixel as a

"non-mask pixel" for consecutive computations of that pass. At the final (fifth) image pass, for any

direction chosen, we take into account all four images of the four previous inpainting passes (Pi(x,

y), i=1,...,4) and for each pixel we keep the minimum (darkest) intensity. Finally, the original image

is divided by the estimated background image. The values of the new image are stretched in the

range of the original grayscale image (Fig. 3.1.1.2b). The impact on binarization of the

aforementioned background smoothing procedure is shown in Fig. 3.1.1.2c-d.

(a)

(b)

(c)

(d)

Figure 3.1.1.2: Background estimation and image normalization: (a) Original grayscale image

I(x,y); (b) original image after background estimation and normalization N(x,y); (c)-(d) global Otsu

[Otsu 1979] of (a) and (b) respectively.

First Combination and Small Components Removal

Given that the original image has been normalized using background estimation, we can efficiently

apply the global binarization method of Otsu to the normalized image N(x,y). In this way, we detect

the main text of the image, from which we can calculate the image contrast that is required in

consecutive steps. However, it is important to remove small noisy components to minimize the

noise contribution in our calculations and to decrease the false combination when Otsu [Otsu 1979]

is combined at connected component level with the Niblack [Niblack 1986] binarization result.

In order to further enhance the text detection of this step, we also use the binarization method of

Gatos et al. [Gatos 2006]. This method uses Wiener filtering as a pre-processing step and hence the

original image is given as input. The image normalization using the background estimation and the

Wiener filtering has similar effect, considering that both methods result in an image of smoothed

background. In addition, before binarization we apply a 3x3 Gaussian filter while after binarization

we apply a morphological closing operation in order to remove small noisy dots.

We consider that the aforementioned noise consists of several components corresponding to a small

part of the detected foreground. Hence, the ratio between the total number of pixels that correspond

to components of a small height and the number of those components is expected to be very low. In


particular, we consider O(x,y) to be the combination of Otsu and Gatos et. al. 2004 binarization

result with "1" corresponding to foreground and "0" to background. Furthermore, let Oi(x,y) be the

i-th connected component of O(x,y) with i = 1,...,NCO, where NCO is the number of the connected

components of O(x,y) and Oi(x, y) equals to "1" only at foreground pixels of the i-th connected

component and "0" elsewhere. Then, we discard all connected components smaller than half of the

height h, as determined by the criterion specified by Eq. (3.1.1.1).

∑𝑅𝑃𝑗

𝑅𝐶𝑗ℎ𝑗= 1 (3.1.1.1)

where h denotes the minimum connected component height for which the above relation Eq.

(3.1.1.1) is satisfied,

𝑅𝑃𝑗 ∑ ∑ 𝑂𝑗(𝑥 𝑦)

𝐼𝑦𝑦=1

𝐼𝑥𝑥=1

∑ ∑ 𝑂(𝑥 𝑦)𝐼𝑦𝑦=1

𝐼𝑥𝑥=1

, denotes the ratio between the number of foreground pixels of O(x,y)

connected components of height j, (i.e. Oj(x,y)) and the total number of foreground pixels of O(x,y),

𝑅𝐶𝑗 𝑁𝐶𝑗

𝑁𝐶𝑂, denotes the ratio between the number NC

j of the connected components of height j and

the total number NCO of connected components of O(x,y).

Based upon the aforementioned post-processing criterion, the first combination image OP(x,y) after

post-processing can be defined by Eq. (3.1.1.2) as follows.

𝑂𝑃( 𝑦) ⋃ 𝑂𝑖( 𝑦) ∀𝑖: 𝐻(𝑖) ≥ ℎ𝑁𝐶𝑂𝑖= / (3.1.1.2)

where h is given by Eq. (3.1.1.1) and H(i) denotes the height of the i-th connected component of

Otsu Oi(x, y).

Niblack Parameter Estimation and Final Combination

Local binarization of Niblack [Niblack 1986] is to be applied in the normalized image N(x,y).

However, Niblack requires two parameters, the window size "w" and the strictness "k". In the

binarization method of Ntirogiannis et al. [Ntirogiannis 2014], the window size was set according to

the stroke width, the computation of which required skeletonization and distance transformation

which increased the computational time. Under those circumstances, we use a fix window size for

Niblack w=50, while parameter "k" is set according to the global image contrast C, as denoted by

Eq. (3.1.1.3).

𝑘 −0. − 0.1 ⌊𝐶

0⌋ (3.1.1.3)

Concerning the global contrast of the image, the contrast ratio is often used, i.e. the ratio between

the foreground and the background intensity, and the logarithmic (log) of the contrast ratio has also

been used including document image applications (Roufs et al. [Roufs 1997]). In particular, we use

the log contrast ratio C (Eq. (3.1.1.4)) modified for degraded document images. As numerator, we

use the average foreground intensity FGave increased by the foreground standard deviation FGstd

in order to be closer to the faint character’s intensity and as denominator we use the average

background intensity BGave decreased by the background standard deviation BGstd in order to be

closer to the background variations such as shadows and stains. In Eq. (3.1.1.4), the constant "-50"

is used to limit the values between 0–100, approximately.

𝐶 −50𝑙𝑜𝑔10(𝐹𝐺𝑎𝑣𝑒−𝐹𝐺𝑠𝑡𝑑

𝐵𝐺𝑎𝑣𝑒−𝐵𝐺𝑠𝑡𝑑) (3.1.1.4)

where the foreground statistics are calculated taking into account information of the original image

that corresponds to foreground of the OP(x,y) (excluding contour pixels). The background statistics

are calculated using a new estimated background image that results after considering for each pixel

the average of the individual inpainting passes performed at the background estimation stage.


The result of the first combination followed by the post-processing step OP(x,y) contains low

background noise but fails to retain the faint parts of the characters. On the contrary, the Niblack

result NB(x,y) contains much background noise but detects the faint characters efficiently enough.

Thus, we apply a combination at connected component level between the OP(x,y) and the NB(x,y)

images. In more details, we keep each connected component of NB(x,y) that has at least one

common foreground pixel with the OP(x,y) image. Example results are presented in Fig. 3.1.1.3.

Finally, some enhancement is performed on the binary image to smooth the contours from the

"teeth" effect.

(a)

(b)

Figure 3.1.1.3: Example results of the first and final combination: (a) Combination of Otsu [Otsu

1979] and Gatos et al. [Gatos 2006] binarization; (b) final result after additional combination with

Niblack [Niblack 1986].


Our methodology detects and removes noisy black borders as well as noisy text regions. It is based

on [Stamatopoulos 2007] and focus on handwritten document characteristics. The border removal

method relies upon the projection profiles combined with a connected component labelling process.

Signal cross-correlation is also used in order to verify the detected noisy text areas.

At a first step, the noisy black borders (vertical and horizontal) of the image are removed. The

flowchart for noisy black borders detection and removal is shown in Fig. 3.1.2.1. In order to achieve

this, we first proceed to an image smoothing using the Run Length Smoothing Algorithm (RLSA)

[Wahl 1982]. Then the vertical and horizontal histograms of the image are calculated in order to

detect the starting and ending offsets of borders and text regions. The final clean image without the

noisy black borders is calculated by using the connected components of the image [Chang 2004].


Figure 3.1.2.1: Flowchart for noisy black borders detection and removal.

Once the noisy black borders have been removed, we proceed to detect noisy text regions of the

image. Our aim is to calculate the limits of the text region. The flowchart for noisy text regions

detection and removal is shown in Fig. 3.1.2.2.

We first proceed to an image smoothing by using dynamic parameters which depend on average

character height. Our aim is to connect all pixels that belong to the same line. Then, we calculate

the vertical histogram and detect the regions which are separated by a blank area. Finally, we detect

noisy text regions with the help of the signal cross-correlation function [Sauvola 1995] and remove

the noisy text regions by using the connected components of the image.

Figure 3.1.2.2: Flowchart for noisy text regions detection and removal.

An example of the developed border removal methodology is given in Fig. 3.1.2.3.


(a) (b) (c)

Figure 3.1.2.3: Border removal example: (a) Original image; (b) Binary image (c) image after

border removal.


Color Enhancement


As enhancement methods of the original grayscale image, we use background smoothing based on

background estimation. In particular, the background is estimated using a new fast inpainting

technique. Afterwards, the estimated background is used to smooth the original image. In the

following, the background estimation and smoothing technique is detailed.

For the background estimation, an inpainting procedure is followed for which the Niblack [Niblack

1986] binarization output (window size w=60 and k=-0.2.) is used as the inpainting mask.

Afterwards, Niblack’s foreground result (inpainting mask) is dilated using a 3x3 mask in order to

exclude background pixels near the character edges which may have intensity closer to the

foreground intensity. In addition, Otsu [Otsu 1979] is performed, the foreground of which is

considered as part of the inpainting mask (Fig. 3.1.3.1b). This is essential to handle the big black

borders for which Niblack could fail to binarize effectively because of the fixed window size.

Finally, the estimated background image (Fig. 3.1.3.1c) inherits the intensity of the original image

for "non-mask" pixels mask, while the inpainting mask is filled with surrounding background

information of the original image according to the following procedure.

To achieve the inpainting result, five passes of the image are required. The first four image passes

are performed following the LRTB, LRBT, RLTB and RLBT directions, where L, R, T and B refer

to Left, Right, Top and Bottom, respectively. In each pass i, we use as input the original image and

get as output an inpainting image Pi(x,y). In particular, for each "mask" pixel, we calculate the

average of the "non-mask" pixels in the 4-connected neighborhood (cross-type) and consider that

pixel as a "non-mask pixel" for consecutive computations of that pass. At the final (fifth) image

pass, for any direction chosen, we take into account all four images of the four previous inpainting

passes (Pi(x, y), i=1,...,4) and for each pixel we keep the minimum (darkest) intensity (Fig.

3.1.3.1c).


Background estimation followed by the image normalization procedure is often used to balance the

illumination of a picture that is taken under uneven lighting conditions. This method modifies the

image towards a bi-modal distribution that can be globally binarized with improved success. The

main principle of the aforementioned method is that the observed image I(x,y) equals to the product

of the illumination (scene lighting conditions) S(x,y) and the reflectance R(x,y) of the picture

objects. If the lighting conditions can be reproduced without any scene objects I’(x, y), then the

normalized image equals to I(x, y)/I’(x,y). In the case of document images, we consider that the

image that corresponds to the lighting conditions without any scene objects I’(x,y) is represented by

the estimated background image BG(x,y). Then, according to the aforementioned discussion, the

normalized image F(x,y) would be equal to I(x,y)/BG(x,y). The F(x,y) image is also normalized in

the range between the minimum and maximum value of the original image Eq. (3.1.3.1), otherwise

we could stretch the F(x,y) image to [0,255] but this leads to potential loss of faint characters and

some noise remains. Example image normalization result is presented in Fig. 3.1.3.1d.

( 𝑦) ⌈𝐼𝑚𝑎𝑥 − 𝐼𝑚𝑖 𝐹(𝑥 𝑦)−𝐹𝑚𝑖𝑛

𝐹𝑚𝑎𝑥−𝐹𝑚𝑖𝑛+ 𝐼𝑚𝑖 ⌉ (3.1.3.1)

where 𝐹( 𝑦) 𝐼(𝑥 𝑦)+

𝐵𝐺(𝑥 𝑦)+ , in order to be theoretically correct and eliminate any possibility of

division by zero, 𝐼𝑚𝑖 , 𝐼𝑚𝑎𝑥 and 𝐹𝑚𝑖 , 𝐹𝑚𝑎𝑥 denote the minimum and maximum values

of I(x, y) and F(x,y) images, respectively.

(a)

(b)

(c)

(d)

Figure 3.1.3.1: Background estimation and image normalization for background smoothing: (a)

original grayscale image; (b) inpainting mask; (c) estimated background using an inpainting

procedure; (d) final enhanced image after background normalization with the estimated background.

Background and noise removal

One of the image pre-processing steps required for handwritten text recognition, and one that is

especially important for scanned antique documents in which it is common to observe some degree


of paper degradation, is the extraction of the text from noisy backgrounds. A standard approach to

deal with this problem is to threshold the images [Sauvola 2000], i.e., assigning pixels to be either

black or white depending on whether a pixel value is below or above a certain threshold, thus

obtaining 1-bit depth image. However, in the context of ancient texts, it can be observed that the

writing strokes vary greatly in width and darkness, not only between different pages or lines, but

also within single words. Because of this, when applying thresholding techniques, there is always a

risk that parts of the strokes are lost due to the hard decision of black or white. Furthermore, state-

of-the-art recognizers are based on Gaussian Mixture HMMs or neural networks, which do not

require having as input a binary image, and in fact better performance can be obtained if more of

the original image information is provided.

To this end, instead of thresholding, we aim for a technique that extracts text from noisy

backgrounds while having as output a grayscale image that resembles better the information

contained in the original images. Local thresholding techniques work rather well, so to take

advantage of this, we can use an existing thresholding method and relax it such that values in a band

near the threshold are mapped to a smooth transition between black and white. Due to its good

performance, we have taken as reference Sauvola’s method [Sauvola 2000], and modified it

following the previously mentioned idea.

Given an input image, for every pixel a threshold is computed based on the pixel’s neighborhood.

Like the original Niblack’s method [Niblack 1986], Sauvola’s method computes the threshold based

on the neighborhood’s mean and standard deviation values. The Sauvola’s threshold 𝑇( 𝑦) for a

pixel position ( 𝑦) is given as follows:

𝑇( 𝑦) 𝜇( 𝑦) [1 + 𝑘 (𝜎(𝑥 𝑦)

𝑅− 1)] (3.1.3.2)

where𝜇( 𝑦) and 𝜎( 𝑦) are the neighborhood’s mean and standard deviation, respectively, R is the

dynamic range of the standard deviation (usually R=128), and k is a parameter that needs to be

adjusted although it is common to have k=0.5. The neighborhood is a square region of w pixels

wide, centered at the corresponding pixel. This algorithm is very fast, since it is well known that the

mean and standard deviation of any subwindow of an image can be efficiently computed by using

integral images [Shafait 2008].

The proposed modification of the Sauvola method only requires one additional parameter, which we

have called s, that determines how fast the transition between black and white is. As the value of s is

decreased, the transition from black and white becomes narrower, which means that the images tend

to get closer to a 1-bit depth image, and in the limit when s=0 the result is exactly the same as the

original Sauvola algorithm.

The Sauvola method assumes that the text is dark on a light background. Because of this

assumption, when the text is relatively light, the thresholds tend to be too high, so parts of the

strokes tend to be removed, or become light gray for the case of the proposed technique. To prevent

this, a simple strategy is to previously process the images by stretching the contrast so that the

darkest and brightest pixels of the image reach the lowest and highest values of the grayscale range.

However, care must be taken when doing a contrast stretch. The image must contain text (not be all

background), otherwise the stretch will enhance all of the background noise.

After the extraction of the text using the proposed technique (with or without a contrast stretch),

some of the remaining noise can be removed using more standard approaches. Specifically, a certain

percentage of the smallest connected components are removed after applying the Run Length

Smoothing Algorithm (RLSA) [Wong 1982] to the image.


Show-through reduction

Recently we have been working on a method for removing some of the show- through noise. Show-

through [Tonazzini 2007] is the effect caused by the slight transparency of paper that when

scanning, the content of the reverse side of the page tends to appear. The method being developed is

still in preliminary stages, so not much detail will be given. The method is based on the fact that the

color of the content for the reverse side of the page is a combination of the original color and the

color of the paper. This means that in the color space there is a difference in the content from both

sides of the paper, thus this can be exploited to reduce the show-through effect.

Binary Enhancement

As enhancement methods performed on a binary image, we use the shrink and swell filters of Gatos

et al. [Gatos 2006] along with a contour enhancement technique. The shrink and swell operations

uses a local neighborhood to check if the majority of the pixels belong to foreground or background

and set the central pixel to foreground or background, respectively. These operations are described

in details in Gatos et al. [Gatos 2006].

Additional enhancement includes the use of the masks of Lu et al. [Lu 2010] binarization method

(Fig. 3.1.3.2). The operations with the aforementioned masks are an additional enhancement that

can be performed on the binary image to handle the "teeth" effect. The size of 3x3 was used for

these masks. Example application of contour enhancement is shown in Fig. 3.1.3.3.

Figure 3.1.3.2: Masks used for the enhancement of the binary image.

Figure 3.1.3.3: Contour enhancement of the binary image: binarization result before and after the

contour enhancement.

3.1.4 Deskew

Page and Text Line Skew estimation method

The algorithm that was developed for the estimation of the skew at both the page and the text line

level is based on enhanced horizontal projection profiles. Furthermore, this method was combined

with the minimum bounding box technique which is commonly used in the literature [Safabakhsh


2000] and [Sarfraz 2008] as a skew estimation technique. The area of the rectangle that contains all

the foreground pixels of a document image is expected to be minimal when the document image is

properly deskewed. Furthermore, the running time of the developed method is improved by the

implementation of coarse-to-fine convergence.

This approach is more efficient compared with state-of-the-art skew detection algorithms. The

developed method deals with all the drawbacks of the projection profile methods since it is noise

and warp resistant, it gives more accurate results and it is fast due to the proposed coarse-to-fine

convergence. For these reasons, it can be efficiently applied to historical documents providing great

results in a wide range of angles. The motivation for this work was the recent publication of

Papandreou and Gatos [Papandreou 2011] which was extended in the developed method.

In order to estimate the skew, the document image is rotated in a range of angles 𝜃 [𝜃𝑚𝑖 𝜃𝑚𝑎𝑥]. Let 𝐼𝜃( 𝑦)be the binary document image in angle 𝜃 having 1s for black (foreground) pixels and 0s

for white (background) pixels and𝐼𝑥𝜃 and 𝐼𝑦

𝜃 be the width and the height of the rotated document

image. At a first step, the rectangle box that bounds the foreground pixels is computed and then, a

histogram of enhanced horizontal profiles is calculated for every angle. Finally, the profiles for each

angle are divided with the corresponding bounding box area and the angle in which the enhanced

histogram has the greatest values is regarded as the skew angle.

In order to compute the minimum rectangle 𝐸𝜃that bounds the foreground pixels for each angle𝜃,

we calculate the 4 extreme points ( 𝑙𝜃 𝑟

𝜃 𝑦𝑢𝜃 𝑦𝑑

𝜃).

𝐸𝜃 ( 𝑟𝜃 − 𝑙

𝜃) × (𝑦𝑑𝜃 − 𝑦𝑢

𝜃) (3.1.4.1)

For each angle 𝜃, after the extreme points of the image have been detected, the horizontal histogram

𝐻ℎ𝜃() is calculated within these boundaries.

In order to calculate 𝐻ℎ𝜃(), the initial horizontal histogram 𝐻ℎ

𝜃() is computed for each angle θ,

where 𝐻ℎ𝜃() is defined as follows:

𝐻ℎ𝜃(𝑦) ∑ 𝐼𝜃( 𝑦)

𝑥𝑟𝜃

𝑥=𝑥𝑙𝜃

(3.1.4.2)

and then it is divided with the area of the bounding box 𝐸𝜃 in order to form 𝐻ℎ𝜃()

𝐻ℎ𝜃(𝑦) 𝐻ℎ

𝜃(𝑦) 𝐸𝜃⁄ (3.1.4.3)

In order to estimate the correct skew of the image, we seek for the angle𝜃ℎ in which the variance

between three successive values of 𝐻ℎ𝜃()histogram is maximized. 𝜃ℎ is defined as follows:

𝜃ℎ argmax(𝐷ℎ(𝜃))𝜃

(3.1.4.4)

where 𝜃 varies in the range of the image document skew expected while𝐷ℎ(𝜃) is defined as:

𝐷ℎ(𝜃) argmax(| × 𝐻ℎ𝜃(𝑦) − 𝐻ℎ

𝜃(𝑦 − 1) − 𝐻ℎ𝜃(𝑦 − )|)

𝑦 (3.1.4.5)

where [ 𝑙𝜃 𝑟

𝜃] and𝑦 [𝑦𝑢𝜃 𝑦𝑑

𝜃] and values of 𝐻ℎ𝜃(𝑦) with and 𝑦 exceeding their range are

considered to be zero.


Faster convergence – coarse-to-fine technique

In order to make the developed method faster and reduce the computational cost, a coarse-to-fine

technique, similar to the one proposed by Shutao Li et al [Shutao 2007], is applied. With the use of

this technique the number of angles that the document image is rotated is reduced without

restricting the range of the angles that the document is rotated. This coarse-to-fine technique,

furthermore, enables the extension of the range of the expected skew angle since it accelerates the

proposed algorithm multiple times, depending on the range of angles of the expected skew, without

affecting significantly the efficiency of the proposed method. The flowchart of the method using

this coarse-to-fine technique is presented in Fig.3.1.4.1.

Import Image

K = 1

If K≤4

If K=1A1 = -θΑ2 = θ

step = 1

If K=2A1 = S – 0,5A2 = S + 0,5step = 0,5

If K=3A1 = S – 04A2 = S + 0,4step = 0,1

If K=4A1 = S – 0,05A2 = S + 0,05step = 0,05

Angle = A1maxV=0maxH=0

If angle ≤A2

Rotate Image by (angle)

Compute the Bounding Box Area

E

Skew Estimation

Method

angle = angle +step

K = K+1S

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

No

No

S

Figure 3.1.4.1: Flowchart of the coarse-to-fine technique used for faster convergence.

Given a wide range of angles [𝜃𝑚𝑖 𝜃𝑚𝑎𝑥] in which the document skew is expected, a classical

projection profile method would rotate the document in every angle from 𝜃𝑚𝑖 to 𝜃𝑚𝑎𝑥 with a

certain step which corresponds to the required accuracy. In that way, the algorithm would perform

[(𝜃𝑚𝑎𝑥 − 𝜃𝑚𝑖 + 1) × 0] rotations in order to achieve an accuracy of 0.05°. Instead, in the

proposed method the document image is rotated from 𝜃𝑚𝑖 to 𝜃𝑚𝑎𝑥 with step 1° and the dominant

skew angle𝜃𝑑 is defined. Afterwards, the document image is rotated from (𝜃𝑑 − 0.5) to (𝜃𝑑 +0.5) with step 0.5° in order to find the new dominant skew angle𝜃𝑑2. Then, the document image is

rotated from (𝜃𝑑2 − 0.4) to (𝜃𝑑2 + 0.4) with step 0.1° and the dominant skew angle is readjusted

in𝜃𝑑3, while finally, the image is rotated from (𝜃𝑑3 − 0.05) to (𝜃𝑑3 + 0.05) and so the skew angle

𝜃𝑠 is determined. With the use of this coarse-to-fine technique the rotations required are reduced

to(𝜃𝑚𝑎𝑥 − 𝜃𝑚𝑖 + 13). As a conclusion the proposed algorithm succeeds a faster convergence and

it performs 9 times faster in a range between −5° and 5° and 17 5 times faster in a range between

−45° to 45°.


The developed method can be appliedto complete document pages as well as individual handwritten

text lines of historical documents. The current implementation differs from the one presented in

[Papandreou 2011] since a coarse-to-fine technique is integrated in order to accelerate the

developed algorithm and the estimated skew angle derives from the maximization of the variance

between three successive values of 𝐻ℎ𝜃()histogram.

Word Skew estimation method

The algorithm that was developed in order to estimate the skew of handwritten word images is a

coarse-to-fine method that is fast, has low computational cost, is accurate and doesn’t depend on the

minima of the word. In that way, it provides an accurate estimation of the word skew, even for

words with small length. The developed method was published in ICDAR2013 in Washington DC

[Papandreou 2013a].

The developed coarse-to-fine method is based on two steps which are demonstrated in the flowchart

in Fig. 3.1.4.2. At first, (a) the word image is divided vertically into two equal overlapping parts, (b)

the corresponding center of mass of the foreground pixels is detected in each part and (c) the

inclination of the line that connects the two centers of mass is calculated. This leads to a first rough

estimation of the skew angle and the corresponding correction is applied in the handwritten word

image. Afterwards, in the next step, the core-region of the word is detected and the word is divided

again vertically in two equal overlapping parts. The corresponding centers of mass are calculated

again, though this time only the information inside the core-region is taken under consideration. The

inclination of the line that connects the updated centers of mass is calculated and a new finer skew

correction is made in the word image. The second step is repeated iteratively till the finest skew

estimation is done.

Divide the word Image in two parts and find their

centers of mass

Calculate the inclination of the line connecting the two

centers of mass

Regard the computed inclination s as a coarse

estimation of the skew and correct it

Detect the core-region of the word Image

Divide the word Image in two parts and find their

centers of mass considering only the information inside

the core-region

Calculate the inclination of the line connecting the

updated centers of mass

If (s<ac)OR(c≥M)

END

YES

NO

START

Step 1

Step 2

Ts=s

C=C+1

Regard the computed inclination s as a finer

estimation of the skew and correct it

Ts=Ts+s

Ts=0C=0

s: Estimated skewac: Accuracy needed

M: Maximum iterationsTs: Total Skew

c: Counter

Figure 3.1.4.2: Flowchart of the developed coarse-to-fine methodology.


In order to define the overlapping parts the word image is divided horizontally in three equal stripes

as it is demonstrated in Fig. 3.1.4.3. The first two thirds of the image compose the left part with

dimensions 𝐼𝑥𝑙 × 𝐼𝑦and the last two thirds form the right part with dimensions 𝐼𝑥

𝑟 × 𝐼𝑦.

Figure 3.1.4.3: The word is divided in two equal horizontal overlapping parts with length 𝐼𝑥𝑙 and 𝐼𝑥

𝑟,

respectively for the left and the right part.

Afterwards, the according centers of mass ( 𝑐𝑚𝑙 𝑦𝑐𝑚

𝑙 ) and ( 𝑐𝑚𝑟 𝑦𝑐𝑚

𝑟 ) of the overlapping parts are

calculated. In order to have a coarse estimation of the skew angle of the word image, the inclination

of the line connecting the two reference points is calculated (see Fig. 3.1.4.4) is calculated using Eq.

3.1.4.6.

𝑠 𝑡𝑎𝑛− (𝑦𝑐𝑚

𝑟 − 𝑦𝑐𝑚𝑙

𝑐𝑚𝑟 − 𝑐𝑚

𝑙 ) (3.1.4.6)

Figure 3.1.4.4: The inclination of the line connecting the two reference points is the coarse

estimation of the word image skew angle.

In the developed method, the word image is separated in two equal overlapping parts. In that way,

the overlapping parts are providing continuity in the information of the left and the right side. In the

next step the overlapping parts are calculated again for the corrected word image and the core-

region of the word is detected. In that way, the main part of the word image (inside the core-region)

is distinguished from the ascenders and descenders (see Fig. 3.1.4.5).

Figure 3.1.4.5: The overlapping parts are calculated for the corrected image and the core-region of

the handwritten word is detected.

In order to detect the core-region of the word image the algorithm of reinforced projection profiles

described in [Papandreou 2014] is used, and in that way the upper and lower baselines of the word

are defined. At this point the centers of mass of the two overlapping parts are calculated again,

though only the information inside the core-region is taken under consideration. The coordinates of

the centers of mass are now updated in ( 𝑐𝑚𝑙 𝑦 𝑐𝑚

𝑙 ) and ( 𝑐𝑚𝑟 𝑦 𝑐𝑚

𝑟 ). After defining the updated


centers of mass (see Fig. 3.1.4.6) the inclination of the line that connects them is calculated by eq.

3.1.4.7 and in that way a finer estimation of the word image skew angle is achieved.

𝑠 𝑡𝑎𝑛− (𝑦 𝑐𝑚

𝑟 − 𝑦 𝑐𝑚𝑙

𝑐𝑚𝑟 − 𝑐𝑚𝑙 ) (3.1.4.7)

Figure 3.1.4.6: The inclination of the line that connects the updated reference points is a finer

estimation of the word image skew angle.

After correcting the skew detected, the second step is repeated iteratively until a finest skew

estimation is reached. When our last estimation of the word image skew angle is smaller than a

defined accuracy, the algorithm stops and the total skew defines as the sum of all previous

calculated values and is given by the following equation:

𝑇𝑠 ∑𝑠𝑖

𝑖

(3.1.4.8)

where𝑖 is the number of skew angle estimations and depends on the accuracy needed and 𝑠𝑖 is the

estimated word skew of the 𝑖-th calculation.

By excluding the information outside the core-region of the word and by updating the centers of

mass considering only remaining information, a finer estimation of the skew angle is achieved. This

is due to the fact that the ascenders and descenders of a word have a major contribution in the

vertical average deviation of the foreground pixels and they affect erroneously the skew estimation.

3.1.5 Deslant

Text Line and Word Slant estimation method

The algorithm that was developed for the estimation of the slant of text lines and words takes

advantage of the orientation of the non-horizontal strokes as well as their position relative to the

word's core-region. As a first step, the word core-region is detected with the use of reinforced

horizontal black run profiles which permits to detect the core-region scan lines more accurately.

Then, the near-horizontal parts of the document word are extracted and the orientation and the

height of non-horizontal remaining fragments as well as their location in relation to the word's core-

region are calculated. Word slant is estimated taking into consideration the orientation and the

height of each fragment while an additional weight is applied if a fragment is partially outside the

core-region of the word which indicates that this fragment corresponds to a part of the character

stroke that has a significant contribution to the overall word slant and should by definition be

vertical to the orientation of the word. This work is an extension of [Bozinovic 1989] and

[Papandreou 2012]; it has been published in [Papandreou 2014].

The proposed slant estimation method consists of three steps. At a first step, the word core-region is

detected with the use of horizontal black run profiles. Then, the orientation and the height of non-

horizontal fragments of the word as well as their location in relation to the word core-region are

calculated. Finally, the word slant is estimated taking into consideration the orientation and the

height of each fragment while an additional weight is applied if a fragment is outside the core-

region of the word which indicates that this fragment corresponds to a part of character stroke that

has a significant contribution to the overall word slant.


Core-region detection

In order to detect the core-region, we focus on the horizontal black runs of the word image and

introduce the reinforced horizontal black run profile 𝐻. The motivation for proposing this profile is

the need to stress the existence of long horizontal black runs as well as the number of horizontal

black runs of every line of the word image. The parts outside of the core-region are most of the

times vertical strokes with no significant width or horizontal strokes that are in a sparse area with no

significant concentration of foreground pixels. On the other hand, the core-region is the main area

of the word where most of the information can be found; in this area there are more black-white

transactions than outside of the core-region. The horizontal black run profile 𝐻 is defined as

follows:

)(

0

),(

0

2)]([)(yB

i

yiL

j

jyByH (3.1.5.1)

where 𝐵(𝑦) is the number of black runs in the horizontal scan of line y and 𝐿(𝑖 𝑦) is the length of

the i-th black run of line y. Afterwards, a threshold 𝑇ℎis set in order to distinguish the core-region

from the ascenders and descenders.

Examples of word core-region detection are given in Fig. 3.1.5.1.

(a) (b) (c)

(d) (e) (f)

Figure 3.1.5.1: Examples of word core-region detection: (a),(d) original word images and the

detected core-regions; (b),(e) the horizontal black run profiles H() and (c),(f) the Boolean horizontal

profiles HB().

Calculation of the orientation of non-horizontal fragments

As a next step we remove all scan lines that contain either horizontal or almost horizontal strokes.

For a given document image words/lines, all scan lines which contain at least one black run with

length greater than are extracted, where is defined as follows for word images:

.5 × 𝑙 (3.1.5.2)

while for text lines we have:

5 × 𝑙 (3.1.5.3)

where 𝑙 is the modal value of all horizontal black run lengths 𝐿() and corresponds to the character

stroke width. The parameter value .5 is used in word images in order not to exclude neighboring

connected strokes with width approximately 𝑙. On the other hand, in text line images the parameter

value is set at 5 so that not most of the scan lines are removed. An example of removing word

image scan lines is shown in Fig. 3.1.5.2.


Portions of each strip separable by vertical lines are isolated in boxes. For each box, the centers of

gravity for its upper and lower halves are computed and connected. In case of an empty half, with

no foreground pixels, the whole box is disregarded. The slant 𝑠𝑖 of the connecting line defined by

the centers of gravity for upper and lower halves of box 𝑖 corresponds to the slant of this box.

(a) (b)

(c) (d)

Figure 3.1.5.2: Image boxes that contribute to word slant: (a) original word images and the detected

core-region; (b) scan lines removed; (c) detected image boxes and (d) valid image boxes.

Overall word slant estimation

The weighted average of 𝑠𝑖 for all boxes is used to calculate the slant 𝑆 of the word image as

follows:

b

i

ii

b

i

iii

ch

chs

S

1

1

(3.1.5.4)

where 𝑏 is the number of valid boxes, ℎ𝑖 the height of box 𝑖and 𝑖 a parameter used to weigh the

contribution of box 𝑖 based on its position in relation to the word core-region. Parameter 𝑖 is used

to assign a weight of 1 if the box is inside the core-region and 2 otherwise. The rationale behind this

parameter is that character parts outside the core region have a significant contribution to the overall

word slant. As we observe in Fig. 3.1.5.3 the boxes that bound the largest fragments and are outside

the core region contain the strokes that should be vertical to the orientation of the word and

consequently our algorithm considers such strokes as dominant.

Figure 3.1.5.3: The fragments the bounding box of which have a significant height and are outside

the core region are those that by definition should be vertical to the orientation of the word.





In the approach developed in the first year of the project we followed a two-step procedure in order

to detect the main text zones. First, vertical lines are detected based on a fuzzy smoothing method.

Then, we process the information appearing in the vertical white runs of the image. Based on those

two steps we are able to detect the main text zones that may appear in one or two columns.

Detection of vertical lines

In this step, we first proceed to a vertical smoothing based on the Run Length Smoothing

Algorithm(RLSA) [Wahl 1982] in order to connect vertical broken lines (see Fig. 3.2.1.1). Then, we

detect pixels that belong to long vertical lines by applying the vertical fuzzy RLSA [Shi 2004].The

fuzzy RLSA measure is calculated for every pixel on the initial image and describes “how far one

can see when standing at a pixel along vertical direction”. By applying this measure, a new

grayscale image is created, which is then binarized, and the lines of text are extracted from the new

image (see Fig. 3.2.1.2).

(a) (b)

Figure 3.2.1.1: An example of a part of an image before (a) and after (b) vertical smoothing.


(a) (b)

Figure 3.2.1.2: Vertical fuzzy RLSA: (a) original image; (b) the fuzzy RLSA result.

In order to detect vertical lines, we proceed to a vertical projection of the resulting image. Since we

want to focus on the vertical black runs of the document image we define the vertical black run

profile V(x) as follows:

𝑉( ) ∑ 𝐿(𝑖 )2𝐵(𝑥)𝑖=0 (3.2.1.1)

where B(x) is the number of black runs in the vertical scan of column x and L(i,x) is the length of

the i-th black run of column x.

An example of vertical projections is presented in Fig. 3.2.1.3(a). By applying a threshold to these

projections and calculating the minimum and maximum image points that contribute to the

corresponding vertical lines, we can define the vertical line segments of the image (see Fig.

3.2.1.3(b)). The main text body can be defined in the zone between those consequent vertical lines

that have maximum distance between them (see Fig. 3.2.1.3(b)).


(a) (b)

Figure 3.2.1.3: (a) vertical projections of image in Fig. 3.2.1.2 and (b) the resulting vertical lines as

well as the detected main text zone

The same procedure can be followed to detect horizontal lines. Combining information about

horizontal and vertical lines can lead to the segmentation of the image based on the defined grid as

well as to the detection of the main text region, the title and the side notes zone. Some preliminary

results for this procedure are presented in Fig. 3.2.1.4.

Figure 3.2.1.4: (a) original image; (b) detection of the main text region, the title and the side notes

zone


Processing vertical white runs

Processing the background of the image can be used to efficiently detect the main text zone. In our

approach, we first count the vertical white runs of the image that are of significant length

(>image_height/3). Some examples of this procedure for single and two-column documents are

presented in Fig. 3.2.1.5.

(a)

(b)

(c)

(d)

Figure 3.2.1.5: (a),(c) original images; (b),(d) their long vertical white runs (in red) and the

detected main text zones

At a next step we detect and tag image columns that do not have long vertical white runs. If we

calculate and sort all the lengths of successive tagged columns, then if the two longer lengths have

almost the same value, the document is classified as two-column document. Otherwise, the main

text area is detected. Examples of both cases are presented in Fig. 3.2.1.5.


Segmentation of structured documents using stochastic grammars

Given an image of a page containing a sequence of marriage registers, the page segmentation is

performed in two steps. First, the image is divided into squared cells of pixels. For each cell we use

a model to compute the probability of belonging to one of the considered classes. Then those

probabilities are parsed using a grammar that accounts for the structure of the document in order to

obtain the most likely segmentation.

Text Classification Features

Following the outline in [Cruz 2012] we used two different set of features, Gabor features as texture

descriptors and Relative Location Features [Gould 2008] for classifying small regions of pixels.

According to the structure of the registers (see Fig. 3.2.1.6) there are four possible classes: name,

body, tax and background.

Figure 3.2.1.6: Sample image of part of a page with two registers. Each register has three parts:

Name (green), Body (Blue) and Tax (Red).

A common procedure to analyze the texture information of a document image is the use of a multi-

resolution filter bank in order to compute a set of responses for several orientations and frequencies

of the filter. For this research we defined the set of features ℝ𝑚× which corresponds with the

set of Gabor filter responses computed for m frequencies and n orientations at pixel level.

Relative Location Features (RLF) were introduced by Gould et al. in [Gould 2008] as a way to

encode inter-class spatial relationships as local features. Each of these features model the

probability of assigning the class label c to a region z taking into account the information provided

by the rest of image regions about their position and its initial label predictions.

Once we obtained the set of features, we calculated the probability P (c|z) that region z belongs to

class c by computing the probability density function for each possible class using a Gaussian

mixture model (GMM).

2D Stochastic Context-Free Grammars

We define a 2D Stochastic Context-Free Grammar (2D-SCFG) as a generalization of SCFG that is

able to deal with bidimensional matrices [Alvaro 2013]. The binary rules of a 2D SCFG have an

additional parameter𝑟 {𝐻 𝑉}that describes a spatial relation: horizontal concatenation (H) or


vertical concatenation (V). Given a rule𝐴 𝑟→ 𝐵 𝐶, the combined subproblems must be arranged

according to the spatial relation constraint.

Given a page image, the problem is to obtain the most likely parsing according to a 2D SCFG. For

this purpose, the input page is considered as a bidimensional matrix I with dimensions 𝑤 × ℎand

each cell of the matrix can be either a pixel or a cell of 𝑑 × 𝑑 pixels. Then, we define an extension

of the well-known CYK algorithm to account for bidimensional structures. We basically extended

the algorithm described in [Crespi 2008] to include the stochastic information of our model.

The CYK algorithm for 2D SCFG is based on building a parsing table Ƭ. Each hypothesis accounts

for a region zx,y;x+i,y+j. Each region z is defined as a rectangle delimited by its top-left corner (x,y)

and its bottom-right corner (x+i, y+j), and lixj represents a subproblem of size ixj. An element of the

parsing table Ƭx,y;x+i,y+j[A] represents that nonterminal A can derive the region zx,y;x+i,y+j with a

certain probability.

The parsing process is defined as a dynamic programming algorithm. The derivation probability of

a certain region is marginalized according to the terminal classes c and taking into account the size

of the regions. The most likely parsing is approximated by maximizing the posterior probability and

by considering usual assumptions, the parsing table is initialized for each region z of size 1x1 as

(3.2.1.2)

where𝑃(𝐴 → ) is the probability of the model for terminal c; P (c | z) represents the probability

that region z belongs to class c (described in previous section); and P (l1x1 | A) is the probability that

nonterminal A derives a subproblem of size 1x1. Second, the algorithm continues analyzing new

hypotheses of increasing size using the binary rules of the grammar. The general case is computed

for all i and j, 2 ≤i≤w; 2≤j≤h as

(3.2.1.3)

where a new subproblem Ƭx,y;s,t[A] is computed from two smaller subproblems. The probability is

maximized for every possible vertical and horizontal decomposition resulting in the region zx,y;s,t.

The 2D SCFG provides syntactic and spatial constraints𝑃(𝐴 𝑟→ 𝐵 𝐶), and we have also included

the expected size for a nonterminal P (lixj | A) as a source of information.

The expected size P (lixj | A) has two effects on the parsing process. First, it helps to model the

spatial relations among every part of a given page because there is a specific nonterminal for each

zone of interest. This can be seen in Fig. 3.2.1.6 where the expected size of the background region

on top of the page will be different from the expected size of the background zone over a Tax

region. Furthermore, many unlikely hypotheses are pruned during the parsing process due to its size

information; hence, it speeds up the algorithm.

Finally, the most likely parsing of the full input page is obtained in T0,0;w,h[S]. The time complexity

of the algorithm is O(w3h

3|R|j) and the spatial complexity is O(w

2h

2).

Page Split

Our methodology detects the optimal page frames of double page document images and splits the

image into the two pages as well as removes noisy borders. It is based on white run projections

which have been proved efficient for detecting zones of text areas. It is an extension of


[Stamatopoulos 2010] with special focus on historical handwritten documents.

The page split methodology consists of three distinct steps. At a first step, a pre-processing which

includes noise removal and image smoothing based on ARLSA [Nikolaou 2010] is applied. At a

next step, the vertical zones of the two pages are detected. Finally, the frame of both pages is

detected after calculating the horizontal zones for each page.

In order to detect the vertical zones we focus on the white pixels of the image and introduce the

vertical white run projections HV() which have been proved efficient for detecting vertical zones of

text areas. The motivation for proposing these projections is the need to stress the existence of long

vertical white runs in the image. The vertical white run projections HV() are defined as follows:

𝐻𝑉( ) ∑ (𝑦𝑗2−𝑦𝑗1)2

𝑤𝑣𝑖𝑗=1

( −2∗𝑎)∗𝐼𝑦 (3.2.1.4)

where wvi is the number of white runs (i,yj1)-(i,yj2) for row x=i in the range of y=a*Iy… (1-a)*Iy ,Iy

denotes the image height and HV(x) [0… (1-2*a)*Iy].

(a)

(b) (c)

Figure 3.2.1.7: (a) Binary image (b) vertical white pixel projections and (c) vertical white run

projections.

An example of the vertical white run projections compared to classical vertical white pixel

projections is given in Fig. 3.2.1.7. In this figure it is demonstrated that using white run projections

we can better discriminate text from non-text vertical zones. We can safely consider that if

HV(x)<(1-2*a)Iy/2 then the corresponding image column may contain page information. Also, if we

have consecutive n image columns that satisfy this equation with n>b*Ix then a vertical page zone is

detected. Although detecting two vertical page zones is the most common case, we examine all the

cases where more than two, just one or no vertical page zones are detected. Once the vertical page

zones have been detected, a similar process is applied in order to detect the horizontal page zones.

An example of the developed page split methodology is given in Fig. 3.2.1.8.


(a) (b)

(c)

Figure 3.2.1.8: Page Split example: (a) original image; (b) binary image (c) image after page split.


Baseline Detection

The text line analysis and detection (TLAD) system used in this task is based on HMM and has four

main phases (see Fig. 3.2.2.1): image preprocessing, feature extraction, HMMs and LM training and

decoding. All of the above phases are required, at some level, for the application of pattern

recognition techniques to a problem. Although these processing steps are described in other sections

of this deliverable, it is important to note that our TLAD system uses specific software to perform

them.

Figure 3.2.2.1: Global scheme of the handwritten text line detection process.


Preprocessing

The input data for TLAD are page images. Hence as a first step each image undergoes conventional

noise reduction, geometry normalization and other preprocessing procedures in order to meet the

assumptions of the approach.

The preprocessing phase is composed of several techniques that are applied on the input page

images in order to suppress unwanted issues (noise, background, bleed through, distortions, etc.)

and/or to enhance some image features (specific for TLAD), which are important for subsequent

feature extraction and line detection processes.

In Fig. 3.2.2.2 we can see the step-by-step results of the successive application of each

preprocessing technique to a sample portion of a page image.

As one can see, in the last step shown in Fig. 3.2.2.2 we smear the line with the RLSA algorithm.

This is performed in order to enhance the text line regions previous to the feature extraction phase.

Feature extraction

Since our TLAD approach is based on continuous density HMMs, each preprocessed image must be

represented as a sequence of feature vectors. This is done by dividing the image resulting after

applying RLSA into D non-overlapping vertical rectangular regions all with height equal to the

image-height, L.

For each pixel line of each rectangular region we calculate the projection profile of the black pixels

and smooth it with a rolling average filter. By stacking for each page image pixel line L the value of

the projection profiles for each region D we obtain an LxD-dimensional feature vector sequence that

represents the image.

Figure 3.2.2.2: Figure shows the effects of each of the preprocessing techniques on a handwritten

text image and the final feature extraction

Modelling and decoding

Analogous to the speech recognition and handwritten text recognition (ASR,HTR) problems, the

handwritten text line detection problem can be formulated as the problem of finding a most likely

text line sequence hypothesis, �̂� ⟨ℎ ℎ2 ℎ ⟩, for a given handwritten page image (or selected

region image) represented by an observation sequence1, that is:

1Henceforward, in the context of this formal framework, each time we mention the image of a page or selected text, we

are implicitly referring to its input feature vector sequence “o” describing it


(3.2.2.1)

The above formula (Please read [Bosch 2012] for more details) can be optimally solved by using

the Viterbi search algorithm [Jelinek 1998]. The solution for a given page image not only provides

us the most likely text line sequence hypothesis �̂�, but also yields the physical location of the text

line by making explicit the “hidden” alignment b (a sequence of marks in the page) between the

input feature vector subsequence and the text line sequence hypothesis (see Fig. 3.2.2.3).

Our line detection approach is based on two modelling levels: morphological and syntactical. The

morphological level, expressed as the P(o|h) term (see Eq. 3.2.2.1) is modelled by using HMMs and

is in charge of explaining the different vertical line region classes that appear on the input images

along its vertical direction. In our line detection approach four different kinds of vertical regions are

defined:

Normal text Line-region (NL): Region occupied by the main body of a normal handwritten text

line.

Inter Line-region (IL): Defined as the region found within two consecutive normal text lines,

characterized by being crossed by the ascenders and descenders belonging to the adjacent text lines.

Blank Line-region (BL): Large rectangular region of blank space usually found at the start and end

of page image (top and bottom margins).

Non-text Line-region (NT): Stands for everything which does not belong to any of the other

regions.

Figure 3.2.2.3: Schematic representation for a page image extraction of L feature vectors of D

components.


Figure 3.2.2.4: Examples of the different classes of line regions.

We model each of these regions by an HMM which is trained with instances of such regions.

Basically, each line-region HMM is a stochastic finite-state device that models the succession of

feature vectors extracted from instances of the specific line-region images. In turn, each HMM state

generates feature vectors following an adequate parametric probabilistic law; typically a mixture of

Gaussian densities. The adequate number of states and Gaussians per state may be conditioned by

the available amount of training data. Once an HMM “topology” (number of states and structure)

has been adopted, the model parameters can be easily trained from instances (sequences of features

vectors) of full images containing a sequence of line-regions (without any kind of segmentation)

accompanied by the reference labels of these images that correspond to the actual sequence of line-

region classes. This training process is carried out using a well-known instance of the EM algorithm

called forward-backward or Baum-Welch re-estimation [Jelinek 1998]. The syntactic modelling

level, expressed as the P(h) term (see Eq.(3.2.2.1)), is responsible for defining the way that the

different line regions can be concatenated in order to produce a valid page structure. It is worth

noting that at this level, NL and NT line regions are always forced to be followed by IL region:

NL+IL and NT+IL. We can also use the LM to impose restrictions about the minimum or maximum

number of line-regions to be detected. The LM for our text line detection approach is implemented

as a stochastic finite state grammar (SFSG) which recognizes valid sequences of elements (line

regions): NL+IL, NT+IL and BL.

Both modelling levels, morphological and syntactical, which are represented by finite-state

automata, can be integrated into a single global model on which Eq. (3.2.2.1) is easily solved; that

is, given an input sequence of raw feature vectors, an output string consisting of the recognized

sequence of line region class labels along with their corresponding coordinate positions is obtained.

It should be mentioned that, in practice, HMM and LMs probabilities (usually expressed in

logarithms) are generally “balanced” before being used in Eq. (3.2.2.1). This is carried out by using

a “Grammar Scale Factor” (GSF), the value of which is tuned empirically.

Polygon Based Text Line Segmentation

The proposed method for text line segmentation in handwritten documents is an extension of the

work presented in [Louloudis 2008] and consists of three main steps. The first step includes

binarization, vertical line removal, connected component extraction, average character height

estimation and partitioning of the connected component domain into three distinct spatial sub-

domains. In the second step, a block-based Hough transform is used for the detection of potential


text lines, while a third step is used to correct possible splitting, to detect possible text lines which

the previous step did not reveal and, finally, to separate vertically connected parts and assign them

to text lines. A detailed description of these stages is given in the following subsections. For a more

detailed description of the method the interested reader should see [Louloudis 2008]. In this

extended method, a novel vertical character separation algorithm is presented.

Preprocessing

The preprocessing step consists of four stages. First, an adaptive binarization and vertical line

removal is applied. Then, the connected components of the binary image are extracted and the

bounding box coordinates for each connected component are calculated. The average character

height AH for the whole document image is calculated. We assume that the average character height

equals the average character width AW.

The connected components domain contains components of a different profile with respect to width

and height, since it is frequent to have components describing one character, multiple characters, a

whole word, accents and characters from adjacent touching text lines. The aforementioned

connected components variability has motivated us to divide the connected component domain into

different sub-domains, in order to deal with these categories separately. More specifically, in the

proposed approach, we divide the connected components domain into three distinct spatial sub-

domains denoted as “Subset 1”, “Subset 2” and “Subset3”.

“Subset 1” (corresponding to the majority of the normal character components) contains all

components with size which satisfies the following constraints:

0.5* 3* AND 0.5*AH H AH AW W (3.2.2.2)

where H, W denote the component’s height and width, respectively, and AH, AW denote the average

character height and the average character width, respectively. The motivation for “Subset 1”

definition is to exclude accents and components that are large in height and are likely to belong to

more than one text line.

“Subset 2” contains all large connected components. Large components are either capital letters or

characters from adjacent text lines touching. The size of these components is described by the

following equation:

3*H AH (3.2.2.3)

The motivation for “Subset 2” definition is to grasp all connected components that exist due to

touching text lines. We assume that the corresponding height will exceed three times the average

character height.

Finally, “Subset 3” should contain characters as accents, punctuation marks and small characters.

The equation describing this set is:

3* AND 0.5* OR 0.5* AND 0.5*H AH AW W H AH AW W (3.2.2.4)

The motivation for ‘Subset 3’ definition is that accents usually have width less than half the average

character width or height less than half the average character height.

Hough Transform Mapping

In this step, the Hough transform takes into consideration only connected components that belong to

sub-domain “Subset 1”. Selection of this sub-domain is made for the following reasons: (i) It is

guaranteed that components which span across more than one text line will not vote in the Hough


domain; (ii) it rejects components, such as accents, which have a small size. This avoids false text

line detection by connecting all the accents above the core text line.

In our approach, instead of having only one representative point for every connected component, a

partitioning is applied for each connected component lying in “Subset 1”, in order to have more

representative points voting in the Hough domain. In particular, every connected component lying

in this subset is partitioned to equally-sized blocks. The number of the blocks is defined by the

following equation:

AW

WN c

b (3.2.2.5)

where cW denotes the width of the connected component and AW the average character width of all

connected components in the image. For a detailed description of this step see [Louloudis 2008].

Postprocessing

The post-processing step consists of two stages. At the first stage (i) a merging technique over the

result of the Hough transform is applied to correct possible false alarms and (ii) connected

components of “Subset 1” that were not clustered to any text line are checked to see whether they

create a new text line that the Hough transform did not reveal.

The second post-processing stage deals with components lying in “Subset 2”. This subset includes

components whose height exceeds three times the average height AH. All components of this subset

mainly belong to n detected text lines (n>1). Our novel method for splitting these components

consists of the following:

(A) Calculate yi, which are the average y values of the intersection of detected line i and the

connected component’s bounding box (i =1..n) (see Figure 3.2.2.5a).

(B) Then, compute the skeleton of the connected component (Figure 3.2.2.5b), detect all

junction points and remove them from the skeleton (Figure 3.2.2.5c). If no junction point exist in

the zone between yi and yi+1 remove all skeleton points in the center of the zone.

(C) Assign the skeleton connected components to the closest line yi (Figure 3.2.2.5d).

(D) Each pixel of the initial image (Figure 3.2.2.5a) inherits the id of the closest skeleton pixel

(Figure 3.2.2.5e).


(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

Figure 3.2.2.5: Steps for assigning pixels of vertically connected characters to corresponding text lines (colored lines): (a)-(f) Initial image, (b)-(g)

skeleton, (c)-(h) removal of junction points, (d)-(i) assignment of skeleton connected components to closest line, (e)-(j) final result.



Word segmentation refers to the process of detecting the word boundaries starting from a text line

image. Although the reader may think that the solution to this problem is trivial, it is still considered

an open problem among researchers of handwriting recognition and document analysis areas due to

several challenges that need to be addressed. These challenges include: i) the appearance of skew

along a single text line (Figure 3.2.3.1), ii) the existence of slant (Figure 3.2.3.2), iii) the existence

of punctuation marks (Figure 3.2.3.3) and iv) the non-uniform spacing of words (Figure

3.2.3.4).The proposed word segmentation method is an extension of the method presented in

[Louloudis 2009b], adapted to historical handwritten documents.

Figure 3.2.3.1: Text line image which presents different skew angle along its width.

Figure 3.2.3.2: Example of a slanted text line image.

Figure 3.2.3.3: Part of a document image showing the existence of punctuation marks.

Figure 3.2.3.4: A document image portion showing the non-uniform spacing of words.

A word segmentation algorithm usually contains two steps. The first step deals with the

computation of the distances of adjacent components in the text line image and the second step

concerns the classification of the previously computed distances as either inter-word gaps or inter-

character gaps. For the first step, we used the Euclidean distance metric [Louloudis 2009a]. For the

classification of the computed distances we used a well-known methodology from the area of


unsupervised clustering techniques, namely Gaussian Mixtures. A more detailed description of the

two steps is provided in the following subsections.

Distance Computation

In order to calculate the distance of adjacent components in the text line image, a pre-processing

procedure is applied. The pre-processing procedure concerns the correction of the skew angle as

well as the dominant slant angle [Vinciarelli 2001] of the text line image (Figure 3.2.3.5, Figure

3.2.3.6). The computation of the gap metric is considered not on the connected components (CCs)

but on the overlapped components (OCs), where an OC is defined as a set of CCs whose projection

profiles overlap in the vertical direction (see Figure 3.2.3.7).

We define as distance of two adjacent overlapped components (OCs) their Euclidean distance. The

Euclidean distance between two adjacent overlapped components is defined as the minimum among

the Euclidean distances of all pairs of points of the two adjacent overlapped components (Figure

3.2.3.8). For the calculation of the Euclidean distance we apply a fast scheme that takes into

consideration only a subset of the pixels of the left and right OCs instead of the whole number of

black pixels. In order to define the subset of pixels of the left OC, we include in this subset the

rightmost black pixel of every scanline. The subset of pixels for the right OC is defined by

including the leftmost black pixel of every scanline. Finally, the Euclidean distance of the two OCs

is defined as the minimum of the Euclidean distances of all pairs of pixels.

(a)

(b)

Figure 3.2.3.5: (a) a text line image and (b) the same text line after skew correction.

(a)

(b)

Figure 3.2.3.6: (a) a text line image and (b) the same text line after slant correction.

Figure 3.2.3.7: The overlapped components of the text line image are defined based on the

projection profile.


Figure 3.2.3.8: The arrows define the Euclidean distances between adjacent overlapped

components.

Gap Classification

A mixture model based clustering is based on the idea that each cluster is mathematically presented

by a parametric distribution. We have a two clusters problem (inter-word and intra-word gaps) so

every cluster is modeled with a Gaussian distribution. The algorithm that is used to calculate the

parameters for the Gaussians is the Expectation Maximization (EM) algorithm. We use this

methodology since Gaussian Mixture Modeling is a well-known unsupervised clustering technique

with many advantages which include: (i) the mixture model covers the data well, (ii) an estimation

of the density for each cluster can be obtained and (iii) a “soft” classification is available. For a

detailed description of Gaussian Mixtures, the interested reader is referred to [Marin 2005]. For the

calculation of the number of parameters and the number of Gaussians we used the software package

CLUSTER, which implements an unsupervised algorithm for modeling Gaussian mixtures

[https://engineering.purdue.edu/~bouman/software/cluster/].



The HTR system used here is based on hidden Markov models (HMM) and N-grams. A complete

description of the system can be found in [Toselli 2004],[Vidal 2012]. It follows the classical

architecture composed of three main modules (see Figure3.3.1): document image preprocessing,

line image feature extraction and model training/decoding. The preprocessing module is in charge

to reduce the noise of the images, enhance the quality of text regions, divide the image into lines

and reduce the variability of text styles. This module has been studied in the previous sections (T3.1

Image pre-processing and T3.2 Image segmentation). Therefore in this section we focus on the rest

of the modules. In the feature extraction module a feature vector sequence is obtained as the

representation of a handwritten text image. The training module is in charge to train the different

models used on the decoding step: HMMs, language models and lexical models. Finally, the

decoding module obtains the most likely word sequence for a sequence of feature vectors.

Figure 3.3.1: Diagram representing the different modules of a handwritten text recognition system.

Feature extraction

Each preprocessed line image is represented as a sequence of feature vectors using a sliding

window with a width of W pixels that sweeps across the text line from left to right in steps of T

pixels. At each of the M positions of the sliding window, N features are extracted.

The features used here are known as PRHLT-features and were introduced in [Toselli 2004]. The

sliding window is vertically divided into R cells, where R is tuned empirically and T is the height of

the line image divided by R. Then W is defined as W = 5T. In each cell, three features are

calculated: the normalized gray level, the horizontal and vertical components of the grey level

gradient. The feature vector is constructed for each window by stacking the three values of each cell

(N = 3R). Hence, at the end of this process, a sequence of (3R)-dimensional feature vectors is

obtained.

Modelling and decoding

Given a handwritten line image represented by a feature vector sequence, x = x1x2 ...xm, the HTR

problem can be formulated as the problem of finding a most likely word sequence, �̂� 𝑤 ̂𝑤2̂ 𝑤�̂�,

i.e., �̂� 𝑎𝑟𝑔𝑚𝑎 𝑤Pr (𝑤| ). Using the Bayes’ rule:

(3.3.1)

Pr(x|w) is typically approximated by concatenated character models, usually Hidden Markov

Models [Jelinek 1998], and Pr(w) is approximated by a word language model, usually n-grams

[Jelinek 1998].


In practice, the simple multiplication of P(x|w) and P(w) needs to be modified in order to balance

the absolute values of both probabilities. The most common modification is to usea language weight

α (Grammar Scale Factor, GSF), which weights the influence of the language model on the

recognition result, and an insertion penalty β (Word Insertion Penalty, WIP), which helps to control

the word insertion rate of the recognizer [Ogawa 1998]. In addition, usually log-probabilities are

used to avoid the numeric underflow problems that can appear using probabilities. So, Eq. 3.3.1can

be rewritten as:

(3.3.2)

where α and β are optimized for all the training sentences of the corpus.

Each character is modelled by a continuous density left-to-right HMM, which models the horizontal

succession of feature vectors representing this character. Fig. 3.3.2 shows an example of how a

HMM models two feature vector subsequence corresponding to the character “a”. A Gaussian

mixture model governs the emission of feature vectors in each HMM state. The required number of

Gaussians (NG) in the mixture depends, along with many other factors, on the vertical variability

typically associated with each state. On the other hand, the adequate number of states ( N S ) to

model a certain character depends on the underlying horizontal variability. The number of states and

Gaussians define the total amount of parameters to be estimated. Therefore, these numbers need to

be empirically tuned to optimize the overall performance for the given amount of training vectors

available.

Figure 3.3.2: Example of 5-states HMM modelling (feature vectors sequences of) instances of the

character “a” within the Spanish word “Castilla”. The states are shared among all instances of

characters of the same class.

The model parameters can be easily trained from images of continuously handwritten text (without

any kind of segmentation) accompanied by the transcription of these images into the corresponding

sequence of characters. This training process is carried out using a well-known instance of the EM

algorithm called forward backward or Baum-Welch.

The concatenation of characters to form words is modelled by simple lexical models. Each word is

modelled by a stochastic finite-state automaton which represents all possible concatenations of


individual characters that may compose the word. This automaton takes into account optional

character capitalizations. By embedding the character HMMs into the edges of this automaton, a

lexical HMM is obtained. These HMMs estimate the word-conditional probability P(x|w) of Eq.

3.3.1. Finally, the concatenation of words into text lines or sentences is modelled by an n-gram

language model, with Kneser-Ney back-off smoothing [Kneser 1995].

Once all the character, word and language models are available, recognition of new test sentences

can be performed. Thanks to the homogeneous finite-state (FS) nature of all these models, they can

be easily integrated into a single global FS model by replacing each word character of the n-gram

model by the corresponding HMM. The search for decoding the input feature vectors sequence x

into the output words sequence �̂� is performed over this global model. This search is optimally

done by using the Viterbi algorithm [Jelinek 1998].

Word graphs

A word graph (WG) is a data structure that represents a set of strings in a very efficient way. In

handwritten recognition, a WG represents a large amount of sequences of words whose posterior

probability Pr(w|x) is large enough, according to the morphological (likelihood) and language

(prior) models used to decode a text image, represented as a sequence of feature vectors. In other

words, the word graph is just (a pruned version of) the Viterbi search trellis obtained when

transcribing the whole image sentence. A detailed description of word graphs in HTR can be found

in [Vidal 2012].



3.4.1 The query-by-example approach

Word spotting in document images is related to Content-Based Image Retrieval systems by

searching a word image from a set of unindexed document images using the image contentas the

only information source. As final outcome, the system returns a ranking list of document image

words to the user.

There are two fundamentally different approaches to the above systems, namely, the segmentation-

based and the segmentation-free approach. As the name implies, the former requires an initial step

for partitioning the document image into word images which are subsequently matched to the query

word image based on a similarity measure. The latter is based on matching the reference query

word image directly to the document image without any imposed constraints in the search space.

In the following sections, the description of the proposed methodologies for each of the

aforementioned approaches will be detailed. As an outline, the proposed methodologies share a

common base, i.e. the computation of local document-specific features while they follow a different

path at the level of matching between the query word image and the corresponding information

which resides in the document image. This difference is due to, on the one hand, the availability of

segmented word images (segmentation-based case) and on the other hand, the lack of information

about the localization of word images since no segmentation has been applied (segmentation-free

case).

KeyPoints Detection

Figure 3.4.1.1 shows the consecutive steps of the proposed methodology for the detection of

characteristic local points (keypoints) in a document image.

First Gradient Vector

OrientationQuantization Median Filter

Corner Edges

Detection

Local Point Grouping

and FilteringFinal Local Points

Preprocessing Stage

Local Point Selection

Gradient

OrientationQuantization

Convex Hull Corner

Detection

Keypoint Entropy

CalculationKeypoint FilteringFinal Keypoints

Keypoint Detection

Keypoint Selection

Figure 3.4.1.1: The steps for the detection and selection of characteristic keypoints

Initially, the gradient vector kG of the image k and its orientation ( )kG is calculated and they are

defined as:


**

1

* *, ( )

yxk k

y x

IIG G tan

I I (3.4.1.1)

where *

xI and *

yI are calculated by the convolution of the 1-D kernels 1,0,1 and [ 1,0,1]T to the

image k , respectively. The resulting *

xI and *

yI is very sensitive to noise. In order to make them

most robust, we assume that *

xI and *

yI may belong to two distinct categories: noise and meaningful

data. The estimation of the threshold that separates these two clusters is achieved by minimizing the

intra-class variance between the clusters as in the Otsu approach [Otsu 1975]. Finally, the values

that are below this dynamic threshold are rejected.

Then, the gradient orientation is calculated (eq. 3.4.1.1). Fig. 3.4.1.2(b) shows an example of the

feature ( )kG using a grey-level representation. The grey values are the zero angles that are not

taken into consideration by the algorithm in the consecutive steps. The dark colors represent

negative angles while the bright colors represent positive angles.

The orientation of the gradient describes the local structure of the strokes. Gradient orientation

features are also used in character recognition. Most of the authors use 4- or 8-direction histograms

computed in zones [Fujisawa 2010], [Koerich 2002]. The first order feature is not invariant to

image rotation. We argue that it is not beneficial in document images to incorporate invariant

properties as it results in noise amplification. This is further supported by the observation stated in

[Leydier 2009] wherein the used features which are invariant to rotation have resulted in worse

performance, when compared to features that are dependent on rotation. They adhere to the

observation that the features that are invariant to rotation are more sensitive to the noise and the

complex texture of the background.

The next step involves the quantization of the ( )kG to n levels. The purpose of this step is to

detect the changes to the writing direction as these points consist of important and descriptive

information. The quantization levels n is a parameter of the proposed algorithm and controls the

amount of the final local points. Increasing the quantization levels, the process becomes more

sensitive to the writing directions. After extensive experimentation it is proven that 3 levels are

enough in the context of word spotting retrieval. Fig. 3.4.1.2(d) shows the output for the

quantization of the ( )kG values to 3 levels.

Example of connected components (CCs) attributed to each quantization level is shown at (Fig.

3.4.1.2(e-g)), respectively. These CCs represent chunks of strokes that have different writing

directions between them. The most descriptive and important points on those CCs are the endpoints

that appear along the edge. For the endpoint detection, the convex hull that contains one CC is

taken into consideration wherein its corners are nominated as the initial keypoints kP. Figure 3.4.1.3

shows a visual representation of the CC’s convex hull (with green color) and the red dots are the

initial keypoints. The endpoint detection step is applied for each CC at each quantization level. It is

worth mentioning that aiming to decrease the performance cost, small CCs, i.e. width and height

less than 5 pixels, are rejected.


(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Figure 3.4.1.2: The steps for the keypoint detection: (a) original document image, (b) orientation of

the gradient vector, (c) quantization of the gradient vector orientation (greyscale version), (d)

quantization of the gradient vector orientation (color version), (e-g) connected components of the

distinct quantization levels, (h) the initial keypoints and (i) the final keypoints

The next steps involve the selection of those kPs that mark the most descriptive information of the

word. This is achieved, by initially calculating the entropy of the quantized gradient angles around

the keypoint using the Shannon entropy equation:

( ( ))(

1( )) ln

kk i

iW

i W

occ Gocc

NNGE (3.4.1.2)

where W is the pixels in the window, ( ( ))k

iocc G is the occurrence of the corresponding quantized

gradient angle in W and

( )( )ki

i W

GN occ .

The kPs are finally selected starting from those that have the maximum entropy and if other kPs are

found in their neighborhood of size W, then those kPs are rejected. The remaining points are those

that contain the maximum entropy in their neighborhood and, consequently, are the most

significant. The size of window W is a parameter that directly correlates to the final output numbers

of the kPs detection algorithm. It is proposed to be W=18x18 which equals to the size of the

neighborhood that the local point descriptor is calculated upon as is detailed in the next section. Fig.

3.4.1.2(i) shows the final keypoints kP.


(a) (b)

Figure 3.4.1.3: Detection of keypoints : (a) Example connected components;(b) the corner points of

the convex hull as the keypoints of the connected components

Feature Extraction

The feature for the local keypoint is calculated from the quantized gradient angles. An area of

18x18 pixels around the kP, is divided into 9 cells with size 6x6 for each of them, as shown at Fig.

3.4.1.4. Each cell is represented by a 3-bin histogram (each bin corresponds to a quantization level).

Each pixel accumulates a vote in the corresponding angle histogram bin. The strength of voting

depends on the norm of the gradient vector and on the distance from the location of local point as

shown at the following equation:

, , ,x y x y x yV s G (3.4.1.3)

where the ,x yV is the contribution of pixel (x,y) to the distribution of the corresponding cell

histogram, ,x yG is the norm of the gradient vector G , and ,x ys denotes a linear weighting factor of

the distance between pixel ( , )x y and the keypoint, based on the following equation:

2 2

,

21

3 9 2

LP LP

x y

x x y ys

(3.4.1.4)

where ( , )LP LPx y denotes the position of the LP. The weighting factor values range in the interval

[1/3,1] for which the maximum value is achieved for x , yLP LPx y while its minimum is reached

at x 9, y 9LP LPx y . The task of the variable ,x ys is to weigh the pixel participation to the

histogram taking into account its distance from the kP.

Finally, all nine (9) histograms are concatenated in one 27-bin histogram and normalized by its

norm. In order to make the descriptor illumination independent all the values above 0.2 are fixed to

0.2 and the resulting values are re-normalized again.

(a) (b) (c)

Figure 3.4.1.4: (a) example keypoint at the global connected components level; (b) example

keypoint at the local level; (c) Neighborhood used for the computation of the descriptor at the

keypoint.


Matching in a Segmentation-based Word Spotting context

In the case of segmentation-based word spotting, the aim is to match the query keypoints to the

corresponding keypoints of any word image in the document. For this task, the descriptor that has

been presented in the previous section is taken into consideration along with a Local Proximity

Nearest Neighbor (LPNN) search. The advantage of LPNN search is two-fold: (i) it enables a search

in focused areas instead of searching in a brute force manner and (ii) it goes beyond the typical use

of a descriptor by the incorporation of spatial context in the local search addressed. In the sequel,

the complete matching step will be detailed.

Figure 3.4.1.5: (a) the query image, (b) the word image, (c) the projection of query local points to

the word image: the red lines are the matching local points; the green area is the local proximity

area of the nearest neighbor search.

First of all, it is worth to note that in the description that follows, it is assumed that there is an

outcome of a word image segmentation method and there will be no discussion about any specific

methodology used for the segmentation process. In particular, for the experiments, the word image


segmentation information is taken from the ground truth corpora.

The initial stage in the matching step is a normalization which is applied for any word image

including the query word image. The aim of this stage is to alleviate any scale variability for the

same word. The normalized procedure comprises the following steps:

Calculation of the mean center in the x- and y-axis of the keypoints set in a word image:

1 1( , ) ,

k k

i i

x y

i ix y

p p

c ck k

(3.4.1.5)

where ,i i

x yp p denote the location of the i

th keypoint and k denotes the total number of the keypoints

in a word image.

Calculation of the mean distance of each keypoint from the mean center:

0 0,

k ki ix x y y

i ix y

p c p c

D Dk k

(3.4.1.6)

Calculation of the updated location of each keypoint point which is transformed to a new space

wherein [ , ]x y

c c is the new coordinate origin:

,

iiy yi ix x

x y

x y

p cp cp p

D D (3.4.1.7)

After normalization, all word images are directly comparable due to the achieved scale invariance.

In the next stage, the LPNN for each keypoint that resides on the query image is addressed. LPNN

is realized in a search area which is computed by taking into account a percentage (25%) of the

already calculated distances Dx, Dy. During search, if there is one or more word keypoints in the

proximity of the query keypoint under consideration, the Euclidean distance between their

descriptors is calculated and the minimum distance is kept. This is repeated for each keypoint in the

query image. The final similarity measure is the sum of all the minimal distances. If there is not a

local point in its proximity then a penalty value is added to the similarity measure and it is equal to

maximum Euclidean distance that can be calculated between the keypoint descriptors, which results

in a value of 27 .

As a final stage, the system presents to the user all the word images based on ascending sort order

of the calculated similarity measure.

Matching in a Segmentation-free Word Spotting context

The segmentation-free word spotting scenario using the document-specific keypoint and

corresponding descriptor is an extension of the procedure described for the segmentation-based

word spotting scenario. The fundamental difference is that there is no information about the

potential word image that should be matched to the query.

To deal with this challenge, a strategy is followed which comprises the following stages:

As a first stage, the normalization described in the previous section using Eq. (3.4.1.5-3.4.1.6) is

applied on the query word image. The computation of ( , )x y

c c does not guarantee that this is a

keypoint and therefore, in the case that ( , )x y

c c is not a keypoint, a selection of the nearest keypoint

is realized for which the proposed descriptor is computed.


In a second stage, the similarity between the selected keypoint of the query word image and all

keypoints in the document image is calculated. In the achieved similarity set, a ranked list is created

out of which, the top N matches are kept for further use. The N is chosen as half of the total number

of the keypoints in the document image.

(a) (b)

(c)

Figure 3.4.1.6: (a) The original document (b) the initial keypoints (c) the top N keypoints.

In the sequel, for each keypoint that belongs to the top N matches, a LPNN search is achieved. As

described in the previous section, an LPNN search requires the computation of ( ,x yD D ) which also

depends on the knowledge of the position of the keypoints that can be found in the word image

under consideration. Since, in a segmentation-free approach, there is no knowledge about the word

image, a stepwise process of multiple instances of the computed ( ' ',x yD D ) at an interval that is

guided by the computed ( ,x yD D ) at the query word image is used. In particular, 'xD take values in

the range 3,

2 2x xD D

and 'yD take values in the range

3,

2 2

y yD D

with a step of 10

xD and10

yD ,

respectively. In other words, there are ten (10) instances of ( ' ',x yD D ) for which a correlation test will

be used to determine the optimal ( ' ',x yD D ) for each keypoint that belongs to the top N matches in the

document. This kind of information permits to realize an LPNN search and consecutively, to

proceed with a matching between the keypoints of the query word image and the keypoints that are

contained in the search area that is set by the optimal ( ' ',x yD D ) and is centered at one of the

keypoints that belongs to the top N matches.

The optimal ( ' ',x yD D ) for each keypoint in the document image is calculated, so that the correlation

of the histograms between the query and the document is maximized.

The correlation coefficient between the query vector q and the word vector w is

1

2 2

1 1

( )( )

( , )

( ) (w )

p

i ii

p p

i ii i

q q w w

cor q w

q q w

(3.4.1.8)

where q andw are the mean value of vectors q and w, p is the length of the vectors that represent

the number of query keypoints.


It is worth noting that to utilize such a correlation there should be a correspondence between the

vectors used. In the problem at hand, each of the vectors are built by sorting the orientation of the

keypoint in relation with the mean center of the search area (Figure 3.4.1.7(a),(b)). In this way a

sorting list is achieved for the query word image and each of the search areas around the keypoints

in the document, wherein the valuation at each slot in the list equals to the relative distance ρ

(Figure 3.4.1.7(b)).

After finding the optimal ( ' ',x yD D ) for each candidate center point, the matching procedure that is

described in the previous section (segmentation-based case) is applied, wherein a similarity measure

is computed between this candidate center point and the query word image.

As a final stage, the system presents to the user all the word images based on ascending sort order

of the calculated similarity measure.

(a) (b)

Figure 3.4.1.7: (a) The keypoints considered in an LPNN search for the query word image, (b)

keypoints position in (ρ,θ) terms.

3.4.2 The query-by-string approach

We outline the basic character-lattice (CL) concepts and present the proposed CL-Based KWS

HMM-Filler approach along with its computing cost.

Character-Lattice Basics

A CL on a vector sequence x is a weighted directed acyclic graph with a finite set of nodes Q,

including an initial node 𝑞𝐼 𝑄 and a set of final nodes 𝐹 ⊆ (𝑄 − 𝑞𝐼). Each node q is associated

with a horizontal position in x, given by t (q ) [0,n], where n is the length of x. For an edge (q',q )

E (q' ≠ q,q' ∉F,q ≠ qI), c = w(q', q) is its associated character and s(q', q) is its score, corresponding

to the HMM likelihood that the character w(q', q) appears between frames t(q')+1 and t(q),

computed during the Viterbi decoding of x. For an illustrative example of a CL, see Fig. 3.4.2.1.

A complete path of a CL is a sequence of edges 𝑃 ≡ (𝑞 𝑞 ) (𝑞2

𝑞2) (𝑞 𝑞 ) such that 𝑞

𝑞𝐼, 𝑞𝑖 𝑞𝑖+ , 1 ≤i < L, qL F. A complete path corresponds to a whole line decoding hypothesis

and its score is the product of the scores of all its edges: 𝑆(𝒫 ) ∏ 𝑠(𝑞𝑖 𝑞𝑖).

𝑖=

The Viterbi score of x, S(x), is the maximum of the scores of all complete CL paths. It can be easily

and efficiently computed by Dynamic Programming. For our purposes in this paper, we will obtain

S(x), by previously obtaining forward (Ψ) or backward (Φ) partial path scores, also very efficiently

computed by Dynamic Programming:


(3.4.2.1)

(3.4.2.2)

These equations are similar to those used in the standard backward-forward computations in word-

graphs (see [Wessel 2001]).

Figure 3.4.2.1: A small, partial example of CL for a handwritten word image. Two edge sequences:

(q 1 , q 2 ) , . . . , (q 4 , q 5 ) and (q 1 , q 2 ) , . . . , (q7, q 8 ) (in dashed lines) correspond to the character se-

quence: “bore”.

CL-based KWS Score for Character Sequences

Let a sub-path e of a CL be defined as a sequence of connected edges) of the form (see examples in

Fig 3.4.2.1: 𝒆 (𝑞 𝑞 ) (𝑞2

𝑞2) (𝑞 𝑞 ): 𝑞𝑖 𝑞𝑖+ 1 ≤ 𝑖 < 𝐿. The maximum score of a

complete path 𝒫 which contains a sub-path e is denoted as φ(e,x). It can be efficiently computed

using the backward and forward partial scores of Eq. (3.4.2.1-3.4.2.2):

(3.4.2.3)

where 𝒆 𝒫 means that e is a sub-path of 𝒫. Given a character sequence c, let S'(c,x) be the

maximum score of a complete path 𝒫 containing a sub-path e associated with c:

(3.4.2.4)

where Ω(e) is the character sequence ≡ 2 associated with e; that is,

𝒆 (𝑞 𝑞 ) (𝑞2

𝑞2) (𝑞 𝑞 )𝑤(𝑞𝑖

𝑞𝑖) 𝑖 1 ≤ 𝑖 < 𝐿. Eq. (3.4.2.3-3.4.2.4) can be easily

computed as shown in. Alg. 1.

Assuming the CL is complete, if c is the character sequence of a word υ to be spotted (including


blank spaces) it can be readily seen that S(x) = sf(x) and S'(c,x) = sυ (x)(c.f Eq 2.4.2.3).Therefore,

(3.4.2.5)

where length normalization is now Lc = t ( q L ) — t (q1 ).

Computational Complexity

The preprocessing cost is that of generating the CLs for the N text line images and computing the

corresponding forward-backward scores. The cost of producing the CLs is 𝑂(𝛤 𝑛 ), where n is

the average line image length and Γ is a (generally large) constant which depends on the size of the

CLs. On the other hand, forward-backward computing time is negligible with respect to that of CL

generation [Toselli 2013]. In the search phase, according to Alg. 1, the worst-case cost of

computing S'(c,x) is = O(m+BL), where m is the number of edges of the CL of x, L is the number

of characters of c and B is the maximum CL branching factor (≈30 for the CLs of Table4.4.2.3).

Finally, the cost of spotting M keywords in collection of N text line images is: 𝑂( 𝛮 𝛭). The

relative improvement with respect to the classical HMM-filler method depends on how 𝑛

compares with . Real computing times of the proposed and the classical HMM-filler approaches

will be reported later in Table 4.4.2.5.


4. Evaluation


4.1.1 Binarization

At first we tested the developed binarization method on the three images of the TranScriptorium

project from the DIBCO 2013 ([Pratikakis 2013] competition. Representative results are presented

in Fig. 4.1.1.1(c)-(d). In addition, corresponding results of the best DIBCO 2013 algorithm for the

three images of TranScriptorium and for the whole dataset are presented in Fig. 4.1.1.1(e)-(f) and

Fig. 4.1.1.1(g)-(h). Finally, the results of the state-of-the-art method [Ntirogiannis 2014] are

presented in Fig. 4.1.1.1(i)-(j). From Fig. 4.1.1.1, we can observe that the newly developed method

preserves the textual information in a better way.

In the following, representative binarization results are shown in Fig. 4.1.1.2 - 4.1.1.4 concerning

images from Bentham, Hattem and Plantas collection. In particular, the binarization methods used

are the newly developed method, the method of Ntirogiannis et al. [Ntirogiannis 2014] and Gatos et

al. [Gatos 2006]. From those figures we can conclude that the newly developed binarization method

can handle efficiently images from different collections, while the initial method of Ntirogiannis et

al. [Ntirogiannis 2014] cannot handle images from the Hattem collection (Fig. 4.1.1.3). On the

contrary, in Fig. 4.1.1.4 the developed method does not remove all the bleed-through, but the

binarization result is satisfactory enough. Lastly, in Fig. 4.1.1.2 it is clear that more textual is

preserved using the newly developed method.

The developed binarization method was tested against the algorithms of the DIBCO 2013

competition ([Pratikakis 2013]). The corresponding results are presented in Table 4.1.1.1. The

performance of the developed method is very high, in particular it is ranked 3rd in terms of F-

Measure. It is worth mentioning that the developed method has very high Recall rate since our goal

is to detect as much textual information as possible.


Table 4.1.1.1: Quantitative results for the TS images of DIBCO 2013 competition

Methods F-Measure Recall Precision

3 94,43292 92,04163 96,95178

5 93,67852 90,20371 97,43178

New 92,02438 97,30573 87,35464

17 91,24978 89,43323 93,14166

13 90,07855 85,59744 95,05475

7 89,89771 85,26649 95,0609

2 89,79226 85,01224 95,14184

9 89,25338 84,22794 94,91656

8,2 88,94308 82,50688 96,4684

18 88,93362 83,38049 95,2792

10,1 88,90041 84,4482 93,8482

15,2 88,86937 81,13896 98,22791

10,2 88,29652 82,9598 94,36705

10,3 88,13874 82,52364 94,57376

8,1 87,56595 82,45027 93,35843

11 85,43979 76,60897 96,57174

12 85,17727 79,9144 91,18219

8,3 84,55081 74,59911 97,5664

15,1 83,65741 79,38634 88,41419

4 83,01694 71,65871 98,65406

6 81,53499 70,79074 96,12423

16 81,09038 69,92609 96,49693

14 80,26672 69,35751 95,24829

1 36,3517 92,9747 22,59251


(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

Figure 4.1.1.1: Comparison between the new binarization method and state-of-the-art methods: (a)-

(b) original images; (c)-(d) the newly developed method; (e)-(f) best DIBCO 2013 method (Howe

[Howe 2012]) for the three TranScriptorium images; (g)-(h) DIBCO 2013 winning method; (i)-(j)

method of Ntirogiannis et al. [Ntirogiannis 2014].


(a)

(b)

(c)

(d)

Figure 4.1.1.2: Representative Bentham case: (a) Original image; (b) developed binarization; (c)

Gatos et al. [Gatos 2006] binarization; (d) Ntirogiannis et al. [Ntirogiannis 2014].


(a)

(b)

(c)

(d)

Figure 4.1.1.3: Representative Hattem case: (a) Original image; (b) developed binarization; (c)



(a)

(b)

(c)

(d)

Figure 4.1.1.4: Representative Plantas case: (a) Original image; (b) developed binarization; (c)




In order to record the efficiency of the developed border removal algorithm we manually marked

the correct text region in the binary images in order to create the ground truth set (see Fig.

4.1.2.1(a)). The performance evaluation is based on a pixel based approach and counts the pixels at

the correct text region and the result image after borders removal (see Fig. 4.1.2.1(b)). Let G be the

set of all pixels inside the correct text region in ground truth, R the set of all pixels inside the result

image and T(s) a function that counts the elements of set s. We calculate the precision and recall as

follows:

𝑃𝑟𝑒 𝑖𝑠𝑖𝑜𝑛 T(G∩R)

T(R) 𝑅𝑒 𝑎𝑙𝑙

T(G∩R)

T(G) (4.1.2.1)

A performance metric FM can be extracted if we combine the values of precision and recall:

𝐹 2∗Precision∗Recall

Precision+Recall (4.1.2.2)

In our example (see Fig. 4.1.2.1) the precision is 100% because all the black borders have been

removed and Recall is 94% because marginal text (which belongs to the correct text región) has

been cropped.

(a) (b)

Figure 4.1.2.1: (a) Marked text region (ground truth); (b) result after borders removal.

We used a set of 533 randomly selected handwritten historical images (433 images from the

Bentham dataset, 50 images from the Plantas dataset and 50 images from Hattem dataset). For

comparison purposes, we applied at the same set the state-of-the art method D.X. Le [Le 1996].

Table 4.1.2.1 illustrates the average precision, recall and FM of the two methods. As it can be

observed, the proposed border removal method outperforms the state-of-the art method and

achieves FM= 93.76%.

Table 4.1.2.1: Border Removal - Evaluation results

Method Precision Recall FM

TranScriptorium 90.53% 97.23% 93.76%

D.X. Le 85.40% 99.54% 91.93%



Color Enhancement


For the enhancement of the grayscale image we conducted an end-to-end experiment using the

HTR/HTK tool and the initial set that consists of 53 images from the Bentham collection. We had

randomly partitioned the image collection into two sets, train (1203 images of text lines) and test

(140 images of text lines). The word error rate using the original images was 29.1 approximately.

Using a smoothing filter, specifically a 5x5 Wiener filter (Fig. 4.1.3.1d), the word error rate

decreased to 28.9, while using a 3x3 Gaussian filter (Fig. 4.1.3.1c) the word error rate decreased to

28.3, enhancing the overall performance. On the contrary, using a filter that has the opposite effect

of smoothing, i.e. highlighting the local contrast and details, such as an unsharp filter (Fig.

4.1.3.1e), the word error rate increased to 30.1 approximately. Unexpectedly, the background

smoothing using the background estimation method (Fig. 4.1.3.1b), even though it has visually

satisfactory results, the word error rate increased to 30.3.

In addition, we also used a much larger dataset that consisted of 433 images from the Bentham

collection that had been completely ground-truthed. This set consists of 11470 line images

approximately, that is partitioned into a train set of 9195 line images, a validation set of 1415 line

images and a test set of 860 line images. For the HTR/HTK experiments we used the train and the

validation set. The word error rate using the original images was 41.55 approximately. Using a

smoothing filter, for instance the 5x5 Wiener filter, the word error rate slightly decreased to 41.00,

while using the background smoothing filter, the word error rate remained almost the same at 41.53.

(a)

(b)

(c)

(d)

(e)

Figure 4.1.3.1: Example enhancement methods: (a) original text line image; (b) background

smoothing using background estimation; (c) gaussian filter (3x3); (d) wiener filter (5x5); (e)

unsharp filter (3x3).


Background and noise removal

The methods developed for background and noise removal have been evaluated using two datasets,

Esposalles and Plantas, for which the details are provided in the section for the evaluation of the

handwritten text recognition task (Section 4.3)

To make a fair comparison of the different pre-processing methods, in the experiments only the pre-

processing is varied. The rest of the HTR system parameters were chosen arbitrarily and kept fixed.

Orig. image Old method Sauvola Proposed + Stretch + RLSA

Figure 4.1.3.2: Example images from the Esposalles corpus comparing the original image with: the

previous background removal method, Sauvola thresholding, the proposed method, the proposed

method with contrast stretching and the proposed method with both contrast stretching and RLSA

cleaning.

The Esposalles corpus was used for preliminary testing. A few images were chosen at random, and

these were visually inspected in order to observe the results obtained for each of the methods and

the impact that each parameter had. In Fig. 4.1.3.2 a few example images are presented comparing

the original and the result after applying the thresholding or the proposed technique. It can be

observed that strokes that are locally lighter than other neighboring strokes, with thresholding there

is a risk that these are discarded. On the other hand, with the proposed technique there is a much

lower chance that locally light strokes be completely removed, instead these are mapped to a shade

of gray. Furthermore, the handwritten text looks smoother and more natural since all of the

boundaries between background and strokes have a softer transition.

After this visual inspection, a single set of parameters was chosen, specifically W = 30, R = 128, k =

0.2 and s = 1, and an experiment was run taking as performance measure the well-known Word

Error Rate (WER). The rest of the processing steps and training of the models was exactly the same

as the ones from [Romero 2013]. Table4.1.3.1 presents the WERs for the best previously published

result on that dataset as well as the result obtained for Sauvola thresholding and the proposed

technique. The previously used technique performs considerably worse than local thresholding,

obtaining a relative improvement of about 16.8%. Then comparing the results of the proposed

technique and the thresholding, there is a smaller, although still significant improvement. From this

we can observe that the hard decision that a thresholding technique imposes, definitely discards

useful information that benefits a recognizer based on HMMs and Gaussian mixtures.

The Plantas corpus was used for a more in-depth evaluation of the proposed technique. The

parameters of the algorithms were varied in order to obtain the best performance in each case. The

results are also presented in Table 4.1.3.1, although in this case Character Error Rate (CER) was

used in order to observe the benefit at a character level. For this corpus the proposed technique also

performs significantly better than Sauvola thresholding.


Table 4.1.3.1: Recognition results for the Esposalles and the Plantas corpora comparing the

proposed technique with previously published results and the Sauvola thresholding.

Dataset Approach WER (%) 95% conf. int.*

Esposalles Previously published [Romero 2013] 16.1 15.8–16.4

Esposalles Sauvola thresholding 13.4 13.1–13.7

Esposalles Proposed technique 13.1** 12.8–13.4

Dataset Approach CER (%) 95% conf. int.

Plantas Sauvola thresholding 25.4 25.0–25.7

Plantas Proposed technique 24.7*** 24.3–25.0

*Wilson interval estimation.

**Better than Sauvola for a confidence level of 90% using a two-proportion z-test.

***Better than Sauvola for a confidence level of 99% using a two-proportion z-test.

Show-through reduction

As mentioned earlier, the show-through reduction technique being developed is still in early stages.

Work remains to be done related to the theory behind the approach and perform exhaustive

experimentation. For the time being, we have very interesting results by simple visual inspection of

the images. Figure 4.1.3.3 presents an example image from the Plantas corpus that shows the kind

of results that are being obtained.

Original color scanned image

Standard grayscale converted image


Image after removal of show-through

Figure 4.1.3.3: Example image from the Plantas corpus comparing the original image and the result

obtained after show-through reduction.

Binary Enhancement

As far as the enhancement of the binary image is concerned, it is only visually evaluated. A

representative example of contour enhancement of the binary image is demonstrated in Fig. 4.1.3.4.

The contour is smoothed and the "teeth" effect is greatly reduced.

(a)

(b)

(c)

Figure4.1.3.4: Contour enhancement of the binary image: (a) original image, (b)-(c) binarization

result before and after the contour enhancement.


4.1.4 Deskew

In order to directly evaluate the estimation accuracy of the described methods we developed a tool

which allowed us to parse the pages of the Bentham dataset, to extract the text lines from the xml

ground truth file and to extract the words from each text line. For each word the skew angle was

marked and its length was computed. After all the words of each text line were treated the tool

calculated a weighted average of the skew angles computed for the words depending on their length

and proposed a skew angle for the text line. The user was able to mark and further correct the skew

of each text line if the calculation was not satisfactory. After all the text lines of a page were parsed

a weighted average of the skew angles of the text lines was computed and this was proposed as the

skew angle of the page. In this way, after treating 10 representative pages of the Bentham

collection, randomly chosen from the different boxes of the first part, 1949 words, 247 text lines

and 10 pages with known ground truth were created. Those images were used to form the datasets

for direct evaluation.

The performance evaluation of the algorithm is based on a well-established method for document

skew estimation similar to the one used in ICDAR Document Image Skew Estimation Contest 2013

(DISEC’13) [Papandreou 2013b]. For every document image of the dataset the distance between the

ground truth and the estimation of the algorithm was calculated. In order to measure the

performance of the methods the following three criteria were used: (a) the Average Error Deviation

(AED), (b) the Average Error Deviation of the Top 80% of the results of each algorithm (TOP80)

and (c) the percentage of Correct Estimations (CE) (error deviation lower than 0.2º). The evaluation

will record the accuracy and the efficiency of the methodologies.

In order to obtain a larger evaluation set for the page level evaluation, we rotated each page at 11

different angles from -5 to 5 degrees with a one degree step at each time. In this way, a dataset of

110 pages with known ground truth was created. In order to compare the developed method with the

current state-of-the-art skew estimation techniques we implemented (i) the classical Projection

Profiles (PP) [Postl 1986], (ii) the enhanced vertical profile (eVP) [Papandreou 2011], (iii) the

classical Hough Transform (HT) [Srihari 1989] and (iv) the Yan Cross Correlation (CC) [Yan 1993].

The range of expected skew for all projection algorithms was set from −6° to 6° while the step of

rotation, and consequently the accuracy, was 0.05°. It should be also noted that the estimations of

each algorithm were rounded to the second decimal. The results presented in Table 4.1.4.1

demonstrate the performance of the developed method in comparison with the state-of-the-art skew

estimation algorithms.

Table 4.1.4.1: Results of Skew Estimation Algorithms in Page Level

Skew Estimation Technique

Page Level Dataset

AED (°) TOP80 (°) CE (%)

Projection Profiles (PP) 0.92 0.72 15.45

Enhanced Vertical Profile (eVP) 1.13 0.76 8.18

Hough Transform (HT) 0.95 0.76 14.27

Cross Correlation (CC) 2.63 1.33 15.45

Developed Method (eHP c-f) 0.78 0.49 23.63

Concerning the text line level performance evaluation, we excluded from the 247 cropped text lines

those that consisted of less than 2 words and a dataset was formed with the remaining 215 text lines.

The same three metrics were used in order to evaluate the performance of the algorithms. In order to

make a comparison of the developed method with the current state-of-the-art techniques the four

above mentioned algorithms were used and moreover, we also implemented the regression (R)

algorithm [Madhvanath 1999]. The results presented in Table 4.1.4.2 demonstrate the performance

of the developed method in comparison with the state-of-the-art skew estimation algorithms.


Table 4.1.4.2: Results of Skew Estimation Algorithms in Text Line Level


Text Line Level Dataset

AED (°) TOP80 (°) CE (%)




Cross Correlation (CC) 2.32 1.39 7.9

Regression (R) 0.28 0.15 55.34


Concerning the word level evaluation, we excluded from the 1949 cropped word images the words

that consisted of less than 3 letters as well as the smeared words and a dataset was formed with the

remaining 1376 words that were marked and the same three metrics were used in order to evaluate

the performance of the algorithms. In this case, though, we regard as correct the estimations where

the error deviation is lower than 0.5°. In order to make a comparison of the developed method with

the current state-of-the-art skew estimation techniques we implemented (i) (PP), (ii) (eVP), (iii)

(HT), (iv) (R), (v) the method that was developed for the page and text line level (eHP c-f) and (vi)

the algorithm of Blumenstein et al. [Blumenstein 2002] that takes advantage of the centers of mass

(CM) of the word image. The results presented in Table 4.1.4.3 demonstrate the performance of the

developed method in comparison with the other skew estimation algorithms.

Table 4.1.4.3: Results of Skew Estimation Algorithms in Word Level


Word Level Dataset

AED (°) TOP80 (°) CE (%)




Regression (R) 3.94 0.71 37.86


Centers of Mass (CM) 0.79 0.64 51.38

Developed Method 0.66 0.58 57.99

As can be observed in Table 4.1.4.2, when it is applied to text lines, the performance of the

developed skew estimation method is close to HT and R, which proved to be the most efficient

state-of-the-art methods. Moreover, as can be seen in Table 4.1.4.3 it outperforms both the HT and

R methods on word skew estimation. The fact that has a good performance both in text lines and

words proves to be a great advantage of the method, since there are several text lines in the

transcript that contain only one word or one letter. In those cases both HT and R algorithms fail

with a great error deviation.

Table 4.1.4.3 also shows that the developed algorithm outperforms all state-of-the-art algorithms on

the word level, and proves to be more efficient. Our proposal is that eHP c-f should be applied on

the page level (before text line segmentation), next it should be applied in text line level (preceding

word segmentation) and finally the word skew estimation should be applied in order to correct the

skew of each word.


4.1.5 Deslant

In order to directly evaluate the slant estimation accuracy of the described methods we developed a

tool which allows us to parse the pages of the Bentham dataset, extract the text lines from the xml

ground truth file and then extract the words from each text line. For each word the slant angle was

marked. After all the words of each text line were treated the tool calculated the average of the slant

angles computed for the words and estimated the dominant slant angle for the text line. In this way,

after processing the same 10 pages of the Bentham collection, that were chosen for the skew

estimation evaluation, the same 1949 words and 247 text lines with known ground truth were

created. Those images were used to form the datasets for direct evaluation.

The performance evaluation of the algorithms is based on the same method that was used for the

skew estimation algorithms. For every document image of the dataset the distance between the

ground truth and the estimation of the algorithm was calculated. In order to measure the

performance of the methods the following three criteria were used: (a) (AED), (b) (TOP80) and (c)

(CE) (correct is regarded an error deviation less than 5º). The evaluation will record the accuracy

and the efficiency of the methodologies.

Concerning the text line level performance evaluation, we excluded from the 247 cropped text lines

those that were smeared or consisted of one or two letters and a dataset was formed with the

remaining 228 text lines. In order to make a comparison of the developed method with the current

state-of-the-art skew estimation techniques we implemented (i) the Bozinovic and Srihari

[Bozinovic 1989], (ii) the Papandreou and Gatos [Papandreou 2012], (iii) the Kimura et al. [Kimura

1993], (iv) the Ding et al. [Ding 2000] and (v) Vinciarelli and Juergen [Vinciarelli 2001]

algorithms. The range of expected slant for the projection algorithm of Vinciarelli and Juergen was

set from −50 to 50 while the step of rotation, and consequently their accuracy, was 1 . The results

presented in Table 4.1.5.1 demonstrate the performance of the developed method in comparison

with the state-of-the-art slant estimation algorithms.

Table 4.1.5.1: Results of Slant Estimation Algorithms in Text Line Level

Slant Estimation Technique

Text Line Level Dataset

AED (°) TOP80 (°) CE (%)

Bozinovic and Srihari 11.75 6.86 38.59

Papandreou and Gatos 9.36 5.57 49.56

Kimura et al. 35.88 33.81 2.19

Ding et al. 23.64 20.95 10.52

Vinciarelli and Juergen 8.47 5.29 57.92


As it can be observed the developed method outperforms the state-of-the-art algorithms. While it

has similar performance with the technique of Vinciarelli and Juergen, Vinciarelli and Juergen

algorithm is a projection profile algorithm that uses the shear transform several times in every

image to detect the dominant slant, and due to this fact is more computationally expensive and time

consuming especially when it concerns text line images.

Concerning the word level evaluation, we excluded from the 1949 cropped word images the words

that consisted of less than 2 letters as well as the smeared words and a dataset was formed with the

remaining 1693 words that were marked and the same three metrics were used in order to evaluate

the performance of the algorithms. In order to make a comparison of the developed method with the


current state-of-the-art techniques the same five, above mentioned, algorithms were used. The

results presented in Table 4.1.5.2 demonstrate the performance of the developed method in

comparison with the state-of-the-art slant estimation algorithms.

Table 4.1.5.2: Results of Slant Estimation Algorithms in Word Level

Slant Estimation Technique

Word Level Dataset

AED (°) TOP80 (°) CE (%)

Bozinovic and Srihari 13.12 7.26 33.72

Papandreou and Gatos 10.35 6.03 46.96

Kimura et al. 37.6 35. 19 1.83

Ding et al. 23 20.88 11.04

Vinciarelli and Juergen 8.06 4.97 59. 20


The performance of the slant estimation algorithm on the word level is close to the performance on

text line level. The developed algorithm outperforms the state-of-the-art algorithms and has a

performance close to the Vinciarelli and Juergen algorithm, without having its drawbacks.





In order to record the efficiency of the developed text zone detection method, we created a set of

300 handwritten historical images from the Bentham, the Plantas and the Hattem datasets. These

images have been also used for the border removal evaluation so we used their border removed

version (the ground truth for the border removal evaluation). At a next step, we manually marked

the main text zones. These correspond to one area for single column documents or two areas for two

column documents (see Fig. 4.2.1.1).

Figure 4.2.1.1: The ground truth of main document text zones

The performance evaluation of the developed method is based on comparing the number and the

coordinates of areas between the result and the ground truth. A result is assumed correct only if the

number of areas is the same and the horizontal coordinates of the detected and ground truth areas

are almost the same (their absolute difference is less than image_width / 30). The results showed

that in 254 out of the 300 images, the main text zones were correctly detected. This leads to an

accuracy of ~ 85%.


We used a 2D-SCFG model for segmenting structured documents of Marriage license books. The

input image of a page is represented as a matrix of dimensions wxh cells. Each cell accounts for a

region of 50 x 50 pixels. We carried out an experiment using 80 pages of this dataset. The training

set contained 60 pages and the remaining 20 pages were used as test set. We followed this

experimentation in order to obtain results comparable to those in [Cruz 2012]. In that paper they

used the same text classification features (Gabor and RLF) but a different logical model:

Conditional Random Fields (CRF).

As a result of the segmentation process we obtain the most likely layout class for each cell, hence,

we evaluated the recognition performance by reporting the average precision, recall and F-measure

computed at cell level for each page in the test set. Table 4.2.1.1 shows the experimental results

obtained by both sets of text features and both logical models. More details can be found in [Alvaro

2013] and [Cruz 2012].


Table 4.2.1.1: Classification results for different models and text classification features.

Model Class

Gabor Gabor + RLF

Precision Recall F-measure Precision Recall F-measure

Body 0.89 0.87 0.88 0.88 0.96 0.92

CRF Name 0.27 0.82 0.40 0.77 0.73 0.74

Tax 0.35 0.91 0.50 0.69 0.66 0.66

Body 0.88 0.97 0.92 0.88 0.97 0.92

2D SCFG Name 0.72 0.93 0.81 0.82 0.83 0.82

Tax 0.39 0.94 0.54 0.63 0.69 0.64

Results showed that classes Body and Name were classified with good F-measure rates, whereas

class Tax represented the most challenging part. It is related with the percentage of the page that

represents each region, because it is more difficult to properly classify small regions.

There are two factors to take into account: the text classification features and the page segmentation

model. Regarding text classification features, results using CRF as page segmentation model

showed that the inclusion of the RLF information significantly outperformed the results obtained

with Gabor features. However, RLF did not produce such a great improvement when grammars

were the segmentation model. 2D SCFG took advantage of the knowledge about the document

structure, thus, grammars were able to overcome the lacks of Gabor features obtaining good results

for the classes Body and Name without the additional spatial information provided by RLF. Even

though, the inclusion of RLF in 2D SCFG classification slightly improved the segmentation results

for class Name, as well as it achieved a good F-measure value for class Tax.

Comparing the performance of 2D SCFG with CRF, results showed that grammars achieved a great

improvement when Gabor features were used, which was not so prominent with RLF features.

Moreover, CRF recognized regions but did not identify the explicit segmentation in records.

However, 2D SCFG can provide the coordinates of every region in the most likely hypothesis.

Using this information we also carried out a semantic evaluation. We manually evaluated the results

attending to two different metrics: the number of records properly segmented in terms of F-

measure; and the numbers of lines correctly detected into the Body of the proper record in views of

a future line segmentation process. Results showed that the combination of 2D-SCFG and RLF

improved the final records detection as well as the number of lines properly segmented with F-

measure values of 0.80 and 0.89 respectively.

Page Split

In order to record the efficiency of the developed page split algorithm we used the same

performance evaluation method as that used in border removal algorithm. We manually mark the

correct page frames in the binary double images in order to create the ground truth set (see Fig.

4.2.1.2). The performance evaluation is based on a pixel based approach and counts the pixels at the

correct page frames and the result page frames after page split. Let G be the set of all pixels inside

the correct text region in ground truth, R the set of all pixels inside the result image and T(s) a

function that counts the elements of set s. We calculate the precision and recall as follows


𝑃𝑟𝑒 𝑖𝑠𝑖𝑜𝑛 T(G∩R)

T(R) 𝑅𝑒 𝑎𝑙𝑙

T(G∩R)

T(G) (4.2.1.1)

A performance metric FM can be extracted if we combine the values of precision and recall:

𝐹 2∗Precision∗Recall

Precision+Recall (4.2.2.2)

Figure 4.2.1.2: Marked page frames in a double document image (ground truth)

We used a set of 140 randomly selected double handwritten historical images from 4 different

books. In order to calculate the Precision, the Recall and the FM of a double page image we

calculate the values for each individual page frame and then the average values are extracted. Table

4.2.1.2 illustrates the average precision, recall and FM of the developed page split methodology. As

it can be observed, it achieves FM= 93.21%.

Table 4.2.1.2: Page Split - Evaluation results

Method Precision Recall FM

TranScriptorium 87.68% 99.50% 93.21%


Polygon Based Text Line Segmentation

For the evaluation of the text line segmentation methods we follow the same protocol which was

used in the ICDAR 2013 Handwriting Segmentation Competition [Stamatopoulos 2013]. The

performance evaluation method is based on counting the number of one-to-one matches between

the areas detected by the algorithm and the areas in the ground truth (manual annotation of correct

text lines). We use a MatchScore table whose values are calculated according to the intersection of

the ON pixel sets of the result and the ground truth. Let Gi the set of all points of the i ground truth

region, Rj the set of all points of the j result region, Τ(s) a function that counts the elements of set s.

Table MatchScore(i,j) represents the matching results of i ground truth region and the j result region

as follows:


)(

)(),(

ji

ji

RGT

RGTjiMatchScore

(4.2.2.1)

We consider a region pair as a one-to-one match only if the matching score is equal to or above the

evaluator's acceptance threshold Ta. If N is the count of ground-truth elements, M is the count of

result elements, and o2o is the number of one-to-one matches, we calculate the detection rate (DR)

and recognition accuracy (RA) as follows:

N

ooDR

2 (4.2.2.2)

M

ooRA

2 (4.2.2.3)

A performance metric FM can be extracted if we combine the values of detection rate and

recognition accuracy:

RADR

RADRFM

**2 (4.2.2.4)

The proposed text line segmentation method (denoted as NCSR in the experimental results table)

has been tested on 433 handwritten document images derived from the Bentham collection. For all

these document images, we have manually created the corresponding text line segmentation ground

truth using the Aletheia tool and stored it using the PAGE xml format. The total number of text lines

appearing on those images was 11468. The acceptance threshold for the text line segmentation

evaluation is set to Ta = 0.95. For the sake of comparison, we also tested the winning method of the

ICDAR2013 Handwriting Segmentation Competition [Stamatopoulos 2013] for the text line

segmentation task as well as the method proposed by Nicolaou et al. [Nicolaou 2009]. Table 4.2.2.1

shows comparative experimental results in terms of detection rate (DR), recognition accuracy (RA)

and FM. As it can be observed from Table 4.2.2.1, the proposed method (NCSR) outperforms the

other two approaches achieving detection rate 81.87%, recognition accuracy 85.97% and FM

83.87%.

Table 4.2.2.1:Comparative experimental results on the dataset of the Bentham collection.

Method N M o2o DR (%) RA (%) FM (%)

NCSR 11468 11034 9528 83.08 86.35 84.68

ICDAR2013

Handwriting

Segmentation

Competition winner

[Stamatopoulos 2013]

11468 11051 9418 82.12 85.22 83.64

Nicolaou

[Nicolaou 2009]

11468 10022 8617 75.13 85.98 80.19

Baseline Detection

In order to assess the quality of the baseline detection approach, two kinds of measures have been

adopted: “line error rate” (LER), considered a qualitative measure, is calculated as the number of

incorrectly assigned line labels divided by the total correct line regions; and the “segmentation error

rate” (SER) which, addressing the evaluation more from a quantitative point of view, measures the

geometrical accuracy of the detected line region positions with respect to the corresponding

(correct) reference marks. The LER is obtained by comparing the sequences of automatically

obtained region labels (ℎ̂ in Eq. 3.2.2.1) with the corresponding sequences. This is computed in the

same way as the well-known WER, with equal editing-costs assigned to deletions, insertions and


substitutions [McCowan 2004]. Concerning SER computation, it is carried out in two phases. First,

for each page, we find the best alignment between the system-proposed line position marks (�̂� in

Eq. 3.2.2.1) and the page reference marks by minimizing its accumulate absolute differences using

dynamic programming. Then, in the second phase, taking into account only the detected regions

marked as NL, SER is calculated as the average value (and standard deviation) of the already

computed absolute differences of the position marks for all the pages globally.

In order to evaluate the approach used for this task we performed experiments in a number of

different corpuses.

In the following sections we will describe each corpus used and provide experimental results for

them.

Cristo Salvador Database

Experiments were carried out using a corpus compiled from a XIX century Spanish manuscript

identified as “Cristo-Salvador” (CS), which was kindly provided by the Biblioteca Valenciana

Digital (BiVaLDi)2. This is a rather small document composed of 53 color images of text pages,

scanned at 300 dpi and written by a single writer.

In this case, we employ the already predefined book partition [Romero 2007]. This partition

divides the dataset into a test set containing 20 page images and a training set composed of the 33

remaining pages.

For this corpus, text regions were classified into one of the following types: normal line, short line,

paragraph line, blank space, non-text zone and inter line.

Esposalles Database

The ESPOSALLES database [Romero 2013] is compiled from a book of the Marriage License

Books col-lection (called Llibres d'Esposalles), conserved at the Archives of the Cathedral of

Barcelona. This collection is composed of 291 books with information of approximately 600,000

marriage licenses given in 250 parishes between 1451 and 1905.

These demographic documents have already been proven to be useful for genealogical re-search

and population investigation, which renders their complete transcription an interesting and relevant

problem [Esteve 2009]. The contained data can be useful for research into population estimates,

marriage dynamics, cycles, and indirect estimations for fertility, migration and survival, as well as

socio-economic studies related to social homogamy, intra-generational social mobility, and inter-

generational transmission and sibling differentials in social and occupational positions.

In our experiments, we used only volume 69, which encompasses 173 pages [Vidal 2012] that

contains a total of 5,831 text lines. We used the following page hold-out partition:

Training - consisting of the first 150 pages; used for parameter training and parameter

tuning.

Test - consisting of the last 23 pages; used only for final evaluation.

For this corpus text regions where classified into one of the following types: normal line, start

paragraph line, end paragraph line, end paragraph with ruler, header, footer, blank space, inter line

and inter line with added text.

2http://bv2.gva.es


Bentham Database

The Bentham collection has more than 80,000 documents, most of them digitized. From the

digitized documents, more than 6,000 have been transcribed during the Transcribe Bentham

initiative3. The transcripts are recorded in TEI-compliant XML format.

The digitized material includes legal reform, punishment, the constitution, religion, and his

panopticon prison scheme. The Bentham Papers include manuscripts written by Bentham himself

over a period of sixty years, as well as fair copies written by Bentham's secretarial staff. This is

important to the tranScriptorium project, as handwriting is complex enough to challenge the HTR

software, while manuscripts written by secretarial staff will provide variety. Bentham's manuscripts

are often complicated by deletions, marginalia, interlineal additions and other features. Some

manuscripts are written on large sheets artificially divided, and a few are dificed into tables with

several columns. Portions of the collection are written in French, and Bentham occasionally uses

Latin and ancient Greek in his writings.

In order to produce the Ground Truth (GT) as much automatically as possible and to carry out HTR

experiments, the images were classified according to their difficulty, for being processed in

increasing order of difficult. In [Gatos 2014] a detailed description of the GT creation process can

be found. Three criteria have been used for classifying the documents: first, a score was computed

for each image according to the TEI elements that appear in the transcripts. Thus, a document image

with many deletions, additions, stroke-out words, etc. had lower score than a document image that

had few of these elements. Second, after sorting the images according to this score, they were

classified according to their physical geometry. For the first round, only the images that had 2731 x

4096 pixels were chosen. Third, the resulting images were automatically clustered in three classes:

images that had a main text block in one column and left margin, images that had a main text block

in one column and right margin, and other. Horizontal projections were used for performing this

clustering. Finally, the initial chosen images were the images that belonged to the first two clusters

and which score were above a given threshold. The initial set is composed by 798 pages.

This initial set has been divided into two batches. The first one composed by the easier 433 pages

and the second batch is composed by the rest.

We used the 1st Batch to evaluate our approach on this corpus by dividing it into two sets:

Training - consisting of the first 215 pages; used for parameter training and parameter

tuning.


Due to the large size of this corpus text regions where classified into one of the following types:

normal line, blank space and inter line. This simple labelling was performed in order to reduce the

effort correction and labelling for creation of the ground truth but this implies that we will be

burdening the HMMs to cope with much more variability.

Plantas Database

The PLANTAS database has been compiling from the manuscript piece entitled “Historia de las

plantas” written by Bernardo de Cienfuegos in the 17th century. In this case, HTR experiments were

only carried out on the PROLOGO chapter, which belongs to the first volume of the manuscript.

The PROLOGO chapter is composed of 38 pages and was divided, for experimentation, into two

sets:

Training - consisting of the first 20 pages; used for parameter training and parameter tuning.


3http://blogs.ucl.ac.uk/transcribe-bentham/


With all of the above databases our experimentation process consisted of performing training and

parameter tuning in the database's training partition and with the best parameters obtained we

performed a single experiment on the test partition.

In Table 4.2.2.2 we report the best figures for LER and SER achieved for classification through the

indicated experimentation process. The SER mean and std-dev are given in this case in percentage

of the text line average width (dependent of the actual corpus).

Table 4.2.2.2: LER and SER achieved

Database LER(%) SER(%)

𝜇𝑟 𝜎𝑟

CS 0.7 8.81 15.40

Esposalles 8.9 61.3 21.31

Bentham 13.8 30.22 15.71

Plantas 0.66 21.81 20.3

As we can see in Table4.2.2.2 the approach used provided adequate performance in all tested corpus

presenting worse values for more complex corpus or were more classes have been considered but in

general yielded good results.


For the evaluation of the word segmentation methods we follow the same protocol which was used

in the ICDAR 2013 Handwriting Segmentation Competition [Stamatopoulos 2013]. An analytic

description of the protocol is provided in the abovementioned section for evaluation of text line

segmentation methods. The proposed word segmentation method (denoted as Gaussian_Mixtures in

the experimental results tables) has been tested on 433 handwritten document images derived from

the Bentham collection. For all these document images, we have manually created the

corresponding word segmentation ground truth using the Aletheia tool and stored it using the PAGE

xml format. The total number of words appearing on those images was 94886.The acceptance

threshold Ta was set to 0.9.

For the sake of comparison two different experiments were conducted. The first experiment used as

input the text line segmentation result which came from the ground truth (after manual annotation).

In this case, the input of the word segmentation algorithms was correct text lines. In the second

experiment, the input of the word segmentation algorithms came from the result of the automatic

text line segmentation method proposed in the previous section (NCSR). The motive behind the two

experiments was to measure the drop in the performance after starting from erroneous input

(automatic text line segmentation result).

For both scenarios, we measure the performance of four different word segmentation algorithms

which correspond to i) the proposed method (Gaussian_Mixtures), ii) the sequential clustering

method (Sequential_Clustering) [Kim 2001], iii) the modified max method (Modified_Max) [Kim

2001] and iv) the average linkage method (Average_Linkage) [Kim 2001] . For ii, iii and iv the first

three hypotheses were considered (more details in [Kim 2001]).


Table 4.2.3.1 presents comparative experimental results in terms of detection rate (DR), recognition

accuracy (RA) and FM when the input to the word segmentation algorithms is correct text lines

whereas Table 4.2.3.2 shows experimental results when the input to the word segmentation

algorithms is text lines produced by the automatic text line segmentation method (NCSR).

Table 4.2.3.1: Comparative experimental results on the dataset of the Bentham collection starting

from correct text lines (ground truth).


Gaussian_Mixtures 94886 103229 77308 81.47 74.89 78.04

Sequential_Clustering1 94886 92124 74037 78.03 80.37 79.18



Average_Linkage1 94886 86798 69546 73.29 80.12 76.56

Average_Linkage2 94886 62143 38251 40.31 61.55 48.72

Average_Linkage3 94886 69801 42255 44.53 60.54 51.32

Modified_Max1 94886 47035 26330 27.75 55.98 37.11

Modified_Max2 94886 58530 36159 38.11 61.78 47.14

Modified_Max3 94886 65161 34065 35.90 52.28 42.57

Table 4.2.3.2: Comparative experimental results on the dataset of the Bentham collection starting

from text lines produced by automatic method (NCSR).


Gaussian_Mixtures 94886 102566 74066 78.06 72.21 75.02




Average_Linkage1 94886 84465 64976 68.48 76.93 72.46

Average_Linkage2 94886 59460 35075 36.97 58.99 45.45

Average_Linkage3 94886 68089 39904 42.05 58.61 48.97

Modified_Max1 94886 43659 25280 26.64 57.90 36.49

Modified_Max2 94886 59249 32938 34.71 55.59 42.74

Modified_Max3 94886 57310 30387 32.02 53.02 39.93



The quality of the automatic transcriptions obtained by the system is measured by means of the

Word Error Rate (WER). It is defined as the minimum number of words that need to be substituted,

deleted, or inserted to match the recognition output with the corresponding reference ground truth,

divided by the total number of words in the reference transcriptions.

The Character Error Rate (CER) it is also used to assess the quality of the automatic transcriptions

obtained. Its definition is analogous to the previous one, but substituting words by characters.

HTR in Bentham database

The Bentham collection has more than 80,000 documents, most of them digitized. From the

digitized documents, more than 6,000 have been transcribed during the Transcribe Bentham

initiative4. The transcripts are recorded in TEI-compliant XML format.

The digitized material includes legal reform, punishment, the constitution, religion, and his

panopticon prison scheme. The Bentham Papers include manuscripts written by Bentham himself

over a period of sixty years, as well as fair copies written by Bentham's secretarial staff. This is

important to the tranScriptorium project, as handwriting is complex enough to challenge the HTR

software, while manuscripts written by secretarial staff will provide variety. Bentham's manuscripts

are often complicated by deletions, marginalia, interlineal additions and other features. Some

manuscripts are written on large sheets artificially divided, and a few are dificed into tables with

several columns. Portions of the collection are written in French, and Bentham occasionally uses

Latin and ancient Greek in his writings.

In order to produce the Ground Truth (GT) as much automatically as possible and to carry out HTR

experiments, the images were classified according to their difficulty, for being processed in

increasing order of difficult. In [Gatos 2014] a detailed description of the GT creation process can

be found. Three criteria have been used for classifying the documents: first, a score was computed

for each image according to the TEI elements that appear in the transcripts. Thus, a document image

with many deletions, additions, stroke-out words, etc. had lower score than a document image that

had few of these elements. Second, after sorting the images according to this score, they were

classified according to their physical geometry. For the first round, only the images that had 2731 x

4096 pixels were chosen. Third, the resulting images were automatically clustered in three classes:

images that had a main text block in one column and left margin, images that had a main text block

in one column and right margin, and other. Horizontal projections were used for performing this

clustering. Finally, the initial chosen images were the images that belonged to the first two clusters

and which score were above a given threshold. The initial set is composed by 798 pages.

This initial set has been divided into two batches. The first one composed by the easier 433 pages

and the second batch is composed by the rest. In addition, to carry out fast HTR experiments to test

different techniques, 53 pages from the first batch were chosen. In the next subsections the

experiments carried out with the selected 53 pages and with the complete first batch are presented.

Preliminary HTR experiments

In this subsection we provide the results obtained in the experiments carried out with the 53 selected

pages. These 53 pages contain 1,342 lines with nearly of 12,000 running words and a vocabulary of

more than 2,000 different words. The last column of the Table 4.3.1 summarizes the basic statistics

of these pages.

4 http://blogs.ucl.ac.uk/transcribe-bentham/

http://blogs.ucl.ac.uk/transcribe-bentham/


Table 4.3.1: Basic statistics of the different partitions in selected 53 pages of the Bentham database.

Number of: P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 Total

Pages 5 6 6 6 5 5 5 5 5 5 53

Lines 138 145 140 138 122 94 126 150 146 144 1,342

Run. words 1,174 1,334 1,216 1,287 1,032 824 1,111 1,429 1,282 1,246 11,935

Run. OOV 130 158 115 114 115 74 174 120 189 211 -

Lex. OOV 115 143 108 98 109 073 169 114 178 176 -

Lexicon 445 504 470 447 434 326 472 497 532 465 2,181

Character set size 81 81 81 81 81 81 81 81 81 81 81

Run. Characters 6,384 7,013 6,536 6,669 5,450 4,401 5,963 7,778 6,854 6,326 63,400

The 53 pages were divided into ten blocks of 5 or 6 pages each, aimed at performing cross-

validation experiments. That is, we carry out ten rounds, with each of the partitions used once as

validation data and the remaining nine partitions used as training data. Table 4.3.1 contains some

basic statistics of the different partitions defined. The number of running words for each partition

that do not appear in the other nine partitions is shown in the running out-of-vocabulary (Run.

OOV) row.

The first two rows of Table 4.3.2 show the best results obtained with both Open and Closed

Vocabulary (OV and CV). The values of the parameters that give the best results are: NS = 8, NG =

64, α = 50 and β = 50. In the OV setting, only the words seen in the training transcriptions were

included in the recognition lexicon. In CV, on the other hand, all the words which appear in the test

set but were not seen in the training transcriptions (i.e., the Lex. OOV) were added to the lexicon.

Except for this difference, the language model in both cases were the same; i.e., bi-grams estimated

only from the training transcriptions. Also, the morphological character models were obviously

trained using only training images and transcriptions.

These two lexicon settings represent extreme cases with respect to real use of HTR systems.

Clearly, only the words which appear in the lexicon of a (word-based) HTR system can ever have

the chance to be output as recognition hypotheses. Correspondingly, an OV system will commit at

least one error for every OOV word instance which appears in test images. Therefore, in practice

the training lexicon is often extended with missing common words obtained from some adequate

vocabulary of the task considered. In this sense, the CV setting represents the best which could be

done in practice, while OV corresponds to the worst situation. In the next experiments we only use

OV lexicon.

Table 4.3.2: Transcription Word Error Rate (WER) for the different partitions defined. All results

are percentages.

P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 Total

WER OV 29.8 31.6 26.2 30.3 31.1 30.9 47.0 23.2 39.7 49.0 33.7

WER CV 19.9 20.6 17.5 22.1 22.3 22.1 35.1 15.3 25.7 29.3 22.8

Optional BS 31.0 34.4 27.9 30.8 33.1 32.4 51.7 24.3 41.1 51.6 35.7

BS with skyps 29.5 33.2 27.9 30.9 29.5 32.6 49.9 23.5 41.2 52.0 34.8

Variable NS 27.6 31.5 26.5 28.2 30.8 28.8 41.5 23.3 36.7 41.5 31.5


Figure 4.3.1: Normalized histogram of the length of words that take part in error events in

BENTHAM database. Punctuation marks are not considered.

As expected, the OV experiment has a WER bigger than the experiment with CV. In Table 4.3.3 is

shown the analysis of the errors involved in the OV experiment. In this table we can see that the

punctuation marks are responsible of around the 20% of the errors, the numbers of the 3.5% and the

stop words of the 27.2%. The rest of the errors are due to other words.

Table 4.3.3: Errors analysis in the OV experiment for 53 selected pages of Bentham database.

Involving to Insertion Deletions Substitutions Total (%)

Punc-Marks 0.4 9.3 10.2 19.9

Numbers 0.1 0.8 2.6 3.5

Stop-Words 4.4 1.3 21.5 27.2

Others 9.1 4.6 35.6 49.3

In Fig. 4.3.1 we can see a normalized histogram of the length of words that take part in the error

events. From the graph we can see that the 15% of errors corresponds to words of 4 characters. To

carry out this analysis we did not consider the punctuation marks.

In Fig. 4.3.2 a normalized histogram of the position of the error in the line is shown. To carry out

this histogram we have only considered the lines longer than 5 words with recognition errors. In the

graph we can see that the first and the last position have more errors than the others. This can be

due to the effect of the hyphenated words and to the lack of context of the first word of each line. In

future works we plant to attack the hyphenated words problem.


Figure 4.3.2: Normalized histogram of the position of the error in the line in BENTHAM database.

Only lines longer than 5 words with recognition errors are considered.

Some ideas to solve this problem are: paste sequentially all the line images in a page into a very

long line(=page) image and try to recognize this long image as a whole; paste lines pairwise with 1-

line overlap or model possible/probable hyphenation at the lexical level.

In order to improve the quality of the automatic transcriptions obtained, we have carried out several

experiments with different HMMs. Most specifically we have tested several ways to model the

blank space that can appear between words. This blank space is usually modelled by a specific

HMM. Then, this specific HMM is added to the end of the stochastic finite-state automatons that

represent each word, forcing to recognize a blank space after each word. However, in the historical

documents, this separation between words not always exists.

The first idea that we tested consisted in making the blank space optional in the stochastic finite-

state automatons. In the third row of Table 4.3.2 (Optional BS) we can see the results obtained

after tuning the parameters. Contrary to expectations, the results obtained with this optional blank

space are worse than those obtained when it is compulsory. After analyzing the obtained

transcriptions we saw that, the increase in WER is due to the recognition of some words as the

concatenation of several words. For example, the word “somebody”, instead of being recognized as

“somebody”, is recognized as “some” and “body”.

The second idea that we tested consisted in changing the structure of the blank-space HMM. In the

previous experiments all the HMMs are modelled as constrained left-to-right HMMs (each state

only has connections with itself and with the next state. See Fig. 3.3.2). In this new experiment we

have modified the blank-space HMM to allow skipping states (see Fig. 4.3.3). In the row “BS with

skyps” of the Table 4.3.2 we can see the result obtained with this new HMM. Although the result

improve the optional BS, it is still worse than the obtained baseline.

Finally, we have also tested a variable number of states for each character class instead of using a

fixed number of states for all HMMs. The number of states NSc chosen for each HMM character

class was computed as N Sc = lc/f, where lc is the average length of the sequences of feature vectors

used to train the character class, and f is a design parameter measuring the average number of

feature vectors modelled per state (state load factor). After tuning f the value that gave the best

result was 14 and 32 gaussians per state. In the last row of Table 4.3.2 the results obtained are

shown. This result slightly improves the baseline obtained in the first row.


Figure 4.3.3: Example of 5-states blank space HMM.

First batch HTR experiments

In this subsection we provide the results obtained in the experiments carried out with the first batch

formed by 433 pages. These 433 pages contain 11,537 lines with nearly of 110,000 running words

and a vocabulary of more than 9,500 different words. The last column of the Table 4.3.4

summarizes the basic statistics of these pages.

Table 4.3.4: Basic statistics of the different partitions in selected 433 pages of the Bentham

database.

Number of: Training Validation Test Total

Pages 350 50 33 433

Lines 9,198 1,415 860 11,473

Run. words 86,075 12,962 7,868 106,905

Lexicon 8,659 2,710 1,947 9,717

Run. OOV - 857 417 -

Lex. OOV - 681 377 -

Character set size 86 86 86 86

Run. Characters 442,336 67,400 40,938 550,674

The 433 pages were divided into three sets: training, composed by 350 pages; validation, composed

by 50 images and the test set composed by the 33 remaining pages. Table 4.3.4 contains some basic

statistics of these partitions. The number of OOV words that appear in the Validation column are

words that do not appear in the training set. On the other hand, the OOV words that appear in the

Test column are words that do not appear neither in the training set nor in the validation set.

Table 4.3.5 shows the WER(%) obtained for the Validation set for different values of the number of

states (NS) and the number of gaussians (NG) parameters. The experiments have been carried out

with open vocabulary. The best WER, 32.6%, has been obtained using NS =12 and NG = 128.


Table 4.3.5: Transcription Word Error Rate (WER) for the validation set for different values of the

parameters NS and NG. All results are percentages.

NG/NS 6 8 10 12

32 43.8 38.6 36.3 34.5

64 41.4 37.0 34.6 33.4

128 40.1 36.2 34.0 32.6

HTR in ESPOSALLES database

The ESPOSALLES database [Romero 2013] is compiled from a book of the Marriage License

Books collection (called Llibres d'Esposalles), conserved at the Archives of the Cathedral of

Barcelona. This collection is composed of 291 books with information of approximately 600,000

marriage licenses given in 250 parishes between 1451 and 1905.

These demographic documents have already been proven to be useful for genealogical research and

population investigation, which renders their complete transcription an interesting and relevant

problem [Esteve 2009]. The contained data can be useful for research into population estimates,

marriage dynamics, cycles, and indirect estimations for fertility, migration and survival, as well as

socio-economic studies related to social homogamy, intra-generational social mobility, and inter-

generational transmission and sibling differentials in social and occupational positions.

The ESPSALLES database, used here, consists of a relatively small part of the whole collection

outlined above. Most specifically, the current database version encompasses 173 page images

containing 1,747 marriage licenses.

ESPOSALLES experiments

In this subsection we provide the results obtained in the experiments carried out with the

ESPOSALLES database. A major detail of the database and additional HTR results can be found in

[Romero 2013]

The database consists of a marriage license manuscripts digitalized by experts at 300dpi in true

colors and saved in TIFF format. The book was written between 1617 and 1619 by a single writer in

old Catalan. It contains 1,747 licenses on 173 pages. The main block of the whole manuscript was

transcribed line by line by an expert paleographer in the same way as they appear on the text,

without correcting orthographic mistakes or writing out abbreviations.

The complete annotation contains around 60,000 running words in over 5,000 lines, split up into

nearly 2,000 licenses. The last column of the Table 4.3.8 summarizes the basic statistics of the text

transcriptions.

Each page is divided vertically into individual license records. The marriage license contains

information about the husband's and wife's names, the husband's occupation, the husband's and wife

former marital status, and the socioeconomic position given by the amount of the fee. In some

cases, even further information is given, such as the fathers' names and occupations, information

about a deceased parent, place of residence, or geographical origin.

The 173 pages of the database were divided into 7 consecutive blocks of 25 pages each (1 – 25, 26 -

50, … , 150 - 173), aimed at performing cross-validation experiments. Table 4.3.6 contains some

basic statistics of the different partitions defined. The number of running words for each partition

that do not appear in the other six partitions is shown in the running out-of-vocabulary (Run. OOV)

row.


Table 4.3.6: Basic statistics of the different partitions for the database LICENSES

Number of: P0 P1 P2 P3 P4 P5 P6 Total

Pages 25 25 25 25 25 25 23 173

Licenses 256 246 246 249 243 255 252 1,747

Lines 827 779 786 768 771 773 743 5,447

Run. words 8,893 8,595 8,802 8,506 8,572 8,799 8,610 60,777

Run. OOV 426 374 368 340 329 373 317 -

Lexicon 1,119 1,096 1,106 1,036 1,046 1,078 1,011 3,465

Characters 48,464 46,459 47,902 45,728 46,135 47,529 46,012 328,229

Characters set size 85 85 85 85 85 85 85 85

Two different recognition tasks were performed with the ESPOSALLES data set. In the first one,

called line level, each text line image was independently recognized. In the second, called license

level, the feature sequences extracted from all the line images belonging to each license record are

concatenated. Hence, in the license level recognition more context information is available. In

addition, two experimental conditions were considered: Open and Closed Vocabulary (OV and CV).

The results can be seen in Table 4.3.7.

Table 4.3.7: Transcription Word Error Rate (WER) in the ESPOSALLES database. All results are

percentages.

P0 P1 P2 P3 P4 P5 P6 Total

Lines OV 19.4 15.3 13.4 16.1 14.1 18.5 15.8 16.1

CV 15.2 10.9 9.2 11.7 9.7 14.6 12.6 12.0

Licenses

OV 17.3 14.1 12.8 15.5 14.0 16.6 15.0 15.0

CV 13.3 9.8 8.8 11.2 10.0 12.6 11.3 11.0

The characters were modelled by continuous density left-to-right HMMs with 6 states an 64

Gaussian mixture components per state. The optimal number of HMM states and Gaussian densities

per state were tuned empirically. In the line recognition problem the WER was 12.0% and 16.1% in

the closed and open vocabulary settings, respectively. In the whole license recognition case, the

WER decreased as expected, given that the language model in this case is more informative. In

particular, the system achieved a WER of 11.0% with closed vocabulary and 15.0% with open

vocabulary.

HTR in PLANTAS database

The PLANTAS database has been compiling from the manuscript piece entitled “Historia de las

plantas” written by Bernardo de Cienfuegos in the 17th century. In this case, HTR experiments were

only carried out on the PROLOGO chapter, which belongs to the first volume of the manuscript.


Table 4.3.8 shows basic statistics of this part of the dataset.

Table 4.3.8: Basic statistics of for the PROLOGO chapter of the PLANTAS manuscript.

Num. of Pages 38

Num. of Lines 1 206

Running Words 11 642

Lexicon 3 899

Running Chars 61 973

Num. of Dif. Chars 71

Experimental setup

Because of the small size of the PROLOGO chapter of the PLANTAS manuscript (38 pages), we

have performed a tenfold cross validation for experimental evaluation. Therefore, Table 4.3.9

includes information about number of text lines used for training and test along with the

corresponding running words, lexicon and out-of-vocabulary words (OOV) over all cross test sets.

Table 4.3.9: Basic statistics of the tenfold cross validation partition defined for the PROLOGO

chapter of the PLANTAS manuscript. OOV stands for out-of-vocabulary words.

Blk-ID

Lines Words Lexicon

Training Test RW-Test OOV OOV(%) Lex-Test Lex-OOV

00 1 080 126 1 490 337 22.61 565 304

01 1 074 132 1 528 314 20.54 595 300

02 1 081 125 1 435 388 27.03 592 346

03 1 080 126 1 493 317 21.23 573 277

04 1 078 128 1 444 357 24.72 642 337

05 1 078 128 1 446 384 26.55 662 364

06 1 080 126 1 445 358 24.77 666 348

07 1 078 128 1 560 335 21.47 606 306

08 1 082 124 1 529 356 23.28 631 341

09 1 143 63 798 171 21.42 36 165

For each cross-validation run, using only corresponding training samples, an open-vocabulary

dictionary was built and Kneser-Ney smoothed n-gram model was trained. On the other hand, the

71 different characters were modelled by continuous density left-to-right HMMs with different

number of states and number of Gaussian mixture components per state. As in previous datasets, the

optimal number of HMM states and Gaussian densities per state have been also tuned empirically,

PLANTAS experiments

Preliminary recognition results over the ten cross-validation runs, on the PROLOGO chapter of the

PLANTAS manuscript, are reported in Table 4.3.10 for different number of Gaussian densities per

HMM state, optimized for 12 HMM states and corresponding values of grammar scale factors

(GSFs) and word insertion penalties (WIPs). WER and CER figures were computed filtering out


previously tags and punctuation marks.

The best WER/CER figure of 50.06%/22.40%, is achieved by using character HMMs of 12 states

and 32 Gaussians per state.

Table 4.3.10: Best Word Error Rate (WER(%)) and Character Error Rate (CER(%)) reported results

on Prologo PLANTAS database, for HMMs with 12 states and for different number of Gaussian

Densities per each HMM state (NG) and optimized parameter values of grammar scale factor (GSF)

and word insertion penalty (WIP).

NG GSF WIP WER(%) CER(%)

8 105.81 -43.89 50.68 23.60

16 120.71 -66.96 50.11 23.46

32 107.96 -13.33 50.06 22.40

64 102.68 -72.01 50.99 22.82

HTR in Hattem database

The Hattem database experiment which is described in the following section is transcribed at line

level only for 40 page images. From the layout point of view, these images are characterized for

having the lines very near each other. In fact, the images have many touching characters between

consecutive lines. In this way the line detection and the line extraction seem very difficult. For the

experiments reported in this section, the baselines were manually annotated as polylines.

From the HTR point of view, this is a single writer collection. Another characteristic is that there are

not punctuation symbols. Paragraphs are marked by special symbols that have not been taken into

account in the following experiments. The manuscript has many abbreviations, but this

characteristic has not been taken into account in the following experiments. That means that

“diplomatic” HTR experiments have been carried out.

Preliminary HTR experiments

This subsection shows the results obtained in the experiments carried out with the 40 selected

pages. These 40 pages contain 1,552 lines with nearly of 10,330 running words and a vocabulary of

more than 2,602 different words. The last column of the Table 4.3.11 summarizes the basic statistics

of these pages.

Table 4.3.11: Basic statistics of the different partitions in selected 40 pages of the Hattem database.

Number of: P0 P1 P2 P3 P4 P5 P6 P7 Total

Pages 5 5 5 5 5 5 5 5 40

Lines 186 200 195 185 191 191 203 201 1,552

Run. words 1,280 1,351 1,318 1,227 1,237 1,230 1,344 1,343 10,330

Run. OOV 244 276 264 261 250 285 306 248 -

Lex. OOV 219 241 228 234 210 221 264 225 -

Lexicon 551 575 562 581 554 544 611 570 2,602

Character set size 45 43 44 46 43 52 44 45 60

Run. Characters 5,148 5,506 5,446 5,063 5,242 5,006 5,656 5,645 42,712


The 40 pages were divided into 8 blocks of 5 pages each, aimed at performing cross-validation

experiments. That is, we carry out 8 rounds, with each of the partitions used once as validation data

and the remaining nine partitions used as training data. Table 4.3.11 contains some basic statistics

of the different partitions defined. The number of running words for each partition that does not

appear in the other six partitions is shown in the running out-of-vocabulary (Run. OOV) row.

The first two rows of Table 4.3.12 show the best results obtained with both Open and Closed

Vocabulary (OV and CV). The values of the parameters that give the best results are: NS = 8, NG =

64. In the OV setting, only the words seen in the training transcriptions were included in the

recognition lexicon. In CV, on the other hand, all the words which appear in the test set but were not

seen in the training transcriptions (i.e., the Lex. OOV) were added to the lexicon. Except for this

difference, the language models in both cases were the same; i.e., bi-grams estimated only from the

training transcriptions. Also, the morphological character models were obviously trained using only

training images and transcriptions.

These two lexicon settings represent extreme cases with respect to real use of HTR systems.

Clearly, only the words which appear in the lexicon of a (word-based) HTR system can ever have

the chance to be output as recognition hypotheses. Correspondingly, an OV system will commit at

least one error for every OOV word instance which appears in test images. Therefore, in practice

the training lexicon is often extended with missing common words obtained from some adequate

vocabulary of the task considered. In this sense, the CV setting represents the best which could be

done in practice, while OV corresponds to the worst situation. In the next experiments we only use

OV lexicon.

Table 4.3.12: Best Word Error Rate (WER(%)) and Character Error Rate (CER(%)) reported results

on Hattem database, for HMMs with 8 states and 32 and 64 Gaussian Densities per each HMM state

(NG) and optimized parameter values of grammar scale factor (GSF) and word insertion penalty

(WIP). Only the best results are reported.

NG GSF WIP WER(%) CER(%) WER(%) CER(%)

32 70 -110 25.5 13.7 43.3 21.6

64 70 -110 24.5 13.1 42.8 21.0



4.4.1 The query-by-example approach

We evaluate the segmentation-based and segmentation-free word spotting on a handwritten database

from the writing of Bentham secretaries. It consists of 433 high quality (approximately 3000 pixel

width and 4000 pixel height) handwritten documents from a variety of authors. It contains several

very difficult problems, wherein the most difficult is the word variability. The variation of the same

word is extreme and involves writing style, font size, noise as well as their combination. Figure

4.4.1.1 shows some examples of these instances.

(a)

(b)

Figure 4.4.1.1: Type of word variations met in the Bentham Database (a) word “England” (b)

word “Embezzlement”

The metrics employed in the performance evaluation of the proposed segmentation-based and

segmentation-free algorithms are the Precision at the k Top Retrieved words (P@k) and the Mean

Average Precision (MAP). To further detail the metrics, let define Precision and P@k as follows:

{ } { }

{ }

relevant words retrievedwordsPrecision

retrievedwords (4.4.1.1)


{ } { }

@{ }

relevant words k retrieved wordsP k

k retrieved words (4.4.1.2)

Precision is the fraction of retrieved words that are relevant to the search, while in the case that

precision should be determined for the k top retrieved words, P@k is computed. In particular, in the

proposed evaluation, we use P@5 which is the precision at top 5 retrieved words. This metric

defines how successfully the algorithms produce relevant results to the first 5 positions of the

ranking list. This is important measurement because the current word-spotting scenarios are

recommender systems to a transcription operation. It is apparent that the above metrics are suitable

for binary classification i.e. a retrieved word is relevant or non-relevant to the query. That it may

imposes problems for segmentation – free systems as they may not detect the whole word or they

include parts of another word. This situation is resolved by detecting the common area between the

ground truth word area and the output of the method and if it is larger by a threshold, it is

considered relevant.

The second metric used in the proposed evaluation is the Mean Average Precision (MAP) which is a

typical measure for the performance of information retrieval systems. There are two implementation

strategies for the calculation of Average Precision (AP). The interpolated AP and the non-

interpolated AP [Nist 2013], [Chatzichristofis 2011] which both implemented from the Text

Retrieval Conference (TREC) community by the National Institute of Standards and Technology

(NIST). For the proposed evaluation, the non-interpolated AP is used which is defined as the

average of the precision value obtained after each relevant word is retrieved:

1

@ ( )

{ }

n

k

P K rel k

APrelevant words

(4.4.1.3)

where

1( )

0

if word at rank k isrelevantrel k

if word at rank k isnot relevant (4.4.1.4)

Finally, the Mean Average Precision (MAP) is calculated by averaging the AP for all the queries and

it is defined as:

1

Q

q

q

MAP AP

(4.4.1.5)

where Q is the total number of queries.

This metric is chosen because it rewards systems that retrieve highly ranked relevant words.

Initially the proposed segmentation-based word-spotting was evaluated against two previous

segmentation-based profile-based strategies. The results are shown at Table 4.4.1.1.

In the sequel, the proposed segmentation-free word-spotting procedure was evaluated against two

previous segmentation-free procedures. One involves a sliding window strategy and another by

detecting and matching information parts of the document. Another consideration of segmentation-

free procedures is the performance efficiency. As the word spotting system is a component that

plays the role of a recommendation system in the transcription stage, then the retrieved results

should be achieved quickly, otherwise, they will be of no use. For this purpose, the time expense for

the matching step has been computed as shown at Table 4.4.1.1. The experiments were performed


single threaded using an Intel Core 2 Duo CPU E8400 (3GHZ). The number of query word images

used was 5. For each word image, two variations are used resulting in a total of 10 queries. This

number should be increased in future experiments, however, the high time expense of the methods

for which comparisons were addressed did not permit us to achieve such a goal. In the case of

segmentation-based methods, the aforementioned-obstacle was not existing and therefore, to

improve the statistical significance of our results we have made additional experiments with the

segmentation-based approaches using 20 query word images. For each query word, two different

variations have been used resulting in a total of 40 queries .Table 4.4.1.2 shows the performance

evaluation results for the augmented query set.

Table 4.4.1.1: Overall Performance Evaluation Results

Methodology type Methodology P@5 MAP Time Expense for Matching

Segmentation-free Sliding Window [Gatos 2009]

10% 0.53% ~12 sec/page

Patch-based [Leydier 2009]

51.33% 15.01% ~263 sec/page

Proposed Method 35.33% 6.54% ~3 sec/page

Segmentation-based

Proposed Method 57.33% 20.42% ~0.5 sec/page

WS [Manmatha 1996] 21.33% 9.89% ~0.001 sec/page

CSPD [Zagoris 2010] 55.33% 19.6% ~0.001 sec/page

Table 4.4.1.2: Performance Evaluation Results for Segmentation–based Methods

P@5 MAP

Manmatha [Manmatha 1996] 20.00% 2.42%

CSPD [Zagoris 2010] 30.00% 3.81%

Proposed Method 45.33% 10.02%

The performance evaluation results shown at Tables 4.4.1.1-2 confirm the extreme difficulty of the

Bentham database due to writing variations and the variety of degradations appearing in the

document. Moreover, it is shown that, the segmentation-based methods performed better both in

effectiveness and efficiency than the segmentation-free methods. However, we should note that in

our experiments we have considered perfect word segmentation due to the use of the ground truth.

This means that these results are the upper limit of the achieved performance which is going to be

reduced in the case of using an imperfect state-of-the-art word segmentation method.

Last but not least, the role of the word spotting architecture as recommender system to a

transcription operation makes imperative to take into consideration both effectiveness and

efficiency. In this respect, the results produced by the proposed methods proved to be promising.


4.4.2 The query-by-string approach

To compare the effectiveness of the proposed and the classical HMM-Filler KWS approaches,

several experiments were carried out. The evaluation measures, corpora, experimental setup and the

results are presented next.

Evaluation Measures

The standard recall and interpolated precision measures [Manning 2008] are used here. Interpolated

precision is widely used in the literature to avoid cases in which plain precision can be ill-defined.

In addition, we employ another popular scalar KWS assessment measure called average precision

(AP) [Zhu 2004][Robertson 2008] which, actually, is the area under the Recall-Precision curve.

On the other hand, real computing times are reported in terms of total elapsed times needed on

dedicated 64-bit Intel Xeon computers running at 2.5GHz.

Corpora Description

Experiments were carried out with two different corpora: “Cristo-Salvador” (CS) and IAMDB.

CS is a XIX century Spanish manuscript, entitled “Noticia histrica de las estas con que Valencia

celebr el siglo sexto de la venida a esta capital de la milagrosa imagen del Salvador”, which was

kindly provided by the Biblioteca Valenciana Digital (BiVaLDi). It is composed of 50 color images

of 5.1’’ x 7.2’’ relevant text pages, written by a single writer and scanned at 300 dpi. Some examples

are shown in Fig. 4.4.2.1. Line images, extracted from the original page images in previous works

with this corpus [Romero 2007], are also available. The whole CS corpus, along with directions for

its use in HTR experiments, is publicly available for research purposes from

https://prhlt.iti.upv.es/page/data .

The first 29 pages (675 lines) are used for training and the test set encompasses the remaining 21

pages (497 lines), many of which contain notes and short comments written in a style rather

different from that of the training pages. Table 4.4.2.1 summarizes this information.

Word statistics were computed ignoring capitalization and punctuation marks. The number of

different words of the test partition that do not appear in the training partition is shown in the

column OOV (out of vocabulary words).

Figure 4.4.2.1: Examples of page images from CS corpus.

https://prhlt.iti.upv.es/page/data

https://prhlt.iti.upv.es/page/data


Table 4.4.2.1: Basic statistics of the partition of the CS database. OOV stands for Out-Of-

Vocabulary words.

Training Test Total

Running Chars 35 863 26 353 62 216

Running Words 6 223 4 637 10 860

Running OOV(%) 0 29.03 -

# Lines 675 497 1 172

# Pages 29 21 50

Char Lex. Size 78 78 78

Word Lex. Size 2 236 1 671 3 287

OOV Lex. Size 0 1 051 -

Figure 4.4.2.2: Examples of text line images from IAMDB.

IAMDB is also a publicly accessible corpus compiled by the Research Group on Computer Vision

and Artificial Intelligence (FKI) at Institute of Computer Science an Applied Mathematics (IAM).

The acquisition was based on the Lancaster-Oslo/Bergen Corpus (LOB). Fig. 4.4.2.2 shows an

example of a handwritten sentence images from this corpus.

The last released version (3.0) is composed of 1539 scanned text pages, handwritten by 657

different writers. The database is provided at different segmentation levels: words, lines, sentences

and page images. In our case, the line segmentation level is considered [Marti 2002]. The corpus

was partitioned into training, validation and test sets contain text lines from a different set of

writers, whose basic statistic information are summarized in Table 4.4.2.2.


Table 4.4.2.2: Basic statistics of the partition of the IAMDB database. OOV stands for Out-Of-

Vocabulary words.

Training Valid. Test Total

Running Chars 269 270 39 318 39 130 347 718

Running Words 47 615 7 291 7 197 62 103

Running OOV(%) 0 18.26 14.81 -

# Lines 6 161 920 929 8 010

# Pages 1 539

Char Lex. Size 72 69 65 81

Word Lex. Size 7 778 2 442 2 488 9 809

OOV Lex. Size 0 1 095 936 -

Sets of Word Queries to be Spotted

The sets of key word to be spotted for the CS and IAMDB corpora are the ones already used in

[Toselli 2013] and [Frinken 2012] respectively. The criterion followed for key word selection was

described in [Frinken 2012], where all the words seen in the training partition are tested as

keywords. However, in contrast of the IAMDB's, stop words were not removed from the CS one.

For the case of CS corpus, the selected query vocabulary consists of M = 2236 words, including

numeric expressions and a few symbols. It is worth mentioning that, from these query words, only

620 actually appear in the test line images. We say that these query words are pertinent, whereas the

remaining 1616 words are not pertinent. Clearly, spotting non-pertinent words is also challenging,

since the system may erroneously find other similar words, thereby leading to important precision

degradations. On the other hand, many of the pertinent query words appear more than once in the

test set. Concretely, there are 3291 pertinent query word occurrences, representing around 85% of

the total number of running words in the test images. Each of the 2236 word queries has to be tested

against all N =497 test lines. So the total number of query-line events, N .M, is 1111292, from which

only 3005 (corresponding to the above mentioned running pertinent words) are pertinent and

relevant. All this information is summed up in Table 4.4.2.3 along with the corresponding one of the

IAMDB.

Experimental Set-up

CS and IAMDB character HMMs were trained from the corresponding training partitions. In

general, a left-to-right HMM was trained for each of the elements appearing in the training text

images, such as lowercase and uppercase letters, symbols, blank spaces, special abbreviations,

possible blank spaces between words and characters, crossed-words, etc. However, specific details

of the image preprocessing and HMM training setup usually adopted for CS and IAMDB are fairly

different. In particular, different line image preprocessing, writing style attribute normalization and

feature extraction are usually adopted for each corpus. In addition, a caseless HMM topology was

utilized for CS; that is, each character HMM models, “in parallel”, both lower- and upper-case

letters.


Table 4.4.2.3: Basic statistics of the selected query word sets for CS and IAMDB along with their

related information.

CS IAMDB

Total Relev/Pert Total Relev/Pert

# Line images: N 497 497 929 855

# Query words: M 2 236 620 3 421 1 098

# Running pertinent query words | 3 291 | 1 925

# Line-query events: N M 1 111 292 3 005 3 178 109 1 916

See [Toselli 2013] and [Fischer 2010] for details about the meta-parameters of line-image

preprocessing, feature extraction and HMMs (which were optimized through cross-validation on the

training data for CS and on the validation data for IAMDB).

In the preprocessing phase of the proposed CL approach, the CL of each line image of the test

partition was obtained, using only the “filler” model (see Fig. 2.4.2.1a). Table 4.4.2.4 shows some

statistics of the resulting CLs, for both corpora, which were generated using the HTK toolkit

[Young 1997] setting the maximum input branching factor to 30. Then, the corresponding forward-

backward scores were computed according to Eq. 3.4.2.1-3.4.2.2. In the query phase, the character

sequence scores, S’(c,x), were computed for each keyword, υ = c, and line image, x, according to

Alg. 1. Finally, the KWS scores S(c, x) were determined for all the keywords, according to

Eq.3.4.2.5.

Table 4.4.2.4: Statistics of the CS and IAMDB CLs. All the figures are numbers of elements,

averaged over all the CLs.

Corpora Nodes Edges Branch. Fact. Paths

CS 35 608 1 058 890 30 ~10305

IAMDB 37 313 1 112 070 30 ~10307

The experimental setup for HMM-Filler approach was the usual one [Fischer 2010]. In the

preprocessing phase the filler model scores sf (x) for all the test line images x were computed. Then

in the query phase, for each keyword υ to be spotted a keyword-specific model was assembled and

the corresponding decoding score sυ(x) was obtained for each line image x. Finally the KWS scores

were computed according to Eq. 2.4.2.3. HTK was also used for HMM decoding.

Results

The goal of experiments carried out is twofold: to assess the effectiveness and the efficiency of the

proposed CL-based HMM-Filler with respect to its classical version.

Experiments were carried out with the CS and IAMDB corpora using both the classical HMM-Filler

and the CL-based approaches here proposed. Since, for each corpus, both approaches share identical

text-line image processing and character HMMs, precision-recall performance results should be

identical for both approaches, according to Eq. 3.4.2.5. However, since CL-based KWS

(necessarily) relies on incomplete (pruned) CLs, some degradation is actually expected. On the


other hand, according to the Sec. 2.4.2and 3.4.2, much lower keyword search computing time is

expected for the proposed CL-based approach.

Recall precision curves for both corpora and both approaches are shown in Fig. 4.4.2.3, while

Table 4.4.2.5 shows the corresponding Average Precision (AP) figures. This table also contains the

relevant computing times, split into preprocessing and search (query) times, the latter given in

average minutes required for each single keyword search. The table also shows total time (in days)

needed to fully index each corpus with the corresponding selected keywords (2336 for CS and 3421

for IAMDB).

Figure 4.4.2.3: Recall-Precision curves for CS (left) and IAMDB (right), using the reference

HMM-Filler approach (REF) and the CL-Based version (CL).

As expected, a large gain in efficiency is observed for the proposed CL-based HMM- filler method

with respect to the classical approach. The average query response time is reduced by a factor of 43

for CS and by a factor of 76 for IAMDB. The corresponding overall time needed for fully indexing

all the selected keywords is affordable using the proposed CL-based KWS method, while it

becomes clearly prohibitive for the classical HMM-Filler approach. These great speedups are

achieved without any significant loss in effectiveness, as supported by the overall AP figures of

Table 4.4.2.5 and by the detailed recall-precision graphs of Fig.4.4.2.3.

Table 4.4.2.5: Effectiveness (AP) and efficiency in terms of preprocessing and average query times

and total indexing times, for classical (REF) and CL-based HMM-Filler KWS.

CS IAMDB

REF CL REF CL

Average Precision 0.62 0.61 0.36 0.36

Preprocessing Time (min) 124.8 197.6 66.7 219.5

Average Query Time (min) 124.4 2.9 68.4 0.9

Total Indexing Time (days) 193.3 4.6 162.5 2.3


To conclude, it is important to remark that in all the experiments, full Viterbi decoding has been

carried out. At the expense of very small accuracy degradations, processing times for both methods

can be easily reduced (perhaps by one order of magnitude) by using beam-search pruning

techniques. Additionally, the CL-based approach has still much room for further speedups simply

by using smaller CLs. Informal tests suggest that, in this way, CL-based KWS can become about

one order of magnitude faster without significant additional accuracy degradations.

On the other hand, small accuracy degradations can be more than compensated by taking into

account contextual information in the filler model, for instance using character N-Grams. Clearly,

this will result in significantly larger filler models which will make the computing time of the

classical HMM-Filler approach much more prohibitive. In contrast, using these larger models no

significant computing time increase is expected for the proposed CL-based approach.


5. Description of Tools


5.1.1 Binarization

The first version of the binarization toolkit was delivered in the form of a console application:

binarization [infile] [outfile] [Param]

where [infile] is the input binary image, [outfile] the output binary image after binarization and

[Param] is a parameter that could be 0, 1, or 2.If no [Param] is given, [Param=0] is considered.

Notice that for [Param=1] binarization gives better results for bleed-through cases, for [Param=2]

binarization gives better results for faint character detection and [Param=0] is the default

binarization that can handle most cases in general. Examples of the developed binarization

methodology are given in Fig.5.1.1.1.

(a) (b)

(c) (d) (e)


(f) (g) (h) Figure 5.1.1.1: Binarization examples; (a)-(b) original image and binarization result of a typical

case; (c)-(e) original image and binarization results using [Param=0] and [Param=1], respectively of

a bleed-through case; (f)-(h) original image and binarization results using [Param=0] and

[Param=2], respectively of a case with faint characters.



The first version of the border removal toolkit was delivered in the form of a console application:

BorderRemoval [infile] [outfile]

where [infile] is the input binary image and [outfile] the output binary image after border removal.

Examples of the developed border removal methodology are given in Fig.5.1.2.1.

(a) (d) (g)

(b) (e) (h)

(c) (f) (i) Figure 5.1.2.1: Border removal examples; (a)-(c) original images; (d)-(f) image after binarization;

(g)-(i) image after border removal.



Text image enhancement tool

The imgtxtenh command line tool implements the proposed modification of the Sauvola algorithm.

It takes as input an image which can be in any of the following formats: PPM, PGM, BMP, PNG,

JPEG and TIFF. The output is the resulting image in grayscale, and the output format can be

requested to be in any of the following formats: PGM, PNG, JPEG and TIFF. By default the

original image is read from standard input and the resulting image is sent to standard output. It can

be requested that the original image be read from a file and that the output image be written to a

file, by using the arguments -i {INFILE} and -o {OUTFILE}.

The algorithm parameters that can be varied are the window width w , the mean threshold factor k

(which are from the original Sauvola) and the black to white transition factor s . These three

parameters are specified to the tool as arguments with the respective letters, i.e., -w {INT}, -k

{FLOAT} and -s {FLOAT}. To do a contrast stretching to the image before the algorithm is applied

use the argument -S {SATU} where SATU is the percentage of pixels that will be saturated in the

extremes of the range.

For example to apply the original Sauvola algorithm with parameters 30w and 2.0k to an

image Esposalles.png and save the result to Esposalles_sauvola.png, the following command would

be issued:

imgtxtenh -s 0 -w 30 -k 0.2 -i Esposalles.jpg -o out.jpg

To apply the proposed algorithm with the same parameters as the previous example and a black to

white transition factor 5.0s , the following command would be issued:

imgtxtenh -s 0.5 -w 30 -k 0.2 -i Esposalles.jpg -o out.jpg

Finally, for the same parameters although doing a contrast stretching with saturation 0.01%, the

following command would be issued:

imgtxtenh -S 0.01 -s 0.5 -w 30 -k 0.2 -i Esposalles.jpg -o out.jpg

Show-through reduction tool (imgtogray)

The imgtogray command line tool as the name suggests, simply receives as input a color image and

produces a grayscale image. The same as for the imgtxtenh tool, as input the image can be in any of

the following formats: PPM, PGM, BMP, PNG, JPEG and TIFF, and the output in the following

formats: PGM, PNG, JPEG and TIFF. By default the original image is read from standard input and

the resulting image is sent to standard output. It can be requested that the original image be read

from a file and that the output image be written to a file by using the arguments -i {INFILE} and -o

{OUTFILE}.

This imgtogray tool can be used for show-through reduction by providing a color to gray

conversion model that has been optimized for this task. A model that has been optimized for the

Plantas corpus has been hard coded in the tool, and can be selected by giving as argument -1. For

example to process an image Plantas.jpg, one would issue the command:

imgtogray -1 -i Plantas.jpg -o out.jpg


Grayscale image enhancement

The first version of the grayscale enhancement toolkit was delivered in the form of a console

application:

enhanceGray [infile] [outfile] [Param]

where [infile] is the input grayscale or color image, [outfile] the output grayscale image after

enhancement and [Param] is a parameter that could be 0, 1, 2 or 3.If no [Param] is given, [Param=0]

is considered. For [Param=0] background smoothing is performed using background estimation, for

[Param=1] a 5x5 Wiener filter is applied, for [Param=2] a 3x3 Gaussian filter of σ=1 is applied and

for [Param=3] an unsharp filter using an 9x9 mask followed by 3x3 Wiener filter is applied.

Examples of the enhancing methodologies are provided in Fig. 5.1.3.1.

(a)

(b)

(c)

(d)


(e)

Figure 5.1.3.1: Representative examples of grayscale image enhancement; (a)original image; (b)

background smoothing using background estimation; (c) 5x5 Wiener filtering; (d) 3x3 Gaussian

filtering; (e) 9x9 unsharp filtering followed by a 3x3 Wiener filter.

Binary image enhancement

The first version of the binary image enhancement toolkit was delivered in the form of a console

application:

enhanceBinary [infile] [outfile] [Param]

if [Param=2]: enhanceBinary [infile] [outfile] [Param] [Param2]

where [infile] is the input binary image, [outfile] the output binary image after enhancement and

[Param] is a parameter that could be 0, 1, or 2.If no [Param] is given, [Param=0] is considered,

while in the case of [Param=2] and additional parameter. i.e. Param2, can be given. For [Param=0]

contour enhancement is perfromed, for [Param=1] a morphological closing using a 3x3 mask is

applied and for [Param=2] a despeckle filter of a 7x7 mask is applied. In the last case of the

despeckle filter, if an additional parameter is given of value "Z", then the despeckle mask will be of

size ZxZ Examples of the enhancing methodologies are given in Fig. 5.1.3.2.

(a)

(b)

(c)

(d)

Figure 5.1.3.2: Examples of enhancing a binary image; (a) initial binary image; (b) contour

enhancement; (c) 3x3 morphological closing; (d) despeckle filtering (7x7 mask).


5.1.4 Deskew

Deskew at page and text line level

The first version of the skew correction toolkit was delivered in the form of a console application:

Page_Line_Skew [infile] [outfile]

where [infile] is the input binary image and [outfile] the output binary image after skew correction.

Examples of the developed skew correction methodology are given in Fig.5.1.4.1.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 5.1.4.1: Skew correction examples; (a), (c), (e) images after binarization; (b), (d), (f) images

after skew correction.


Deskew at word level

The first version of the skew correction toolkit was delivered in the form of a console application:

Word_Skew [infile] [outfile]

where [infile] is the input binary image and [outfile] the output binary image after skew correction.

Examples of the developed skew correction methodology are given in Fig.5.1.4.2.

(a) (f)

(b) (g)

(c) (h)

(d) (i)

(e) (j) Figure 5.1.4.2: Skew correction examples; (a)-(e) images after binarization; (f)-(j) images after

skew correction.


5.1.5 Deslant

Deslant at text line level

The first version of the slant correction toolkit was delivered in the form of a console application:

Text_Line_Slant [infile] [outfile]

where [infile] is the input binary image and [outfile] the output binary image after slant correction.

Examples of the developed slant correction methodology are given in Fig.5.1.5.1.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 5.1.5.1: Slant correction examples; (a), (c), (e) images after binarization; (b), (d), (f) images

after slant correction.


Deslant at word level

The first version of the slant correction toolkit was delivered in the form of a console application:

Word_Slant [infile] [outfile]

where [infile] is the input binary image and [outfile] the output binary image after slant correction.

Examples of the developed slant correction methodology are given in Fig.5.1.5.2.

(a) (f)

(b) (g)

(c) (h)

(d) (i)

(e) (j) Figure 5.1.5.2: Slant correction examples; (a)-(e) images after binarization; (f)-(j) images after slant

correction.




Page split

The first version of the page split toolkit was delivered in the form of a console application:

PageSplit [infile1] [infile2] [outfile]

where [infile1] is the input original color image, [infile2] the corresponding binary image and

[outfile] the result PAGE xml file in which the detected page frames are marked as different

regions. Examples of the developed page split methodology are given in Fig.5.2.1.1.

(a) (b)

(c)


(d) (e)

(f)

Figure 5.2.1.1: Page Split examples; (a)-(d) original images; (b)-(e) image after binarization; (c)-(f)

image after page split.

Detection of text zones

The first version of the text zones detection toolkit was delivered in the form of a console

application:

TextZonesDetection [infile] [outfile]

where [infile] is the input binary image and [outfile] an ascii file containing the coordinates of main

text zones (x1, x2 if the document is single column or x1,x2,x3,x4 if the document has two

columns). Examples of the developed text zones detection methodology are given in Fig. 5.2.1.2.


Figure 5.2.1.2: Document text zones detection results


The first version of the segmentation toolkit of structured documents was delivered in the form of a

console application:

espoparser –c config –i infile –o outfile

where [infile] is the input file with information about class probabilities for the cell of the image

and [outfile] the parsed image.



Baseline detection

Preliminaries

In this document you can find the instructions to install and use the tools developed for Baseline

Detection. As part of this readme we will use as input 10 pages of the “Plantas Prologo” database as

input of this experiment and at the end of the process we will obtain the baselines and the

corresponding Liner Error Rate (LER) and Segmentation Error Rate (SER).

All the scripts and code provided has been developed for linux.

Following software must be installed in order to proceed:

• HTK 3.4.1 toolkit

• Ruby Virtual Machine 1.9.2

• C++ Libraries: Open CV, Boost, Log4C++,Eigen

• Ruby Libraries: narray, ruby-debug, trollop, ap, fileutils

For details on installation of this base software please follow the official webpage instructions that

can be found on the project's webpage.

Baseline detection experiment

The script launch-training-detection.rb launches a complete Baseline detection experiment with a

subset of the “Plantas Prologo” data set. Most specifically it launches a hold out experiment using

the first 5 pages of the corpus to train and the second 5 pages to test.

To execute the script:

cd baseline_detection

./launch_training_detection.rb

The time required to carry out the complete experiment is of 30 minutes approximately.

The Baseline detection experiment described here, involves four different phases:

Preprocessing - The script assumes we are providing clean images and applies RLSA to them in order to enhance the zones with lines.

Feature Extraction - Extracts feature vector of all pages.

Training - Trains the statistical model with the training pages feature vectors.

Test - Provides a hypothesis of the detected text lines and their location in the test pages.

When the experiment is launched it will leave the results in the folder experimentation/experimentl in

this directory you will find the following:

recogAlg_*.mlf: a series of the result files obtained by models trained with 1,2,4,8,16,32,64,128 gaussians. Inside you can find for each test file the labels and location of each line detected.

TESTGLOBAL: a file that contains the LER and SER for the test set for each of the models trained with a different number of gaussians.


Adding new images for training and test

In order to change the pages used for training and test the following steps must be performed.

There must be a clean preprocessed page image added to the directory images/clean_extracted

If the page will be used for training or we want to be able to measure the error on the test pages a .inf for that image must be added to the inf folder. We will explain later how to obtain a .inf file from a page xml file.

The new page must be added to a list file either experimentation/data/train.lst or experimentation/data/train.lst so that the new sample is used for training or testing.

Adding the extracted baselines to PAGE format

In order to add the baselines to the PAGE format files the following must be performed:

Enter the experiment folder

Copy the recogAlg_*.mlf and the test list file to the asses_line_detection_accuracy/bin

Extract the baseline info from the .mlf file using the line_limits command

Copy the generated .l files to the images folder

Add the baseline info to the PAGE xmls with the page_format_add_baseline command

cd experimentation/experiment1

cp recogAlg_32.mlf ../../asses_line_detection_accuracy/bin

cp data/test.lst ../../asses_line_detection_accuracy/bin

cd ../../asses_line_detection_accuracy/bin

./line_limits -t test.lst -e recogAlg_32.mlf cp *.l ../../images

cd ../../images

for f in *.l; do

../page_format_add_baseline -i ${f/l/jpg} -l ${f/l/xml} -t $f -m FILE -v 0; cp output.xml ${f/l/xml}; done

Obtaining an inf file for training from a PAGE xml file

In order to generate an inf file with the baseline information from a PAGE file the command

page2inf

cd images

for f in *.xml; do ../page2inf -i ${f/xml/jpg} -l $f -m FILE -v 0 > ${f/xml/inf}; done mv *.inf ../inf


Polygon based text line detection

The first version of the text line segmentation tool was delivered in the form of a console

application:

TextLineSegmentation [inputimage] [inputxml] [outputxml]

where [inputimage] is the input binary image, [inputxml] the xml file in Page format containing

information about the regions of the document image and [outputxml] the result PAGE xml file in

which the detected text lines are marked using polygons. Examples of the developed text line

segmentation method are provided in Fig.5.2.2.1.

(a) (b)

Figure 5.2.2.1: Text line segmentation example. (a) Initial binary image, (b) text line segmentation

result



The first version of the word segmentation tool was delivered in the form of a console application:

WordSegmentation [inputimage] [inputxml] [outputxml]

where [inputimage] is the input binary image, [inputxml] the xml file in Page format containing

information about the text lines of the document image and [outputxml] the result PAGE xml file in

which the detected words are marked using polygons. Examples of the developed text line

segmentation method are provided in Fig.5.2.3.1.

(a) (b)

Figure 5.2.3.1: Word segmentation example. (a) Initial binary image, (b) word segmentation result



Preliminaries

In this document you will find directions to install the required tools for Handwritten Text

Recognition (HTR) and to carry out a HTR experiment. The extracted lines of 50 pages of

Bentham are the input of this experiment and at the end of the process we will obtain their

transcriptions and the corresponding Word Error Rate (WER). All the scripts are developed for

Linux.

It is necessary to install some standard Linux tools and libraries:

sudo apt-get install netpbm libnetpbm10-dev libpng12-dev libtiff4-dev

HTK and software tools for HTR

To carry out the HTR experiment we are going to use the HTK toolkit. It is a hidden Markov

model toolkit for building and manipulating hidden Markov models. This toolkit is publicly

available at:

http://htk.eng.cam.ac.uk

Once you have downloaded it, the steps to compile and install it are:

tar -xzvf HTK-3.4.1.tar.gz cd htk

CFLAGS=-m64 ./configure --prefix=$HOME --without-x make all make install

The HTK programs and scripts will be installed in “$HOME/bin” (unless other route is been

specified in the parameter --prefix).

The rest of tools, utils and examples for HTR can be downloaded from the svn in wp3/T3.3/HTR-

toolsUtils.tgz. To decompress it:

tar -xzvf HTR-toolsUtils.tgz -C $HOME

This will create the directory “HTR” in $HOME. Inside this directory there are four subdirectories:

“src”, “bin”, “scripts” and “BenthamData”.

In “src” is the source code of the preprocessing tools. To compile and install them:

cd $HOME/HTR/src ./compile.sh ../bin

The tools will be installed in “$HOME/HTR/bin”. The following executable files are found:

imgtxtenh: performs noise reduction using a modification of Sauvola

method.

pgmskew: detects the skew angle in the image and corrects it.

pgmslant: detects the slant of words (connected components) in the image

and corrects each slant by an horizontal shift transform (see [Pastor 2004]).

pgmnormsize: performs size normalization of each detected word in the image (see

http://htk.eng.cam.ac.uk/


[Romero 2006]).

pgmtextfea: transforms the already preprocessed image into sequence of

feature vectors.

pfl2htk: converts sequence of real-values vectors in ASCII format to HTK format.

tasas: given the HTR output and the reference text for each line, computes the WER.

In the next section the use of these tools will be explained in detail.

In order to make these tools and the HTK tools available to the user-command line, do not forget to

include their respective paths into the PATH variable of the shell:

export PATH=$PATH:$HOME/bin:$HOME/HTR/bin:.

In “$HOME/HTR/scripts” the following bash-shell scripts can be found:

HTR-Lines-preprocessing.sh: launches the complete text image preprocessing

on a given set of text line images.

HTR-feature-extraction.sh: launches the features extraction process on a given

set of text line images.

Script_Train.sh: launches a complete HMMs training process.

Script_Test.sh: launches a complete Viterbi testing process.

HTR-Bentham-Experiment.sh: launches a complete HTR experiment

(preprocesing, training and test) with the Bentham corpus.

HTR-LM-testing: launches a HTR test with the language model that receives as a

parameter.

Create_HMMs-TOPOLOGY.sh: sets the initial HMM topology.

Create_WER_Bentham.sh: compute the WER given the system transcription hy-

pothesis and the reference.

You may want to have a look into each of these scripts to learn more about what exactly they are

doing. Most of them just employ HTK commands.

Finally, the directory “$HOME/HTR/BenthamData” has the models, images and partitions required

to carry out an experiment.

A HTR experiment

The script HTR-Bentham-Experiment.sh launches a complete HTR experiment with the Bentham data

set. Most specifically it launches a cross-validation experiment with the 10 partitions defined for 50

pages. To execute the script:

cd $HOME/HTR

./scripts/HTR-Bentham-Experiment.sh

The time required to carry out the complete experiment can be several days. The HTR experiment

described here, involves three different phases: preprocessing and feature extraction, training and test.

In the next subsections we are going to explain in detail what is doing the HTR-Bentham-Experiment.sh

script.


Preprocessing and feature extraction

First, the HTR preprocessing tools are applied on the line images of the corpus. To do this the script

HTR-Bentham-Experiment.sh calls script HTR-Lines-preprocesing.sh:

$D/scripts/HTR-Lines-preprocessing.sh Lines Preprocessed

This script takes as arguments: 1. the directory containing the images in PNG format. 2. the output directory where the preprocessed images will be stored.

This script performs, for each of the images, the following sequence of commands:

imgtxtenh -i $f -S 4 -a | # Clean the background of the image.

pgmskew | # Perform "skew/slope" correction.

pgmslant -M M | # Perform "Slant correction.

pgmnormsize -c 5| # Perform size normalization.

pnmcrop -white # Remove left and right white zones.

To test different preprocessing operations this script must be changed. For example, if a new

pgmslant method is going to be tested, it will be necessary to call to the new executable instead to

pgmslant. Once all the images have been preprocessed it is time for the feature extraction process.

In this work we use the features introduced in [Toselli 2004]. The image is divided into square cells

and from each cell three features are calculated: the normalized gray level, the horizontal and

vertical components of the grey level gradient. To do this the script HTR-feature-extraction.sh is

used:

$D/scripts/HTR-feature-extraction.sh Preprocessed Features 20 2 2 2

This script takes as arguments: 1. the directory containing the preprocessed images. 2. the output directory where the feature vectors files will be stored. 3. the feature extraction resolution (20) 4. the analysis windows size (2 2) 5. the frequency (2)

Training phase

The next step in the HTR-Bentham-Experiment.sh script is to generate a HMM prototype per

character with an initial continuous density left-to-right topology.

The script Create_HMMs-TOPOLOGY.sh is used for this:

$D/scripts/Create_HMMs-TOPOLOGY.sh 60 8 > proto

This script accepts as input the dimension of the feature vectors and the number of states to be set

up for HMMs. It outputs an initial HMM topology with only one Gaussian density in the mixture

per state, whose dimension (in this case) is set to 60 = 20 x 3, according to the feature extraction

resolution of 20 computed in the preprocessing step (20 grey level features, 20 horizontal derivative

features and 20 vertical derivative features). The number of states per HMM has been set to 8, each

of them has the state-transition to itself with probability 0.6 and to the next state (on the right side)

with probability 0.4.


Finally (once the processing and feature extraction have finished for all images), the script to train

the HMMs is called:

$D/scripts/Script_Train.sh train hmms proto labels HMM_list

This accepts as arguments: 1. the list of sample files to use in the training process. 2. the directory where the trained HMMs will be stored. 3. the initial HMM topology file (HMM prototype). 4. the transcriptions file in MLF format (this file contains the transcription of the training lines in HTK

format) 5. the file of HMMs names.

Recognition phase

Finally, the proper recognition through the HTK command HVite is performed. To do this the script

Script_Test.sh is used:

$D/scripts/Script_Test.sh list.test ../Train_${i}/hmms dic bgram.gram HMM_list 50 -50 6 4

This accepts as arguments: 1. the list of sample files to use in the recognition process. 2. the directory where the trained HMMs are stored. 3. the lexicon in HTK format. 4. the language model in HTK format. 5. The file of HMMs name lists. 6. The grammar scale factor (50) (See [Ogawa 1998]). 7. The word insertion penalty (-50) (See [Ogawa 1998]). 8. The number of Gaussians per state in the HMMs (64).

Once the recognition has finished, it is time to assess the accuracy of the recognized hypotheses. To

do this the following steps are carried out:

1. First, the recognized hypotheses for all the partitions from 0 to 9 are joined in the directory

Test:

for f in Test_?; do

cp $f/resultados/res-64gauss Test; done

2. The script Create_WER_Bentham.sh returns the WER for the obtained transcripts:

$D/Create_WER_Bentham.sh Test $D/BenthamData/Transcription

which accepts as arguments the directory with all hypotheses and the reference transcripts.

Therefore, at the end of this process you will have a directory with all the hypothesis transcripts

($HOME/HTR/Experiments/Test), and the WER obtained by Create_WER_Bentham.sh.

Testing a new lexicon and language model

To test a new lexicon and a new language model we can use the script HTR-LM-testing.sh.

cd $HOME/HTR

./scripts/HTR-LM-testing.sh Lexicon LanguageModel GSF WIP


This accepts as arguments: 1. the new lexicon. 2. the new language model. 3. the Grammar Scale Factor. 4. the Word Insertion Penalty.



5.4.1 Query by example

For the execution of the program, the VCImageProcessingProgram.exe file must be run. The

program depends on Microsoft .NET Framework 4.5.1. The offline installer for the framework is in

the dependencies folder. Run dotNetFx40_Full_x86_x64.exe first and then the file NDP451-

KB2858728-x86-x64-AllOS-ENU.exe which is the update to 4.5.1. The supported operating

systems are Microsoft Vista and later editions. Fig.5.4.1.1 depicts the main interface of the program.

Figure 5.4.1.1: The main interface

The program at the moment supports only the Bentham database but future version will support add

databases through the graphical user interface. Now, as shown in Figure 5.4.1.2 in the

Configuration-> Debug section, you must point the directory where the Bentham directory lies in

the following directory structure: bentham/images.


Figure 5.4.1.2: Setting a local path to the Bentham database

In order to run the word spotting procedure, you select an image from the right list, and then click

on a word on the image on the main area. A pull down section will appear with the selected image

word as query. You will have options to query the current document or the database and the method

as seen in Figure 5.4.1.3

Figure 5.4.1.3: Querying a word image


Figure 5.4.1.4: Word spotting results

5.4.2 Query by string

A series of software programs have been developed for the query-by-string tool. The query-by-

string tools accepts a word graph and a list of words, and provide at which degree each word in the

list can appear in the word graph and consequently in a line.

The provided script Launch-Demo.sh as well as the binary WGPhraseSpooting have been

developed for linux. Following software must be installed in order to proceed:

HTK 3.4.1 toolkit (for word-graph generation)

GLIBC 2.13 or superior

Bash shell for scripting

In order to give an example of how to use this tool, we will employ (as input) a word graph of a text

line image belonging to the “IAMDB” database. This word graph was previously obtained during

the decoding process of the corresponding text line image using the HTK's HVite command with a

character HMM filler model (see Sec. 4.4.2).

Obtain the spotting confidence score for each word listed in spotWrdList_SPACE.lst:


for f in WG_Den30/*.lat.gz; do F=$(basename $f .lat.gz) echo $F ../bin/WGPhraseSpooting -i $f -w spotWrdList_SPACE.lst 2>/dev/null | awk 'BEGIN{while (getline < "puntMap.inf" > 0) M[$1]=$2}\ !/^#/{ CAD=M[$1] for (i=2;i<NF-4;i++) CAD=CAD"#"M[$i] print CAD,$(NF-4),$(NF-3),$(NF-2),$(NF-1) }' > WG_Den30/${F}_Vit.lst done

The script Launch-Demo.sh actually launches the WGPhraseSpooting taken as inputs the word

graph file “f07-084b-04.lat.gz”, in standard-lattice-format (STL), and the file

“spotWrdList_SPACE.lst"listing the query words to be spotted. In this case, each word entry of this

list is arranged in to the corresponding character sequence separated by spaces.

To execute the script Launch-Demo.sh:

cd pub/deliverables/D3.1/D3.1.1/5.4.2/Index-Generation-Tool/Example ./launch.sh f07-084b-04.lat.gz spotWrdList_SPACE.lst > scores

In the file “scores” are stored the final spotting scores for all the entries in “spotWrdList

SPACE.lst".

For this case, the approximate total elapsed time required by this tool to process the whole query

word list “spotWrdList SPACE.lst" is around 150 seconds. This is assuming that is has been

employed a dedicated single core of a 64-bit Intel Core Quad CPU running at 2.83GHz.

It is important to remark that this tool was conceived only for testing empirically the approach

addressed in Sec. 4.4.2, and currently it lacks of a friendly interface for interacting with the user and

for showing the spotting results, whose realization is planned for future deliveries of it.


6. References

[Agrawal 2013] M. Agrawal and D. Doermann, “Clutter Noise Removal in Binary Document

Images”, International Journal on Document Analysis and Recognition (IJDAR), vol. 16, no. 4, pp.

351-369, 2013.

[Alireza 2011] Alireza A., Umapada P., Nagabhushan P. and Kimura F., "A Painting Based

Technique for Skew Estimation of Scanned Documents", International Conference on Document

Analysis and Recognition, pp.299-303, 2011.

[Alvaro 2013] F. Alvaro, F. Cruz, J.-A. Sanchez, O. Ramos-Terrades, and J.- M. Benedl. Page

segmentation of structured documents using 2d stochastic context-free grammars. In 6th Iberian

Conference on Pattern Recognition and Image Analysis (IbPRIA), LNCS 7887, pages 133-140.

Springer, 2013.

[Arivazhagan 2007] M. Arivazhagan, H. Srinivasan, and S. N. Srihari, "A Statistical Approach to

Handwritten Line Segmentation", Document Recognition and Retrieval XIV, Proceedings of SPIE,

2007, pp. 6500T-1-11.

[Avila 2004] B.T. Avila and R.D. Lins, “A New Algorithm for Removing Noisy Borders from

Monochromatic Documents”, Proc. of ACM-SAC’2004, pp 1219-1225, Cyprus, ACM Press, March

2004.

[Baird 1987] Baird H. S, "The skew angle of printed documents", Proc. SPSE 40th symposium

hybrid imaging systems, Rochester, NY, pp 739–743M, 1987.

[Ballesteros 2005] Ballesteros J., Travieso C.M., Alonso J.B. and M.A. Ferrer, "Slant estimation of

handwritten characters by means of Zernike moments", Electronics Letters, Vol. 41, No 20, pp.

1110-1112, 2005.

[Barney Smith 2010] E.H. Barney Smith, L. Likforman-Sulem and J. Darbon, “Effect of pre-

processing on binarization,” in Proc. SPIE Conf. on Document Recognition Retrieval (DRR’10),

vol. 7534, San Jose, CA USA, Jan. 2010.

[Basu 2007] S. Basu, C. Chaudhuri, M. Kundu, M. Nasipuri, and D.K. Basu, "Text Line Extraction

from Multi – Skewed Handwritten Documents", Pattern Recognition, vol. 40, no. 6, June 2007, pp.

1825-1839.

[Bazzi 1999] I. Bazzi, R. Schwartz, and J. Makhoul. An omnifont open-vocabulary OCR system for

English and Arabic. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(6):495-

504, 1999.

[Bertolami 2008] R. Bertolami and H. Bunke. Hidden markov model-based ensemble methods for

offline handwritten text line recognition. Pattern Recognition, 41(11):3452-3460, 2008.

[Blumenstein 2002] Blumenstein M., Cheng C. K. and Liu X. Y., "New Preprocessing Techniques

for Handwritten Word Recognition," Second IASTED International Conference on Visualization,

Imaging and Image Processing, pp.332-336, 2002.

[Boquera 2011] S. Espaa-Boquera, M.J. Castro-Bleda, J. Gorbe-Moya, and F. Zamora-Martlnez.

Improving offline handwriting text recognition with hybrid hmm/ann models. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 33(4):767-779, 2011.

[Bosch 2012] V. Bosch, A. H. Toselli, and E. Vidal. “Statistical text line analysis in handwritten

documents”, In International Conference on Frontiers in Handwriting Recog-nition (ICFHR), pages

201-206, 2012.

[Bozinovic 1989] Bozinovic A. and Srihari A., "Off-Line Cursive Script Word Recognition",

Transactions on Pattern Analysis and Machine Intelligence, Vol II, No 1, pp. 69-82, 1989.

[Bruzzone 1999] E. Bruzzone, and M. C. Coffetti, ''An Algorithm for Extracting Cursive Text


Lines'', Proc. 5th Int’l Conf. on Document Analysis and Recognition (ICDAR’99), 1999, pp. 749-

752.

[Bukhari 2012] S. S. Bukhari, T. M. Breuel, A Asi, J. El-Sana, “Layout Analysis for Arabic

Historical Document Images Using Machine Learning”, ICFHR 2012, pp. 635-640.

[Bulacu 2007] M. Bulacu, R. van Koert, L.Schomaker, T. van der Zant, "Layout analysis of

handwritten historical documents for searching the archive of the Cabinet of the Dutch Queen",

ICDAR 2007, pp. 357-361.

[Bunke 1995] H. Bunke, M. Roth, and E.G. Suchakat-Talamazzini. Off-line cursive handwriting

recognition using hidden markov models.Pattern Recognition, 28(9):1399-1413, 1995.

[Bunke 2003] H. Bunke. Recognition of cursive roman handwriting- past, present and future.In

Proceedings of the 7th International Conference Document Analysis and Recognition (ICDAR 03),

page 448, Washington, DC, USA, 2003.IEEE Computer Society.

[Canny 1986] J. Canny, “A Computational Approach to Edge Detection”, IEEE Trans. on Pattern

Analysis and Machine Intelligence, vol. 8, no. 6, Nov. 1986, pp. 679-698.

[Chang 2004] F. Chang, C.J. Chen and C.J. Lu, “A linear-time component-labeling algorithm using

contour tracing technique”, Computer Vision and Image Understanding, Vol. 93, No.2, February

2004, pp. 206-220.

[Chatzichristofis 2011] S. A. Chatzichristofis, K. Zagoris, and A. Arampatzis. The trec files: the

(ground) truth is out there. In Proceedings of the 34th international ACM SIGIR conference on

Research and development in Information, SIGIR ’11, pages 1289–1290, New York, NY, USA,

2011. ACM.

[Chou 2007] Chou C.-H., Chu S.-Y.and Chang F., "Estimation of skew angles for scanned

documents based on piecewise covering by parallelograms", Pattern Recognition vol. 40, pp.443–

455, 2007.

[Ciardiello 1988] Ciardiello G., Scafuro G., Degrandi M. T., Spada M. R. and Roccotelli M. P., "An

experimental system for office document handling and text recognition", Proc. 9th international

conference on pattern recognition, pp. 739–743, 1988.

[Cote 1998] Cote M., Lecolinet E., Cheriet M. and Suen C., "Automatic reading of cursive scripts

using a reading model and perceptual concepts," International Journal on Document Analysis and

Recognition, vol. 1, no. 1, pp. 3–17, 1998.

[Crespi 2008] S. Crespi Reghizzi and M. Pradella.A CKY parser for picture grammars.Information

Processing Letters, 105(6):213-217, February 2008.

[Cruz 2012] F. Cruz and O. R. Terrades. Document segmentation using relative location features. In

International Conference on Pattern Recognition, pages 1562-1565, Japan, 2012.

[Dey 2012] S. Dey, J. Mukhopadhyay, S. Sural and P. Bhowmick, “Margin noise removal from

printed document images”, Workshop on Document Analysis and Recognition (DAR 2012). ACM,

USA, pp. 86-93, 2012.

[Deya 2010] Deya P. and Noushath S., "e-PCP: A robust skew detection method for scanned

document images", Pattern Recognition vol. 43, pp.937-948, 2010.

[Ding 1999] Ding Y., Kimura F., Miyake Y. and Shridhar M., "Evaluation and Improvement of slant

estimation for handwritten words", Proceedings of the Fifth International Conference on Document

Analysis and Recognition (ICDAR '99), pp. 753-756, 1999.

[Ding 2000] Ding Y., Kimura F. and Miyake Y., "Slant estimation for handwritten words by

directionally refined chain code", 7th International Workshop on Frontiers in Handwriting

Recognition (IWFHR 2000), pp. 53-62, 2000.


[Ding 2004] Ding Y., Ohyama W., Kimura F. and Shridhar M., "Local Slant Estimation for

Handwritten English Words", 9th International Workshop on Frontiers in Handwriting Recognition

(IWFHR 2004), pp. 328-333, 2004.

[Dong 2005] Dong J., Ponson D., Krzyÿzak A. and Suen C. Y., "Cursive word skew/slant

corrections based on Radon transform," Proc. of 8th International Conference on Document

Analysis and Recognition, pp.478-483, 2005.

[Drira 2011] F. Drira, F. LeBourgeois and H. Emptoz, “A new PDE-based approach for singularity-

preserving regularization: application to degraded characters restoration”, International Journal on

Document Analysis and Recognition, DOI 10.1007/s10032-011-0165-5, 2011.

[Du 2008] X. Du, W. Pan, and T. D. Bui, "Text line segmentation in handwritten documents using

Mumford–Shah model", Proc. 11th Int’l Conf. on Frontiers in Handwriting Recognition

(ICFHR'08), 2008, pp. 253–258.

[Esteve 2009] A. Esteve, C. Cortina, and A. Cabre. Long term trends in marital age homogamy

patterns: Spain, 1992-2006. Population, 64(1):173-202, 2009.

[Fadoua 2006] D. Fadoua, F. Le Bourgeois and H. Emptoz, “Restoring Ink Bleed-Through

Degraded Document Images Using a Recursive Unsupervised Classification Technique”, 7th

international conference on Document Analysis Systems, pp. 38-49, 2006.

[Fan 2002] K.C. Fan, Y.K. Wang and T.R. Lay, “Marginal Noise Removal of Document Images”,

Pattern Recognition, vol.35, no. 11, pp. 2593-2611, 2002.

[Fenrich 1992] R. Fenrich, S. Lam, and S. N. Srihari. Chapter optical character recognition. In

Encyclopedia of Computer Science and Engineering, pages 993-1000. Third ed., Van Nostrand,

1992.

[Fischer 2010] A. Fischer, A. Keller, V. Frinken, and H. Bunke. Lexicon-free hand¬written word

spotting using character HMMs. Pattern Recognition Letters, 33(7):934 - 942, 2012. Special Issue

on Awards from ICPR 2010.

[Fletcher 1988] L. A. Fletcher, and R. Kasturi, "A Robust Algorithm for Text String Separation from

Mixed Text/Graphics Images", IEEE Transactions on Pattern Analysis and Machine Intelligence,

vol.10, no.6, Nov. 1988, pp. 910-918.

[Frinken 2012] V. Frinken, A. Fischer, R. Manmatha, and H. Bunke. A Novel Word Spot¬ting

Method Based on Recurrent Neural Networks. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 34(2):211 -224, Feb. 2012.

[Fujisawa 2010] H. Fujisawa and C.-L. Liu. Directional pattern matching for character recognition

revisited. algorithms, 13:14, 2010.

[Gatos 1997] Gatos B., Papamarkos N. and Chamzas C., "Skew detection and text line position

determination in digitized documents", Pattern Recognition, vol.30, issue 9, pp.1505-1519, 1997.

[Gatos 2006] B. Gatos, I. Pratikakis, and S.J. Perantonis, “Adaptive degraded document image

binarization,” Pattern Recognition, vol. 39, no. 3, pp. 317–327, Mar. 2006.

[Gatos 2008] B. Gatos, I. Pratikakis, and S.J. Perantonis, “Improved document image binarization

by using a combination of multiple binarization techniques and adapted edge information”, in

Proceedings of the 19th International Conference on Pattern Recognition (ICPR’08), Tampa,

Florida, USA, Dec.2008, pp. 1-4.

[Gatos 2009] B. Gatos and I. Pratikakis. Segmentation-free word spotting in historical printed

documents.In Document Analysis and Recognition, 2009.ICDAR’09. 10th International Conference

on, pages 271–275. IEEE, 2009.

[Gatos 2011] B. Gatos, K. Ntirogiannis and I. Pratikakis, “DIBCO 2009: Document Image


Binarization Contest,” International Journal on Document Analysis and Recognition, vol. 14, no. 1,

pp. 35–44, 2011.

[Gatos 2014] B. Gatos, G. Louloudis, T. Causer, K. Grint, V. Romero, J.A. Sanchez, A.H. Toselli,

and E. Vidal. Ground-Truth Production in the tranScriptporium Project. In Proceedings of the DAS,

2014.

[Gorman 1993] Gorman L., "The document spectrum for page layout analysis," IEEE Trans Pattern

Anal Mach Intell vol. 15 issue 11, pp. 1162–1173, 1993.

[Gould 2008] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller. Multi-class segmentation

with relative location prior.Int. Journal of Computer Vision, 80(3):300-316, 2008.

[Graves 2009] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber.

A novel connectionist system for unconstrained handwriting recognition.IEEE Transactions on

Pattern Analysis and Machine Intelligence, 31(5):855-868, 2009.

[Gustafson 1978] D. E Gustafson and W. C Kessel. Fuzzy clustering with a fuzzy covariance

matrix. In Decision and Control including the 17th Symposium on Adaptive Processes, 1978 IEEE

Conference on, volume 17, pages 761–766. IEEE, 1978.

[Hashizume 1986] Hashizume A., Yeh P.S. and Rosenfeld A., "A method of detecting the orientation

of aligned components", Pattern Recognition Letters, vol. 4, pp. 125-132, 1986.

[Hassan 2012] E. Hassan, S. Chaudhury, and M. Gopal. Word shape descriptor-based document

image indexing: a new dbh-based approach. International Journal on Document Analysis and

Recognition (IJDAR), pages 1–20, 2012.

[Hinds 1990] Hinds J., Fisher L. and D’Amato D. P., "A document skew detection method using

run-length encoding and the Hough transform", Proceedings of the 10th international conference

pattern recognition. IEEE CS Press, Los Alamitos, CA, pp 464–468, 1990.

[Howe 2012] N.R. Howe, “Document Binarization with Automatic Parameter Tuning”,

International Journal on Document Analysis and Recognition, DOI: 10.1007/s10032-012-0192-x,

Sep. 2012.

[Huang 2008] C. Huang, and S. Srihari, "Word segmentation of off-line handwritten documents",

Proc. Annual Symposium on Document Recognition and Retrieval (DRR) XV, IST/SPIE, 2008.

[Impedovo 1991] S. Impedovo, L. Ottaviano, and S. Occhiegro. Optical character recognition a

survey.International Journal of Pattern Recognition and Artificial Intelligence, 5(1):1-24, 1991.

[Ishitani 1993] Ishitani Y., "Document skew detection based on local region complexity", Proc. 2nd

international conference on document analysis and recognition, Tsukuba, Japan, pp. 49–52, 1993.

[Jain 1989] A. Jain, Fundamentals of Digital Image Processing, Prentice-Hall, Englewood Cliffs,

NJ, 1989.

[Jelinek 1998] F. Jelinek. Statistical methods for speech recognition.MIT Press, 1998.

[Juan 2001] A. Juan and E. Vidal. On the use of Bernoulli mixture models for text classification. In

Proceedings of the Workshop on Pattern Recognition in Information Systems (PRIS 01), Setubal

(Portugal), July 2001.

[Jung 2004] Jung K., Kim K.I. and Jain A.K., "Text information extraction in images and video: a

survey", Pattern Recognition vol. 37 issue 5, pp. 977-997S, 2004.

[Kamel 1993] M. Kamel, and A. Zhao, “Extraction of Binary Character/Graphics Images from

Grayscale Document Images, CVGIP: Computer Vision Graphics and Image Processing, vol. 55(3),

pp. 203-217, May 1993.

[Kavallieratou 1999] Kavallieratou E., Fakotakis N. and Kokkinakis G., "New algorithms for


skewing correction and slant removal on word-level," Proc. IEEE 6th International Conference on

Electronics, Circuits and Systems, Pafos, Cyprus, pp. 1159–1162, 1999.

[Kavallieratou 2001] Kavallieratou E., Fakotakis N., Kokkinakis G., "Slant estimation algorithm for

OCR systems", Pattern Recognition, Vol. 34, pp. 2515-2522, 2001.

[Kavallieratou 2002] E. Kavallieratou, N. Fakotakis, and G. Kokkinaki. An unconstrained

handwriting recognition system.International Journal of Document Analysis and Recognition,

4:226-242, 2002.

[Kennard 2006] D. J. Kennard, and W. A. Barrett, "Separating Lines of Text in Free-Form

Handwritten Historical Documents", Proc. 2nd Int’l Conf. on Document Image Analysis for

Libraries (DIAL'06), 2006, pp. 12-23.

[Keshet 2009] J. Keshet, D. Grangier, and S. Bengio. Discriminative keyword spotting. Speech

Communication, 51(4):317-329, April 2009.

[Kesidis 2011] A. L. Kesidis, E. Galiotou, B. Gatos, and I. Pratikakis.A word spotting framework for

historical machine-printed documents.International Journal on Document Analysis and Recognition

(IJDAR), 14(2):131–144, 2011.

[Keysers 2002] D. Keysers, R. Paredes, H. Ney, and E. Vidal. Combination of tangent vectors and

local representations for handwritten digit recognition. In International Workshop on Statistical

Pattern Recognition, Lecture Notes in Computer Science, pages 407-441. Springer-Verlag, 2002.

[Khoury 2010]I. Khoury A. Gimenez and A. Juan.Windowed Bernoulli Mixture HMMs for Arabic

Handwritten Word Recognition.In Proceedings of the 12th International Conference on Frontiers

in Handwriting Recognition (ICFHR), Kolkata (India), Nov 2010 2010.

[Khurshid 2012] K. Khurshid, C. Faure, and N. Vincent. Word spotting in historical printed

documents using shape and sequence comparisons.Pattern Recognition, 45(7):2598–2609, 2012.

[Kim 1997] Kim G. and Govindaraju V., “A Lexicon Driven Approach to Handwritten Word

Recognition for Real-Time Applications”, IEEE Trans. Pattern Anal.Mach. Intell., Vol. 19, Issue 4,

pp. 366-379, 1997.

[Kim 1998] G. Kim, and V. Govindaraju, "Handwritten Phrase Recognition as Applied to Street

Name Images", Pattern Recognition, vol. 31, no. 1, Jan. 1998, pp. 41-51.

[Kim 1999] G. Kim, V. Govindaraju, and S. Srihari. An architecture for handwritten text

recognition systems. International Journal of Document Analysis and Recognition, 2(1):37-44,

1999.

[Kim 2001] S.H. Kim, S. Jeong, G.- S. Lee, C.Y. Suen, "Word segmentation in handwritten Korean

text lines based on gap clustering techniques", Proc. 6th Int’l Conf. on Document Analysis and

Recognition (ICDAR’01), 2001, pp. 189-193.

[Kim 2002] I.K. Kim, D.W. Jung, and R.H. Park, “Document image binarization based on

topographic analysis using a water flow model,” Pattern Recognition, vol. 35, no. 1, pp. 265–277,

2002.

[Kimura 1993] Kimura F., Shridhar M. and Chen Z., "Improvements of a Lexicon Directed

Algorithm for Recognition of Unconstrained Handwritten Words", 2nd International Conference on

Document Analysis and Recognition (ICDAR 1993), pp. 18-22, Oct. 1993.

[Kneser 1995] R. Kneser and H. Ney. Improved backing-off for m-gram language modeling.

volume 1, pages 181-184, Detroit, USA, 1995.

[Koerich 2002] A.L Koerich, Y. Leydier, R. Sabourin, and C. Y Suen. A hybrid large vocabulary

handwritten word recognition system using neural networks with hidden markov models.In

Frontiers in Handwriting Recognition, 2002.Proceedings. Eighth International Workshop on, pages


99–104. IEEE, 2002.

[Koerich 2003] A. L. Koerich, R. Sabourin, and C. Y. Suens. Large vocabulary off-line

handwriting recognition: A survey. Pattern Analysis Applications, 6(2):97-121, 2003.

[Konidaris 2008] T. Konidaris, B. Gatos, S.J. Perantonis, and A. Kesidis. Keyword matching in

historical machine-printed documents using synthetic data, word portions and dynamic time

warping.In Document Analysis Systems, 2008.DAS ’08. The Eighth IAPR International Workshop

on, pages 539–545, 2008.

[Lavrenko 2004] V. Lavrenko, T. M Rath, and R Manmatha. Holistic word recognition for

handwritten historical documents.In Document Image Analysis for Libraries, 2004.Proceedings.

First International Workshop on, pages 278–287. IEEE, 2004.

[Le 1996] D.X. Le and G.R. Thoma, “Automated Borders Detection and Adaptive Segmentation for

Binary Document Images”, International Conference on Pattern Recognition, p. III: 737-741, 1996.

[Lee 1996] S. W. Lee. Off-line recognition of totally unconstrained handwritten numerals using

mcnn.IEEE Transactions on Pattern Analysis and Machine Intelligence, 18:648-652, 1996.

[Lemaitre 2006] A. Lemaitre, and J. Camillerapp, "Text Line Extraction in Handwritten Document

with Kalman Filter Applied on Low Resolution Image", Proc. 2nd Int’l Conf. on Document Image

Analysis for Libraries (DIAL'06), 2006, pp. 38-45.

[Leung 2005] C.C. Leung, K.S. Chan, H.M. Chan and W.K. Tsui, “A new approach for image

enhancement applied to low-contrast–low-illumination IC and document images”, Pattern

Recognition Letters, Vol.26, pp. 769–778, 2005.

[Leydier 2007] Y. Leydier, F. Lebourgeois, and H. Emptoz. Text search for medieval manuscript

images.Pattern Recognition, 40(12):3552–3567, 2007.

[Leydier 2009] Y. Leydier, A. Ouji, F. LeBourgeois, and H. Emptoz. Towards an omnilingual word

retrieval system for ancient manuscripts.Pattern Recognition, 42(9):2089–2105, 2009.

[Li 2008] Y. Li, Y. Zheng, and D. Doermann, "Script-Independent Text Line Segmentation in

Freestyle Handwritten Documents'', IEEE Trans. Pattern Anal. Machine Intell. (T-PAMI), vol. 30,

no. 8, pp. 1313-1329, August, 2008.

[Likforman 1995] L. Likforman-Sulem, A. Hanimyan, and C. Faure, ''A Hough Based Algorithm

for Extracting Text Lines in Handwritten Documents'', Proc. 3rd Int’l Conf. on Document Analysis

and Recognition (ICDAR’95), 1995, pp. 774-777.

[Likforman 2007] L. Likforman - Sulem, A. Zahour, and B Taconet, ''Text Line Segmentation of

Historical Documents: A Survey'', Internation Journal on Document Analysis and Recognition

(IJDAR), vol. 9, no.2-4, April 2007, pp. 123-138.

[Link 1] https://engineering.purdue.edu/~bouman/software/cluster/

[Liwicki 2007] M. Liwicki, E. Indermuhle, and H. Bunke. On-line handwritten text line detection

using dynamic programming.In Document Analysis and Recognition, 2007.ICDAR 2007.Ninth

International Conference on, volume 1, pages 447 -451, sept. 2007.

[Llados 2012] J. Llados, M. Rusinol, A. Fornes, D. Fernandez, and A. Dutta. On the influence of

word representations for handwritten word spotting in historical documents.International Journal of

Pattern Recognition and Artificial Intelligence, 26(05), 2012.

[Lleida 1993] E. Lleida, J. B. Marino, J. M. Salavedra, A. Bonafonte, E. Monte, and A. Martinez.

Out-of-vocabulary word modelling and rejection for keyword spotting. In Proceedings of the Third

European Conference on Speech Communication and Technology, pages 1265-1268, Berlin,

Germany, 1993.

[Louloudis 2008] G. Louloudis, B. Gatos, I. Pratikakis, C. Halatsis, "Text line detection in

https://engineering.purdue.edu/~bouman/software/cluster/


handwritten documents", Pattern Recognition, Vol. 41, Issue 12, pp. 3758-3772, 2008.

[Louloudis 2009a] G. Louloudis, N. Stamatopoulos and B. Gatos, "A Novel Two Stage Evaluation

Methodology for Word Segmentation Techniques", 10th International Conference on Document

Analysis and Recognition (ICDAR'09), pp. 686-690, Barcelona, Spain, July 2009.

[Louloudis 2009b] G. Louloudis, B. Gatos, I. Pratikakis, C. Halatsis, "Text line and word

segmentation of handwritten documents", Pattern Recognition, Vol. 42, Issue 12, pp. 3169-3183,

2009.

[Lu 2000] Z. Lu, R. Schwartz, and C. Raphael. Script-independent, hmm-based text line finding for

ocr. In Pattern Recognition, 2000.Proceedings. 15th International Conference on, volume 4, pages

551 -554 vol.4, 2000.

[Lu 2003] Lu Y. and Tan C. L., "Anearest-neighbor chain based approach to skew estimation in

document images", Pattern Recognition Letters vol. 24, pp. 2315–2323, 2003.

[Lu 2010] S. Lu, B. Su and C.L. Tan, “Document image binarization using background estimation

and stroke edges,” International Journal on Document Analysis and Recognition, vol. 13, no. 4, pp.

303-314, 2010.

[Luthy 2007] F. Luthy, T. Varga, and H. Bunke, ''Using Hidden Markov Models as a Tool for

Handwritten Text Line Segmentation'', Proc. 9th Int’l Conf. on Document Analysis and Recognition

(ICDAR’07), 2007, pp. 8-12.

[Madhvanath 1999] Madhvanath S., Kim G., and Govindaraju V., “Chaincode contour processing

for handwritten word recognition,” IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 21, no. 9, pp. 928–932, 1999.

[Mahadevan 1995] U. Mahadevan, and R. C. Nagabushnam, ''Gap metrics for word separation in

handwritten lines'', Proc. 3rd Int’l Conf. on Document Analysis and Recognition (ICDAR’95),

1995, pp. 124-127.

[Makridis 2007] M. Makridis, N. Nikolaou and B. Gatos, "An Efficient Word Segmentation

Technique for Historical and Degraded Machine-Printed Documents", Proc. 9th Int’l Conf. on

Document Analysis and Recognition (ICDAR'07), 2007, pp. 178-182.

[Malleron 2009] V. Malleron, V. Eglin, H. Emptoz, S. Dord-Crousle, P. Regnier, “Text Lines and

Snippets Extraction for 19th Century Handwriting Documents Layout Analysis”, ICDAR 2009, pp.

1001-1005.

[Manmatha 1996] R Manmatha, C. Han, and E. M Riseman. Word spotting: A new approach to

indexing handwriting. In Computer Vision and Pattern Recognition, 1996.Proceedings CVPR’96,

1996 IEEE Computer Society Conference on, pages 631–637.IEEE, 1996.

[Manning 2008] C. D. Manning, P. Raghavan, and H. Schtze.Introduction to Information

Retrieval.Cambridge University Press, New York, NY, USA, 2008.

[Marin 2005] J.M. Marin, K. Mengersen, and C.P. Robert, "Bayesian Modelling and Inference on

Mixtures of Distributions", Handbook of Statistics, vol. 25, Elsevier-Sciences, 2005.

[Marti 2001] U.-V. Marti and H. Bunke.Using a Statistical Language Model to improve the

preformance of an HMM-Based Cursive Handwriting Recognition System.International Journal of

Pattern Recognition and Artificial In telligence, 15(1):65-90, 2001.

[Marti 2002] U.-V. Marti and H. Bunke. The iam-database: an english sentence database for offline

handwriting recognition. International Journal on Document Analysis and Recognition, 5:39-46,

2002.

[McCowan 2004] I. A. McCowan, D. Moore, J. Dines, D. G.-Perez, M. Flynn, P. Wellner, and H.

Bourlard.On the use of information retrieval measures for speech recognition evaluation.Idiap-RR


Idiap-RR-73-2004, IDIAP, Martigny, Switzerland, 0 2004.

[Mori 1992] S. Mori, C. Y. Suen, and K. Yamamoto. Historial review of ocr research and

development. Proceedings of the IEEE, 80(7):1029-1058, 1992.

[Morita 1999] Morita M., Facon J., Bortolozzi F., Garnes S. and Sabourin R., "Mathematical

morphology and weighted least squares to correct handwriting baseline skew," Proc. of 5th

International Conference on Document Analysis and Recognition, Bangalore, India, pp. 430–433,

1999.

[Nagy 2000] G. Nagy. Twenty years of document image analysis. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 22(1):38-62, 2000.

[Niblack 1986] W. Niblack, An Introduction to Digital Image Processing. Englewood Cliffs, NJ:

Prentice Hall, 1986, pp. 115–116.

[Nicolas 2006] S. Nicolas, T. Paquet and L. Heutte, "Complex Handwritten Page Segmentation

Using Contextual Models“, DIAL 2006, pp. 46-59.

[Nicolaou 2009] A. Nicolaou and B. Gatos, "Handwritten Text Line Segmentation by Shredding

Text into its Lines", 10th International Conference on Document Analysis and Recognition

(ICDAR'09), pp. 626-630, Barcelona, Spain, July 2009.

[Nikolaou 2010] N. Nikolaou, M. Makridis, B. Gatos, N. Stamatopoulos and N. Papamarkos,

“Segmentation of historical machine-printed documents using Adaptive Run Length Smoothing and

skeleton segmentation paths”, Image and Vision Computing, vol. 28, no. 4, pp. 590-604, 2010.

[Nicolas 2004] S. Nicolas, T. Paquet, and L. Heutte, "Text Line Segmentation in Handwritten

Document Using a Production System", Proc. 9th Int’l Workshop on Frontiers in Handwriting

Recognition (IWFHR’04), 2004, pp. 245-250.

[Nist 2013] TREC NIST. http://trec.nist.gov/pubs/trec16/appendices/measures.pdf, 2013.

[Ntirogiannis 2014] K. Ntirogiannis, B. Gatos and I. Pratikakis, "A Combined Approach for the

Binarization of Handwritten Document Images", Pattern Recognition Letters - Special Issue on

Frontiers in Handwriting Processing, vol. 35, no.1, pp. 3-15, Jan. 2014.

[Ogawa 1998] A. Ogawa, K. Takeda, and F. Itakura. Balancing acoustic and linguistic

probabilites.volume 1, pages 181-184, Seattle, WA, USAR, 1998.

[Otsu 1975] N. Otsu. A threshold selection method from gray-level histograms.Automatica, 11(285-

296):23–27, 1975.

[Otsu 1979] N. Otsu, “A Thresholding Selection Method from Gray-level Histogram,” IEEE Trans.

Syst., Man, Cybern., vol. 9, no. 1, pp. 62–66, 1979.

[Oztop 1999] E. Oztop, A.Y. Mlayim, V. Atalay, and F. Yarman-Vural. Repulsive attractive network

for baseline extraction on document images. Signal Processing, 75(1):1 - 10, 1999.

[Papandreou 2011] Papandreou A. and Gatos B., "A Novel Skew Detection Technique Based on

Vertical Projections", Proc. 11th international conference on document analysis and recognition,

ICDAR '11, pp. 384-388, 2011.

[Papandreou 2012] Papandreou A. and Gatos B., "Word Slant Estimation using Non-Horizontal

Character Parts and Core-Region Information", 10th IAPR International Workshop on Document

Analysis Systems (DAS 2012), pp. 307-311, 2012.

[Papandreou 2013a] Papandreou A. and Gatos B., "A Coarse to Fine Skew Estimation Technique

for Handwritten Words", Proc. 12th international conference on document analysis and recognition,

ICDAR '13, 2013.

[Papandreou 2013b] Papandreou A., Gatos B., Louloudis G. and Stamatopoulos N., "ICDAR2013


Document Image Skew Estimation Contest (DISEC’13)", Proc. 12th international conference on

document analysis and recognition, ICDAR '13, 2013.

[Papandreou 2014] Papandreou A. and Gatos B., "Slant estimation and core-region detection for

handwritten Latin words", Pattern Recognition Letters Vol. 35 pp. 16-22, 2014.

[Pastor 2004] M. Pastor, A. Toselli, and E. Vidal. Projection Profile Based Algorithm for Slant

Removal. In International Conference on Image Analysis and Recognition (ICIAR’04), Lecture

Notes in Computer Science, pages 183-190, Porto, Portugal, September 2004. Springer-Verlag.

[Plamondon 2000] R. Plamondon and S. N. Srihari. On-Line and Off-Line Handwriting

Recognition: A Comprehensive Survey. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 22(1):63-84, 2000.

[Postl 1986] Postl W., "Detection of linear oblique structures and skew scan in digitized

documents", Proc. 8th international conference on pattern recognition, pp. 687-689, 1986.

[Pratikakis 2010] I. Pratikakis, B. Gatos and K. Ntirogiannis, “H-DIBCO 2010 Handwritten

Document Image Binarization Competition,” in Proc. Int. Conf. on Frontiers in Handwriting

Recognition (ICFHR’10), Kolkata, India, Nov. 2010, pp. 727–732.

[Pratikakis 2011] I. Pratikakis, B. Gatos and K. Ntirogiannis, “ICDAR 2011 Document Image

Binarization Contest - DIBCO 2011,” in Proc. Int. Conf. on Document Analalysis and Recognition

(ICDAR’11), Beijing, China, Sep. 2011, pp. 1506–1510.

[Pratikakis 2013] I. Pratikakis, B. Gatos and K. Ntirogiannis, "ICDAR 2013 Document Image

Binarization Contest (DIBCO 2013)", 2013 International Conference on Document Analysis and

Recognition (ICDAR'13), Washington DC, USA, Aug. 2013, pp. 1471 1476.

[Pu 1998] Y. Pu, and Z. Shi, '' A Natural Learning Algorithm Based on Hough Transform for Text

Lines Extraction in Handwritten Documents'', Proc. 6th Int’l Workshop on Frontiers in Handwriting

Recognition (IWFHR’98), 1998, pp. 637-646.

[Ramponi 1996] G. Ramponi, N. Strobel, S. K. Mitra, and T. Yu, “Nonlinear unsharp masking

methods for image contrast enhancement,” J. Electron. Imag.,vol. 5, pp. 353–366, Jul. 1996.

[Rath 2003a] T. M. Rath and R. Manmatha.Features for word spotting in historical manuscripts.In

Document Analysis and Recognition, 2003.Proceedings. Seventh International Conference on,

pages 218–222. IEEE, 2003.

[Rath 2003b] T. M. Rath and R. Manmatha.Word image matching using dynamic time warping.In

Computer Vision and Pattern Recognition, 2003.Proceedings. 2003 IEEE Computer Society

Conference on, volume 2, pages II–521. IEEE, 2003.

[Ratzlaff 2003] E. H. Ratzlaff. Methods, Report and Survey for the Comparison of Diverse Isolated

Character Recognition Results on the UNIPEN Database. In Proceedings of the Seventh Inter-

national Conference on Document Analysis and Recognition (ICDAR ’03), volume 1, pages 623-

628, Edinburgh, Scotland, August 2003.

[Robertson 2008] S. Robertson. A new interpretation of average precision. In Proceedings of the

31st annual international ACM SIGIR conference on Research and development in information

retrieval, SIGIR, pages 689-690, New York, NY, USA, 2008. ACM.

[Rodriguez 2008] J. A. Rodriguez and F. Perronnin. Local gradient histogram features for word

spotting in unconstrained handwritten documents. In Int. Conf. on Frontiers in Handwriting

Recognition, 2008.

[Romero 2006] V. Romero, M. Pastor, A. H. Toselli, and E. Vidal. Criteria for handwritten off-line

text size nor-malization. In Procc. of The Sixth IASTED international Conference on Visualization,

Imaging, and Image Processing (VIIP 06), Palma de Mallorca, Spain, August 2006.


[Romero 2007] V. Romero, A. Gimoenez, and A. Juan. Explicit Modelling of Invariances in

Bernoulli Mixtures for Binary Images. In 3rd Iberian Conference on Pattern Recognition and

Image Analysis, volume 4477 of Lecture Notes in Computer Science, pages 539-546. Springer-

Verlag, Girona (Spain), June 2007.

[Romero 2007] V. Romero, A. H. Toselli, L. Rodriguez, and E. Vidal. Computer Assisted

Transcription for Ancient Text Images. In International Conference on Image Analysis and

Recognition (ICIAR 2007), volume 4633 of LNCS, pages 1182-1193. Springer-Verlag, Montreal

(Canada), August 2007.

[Romero 2007] V. Romero, V. Alabau, and J. M. Benedl. Combination of N-grams and Stochastic

Context- Free Grammars in an Offline Handwritten Recognition System. In 3rd Iberian Conference

on Pattern Recognition and Image Analysis, volume 4477 of LNCS, pages 467-474. Springer-

Verlag, Girona (Spain), June 2007.

[Romero 2013] V. Romero, A. Fornés, N. Serrano, J.A. Sánchez, A.H. Toselli, V. Frinken, E. Vidal,

and J. Lladós. The ESPOSALLES database: An ancient marriage license corpus for off-line

handwriting recognition. Pattern Recognition, 46:1658–1669, 2013.

http://dx.doi.org/10.1016/j.patcog.2012.11.024.

[Roufs 1997] J.A.J. Roufs and M.C. Boschman, “Text Quality Metrics for Visual Display Units: I.

Methodological Aspects”, Displays, vol. 18, no.1, pp. 37–43, 1997.

[Roy 2008] P.P. Roy, U. Pal, and J. Llados, "Morphology based handwritten line segmentation using

foreground and background information", Proc. 11th Int’l Conf. on Frontiers in Handwriting

Recognition (ICFHR'08), 2008, pp. 241–246.

[Rusinol 2011] M. Rusinol, D. Aldavert, R. Toledo, and J. Lladós. Browsing heterogeneous

document collections by a segmentation-free word spotting method. In Document Analysis and

Recognition (ICDAR), 2011 International Conference on, pages 63–67. IEEE, 2011.

[Saabni 2014] Raid Saabni, Abedelkadir Asi, Jihad El-Sana, Text line extraction for historical

document images, Pattern Recognition Letters, Vol. 35, Iss. 1, pp. 23-33, 2014.

[Sadri 2009] Sadri J. and Cheriet M., "A new approach for skew correction of documents based on

particle swarm optimization", Proceedings of 10th international conference on document analysis

and recognition, ICDAR‘09, pp 1066–1070, 2009.

[Safabakhsh 2000] Safabakhsh R. and Khadivi S, "Document Skew Detection Using Minimum-

Area Bounding Rectangle", Proc. The International Conference on Information Technology: Coding

and Computing ITCC 00, pp. 253 – 258, 2000.

[Sarfraz 2008] Sarfraz M. and Rasheed Z., "Skew estimation and correction of text using bounding

box", Proceedings of fifth international conference on computer graphics, imaging and

visualization, (CGIV ‘08), pp 259–264, 2008.

[Sauvola 1995] J. Sauvola and M. Pietikainen, “Page segmentation and classification using fast

feature extraction and connectivity analysis”, International Conference on Document Analysis and

Recognition, pp. 1127-1131, 1995.

[Sauvola 2000] J. Sauvola and M. Pietikainen, “Adaptive document image binarization,” Pattern

Recognition, vol. 33, no. 2, pp. 225–236, 2000.

[Seni 1994] G. Seni, and E. Cohen, ''External Word Segmentation of Off-line Handwritten Text

Lines'', Pattern Recognition, vol. 27, no. 1, Jan. 1994, pp. 41-52.

[Senior 1998] A. Senior and A. Robinson. An off-line cursive handwriting recognition

system.IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):309-321, 1998.

[Serrano 2009] J. A. Rodriguez-Serrano and F. Perronnin.Handwritten word-spotting using hidden

Markov models and universal vocabularies. Pattern Recognition, 42:2106-2116, September 2009.


[Shafait 2008] F. Shafait, D. Keysers, and T. M. Breuel. Efficient implementation of local adaptive

thresholding techniques using integral images. In Proc. SPIE 6815, Document Recognition and

Retrieval XV, 681510, pages 1–6, January 2008.

[Shafait 2008a] Shafait F, Keysers D, Breuel T., “Performance evaluation and benchmarking of six-

page segmentation algorithms.”, IEEE Trans Pattern Anal Mach Intell. 2008 Jun;30(6):941-54. doi:

10.1109/TPAMI.2007.70837.

[Shepard 1953] D. H. Shepard. Apparatus for reading.US-Patent No 2663758, 1953.

[Shi 2004] Z. Shi, and V. Govindaraju, "Line Separation for Complex Document Images Using

Fuzzy Runlength", Proc. 1st Int’l Workshop on Document Image Analysis for Libraries (DIAL’04),

2004, pp. 306-312.

[Shi 2005] Z. Shi, S. Setlur, and V. Govindaraju, "Text Extraction from Gray Scale Historical

Document Images Using Adaptive Local Connectivity Map", Proc. 8th Int’l Conf. on Document

Analysis and Recognition (ICDAR’05), 2005, pp. 794-798.

[Shutao 2007] Shutao Li, Qinghua S., and Jun S., "Skew detection using wavelet decomposition and

projection profile analysis", Pattern Recognition Letters vol. 28, issue 5, pp. 555–562, 2007.

[Singh 2008] Singh C., Bhatia N. and Kaur A., "Hough transform based fast skew detection and

accurate skew correction methods", Pattern Recognition vol. 41, pp. 3528–3546, 2008.

[Srihari 1989] Srihari N. and Govindaraju V., "Analysis of textual images using the Hough

transform", Mach Vis A, pp. 141–153, 1989.

[Srihari 1993] S. Srihari. Recognition of handwritten and machine printed text for postal adress

interpretation. Pattern Recognition Letters, 14:291-302, 1993.

[Srihari 2005] S. Srihari, H. Srinivasan, P. Babu, and C. Bhole. Handwritten arabic word spotting

using the cedarabic document analysis system. In Proc. Symposium on Document Image

Understanding Technology (SDIUT-05), pages 123–132, 2005.

[Stamatopoulos 2007] N. Stamatopoulos, B. Gatos and A. Kesidis, “Automatic Borders Detection

of Camera Document Images”, 2nd International Workshop on Camera-Based Document Analysis

and Recognition (CBDAR'07), pp.71-78, 2007.

[Stamatopoulos 2009] N. Stamatopoulos, B. Gatos, and S. J. Perantonis, "A method for combining

complementary techniques for document image segmentation", Pattern Recognition, Vol. 42, Issue

12, pp. 3158-3168, 2009.

[Stamatopoulos 2010] N. Stamatopoulos, B. Gatos and T. Georgiou, “Page Frame Detection for

Double Page Document Images”, 9th International Workshop on Document Analysis Systems

(DAS'10), pp. 401-408, Boston, MA, USA, June 2010.

[Stamatopoulos 2013] N. Stamatopoulos, G. Louloudis, B. Gatos, U. Pal and A. Alaei,

''ICDAR2013 Handwriting Segmentation Contest'', 12th International Conference on Document

Analysis and Recognition (ICDAR'13), 2013, pp. 1434-1438.

[Su 2010] B. Su, S. Lu and C.L. Tan. “Binarization of Historical Document Images using the Local

Maximum and Minimum”, in Proc. Int. Workshop on Document Analysis Systems (DAS’10),

Boston, MA USA, Jun. 2010, pp. 159-166.

[Su 2010] S. Lu, B. Su and C.L. Tan, “Document image binarization using background estimation

and stroke edges,” International Journal on Document Analysis and Recognition, vol. 13, no. 4, pp.

303-314, 2010.

[Su 2011] B. Su, S. Lu and C.L. Tan, “Combination of document image binarization techniques”, in

Proceedings of the 11th International Conference on Document Analysis and Recognition

(ICDAR’11), Beijing, China, Sep. 2011, pp. 22-26.


[Szke 2005] I. Szke, P. Schwarz, L. Burget, M. Karafit, and J. Cernock. Phoneme based acoustics

keyword spotting in informal continuous speech. In Radioelektronika 2005, pages 195-198. Faculty

of Electrical Engineering and Communication BUT, 2005.

[Thomas 2010] S. Thomas, C. Chatelain, L. Heutte, and T. Paquet. Alpha-Numerical Sequences

Extraction in Handwritten Documents. In Proceedings of the 12th International Conference on

Frontiers in Handwriting Recognition, ICFHR, pages 232-237, Kolkata, India, 2010. IEEE

Computer Society.

[Tonazzini 2007] A. Tonazzini, E. Salerno, and L. Bedini. Fast correction of bleed-through

distortion in grayscale documents by a blind source separation technique. International Journal of

Document Analysis and Recognition (IJDAR), 10(1):17–25, 2007.

[Tonazzini 2010] A. Tonazzini, Color Space Transformations for Analysis and Enhancement of

Ancient Degraded Manuscripts, Pattern Recognition and Image Analysis, Vol. 20, No. 3, pp. 404–

417, 2010.

[Toselli 2004] A. H. Toselli, A. Juan, D. Keysers, J. Gonzalez, I. Salvador, H. Ney, E. Vidal, and F.

Casacuberta. Integrated Handwriting Recognition and Interpretation using Finite-State Models. Int.

Journal of Pattern Recognition and Artificial Intelligence, 18(4):519-539, June 2004.

[Toselli 2004a] A. H. Toselli et al. Integrated Handwriting Recognition and Interpretation using

Finite- State Models.Int. Journal on Pattern Recognition and Artificial Intelligence, 18(4):519-

539, 2004.

[Toselli 2004b] A. H. Toselli, A. Juan, and E. Vidal.Spontaneous Handwriting Recognition and

Classification. In Proceedings of the 17th International Conference on Pattern Recognition,

volume 1, pages 433-436, Cambridge, United Kingdom, August 2004.

[Toselli 2009] A. H. Toselli, V. Romero, M. Pastor, and E. Vidal.Multimodal interactive

transcription of text images. Pattern Recognition, 43(5):1824-1825, 2009.

[Toselli 2011] A. H. Toselli, V. Romero, and E. Vidal. Language Technology for Cultural Heritage,

chapter Alignment between Text Images and their Transcripts for Handwritten Documents., pages

23-37. Theory and Applications of Natural Language Processing.Springer, 2011. Caroline

Sporleder, Antal van den Bosch y Kalliopi Zervanou (Eds.).

[Toselli 2013] A. H. Toselli, E. Vidal, V. Romero, and V. Frinken. Word- graph based keyword

spotting and indexing of handwritten document images. Technical report, Universidad Politcnica de

Valencia, 2013.

[Varga 2005] T. Varga, and H. Bunke, ''Tree structure for word extraction from handwritten text

lines'', Proc. 8th Int’l Conf. on Document Analysis and Recognition (ICDAR’05), 2005, pp. 352-

356.

[Vidal 2012] E. Vidal V. Romero, A. H. Tosselli. Multimodal Interactive Handwritten Text

Transcription.World Scientific Publishing, 2012.

[Vinciarelli 2001] Vinciarelli A. and Juergen J., "A new normalization technique for cursive

handwritten words", Pattern Recognition Letters, Vol. 22, Num. 9, pp. 1043-1050, 2001.

[Vinciarelli 2002] A. Vinciarelli. A survey on off-line cursive word recognition.Pattern Recognition,

35(7):1033-1446, 2002.

[Vinciarelli 2004] A. Vinciarelli, S. Bengio, and H. Bunke. Off-line recognition of unconstrained

handwritten texts using hmms and statistical language models. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 26(6):709-720, june 2004.

[Wahl 1982] F.M. Wahl, K.Y. Wong, and R.G. Casey, “Block Segmentation and Text Extraction in

Mixed Text/Image Documents”, Computer Graphics and Image Processing, vol. 20, pp. 375-390,

1982.


[Wang 1997] Wang J., Leung M. K. H. and Hui S.C. "Cursive word reference line detection",

Pattern Recognition vol. 30 issue 3, pp. 503–511, 1997.

[Weliwitage 2005] Weliwitage, A. L. Harvey, and A. B. Jennings, "Handwritten Document Offline

Text Line Segmentation", Proc. of Digital Imaging Computing: Techniques and Applications

(DICTA’05), 2005, pp. 184-187.

[Wessel 2001] F. Wessel, R. Schluter, K. Macherey, and H. Ney. Confidence measures for large

vocabu¬lary continuous speech recognition. IEEE Transactions on Speech and Audio Processing,

9(3):288-298, mar 2001.

[Wong 1982] K. Y. Wong, R. G. Casey, and F. M. Wahl.Document analysis system. IBM J. Res.

Dev., 26(6):647–656, November 1982.

[Yacoub 2005] S. Yacoub, J. Burns, P. Faraboschi, D. Ortega, J.A. Peiro, and V. Saxena, “Document

Digitization Lifecycle for Complex Magazine Collection”, Proceedings of the ACM symposium on

Document engineering, pp. 197-206, 2005.

[Yan 1993] Yan H., "Skew Correction of Document Images Using Interline Cross-Correlation",

CVGIP: Graphical Models and Image Processing, vol. 55, issue 6, pp. 538-543, 1993.

[Yang 2000] Y. Yang and H. Yan, “An adaptive logical method for binarization of degraded

document images,” Pattern Recognition, vol. 33, no. 5, pp. 787–807, 2000.

[Yin 2007] F. Yin, and C. Liu, ''Handwritten Text Line Extraction Based on Minimum Spanning

Tree Clustering'', Proc. Int’l Conf. on Wavelet Analysis and Pattern Recognition (ICWAPR’07),

vol.3, 2007, pp. 1123-1128.

[Young 1997] S. Young, J. Odell, D. Ollason, V. Valtchev, and P. Woodland. The HTK Book:

Hidden Markov Models Toolkit V2.1. Cambridge Research Laboratory Ltd, March 1997.

[Yu 1996] Yu B. and Jain A. K., "A robust and fast skew detection algorithm for generic

documents", Pattern Recogn vol. 29 issue 10, pp.1599–1629, 1996.

[Zagoris 2010] K. Zagoris, E. Kavallieratou, and N. Papamarkos. A document image retrieval

system.Engineering Applications of Artificial Intelligence, 23(6):872 – 879, 2010.

[Zagoris 2011] K. Zagoris, E. Kavallieratou, and N. Papamarkos. Image retrieval systems based on

compact shape descriptor and relevance feedback information. Journal of Visual Communication

and Image Representation, 22(5):378 – 390, 2011.

[Zahour 2007] Zahour, L. Likforman-Sulem, W. Boussalaa, and B. Taconet, "Text Line

Segmentation of Historical Arabic Documents", Proc. 9th Int’l Conf. on Document Analysis and

Recognition (ICDAR'07), 2007, pp. 138-142.

[Zhu 2004] M. Zhu. Recall, Precision and Average Precision. Working Paper 2004-09 Department

of Statistics & Actuarial Science - University of Waterloo, August 26 2004.

[Zirari 2013] F. Zirari, A. Ennaji, S. Nicolas, and D. Mammass. A methodology to spot words in

historical arabic documents. In Computer Systems and Applications (AICCSA), 2013 ACS

International Conference on, pages 1–4, 2013.