language-independent text line extraction from historical document images first international...

43
Motivation Historical handwritten manuscripts are valuable cultural heritage Providing insights into both tangible and intangible cultural aspects from the past Efforts to understand, manipulate and archive historical manuscripts Digitization increases accessibility and allows automatic processing *Courtesy: - wadod.com - Genizah Project 2

Upload: willa-nichols

Post on 29-Dec-2015

225 views

Category:

Documents


1 download

TRANSCRIPT

2

Motivation

Historical handwritten manuscripts are valuable cultural heritage

Providing insights into both tangible and intangible cultural aspects from the past

 Efforts to understand, manipulate and archive historical manuscripts

Digitization increases accessibility and allows automatic processing

*Courtesy: - wadod.com - Genizah Project

3

Outline

Background

Challenges

Seam Carving

Text line representation by seams

Energy Map

Seam Generation

Experimental Results

Summary

4

Image representation

N x M (Matrix)

5

Binarization

intensity

#pix

els

6

Connectivity & Components

4-Neighborhood 8-Neighborhood

We can define 4- or 8-paths depending on the type of connectivity specified

A set of pixels S is a Connected

Component if for each pixel pair

(x1,y1) є S and (x2,y2) є S there

is a path between them such that

every two successive pixels in the path

are in S and are X-neighbors. (X = 4, 8).

7

Connected Component

One word, but3connected components

8

Distances

Given 2 points P = (u,v) , Q = (x,y) Euclidean Distance

City Block Distance

Chessboard Distance

In example: P = (1,8); Q = (4,1)

22 )()(),( vyuxQPde

vyuxQPd ),(4

vyuxQPd ,max),(8

10;6.7)7(3;7 422

8 ddd e

9

Distance transform

Given a set of pixels S, calculate the distance of other pixels to S The pixels in the set S will be considered as reference pixels

Let . We scan the image by a pre-defined connectivity :

First pass: Consider Green pixels (N1)

SP

10

Distance transform

In reverse scan, consider Blue pixels (N2)

Distance transform

First scan

11

3 2 2 1 0 0 0 1 2 3

3 2 1 1 0 0 0 1 2 3

3 2 1 0 0 0 0 1 2 3

3 2 1 0 0 0 1 1 2 3

2 2 1 0 0 0 1 2 2 3

2 1 1 0 0 0 1 2 3 4

2 1 0 0 0 0 1 2 3 4

2 1 0 0 0 1 1 2 3 4

Distance transform – (cont’d)

0 0 0 0 1 1 1 0 0 0

0 0 0 0 1 1 1 0 0 0

0 0 0 1 1 1 1 0 0 0

0 0 0 1 1 1 0 0 0 0

0 0 0 1 1 1 0 0 0 0

0 0 0 1 1 1 0 0 0 0

0 0 1 1 1 1 0 0 0 0

0 0 1 1 1 0 0 0 0 0

Binary RepresentationDistance transform

Chessboard metric = Reference pixels

Alef Letter - Arabic

Printed

Handwritten

12

Sign Distance transform

3 2 2 1 0 0 0 1 2 3

3 2 1 1 0 -1 0 1 2 3

3 2 1 0 -1 -1 0 1 2 3

3 2 1 0 -1 0 1 1 2 3

2 2 1 0 -1 0 1 2 2 3

2 1 1 0 -1 0 1 2 3 3

2 1 0 -1 -1 0 1 2 3 4

2 1 0 0 0 1 1 2 3 4

Sign Distance transform

chessboard metric

Alef Letter

Printed

Handwritten

13

Sign Distance transform – (cont’d)

Sign Distance transform (SDT)

Original Document Image

The brighter the color the larger the distance from reference pixels

14

Gradient A gray-scale image I is defined as a two-dimensional function

I(x,y)=gray

The gradient of the image (I ) is given by the formula :

Where:

is the derivative of the image in the horizontal direction

is the derivative of the image in the vertical direction

The magnitude of the gradient is defined by:

y

I

x

IIII yx

),(

x

I

y

I

22yx III

15

I

Gradient

x

I

y

I

16

Background

De-noising Binarization

Page Layout Analysis

Text-line and word

Segmentation

Indexation and

Recognition

Pre-Processing

Segmentation

Original *Courtesy: Islamic manuscript, Leipzig

University Library, Germany

17

Text-line Extraction

Assigning the same color to each text line

Original Manuscript

Processed Manuscript*Courtesy: Juma Al-majid Center for Culture and Heritage, Dubai.

يــ ث ت ب

حـ خـ جـ

18

Outline

Background

Challenges

Seam Carving

Text line representation by seams

Energy Map

Seam Generation

Experimental Results

Summary

19

Challenges

A 19th century master thesis – SAAB medical Library, American University of

Beirut

Different slope (within the same line) Delayed strokes Overlapping components

Historical handwritten documents pose different challenges than those in machine-printed.

Looser layout format Line Proximity Multi-Oriented lines Touching components

20

Outline

Background

Challenges

Seam Carving

Text line representation by seams

Energy Map

Seam Generation

Experimental Results

Summary

21

Seam Carving

Content-aware image resizing

Original Image Calculated seams Resized

An energy function defines energy value for each pixel A seam is an optimal 8-connected path of low energy pixels

Gradient Image

22

Seam Carving – (cont’d)

let I be an n x m size image. Define a vertical seam to be:

where x is a mapping x : [1, . . . ,n] [1, . . . ,m].

Seam contains one, and only one, pixel in each row of the image, otherwise a distorted image might be obtained.

The pixels of the path of a seam will therefore be :

one can change the value of K in the constraint, and get either a simple column for k = 0 , or even completely disconnected set of pixels.

KixixiixiS ni |)1()(|,,))}(,{( 1

ni

n

iis ixiIsII 11 ))(,()(

23

Seam Carving – (cont’d)

Given an energy function e, the cost of a seam is:

We look for the optimal seam s* that minimizes this cost :

The optimal seam can be found using Dynamic programming

n

i

n

iis ixiIesIeIEsE

1 1

)))(,(())(()()(

n

iisIesES

1

* ))((minarg)(minarg

24

Outline

Background

Challenges

Seam Carving

Text line representation by seams

Energy Map

Seam Generation

Experimental Results

Summary

25

Text line representation by seams

Human perception of text lines

Tracks text lines by ink concentration and in-between line spaces

Two types of seams have been defined

*Courtesy: Wadod Center for masnuscripts.

26

Text line representation by seams -(cont’)

Original Document Image

Processed

The medial seam crosses the text area of a text line. A Separating seam is a path that passes between two consecutive text lines.

*Courtesy: Wadod Center for masnuscripts.

Medial Seam

Separating Seam

Seam Seed

27

Outline

Background

Challenges

Seam Carving

Text line representation by seams

Energy Map

Seam Generation

Experimental Results

Summary

28

Energy Map

We use the Sign distance transform (SDT) as an energy map

In SDT, pixels values are assigned according to their distance from the nearest reference pixel

Recall, distance values are negative inside connected components and positive in-between

Intuition: Local minima and maxima points determine the medial and separating seams, respectively

*Courtesy: Wadod Center for masnuscripts

Sign Distance Transform (SDT)

Original Document Image

29

Outline

Background

Challenges

Seam Carving

Text line representation by seams

Energy Map

Seam Generation

Experimental Results

Summary

30

Seam Generation – (cont’d)

The SDT is traversed horizontally to compute a cumulative energy map - Seam Map - for all possible connected seams for each entry (i,j):

))1,(*(min),(2),( 11 jlimapwjiSDTjimap ll

Sign distance transform

Left-to-right pass Right-to-left pass

SDT is traversed with two passes to enhance text line patterns Bi-linearly interpolate the resulting two maps

Interpolated map

31

Seam Generation – (cont’d)

The minimal entry of the last column is detected. Backtrack from the minimal entry to find the medial

seam.

Original Document Image Seam Map – One passSeam Map – Two passes

32

Seam Generation – (cont’d)

Iteratively, all text lines will be extracted

33

Seam Generation – (cont’d)

Then, why separating seams are needed?

Avoid recalculation of energy and seam maps after each line extraction

Avoid additional strokes classification (post processing)

34

Seam Generation – (cont’d)

Separating seams define the boundaries of text lines

Generated with respect to the medial seam of the corresponding text line

Grown from seam seeds toward the two sides of the image guided by the SDT

35

Seam Generation – (cont’d)

Seam fragment is a connected group of pixels defined as the closest local maxima along the vertical direction

Seam Map

Seam fragments with low priority are discarded Seeds candidate set is constructed

Medial Seam

Sign Distance Transform

The seed that generates the optimal (maximal cost) seam was chosen

36

Seam Generation – (cont’d)

The separating seams may diverge from the medial seamdue to the fork of ridges

Before After

A spring force anchored at the medial seam guides the separating seams

37

Touching/Overlapping Components

Usually, crossing overlapping components is avoided gracefully

Touching components are split too, but not necessarily in the optimal position

Processed

Processed

38

Outline

Background

Challenges

Seam Carving

Text line representation by seams

Energy Map

Seam Generation

Experimental Results

Summary

39

Experimental Results

Language Overlapping

Components

Lines

Description Dataset

Arabic and Spanish

516 1050 Wadod Center for Manuscripts

Wadod

Arabic 258 900 Al-Majid Center for Culture and Heritage,

Dubai

Al-Majid

English 485 420 American University of Beirut

AUB

English 317 150 Congress Library CongressLibrary

1576 2520

40

Correctness (%)Datase

tLine Low

erUpper

Medial

98 97 97 99 Wadod

97 97 96 98 Al-Majid

95 94 95 96 AUB

94.25 94 93 95 Congress

library

StrokeCrossing (%)

Overlapping

Components

Dataset

9 516 Wadod

2 258 Al-Majid

9 485 AUB

10 317 Congress

library

Experimental Results- (cont’d)

Table 1: correctness of text line extraction

Table 2: crossed components

41

Experimental Results- (cont’d)

42

Outline

Background

Challenges

Seam Carving

Text line representation by seams

Energy Map

Seam Generation

Experimental Results

Summary

43

Summary

Summary Language independent approach

Dynamic programming was used to find text lines

Saves energy map re-computing after text line extraction

Post processing steps are avoided

Crossing overlapping components was avoided in most cases

Still need more research to split touching components optimally

44

Thank you