document image processing 1.fourier transforms 2.hough transforms 3.docstrum 4.text vs graphics

Document Image Processing

1. Fourier Transforms

2. Hough Transforms

3. Docstrum

4. Text vs Graphics

Features

Hamming Distance, HD =

Correlation:

Central Moments:

),(),(1 1

yxfyxfm

x

n

yr

r

r

ff

ff

0220 mm Spread=

Slenderness=

Fourier descriptors

211

20220 4)( mmm

Fourier Transform

Document Images and FT

Hough Transform

•Parametric Form

•Global

•Peaks in Accumulator Space

•Y intercept is infinite

•Use (r, )

0o 180o

r 2

2-r

Accumulator array [r, ]

0

0 18090

(1)Adjust a binary image f(x,y) into portrait mode, if necessary. (2)For each row of the binary image f(x,y), generate and label its

black runs.(3)Build objects based on black runs of the same labels and update

extreme coordinates of objects. (4)Create a simplified binary image by preserving the last black runs

of each "allowable" object. (5)Apply the Hough transform on the simplified binary image. (6)Analyze the local maxima of the Hough accumulator cell array to

detect the skew angle of the binary image as follows: (a) Collect the first and second maxima of Hough accumulator cell array elements. (b) Collect all Hough accumulator cell array elements of which the values are greater than one-half of the second maxima of Hough accumulator cell element. (c) Add these values together based on their angle. (d) The skew angle is the angle corresponding to the maximum of these values.

Document Skew

Document Skew Angle Detection Algorithm, D. X. Le and G. Thoma, Proc. 1993 SPIE Symposium on Aerospace and Remote Sensing -Visual Information Processing II, Orlando, FL, April 14-16, 1993, Vol. 1961, pp. 251-262.

http://archive.nlm.nih.gov/staff/le.php

http://archive.nlm.nih.gov/staff/thoma.php

http://archive.nlm.nih.gov/staff/thoma.php

The description of the physical layout is constructed from the

information extracted from the document image during page

segmentation and classification

• information about the position, dimensions, shape etc.

• Geometrical relationships between components

• Eg., for a technical article may contain information such as: ‘text component with a set S1 of attributes below a line drawing component with a set S2 of attributes’

– the Office Document Architecture (ODA)

– the Standard Generalized Markup Language (SGML)

– The eXtensible Markup Language (XML)

Physical Layout Structure

DocstrumSlope Histograms

Use local information

Connect a mark (component) with K (=4..6) neighbors

Histogram of the slopes

More efficient than projection profiles

Docstrum is the radius and angle plot of the slopes

Extracting Text Strings1. Set H to the average height of the marks being considered

2. Set the Hough space resolution in to 1o and r to 0.2H

3. Apply the Hough transform to all marks, using the ranges

4. Set the mark count threshold to T=20

5. For each cell in the accumulator space with count greater than T

a) For the cluster of cells calculate the average height Hlocal of marks contributing to the cluster

b) Compute a new clustering factor f= Hlocal/R re-cluster cells

c) Perform string segmentation on marks contributing to new clusters

6. Update Hough transform by deleting contributions from discarded components in the step 5c above

7. Decrement T by 1, and if T>2 go to Step 5

8. Compute the Hough transform for the entire range of and go to step 4

00 50 00 9585 00 180175

),5(),4(),,5( rrr

),(),1(),,( frfrfr

• A document contains the information that its author wishes to convey

– By the formatting of characters and pictorial information and the general layout of the document

– The shape and size of paragraphs and illustrations, the font of the characters as well as their positions in the page can carry a message

• The physical layout denotes the organization of the text and graphical components in the document

• Physical layout analysis comprises of

page segmentation

page classification and

physical layout structure extraction

Physical Layout Analysis

• Page segmentation is the identification of areas of interest in the document image

– Identifies the boundaries of the areas in the image that correspond to the printed regions on the page

– A higher-level description of the page is obtained, in terms of the outlines (contours) of these areas

• Methods:

– Connected components aggregation (bottom-up)

– Projection-profiles analysis (top-down)

– Analysis of background space (hybrid)

Page Segmentation

Page classification is the determination of the type of the contents of each area of interest in the document image

• Analyze attributes of the contents of each area and deduce its type

– In OCR applications, one is interested in text and non-text

– In Graphics applications the non-text areas must be assigned line drawing, halftone, photograph, etc.

Page Classification

i j

i jshort jip

jjipF

),(

/),( 2

i j

i jlong jip

jipjF

),(

),(2

Texture analysis method: p(i,j) is formed representing the number of times the image contains a horizontal run of length j whose black and white proportion is in category i.

Categories are made in bins i) less than 10%, (ii) 10-20%, etc.

Logical Structure

Semantic structure:

Eg., Find abstracts of all papers in a database which include a keyword

Physical structure of a newspaper: extraction of blocks of text, graphics, half-tones, identification of attributes such as fonts, size, style

Logical structure is identifying headlines, captions, bylines, grouping paragraphs belonging to same story across columns, pages etc

HTML vs XML :: Physical vs Logical

Physical Layout Analysis (Lit Survey)

Wahl et al: Closing with a horizontal kernel (300) AND Closing with vertical kernel (30)

Nagy and Seth: X-Y tree

Fisher et al: Low resolution image used

Lebourgeois et al: Non-uniform down sampling; Dilation by a horizontal kernel

Bloomberg: Vertical dilation followed by a close-open sequence to remove noise, followed by a hit-or-miss transform to identify seed points of characters to identify italics and bold fonts

Saitoh and Pavlidis: Non-uniform down sampling

Hinds et al: Erosion using 2-pixel vertical (horizontal) kernel. Followed by Hough Transform

Pavlidis and Zhou: Projection profile and clustering

Amamoto et al: Open white space with long horizontal structural element followed by vertical and take union

O’ Gorman: Docstrum

Ishitani: document skew using line complexities to take care of non-text blocks

Chen and Haralick: Recursive opening and closing

Ankindele and Belaid: permit non-rectangular blocks

Logical Layout Analysis (Lit Survey)

Tsujimoto and Asada: Rule based system

Fisher: Rule based system

Chenvoy and Belaid: Blackboard system

Kreich et al: Top-down knowledge based system

Derrien-Peden: Frame-based system

Yamashita et al: Model-based method

Dengel: Busines letters

document image processing 1.fourier transforms 2.hough transforms 3.docstrum 4.text vs graphics

Documents

thoma slide

ft slide

hough transforms

update hough

document image processing

o r r accumulator array

local information

hough space resolution