datech2014 - automatic article extraction in old newspapers digitized collections

22
Automatic Article Extraction in Old Newspapers Digitized Collections David Hébert May 19 th 2014 David Hébert, Thomas Palfray, Pierrick Tranouez, Stéphane Nicolas, Thierry Paquet

Upload: impact-centre-of-competence

Post on 22-Nov-2014

272 views

Category:

Technology


2 download

DESCRIPTION

Slides of the presentation of the paper Automatic Article Extraction in Old Newspapers Digitized Collections by David Hebert, Thomas Palfray, Pierrick Tranouez, Stéphane Nicolas and Thierry Paquet. #digidays

TRANSCRIPT

Page 1: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Automatic Article Extraction in Old Newspapers Digitized Collections

David Hébert

May 19th 2014

David Hébert, Thomas Palfray, Pierrick Tranouez, Stéphane Nicolas, Thierry Paquet

Page 2: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Document digitization

David Hébert - Datech - May 19th 2014 2

Le blanc du calcaire des calanques tacheté du vert jailli d'un printemps

pluvieux. Le bleu de la rade duquel émerge, au loin, le phare du

Planier. Tout autour, la ville de béton et de tuiles à perte de vue.

Jusqu'à d'autres collines... Le toit-terrasse de l'immeuble de Le

Corbusier off re une vue panoramique unique à Marseille. Sur ce

promontoire, il faut ajouter les cris des enfants de l'école maternelle

dont la terrasse de la Cité radieuse, à 56 mètres du sol, compose une

incroyable cour de récréation.

Le blanc du calcaire des calanques tacheté du vert jailli d'un printemps

pluvieux. Le bleu de la rade duquel émerge, au loin, le phare du

Planier. Tout autour, la ville de béton et de tuiles à perte de vue.

Jusqu'à d'autres collines... Le toit-terrasse de l'immeuble de Le

Corbusier of f re une vue panoramique unique à Marseille. Sur ce

promontoire, il faut ajouter les cris des enfants de l'école maternelle

dont la terrasse de la Cité radieuse, à 56 mètres du sol, compose une

incroyable cour de récréation.

Page 3: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

180 years of diversity

PlaIR : Regional Indexation Platform

Enrichment of the « Journal de Rouen »

• 1762 – 1947

• Approximately 300 000 images

• Various layouts

David Hébert - Datech - May 19th 2014 3

Page 4: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Plan

1. Proposed Approach

2. Logical labeling at pixel level

3. Logical structure extraction

4. Results

5. Conclusion and future work

David Hébert - Datech - May 19th 2014 4

Page 5: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Overview of our method

David Hébert - Datech - May 19th 2014 5

Physico-logical entities extraction

Physico-logical entities extraction

Article reconstruction

Article reconstruction

• Labelling at the pixel level

• Contextualisation

• Graphical model

• Discriminative model

The CRF

• Higher level of analysis

• Blocs identification

• Taking advantage of hierarchical organisation of

information

• Finding a reading order

Logical labeling at

pixel level

Logical structure

extraction

Page 6: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Plan

1. Proposed Approach

2. Logical labeling at pixel level

3. Logical structure extraction

4. Results

5. Conclusion and future work

David Hébert - Datech - May 19th 2014 6

Page 7: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Conditional Random Fields

Proposed by Lafferty, McCallum and Peirera in 2001 for Part Of Speech tagging

Having a sequence of observations X, find the best label sequence Y

Having a sequence of words, find the role of the words in the sentence

=> observations are words (discrete observations) => labels are the description of the role in the sentence

David Hébert - Datech - May 19th 2014 7

[Lafferty 01] John Lafferty, Andrew McCallum & Fernando Pereira. Conditional Random Fields : Probabilistic Models for Segmenting and Labeling

Sequence Data. In Proc. 18th International Conf.on Machine Learning, pages 282-289, 2001.

Local potentials

(CRF parameters)

Feature functions

Model of the

decision function

Length of the sequence

xt-1

yt-1 yt-1

xt

yt yt

xt+1

yt+1 yt+1

Local combination of potentials

Global combination over the sequence

Page 8: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Feature functions

David Hébert - Datech - May 19th 2014 8

: generical notation of a feature function that include 2 kind of functions

- Observation functions, denoted by

- Transition functions, denoted by

- Each feature function is linked to a parameter λk

x1 x2 xT

yt yt Yt-1 Yt-1

Parameter estimation

= conditional log-likelihood on N

labelled examples

Inference: Having X, find Y* as

Page 9: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Which physico-logical entities?

David Hébert - Datech - May 19th 2014 9

Pixel description with numerical values

Require some data adaptation to

feed the CRF:

Multi-scale quantization

x1 x2 xT

y1 y1 y2 y2 yT yT

Numerical descriptors

D. Hébert, T. Paquet, S. Nicolas, Continuous CRF with Multi-scale Quantization Feature Functions Application to Structure Extraction in Old Newspaper, ICDAR 2011

Page 10: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Experimentations

David Hébert - Datech - May 19th 2014 10

Identification of:

- Text lines

- Titles

- Horizontal separators

- Vertical separators - Noisy areas

- Characters

- Inter-character white spaces

- Inter-words white spaces

• Observations are horizontal runs length.

• An observation is described by :

- its length

- The median length of the vertical runs

Page 11: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

A generical model of data

David Hébert - Datech - May 19th 2014 11

• Not a complete document model

• A model of columns of information

• A model of entities sequences

=> Generical enought model for

various layouts

Page 12: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Approach recall

David Hébert - Datech - May 19th 2014 12

Physico-logical entities extraction

Physico-logical entities extraction

Article reconstruction

Article reconstruction

Pixel level analysis : DONE

Higher level of analysis to identify articles

Page 13: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Plan

1. Proposed Approach

2. Logical labeling at pixel level

3. Logical structure extraction

4. Results

5. Conclusion and future work

David Hébert - Datech - May 19th 2014 13

Page 14: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Article reconstruction

David Hébert - Datech - May 19th 2014 14

Page 15: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Article reconstruction

David Hébert - Datech - May 19th 2014 15

Page 16: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

David Hébert - Datech - May 19th 2014 16

D

O

R

B

F

S

Z

A

P

W

O

O

P

P

R

R

A

A

Z

Z

S

S

B

B

F

F

W

W

Article reconstruction

Page 17: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

David Hébert - Datech - May 19th 2014 17

D

Reading order

O

R

B

F

S

Z

A

P

W

O

O

P

P

R

R

A

A

Z

Z

S

S

B

F

F

B W

W

Article reconstruction

Page 18: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Plan

1. Proposed Approach

2. Logical labeling at pixel level

3. Logical structure extraction

4. Results

5. Conclusion and future work

David Hébert - Datech - May 19th 2014 18

Page 19: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Results

David Hébert - Datech - May 19th 2014 19

Quantitative evaluation :

42 images evaluated manually

226 true articles

245 articles detected 194 correct detection (85,84%)

Over-segmentation rate of 8.41%

• 21550 documents made of 4 pages on average

(101978 images) on the platform :

http://plair.univ-rouen.fr

• 550 000 articles

• Approximately 20 days of computation (8 cores)

Page 20: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Results on other layouts

David Hébert - Datech - May 19th 2014 20

Page 21: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

Conclusion and future work

David Hébert - Datech - May 19th 2014 21

Presentation of a logical segmentation method in two steps :

- Physico-logical entities segmentation with CRF

- Article identification with a generic layout model

Suitable for complex Manhattan layouts with little set of rules

Average article detection rate of 85%

Future work :

- Improve the CRF model (descriptors and/or the labels description)

- Add variability in the description of an entity (typicaly the definition of a separator)

Page 22: Datech2014 - Automatic Article Extraction  in Old Newspapers Digitized Collections

22

The end…

Thanks for your attention

Questions?

David Hébert - Datech - May 19th 2014