Transcript
Page 1: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

An Experimental Workflow Development Platform for Historical Document Digitisation and AnalysisClemens Neudecker, KB National Library of the Netherlands

International workshop on Historical Document Imaging and Processing, Beijing, 17 September 2011

Page 2: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

2

Background IMPACT – Improving Access to Text (2008 – 2011)

Large-scale integrating research project, funded by the ECMain objectives: - Innovate OCR technology- Capacity building in mass-digitisation

From a technical perspective: > 20 software toolkits for solving specific issuesPrototyping new algorithms

“One ring to rule them all…” IMPACT Interoperability Framework (IIF)

Page 3: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

3

Main requirementsBehavioural: Minimize integration effort Minimize deployment effort Maximize usability Maximize scalability

Functional: Modular Transparent Expandable Open source Platform independent

Page 4: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

4

Architecture IMPACT Interoperability Framework: Technologies

- Java 6- Generic Web Service Wrapper- Apache Ant/Maven- Apache Tomcat/httpd- Apache Axis2- Apache Synapse- Taverna Workflow Engine

IMPACT Interoperability Framework: Dataset- more than 500.000 images from digital libraries- more than 25.000 ground truth transcriptions

Page 5: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

5

So how does it work?1. Digitisation/OCR challenges registered and tagged in database

2. Database contains 99,99% correct result: “ground truth”

3. Researcher develops new method to tackle a problem

4. Research prototype is wrapped to a web service

5. Web service is integrated as a workflow module

6. Workflow module can be evaluated, combined, etc.

Page 6: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

6

Framework integration Easy to use generic command line wrapper (open source)

Page 7: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7

Workflow development

OCR workflow = data pipeline

Building blocks =

processing steps (nodes)

Integration = interaction between nodes

(mashup)

Collaboration with

Page 8: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

8

Workflow management Web 2.0 style registry: myExperiment

Local client: Taverna Workbench

Web client: project website

Page 9: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

9

Compute cluster Enterprise Service Bus

receives requests from users and distributes the load to the availableworker nodes

Main effect: Process parallelization,

Load distribution,

Fail over

Page 10: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

10

Dataset Access to a representative and annotated dataset of significant size,

with metadata, ground truth and search facilities

Page 11: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

11

Evaluation features Text based comparison of result with ground truth,

using Levenshtein distance method Layout based comparison of result with ground truth,

using the Page Analysis And Ground Truth Elements Framework Example:

Page 12: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

12

Community Web2.0 style workflow registry

Community of experts

Sharing of resources

Knowledge exchange

A central meeting point for users and researchers

Page 13: An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

13

SummaryBenefits:- Availability of resources (images, ground truth and tools)

to the international research community- A common baseline for transparent evaluation and comparison- Sharing of results and know-how- Enable new research through scalable computing - Consolidation of support and maintenance

Thank you! Questions?


Top Related