impact at ocr summit

17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. An Experimental Workflow Development Platform for Historical Document Digitisation Clemens Neudecker, KB National Library of the Netherlands

Upload: cneudecker

Post on 15-Jun-2015

71 views

Category:

Technology


2 download

DESCRIPTION

OCR Summit Meeting Initiative for Digital Humanities, Media and Culture, Texas A&M University, 17-18 October 2011, College Station, TX, United States.

TRANSCRIPT

Page 1: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

An Experimental Workflow Development Platform for Historical Document Digitisation

Clemens Neudecker, KB National Library of the Netherlands

Page 2: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Background IMPACT – Improving Access to Text (2008 – 2011)

From a technical perspective: > 20 software components for solving specific issuesPrototyping new algorithms, improving commercial solutions

Different frameworks (C, C++, Java, etc.), platforms (Win/Linux) + 3rd party applications

“One ring to rule them all…”

IMPACT Interoperability Framework (IIF)

Page 3: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Main requirements

Behavioural: Minimize integration effort Minimize deployment effort Maximize usability Maximize scalability

Functional: Modular Transparent Expandable Open source Platform independent

Page 4: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Framework integration Simple to use generic command line wrapper for web services

Page 5: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Architecture IMPACT Interoperability Framework: Technologies

- Java

- Apache Maven

- Apache Tomcat

- Apache Axis2+Synapse

- Taverna Workflow Engine

IMPACT Interoperability Framework: Dataset

- more than 600.000 images from digital libraries

- more than 50.000 ground truth transcriptions

Page 6: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Generic Web Service Wrapper

Only requirement: Command Line Application HTML formSource code available on github: https://github.com/impactcentre/toolwrapper Easy integration: developers can focus on their application

and have to worry less about integration = higher quality software components

Page 7: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Workflows OCR workflow =

data pipeline

Building blocks = processing modules

Integration = interaction between nodes (mashups)

Collaboration with

Page 8: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page 9: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Workflow management Web 2.0 style registry: myExperiment

Local client: Taverna Workbench

Web client: Project website

Page 10: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Local client: Taverna Workbench

Background: BioSciences

Developed and maintained bymyGrid, UK

Available for Windows/Linux/OSXand as open source(Java)

Page 11: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Web client: Taverna Server/Workflow Parser

SOAP/REST API Remote execution of workflows

Page 12: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Community Web2.0 style workflow registry

Community of experts

Sharing of resources

Knowledge exchange

A central meeting point for users and researchers

Page 13: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Compute cluster Enterprise Service Bus

receives requests from users and distributes the load to the availableworker nodes

Main effect: Process parallelization,Load distribution,Fail over

Test deployment on Dutch Supercomputing Cloud HPC

Page 14: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Dataset Representative and annotated dataset of significant size, with

metadata, ground truth and search facilities

Page 15: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation features Text based comparison of result with ground truth,

using Levenshtein distance method Layout based comparison of result with ground truth,

using the Page Analysis And Ground Truth Elements Framework Example:

Page 16: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Outlook

Online service for testing/evaluation/processing

Results Repository (WebDAV, POI)

Extending the scope:Workflows for linguistic analysis: CLARIN

Workflows for preservation: SCAPE

Even better scalability: MapReduce/Hadoop

Supported by a community of developers & practitioners

Page 17: IMPACT at OCR Summit

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Summary- Availability of resources (images, ground truth and tools)

to the international research community- A common baseline for transparent evaluation and comparison- Ready-to-use components, reproducible experiments- Sharing of results and know-how- Enable scalability for prototypes/data intensive workflows - Simple and uniform user interface for all embedded tools- Consolidation of support and maintenance

Thank you! Questions?