Download - An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
An Experimental Workflow Development Platform for Historical Document Digitisation and AnalysisClemens Neudecker, KB National Library of the Netherlands
International workshop on Historical Document Imaging and Processing, Beijing, 17 September 2011
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
2
Background IMPACT – Improving Access to Text (2008 – 2011)
Large-scale integrating research project, funded by the ECMain objectives: - Innovate OCR technology- Capacity building in mass-digitisation
From a technical perspective: > 20 software toolkits for solving specific issuesPrototyping new algorithms
“One ring to rule them all…” IMPACT Interoperability Framework (IIF)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
3
Main requirementsBehavioural: Minimize integration effort Minimize deployment effort Maximize usability Maximize scalability
Functional: Modular Transparent Expandable Open source Platform independent
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
4
Architecture IMPACT Interoperability Framework: Technologies
- Java 6- Generic Web Service Wrapper- Apache Ant/Maven- Apache Tomcat/httpd- Apache Axis2- Apache Synapse- Taverna Workflow Engine
IMPACT Interoperability Framework: Dataset- more than 500.000 images from digital libraries- more than 25.000 ground truth transcriptions
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
5
So how does it work?1. Digitisation/OCR challenges registered and tagged in database
2. Database contains 99,99% correct result: “ground truth”
3. Researcher develops new method to tackle a problem
4. Research prototype is wrapped to a web service
5. Web service is integrated as a workflow module
6. Workflow module can be evaluated, combined, etc.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
6
Framework integration Easy to use generic command line wrapper (open source)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
7
Workflow development
OCR workflow = data pipeline
Building blocks =
processing steps (nodes)
Integration = interaction between nodes
(mashup)
Collaboration with
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
8
Workflow management Web 2.0 style registry: myExperiment
Local client: Taverna Workbench
Web client: project website
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
9
Compute cluster Enterprise Service Bus
receives requests from users and distributes the load to the availableworker nodes
Main effect: Process parallelization,
Load distribution,
Fail over
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
10
Dataset Access to a representative and annotated dataset of significant size,
with metadata, ground truth and search facilities
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
11
Evaluation features Text based comparison of result with ground truth,
using Levenshtein distance method Layout based comparison of result with ground truth,
using the Page Analysis And Ground Truth Elements Framework Example:
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
12
Community Web2.0 style workflow registry
Community of experts
Sharing of resources
Knowledge exchange
A central meeting point for users and researchers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
13
SummaryBenefits:- Availability of resources (images, ground truth and tools)
to the international research community- A common baseline for transparent evaluation and comparison- Sharing of results and know-how- Enable new research through scalable computing - Consolidation of support and maintenance
Thank you! Questions?