An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

Download An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

Post on 15-Jun-2015

73 views

Category:

Technology

1 download

DESCRIPTION

An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis International Workshop on Historical Document Imaging and Processing (HIP). ICDAR 2011, 16-17 September 2011, Beijing, China.

TRANSCRIPT

1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.An Experimental Workflow DevelopmentPlatform for Historical DocumentDigitisation and AnalysisClemens Neudecker, KB National Library of the NetherlandsInternational workshop on Historical Document Imaging and Processing, Beijing, 17 September 2011 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.2Background IMPACT Improving Access to Text (2008 2011)Large-scale integrating research project, funded by the ECMain objectives:- Innovate OCR technology- Capacity building in mass-digitisation From a technical perspective:> 20 software toolkits for solving specific issuesPrototyping new algorithmsOne ring to rule them all IMPACT Interoperability Framework (IIF) 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.3Main requirementsBehavioural: Minimize integration effort Minimize deployment effort Maximize usability Maximize scalabilityFunctional: Modular Transparent Expandable Open source Platform independent 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.4Architecture IMPACT Interoperability Framework: Technologies- Java 6- Generic Web Service Wrapper- Apache Ant/Maven- Apache Tomcat/httpd- Apache Axis2- Apache Synapse- Taverna Workflow Engine IMPACT Interoperability Framework: Dataset- more than 500.000 images from digital libraries- more than 25.000 ground truth transcriptions 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.5So how does it work?1. Digitisation/OCR challenges registered and tagged in database2. Database contains 99,99% correct result: ground truth3. Researcher develops new method to tackle a problem4. Research prototype is wrapped to a web service5. Web service is integrated as a workflow module6. Workflow module can be evaluated, combined, etc. 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.6Framework integration Easy to use generic command line wrapper (open source) 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.7Workflow development OCR workflow =data pipeline Building blocks =processing steps (nodes) Integration =interaction between nodes(mashup) Collaboration with 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.8Workflow management Web 2.0 style registry: myExperiment Local client: Taverna Workbench Web client: project website 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.9Compute cluster Enterprise Service Busreceives requests fromusers and distributesthe load to the availableworker nodes Main effect:Process parallelization,Load distribution,Fail over 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.10Dataset Access to a representative and annotated dataset of significant size,with metadata, ground truth and search facilities 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.11Evaluation features Text based comparison of result with ground truth,using Levenshtein distance method Layout based comparison of result with ground truth,using the Page Analysis And Ground Truth Elements Framework Example: 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.12Community Web2.0 style workflow registry Community of experts Sharing of resources Knowledge exchange A central meeting pointfor users and researchers 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.13SummaryBenefits:- Availability of resources (images, ground truth and tools)to the international research community- A common baseline for transparent evaluation and comparison- Sharing of results and know-how- Enable new research through scalable computing- Consolidation of support and maintenanceThank you!Questions?

Recommended

View more >