[ACM Press the 2011 Workshop - Beijing, China (2011.09.16-2011.09.17)] Proceedings of the 2011 Workshop on Historical Document Imaging and Processing - HIP '11 - An experimental workflow development platform for historical document digitisation and analysis

Download [ACM Press the 2011 Workshop - Beijing, China (2011.09.16-2011.09.17)] Proceedings of the 2011 Workshop on Historical Document Imaging and Processing - HIP '11 - An experimental workflow development platform for historical document digitisation and analysis

Post on 27-Mar-2017




2 download

Embed Size (px)


<ul><li><p>An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis </p><p> Clemens Neudecker </p><p>National Library of the Netherlands P.O. Box 90407 </p><p>2509 LK The Hague, The Netherlands </p><p>Sven Schlarb Austrian National Library </p><p>Josefsplatz 1 1015 Vienna, Austria </p><p>Zeki Mustafa Dogan Goettingen State and University Library </p><p>Papendiek 14 37063 Goettingen, Germany </p><p> Paolo Missier, Shoaib Sufi, </p><p>Alan Williams and Katy Wolstencroft University of Manchester </p><p>Kilburn Building, Oxford Road Manchester, M13 9PL, United Kingdom </p><p>ABSTRACT The paper presents a novel web-based platform for experimental workflow development in historical document digitisation and analysis. The platform has been developed as part of the IMPACT project, providing a range of tools and services for transforming physical documents into digital resources. It explains the main drivers in developing the technical framework and its architecture, how and by whom it can be used and presents some initial results. The main idea lies in setting up an interoperable and distributed infrastructure based on loose coupling of tools via web services that are wrapped in modular workflow templates which can be executed, combined and evaluated in many different ways. As the workflows are registered through a Web 2.0 environment, which is integrated with a workflow management system, users can easily discover, share, rate and tag workflows and thereby support the building of capacity across the whole community. Where ground truth is available, the workflow templates can also be used to compare and evaluate new methods in a transparent and flexible way. </p><p>Categories and Subject Descriptors D.2.11 [Software Engineering]: Software Architectures Service Oriented Architecture (SOA). </p><p>General Terms Design, Experimentation, Management, Measurement. </p><p>Keywords Digitisation, Historical Documents, Web Service, Scientific Workflow, Optical Character Recognition, Evaluation. </p><p>1. INTRODUCTION Providing access to information and resources by means of the Internet, everywhere and anytime, is a key factor to sustaining the role of cultural heritage institutions in the digital age [1]. Scholars require access to the vast knowledge that is contained in the collections of cultural heritage institutions, and they demand access to be on par with born-digital resources [2]. However, the aim of fully integrating intellectual content into the modern information and communication technologies environment requires full-text digitisation: transforming digital images of scanned documents into searchable electronic text. The European research project IMPACT - Improving Access to Text [3] aims at significantly improving access to printed historical material by innovating Optical Character Recognition (OCR) software and language technology from multiple directions along with sharing best practice about the operational context for historical document digitisation and analysis. This paper describes the architecture implemented to join the various strands of work and how this is envisaged to benefit the community. </p><p>1.1 Background of the IMPACT project The IMPACT project brings together twenty-six national and regional libraries, research institutions and commercial suppliers that will share their knowledge, compare best practices and develop innovative tools to enhance the capabilities of OCR engines as well as the accessibility of digitised text and lay down the foundations for the mass-digitisation programmes that will take place over the next decade. </p><p>Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HIP '11, September 16 - September 17 2011, Beijing, China Copyright 2011 ACM 978-1-4503-0916-5/11/09$10.00. </p><p>IMPACT develops a whole suite of software tools to specifically address many of the challenges encountered in OCR processing of historical material, tackling every individual step in an OCR workflow [4]. The main focus lies on methods which are particularly suitable for printed historical documents. </p><p>161</p></li><li><p>1.2 Requirements While modern documents are typically recognised by integrated OCR software packages with very high accuracy rates, historical documents often exemplify rare defects and challenges that can require additional processing steps, targeting problems specific to the nature of historic documents. Solutions for such particular issues can often be found only on the level of individual collections or timeframes/subject matters. This requires that the technical architecture provides sufficient flexibility to not only integrate a variety of specific software tools, but also allows tailoring the tools and processing steps in numerous ways as to derive the optimal combination for the source material. </p><p>These processing steps together constitute a pipeline process where information from a prior step is subsequently used in the next processing step. But there can also occur loops. For example, the OCR engine can utilise a set of candidates obtained from the segmentation step for processing them iteratively, thereby determining the most optimal setting. Flexibility is therefore needed on an even higher level - tuning the OCR workflow to a specific kind of input is often the only way to obtain high quality results. In particular for historical documents there is currently no ideal workflow which will return optimal results for the whole wide range of possible inputs. Existing software solutions for OCR often don't provide the user with means to change and rearrange the workflow chain, whereas this can have a considerable impact on the results. As a side effect, know-how regarding the configuration of OCR processes for historic documents is often not accumulated, which can cause uncertainties in OCR projects and lead to re-inventing the wheel. </p><p>This background makes it desirable to have an open, flexible and modular technical infrastructure for managing digitisation workflows via a service-oriented architecture [5] in order to exchange and rearrange the required services while keeping an overall integrated data pipeline and high degree of automation throughout all the steps in the process. Such an infrastructure needs to meet the interoperability demands in terms of integrating cross platform services and establishing data exchange between them while supporting a variety of data formats for representing content and being fit for deployment in heterogeneous and distributed IT infrastructures. </p><p>In this paper we present an approach to such a novel infrastructure for experimental workflow development in digitisation, with a particular focus on OCR for historic documents as implemented in the IMPACT project. The web-based infrastructure establishes platform independent interaction between tools, users, and data in a user friendly way and thereby provides the means to easily conduct experiments and evaluations [6] using different combinations of tools and content. </p><p>2. TECHNICAL ARCHITECTURE A core piece of work in IMPACT lies in the development of novel software techniques for numerous tasks connected to OCR, such as image enhancement, segmentation and post-processing, as well as in the improvement of existing OCR engines and experimental prototypes. The variety of development platforms and programming languages used by the developers of these tools makes it necessary to define an overall technical integration architecture for establishing interoperability of the different software components. </p><p>2.1 Design principles An integration architecture is usually built in several layers to break the problem into distinct smaller problems. As presented by [7], four important layers can be distinguished: Data-level integration, application integration, business process integration and presentation integration. Before going into detail, we explain how the components generally relate to these layers: </p><p> Data-level integration: Data provision to application </p><p>level and conversion. For example, image (and ground truth) data repositories that allow access to sample images and metadata along with seamlessly integrated conversion services. </p><p> Application integration: Web services, Workflow modules (the concept will be explained later). </p><p> Business process integration: Basic workflows, Complex workflows. </p><p> Presentation integration: Web portal, Project website, 3rd-party projects. </p><p> The main conceptual principles were to achieve, first, modularity, i.e. it should be possible to combine individual modules in a vast number of combinations, thereby enabling users to identify the most suitable processing chain for a particular kind of material being processed and guaranteeing the reusability of the components. In this respect, the service-oriented-architecture has been identified as an appropriate guiding architectural design principle; more specifically the principle of loose coupling of reusable processing units, minimising interdependencies between them. Second, to achieve transparency, i.e. it should be possible to evaluate and test each individual application/processing step separately, so that it is obvious whether a functional unit produces expected results and contributes to the overall quality of the workflow and what the cost is in terms of processing time. Third, to achieve flexibility, i.e. it should be possible to easily change and rearrange workflows (chain of software components operating on data) and to create new workflows on the basis of existing ones by adapting them. And, fourth, to achieve extensibility, i.e. it should be possible to install third party components with little effort. Currently the platform incorporates tools as diverse as the Abbyy FineReader SDK (modified IMPACT version), Binarisation [8], Dewarping [9], Segmentation [10] and Recognition [11], [12] technologies developed in the course of IMPACT, as well as third party components such as Tesseract [13] or OCRopus [14]. </p><p>2.2 Components Following these principles, IMPACT has adopted in the first instance, open source components from the Apache Software Foundation, a highly active and reliable open source software development community, and evaluated alternatives only if the default Apache Software solutions appeared to be inappropriate or too complex for the project purposes. The main advantage of this approach is that the interdependencies between the different Apache Software solutions are well defined and interoperability between the components is easily achieved by following the best practices recommended by the community. </p><p>Figure 1 gives an overview of the different components used in the IMPACT framework. Java technology provides a platform-</p><p>162</p></li><li><p>independent technical ground for most of the framework components. Apache Tomcat has been chosen as the servlet container where the Axis2 web service stack is running as a web application. Apache Synapse is used as an enterprise service bus (ESB) providing web service load balancing by distributing workload across various processing units in the distributed network and failover functionality by skipping web service endpoints that are not available for some reason. The Apache Web Server is optionally used as a filtering and routing instance which redirects requests to the corresponding application server. On top of the web service layer, the Taverna Workflow Engine has been adopted for service orchestration and mashups (see also Sections 2.4 and 2.5). </p><p>Figure 1. IMPACT Framework components. </p><p>2.3 Implementing tools as web services Many of the software components developed in IMPACT can be used via the command line interface, while a set of parameters configures how the actual processing should be done. As illustrated in figure 2, it was decided to use the command line interface as the default IMPACT tool interface and build a generic Java-based wrapper around it. Using a generic skeleton, the Java-based wrapper is then offered as an Axis2 web service while the operations with the corresponding parameters of the underlying command line tool are offered as a web service defined in the Web Service Description Language (WSDL). </p><p>For creating the Web services, it was decided to choose the "contract first" development style, i.e. the WSDL description is created first and the corresponding Java code is generated afterwards [15]. The main advantage of this approach is that it enables an implementation independent definition of data types using XMLSchema, e.g. by adding constraints to simple data types, like regular expressions or restricted string lists to strings (xsd:string) or ranges for integer values (xsd:int), etc. The exact way how these data types are then mapped into Java specific data types is therefore of secondary importance. </p><p> Figure 2. Generic Web Service Wrapper </p><p> As all of the IMPACT components operate somehow on data streams by either modifying them or extracting information from them, the web service must support binary data exchange between </p><p>web services which is supported by the generic web service wrapper in two ways: Data can be passed "by reference" where the SOAP message contains a URL reference to a file, or it can be passed "by value" where the binary data is attached to the SOAP message using the MTOM standard for binary data transfer, the first being the default method used in IMPACT. </p><p>Generally, there is a one-to-one mapping from tools to web services, which means that one tool will be described as a service in one WSDL file. Different types of functionality will be offered as different web service operations, and additional parameters are offered as parameters of the operation if required for higher level workflow creation. All the web services offered within the IMPACT framework together form the basic layer of the framework. Each web service can be explored and tested as a standalone service or integrated in another environment by generating appropriate stubs. The web services are used as a basis for different kinds of clients, like the workflow management system Taverna, described in some detail in the next section, or the website client which has been created in order to enable the seamless website integration of the web services for demonstration purposes. </p><p>Web services can be combined whenever the output of one service is compatible with the input of another service and it makes sense to apply the processing order that is given by this software workflow. For example, binarisation (reducing a colour/greyscale image to black and white considering especially the character informati...</p></li></ul>