[ACM Press the 2011 Workshop - Beijing, China (2011.09.16-2011.09.17)] Proceedings of the 2011 Workshop on Historical Document Imaging and Processing - HIP '11 - An experimental workflow development platform for historical document digitisation and analysis
Post on 27-Mar-2017
An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis
National Library of the Netherlands P.O. Box 90407
2509 LK The Hague, The Netherlands
Sven Schlarb Austrian National Library
Josefsplatz 1 1015 Vienna, Austria
Zeki Mustafa Dogan Goettingen State and University Library
Papendiek 14 37063 Goettingen, Germany
Paolo Missier, Shoaib Sufi,
Alan Williams and Katy Wolstencroft University of Manchester
Kilburn Building, Oxford Road Manchester, M13 9PL, United Kingdom
ABSTRACT The paper presents a novel web-based platform for experimental workflow development in historical document digitisation and analysis. The platform has been developed as part of the IMPACT project, providing a range of tools and services for transforming physical documents into digital resources. It explains the main drivers in developing the technical framework and its architecture, how and by whom it can be used and presents some initial results. The main idea lies in setting up an interoperable and distributed infrastructure based on loose coupling of tools via web services that are wrapped in modular workflow templates which can be executed, combined and evaluated in many different ways. As the workflows are registered through a Web 2.0 environment, which is integrated with a workflow management system, users can easily discover, share, rate and tag workflows and thereby support the building of capacity across the whole community. Where ground truth is available, the workflow templates can also be used to compare and evaluate new methods in a transparent and flexible way.
Categories and Subject Descriptors D.2.11 [Software Engineering]: Software Architectures Service Oriented Architecture (SOA).
General Terms Design, Experimentation, Management, Measurement.
Keywords Digitisation, Historical Documents, Web Service, Scientific Workflow, Optical Character Recognition, Evaluation.
1. INTRODUCTION Providing access to information and resources by means of the Internet, everywhere and anytime, is a key factor to sustaining the role of cultural heritage institutions in the digital age . Scholars require access to the vast knowledge that is contained in the collections of cultural heritage institutions, and they demand access to be on par with born-digital resources . However, the aim of fully integrating intellectual content into the modern information and communication technologies environment requires full-text digitisation: transforming digital images of scanned documents into searchable electronic text. The European research project IMPACT - Improving Access to Text  aims at significantly improving access to printed historical material by innovating Optical Character Recognition (OCR) software and language technology from multiple directions along with sharing best practice about the operational context for historical document digitisation and analysis. This paper describes the architecture implemented to join the various strands of work and how this is envisaged to benefit the community.
1.1 Background of the IMPACT project The IMPACT project brings together twenty-six national and regional libraries, research institutions and commercial suppliers that will share their knowledge, compare best practices and develop innovative tools to enhance the capabilities of OCR engines as well as the accessibility of digitised text and lay down the foundations for the mass-digitisation programmes that will take place over the next decade.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HIP '11, September 16 - September 17 2011, Beijing, China Copyright 2011 ACM 978-1-4503-0916-5/11/09$10.00.
IMPACT develops a whole suite of software tools to specifically address many of the challenges encountered in OCR processing of historical material, tackling every individual step in an OCR workflow . The main focus lies on methods which are particularly suitable for printed historical documents.
1.2 Requirements While modern documents are typically recognised by integrated OCR software packages with very high accuracy rates, historical documents often exemplify rare defects and challenges that can require additional processing steps, targeting problems specific to the nature of historic documents. Solutions for such particular issues can often be found only on the level of individual collections or timeframes/subject matters. This requires that the technical architecture provides sufficient flexibility to not only integrate a variety of specific software tools, but also allows tailoring the tools and processing steps in numerous ways as to derive the optimal combination for the source material.
These processing steps together constitute a pipeline process where information from a prior step is subsequently used in the next processing step. But there can also occur loops. For example, the OCR engine can utilise a set of candidates obtained from the segmentation step for processing them iteratively, thereby determining the most optimal setting. Flexibility is therefore needed on an even higher level - tuning the OCR workflow to a specific kind of input is often the only way to obtain high quality results. In particular for historical documents there is currently no ideal workflow which will return optimal results for the whole wide range of possible inputs. Existing software solutions for OCR often don't provide the user with means to change and rearrange the workflow chain, whereas this can have a considerable impact on the results. As a side effect, know-how regarding the configuration of OCR processes for historic documents is often not accumulated, which can cause uncertainties in OCR projects and lead to re-inventing the wheel.
This background makes it desirable to have an open, flexible and modular technical infrastructure for managing digitisation workflows via a service-oriented architecture  in order to exchange and rearrange the required services while keeping an overall integrated data pipeline and high degree of automation throughout all the steps in the process. Such an infrastructure needs to meet the interoperability demands in terms of integrating cross platform services and establishing data exchange between them while supporting a variety of data formats for representing content and being fit for deployment in heterogeneous and distributed IT infrastructures.
In this paper we present an approach to such a novel infrastructure for experimental workflow development in digitisation, with a particular focus on OCR for historic documents as implemented in the IMPACT project. The web-based infrastructure establishes platform independent interaction between tools, users, and data in a user friendly way and thereby provides the means to easily conduct experiments and evaluations  using different combinations of tools and content.
2. TECHNICAL ARCHITECTURE A core piece of work in IMPACT lies in the development of novel software techniques for numerous tasks connected to OCR, such as image enhancement, segmentation and post-processing, as well as in the improvement of existing OCR engines and experimental prototypes. The variety of development platforms and programming languages used by the developers of these tools makes it necessary to define an overall technical integration architecture for establishing interoperability of the different software components.
2.1 Design principles An integration architecture is usually built in several layers to break the problem into distinct smaller problems. As presented by , four important layers can be distinguished: Data-level integration, application integration, business process integration and presentation integration. Before going into detail, we explain how the components generally relate to these layers:
Data-level integration: Data provision to application
level and conversion. For example, image (and ground truth) data repositories that allow access to sample images and metadata along with seamlessly integrated conversion services.
Application integration: Web services, Workflow modules (the concept will be explained later).
Business process integration: Basic workflows, Complex workflows.
Presentation integration: Web portal, Project website, 3rd-party projects.
The main conceptual principles were to achieve, first, modularity, i.e. it should be possible to combine individual modules in a vast number of combinations, thereby enabling users to identify the most suitable processing chain for a particular kind of material being processed and guaranteeing the reusability of the components. In this respect, the service-oriented-architecture has been identified as an appropriate guiding architectural design principle; more specifically the principle of loose coupling of reusable processing units, minimising interdependencies between them. Second, to achieve transparency, i.e. it should be possible to evaluate and test each individual application/processing step separately, so that it is obvious whether a functional unit produces expected results and contributes to the overall quality of the workflow and what the cost is in terms of processing time. Third, to achieve flexibility, i.e. it should be possible to easily change and rearrange workflows (chain of software components operating on data) and to create new workflows on the basis of existing ones by adapting them. And, fourth, to achieve extensibility, i.e. it should be possible to install third party components with little effort. Currently the platform incorporates tools as diverse as the Abbyy FineReader SDK (modified IMPACT version), Binarisation , Dewarping , Segmentation  and Recognition ,  technologies developed in the course of IMPACT, as well as third party components such as Tesseract  or OCRopus .
2.2 Components Following these principles, IMPACT has adopted in the first instance, open source components from the Apache Software Foundation, a highly active and reliable open source software development community, and evaluated alternatives only if the default Apache Software solutions appeared to be inappropriate or too complex for the project purposes. The main advantage of this approach is that the interdependencies between the different Apache Software solutions are well defined and interoperability between the components is easily achieved by following the best practices recommended by the community.
Figure 1 gives an overview of the different components used in the IMPACT framework. Java technology provides a platform-
independent technical ground for most of the framework components. Apache Tomcat has been chosen as the servlet container where the Axis2 web service stack is running as a web application. Apache Synapse is used as an enterprise service bus (ESB) providing web service load balancing by distributing workload across various processing units in the distributed network and failover functionality by skipping web service endpoints that are not available for some reason. The Apache Web Server is optionally used as a filtering and routing instance which redirects requests to the corresponding application server. On top of the web service layer, the Taverna Workflow Engine has been adopted for service orchestration and mashups (see also Sections 2.4 and 2.5).
Figure 1. IMPACT Framework components.
2.3 Implementing tools as web services Many of the software components developed in IMPACT can be used via the command line interface, while a set of parameters configures how the actual processing should be done. As illustrated in figure 2, it was decided to use the command line interface as the default IMPACT tool interface and build a generic Java-based wrapper around it. Using a generic skeleton, the Java-based wrapper is then offered as an Axis2 web service while the operations with the corresponding parameters of the underlying command line tool are offered as a web service defined in the Web Service Description Language (WSDL).
For creating the Web services, it was decided to choose the "contract first" development style, i.e. the WSDL description is created first and the corresponding Java code is generated afterwards . The main advantage of this approach is that it enables an implementation independent definition of data types using XMLSchema, e.g. by adding constraints to simple data types, like regular expressions or restricted string lists to strings (xsd:string) or ranges for integer values (xsd:int), etc. The exact way how these data types are then mapped into Java specific data types is therefore of secondary importance.
Figure 2. Generic Web Service Wrapper
As all of the IMPACT components operate somehow on data streams by either modifying them or extracting information from them, the web service must support binary data exchange between
web services which is supported by the generic web service wrapper in two ways: Data can be passed "by reference" where the SOAP message contains a URL reference to a file, or it can be passed "by value" where the binary data is attached to the SOAP message using the MTOM standard for binary data transfer, the first being the default method used in IMPACT.
Generally, there is a one-to-one mapping from tools to web services, which means that one tool will be described as a service in one WSDL file. Different types of functionality will be offered as different web service operations, and additional parameters are offered as parameters of the operation if required for higher level workflow creation. All the web services offered within the IMPACT framework together form the basic layer of the framework. Each web service can be explored and tested as a standalone service or integrated in another environment by generating appropriate stubs. The web services are used as a basis for different kinds of clients, like the workflow management system Taverna, described in some detail in the next section, or the website client which has been created in order to enable the seamless website integration of the web services for demonstration purposes.
Web services can be combined whenever the output of one service is compatible with the input of another service and it makes sense to apply the processing order that is given by this software workflow. For example, binarisation (reducing a colour/greyscale image to black and white considering especially the character information) can be applied to a document image before handing the results over to the text recognition process, while the inverse processing order would not make sense. After evaluating different possibilities for workflow modelling , IMPACT has opted for the Taverna Workbench briefly introduced in the next section. The main reason for this was the suitability of the data-driven approach of the Taverna workflow language for modelling OCR workflows and the immediate benefit for non-expert users through the graphical interface Taverna provides. Other workflow management systems typically require the users to model workflows in some dialect of XML, a complexity that is removed in Taverna through the use of graphical design patterns.
2.4 The Taverna workflow system Taverna  is a workflow language and computational model originally designed to support the automation of complex, service-based and data-intensive processes. It shares these goals with a growing and mature constellation of other systems , each geared towards different types of applications and, therefore, different areas of science. Taverna in particular has been building its reputation since 2004 as a tool that life scientists and bio-informaticians could rely on to formally describe and then enact the complex data pipelines that characterise the computational portion of their experimental science . Its application space has since expanded into more diverse areas of science, including for example astronomy and image analysis in the biomedical field.1
Its process modelling paradigm is simple: users interactively build data pipelines, where a set of nodes representing data processing elements, are connected by directed edges that denote data dependencies amongst the nodes. As nodes can have multiple 1 In 2008 there were over 4000 active users of Taverna known to
the development group.
inputs and outputs, the resulting workflow is a directed graph. Processor nodes are typically web service clients that are dedicated to the invocation of a single service operation, but they can also be local scripts (in the R and Beanshell languages). Examples of complete Taverna workflows are given in the next section, along with a description of how generic, WSDL-described services can be imported as Taverna processors.
Workflows are executed by the Taverna engine, which "pushes" data through the pipeline starting from its input ports, and terminates when all those input tokens have been completely processed. This data-driven model, where the order of the computation is determined solely by the arcs in the graph, provides opportunities to parallelise the computation, in a way that is completely transparent to the workflow designer. Indeed, the designer's interaction with the system is mediated entirely by the Taverna Workbench, a visual environment where users compose workflows from a palette of pre-loaded or custom-made processors, launch them, monitor their execution, and visualise and export the results.
While the Taverna development is centrally controlled, it is essentially a large community effort that has been publicly funded over the years.2 Conceived from the beginning as an open source project, its plugin-based architecture has made it possible to incorporate contributions from a large community of developers. As a result, dedicated interfaces are now available to specialised classes of services, including for example the caGrid services that underpin the large caBig cancer research project in the US .
Faced with the challenges of modern data-intensive science, Taverna has recently been substantially re-engineered, resulting in a scalable architecture that can manage high-volume processing of large images, for example . At the same time, the user Workbench has been fully integrated with myExperiment (www.myexperiment.org), a Web 2.0-style web site conceived in 2007 to help scientists discover and share scientific workflows . Since its inception, myExperiment has enjoyed broad adoption by many members of the very same community of Taverna users, and it now collects over a 1,800 user-contributed workflows in its online repository. The connection between Taverna and myExperiment is as natural as it is user-empowering: a user can now launch Taverna workflows from within the myExperiment site, and conversely, the repository can be searched from within the Taverna workbench.
2.5 From web services to workflows As mentioned, Taverna processors are for the most part web service clients that are responsible for invoking one specific service operation. In fact, Taverna lets workflow designers import WSDL-described services by indicating their URL, and generates one processor for each operation specified in the interface. Once the web service is available as a collection of Taverna processors, those can be used within a workflow, i.e., they become available to the designers in the Workbench's palette. The generation process also includes the creation of input and output data structures that correspond to the input and output data types of the service operations. 2 Taverna has been a central software product of the OMII-UK
initiative, a version of Taverna has been available since mid 2003 and at the time of writing Taverna has guaranteed funding until 2014.
IMPACT exploits this Taverna feature to provide so called "workflow modules", which replicate the service's functionality, but at the workflow level, so that the service can now be easily composed with others that have been similarly exposed. By doing that, IMPACT creates an abstraction layer exposing relevant features of the software tools as workflow modules and showing how the components are supposed to interact by creating complex workflows out of basic workflow components (see Section 3.1.3).
As an example, figure 3 shows the diagram of the "basic workflow" providing access to the ABBYY Fine Reader OCR web service. In the middle, the web service operation "ocrImageFileByUrl" represents the actual processing node. The nodes "ocrImageFileByUrl_part1" and "ocrImageFileByUrl_part1_2" are so called XML splitters which in Taverna are used to offer XML structures that correspond to the data types used by the web service.
Above the "ocrImageFileByUrl_part1" node, the four ports represent the input parameters of the ABBYY Fine Reader OCR web service and are all exposed as workflow input ports in the first line. Without going much into detail on the meaning of the single ports, it is just worth mentioning that the data type conversion which takes place between the "languages_commaseparated" input port (a comma separated string indicating the dictionaries to be used for OCR processing, e.g. "German, English, French") which is converted to a Taverna list type  using the local service "Split_string_into_string_list_by_regular_expression" which takes the string as an input and returns a list of strings as output. This is then handed over to the web service where the port "languages" requires a list of strings. This shows how Taverna is used as a mediation instance between the way the web service is offered to the Taverna user and the way the tool's functionality is offered as a web service.
3. RESULTS In this section, some initial results are discussed. At first, the benefits of the chosen approach are outlined, and then some challenges are presented along with how they have been tackled. The described architecture is fully operational and used by IMPACT project partners since 2010 and at the time of the writing of this paper, more than hundred thousand files from various libraries digital collections have been processed with the system.
3.1 Benefits Benefits of the defined architecture can be observed in the management of the distributed development, the design of demonstrations and the creation of new service mashups, as well as in the reproducibility of evaluations.
3.1.1 Distributed development By using services at remote sites users are freed from the requirement to keep the tools up to date, install software and run complex hardware. This architecture gives users access to a large number of computing resources with very little or no administrative requirements.
Figure 3. Example ABBYY FineReader OCR as a Taverna workflow (all service ports displayed) The architecture also has several advantages for the developers of new methods and applications. It allows them to focus exclusively on the provision of their own service rather than having to also support all the related services as well. This should in turn lead to higher quality tools, the time saving translating into more resources for the specific tool development. With the generic wrapper, any tool can easily be integrated into the framework and then made available for evaluation and demonstration purposes right away.
3.1.2 Demonstration design Design of new demonstrations is accelerated over alternative approaches through a combination of easy visualisation of the current demonstration and ready availability of new services or workflows with which to extend it. Users can start with something familiar and incorporate new functional modules with very little effort. For example, a simple workflow can send some images to the OCR engine and display the results. A user can extend this to also pre-process images automatically before sending them to OCR. Using traditional approaches this would involve editing the code using some kind of editor, installing the OCR engine and image enhancement tool and then some testing to determine whether the correct results are being achieved. Using the current architecture, it becomes a simple drag and drop operation to incorporate the tool into the workflow and a further operation
within the graphical interface to send the images first to the image processing step.
3.1.3 New service mashups The loose coupled, autonomic and stateless web services and the corresponding basic workflows provide an architecture that follows the "build once, use often" principal. Thus creating composite workflows by mashing other web services and workflows is quickly performed using the Taverna Workbench. Taverna comes with a library of frequently used shim services, which are created to specifically connect the inputs and outputs of closely related services in order to achieve interoperability between domain services. However, a new mashup often brings about the need to develop a new conversion service, which is then also added to the architecture as a service.
3.1.4 Evaluation Within the IMPACT project, a representative dataset of images and ground-truth - the close to 100 percent correct transcription of text and layout elements visible on the document pages which the OCR methods are expected to produce has been produced . The dataset currently comprises more than 600,000 high quality images from the major European mass digitisation programmes for which approximately 25,000 instances of ground truth transcriptions have been created. Using the data pipelining architecture and the ground truth data for OCR, any digitisation
workflows which produce OCR results can be statistically evaluated. In this case the OCR results will be compared with the ground truth data and the enhancement achieved by different formats, tools and configurations will be measured in terms of text accuracy , as well as layout detection rate  by making use of the Page Analysis and Ground Truth Elements Framework .
One possible evaluation method will be to add an additional tool / processing step to an existing workflow and follow the effect on the OCR results. Figure 4 shows the evaluation of an image enhancement step (Dewarping) on the OCR results.
Figure 4. Assessment of the effects of pre-OCR image
enhancement (Dewarping) on OCR results.
Another method will be to measure the effects of two competing steps on the OCR results. Figure 5 shows the comparison of JPEG2000 format with TIFF format and their effects on the OCR accuracy.
Figure 5. An ongoing discussion in the digitisation
community: JPEG2000 vs. TIFF.
It is also conceivable to use these methods to test different scanners or digital cameras, language libraries or even third party digitisation tools which are available as web services or workflows, evidently with regard to OCR accuracy. Thereby it is possible to quickly identify the best possible configuration of individual tools or even complex workflow chains, as required for processing heterogeneous collections of historic documents.
3.1.5 Scalability Recently, digitisation programmes have seen a shift from what was before often called boutique digitisation to mass digitisation activities . This entails that software components used in document recognition and analysis are required to scale up to the amounts of processing required in these cases. While there are similar systems such as for example Gamera  or DAE , they appear to focus more on experimentation, while the IMPACT framework has been developed with a view on aptitude for large-scale processing from the start.
The proposed architecture takes account of this by several means: in order to keep the data transmission time as short as possible, the architecture encourages the usage of URL references instead of binary attachments in SOAP messages. In addition, an Enterprise Service Bus is used to provide load balancing by distributing the workload across various proxies of identical service copies which are deployed on different locations. Failover functionality is also provided by monitoring the web services and skipping the endpoints that are not available for some reason. The workflow inherent parallelization of the workflow management system can have an additional boost on the execution time, depending on the overall workflow complexity.
3.2 Challenges The architecture contains several abstraction levels such as the wrapping of command-line tools into web services, the wrapping of web services into basic workflows etc. Changes in the lower abstraction levels have a high impact on the overall availability of the architecture. If not managed carefully, these changes can cause a partial or a whole malfunction. For example, a new version of a tool might use different names for the input/output ports than the earlier version which would cause the related basic workflow to fail. Or a service endpoint might move to another server which would have the same effect on the workflows that rely upon this service.
The usage of workflow modules limits the impact of changes to the service interfaces. However, at some point there is need to change the interface of the workflow module and manage versioning without effecting service consumers. Workflow modules with enabled versioning ensure that a) a version of the web service and workflow is always kept available that is known to be working and b) it is relatively straightforward to create fresh mashups from new workflows, whereas it is a complex exercise to adapt all references to a certain tool or nested workflow within already existing mashups.
Another challenge lies in the fact that usually masses of images are used in digitisation workflows (mass digitisation) which are rather big in size. Running workflows dealing with hundreds of images might cause an overhead in time, memory and/or bandwidth, which affects the overall performance. There are several methods to cope with these issues as outlined in Section 3.1.5. In addition, the framework has built-in provenance capabilities, which, for example, enable users to measure processing times for a particular workflow and environment in order to identify potential bottlenecks.
4. CONCLUSIONS The main motivation of this work was to describe a SOA approach to an infrastructure for experimental workflow development in digitisation and OCR, with a focus on historical document collections. The infrastructure establishes interoperability between tools, users, and data in an efficient and user friendly manner. By introducing yet another abstraction layer through the use of workflow representations of tools, the technical architecture presented here establishes a number of additional benefits in the area of workflow development, evaluation and usability in particular.
Thanks to these resources, it is now possible for researchers to evaluate new methods and prototypes for document image analysis and processing using the tools and data contained within
this framework, with a view not only on performance but also their aptitude for mass digitisation. The dataset (mentioned in 3.1.4) constitutes a unique resource in the digitisation domain and has sophisticated interfaces with the framework, allowing researchers to evaluate alternative methods on the basis of a common, publicly available, and representative dataset of sufficient size, thereby greatly supporting transparency and comparability of their research. The framework also enables new research regarding the clustering of large data collections and the creation of scalable workflows for processing them.
The concept also entails some challenges. However, if managed carefully, the challenges cannot outweigh the opportunities arising from an open and flexible model for developing workflows, especially in terms of experimenting with new services and evaluating different combinations of workflow modules. The Web 2.0 features exposed by the workflow registry add value through the sharing of knowledge and building of capacity in the wider community.
5. ACKNOWLEDGMENTS The IMPACT Research and Development work presented here is partially supported by European Community under the Information Society Technologies Programme (IST-1-4.1 Digital libraries and technology-enhanced learning) of the 7th framework programme - Project FP 7-ICT-2007-1.
6. REFERENCES  Lvy, M., Niggemann, E., and De Decker, J. 2011. The New
Renaissance, Report of the Comit des Sages. http://ec.europa.eu/information_society/activities/digital_libraries/doc/reflection_group/final_report_%20cds.pdf.
 Bulger, M., Meyer, E.T., de la Flor, G., Terras, M., Wyatt, S., Jirotka, M., Eccles, M., and Madsen, C. 2011. Reinventing research? Information practices in the humanities. Case study by the Research Information Network. April 2011. http://www.rin.ac.uk/system/files/attachments/Humanities_Case_Studies_for_screen_2_0.pdf.
 Balk, H., and Ploeger, L. 2009. IMPACT. Working together to address the challenges involving mass digitization of historical printed text. OCLC Systems & Services: Inter-national digital library Perspectives 25, 4 (2009), 233248.
 Balk, H. 2009. Poor access to digitised historical texts: the solutions of the IMPACT project. In Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data (AND '09). ACM, New York, NY, USA, 1-1. DOI=10.1145/1568296.1568298.
 Lin, C., Lu, S., Fei, X., Chebotko, A., Pai, D., Lai, Z., Fotouhi, F., and Hua, J. 2009. A Reference Architecture for Scientific Workflow Management Systems and the VIEW SOA Solution. IEEE Trans. Serv. Comput. 2, 1 (January 2009), 79-92. DOI=10.1109/TSC.2009.4.
 Dogan, Z.M., Neudecker, C., Schlarb, S. and Zechmeister, G. 2010. Experimental workflow development in digitisation. 2nd International Conference on Qualitative and Quantitative Methods in Libraries, Chania, Greece, 2010. In print.
 Sarang, P. 2007. SOA Approach to Integration: Xml, Web Services, Esb, and BPEL in Real-World SOA Projects. Packt Publishing.
 Gatos, B., Pratikakis, I., and Perantonis, S.J. 2008. Improved document image binarization by using a combination of multiple binarization techniques and adapted edge information. In: ICPR'08, pp. 1-4.
 Stamatopoulos, N., Gatos, B., Pratikakis, I., and Perantonis, S.J. 2008. A Two-Step Dewarping of Camera Document Images. In Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis Systems (DAS '08). IEEE Computer Society, Washington, DC, USA, 209-216. DOI=10.1109/DAS.2008.40.
 Nikolaou, N., Makridis, M., Gatos, B., Stamatopoulos, N., and Papamarkos, N. 2010. Segmentation of historical machine-printed documents using Adaptive Run Length Smoothing and skeleton segmentation paths. Image Vision Comput. 28, 4 (April 2010), 590-604. DOI=10.1016/j.imavis.2009.09.013.
 Pletschacher, S., Hu, J. and Antonacopoulos, A. 2009. A New Framework for Recognition of Heavily Degraded Characters in Historical Typewritten Documents Based on Semi-Supervised Clustering. In Proceedings of the 2009 10th International Conference on Document Analysis and Recognition (ICDAR '09). IEEE Computer Society, Washington, DC, USA, 506-510. DOI=10.1109/ICDAR.2009.267.
 Kluzner, V., Tzadok, A. Shimony, Y. Antonacopoulos, A. and Walach, E. 2009. Word-Based Adaptive OCR for Historical Books. In Proceedings of the 10th International Conference on Document Analysis and Recognition. ICDAR 09 (Barcelona, Spain, July 26-29, 2009) 501-505. DOI=10.1109/ICDAR.2009.133.
 Smith, R. 2007. An Overview of the Tesseract OCR Engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02 (ICDAR '07), Vol. 2. IEEE Computer Society, Washington, DC, USA, 629-633.
 Breuel, T. 2009. Recent progress on the OCRopus OCR system. In Proceedings of the International Workshop on Multilingual OCR (MOCR '09). ACM, New York, NY, USA, Article 2, 10 pages. DOI=10.1145/1577802.1577805.
 Al-Masri, E., and Mahmoud, Q.H. 2008. Investigating web services on the world wide web. In Proceedings of the 17th international conference on World Wide Web (WWW '08). ACM, New York, NY, USA, 795-804. DOI=10.1145/1367497.1367605.
 Yu, J. and Buyya, R. 2005. A taxonomy of scientific workflow systems for grid computing. SIGMOD Rec. 34, 3 (September 2005), 44-49. DOI=10.1145/1084805.1084814.
 Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M., Li, P., and Oinn, T. Taverna: a tool for building and running workflows of services. Nucleic acids research, 34 (Web Server issue): W729, 2006.
 Migliorini, S., Gambini, M., La Rosa, M., and ter Hofstede, A.H.M. 2011. Pattern-Based Evaluation of Scientific Workflow Management Systems, Vienna, 2011.
 Lee, E.A., and Parks, T.M. 2001. Dataflow process networks. In Readings in hardware/software co-design, G. De Micheli, R. Ernst, and W. Wolf (Eds.). Kluwer Academic Publishers, Norwell, MA, USA 59-85.
 Tan, W., Madduri, R., Keshav, K., Suzek, B.E., Oster, S., and Foster, I. 2008. Orchestrating caGrid Services in Taverna. In Proceedings of the 2008 IEEE International Conference on Web Services (ICWS '08). IEEE Computer Society, Washington, DC, USA, 14-20. DOI=10.1109/ICWS.2008.56.
 Missier, P., Soiland-Reyes, S., Owen, S., Tan, W., Nenadic, A., Dunlop, I., Williams, A., Oinn, T., and Goble, C. 2010. Taverna, reloaded. In Proceedings of the 22nd international conference on Scientific and statistical database management (SSDBM'10), Gertz, M. and Ludscher, B. (Eds.). Springer-Verlag, Berlin, Heidelberg, 471-481.
 Goble, C., Bhagat, J., Aleksejevs, S., Cruickshank, D., Michaelides, D., Newman, D., Borkum, M., Bechhofer, S., Roos, M., Li, P., and De Roure, D. 2010. myExperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Research, 2010.
 Turi, D., Missier, P., Goble, C., De Roure, D., and Oinn, T. 2007. Taverna Workflows: Syntax and Semantics. In Proceedings of the Third IEEE International Conference on e-Science and Grid Computing (E-SCIENCE '07). IEEE Computer Society, Washington, DC, USA, 441-448. DOI=10.1109/E-SCIENCE.2007.71.
 Antonacopoulos, A., Bridson, D., Papadopoulos, C., and Pletschacher, S. 2009. A Realistic Dataset for Performance Evaluation of Document Layout Analysis. In Proceedings of the 2009 10th International Conference on Document Analysis and Recognition (ICDAR '09). IEEE Computer
Society, Washington, DC, USA, 296-300. DOI=10.1109/ICDAR.2009.271.
 Feng, S. and Manmatha, R. 2006. A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books. In Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries (JCDL '06). ACM, New York, NY, USA, 109-118. DOI=10.1145/1141753.1141776.
 Antonacopoulos, A., Karatzas, D., and Bridson, D. 2006. Ground Truth for Layout Analysis Performance Evaluation. Document Analysis Systems VII: Proceedings of the International Association for Pattern Recognition (IAPR) Workshop on Document Analysis Systems (DAS2006), Bunke, H., Spitz, A.L. (Eds.), Springer Lecture Notes in Computer Science, LNCS 3872, 2006, pp. 302311.
 Pletschacher, S., and Antonacopoulos, A. 2010. The PAGE (Page Analysis and Ground-Truth Elements) Format Framework. In Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR '10). IEEE Computer Society, Washington, DC, USA, 257-260. DOI=10.1109/ICPR.2010.72.
 Coyle, K. 2006. Mass Digitization of Books, The Journal of Academic Librarianship, v. 32, n. 6. November 2006.
 Dalitz, C. and Baston, R. 2009. Optical Character Recognition with the Gamera Framework. In Dalitz, C. (Ed.) Document Image Analysis with the Gamera Framework. Schriftenreihe des Fachbereichs Elektrotechnik und Informatik, Hochschule Niederrhein, Vol. 8, pp. 53-65.
 Lamiroy, B. and Lopresti, D. 2011. An Open Architecture for End-to-End Document Analysis Benchmarking. In Proceedings of the 11th International Conference on Documwent Analysis and Recognition (ICDAR) 2011. To appear.