![Page 1: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/1.jpg)
Tools for text digitisation and transcription
Tools for text digitisation and transcription
Tomasz Parkoła
Poznan Supercomputing and Networking Center
CERL annual seminar, 28.10.2014, Oslo, Norway
![Page 2: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/2.jpg)
Tools for text digitisation and transcription
Agenda
• IMPACT Center of Competence • Expertise, tools & events • PSNC example • Summary
![Page 3: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/3.jpg)
Tools for text digitisation and transcription
IMPACT CoC
Content holders
Service providers
Researchers
• Other competence centres and initiatives
• Europeana • Research infrastructures
IMPACT Centre of Competence in digitisation
![Page 4: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/4.jpg)
Tools for text digitisation and transcription
IMPACT CoC members
Premium members • Biblioteca Nacional de España • Bibliothèque nationale de France • British Library • Fraunhofer-Gesellschaft zur
Förderung der angewandten Forschung
• Fundación Biblioteca Virtual Miguel de Cervantes (Management and headquarters)
• Instituut voor Nederlandse Lexicologie • Koninklijke Bibliotheek • Contentra Technologies (formerly
Planman Technologies) • Poznań Supercomputing and
Networking Center • Universidad de Alicante
Standard members • Biblioteka Uniwersytecka we
Wrocławiu (Wroclaw University Library)
• California Digital Library • Centro de documentación teatral • DIGIBIS • Elzaburu • Göteborgs Universitet • Hochschulbibliothekszentrum des
Landes Nordrhein-Westfalen (University Library Centre of North Rhine-Westphalia)
• i2s Digibook • KU Leuven • Kungliga Biblioteket (National Library
of Sweden) • LIBNOVA • Ludwig-Maximilians-Universität,
Centrum für Informations- und Sprachverarbeitung
Standard members (cont.) • Narodna in univerzitetna knjižnica
(National Library of Slovenia) • National Library of Czech Republic • National Library of Egypt • National Library of Finland • National Library of Latvia • National Library of Serbia • Staats- und Universitätsbibliothek
Bremen • Tecnilógica • Universitat de Barcelona • Universidad Complutense de Madrid • Universidad de Granada • Universidad de Murcia • Universidad de Salamanca • Universidad de Valladolid • University of Salford • Vinfra
![Page 5: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/5.jpg)
Tools for text digitisation and transcription
IMPACT CoC members
source: h*p://www.amcharts.com/visited_countries/
![Page 6: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/6.jpg)
Tools for text digitisation and transcription
IMPACT CoC members
source: h*p://www.amcharts.com/visited_countries/
![Page 7: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/7.jpg)
Tools for text digitisation and transcription
Main activities led by IMPACT CoC
Website with tools, resources and guidance for
digi;sa;on prac;ces
Consulta-on in the context of tools and resources that can be applied in the digi;sa;on workflow
Knowledge dissemina-on via social media,
events organisa;on and training materials
Support for members to create new
research ini;a;ves, projects and expert
groups
![Page 8: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/8.jpg)
Tools for text digitisation and transcription
Key benefits for members
Cultural heritage Research centres
Companies
• Access, validate and iden;fy best digi;sa;on technologies
• Meet the experts and define best prac;ces
• Share experience and guide innova;on • Learn about research challenges • Collaborate and provide solu;on • Find project partners, sponsors or
facilitators • Share knowledge and experience
• Showcase your tools and services • Meet your target customers • Introduce innova;on
![Page 9: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/9.jpg)
Tools for text digitisation and transcription
Example: Geometric correction in the demonstrator platform
![Page 10: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/10.jpg)
Tools for text digitisation and transcription
Expertise and experience: overview
ICT Innova;on
Digi;sa;on Services and
support
R&D Face challenges
![Page 11: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/11.jpg)
Tools for text digitisation and transcription
Expertise and experience: examples
• Projects • Consultancy • Spike-‐solu;ons
• Interoperability, standards, formats and licensing
• Blogs, events, working groups • Public-‐private partnership and sponsorship
• Scanning, conversion, etc. • OCR enhancement & adapta;on
• Legal consultancy
• Tools integra;on • Workflow management • Online access
Digi;sa;on projects
Digi;sa;on services
Research and development
Community & coopera;on
![Page 12: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/12.jpg)
Tools for text digitisation and transcription
Tools: overview
IMPACT plaTorm
Image enhancement
Segmenta;on
OCR engines Evalua;on
Other
![Page 13: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/13.jpg)
Tools for text digitisation and transcription
Tools: examples (http://digitisation.eu/demonstrator-platform)
Image enhancement
NCSR Border removal NCSR Geometric Correc;on NCSR Binarisa;on Abbyy FineReader 10 Binarisa;on Unpaper
![Page 14: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/14.jpg)
Tools for text digitisation and transcription
Tools: examples (http://digitisation.eu/demonstrator-platform)
Segmenta;on Abbyy FineReader 10 Segmenta;on
Uni. Salford region, line, word Segmenta;on Service
NCSR character segmenta;on.
Uni. Innsbruck
![Page 15: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/15.jpg)
Tools for text digitisation and transcription
Tools: examples (http://digitisation.eu/demonstrator-platform)
OCR engines Abbyy FineReader 10 OCR
Abbyy FineReader 10 with external dic;onary
Uni. Salford Typewri*en OCR
Tesseract 3.00
Gocr
Ocropus
Cuneiform
![Page 16: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/16.jpg)
Tools for text digitisation and transcription
Tools: examples (http://digitisation.eu/demonstrator-platform)
Evalua;on NCSR OCR Evalua;on service
Uni. Salford layout evalua;on service
INL word evalua;on service
A B
![Page 17: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/17.jpg)
Tools for text digitisation and transcription
Tools: examples (http://digitisation.eu/demonstrator-platform)
Other ALTO and PAGE XML transforma;on
Uni. Salford ground-‐truth normalisa;on
Uni. Salford PAGE XML to svg
JP2
Exif
![Page 18: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/18.jpg)
Tools for text digitisation and transcription
Resources
• Linguistic data – OCR/IR lexica: Slovene, German, Spanish,
Czech, Polish, Dutch, English, French and other coming soon.
• Images and ground truth – Czech, Spanish, Polish, Bulgarian, Slovene,
Biodiversity Heritage Library and other coming soon.
• Annotated corpora – Spanish.
![Page 19: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/19.jpg)
Tools for text digitisation and transcription
Recent activities
• Past events (2013-2014) – Developer workshop for tools integration – TPDL tutorial (state of the art tools for text digitisation) – Digitisation Days, DATecH and Succeed awards – Workshop to investigate interoperability issues in digitisation
• Upcoming events (2014-2015) – November 28th, 2014: Succeed in digitisation. Spreading Excellence.
http://www.succeed-project.eu/succeed-digitisation – September, 2015: TPDL 2015 organised by PSNC (premium
member of IMPACT CoC) http://tpdl2015.info/ – 2016: Digitisation Days and DATecH
• Supporting take-up of tools and resources – A dozen of cultural heritage institutions validating and integrating
state of the art digitisation tools (Tesseract, ScanTailor, ImageMagick, JHOVE, NER, Korrektor, Gimp, Alchemy API, COBaLT, Omnipage, Abbyy FR)
• Cooperation with other initiatives and centres of competence (e.g. Open Preservation Foundation)
![Page 20: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/20.jpg)
Tools for text digitisation and transcription
PSNC example: who are we?
PSNC is a R&D centre located in Poznan, Poland, focused on ICT in the context of : • Cloud technologies (archiving &
computing) and HPC • Network technologies (protocols, tools,
management) • Innovative applications & services
![Page 21: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/21.jpg)
Tools for text digitisation and transcription
PSNC example: who are we?
In the context of cultural heritage we provide • Polish National aggregator for Europeana and others • DInGO toolset for digitisation projects with over 100
production-mode deployments • Virtual Transcription Laboratory tool with OCR training
module and OCR execution support • Expertise (over 20 R&D projects, Digital Libraries Conference,
training, workshops, consultancy)
![Page 22: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/22.jpg)
Tools for text digitisation and transcription
PSNC example: Virtual Transcription Laboratory
Web portal which provides access to: • Cutouts tool (creation of customised recognition
profiles) • OCR engine with multiple recognition profiles • Transcription editor with QA interface and group work
support • TXT, hOCR and ePUB output formats
http://wlt.synat.pcss.pl/
![Page 23: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/23.jpg)
Tools for text digitisation and transcription
PSNC example: why?
High quality, efficient mass digi;sa;on is currently one of the crucial challenges for cultural heritage ins;tu;ons.
PSNC is involved in the IMPACT CoC to help these ins;tu;ons to overcome exis;ng barriers and to face new challenges in this context.
We believe that exper;se of IMPACT CoC members can significantly contribute to successful R&D ac;vi;es and cubng-‐edge informa;on technologies for digi;sa;on.
1
2
3
![Page 24: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/24.jpg)
Tools for text digitisation and transcription
Summary: Join us!
Become standard member
get support in the digi;sa;on programmes
access part of the IMPACT CoC resources
meet the experts advice
Become premium member
steer ac;vi;es led by IMPACT CoC
get full access to IMPACT CoC resources
ac;vely inves;gate and innovate digi;sa;on
Contact: info@digi;sa;on.eu
![Page 25: Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557cc247d8b42a59078b4c70/html5/thumbnails/25.jpg)
Tools for text digitisation and transcription
Thank you!
Tomasz Parkoła [email protected] CERL annual seminar, 28.10.2014, Oslo, Norway