enp belgrade ws olr @ ccs

13
June 14, 2013 Page 1 Content Conversion Specialists WS Refinement and Quality Assessment Claus Gravenhorst Director Strategic Initiatives CCS Content Conversion Specialists europeana newspapers Workshop Refinement and Quality Assessment, Belgrade 14.6.2013 OLR at CCS From unstructured to structured newspaper data and the role of content providers in the overall process Claus Gravenhorst

Upload: europeana-newspapers

Post on 11-May-2015

537 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: ENP Belgrade WS OLR @ CCS

June 14, 2013Page 1

Content Conversion SpecialistsWS Refinement and Quality Assessment

Claus GravenhorstDirector Strategic Initiatives

CCS Content Conversion Specialists

europeana newspapersWorkshop Refinement and Quality Assessment, Belgrade 14.6.2013

OLR at CCSFrom unstructured to structured newspaper data and the roleof content providers in the overall process

Claus Gravenhorst

Page 2: ENP Belgrade WS OLR @ CCS

June 14, 2013Page 2

Content Conversion SpecialistsWS Refinement and Quality Assessment

Claus GravenhorstDirector Strategic Initiatives

Agenda

About CCS

General workflow for mass digitization of newspapers

OLR – Layout and structure analysis

ENP OLR workflow (involvement of CP‘s)

Quality assurance

Output - METS/ALTO package

Demo of first results

Page 3: ENP Belgrade WS OLR @ CCS

June 14, 2013Page 3

Content Conversion SpecialistsWS Refinement and Quality Assessment

Claus GravenhorstDirector Strategic Initiatives

About CCS

CCS Content Conversion Specialists GmbH (Hamburg), as technical project partner, will provide its expertise and docWorks technology to set up and operate a mass digitisation workflow to create high quality structured content from 2 million scanned newspaper pages provided by 5 library partners

Page volume:

BNF=1.000 k, NLE=500 k , SUB HH=480 k, NLF=90 k, SBB=10 k

The distributed OLR workflow enables the contribution of project partners (content providers) to the integrated quality assurance process

CCS will also contribute to the specification of the metadata model

Page 4: ENP Belgrade WS OLR @ CCS

June 14, 2013Page 4

Content Conversion SpecialistsWS Refinement and Quality Assessment

Claus GravenhorstDirector Strategic Initiatives

General workflow for mass digitization

Re-Scan

Conversion

ImagingLayout

AnalysisOCRISR

Reject Condition

DeliveryQA

random

Final Output

Scanning

Image

Metadata

Database----------------Repository

Automated QA

DocumentUID

BarcodeItem Tracking

Manual QA

•in-house•near-shore•off-shore•multiple locations

Manual QA

•in-house•near-shore

Check in

Check out

Scanner

•Robot-•Book-•Document-•Microfilm-

QA+CorrectionQA+Correcti

onQA +

Correction

Z 39.50Metadata

Page 5: ENP Belgrade WS OLR @ CCS

June 14, 2013Page 5

Content Conversion SpecialistsWS Refinement and Quality Assessment

Claus GravenhorstDirector Strategic Initiatives

Layout and structure analysis

Layout analysis based on „bottom up“ approach

General rule system enables recognition of words, text lines, text blocks, columns and classification of text blocks, illustrations, advertisements, tables and the following page types:

- title page (the title page of an issue) - content page (a page that consists of content/text only) - illustration page (a page that has at least one illustration) - advertisement page (a page that contains adverts only)

Structure analysis through classification of headlines and grouping of zones into articles

(incl. article continuation)

Page 6: ENP Belgrade WS OLR @ CCS

June 14, 2013Page 6

Content Conversion SpecialistsWS Refinement and Quality Assessment

Claus GravenhorstDirector Strategic Initiatives

ENP OLR workflow | Conversion without scanning

Digital ImageMetadataDelivery

Digital ImageMetadataDelivery

Digital ObjectReturn

Digital ObjectReturn

Inspection / Automatic QAInspection /

Automatic QA

Doc DeliveryDoc Delivery

RejectReject

Conversion facility

Material location

ConversionMD Recording

Page 7: ENP Belgrade WS OLR @ CCS

June 14, 2013Page 7

Content Conversion SpecialistsWS Refinement and Quality Assessment

Claus GravenhorstDirector Strategic Initiatives

Possible conversion scenarios

A) Conversion at library (on-site)

B) Conversion off-shore at CCS data center,final QA at the library via internet transfer (remote QA solution)

C) Conversion off-shore at CCS,final QA at the library by backup shipment

Page 8: ENP Belgrade WS OLR @ CCS

June 14, 2013Page 8

Content Conversion SpecialistsWS Refinement and Quality Assessment

Claus GravenhorstDirector Strategic Initiatives

Scenario B | Remote QA at library

Internet

StorageStorage

IN

OUTPOOL

dW Share

Master

OffshoreProcessing

@ CCS

OUTPUT

METS ALTO

StorageStorage

POOL

dW Share

RQA

QA on-site @ Library

INPUT

HDDHDDHDD

Page 9: ENP Belgrade WS OLR @ CCS

June 14, 2013Page 9

Content Conversion SpecialistsWS Refinement and Quality Assessment

Claus GravenhorstDirector Strategic Initiatives

Quality assurance

@ CCS | Automated markup and basic manual correction:

- headlines, illustrations, tables, captions, advertisements, etc.

- article segmentation and grouping of zones into articles (incl. continuation)

@ Content Provider (Library)

Recommended:

- Zoning: correct classification of blocks as „text“ or „illustration“ - Article segmentation: correct identification of headlines/text blocks/captions - Grouping: correct gouping of blocks (text, illustration) to articles - Metadata: correct title, issue date and issue number

Optional:

- Page types: correct page types - Page numbers: correct page sequence - OCR: perform text correction of specific zones (e.g. headlines, captions)

Page 10: ENP Belgrade WS OLR @ CCS

June 14, 2013Page 10

Content Conversion SpecialistsWS Refinement and Quality Assessment

Claus GravenhorstDirector Strategic Initiatives

Output | METS/ALTO package

METS/ALTO metadata schemas to describe the structured digital ouput object

A newspaper issue processed in docWorks is converted into one METS XML file. It reflects the whole physical and logical structure, manages all links to the image files and the related ALTO XML files. ALTO is based on a standardized page description schema and contains all information of a page (print space, margins, coordinates, OCR results).

Benefits of structural markup:

- better browsing and more precise text search

- better access and display on tablet and mobile devices - automated article classification and clustering through data/text mining and linguistic technologies - user engagement for manual online text correction, article classification, annotation, building personal collections, etc. - sharing articles via social media platforms like Facebook, Twitter, etc. _______________

METS = Metadada Encoding and Transmission Standard

ALTO = Analyzed Layout and Text Object

Page 11: ENP Belgrade WS OLR @ CCS

June 14, 2013Page 11

Content Conversion SpecialistsWS Refinement and Quality Assessment

Claus GravenhorstDirector Strategic Initiatives

Access and Presentation

Access through Europeana as well as content provider portals

Existing newspaper presentation systems at National Library of Australia (Trove), Library of Congress/NDNP (Chronicling America), Dutch National Library (DDD), National Library of Luxembourg (eLuxemburgensia), ...

Veridian demo:

Example of a newspaper presentation system to demonstrate access to already processed ENP newspaper issues

Page 12: ENP Belgrade WS OLR @ CCS

June 14, 2013Page 12

Content Conversion SpecialistsWS Refinement and Quality Assessment

Claus GravenhorstDirector Strategic Initiatives

Questions + answers

Page 13: ENP Belgrade WS OLR @ CCS

June 14, 2013Page 13

Content Conversion SpecialistsWS Refinement and Quality Assessment

Claus GravenhorstDirector Strategic Initiatives

Contact

Claus Gravenhorst

Director Strategic InitiativesCCS Content Conversion Specialists GmbH

Weidestr. 134

22083 Hamburg

Germany [email protected] 

www.content-conversion.com