overview of the europeana newspapers project

36
Europeana Newspapers Project: Overview London, 9 June 2014 Rossitza Atanassova, British Library @RossiAtanassova

Upload: europeana-newspapers

Post on 11-May-2015

739 views

Category:

Documents


0 download

DESCRIPTION

An overview of the Europeana Newspapers Project by Rossitza Atanassova, British Library. Presentation given at the Europeana Newspapers Information Day, held at the British Library on 9 June 2014.

TRANSCRIPT

Page 1: Overview of the Europeana Newspapers Project

Europeana Newspapers

Project: Overview

London, 9 June 2014

Rossitza Atanassova, British Library

@RossiAtanassova

Page 2: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

2

In a nutshell

The Europeana Newspapers Project is a best practice network that

aims at aggregating up to 18 million digitised historic newspaper

pages from 12 European libraries and significantly improving the

discovery of three centuries of Europeana news articles and events

with relevance to the whole of Europe.

In addition 11 other libraries who joined the networked since th

begining of the project are contributing metadata.

Volume Across European cultures

Sharing best practices Improving discovery and access

Page 3: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Quick overview http://www.europeana-newspapers.eu/

• 3 year project ending January 2015

• Aggregate and make searchable up to 18 million historic

newspaper pages from across Europe

• Provide access for Europeana via a dedicated content

browser developed by The European Library

• Build tools to better assess the quality of newspaper

digitisation in relation to level of detail, speed and costs

• Create best practice recommendations for newspapers

metadata

• Grow the best practice network and actively engage users of

digitised newspapers

3

Page 4: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

18 Project Partners

12 content providers

2 networking partners

4 technology providers

1 aggregator

Page 5: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

11 Associated Partners

Page 6: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Neworking Partners

6

Page 7: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Workpackages

• WP1 Project coordination by StaatsBibliothek Berlin (SBB)

• WP2 Refinement led by Koninklijke Bibliotheek (KB)

• WP3 Evaluation and quality assessment led by University of

Salford (USAL)

• WP4 Aggregation and presentation led by The European

Library (TEL)

• WP5 Metadata best practice recommendations led by

University of Innsbruck (UIBK)

• WP6 Dissemination led by the Association of European

Research Libraries (LIBER)

7

Page 8: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

WP2 Refinement of digitised newspapers

• Analyse and select of digitised newspapers content for

refinement (public document available)

• Define digitization requirements and minimum quality of

newspapers for advanced services in Europeana (public

document available)

• Develop workflow for refinement procedure as part of the

aggregation process to co-ordinate refinement of selected

content (full text, structural enrichment, named entities

recognition) (ongoing)

• Provide recommendations on best practice for refinement of

digitized newspaper collections with full-text (due Jan 2015)

8

Page 9: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

WP3 Evaluation and Quality Assurance

The Europeana Newspapers project will help by developing

an evaluation and quality-assessment infrastructure for

newspaper digitisation. It will establish accepted baselines for

accuracy in relation to the level of detail, speed of digitisation

and costs. This will in turn help experts to assess different

methods of newspaper digitisation and pick the one that gives

the best result.

9

Page 10: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

WP4 Aggregation and presentation of digitised

content for Europeana

• Aggregate content and develop a search interface

• Over 2.5 million pages ingested with content from 8 project

partners and metadata from 2 associated partners

• Access via TEL prototype browser

http://www.theeuropeanlibrary.org/tel4/newspapers

• Usability testing and improving functionality

http://www.europeana-newspapers.eu/functionality-

newspaper-browser/

10

Page 11: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

WP5 Metadata best practice for newspapers

• Gather and analyse metadata models from libraries currently

in use for the digitisation of newspapers

• Design and release a comprehensive metadata model based

on de-facto standards such as METS, MODS, MARC, ALTO

(due Jan 2015)

• Prepare an online resource that contains the rules how to

apply the format and how to use it within a digitisation project

• Tool to enrich the newspapers METS/ALTO profile with

structural metadata

• IFLA Newspapers Section Pre-Conference (Geneva 13rd - 14th

August 2014). Title: Structural metadata – a Key for Indexing Digitized

Newspapers

11

Page 12: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Europeana Newspapers data set

• About 18 million pages provided by 12 partners

• 17th to 20th century material

• 20 different languages

• Over 8 million pages refined through Optical Character

Recognition done by UIBK

• Over 2 million pages refined through Optical Layout

Recognitiondone by Content Conversion Specialists (CCS)

• Subset refined with Named Entity Recognition (NER)

• www.europeana-newspapers.eu/wp-

content/uploads/2012/04/D-2-1_Dataset_for_refinement.pdf

12

Page 13: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Selection Criteria

• Selection done by the participating libraries and takes into

account physical condition, demand and copyright

• Free from restrictions, metadata with CC-0 license required

for Europeana

• Relevance to end-users – libraries’ own criteria

• Digitisation quality – high resolution uncompressed master

images required for the refinement process

• Document characteristics – condition, language, layout, font

• Technical considerations – file formats and metadata

standards

13

Page 14: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Europena Newsapers data set

14

Page 15: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Volume

15

Page 16: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Languages

16

Page 17: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Font type for the top 10 languages

17

Page 18: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Timeframe

18

Page 19: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Workflow for refinement

19

Page 20: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Tool support for libraries

20

1. BCT

Binarisation

and Colour

Reduction Tool

Page 21: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Tool support for libraries

21

1. BCT

Binarisation

and Colour

Reduction Tool

2. FRT

File Rename

Tool

Page 22: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Tool support for content providing libraries

22

1. BCT

Binarisation

and Colour

Reduction Tool

2. FRT

File Rename

Tool

3. FAT

File Analyzer

Tool

Page 23: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

General status of refinement (as of June 2014)

• 10.355.614

Total number of pages for refinement

• 7.776.277

pages processed so far

• 2.579.337

pages remaining

% completed!

• The rest to be completed by

November 2014

23

Page 24: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Status of refinement OCR and OLR

• 8.149.263

Total number of pages to be OCR’ed

• 6.844.975

OCR’ed pages so far

• Technology: ABBYY FineReader SDK

• 2.206.351

number of pages to be OLR’ed

• 1.947.477

OLR’ed pages so far

• Technology: docWorks

24

Page 25: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Status of refinement NER

25

Status of languages

• software & trained model delivered, processing data, refining model

• software & trained model delivered, processing data

• data preparation done, training started

• data preparation done, training not yet started

Status of software Collaborations:

• NER attestation tool available

http://kbresearch.dyndns.org/eunews/

• NER training data available (NL):

http://kbresearch.dyndns.org/eunews/data/

• NER tagging tool available (open source)

https://github.com/KBNLresearch/europeananp-ner

• New output format: ALTO 2.1

Page 26: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

NER

http://www.slideshare.net/Europeana_Newspapers

26

Page 27: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

WP6 Dissemination

27

RAISE AWARENESS by

sharing our goals, results

and achievements as widely

as possible.

EXPAND OUR NETWORK

of content providers,

technology producers and

other stakeholder groups.

Nationaal Archief: http://www.flickr.com/photos/29998366@N02/3280639091

Page 28: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Europeana Newspapers website

• Blog: project news, partner

features, thematic articles

• Interviews with researchers

• Europeana newspapers

browser updates

• Highlight new content

• Promotional materials

• Project publications and

presentations

• Events calendar

28

Page 29: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Engage the researcher

29

Page 30: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

What researchers value

30

“I see enormous value in an archive that breaks

down national boundaries automatically, where I

can search for content from a range of

countries.” – Bob Nicholson

“The difference lies not just in access but in the

conversion of a massive amount of print into a

searchable resource … This holds the potential to

make connections across newspapers in ways

previously unimaginable.” –Matt Rubery

Page 31: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Audiences

Policy makers

European Library

community

International library

community

Research community

EC projects Museums

archives

Technology experts

Teachers

Publishers

31

Through information days, workshops, conferences and media communication:

Page 32: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Workshop on refinement & QA

• 13 – 14 June, University Library Belgrade

• Blog: http://www.europeana-newspapers.eu/focus-on-

newspaper-refinement-quality-assessment-in-belgrade/

32

Page 33: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Final workshop “Newspapers in Europe & the

Digital Agenda for Europe”

Goal: Produce a roadmap for

improving access to digital

newspapers for policy makers

Aimed at: policy makers,

researchers, librarians, cultural

heritage professionals and

newspaper publishers.

British Library, 29-30 September

2014

BiblioArchives: http://www.flickr.com/photos/lac-bac/7639138098/

Page 34: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Closing week

Promoting the newspaper browser with end-users.

34

1-5 DECEMBER

2014

• One week of promotional events

• Live browser demos, press

articles, coordinated social media

activity

• All partner libraries will participate

Page 35: Overview of the Europeana Newspapers Project

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the

Competitiveness and Innovation Framework Programme by the European Community

http://ec.europa.eu/ict_psp

Animation

35

Page 36: Overview of the Europeana Newspapers Project

Thank you

For more information visit

http://www.europeana-newspapers.eu/