the australian newspapers digitisation program: development of the newspapers content management...

43
1 Australian Newspapers Australian Newspapers Digitisation Program Digitisation Program Development of the Development of the Newspapers Content Newspapers Content Management System Management System Rose Holley – ANDP Manager Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28 November 2008 ANPlan/ANDP Workshop, 28 November 2008

Upload: rose-holley

Post on 05-Dec-2014

819 views

Category:

Technology


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

1

Australian Newspapers Australian Newspapers Digitisation ProgramDigitisation Program

Development of the Development of the Newspapers Content Newspapers Content Management SystemManagement System

Rose Holley – ANDP ManagerRose Holley – ANDP Manager

ANPlan/ANDP Workshop, 28 November 2008ANPlan/ANDP Workshop, 28 November 2008

Page 2: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

2

RequirementsRequirements

Manage, store and organise millions Manage, store and organise millions of digital newspaper pages behind of digital newspaper pages behind the scenes.the scenes.

Manage the entire digitisation Manage the entire digitisation workflow from scanning to public workflow from scanning to public delivery.delivery.

Page 3: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

3

How?How? Current NLA Digital Content Current NLA Digital Content

Management System cannot cope Management System cannot cope with volume of digital newspapers or with volume of digital newspapers or complex structure of newspaperscomplex structure of newspapers

No ‘off the shelf’ product available No ‘off the shelf’ product available that meets requirementsthat meets requirements

Need the system now (March 2007)Need the system now (March 2007)

Page 4: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

4

SolutionSolution NLA team to develop a software solutionNLA team to develop a software solution Ensure the system uses open source Ensure the system uses open source

software software System to be standalone and not bolted System to be standalone and not bolted

into other systemsinto other systems Possibility of sharing system in Possibility of sharing system in

future/providing as open source to other future/providing as open source to other librarieslibraries

Page 5: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

5

Software DevelopmentSoftware Development Agile method of development usedAgile method of development used Modules designed in stages as required Modules designed in stages as required Stage 1 – Receipt and checking of scanned imagesStage 1 – Receipt and checking of scanned images Stage 2 – Quality Assurance ModulesStage 2 – Quality Assurance Modules Stage 3 – Sending/receiving items from OCRStage 3 – Sending/receiving items from OCR Stage 4 – System Administration and StatisticsStage 4 – System Administration and Statistics Stage 5 – Interface Design and Usability of SystemStage 5 – Interface Design and Usability of System

Page 6: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

6

ProgressProgress Software development March 2007 – June 2008Software development March 2007 – June 2008 First module in use May 2007First module in use May 2007 CMS in use for 18 monthsCMS in use for 18 months CMS in final stages of completion (Jan – June CMS in final stages of completion (Jan – June

2009)2009) Further development required to enable Further development required to enable

acceptance of contributors content acceptance of contributors content Simple user interface yet to be designedSimple user interface yet to be designed

Page 7: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

7

Page 8: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

8

Australian Newspapers Australian Newspapers CMSCMS

Screenshots of system follow and Screenshots of system follow and explanation of workflows.explanation of workflows.

Page 9: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

9

Preparing for DigitisationPreparing for Digitisation Creation of digital imagesCreation of digital images Adding metadata and Quality Adding metadata and Quality

AssuranceAssurance Optical Character RecognitionOptical Character Recognition Quality AssuranceQuality Assurance Statistics and AdminStatistics and Admin

Workflow SummaryWorkflow Summary

Page 10: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

10

Identify title to be digitisedIdentify title to be digitised Source master microfilm from ownerSource master microfilm from owner Send master microfilm to scanning Send master microfilm to scanning

contractorscontractors Add title to Content Management Add title to Content Management

SystemSystem

Preparing for Preparing for DigitisationDigitisation

Page 11: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

11

CMS - Add Title CMS - Add Title

Page 12: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

12

Microfilm converted to digital imagesMicrofilm converted to digital images

Page 13: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

13

Image ReceptionImage Reception Images received from scanning Images received from scanning

contractor on LTO2 Tapecontractor on LTO2 Tape Tapes added to tape robot and Tapes added to tape robot and

extractedextracted Reels automatically added to Content Reels automatically added to Content

Management SystemManagement System Reel details are checkedReel details are checked Images ingested into Content Images ingested into Content

Management SystemManagement System

Page 14: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

14

CMS - Check Reel DetailsCMS - Check Reel Details

Page 15: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

15

CMS - Ingest ReelsCMS - Ingest Reels

Page 16: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

16

CMS - Tasks 1 and 2CMS - Tasks 1 and 2

Task 1 – Add metadata (dates and Task 1 – Add metadata (dates and page numbers)page numbers)

Supervisor reviews marked pagesSupervisor reviews marked pages Task 2 – Define batches Task 2 – Define batches Task 2 – Resolve duplicatesTask 2 – Resolve duplicates Task 2 – Create missing page targetsTask 2 – Create missing page targets

Page 17: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

17

Identify title to be worked Identify title to be worked onon

Page 18: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

18

Identify reel

Page 19: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

19

CMS - Adding MetadataCMS - Adding Metadata Date and Page Sequence number Date and Page Sequence number

addedadded

Page 20: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

20

Supervisor Supervisor ReviewReview

Supervisor Supervisor reviews reviews pages pages marked for marked for attentionattention

Page 21: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

21

CMS - Define BatchesCMS - Define Batches Batches defined by dateBatches defined by date Each batch contains 2-3000 imagesEach batch contains 2-3000 images Batches are automatically assigned a numberBatches are automatically assigned a number

Page 22: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

22

CMS - Resolve DuplicatesCMS - Resolve Duplicates Duplicate pages compared and the best copy is Duplicate pages compared and the best copy is

selectedselected

Page 23: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

23

Missing Missing page page targets targets are are generategeneratedd

MissinMissing g

PagesPages

Page 24: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

24

Optical Character Optical Character Recognition (OCR)Recognition (OCR)

Complete batches are added to a tapeComplete batches are added to a tape Tapes are generated and written Tapes are generated and written Tapes sent to OCR contractorTapes sent to OCR contractor Contractor completes OCR processesContractor completes OCR processes OCR data (not images) is returned via OCR data (not images) is returned via

FTPFTP

Page 25: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

25

CMS - Tapes CreatedCMS - Tapes Created Completed batches added to a tapeCompleted batches added to a tape

Page 26: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

26

Optical Character Recognition (OCR) of pages and article zoningOptical Character Recognition (OCR) of pages and article zoning

Page 27: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

27

OCR Data ReceptionOCR Data Reception(Automated process)(Automated process)

OCR contractor advises NLA server that a OCR contractor advises NLA server that a batch has been completedbatch has been completed

NLA server downloads the batchNLA server downloads the batch Batch is ingested into Content Batch is ingested into Content

Management SystemManagement System Checks are performed on data validityChecks are performed on data validity QA Derivatives are generatedQA Derivatives are generated Articles may now be searched, but are not Articles may now be searched, but are not

yet publicly accessibleyet publicly accessible

Page 28: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

28

CMS - Batch informationCMS - Batch information

Page 29: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

29

Quality Assurance (QA)Quality Assurance (QA) A random sample of Issues and Articles A random sample of Issues and Articles

are checkedare checked Volume and Issue number are checked for Volume and Issue number are checked for

accuracyaccuracy Sample articles are checked against Sample articles are checked against

agreed Quality Acceptance Criteria (QAC)agreed Quality Acceptance Criteria (QAC) Error rates calculated against QAC on the Error rates calculated against QAC on the

flyfly Supervisor checks final resultsSupervisor checks final results

Page 30: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

30

CMS - Selecting the batchCMS - Selecting the batch

Page 31: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

31

Volume & Issue Number Volume & Issue Number CheckCheck

Page 32: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

32

Article checked against Article checked against QACQAC

Page 33: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

33

Re-keyed fields checked for Re-keyed fields checked for accuracyaccuracy

Page 34: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

34

Supervisor checks results Supervisor checks results (auto or manual accept/reject)(auto or manual accept/reject)

Page 35: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

35

QA ResultsQA Results Automated email sent to supplier Automated email sent to supplier

advising the resultadvising the result Emails for rejected batches include a Emails for rejected batches include a

summary of errorssummary of errors Summary of errors saved for all Summary of errors saved for all

batchesbatches Accepted batches are immediately Accepted batches are immediately

accessible in public search systemaccessible in public search system

Page 36: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

36

Batch History and details Batch History and details retainedretained

Page 37: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

37

Page 38: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

38

Search or Browse articles Search or Browse articles within CMSwithin CMS

Page 39: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

39

StatisticsStatistics Stats for content received, QA’d and Stats for content received, QA’d and

delivered to the public generated by delivered to the public generated by the Content Management Systemthe Content Management System

(Stats for usage of public search (Stats for usage of public search system collected using Google system collected using Google Analytics)Analytics)

Page 40: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

40

CMS - Content StatisticsCMS - Content Statistics

Page 41: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

41

CMS - Work StatisticsCMS - Work Statistics

Page 42: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

42

AccessAccess Public access to digital newspapers is Public access to digital newspapers is

provided through Australian Newspapers provided through Australian Newspapers Search and Delivery SystemSearch and Delivery System

Users can search or browse newspapersUsers can search or browse newspapers Search results can be refined using filtersSearch results can be refined using filters Users can browse by Newspaper title or Users can browse by Newspaper title or

Date.Date.

Page 43: The Australian Newspapers Digitisation Program: Development of the Newspapers Content Management System. Nov 2008

43http://ndpbeta.nla.gov.au/ndp/del/home