getting all your newspaper data into contentdm: the new flex loader contentdm western users group...

27
Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

Upload: arron-mitchell

Post on 26-Dec-2015

235 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

Getting ALL Your Newspaper Data into CONTENTdm:

The New Flex Loader

CONTENTdm Western Users GroupJune 3, 2010

Page 2: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

digitalnewspapers.org

Page 3: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

UDN Overview• Run by J. Willard Marriott Library• Entirely a “soft money” program

– Raised $3.5 million in local, state, federal funds• Launched in December, 2002

– 3 titles, 30K pages– On CONTENTdm (version 2)

• Current holdings– 60 titles / 89K issues / 960K pages / 10.6 million articles

o One-millionth page next month– 27 of 29 Utah counties represented

o Can’t find a long newspaper run from either Wayne or Daggett County– Covering 1850 (Deseret News) – 1982 (Vernal Express)

• Participant in NEH’s National Digital Newspaper Program– Charter member since 2005

Page 4: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

Constants For 8 YearsWe have always had

– Article-level metadatao Headlines, article type, full text

Headlines and article classification captured manually (overseas) Full text generated by OCR software

– A 3-tiered compound object structureo Issue / Page / Article

In the early days, we had text also in the page items, but we removed them because searches returned “double” hits

– Images of both full pages and individual articleso Very nice for viewing but bad for the database

Millions of PDF files

– CONTENTdm o UDN is the largest Cdm server

11.6 million items

Page 5: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

Article-Level• Question: Is article-level metadata worth the time and expense to

create?– We (UDN) believe it delivers a higher-quality user experience

o Pay extra ($0.30/pg.) to have it– NDNP (NEH / LC) believes cost outweighs the benefit

• Headlines, sub-headings– Keyed manually, nearly 100% accurate

o Double-keyed and reconciled– Contain important keywords– Search accuracy is more critical as newspaper databases grow

• Article types– Mastheads, advertisements, news– Birth, marriage, death announcements

o Genealogical info is especially important to our users 62% visit for genealogy

• Full text – generated from OCR

Page 6: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

Indexing Newspaper Content• In the early days, our newspapers wouldn’t ingest using

the Acquisition Station (now the Project Client)– Complex metadata– 3-tier compound objects

• In 2002 DiMeMa (now OCLC) developed specialized software to import our newspaper content– The “Indexer”– Allowed us to use Cdm platform for rapid expansion of

newspaperso Eventually purchased 2nd license (unlimited) just for UDN

Page 7: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

Some History• 2002-2004: DiMeMa indexed UDN content with the Indexer

– Delivered Cdm-ready files to uso $0.15 per page

– We loaded into Version 3• Over time became problematic

– DiMeMa was a software company, not in the “production” business– We wanted to run it ourselves and reduce our costs

• 2005 - DiMeMa gave us the Indexer software– One caveat – it was never made “production-ready”

o Only an internal operation-type softwareo “Rough around the edges”

• Continued running indexer for newspaper content ever since– It’s a work horse

o Processed 1 million pageso But the old, gray mare…………..

Page 8: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

V3 Indexer – Today• Runs on Windows 2000 server

– Microsoft is dropping support next month• OCLC no longer provides support for it

– Cannot enhance the functionality– Cannot install a 2nd server; only running one instance

• Slow and complex– Major bottleneck in the process– Error messages are difficult to understand

• Command line indexing– Web indexing times out

• V3 indexing fails when collection gets too big– Requires entire collection to be re-indexed

• Afterwards– Correct some metadata and add other metadata– Convert to V4

Page 9: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

Data Formats• Receive data in 2 distinctly different formats

– In-state projectso Cdm V3 format for the Indexero Ingest into Cdm V4

Very long, complex process

o Migrating to Cdm V5 will add additional steps

– NDNPo NDNP METS/ALTO formato Send batches to LC as requiredo Cannot ingest into UDN

Indexer cannot process NDNP batches

– iArchives has to support both formats

Page 10: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

Recent Developments• NEH’s National Digital Newspaper Program (NDNP)

– Newspaper programs are launching all over the countryo More than 22 states now participate at the national level

– NDNP standards are rapidly becoming “the” standard– Does not provide funding for article-level processing

o Major barrier for some states to implement article-level– NDNP spec requires article coordinates, however

o For highlighting in the viewer

• JPEG2000– Tiles enable online viewers to make a smaller “clip” from a larger

imageo e.g., a newspaper article can be clipped out of a newspaper page

– We no longer need separate article imageso Although we still need article metadata

Page 11: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

DilemmaHow do we:• Get new, fully supported ingestion software for

newspapers• Migrate to Cdm V5• Continue “full” article-level metadata• Receive only one file format (NDNP)

– Move away from the Cdm V3 file deliveries

Page 12: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

Bridging the Gap• Idea – extend the NDNP xml spec to include article data

– Create a new, separate xml file for articles

• We, OCLC, and iArchives developed spec for article xml files – Similar xml formatting as NDNP– Included “on the side” with NDNP batch files– Deliverables now are

o Standard NDNP batch

o Article.xml file for each issue in the batch containing the article metadata

– iArchives has provided a script that collates each article xml file into its respective issue folder

Page 13: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

Old Article Metadata Now at Page-Level

• Full text– Per NDNP, searchable text is stored at the page-level

o A part of page metadatao Each word and article have their own coordinates

Can be highlighted by the viewer in the page image

• Article images no longer required– 12 PDFs per page (on average) replaced by single,

higher-res jp2 of full page– Reduces the number of image files by 90%

o Although actual file space is increasing12 PDFs ~ 2 MB’s / 1 jp2 ~ 4 MB’s

Page 14: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

New Cdm Flex LoaderCombines the benefits of article-level metadata with page-level

processing and file structure

• Processes standard NDNP batch with article xml files in each issue folder– Supports article xml created by either iArchives or CCS

• Loads directly into Cdm V5– Approve and index like any other collection

• Compound object contains– Issue / compound object metadata– Page images and metadata within each issue– Article metadata is stored internally in article xml files

o In the collection’s “supp” folder• Will have article xml search-ability in a later release

– Scheduled for later this year

Page 15: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

New Features• Will be a standard Cdm release with full product support

– No cost extension of the software• “Extendable” beyond newspapers

– Can support any content with similar xml structure• Very small client application

– Most processing done by connecting to a “web service”• Nice user interface

– Tabs for entry and mapping of metadatao Highly configurable

• Can process tiff’s or jp2’s– jp2’s recommended for speed

o Remember: newspaper processing is voluminous and speed can be very important

Page 16: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

Features - 2• Loads into Approval Queue

– Per normal for any collection

• Pretty fast (in beta testing)– 45 issues in an hour on my (slow) desktop– Should be able to speed this up

• Can continue to load into existing collection– Eliminates need to create many separate collections

and merge them together later

Page 17: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

(demo of loader)

Page 18: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010

Questions?John Herbert

Head-Digital Technologies

J. Willard Marriott Library

[email protected]

(801) 585-6019

Page 19: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010
Page 20: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010
Page 21: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010
Page 22: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010
Page 23: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010
Page 24: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010
Page 25: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010
Page 26: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010
Page 27: Getting ALL Your Newspaper Data into CONTENTdm: The New Flex Loader CONTENTdm Western Users Group June 3, 2010