microfilm, paper, and ocr - utah digital newspapers

1

Upload: others

Post on 23-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Microfilm, Paper,and OCR: Issues inNewspaperDigitizationThe Utah Digital Newspapers Program

by Kenning Arlitsch and John Herbert

Kenning Arlitsch and JohnHerbert are both at theJ. Willard Marriott Library,University of Utah. Mr.Arlitsch (kenning.arlitsch©library.utah.edu - 295 S.1500 East, Room 463, SaltLake City, UT84112) is Headof Information Technology,and Mr. Herbert ([email protected] -295 S. 1500 East, Room 418,Salt Lake City, UT 84112) isProgram Director - UtahDigital Newspapers. Theywould like to gratefullyacknowledge the contribu-tions of Scott Christensenand Frederick Zarndt of iAr-chives Inc., and of RandySilverman, Preservation Li-brarian at the Marriott Li-brary in the preparation ofthis manuscript.

History of the UDN Program

The Marriott Library at the Uni-versity of Utah (U of U) has along history of large-scale news-paper projects beginning withthe National Endowment forthe Humanities' United StatesNewspapers Program (USNP) inthe 1980s, in which the Libraryled the effort to catalog andmicrofilm Utah newspapers. Thisinvolvement continues todaywith the Utah Digital News-paper (UDN) program, which isdigitizing historic Utah news-papers, making them searchableand available on the Internet.

UDN's Grant History:2002-20041

With the first of three LibraryServices and Technology Act(LSTA) grants, the Marriott Li-brary digitized 30 years of threeweekly newspapers in 2002. Dur-ing this first phase of the pro-gram, the newspaper digitiza-tion process was developed andthe UDN website was launchedwith some 30,000 total pages.(http://digitalnewspapers.org).

A second LSTA grant, whichran from January-September2003, digitized 106,000 newpages, effectively quadrupling

the collection. The grant alsofunded a project director to runday-to-day operations and se-cure ongoing funding, and fund-ed a publicity campaign to in-sure broad knowledge of theprogram across the state.

In September 2003, the pro-gram was awarded a $1 millionfederal grant to continue foranother two years by the Insti-tute for Museum and LibraryServices (IMLS), an agency with-in the Department of Healthand Human Services. IMLS is pro-viding $470,000, with the U of Uand Brigham Young University(BYU) providing matching funds

59

Kenning Arlitsch and John Herbert Microform & Imaging Review

of $450,000 and $100,000 re-spectively. With this grant, theprogram will digitize 264,000newspaper pages, with portionsdistributed to other sites, namelyBYU and Utah State University(USU). The metadata (includingsearchable full text) from thesesites will be harvested and com-bined with the metadata fromthe U of U's collection. This willpresent a combined, or aggre-gated, collection to readers sothey can search on the entirecollection at once, regardless ofwhere the data is located. An-other major goal of the grantis to administer a training pro-gram to other academic andhistorical institutions in theWest, providing information onlaunching a digital newspapersprogram, managing the digiti-zation process, and writing com-pelling grant proposals.

In March 2004, the programwas awarded a third LSTA grantto digitize 10,000 pages fromeach of five specific Utah news-papers in five different counties.In administering this grant, theUtah State Library is providing$74,000, with matching fundsof $25,000 raised locally, $5,000each from public libraries inthe five newspaper communi-ties. These matching funds inparticular show how the pro-gram has substantial grass rootssupport in local communitiesthroughout the state. By thetime the two current grants ex-pire in September 2005, the pro-gram should have 450,000 news-paper pages digitized.

Impact of the ProgramAs the program has grown dur-ing the past three years, it hashad an increasing impact onUtahans. Monthly website usage

has increased five-fold fromJune 2003 to March 2004.2 Nu-merous emails and phone callshave been received from pa-trons who either want more in-formation about the programor who are willing to support itin some way. What the programhas done, at a very high level, isbreak down the traditional bar-riers between a major universi-ty and the general citizenry.Not only is the program tellingthe unique story of Utah's his-tory to the world via the Inter-net, it is also helping to create anew generation of "citizen his-torians" who are experiencingUtah history more easily and ef-fectively than ever before.

Digitizing Microfilm

The first newspapers digitizedby the UDN were scanned frommicrofilm. After decades of in-dependent newspaper microfilmcreation and USNP participa-tion, the U of U's newspapermicrofilm was clearly the mostcomplete and accessible sourcefor scanning. Many newspaperoriginals were destroyed fol-lowing filming, so the expecta-tion was that paper would bedifficult to locate.3 But problemswith the quality and availabilityof our microfilm caused us topursue print archives; during2003, 65% of the 106,000 pageswere digitized from paper.

Service BureausLibraries have long used servicebureaus to convert their docu-ments to microfilm, and thequality of work performed bythese bureaus can have reper-cussions long after contracts arecompleted. Lockhart and Swart-zell" conducted extensive tests

on five vendors in the late1980s, determining that while"all vendors met the basic tech-nical standards ... each testbatch had problems whichwould require detailed atten-tion in project initiation."5 Inthe UDN, these problems wouldhave a significant impact.

The U of U began microfilm-ing newspapers through a serv-ice bureau in 1948 and by thetime the USNP was launched in1983, "the Marriott Library hadalmost complete microfilm hold-ings for 30 years' worth of UtahNewspapers."6 Some of that mi-crofilm was digitized in 2002,and its defects had an impacton the digitized images, bothvisually and for optical charac-ter recognition (OCR) processes.Uneven lighting plagued manyof the newspapers. An imagemight go from an acceptable ex-posure on one side of the frameto a one or two f/stop differenceon the other side. Consistentfocus across the frame was an-other challenge; letters weresharp on one side but some-times more softly focused onthe other. (This can easily occurwhen a copy-stand-mountedcamera is not level.) Blacksmudges infected many frames,blocking out words or entirecolumns.

Most of these visual defectsappear in the early years of thenewspapers, leading us to con-clude they were the first to bemicrofilmed and that servicebureaus of the late 1940s hadnot yet perfected their tech-niques. There may also havebeen little or no quality controlefforts on the part of the U of U;recommendations from theAmerican Library Association,RLG, and ANSI/AIIM for inspec-

60

Vol.33 No. 2 i Microfilm, Paper, and OCR

tion of microfilm only becameavailable in the 1990s.7

Even the ownership of mas-ter reels can come into questionwith a service bureau. The Li-brary's microfilm service bureauhad changed ownership severaltimes, and a misunderstandingresulted in the master reels be-ing shipped out of state. Theservice bureau erroneously be-lieved it had acquired the mas-ter reels as a part of the pur-chase from the previous owner.During the 2002 processing, wediscovered, shockingly, that themaster reels were in Texas andthe service bureau refused toreturn them. The University re-acted by contacting the UtahAttorney General's office, andafter several months of corre-spondence, the reels were re-turned. Now the Library is usingstorage and duplication servicesoffered by BYU.

Physical ConditionThe physical condition of micro-film can also affect scan andOCR quality. Cellulose acetatefilm, used widely through the1970s before being replaced bystronger polyester, is known totear.8 Cellulose acetate is alsoprone to the same kind of "vin-egar syndrome" chemical de-composition (though it is not asflammable) as the older cellu-lose nitrate base.9 This decom-position leads to "buckling andshrinking, embrittlement, andbubbling,"10 causing distortionsin the image. Separately, chemi-cal "redox blemishes," resultingfrom oxidative attack in less-than-ideal storage conditions,have been noted in microfilmthroughout the country,11 andhave been seen in a few in-stances in film used by the UDN.

These reddish spots adversely af-fect the quality of the scannedimage.

Advantages of MicrofilmDespite the problems mentionedabove, scanning newspapersfrom microfilm offers severaldistinct advantages:

• Inexpensive scanning. Withthe right equipment, microfilmcan be scanned in an auto-mated fashion, allowing an op-erator to load a reel of film andessentially walk away from thescanner. These scanners cancost $100,000, but the UDN ex-perience shows that firms withthis equipment can offer pric-ing at approximately $0.15/page.

• Low conservation costs.Whereas paper may requireconservation treatment prior toscanning, microfilm is usuallyphysically stable and requiresno such treatment. Barring thephysical problems describedabove, preparation costs arelimited to making scanning cop-ies from the master reels. Scan-ning from microfilm is best donefrom a clean copy, free of thedefects found in service copies.

• Availability. Thanks to theUSNP, newspaper microfilm col-lections are available and fairlycomplete in each state.

Digitizing Paper

In 2003, sixty-five percent ofthe 106,000 pages were digi-tized from paper. When ingood condition, paper repre-sents original source material,whereas microfilm represents asmuch as a third-generation copy(paper-to-master-to-scan copy).Our hypothesis that scanningfrom paper produces better im-ages and more accurate search-ing is discussed in the section"OCR Accuracy - Microfilm vs.Paper." However, for all itspromise of cleaner images and

better search accuracy, originalnewsprint has its own set ofchallenges - not the least ofwhich is finding the collectionin the first place. The UDN isconstantly canvassing the statein an effort to locate originalcollections.

Scanning EquipmentThe oversized nature of news-papers makes them difficultto scan on conventional equip-ment; a book scanner or high-resolution digital scanning cam-era with copy stand and light-ing are requisite. Equipmentthat scans this size at a mini-mum of 300 dpi (400 dpi is rec-ommended) and at an eco-nomically feasible speed costs$50,000 - $100,000. The UDNout-sources its scanning at $.20-$.30/page, depending on wheth-er the newspaper is loose orbound.

Conservation costsNewsprint from the mid-19th cen-tury can still be in very goodcondition, if it was properlystored and handled minimally.Some collections, however, havebeen stored in adverse condi-tions, and have deterioratedover time, requiring conserva-tion work to render them stableenough for scanning. This re-pair work generally consists ofminor mending and cleaning.While the time and effort forthis can vary widely from onecollection to another, our over-all cost average for this minimalconservation is $0.19/ page.

Advantages of Paper• Cleaner digital images, moreaccurate OCR. Provided the pa-per is in fair (or better) condi-tion, better digital images areachieved by scanning directly

61