1
Digital Libraries &Digital Libraries &Document Image AnalysisDocument Image Analysis
Henry S. Baird
Statistical Pattern & Image Analysis research
Information Sciences & Technologies Lab
ICDAR Aug 4, 2003 - HSB
2
DLs as seen by a DIA ResearcherDLs as seen by a DIA Researcher
15 years in DIA R&D
Lucky to have known/collaborated with:– PARC DL enthusiasts: Masinter, Street, Bloomberg, et al– UC Berkeley Digital Library project: Wilensky, Fateman, et al– CMU Universal Library project: Thibadeau, Hauptmann, et al– Xerox Scanning Service Bureaus: Wallis, et al– … many others with an interest in DLs
What challenges do DLs pose to DIA R&D?
ICDAR Aug 4, 2003 - HSB
3
Digital Library DreamsDigital Library Dreams
Electronic networked DLs promise to provide:– more books, journals, etc– to more people– faster– at more places & times
than physical libraries can hope to….
The Ideal DL: an international, interoperable, sustainable body of rich cultural materials in digital form
ICDAR Aug 4, 2003 - HSB
4
Document Images’ Usefulness in DLsDocument Images’ Usefulness in DLs
display, print raster image
+ retrieve (more or less well) + OCRed text
+ retrieve well, reuse, summarize, translate, …
+ correct text
+ Web publishing + links (e.g. HTML)
+ “semantic web” + functional tags (e.g. XML)
+ reprinting + layout format (e.g. RTF)
+ index, catalogue + metadata (title, author, …)
ICDAR Aug 4, 2003 - HSB
5
Advantages of Digital DisplaysAdvantages of Digital Displays versusversus Ink-on-Paper Ink-on-Paper
Many…– networked -- potentially unbounded content– rapidly rewritable -- supports animation– radiant -- legible in the dark– sensitive -- markable, interactive
Generally thought to be overwhelming, but …
ICDAR Aug 4, 2003 - HSB
6
Advantages of Ink-on-PaperAdvantages of Ink-on-Paper versusversus Digital Displays Digital Displays
PAPER cheap large, many high-resolution lightweight thin unpowered stable
DISPLAYS today expensive small, few low-resolution heavy thick powered requires
maintenance
DISPLAYS in future less expensive larger, more higher-resolution lighter thinner lower power
eBooks, e-paper,notebooks, laptops,PDAs, …A. Dillon, “Reading from Paper versus Screen: a critical review
of the empirical literature,” Ergonomics 53(10): 1297-1326, 1992.
ICDAR Aug 4, 2003 - HSB
7
The fact is, for many usesThe fact is, for many uses Paper is Still Widely Preferred Paper is Still Widely Preferred
“Paper [remains today] the medium of choice for reading, even when most high-tech [display] technologies are to hand”
— Sellen & Harper (2002)
Why is this? Paper allows:– flexible navigation though documents– cross-referencing of several documents– annotations– interweaving of reading and writing
A. J. Sellen & R. H. R. Harper, The Myth of the Paperless Office, The MIT Press, Cambridge, MA, 2002.
ICDAR Aug 4, 2003 - HSB
8
Document Images are DoublyDocument Images are Doubly Disadvantaged within DLs Disadvantaged within DLs
They fail to support most uses that symbolically encoded, tagged data do They lose many key advantages they enjoyed on paper
A Threat: ‘If it’s not in Google, I don’t need it!’
Can they be made as useful in DLs as encoded data?
Can they sometimes work better in DLs than encoded data?
…these are challenges to us, the DIA R&D community.
ICDAR Aug 4, 2003 - HSB
9
The British LibraryThe British Library ‘ ‘The World’s Knowledge’The World’s Knowledge’
38.8M items catalogued
website: 18.4M page hits/year
Compare Google:• >3B pages• 150M searches/day
“[Reinforcing] the Library’s role as the pre-eminent
global document supplier, digital scanning from print
and microfilm originals will give researchers rapid,
high quality delivery from over 100 million research
articles, reports, and conference papers direct to
their desktop.”
-- Lynne Brindley, Chief Executive
2002-2003 Annual Report
ICDAR Aug 4, 2003 - HSB
10
Bibliothèque nationale de FranceBibliothèque nationale de France
The Digital Library
– digitization of both printed books and graphic material
– primarily in image mode to begin with
– most out-of-copyright
Gallica 2000
– multimedia documents: Middle Ages -> early 20th century
– 35,000 printed volumes: images
– 1000 titles full text
– “one of the largest DLs free of charge on the web”
ICDAR Aug 4, 2003 - HSB
11
Million Book DL ProjectMillion Book DL Project
1M books to be scanned by 2005– bitonal, 600 dpi
Free-to-read, universally accessible Searchable by full text (where OCR is possible)
– ABBYY Fine Reader OCR Books in public domain or copyrighted but out of print Fifteen partners:
– US, India, China; est. 4000 person-years of clerical labor– Multinational, multilingual (mainly English)
20Tbyte trusted repository Research testbed for summarization, OCR, automatic
extraction of metadata, machine translation
Reddy, Raj and Gloriana St. Clair, “The Million Book Project,” CMU, Dec. 1, 2001.
ICDAR Aug 4, 2003 - HSB
12
Google CatalogsGoogle Catalogs
“1000’s” of scanned mail-order catalogs free for publishers, ‘few days’ turnaround
– for a fee: link products to web sites free to users: download page images indexed by: vendor, date, page numbers, etc (not by full text content)
ICDAR Aug 4, 2003 - HSB
13
Amazon.com planAmazon.com plan ‘Look Inside the Book II’‘Look Inside the Book II’
~500k books: in-copyright, non-fiction Scan (full color), OCR cover-to-cover Full-text search, download sample pages Free but limited access to page images———Can Google be far behind…? search document image files found on Web
David D. Kirkpatrick, “Amazon Plan Would Allow Searching Text of Many Books,” The New York Times, July 21, 2003.
ICDAR Aug 4, 2003 - HSB
14
Capturing Document ImagesCapturing Document Images
To digitize a book: $4 - $1000 each!
cheaply: bitonal, low quality, mass scanning, …
expensively: color, quality control, custom handling, …
“The Price of Digitization,” Proc., NINCH Symposium (National Initiative for a Networked Cultural Heritage), New York, April 8, 2003.
Breakdown of costs:1/3 cataloging, description, indexing
1/3 scanning, OCR, correction, markup
1/3 quality control, file maintenance, admin
NOTE: DIA can help with all three
ICDAR Aug 4, 2003 - HSB
15
Document Image Capture OperationsDocument Image Capture Operations
Usually, large-scale batch operations Sometimes destructive:
– cut off spines, discard covers, wear & tear– hot debate over ‘scan-and-discard’ policies
Image quality standards are often subjective– usually: “completeness”; no missing pages, text– seldom: checked for human, machine legibility– rarely: guaranteed suitable for future uses
Scan once, for ever:– seldom rescanned (Lesk: “not for 5-10y”)
M. Lesk, Practical Digital Libraries: Books, Bytes, & Bucks. Morgan Kaufmann, San Francisco, CA, 1997.
ICDAR Aug 4, 2003 - HSB
16
The PARC Rare Book ScannerThe PARC Rare Book Scanner
• Bulk scanning w/out
damaging books• Zero force on binding• Book is open 90 degrees• Pages turned manually• 280 dpi• 9.25”x11.75” field• Throughput
• 8-bit grey 450 pages/h
• 24-bit color 120 pages/h
Bob Street & Steve Ready, PARC.
ICDAR Aug 4, 2003 - HSB
17
GUI & IP for Image CaptureGUI & IP for Image Capture
• Capturing Metadata
• automatic page numbering 1,2,3,.../ i,ii,iii,.../ I,II,III,…
• section labels
• comments (manual)
• Image Processing• performed on the fly:
• contrast, cleaning, etc• crop. skew-correct
• processing templates
• Assuring Quality
• visual inspection
• Calibration
• color test targets
• per-pixel gain/offset map
ICDAR Aug 4, 2003 - HSB
18
DIA R&D for Image Quality ControlDIA R&D for Image Quality Control Measuring document image quality
– new test target designs– image processing algorithms– rigorous, quantitative standards
Assuring quality – fast algorithms for on-the-fly image quality
estimation Predicting human & machine legibility
What image quality features correlate well with human and OCR legibility? … and with other, later DIA tasks?
K. Summers, “Document Image Improvement forOCR as a Classification Problem,” Proc., DR&RX, Santa Clara,CA, Jan 2003.
E. H. Barney Smith & X. Qiu, “RelatingStatistical Image Differences & DegradationFeatures,” Proc, 5th DAS, Princeton, NJ., Aug 2002.
ICDAR Aug 4, 2003 - HSB
19
When Quality Control Goes WrongWhen Quality Control Goes WrongFront Page, 1852 Edition of the New York Times
The Historical New York Times Project, CMU/NYT, 1999.
Scanned from microfilm.
ICDAR Aug 4, 2003 - HSB
20
Extracting & Recognizing ContentExtracting & Recognizing Content
These are central DIA R&D goals
But existing doc image understanding systems cannot guarantee high accuracy across the full range of documents:
- typefaces, h/w styles
- image qualities
- layout geometries
- writing systems
- languages
- domains of discourse
S. Rice, G. Nagy, T. Nartker, OCR: An Illustrated Guide to the Frontier, Kluwer Academic Publishers: 1999.
DL’s scholarly & historical docs are often harder
old fashioned
poor & variable
deformed
obsolete
rare
arcane
ICDAR Aug 4, 2003 - HSB
21
Rare Botanical Reference Book
• Jepson’s A Flora of California, 1943.
• Authoritative, still in demand by scholars
• Only a few copies are left
• Difficult to OCR well
• Scanned at PARC, all page images put
on the Univ. California, Berkeley Digital
Library website
Richly MeaningfulRichly Meaningful Typographical Book Designs Typographical Book Designs
ICDAR Aug 4, 2003 - HSB
22
Cut into Word-box Images:Cut into Word-box Images: layout analysis without OCR layout analysis without OCR
ICDAR Aug 4, 2003 - HSB
23
Reflow Word Boxes into TextlinesReflow Word Boxes into Textlines to Fit the Display Geometry to Fit the Display Geometry
T. Breuel, W. Janssen, K. Popat, H. Baird, “Paper to PDA,” Proc., ICPR, Quebec City, 2002.
ICDAR Aug 4, 2003 - HSB
24
Make Doc-Images Highly Portable,Make Doc-Images Highly Portable, Legible Everywhere Legible Everywhere
No OCR errors!(Only layout errors.)Preserve meaningful appearance
Challenges: reading order non-text navigation linking
ICDAR Aug 4, 2003 - HSB
25
For Text seems feasible
– Summarization of doc images w/out OCR
– Outlining, condensing, linking
– Reflowing tables
For Non-text seems dauntingly hard
– Mathematics
– Chemical formulae
– Line-art drawings
– Graphics generally
Other ‘Pure-Image’ DIA for DLsOther ‘Pure-Image’ DIA for DLsNot Dependent on Accurate RecognitionNot Dependent on Accurate Recognition
Vitally important to trysince recognition & encodingare highly problematic
ICDAR Aug 4, 2003 - HSB
26
Personal Digital LibrariesPersonal Digital Libraries
People are beginning to– collect– manage– share
their own small DLs Scanned & encoded documents, mixed together How to assist ‘productive reading’ These users lack specialized skills DIA tools need to be deskilled to a clerical level … and to work together far better
Thanks to: Jon Hull et al, Ricoh Innovations; Robert Wilensky et al, UC Berkeley; Larry Spitz, DocRec; Kris Popat et al, PARC.
ICDAR Aug 4, 2003 - HSB
27
Interactive Digital LibrariesInteractive Digital Libraries Today’s DIA tools leave many errors
in recognition, encoding, tagging etc How can these mistakes affordably be fixed? Invite volunteer help:
– e.g. Gutenberg Project, Open Mind Initiative
Challenge: provide interactive tools to– accept corrections on-line– enforce review, verification– efficiently make the most of every correction– DIA tools able to benefit from correction
Thanks to: George Nagy, David Stork, Dan Lopresti.
ICDAR Aug 4, 2003 - HSB
28
Collaborative DLs:Collaborative DLs: DIA for the Masses DIA for the Masses
Enable non-professionals to collaborate
in improving, manually, on the best that
automatic DIA tools can do, e.g.– one person may correct thresholding– another corrects OCR errors– yet another adds tags
Offer DIA tools downloadable from the web,
possibly under GPL-like licenses Dimp ? — document image processing toolkit
interoperable via common data structures & file formats
Thanks to: Tom Breuel, Kris Popat, Bill Janssen.
ICDAR Aug 4, 2003 - HSB
29
DIA R&D Opportunities for DLsDIA R&D Opportunities for DLsMaking Document Images as Useful as Making Document Images as Useful as
Symbolically Encoded DataSymbolically Encoded Data
Image capture, quality control
Image improvement, rectification, etc
Content extraction, recognition, & analysis
Legibility, presentation, reflowing
Markup, indexing, retrieval, summarization
Personal & interactive DLs
Offering DIA tools to DL users
… many more, no doubt
ICDAR Aug 4, 2003 - HSB
30
An Urgent Responsibility?An Urgent Responsibility?
Vast, irreplaceable, culturally vital legacy collections
of paper documents are competing ineffectively for
attention with billions of digital documents
Thus paper archives are threatened with neglect,
perceived irrelevance, …. & eventually, oblivion?
The DIA community is uniquely qualified
to help the DL community rescue them.
ICDAR Aug 4, 2003 - HSB
31
ContactContact
Henry S. BairdStatistical Pattern & Image Analysis
[email protected]/baird
+1-650-812-4481 FAX –4374