metadata extraction for nasa collection june 21, 2007 kurt maly, steve zeil, mohammad zubair {maly,...
TRANSCRIPT
Metadata Extraction for NASA Collection
June 21 , 2007
Kurt Maly, Steve Zeil, Mohammad Zubair{maly, zeil, zubair} @cs.odu.edu
Outline Metadata Extraction Project
System overview Demo
What can ODU do for NASA
Current Status and Required enhancements
Why ODU
Cost Estimate
ODU Metadata Extraction System Input: pdf documents
processed through OCR (Optical Character Recognition) Output: metadata in XML format
easily processed for uploading into any database
(demo: 1st document)
System Overview
Processing has two main branches: Documents with forms (RDPs) Documents without forms
System OverviewInput
Documents
Input Processing &
OCR
Form Processing
Final Metadata
Output
XML model of document
Unresolved Documents
Extracted Metadata
CleanedMetadata
sf298_1 sf298_2 ...
Form Templates
au eagle ...
Nonform Templates
Post Processing
Nonform Processing
Extracted Metadata
Validation
trusted outputs
Untrusted Metadata Outputs
Human Review & Correction
correctedmetadata
Demo
(additional documents)
What Can ODU do for NASA
Automate form containing document processing @ NASA site
Automate document processing for 80% of collection with minimal set of metadata
Provide Interface for Human Intervention for remaining 20%
Develop general reporting tool for management on accuracy of process
Current Status
Completely Automated Software for: Drop in pdf file Process and produce output metadata in XML format
Easy (less than 5 minutes) installation process
Default set of templates for: RDP containing documents Non-form documents
Statistical models of NASA collection (30,000 documents) Phrase dictionaries: personal authors, corporate authors Length and English word presence for title and abstract Structure of dates, report numbers
Current StatusMetadata Extraction Results for 25 documents that were randomly selected from the NASA Collection
* Notes1. Accuracy is defined as successful completion of the extractor with
reasonable metadata values extracted2. “Reasonable” implies that values could be automatically processed (see
required enhancements) into standard format3. Accuracy for documents without RDP could be enhanced with additional
templates, (see required enhancements)
Document Type
Number of documents
Number of templates used
Accuracy *
With RDP 5 2 99%
Without RDP 20 9 64%
Overall 25 11 71%
Current Status
Documents with RDP forms Extracts high-quality metadata for 2 variants of SF-298 Tested on 154 NASA documents
Documents without RDP forms Extracts moderate-quality metadata for 9 common
document layouts Tested on 574 NASA documents
Required Enhancements
Develop complete template set
Standardize output and integrate with existing process at NASA site
Provide tutorial for operation and template writing
Required Enhancements
Develop statistical model of target collection
Write default template set to cover at least 80% of known collection
Provide oracle for detection of problem cases
Required Enhancements
Develop interface for showing scoring of output and location in document
Develop interactive modules for correcting metadata
Develop driver for creating output in desired format
Required Enhancements
Develop statistical description of input flow of documents
Develop statistical descriptions of output flow of metadata records Accuracy Computer time to process Human time to validate/correct
Why - software from ODU
Research, new technology ODU digital library research group is world class and has made
many contributions to advancing field. $2.5M funding in last five years from various agencies National Science Foundation, Andrew Mellon Foundation, Los Alamos, Sandia National Laboratory, Air Force Research Laboratory, NASA Langley, DTIC, and IBM
State of art in automated metadata extraction is good for homogenous collection but not effective for large, evolving, heterogeneous collections (such as NASA’s)
Need for new methods, techniques and processes
Why - software from ODU
Inexpensive (relatively) ODU is university with low overhead (43%)
Universities can use students and pay them assistantships rather than fulltime salaries
Department adds matching tuition waivers for research assistants which is big incentives for students to apply for research work
Faculty are among best in field, require partial funding.
Why - software from ODU
Long term software maintenance through department Department commits continuity independent of faculty
on projects
Department will find and assign faculty and student who can become conversant with code and maintain it (not evolve it)
Likely that there would be other faculty who are interested in evolving code for appropriate funding
Cost of Possible Project
For a 15month project for a significant collection best estimate if it were done in isolation, cost for NASA: $160,000
For the same 15 month project if done in parallel with DTIC (and possibly GPO), cost for NASA $90,000