metadata extraction for nasa collection june 21, 2007 kurt maly, steve zeil, mohammad zubair {maly,...

18
Metadata Extraction for NASA Collection June 21 , 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Upload: ashley-harrison

Post on 02-Jan-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Metadata Extraction for NASA Collection

June 21 , 2007

Kurt Maly, Steve Zeil, Mohammad Zubair{maly, zeil, zubair} @cs.odu.edu

Page 2: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Outline Metadata Extraction Project

System overview Demo

What can ODU do for NASA

Current Status and Required enhancements

Why ODU

Cost Estimate

Page 3: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

ODU Metadata Extraction System Input: pdf documents

processed through OCR (Optical Character Recognition) Output: metadata in XML format

easily processed for uploading into any database

(demo: 1st document)

Page 4: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

System Overview

Processing has two main branches: Documents with forms (RDPs) Documents without forms

Page 5: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

System OverviewInput

Documents

Input Processing &

OCR

Form Processing

Final Metadata

Output

PDF

XML model of document

Unresolved Documents

Extracted Metadata

CleanedMetadata

sf298_1 sf298_2 ...

Form Templates

au eagle ...

Nonform Templates

Post Processing

Nonform Processing

Extracted Metadata

Validation

trusted outputs

Untrusted Metadata Outputs

Human Review & Correction

correctedmetadata

Page 6: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Demo

(additional documents)

Page 7: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

What Can ODU do for NASA

Automate form containing document processing @ NASA site

Automate document processing for 80% of collection with minimal set of metadata

Provide Interface for Human Intervention for remaining 20%

Develop general reporting tool for management on accuracy of process

Page 8: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Current Status

Completely Automated Software for: Drop in pdf file Process and produce output metadata in XML format

Easy (less than 5 minutes) installation process

Default set of templates for: RDP containing documents Non-form documents

Statistical models of NASA collection (30,000 documents) Phrase dictionaries: personal authors, corporate authors Length and English word presence for title and abstract Structure of dates, report numbers

Page 9: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Current StatusMetadata Extraction Results for 25 documents that were randomly selected from the NASA Collection

* Notes1. Accuracy is defined as successful completion of the extractor with

reasonable metadata values extracted2. “Reasonable” implies that values could be automatically processed (see

required enhancements) into standard format3. Accuracy for documents without RDP could be enhanced with additional

templates, (see required enhancements)

Document Type

Number of documents

Number of templates used

Accuracy *

With RDP 5 2 99%

Without RDP 20 9 64%

Overall 25 11 71%

Page 10: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Current Status

Documents with RDP forms Extracts high-quality metadata for 2 variants of SF-298 Tested on 154 NASA documents

Documents without RDP forms Extracts moderate-quality metadata for 9 common

document layouts Tested on 574 NASA documents

Page 11: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Required Enhancements

Develop complete template set

Standardize output and integrate with existing process at NASA site

Provide tutorial for operation and template writing

Page 12: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Required Enhancements

Develop statistical model of target collection

Write default template set to cover at least 80% of known collection

Provide oracle for detection of problem cases

Page 13: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Required Enhancements

Develop interface for showing scoring of output and location in document

Develop interactive modules for correcting metadata

Develop driver for creating output in desired format

Page 14: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Required Enhancements

Develop statistical description of input flow of documents

Develop statistical descriptions of output flow of metadata records Accuracy Computer time to process Human time to validate/correct

Page 15: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Why - software from ODU

Research, new technology ODU digital library research group is world class and has made

many contributions to advancing field. $2.5M funding in last five years from various agencies National Science Foundation, Andrew Mellon Foundation, Los Alamos, Sandia National Laboratory, Air Force Research Laboratory, NASA Langley, DTIC, and IBM

State of art in automated metadata extraction is good for homogenous collection but not effective for large, evolving, heterogeneous collections (such as NASA’s)

Need for new methods, techniques and processes

Page 16: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Why - software from ODU

Inexpensive (relatively) ODU is university with low overhead (43%)

Universities can use students and pay them assistantships rather than fulltime salaries

Department adds matching tuition waivers for research assistants which is big incentives for students to apply for research work

Faculty are among best in field, require partial funding.

Page 17: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Why - software from ODU

Long term software maintenance through department Department commits continuity independent of faculty

on projects

Department will find and assign faculty and student who can become conversant with code and maintain it (not evolve it)

Likely that there would be other faculty who are interested in evolving code for appropriate funding

Page 18: Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu

Cost of Possible Project

For a 15month project for a significant collection best estimate if it were done in isolation, cost for NASA: $160,000

For the same 15 month project if done in parallel with DTIC (and possibly GPO), cost for NASA $90,000