metadata extraction

Metadata Extraction

Progress Report

12/14/2006

Outline

System Overview Detailed Structure with Recent Changes

IDM representation of documents validation & post-hoc classification

Status of Recent & Upcoming Deliverables Future Directions

System Overview

Input Documents

Input Processing

Form Processing

Final Non-form Output

Final Form Output

PDF

Omnipage XML

Unresolved Documents

Extracted MetadataCleanedMetadatasf298_1 sf298_2 ...

Form Templates

au eagle ...

Nonform Templates

Post Processing

Nonform Processing

Extracted Metadata

Detailed Structure with Recent Changes Input Processing Form Processing Post Processing Nonform Processing

Input Processing

OCR – Omnipage update radically changed XML output Details later

Study of 10188 DTIC documents found none with POINT (Page Of INTerest) pages outside 1st and last 5 suspended efforts at more sophisticated POINT page location

Input Documents

PDF ReducedPDF

Omnipage XML

Extract 1st & last 5 pages

Backup

Original PDF

OCR

Omnipage XML

Omnipage XML

Form Processing

Form Processor

Unresolved

Omnipage XML

Resolved Documents

IDM

Unresolved Documents(IDM)

Omnipage XML MetaIDM

Resolved

sf298_1 sf298_2 ...

Form Templates

Extracted Metadata

Bug fixes and Tuning Omnipage XML converted to IDM

Main form template engine rewritten to work from IDM

Independent Document Model (IDM) Platform independent Document Model Motivation

Dramatic XML Schema Change between Omnipage 14 and 15

Tie the template engine to stable specification Protects from linking directly to specific OCR product Allows us to include statistics for enhanced feature usage

Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..)

Generating IDM

Use XSLT 2.0 stylesheets to transform Supporting new OCR schema only requires

generation of new XSLT stylesheet. -- Engine does not change

Chain a series of sheets to add functionality (CleanML)

Schema Specification Available (http://dtic.cs.odu.edu/devzone/IDM_Specification.doc)

IDM Usage

Each incoming XML schema requires specific XSLT 2.0 Stylesheet

Resulting IDM Doc used for “Form Based” templates

IDM transformed into CleanML for “Non-form” templates

CleanML XML Doc

OmniPage 14 XML Doc

OmniPage 15 XML Doc

Other OCR Output XML Doc

IDM XML Doc

Form Based Extraction

Non Form Extraction

docTreeModelCleanML.xsl

docTreeModelOther.xsl

docTreeModelOmni15.xsl

docTreeModelOmni14.xsl

IDM Tool Status

Converters completed to generate IDM from Omnipage 14 and 15 XML Omnipage 15 proved to have numerous errors in its

representation of an OCR’d document Consequently, not recommended

Form-based extraction engine revised to work from IDM Non-form engine still works from our older “CleanXML”

convertor from IDM to CleanXML completed as stop-gap measure

direct use of IDM deferred pending review of other engine modifications

Post Processing

No significant changes

Final Form Output

Authority FilePost

Processor

Extracted Metadata

PermittedValues

CleanedMetadata

Nonform Processing

Bug fixes & tuning Added new

validation component

Post-hoc classification replaces former a

priori classification schemes

Convert to CleanXML

Extract Metadata

Validation Script

Select Best Metadata

Clean

Final Nonform Output

Unresolved Docs(IDM)

CleanXML

CandidateMetadata

Sets

IDM

Selected Metadata

document

document

au eagle ...

Nonform Templates

Validation

Given a set of extracted metadata mark each field with a confidence value indicating how

trustworthy the extracted value is mark the set with a composite confidence score

Fields and Sets with low confidence scores may be referred for additional processing automated post-processing human intervention and correction

Validating Extracted Metadata

Techniques must be independent of the extraction method A validation specification is written for each collection, combining Field-specific validation rules

statistical models derived for each field of text length % of words from English dictionary % of phrases from knowledge base prepared for that

field pattern matching

Sample Validation Specification

Combines results from multiple fields<val:validate collection="dtic"

xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary">

<val:average>

<val:field name="UnclassifiedTitle">...</val:field>

<val:field name="PersonalAuthor">...</val:field>

<val:field name="CorporateAuthor">...</val:field>

<val:field name="ReportDate">...</val:field>

</val:average>

</val:validate>

Validation Spec: Field Tests

Each field is subjected to one or more tests…<val:field name="PersonalAuthor"> <val:average> <val:length/> <val:max>

<val:phrases length="1"/> <val:phrases length="2"/> <val:phrases length="3"/>

</val:max> </val:average> </val:field><val:field name="ReportDate"> <val:reportFormat/></val:field>...

Sample Input Metadata Set

<metadata>

<UnclassifiedTitle>Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle>

<PersonalAuthor>Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor>

<ReportDate>Accepted this 18th day of June 2004 by:</ReportDate>

</metadata>

Sample Validator Output

<metadata confidence="0.522"><UnclassifiedTitle confidence="0.943">Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle>

<PersonalAuthor confidence="0.622">Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor>

<ReportDate confidence="0.0" warning="ReportDate field does not match required pattern">Accepted this 18th day of June 2004 by:</ReportDate>

</metadata>

Classification (a priori)

Classify (select best template)


CleanXML

Extracted Metadata

au eagle ...

Nonform Templates

Unresolved Document

Extract Metadata

selectedtemplate

Previously, we had attempted various schemes for a priori classification x-y trees bin classification

Still investigating some visual recognition

Post-Hoc Classification

Apply all templates to document results in multiple candidate sets of metadata

Score each candidate using the validator Select the best-scoring set

Extract Metadata


CleanXML

Selected Metadata

au eagle ...

Nonform Templates

Unresolved Document


CandidateMetadata

Sets

Validation Spec.

validation rules

Demo & Experimental Results Results of 157 documents

http://128.82.7.147:8080/dtic/validsum157.jsp

Class Hand Classification Validation

Au 86 86

Eagle 47 47

Title 24 24

Total 157 157

Future Directions

Input Documents

Input Processing

Form Processing

Final Metadata

Output

PDF

Omnipage XML

Unresolved Documents

Extracted Metadata

CleanedMetadata

sf298_1 sf298_2 ...

Form Templates

au eagle ...

Nonform Templates

Post Processing

Nonform Processing

Extracted Metadata

Validation

Status of Recent & Upcoming Deliverables DTIC - Classifier Development (9/19/06) NASA - Enhance classification algorithm for

two specific classes (10/31/2006) NASA - Process study for inter-organizational

collections – configuration software – (12/1/2006)

NASA - Enhance engine to recognize two major classes (Dec 15, 2006)

Classifier Development

DTIC - Classifier Development (9/19/06) NASA - Enhance classification algorithm for

two specific classes (10/31/2006) Delayed by difficulties with a priori classification

schemes Now replaced by post hoc validation-based

classification some tuning of validation spec required cleaning of metadata sources for statistical models Demo posted 11/15/2006

Configuration

NASA - Process study for inter-organizational collections (12/1/2006) extraction engines differentiate by collection-dependent

template sets validation specifications take collection name as a required

attribute used to locate distinct statistical models built for that collection

Regression test framework established protects against changes or tuning to one collection degrading

performance on others

Engine Enhancements

NASA - Enhance engine to recognize two major classes (12/15/2006) in many ways, already satisfied most planned enhancements deferred due to

work on IDM in short term, emphasis will be on expanding the

template set to exploit existing engine features and availability of new post-hoc classifier

END

Questions?

Current System (Detailed)

Input Documents

Extract 1st & last 5 pages

OCR

Backup

Form Processor

UnresolvedConvert to CleanXML

Omnipage XML

Extract Metadata

Validation Script


Omnipage Clean

Final Form Output


Authority FilePost

Processor

PDFReduced

PDF

Original PDF

Omnipage XML

Omnipage XML

Resolved Documents

IDM

IDM

CleanXML

CandidateMetadata

Sets

IDM

Extracted Metadata

PermittedValues

CleanedMetadata

Selected Metadata

Omnipage XML MetaIDM

Resolved

sf298_1 sf298_2 ...

Form Templates

au eagle ...

Nonform Templates

metadata extraction

Documents

tuningomnipage xml

idmnonform engine

xml outputdetails laterstudy

use of idm

idmmain form template

idm use xslt

specific xslt

field testseach field