metadata extraction
DESCRIPTION
Metadata Extraction. Progress Report 12/14/2006. Outline. System Overview Detailed Structure with Recent Changes IDM representation of documents validation & post-hoc classification Status of Recent & Upcoming Deliverables Future Directions. System Overview. - PowerPoint PPT PresentationTRANSCRIPT
Outline
System Overview Detailed Structure with Recent Changes
IDM representation of documents validation & post-hoc classification
Status of Recent & Upcoming Deliverables Future Directions
System Overview
Input Documents
Input Processing
Form Processing
Final Non-form Output
Final Form Output
Omnipage XML
Unresolved Documents
Extracted MetadataCleanedMetadatasf298_1 sf298_2 ...
Form Templates
au eagle ...
Nonform Templates
Post Processing
Nonform Processing
Extracted Metadata
Detailed Structure with Recent Changes Input Processing Form Processing Post Processing Nonform Processing
Input Processing
OCR – Omnipage update radically changed XML output Details later
Study of 10188 DTIC documents found none with POINT (Page Of INTerest) pages outside 1st and last 5 suspended efforts at more sophisticated POINT page location
Input Documents
PDF ReducedPDF
Omnipage XML
Extract 1st & last 5 pages
Backup
Original PDF
OCR
Omnipage XML
Omnipage XML
Form Processing
Form Processor
Unresolved
Omnipage XML
Resolved Documents
IDM
Unresolved Documents(IDM)
Omnipage XML MetaIDM
Resolved
sf298_1 sf298_2 ...
Form Templates
Extracted Metadata
Bug fixes and Tuning Omnipage XML converted to IDM
Main form template engine rewritten to work from IDM
Independent Document Model (IDM) Platform independent Document Model Motivation
Dramatic XML Schema Change between Omnipage 14 and 15
Tie the template engine to stable specification Protects from linking directly to specific OCR product Allows us to include statistics for enhanced feature usage
Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..)
Generating IDM
Use XSLT 2.0 stylesheets to transform Supporting new OCR schema only requires
generation of new XSLT stylesheet. -- Engine does not change
Chain a series of sheets to add functionality (CleanML)
Schema Specification Available (http://dtic.cs.odu.edu/devzone/IDM_Specification.doc)
IDM Usage
Each incoming XML schema requires specific XSLT 2.0 Stylesheet
Resulting IDM Doc used for “Form Based” templates
IDM transformed into CleanML for “Non-form” templates
CleanML XML Doc
OmniPage 14 XML Doc
OmniPage 15 XML Doc
Other OCR Output XML Doc
IDM XML Doc
Form Based Extraction
Non Form Extraction
docTreeModelCleanML.xsl
docTreeModelOther.xsl
docTreeModelOmni15.xsl
docTreeModelOmni14.xsl
IDM Tool Status
Converters completed to generate IDM from Omnipage 14 and 15 XML Omnipage 15 proved to have numerous errors in its
representation of an OCR’d document Consequently, not recommended
Form-based extraction engine revised to work from IDM Non-form engine still works from our older “CleanXML”
convertor from IDM to CleanXML completed as stop-gap measure
direct use of IDM deferred pending review of other engine modifications
Post Processing
No significant changes
Final Form Output
Authority FilePost
Processor
Extracted Metadata
PermittedValues
CleanedMetadata
Nonform Processing
Bug fixes & tuning Added new
validation component
Post-hoc classification replaces former a
priori classification schemes
Convert to CleanXML
Extract Metadata
Validation Script
Select Best Metadata
Clean
Final Nonform Output
Unresolved Docs(IDM)
CleanXML
CandidateMetadata
Sets
IDM
Selected Metadata
document
document
au eagle ...
Nonform Templates
Validation
Given a set of extracted metadata mark each field with a confidence value indicating how
trustworthy the extracted value is mark the set with a composite confidence score
Fields and Sets with low confidence scores may be referred for additional processing automated post-processing human intervention and correction
Validating Extracted Metadata
Techniques must be independent of the extraction method A validation specification is written for each collection, combining Field-specific validation rules
statistical models derived for each field of text length % of words from English dictionary % of phrases from knowledge base prepared for that
field pattern matching
Sample Validation Specification
Combines results from multiple fields<val:validate collection="dtic"
xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary">
<val:average>
<val:field name="UnclassifiedTitle">...</val:field>
<val:field name="PersonalAuthor">...</val:field>
<val:field name="CorporateAuthor">...</val:field>
<val:field name="ReportDate">...</val:field>
</val:average>
</val:validate>
Validation Spec: Field Tests
Each field is subjected to one or more tests…<val:field name="PersonalAuthor"> <val:average> <val:length/> <val:max>
<val:phrases length="1"/> <val:phrases length="2"/> <val:phrases length="3"/>
</val:max> </val:average> </val:field><val:field name="ReportDate"> <val:reportFormat/></val:field>...
Sample Input Metadata Set
<metadata>
<UnclassifiedTitle>Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle>
<PersonalAuthor>Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor>
<ReportDate>Accepted this 18th day of June 2004 by:</ReportDate>
</metadata>
Sample Validator Output
<metadata confidence="0.522"><UnclassifiedTitle confidence="0.943">Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle>
<PersonalAuthor confidence="0.622">Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor>
<ReportDate confidence="0.0" warning="ReportDate field does not match required pattern">Accepted this 18th day of June 2004 by:</ReportDate>
</metadata>
Classification (a priori)
Classify (select best template)
Final Nonform Output
CleanXML
Extracted Metadata
au eagle ...
Nonform Templates
Unresolved Document
Extract Metadata
selectedtemplate
Previously, we had attempted various schemes for a priori classification x-y trees bin classification
Still investigating some visual recognition
Post-Hoc Classification
Apply all templates to document results in multiple candidate sets of metadata
Score each candidate using the validator Select the best-scoring set
Extract Metadata
Final Nonform Output
CleanXML
Selected Metadata
au eagle ...
Nonform Templates
Unresolved Document
Select Best Metadata
CandidateMetadata
Sets
Validation Spec.
validation rules
Demo & Experimental Results Results of 157 documents
http://128.82.7.147:8080/dtic/validsum157.jsp
Class Hand Classification Validation
Au 86 86
Eagle 47 47
Title 24 24
Total 157 157
Future Directions
Input Documents
Input Processing
Form Processing
Final Metadata
Output
Omnipage XML
Unresolved Documents
Extracted Metadata
CleanedMetadata
sf298_1 sf298_2 ...
Form Templates
au eagle ...
Nonform Templates
Post Processing
Nonform Processing
Extracted Metadata
Validation
Status of Recent & Upcoming Deliverables DTIC - Classifier Development (9/19/06) NASA - Enhance classification algorithm for
two specific classes (10/31/2006) NASA - Process study for inter-organizational
collections – configuration software – (12/1/2006)
NASA - Enhance engine to recognize two major classes (Dec 15, 2006)
Classifier Development
DTIC - Classifier Development (9/19/06) NASA - Enhance classification algorithm for
two specific classes (10/31/2006) Delayed by difficulties with a priori classification
schemes Now replaced by post hoc validation-based
classification some tuning of validation spec required cleaning of metadata sources for statistical models Demo posted 11/15/2006
Configuration
NASA - Process study for inter-organizational collections (12/1/2006) extraction engines differentiate by collection-dependent
template sets validation specifications take collection name as a required
attribute used to locate distinct statistical models built for that collection
Regression test framework established protects against changes or tuning to one collection degrading
performance on others
Engine Enhancements
NASA - Enhance engine to recognize two major classes (12/15/2006) in many ways, already satisfied most planned enhancements deferred due to
work on IDM in short term, emphasis will be on expanding the
template set to exploit existing engine features and availability of new post-hoc classifier
Current System (Detailed)
Input Documents
Extract 1st & last 5 pages
OCR
Backup
Form Processor
UnresolvedConvert to CleanXML
Omnipage XML
Extract Metadata
Validation Script
Select Best Metadata
Omnipage Clean
Final Form Output
Final Nonform Output
Authority FilePost
Processor
PDFReduced
Original PDF
Omnipage XML
Omnipage XML
Resolved Documents
IDM
IDM
CleanXML
CandidateMetadata
Sets
IDM
Extracted Metadata
PermittedValues
CleanedMetadata
Selected Metadata
Omnipage XML MetaIDM
Resolved
sf298_1 sf298_2 ...
Form Templates
au eagle ...
Nonform Templates