exploiting dynamic validation for document layout classification during metadata extraction kurt...
TRANSCRIPT
![Page 1: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/1.jpg)
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT
CLASSIFICATION DURING METADATA EXTRACTION
Kurt Maly
Steven Zeil
Mohammad Zubair
WWW/Internet 2007Vila Real, PortugalOctober 5-8, 2007
![Page 2: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/2.jpg)
OUTLINE
1. Background: Robust automatic extraction of metadata from heterogeneous collections
2. Validation of extracted metadata
3. Post-hoc classification of document layouts
4. Conclusions
![Page 3: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/3.jpg)
1. Background
• Diverse, growing government document collections• Amount of metadata available varies considerably• Automated system to extract metadata from new
documents– Classify documents by layout similarity– Template defines extraction process for a layout class
![Page 4: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/4.jpg)
Process Overview
O C R
L ay o u tC las s if ic a tio n
E x tr ac tM etad a ta
Valid a to r
Hu m anC o r r ec tio n
s e lec ted tem p la te
m etad ata
d o c u m en t ( P D F )
d o c u m en t ( X M L )u n tr u s tedm etad ata
E n ter in tod atab as e
tru s tedm etad ata
c o r r ec tedm etad ata
lay o u t tem p la tes
![Page 5: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/5.jpg)
![Page 6: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/6.jpg)
Sample Metadata Record (including mistakes)
<?xml version="1.0"?><metadata> <UnclassifiedTitle>Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military Genius </UnclassifiedTitle> <PersonalAuthor> Name of Candidate: Major Matthew H. Fath </PersonalAuthor> <ReportDate>Accepted this 18th day of June 2004 by:</ReportDate> <approvedby>Approved by: Thesis Committee Chair Jack D. Kem, Ph.D. , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel
John A. Suprin, M.A. </approvedby> <acceptedby>Robert F. Baumann, Ph.D.</acceptedby></metadata>
![Page 7: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/7.jpg)
Issue: Layout Classification
• Key to keeping extraction templates simple
• Previously explored a variety of techniques based upon geometric position of text and graphics– e.g., MX-Y trees, learning machines(??)
• Generally unsatisfactory in either accuracy or in compatibility with template approach
![Page 8: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/8.jpg)
Issue: Robustness
• Sources of errors– OCR software failures– Poor document quality– Classification errors– Template errors– Extraction engine faults
• Need to detect dubious outputs– refer to human for inspection & correction
![Page 9: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/9.jpg)
2. Validation
Exploit statistical and heuristic approaches to evaluate quality of extracted metadata
• Reference Models
• Validation Process– tests– specifications
![Page 10: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/10.jpg)
Reference Models
• From previously extracted metadata– specific to document collection
• Phrase dictionaries constructed for fields with specialized vocabularies– e.g., author, organization
• Statistics collected– mean and standard deviation– permits detection of outputs that are
significantly different from collection norms
![Page 11: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/11.jpg)
Statistics collected
• Field length statistics – title, abstract, author,..
• Phrase recurrence rates for fields with specialized vocabularies – author and organization
• Dictionary detection rates for words in natural language fields– abstract, title,.
![Page 12: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/12.jpg)
Field Avg. Std. Dev. UnclassifiedTitle 9.9 4.8 Abstract 114 58 PersonalAuthor 2.8 0.5 CorporateAuthor 7 2.3
Field Length (in words), DTIC collection
![Page 13: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/13.jpg)
Field Avg. Std. Dev. UnclassifiedTitle 88% 13% Abstract 94% 5%
Dictionary Detection (% of recognized words), DTIC collection
![Page 14: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/14.jpg)
Field Phrase length
Avg. Std. Dev.
1 97% 11% Personal author 2 83% 32% 3 71% 45%
1 100% 2%
CorporateAuthor 2 99% 6%
3 99% 10% 4 98% 13%
Phrase Dictionary Hit Percentage, DTIC collection
![Page 15: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/15.jpg)
Validation Process
• Extracted outputs for fields are subjected to a variety of tests– Test results are normalized to obtain
confidence value in range 0.0-1.0
• Test results for same field are combined to form field confidence
• Field confidences are combined to form overall confidence
![Page 16: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/16.jpg)
Validation Tests
• Deterministic– Regular patterns such as date, report
numbers
• Probabilistic– Length: if value of metadata is close to
average -> high score– Vocabulary: recurrence rate according to
field’s phrase dictionary – Dictionary: detection rate of words in English
dictionary
![Page 17: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/17.jpg)
Combining results
• Validation specification describes– which tests to apply to which fields– how to combine field tests into field
confidence– how to combine field confidences into overall
confidence
![Page 18: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/18.jpg)
Validation Specification for DTIC Collection
<?xml version="1.0"?>
<val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs…">
<val:average>
<val:field name="UnclassifiedTitle">
<val:average>
<val:dictionary/>
<val:length/>
</val:average>
</val:field>
<val:field name="PersonalAuthor">
<val:min>
<val:length/>
<val:max>
<val:phrases length="1"/> <val:phrases length="2"/> <val:phrases length="3"/>
</val:max>
</val:min>
</val:field>
![Page 19: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/19.jpg)
Validation Specification - continued
<val:field name="CorporateAuthor">
<val:min>
<val:length/>
<val:max>
<val:phrases length="1"/> <val:phrases length="2"/>
<val:phrases length="3"/> <val:phrases length="4"/>
</val:max>
</val:min>
</val:field>
<val:field name="ReportDate">
<val:dateFormat/>
</val:field>
</val:average>
</val:validate>
![Page 20: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/20.jpg)
<?xml version="1.0"?>
<metadata confidence="0.460"
warning="ReportDate field does not match required pattern">
<UnclassifiedTitle confidence="0.979">Thesis Title: Intrepidity, Iron
Will, and Intellect: General Robert L. Eichelberger and Military Genius
</UnclassifiedTitle>
<PersonalAuthor confidence="0.4"
warning="PersonalAuthor: unusual number of words">
Name of Candidate: Major Matthew H. Fath
</PersonalAuthor>
<ReportDate confidence="0.0"
warning="ReportDate field does not match required pattern">
Accepted this 18th day of June 2004 by:
</ReportDate>
<approvedby warning="unvalidated">Approved by: Thesis Committee Chair Jack D. Kem, Ph.D.
, Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel John A. Suprin, M.A.
</approvedby>
<acceptedby warning="unvalidated">Robert F. Baumann, Ph.D.</acceptedby>
</metadata>
Sample Output from the Validator
![Page 21: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/21.jpg)
3. Classification
• Post hoc classification
• Experimental Results
![Page 22: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/22.jpg)
Post hoc Classification
• Previously attempted a priori classification– choose one layout based on geometry of
page– apply template for that chosen layout
• Alternative: exploit validator for post hoc selection of layout– Apply all templates to given document– Score each output using validator– Select template which scored highest
![Page 23: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/23.jpg)
Experimental Design
• How effective is post-hoc classification?• Selected several hundred documents recently added to
DTIC collection– Visually classified by humans,
• comparing to 4 most common layouts from studies of earlier documents
• discarded documents not in one of those classes• 167 documents remained
• Applied all templates, validated extracted metadata, selected highest confidence as the validator’s choice
• Compared validator’s preferred layout to human choices
![Page 24: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/24.jpg)
Automatic vs. Human Classifications
• Post-hoc classifier agreed with human on 74% of cases
Manually Assigned Classes
Validator au
Validator eagle
Validator rand
Validator title
Total Manual
au 86 0 0 0 86
eagle 0 8 33 4 45
rand 0 0 8 4 12
title 0 0 1 23 24
![Page 25: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/25.jpg)
Post hoc Classification
• Problem:– WYSIWYG extraction often results in extra words in
extracted data• E.g., in author field ( ‘name of candidate’, “Major’)
– Not desired in final output• post-processing to remove these anticipated but not yet
implemented
– Artificially reduce validator scores• not part of phrase dictionary
• Solutions:– Post-processing must be done prior to validation
![Page 26: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/26.jpg)
Re-interpreting the experiment
• Subjected author metadata to simulated post-processing– scripts to remove
• known extraneous phrases specific to the document layouts
• military ranks and other honorifics
• Agreement between post-hoc classifier and human classification rose to 99%– far exceeds our best a priori classifiers to date
![Page 27: EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila](https://reader035.vdocuments.mx/reader035/viewer/2022070401/56649f1d5503460f94c34dca/html5/thumbnails/27.jpg)
Conclusions
• Creating statistical model of existing metadata is very useful tool to validate extracted metadata from new documents
• Validation can be used to classify documents and select the right template for the automated extraction process