oct. 12, 2007 stev 2007, portland or a scriptable, statistical oracle for a metadata extraction...

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

A Scriptable, A Scriptable, Statistical Oracle for Statistical Oracle for

a Metadata a Metadata Extraction SystemExtraction System

Kurt J. Maly, Steven J. Zeil, Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf Mohammad Zubair, Ashraf Amrou, Ali Aazhar, Naveen Amrou, Ali Aazhar, Naveen

RatkalRatkal


The ProblemThe Problem

Dynamic validation of a program thatDynamic validation of a program that mimics human behaviormimics human behavior is imprecisely specifiedis imprecisely specified will vary widely in behaviorwill vary widely in behavior

– by user/installationby user/installation– over timeover time


Overall ApproachOverall Approach

Apply a wide variety of tests on Apply a wide variety of tests on selected output propertiesselected output properties– deterministicdeterministic– statisticalstatistical

Combine tests heuristicallyCombine tests heuristically– Combination controlled by scripts for Combination controlled by scripts for

flexibilityflexibility


OutlineOutline

The Application: Metadata ExtractionThe Application: Metadata Extraction Dynamic Validation of the ExtractorDynamic Validation of the Extractor Evaluating the ValidatorEvaluating the Validator ConclusionsConclusions


The Application: Metadata The Application: Metadata ExtractionExtraction

Large, diverse, Large, diverse, growing government growing government document collectionsdocument collections– DTIC, NASA, GPO (EPA DTIC, NASA, GPO (EPA

& Congress)& Congress) Automated system to Automated system to

extract metadata from extract metadata from documentsdocuments– Input: scanned page Input: scanned page

images or “text” PDFimages or “text” PDF– Output: XML containing Output: XML containing

metadata fields metadata fields e.g., titles, authors, e.g., titles, authors,

dates of publication, dates of publication, abstracts, release abstracts, release rightsrights

Authors

Title

Authors Affiliation

Abstract

Introduction


ApproachApproach ClassifyClassify documents by layout similarity documents by layout similarity TemplatesTemplates contain rules for extracting metadata from a specific contain rules for extracting metadata from a specific

layoutlayout– To keep templates simple, layout classes must be fairly detailed and To keep templates simple, layout classes must be fairly detailed and

specificspecific


Process OverviewProcess Overview

O C R

L ay o u tC las s if ic a t io n

E x tr ac tM etad a ta

Valid a to r

Hu m anC o r r ec tio n

s e lec ted tem p la te

m etad a ta

d o c u m en t ( P D F )

d o c u m en t ( X M L )u n tr u s tedm etad a ta

E n ter in tod a tab as e

tr u s tedm etad a ta

c o r r ec tedm etad a ta

lay o u t tem p la tes


Sample Metadata Record Sample Metadata Record (including mistakes) (including mistakes)

<?xml version="1.0"?><?xml version="1.0"?><metadata><metadata> <UnclassifiedTitle>Thesis Title: Intrepidity, Iron <UnclassifiedTitle>Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military GeniusWill, and Intellect: General Robert L. Eichelberger and Military Genius </UnclassifiedTitle></UnclassifiedTitle> <PersonalAuthor><PersonalAuthor> Name of Candidate: Major Matthew H. FathName of Candidate: Major Matthew H. Fath </PersonalAuthor></PersonalAuthor> <ReportDate>Accepted this 18th day of June 2004 by:</ReportDate><ReportDate>Accepted this 18th day of June 2004 by:</ReportDate> <approvedby>Approved by: Thesis Committee Chair Jack D. Kem, <approvedby>Approved by: Thesis Committee Chair Jack D. Kem,

Ph.D.Ph.D. , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel

John A. Suprin, M.A. John A. Suprin, M.A. </approvedby></approvedby> <acceptedby>Robert F. Baumann, Ph.D.</acceptedby><acceptedby>Robert F. Baumann, Ph.D.</acceptedby></metadata></metadata>


Rationale for Dynamic ValidationRationale for Dynamic Validation

Sources of errorSources of error– Document flawsDocument flaws– OCR software failuresOCR software failures– Mis-classified layoutsMis-classified layouts– Template faultsTemplate faults– Extraction engine faultsExtraction engine faults

Software replaces expensive human-intensive Software replaces expensive human-intensive processprocess

Moderately high (10-20%) failure rate is tolerable Moderately high (10-20%) failure rate is tolerable ifif we can identify we can identify whichwhich output sets are failures output sets are failures– route those sets to humans for inspection and correctionroute those sets to humans for inspection and correction


Process OverviewProcess Overview

O C R

L ay o u tC las s if ic a t io n

E x tr ac tM etad a ta

Valid a to r

Hu m anC o r r ec tio n

s e lec ted tem p la te

m etad a ta

d o c u m en t ( P D F )

d o c u m en t ( X M L )u n tr u s tedm etad a ta

E n ter in tod a tab as e

tr u s tedm etad a ta

c o r r ec tedm etad a ta

lay o u t tem p la tes


Dynamic ValidationDynamic Validation

Challenges:Challenges: imprecise specificationimprecise specification low-level internal state not trusted as low-level internal state not trusted as

indicator of correct progressindicator of correct progress input characteristics vary from one input characteristics vary from one

document collection to anotherdocument collection to another input characteristics may vary over input characteristics may vary over

timetime


ApproachApproach

Wide battery of basic tests can be Wide battery of basic tests can be applied to metadata fieldsapplied to metadata fields– deterministicdeterministic– statisticalstatistical

Basic test results combined Basic test results combined heuristicallyheuristically– under control of custom scripting under control of custom scripting

languagelanguage


Basic Tests - DeterministicBasic Tests - Deterministic

date formatsdate formats regular expressionsregular expressions

– structured fields, e.g., report numbersstructured fields, e.g., report numbers


Basic Tests – StatisticalBasic Tests – Statistical

Reference models from prior Reference models from prior metadata (human extracted)metadata (human extracted)– 850,000 records in DTIC collection850,000 records in DTIC collection– 20,000 records in NASA20,000 records in NASA

Measured field lengthsMeasured field lengths Phrase dictionaries constructed for Phrase dictionaries constructed for

fields with specialized vocabulariesfields with specialized vocabularies– e.g., author, organizatione.g., author, organization


Statistics collectedStatistics collected(mean & std dev)(mean & std dev)

Field lengthsField lengths– title, abstract, author,..title, abstract, author,..

Dictionary detection rates for words Dictionary detection rates for words in natural language fieldsin natural language fields– abstract, title,.abstract, title,.

Phrase recurrence rates for fields Phrase recurrence rates for fields with specialized vocabularies with specialized vocabularies – author and organization author and organization


Field Avg. Std. Dev. UnclassifiedTitle 9.9 4.8 Abstract 114 58 PersonalAuthor 2.8 0.5 CorporateAuthor 7 2.3

Field Length (in words), DTIC collection


Field Avg. Std. Dev. UnclassifiedTitle 88% 13% Abstract 94% 5%

Dictionary Detection (% of recognized words), DTIC collection


Phrase Dictionary Recurrence Rate, DTIC collection

FieldFieldPhrase Phrase LengthLength MeanMean

Std. Std. Dev.Dev.

PersonalAuthorPersonalAuthor

11 97%97% 11%11%

22 83%83% 32%32%

33 71%71% 45%45%

CorporateAuthorCorporateAuthor

11 100%100% 2.0%2.0%

22 99%99% 6.0%6.0%

33 99%99% 10%10%

44 99%99% 13%13%


Validation ProcedureValidation Procedure

Selected basic tests are applied to Selected basic tests are applied to extracted metadata field valuesextracted metadata field values– deterministic tests will pass or faildeterministic tests will pass or fail– Statistical tests compare to norms from Statistical tests compare to norms from

reference modelsreference models standard score computedstandard score computed

Test results for same field are combined Test results for same field are combined to form field confidenceto form field confidence

Field confidences are combined to form Field confidences are combined to form overall confidenceoverall confidence


Combining Basic Test Combining Basic Test ScoresScores

Validation specification describesValidation specification describes– which tests to apply to which fieldswhich tests to apply to which fields– how to normalize/scale test scores prior how to normalize/scale test scores prior

to combinationto combination– how to combine field tests into field how to combine field tests into field

confidenceconfidence– how to combine field confidences into how to combine field confidences into

overall confidenceoverall confidence


Partial Validation Spec – Partial Validation Spec – DTICDTIC<val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs…"><val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs…">

<val:average><val:average> <val:field name="UnclassifiedTitle"><val:field name="UnclassifiedTitle"> <val:average><val:average> <val:dictionary/><val:dictionary/> <val:length/><val:length/> </val:average></val:average> </val:field></val:field> <val:field name="PersonalAuthor"><val:field name="PersonalAuthor"> <val:min><val:min> <val:length/><val:length/> <val:max><val:max> <val:phrases length="1"/><val:phrases length="1"/> <val:phrases length="2"/><val:phrases length="2"/> <val:phrases length="3"/><val:phrases length="3"/> </val:max></val:max> </val:min></val:min> </val:field></val:field>


Validation scriptValidation script

Specification combined with extracted Specification combined with extracted datadata– to form an executable scriptto form an executable script

Apache Jelly projectApache Jelly project

Script executed to produce metadata Script executed to produce metadata record annotated withrecord annotated with– confidence values for each fieldconfidence values for each field– warning/explanations for low-scoring fieldswarning/explanations for low-scoring fields– overall confidence for output recordoverall confidence for output record


Sample Output From Sample Output From ValidatorValidator

<metadata confidence="0.460" <metadata confidence="0.460" warning="ReportDate field does not match required pattern">warning="ReportDate field does not match required pattern"> <UnclassifiedTitle confidence="0.979">Thesis Title: Intrepidity, Iron <UnclassifiedTitle confidence="0.979">Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military GeniusWill, and Intellect: General Robert L. Eichelberger and Military Genius </UnclassifiedTitle></UnclassifiedTitle> <PersonalAuthor confidence="0.4" <PersonalAuthor confidence="0.4" warning="PersonalAuthor: unusual number of words">warning="PersonalAuthor: unusual number of words"> Name of Candidate: Major Matthew H. FathName of Candidate: Major Matthew H. Fath </PersonalAuthor></PersonalAuthor> <ReportDate confidence="0.0"<ReportDate confidence="0.0" warning="ReportDate field does not match required pattern">warning="ReportDate field does not match required pattern"> Accepted this 18th day of June 2004 by:Accepted this 18th day of June 2004 by: </ReportDate></ReportDate> <approvedby warning="unvalidated">Approved by: Thesis Committee <approvedby warning="unvalidated">Approved by: Thesis Committee

Chair Jack D. Kem, Ph.D.Chair Jack D. Kem, Ph.D. , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel John A. , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel John A.

Suprin, M.A.Suprin, M.A. </approvedby></approvedby> <acceptedby warning="unvalidated">Robert F. Baumann, <acceptedby warning="unvalidated">Robert F. Baumann,

Ph.D.</acceptedby>Ph.D.</acceptedby></metadata></metadata>


Experimental DesignExperimental Design How effective is post-hoc classification?How effective is post-hoc classification? Selected 2000 documents recently added to DTIC Selected 2000 documents recently added to DTIC

collectioncollection– Visually classified by humans, Visually classified by humans,

comparing to 10 most common layouts from studies of comparing to 10 most common layouts from studies of earlier documentsearlier documents

discarded documents not in one of those classesdiscarded documents not in one of those classes 646 documents remained646 documents remained

Applied all templates, validated extracted Applied all templates, validated extracted metadata, selected highest confidence as the metadata, selected highest confidence as the validator’s choicevalidator’s choice

Compared validator’s preferred layout to human Compared validator’s preferred layout to human choiceschoices


Exp. Design JustificationExp. Design Justification Directly models one source of errorDirectly models one source of error

– Document flawsDocument flaws– OCR software failuresOCR software failures– Mis-classified layoutsMis-classified layouts– Template faultsTemplate faults– Extraction engine faultsExtraction engine faults

Layouts involved include some that are very Layouts involved include some that are very similar similar – single-field failures typical of other error sourcessingle-field failures typical of other error sources

Minimizes disputes among human judgesMinimizes disputes among human judges Relatively unaffected by continuing changes to Relatively unaffected by continuing changes to

extraction softwareextraction software


Validation Spec for Validation Spec for ExperimentExperiment

Similar to production spec exceptSimilar to production spec except– field scores combined by summation field scores combined by summation

rathater than by minimum or averagerathater than by minimum or average Simulated post-processing of Simulated post-processing of

extracted valuesextracted values– extractor is WYSIWYGextractor is WYSIWYG– not always what is desirednot always what is desired

e.g., “Major Matthew H. Fath” e.g., “Major Matthew H. Fath” => “Fath, Matthew H.”=> “Fath, Matthew H.”


Automatic vs. Human Automatic vs. Human ClassificationsClassifications

Post-hoc classifier agreed with human on Post-hoc classifier agreed with human on 91% of cases91% of cases


ConclusionsConclusions

Important characteristics of this approach:Important characteristics of this approach: Aggressively OpportunisticAggressively Opportunistic

– lots of small, simple testslots of small, simple tests PragmaticPragmatic

– heuristic combination of simple test resultsheuristic combination of simple test results FlexibleFlexible

– scripting aids in scripting aids in tuning heuristicstuning heuristics adaption to different installations & input setsadaption to different installations & input sets


Conclusions: Exploiting Conclusions: Exploiting Validation InternallyValidation Internally

Agreement rate between validator Agreement rate between validator and humans far exceeds our best and humans far exceeds our best prior classifier algorithms prior classifier algorithms – based on geometric layout of text and based on geometric layout of text and

graphic blocksgraphic blocks New classifier:New classifier:

– apply all available templates to apply all available templates to documentdocument

– score all outputs using validatorscore all outputs using validator– choose top-scoring output setchoose top-scoring output set

oct. 12, 2007 stev 2007, portland or a scriptable, statistical oracle for a metadata extraction...

Documents