oct. 12, 2007 stev 2007, portland or a scriptable, statistical oracle for a metadata extraction...

29
Oct. 12, 2007 Oct. 12, 2007 STEV 2007, Portland OR STEV 2007, Portland OR A Scriptable, A Scriptable, Statistical Oracle Statistical Oracle for a Metadata for a Metadata Extraction System Extraction System Kurt J. Maly, Steven J. Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Zeil, Mohammad Zubair, Ashraf Amrou, Ali Aazhar, Ashraf Amrou, Ali Aazhar, Naveen Ratkal Naveen Ratkal

Upload: alice-norton

Post on 05-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

A Scriptable, A Scriptable, Statistical Oracle for Statistical Oracle for

a Metadata a Metadata Extraction SystemExtraction System

Kurt J. Maly, Steven J. Zeil, Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf Mohammad Zubair, Ashraf Amrou, Ali Aazhar, Naveen Amrou, Ali Aazhar, Naveen

RatkalRatkal

Page 2: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

The ProblemThe Problem

Dynamic validation of a program thatDynamic validation of a program that mimics human behaviormimics human behavior is imprecisely specifiedis imprecisely specified will vary widely in behaviorwill vary widely in behavior

– by user/installationby user/installation– over timeover time

Page 3: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Overall ApproachOverall Approach

Apply a wide variety of tests on Apply a wide variety of tests on selected output propertiesselected output properties– deterministicdeterministic– statisticalstatistical

Combine tests heuristicallyCombine tests heuristically– Combination controlled by scripts for Combination controlled by scripts for

flexibilityflexibility

Page 4: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

OutlineOutline

The Application: Metadata ExtractionThe Application: Metadata Extraction Dynamic Validation of the ExtractorDynamic Validation of the Extractor Evaluating the ValidatorEvaluating the Validator ConclusionsConclusions

Page 5: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

The Application: Metadata The Application: Metadata ExtractionExtraction

Large, diverse, Large, diverse, growing government growing government document collectionsdocument collections– DTIC, NASA, GPO (EPA DTIC, NASA, GPO (EPA

& Congress)& Congress) Automated system to Automated system to

extract metadata from extract metadata from documentsdocuments– Input: scanned page Input: scanned page

images or “text” PDFimages or “text” PDF– Output: XML containing Output: XML containing

metadata fields metadata fields e.g., titles, authors, e.g., titles, authors,

dates of publication, dates of publication, abstracts, release abstracts, release rightsrights

Authors

Title

Authors Affiliation

Abstract

Introduction

Page 6: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

ApproachApproach ClassifyClassify documents by layout similarity documents by layout similarity TemplatesTemplates contain rules for extracting metadata from a specific contain rules for extracting metadata from a specific

layoutlayout– To keep templates simple, layout classes must be fairly detailed and To keep templates simple, layout classes must be fairly detailed and

specificspecific

Page 7: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Process OverviewProcess Overview

O C R

L ay o u tC las s if ic a t io n

E x tr ac tM etad a ta

Valid a to r

Hu m anC o r r ec tio n

s e lec ted tem p la te

m etad a ta

d o c u m en t ( P D F )

d o c u m en t ( X M L )u n tr u s tedm etad a ta

E n ter in tod a tab as e

tr u s tedm etad a ta

c o r r ec tedm etad a ta

lay o u t tem p la tes

Page 8: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Sample Metadata Record Sample Metadata Record (including mistakes) (including mistakes)

<?xml version="1.0"?><?xml version="1.0"?><metadata><metadata> <UnclassifiedTitle>Thesis Title: Intrepidity, Iron <UnclassifiedTitle>Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military GeniusWill, and Intellect: General Robert L. Eichelberger and Military Genius </UnclassifiedTitle></UnclassifiedTitle> <PersonalAuthor><PersonalAuthor> Name of Candidate: Major Matthew H. FathName of Candidate: Major Matthew H. Fath </PersonalAuthor></PersonalAuthor> <ReportDate>Accepted this 18th day of June 2004 by:</ReportDate><ReportDate>Accepted this 18th day of June 2004 by:</ReportDate> <approvedby>Approved by: Thesis Committee Chair Jack D. Kem, <approvedby>Approved by: Thesis Committee Chair Jack D. Kem,

Ph.D.Ph.D. , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel

John A. Suprin, M.A. John A. Suprin, M.A. </approvedby></approvedby> <acceptedby>Robert F. Baumann, Ph.D.</acceptedby><acceptedby>Robert F. Baumann, Ph.D.</acceptedby></metadata></metadata>

Page 9: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Rationale for Dynamic ValidationRationale for Dynamic Validation

Sources of errorSources of error– Document flawsDocument flaws– OCR software failuresOCR software failures– Mis-classified layoutsMis-classified layouts– Template faultsTemplate faults– Extraction engine faultsExtraction engine faults

Software replaces expensive human-intensive Software replaces expensive human-intensive processprocess

Moderately high (10-20%) failure rate is tolerable Moderately high (10-20%) failure rate is tolerable ifif we can identify we can identify whichwhich output sets are failures output sets are failures– route those sets to humans for inspection and correctionroute those sets to humans for inspection and correction

Page 10: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Process OverviewProcess Overview

O C R

L ay o u tC las s if ic a t io n

E x tr ac tM etad a ta

Valid a to r

Hu m anC o r r ec tio n

s e lec ted tem p la te

m etad a ta

d o c u m en t ( P D F )

d o c u m en t ( X M L )u n tr u s tedm etad a ta

E n ter in tod a tab as e

tr u s tedm etad a ta

c o r r ec tedm etad a ta

lay o u t tem p la tes

Page 11: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Dynamic ValidationDynamic Validation

Challenges:Challenges: imprecise specificationimprecise specification low-level internal state not trusted as low-level internal state not trusted as

indicator of correct progressindicator of correct progress input characteristics vary from one input characteristics vary from one

document collection to anotherdocument collection to another input characteristics may vary over input characteristics may vary over

timetime

Page 12: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

ApproachApproach

Wide battery of basic tests can be Wide battery of basic tests can be applied to metadata fieldsapplied to metadata fields– deterministicdeterministic– statisticalstatistical

Basic test results combined Basic test results combined heuristicallyheuristically– under control of custom scripting under control of custom scripting

languagelanguage

Page 13: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Basic Tests - DeterministicBasic Tests - Deterministic

date formatsdate formats regular expressionsregular expressions

– structured fields, e.g., report numbersstructured fields, e.g., report numbers

Page 14: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Basic Tests – StatisticalBasic Tests – Statistical

Reference models from prior Reference models from prior metadata (human extracted)metadata (human extracted)– 850,000 records in DTIC collection850,000 records in DTIC collection– 20,000 records in NASA20,000 records in NASA

Measured field lengthsMeasured field lengths Phrase dictionaries constructed for Phrase dictionaries constructed for

fields with specialized vocabulariesfields with specialized vocabularies– e.g., author, organizatione.g., author, organization

Page 15: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Statistics collectedStatistics collected(mean & std dev)(mean & std dev)

Field lengthsField lengths– title, abstract, author,..title, abstract, author,..

Dictionary detection rates for words Dictionary detection rates for words in natural language fieldsin natural language fields– abstract, title,.abstract, title,.

Phrase recurrence rates for fields Phrase recurrence rates for fields with specialized vocabularies with specialized vocabularies – author and organization author and organization

Page 16: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Field Avg. Std. Dev. UnclassifiedTitle 9.9 4.8 Abstract 114 58 PersonalAuthor 2.8 0.5 CorporateAuthor 7 2.3

Field Length (in words), DTIC collection

Page 17: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Field Avg. Std. Dev. UnclassifiedTitle 88% 13% Abstract 94% 5%

Dictionary Detection (% of recognized words), DTIC collection

Page 18: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Phrase Dictionary Recurrence Rate, DTIC collection

FieldFieldPhrase Phrase LengthLength MeanMean

Std. Std. Dev.Dev.

PersonalAuthorPersonalAuthor

11 97%97% 11%11%

22 83%83% 32%32%

33 71%71% 45%45%

CorporateAuthorCorporateAuthor

11 100%100% 2.0%2.0%

22 99%99% 6.0%6.0%

33 99%99% 10%10%

44 99%99% 13%13%

Page 19: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Validation ProcedureValidation Procedure

Selected basic tests are applied to Selected basic tests are applied to extracted metadata field valuesextracted metadata field values– deterministic tests will pass or faildeterministic tests will pass or fail– Statistical tests compare to norms from Statistical tests compare to norms from

reference modelsreference models standard score computedstandard score computed

Test results for same field are combined Test results for same field are combined to form field confidenceto form field confidence

Field confidences are combined to form Field confidences are combined to form overall confidenceoverall confidence

Page 20: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Combining Basic Test Combining Basic Test ScoresScores

Validation specification describesValidation specification describes– which tests to apply to which fieldswhich tests to apply to which fields– how to normalize/scale test scores prior how to normalize/scale test scores prior

to combinationto combination– how to combine field tests into field how to combine field tests into field

confidenceconfidence– how to combine field confidences into how to combine field confidences into

overall confidenceoverall confidence

Page 21: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Partial Validation Spec – Partial Validation Spec – DTICDTIC<val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs…"><val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs…">

<val:average><val:average> <val:field name="UnclassifiedTitle"><val:field name="UnclassifiedTitle"> <val:average><val:average> <val:dictionary/><val:dictionary/> <val:length/><val:length/> </val:average></val:average> </val:field></val:field> <val:field name="PersonalAuthor"><val:field name="PersonalAuthor"> <val:min><val:min> <val:length/><val:length/> <val:max><val:max> <val:phrases length="1"/><val:phrases length="1"/> <val:phrases length="2"/><val:phrases length="2"/> <val:phrases length="3"/><val:phrases length="3"/> </val:max></val:max> </val:min></val:min> </val:field></val:field>

Page 22: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Validation scriptValidation script

Specification combined with extracted Specification combined with extracted datadata– to form an executable scriptto form an executable script

Apache Jelly projectApache Jelly project

Script executed to produce metadata Script executed to produce metadata record annotated withrecord annotated with– confidence values for each fieldconfidence values for each field– warning/explanations for low-scoring fieldswarning/explanations for low-scoring fields– overall confidence for output recordoverall confidence for output record

Page 23: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Sample Output From Sample Output From ValidatorValidator

<metadata confidence="0.460" <metadata confidence="0.460" warning="ReportDate field does not match required pattern">warning="ReportDate field does not match required pattern"> <UnclassifiedTitle confidence="0.979">Thesis Title: Intrepidity, Iron <UnclassifiedTitle confidence="0.979">Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military GeniusWill, and Intellect: General Robert L. Eichelberger and Military Genius </UnclassifiedTitle></UnclassifiedTitle> <PersonalAuthor confidence="0.4" <PersonalAuthor confidence="0.4" warning="PersonalAuthor: unusual number of words">warning="PersonalAuthor: unusual number of words"> Name of Candidate: Major Matthew H. FathName of Candidate: Major Matthew H. Fath </PersonalAuthor></PersonalAuthor> <ReportDate confidence="0.0"<ReportDate confidence="0.0" warning="ReportDate field does not match required pattern">warning="ReportDate field does not match required pattern"> Accepted this 18th day of June 2004 by:Accepted this 18th day of June 2004 by: </ReportDate></ReportDate> <approvedby warning="unvalidated">Approved by: Thesis Committee <approvedby warning="unvalidated">Approved by: Thesis Committee

Chair Jack D. Kem, Ph.D.Chair Jack D. Kem, Ph.D. , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel John A. , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel John A.

Suprin, M.A.Suprin, M.A. </approvedby></approvedby> <acceptedby warning="unvalidated">Robert F. Baumann, <acceptedby warning="unvalidated">Robert F. Baumann,

Ph.D.</acceptedby>Ph.D.</acceptedby></metadata></metadata>

Page 24: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Experimental DesignExperimental Design How effective is post-hoc classification?How effective is post-hoc classification? Selected 2000 documents recently added to DTIC Selected 2000 documents recently added to DTIC

collectioncollection– Visually classified by humans, Visually classified by humans,

comparing to 10 most common layouts from studies of comparing to 10 most common layouts from studies of earlier documentsearlier documents

discarded documents not in one of those classesdiscarded documents not in one of those classes 646 documents remained646 documents remained

Applied all templates, validated extracted Applied all templates, validated extracted metadata, selected highest confidence as the metadata, selected highest confidence as the validator’s choicevalidator’s choice

Compared validator’s preferred layout to human Compared validator’s preferred layout to human choiceschoices

Page 25: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Exp. Design JustificationExp. Design Justification Directly models one source of errorDirectly models one source of error

– Document flawsDocument flaws– OCR software failuresOCR software failures– Mis-classified layoutsMis-classified layouts– Template faultsTemplate faults– Extraction engine faultsExtraction engine faults

Layouts involved include some that are very Layouts involved include some that are very similar similar – single-field failures typical of other error sourcessingle-field failures typical of other error sources

Minimizes disputes among human judgesMinimizes disputes among human judges Relatively unaffected by continuing changes to Relatively unaffected by continuing changes to

extraction softwareextraction software

Page 26: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Validation Spec for Validation Spec for ExperimentExperiment

Similar to production spec exceptSimilar to production spec except– field scores combined by summation field scores combined by summation

rathater than by minimum or averagerathater than by minimum or average Simulated post-processing of Simulated post-processing of

extracted valuesextracted values– extractor is WYSIWYGextractor is WYSIWYG– not always what is desirednot always what is desired

e.g., “Major Matthew H. Fath” e.g., “Major Matthew H. Fath” => “Fath, Matthew H.”=> “Fath, Matthew H.”

Page 27: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Automatic vs. Human Automatic vs. Human ClassificationsClassifications

Post-hoc classifier agreed with human on Post-hoc classifier agreed with human on 91% of cases91% of cases

Page 28: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

ConclusionsConclusions

Important characteristics of this approach:Important characteristics of this approach: Aggressively OpportunisticAggressively Opportunistic

– lots of small, simple testslots of small, simple tests PragmaticPragmatic

– heuristic combination of simple test resultsheuristic combination of simple test results FlexibleFlexible

– scripting aids in scripting aids in tuning heuristicstuning heuristics adaption to different installations & input setsadaption to different installations & input sets

Page 29: Oct. 12, 2007 STEV 2007, Portland OR A Scriptable, Statistical Oracle for a Metadata Extraction System Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf

Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR

Conclusions: Exploiting Conclusions: Exploiting Validation InternallyValidation Internally

Agreement rate between validator Agreement rate between validator and humans far exceeds our best and humans far exceeds our best prior classifier algorithms prior classifier algorithms – based on geometric layout of text and based on geometric layout of text and

graphic blocksgraphic blocks New classifier:New classifier:

– apply all available templates to apply all available templates to documentdocument

– score all outputs using validatorscore all outputs using validator– choose top-scoring output setchoose top-scoring output set