oct. 12, 2007 stev 2007, portland or a scriptable, statistical oracle for a metadata extraction...
TRANSCRIPT
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
A Scriptable, A Scriptable, Statistical Oracle for Statistical Oracle for
a Metadata a Metadata Extraction SystemExtraction System
Kurt J. Maly, Steven J. Zeil, Kurt J. Maly, Steven J. Zeil, Mohammad Zubair, Ashraf Mohammad Zubair, Ashraf Amrou, Ali Aazhar, Naveen Amrou, Ali Aazhar, Naveen
RatkalRatkal
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
The ProblemThe Problem
Dynamic validation of a program thatDynamic validation of a program that mimics human behaviormimics human behavior is imprecisely specifiedis imprecisely specified will vary widely in behaviorwill vary widely in behavior
– by user/installationby user/installation– over timeover time
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Overall ApproachOverall Approach
Apply a wide variety of tests on Apply a wide variety of tests on selected output propertiesselected output properties– deterministicdeterministic– statisticalstatistical
Combine tests heuristicallyCombine tests heuristically– Combination controlled by scripts for Combination controlled by scripts for
flexibilityflexibility
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
OutlineOutline
The Application: Metadata ExtractionThe Application: Metadata Extraction Dynamic Validation of the ExtractorDynamic Validation of the Extractor Evaluating the ValidatorEvaluating the Validator ConclusionsConclusions
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
The Application: Metadata The Application: Metadata ExtractionExtraction
Large, diverse, Large, diverse, growing government growing government document collectionsdocument collections– DTIC, NASA, GPO (EPA DTIC, NASA, GPO (EPA
& Congress)& Congress) Automated system to Automated system to
extract metadata from extract metadata from documentsdocuments– Input: scanned page Input: scanned page
images or “text” PDFimages or “text” PDF– Output: XML containing Output: XML containing
metadata fields metadata fields e.g., titles, authors, e.g., titles, authors,
dates of publication, dates of publication, abstracts, release abstracts, release rightsrights
Authors
Title
Authors Affiliation
Abstract
Introduction
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
ApproachApproach ClassifyClassify documents by layout similarity documents by layout similarity TemplatesTemplates contain rules for extracting metadata from a specific contain rules for extracting metadata from a specific
layoutlayout– To keep templates simple, layout classes must be fairly detailed and To keep templates simple, layout classes must be fairly detailed and
specificspecific
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Process OverviewProcess Overview
O C R
L ay o u tC las s if ic a t io n
E x tr ac tM etad a ta
Valid a to r
Hu m anC o r r ec tio n
s e lec ted tem p la te
m etad a ta
d o c u m en t ( P D F )
d o c u m en t ( X M L )u n tr u s tedm etad a ta
E n ter in tod a tab as e
tr u s tedm etad a ta
c o r r ec tedm etad a ta
lay o u t tem p la tes
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Sample Metadata Record Sample Metadata Record (including mistakes) (including mistakes)
<?xml version="1.0"?><?xml version="1.0"?><metadata><metadata> <UnclassifiedTitle>Thesis Title: Intrepidity, Iron <UnclassifiedTitle>Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military GeniusWill, and Intellect: General Robert L. Eichelberger and Military Genius </UnclassifiedTitle></UnclassifiedTitle> <PersonalAuthor><PersonalAuthor> Name of Candidate: Major Matthew H. FathName of Candidate: Major Matthew H. Fath </PersonalAuthor></PersonalAuthor> <ReportDate>Accepted this 18th day of June 2004 by:</ReportDate><ReportDate>Accepted this 18th day of June 2004 by:</ReportDate> <approvedby>Approved by: Thesis Committee Chair Jack D. Kem, <approvedby>Approved by: Thesis Committee Chair Jack D. Kem,
Ph.D.Ph.D. , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel
John A. Suprin, M.A. John A. Suprin, M.A. </approvedby></approvedby> <acceptedby>Robert F. Baumann, Ph.D.</acceptedby><acceptedby>Robert F. Baumann, Ph.D.</acceptedby></metadata></metadata>
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Rationale for Dynamic ValidationRationale for Dynamic Validation
Sources of errorSources of error– Document flawsDocument flaws– OCR software failuresOCR software failures– Mis-classified layoutsMis-classified layouts– Template faultsTemplate faults– Extraction engine faultsExtraction engine faults
Software replaces expensive human-intensive Software replaces expensive human-intensive processprocess
Moderately high (10-20%) failure rate is tolerable Moderately high (10-20%) failure rate is tolerable ifif we can identify we can identify whichwhich output sets are failures output sets are failures– route those sets to humans for inspection and correctionroute those sets to humans for inspection and correction
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Process OverviewProcess Overview
O C R
L ay o u tC las s if ic a t io n
E x tr ac tM etad a ta
Valid a to r
Hu m anC o r r ec tio n
s e lec ted tem p la te
m etad a ta
d o c u m en t ( P D F )
d o c u m en t ( X M L )u n tr u s tedm etad a ta
E n ter in tod a tab as e
tr u s tedm etad a ta
c o r r ec tedm etad a ta
lay o u t tem p la tes
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Dynamic ValidationDynamic Validation
Challenges:Challenges: imprecise specificationimprecise specification low-level internal state not trusted as low-level internal state not trusted as
indicator of correct progressindicator of correct progress input characteristics vary from one input characteristics vary from one
document collection to anotherdocument collection to another input characteristics may vary over input characteristics may vary over
timetime
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
ApproachApproach
Wide battery of basic tests can be Wide battery of basic tests can be applied to metadata fieldsapplied to metadata fields– deterministicdeterministic– statisticalstatistical
Basic test results combined Basic test results combined heuristicallyheuristically– under control of custom scripting under control of custom scripting
languagelanguage
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Basic Tests - DeterministicBasic Tests - Deterministic
date formatsdate formats regular expressionsregular expressions
– structured fields, e.g., report numbersstructured fields, e.g., report numbers
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Basic Tests – StatisticalBasic Tests – Statistical
Reference models from prior Reference models from prior metadata (human extracted)metadata (human extracted)– 850,000 records in DTIC collection850,000 records in DTIC collection– 20,000 records in NASA20,000 records in NASA
Measured field lengthsMeasured field lengths Phrase dictionaries constructed for Phrase dictionaries constructed for
fields with specialized vocabulariesfields with specialized vocabularies– e.g., author, organizatione.g., author, organization
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Statistics collectedStatistics collected(mean & std dev)(mean & std dev)
Field lengthsField lengths– title, abstract, author,..title, abstract, author,..
Dictionary detection rates for words Dictionary detection rates for words in natural language fieldsin natural language fields– abstract, title,.abstract, title,.
Phrase recurrence rates for fields Phrase recurrence rates for fields with specialized vocabularies with specialized vocabularies – author and organization author and organization
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Field Avg. Std. Dev. UnclassifiedTitle 9.9 4.8 Abstract 114 58 PersonalAuthor 2.8 0.5 CorporateAuthor 7 2.3
Field Length (in words), DTIC collection
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Field Avg. Std. Dev. UnclassifiedTitle 88% 13% Abstract 94% 5%
Dictionary Detection (% of recognized words), DTIC collection
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Phrase Dictionary Recurrence Rate, DTIC collection
FieldFieldPhrase Phrase LengthLength MeanMean
Std. Std. Dev.Dev.
PersonalAuthorPersonalAuthor
11 97%97% 11%11%
22 83%83% 32%32%
33 71%71% 45%45%
CorporateAuthorCorporateAuthor
11 100%100% 2.0%2.0%
22 99%99% 6.0%6.0%
33 99%99% 10%10%
44 99%99% 13%13%
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Validation ProcedureValidation Procedure
Selected basic tests are applied to Selected basic tests are applied to extracted metadata field valuesextracted metadata field values– deterministic tests will pass or faildeterministic tests will pass or fail– Statistical tests compare to norms from Statistical tests compare to norms from
reference modelsreference models standard score computedstandard score computed
Test results for same field are combined Test results for same field are combined to form field confidenceto form field confidence
Field confidences are combined to form Field confidences are combined to form overall confidenceoverall confidence
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Combining Basic Test Combining Basic Test ScoresScores
Validation specification describesValidation specification describes– which tests to apply to which fieldswhich tests to apply to which fields– how to normalize/scale test scores prior how to normalize/scale test scores prior
to combinationto combination– how to combine field tests into field how to combine field tests into field
confidenceconfidence– how to combine field confidences into how to combine field confidences into
overall confidenceoverall confidence
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Partial Validation Spec – Partial Validation Spec – DTICDTIC<val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs…"><val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs…">
<val:average><val:average> <val:field name="UnclassifiedTitle"><val:field name="UnclassifiedTitle"> <val:average><val:average> <val:dictionary/><val:dictionary/> <val:length/><val:length/> </val:average></val:average> </val:field></val:field> <val:field name="PersonalAuthor"><val:field name="PersonalAuthor"> <val:min><val:min> <val:length/><val:length/> <val:max><val:max> <val:phrases length="1"/><val:phrases length="1"/> <val:phrases length="2"/><val:phrases length="2"/> <val:phrases length="3"/><val:phrases length="3"/> </val:max></val:max> </val:min></val:min> </val:field></val:field>
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Validation scriptValidation script
Specification combined with extracted Specification combined with extracted datadata– to form an executable scriptto form an executable script
Apache Jelly projectApache Jelly project
Script executed to produce metadata Script executed to produce metadata record annotated withrecord annotated with– confidence values for each fieldconfidence values for each field– warning/explanations for low-scoring fieldswarning/explanations for low-scoring fields– overall confidence for output recordoverall confidence for output record
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Sample Output From Sample Output From ValidatorValidator
<metadata confidence="0.460" <metadata confidence="0.460" warning="ReportDate field does not match required pattern">warning="ReportDate field does not match required pattern"> <UnclassifiedTitle confidence="0.979">Thesis Title: Intrepidity, Iron <UnclassifiedTitle confidence="0.979">Thesis Title: Intrepidity, Iron Will, and Intellect: General Robert L. Eichelberger and Military GeniusWill, and Intellect: General Robert L. Eichelberger and Military Genius </UnclassifiedTitle></UnclassifiedTitle> <PersonalAuthor confidence="0.4" <PersonalAuthor confidence="0.4" warning="PersonalAuthor: unusual number of words">warning="PersonalAuthor: unusual number of words"> Name of Candidate: Major Matthew H. FathName of Candidate: Major Matthew H. Fath </PersonalAuthor></PersonalAuthor> <ReportDate confidence="0.0"<ReportDate confidence="0.0" warning="ReportDate field does not match required pattern">warning="ReportDate field does not match required pattern"> Accepted this 18th day of June 2004 by:Accepted this 18th day of June 2004 by: </ReportDate></ReportDate> <approvedby warning="unvalidated">Approved by: Thesis Committee <approvedby warning="unvalidated">Approved by: Thesis Committee
Chair Jack D. Kem, Ph.D.Chair Jack D. Kem, Ph.D. , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel John A. , Member Mr. Charles S. Soby, M.B.A. , Member Lieutenant Colonel John A.
Suprin, M.A.Suprin, M.A. </approvedby></approvedby> <acceptedby warning="unvalidated">Robert F. Baumann, <acceptedby warning="unvalidated">Robert F. Baumann,
Ph.D.</acceptedby>Ph.D.</acceptedby></metadata></metadata>
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Experimental DesignExperimental Design How effective is post-hoc classification?How effective is post-hoc classification? Selected 2000 documents recently added to DTIC Selected 2000 documents recently added to DTIC
collectioncollection– Visually classified by humans, Visually classified by humans,
comparing to 10 most common layouts from studies of comparing to 10 most common layouts from studies of earlier documentsearlier documents
discarded documents not in one of those classesdiscarded documents not in one of those classes 646 documents remained646 documents remained
Applied all templates, validated extracted Applied all templates, validated extracted metadata, selected highest confidence as the metadata, selected highest confidence as the validator’s choicevalidator’s choice
Compared validator’s preferred layout to human Compared validator’s preferred layout to human choiceschoices
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Exp. Design JustificationExp. Design Justification Directly models one source of errorDirectly models one source of error
– Document flawsDocument flaws– OCR software failuresOCR software failures– Mis-classified layoutsMis-classified layouts– Template faultsTemplate faults– Extraction engine faultsExtraction engine faults
Layouts involved include some that are very Layouts involved include some that are very similar similar – single-field failures typical of other error sourcessingle-field failures typical of other error sources
Minimizes disputes among human judgesMinimizes disputes among human judges Relatively unaffected by continuing changes to Relatively unaffected by continuing changes to
extraction softwareextraction software
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Validation Spec for Validation Spec for ExperimentExperiment
Similar to production spec exceptSimilar to production spec except– field scores combined by summation field scores combined by summation
rathater than by minimum or averagerathater than by minimum or average Simulated post-processing of Simulated post-processing of
extracted valuesextracted values– extractor is WYSIWYGextractor is WYSIWYG– not always what is desirednot always what is desired
e.g., “Major Matthew H. Fath” e.g., “Major Matthew H. Fath” => “Fath, Matthew H.”=> “Fath, Matthew H.”
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Automatic vs. Human Automatic vs. Human ClassificationsClassifications
Post-hoc classifier agreed with human on Post-hoc classifier agreed with human on 91% of cases91% of cases
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
ConclusionsConclusions
Important characteristics of this approach:Important characteristics of this approach: Aggressively OpportunisticAggressively Opportunistic
– lots of small, simple testslots of small, simple tests PragmaticPragmatic
– heuristic combination of simple test resultsheuristic combination of simple test results FlexibleFlexible
– scripting aids in scripting aids in tuning heuristicstuning heuristics adaption to different installations & input setsadaption to different installations & input sets
Oct. 12, 2007Oct. 12, 2007 STEV 2007, Portland ORSTEV 2007, Portland OR
Conclusions: Exploiting Conclusions: Exploiting Validation InternallyValidation Internally
Agreement rate between validator Agreement rate between validator and humans far exceeds our best and humans far exceeds our best prior classifier algorithms prior classifier algorithms – based on geometric layout of text and based on geometric layout of text and
graphic blocksgraphic blocks New classifier:New classifier:
– apply all available templates to apply all available templates to documentdocument
– score all outputs using validatorscore all outputs using validator– choose top-scoring output setchoose top-scoring output set