listreader: inducing wrappers for ocred lists to efficiently populate ontologies thomas l. packer...

91
ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

Upload: andrea-holmes

Post on 03-Jan-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

1

ListReader:Inducing Wrappers for OCRed Lists to Efficiently Populate

Ontologies

Thomas L. PackerOctober 6, 2014

Page 2: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

2

Dissertation in a Nutshell

• Challenge: Low-cost ontology population• Focus: Data in lists of OCRed documents• Thesis: Wrapper induction is effective

Page 3: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

3

Outline

• Motivation & Challenges• Solutions– Data Mapping– Local Wrapper Induction– Global Wrapper Induction

• Conclusions

Page 4: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

4

Motivations & Challenges

Page 5: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

5

Motivation

Page 6: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

6

Other Domains and Applications

• Printed receipts– Marketing app.

(Itemize.com)– Personal finance app.

(Itemize.com)– Travel reimbursement app.– Nutrition app. (Noshly.com)

• Digitize library catalog• Citation metrics• Museum Specimen Labels

Page 7: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

7

OCR Text Challenges

OCR

Captain Donald "Dude" Bakken ............... Right Half Back\nLeRov "Sonny' Johnson ........ ..........,.... Lcft Half Back\nOrley Bakken ...........,........ ... ,.......... Quarter Back\nRoger Myhrum .............. ..................... Full Back\nBill "Schnozz" Krohg .............. ................ Center\nHoward "Little Huby" Megorden ................ Right Guard\nRoyce "Shorty" Norgaard ....................... Left Guard\nEugene "Mad Russian" Easthind ............... Right Tackle\n

<HTML>… <OL> <LI>Captain Donald "Dude" Bakken ............... Right Half Back</LI> <LI>LeRoy "Sonny” Johnson ....................... Left Half Back</LI> <LI>Orley Bakken .................................. Quarter Back</LI> <LI>Roger Myhrum ................................... Full Back</LI> <LI>Bill "Schnozz" Krohg .............................. Center</LI> <LI>Howard "Little Huby" Megorden ................ Right Guard</LI> <LI>Royce "Shorty" Norgaard ....................... Left Guard</LI> <LI>Eugene "Mad Russian" Eastlind ............... Right Tackle</LI>…

vs. a Web Page

Page 8: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

8

“Big Data” Challenges

Page 9: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

9

Human Effort Challenges

Given NamesAli Alison Alex Andie Andy Ariel CarolineChris Cindy Claire Corey Daisy Diane …

SurnamesAdamsAllenAndersonBakerBrownCampbellClarkDavisGarcíaGonzálezGreenHallHarris…

MonthsJan.JanuaryFeb.FebruaryMar.MarchApr.AprilMayJun.JuneJul.July…

Predicatesb.bornborn onbaptizedwas baptizedwas baptized onwas baptized ind.dieddied onm.marriedwas married to…

Regular Expressions for Dates^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$.^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)\d\d$ …

Regular Expression for Records(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z …

Page 10: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

10

Solutions

Page 11: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

11

Summary of Related Work and Present Contributions

Input Handling and Project Scope Process Scalability IE Output

1.OCR / Noise

Tolerant

2. General Text List

Detection

3. General Text List

Info. Extr.

4.Alg. Time & Space

Complexity

5.Human Cost Engineering Knowledge

6.Human Cost Annotating Examples

7.Rich

Ontology Population

8.Precision

9.Recall

Related Work

Grammar Induction NR NR NR Bad Very Good Very Good NR NR NR

Traditional IE * Good OK OK OK Bad OK OK Good Good

Web Wrapper Induction NR NR OK Good OK Good OK Good Good

Specialized DAR Systems Good OK OK NR Bad Bad OK Good Good

Dissertation Work

List Detection OK Good NR Very Good OK Very Good NR NR NR

Local Regex Good OK Good Bad Good OK Good Very Good OK

Local HMM Good OK Good OK Good OK Good Good Good

Global Regex OK Good Good Very Good Good Good Good Very Good OK

Global HMM Good Good Good Good Good Good Good Good Good

* Our baseline comparison system is a Conditional Random Field (CRF)

Page 12: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

12

Data Mapping (Packer and Embley, 2014b, 2014a, 2013)

Page 13: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

13

Typical Data Extraction

Named Entity Recognition (Categorizing)

Flat Attribute Extraction(Grouping and Categorizing)

Relation Extraction(Grouping and Categorizing)

Object-Set Labels

Page 14: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

14

Ontology-Path Labels

ListReader Data Extraction

Flexible, Unified Ontology

Person.(MarriageDate)SpouseName[1]Edward Hill

Person-SpouseName-MarriageDate(p1, “Edward Hill”, “1801”)

Person(p1)

SpouseName(“Edward Hill”)

MarriageDate(“1801”)

Person.(SpouseName)MarriageDate[1]1801

Page 15: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

15

Local Regex (Packer and Embley, 2013)

Page 16: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

16

ListReader (Local)

Data Entry Form

OCR Text

Find Contiguous

List

Build Form & Label

First Record

Text to Ontology Mapping

“click”

ListReader

OCR Text

Scan Book

Images

Database Querying & Reasoning

Induce Wrapper

Label Selected

Insertions

Page 17: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

17

Local Wrapper Induction

\n1. Andrew b. 1772\n\n2. William Lee h, i774\n\n3. Nathaniel Griswold\n

1. Initialize Regex

\n<BO>1</BO>. <GN>Andrew</GN> b. <BY>1772</BY>\n\n2. William Lee, h. i774.\n \n3. Nathaniel Griswold\n

BO FN BY\n(1)\. (Andrew) b\. (1772)\n

3. Alignment-Search

2. Generalize Regex BO GN BY\n([\dlio])[.,] (\w{6}) [bh][.,] ([\dlio]{4})\n

C GN BY\n([\dlio])\[.,] (\w{6}) [bh][.,] ([\dlio]{4})\nX

Deletion

GN GN Unknown BY\n([\dlio])[.,] (\w{6,7}) (\S{1,10}) [bh][.,] ([\dlio]{4})\n

Insertion

\n1. Andrew b. 1772\n\n2. William Lee h, i774\n\n3. Nathaniel Griswold\n

Expansion

4. Evaluate (edit sim. × match freq.)

One match! No match

5. User labels insertions

\n<BO>1</BO>. <GN>Andrew</GN> b. <BY>1772</BY>\n\n2. William <GN>Lee</GN> h, i774\n\n3. Nathaniel Griswold\n

BO GN GN BY\n([\dlio])[.,] (\w{4,5}) (\w{3}) [bh][.,] ([\dlio]{4})\n

6. Extract\n<BO>1</BO>. <GN>Andrew</GN> b. <BY>1772</BY>\n\n<BO>2</BO>. <GN>William</GN> <GN>Lee</GN> h, <BY>i774</BY>\n\n<BO>3</BO>. <GN>Charles</GN> <GN>Conrad</GN>\n

Many more …

<BO> = Birth Order<GN> = Given Name<BY> = Birth Year

User-supplied labels

Page 18: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

18

Algorithmic Complexity

\n1. Andrew, b. 1772.\n2. Clarissa, b. 1774.\n

\n1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.\n

\n1. Andrew, b. 1772.\n2. Clarissa, b. 1774.\n

\n1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.\n

\n1. Andrew, b. 1772.\n2. Clarissa, b. 1774.\n

\n1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.\n

\n1. Andrew, b. 1772.\n2. Clarissa, b. 1774.\n

\n1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.\n

1.

3.

2.

4. 4 of 2mn = 1.987e+233

Page 19: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

19

A B C E F G H

Alignment using A* Search

A B C E F G

A B C’ D E F

Branching Factor = 2 * 4 = 8

A B C’ E F G

Goal State(Regex matching next record)

Start State(Regex for first record)

Insertion @ 4

Substitution @ 3 Insertion @ 7

A B C’ E F

Deletion @ 6Never traverses this branch

Search Tree Depth = 3

This search space size = ~10 Instead of 13,824

Other search space sizes = ~1000instead of 587,068,342,272

f(r) = g(r) + h(r) 4 = 1 + 3

f(r) = g(r) + h(r) 3 = 1 + 2

X = Part of candidate regex that does not match the next record

Page 20: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

20

ListReader Accuracy

Tested on 60 hand-labeled lists

(1254 fields)

Conditional Random Field from http://mallet.cs.umass.edu/

Page 21: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

21

Global Regex & HMM (Packer and Embley, 2014(a, b))

Page 22: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

22

ListReader (Global)

Data Entry Form

OCR Text

Find Records

Build Form & Label Selected

Patterns

Text to Ontology Mapping

“click”

ListReader

OCR Text

Scan Book

Images

Database Querying & Reasoning

Induce Grammar

Page 23: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

23

Global Induction: Basic Ideas

5. Polly, b. 1782.8. Margaret Stoutenburgh, b. 1794.

3. Lucia, b. 1777, d. 1778.6. Phebe, b. 1783, d. 1805.7. William Richard Henry, b. 1787, d. 1796.

1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.4. Lucia Mather, b. 1779, d. 1870, m. John Marvin.

2. Elizabeth, b. 1774, d. 1851, m. 1801 Edward Hill.

1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.2. Elizabeth, b. 1774, d. 1851, m. 1801 Edward Hill.3. Lucia, b. 1777, d. 1778.4. Lucia Mather, b. 1779, d. 1870, m. John Marvin.5. Polly, b. 1782.6. Phebe, b. 1783, d. 1805.7. William Richard Henry, b. 1787, d. 1796.8. Margaret Stoutenburgh, b. 1794.

Document Text

Clustered and Aligned Text

Page 24: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

24

Conflated Superfluous Distinctions

… [UpLo+];[Sp][Dg].[Sp][UpLo+] … .[Sp][Dg].[Sp][UpLo+] …

1. Word Split over Newline2. Numeral3. Word

… Rich-\nard Mather ;\n5. PoUy … .\n6. Phebe …

4. Space5. Space Erroneously Inserted by OCR6. Capitalized Word Sequence

Page 25: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

Suffix Tree

25

[UpLo+];[Sp][Dg].[Sp][UpLo+].[Sp][Dg].[Sp][UpLo+].[Sp]

[Sp][Dg].[Sp][UpLo+].[Sp]

Page 26: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

26

Identify and Parse Field Groups[Sp][Dg].[Sp][UpLo+],[Sp][Lo].[Sp][DgDgDgDg].[Sp]

\n1. Andrew, b. 1772.\n\n2. Clarissa, b. 1774.\n\n3. Elias, b. 1776.\n\n5. PoUy , b. 1782.\n\n5. Sylvester, b. 1782.\n\n7. Charles, b. 1787.\n\n8. Margaret Stoutenburgh, b. 1794.\n

[Sp][Dg].[Sp][UpLo+],[Sp][Lo].[Sp][DgDgDgDg],[Sp][Lo].[Sp][DgDgDgDg].[Sp]\n4. William Lee, b. 1779, d. 1802.\n\n6. Nathaniel Griswold, b. 1784, d. 1785.\n\n3. Lucia, b. 1777, d. 1778.\n\n6. Phebe, b. 1783, d. 1805.\n\n7. William Richard Henry, b. 1787, d. 1796.\n

\n[Dg].[Sp][UpLo+] , d. [DgDgDgDg].\n, b. [DgDgDgDg]

Page 27: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

27

Continue Parsing the Text

Page 28: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

28

Generate Regex from Grammar

(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((\.)([\n]))

(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((\.)([\n]))

\n6. Nathaniel Griswold, b. 1784, d. 1785.\n

ID # 1234

Page 29: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

29

Query the User

(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((\.)([\n]))

\n1. Andrew, b. 1772.\n\n2. Clarissa, b. 1774.\n\n3. Elias, b. 1776.\n\n5. PoUy , b. 1782.\n\n5. Sylvester, b. 1782.\n\n7. Charles, b. 1787.\n\n8. Margaret Stoutenburgh, b. 1794.\n

(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((\.)([\n]))

\n4. William Lee, b. 1779, d. 1802.\n\n6. Nathaniel Griswold, b. 1784, d. 1785.\n\n3. Lucia, b. 1777, d. 1778.\n\n6. Phebe, b. 1783, d. 1805.\n\n7. William Richard Henry, b. 1787, d. 1796.\n

Page 30: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

30

Propagate Labels via Capture Group IDs

(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((\.)([\n]))

\n1. Andrew, b. 1772.\n\n2. Clarissa, b. 1774.\n\n3. Elias, b. 1776.\n\n5. PoUy , b. 1782.\n\n5. Sylvester, b. 1782.\n\n7. Charles, b. 1787.\n\n8. Margaret Stoutenburgh, b. 1794.\n

(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((\.)([\n]))

\n4. William Lee, b. 1779, d. 1802.\n\n6. Nathaniel Griswold, b. 1784, d. 1785.\n\n3. Lucia, b. 1777, d. 1778.\n\n6. Phebe, b. 1783, d. 1805.\n\n7. William Richard Henry, b. 1787, d. 1796.\n

ID # 1234

Page 31: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

31

F-measure Results

Tested on 68 hand-labeled pages (13,748 fields)Statistically significant at p < 0.01 using an unpaired t-test

Page 32: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

32

Precision Results

Tested on 68 hand-labeled pages (13,748 fields)Statistically significant at p < 0.01 using an unpaired t-test

Page 33: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

33

Recall Results

Area Under the Learning CurveListReader (Two-phase): 39.0% CRF: 34.4% ListReader (One-phase): 32.7%

Page 34: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

34

HMM Field Group Templates

Page 35: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

35

Complete HMM

Page 36: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

36

HMM Recall Results

Shaver Family History

Kilbarchan Parish Register

Page 37: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

37

Conclusion

Page 38: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

38

Dissertation Contributions

• List detection• List wrapper induction• Mapping between in-line labeled text and ontology• A* admissible heuristic for regex induction• HMMs for list wrapping• AL query strategies• Linear/linear end-to-end ontology population with good

learning curve, requiring no expert input (Regex)• Linear/quadratic end-to-end ontology population with

better learning curve (HMM)

Page 39: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

39

Contributions in a Nutshell

• Wrapper induction is effective• Data in lists of OCRed documents• Low-cost ontology population

Page 40: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

40

Appendix

Page 41: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

41

Related Work

Page 42: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

42

Grammar InductionPapers

• Adriaans, 2006• Wolff, 2003• Grünwald, 1996• Stolcke, 1993• Wolff, 1977

PurposeAutomatically infer a natural formal grammar for a language given a sample of unlabeled strings

Limitations• Higher

algorithmic complexity

• Not an end-to-end solution

Page 43: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

43

Probabilistic Finite State Automata for Information Extraction

Papers• Li, 2011• Heidorn, 2008• Borkar, 2001

PurposeInfer a PFSA to automatically infer a label sequence from a word sequence

Limitations• Requires

customized knowledge resources

• Requires supervision

• Lower output versatility

Page 44: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

44

Web Wrapper InductionPapers

• Gentile, 2013• Dalvi, 2010• Elmeleegy, 2009• Gupta, 2009• Carlson, 2008• Chang, 2003• Crescenzi, 2001

PurposeScalably infer a specialized grammar of a Web site and a mapping from pages to database

Limitations• Not designed for

OCR text• Usually limited in

output expressiveness

Page 45: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

45

Specialized Document Analysis and Recognition Systems

Papers• Le, 2005• Besagni, 2004• Besagni, 2003• Belaïd, 2001• Belaïd, 1998

PurposeAutomatically extract information from specific types of machine-printed records

Limitations• Requires a lot of custom rule

and resource engineering• Less scalable over multiple

domains or kinds of lists• Many rely on page images

which may not be available

Page 46: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

46

Appendix

Page 47: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

47

from OCRed Text

ListReader

to Populated Ontology

Page 48: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

48

General List Reading Pipeline

OCR

List Detection

List Structure Recognition

Information Extraction

Page 49: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

49

Challenges

Page 50: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

50

OCR Challenges

OCR

\nFirst row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig,\nH. Megorden, D Wynne\nSecond row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun,\nMr. Bohnsack.\nThird row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom.\nFourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny-\nQootLaM "leam\nCaptain Donald "Dude" Bakken ............... Right Half Back\nLeRov "Sonny' Johnson ........ ..........,.... Lcft Half Back\nOrley Bakken ...........,........ ... ,.......... Quarter Back\nRoger Myhrum .............. ..................... Full Back\nBill "Schnozz" Krohg .............. ................ Center\nHoward "Little Huby" Megorden ................ Right Guard\nRoyce "Shorty" Norgaard ....................... Left Guard\nEugene "Mad Russian" Easthind ............... Right Tackle\nAlvin "Stuben" Hagen ......................... Left Tackle\nRichard "Dick" Nienabcr ........................ Right End\nJames "Oakie" Wogsland .......................... Lcft End\n\nOther lettermen were-\nGlenn "Doc" Whaley\nAllen "Swede" Enckson\nJames "Snooky" Mittun\nCurtis "Curt" Paulson\nArthur "Art" Vig\nForrest "Forry" Knudson\nRobert "Bobby" Roysland\nPage 26\n

<HTML>… <OL> <LI>Captain Donald "Dude" Bakken ............... Right Half Back</LI> <LI>LeRoy "Sonny” Johnson ....................... Left Half Back</LI> <LI>Orley Bakken .................................. Quarter Back</LI> <LI>Roger Myhrum ................................... Full Back</LI> <LI>Bill "Schnozz" Krohg .............................. Center</LI> <LI>Howard "Little Huby" Megorden ................ Right Guard</LI> <LI>Royce "Shorty" Norgaard ....................... Left Guard</LI> <LI>Eugene "Mad Russian" Eastlind ............... Right Tackle</LI>…

vs. HTML

Page 51: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

51

Diversity Challenges

Page 52: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

52

Versatile Output Schemas

Page 53: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

53

Precision & Recall

\n1. Andrew, b. 1772.\n

\n1. Andrew, b. 1772.\n

Sibling Birth Order

Given Name

Birth Year

Birth Year

Ground Truth

False Positive

(Precision Error)

False Negative(Recall Error)

Sibling Birth Order

False Positive

andFalse

Negative

ExtractorPredictions

Page 54: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

54

Human EffortOCR Text

“click”

Given NamesAli Alison Alex Andie Andy Ariel CarolineChris Cindy Claire Corey Daisy Diane Emmy Francis Heather Janey Jojo Kat Katharine Linda …

SurnamesAdamsAllenAndersonBakerBrownCampbellClarkDavisGarcíaGonzálezGreenHallHarrisHernándezHillJacksonJohnsonJonesKingLeeLewis…

MonthsJan.JanuaryFeb.FebruaryMar.MarchApr.AprilMayJun.JuneJul.JulyAug.AugustSep.Sept.SeptemberOct.OctoberNov.…

Predicatesb.bornborn onbaptizedwas baptizedwas baptized onwas baptized ind.dieddied onm.marriedwas married towas married inp.parishparishionerc.christeningwas christened atwas christened in…

Regular Expressions for Dates^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$.^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)\d\d$ …

Regular Expression for Records(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z …

Page 55: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

55

ListReader Solution

Page 56: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

56

ListReader

Data Entry Form

OCR Text

List & Record

Discovery

Field Labeling

Text to Ontology Mapping

“click”

ListReader

OCR TextScan Book Images

Database Querying & Reasoning

Page 57: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

57

Another High-level View of ListReader

1555. Elias Mather, b. 1750, d. 1788, son of Deborah Ely and Rich-ard Mather; m. 1771, Lucinda Lee, who was b. 1752, dau. of Abner Leeand EHzabeth Lee. Their children :—1. Andrew, b. 1772.2. Clarissa, b. 1774.3. Elias, b. 1776.4. William Lee, b. 1779, d. 1802.5. Sylvester, b. 1782.6. Nathaniel Griswold, b. 1784, d. 1785.7. Charles, b. 1787.1556. Deborah Mather, b. 1752, d. 1826, dau. of Deborah Ely andRichard Mather; m. 1771, Ezra Lee, who was b. 1749 and d. 1821, sonof Abner Lee and Elizabeth Lee. Their children :—1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.2. Elizabeth, b. 1774, d. 1851, m. 1801 Edward Hill.3. Lucia, b. 1777, d. 1778.4. Lucia Mather, b. 1779, d. 1870, m. John Marvin.5. PoUy , b. 1782.6. Phebe, b. 1783, d. 1805.7. William Richard Henry, b. 1787, d. 1796.8. Margaret Stoutenburgh, b. 1794.

Page 58: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

58

Data Mapping (Packer and Embley, 2014b, 2014a, 2013)

Page 59: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

59

ListReader MappingStructural Distinction Predicates Labels

1. Lexical vs. non-lexical Person(p1) vs. GivenName(“Elizabeth”) Person.GivenName[1] Elizabeth

2. N-ary relationships Person-SpouseName-MarriageDate(p1, “Edward”, “1801”) Person.(MarriageDate)SpouseNameEdward

3. M degrees of separation Person-DeathDate(p1, dd1), DeathDate-Year(dd1, “1851”)

Person.DeathDate.Year1851

4. Functionality and optionality Person-GivenName() vs. Person-Surname() Person.GivenName[1] Person.Surname

Elizabeth Hill5. Generalization-

specialization class hierarchies Person vs. Child Child.ChildNr2

6. Non-tree ontology structure Person-BirthDate(p1, bd1), BirthDate-Year(bd1, “1774”)Person-DeathDate(p1, dd1), DeathDate-Year (dd1, “1851”)

Person.BirthDate.Year Person.DeathDate.Year1774 1851

2. Elizabeth, b. 1774, d. 1851, m. 1801 Edward Hill.

Page 60: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

60

Unsupervised List Detection(Packer and Embley, 2012)

Page 61: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

61

Literal Pattern Area

Score = 3 x 7 = 21

Page 62: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

62

Pattern Area

Score = 6 x 7 = 42

Page 63: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

64

Label Selector 1:Naïve Bayes Classifier

OCR Text Noisy Word Categories

Page 64: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

65

OCR Text Noisy Word Categories

Label Selector 2:Standard Deviation

Page 65: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

66

26 Pages for Dev / Parameter Setting, F-measure on 16 Separate Test Pages

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

12%

32%38%

51%

77%

84% 86% 86%

Averaged over Pages

Averaged over Words

Page 66: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

67

Local Regex (Packer and Embley, 2013)

Page 67: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

68

OCR

newline First row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig, newline H. Megorden, D Wynne newline Second row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun, newline Mr. Bohnsack. newline Third row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom. newLine Fourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny- newline newline QootLaM "leam newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

Page 68: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

69

Hand Labeling

newline First row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig, newline H. Megorden, D Wynne newline Second row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun, newline Mr. Bohnsack. newline Third row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom. newLine Fourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny- newline newline QootLaM "leam newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

newline First row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig, newline H. Megorden, D Wynne newline Second row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun, newline Mr. Bohnsack. newline Third row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom. newLine Fourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny- newline newline QootLaM "leam newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

Page 69: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

70

List Start Detection

newline First row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig, newline H. Megorden, D Wynne newline Second row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun, newline Mr. Bohnsack. newline Third row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom. newLine Fourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny- newline newline QootLaM "leam newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

Page 70: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

71

newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

Record Boundary Detection

Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

Page 71: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

72

Wrapper for First Record

Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

Page 72: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

73

Update and Label using First Wrapper

Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

Page 73: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

74

Update and Label using First Wrapper

Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

Page 74: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

75

Final Wrapper & Extraction

Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline

Page 75: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

76

Precision & Recall

\n1. Andrew, b. 1772.\n

\n1. Andrew, b. 1772.\n

Birth Order

Given Name

Birth Year

Birth Year

GroundTruth

False Positive

(Precision Error)

False Negative(Recall Error)

Birth Order

False Positive

andFalse

Negative

ExtractorPredictions

Page 76: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

77

Evaluation

Evaluated on 60 short-record lists from “The Ely Ancestry” family history book containing 3088 non-space word tokens, 1254 extractable field strings

Page 77: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

78

ListReader Efficiency

Page 78: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

79

Local HMM (Packer and Embley, 2013)

Page 79: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

80

HMM Induction

Initialize Active Learning(Novelty Detection)

Page 80: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

81

HMM Modeling Details

Sparse, Noisy Data:1. Parameter smoothing: Prior knowledge as non-zero Dirichlet priors2. Emission model parameter tying for shared lexical object sets3. Cluster field words in emiss. model with 5 character classes

List Structure:4. Transition model is fine-grained total order among word states5. Tr. model is cyclical only at record delimiters6. Tr. model is a total order: non-zero priors allow for deletions7. Tr. model has unique “unknown” states to allow for insertions8. Delimiter state emiss. models don’t use word clustering like field states

Page 81: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

82

Global Regex(Packer and Embley, 2014a)

Page 82: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

83

AUL Results

Area Under the Learning Curve

Prec. CRF: 54.1 ListReader (One-phase): 95.6 ListReader (Two-phase): 94.4

Rec. CRF: 34.4 ListReader (One-phase): 32.7 ListReader (Two-phase): 39.0

F1CRF: 39.5 ListReader (One-phase): 48.6 ListReader (Two-phase): 55.1

Page 83: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

84

Global HMM(Packer and Embley, 2014b)

Page 84: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

85

HMM Field Group Template

Page 85: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

86

F-measure Results

Page 86: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

87

Precision Results

Page 87: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

88

Recall Results

Page 88: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

89

Page 89: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

90

ListReaderListReader Form

OCR Text

Unsupervised Active Learning Pipeline

Conflate Text

Build Suffix Tree

Find Record Patterns in

Tree

Request Labels from

User

Find Major Fields in Record

Clusters

Generate Regex

Templates

---------------------- Phase One ---------------------- ---- Phase Two ----

“click”

Page 90: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

91

ListReader

Child(p1)

Person(p1)

Child-ChildNumber(p1, “1”)

Child-Name(p1, n1)

Page 91: ListReader: Inducing Wrappers for OCRed Lists to Efficiently Populate Ontologies Thomas L. Packer October 6, 2014 1

92

ListReader Pieces

• Predicates in ontologies

• Labeled fields in plain text

• List wrappers (regex)

• Data entry forms

Four types of knowledge rep. and mappings:

\n[Dg].[Sp][UpLo+].\n