listreader: inducing wrappers for ocred lists to efficiently populate ontologies thomas l. packer...
TRANSCRIPT
1
ListReader:Inducing Wrappers for OCRed Lists to Efficiently Populate
Ontologies
Thomas L. PackerOctober 6, 2014
2
Dissertation in a Nutshell
• Challenge: Low-cost ontology population• Focus: Data in lists of OCRed documents• Thesis: Wrapper induction is effective
3
Outline
• Motivation & Challenges• Solutions– Data Mapping– Local Wrapper Induction– Global Wrapper Induction
• Conclusions
4
Motivations & Challenges
5
Motivation
6
Other Domains and Applications
• Printed receipts– Marketing app.
(Itemize.com)– Personal finance app.
(Itemize.com)– Travel reimbursement app.– Nutrition app. (Noshly.com)
• Digitize library catalog• Citation metrics• Museum Specimen Labels
7
OCR Text Challenges
OCR
…
Captain Donald "Dude" Bakken ............... Right Half Back\nLeRov "Sonny' Johnson ........ ..........,.... Lcft Half Back\nOrley Bakken ...........,........ ... ,.......... Quarter Back\nRoger Myhrum .............. ..................... Full Back\nBill "Schnozz" Krohg .............. ................ Center\nHoward "Little Huby" Megorden ................ Right Guard\nRoyce "Shorty" Norgaard ....................... Left Guard\nEugene "Mad Russian" Easthind ............... Right Tackle\n
…
<HTML>… <OL> <LI>Captain Donald "Dude" Bakken ............... Right Half Back</LI> <LI>LeRoy "Sonny” Johnson ....................... Left Half Back</LI> <LI>Orley Bakken .................................. Quarter Back</LI> <LI>Roger Myhrum ................................... Full Back</LI> <LI>Bill "Schnozz" Krohg .............................. Center</LI> <LI>Howard "Little Huby" Megorden ................ Right Guard</LI> <LI>Royce "Shorty" Norgaard ....................... Left Guard</LI> <LI>Eugene "Mad Russian" Eastlind ............... Right Tackle</LI>…
vs. a Web Page
8
“Big Data” Challenges
9
Human Effort Challenges
Given NamesAli Alison Alex Andie Andy Ariel CarolineChris Cindy Claire Corey Daisy Diane …
SurnamesAdamsAllenAndersonBakerBrownCampbellClarkDavisGarcíaGonzálezGreenHallHarris…
MonthsJan.JanuaryFeb.FebruaryMar.MarchApr.AprilMayJun.JuneJul.July…
Predicatesb.bornborn onbaptizedwas baptizedwas baptized onwas baptized ind.dieddied onm.marriedwas married to…
Regular Expressions for Dates^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$.^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)\d\d$ …
Regular Expression for Records(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z …
10
Solutions
11
Summary of Related Work and Present Contributions
Input Handling and Project Scope Process Scalability IE Output
1.OCR / Noise
Tolerant
2. General Text List
Detection
3. General Text List
Info. Extr.
4.Alg. Time & Space
Complexity
5.Human Cost Engineering Knowledge
6.Human Cost Annotating Examples
7.Rich
Ontology Population
8.Precision
9.Recall
Related Work
Grammar Induction NR NR NR Bad Very Good Very Good NR NR NR
Traditional IE * Good OK OK OK Bad OK OK Good Good
Web Wrapper Induction NR NR OK Good OK Good OK Good Good
Specialized DAR Systems Good OK OK NR Bad Bad OK Good Good
Dissertation Work
List Detection OK Good NR Very Good OK Very Good NR NR NR
Local Regex Good OK Good Bad Good OK Good Very Good OK
Local HMM Good OK Good OK Good OK Good Good Good
Global Regex OK Good Good Very Good Good Good Good Very Good OK
Global HMM Good Good Good Good Good Good Good Good Good
* Our baseline comparison system is a Conditional Random Field (CRF)
12
Data Mapping (Packer and Embley, 2014b, 2014a, 2013)
13
Typical Data Extraction
Named Entity Recognition (Categorizing)
Flat Attribute Extraction(Grouping and Categorizing)
Relation Extraction(Grouping and Categorizing)
Object-Set Labels
14
Ontology-Path Labels
ListReader Data Extraction
Flexible, Unified Ontology
Person.(MarriageDate)SpouseName[1]Edward Hill
Person-SpouseName-MarriageDate(p1, “Edward Hill”, “1801”)
Person(p1)
SpouseName(“Edward Hill”)
MarriageDate(“1801”)
Person.(SpouseName)MarriageDate[1]1801
15
Local Regex (Packer and Embley, 2013)
16
ListReader (Local)
Data Entry Form
OCR Text
Find Contiguous
List
Build Form & Label
First Record
Text to Ontology Mapping
“click”
ListReader
OCR Text
Scan Book
Images
Database Querying & Reasoning
Induce Wrapper
Label Selected
Insertions
17
Local Wrapper Induction
\n1. Andrew b. 1772\n\n2. William Lee h, i774\n\n3. Nathaniel Griswold\n
1. Initialize Regex
\n<BO>1</BO>. <GN>Andrew</GN> b. <BY>1772</BY>\n\n2. William Lee, h. i774.\n \n3. Nathaniel Griswold\n
BO FN BY\n(1)\. (Andrew) b\. (1772)\n
3. Alignment-Search
2. Generalize Regex BO GN BY\n([\dlio])[.,] (\w{6}) [bh][.,] ([\dlio]{4})\n
C GN BY\n([\dlio])\[.,] (\w{6}) [bh][.,] ([\dlio]{4})\nX
Deletion
GN GN Unknown BY\n([\dlio])[.,] (\w{6,7}) (\S{1,10}) [bh][.,] ([\dlio]{4})\n
Insertion
\n1. Andrew b. 1772\n\n2. William Lee h, i774\n\n3. Nathaniel Griswold\n
Expansion
4. Evaluate (edit sim. × match freq.)
One match! No match
5. User labels insertions
\n<BO>1</BO>. <GN>Andrew</GN> b. <BY>1772</BY>\n\n2. William <GN>Lee</GN> h, i774\n\n3. Nathaniel Griswold\n
BO GN GN BY\n([\dlio])[.,] (\w{4,5}) (\w{3}) [bh][.,] ([\dlio]{4})\n
6. Extract\n<BO>1</BO>. <GN>Andrew</GN> b. <BY>1772</BY>\n\n<BO>2</BO>. <GN>William</GN> <GN>Lee</GN> h, <BY>i774</BY>\n\n<BO>3</BO>. <GN>Charles</GN> <GN>Conrad</GN>\n
Many more …
<BO> = Birth Order<GN> = Given Name<BY> = Birth Year
User-supplied labels
18
Algorithmic Complexity
\n1. Andrew, b. 1772.\n2. Clarissa, b. 1774.\n
\n1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.\n
\n1. Andrew, b. 1772.\n2. Clarissa, b. 1774.\n
\n1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.\n
\n1. Andrew, b. 1772.\n2. Clarissa, b. 1774.\n
\n1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.\n
\n1. Andrew, b. 1772.\n2. Clarissa, b. 1774.\n
\n1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.\n
1.
3.
2.
4. 4 of 2mn = 1.987e+233
19
A B C E F G H
Alignment using A* Search
A B C E F G
A B C’ D E F
Branching Factor = 2 * 4 = 8
A B C’ E F G
Goal State(Regex matching next record)
Start State(Regex for first record)
Insertion @ 4
Substitution @ 3 Insertion @ 7
A B C’ E F
Deletion @ 6Never traverses this branch
Search Tree Depth = 3
This search space size = ~10 Instead of 13,824
Other search space sizes = ~1000instead of 587,068,342,272
f(r) = g(r) + h(r) 4 = 1 + 3
f(r) = g(r) + h(r) 3 = 1 + 2
X = Part of candidate regex that does not match the next record
20
ListReader Accuracy
Tested on 60 hand-labeled lists
(1254 fields)
Conditional Random Field from http://mallet.cs.umass.edu/
21
Global Regex & HMM (Packer and Embley, 2014(a, b))
22
ListReader (Global)
Data Entry Form
OCR Text
Find Records
Build Form & Label Selected
Patterns
Text to Ontology Mapping
“click”
ListReader
OCR Text
Scan Book
Images
Database Querying & Reasoning
Induce Grammar
23
Global Induction: Basic Ideas
5. Polly, b. 1782.8. Margaret Stoutenburgh, b. 1794.
3. Lucia, b. 1777, d. 1778.6. Phebe, b. 1783, d. 1805.7. William Richard Henry, b. 1787, d. 1796.
1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.4. Lucia Mather, b. 1779, d. 1870, m. John Marvin.
2. Elizabeth, b. 1774, d. 1851, m. 1801 Edward Hill.
1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.2. Elizabeth, b. 1774, d. 1851, m. 1801 Edward Hill.3. Lucia, b. 1777, d. 1778.4. Lucia Mather, b. 1779, d. 1870, m. John Marvin.5. Polly, b. 1782.6. Phebe, b. 1783, d. 1805.7. William Richard Henry, b. 1787, d. 1796.8. Margaret Stoutenburgh, b. 1794.
Document Text
Clustered and Aligned Text
24
Conflated Superfluous Distinctions
… [UpLo+];[Sp][Dg].[Sp][UpLo+] … .[Sp][Dg].[Sp][UpLo+] …
1. Word Split over Newline2. Numeral3. Word
… Rich-\nard Mather ;\n5. PoUy … .\n6. Phebe …
4. Space5. Space Erroneously Inserted by OCR6. Capitalized Word Sequence
Suffix Tree
25
[UpLo+];[Sp][Dg].[Sp][UpLo+].[Sp][Dg].[Sp][UpLo+].[Sp]
[Sp][Dg].[Sp][UpLo+].[Sp]
26
Identify and Parse Field Groups[Sp][Dg].[Sp][UpLo+],[Sp][Lo].[Sp][DgDgDgDg].[Sp]
\n1. Andrew, b. 1772.\n\n2. Clarissa, b. 1774.\n\n3. Elias, b. 1776.\n\n5. PoUy , b. 1782.\n\n5. Sylvester, b. 1782.\n\n7. Charles, b. 1787.\n\n8. Margaret Stoutenburgh, b. 1794.\n
[Sp][Dg].[Sp][UpLo+],[Sp][Lo].[Sp][DgDgDgDg],[Sp][Lo].[Sp][DgDgDgDg].[Sp]\n4. William Lee, b. 1779, d. 1802.\n\n6. Nathaniel Griswold, b. 1784, d. 1785.\n\n3. Lucia, b. 1777, d. 1778.\n\n6. Phebe, b. 1783, d. 1805.\n\n7. William Richard Henry, b. 1787, d. 1796.\n
\n[Dg].[Sp][UpLo+] , d. [DgDgDgDg].\n, b. [DgDgDgDg]
27
Continue Parsing the Text
28
Generate Regex from Grammar
(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((\.)([\n]))
(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((\.)([\n]))
\n6. Nathaniel Griswold, b. 1784, d. 1785.\n
ID # 1234
29
Query the User
(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((\.)([\n]))
\n1. Andrew, b. 1772.\n\n2. Clarissa, b. 1774.\n\n3. Elias, b. 1776.\n\n5. PoUy , b. 1782.\n\n5. Sylvester, b. 1782.\n\n7. Charles, b. 1787.\n\n8. Margaret Stoutenburgh, b. 1794.\n
(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((\.)([\n]))
\n4. William Lee, b. 1779, d. 1802.\n\n6. Nathaniel Griswold, b. 1784, d. 1785.\n\n3. Lucia, b. 1777, d. 1778.\n\n6. Phebe, b. 1783, d. 1805.\n\n7. William Richard Henry, b. 1787, d. 1796.\n
30
Propagate Labels via Capture Group IDs
(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((\.)([\n]))
\n1. Andrew, b. 1772.\n\n2. Clarissa, b. 1774.\n\n3. Elias, b. 1776.\n\n5. PoUy , b. 1782.\n\n5. Sylvester, b. 1782.\n\n7. Charles, b. 1787.\n\n8. Margaret Stoutenburgh, b. 1794.\n
(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((( )?,)([ \n])(b)(\.)([ \n])([\d]{4}))((\.)([\n]))
\n4. William Lee, b. 1779, d. 1802.\n\n6. Nathaniel Griswold, b. 1784, d. 1785.\n\n3. Lucia, b. 1777, d. 1778.\n\n6. Phebe, b. 1783, d. 1805.\n\n7. William Richard Henry, b. 1787, d. 1796.\n
ID # 1234
31
F-measure Results
Tested on 68 hand-labeled pages (13,748 fields)Statistically significant at p < 0.01 using an unpaired t-test
32
Precision Results
Tested on 68 hand-labeled pages (13,748 fields)Statistically significant at p < 0.01 using an unpaired t-test
33
Recall Results
Area Under the Learning CurveListReader (Two-phase): 39.0% CRF: 34.4% ListReader (One-phase): 32.7%
34
HMM Field Group Templates
35
Complete HMM
36
HMM Recall Results
Shaver Family History
Kilbarchan Parish Register
37
Conclusion
38
Dissertation Contributions
• List detection• List wrapper induction• Mapping between in-line labeled text and ontology• A* admissible heuristic for regex induction• HMMs for list wrapping• AL query strategies• Linear/linear end-to-end ontology population with good
learning curve, requiring no expert input (Regex)• Linear/quadratic end-to-end ontology population with
better learning curve (HMM)
39
Contributions in a Nutshell
• Wrapper induction is effective• Data in lists of OCRed documents• Low-cost ontology population
40
Appendix
41
Related Work
42
Grammar InductionPapers
• Adriaans, 2006• Wolff, 2003• Grünwald, 1996• Stolcke, 1993• Wolff, 1977
PurposeAutomatically infer a natural formal grammar for a language given a sample of unlabeled strings
Limitations• Higher
algorithmic complexity
• Not an end-to-end solution
43
Probabilistic Finite State Automata for Information Extraction
Papers• Li, 2011• Heidorn, 2008• Borkar, 2001
PurposeInfer a PFSA to automatically infer a label sequence from a word sequence
Limitations• Requires
customized knowledge resources
• Requires supervision
• Lower output versatility
44
Web Wrapper InductionPapers
• Gentile, 2013• Dalvi, 2010• Elmeleegy, 2009• Gupta, 2009• Carlson, 2008• Chang, 2003• Crescenzi, 2001
PurposeScalably infer a specialized grammar of a Web site and a mapping from pages to database
Limitations• Not designed for
OCR text• Usually limited in
output expressiveness
45
Specialized Document Analysis and Recognition Systems
Papers• Le, 2005• Besagni, 2004• Besagni, 2003• Belaïd, 2001• Belaïd, 1998
PurposeAutomatically extract information from specific types of machine-printed records
Limitations• Requires a lot of custom rule
and resource engineering• Less scalable over multiple
domains or kinds of lists• Many rely on page images
which may not be available
46
Appendix
47
from OCRed Text
ListReader
to Populated Ontology
48
General List Reading Pipeline
OCR
List Detection
List Structure Recognition
Information Extraction
49
Challenges
50
OCR Challenges
OCR
\nFirst row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig,\nH. Megorden, D Wynne\nSecond row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun,\nMr. Bohnsack.\nThird row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom.\nFourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny-\nQootLaM "leam\nCaptain Donald "Dude" Bakken ............... Right Half Back\nLeRov "Sonny' Johnson ........ ..........,.... Lcft Half Back\nOrley Bakken ...........,........ ... ,.......... Quarter Back\nRoger Myhrum .............. ..................... Full Back\nBill "Schnozz" Krohg .............. ................ Center\nHoward "Little Huby" Megorden ................ Right Guard\nRoyce "Shorty" Norgaard ....................... Left Guard\nEugene "Mad Russian" Easthind ............... Right Tackle\nAlvin "Stuben" Hagen ......................... Left Tackle\nRichard "Dick" Nienabcr ........................ Right End\nJames "Oakie" Wogsland .......................... Lcft End\n\nOther lettermen were-\nGlenn "Doc" Whaley\nAllen "Swede" Enckson\nJames "Snooky" Mittun\nCurtis "Curt" Paulson\nArthur "Art" Vig\nForrest "Forry" Knudson\nRobert "Bobby" Roysland\nPage 26\n
<HTML>… <OL> <LI>Captain Donald "Dude" Bakken ............... Right Half Back</LI> <LI>LeRoy "Sonny” Johnson ....................... Left Half Back</LI> <LI>Orley Bakken .................................. Quarter Back</LI> <LI>Roger Myhrum ................................... Full Back</LI> <LI>Bill "Schnozz" Krohg .............................. Center</LI> <LI>Howard "Little Huby" Megorden ................ Right Guard</LI> <LI>Royce "Shorty" Norgaard ....................... Left Guard</LI> <LI>Eugene "Mad Russian" Eastlind ............... Right Tackle</LI>…
vs. HTML
51
Diversity Challenges
52
Versatile Output Schemas
53
Precision & Recall
\n1. Andrew, b. 1772.\n
\n1. Andrew, b. 1772.\n
Sibling Birth Order
Given Name
Birth Year
Birth Year
Ground Truth
False Positive
(Precision Error)
False Negative(Recall Error)
Sibling Birth Order
False Positive
andFalse
Negative
ExtractorPredictions
54
Human EffortOCR Text
“click”
Given NamesAli Alison Alex Andie Andy Ariel CarolineChris Cindy Claire Corey Daisy Diane Emmy Francis Heather Janey Jojo Kat Katharine Linda …
SurnamesAdamsAllenAndersonBakerBrownCampbellClarkDavisGarcíaGonzálezGreenHallHarrisHernándezHillJacksonJohnsonJonesKingLeeLewis…
MonthsJan.JanuaryFeb.FebruaryMar.MarchApr.AprilMayJun.JuneJul.JulyAug.AugustSep.Sept.SeptemberOct.OctoberNov.…
Predicatesb.bornborn onbaptizedwas baptizedwas baptized onwas baptized ind.dieddied onm.marriedwas married towas married inp.parishparishionerc.christeningwas christened atwas christened in…
Regular Expressions for Dates^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$.^(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)\d\d$ …
Regular Expression for Records(([\n])([\d]{1})(\.[ \n])(([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)([ \n])([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z]+)|([A-Z]+[a-z]+|[A-Z]+[a-z]+[A-Z]+[a-z …
55
ListReader Solution
56
ListReader
Data Entry Form
OCR Text
List & Record
Discovery
Field Labeling
Text to Ontology Mapping
“click”
ListReader
OCR TextScan Book Images
Database Querying & Reasoning
57
Another High-level View of ListReader
1555. Elias Mather, b. 1750, d. 1788, son of Deborah Ely and Rich-ard Mather; m. 1771, Lucinda Lee, who was b. 1752, dau. of Abner Leeand EHzabeth Lee. Their children :—1. Andrew, b. 1772.2. Clarissa, b. 1774.3. Elias, b. 1776.4. William Lee, b. 1779, d. 1802.5. Sylvester, b. 1782.6. Nathaniel Griswold, b. 1784, d. 1785.7. Charles, b. 1787.1556. Deborah Mather, b. 1752, d. 1826, dau. of Deborah Ely andRichard Mather; m. 1771, Ezra Lee, who was b. 1749 and d. 1821, sonof Abner Lee and Elizabeth Lee. Their children :—1. Samuel Holden Parsons, b. 1772, d. 1870, m. Elizabeth Sullivan.2. Elizabeth, b. 1774, d. 1851, m. 1801 Edward Hill.3. Lucia, b. 1777, d. 1778.4. Lucia Mather, b. 1779, d. 1870, m. John Marvin.5. PoUy , b. 1782.6. Phebe, b. 1783, d. 1805.7. William Richard Henry, b. 1787, d. 1796.8. Margaret Stoutenburgh, b. 1794.
58
Data Mapping (Packer and Embley, 2014b, 2014a, 2013)
59
ListReader MappingStructural Distinction Predicates Labels
1. Lexical vs. non-lexical Person(p1) vs. GivenName(“Elizabeth”) Person.GivenName[1] Elizabeth
2. N-ary relationships Person-SpouseName-MarriageDate(p1, “Edward”, “1801”) Person.(MarriageDate)SpouseNameEdward
3. M degrees of separation Person-DeathDate(p1, dd1), DeathDate-Year(dd1, “1851”)
Person.DeathDate.Year1851
4. Functionality and optionality Person-GivenName() vs. Person-Surname() Person.GivenName[1] Person.Surname
Elizabeth Hill5. Generalization-
specialization class hierarchies Person vs. Child Child.ChildNr2
6. Non-tree ontology structure Person-BirthDate(p1, bd1), BirthDate-Year(bd1, “1774”)Person-DeathDate(p1, dd1), DeathDate-Year (dd1, “1851”)
Person.BirthDate.Year Person.DeathDate.Year1774 1851
2. Elizabeth, b. 1774, d. 1851, m. 1801 Edward Hill.
60
Unsupervised List Detection(Packer and Embley, 2012)
61
Literal Pattern Area
Score = 3 x 7 = 21
62
Pattern Area
Score = 6 x 7 = 42
64
Label Selector 1:Naïve Bayes Classifier
OCR Text Noisy Word Categories
65
OCR Text Noisy Word Categories
Label Selector 2:Standard Deviation
66
26 Pages for Dev / Parameter Setting, F-measure on 16 Separate Test Pages
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
12%
32%38%
51%
77%
84% 86% 86%
Averaged over Pages
Averaged over Words
67
Local Regex (Packer and Embley, 2013)
68
OCR
newline First row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig, newline H. Megorden, D Wynne newline Second row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun, newline Mr. Bohnsack. newline Third row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom. newLine Fourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny- newline newline QootLaM "leam newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline
69
Hand Labeling
newline First row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig, newline H. Megorden, D Wynne newline Second row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun, newline Mr. Bohnsack. newline Third row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom. newLine Fourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny- newline newline QootLaM "leam newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline
newline First row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig, newline H. Megorden, D Wynne newline Second row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun, newline Mr. Bohnsack. newline Third row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom. newLine Fourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny- newline newline QootLaM "leam newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline
70
List Start Detection
newline First row, left to right: C. Paulson, G. Whaley, E Eastlund, B. Krohg, D. Bakken, R. Norgaard, 0. Bakken, A. Vig, newline H. Megorden, D Wynne newline Second row- Mr. See bach, D. Colligan, J. Wogsland, F Knudson, A. Hagen, R. Myhrum, R. Nienaber, J. Mittun, newline Mr. Bohnsack. newline Third row: G. Carlm, R. Reterson, K Larson, J Skatvold, A. Enckson, R Roysland, L.Johnson, L. Nystrom. newLine Fourth row: R. Kvare, H. Haugen, R. Lubken, R Larson, A. Carlson, A. Nienaber, W Ram bo I, V Hanson, K. Ny- newline newline QootLaM "leam newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline
newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline
71
newline Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline
Record Boundary Detection
Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline
72
Wrapper for First Record
Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline
73
Update and Label using First Wrapper
Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline
74
Update and Label using First Wrapper
Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline
75
Final Wrapper & Extraction
Captain Donald "Dude" Bakken ............... Right Half Back newline LeRoy "Sonny' Johnson ..................,.... Lcft Half Back newline Orley Bakken ...........,...........,.......... Quarter Back newline Roger Myhrum ................................... Full Back newline Bill "Schnozz" Krohg .............................. Center newline Howard "Little Huby" Megorden ................ Right Guard newline Royce "Shorty" Norgaard ....................... Left Guard newline Eugene "Mad Russian" Easthind ............... Right Tackle newline Alvin "Stuben" Hagen ......................... Left Tackle newline Richard "Dick" Nienabcr ........................ Right End newline James "Oakie" Wogsland .......................... Lcft End newline newline Other lettermen were- newline Glenn "Doc" Whaley newline Allen "Swede" Enckson newline James "Snooky" Mittun newline Curtis "Curt" Paulson newline Arthur "Art" Vig newline Forrest "Forry" Knudson newline Robert "Bobby" Roysland newline Page 26 newline
76
Precision & Recall
\n1. Andrew, b. 1772.\n
\n1. Andrew, b. 1772.\n
Birth Order
Given Name
Birth Year
Birth Year
GroundTruth
False Positive
(Precision Error)
False Negative(Recall Error)
Birth Order
False Positive
andFalse
Negative
ExtractorPredictions
77
Evaluation
Evaluated on 60 short-record lists from “The Ely Ancestry” family history book containing 3088 non-space word tokens, 1254 extractable field strings
78
ListReader Efficiency
79
Local HMM (Packer and Embley, 2013)
80
HMM Induction
Initialize Active Learning(Novelty Detection)
81
HMM Modeling Details
Sparse, Noisy Data:1. Parameter smoothing: Prior knowledge as non-zero Dirichlet priors2. Emission model parameter tying for shared lexical object sets3. Cluster field words in emiss. model with 5 character classes
List Structure:4. Transition model is fine-grained total order among word states5. Tr. model is cyclical only at record delimiters6. Tr. model is a total order: non-zero priors allow for deletions7. Tr. model has unique “unknown” states to allow for insertions8. Delimiter state emiss. models don’t use word clustering like field states
82
Global Regex(Packer and Embley, 2014a)
83
AUL Results
Area Under the Learning Curve
Prec. CRF: 54.1 ListReader (One-phase): 95.6 ListReader (Two-phase): 94.4
Rec. CRF: 34.4 ListReader (One-phase): 32.7 ListReader (Two-phase): 39.0
F1CRF: 39.5 ListReader (One-phase): 48.6 ListReader (Two-phase): 55.1
84
Global HMM(Packer and Embley, 2014b)
85
HMM Field Group Template
86
F-measure Results
87
Precision Results
88
Recall Results
89
90
ListReaderListReader Form
OCR Text
Unsupervised Active Learning Pipeline
Conflate Text
Build Suffix Tree
Find Record Patterns in
Tree
Request Labels from
User
Find Major Fields in Record
Clusters
Generate Regex
Templates
---------------------- Phase One ---------------------- ---- Phase Two ----
“click”
91
ListReader
Child(p1)
Person(p1)
Child-ChildNumber(p1, “1”)
Child-Name(p1, n1)
…
92
ListReader Pieces
• Predicates in ontologies
• Labeled fields in plain text
• List wrappers (regex)
• Data entry forms
Four types of knowledge rep. and mappings:
\n[Dg].[Sp][UpLo+].\n