automating the extraction of genealogical information from historical documents
DESCRIPTION
Automating the Extraction of Genealogical Information from Historical Documents. Aaron P. Stewart David W. Embley March 20, 2011. Part I: Vision. Current projects at the BYU Data Extraction Group. Goal: Search books for names. History of the Jones Family. George Jones. scanner. - PowerPoint PPT PresentationTRANSCRIPT
Automating the Extraction of Genealogical Information from Historical Documents
Aaron P. StewartDavid W. EmbleyMarch 20, 2011
Part I: Vision
Current projects at the BYU Data Extraction Group
4
Goal: Search books for names
History of the Jones
Family scanner
George Jones
Original Document
Original Document
Original Document
Extracted Facts
NamesWilliam Gerard LathropMary ElyGerard LathropCharlotte Brackett JenningsNathan Tilestone JenningsMaria MillerMaria Jennings [Lathrop]Donald McKenzie [Lathrop]Anna Margaretta [Lathrop]Anna Catherine [Lathrop]
RelationshipsWilliam Gerard Lathrop : son of : Mary ElyWilliam Gerard Lathrop : son of : Gerard LathropWilliam Gerard Lathrop : m. : Charlotte Brackett JenningsCharlotte Brackett Jennings : dau. of : Nathan Tilestone JenningsCharlotte Brackett Jennings : dau. of : Maria Miller
Relationships (continued)Maria Jennings : child of : William Gerard LathropMaria Jennings : child of : Charlotte BrackettWilliam Gerard : child of : William Gerard LathropWilliam Gerard : child of : Charlotte BrackettDonald McKenzie : child of : William Gerard LathropDonald McKenzie : child of : Charlotte BrackettAnna Margaretta : child of : William Gerard LathropAnna Margaretta : child of : Charlotte BrackettAnna Catherine : child of : William Gerard LathropAnna Catherine : child of : Charlotte Brackett
Inferred Facts
NamesWilliam Gerard LathropMary ElyGerard LathropCharlotte Brackett JenningsNathan Tilestone JenningsMaria MillerMaria Jennings [Lathrop]Donald McKenzie [Lathrop]Anna Margaretta [Lathrop]Anna Catherine [Lathrop]
RelationshipsWilliam Gerard Lathrop : son of : Mary ElyWilliam Gerard Lathrop : son of : Gerard LathropWilliam Gerard Lathrop : m. : Charlotte Brackett JenningsCharlotte Brackett Jennings : dau. of : Nathan Tilestone JenningsCharlotte Brackett Jennings : dau. of : Maria Miller
Relationships (continued)Maria Jennings : child of : William Gerard LathropMaria Jennings : child of : Charlotte BrackettWilliam Gerard : child of : William Gerard LathropWilliam Gerard : child of : Charlotte BrackettDonald McKenzie : child of : William Gerard LathropDonald McKenzie : child of : Charlotte BrackettAnna Margaretta : child of : William Gerard LathropAnna Margaretta : child of : Charlotte BrackettAnna Catherine : child of : William Gerard LathropAnna Catherine : child of : Charlotte Brackett
Inferred RelationshipsMaria Jennings : grandchild of : Mary ElyMaria Jennings : grandchild of : Gerard LathropMaria Jennings : grandchild of : Nathan Tilestone JenningsMaria Jennings : grandchild of : Maria MillerWilliam Gerard : grandchild of : Mary ElyWilliam Gerard : grandchild of : Gerard LathropWilliam Gerard : grandchild of : Nathan Tilestone JenningsWilliam Gerard : grandchild of : Maria Miller…
Keywords
Chief Justice
Queries
• Is there a chief justice related to Mary Ely?• Who are the sons of Gerard Lathrop?• Who are the grandchildren of Mary Ely?
Part II: Implementation
Ontology Editor
Data Frame Editor
Rule Editor
Name Query
Name Query
Name Query
Name Query
HyKSS Indexing
HyKSS Indexing
Keyword Search
Keyword Search
Keyword Search
Relationship Search
Relationship Search
Inferred Relationship Search
Inferred Relationship Search
Maria Jennings is a grandchild of Mary ElyGrandchildOf(Maria Jennings, Mary Ely) :- Child-Parent(Maria Jennings, William Gerard Lathrop), Child-Parent(William Gerard Lathrop, Mary Ely)
Part III: Improvements
Extraction Tools
Need Better Extraction Results
From Packer et al., http://deg.byu.edu/papers/Ancestry_NAACL_HLT_Paper.pdf
------- Lists -------
Example of a Better Extractor(Margin Finder)
B\ liee (OCR error)
Buekman (OCR error)
Jobsph (OCR error)
Baseline errors
Baseline errors
Uuckkman (OCR error)
Charles. (OCR error)
Example of a Better Extractor(Margin Finder)
LEVEL 1
LEVEL 1
LEVEL 1LEVEL 1LEVEL 1LEVEL 1
LEVEL 1LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 1
LEVEL 1LEVEL 1
LEVEL 1LEVEL 1
LEVEL 1
LEVEL 2
LEVEL 2
LEVEL 2
Need Annotation Tools
Credits
• Ontology Editor – Numerous past students• Data Frame Editor – Numerous past students• Rule Editor – Nathan Tate• Hybrid Keyword and Semantic Search (HyKSS)
– Andrew Zitzelberger
• This presentation contains both actual screenshots and mock-ups of projected results