cui tao phd dissertation defense
DESCRIPTION
Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages. Cui Tao PhD Dissertation Defense. Motivation. Birth date of my great grandpa Price and mileage of red Nissans, 1990 or newer Protein and amino acids information of gene cdk-4? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/1.jpg)
1
Cui TaoPhD Dissertation Defense
Ontology Generation, Information Harvesting and Semantic Annotation For Machine-
Generated Web Pages
![Page 2: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/2.jpg)
2
MotivationBirth date of my great
grandpa
Price and mileage of red Nissans, 1990 or newer
Protein and amino acids information of gene cdk-4?
US states with property crime rates above 1%
![Page 3: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/3.jpg)
3
Search by Search Engine
![Page 4: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/4.jpg)
4
Search the Hidden Web
• The Hidden Web:– Hidden behind forms– Hard to query “cdk-4"
![Page 5: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/5.jpg)
5
Query for Data
• The Hidden Web:– Hidden behind forms– Hard to query
Find the protein and the animo-acids
information for gene “cdk-4"
![Page 6: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/6.jpg)
6
A Web of Pages A Web of Knowledge
• Web of Knowledge– Machine-“understandable”– Publicly accessible– Queriable by standard query languages
• Semantic annotation– Domain ontologies– Populated conceptual model
• Problems to resolve– How do we create ontologies?– How do we annotate pages for ontologies?
![Page 7: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/7.jpg)
Contributions of Dissertation Work
• Web of Pages Web of Knowledge– Knowledge & meta-knowledge extraction– Reformulation as machine-“understandable”
knowledge
• Automatic & semi-automatic solutions via:– Sibling tables (TISP/TISP++)– User-created forms (FOCIH)
7
![Page 8: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/8.jpg)
8
Automatic Annotation with TISP(Table Interpretation with Sibling Pages)
• Recognize tables (discard non-tables)• Locate table labels• Locate table values• Find label/value associations
![Page 9: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/9.jpg)
9
Recognize Tables
Data Table
Layout Tables (discard)
NestedData Tables
![Page 10: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/10.jpg)
10
Find Label/Value AssociationsExample:(Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918
12
![Page 11: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/11.jpg)
11
Interpretation Technique:Sibling Page Comparison
![Page 12: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/12.jpg)
12
Interpretation Technique:Sibling Page Comparison
Same
![Page 13: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/13.jpg)
13
Interpretation Technique:Sibling Page Comparison
Almost Same
![Page 14: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/14.jpg)
14
Interpretation Technique:Sibling Page Comparison
Different
Same
![Page 15: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/15.jpg)
15
Technique Details
• Unnest tables• Match tables in sibling pages
– “Perfect” match (table for layout discard )– “Reasonable” match (sibling table)
• Determine & use table-structure pattern– Discover pattern– Pattern usage– Dynamic pattern adjustment
![Page 16: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/16.jpg)
16
Table Unnesting
![Page 17: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/17.jpg)
17
Regularity Expectations:
• (<tr><(td|th)> {L} <(td|th)> {V})n
• <tr>(<(td|th)> {L})n
(<tr>(<(td|th)> {V})n)+
• …
Pattern combinations are also possible.
Table Structure Patterns
![Page 18: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/18.jpg)
18
<tr>(<(td|th)> {L})n
(<tr>(<(td|th)> {V})n)+
Table Structure Patterns
![Page 19: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/19.jpg)
19
Pattern Usage
![Page 20: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/20.jpg)
20
Dynamic Pattern Adjustment
![Page 21: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/21.jpg)
21
TISP++
• Automatic ontology generation
• Automatic information annotation
![Page 22: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/22.jpg)
22
Ontology Generation – OSM
• Object set: table labels– Lexical: labels that associate with actual values– Non-lexical: labels that associate with other tables
• Relationship set: table nesting• Constraints: updates based on observation
![Page 23: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/23.jpg)
23
Ontology Generation – OWL
• Object set: OWL class• Relationship set: OWL object property• Lexical object set:
– OWL data type property– Different annotation properties to keep track of
the provenance
![Page 24: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/24.jpg)
Generated Ontology
![Page 25: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/25.jpg)
Generated Ontology
![Page 26: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/26.jpg)
26
RDF Graph
![Page 27: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/27.jpg)
27
Query the DataFind the protein
and the animo-acids information for gene “cdk-4"
![Page 28: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/28.jpg)
28
TISP Evaluation• Applications
– Commercial: car ads– Scientific: molecular biology– Geopolitical: US states and countries
• Data: > 2,000 tables in 35 sites• Evaluation
– Initial two sibling pages• Correct separation of data tables from layout tables?• Correct pattern recognition?
– Remaining tables in site• Information properly extracted?• Able to detect and adjust for pattern variations?
![Page 29: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/29.jpg)
29
Experimental Results• Table recognition: correctly discarded 157 of
158 layout tables
• Pattern recognition: correctly found 69 of 72 structure patterns
• Extraction and adjustments: 5 path adjustments and 34 label adjustments all correct
![Page 30: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/30.jpg)
30
TISP++ Performance
• Performance depends on TISP• TISP test set
– Generates all ontologies correctly– Annotates all information in tables correctly
![Page 31: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/31.jpg)
31
Form-based Ontology Creation and Information Harvesting (FOCIH)
• Personalized ontology creation by form– General familiarity– Reasonable conceptual framework– Appropriate correspondence
• Transformable to ontological descriptions• Capable of accepting source data
• Automated ontology creation • Automated information harvesting
![Page 32: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/32.jpg)
32
Form Creation
![Page 33: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/33.jpg)
33
Created Sample Form
![Page 34: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/34.jpg)
34
Generated Ontology View
![Page 35: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/35.jpg)
35
Source-to-Form Mapping
![Page 36: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/36.jpg)
36
Source-to-Form Mapping
![Page 37: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/37.jpg)
37
Source-to-Form Mapping
![Page 38: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/38.jpg)
38
Source-to-Form Mapping
![Page 39: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/39.jpg)
39
Almost Ready to Harvest
• Need reading path: DOM-tree structure• Need to resolve mapping problems
– Pattern recognition– Instance recognition
![Page 40: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/40.jpg)
40
Reading Path
![Page 41: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/41.jpg)
41
Pattern & Instance Recognition
![Page 42: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/42.jpg)
42
Pattern & Instance Recognition
![Page 43: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/43.jpg)
43
Pattern & Instance Recognitionregular expression
for decimal numberleft context
right context
![Page 44: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/44.jpg)
44
Pattern & Instance Recognition
list pattern, delimiter is “,”
![Page 45: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/45.jpg)
45
Pattern & Instance Recognition
list pattern, delimiter is regular expression for percentage numbers and a comma
![Page 46: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/46.jpg)
46
Pattern & Instance Recognition
list pattern, delimiter is regular expression for percentage numbers and a comma
![Page 47: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/47.jpg)
47
Can Now Harvest
![Page 48: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/48.jpg)
48
Can Now Harvest
![Page 49: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/49.jpg)
49
Can Now Harvest
![Page 50: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/50.jpg)
50
Semantic Annotation
![Page 51: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/51.jpg)
51
Semantic Annotation
![Page 52: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/52.jpg)
52
Semantic Annotation
![Page 53: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/53.jpg)
53
Semantic Annotation
![Page 54: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/54.jpg)
54
Semantic Annotation
![Page 55: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/55.jpg)
55
Semantic Query
![Page 56: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/56.jpg)
56
FOCIH Performance
• Ontology creation• Semantic annotation
– Depends on TISP performance– Depends on pattern and instance recognition
performance
![Page 57: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/57.jpg)
57
FOCIH Performance
• Pattern and instance recognition:– Works with highly regular data– Tested 71 mappings– 25 full-string values (25/25 correct)– 38 substring values (29/38 correct)– 8 list patterns (6/8 correct)
![Page 58: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/58.jpg)
58
FOCIH Difficulties
![Page 59: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/59.jpg)
59
FOCIH Difficulties
![Page 60: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/60.jpg)
60
FOCIH Difficulties
No selection
![Page 61: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/61.jpg)
61
WoK via TISP
![Page 62: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/62.jpg)
62
WoK via TISP
![Page 63: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/63.jpg)
63
WoK via FOCIH
![Page 64: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/64.jpg)
64
WoK via FOCIH
![Page 65: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/65.jpg)
65
Contributions
• TISP: automatic sibling table interpretation• TISP++:
– Automatic ontology generation based on interpreted tables
– Automatic semantic annotation for interpreted tables• FOCIH:
– Semi-automatic personalized ontology creation– Automatic personalized information harvesting and
semantic annotation• All together: contributes to turning the current web
of pages into a web of Knowledge
![Page 66: Cui Tao PhD Dissertation Defense](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56812f08550346895d94a5f3/html5/thumbnails/66.jpg)
66
Future Work
• Sibling pages in addition to sibling tables
• Reverse engineer from ontologies to forms as a basis for information harvesting for already defined ontologies.