automatic creation and simplified querying of semantic web content an approach based on...

26
Automatic Creation Automatic Creation and Simplified and Simplified Querying of Semantic Querying of Semantic Web Content Web Content An Approach Based on An Approach Based on Information-Extraction Information-Extraction Ontologies Ontologies Yihong Ding, David W. Embley, and Stephen W. Liddle Brigham Young University

Post on 22-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Automatic Creation and Automatic Creation and Simplified Querying of Simplified Querying of Semantic Web ContentSemantic Web Content

An Approach Based on An Approach Based on Information-Extraction OntologiesInformation-Extraction Ontologies

Yihong Ding, David W. Embley, and Stephen W. LiddleBrigham Young University

Page 2: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Fundamental ProblemsFundamental Problems

Lack of semantic web contentLack of semantic web content Difficulty of content creationDifficulty of content creation Inability to use semantic web content easilyInability to use semantic web content easily

Page 3: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Proposed SolutionsProposed Solutions

Automatically annotate data-rich web pages Automatically annotate data-rich web pages (turning them into semantic web pages)(turning them into semantic web pages)

Provide for free-form, textual queries of Provide for free-form, textual queries of semantic web contentsemantic web content

Page 4: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

A Show-Case VisionA Show-Case Vision

Find me the price and Find me the price and mileage of red Nissans – mileage of red Nissans – I want a 1990 or newer.I want a 1990 or newer.

Page 5: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Demo I: Data ExtractionDemo I: Data Extraction

Page 6: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Demo II: Semantic AnnotationDemo II: Semantic Annotation

Page 7: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Demo III: Free-Form QueryDemo III: Free-Form Query

Page 8: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Explanation: How it WorksExplanation: How it Works

Extraction OntologiesExtraction Ontologies Semantic AnnotationSemantic Annotation Free-Form Query InterpretationFree-Form Query Interpretation

Page 9: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Extraction OntologiesExtraction Ontologies

Object sets

Relationship sets

Participation constraints

Lexical

Non-lexical

Primary object set

Aggregation

Generalization/Specialization

Page 10: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Formalism & Extraction OntologiesFormalism & Extraction Ontologies

Fully formalized in predicate calculusFully formalized in predicate calculus Object set ~ 1-place predicateObject set ~ 1-place predicate N-ary relationship set ~ n-place predicateN-ary relationship set ~ n-place predicate Constraint ~ closed predicate-calculus formulaConstraint ~ closed predicate-calculus formula

As a description logic ~ As a description logic ~ ALCN ALCN (Attributive (Attributive Language with Complement and Numeric Language with Complement and Numeric Restrictions)Restrictions)

(a quick side note)

Page 11: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Extraction OntologiesExtraction Ontologies

External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})?

Key Word Phrase

Left Context: $

Data Frame:

Internal Representation: float

Values

Key Words: ([Pp]rice)|([Cc]ost)| …

Operators

Operator: >

Key Words: (more\s*than)|(more\s*costly)|…

Page 12: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Data-Extraction Results: Car AdsData-Extraction Results: Car Ads

Training set for tuning ontology: 100Test set: 116

Salt Lake Tribune

Recall % Precision %Year 100 100Make 97 100Model 82 100Mileage 90 100Price 100 100PhoneNr 94 100Feature 91 99

Page 13: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Car Ads: CommentsCar Ads: Comments Dynamic setsDynamic sets

Missed: MERC, Town Car, 98 RoyaleMissed: MERC, Town Car, 98 Royale Could use lexicon of makes and modelsCould use lexicon of makes and models

Unspecified variation in lexical patternsUnspecified variation in lexical patterns Missed: 5 speed (instead of 5 spd), p.l (instead of p.l.)Missed: 5 speed (instead of 5 spd), p.l (instead of p.l.) could adjust lexical patternscould adjust lexical patterns

Misidentification of attributesMisidentification of attributes Classified AUTO in AUTO SALES as automatic transmissionClassified AUTO in AUTO SALES as automatic transmission Could adjust exceptions in lexical patternsCould adjust exceptions in lexical patterns

Typographical errorsTypographical errors ““Chrystler”, “DODG ENeon”, “I-15566-2441”Chrystler”, “DODG ENeon”, “I-15566-2441” Could look for spelling variations and common typos Could look for spelling variations and common typos

Page 14: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

General Extraction ResultsGeneral Extraction Results

~ 20 Domains (cars, obituaries, cameras, jobs, ~ 20 Domains (cars, obituaries, cameras, jobs, games, prescription drugs, …)games, prescription drugs, …)

Simple, unified domains: nearly 100% recall Simple, unified domains: nearly 100% recall and precisionand precision

Complex, loosely defined domains (e.g. Complex, loosely defined domains (e.g. obituaries: 82% recall and 74% precision)obituaries: 82% recall and 74% precision)

Typical: 80%+ recall and precisionTypical: 80%+ recall and precision

Page 15: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Generality & Resiliency ofGenerality & Resiliency ofExtraction OntologiesExtraction Ontologies

Assumptions about web pages (generality)Assumptions about web pages (generality) Data richData rich Narrow domainNarrow domain Document typesDocument types

Simple multiple-record documents (easiest)Simple multiple-record documents (easiest) Single-record documents (harder)Single-record documents (harder) Records with scattered components (even harder)Records with scattered components (even harder)

Declarative (resiliency)Declarative (resiliency) Still works when web pages changeStill works when web pages change Works for new, unseen pages in the same domainWorks for new, unseen pages in the same domain Scalable, but takes work to declare the extraction ontology Scalable, but takes work to declare the extraction ontology

(another quick side note)

Page 16: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Semantic AnnotationSemantic Annotation

Page 17: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Free-Form Query InterpretationFree-Form Query Interpretation

Parse Free-Form Query Parse Free-Form Query (with data extraction ontology)(with data extraction ontology)

Select OntologySelect Ontology Formulate Query ExpressionFormulate Query Expression Run Query Over Semantically Annotated DataRun Query Over Semantically Annotated Data

Page 18: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Parse Free-Form Query Parse Free-Form Query “Find me the and of all s – I want a ”

price

mileage

red

Nissan

1996

or newer

>= Operator

Page 19: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Select OntologySelect Ontology

Similarity value: 5

Similarity value: 2

“Find me the price and mileage of all red Nissans – I want a 1996 or newer”

Page 20: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Conjunctive queries and aggregate queriesConjunctive queries and aggregate queries Mentioned object sets are all of interest in the result.Mentioned object sets are all of interest in the result. Values and operator keywords determine conditions.Values and operator keywords determine conditions.

Color = “red”Color = “red” Make = “Nissan”Make = “Nissan” Year >= 1996Year >= 1996

>= Operator

Formulate Query ExpressionFormulate Query Expression

Page 21: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

For

Let

Where

Return

Formulate Query ExpressionFormulate Query Expression

Page 22: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Run QueryRun QueryOver Semantically Annotated DataOver Semantically Annotated Data

Page 23: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Query Interpretation Results:Query Interpretation Results:Pilot Experiment with Car AdsPilot Experiment with Car Ads

15 car-ads free-form queries from 3 volunteer CS students15 car-ads free-form queries from 3 volunteer CS students ResultsResults

Recognizing object sets of interestRecognizing object sets of interest Recall: 85%Recall: 85% Precision: 90%Precision: 90%

Recognizing constraintsRecognizing constraints Recall: 61%Recall: 61% Precision: 79%Precision: 79%

ProblemsProblems Regular expressions not tuned up and lexicons incompleteRegular expressions not tuned up and lexicons incomplete Ambiguities: “Are there any Ford mustangs, 2002, that are red?” Ambiguities: “Are there any Ford mustangs, 2002, that are red?”

(Is 2002 a year, mileage, or price?)(Is 2002 a year, mileage, or price?) CaveatsCaveats

No disjunctionNo disjunction No negationNo negation

Page 24: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

GeneralGeneralQuery Interpretation ResultsQuery Interpretation Results

AskOntosAskOntos ((Pilot Experiment on 5 domains: cars, real estate, countries, movies, Pilot Experiment on 5 domains: cars, real estate, countries, movies,

diamonds)diamonds)

Object sets of interest recognizedObject sets of interest recognized Recall: 90%Recall: 90% Precision: 90%Precision: 90%

Conditions recognizedConditions recognized Recall: 71%Recall: 71% Precision: 88%Precision: 88%

Page 25: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

PragmaticsPragmatics

Technical problemsTechnical problems Extraction and query-interpretation accuracyExtraction and query-interpretation accuracy Execution speedExecution speed HarvestingHarvesting

Crawling?!Crawling?! Information behind forms on the hidden webInformation behind forms on the hidden web

Social problemsSocial problems Cooperation from web site developersCooperation from web site developers End-user concernsEnd-user concerns

MotivationMotivation TrustTrust

All is not rosy …

Page 26: Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

ConclusionsConclusions Automatically create semantic-web contentAutomatically create semantic-web content

Do data extraction over an ordinary web pageDo data extraction over an ordinary web page Create semantic-web pageCreate semantic-web page

Cache pageCache page Store external semantic annotation wrt an ontologyStore external semantic annotation wrt an ontology

Query semantic web pagesQuery semantic web pages Free-form queriesFree-form queries Return resultsReturn results

TableTable Link to original web page (scrolled and highlighted)Link to original web page (scrolled and highlighted)

Pragmatic considerationsPragmatic considerations

www.deg.byu.edu