toss: an extension of tax with ontologies and similarity queries edward hung yu deng v.s....

27
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni Giuseppe Vitalone Speaker: Roberto Gamboni

Upload: francis-singleton

Post on 16-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

TOSS: An Extension of TAX with Ontologies and Similarity Queries

Edward Hung Yu Deng V.S. Subrahmanian

Presentation by:Valentina BonsiRoberto GamboniGiuseppe Vitalone

Speaker:Roberto Gamboni

Page 2: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Outline

AbstractTAX overview Quality problemsTOSS architectureTOSS algebraExperimentsConclusions & Related works

Page 3: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Abstract

Tree Algebra for XML an algebra developed for XML DB 100% precision but low recall semantic not considered

TAX with Ontologies and Similarity Queries ontology similarity enhancement improves recall

Much higher quality!

Page 4: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Tree Algebra for XML

Semistructured instance: I = (V,E,t) G = (V,E) is a set of rooted directed trees where V is

a set of nodes and E is a set of edges V x V. t assigns for each object o V a type for its tag and

content, i.e. o.tag = string and o.content = int.

Pattern tree: P = (T,F) T = (V,E) is object labeled (a distinct integer) and

edge labeled (‘pc’ or ‘ad’) tree F is a selection condition applicable to objects in T.

Page 5: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

TAX selection example

DB1

car

car

car

carModel [Toyota/Yaris]price [10000]year [2002]km [30000]carDealer [RBV]fuelCons [10]

carModel [Vw/Polo]price [14000]year [2004]km [40000]carDealer [Pico]fuelCons [12]

carModel [Vw/Golf]price [20000]year [2005]km [10000]carDealer [RBV S.p.A.]fuelCons [13]

#1

#2

#3pc

pc

#1.tag=car &#2.tag=price &#3.tag=carModel &#2.content<15000

carprice [10000]

carModel [Toyota/Yaris]

carprice [14000]

carModel [Vw/Polo]

Witness trees

Pattern tree

Page 6: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

TAX similarity problems

biblio

book

title[Operating Systems]price [45,50]author [W. Stallings]publisher [MacMillan]year [1992]ISBN [002945671]

book

title [Cryptography]price [42,50]author [William Stallings]publisher [Prentice Hall]year[2003]ISBN[003456783]

#1

#2

#3pc

pc

#1.tag=book &#2.tag=title &#3.tag=author &#3.content= “W. Stallings”

Low recall!!!

W. Stallings and William Stallings are probably the same person but TAX does not use any notion of similarity between terms.Solution: improve TAX with some similarity measure

ds(W. Stallings, William Stallings) = 0,1 (very similar)ds(W. Stallings, Shakespeare) = 5 (much less similar)

Page 7: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

TAX multi-DB example

cars

car

car

car

carModel [Toyota/Yaris]price [10000]year [2002]km [30000]carDealer [RBV]fuelCons [10]

carModel [Vw/Polo]price [14000]year [2004]km [40000]carDealer [Pico]fuelCons [12]

carModel [Vw/Golf]price [20000]year [2005]km [10000]carDealer [RBV]fuelCons [13]

vendor

car

car

car

make [Volkswagen]model [Fox]year [2005]miles [30000]cost [5000]fuelCons [15]

make [AstonMartin]model [Vanquish]year [2004]miles [10000]cost [70000]fuelCons [6]

make [Ferrari]model [360]year [2002]miles [15000]cost [80000]fuelCons [6]

automobiles

dealerName[RVB]

location[Bologna]

feedback[5]

DB1 DB2

Page 8: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

TAX problems with multi-DB

Different tags can refer to the same thing.The same content can be stored

differently.Tags like km and miles or price and cost

may contain values expressed in different units (i.e. EUR or USD).

Page 9: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Inter-term lexical relationships

Web search Company

Computer Company

Google

Company

isa

isa

isa

“Return all authors of papers written by someone in a Web Search Company”

Google’s authors are never returned!

Ontology

authors

authorauthor

firstName[Marco]

lastName[Pivi]

company[Google]

firstName[Samuele]

lastName[Salti]

company[Eclipse Found.]

#1

#2

#3pc

pc

#1.tag = author &#2.tag = lastName &#3.tag = company &#3.content = “Web Search Company”

Page 10: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

TOSS: Architecture’s birdseye view

Xindice system

threshold similarity measure

User queries

Fusion of Ontologies

XML files

Similarity Enhancer SEO Query Executor resultsOntology

Maker

WordNet

User-specified rules

Goal: extend and enhance TAX to return high quality answers using ontology and similarity measures

Page 11: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Ontology maker

animals

black widow

elephant

dog

name [Fuffi]

race [African]

age [50]

name [Fido]

race [Collie]

age [4]

XML DB:Derived ontology:

mammal

spider

arachnid

proboscidean

carnivorecanine

isaisa

isa

isa

isa

isa

isa

isa

isa

name [Pito]

race [Mactans]

age [7]

Page 12: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Ontology Integration

cars

car

carModel

price

year

km

carDealer

fuelCons

carmake

model

year

miles

cost

fuelCons

vendor

automobiles

Interoperation Constraints (specified by user)

dealerName

location

feedback

Page 13: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Fusion of Ontologies

cars

car

carModel

price

yearkm

fuelCons

automobiles

vendordealerName

location

feedback

miles

cost

make:2 and model:2 are both mapped into carModel

• not grouped!• as different units might be used in istances, the administrator has to define a conversion function to compare these values

Page 14: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

User-specified rules

TOSS: Architecture’s birdseye view

Xindice system

User queries

Fusion of Ontologies

XML files

threshold similarity measure

Similarity Enhancer SEO Query Executor resultsOntology

Maker

WordNet

Page 15: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Similarity Enhancer

airports

LAX – CA (Los Angeles)

LB – CA (Long Beach)

London City Airport

London BAA Heathrow

London Gatwick

Roma Fiumicino

British Airways

American Airlines

Delta Airlines

Alitalia

United Airlines

Threshold = 2

d(LAX,LB) =1,5d(London City,London Heathrow)=1d(London City,London Gatwick)= 1,3d(London Gatwick, London Heathrow)=1,6

d(London City,Roma Fumicino) =3,5d(Roma Fiumicino,LAX) = 9

1. Preserves the original partial order

2. All nodes mapped into the same node are similar to each other

3. Two strings are similar iff they are mapped into the same node

4. There are not redundant nodes (no subset)

Page 16: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

TOSS: Architecture’s birdseye view

threshold similarity measure

User queries

Fusion of Ontologies

XML files

Similarity Enhancer SEO resultsOntology

Maker

WordNet

User-specified rules

Xindice system

Query Executor

Page 17: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Query Executor

Transforms a user query into a query that takes the similarity enhanced and (fused) ontology into account.

Implements an ontology extended algebra that improves TAX algebra.

In TOSS algebra, a simple selection condition is X op Y, where op {=, ≠, <, ≤, >, ≥, ~, instance_of, is_a, subtype_of, above, below} and X, Y are terms (attributes, types etc..).

Page 18: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

A selection condition is a simple selection condition or conjunction, disjunction, negation of selection conditions.

C = X ~ Y is true iff a node containing both of them in SEO;

C = X instance_of Y is true iff type of X is a subtype of Y and its value dom(Y);

C = X subtype_of Y is true iff type(X) ≤ type(Y); C = X below Y is true iff X instance_of Y or X

subtype_of Y; C = X above Y is true iff Y below X.

TOSS Algebra

Page 19: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Query Example

biblio

book

title[Operating Systems]price [45,50]author [W. Stallings]publisher [MacMillan]year [1992]ISBN [002945671]

book

title [Cryptography]price [42,50]author [William Stallings]publisher [Prentice Hall]year[2003]ISBN[003456783]

#1

#2

#3pc

pc

#1.tag=book &#2.tag=title &#3.tag=author &#3.content ~ “W. Stallings”

bookauthor [William Stallings]

title [Cryptography]

bookauthor [W. Stallings]

title [Operating Systems]

ds(W. Stallings, William Stallings) <

NOW all correct answers are returned!

Page 20: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Query Example(2)

animals

elephant

dog

black widow

Name [Fuffi]

Name [Pito]

Name [Fido]

Age [50]

Age [7]

Age [4]

“Return the list of all mammals”

Mammal ???

ontology

Elephant IS A mammal

Dog IS A mammal

elephantName [Fuffi]

Age [50]

dogName [Fido]

Age [4]

Page 21: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Implementation and Experiments

TOSS implemented in Java.Built on top of Xindice DBMS.Experiments over DBLP:

Recall and precision 12 selection queries on 3 data sets (each containing 100 random papers)

Page 22: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Recall and precision=TAX�X = TOSS (=2) + = TOSS (=3)

TAX always get 100% precision but low recall!

TOSS maintains its precision close to 1 with much higher recall!

For queries with lowest TOSS precision, a precision degradation of 1/3 corresponds to a 3 times increase of recall

Page 23: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Recall and precision (2)

TOSS quality is always better than TAX!

=TAX�X = TOSS (=2) + = TOSS (=3)

Page 24: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Recall and precision (3)

In TOSS most of the queries get their normalized recall more than doubled

TOSS results with threshold=3 are not necessarily better than the ones with threshold=2

X = improvement (=2) + = improvement (=3)

Page 25: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Conclusions & Related works

Ontologies to improve the quality of answers to queries (Wiederhold’s group);

Merge ontologies under interoperation constraints;

Semistructured instances with associated ontologies can be queried;

Introduct the concept of similarity search in semistructured DBs.

Scored pattern tree (TIX)

Page 26: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Bibliography

H.V. Jagadish, L.V.S. Lakshmanan, D. Srivastava and K. Thompson. TAX: A tree algebra for XML. In Proc. DBPL Conf, Rome, Italy 2001.

G.A. Miller et al. WordNet – a lexical database for english. Cognitive Science Laboratory, Princeton University.

G. Wiederhold. Interoperation, mediation and ontologies. In Interantional Symp. On Fifth Generation Computer Systems, Workshop on Heterogeneus Cooperative Knowledge Bases, ICOT, pages 33 – 48, 1994.

SIGMOD Record in XML. Available at http://www.acm.org/sigmod/record/xml/, Nov 2002.

Page 27: TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung Yu Deng V.S. Subrahmanian Presentation by: Valentina Bonsi Roberto Gamboni

Questions and answers