toss: an extension of tax with ontologies and similarity queries edward hung yu deng v.s....
TRANSCRIPT
TOSS: An Extension of TAX with Ontologies and Similarity Queries
Edward Hung Yu Deng V.S. Subrahmanian
Presentation by:Valentina BonsiRoberto GamboniGiuseppe Vitalone
Speaker:Roberto Gamboni
Outline
AbstractTAX overview Quality problemsTOSS architectureTOSS algebraExperimentsConclusions & Related works
Abstract
Tree Algebra for XML an algebra developed for XML DB 100% precision but low recall semantic not considered
TAX with Ontologies and Similarity Queries ontology similarity enhancement improves recall
Much higher quality!
Tree Algebra for XML
Semistructured instance: I = (V,E,t) G = (V,E) is a set of rooted directed trees where V is
a set of nodes and E is a set of edges V x V. t assigns for each object o V a type for its tag and
content, i.e. o.tag = string and o.content = int.
Pattern tree: P = (T,F) T = (V,E) is object labeled (a distinct integer) and
edge labeled (‘pc’ or ‘ad’) tree F is a selection condition applicable to objects in T.
TAX selection example
DB1
car
car
car
carModel [Toyota/Yaris]price [10000]year [2002]km [30000]carDealer [RBV]fuelCons [10]
carModel [Vw/Polo]price [14000]year [2004]km [40000]carDealer [Pico]fuelCons [12]
carModel [Vw/Golf]price [20000]year [2005]km [10000]carDealer [RBV S.p.A.]fuelCons [13]
#1
#2
#3pc
pc
#1.tag=car .tag=price .tag=carModel .content<15000
carprice [10000]
carModel [Toyota/Yaris]
carprice [14000]
carModel [Vw/Polo]
Witness trees
Pattern tree
TAX similarity problems
biblio
book
title[Operating Systems]price [45,50]author [W. Stallings]publisher [MacMillan]year [1992]ISBN [002945671]
book
title [Cryptography]price [42,50]author [William Stallings]publisher [Prentice Hall]year[2003]ISBN[003456783]
#1
#2
#3pc
pc
#1.tag=book .tag=title .tag=author .content= “W. Stallings”
Low recall!!!
W. Stallings and William Stallings are probably the same person but TAX does not use any notion of similarity between terms.Solution: improve TAX with some similarity measure
ds(W. Stallings, William Stallings) = 0,1 (very similar)ds(W. Stallings, Shakespeare) = 5 (much less similar)
TAX multi-DB example
cars
car
car
car
carModel [Toyota/Yaris]price [10000]year [2002]km [30000]carDealer [RBV]fuelCons [10]
carModel [Vw/Polo]price [14000]year [2004]km [40000]carDealer [Pico]fuelCons [12]
carModel [Vw/Golf]price [20000]year [2005]km [10000]carDealer [RBV]fuelCons [13]
vendor
car
car
car
make [Volkswagen]model [Fox]year [2005]miles [30000]cost [5000]fuelCons [15]
make [AstonMartin]model [Vanquish]year [2004]miles [10000]cost [70000]fuelCons [6]
make [Ferrari]model [360]year [2002]miles [15000]cost [80000]fuelCons [6]
automobiles
dealerName[RVB]
location[Bologna]
feedback[5]
DB1 DB2
TAX problems with multi-DB
Different tags can refer to the same thing.The same content can be stored
differently.Tags like km and miles or price and cost
may contain values expressed in different units (i.e. EUR or USD).
Inter-term lexical relationships
Web search Company
Computer Company
Company
isa
isa
isa
“Return all authors of papers written by someone in a Web Search Company”
Google’s authors are never returned!
Ontology
authors
authorauthor
firstName[Marco]
lastName[Pivi]
company[Google]
firstName[Samuele]
lastName[Salti]
company[Eclipse Found.]
#1
#2
#3pc
pc
#1.tag = author .tag = lastName .tag = company .content = “Web Search Company”
TOSS: Architecture’s birdseye view
Xindice system
threshold similarity measure
User queries
Fusion of Ontologies
XML files
Similarity Enhancer SEO Query Executor resultsOntology
Maker
WordNet
User-specified rules
Goal: extend and enhance TAX to return high quality answers using ontology and similarity measures
Ontology maker
animals
black widow
elephant
dog
name [Fuffi]
race [African]
age [50]
name [Fido]
race [Collie]
age [4]
XML DB:Derived ontology:
mammal
spider
arachnid
proboscidean
carnivorecanine
isaisa
isa
isa
isa
isa
isa
isa
isa
name [Pito]
race [Mactans]
age [7]
Ontology Integration
cars
car
carModel
price
year
km
carDealer
fuelCons
carmake
model
year
miles
cost
fuelCons
vendor
automobiles
Interoperation Constraints (specified by user)
dealerName
location
feedback
Fusion of Ontologies
cars
car
carModel
price
yearkm
fuelCons
automobiles
vendordealerName
location
feedback
miles
cost
make:2 and model:2 are both mapped into carModel
• not grouped!• as different units might be used in istances, the administrator has to define a conversion function to compare these values
User-specified rules
TOSS: Architecture’s birdseye view
Xindice system
User queries
Fusion of Ontologies
XML files
threshold similarity measure
Similarity Enhancer SEO Query Executor resultsOntology
Maker
WordNet
Similarity Enhancer
airports
LAX – CA (Los Angeles)
LB – CA (Long Beach)
London City Airport
London BAA Heathrow
London Gatwick
Roma Fiumicino
British Airways
American Airlines
Delta Airlines
Alitalia
United Airlines
Threshold = 2
d(LAX,LB) =1,5d(London City,London Heathrow)=1d(London City,London Gatwick)= 1,3d(London Gatwick, London Heathrow)=1,6
d(London City,Roma Fumicino) =3,5d(Roma Fiumicino,LAX) = 9
1. Preserves the original partial order
2. All nodes mapped into the same node are similar to each other
3. Two strings are similar iff they are mapped into the same node
4. There are not redundant nodes (no subset)
TOSS: Architecture’s birdseye view
threshold similarity measure
User queries
Fusion of Ontologies
XML files
Similarity Enhancer SEO resultsOntology
Maker
WordNet
User-specified rules
Xindice system
Query Executor
Query Executor
Transforms a user query into a query that takes the similarity enhanced and (fused) ontology into account.
Implements an ontology extended algebra that improves TAX algebra.
In TOSS algebra, a simple selection condition is X op Y, where op {=, ≠, <, ≤, >, ≥, ~, instance_of, is_a, subtype_of, above, below} and X, Y are terms (attributes, types etc..).
A selection condition is a simple selection condition or conjunction, disjunction, negation of selection conditions.
C = X ~ Y is true iff a node containing both of them in SEO;
C = X instance_of Y is true iff type of X is a subtype of Y and its value dom(Y);
C = X subtype_of Y is true iff type(X) ≤ type(Y); C = X below Y is true iff X instance_of Y or X
subtype_of Y; C = X above Y is true iff Y below X.
TOSS Algebra
Query Example
biblio
book
title[Operating Systems]price [45,50]author [W. Stallings]publisher [MacMillan]year [1992]ISBN [002945671]
book
title [Cryptography]price [42,50]author [William Stallings]publisher [Prentice Hall]year[2003]ISBN[003456783]
#1
#2
#3pc
pc
#1.tag=book .tag=title .tag=author .content ~ “W. Stallings”
bookauthor [William Stallings]
title [Cryptography]
bookauthor [W. Stallings]
title [Operating Systems]
ds(W. Stallings, William Stallings) <
NOW all correct answers are returned!
Query Example(2)
animals
elephant
dog
black widow
Name [Fuffi]
Name [Pito]
Name [Fido]
Age [50]
Age [7]
Age [4]
“Return the list of all mammals”
Mammal ???
ontology
Elephant IS A mammal
Dog IS A mammal
elephantName [Fuffi]
Age [50]
dogName [Fido]
Age [4]
Implementation and Experiments
TOSS implemented in Java.Built on top of Xindice DBMS.Experiments over DBLP:
Recall and precision 12 selection queries on 3 data sets (each containing 100 random papers)
Recall and precision=TAX�X = TOSS (=2) + = TOSS (=3)
TAX always get 100% precision but low recall!
TOSS maintains its precision close to 1 with much higher recall!
For queries with lowest TOSS precision, a precision degradation of 1/3 corresponds to a 3 times increase of recall
Recall and precision (2)
TOSS quality is always better than TAX!
=TAX�X = TOSS (=2) + = TOSS (=3)
Recall and precision (3)
In TOSS most of the queries get their normalized recall more than doubled
TOSS results with threshold=3 are not necessarily better than the ones with threshold=2
X = improvement (=2) + = improvement (=3)
Conclusions & Related works
Ontologies to improve the quality of answers to queries (Wiederhold’s group);
Merge ontologies under interoperation constraints;
Semistructured instances with associated ontologies can be queried;
Introduct the concept of similarity search in semistructured DBs.
Scored pattern tree (TIX)
Bibliography
H.V. Jagadish, L.V.S. Lakshmanan, D. Srivastava and K. Thompson. TAX: A tree algebra for XML. In Proc. DBPL Conf, Rome, Italy 2001.
G.A. Miller et al. WordNet – a lexical database for english. Cognitive Science Laboratory, Princeton University.
G. Wiederhold. Interoperation, mediation and ontologies. In Interantional Symp. On Fifth Generation Computer Systems, Workshop on Heterogeneus Cooperative Knowledge Bases, ICOT, pages 33 – 48, 1994.
SIGMOD Record in XML. Available at http://www.acm.org/sigmod/record/xml/, Nov 2002.
Questions and answers