supporting location-based approximate-keyword queries acm international conference on geographical...
TRANSCRIPT
Supporting Location-based Approximate-Keyword Queries
ACM International conference on Geographical Information Systems 2010
S Alsubaiee, A Behm, C Li – University of California, Irvine
Presenter: Raghav KarumurDate: 3/30/2011
Course: [CSCI 8735] Advanced Database SystemsDepartment of Computer Science and Engineering
University of Minnesota, Twin CitiesSpring 2011
Lunch Time!
2Advanced Database Systems Raghav Karumur Spring 2011
I’ll go for Chinese food! What was the restaurant’s name???Uh…… Ch-o-chi???
Let me Find It!
Errr… Just one typo!
Outline
• Overview• Problem Formulation and Preliminaries• Contributions• Algorithms• Index Construction• Experiments and Analysis• Conclusion• References
5Advanced Database Systems Raghav Karumur Spring 2011
Overview
6Advanced Database Systems Raghav Karumur Spring 2011
Terminology – Clear?
Location-based keyword search consists of : a set of key words + spatial location
Goal: Find objects with these key words close to the location.Ex: User is looking for a restaurant named Chaochi close to San Jose. Consider the query: Q1 : (Chaochi) near (San Jose) The website returns listings close to San Jose that have the key word Chaochi
Problem: Inconsistencies can exist either in user queries/data or both. - Users/ up loaders may enter wrong spelling!Q1’ : (Chochi) near (San Jose)
Therefore, Q1’ may not be able to find the restaurant with the mistyped title.
Hence, support of approximate key word search is necessary!
Overview
7Advanced Database Systems Raghav Karumur Spring 2011
Approach used so far:
Build a collection of keywords similar to the mistyped keyword, and suggest another query, or find objects with these keywords. Drawback of this approach:
No support for simultaneous spatial and textual information.
Problem Formulation
8Advanced Database Systems Raghav Karumur Spring 2011
Object Collection
chaochi restaurant <37.39, -121.87>starbucks <37.79, -122.40>starbucks <40.72, -73.99>apple store <44.59, -92.99>sam’s club <43.59, -116.47>…
Object Collection
chaochi restaurant <37.39, -121.87>starbucks <37.79, -122.40>starbucks <40.72, -73.99>apple store <44.59, -92.99>sam’s club <43.59, -116.47>…
Find objects in “San Jose” with keywords similar to “chochi” & “resturant”
Problem FormulationLocation Based Keyword Search:
• Given a collection of strings, find those that are similar to the given query string.• Consider a collection of spatial objects o1, … , on each having a set of keywords
and a location. • A spatial approximate-keyword query Q = <Qs,Qt> consists of two conditions:
- a spatial condition Qs such as a rectangle or a circle, and
- an approximate keyword condition Qt having a set of k pairs
each representing a keyword wi with an associated similarity threshold i
• Goal: Find all objects in the collection within Qs that satisfy Qt
• An object satisfies Qt if for each keyword wi in Qt , the object has a keyword in
its description whose similarity to wi is within the corresponding threshold i
9Advanced Database Systems Raghav Karumur Spring 2011
},,...,,,,{ 2211 kkwww
Problem Formulation
Approaches:
• Combine these two indexes • Search the resultant index called LBAK-tree to find answers
10Advanced Database Systems Raghav Karumur Spring 2011
Trie-based method
Inverted-index method
Preliminaries: Location-Based Keyword Search
Find objects within a given spatial region that have a given set of keywords
Augment a hierarchal spatial index with textual information
11Advanced Database Systems Raghav Karumur Spring 2011
Preliminaries: Approximate String Search
… chaochi
chucho
church
Query q:
chochi
Collection of strings s
Search
Output: strings s that satisfy Sim(q,s)≤δSim functions: Edit distance, Jaccard, Cosine, etc
12Advanced Database Systems Raghav Karumur Spring 2011
Preliminaries: Approximate String Search
chaochi
2-grams {ch, ha, ao, oc, ch, hi}
Intuition: similar strings share a certain number of grams
Sliding Window
Gram-based inverted-index
13Advanced Database Systems Raghav Karumur Spring 2011
Solution
Tree-based spatial index Approximate string search capability
Keyword search capability
LBAK-Tree
14Advanced Database Systems Raghav Karumur Spring 2011
Contributions
• How to combine those indexes• Three Algorithms
1) Simple fixed-level solution2) Utilizing local spatial distribution of objects3) Exploiting frequency distribution of keywords
15Advanced Database Systems Raghav Karumur Spring 2011
What is used:• Queries with spatial condition are typically supported by a tree-based
index such as R*-tree, KD-tree, Quad-tree etc.• R*-tree is used in this paper.• Most trie-based indexes are specific to edit distance and its variants, and
do not support other similarity measures such as Jaccard.• However, inverted indexes usually support a family of similarity metrics
such as edit distance, Jaccard, etc. • Inverted-index is therefore used in this paper.• In this paper, LBAK tree is used and is augmented with capabilities for
approximate keyword search. • Gram-based inverted index is used to perform approximate string search.
16Advanced Database Systems Raghav Karumur Spring 2011
The LBAK tree
17Advanced Database Systems Raghav Karumur Spring 2011
LBAK nodes may be classified into three categories: • S-Nodes: -Do not store any textual information.
- Used only for pruning based on spatial condition
• SA-Nodes: - Store union of keywords of their sub tree.- Stores an approximate index on these keywords.- Used for finding similar keywords, - Used for pruning based on spatial and approximate conditions.
• SK-Nodes: - Store union of keywords of their sub tree.- Used for pruning with spatial condition and keywords.- Must have previously identified relevant similar keywords by the time we reach this node
Alg 1: Simple Fixed Level Solution
18Advanced Database Systems Raghav Karumur Spring 2011
Alg 1: Simple Fixed Level Solution
19Advanced Database Systems Raghav Karumur Spring 2011
Query: objects in “San Jose” with keywords similar to “chochi” & “resturant”– Based on edit distance of 1– Expressed as Q: <{San Jose}; {<chochi, 1>, <resturant, 1>}>.
• The query clearly has typos.. •Assume nodes A, B, C, D satisfy the spatial condition San Jose.• Throughout the traversal of the tree we always check the spatial condition.•At the S-Node A, we only rely on spatial condition for pruning.
Alg 1: Simple Fixed Level Solution
20
•When we reach SA-node B, we search its approximate index to find keywords similar to chochi and resturant according to the edit-distance threshold of 1.• We can find two keywords similar to chochi (namely, chaochi and choochi), and one keyword similar to resturant(namely restaurant).
Alg 1: Simple Fixed Level Solution
21
• Once we visit the SK-nodes C and D, we intersect their stored keywords with {chaochi, choochi} and {restaurant} respectively.
• Clearly, node C can be pruned as it does not have the keyword restaurant.
Alg 1: Simple Fixed Level Solution
22
• Since node D has the keywords chaochi and restaurant, we traverse its children.
• We repeat the process until we find the answers.
How to Choose Level L?Trade off between space and time – until “some” level (both increase)• Usually, about 90% of query time is spent in approx. index lookups. • Therefore, choose an optimal level L for placement of approx. indexes and
this can greatly improve avg. query time .
23Advanced Database Systems Raghav Karumur Spring 2011
Observations• Query time & index size sensitive to approximate-index locations• Fixed-level solution ignores local spatial distribution of objects• If a node is sparse, we might consider placing the index at its descendents.• If a node is dense, we build the index at the node itself because a query
region is likely to overlap with many of its children.
Prefer to build approximateindex at parent
Prefer to build approximateindexes at children
24Advanced Database Systems Raghav Karumur Spring 2011
Algorithm 2: Placing Approximate Indexes at Variable Levels
(Spatial Nodes)
(Spatial-Approximate Nodes)
(Spatial-Keyword Nodes)
25Advanced Database Systems Raghav Karumur Spring 2011
Selecting Nodes for Approximate Indexes
• Goal: Find optimal set of nodes that should have approximate indexes
•Optimization problem: “Given an R*-tree and a space budget, choose nodes from the tree to
store approximate indexes, such that the average query time of a given workload is minimized. ”
-- NP Hard Problem!
26Advanced Database Systems Raghav Karumur Spring 2011
Greedy Algorithm: Selecting Nodes for Approximate Indexes
N6
N3
N1
N2
N4 N7N5
N12 N13 N14N8 N9 N10 N11 N15
✔
✔✔
27Advanced Database Systems Raghav Karumur Spring 2011
• A greedy algorithm SelectSANodes is developed that traverses the tree top-down and tries to push approx. indexes down the most promising paths.
Selecting Nodes for Approximate Indexes• Algorithm maintains a priority queue of nodes to be traversed.• Priority of node n is defined as the benefit of storing multiple approximate
indexes at its children as compared to building a single index at n.• For each visited node n, if the benefit of building multiple approximate
indexes at n’s children is negative, then the algorithm selects n to be an SA-Node, and it will not traverse its children.
• If the algorithm reaches a leaf node, it immediately selects the leaf to be an SA-Node.
• The algorithm terminates when the space budget is exhausted or there is no more benefit to pushing approximate indexes down the tree.
• If pTime denotes average query time of probing approx. index at parent, cTime denotes this time if the indexes were built at the children, and pSpace and cSpace are corresponding space costs of indexes, then
28Advanced Database Systems Raghav Karumur Spring 2011
Selecting Nodes for Approximate Indexes• Wn denotes set of stored keywords at node n.• If r is the root, the benefit of storing the approximate index at r’s children
is computed byb(n) =
Benefit of a node can also be given as
• The algorithm starts traversing the tree by popping the pair with the highest benefit.
• The cost of building multiple approx. indexes at n’s children is called space cost and is computed by
s(W) = |W|*( - q + 1)*q – number of grams, W – set of keywords, is avg. keyword length of a
particular data set, and is the size of each inverted-list element.
29Advanced Database Systems Raghav Karumur Spring 2011
|)()(|
)()(
ncSpacenpSpace
ncTimenpTime
|)()(|
)(*)()(*)(
1
1
m
i nn
m
i nin
i
i
WsWs
WtnpWtnp
Cost/Benefit Estimation• Effects of pushing index down– Increase space cost– Increase or decrease average query time
• Typically– Higher levels: good to push index down– Intermediate levels: unclear whether to push it down
Lookup time of an approx.index• Clearly depends on size of the index.• Experimentally determined to be of linear nature with slope .
• Thus the avg. lookup time of an approximate index on W keywords is estimated to be
t(W) = *|W| +
where slope and intercept are implementation dependent and can be experimentally determined.
31Advanced Database Systems Raghav Karumur Spring 2011
Size Time Slope
1 0.02 -
10000 0.207 0.000019
1M 22.253 0.000022
10M 210.152 0.000021
Algorithm3: Exploiting Frequency Distribution of Keywords
32Advanced Database Systems Raghav Karumur Spring 2011
•Frequency distribution of keywords is in general skewed in nature. Ex: A business listings dataset has a keyword such as restaurant more frequently than consulate.
• In order to reduce the no. of keywords in the approx. indexes, we remove frequent keywords from sibling nodes, and place them in their common parent instead. • As a result, approx indexes now appear even in the S-nodes.
•Thus, S-Nodes now contain approx. indexes for frequent words where as SA-Nodes contain approx. indexes for infrequent words.
Index Construction
33Advanced Database Systems Raghav Karumur Spring 2011
Index Construction• A node n is said to be frequent if the fraction of n’s children having
that keyword is greater than certain threshold value .• A small decreases the space cost of approx. indexes.• On the other hand, avg. query time may increase because we could
visit false-positive nodes, since not all of n’s children actually contain the frequent keywords.
• Those false positives will be pruned at SK nodes.• Updated benefit of a node:
34Advanced Database Systems Raghav Karumur Spring 2011
m
innnn
nn
m
innnin
nn
FFWsFsncSpace
FWsnpSpace
FFWtnpFtnpncTime
FWtnpnpTime
ii
ii
1
1
))(()()(
)()(
))((*)()(*)()(
)(*)()(
Index ConstructionUpdated SelectSANodes Algorithm:• To discover frequent keywords in the tree, for each node n two sets of
keywords are maintained: a set of infrequent keywords Wn and a set of frequent keywords Fn.
• Frequent/infrequent keywords are identified by examining its children.• Also, it is ensured that popular keywords appear only at the root of a
sub tree i.e., if a keyword w is frequent at node n, then w is removed from the approx. keyword sets in all of n’s children.
• The propagation of frequent and infrequent keywords is performed bottom-up until the keyword sets of all nodes have been filled.
• The next step is to choose nodes to build approx. indexes on.• We use the updated benefit of a node , instead of benefit of a node.• P(n) denotes the probability of n satisfying the spatial condition of any
query in a workload.
35Advanced Database Systems Raghav Karumur Spring 2011
Incremental Maintenance of Indexes
36Advanced Database Systems Raghav Karumur Spring 2011
If (split in R*-tree)
•For the two new nodes, generated after split, recompute the stored set of keywords (frequent, and
infrequent) by examining their children.
•Propagate all the new keywords up to the root, retraverse the tree and rebuild approx. indexes at places
where split has occurred (identified by a split marker).
Else
•First insert the object into the leaf acc. to standard R*-tree procedure.
•Then the keywords of new objects are propagated bottom up.
•At an SK-Node, we add the new keywords to its stored set of keywords.
•At an SA-Node, we add the keyword to its approx. index.
•At an S –Node, we check its children for new frequent keywords, and add them to its approx. index.
Experiments and Analysis
37Advanced Database Systems Raghav Karumur Spring 2011
• Datasets used: CoPhIR Test Collection – Flickr Business listings data – Florida International University.
• Packages used: Flamingo• Approaches evaluated:
Fixed level approach (FL) Variable Level approach (VL)
• Processed dataset to extract photos taken in US based on their latitude and longitude values.• Used the keywords in the title, description and tags of a photo as its textual attribute.• Compared with MHR tree (contemporary paper)• Used edit distance with threshold 2 for both approaches.• Since MHR-tree is probabilistic, it could miss answers, but this tree doesn’t.• However, MHR has a comparably small index size, that this one doesn’t.
Experiments and Analysis
38Advanced Database Systems Raghav Karumur Spring 2011
•Recall of MHR tree – constantly below 50%• Fig(b) – increased signature size to achieve higher recall.• Query time also increased as the no. edit distance of computations increase, because approx. keyword condition is validated at level.
•Compare VLF with MHR tree• MHR has smaller index size• But, VLF has smaller query time.
Experiments and Analysis
39Advanced Database Systems Raghav Karumur Spring 2011
Size of index components for various construction algorithms.
• As the approx. indexes are pushed down the tree, space requirement increased because of redundant keywords in adj. nodes•Query time decreased as fewer smaller indexes are searched than one big index
Experiments and Analysis
40Advanced Database Systems Raghav Karumur Spring 2011
•Effect on query performance vs index construction methods.•VL and VLF curves are smoother because they are more flexible than FL!•They intersect at some point because of redundant keywords.•At points of intersection, obviously VLF performs better!
•How frequent are key words? Decided by !• = 0 every keyword is frequent•>1 no keyword is frequent•Whole range of values from [0 1] are plotted.•Clear space-time tradeoff with keyword frequency threshold!•Increase in threshold more keywords pushed to lower levels space overhead due to infrequent keywords being duplicated at multiple nodes.
Conclusion
41Advanced Database Systems Raghav Karumur Spring 2011
• Spatial index + Approximate index = LBAK-tree Simple fixed-level solution Utilizing local spatial distribution of objects Exploiting frequency distribution of keywords
• Developed a cost-based model with reduced index size and query times.• Conducted experiments and verified with contemporary techniques.• Can improve over minimizing the index size.
References[1] http://ir.iit.edu/~dagr/cs529/files/ir_book/CHAP%204%20Inverted%20Index.PDF[2] http://en.wikipedia.org/wiki/N-gram[3] http://en.wikipedia.org/wiki/R*-tree[4] www.cs.fsu.edu/~lifeifei/papers/icde10_sas.pdf[5] http://flamingo.ics.uci.edu/releases/4.0/
42Advanced Database Systems Raghav Karumur Spring 2011
Thank You!
Questions?
43Advanced Database Systems Raghav Karumur Spring 2011