supporting location-based approximate-keyword queries acm international conference on geographical...

Supporting Location-based Approximate-Keyword Queries

ACM International conference on Geographical Information Systems 2010

S Alsubaiee, A Behm, C Li – University of California, Irvine

Presenter: Raghav KarumurDate: 3/30/2011

Course: [CSCI 8735] Advanced Database SystemsDepartment of Computer Science and Engineering

University of Minnesota, Twin CitiesSpring 2011

Lunch Time!

2Advanced Database Systems Raghav Karumur Spring 2011

I’ll go for Chinese food! What was the restaurant’s name???Uh…… Ch-o-chi???

Let me Find It!

Errr… Just one typo!

Outline

• Overview• Problem Formulation and Preliminaries• Contributions• Algorithms• Index Construction• Experiments and Analysis• Conclusion• References


Overview


Terminology – Clear?

Location-based keyword search consists of : a set of key words + spatial location

Goal: Find objects with these key words close to the location.Ex: User is looking for a restaurant named Chaochi close to San Jose. Consider the query: Q1 : (Chaochi) near (San Jose) The website returns listings close to San Jose that have the key word Chaochi

Problem: Inconsistencies can exist either in user queries/data or both. - Users/ up loaders may enter wrong spelling!Q1’ : (Chochi) near (San Jose)

Therefore, Q1’ may not be able to find the restaurant with the mistyped title.

Hence, support of approximate key word search is necessary!

Overview


Approach used so far:

Build a collection of keywords similar to the mistyped keyword, and suggest another query, or find objects with these keywords. Drawback of this approach:

No support for simultaneous spatial and textual information.

Problem Formulation


Object Collection

chaochi restaurant <37.39, -121.87>starbucks <37.79, -122.40>starbucks <40.72, -73.99>apple store <44.59, -92.99>sam’s club <43.59, -116.47>…

Object Collection

chaochi restaurant <37.39, -121.87>starbucks <37.79, -122.40>starbucks <40.72, -73.99>apple store <44.59, -92.99>sam’s club <43.59, -116.47>…

Find objects in “San Jose” with keywords similar to “chochi” & “resturant”

Problem FormulationLocation Based Keyword Search:

• Given a collection of strings, find those that are similar to the given query string.• Consider a collection of spatial objects o1, … , on each having a set of keywords

and a location. • A spatial approximate-keyword query Q = <Qs,Qt> consists of two conditions:

- a spatial condition Qs such as a rectangle or a circle, and

- an approximate keyword condition Qt having a set of k pairs

each representing a keyword wi with an associated similarity threshold i

• Goal: Find all objects in the collection within Qs that satisfy Qt

• An object satisfies Qt if for each keyword wi in Qt , the object has a keyword in

its description whose similarity to wi is within the corresponding threshold i


},,...,,,,{ 2211 kkwww

Problem Formulation

Approaches:

• Combine these two indexes • Search the resultant index called LBAK-tree to find answers


Trie-based method

Inverted-index method

Preliminaries: Location-Based Keyword Search

Find objects within a given spatial region that have a given set of keywords

Augment a hierarchal spatial index with textual information


Preliminaries: Approximate String Search

… chaochi

chucho

church

Query q:

chochi

Collection of strings s

Search

Output: strings s that satisfy Sim(q,s)≤δSim functions: Edit distance, Jaccard, Cosine, etc


Preliminaries: Approximate String Search

chaochi

2-grams {ch, ha, ao, oc, ch, hi}

Intuition: similar strings share a certain number of grams

Sliding Window

Gram-based inverted-index


Solution

Tree-based spatial index Approximate string search capability

Keyword search capability

LBAK-Tree


Contributions

• How to combine those indexes• Three Algorithms

1) Simple fixed-level solution2) Utilizing local spatial distribution of objects3) Exploiting frequency distribution of keywords


What is used:• Queries with spatial condition are typically supported by a tree-based

index such as R*-tree, KD-tree, Quad-tree etc.• R*-tree is used in this paper.• Most trie-based indexes are specific to edit distance and its variants, and

do not support other similarity measures such as Jaccard.• However, inverted indexes usually support a family of similarity metrics

such as edit distance, Jaccard, etc. • Inverted-index is therefore used in this paper.• In this paper, LBAK tree is used and is augmented with capabilities for

approximate keyword search. • Gram-based inverted index is used to perform approximate string search.


The LBAK tree


LBAK nodes may be classified into three categories: • S-Nodes: -Do not store any textual information.

- Used only for pruning based on spatial condition

• SA-Nodes: - Store union of keywords of their sub tree.- Stores an approximate index on these keywords.- Used for finding similar keywords, - Used for pruning based on spatial and approximate conditions.

• SK-Nodes: - Store union of keywords of their sub tree.- Used for pruning with spatial condition and keywords.- Must have previously identified relevant similar keywords by the time we reach this node

Alg 1: Simple Fixed Level Solution




Query: objects in “San Jose” with keywords similar to “chochi” & “resturant”– Based on edit distance of 1– Expressed as Q: <{San Jose}; {<chochi, 1>, <resturant, 1>}>.

• The query clearly has typos.. •Assume nodes A, B, C, D satisfy the spatial condition San Jose.• Throughout the traversal of the tree we always check the spatial condition.•At the S-Node A, we only rely on spatial condition for pruning.


20

•When we reach SA-node B, we search its approximate index to find keywords similar to chochi and resturant according to the edit-distance threshold of 1.• We can find two keywords similar to chochi (namely, chaochi and choochi), and one keyword similar to resturant(namely restaurant).


21

• Once we visit the SK-nodes C and D, we intersect their stored keywords with {chaochi, choochi} and {restaurant} respectively.

• Clearly, node C can be pruned as it does not have the keyword restaurant.


22

• Since node D has the keywords chaochi and restaurant, we traverse its children.

• We repeat the process until we find the answers.

How to Choose Level L?Trade off between space and time – until “some” level (both increase)• Usually, about 90% of query time is spent in approx. index lookups. • Therefore, choose an optimal level L for placement of approx. indexes and

this can greatly improve avg. query time .


Observations• Query time & index size sensitive to approximate-index locations• Fixed-level solution ignores local spatial distribution of objects• If a node is sparse, we might consider placing the index at its descendents.• If a node is dense, we build the index at the node itself because a query

region is likely to overlap with many of its children.

Prefer to build approximateindex at parent

Prefer to build approximateindexes at children


Algorithm 2: Placing Approximate Indexes at Variable Levels

(Spatial Nodes)

(Spatial-Approximate Nodes)

(Spatial-Keyword Nodes)


Selecting Nodes for Approximate Indexes

• Goal: Find optimal set of nodes that should have approximate indexes

•Optimization problem: “Given an R*-tree and a space budget, choose nodes from the tree to

store approximate indexes, such that the average query time of a given workload is minimized. ”

-- NP Hard Problem!


Greedy Algorithm: Selecting Nodes for Approximate Indexes

N6

N3

N1

N2

N4 N7N5

N12 N13 N14N8 N9 N10 N11 N15

✔

✔✔


• A greedy algorithm SelectSANodes is developed that traverses the tree top-down and tries to push approx. indexes down the most promising paths.

Selecting Nodes for Approximate Indexes• Algorithm maintains a priority queue of nodes to be traversed.• Priority of node n is defined as the benefit of storing multiple approximate

indexes at its children as compared to building a single index at n.• For each visited node n, if the benefit of building multiple approximate

indexes at n’s children is negative, then the algorithm selects n to be an SA-Node, and it will not traverse its children.

• If the algorithm reaches a leaf node, it immediately selects the leaf to be an SA-Node.

• The algorithm terminates when the space budget is exhausted or there is no more benefit to pushing approximate indexes down the tree.

• If pTime denotes average query time of probing approx. index at parent, cTime denotes this time if the indexes were built at the children, and pSpace and cSpace are corresponding space costs of indexes, then


Selecting Nodes for Approximate Indexes• Wn denotes set of stored keywords at node n.• If r is the root, the benefit of storing the approximate index at r’s children

is computed byb(n) =

Benefit of a node can also be given as

• The algorithm starts traversing the tree by popping the pair with the highest benefit.

• The cost of building multiple approx. indexes at n’s children is called space cost and is computed by

s(W) = |W|*( - q + 1)*q – number of grams, W – set of keywords, is avg. keyword length of a

particular data set, and is the size of each inverted-list element.


|)()(|

)()(

ncSpacenpSpace

ncTimenpTime

|)()(|

)(*)()(*)(

1

1

m

i nn

m

i nin

i

i

WsWs

WtnpWtnp

Cost/Benefit Estimation• Effects of pushing index down– Increase space cost– Increase or decrease average query time

• Typically– Higher levels: good to push index down– Intermediate levels: unclear whether to push it down

Lookup time of an approx.index• Clearly depends on size of the index.• Experimentally determined to be of linear nature with slope .

• Thus the avg. lookup time of an approximate index on W keywords is estimated to be

t(W) = *|W| +

where slope and intercept are implementation dependent and can be experimentally determined.


Size Time Slope

1 0.02 -

10000 0.207 0.000019

1M 22.253 0.000022

10M 210.152 0.000021

Algorithm3: Exploiting Frequency Distribution of Keywords


•Frequency distribution of keywords is in general skewed in nature. Ex: A business listings dataset has a keyword such as restaurant more frequently than consulate.

• In order to reduce the no. of keywords in the approx. indexes, we remove frequent keywords from sibling nodes, and place them in their common parent instead. • As a result, approx indexes now appear even in the S-nodes.

•Thus, S-Nodes now contain approx. indexes for frequent words where as SA-Nodes contain approx. indexes for infrequent words.

Index Construction


Index Construction• A node n is said to be frequent if the fraction of n’s children having

that keyword is greater than certain threshold value .• A small decreases the space cost of approx. indexes.• On the other hand, avg. query time may increase because we could

visit false-positive nodes, since not all of n’s children actually contain the frequent keywords.

• Those false positives will be pruned at SK nodes.• Updated benefit of a node:


m

innnn

nn

m

innnin

nn

FFWsFsncSpace

FWsnpSpace

FFWtnpFtnpncTime

FWtnpnpTime

ii

ii

1

1

))(()()(

)()(

))((*)()(*)()(

)(*)()(

Index ConstructionUpdated SelectSANodes Algorithm:• To discover frequent keywords in the tree, for each node n two sets of

keywords are maintained: a set of infrequent keywords Wn and a set of frequent keywords Fn.

• Frequent/infrequent keywords are identified by examining its children.• Also, it is ensured that popular keywords appear only at the root of a

sub tree i.e., if a keyword w is frequent at node n, then w is removed from the approx. keyword sets in all of n’s children.

• The propagation of frequent and infrequent keywords is performed bottom-up until the keyword sets of all nodes have been filled.

• The next step is to choose nodes to build approx. indexes on.• We use the updated benefit of a node , instead of benefit of a node.• P(n) denotes the probability of n satisfying the spatial condition of any

query in a workload.


Incremental Maintenance of Indexes


If (split in R*-tree)

•For the two new nodes, generated after split, recompute the stored set of keywords (frequent, and

infrequent) by examining their children.

•Propagate all the new keywords up to the root, retraverse the tree and rebuild approx. indexes at places

where split has occurred (identified by a split marker).

Else

•First insert the object into the leaf acc. to standard R*-tree procedure.

•Then the keywords of new objects are propagated bottom up.

•At an SK-Node, we add the new keywords to its stored set of keywords.

•At an SA-Node, we add the keyword to its approx. index.

•At an S –Node, we check its children for new frequent keywords, and add them to its approx. index.

Experiments and Analysis


• Datasets used: CoPhIR Test Collection – Flickr Business listings data – Florida International University.

• Packages used: Flamingo• Approaches evaluated:

Fixed level approach (FL) Variable Level approach (VL)

• Processed dataset to extract photos taken in US based on their latitude and longitude values.• Used the keywords in the title, description and tags of a photo as its textual attribute.• Compared with MHR tree (contemporary paper)• Used edit distance with threshold 2 for both approaches.• Since MHR-tree is probabilistic, it could miss answers, but this tree doesn’t.• However, MHR has a comparably small index size, that this one doesn’t.



•Recall of MHR tree – constantly below 50%• Fig(b) – increased signature size to achieve higher recall.• Query time also increased as the no. edit distance of computations increase, because approx. keyword condition is validated at level.

•Compare VLF with MHR tree• MHR has smaller index size• But, VLF has smaller query time.



Size of index components for various construction algorithms.

• As the approx. indexes are pushed down the tree, space requirement increased because of redundant keywords in adj. nodes•Query time decreased as fewer smaller indexes are searched than one big index



•Effect on query performance vs index construction methods.•VL and VLF curves are smoother because they are more flexible than FL!•They intersect at some point because of redundant keywords.•At points of intersection, obviously VLF performs better!

•How frequent are key words? Decided by !• = 0 every keyword is frequent•>1 no keyword is frequent•Whole range of values from [0 1] are plotted.•Clear space-time tradeoff with keyword frequency threshold!•Increase in threshold more keywords pushed to lower levels space overhead due to infrequent keywords being duplicated at multiple nodes.

Conclusion


• Spatial index + Approximate index = LBAK-tree Simple fixed-level solution Utilizing local spatial distribution of objects Exploiting frequency distribution of keywords

• Developed a cost-based model with reduced index size and query times.• Conducted experiments and verified with contemporary techniques.• Can improve over minimizing the index size.

References[1] http://ir.iit.edu/~dagr/cs529/files/ir_book/CHAP%204%20Inverted%20Index.PDF[2] http://en.wikipedia.org/wiki/N-gram[3] http://en.wikipedia.org/wiki/R*-tree[4] www.cs.fsu.edu/~lifeifei/papers/icde10_sas.pdf[5] http://flamingo.ics.uci.edu/releases/4.0/


http://ir.iit.edu/~dagr/cs529/files/ir_book/CHAP%204%20Inverted%20Index.PDF

http://en.wikipedia.org/wiki/N-gram

http://en.wikipedia.org/wiki/R*-tree

http://www.cs.fsu.edu/~lifeifei/papers/icde10_sas.pdf

http://flamingo.ics.uci.edu/releases/4.0/

Thank You!

Questions?


supporting location-based approximate-keyword queries acm international conference on geographical...

Documents

locationbased keyword

chochi query q

san jose

mistyped keyword

raghav karumur date

spatial condition q

chaochi close

spatial location goal