web search...issues a “fact” is broken down into multiple pieces Ækeyword search in databases...

Keyword Search on Databases

COMP 9315October 2007

web search

FeaturesEasy to useQuick responseRanked results

Build an interface for each particular database? e.g., imdb.com

DB search

Write SQL query?A general-purpose search engine for databases?

a search engine for databasesUp until recently, information discovery in databases required:

Knowledge of schemaKnowledge of a query language (e.g., SQL)Knowledge of the role of the keywords

Goal: Enable IR style keyword search over DBMSs without Goal: Enable IR-style keyword search over DBMSs without the above requirements

i.e., how can you exploit structure without understanding the structure?

ExamplesI am looking for a movie. I can not remember the name, but it is an “action” movie about a “president”.

ExamplesUniversity database

Info on courses

Online shoppingCanon Digital Rebel

Movie databaseMovie databaseAn “action” movie about a “president”

Keyword search on databases

Introduction

Unstructured vs. structured dataUnstructured data / text

Fact1: “Peter Jackson directed The Lord of the Rings The Return of the King.”Fact2: “Peter Jackson directed King Kong.”Fact3: “Peter Tait acts in the movie The Lord of the Rings The Return of the King.”Fact4: “Peter Tait acts The Lord of the Rings The Return of the King, which is directed by Peter Jackson.”……

Structured data / relational

Movie

Act ActorDirectorAID Name

a1 Peter Tait

a2 Peter King

DID Name

d1 Peter Jackson

MID DID Title

m1 d1 The Lord of the Rings: The Return of the King

m2 d1 King Kong

MID AID

m1 a1

m3 a2

IssuesA “fact” is broken down into multiple pieces keyword search in databases is like playing jigsaw!

Try to find some “nice” pictures (facts);A few given features (keywords) should appear in the pictures

ChallengesHow to define and rank pictures?

How to define a result?

How to searchefficiently?

How to rankresults?

texta document(e.g., a webpage)

use inverted index

IR-style ranking, popularity, etc.

text with structure

? ? ?

How to define and rank pictures?How to play fast?

Inverted Indexes the IR Way

Boolean ModelSimple retrieval model based on set theory

Give me emails that contains “9314” and “timetable”

If documents containing a keyword (term) are immediately available, we only need to do a merge

Inverted indexFor each term T, we must store a list of all documents that contain T

Within each posting, sort by docID

Merge

How Inverted Files are Created

Documents are parsed to extract tokens. These are saved with the Document ID.

Doc 1 Doc 2

T e r m D o c #n o w 1is 1th e 1t i m e 1fo r 1a ll 1g o o d 1m e n 1to 1c o m e 1to 1th e 1a id 1o f 1th e ir 1c o u n tr y 1

Now is the timefor all good men

to come to the aidof their country

It was a dark andstormy night in

the country manor. The time was past midnight

c o u n tr y 1it 2w a s 2a 2d a r k 2a n d 2s t o rm y 2n ig h t 2in 2th e 2c o u n tr y 2m a n o r 2th e 2t i m e 2w a s 2p a s t 2m id n ig h t 2


After all documents have been parsed the inverted file is sorted alphabetically

Te rm D o c #a 2a id 1a ll 1a n d 2c o m e 1c o u n try 1c o u n try 2d a rk 2fo r 1g o o d 1in 2is 1it 2m a n o r 2m e n 1m id n ig h t 2

T e rm D o c #n o w 1is 1th e 1t i m e 1fo r 1a ll 1g o o d 1m e n 1to 1c o m e 1to 1th e 1a id 1o f 1th e ir 1c o u n tr y 1alphabetically. m id n ig h t 2

n ig h t 2n o w 1o f 1p a s t 2s t o rm y 2th e 1th e 1th e 2th e 2th e ir 1ti m e 1ti m e 2to 1to 1w a s 2w a s 2

c o u n tr y 1it 2w a s 2a 2d a rk 2a n d 2s t o rm y 2n ig h t 2in 2th e 2c o u n tr y 2m a n o r 2th e 2t i m e 2w a s 2p a s t 2m id n ig h t 2

How InvertedFiles are Created

Multiple term entries for a single document are merged.

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1

Te rm D o c #a 2a id 1a ll 1a n d 2c o m e 1c o u n try 1c o u n try 2d a rk 2fo r 1g o o d 1in 2is 1it 2m a n o r 2m e n 1m id n ig h t 2

Within-document term frequency information is compiled.

manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

m id n ig h t 2n ig h t 2n o w 1o f 1p a s t 2s t o rm y 2th e 1th e 1th e 2th e 2th e ir 1t im e 1t im e 2to 1to 1w a s 2w a s 2


Finally, the file can be split into A Dictionary or Lexicon file

and A Postings file

How Inverted Files are CreatedDictionary/Lexicon PostingsTe rm D o c # F re q

a 2 1a id 1 1a ll 1 1a n d 2 1c o m e 1 1c o u n try 1 1c o u n try 2 1d a rk 2 1fo r 1 1g o o d 1 1in 2 1is 1 1

D o c # F re q2 11 11 12 11 11 12 12 11 11 1

T e r m N d o c s T o t F r e qa 1 1a id 1 1a l l 1 1a n d 1 1c o m e 1 1c o u n t r y 2 2d a r k 1 1fo r 1 1g o o d 1 1i 1 1is 1 1

it 2 1m a n o r 2 1m e n 1 1m id n ig h t 2 1n ig h t 2 1n o w 1 1o f 1 1p a s t 2 1s t o rm y 2 1th e 1 2th e 2 2th e ir 1 1t im e 1 1t im e 2 1to 1 2w a s 2 2

1 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

in 1 1is 1 1i t 1 1m a n o r 1 1m e n 1 1m id n ig h t 1 1n ig h t 1 1n o w 1 1o f 1 1p a s t 1 1s t o r m y 1 1t h e 2 4t h e ir 1 1t i m e 2 2t o 1 2w a s 1 2

More about inverted indexesPermit fast search for individual terms and Boolean queriesFor each term, you get a list consisting of:

document ID frequency of term in doc (optional) position of term in doc (optional)

Inverted index works inpopular RDBMSs, such as ORACLE and MySQLsearch engines, such as Google

Ranking

What about Ranking?Why ranking?

Too many (thousands of) results match a keyword queryUsers are interested in a few (<10) results

Lots of variations of rankingCombining subsets of:Combining subsets of:

IR-style relevance: Based on term frequencies, proximities, position (e.g., in title), etc. Popularity information: frequently visited pages, users’ preference, etc.Link analysis information (e.g., Google’s PageRank)

IR-Style Ranking9314 timetable 9318 …

D1 5 0 0 …

D2 3 2 100 …

D3 1000 10 6 …

D4 0 0 7D4 0 0 7 …

… … … … …

Query: 9314 AND

timetable

Answer:1) D32) D2

Consider the number of occurrences of a term in a document (raw) term frequency (tf)tf is often attenuated:

ntf = 1 + log(tf) ,if tf > 0ntf = 0 ,otherwise

S(D,Q)=Σt∈Qtf(t,D)

IR-Style Ranking9314 timetable 9318 …

D1 5 0 0 …

D2 3 20 100 …

D3 1000 10 6 …

Forced to compare “apples” with “oranges” “fairly”Intuition: frequently appearing words are less important

idf (inverse document frequency) = N/dfidf”9318” = 4/3, idf”timetable” = 4/2often attentuated by log() too.idf is a collection-wide/specific statistics

D4 0 0 7 …

… … … … …

IR-Style RankingPivoted normalization Weighting

s: ~ 0.20

tft document freq

qtfw w’s freq in the query

N #docs in the collection

dft #docs containing t

dl document length (#terms)

avdl avg doc length

s a parameter, default =0.2

IR-style rankingMySQL Ranking

local weight = (ln(tf)+1)/sumtf * dl/(1+0.0115*dl)global weight = ln((N-nf)/nf)query weight = Σt∈Qlocal weight * global weight

tf How many times the term appears in the row

sumtf The sum of "(ln(tf)+1)" for all terms in the same row

dl How many unique terms are in the row

N How many rows are in the table

df How many rows contain the term

qf How many times the term appears in the query

Precision and Recall

ABrelevant

retrieved

C

Precision: fraction of retrieved docs that are relevant

= C/BRecall: fraction of relevant docs that are retrieved

= C/A

Towards Relational Data

Some Existing Systems Graph-based Approach

Proximity Search [1998]BANKS [2002 & 2005, IIT]

Relational ApproachDBXplorer [2002 Microsoft] (SQL Server)DISCOVER [2002 & 2003, UCSD]SPARK [2007]

How to define a result?

How to searchefficiently?

How to rankresults?

text a documentuse inverted index

IR-style ranking, popularity, etc.

text with structure ? ? ?

BANKS

h // ii b i /b k /http://www.cse.iitb.ac.in/banks/

BANKS: Graph based ApproachData: Database is modelled as a graph

Nodes are the tuplesEdges are the references between tuples (foreign primary key join)

Query: set of keywords {k1, k2, .., kn}Each keyword ki matches set of nodes Siy i i

Answer: rooted, directed tree connecting nodes, with one node from each Si

Query: sudarshan roy

RankingRanking based on proximity + prestigeProximity

Forward edges are relation that are foreign primary keyWeight of forward edge is based on schema (how tables are correlated)Backward edges are added to account for “hubs”g

Weight of backward edge u v indegree of u

PrestigeCalculated from indegree of the node

Answer tree relevanceEdge score E = 1 / Σ edge-weightsNode score N = Σ root- and leaf-node-weights

Ignore weights of internal nodesNormalize and combine using weighting factor λ

Additive: (1- λ) E + λN; multiplicative: ENλ

Proximity ExampleWeight of forward edge based on schemaWeight of backward edge = indegree of edges pointing to the node

3

Uni: UNSW

stu: A

stu: B

stu: C

1

1

3

1

3

Searching [BANKS2 2005]Backward Expanding Search Algorithm:

Start at nodes that contain keyword queries.Run concurrent single source shortest path algorithm from these nodes.

Create an iterator for each node matching a keywordTraverse the graph edges in reverse direction

Output a node whenever it is on the intersection of the sets of nodes reached from each keyword

Answers may not be in the most relevant order

DISCOVER

Architecture MovieQ = [(m1,title,5), (m2,title,3)]

DirectorQ = [(d3,name,3), (d1,name,2)]

……

MovieQ

[Peter king]

DirectorQ

ActorQ

MovieQ ← DirectorQ

MovieQ ← Act{} →ActorQ……

...SELECT * FROM MovieQ m, DirectorQ dWHERE m.did = d.did AND m.mid=? AND d.did = ?;...a2

d1 → m1d1 → m2

Score [DISCOVER2 2003]Local score (for a single tuple):

Score of a joining tuple tree: average of the local scoresScore of a joining tuple tree: average of the local scores

Candidate NetworksPromising

MovieQ ← DirectorQ MovieQ is prunedMinimal

ActorQ Act MovieQ ← Director is pruned

Query Processing1. Generate tuple set graph from the schema graph and query

keywords2. Breadth-first enumeration of all Candidate Networks (CNs)3. Rewrite the list of CNs into an execution schedule

4. Execute it

Different strategies of scheduling

Query ProcessingNaïve

1. Retrieve top-k results from each CNMovieQ ← DirectorQ

SELECT * FROM MovieQ m, DirectorQ d WHERE m.did = d.did;

2. ORDER BY + LIMIT3 Merge them to obtain top-k query results3. Merge them to obtain top k query results

Query ProcessingSparse1. For each CN, compute a upper bound score of its

results: MPS(CNi)2. While currently found top-k score < all MPS(CNi)

1. Execute the CN with maximal MPS2 update top-K result set MPS2. update top K result set

movieQ

m1 5

m2 3

… …

directorQ

d3 3

d1 2

… …

top-k

m1-d3 4

m2-d3 3

… …

actorQ

a7 1

a9 0.5

… …

(MovieQ ← DirectorQ, 4)(MovieQ ← Act{} →ActorQ,2)

k=2

current top-2 score > 2, stop!

S

tuple score

m1 5

m2 3

MovieQ DirectorQ

tuple score

d1 3

d2 2

Director Q

Single pipeline / global pipeline algorithm

m3 2

m4 1

d3 1

5 3 2 1 MovieQ

1 2 3

k=1

When can we stop? score (top-1 result)≥ score (un-seen candidates)

top-1

9 candidates checked

SPARK

SPARKWork from our group

Searching, Probing & Ranking Top-k ResultsSPARK: Keyword Search on Relational Databases, Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou, Sigmod 2007

ContributionsImproved the effectiveness (scoring function)Efficient query processing

EffectivenessTop-1 result for “Nikos Clique” on DBLP data

[Hristidis VLDB03]

InProceeding: Clique-to-Clique Distance Computation Using a Specific Architecture

[Liu Proceeding Person: Nikos Karatzas

42

[Liu SIGMOD06]

Proceeding Person: Nikos KaratzasSeries InProceeding: Maximum Clique Transversals

Proceeding InProceeding: On … Clique-Width and …

Ours Person: Nikos Mamoulis RPI InProceeding: Constraint-Based Algorithms for Computing Clique Intersection Joins

SPARK: Top-k Keyword Query in Relational Databases

)(),(),(),( TscoreQTscoreQTscoreQTscore sizecomplkeyw ××=

Score FunctionScore Function

Consider all join results of CN(T). By concatenating text columns, we get a set of virtual documents.

∑ ∈=

Qq keywkeyw qTscoreQTscore ),(),(

43

Then use a tf-idf score!

Example

JTT: “the Lord of the Ring: the Return of the KingPeter Jackson King Kong”

tf(Peter) = 1, tf(King) = 2

dl = 8

idf(Peter): #join results of “Movie Director Movie” / #results containing Peter




penalize incomplete resultsComplete result:

AND / OR

1),( ≤QTscorecompl

1),( =QTscorecompl

44

penalize large results|T|=1:

1)( ≤Tscoresize

1)( =Tscoresize




CN*(T): virtual documents with same structure of T

( )( )qQTq

TCN

T

qkeyw idf

avdldlss

TtfQTscore ⋅

⋅+−

++=∑ ∩∈

)*(

1

)(ln1ln1),(

45

avdl & idf are estimated

T.i: weighted and normalized keyword appearancep>1: switch of AND/OR query

size(CNnf): # of non-free tuple sets

( ) pmi

p

compl miT

QTscore

1

1.1

1),(⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ −−= ∑ ≤≤

))(1())(1()( 2211nf

size CNsizessCNsizessTscore ⋅−+⋅⋅−+=


CN:

Skyline Sweeping AlgorithmSkyline Sweeping Algorithm

m1 10.0

m2 1.0

m3 0.9

d1 5.0

d2 1.1

d3 0.8

MovieQ DirectorQ

Director Q

0.8 1.1

m1 d1

46

Challenge1: minimize search cost3 x 3 = 9 candidates, but we only need top-1 resultIf m1 d1 really joinWe can stop now!score(m1 d1) > score (un-seen candidates)

Solution: J*-algorithmPrune unnecessary candidates

10.0 1.0 0.9 MovieQ

5.0


Skyline Sweeping Algorithm

m1 1.2

m2 1.0

m3 0.9

d1 1.2

d2 1.1

d3 0.8

MovieQ DirectorQ

47

Challenge2: non-monotonic functionscore(mi dj) = agg(score(mi), score(dj))agg is non-monotonic! score(m1 d1) < score(m2 d2)

Solution: find a tight monotonic upper boundFor stopping criteria


SS: Monotonic Upper BoundSS: Monotonic Upper BoundIdea

Local scores lose information of term distributionBut we know average term frequency

Example (tf normalized; dl & idf ignored)JTT: m1 d1 MovieQ DirectorQ

48

On average, Peter / King appears (5+3)/2=4 timesThe average case Upper Bound score of m1 d1(1+ln(1+ln(4))) + (1+ln(1+ln(4))) = 6.8(1+ln(1+ln(3))) + (1+ln(1+ln(5))) = 5.1…(1+ln(1+ln(8))) = 4.1

tuple tf

m1 5

tuple tf

d1 3


SS: Monotonic Upper BoundSS: Monotonic Upper BoundSuppose

Then

∑ ∈=

Qq qidfsumidf

( )sumidf

idftftwatf CNQq qq∑ ∩∈

⋅=)(

49

Then

Tuple sets should be sorted by watf(t)Upper bound score of all un-seen candidates stopping criteria!

( )( )( ) ( )( )( )⎪⎩

⎪⎨⎧ ++⋅

≤⋅++∑

∑∑∈

∈∈

Tt

TtqQq q twatf

twatfsumidfidftf

)()(ln1ln1

ln1ln1


Skyline Sweeping AlgorithmSkyline Sweeping Algorithm

DirectoHeap (to do list):

For Challenge1(search cost):Lazily perform join

In heapPerform join

tuple watf

m1 5

m2 3

m3 2

m4 1

MovieQ DirectorQ

tuple watf

d1 3

d2 2

d3 1

50

5 3 2 1 MovieQ

or Q 1 2 3

5 candidates checked

Heap (to do list):(5/3,8.0)(5/2,7.0),(3/3,6.0)(3/3,6.0)(3/3,6.0),(2/3,5.0),(3/2,5.0)…

stopping criteria:score (top-1 result)≥ score (heap head)

m4 1k=1


Block Pipeline AlgorithmBlock Pipeline AlgorithmObservation:

Bound a non-monotonic function using a monotonic one?Not tight in many places!

51

1,2,3,4,5,…,9,101,2,8,7,9,…,4,5 ☺

Challenge: how to further reduce search cost?


Idea: keep tf signatures a tighter, yet non-monotonicupper bound (UB2)UB1 (loose, monotonic): for stopping criteriaUB2 (tight, non-monotonic): for a better join sequence

Block Pipeline Algorithm: 3 level laziness1. Compute UB1

Block Pipeline AlgorithmBlock Pipeline Algorithm

D

52

2. Compute UB23. Perform join

5 3 2 1 MovieQ

DQ

1 2 3

(4,1) (3,0) (1,1) (1,0)

(1,0) (2,0) (1,2)

(5/3,6.8,?)(5/3,6.8,6.7),(5/2,6.5,?)(3/3,6.2,?),(5/1,6.2,?),(5/2,6.5,5.8),(3/2,5.8,?)…

(5/3,6.8,6.7)(5/2,6.5,5.8),(3/3,6.2,?)

STOP!


Other IssuesOther IssuesGeneralization to Multiple CNs

Initially, push lower-left points of each CN into heap, and sort them all together

Progressively output resultsOptimization: join on blocks

53

Optimization: join on blocksWhy?1. Low join selectivity (i.e., too many white points)2. High database connection overhead

How?Evaluate a block of candidates in one SQL!Group tuples by: tf signature / local score (watf) / row id


Efficiency: DBLPDBLP• ~ 0.9M tuples in total• k = 10• PC 1.8G, 512M

54

k=10

1 sec


References[Sigmod99] P. J. Haas and J. M. Hellerstein. Ripple joins for online aggregation.[VLDB01] A. Natsev, Y.-C. Chang, J. R. Smith, C.-S. Li, and J. S. Vitter. Supporting incremental join queries on ranked inputs.[VLDB02] V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword search in relational databases.[ICDE02-1] S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: A system for keyword based search over relational databasessystem for keyword-based search over relational databases.[ICDE02-2] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS.[VLDB03] V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efficient IR-Style Keyword Search over Relational Databases. [VLDB05] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases.[Sigmod06] F. Liu, C. T. Yu, W. Meng, and A. Chowdhury. Effective keyword search in relational databases.