web search...issues a “fact” is broken down into multiple pieces Ækeyword search in databases...
TRANSCRIPT
Keyword Search on Databases
COMP 9315October 2007
web search
FeaturesEasy to useQuick responseRanked results
Build an interface for each particular database? e.g., imdb.com
DB search
Write SQL query?A general-purpose search engine for databases?
a search engine for databasesUp until recently, information discovery in databases required:
Knowledge of schemaKnowledge of a query language (e.g., SQL)Knowledge of the role of the keywords
Goal: Enable IR style keyword search over DBMSs without Goal: Enable IR-style keyword search over DBMSs without the above requirements
i.e., how can you exploit structure without understanding the structure?
ExamplesI am looking for a movie. I can not remember the name, but it is an “action” movie about a “president”.
ExamplesUniversity database
Info on courses
Online shoppingCanon Digital Rebel
Movie databaseMovie databaseAn “action” movie about a “president”
Keyword search on databases
Introduction
Unstructured vs. structured dataUnstructured data / text
Fact1: “Peter Jackson directed The Lord of the Rings The Return of the King.”Fact2: “Peter Jackson directed King Kong.”Fact3: “Peter Tait acts in the movie The Lord of the Rings The Return of the King.”Fact4: “Peter Tait acts The Lord of the Rings The Return of the King, which is directed by Peter Jackson.”……
Structured data / relational
Movie
Act ActorDirectorAID Name
a1 Peter Tait
a2 Peter King
DID Name
d1 Peter Jackson
MID DID Title
m1 d1 The Lord of the Rings: The Return of the King
m2 d1 King Kong
MID AID
m1 a1
m3 a2
IssuesA “fact” is broken down into multiple pieces keyword search in databases is like playing jigsaw!
Try to find some “nice” pictures (facts);A few given features (keywords) should appear in the pictures
ChallengesHow to define and rank pictures?
How to define a result?
How to searchefficiently?
How to rankresults?
texta document(e.g., a webpage)
use inverted index
IR-style ranking, popularity, etc.
text with structure
? ? ?
How to define and rank pictures?How to play fast?
Inverted Indexes the IR Way
Boolean ModelSimple retrieval model based on set theory
Give me emails that contains “9314” and “timetable”
If documents containing a keyword (term) are immediately available, we only need to do a merge
Inverted indexFor each term T, we must store a list of all documents that contain T
Within each posting, sort by docID
Merge
How Inverted Files are Created
Documents are parsed to extract tokens. These are saved with the Document ID.
Doc 1 Doc 2
T e r m D o c #n o w 1is 1th e 1t i m e 1fo r 1a ll 1g o o d 1m e n 1to 1c o m e 1to 1th e 1a id 1o f 1th e ir 1c o u n tr y 1
Now is the timefor all good men
to come to the aidof their country
It was a dark andstormy night in
the country manor. The time was past midnight
c o u n tr y 1it 2w a s 2a 2d a r k 2a n d 2s t o rm y 2n ig h t 2in 2th e 2c o u n tr y 2m a n o r 2th e 2t i m e 2w a s 2p a s t 2m id n ig h t 2
How Inverted Files are Created
After all documents have been parsed the inverted file is sorted alphabetically
Te rm D o c #a 2a id 1a ll 1a n d 2c o m e 1c o u n try 1c o u n try 2d a rk 2fo r 1g o o d 1in 2is 1it 2m a n o r 2m e n 1m id n ig h t 2
T e rm D o c #n o w 1is 1th e 1t i m e 1fo r 1a ll 1g o o d 1m e n 1to 1c o m e 1to 1th e 1a id 1o f 1th e ir 1c o u n tr y 1alphabetically. m id n ig h t 2
n ig h t 2n o w 1o f 1p a s t 2s t o rm y 2th e 1th e 1th e 2th e 2th e ir 1ti m e 1ti m e 2to 1to 1w a s 2w a s 2
c o u n tr y 1it 2w a s 2a 2d a rk 2a n d 2s t o rm y 2n ig h t 2in 2th e 2c o u n tr y 2m a n o r 2th e 2t i m e 2w a s 2p a s t 2m id n ig h t 2
How InvertedFiles are Created
Multiple term entries for a single document are merged.
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1
Te rm D o c #a 2a id 1a ll 1a n d 2c o m e 1c o u n try 1c o u n try 2d a rk 2fo r 1g o o d 1in 2is 1it 2m a n o r 2m e n 1m id n ig h t 2
Within-document term frequency information is compiled.
manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
m id n ig h t 2n ig h t 2n o w 1o f 1p a s t 2s t o rm y 2th e 1th e 1th e 2th e 2th e ir 1t im e 1t im e 2to 1to 1w a s 2w a s 2
How Inverted Files are Created
Finally, the file can be split into A Dictionary or Lexicon file
and A Postings file
How Inverted Files are CreatedDictionary/Lexicon PostingsTe rm D o c # F re q
a 2 1a id 1 1a ll 1 1a n d 2 1c o m e 1 1c o u n try 1 1c o u n try 2 1d a rk 2 1fo r 1 1g o o d 1 1in 2 1is 1 1
D o c # F re q2 11 11 12 11 11 12 12 11 11 1
T e r m N d o c s T o t F r e qa 1 1a id 1 1a l l 1 1a n d 1 1c o m e 1 1c o u n t r y 2 2d a r k 1 1fo r 1 1g o o d 1 1i 1 1is 1 1
it 2 1m a n o r 2 1m e n 1 1m id n ig h t 2 1n ig h t 2 1n o w 1 1o f 1 1p a s t 2 1s t o rm y 2 1th e 1 2th e 2 2th e ir 1 1t im e 1 1t im e 2 1to 1 2w a s 2 2
1 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
in 1 1is 1 1i t 1 1m a n o r 1 1m e n 1 1m id n ig h t 1 1n ig h t 1 1n o w 1 1o f 1 1p a s t 1 1s t o r m y 1 1t h e 2 4t h e ir 1 1t i m e 2 2t o 1 2w a s 1 2
More about inverted indexesPermit fast search for individual terms and Boolean queriesFor each term, you get a list consisting of:
document ID frequency of term in doc (optional) position of term in doc (optional)
Inverted index works inpopular RDBMSs, such as ORACLE and MySQLsearch engines, such as Google
Ranking
What about Ranking?Why ranking?
Too many (thousands of) results match a keyword queryUsers are interested in a few (<10) results
Lots of variations of rankingCombining subsets of:Combining subsets of:
IR-style relevance: Based on term frequencies, proximities, position (e.g., in title), etc. Popularity information: frequently visited pages, users’ preference, etc.Link analysis information (e.g., Google’s PageRank)
IR-Style Ranking9314 timetable 9318 …
D1 5 0 0 …
D2 3 2 100 …
D3 1000 10 6 …
D4 0 0 7D4 0 0 7 …
… … … … …
Query: 9314 AND
timetable
Answer:1) D32) D2
Consider the number of occurrences of a term in a document (raw) term frequency (tf)tf is often attenuated:
ntf = 1 + log(tf) ,if tf > 0ntf = 0 ,otherwise
S(D,Q)=Σt∈Qtf(t,D)
IR-Style Ranking9314 timetable 9318 …
D1 5 0 0 …
D2 3 20 100 …
D3 1000 10 6 …
Forced to compare “apples” with “oranges” “fairly”Intuition: frequently appearing words are less important
idf (inverse document frequency) = N/dfidf”9318” = 4/3, idf”timetable” = 4/2often attentuated by log() too.idf is a collection-wide/specific statistics
D4 0 0 7 …
… … … … …
IR-Style RankingPivoted normalization Weighting
s: ~ 0.20
tft document freq
qtfw w’s freq in the query
N #docs in the collection
dft #docs containing t
dl document length (#terms)
avdl avg doc length
s a parameter, default =0.2
IR-style rankingMySQL Ranking
local weight = (ln(tf)+1)/sumtf * dl/(1+0.0115*dl)global weight = ln((N-nf)/nf)query weight = Σt∈Qlocal weight * global weight
tf How many times the term appears in the row
sumtf The sum of "(ln(tf)+1)" for all terms in the same row
dl How many unique terms are in the row
N How many rows are in the table
df How many rows contain the term
qf How many times the term appears in the query
Precision and Recall
ABrelevant
retrieved
C
Precision: fraction of retrieved docs that are relevant
= C/BRecall: fraction of relevant docs that are retrieved
= C/A
Towards Relational Data
Some Existing Systems Graph-based Approach
Proximity Search [1998]BANKS [2002 & 2005, IIT]
Relational ApproachDBXplorer [2002 Microsoft] (SQL Server)DISCOVER [2002 & 2003, UCSD]SPARK [2007]
How to define a result?
How to searchefficiently?
How to rankresults?
text a documentuse inverted index
IR-style ranking, popularity, etc.
text with structure ? ? ?
BANKS
h // ii b i /b k /http://www.cse.iitb.ac.in/banks/
BANKS: Graph based ApproachData: Database is modelled as a graph
Nodes are the tuplesEdges are the references between tuples (foreign primary key join)
Query: set of keywords {k1, k2, .., kn}Each keyword ki matches set of nodes Siy i i
Answer: rooted, directed tree connecting nodes, with one node from each Si
Query: sudarshan roy
RankingRanking based on proximity + prestigeProximity
Forward edges are relation that are foreign primary keyWeight of forward edge is based on schema (how tables are correlated)Backward edges are added to account for “hubs”g
Weight of backward edge u v indegree of u
PrestigeCalculated from indegree of the node
Answer tree relevanceEdge score E = 1 / Σ edge-weightsNode score N = Σ root- and leaf-node-weights
Ignore weights of internal nodesNormalize and combine using weighting factor λ
Additive: (1- λ) E + λN; multiplicative: ENλ
Proximity ExampleWeight of forward edge based on schemaWeight of backward edge = indegree of edges pointing to the node
3
Uni: UNSW
stu: A
stu: B
stu: C
1
1
3
1
3
Searching [BANKS2 2005]Backward Expanding Search Algorithm:
Start at nodes that contain keyword queries.Run concurrent single source shortest path algorithm from these nodes.
Create an iterator for each node matching a keywordTraverse the graph edges in reverse direction
Output a node whenever it is on the intersection of the sets of nodes reached from each keyword
Answers may not be in the most relevant order
DISCOVER
Architecture MovieQ = [(m1,title,5), (m2,title,3)]
DirectorQ = [(d3,name,3), (d1,name,2)]
……
MovieQ
[Peter king]
DirectorQ
ActorQ
MovieQ ← DirectorQ
MovieQ ← Act{} →ActorQ……
...SELECT * FROM MovieQ m, DirectorQ dWHERE m.did = d.did AND m.mid=? AND d.did = ?;...a2
d1 → m1d1 → m2
Score [DISCOVER2 2003]Local score (for a single tuple):
Score of a joining tuple tree: average of the local scoresScore of a joining tuple tree: average of the local scores
Candidate NetworksPromising
MovieQ ← DirectorQ MovieQ is prunedMinimal
ActorQ Act MovieQ ← Director is pruned
Query Processing1. Generate tuple set graph from the schema graph and query
keywords2. Breadth-first enumeration of all Candidate Networks (CNs)3. Rewrite the list of CNs into an execution schedule
4. Execute it
Different strategies of scheduling
Query ProcessingNaïve
1. Retrieve top-k results from each CNMovieQ ← DirectorQ
SELECT * FROM MovieQ m, DirectorQ d WHERE m.did = d.did;
2. ORDER BY + LIMIT3 Merge them to obtain top-k query results3. Merge them to obtain top k query results
Query ProcessingSparse1. For each CN, compute a upper bound score of its
results: MPS(CNi)2. While currently found top-k score < all MPS(CNi)
1. Execute the CN with maximal MPS2 update top-K result set MPS2. update top K result set
movieQ
m1 5
m2 3
… …
directorQ
d3 3
d1 2
… …
top-k
m1-d3 4
m2-d3 3
… …
actorQ
a7 1
a9 0.5
… …
(MovieQ ← DirectorQ, 4)(MovieQ ← Act{} →ActorQ,2)
k=2
current top-2 score > 2, stop!
S
tuple score
m1 5
m2 3
MovieQ DirectorQ
tuple score
d1 3
d2 2
Director Q
Single pipeline / global pipeline algorithm
m3 2
m4 1
d3 1
5 3 2 1 MovieQ
1 2 3
k=1
When can we stop? score (top-1 result)≥ score (un-seen candidates)
top-1
9 candidates checked
SPARK
SPARKWork from our group
Searching, Probing & Ranking Top-k ResultsSPARK: Keyword Search on Relational Databases, Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou, Sigmod 2007
ContributionsImproved the effectiveness (scoring function)Efficient query processing
EffectivenessTop-1 result for “Nikos Clique” on DBLP data
[Hristidis VLDB03]
InProceeding: Clique-to-Clique Distance Computation Using a Specific Architecture
[Liu Proceeding Person: Nikos Karatzas
42
[Liu SIGMOD06]
Proceeding Person: Nikos KaratzasSeries InProceeding: Maximum Clique Transversals
Proceeding InProceeding: On … Clique-Width and …
Ours Person: Nikos Mamoulis RPI InProceeding: Constraint-Based Algorithms for Computing Clique Intersection Joins
SPARK: Top-k Keyword Query in Relational Databases
)(),(),(),( TscoreQTscoreQTscoreQTscore sizecomplkeyw ××=
Score FunctionScore Function
Consider all join results of CN(T). By concatenating text columns, we get a set of virtual documents.
∑ ∈=
Qq keywkeyw qTscoreQTscore ),(),(
43
Then use a tf-idf score!
Example
JTT: “the Lord of the Ring: the Return of the KingPeter Jackson King Kong”
tf(Peter) = 1, tf(King) = 2
dl = 8
idf(Peter): #join results of “Movie Director Movie” / #results containing Peter
SPARK: Top-k Keyword Query in Relational Databases
)(),(),(),( TscoreQTscoreQTscoreQTscore sizecomplkeyw ××=
Score FunctionScore Function
penalize incomplete resultsComplete result:
AND / OR
1),( ≤QTscorecompl
1),( =QTscorecompl
44
penalize large results|T|=1:
1)( ≤Tscoresize
1)( =Tscoresize
SPARK: Top-k Keyword Query in Relational Databases
)(),(),(),( TscoreQTscoreQTscoreQTscore sizecomplkeyw ××=
Score FunctionScore Function
CN*(T): virtual documents with same structure of T
( )( )qQTq
TCN
T
qkeyw idf
avdldlss
TtfQTscore ⋅
⋅+−
++=∑ ∩∈
)*(
1
)(ln1ln1),(
45
avdl & idf are estimated
T.i: weighted and normalized keyword appearancep>1: switch of AND/OR query
size(CNnf): # of non-free tuple sets
( ) pmi
p
compl miT
QTscore
1
1.1
1),(⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −−= ∑ ≤≤
))(1())(1()( 2211nf
size CNsizessCNsizessTscore ⋅−+⋅⋅−+=
SPARK: Top-k Keyword Query in Relational Databases
CN:
Skyline Sweeping AlgorithmSkyline Sweeping Algorithm
m1 10.0
m2 1.0
m3 0.9
d1 5.0
d2 1.1
d3 0.8
MovieQ DirectorQ
Director Q
0.8 1.1
m1 d1
46
Challenge1: minimize search cost3 x 3 = 9 candidates, but we only need top-1 resultIf m1 d1 really joinWe can stop now!score(m1 d1) > score (un-seen candidates)
Solution: J*-algorithmPrune unnecessary candidates
10.0 1.0 0.9 MovieQ
5.0
SPARK: Top-k Keyword Query in Relational Databases
Skyline Sweeping Algorithm
m1 1.2
m2 1.0
m3 0.9
d1 1.2
d2 1.1
d3 0.8
MovieQ DirectorQ
47
Challenge2: non-monotonic functionscore(mi dj) = agg(score(mi), score(dj))agg is non-monotonic! score(m1 d1) < score(m2 d2)
Solution: find a tight monotonic upper boundFor stopping criteria
SPARK: Top-k Keyword Query in Relational Databases
SS: Monotonic Upper BoundSS: Monotonic Upper BoundIdea
Local scores lose information of term distributionBut we know average term frequency
Example (tf normalized; dl & idf ignored)JTT: m1 d1 MovieQ DirectorQ
48
On average, Peter / King appears (5+3)/2=4 timesThe average case Upper Bound score of m1 d1(1+ln(1+ln(4))) + (1+ln(1+ln(4))) = 6.8(1+ln(1+ln(3))) + (1+ln(1+ln(5))) = 5.1…(1+ln(1+ln(8))) = 4.1
tuple tf
m1 5
tuple tf
d1 3
SPARK: Top-k Keyword Query in Relational Databases
SS: Monotonic Upper BoundSS: Monotonic Upper BoundSuppose
Then
∑ ∈=
Qq qidfsumidf
( )sumidf
idftftwatf CNQq qq∑ ∩∈
⋅=)(
49
Then
Tuple sets should be sorted by watf(t)Upper bound score of all un-seen candidates stopping criteria!
( )( )( ) ( )( )( )⎪⎩
⎪⎨⎧ ++⋅
≤⋅++∑
∑∑∈
∈∈
Tt
TtqQq q twatf
twatfsumidfidftf
)()(ln1ln1
ln1ln1
SPARK: Top-k Keyword Query in Relational Databases
Skyline Sweeping AlgorithmSkyline Sweeping Algorithm
DirectoHeap (to do list):
For Challenge1(search cost):Lazily perform join
In heapPerform join
tuple watf
m1 5
m2 3
m3 2
m4 1
MovieQ DirectorQ
tuple watf
d1 3
d2 2
d3 1
50
5 3 2 1 MovieQ
or Q 1 2 3
5 candidates checked
Heap (to do list):(5/3,8.0)(5/2,7.0),(3/3,6.0)(3/3,6.0)(3/3,6.0),(2/3,5.0),(3/2,5.0)…
stopping criteria:score (top-1 result)≥ score (heap head)
m4 1k=1
SPARK: Top-k Keyword Query in Relational Databases
Block Pipeline AlgorithmBlock Pipeline AlgorithmObservation:
Bound a non-monotonic function using a monotonic one?Not tight in many places!
51
1,2,3,4,5,…,9,101,2,8,7,9,…,4,5 ☺
Challenge: how to further reduce search cost?
SPARK: Top-k Keyword Query in Relational Databases
Idea: keep tf signatures a tighter, yet non-monotonicupper bound (UB2)UB1 (loose, monotonic): for stopping criteriaUB2 (tight, non-monotonic): for a better join sequence
Block Pipeline Algorithm: 3 level laziness1. Compute UB1
Block Pipeline AlgorithmBlock Pipeline Algorithm
D
52
2. Compute UB23. Perform join
5 3 2 1 MovieQ
DQ
1 2 3
(4,1) (3,0) (1,1) (1,0)
(1,0) (2,0) (1,2)
(5/3,6.8,?)(5/3,6.8,6.7),(5/2,6.5,?)(3/3,6.2,?),(5/1,6.2,?),(5/2,6.5,5.8),(3/2,5.8,?)…
(5/3,6.8,6.7)(5/2,6.5,5.8),(3/3,6.2,?)
STOP!
SPARK: Top-k Keyword Query in Relational Databases
Other IssuesOther IssuesGeneralization to Multiple CNs
Initially, push lower-left points of each CN into heap, and sort them all together
Progressively output resultsOptimization: join on blocks
53
Optimization: join on blocksWhy?1. Low join selectivity (i.e., too many white points)2. High database connection overhead
How?Evaluate a block of candidates in one SQL!Group tuples by: tf signature / local score (watf) / row id
SPARK: Top-k Keyword Query in Relational Databases
Efficiency: DBLPDBLP• ~ 0.9M tuples in total• k = 10• PC 1.8G, 512M
54
k=10
1 sec
SPARK: Top-k Keyword Query in Relational Databases
References[Sigmod99] P. J. Haas and J. M. Hellerstein. Ripple joins for online aggregation.[VLDB01] A. Natsev, Y.-C. Chang, J. R. Smith, C.-S. Li, and J. S. Vitter. Supporting incremental join queries on ranked inputs.[VLDB02] V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword search in relational databases.[ICDE02-1] S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: A system for keyword based search over relational databasessystem for keyword-based search over relational databases.[ICDE02-2] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS.[VLDB03] V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efficient IR-Style Keyword Search over Relational Databases. [VLDB05] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases.[Sigmod06] F. Liu, C. T. Yu, W. Meng, and A. Chowdhury. Effective keyword search in relational databases.