top-k set similarity joins

25
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based on Chuan Xiao’s presentation slides in ICDE ’09

Upload: finnea

Post on 07-Jan-2016

31 views

Category:

Documents


2 download

DESCRIPTION

Top-k Set Similarity Joins. Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee. Based on Chuan Xiao’s presentation slides in ICDE ’09. Outline. Introduction Problem Definition Existing Approaches - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Top-k Set Similarity Joins

Top-k Set Similarity Joins

Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan ShangUniv. of New South Wales, AustrailiaICDE ’09

9 Feb 2011Taewhi Lee

Based on Chuan Xiao’s presentation slides in ICDE ’09

Page 2: Top-k Set Similarity Joins

2 / 40

Outline

Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments

Page 3: Top-k Set Similarity Joins

Motivation Data Cleaning

University City State Postal Code

University of New South Wales Sydney NSW 2052

University of Sydney Sydney NSW 2006

University of Melbourne Melbourne Victoria 3010

University of Queensland Brisbane Queensland 4072

University of New South Vales Sydney NSW 2052

3

Page 4: Top-k Set Similarity Joins

More Application Near duplicate Web page detection

Obama Has Busy Final Day Before Taking Office as Bush Says Farewells

New York TimesJan 19th, 2009

iht.comJan 20, 2009

4

Page 5: Top-k Set Similarity Joins

5 / 40

Outline

Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments

Page 6: Top-k Set Similarity Joins

(Traditional) Set Similarity Join Each record is tokenized into a set Given a collection of records, the set similarity join

problem is to find all pairs of records, <x,y>, such that sim(x,y) t

Common similarity functions:

– jaccard:

– cosine:

– dice:

tyx

yxyxJ

),(

tyx

yxyxC

),(

x = {A,B,C,D,E}y = {B,C,D,E,F}

4/6 = 0.67

4/5 = 0.8

8/10 = 0.8tyx

yxyxD

2

),(

6

What if t is unknown beforehand?

Page 7: Top-k Set Similarity Joins

What If t is Unknown Beforehand?

Example – using jaccard similarity function– w = {A, B, C, D, E}– x = {A, B, C, E, F}– y = {B, C, D, E, F}– z = {B, C, F, G, H}

– If t = 0.7 no results– If t = 0.4 <w,x>, <w,y>, <x,y>, <x,z>, <y,z>

(too many results and long running time)

Return the top-k results ranked by their simi-larity values– if k = 1 <w,x>

7

Page 8: Top-k Set Similarity Joins

Top-k Set Similarity Join

Return top-k pairs of records, ranked by simi-larity scores

Advantages over traditional similarity join– Without specifying a threshold– Output results progressively benefit interactive

applications– Produce most meaningful results under limited re-

sources/time constraints Can be stopped at any time, but still guarantee

sim(output results) sim(unseen pairs)

8

Page 9: Top-k Set Similarity Joins

9 / 40

Outline

Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments

Page 10: Top-k Set Similarity Joins

Straightforward Solution Start from a certain t, repeat the following steps:

– answer traditional sim-join with t as threshold– if # of results k, stop and output k results with highest sim– else, decrease t

Example (jaccard, k = 2)– w = {A, B, C, E}– x = {A, B, C, E, F}– y = {B, C, D, E, F}– z = {B, C, F, G, H}

– t = 0.9 no result– t = 0.8 <w,x>– t = 0.7 <w,x>– t = 0.6 <w,x>, <x,y>

results don’t change!

Which thresholds shall we enumer-ate?

0.8, 0.6

10

Page 11: Top-k Set Similarity Joins

Naïve and Index-Based Algorithms Naïve Algorithm:

– Compare every pair of objects -> O(n2) time complexity

Index-based Algorithm [Sarawagi et al. SIGMOD04]:

Record Set Index Construction

Candidate Generation

Verification Result Pairs

token record_id

A w x y

B x z …

C y z …<w,x>

<w,y>

<x,y>

<x,z>

inverted lists

11

Page 12: Top-k Set Similarity Joins

Sort the tokens by a global ordering– increasing order of document frequency

Only need to index the first few tokens (prefix) for each record

Example: jaccard t = 0.8 |x y| 4 if |x|=|y|=5

A B

C Dupper boundO(x,y) = 3 < 4!

prefix

sorted

sorted

E F G

E F G

12

Prefix Filter [Chaudhuri et al. ICDE06, Bayardo et al. WWW07]

x

y

Must share at least one token in prefix to be a candidate pair– For jaccard, prefix length = |x| * (1 – t) + 1

each t is associated with a prefix length

Page 13: Top-k Set Similarity Joins

13 / 40

Outline

Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments

Page 14: Top-k Set Similarity Joins

Necessary Thresholds Each prefix is associated with a threshold

– the maximum possible similarity a record can achieve with other records

A B Cx =

1.0 0.8 0.6t

14

1.0 0.75 0.5 0.25x

y

z

1.0 0.8 0.6 0.4 0.2

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Page 15: Top-k Set Similarity Joins

Event-driven Model

Problem: repeated invocation of sim-join algo-rithm– t is decreasing run sim-join algorithm in an in-

cremental way

Prefix Event <x, A, t>– Initialize prefix length for each record as 1 <x, A,

1.0> – For each prefix event

Probe the inverted list of the token for candidate pairs, verify the candidate pairs, and insert them into temp re-sults

Insert x into A’s inverted list Extend prefix by one token

maintain prefix events with a max-heap on t

– Stop until t k-th temp result’s similarity

15

Page 16: Top-k Set Similarity Joins

Topk-join - Example

16

A B C E

A B C E F

B C D E F

B C F G H

w

x

y

z

token record_id

A w x

B y z x w

C y z

inverted list

<x, B, 0.8>

<y, C, 0.8>

<z, C, 0.8>

<w, B, 0.75>

prefix event

(w,x) = 0.8

(y,z) = 0.43

(x,y) = 0.67

temporary result

jaccard, k=2

verified twice!

t=0.6 2nd temp

result’s sim

Page 17: Top-k Set Similarity Joins

Optimizations - Verification In the above example, (w,x) and (y,z) have been veri-

fied twice How to avoid repeated verification?

– Memorize all verified pairs with a hash table too much memory consumption

– Check if this pair will be identified again when it is verified for the first time

– Keep only those will be identified again before algorithm stops

– Guarantee no pair will be verified twice

A B D E F

A C D E F

x

y

1.0 0.8 0.6

if k-th temp result’s sim = 0.7

won’t be identified again!

17

Page 18: Top-k Set Similarity Joins

Optimizations - Indexing How to reduce inverted list size to save memory?

– t is decreasing calculate the upper bound of similar-ity

for future probings into inverted lists

– Don’t insert into inverted list if upper bound k-th temp result’s similarity

A C D E F

B C D E F

x

y

18

0.8

max. similar-ity = 4/6 = 0.67

Page 19: Top-k Set Similarity Joins

19 / 40

Outline

Introduction Problem Definition Existing Approaches Top-k Join Similarity Join Algorithms Experiments

Page 20: Top-k Set Similarity Joins

Experiment Settings Algorithms

– topk-join– pptopk: modified ppjoin[Xiao, et al. WWW08], a prefix-filter based

approach, with t = 0.95, 0.90, 0.85...

Measure– Compare topk-join and pptopk (candidate size, running time)– Output results progressively

Datasetdataset # of records avg. record size

DBLP (author, title) 855k 14.0

TREC (author, title, abstract) 348k 130.1

TREC-3GRAM 348k 868.5

UNIREF-3GRAM (protein seq.) 500k 372.9

20

Page 21: Top-k Set Similarity Joins

Experiment Results

21

Page 22: Top-k Set Similarity Joins

Experiment Results

22

Page 23: Top-k Set Similarity Joins

Experiment Results

23

Page 24: Top-k Set Similarity Joins

Thank You!Any questions or comments?

Page 25: Top-k Set Similarity Joins

Related Work Index-based approaches

– S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004

– C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algo-rithms for approximate string searches. in ICDE, 2008

Prefix-based approaches– S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator

for similarity joins in data cleaning. In ICDE, 2006– R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs simi-

larity search. In WWW, 2007– C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins

for near duplicate detection. In WWW, 2008 PartEnum

– A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similar-ity joins. In VLDB, 2006

25