improving similarity join algorithms using vertical clustering techniques lisa tan department of...

45
Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne State University Sept. 15, 2009

Post on 21-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Improving Similarity Join Algorithms using Vertical

Clustering Techniques

Lisa TanDepartment of Computer Science

Computing & Information TechnologyWayne State University

Sept. 15, 2009

Page 2: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Reason using Similarity Join

Correlate data from different data sources (e.g., data integration)

Data is often dirty (e.g. typing mistakes) Abbreviated, incomplete or missing

information Differences in information “formatting”

due to the lack of standard conventions (e.g. for addresses)

Page 3: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Example

Name Addr PhoneJack Lemmon Maple St. 430-871-8294

Harrison Ford Culver Blvd 292-918-2913

Tom Hanks Main St. 2340762-1234

…… …… ……

Table R Table S

Name Addr PhoneTon Hanks Main Street 234-162-1234

Kevin Spacey Frost Blvd 928-184-2813

Jack Lemon Maple Street 430-817-8294

…… …… ……

Find records from different datasets that could be the same entity.

Page 4: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Experimental Results – Natural Join

Effect of Threshold

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.033 0.066 0.1

Threshold

Pre

cis

ion

Single Equi Join

Single Join w ith Like

Page 5: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Experimental Results – Similarity Join

Effect of Threshold

00.10.20.30.40.50.60.70.80.9

1

0 0.033 0.066 0.1

Threshold

Pre

cisi

on

Single Equi Join

Single Join with Like

Edit Distance onClustering Fields

Page 6: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Problem Statement for Similarity Join

Given a string S called the source and another string T called the target.

Allowing a defined number of errors to be presented in the joins, the similarity join is to verify whether or not two strings represent the same real-world entity based on certain methods.

Page 7: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Sample Applications

1. Finding matching DNA subsequences even after mutations have occurred.

2. Signal recovery for transmissions over noisy lines.

3. Searching for spelling/typing errors and finding possible corrections.

4. Handwriting recognition, virus and intrusion detection.

Page 8: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

General Approaches Attracting different research communities: statistics,

artificial intelligence and database.

Statistics refers similarity join as probabilistic record linkage armed at minimizing the probability of misclassification.

Artificial intelligence uses supervised learning to learn the parameters of string edit distance metrics

Database uses knowledge intensive approach, edit distance as a general record match scheme.

Page 9: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

General Algorithms on Database Area

All the algorithms focus on Edit Distance Dynamic Programming Algorithms Automata Algorithms Bit – Parallelism Algorithms Filtering Algorithms

Page 10: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Comments on Existing Methods

All above proposed algorithms are based on the generic edit distance function.

Some improve the speed of the dynamic programming method.

Some apply filtering techniques that avoid expensive comparisons in large parts of the queried sequence.

Current similarity algorithms are under the assumption that join conditions are known and do not consider relevant field in their join conditions

Although there have been many efforts for efficient string similarity join, there is still room for improvement.

Page 11: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Outlines

Motivation Pre-experimental Results Proposed Approach Identify Clustered Join Attributes Experimental Results Conclusion

Page 12: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Research Goal

Identifying the same real-world entities from multiple heterogeneous databases

Page 13: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Motivation of Clustering Concept

Current similarity algorithms do not consider relevant field concepts.

Clustering concept fits well on relevant field concepts.

Page 14: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Pre-experimental Results

Effect of Threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6

Threshold

ED on name

ED on name andaddress

ED on name,address andtelephone

ED on name andtelephone

Page 15: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Proposed Approach

Our proposed approach takes consideration of clustered related

attributes

Question: how to identify clustered join attributes?

Page 16: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Clustering Algorithm

The rationale behind the clustering is to produce fragments, groups of attribute columns that are closely related.

Page 17: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Identify Clustered Related Attributes

Pre-knowledge of Applications on Data• Attributes Usage Information

Calculate Attribute Affinities Calculate Clustered Affinities Use Bond Energy Bond (BEA) approach

to regroup affinity value Apply split approach to find clustered

related attributes

Page 18: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Clustered Approach - Diagram

Computation of Affinities

Clustering

Logical Accesses

Attribute Affinity Matrix

Clustered attribute affinity matrix

Group of Clustered related attributes

Split Approach

Page 19: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Clustered Approach – Con’t

Attribute Usage 1 if attribute Aj is referenced by application qk

0 otherwise

Attribute Affinity

Cluster Affinitypermutation to maximize the global affinity measure and results in the grouping of large affinity values with large affinity attributes and small affinity values with small affinity attributes.

),( ji AAaff 1),use(q 1)A ,use(q| kik

)(jAkkqacc

),( jk Aquse

),( jk Aquse

Page 20: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Clustered Approach - Example

Page 21: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Split Approach

Split based on access model

where af(Vfi) stands for the access frequency for vertical fragment and af(VFi,VFj) stands for the access frequency for queries having at least one attribute in vertical fragment

22121 ),()(*)( VFVFafVFafVFafSQ

Page 22: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Split Approach – con’t

on Table 3, for the first possible split {Address} and {Birthday, Name, phone}, SQ=25*35;

for the second possible split {Address, Birthday} and {Name, Phone}, SQ=-(30+35) ;

for the third possible split {Address, Birthday, Name} and {Phone}, SQ=-(35+35).

Page 23: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Existing Similarity Join Techniques

Edit Distance Q-gram

Page 24: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Similarity Join – Edit Distance

A widely used metric to define string similarity ED(s1, s2)= minimum # of operations

(insertion, deletion, substitution) to change s1 to s2

Example:

s1: surgerys2: surveyED(s1, s2) = 2

Page 25: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Programming Algorithm This is the oldest algorithm. Answers the question, how do we compute ed(x,y). Take a matrix C0..|x|,0..|y| where Ci,j is the minimum number

of operations to match xi to yj.

This is calculated as follows:• Ci,0 = i

• C0,j = j• if (xi = yj) then Ci,j = Ci-1,j-1

• Otherwise, Ci,j = 1 + min(Ci-1,j, Ci,j-1, Ci-1,j-1)

O(mn) complexity.

Page 26: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Matrix Example

Edit Distance

Page 27: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Similarity Join - Qgram

Qgram Roadmap:

- break strings into substrings of length q

- perform an exact join on the q-grams

- find candidate string pairs based on the results

- check only candidate pairs with a UDF to obtain final answer

Page 28: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Similarity Join – Q-gram

Q-gram is a pair of substrings having the properties:

Slide a window of length q over the string s

Add new characters # and % Generate |s| + q -1 substrings

Page 29: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Q-gram Technique (cont’d)

Rationale : when two strings s and t

are within a small edit distance of

each other, they share a large number

of q-grams in common. Advantage: build on the top of relational

databases with an augmented table created on the fly.

Page 30: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Similarity Join - Qgram

Issue with Qgram: don’t work on the large dataset

Resolution to the issue:

- clear the data by using exact join

- create a table to hold the dismatching data

- apply the Qgram on the new temp table

Page 31: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Similarity Join – Q-gram (continued)

For a string john smith: { (1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn ), (6,n s), (7, sm), (8,smi), (9,mit), (10,ith), (11,th%), (12,h%%)} with q=3

For a string john a smith: {(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn ), (6,n a), (7, a ), (8,a s), (9, sm), (10,smi), (11,mit), (12,ith), (13,th%), (14,h%%)} with q=3

Page 32: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Sample SQL Expression

SELECT R1.A0,, R2.A0, R2.Ai, R2.Aj

FROM R1, R1AiQ, R2, R2AjQ

WHERE R1A0 = R1AiQ.A0 AND

R2A0 = R2AjQ.A0 AND

R1AiQ.Qgram = R2AjQ.Qgram AND

|R1AiQ.Pos – R2AjQ.Pos| <= k AND

|strlen(R1.Ai) – strlen(R2.Aj)| <= k

GROUP BY R1.A0, R2.A0, R2.Ai, R2.Aj

HAVING COUNT(*) >= strlen(R1.A i) – 1 – (k-1)*q AND

COUNT(*) >= strlen(R2.A j) – 1 – (k-1)*q ANDEdit_distance(R1.Ai, R2.Aj, k)

Page 33: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Precision, Recall and F-measure

Precision is defined as the number of true positives divided by the sum of true positives and false positives (TP/(TP + FP)

Recall is defined as the number of true positives divided by the sum of true positives and false negatives (TP/(TP + FN)

F-measure is defined as the weighted harmonic mean of precision and recall:

F = 2 * (precision * recall) / (precision + recall)

Page 34: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Experimental Results

Known join attributes vs clustered join attributes on Precision

Effect of Threshold

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10 12Threshold

Pre

cisi

on

ED on name

ED on name and address

ED on name and birth

ED on name and telephone

ED on name, address andtelephone

ED on name, birth,telephone

ED on name, birth, addressand telephone

Page 35: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Experimental Results

Known join attributes vs clustered join attributes on Recall

Effect of Threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12

Threshold

ED on name

ED on name and address

ED on name and birth

ED on name and telephone

ED on name, address and telephone

ED Recall on name, birth, telephone

ED on name, birth, address andtelephone

Page 36: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Experimental Results

ED vs. QgramEffect of Affinity Fields

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12

Threshold

Pre

cisi

on ED on Affinity Fields

Q-gram on Affinity ClusteredFields

Page 37: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Experimental Results

ED vs Qgram on RecallEffect of Affinity Fields

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12

Threshold

Recall

ED on Aff inity ClusteredFields

Q-gram on Aff inity ClusteredFields

Page 38: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Experimental Results

ED vs Qgram on F-measureEffect of Threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

2 4 6 8 10 12

Threshold

F-measure on ED

F-measure on Q-gram

Page 39: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Conclusion

Proposed a pre-processing approach to improve existing similarity join techniques

Experimental results showed improvement of ED by about 5% and Q-gram by about 15%

Page 40: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Future Work

Potential further works: work on alternative clustering method increase the datasets add some pre and post filter abilities …

Page 41: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Publications

Lisa Tan, Farshad Fotouhi and William Grosky "Improving Similarity Join Algorithms using Vertical Clustering Techniques", ICADIWT 2009, Page 491 - 496.

Improving Similarity Join Algorithm Using Fuzzy Clustering Techniques has been accepted by ICDM-09 Workshop on Mining Multiple Information Sources (MMIS)

Page 42: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Thank You!

Lisa Tan – [email protected]

Co-Authors

Dr. Farshad Fotouhi – [email protected]

Dr. William Grosky – [email protected]

Acknowledgement

Dr. Farshad Fotouhi, Dr. William Grosky, and Computing & Information Technology

Page 43: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Question

Page 44: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Wayne State University - Facts

30th largest university in nation

Top 50 in NSF public rankings

Over 33,300 students

Over 350 undergraduate/graduate degree programs in 12 Schools and Colleges

Page 45: Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne

Comments on Existing Methods

All above proposed algorithms are based on the generic edit distance function.

Some improve the speed of the dynamic programming method.

Some apply filtering techniques that avoid expensive comparisons in large parts of the queried sequence.

Although there have been many efforts for efficient string similarity join, there is still room for improvement.