improving similarity join algorithms using vertical clustering techniques lisa tan department of...
Post on 21-Dec-2015
218 views
TRANSCRIPT
Improving Similarity Join Algorithms using Vertical
Clustering Techniques
Lisa TanDepartment of Computer Science
Computing & Information TechnologyWayne State University
Sept. 15, 2009
Reason using Similarity Join
Correlate data from different data sources (e.g., data integration)
Data is often dirty (e.g. typing mistakes) Abbreviated, incomplete or missing
information Differences in information “formatting”
due to the lack of standard conventions (e.g. for addresses)
Example
Name Addr PhoneJack Lemmon Maple St. 430-871-8294
Harrison Ford Culver Blvd 292-918-2913
Tom Hanks Main St. 2340762-1234
…… …… ……
Table R Table S
Name Addr PhoneTon Hanks Main Street 234-162-1234
Kevin Spacey Frost Blvd 928-184-2813
Jack Lemon Maple Street 430-817-8294
…… …… ……
Find records from different datasets that could be the same entity.
Experimental Results – Natural Join
Effect of Threshold
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 0.033 0.066 0.1
Threshold
Pre
cis
ion
Single Equi Join
Single Join w ith Like
Experimental Results – Similarity Join
Effect of Threshold
00.10.20.30.40.50.60.70.80.9
1
0 0.033 0.066 0.1
Threshold
Pre
cisi
on
Single Equi Join
Single Join with Like
Edit Distance onClustering Fields
Problem Statement for Similarity Join
Given a string S called the source and another string T called the target.
Allowing a defined number of errors to be presented in the joins, the similarity join is to verify whether or not two strings represent the same real-world entity based on certain methods.
Sample Applications
1. Finding matching DNA subsequences even after mutations have occurred.
2. Signal recovery for transmissions over noisy lines.
3. Searching for spelling/typing errors and finding possible corrections.
4. Handwriting recognition, virus and intrusion detection.
General Approaches Attracting different research communities: statistics,
artificial intelligence and database.
Statistics refers similarity join as probabilistic record linkage armed at minimizing the probability of misclassification.
Artificial intelligence uses supervised learning to learn the parameters of string edit distance metrics
Database uses knowledge intensive approach, edit distance as a general record match scheme.
General Algorithms on Database Area
All the algorithms focus on Edit Distance Dynamic Programming Algorithms Automata Algorithms Bit – Parallelism Algorithms Filtering Algorithms
Comments on Existing Methods
All above proposed algorithms are based on the generic edit distance function.
Some improve the speed of the dynamic programming method.
Some apply filtering techniques that avoid expensive comparisons in large parts of the queried sequence.
Current similarity algorithms are under the assumption that join conditions are known and do not consider relevant field in their join conditions
Although there have been many efforts for efficient string similarity join, there is still room for improvement.
Outlines
Motivation Pre-experimental Results Proposed Approach Identify Clustered Join Attributes Experimental Results Conclusion
Research Goal
Identifying the same real-world entities from multiple heterogeneous databases
Motivation of Clustering Concept
Current similarity algorithms do not consider relevant field concepts.
Clustering concept fits well on relevant field concepts.
Pre-experimental Results
Effect of Threshold
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6
Threshold
ED on name
ED on name andaddress
ED on name,address andtelephone
ED on name andtelephone
Proposed Approach
Our proposed approach takes consideration of clustered related
attributes
Question: how to identify clustered join attributes?
Clustering Algorithm
The rationale behind the clustering is to produce fragments, groups of attribute columns that are closely related.
Identify Clustered Related Attributes
Pre-knowledge of Applications on Data• Attributes Usage Information
Calculate Attribute Affinities Calculate Clustered Affinities Use Bond Energy Bond (BEA) approach
to regroup affinity value Apply split approach to find clustered
related attributes
Clustered Approach - Diagram
Computation of Affinities
Clustering
Logical Accesses
Attribute Affinity Matrix
Clustered attribute affinity matrix
Group of Clustered related attributes
Split Approach
Clustered Approach – Con’t
Attribute Usage 1 if attribute Aj is referenced by application qk
0 otherwise
Attribute Affinity
Cluster Affinitypermutation to maximize the global affinity measure and results in the grouping of large affinity values with large affinity attributes and small affinity values with small affinity attributes.
),( ji AAaff 1),use(q 1)A ,use(q| kik
)(jAkkqacc
),( jk Aquse
),( jk Aquse
Clustered Approach - Example
Split Approach
Split based on access model
where af(Vfi) stands for the access frequency for vertical fragment and af(VFi,VFj) stands for the access frequency for queries having at least one attribute in vertical fragment
22121 ),()(*)( VFVFafVFafVFafSQ
Split Approach – con’t
on Table 3, for the first possible split {Address} and {Birthday, Name, phone}, SQ=25*35;
for the second possible split {Address, Birthday} and {Name, Phone}, SQ=-(30+35) ;
for the third possible split {Address, Birthday, Name} and {Phone}, SQ=-(35+35).
Existing Similarity Join Techniques
Edit Distance Q-gram
Similarity Join – Edit Distance
A widely used metric to define string similarity ED(s1, s2)= minimum # of operations
(insertion, deletion, substitution) to change s1 to s2
Example:
s1: surgerys2: surveyED(s1, s2) = 2
Programming Algorithm This is the oldest algorithm. Answers the question, how do we compute ed(x,y). Take a matrix C0..|x|,0..|y| where Ci,j is the minimum number
of operations to match xi to yj.
This is calculated as follows:• Ci,0 = i
• C0,j = j• if (xi = yj) then Ci,j = Ci-1,j-1
• Otherwise, Ci,j = 1 + min(Ci-1,j, Ci,j-1, Ci-1,j-1)
O(mn) complexity.
Matrix Example
Edit Distance
Similarity Join - Qgram
Qgram Roadmap:
- break strings into substrings of length q
- perform an exact join on the q-grams
- find candidate string pairs based on the results
- check only candidate pairs with a UDF to obtain final answer
Similarity Join – Q-gram
Q-gram is a pair of substrings having the properties:
Slide a window of length q over the string s
Add new characters # and % Generate |s| + q -1 substrings
Q-gram Technique (cont’d)
Rationale : when two strings s and t
are within a small edit distance of
each other, they share a large number
of q-grams in common. Advantage: build on the top of relational
databases with an augmented table created on the fly.
Similarity Join - Qgram
Issue with Qgram: don’t work on the large dataset
Resolution to the issue:
- clear the data by using exact join
- create a table to hold the dismatching data
- apply the Qgram on the new temp table
Similarity Join – Q-gram (continued)
For a string john smith: { (1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn ), (6,n s), (7, sm), (8,smi), (9,mit), (10,ith), (11,th%), (12,h%%)} with q=3
For a string john a smith: {(1,##j), (2,#jo), (3,joh), (4,ohn), (5,hn ), (6,n a), (7, a ), (8,a s), (9, sm), (10,smi), (11,mit), (12,ith), (13,th%), (14,h%%)} with q=3
Sample SQL Expression
SELECT R1.A0,, R2.A0, R2.Ai, R2.Aj
FROM R1, R1AiQ, R2, R2AjQ
WHERE R1A0 = R1AiQ.A0 AND
R2A0 = R2AjQ.A0 AND
R1AiQ.Qgram = R2AjQ.Qgram AND
|R1AiQ.Pos – R2AjQ.Pos| <= k AND
|strlen(R1.Ai) – strlen(R2.Aj)| <= k
GROUP BY R1.A0, R2.A0, R2.Ai, R2.Aj
HAVING COUNT(*) >= strlen(R1.A i) – 1 – (k-1)*q AND
COUNT(*) >= strlen(R2.A j) – 1 – (k-1)*q ANDEdit_distance(R1.Ai, R2.Aj, k)
Precision, Recall and F-measure
Precision is defined as the number of true positives divided by the sum of true positives and false positives (TP/(TP + FP)
Recall is defined as the number of true positives divided by the sum of true positives and false negatives (TP/(TP + FN)
F-measure is defined as the weighted harmonic mean of precision and recall:
F = 2 * (precision * recall) / (precision + recall)
Experimental Results
Known join attributes vs clustered join attributes on Precision
Effect of Threshold
0
0.2
0.4
0.6
0.8
1
2 4 6 8 10 12Threshold
Pre
cisi
on
ED on name
ED on name and address
ED on name and birth
ED on name and telephone
ED on name, address andtelephone
ED on name, birth,telephone
ED on name, birth, addressand telephone
Experimental Results
Known join attributes vs clustered join attributes on Recall
Effect of Threshold
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2 4 6 8 10 12
Threshold
ED on name
ED on name and address
ED on name and birth
ED on name and telephone
ED on name, address and telephone
ED Recall on name, birth, telephone
ED on name, birth, address andtelephone
Experimental Results
ED vs. QgramEffect of Affinity Fields
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2 4 6 8 10 12
Threshold
Pre
cisi
on ED on Affinity Fields
Q-gram on Affinity ClusteredFields
Experimental Results
ED vs Qgram on RecallEffect of Affinity Fields
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2 4 6 8 10 12
Threshold
Recall
ED on Aff inity ClusteredFields
Q-gram on Aff inity ClusteredFields
Experimental Results
ED vs Qgram on F-measureEffect of Threshold
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
2 4 6 8 10 12
Threshold
F-measure on ED
F-measure on Q-gram
Conclusion
Proposed a pre-processing approach to improve existing similarity join techniques
Experimental results showed improvement of ED by about 5% and Q-gram by about 15%
Future Work
Potential further works: work on alternative clustering method increase the datasets add some pre and post filter abilities …
Publications
Lisa Tan, Farshad Fotouhi and William Grosky "Improving Similarity Join Algorithms using Vertical Clustering Techniques", ICADIWT 2009, Page 491 - 496.
Improving Similarity Join Algorithm Using Fuzzy Clustering Techniques has been accepted by ICDM-09 Workshop on Mining Multiple Information Sources (MMIS)
Thank You!
Lisa Tan – [email protected]
Co-Authors
Dr. Farshad Fotouhi – [email protected]
Dr. William Grosky – [email protected]
Acknowledgement
Dr. Farshad Fotouhi, Dr. William Grosky, and Computing & Information Technology
Question
Wayne State University - Facts
30th largest university in nation
Top 50 in NSF public rankings
Over 33,300 students
Over 350 undergraduate/graduate degree programs in 12 Schools and Colleges
Comments on Existing Methods
All above proposed algorithms are based on the generic edit distance function.
Some improve the speed of the dynamic programming method.
Some apply filtering techniques that avoid expensive comparisons in large parts of the queried sequence.
Although there have been many efforts for efficient string similarity join, there is still room for improvement.