Similarity Join Wu Yang 2009.4.9. Main work MS--A Primitive Operator for Similarity Joins in Data Cleaning ICDE 2006 Google--Scaling Up All Pairs Similarity.

Download Similarity Join Wu Yang 2009.4.9. Main work MS--A Primitive Operator for Similarity Joins in Data Cleaning ICDE 2006 Google--Scaling Up All Pairs Similarity.

Post on 12-Jan-2016

213 views

Category:

Documents

1 download

TRANSCRIPT

<ul><li><p>Similarity JoinWu Yang2009.4.9</p><p>Efficient Similarity Joins for Near Duplicate Detection</p></li><li><p>Main workMS--A Primitive Operator for Similarity Joins in Data Cleaning ICDE 2006Google--Scaling Up All Pairs Similarity Search www2007University of New South Wales &amp; NICTA Australia Chuan Xiao Wei Wang Xuemin Lin PPJoinEfficient Similarity Joins for Near Duplicate DetectionWWW2008 EdJoin:An Efficient Algorithm for Similarity Joins With Edit Distance ConstraintsVLDB 2008 Approximate Entity Extraction with Edit Distance Constraints. SIGMOD 2009 Top-k Set Similarity Joins. ICDE 2009 **</p></li><li><p>**OutlineMotivationAlgorithmsExperimentsThinking</p></li><li><p>**Near Duplicate DataOn one end, a winded Pete Sampras tried to summon enough energy to give the New York fans another memorable win to talk about it on the subway ride home. On the other side, Roger Federer wore a sly grin like he knew age was about to catch up to the former world No. 1 - the man who owns the record of 14 Grand Slams he wants.03/11/2008 | 11:28 AMBy JAY COHEN, AP Sports Writer Mar 11, 4:23 am EDT</p></li><li><p>App: deduplication /2**</p></li><li><p>App: data integration / record linkageEfficient Similarity Joins for Near Duplicate Detection**</p><p>Efficient Similarity Joins for Near Duplicate Detection</p></li><li><p>**ApplicationsFor Web search engines:Perform focused crawlingIncrease the quality and diversity of query resultsIdentify spams.</p><p>For Web mining:Perform document clusteringFind replicate Web collectionsDetect plagiarism</p><p>SPAM TEMPLATE</p><p>Sir/Madam,We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONALWINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or yourpersonal e-mail address attached to ticket number 653-908-321-675 with serial main number drew lucky star winning numbers which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of 960.000.00 Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS).CONGRATULATIONS!!! </p><p>Sincerely yours,</p><p>Q. What are the advantages of RAID5 over RAID4?A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read.Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read.</p></li><li><p>Algorithms</p><p>Data set Similarity functionAlgorithms **</p></li><li><p>Data setdblp.rawtexas.rawtrec.raw</p><p>uniref.500K.raw</p><p>**</p></li><li><p>**Similarity FunctionCommon similarity functions:</p><p>Jaccard:</p><p>Cosine:</p><p>Overlap:</p><p>Jaccard can be equivalently converted to Overlap</p><p>x = {A,B,C,D,E}y = {B,C,D,E,F}4/6 = 0.674/5 = 0.84</p></li><li><p>Similarity FunctionHamming distance =|(x-y)U(y-x)|Edit distance**</p></li><li><p>Algorithms - classication</p><p>**</p></li><li><p>Algorithms object</p><p>Similarity between sets Binary similarity functions Contains, intersects Numerical similarity functions Overlap, Jaccard, dice, cosine </p><p>Similarity between strings Treat strings as sets Jaccard (on q-grams), edit distance **</p></li><li><p>algorithmsSSJoinAll-pairsPPJoin, PPJoin+Top-k Set Similarity Joins</p><p>**</p></li><li><p>SSJoinBased on setsWhy string to set?Cited from Efficient Exact Set-Similarity Joins --MSGeneralizes to many string similarity funcsPowerful primitiveSets RelationsLeverage relational data processing</p><p>**</p></li><li><p>SSjoinfind {(r, s) | r R, s S, overlap(r, s) t} A fundamental operator can handle other similarity functions (Jaccard, cosine, Hamming, dice, edit distance, ) via transformation </p><p>Efficient Similarity Joins for Near Duplicate Detection**</p><p>Efficient Similarity Joins for Near Duplicate Detection</p></li><li><p>Prefix Filtering-based similarity join </p><p>SSJoin[Chaudhuri et al, ICDE06] Formalize the prefix-filtering principleAll pairs [Bayardo et al, WWW07] Use prefix-filtering in an asymmetric way PPJoin+[Xiao et al, WWW08] Employs prefix-filtering, position filtering and suffix Filtering-based Similarity Joins **</p></li><li><p>ALL Pairs**</p></li><li><p>**Prefix + Positional InformationWe use prefix filter (All-Pairs [www07]) as basic frameworkIntuitiontokens sorted -&gt; rank, or position of tokens within a recordestimate tighter upper bounds of overlap between x and y with positional informationContributionsindex constructionindex not only tokens, but their positions in the record ppjoin algorithmcandidate generationprobe tokens in suffix, compare the positions in the record ppjoin+ algorithm</p></li><li><p>Experiments</p><p>**</p></li><li><p>Experiments**</p></li><li><p>Experiments**</p></li><li><p>Thinking Further optimization on performances Index for similarity functions (e.g., cosine) Better pruning techniques Optimize for the specific similarity/distance function **</p></li><li><p>Thinkingtokeninverted-listTFIDFIRwi,j=tf*idf</p><p>tokentokentoken**</p></li><li><p>continueSET</p><p>Efficient Similarity Joins for Near Duplicate Detection</p></li><li><p>**Related WorkApproximate:LSH: A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.Shingling: A. Z. Broder. On the resemblence and containment of documents. In SEQS, 1997.Exact:Index-based:S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004.Prefix-based:S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.All-Pairs: R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007.PPjoin,PPjoin+ Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection . WWW 2008Pigeon-hole principle based:PartEnum: A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006.</p></li><li><p>**References[SEQS97] A. Z. Broder. On the resemblance and containment of documents. In SEQS 1997.[MIR] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrival. Addison Wesley, 1st edition, May 1999.[VLDB99] LSH: A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.[SIGMOD04] S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004.[ICDE06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006.[VLDB06] PartEnum: A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006.[WWW07] All-Pairs: R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007.[WWW 2008] Efficient Similarity Joins for Near Duplicate Detection Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu.. WWW 2008[VLDB 2008]. Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. VLDB 2008.[ICDE 2009] . Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang</p><p>*************************</p></li></ul>

Recommended

View more >