learning url patterns for webpage de-duplication
DESCRIPTION
Learning URL Patterns for Webpage De-duplication. Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu Email: [email protected]. Outlines. Introduction Duplicate URLs Problem Definition Related Works Algorithms URL Preprocessing Rule Generation Evaluation Conclusions. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/1.jpg)
Learning URL Patterns for Webpage De-duplicationAuthors: Hema Swetha Koppula…WSDM 2010Reporter: Jing ChiuEmail: [email protected]
112/04/21 1Data Mining & Machine Learning Lab
![Page 2: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/2.jpg)
Outlines
•Introduction▫Duplicate URLs▫Problem Definition
•Related Works•Algorithms
▫URL Preprocessing▫Rule Generation
•Evaluation•Conclusions
112/04/21 2Data Mining & Machine Learning Lab
![Page 3: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/3.jpg)
Introduction
•Duplicate URLs•Problem Definition
112/04/21 3Data Mining & Machine Learning Lab
![Page 4: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/4.jpg)
• Making URLs search engine friendly▫ http://en.wikipedia.org/wiki/Casino_Royale▫ http://en.wikipedia.org/?title=Casino_Royale
• Session-id or cookie information present in URLs▫ http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=67873
&cat=8▫ http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=78813
&cat=8• Irrelevant or superfluous components in URLs
▫ http://www.amazon.com/Lord-Rings/dp/B000634DCW▫ http://www.amazon.com/dp/B000634DCW
• Webmaster construct URL representations with custom delimiters▫ http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0Q
Q_fclsZ1QQ_pcatidZ1QQ_pidZ43973351QQ_tabZ2▫ http://catalog.ebay.com/The-Grudge_UPC_043396062603_W0?
_fcls=1&_pcatid=1&_pid=43973351&_tab=2
Duplicate URLs
112/04/21 Data Mining & Machine Learning Lab 4
![Page 5: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/5.jpg)
•Given a set of duplicate clusters and their corresponding URLs▫Learning Rules from URL strings which can
identify duplicates▫Utilizing learned Rules for normalizing
unseen duplicate URLs into a unique normalized URL
•Applications such as crawlers can apply these generalized Rules on a given URL to generate a normalized URL
Problem Definition
112/04/21 Data Mining & Machine Learning Lab 5
![Page 6: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/6.jpg)
• Do not crawl in the dust: different urls with similar text▫Authors: Z. Bar-Yossef, I. Keidar, and U.Schonfeld.▫Conference: International conference on World
Wide Web 2007▫DUST algorithm
Discovering substring substitution rules to transform URLs of similar content to one canonical URL
Rules are learned from URLs obtained from previous crawl logs or web server logs with a confidence measure
Related Works
112/04/21 Data Mining & Machine Learning Lab 6
![Page 7: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/7.jpg)
• De-duping urls via rewrite rules▫ Authors: A. Dasgupta, R. Kumar, and A. Sasturkar▫ Conference: ACM SIGKDD international conference
on Knowledge discovery and data mining▫ Considering a broader set of rule types which
subsume the DUST rules DUST rules session-id rules irrelevant path components Complicate rewrites
▫ Algorithm learns rules from a cluster of URLs with similar page content such a cluster is referred to as a duplicate cluster or a
dup cluster
Related Works (cont.)
112/04/21 Data Mining & Machine Learning Lab 7
![Page 8: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/8.jpg)
•URL Preprocessing▫Basic Tokenization▫Deep Tokenization
•Rule Generation▫Pair-wise Rule Generation▫Rule Generalization
Algorithms
112/04/21 Data Mining & Machine Learning Lab 8
![Page 9: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/9.jpg)
•Basic Tokenization▫Using the standard delimiters specified in
theRFC 1738▫Extracted Tokens:
Protocol Hostname Path components Query-args
•Deep Tokenization▫Using unsupervised technique to learn
custom URL encodings used by webmasters
URL Preprocessing
112/04/21 Data Mining & Machine Learning Lab 9
![Page 10: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/10.jpg)
URL Preprocessing (cont.)
112/04/21 Data Mining & Machine Learning Lab 10
![Page 11: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/11.jpg)
• Definitions▫ URL▫ Rule
• Example▫ u1: http://360.yahoo.com/friends-lttU7d6kIuGq
u1 = {k(1,3) = http, k(2,2) = 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2)
= −, k(3.3,1.1) = lttU7d6kIuGq}▫ u2: http://360.yahoo.com/friendsnMfcaJRPUSMQ
u2 = {k(1,3) = http, k(2,2) = 360.yahoo.com, k(3.1,1.3) = friends, k(3.2,1.2) = −, k(3.3,1.1) = nMfcaJRPUSMQ}
▫ Rule Context (C ):
c(k(1,3)) = http, c(k(2,2)) = 360.yahoo.com, c(k(3.1,1.3)) = friends, c(k(3.2,1.2)) = −, c(k(3.3,1.1)) = nMfcaJRPUSMQ
Transformation (T): t(k(3.3,1.1)) = lttU7d6kIuGq.
Rule Generation
112/04/21 Data Mining & Machine Learning Lab 11
![Page 12: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/12.jpg)
• Pair-wise Rule Generation▫ Target Selection▫ Source Selection
• Rule Generalization▫ Pair 1:
http://www.imdb.com/title/tt0810900/photogallery http://www.imdb.com/title/tt0810900/mediaindex
▫ Pair 2: http://www.imdb.com/title/tt0053198/photogallery http://www.imdb.com/title/tt0053198/mediaindex
▫ Rule 1: c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt,
c(k(4.2,2.1)) = 0810900, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex▫ Rule 2:
c(k(1,5)) = http, c(k(2,4)) = www.imdb.com, c(k(3,3)) = title, c(k(4.1,2.2)) = tt, c(k(4.2,2.1)) = 0053198, c(k(5,1)) = photogallery, t(k(5,1)) = mediaindex
Rule Generation (cont.)
112/04/21 Data Mining & Machine Learning Lab 12
![Page 13: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/13.jpg)
•Dataset
•Rule Numbers after each step
Evaluation
112/04/21 Data Mining & Machine Learning Lab 13
![Page 14: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/14.jpg)
•Small dataset
Evaluation (cont.)
112/04/21 Data Mining & Machine Learning Lab 14
![Page 15: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/15.jpg)
•Small dataset
Evaluation (cont.)
112/04/21 Data Mining & Machine Learning Lab 15
![Page 16: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/16.jpg)
•Large dataset
Evaluation (cont.)
112/04/21 Data Mining & Machine Learning Lab 16
![Page 17: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/17.jpg)
•Large dataset
Evaluation (cont.)
112/04/21 Data Mining & Machine Learning Lab 17
![Page 18: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/18.jpg)
•Presented a set of scalable and robust techniques for de-duplication of URLs▫Basic and deep tokenization▫Rule generation and generalization
•Easy adaptability to MapReduce paradigm•Evaluate effectiveness on both small and
large dataset
Conclusion
112/04/21 Data Mining & Machine Learning Lab 18
![Page 19: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/19.jpg)
•Questions?
Thanks for your attention
112/04/21 Data Mining & Machine Learning Lab 19
![Page 20: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/20.jpg)
Algorithm 1
112/04/21 Data Mining & Machine Learning Lab 20
![Page 21: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/21.jpg)
Algorithm 2
112/04/21 Data Mining & Machine Learning Lab 21
![Page 22: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/22.jpg)
Algrithm 3
112/04/21 Data Mining & Machine Learning Lab 22
![Page 23: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/23.jpg)
Algorithm 4
112/04/21 Data Mining & Machine Learning Lab 23
![Page 24: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/24.jpg)
Algorithm 5
112/04/21 Data Mining & Machine Learning Lab 24
![Page 25: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/25.jpg)
•URL: A URL u is defined as function ▫u : K → V ∪ {⊥}▫K: keys
k(x.i,y.j) x, y represent the position index from the
start and end of the URL i,j represent the deep token index
▫V: Values ▫A key not present in the URL is denoted by
⊥
Definitions of URL
112/04/21 Data Mining & Machine Learning Lab 25
![Page 26: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/26.jpg)
•RULE: A Rule r is defined as a function ▫r : C → T ▫C: context
C : K → V ∪ {∗}▫T: transformation
T : K → V ∪ {⊥,K’} K’ = K ∪ ValueConversions ValueConversions = {Lowercase(K),
Uppercase(K), Encode(K), Decode(K), ...}
Definitions of Rule
112/04/21 Data Mining & Machine Learning Lab 26
![Page 27: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/27.jpg)
Rule Coverage
112/04/21 Data Mining & Machine Learning Lab 27
![Page 28: Learning URL Patterns for Webpage De-duplication](https://reader036.vdocuments.mx/reader036/viewer/2022062518/5681472c550346895db467f0/html5/thumbnails/28.jpg)
MapReduce
112/04/21 Data Mining & Machine Learning Lab 28