幻灯片 1net.pku.edu.cn/~wbia/2010/slides/lecture1… · ppt file · web view ·...
TRANSCRIPT
![Page 1: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/1.jpg)
Web NoisesDetection and Elimination
PengBoDec 3, 2010
![Page 2: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/2.jpg)
What are Web Noises?
![Page 3: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/3.jpg)
主题 Topic
导航 NavGuide
广告 Adv
![Page 4: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/4.jpg)
![Page 5: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/5.jpg)
Call them Noises 虽然这些信息对于人浏览 Web 有用,但常常对自
动 Web 信息处理带来负面影响,比如 Web page clustering, classification, information retrieval and information extraction.
hamper automated information gathering and Web data mining,
“Template Detection via Data Mining and its Applications”
![Page 6: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/6.jpg)
Non-Relevant Data on the Web A fundamental problem on the Web:
“non-relevant” – not directly related to the main topic / functionality of the page
Local (intra-page) noise Irrelevant items within a Web page. E.g., banner ads, navigational guides
Many pages contain lots of non-relevant data
![Page 7: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/7.jpg)
Duplicate data on the Web Another problem on the Web:
Mirrors , News copy, etc, Global noise
Redundant objects Larger than individual page E.g., mirror sites, duplicated Web pages
There are much duplicate or near duplicate data
![Page 8: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/8.jpg)
Why it influences? Hypertext IR Principles--principles of all link
based IR tools: Relevant Linkage Principle
p links to q q is relevant to p Topical Unity Principle
q1 and q2 are co-cited in p q1 and q2 are related to each other
Lexical Affinity Principle The closer the links to q1 and q2 are the
stronger the relation between them.
![Page 9: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/9.jpg)
Violations of Relevant Linkage Principle
Navigational links http://www.ibm.com/
Download links http://www.beethoven.com/
Advertisement links http://www.yahoo.com/
Endorsement links http://www.ebay.com/
Spam links
![Page 10: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/10.jpg)
Violations of Topical Unity Principle
Violations of the Relevant Linkage Principle
Bookmark pages http://bookmark.yinsha.com/ 网上书签
General resource lists http://sewm.pku.edu.cn/IR-
Guide.txt IR Guide Personal homepages
http://www.cse.iitb.ac.in/~soumen/ Soumen’s Home Page
![Page 11: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/11.jpg)
Violations of Lexical Affinity Principle
Alphabetical index lists Computer and Communication Companies ("M" entries)
HTML representation Adjacent cells in the same column are far from each
other in the HTML text
![Page 12: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/12.jpg)
IR Tool Problems Generalization
Search for “Frequency Division Multiplexing” and get back general Electrical Engineering sites
Topic drift Search for “Finite Model Theory” and get SF
49’ers fan web sites Irrelevance
Get “Yahoo” as a result regardless of the query Bias
Search for “computing companies” and get Microspy highly ranked
![Page 13: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/13.jpg)
Hypertext Improvement Problem
remove violations of the Hypertext IR principles process quickly millions of pages
Develop hypertext processing techniques that:• automatically improve hypertext data• are efficient and scalable
Main Goal
![Page 14: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/14.jpg)
Hypertext Cleaning
Web
Crawler
Hypertext Cleaner
IR Tool
![Page 15: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/15.jpg)
Template detection
![Page 16: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/16.jpg)
DOM Tree
模版 Template
![Page 17: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/17.jpg)
![Page 18: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/18.jpg)
![Page 19: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/19.jpg)
Templates
![Page 20: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/20.jpg)
Templates Detection Semantic Definition: A template is a master HTML
shell page that is used as a basis for composing new pages
Content of new pages plugged into template shell
All pages share common look & feel Usually controlled by a central
authority Not necessarily confined to a
single site May include variety of data
Navigational bars Advertisements Company info and policies
![Page 21: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/21.jpg)
Search pagelet
Navigation pagelet
Services pagelet
Company info pagelet
Ad pagelet
![Page 22: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/22.jpg)
Pagelets Semantic Definition: A pagelet is a maximal region of a page
that has a single topic or functionality Not too large
has only one topic / functionality Not too small
any larger region that contains it has other topics / functionalities
![Page 23: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/23.jpg)
IR with Pagelets
Use pagelets rather than pages as atomic units for information retrieval
Main Idea 1
Main Idea 2
Eliminate pagelets belonging to templates
![Page 24: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/24.jpg)
Pagelets: Syntactic Definition A pagelet is a node in the
HTML parse tree of a page satisfying the following: Its HTML tag is one of the
following: <TABLE>, <OL>, <UL>,
<AREA>, <P>, <DL>, … None of it’s children
contains more than k hyperlinks
None of its ancestor is a pagelet
![Page 25: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/25.jpg)
![Page 26: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/26.jpg)
Templates: Syntactic Definition
A template is a collection T = (p1,…,pk) of pagelets satisfying:
Similarity:p1,…,pk are identical or almost identical
Connectivity Every two pages owning pagelets in T are
reachable from each other (undirectedely) through other pages owning pagelets in T.p1
p3
p5
p2
p4
Template Recognition Problem: Given a set of pages S find all the templates in S.
![Page 27: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/27.jpg)
Template Recognition in Large Sets
Cluster pagelets in S according to shingle
Calculate shingle(p) for each pagelet pS
Discard clusters of size 1
For each remaining cluster C:
Construct graph Gc of pages that own pagelets in C
Find undirected connected components of Gc
Output components of size > 1
![Page 28: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/28.jpg)
Evaluation Question:
How to evaluate the performance/effectiveness of this cleaning algorithm?
![Page 29: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/29.jpg)
Benefits of template detection
![Page 30: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/30.jpg)
Cleaning via feature weighting
![Page 31: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/31.jpg)
Cleaning via feature weightingCleaning via feature weighting
In a given Web site Noisy blocks — Share
common contents or presentation styles
Meaningful (or main) blocks — diverse in contents and presentation style
Weighting features makes cleaning automatic (nothing is eliminated)“Eliminating noisy
information in Web pages for data mining”
![Page 32: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/32.jpg)
DOM treesDOM trees
<BODY bgcolor=WHITE> <TABLE width=800 height=200 > … </TABLE> <IMG src="image.gif" width=800> <TABLE bgcolor=RED> … </TABLE></BODY>
bc=red
bc=white
IMG TABLE
BODY
root
width=800 height=20
0TABLEwidth=800
![Page 33: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/33.jpg)
Build Site style tree (SST)
common
![Page 34: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/34.jpg)
SST
Style Node S = (ELEMENTs, n) ELEMENTs — a sequence of element nodes n — number of pages that has this style
Element Node E = (Tag, Attr, STYLEs) Tag — tag name. E.g., TABLE, IMG; Attr — display attributes of Tag. E.g., bgcolor=RED STYLEs — style nodes below E
![Page 35: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/35.jpg)
Quantify the importance
Inner Node
Leaf Node
![Page 36: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/36.jpg)
Weighting policy Inner Node Importance
(1) l = |E.STYLEs| m = number of pages containing E, |E.parent.n| pi — percentage of tag nodes (in E.parent.n)
using the i-th presentation style Inner NodeImp(E) — diversity of presentation
styles
11
1
log)(1 mif
mifppENodeImpl
iimi
![Page 37: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/37.jpg)
NodeImp(Body) = -1log1001 = 0NodeImp(Table)
= -(0.35log1000.35 + 2*0.25log1000.25+ 0.15log1000.15) = 0.29 >0
11
1
log)(1 mif
mifppENodeImpl
iimi
![Page 38: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/38.jpg)
Weighting policy
Features( terms) of Leaf Node Importance of Leaf Node’s Features
(3)
m = number of pages containing E, |E.parent.n| pij — probability of ai appears in E of page j HE(ai) — information entropy of ai
the higher HE(ai), the less important ai
11
log
0)(
1mifmif
ppaH m
jijmij
iE
![Page 39: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/39.jpg)
Weighting policy
Leaf Node Importance
(2) N — number of features in E ai — a feature of content in E (1-HE(ai)) — information contained in ai Leaf NodeImp(E) —content diversity of E
N
aH
N
aHENodeImp
N
iiE
N
iiE
11
)(1
))(1()(
![Page 40: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/40.jpg)
Et1:PCMag,samsung
t2:PCMag,epson
t3:PCMag,canon
TABLEEp
SST:
IMG
root
3
m = 3N = |{PCMag, samsung,
epson, canon}| = 4HE(PCMag) =
-3 * (1/3log31/3) = 1HE(samsung)=HE(epson) =HE(canon) = -(0+0+1log31) = 0NodeImp(E) = ((1-1) + 3*(1-0))/4
= 0.75
11
log
0)(
1mifmif
ppaH m
jijmij
iE
![Page 41: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/41.jpg)
Transitive Weighting policy
Composite Importance
0
0.290
0.75
![Page 42: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/42.jpg)
Page nosie noisy element node
For an element node E in the SST, if all of its descendents and itself have composite importance less than a specified threshold t, then we say element node E is noisy.
Maximal noisy element node meaningful element node :
If an element node E in the SST does not contain any noisy descendent, we say that E is meaningful.
Maximal meaningful element node
![Page 43: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/43.jpg)
Web page cleaning via block elimination
We can use SST (site style tree) to identify & eliminate noise content blocks in a page. Build SST by sample pages crawled from a site. Computing an importance value for each block,
using a specified threshold t to decide noisy or not noisy
Matching to noisy blocks and not noisy blocks in the tree, given a new page.
![Page 44: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/44.jpg)
Noise Detection and Elimination
Table Img Table
Body
Table
TrTr
root
Text
Text
AP
P P P A P Img
AImg
A A A AA
![Page 45: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/45.jpg)
After simplification
Table Img Table
Body
Table
TrTr
root
Text
![Page 46: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/46.jpg)
Summary of the technique Evaluate Common and Diversity of content
and styles DOM trees SST Information Entropy Based Evaluation
Node Importance Composite Importance
Noise detection and automatic matching
![Page 47: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/47.jpg)
Near duplicate detection
![Page 48: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/48.jpg)
Syntactic clustering of the web contentsWWW6,1997
![Page 49: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/49.jpg)
Document Representation
How to represent a document? Represent document content by a feature
set , preparing the computations of resemblance or similarity.
For document D, extract it’s feature set as S(D)
![Page 50: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/50.jpg)
Defining similarity of documents
How to express the concept “roughly the same” precisely?
Quantity Definition: resemblance The resemblance fo two documents A and B is a
number between 0 and 1.
![Page 51: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/51.jpg)
Defining similarity of documents(cont’d)
Resemblance
Symmetric, reflexive, not transitive, not a metric
Note r (A,A) = 1 But r (A,B)=1 does not mean A and B are identical!
Forgives any number of occurrences and any permutations of the terms.
Resemblance distance
)()()()(
),(BSASBSAS
BAr
),(1),( BArBAd
Jaccard coefficient
![Page 52: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/52.jpg)
Feature Selection Assume:
we have converted page into a sequence of tokens Eliminate punctuation, HTML markup, lower
case, etc How to do feature selection, S(D)=?
Document level Character/word level Shingle level
![Page 53: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/53.jpg)
Shingling A contiguous subsequence contained in D is called a shingle. Given a document D we define its w-shingling S(D, w) as the set of all unique shingles of size w contained in D.
D = (a,rose,is,a,rose,is,a,rose) S(D,4) = {(a,rose,is,a),(rose,is,a,rose),
(is,a,rose,is)} “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is
Why shingling? S(D,4) .vs. S(D,1)What is a good value for w?
![Page 54: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/54.jpg)
Sketches Set of all shingles is large
Bigger than the original document
Can we create a document sketch by sampling only a few shingles?
Requirement Sketch resemblance should be
a good estimate of document resemblance
![Page 55: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/55.jpg)
Choosing a sketch Random sampling
E.g., suppose we have identical documents A & B each with n shingles
M(A) = set of s shingles from A, chosen uniformly at random; similarly M(B)
Does it work? For s=1: E[|M(A) M(B)|] = 1/n
But r(A,B) = 1 So the sketch overlap is an underestimate
Verify that this is true for any value of s
![Page 56: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/56.jpg)
Choosing a sketch Improvements:
Random sampling + compare “special” item Random permutations + compare “smallest”
shingle Random permutation
Let be a set (1..N e.g.) Pick a permutation : uniformly at random
={3,7,1,4,6,2,5} A={2,3,6} MIN((A))=?
![Page 57: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/57.jpg)
Estimating Jaccard Coefficient Theorem : If permutations are picked uniformly at
random from the n! possible permutations,
![Page 58: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/58.jpg)
Choosing a sketch Create a “sketch vector” (e.g., of size 200)
for each document Documents which share more than t (say 80%)
corresponding vector elements are similar For doc d, sketchd[i] is computed as follows:
Let f map all shingles in the universe to 0..2m
Let i be a specific random permutation on 0..2m
Pick MIN i (f(s)) over all shingles s in d
![Page 59: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/59.jpg)
Computing Sketch[i] for Doc1
Document 1
264
264
264
264
Start with 64 bit shingles
Permute on the number linewith i
Pick the min value
![Page 60: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/60.jpg)
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Test for 200 random permutations: , ,… 200
Are these equal?
Document 1
264
264
264
264A
Document 2
264
264
264
264A
![Page 61: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/61.jpg)
However…
A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (I.e., lies in the intersection)This happens with probability: Size_of_intersection / Size_of_union
BA
Document 1
264
264
264
264A
Document 2
264
264
264
264A
![Page 62: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/62.jpg)
Finding all near-duplicates Naïve implementation makes O(N^2)
sketch comparisons Suppose N=100 million
How can you do it faster?
![Page 63: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/63.jpg)
本次课小结 Web Noises
Hypertext IR Principles Template Detection
Semantic and Syntactic Definition
Information Entropy of Features Weighting
SST Near duplicates
detection Jaccard similarity Shingling sketch
Document 1
264
264
264
26
4
A
Document 2
264
264
264
264A
Table
Img Table
Body
Table
TrTr
root
Text
![Page 64: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/64.jpg)
References [1] B.-Y. Ziv and R. Sridhar, "Template detection via data
mining and its applications," in Proceedings of the 11th international conference on World Wide Web. Honolulu, Hawaii, USA: ACM Press, 2002.
[2] Y. Lan, L. Bing, and L. Xiaoli, "Eliminating noisy information in Web pages for data mining," in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. Washington, D.C.: ACM Press, 2003.
[3] G. David, P. Kunal, and T. Andrew, "The volume and evolution of web page templates," in Special interest tracks and posters of the 14th international conference on World Wide Web. Chiba, Japan: ACM Press, 2005.
[4] Z. B. Andrei, C. G. Steven, S. M. Mark, and Z. Geoffrey, "Syntactic clustering of the Web," in Selected papers from the sixth international conference on World Wide Web. Santa Clara, California, United States: Elsevier Science Publishers Ltd., 1997.
[5] N. Shivakumar and H. Garca-Molina, "Finding near-replicas of documents on the web," presented at Proceedings of Workshop on Web Databases (WebDB'98), Mar, 1998.
![Page 65: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/65.jpg)
Related Resources Html-tidy Code
http://code.google.com/p/html-tidy/ Shingle Code
http://research.microsoft.com/research/downloads/Details/4e0d0535-ff4c-4259-99fa-ab34f3f57d67/Details.aspx?0sr=d
![Page 66: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/66.jpg)
Thank You!Q&A
![Page 67: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/67.jpg)
阅读材料 [1] IIR Chapter 19.6 [2] G. Salton and C. Buckley, "Term-
weighting approaches in automatic text retrieval," Inf. Process. Manage., vol. 24, pp. 523, 1988.
![Page 68: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/68.jpg)
DOM Tree
W3C Document Object Model allow programs and scripts to dynamically access
and update the content, structure and style of documents. The document can be further processed and the results of that processing can be incorporated back into the presented page.
![Page 69: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/69.jpg)
Information Entropy In information theory,
entropy is a measure of the uncertainty associated with a random variable. The term by itself in this context usually refers to the Shannon entropy, which quantifies, in the sense of an expected value, the information contained in a message, usually in units such as bits.
![Page 70: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/70.jpg)
Estimating algorithm
1. Generate a set of m random permutations
2. for each do3. compute and 4. check if5. end for6. if equality was observed in k cases,
estimate
mkddr ),(' 21
))((min))((min 21 dSdS ))(( 1dS ))(( 2dS
![Page 71: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/71.jpg)
Some other approaches For set W of shingles, let MINs(W) = set of s
smallest shingles in W Assume documents have at least s shingles
Define M(A) = MINs(S(A)) M(AB) = MINs(M(A) M(B)) r’(A,B) = |M(AB) M(A) M(B)| / s
By increasing sample size (s) we can make it very unlikely r’(A,B) is significantly different from r(A,B)
100-200 shingles is sufficient in practice Compute a fingerprint f for each shingle (e.g.,
Rabin fingerprint) 40 bits is usually enough to keep estimates reasonably
accurate Fingerprint also eliminates need for random permutation
![Page 72: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/72.jpg)
Finding all near-duplicates Naïve implementation makes O(N^2)
sketch comparisons Suppose N=100 million
Divide-Compute-Merge (DCM) Divide data into batches that fit in memory Operate on individual batch and write out
partial results in sorted order Merge partial results
Generalization of external sorting
![Page 73: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/73.jpg)
doc1: s11,s12,…,s1kdoc2: s21,s22,…,s2k…
DCM Stepss11,doc1s12,doc1…s1k,doc1s21,doc2…
Invert t1,doc1t1,docX…t2,doc1t2,docY…
sort onshingle_fp
doc1,docX,1doc1,docZ,1…doc1,docY,1…
Invert and pair
doc1,docX,1doc1,docX,1…doc1,docY,1… sort on
<docid1,docid2>
doc1,docX,2doc1,docY,10…
Merge
![Page 74: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/74.jpg)
Finding all near-duplicates1. Calculate a sketch for each document2. For each document, write out the pairs <shingle_fp,
docId>3. Sort by shingle_fp (DCM)4. In a sequential scan, generate triplets of the form
<docId1,docId2,1> for pairs of docs that share a shingle (DCM)
5. Sort on <docId1,docId2> (DCM)6. Merge the triplets with common docids to generate
triplets of the form <docId1,docId2,count> (DCM)7. Output document pairs whose resemblance exceeds
the threshold
![Page 75: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/75.jpg)
DCM algorithm1. for each random permutation do2. create a file3. for each document d do4. write out to 5. end for6. sort using key s -- this results in contiguous blocks with
fixed s containing all associated
7. create a file8. for each pair within a run of having a given s do9. write out a document-pair record to10. end for11. sort on key 12. end for13. merge for all in order, counting the number of
entries
),( 21 dd
sd
ddSs )),((min
f
f
f
g
f),( 21 dd
g ),( 21 dd
g ),( 21 dd ),( 21 dd
g
![Page 76: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/76.jpg)
Some optimizations “Invert and Pair” is the most expensive
step We can speed it up eliminating very
common shingles Common headers, footers, etc. Do it as a preprocessing step
Also, eliminate exact duplicates up front Probabilistic Counting [5]
![Page 77: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/77.jpg)
Detecting duplicate pages
![Page 78: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/78.jpg)
State of the art Technology
![Page 79: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/79.jpg)
Volume and Evolution of Page Templates
Our results show that 40–50% of the content on the web is template content.
Over the last eight years, the fraction of template content has doubled, and the growth shows no sign of abating.
Text, links, and total HTML bytes within templates are all growing as a fraction of total content at a rate of between 6 and 8% per year. Question:
how to design the experiment to reach these conclusions?
![Page 80: 幻灯片 1net.pku.edu.cn/~wbia/2010/Slides/Lecture1… · PPT file · Web view · 2011-06-08Invert t1,doc1 t1,docX … t2,doc1 t2,docY ... Document level Character/word level Shingle](https://reader034.vdocuments.mx/reader034/viewer/2022050918/5b07c1e87f8b9a520e8b975e/html5/thumbnails/80.jpg)