Download - Mining di dati web
![Page 1: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/1.jpg)
Mining di dati webLezione n° 6
Clustering di Documenti Web
Gli Algoritmi Basati sul Contenuto
A.A 2005/2006
![Page 2: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/2.jpg)
Document Clustering
Classical clustering algorithms are not suitable for high dimensional data.
Dimensionality Reduction is a viable but expensive solution.
Different kind of clustering exists:Partitional (or Top-Down)Hierarchical (or Bottom-Up)
![Page 3: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/3.jpg)
Partitional Clustering
Directly decomposes the data set into a set of disjoint clusters.
The most famous is the K-Means algorithm.
Usually they are linear in the number of elements to cluster.
![Page 4: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/4.jpg)
Hierarchical Partitioning
Proceeds successively by either merging smaller clusters into larger ones, or by splitting larger clusters.
The clustering methods differ in the rule by which it is decided which two small clusters are merged or which large cluster is split.
The end result of the algorithm is a tree of clusters called a dendrogram, which shows how the clusters are related.
By cutting the dendrogram at a desired level a clustering of the data items into disjoint groups is obtained.
![Page 5: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/5.jpg)
Dendrogram Example
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 6: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/6.jpg)
Clustering in Web Content Mining
Possible uses of clustering in Web Content Mining.Automatic Document Classification.
Search Engine Results Presentation.
Search Engine Optimization:Collection Reorganization.Index Reorganization.
Dimensionality Reduction!!!!
![Page 7: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/7.jpg)
Advanced Document Clustering Techniques
Co-ClusteringDhillon, I. S., Mallela, S., and Modha, D. S. 2003. Information-theoretic co-clustering. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 89-98.
Syntactic ClusteringBroder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. 1997. Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29, 8-13 (Sep. 1997), 1157-1166.
![Page 8: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/8.jpg)
Co-ClusteringIdea: represent a collection with its term-document matrix and then cluster both rows and columns.
It has a strong theoretical foundation.
It is based on the assumption that the best clustering is the one that leads to the largest mutual information between the clustered random variables.
![Page 9: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/9.jpg)
Information Theory Entropy of a random variable X with
probability distribution p(x):
The Kullback-Leibler(KL) Divergence or “Relative Entropy” between two probability distributions p and q:
Mutual Information between random variables X and Y: €
KL( p,q) = p(x)log(p(x) q(x))x
∑€
H( p) = − p(x)x
∑ log p(x)
€
I(X,Y ) = p(x,y)logp(x,y)
p(x)p(y)
⎛
⎝ ⎜
⎞
⎠ ⎟
y
∑x
∑
![Page 10: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/10.jpg)
Contingency Table Let X and Y be discrete random variables that take values in the sets {x1, x2, …, xm} and {y1, y2, …, yn}.
Let p(X,Y) denote the joint probability distribution between X and Y.
€
p(X ,Y ) =
.05 .05 .05 0 0 0
.05 .05 .05 0 0 0
0 0 0 .05 .05 .05
0 0 0 .05 .05 .05
.04 .04 0 .04 .04 .04
.04 .04 .04 0 .04 .04
⎡
⎣
⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥
![Page 11: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/11.jpg)
Problem Formulation
Co-clustering is concerned with simulteously clustering X into (at most) k disjoint clusters and Y into (at most) l disjoint clusters.
Let the k clusters of X be written as:{x’1, x’2, …, x’k}, and let the l clusters of Y be written as: {y’1, y’2, …, y’l}.
(CX,CY) is defined co-clustering, where: Cx: {x1, x2, …, xm} {x’1, x’2, …, x’k}
CY: {y1, y2, …, yn} {y’1, y’2, …, y’l}
An optimal co-clustering minimizesI(X;Y) - I(X’=CX(X);Y’=CY(Y)) = I(X;Y) -
I(X’-Y’)
![Page 12: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/12.jpg)
Lemma 2.1For a fixed co-clustering (CX, CY), we can write the loss in mutual information as
I(X;Y) - I(X’;Y’) = D(p(X,Y)||q(X,Y)),where D(-||-) denotes the Kullback-Leibler divergence and q(X,Y) is a distribution of the form
q(x,y)=p(x’,y’)p(x|x’)p(y|y’)where x x’, y y’.
![Page 13: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/13.jpg)
The Approximation Matrix q(X,Y)
q(x,y)=p(x’,y’)p(x|x’)p(y|y’).
p(x’)=x x’ p(x)
p(y’)=y y’ p(y)p(x|x’)=p(x)/p(x’)p(y|y’)=p(y)/p(y’)
![Page 14: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/14.jpg)
Proof of Lemma 2.1
€
given that p(x ', y') = p(x, y)y∈y'
∑x∈x '
∑
I(X;Y ) − I(X ';Y ') = p(x, y)logp(x, y)
p(x) p(y)y∈y '
∑x∈x'
∑y'
∑x'
∑ −
− p(x, y)y∈y '
∑x∈x'
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟log
p(x ', y')
p(x ')p(y ')y '
∑x '
∑ =
= p(x, y)logp(x ', y')
p(x', y ')p(x)
p(x')
p(y)
p(y ')y∈y '
∑x∈x'
∑y'
∑x'
∑
![Page 15: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/15.jpg)
Some UsefulEqualities
![Page 16: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/16.jpg)
Co-Clustering Algorithm
![Page 17: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/17.jpg)
Co-Clustering Soundness
Theorem: The co-clustering algorithm monotonically decreases loss in mutual information (objective function value)
Marginals p(x) and p(y) are preserved at every step (q(x)=p(x) and q(y)=p(y) )
![Page 18: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/18.jpg)
Co-ClusteringComplexity
The algorithm is computationally efficient
Even for sparse dataIf nz is the number of nonzeros in the imput joint distribution p(X,Y), t is the number of iterations:
O(nz * t * (k + l))Experimentally t = 20.
![Page 19: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/19.jpg)
A Toy Example
€
.05 .05 .05 0 0 0
.05 .05 .05 0 0 0
0 0 0 .05 .05 .05
0 0 0 .05 .05 .05
.04 .04 0 .04 .04 .04
.04 .04 .04 0 .04 .04
⎡
⎣
⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥
€
p(x, y)
€
.3 0
0 .3
.2 .2
⎡
⎣ ⎢
⎤
⎦ ⎥
€
p( ˆ x , ˆ y )
€
.054 .054 .042 0 0 0
.054 .054 .042 0 0 0
0 0 0 .042 .054 .054
0 0 0 .042 .054 .054
.036 .036 028 .028 .036 .036
.036 .036 .028 .028 .036 .036
⎡
⎣
⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥
€
q(x, y)
€
D( p || q) = 0.957
![Page 20: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/20.jpg)
A Real ExampleBefore
![Page 21: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/21.jpg)
A Real ExampleAfter
![Page 22: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/22.jpg)
Application:Dimensionality
Reduction
DocumentBag-of-words
1
m
VectorOf
words
DocumentBag-of-words
VectorOf
words
• Do not throw away words • Cluster words instead• Use clusters as features
Word#1
Word#k
• Select the “best” words• Throw away rest• Frequency based pruning• Information criterion based pruning
Feature Selection
Feature Clustering
1
m
Cluster#1
Cluster#k
![Page 23: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/23.jpg)
Syntactic Clustering
Finding syntactically similar documents.
Approach based on two different similarity measures:ResemblanceContainment
A sketch of few hundreds bytes is kept for each document.
![Page 24: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/24.jpg)
Document ModelWe view each document as a sequence of words.
Start by lexically analyzing the doc into a canonical sequence of tokens.
This canonical form ignores minor details such as formatting, html commands, and capitalization.
We then associate with every document D a set of subsequences of tokens S(D,w).
![Page 25: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/25.jpg)
ShinglingA contiguous subsequence contained in D is called a shingle.
Given a document D we define its w-shingling S(D,w) as the set of all unique shingles of size w contained in D.
For instance the 4-shingling of (a,rose,is,a,rose,is,a,rose) is the set:{(a,rose,is,a);(rose,is,a,rose);(is,a,rose,is)}.
![Page 26: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/26.jpg)
ResemblaceFor a given shingle size, the resemblance r of two documents A and B is defined as
where |A| is the size of set A.€
r A,B( ) =S A( ) ∩ S B( )
S A( )∪S B( )
![Page 27: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/27.jpg)
ContainmentFor a given shingle size, the containment c of two documents A and B is defined as
where |A| is the size of set A.€
c A,B( ) =S A( ) ∩ S B( )
S A( )
![Page 28: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/28.jpg)
Properties of r and c
The resemblance is a number between 0 and 1.
r(A,A) = 1The containment is a number between 0 and 1.
If AB then c(A,B)=1.Experiments show that the definitions capture the informal notions of “roughly the same” and “roughly contained”.
![Page 29: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/29.jpg)
Resemblance Distance
Resemblance is not transitive.Version 100 of a document is probably quite different from version 1.
The Resemblance Distance d(A,B)=1-r(A,B) is a not metric but obeys the triangle inequality.
![Page 30: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/30.jpg)
Resemblance and Containment Estimates
Fix a shingle size w. Let U be the set of all shingles of size w. U is countable thus we can view its elements as
numbers. Fix a parameter s. For a set WU define MINs(W) as
where “smallest” refers to numerical order on U, and define
€
MINs W( ) =the set of the smallest
s elements in W ,W
W ≥ s
ow
⎧ ⎨ ⎪
⎩ ⎪
€
MODm W( ) =the set of elements of W
that are 0mod m
![Page 31: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/31.jpg)
Resemblance and Containment Estimates
Theorem. Let :UU a permutation of U chosen u.a.r. Let F(A)=MINs((S(A))) and V(A)=MODm((S(A))). Define F(B) and V(B) analogously. Then
is an unbiased estimate of the resemblance of A and B.
is an unbiased estimate of the resemblance of A and B.
is an unbiased estimate of the containment of A in B.
€
MINs F A( )∪F B( )( ) ∩ F A( ) ∩ F B( )
MINs F A( )∪F B( )( )
€
V A( ) ∩ V B( )
V A( )∪V B( )
€
V A( ) ∩ V B( )
V A( )
![Page 32: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/32.jpg)
The SketchChoose a random permutation of U.The Sketch of a document D consists of the set F(D) and/or V(D).
F(D) has fixed size. Allows only the estimation of resemblance.
V(D) has variable size. Grows as D grows.
![Page 33: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/33.jpg)
Practical Sketches Representation
Canonicalize documents by removing HTML formatting and converting all words to lowercase.
The shingle size w is 10.Use a 40 bit fingerprint function, based on Rabin Fingerprints, enhanced to behave as a random permutation. Now a shingle is this fingerprint value.
m in the modulus is set to 25.
![Page 34: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/34.jpg)
Rabin FingerprintsIs based on the use of irreducible polynomials with coefficients in Galois Field 2.
Let A=(a1, …, am) be a binary string. a1=1.
A(t)=a1tm-1+a2tm-2+…+am
Let P(t) be an irriducible polynomial of degree k, over Z2.
f(A)=A(t) mod P(t) is the Rabin Fingerprint of A.
![Page 35: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/35.jpg)
Shingle ClusteringRetrieve every document on the Web.Calculate the sketch for each document.
Compare the sketches for each pair of documents to see if they exceed a threshold of resemblance.
Combine the pairs of similar documents to make the clusters of similar documents.
![Page 36: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/36.jpg)
Efficiency 30,000,000 HTML docs A pairwaise comparison would involve O(1015) comparisons!!!!
Just one bit per document in a data structure requires 4 Mbytes. A sketch size of 800 bytes per documents requires 24 Gbytes!!!
One millisecond of computation per document translates into 8 hours of computation!!!
Any algorithm involving random disk accesses or that causes paging activity is completely infeasible.
INEfficiency
![Page 37: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/37.jpg)
Divide, Compute, Merge
Take the data, divide it into pieces of size m (in order to fit the data entirely in memory)
Compute on each piece separatelyMerge the results.The merging process is I/O bound:
Each merge pass is linearlog(n/m) passes are required.
The overall performance is O(n log(n/m)).
![Page 38: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/38.jpg)
The “real” Clustering
Algorithm (I phase)
Calculate a sketch for every document. This step is linear in the total lengths of documents.
![Page 39: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/39.jpg)
The “real” Clustering
Algorithm (II phase)
Produce a list of all the shingles and the documents they appear in, sorted by shingle value. To do this, the sketch for each document is expanded into a list of <shingle value, document ID> pairs. Sort the list using the divide, sort merge approach.
Remember: shingle value, means rabin fingerprint of the sketch.
![Page 40: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/40.jpg)
The “real” Clustering
Algorithm (III phase)
Generate a list of all the pairs of documents that share any shingles, along with the number of shingles they have in common. To do this, take the file of sorted couples and expand it into a list of <ID, ID, count of common shingles> triplets: take each shingle that appears in multiple documents
and generate the complete set of <ID, ID, 1> triplets.
Apply divide, sort, merge procedure (summing up the counts for matching ID-ID pairs) to produce a single file of all <ID, ID, count> triplets sorted by the first document ID. This phase requires the greatest amount of disk space because the initial expansion of the document ID triplets is quadratic in the number of documents sharing a shingle, and initially produces many triplets with a count of 1.
![Page 41: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/41.jpg)
The “real” Clustering
Algorithm (IV phase)
Produce the complete clustering. Examine each <ID,ID,count> triplet and decide if the document pair exceeds our threshold for resemblance. If it does, we add a link between the two documents in a union-find algorithm. The connected components output by the union-find algorithm form the final clusters. This phase has the greatest memory requirements because we need to hold the entire union-find data structure in memory.
![Page 42: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/42.jpg)
Performance Issues Common Shingles.
Shared by more than 1,000 documents. The number of document ID pairs is quadratic in
the number of documents sharing a shingle. Remove shingles that are more frequent than a
given threshold. Identical Documents.
Identical documents do not need to be handled. Remove identical documents from collection. Remove documents having the same fingerpring.
Super shingles. Compute a meta-sketch shingling the shingles Documents sharing shingles in the meta-sketch
are very likely to have a high resemblance value.
Need to carefully choose super-shingle size.
![Page 43: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/43.jpg)
Super-shingles based Clustering
Compute the list of super shingles for each document
Expand the list of super shingles into a sorted list of <super shingle, ID> pairs.
Any documents that share a super shingle resemble each other and are added into the cluster.
![Page 44: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/44.jpg)
Problems withSuper-shingles
Super shingles are not as flexible or as accurate as computing resemblance with regular sketches.
They do not work well for shor documents. Short documents do not contain many shingles, even regular shingles are not accurate in computing resemblance.
Super-shingles represent sequence of shingles, and so, shorter documents, with fewer super shingles, have a lower probability of producing a common super shingle.
Super-shingles cannot detect containment.
![Page 45: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/45.jpg)
A Nice Application:Page Changing
Characterization We can use the technique of comparing sketches over time to characterize the behavior of pages on the web.
For instance, we can observe a page at different times and see how similar each version is to the preceding version.
We can thus answer some basic questions like: How often do pages change? How much do they change per time interval? How often do pages move? Within a server? Between servers?
How long do pages live? How many are created? How many die?
![Page 46: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/46.jpg)
Experiments30,000,000 HTML Pages. 150Gbytes (5k per document)
The file containing just the URLs of the documents took up 1.8Gbytes (an average of 60 bytes per URL).
10 word long shingles, 5 byte fingerprint. 1 in 25 of the shingles found were kept.
600M shingles and the raw sketch files took up 3 Gbytes.
![Page 47: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/47.jpg)
ExperimentsIn the third phase - the creation of <ID, ID, count> triples - the storage required was 20 Gbytes. At the end the file took 6 Gbytes.
The final clustering phase is the most memory intensive. The final file took up less than 100MBytes.
![Page 48: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/48.jpg)
ExperimentsResemblance threshold set to 50%.3.6 million clusters found containing a total of 12.3 million documents.
2.1 million clusters contained only identical documents (5.3 million documents).
The remainig 1.5 million clusters contained 7 million documents (a mixture of exact duplicates and similar).
![Page 49: Mining di dati web](https://reader035.vdocuments.mx/reader035/viewer/2022062422/5681334c550346895d9a5216/html5/thumbnails/49.jpg)
ExperimentsPhase Time
(CPU-days)
Paralle-lizable
Sketching 4.6 YES
Duplicate elimination
0.3
Shingle merging 1.7 YES
ID-ID pair formation
0.7
ID-ID merging 2.6 YES
Cluster formation 0.5
Total 10.5