[ppt]
TRANSCRIPT
![Page 1: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/1.jpg)
Mining di dati web
Clustering di Documenti Web:
Le metriche di similarità
A.A 2006/2007
![Page 2: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/2.jpg)
Learning: What does it Means?
Definition from Wikipedia:Machine learning is an area of artificial intelligence concerned with the development of techniques which allow computers to "learn".
More specifically, machine learning is a method for creating computer programs by the analysis of data sets.
Machine learning overlaps heavily with statistics, since both fields study the analysis of data, but unlike statistics, machine learning is concerned with the algorithmic complexity of computational implementations.
![Page 3: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/3.jpg)
And… What about Clustering?
Again from Wikipedia: Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure.
Machine learning typically regards data clustering as a form of unsupervised learning.
![Page 4: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/4.jpg)
What clustering is?
dintra
d inte
r
![Page 5: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/5.jpg)
ClusteringExample where euclidean distance is the distance metric
hierarchical clustering dendrogram
![Page 6: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/6.jpg)
Clustering: K-Means
Randomly generate k clusters and determine the cluster centers or directly generate k seed points as cluster centers
Assign each point to the nearest cluster center.
Recompute the new cluster centers.Repeat until some convergence criterion is met (usually that the assignment hasn't changed).
![Page 7: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/7.jpg)
K-means Clustering Each point is assigned to the cluster with the closest center point
The number K, must be specified Basic algorithm
1.Select K points as the initial center points.
Repeat
2. Form K clusters by assigning all points to the closet center point.
3. Recompute the center point of each cluster
Until the center points don’t change
![Page 8: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/8.jpg)
Example 1: data points
image the k-means clustering result.
![Page 9: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/9.jpg)
K-means clustering result
![Page 10: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/10.jpg)
Importance of Choosing Initial Guesses (1)
![Page 11: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/11.jpg)
Importance of Choosing Initial Guesses (2)
![Page 12: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/12.jpg)
Local optima of K-means The local optima of ‘K-means’
Kmeans is to minimize the “sum of point-centroid” distances. The optimization is difficult.
After each iteration of K-means the MSE (mean square error) decreases. But K-means may converge to a local optimum. So K-means is sensitive to initial guesses.
Supplement to K-means
![Page 13: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/13.jpg)
Example 2: data points
![Page 14: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/14.jpg)
Image the clustering result
![Page 15: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/15.jpg)
Example 2: K-means clustering result
![Page 16: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/16.jpg)
Limitations of K-means K-means has problem when clusters are of differing Sizes Densities Non-globular shapes
K-means has problems when the data contains outliers
One solution is to use many clusters Find parts of clusters, but need to put together
![Page 17: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/17.jpg)
Clustering: K-Means
The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments.
It maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the result has a global minimum of variance.
![Page 18: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/18.jpg)
d_intra or d_inter? That is
the question!d_intra is the distance among elements (points or objects, or whatever) of the same cluster.
d_inter is the distance among clusters.
Questions:Should we use distance or similarity?Should we care about inter cluster distance?
Should we care about cluster shape?Should we care about clustering?
![Page 19: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/19.jpg)
Distance Functions Informal definition:
The distance between two points is the length of a straight line segment between them.
A more formal definition: A distance between two points P and Q in a metric space is d(P,Q), where d is the distance function that defines the given metric space.
We can also define the distance between two sets A and B in a metric space as being the minimum (or infimum) of distances between any two points P in A and Q in B.
![Page 20: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/20.jpg)
Distance or Similarity?
In a very straightforward way we can define the Similarity functionsim: SxS [0,1] as sim(o1,o2) = 1 - d(o1,o2)where o1 and o2 are elements of the space S.
d = 0.3
sim = 0.7
d = 0.1
sim = 0.9
![Page 21: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/21.jpg)
What does similar (or distant) really mean?
Learning (either supervised or unsupervised) is impossible without ASSUMPTIONSWatanabe’s Ugly Duckling theoremWolpert’s No Free Lunch theorem
Learning is impossible without some sort of biasbias.
![Page 22: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/22.jpg)
The Ugly Duckling theorems
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
The theorem gets its fanciful name from the following counter-intuitive statement: assuming similarity is based on the number of shared predicates, an ugly duckling is as similar to a beautiful swan A as a beautiful swan B is to A, given that A and B differ at all.It was proposed and proved by Satosi Watanabe in 1969.
![Page 23: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/23.jpg)
Satosi’s Theorem Let n be the cardinality of the Universal set S. We have to classify them without prior knowledge
on the essence of categories. The number of different classes, i.e. the
different way to group the objects into clusters, is given by the cardinality of the Power Set of S: |Pow(S)|=2n
Without any prior information, the most natural way to measure the similarity among two distinct objects we can measure the number of classes they share.
Oooops… They share exactly the same number of classes, namely 2n-2.
![Page 24: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/24.jpg)
The ugly duckling and 3 beautiful
swans S={o1,o2,o3,o4}
Pow(S) = { {},{o1},{o2},{o3},{o4},{o1,o2},{o1,o3},{o1,o4},{o2,o3},{o2,o4},{o3,o4},{o1,o2,o3},{o1,o2,o4},{o1,o3,o4},{o2,o3,o4},{o1,o2,o3,o4} }
How many classes have in common oi, oj i<>j? o1 and o3
4
o1 and o4
4
![Page 25: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/25.jpg)
The ugly duckling and 3 beautiful
swansIn binary 0000, 0001, 0010, 0100, 1000,…
Chose two objectsReorder the bits so that the chosen object are represented by the first two bits.
How many strings share the first two bits set to 1?2n-2
![Page 26: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/26.jpg)
Wolpert’s No Free Lunch Theorem
For any two algorithms, A and B, there exist datasets for which algorithm A outperform algorithm B in prediction accuracy on unseen instances.
Proof: Take any Boolean concept. If A outperforms B on unseen instances, reverse the labels and B will outperforms A.
![Page 27: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/27.jpg)
So Let’s Get Back to Distances
In a metric space a distance is a function d:SxS->R so that if a,b,c are elements of S:d(a,b) ≥ 0d(a,b) = 0 iff a=bd(a,b) = d(b,a)d(a,c) ≤ d(a,b) + d(b,c)
The fourth property (triangular inequality) holds only if we are in a metric space.
![Page 28: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/28.jpg)
Minkowski DistanceLet’s consider two elements of a set S described by their feature vectors:x=(x1, x2, …, xn)
y=(y1, y2, …, yn)
The Minkowski Distance is parametric in p>1
€
d p( ) x,y( ) = x i − y ip
i=1
n
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
1
p
![Page 29: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/29.jpg)
p=1. Manhattan Distance
If p = 1 the distance is called Manhattan Distance.
It is also called taxicab distance because it is the distance a car would drive in a city laid out in square blocks (if there are no one-way streets).€
d 1( ) x,y( ) = x i − y ii=1
n
∑
![Page 30: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/30.jpg)
p=2. Euclidean Distance
If p = 2 the distance is the well known Euclidean Distance.
€
d 2( ) x,y( ) = x i − y i2
i=1
n
∑
![Page 31: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/31.jpg)
p=. Chebyshev Distance
If p = then we must take the limit.
It is also called chessboard Distance.
€
d ∞( ) x,y( ) = limp→∞
x i − y ip
i=1
n
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
1
p
=
= max x1 − y1 , x2 − y2 ,K xn − yn{ }
![Page 32: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/32.jpg)
2D Cosine Similarity
It’s easy to explain in 2D.Let’s consider a=(x1,y1) and b=(x2,y2).
ab
![Page 33: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/33.jpg)
Cosine SimilarityLet’s consider two points x, y in Rn.
€
sim(x,y) = cosα =x • y
x y=
=
x iy ii=1
n
∑
x i2
i=1
n
∑ y i2
i=1
n
∑
![Page 34: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/34.jpg)
Jaccard DistanceAnother commonly used distance is the Jaccard Distance:
€
d(x,y) =
min x i,y i( )i=1
n
∑
max x i,y i( )i=1
n
∑
![Page 35: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/35.jpg)
Binary Jaccard Distance
In the case of binary feature vector the Jaccard Distance could be simplified to:
€
d(x,y) =x∩ y
x∪ y
![Page 36: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/36.jpg)
Edit Distance The Levenshtein distance or edit distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character
![Page 37: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/37.jpg)
Binary Edit Distance
The binary edit distance, d(x,y), from a binary vector x to a binary vector y is the minimum number of simple flips required to transform one vector to the other
x=(0,1,0,0,1,1)y=(1,1,0,1,0,1)
d(x,y)=3
€
d(x,y) = 1− x1y i( )i=1
n
∑
The binary edit distance is equivalent to the Manhattan distance (Minkowski p=1) for binary features vectors.
![Page 38: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/38.jpg)
The Curse of High Dimensionality
The dimensionality is one of the main problem to face when clustering data.
Roughly speaking the higher the dimensionality the lower the power of recognizing similar objects.
![Page 39: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/39.jpg)
Volume of theUnit-Radius
Sphere
![Page 40: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/40.jpg)
Sphere/CubeVolume Ratio
Unit-radius Sphere / Cube whose edge lenghts is 2.
![Page 41: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/41.jpg)
Sphere/Sphere Volume Ratio
Two embedded spheres. Radiuses 1 and 0.9.
![Page 42: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/42.jpg)
Concentration of the Norm
Phenomenon Gaussian distributions of points (std. dev. =
1). Dimensions (1,2,3,5,10, and 20). Probability Density Functions to find a point
drawn according to a Gaussian distribution, at distance r from the center of that distribution.
![Page 43: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/43.jpg)
Web Document Representation
The Web can be characterized in three different ways:Content.Structure.Usage.
We are concerned with Web Content information.
![Page 44: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/44.jpg)
Bag-of-Words vs. Vector-Space
Let C be a collection of N documentsd1, d2, …, dN
Each document is composed of terms drawn from a term-set of dimension T.
Each document can be represented in two different ways:Bag-of-Words.Vector-Space.
![Page 45: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/45.jpg)
d=[1,0,1,1,0,0,0,1].
Bag-of-WordsIn the bag-of-words model the document is represented asd={Apple,Banana,Coffee,Peach}
Each term is represented.No information on frequency.Binary encoding. t-dimensional Bitvector
ApplePeachAppleBananaAppleBananaCoffeeAppleCoffee
ApplePeachAppleBananaAppleBananaCoffeeAppleCoffee
d=[1,0,1,1,0,0,0,1].
![Page 46: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/46.jpg)
Vector-Space
ApplePeachAppleBananaAppleBananaCoffeeAppleCoffee
In the vector-space model the document is represented asd={<Apple,4>,<Banana,2>,<Coffee,2>,<Peach,2>}
Information about frequency are recorded.
t-dimensional vectors.
ApplePeachAppleBananaAppleBananaCoffeeAppleCoffee d=[4,0,2,2,0,0,0,1].d=[4,0,2,2,0,0,0
,1].
![Page 47: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/47.jpg)
Typical WebCollection Dimensions
No. of documents: 20BNo of terms is approx 150,000,000!!!
Very high dimensionality…Each document contains from 100 to 1000 terms.
Classical clustering algorithm cannot be used.Each document is very similar to the other due to the geometric properties just seen.
![Page 48: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/48.jpg)
We Need to Cope with High
DimensionalityPossible solutions:JUST ONE… Reduce Dimensions!!!!
![Page 49: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/49.jpg)
Dimensionality reduction
dimensionality reduction approaches can be divided into two categories: feature selection approaches try to find a subset of the original features. Optimal feature selection for supervised learning problems requires an exhaustive search of all possible subsets of features of the chosen cardinality
feature extraction is applying a mapping of the multidimensional space into a space of fewer dimensions. This means that the original feature space is transformed by applying e.g. a linear transformation via a principal components analysis
![Page 50: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/50.jpg)
PCA - Principal Component Analysis
In statistics, principal components analysis (PCA) is a technique that can be used to simplify a dataset.
More formally it is a linear transformation that chooses a new coordinate system for the data set such that the greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component), the second greatest variance on the second axis, and so on.
PCA can be used for reducing dimensionality in a dataset while retaining those characteristics of the dataset that contribute most to its variance by eliminating the later principal components (by a more or less heuristic decision).
![Page 51: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/51.jpg)
The MethodSuppose you have a random vector population x:x=(x1,x2, …, xn)T
and the mean:x=E{x}
and the covariance matrix:Cx=E{(x- x)(x- x)T}
![Page 52: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/52.jpg)
The Method The components of Cx, denoted by cij, represent the covariances between the random variable components xi and xj. The component cii is the variance of the component xi. The variance of a component indicates the spread of the component values around its mean value. If two components xi and xj of the data are uncorrelated, their covariance is zero (cij = cji = 0).
The covariance matrix is, by definition, always symmetric.
Take a sample of vectors x1, x2, …, xM we can calculate the sample mean and the sample covariance matrix as the estimates of the mean and the covariance matrix.
![Page 53: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/53.jpg)
The Method From a symmetric matrix such as the covariance matrix, we can calculate an orthogonal basis by finding its eigenvalues and eigenvectors. The eigenvectors ei and the corresponding eigenvalues i are the solutions of the equation:
Cxei= iei ==> | Cx- I |=0 By ordering the eigenvectors in the order of descending eigenvalues (largest first), one can create an ordered orthogonal basis with the first eigenvector having the direction of largest variance of the data. In this way, we can find directions in which the data set has the most significant amounts of energy.
Computationallyexpensivetask
![Page 54: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/54.jpg)
The MethodBy ordering the eigenvectors in the order of descending eigenvalues (largest first), one can create an ordered orthogonal basis with the first eigenvector having the direction of largest variance of the data. In this way, we can find directions in which the data set has the most significant amounts of energy.
To reduce n to k take the first k eigenvectors.
![Page 55: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/55.jpg)
A Graphical Example
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 56: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/56.jpg)
PCA Eigenvectors Projection
Project the data onto the selected eigenvectors:
€
x a μ + aieii=1
k
∑ ai = x −μ( )ei
![Page 57: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/57.jpg)
Another Example
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
![Page 58: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/58.jpg)
Singular Value Decomposition
Is a technique used for reducing dimensionality based on some properties of symmetric matrices.
Will be the subject for a talk given by one of you!!!!
![Page 59: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/59.jpg)
Locality-Sensitive Hashing
The key idea of this approach is to create a small signature for each documents, to ensure that similar documents have similar signatures.
There exists a family H of hash functions such that for each pair of pages u, v we have Pr[mh(u) = mh(v)] = sim(u,v), where the hash function mh is chosen at random from the family H.
![Page 60: [PPT]](https://reader035.vdocuments.mx/reader035/viewer/2022062419/557cd158d8b42aa2298b4aa0/html5/thumbnails/60.jpg)
Locality-Sensitive Hashing
Will be a subject of a talk given by one of you!!!!!!!