cluster analysis potyó lászló. cluster: a collection of data objects similar to one another...
TRANSCRIPT
![Page 1: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/1.jpg)
Cluster AnalysisCluster Analysis
Potyó László
![Page 2: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/2.jpg)
Cluster: a collection of data objectsCluster: a collection of data objects Similar to one another within the same clusterSimilar to one another within the same cluster Dissimilar to the objects in other clustersDissimilar to the objects in other clusters
Cluster analysisCluster analysis Grouping a set of data objects into clustersGrouping a set of data objects into clusters
Number of possible clusters (Bell)Number of possible clusters (Bell)
Clustering is Clustering is unsupervisedunsupervised classification: no classification: no predefined classespredefined classes
What is Cluster Analysis ?What is Cluster Analysis ?
![Page 3: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/3.jpg)
General Applications of ClusteringGeneral Applications of Clustering
Pattern RecognitionPattern Recognition
Spatial Data AnalysisSpatial Data Analysis
Image ProcessingImage Processing
Economic ScienceEconomic Science
WWWWWW
![Page 4: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/4.jpg)
Examples of Clustering ApplicationsExamples of Clustering Applications
Marketing:Marketing: Help marketers discover distinct groups in Help marketers discover distinct groups in their customer bases, and then use this knowledge to their customer bases, and then use this knowledge to develop targeted marketing programdevelop targeted marketing program
Insurance:Insurance: Identifying groups of motor insurance policy Identifying groups of motor insurance policy holders with a high average claim costholders with a high average claim cost
City-planning:City-planning: Identifying groups of houses according to Identifying groups of houses according to their house type, value, and geographical locationtheir house type, value, and geographical location
![Page 5: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/5.jpg)
What Is Good Clustering?What Is Good Clustering?
The quality of a clustering result depends on both the The quality of a clustering result depends on both the
similarity measure used by the method and its similarity measure used by the method and its
implementation.implementation.
The quality of a clustering method is also measured by The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.its ability to discover some or all of the hidden patterns.
Example:Example:
![Page 6: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/6.jpg)
Requirements of ClusteringRequirements of Clustering
ScalabilityScalability
Ability to deal with different types of attributesAbility to deal with different types of attributes
Discovery of clusters with arbitrary shapeDiscovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to Minimal requirements for domain knowledge to determine input parametersdetermine input parameters
Able to deal with noise and outliersAble to deal with noise and outliers
Insensitive to order of input recordsInsensitive to order of input records
High dimensionalityHigh dimensionality
Incorporation of user-specified constraintsIncorporation of user-specified constraints
Interpretability and usabilityInterpretability and usability
![Page 7: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/7.jpg)
Similarity and Dissimilarity Between ObjectsSimilarity and Dissimilarity Between Objects
DistancesDistances are normally used to measure the are normally used to measure the
similarity or dissimilarity between two data similarity or dissimilarity between two data
objectsobjects
Some popular ones include: Some popular ones include: Minkowski distanceMinkowski distance::
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional
data objects, and q is a positive integer
![Page 8: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/8.jpg)
Similarity and Dissimilarity Between ObjectsSimilarity and Dissimilarity Between Objects
If If qq = = 11, , dd is Manhattan distance is Manhattan distance
If If qq = = 22, , dd is Euclidean distance is Euclidean distance
d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)
![Page 9: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/9.jpg)
Categorization of Clustering MethodsCategorization of Clustering Methods
Partitioning MethodsPartitioning Methods
Hierarchical MethodsHierarchical Methods
Density-Based MethodsDensity-Based Methods
Grid-Based MethodsGrid-Based Methods
Model-Based Clustering MethodsModel-Based Clustering Methods
![Page 10: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/10.jpg)
K-meansK-means1. Ask user how
many clusters they’d like. (e.g. k=5)
![Page 11: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/11.jpg)
K-meansK-means1. Ask user how
many clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
![Page 12: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/12.jpg)
K-meansK-means1. Ask user how
many clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)
![Page 13: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/13.jpg)
K-meansK-means1. Ask user how
many clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns
![Page 14: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/14.jpg)
K-meansK-means1. Ask user how
many clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns…
5. …and jumps there
6. …Repeat until terminated!
![Page 15: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/15.jpg)
The GMM assumptionThe GMM assumption
• There are k components. The i’th component is called i
• Component i has an associated mean vector i 1
2
3
![Page 16: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/16.jpg)
The GMM assumptionThe GMM assumption
• There are k components. The i’th component is called i
• Component i has an associated mean vector i
• Each component generates data from a Gaussian with mean i and covariance matrix 2I
Assume that each datapoint is generated according to the following recipe:
1
2
3
![Page 17: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/17.jpg)
The GMM assumptionThe GMM assumption
• There are k components. The i’th component is called i
• Component i has an associated mean vector i
• Each component generates data from a Gaussian with mean i and covariance matrix 2I
Assume that each datapoint is generated according to the following recipe:
1. Pick a component at random. Choose component i with probability P(i).
2
![Page 18: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/18.jpg)
The GMM assumptionThe GMM assumption
• There are k components. The i’th component is called i
• Component i has an associated mean vector i
• Each component generates data from a Gaussian with mean i and covariance matrix 2I
Assume that each datapoint is generated according to the following recipe:
1. Pick a component at random. Choose component i with probability P(i).
2. Datapoint ~ N(i, 2I )
2
x
![Page 19: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/19.jpg)
The General GMM assumptionThe General GMM assumption
1
2
3
• There are k components. The i’th component is called i
• Component i has an associated mean vector i
• Each component generates data from a Gaussian with mean i and covariance matrix i
Assume that each datapoint is generated according to the following recipe:
1. Pick a component at random. Choose component i with probability P(i).
2. Datapoint ~ N(i, i )
![Page 20: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/20.jpg)
Expectation-Maximization (EM)Expectation-Maximization (EM)
Solves estimation with incomplete data.Solves estimation with incomplete data.
Obtain initial estimates for parameters.Obtain initial estimates for parameters.
Iteratively use estimates for missing data Iteratively use estimates for missing data and continue until convergence.and continue until convergence.
![Page 21: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/21.jpg)
EM - algorithmEM - algorithm
Iterative - algorithm Maximizing log-likelihood function
E – step
M – step
![Page 22: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/22.jpg)
Sample 1Sample 1
Clustering data generated by a mixture of Clustering data generated by a mixture of three Gaussians in 2 dimensions three Gaussians in 2 dimensions
number of points: 500number of points: 500 priors are: 0.3, 0.5 and 0.2 priors are: 0.3, 0.5 and 0.2 centers are: (2, 3.5), (0, 0), (0,2)centers are: (2, 3.5), (0, 0), (0,2) variances: 0.2, 0.5 and 1.0variances: 0.2, 0.5 and 1.0
![Page 23: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/23.jpg)
Raw dataRaw data After ClusteringAfter Clustering
• 150 (2, 3.5)(2, 3.5)
• 250 (0, 0)(0, 0)
• 100 (0,2)(0,2)
• 149 (1.9941, 3.4742)
• 265 (0.0306, 0.0026)
• 86 (0.1395, 1.9759)
Sample 1Sample 1
![Page 24: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/24.jpg)
Sample 2Sample 2
Clustering three dimensional dataClustering three dimensional data
Number of points:1000Number of points:1000
Unknown sourceUnknown source
Optimal number of components = ?Optimal number of components = ?
Estimated parameters = ?Estimated parameters = ?
![Page 25: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/25.jpg)
Sample 2Sample 2Raw dataRaw data After ClusteringAfter Clustering
Assumed number of clusters: 5
![Page 26: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/26.jpg)
Sample 2 – Sample 2 – table of estimated parameterstable of estimated parameters
![Page 27: Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649f565503460f94c7aa67/html5/thumbnails/27.jpg)
ReferencesReferences
[1] [1]
http://www.autonlab.org/tutorials/gmm14.pdfhttp://www.autonlab.org/tutorials/gmm14.pdf
[2] [2] http://www.autonlab.org/tutorials/kmeans11.pdfhttp://www.autonlab.org/tutorials/kmeans11.pdf
[3] [3] http://info.ilab.sztaki.hu/~lukacs/AdatbanyaEA2005/klaszterezes.pdfhttp://info.ilab.sztaki.hu/~lukacs/AdatbanyaEA2005/klaszterezes.pdf
[4] [4] http://www.stat.auckland.ac.nz/~balemi/Data%20Mining%20in%20Mahttp://www.stat.auckland.ac.nz/~balemi/Data%20Mining%20in%20Market%20Research.pptrket%20Research.ppt
[5] [5] http://www.ncrg.aston.ac.uk/netlabhttp://www.ncrg.aston.ac.uk/netlab