clustering algorithm machine learning

6
Clustering Algorithm CIS 435 Francisco E. Figueroa Overview Clustering is the group of a particular set of objects based on their characteristics and aggregates them according to their similarities. Clustering is considered the most important unsupervised learning. It works with finding a structure in a collection of unlabeled data. A cluster is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. For example, we can be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection). Possible applications for clustering algorithms can be use to find groups of customers with similar behavior, classify plants based on the given features, and document classification. Clustering algorithms are scalable, can deal with different types of attributes, discovery clusters with arbitrary shape, deal with with noise and outliers and are high dimensional. K-Means, DBSCAN and EM Clustering Algorithm The clustering methods may be divided into the following categories: partitional, hierarchical, density-based, grid-based, model-based, high-dimensional data, and constraint-based. K-Means is partitional method which obtain a single level partition of objects. These methods usually are based on greedy heuristics that are used iteratively to obtain a local optimum solution. K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. Euclidean calculation help to measure the distance of each object in the dataset from each of the centroids. The K-means method does not consider the size of the clusters. The k-means method implicitly assumes spherical probability distributions. DBSCAN is a density-based method, which typically for each data point in a cluster, at least a minimum number of points must exist within a given radius. The method was designed for spatial databases but can be used in other applications. It requires two input parameters: the size of the neighborhood(R) and the minimum points in the neighborhood (N). Essentially these two parameters determine the density within points in the clusters the user is willing to accept since they specify how many points must be in a region. Because DBSCAN uses a density-based definition of a cluster, it is relatively resistant to noise and can handle clusters of arbitrary shapes and sizes. Thus, DBSCAN can find many clusters that could not be found using K-means. DBSCAN has some challenges when clusters have widely varying densities, high-dimensional data because density if more difficult to define and can be expensive to compute the nearest neighbors. Expectation-Maximization is a model-based method. A model is assumed, perhaps based on a probability distribution. Essentially the algorithm tries to build clusters with a high level of similarity within them and a low level of similarity between them. Similarity measurement is based on the mean values and the algorithm tries to minimize error function. Expectation Maximization(EM) method is based on the assumption that the objects in the dataset have attributes whose values are distributed

Upload: francisco-e-figueroa-nigaglioni

Post on 11-Apr-2017

54 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Clustering algorithm Machine Learning

Clustering Algorithm

CIS 435 Francisco E. Figueroa

Overview

Clustering is the group of a particular set of objects based on their characteristics and aggregates them according to their similarities. Clustering is considered the most important unsupervised learning. It works with finding a structure in a collection of unlabeled data. A cluster is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. For example, we can be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection). Possible applications for clustering algorithms can be use to find groups of customers with similar behavior, classify plants based on the given features, and document classification. Clustering algorithms are scalable, can deal with different types of attributes, discovery clusters with arbitrary shape, deal with with noise and outliers and are high dimensional. K-Means, DBSCAN and EM Clustering Algorithm

The clustering methods may be divided into the following categories: partitional, hierarchical, density-based, grid-based, model-based, high-dimensional data, and constraint-based. K-Means is partitional method which obtain a single level partition of objects. These methods usually are based on greedy heuristics that are used iteratively to obtain a local optimum solution.

K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. Euclidean calculation help to measure the distance of each object in the dataset from each of the centroids. The K-means method does not consider the size of the clusters. The k-means method implicitly assumes spherical probability distributions.

DBSCAN is a density-based method, which typically for each data point in a cluster, at least a minimum number of points must exist within a given radius. The method was designed for spatial databases but can be used in other applications. It requires two input parameters: the size of the neighborhood(R) and the minimum points in the neighborhood (N). Essentially these two parameters determine the density within points in the clusters the user is willing to accept since they specify how many points must be in a region. Because DBSCAN uses a density-based definition of a cluster, it is relatively resistant to noise and can handle clusters of arbitrary shapes and sizes. Thus, DBSCAN can find many clusters that could not be found using K-means. DBSCAN has some challenges when clusters have widely varying densities, high-dimensional data because density if more difficult to define and can be expensive to compute the nearest neighbors.

Expectation-Maximization is a model-based method. A model is assumed, perhaps based on a probability distribution. Essentially the algorithm tries to build clusters with a high level of similarity within them and a low level of similarity between them. Similarity measurement is based on the mean values and the algorithm tries to minimize error function. Expectation Maximization(EM) method is based on the assumption that the objects in the dataset have attributes whose values are distributed

Page 2: Clustering algorithm Machine Learning

according to some(unknown) linear combination(or mixture) of simple probability distributions. EM algorithm is a popular iterative refinement algorithm that can be used for finding the parameter estimates. assigns each object to a cluster according to a weight representing the probability of membership. In other words, there are no strict boundaries between clusters.

K-Means Algorithm Results

When applying the K-Means algorithm, where K is 6 we obtained the following results in the training set. The Number of iterations were 5 and within cluster sum of squared errors: 75.13612068802034. Cluster 0: 38,Male,TownHouse,12,Y,Y,Y,Tools Cluster 1: 18,Male,Rent,24,N,Y,N,None Cluster 2: 40,Male,SingleFamily,12,Y,N,Y,Electronics Cluster 3: 20,Male,Rent,20,N,Y,N,None Cluster 4: 24,Female,TownHouse,20,Y,Y,N,Apparel Cluster 5: 56,Female,SingleFamily,4,Y,Y,Y,Cosmetic

The model and evaluation of on the training set for the cluster instances where distributed as follow:Cluster 0: 31 ( 31%), Cluster 1: 6 ( 6%), Cluster 2: 21 ( 21%), Cluster 3: 14 ( 14%), Cluster 4: 20 ( 20%), Cluster 5: 8 ( 8%).

When applying the K-Means algorithm, where K is 6 we obtained the following results in class to clusters set. The Number of iterations were 5 and within cluster sum of squared errors: 41.1979686974573. Cluster 0: 38,Male,TownHouse,12,Y,Y,Y Cluster 1: 18,Male,Rent,24,N,Y,N Cluster 2: 40,Male,SingleFamily,12,Y,N,Y Cluster 3: 20,Male,Rent,20,N,Y,N Cluster 4: 24,Female,TownHouse,20,Y,Y,N Cluster 5: 56,Female,SingleFamily,4,Y,Y,Y

The model and evaluation of on the class to cluster instances where distributed as follow: Cluster 0: 31 ( 31%), Cluster 1: 6 ( 6%), Cluster 2: 17 ( 17%), Cluster 3: 14 ( 14%), Cluster 4: 20 ( 20%), Cluster 5: 12 ( 12%). The amount of incorrectly clustered instances were 45.

Page 3: Clustering algorithm Machine Learning

Class attribute: Product_Purchased: Classes to Clusters: 0 1 2 3 4 5 <-- assigned to cluster 0 5 17 10 0 4 | None 0 0 0 0 15 0 | Apparel 0 1 0 4 0 0 | Game 0 0 0 0 5 8 | Cosmetic 10 0 0 0 0 0 | Electronics 10 0 0 0 0 0 | Jewellary 11 0 0 0 0 0 | Tools Cluster 0 <-- Tools Cluster 1 <-- No class Cluster 2 <-- None Cluster 3 <-- Game Cluster 4 <-- Apparel Cluster 5 <-- Cosmetic Electronics and Jewellary, were wrongly classified. There were some items that were purchased but got wrongly classified to none purchase (like game and cosmetic) DBSCAN Algorithm Results

When DBSCAN is applied in the training set, the following results were obtained in terms of clustered instances: Cluster 0:15 ( 27%), Cluster 1: 6 ( 11%), Cluster 2:10 ( 18%), Cluster 3:17 ( 30%), Cluster 4:8 ( 14%). The amount of unclustered instances were 44.

Page 4: Clustering algorithm Machine Learning

When applying DBSCAN: Classes to Clusters evaluation: Product Purchased, the following results were obtained. Clustered Instances: Cluster 0:15 ( 21%), Cluster 1:11 ( 15%), Cluster 2:10 ( 14%), Cluster 3:17 ( 24%), Cluster 4:8 ( 11%), Cluster 5:10 ( 14%) with a total of 29 unclustered instances. When analyzing the Class attribute: Product_Purchased / Classes to Clusters, the following results were obtained: 0 1 2 3 4 5 <-- assigned to cluster 15 0 0 17 0 0 | None 0 0 0 0 0 5 | Apparel 0 0 0 0 0 0 | Game 0 0 0 0 8 5 | Cosmetic 0 5 0 0 0 0 | Electronics 0 0 10 0 0 0 | Jewellary 0 6 0 0 0 0 | Tools Cluster 0 <-- No class; Cluster 1 <-- Tools; Cluster 2 <-- Jewellary; Cluster 3 <-- None; Cluster 4 <-- Cosmetic; Cluster 5 <-- Apparel. Incorrectly clustered instances were 25.0 for a 25%. Cluster 0: None Class as classified as None, Cluster 1: Tool 6 instances were correctly classified while 5 instances were incorrectly classified (electronics as tools). Cluster 2: Jewellary, the 10 instances were classified correctly. Cluster 3: None, the 17 instances were classified correctly. Cluster 4: Cosmetic, all the instances were classified correctly, and Cluster 5: Apparel, all instances were classified correctly. DBSCAN can find many clusters that could not be found using K-means, but in this case amount of incorrectly clustered instances was less by using DBSCAN (25) than K-Means (45).

Page 5: Clustering algorithm Machine Learning

EM Algorithm Results When running the EM algorithm on the dataset, the following results were obtained Class attribute: Product_Purchased Classes to Clusters: 0 1 2 3 4 5 6 7 <-- assigned to cluster 4 15 13 0 0 0 4 0 | None 0 5 0 0 0 5 0 5 | Apparel 0 5 0 0 0 0 0 0 | Game 0 0 0 0 0 0 8 5 | Cosmetic 5 0 0 0 5 0 0 0 | Electronics 0 0 0 0 0 10 0 0 | Jewellary 0 0 0 5 6 0 0 0 | Tools Cluster 0 <-- Electronics Cluster 1 <-- Game Cluster 2 <-- None Cluster 3 <-- No class Cluster 4 <-- Tools Cluster 5 <-- Jewellary Cluster 6 <-- Cosmetic Cluster 7 <-- Apparel

To identify the characteristics of the customer is likely to buy Tools, Games, and Cosmetics we must see the likelihood parameters within the clusters. The customer that buy tools belong to cluster 4, which basically the average age is 33.19, male, lives in a townhouse, shop online, does not like to chat online, and have some privacy concerns. The customer that buy games belong to cluster 1, average age 20.2, mostly male, the house is rent, shop online, visit the store frequently, does not like to chat online, and has privacy concerns. The customer than buys cosmetics belongs to cluster 6, average age 54.04, mostly female, lives in a single family residence, does not visit the the store frequently, might or not shop online, might chat online, and have privacy concerns. Based on this data, the management need to begin an awareness between all their customers because they clearly have a privacy concern in all the clusters. Second, can create some special offers or special events for females that buy games which are customers that are buying too. Third, for customers that buy tools can create marketing material around townhouses.

Page 6: Clustering algorithm Machine Learning

Real World Clustering Applications

Healthcare - Identify cost change patterns of patients with end-stage renal disease (ESRD) who initiated hemodialysis (HD) by applying different clustering methods. A total of 18,380 patients were identified. Meaningful all-cause cost clusters were generated using K-means CA and hierarchical CA with either flexible beta or Ward’s methods. (Liao, 2016) Retail - Retailers use a variety of tools to create clusters for assortment planning and other uses. These tools include reports and spreadsheets, specialty statistical analysis software packages (such as SAS and Minitab), clustering solutions tailored specifically for use by retailers, and clustering capabilities that are integrated into a broader assortment planning solution. Retailers apply clustering for cases such as single assortment, channel-based clusters, sales volume-based clusters, store capacity-based capacity, among others. (Pollack) Finance - a group mutual funds with different investment objectives, they claimed that cluster analysis is able to explain non-linear structural relationships among unknown structural dataset. They found that over 40% of the mutual funds do not belong to their stated categories, and despite the very large number of categories stated; three groups are very important. Clustering helps simplifying the financial data classification problem based on their characteristics rather than on labels, such as nominal labels (customer gender, living area, income or the success of the last transaction, etc.). (Cai) References: A Tutorial on Clustering Algorithms. Retrieved from http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/ A Tutorial on Clustering Algorithm. K-Means Clustering. Retrieved from http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html Firdaus, S., Uddin, A. A Survey on Clustering Algorithms and Complexity Analysis. International Journal of Computer Science. Volume 12, Issue 2, March 2015. Retrieved from http://www.ijcsi.org/papers/IJCSI-12-2-62-85.pdf Liao, M., Li, Y., Kianifard, F., Obi, E., Arcona, S. Cluster analysis and its application to healthcare claims data: a study of end-stage renal disease patients who initiated hemodialysis. BMC Nephrol. 2016 March. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4776444/ Pollack, J. Retail Clustering Methods: Achieving Success with Assortment Planning. The Parker Avery Group. Retrieved from http://www.parkeravery.com/pov_Retail_Clustering_Methods.html Cai, F., Le-Khac, N., Kechadi, M. Clustering Approaches for Financial Data Analysis: A Survey. Retrieved from http://weblidi.info.unlp.edu.ar/worldcomp2012-mirror/p2012/DMI9090.pdf