# cluster analysis part i

Post on 01-Jan-2016

25 views

Embed Size (px)

DESCRIPTION

Cluster Analysis Part I. Learning Objectives. What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods. General Applications of Clustering. Pattern Recognition Spatial Data Analysis - PowerPoint PPT PresentationTRANSCRIPT

Cluster AnalysisPart I

TCSS555A Isabelle Bichindaritz

Learning ObjectivesWhat is Cluster Analysis?

Types of Data in Cluster Analysis

A Categorization of Major Clustering Methods

Partitioning Methods

TCSS555A Isabelle Bichindaritz

What is Cluster Analysis?Cluster: a collection of data objectsSimilar to one another within the same clusterDissimilar to the objects in other clustersCluster analysisGrouping a set of data objects into clustersClustering is unsupervised classification: no predefined classesTypical applicationsAs a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

TCSS555A Isabelle Bichindaritz

General Applications of Clustering Pattern RecognitionSpatial Data Analysis create thematic maps in GIS by clustering feature spacesdetect spatial clusters and explain them in spatial data miningImage ProcessingEconomic Science (especially market research)WWWDocument classificationCluster Weblog data to discover groups of similar access patterns

TCSS555A Isabelle Bichindaritz

Examples of Clustering ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programsLand use: Identification of areas of similar land use in an earth observation databaseInsurance: Identifying groups of motor insurance policy holders with a high average claim costCity-planning: Identifying groups of houses according to their house type, value, and geographical locationEarth-quake studies: Observed earth quake epicenters should be clustered along continent faults

TCSS555A Isabelle Bichindaritz

What Is Good Clustering?A good clustering method will produce high quality clusters withhigh intra-class similaritylow inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation.The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

TCSS555A Isabelle Bichindaritz

Requirements of Clustering in Data Mining ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeMinimal requirements for domain knowledge to determine input parametersAble to deal with noise and outliersInsensitive to order of input recordsHigh dimensionalityIncorporation of user-specified constraintsInterpretability and usability

TCSS555A Isabelle Bichindaritz

What is Cluster Analysis?Types of Data in Cluster AnalysisA Categorization of Major Clustering MethodsPartitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsModel-Based Clustering MethodsOutlier AnalysisSummary

TCSS555A Isabelle Bichindaritz

Data StructuresData matrix(two modes)

Dissimilarity matrix(one mode)

TCSS555A Isabelle Bichindaritz

Measure the Quality of ClusteringDissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric:d(i, j)There is a separate quality function that measures the goodness of a cluster.The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.Weights should be associated with different variables based on applications and data semantics.It is hard to define similar enough or good enough the answer is typically highly subjective.

TCSS555A Isabelle Bichindaritz

Type of data in clustering analysisInterval-scaled variables:Binary variables:Nominal, ordinal, and ratio variables:Variables of mixed types:

TCSS555A Isabelle Bichindaritz

Interval-valued variablesStandardize dataCalculate the mean absolute deviation:

whereCalculate the standardized measurement (z-score)

Using mean absolute deviation is more robust than using standard deviation

TCSS555A Isabelle Bichindaritz

Similarity and Dissimilarity Between ObjectsDistances are normally used to measure the similarity or dissimilarity between two data objectsSome popular ones include: Minkowski distance:

where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two p-dimensional data objects, and q is a positive integerIf q = 1, d is Manhattan distance

TCSS555A Isabelle Bichindaritz

Similarity and Dissimilarity Between Objects (Cont.)If q = 2, d is Euclidean distance:

Propertiesd(i,j) 0d(i,i) = 0d(i,j) = d(j,i)d(i,j) d(i,k) + d(k,j)Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures.

TCSS555A Isabelle Bichindaritz

Binary VariablesA contingency table for binary data

Simple matching coefficient (invariant, if the binary variable is symmetric):Jaccard coefficient (noninvariant if the binary variable is asymmetric): Object iObject j

TCSS555A Isabelle Bichindaritz

Dissimilarity between Binary VariablesExample

gender is a symmetric attributethe remaining attributes are asymmetric binarylet the values Y and P be set to 1, and the value N be set to 0

TCSS555A Isabelle Bichindaritz

Name

Gender

Fever

Cough

Test-1

Test-2

Test-3

Test-4

Jack

M

Y

N

P

N

N

N

Mary

F

Y

N

P

N

P

N

Jim

M

Y

P

N

N

N

N

Nominal VariablesA generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, greenMethod 1: Simple matchingm: # of matches, p: total # of variables

Method 2: use a large number of binary variablescreating a new binary variable for each of the M nominal states

TCSS555A Isabelle Bichindaritz

Ordinal VariablesAn ordinal variable can be discrete or continuousorder is important, e.g., rankCan be treated like interval-scaled replacing xif by their rank map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by

compute the dissimilarity using methods for interval-scaled variables

TCSS555A Isabelle Bichindaritz

Ratio-Scaled VariablesRatio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt Methods:treat them like interval-scaled variables not a good choice! (why?)apply logarithmic transformationyif = log(xif)treat them as continuous ordinal data treat their rank as interval-scaled.

TCSS555A Isabelle Bichindaritz

Variables of Mixed TypesA database may contain all the six types of variablessymmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.One may use a weighted formula to combine their effects.

f is binary or nominal:dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.f is interval-based: use the normalized distancef is ordinal or ratio-scaledcompute ranks rif and and treat zif as interval-scaled

TCSS555A Isabelle Bichindaritz

What is Cluster Analysis?Types of Data in Cluster AnalysisA Categorization of Major Clustering MethodsPartitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsModel-Based Clustering MethodsOutlier AnalysisSummary

TCSS555A Isabelle Bichindaritz

Major Clustering ApproachesPartitioning algorithms: Construct various partitions and then evaluate them by some criterionHierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterionDensity-based: based on connectivity and density functionsGrid-based: based on a multiple-level granularity structureModel-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

TCSS555A Isabelle Bichindaritz

What is Cluster Analysis?Types of Data in Cluster AnalysisA Categorization of Major Clustering MethodsPartitioning MethodsHierarchical MethodsDensity-Based MethodsGrid-Based MethodsModel-Based Clustering MethodsOutlier AnalysisSummary

TCSS555A Isabelle Bichindaritz

Partitioning Algorithms: Basic ConceptPartitioning method: Construct a partition of a database D of n objects into a set of k clustersGiven a k, find a partition of k clusters that optimizes the chosen partitioning criterionGlobal optimal: exhaustively enumerate all partitionsHeuristic methods: k-means and k-medoids algorithmsk-means (MacQueen67): Each cluster is represented by the center of the clusterk-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw87): Each cluster is represented by one of the objects in the cluster

TCSS555A Isabelle Bichindaritz

The K-Means Clustering Method Given k, the k-means algorithm is implemented in 4 steps:Partition objects into k nonempty subsetsCompute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster.Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no more new assignment.

TCSS555A Isabelle Bichindaritz

The K-Means Clustering Method Example

TCSS555A Isabelle Bichindaritz

- Comments on the K-Means MethodStrength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t
Variations of the K-Means MethodA few variants of the k-means which differ inSelection of the initial k meansDissimilarity calculationsStrategies to calculate cluster meansHandling categorical data: k-modes (Huang98)Replacing means of cluster

Recommended