cluster analysis

Download Cluster analysis

Post on 13-Feb-2017

241 views

Category:

Education

0 download

Embed Size (px)

TRANSCRIPT

  • Lecture-41What is Cluster Analysis?

  • What is Cluster Analysis?Cluster: a collection of data objects

    Similar to one another within the same clusterDissimilar to the objects in other clustersCluster analysis

    Grouping a set of data objects into clustersClustering is unsupervised classification: no predefined classesApplications

    As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithmsLecture-41 - What is Cluster Analysis?

    *

  • General Applications of Clustering Pattern RecognitionSpatial Data Analysis

    create thematic maps in GIS by clustering feature spacesdetect spatial clusters and explain them in spatial data miningImage ProcessingEconomic Science ( market research)WWW

    Document classificationCluster Weblog data to discover groups of similar access patternsLecture-41 - What is Cluster Analysis?

    *

  • Examples of Clustering ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programsLand use: Identification of areas of similar land use in an earth observation databaseInsurance: Identifying groups of motor insurance policy holders with a high average claim costCity-planning: Identifying groups of houses according to their house type, value, and geographical locationEarth-quake studies: Observed earth quake epicenters should be clustered along continent faults

    Lecture-41 - What is Cluster Analysis?

    *

  • What Is Good Clustering?A good clustering method will produce high quality clusters with

    high intra-class similaritylow inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation.The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

    Lecture-41 - What is Cluster Analysis?

    *

  • Requirements of Clustering in Data Mining ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeMinimal requirements for domain knowledge to determine input parametersAble to deal with noise and outliersInsensitive to order of input recordsHigh dimensionalityIncorporation of user-specified constraintsInterpretability and usability

    Lecture-41 - What is Cluster Analysis?

    *

  • Lecture-42Types of Data in Cluster Analysis

  • Data StructuresData matrix

    (two modes)

    Dissimilarity matrix

    (one mode)Lecture-42 - Types of Data in Cluster Analysis

  • Measure the Quality of ClusteringDissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric:d(i, j)There is a separate quality function that measures the goodness of a cluster.The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.Weights should be associated with different variables based on applications and data semantics.It is hard to define similar enough or good enough

    the answer is typically highly subjective.Lecture-42 - Types of Data in Cluster Analysis

    *

  • Type of data in clustering analysisInterval-scaled variablesBinary variablesCategorical, Ordinal, and Ratio Scaled variablesVariables of mixed types

    Lecture-42 - Types of Data in Cluster Analysis

    *

  • Interval-valued variablesStandardize data

    Calculate the mean absolute deviation:

    whereCalculate the standardized measurement (z-score)

    Using mean absolute deviation is more robust than using standard deviation

    Lecture-42 - Types of Data in Cluster Analysis

  • Similarity and Dissimilarity ObjectsDistances are normally used to measure the similarity or dissimilarity between two data objectsSome popular ones include: Minkowski distance:

    where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two p-dimensional data objects, and q is a positive integerIf q = 1, d is Manhattan distance

    Lecture-42 - Types of Data in Cluster Analysis

  • Similarity and Dissimilarity Objects If q = 2, d is Euclidean distance:

    Propertiesd(i,j) 0d(i,i) = 0d(i,j) = d(j,i)d(i,j) d(i,k) + d(k,j)Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures.

    Lecture-42 - Types of Data in Cluster Analysis

  • Binary VariablesA contingency table for binary data

    Simple matching coefficient (invariant, if the binary variable is symmetric):Jaccard coefficient (noninvariant if the binary variable is asymmetric):

    Object iObject jLecture-42 - Types of Data in Cluster Analysis

  • Dissimilarity between Binary VariablesExample

    gender is a symmetric attributethe remaining attributes are asymmetric binarylet the values Y and P be set to 1, and the value N be set to 0Lecture-42 - Types of Data in Cluster Analysis

    Name

    Gender

    Fever

    Cough

    Test-1

    Test-2

    Test-3

    Test-4

    Jack

    M

    Y

    N

    P

    N

    N

    N

    Mary

    F

    Y

    N

    P

    N

    P

    N

    Jim

    M

    Y

    P

    N

    N

    N

    N

  • Categorical VariablesA generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, greenMethod 1: Simple matching

    M is no of matches, p is total no of variables

    Method 2: use a large number of binary variables

    creating a new binary variable for each of the M nominal statesLecture-42 - Types of Data in Cluster Analysis

  • Ordinal VariablesAn ordinal variable can be discrete or continuousorder is important, e.g., rankCan be treated like interval-scaled

    replacing xif by their rank map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by

    compute the dissimilarity using methods for interval-scaled variablesLecture-42 - Types of Data in Cluster Analysis

  • Ratio-Scaled VariablesRatio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt Methods:

    treat them like interval-scaled variablesapply logarithmic transformationyif = log(xif)treat them as continuous ordinal data treat their rank as interval-scaled.Lecture-42 - Types of Data in Cluster Analysis

  • Variables of Mixed TypesA database may contain all the six types of variables

    symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.One may use a weighted formula to combine their effects.

    f is binary or nominal:dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.f is interval-based: use the normalized distancef is ordinal or ratio-scaledcompute ranks rif and and treat zif as interval-scaledLecture-42 - Types of Data in Cluster Analysis

  • Lecture-43A Categorization of Major Clustering Methods

  • Major Clustering ApproachesPartitioning algorithms

    Construct various partitions and then evaluate them by some criterionHierarchy algorithms

    Create a hierarchical decomposition of the set of data (or objects) using some criterionDensity-based

    based on connectivity and density functionsGrid-based

    based on a multiple-level granularity structureModel-based

    A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each otherLecture-43 - A Categorization of Major Clustering Methods

    *

  • Lecture-44Partitioning Methods

  • Partitioning Algorithms: Basic ConceptPartitioning method: Construct a partition of a database D of n objects into a set of k clustersGiven a k, find a partition of k clusters that optimizes the chosen partitioning criterion

    Global optimal: exhaustively enumerate all partitionsHeuristic methods: k-means and k-medoids algorithmsk-means - Each cluster is represented by the center of the clusterk-medoids or PAM (Partition around medoids) - Each cluster is represented by one of the objects in the cluster Lecture-44 - Partitioning Methods

    *

  • The K-Means Clustering Method Given k, the k-means algorithm is implemented in 4 steps:

    Partition objects into k nonempty subsetsCompute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster.Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no more new assignment.Lecture-44 - Partitioning Methods

  • The K-Means Clustering Method Example

    Lecture-44 - Partitioning Methods

  • the K-Means MethodStrength

    Relatively efficient: O(tkn), where n is no of objects, k is no of clusters, and t is no of iterations. Normally, k, t

  • Variations of the K-Means MethodA few variants of the k-means which differ in

    Selection of the initial k meansDissimilarity calculationsStrategies to calculate cluster meansHandling categorical data: k-modes

    Replacing means of clusters with modesUsing new dissimilarity measures to deal with categorical objectsUsing a frequency-based method to update modes of clustersA mixture of categorical and numerical data: k-prototype methodLecture-44 - Partitioning Methods

  • The K-Medoids Clustering MethodFind representative objects, called medoids, in clustersPAM (Partitioning Around Medoids, 1987)

    starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clusteringPAM works effectively for small data sets, but does not scale well for large data setsCLARA (Kaufmann & Rousseeuw, 1990)CLARANS (Ng & Han