cluster analysis-marketing reasearch

Upload: calmchandan

Post on 07-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    1/46

    Business Research Methods

    Cluster Analysis

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    2/46

    Cluster Analysis

    It is a multivariate interdependence analysis that

    classifies individuals or objects into a small numberof mutually exclusive & exhaustive groups ,ensuringthat there is as much likeness within groups and asmuch difference among groups.

    Cluster analysis differs from multiple discriminantanalysis in that the groups are not predefined. Thepurpose of cluster analysis is to determine how

    many distinct groups exist and to define theircomposition. It does not predict relationships & isnot a dependence technique

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    3/46

    Uses of Cluster Analysis It is a multivariate interdependence procedure ideally suited to

    segmentation application in marketing Cluster by definition is a group of similar objects

    Segmentation involves identifying groups oftarget customers who aresimilar in buying habits , demographic characteristics orpsychographics

    There could be clusters of brands similar to each other & differentfrom other clusters( Soaps on the basis of various characteristics )

    Cluster analysis is also used in selection of test market cities that aresufficiently similar so that no extraneous variation can cause

    difference between the experimental & control markets.( population,retail sales, number of retail outlets, proportion of various incomegroups etc)

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    4/46

    Methods Cluster Analysis Basic Methods of 2 types

    1. Hierarchical or Linkage methods

    2. Non_ hierarchical or Nodal Methods or k-means clustering

    In the first method we do not have to specify in advance howmany clusters are to be extracted .A range of solutions isprovided by computers ranging from 1 to n-1 clusters where n is

    number of objects being studied In second method Number of clusters to be extracted is to bespecified in advance

    Both methods use a distance Measure to calculate distancesbetween objects in terms of its various characteristics or

    variables. Most commonly used measure is Euclidean distance. TheEuclidean distance is the square root of the sum of the squareddifferences in values for each variable .

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    5/46

    Methods Cluster Analysis

    Generally interval scaled variables are ideally suited for

    cluster analysis Ratio scaled variables can also be used

    Standardisation of variables is necessary if units of

    measurement of variables widely differ First, an initial clustering solution is obtained using a

    hierarchical procedure, such as average linkage or

    Ward's. The number of clusters so obtained are used as

    inputs to the non- hierarchical procedure such asoptimizing partitioning method.

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    6/46

    Steps in Analysis Run the hierarchical clustering programme on the variables ( after

    standardisation if necessary)

    Generate output called Agglomeration schedule

    It shows all possible solutions from 1to n-1 clusters where n is thenumber of respondents or objects

    Going up from the bottom of the Agglomeration schedule look at the

    column called coefficients to decide on number of clusters

    In this column starting from the bottom ,calculate difference in thevalue of coefficient in the neighbouring rows.

    If the maximum value of this difference occurs ,say, between third &

    fourth row from the bottom it indicates there might be 3 clusters ( thelower row number)

    The dendogram & icicle plot can also be requested & will giveessentially same information in graphical form .

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    7/46

    Steps in Analysis

    Once the number of clusters have been identified a

    Kmeans clustering approach can be run on the data

    The number of clusters obtained in the first stage are

    used as input & output obtained

    This output gives the initial & final cluster centres

    for each variable

    Final cluster centres are the best solutions

    These are used to interpret the average value of each

    variable for a cluster & thereby describe the clusters

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    8/46

    Example Clustering of consumers based on attitude towards shopping

    Six attitudinal variables were identified

    Consumers were asked to express their degree of agreement with thefollowing statements on a 7 point scale(1=Disagree;2=Agree)

    V1=Shopping is fun

    V2=Shopping is bad for your budget

    V3=I combine shopping with eating out

    V4=I get best buys when shopping

    V5=I do not care about shopping

    V6=You can save a lot of money by comparing prices

    Data obtained from 20 respondents shown in next slide

    In reality sample size should be much larger

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    9/46

    Attitudinal Data For Clustering

    Case No. V1 V2 V3 V4 V5 V6

    1 6 4 7 3 2 32 2 3 1 4 5 43 7 2 6 4 1 34 4 6 4 5 3 65 1 3 2 2 6 46 6 4 6 3 3 47 5 3 6 3 3 48 7 3 7 4 1 49 2 4 3 3 6 310 3 5 3 6 4 611 1 3 2 3 5 312 5 4 5 4 2 413 2 2 1 5 4 414 4 6 4 6 4 715 6 5 4 2 1 416 3 5 4 6 4 717 4 4 7 2 2 518 3 7 2 6 4 319 4 6 3 7 2 720 2 3 2 4 7 2

    Table 1

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    10/46

    Results of Hierarchical Clustering

    Stage clusterClusters combined first appears

    Stage Cluster 1 Cluster 2 Coefficient Cluster 1 Cluster 2 Next stage1 14 16 1.000000 0 0 62 6 7 2.000000 0 0 7

    3 2 13 3.500000 0 0 154 5 11 5.000000 0 0 115 3 8 6.500000 0 0 166 10 14 8.160000 0 1 97 6 12 10.166667 2 0 108 9 20 13.000000 0 0 119 4 10 15.583000 0 6 1210 1 6 18.500000 6 7 1311 5 9 23.000000 4 8 1512 4 19 27.750000 9 0 1713 1 17 33.100000 10 0 1414 1 15 41.333000 13 0 1615 2 5 51.833000 3 11 1816 1 3 64.500000 14 5 1917 4 18 79.667000 12 0 18

    18 2 4 172.662000 15 17 1919 1 2 328.600000 16 18 0

    Agglomeration Schedule Using Wards Procedure

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    11/46

    Results of Hierarchical Clustering

    Stage clusterClusters combined first appears

    Stage Cluster 1 Cluster 2 Coefficient Cluster 1 Cluster 2 Next stage1 14 16 1.000000 0 0 62 6 7 2.000000 0 0 7

    3 2 13 3.500000 0 0 154 5 11 5.000000 0 0 115 3 8 6.500000 0 0 166 10 14 8.160000 0 1 97 6 12 10.166667 2 0 108 9 20 13.000000 0 0 119 4 10 15.583000 0 6 1210 1 6 18.500000 6 7 1311 5 9 23.000000 4 8 1512 4 19 27.750000 9 0 1713 1 17 33.100000 10 0 1414 1 15 41.333000 13 0 1615 2 5 51.833000 3 11 1816 1 3 64.500000 14 5 1917 4 18 79.667000 12 0 18

    18 2 4 172.662000 15 17 1919 1 2 328.600000 16 18 0

    Agglomeration Schedule Using Wards Procedure

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    12/46

    Results of Hierarchical Clustering

    Number of ClustersLabel case 4 3 2

    1 1 1 12 2 2 23 1 1 1

    4 3 3 25 2 2 26 1 1 17 1 1 18 1 1 19 2 2 210 3 3 211 2 2 2

    12 1 1 113 2 2 214 3 3 215 1 1 116 3 3 217 1 1 118 4 3 219 3 3 2

    20 2 2 2

    Cluster Membership of Cases Using Wards Procedure

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    13/46

    Vertical Icicle Plot Using Wards Method

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    14/46

    Dendrogram Using Wards Method

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    15/46

    Interpretation

    Theoretical, or practical considerations may

    suggest a certain number of clusters If the purpose is market segmentation

    management may want a particular number ofclusters

    Value in coefficients column suddenly more thandoubles between stages 17(3 clusters) & stage 18(2 clusters)

    Likewise at the last 2 stages of dendogram clusters

    are being combined at large distances

    It appears that a three cluster solution isappropriate

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    16/46

    Dendrogram Using Wards Method

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    17/46

    Interpretation

    A 3 cluster solution results in clusters with

    8,6 & 6 respondents

    A 4 cluster solution has 8,6,5 & 1

    respondents

    It is not meaningful to have a cluster with

    only one case ,so a 3 cluster solution is

    preferable

    Interpreting & profiling clusters involves

    examining cluster centroids

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    18/46

    Cluster Centroids

    Means of Variables

    Cluster No. V1 V2 V3 V4 V5 V6

    1 5.750 3.625 6.000 3.125 1.750 3.875

    2 1.667 3.000 1.833 3.500 5.500 3.333

    3 3.500 5.833 3.333 6.000 3.500 6.000

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    19/46

    Interpretation Cluster 1 is high on

    V1: Shopping is Fun V3:Combine shopping with eating out

    V5: Low: Do not care about shopping

    FUN LOVING & CONCERNED SHOPPERS(1,3,6,7,8,12,15,17)

    Cluster II is high on V5:DO not care about shopping

    V1:Low:Shopping is Fun

    V3:Low:Combine shopping with eating out

    APATHETIC SHOPPERS (2,5,9,11,13,20) Cluster III is High on

    V2: Shopping upsets budget

    V4: Try to get best buys

    V6: Can save a lot of money by comparing prices

    ECONOMICL SHOPPERS (10,14,1618&19)

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    20/46

    Interpretation

    Further profiling can be done on the basis

    of variables not used for clustering Demographic, psychographic, product

    usage, media usage variables can be used

    to target marketing efforts for each cluster The variables that significantly

    differentiates between clusters can beobtained through Discriminant analysis

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    21/46

    Iteration Historya

    2.154 2.102 2.550

    0.000 0.000 0.000

    Iteration1

    2

    1 2 3

    Change in Cluster Centers

    Convergence achieved due to no or small distancechange. The maximum distance by which any centerhas changed is 0.000. The current iteration is 2. The

    minimum distance between initial centers is 7.746.

    a.

    Results of Nonhierarchical Clustering

    Initial Cluster Centers

    4 2 7

    6 3 2

    3 2 6

    7 4 42 7 1

    7 2 3

    V1

    V2

    V3

    V4V5

    V6

    1 2 3

    Cluster

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    22/46

    Results of Nonhierarchical Clustering

    .Cluster Membership

    3 1.414

    2 1.323

    3 2.550

    1 1.404

    2 1.848

    3 1.225

    3 1.500

    3 2.121

    2 1.756

    1 1.143

    2 1.041

    3 1.581

    2 2.5981 1.404

    3 2.828

    1 1.624

    3 2.598

    1 3.555

    1 2.154

    2 2.102

    Case Number1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    1314

    15

    16

    17

    18

    19

    20

    Cluster Distance

    R lt f N hi hi l Cl t i

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    23/46

    Results of Nonhierarchical Clustering

    Final Cluster Centers

    4 2 6

    6 3 4

    3 2 6

    6 4 3

    4 6 2

    6 3 4

    V1

    V2

    V3

    V4

    V5

    V6

    1 2 3

    Cluster

    Distances between Final Cluster Centers

    5.568 5.698

    5.568 6.928

    5.698 6.928

    Cluster

    1

    2

    3

    1 2 3

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    24/46

    Cl i V i bl

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    25/46

    Clustering Variables

    In this instance, the units used for analysis are thevariables, and the distance measures are computed for all

    pairs of variables. Hierarchical clustering of variables can aid in the

    identification of unique variables, or variables that make aunique contribution to the data.

    Clustering can also be used to reduce the number ofvariables. Associated with each cluster is a linearcombination of the variables in the cluster, called thecluster component. A large set of variables can often bereplaced by the set of cluster components with little loss of

    information. However, a given number of clustercomponents does not generally explain as much varianceas the same number of principal components.

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    26/46

    SPSS Windows

    To select this procedures using SPSS for Windows click:

    Analyze>Classify>Hierarchical Cluster

    Analyze>Classify>K-Means Cluster

    Difference Between Cluster Analysis& Discriminant Analysis

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    27/46

    Difference Between Cluster Analysis& Discriminant Analysis

    Cluster analysis is a class of techniques used to classifyobjects or cases into relatively homogeneous groups

    called clusters. Objects in each cluster tend to be similarto each other and dissimilar to objects in the otherclusters. Cluster analysis is also called classificationanalysis, or numerical taxonomy.

    Both cluster analysis and discriminant analysis areconcerned with classification. However, discriminantanalysis requires prior knowledge of the cluster or groupmembership for each object or case included, to developthe classification rule. In contrast, in cluster analysisthere is no a priori information about the group or clustermembership for any of the objects. Groups or clustersare suggested by the data, not defined a priori.

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    28/46

    An Ideal Clustering Situation

    Variable 2

    Variable1

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    29/46

    A Practical Clustering Situation

    X

    Variable 2

    Variable

    1

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    30/46

    Statistics Associated with Cluster

    Analysis Agglomeration schedule. An agglomeration schedule

    gives information on the objects or cases being combinedat each stage of a hierarchical clustering process.

    Cluster centroid. The cluster centroid is the mean valuesof the variables for all the cases or objects in a particularcluster.

    Cluster centers. The cluster centers are the initial starting

    points in nonhierarchical clustering. Clusters are builtaround these centers, or seeds.

    Cluster membership. Cluster membership indicates thecluster to which each object or case belongs.

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    31/46

    Statistics Associated with Cluster

    Analysis Dendrogram. A dendrogram, or tree graph, is a graphical

    device for displaying clustering results. Vertical lines

    represent clusters that are joined together. The position of

    the line on the scale indicates the distances at whichclusters were joined. The dendrogram is read from left to

    right. Fig 2 is a dendrogram.

    Distances between cluster centers. These distances

    indicate how separated the individual pairs of clusters are.Clusters that are widely separated are distinct, and

    therefore desirable.

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    32/46

    Statistics Associated with Cluster

    Analysis Icicle diagram. An icicle diagram is a graphical display

    of clustering results, so called because it resembles a row

    of icicles hanging from the eaves of a house. The columns

    correspond to the objects being clustered, and the rowscorrespond to the number of clusters. An icicle diagram is

    read from bottom to top. Figure 20.7 is an icicle diagram.

    Similarity/distance coefficient matrix. A

    similarity/distance coefficient matrix is a lower-trianglematrix containing pairwise distances between objects or

    cases.

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    33/46

    Conducting Cluster Analysis

    Formulate the Problem

    Assess the Validity of Clustering

    Select a Distance Measure

    Select a Clustering Procedure

    Decide on the Number of Clusters

    Interpret and Profile Clusters

    Conducting Cluster Analysis

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    34/46

    Conducting Cluster Analysis

    Formulate the Problem Perhaps the most important part of formulating the

    clustering problem is selecting the variables on which theclustering is based.

    Inclusion of even one or two irrelevant variables may

    distort an otherwise useful clustering solution.

    Basically, the set of variables selected should describe thesimilarity between objects in terms that are relevant to the

    marketing research problem.

    The variables should be selected based on past research,

    theory, or a consideration of the hypotheses being tested.In exploratory research, the researcher should exercise

    judgment and intuition.

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    35/46

    A Cl ifi ti f Cl t i P d

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    36/46

    A Classification of Clustering Procedures

    SequentialThreshold

    ParallelThreshold

    OptimizingPartitioning

    Single Complete Average

    Clustering Procedures

    NonhierarchicalHierarchical

    Agglomerative Divisive

    Wards Method

    LinkageMethods

    VarianceMethods

    CentroidMethods

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    37/46

    Hierarchical clustering is characterized by the development of ahierarchy or tree-like structure. Hierarchical methods can beagglomerative or divisive.

    Agglomerative clustering starts with each object in a separatecluster. Clusters are formed by grouping objects into bigger and

    bigger clusters. This process is continued until all objects aremembers of a single cluster.

    Divisive clustering starts with all the objects grouped in a singlecluster. Clusters are divided or split until each object is in a separatecluster.

    Agglomerative methods are commonly used in marketing research.They consist of linkage methods, error sums of squares or variancemethods, and centroid methods.

    Conducting Cluster AnalysisSelect a Clustering ProcedureHierarchical

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    38/46

    The single linkage method is based on minimum distance, or thenearest neighbor rule. At every stage, the distance between twoclusters is the distance between their two closest points

    The complete linkage method is similar to single linkage, except thatit is based on the maximum distance or the furthest neighbor

    approach. In complete linkage, the distance between two clusters iscalculated as the distance between their two furthest points.

    The average linkage method works similarly. However, in thismethod, the distance between two clusters is defined as the average ofthe distances between all pairs of objects, where one member of the

    pair is from each of the clusters

    Conducting Cluster AnalysisSelect a Clustering ProcedureLinkage Method

    Linkage Methods of Clustering

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    39/46

    Linkage Methods of Clustering

    Single Linkage

    Minimum Distance

    Complete Linkage

    MaximumDistance

    Average Linkage

    Average Distance

    Cluster 1 Cluster 2

    Cluster 1 Cluster 2

    Cluster 1 Cluster 2

    Conducting Cluster Analysis

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    40/46

    The variance methods attempt to generate clusters to minimize the

    within-cluster variance. A commonly used variance method is the Ward's procedure. For each

    cluster, the means for all the variables are computed. Then, for eachobject, the squared Euclidean distance to the cluster means is calculated .These distances are summed for all the objects. At each stage, the twoclusters with the smallest increase in the overall sum of squares withincluster distances are combined.

    In the centroid methods, the distance between two clusters is the distancebetween their centroids (means for all the variables). Every time objectsare grouped, a new centroid is computed.

    Of the hierarchical methods, average linkage and Ward's methods havebeen shown to perform better than the other procedures.

    Conducting Cluster AnalysisSelect a Clustering ProcedureVariance Method

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    41/46

    Other Agglomerative Clustering Methods

    Wards Procedure

    Centroid Method

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    42/46

    Conducting Cluster Analysis

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    43/46

    It has been suggested that the hierarchical and nonhierarchical

    methods be used in tandem. First, an initial clustering solution isobtained using a hierarchical procedure, such as average linkage or

    Ward's. The number of clusters and cluster centroids so obtained are

    used as inputs to the optimizing partitioning method.

    Choice of a clustering method and choice of a distance measure areinterrelated. For example, squared Euclidean distances should be used

    with the Ward's and centroid methods. Several nonhierarchical

    procedures also use squared Euclidean distances.

    Conducting Cluster Analysis

    Select a Clustering Procedure

    Conducting Cluster Analysis

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    44/46

    Theoretical, conceptual, or practical considerations may suggest a

    certain number of clusters.

    In hierarchical clustering, the distances at which clusters are

    combined can be used as criteria. This information can be obtained

    from the agglomeration schedule or from the dendrogram.

    In nonhierarchical clustering, the ratio of total within-groupvariance to between-group variance can be plotted against the

    number of clusters. The point at which an elbow or a sharp bend

    occurs indicates an appropriate number of clusters.

    The relative sizes of the clusters should be meaningful.

    Conducting Cluster Analysis

    Decide on the Number of Clusters

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    45/46

    Interpreting and profiling clusters involves examining the

    cluster centroids. The centroids enable us to describe each

    cluster by assigning it a name or label.

    It is often helpful to profile the clusters in terms of

    variables that were not used for clustering. These may

    include demographic, psychographic, product usage, media

    usage, or other variables.

    Conducting Cluster Analysis

    Interpreting and Profiling the Clusters

    Conducting Cluster Analysis

  • 8/4/2019 Cluster Analysis-Marketing Reasearch

    46/46

    1. Perform cluster analysis on the same data using different distance measures.

    Compare the results across measures to determine the stability of thesolutions.

    2. Use different methods of clustering and compare the results.

    3. Split the data randomly into halves. Perform clustering separately on eachhalf. Compare cluster centroids across the two subsamples.

    4. Delete variables randomly. Perform clustering based on the reduced set ofvariables. Compare the results with those obtained by clustering based onthe entire set of variables.

    5. In nonhierarchical clustering, the solution may depend on the order of casesin the data set. Make multiple runs using different order of cases until the

    solution stabilizes.

    g y

    Assess Reliability and Validity