cluster analysis techniques

33
Statistics for Research II (Multivariate Techniques) Reference Materials Available articles (websites) Lecture-notes/power-points and the book: Multivariate Analysis By Joseph Hair, William Black, Barry Babin, Rolph Anderson, Ronald Dixon

Upload: -

Post on 18-Aug-2015

27 views

Category:

Documents


0 download

DESCRIPTION

Cluster analysis

TRANSCRIPT

Statistics for Research II(Multivariate Techniques) Reference MaterialsAvailable articles (websites)Lecture-notes/power-pointsand the book: Multivariate AnalysisBy Joseph Hair, William Black, Barry Babin, Rolph Anderson, Ronald DixonMultivariate TechniquesMultiple RegressionDiscriminant AnalysisConjoint Analysisactor AnalysisCluster AnalysisMulti!imensional ScalingDepen!ence TechniquesInter!epen!ence Techniques"roa! #earning $%jectives &' ()plain cluster analysis an! its role in multivariate Analysis*' +n!erstan! ho, customer-pro!uct-firm etc similarity is measure!'.' +n!erstan! the !ifferences %et,een clustering techniques'/' Interpret the results of a cluster analysis'$%jective of Cluster AnalysisThe o%jective of cluster analysis is to group population-sample o%servations in !ifferent groups-cluster so that o%servations ,ithin each group-cluster are similar to one another an! the groups-clusters themselves stan! apart from one another' In other ,or!s0 the o%jective is to !ivi!e the items-o%jects ofpopulation-sample 1 customers-pro!ucts- firms etc 1 into homogeneous an! !istinct groups ,ith respect tosome varia%les-attri%utes of interest' Conceptual frame,or2 of research pro%lems an! its o%jectives suggests ,hat varia%les-attri%utes shoul! %e use! for clustering such items-o%jects$%jective of Cluster Analysis (continue!)Cluster Analysis may %e use! in %usiness for segmenting mar2ets %y i!entifying customers of similar nee!s0 trac2ing commo!ities in or!er of preference-quality0 etc' an! for 2no,ing the !ynamics involve! in their changes ,ith time' Cluster analysis is also occasionally use! to group varia%les into homogeneous an! !istinct groups' This approach is use!0 for e)ample0 in revising a questionnaire on the %asis of responses receive! to a !rafte! questionnaire' The grouping of the questions %y means of cluster analysis helps to i!entify re!un!ant questions an! re!uce their num%er'A Simple ()ample of Clustering()ample&3 Assume the !aily e)pen!itures on foo! (4&) an! clothing (4*) of five persons given in the Ta%le %elo,5erson 4& 4*a* /%6 *c7 .!& 8e6'8&The !ata of this ta%le are plotte! in the figure in ne)t sli!e ()plaining Metho!s of Clustering 9 (cont!')Figure: rouping of observations

!nspection of figure suggests that the five observations for" two clusters # one consisting of persons a and d$ and the other of b$ c and e% &he observations in each cluster are si"ilar with respect to e'penditures on food and clothing$ and that the two clusters are (uite distinct fro" each other% !t is i"practical to e'a"ine all possible clusters of availableobservations to for" ho"ogeneous and distinct clusters b) su""ari*ing each clustering according to degree of pro'i"it) a"ong the cluster ele"ents and of the separation of clusters$ when observations are obtained for "an) variables/attributes and nu"ber observations is large% 5opular Clustering Metho!s&here are "an) clustering "ethods% +ne "ethod$ for e'a"ple$ begins with as "an) groups as there are observations$ and then s)ste"aticall) "erges observations to reduce the nu"ber of groups b) one$ two$ : : :$ until a single group containing all observations is for"ed% Another "ethod begins with a given nu"ber of groups and an arbitrar) assign"ent of the observations to the groups$ and then reassigns the observations one b) one so that ulti"atel) each observation belongs to the nearest group% &hese "ethods e"plo) a distance "easure or a "easure of association to indicate precisel) the degree of si"ilarit) (pro'i"it)/nearness/closeness) a"ong observations Measures of Distance ,efinition (Measures of distance): &he -uclidean distance between the two points i and j is the h)potenuse of the triangle A./:

+bservation i is closer ("ore si"ilar) to 0 than observation k if ,(i1 0) 2 ,(i1k)% An alternative "easure is the s(uared -uclidean distance: ,3(! $ 0) 4 A3 5 .3 4 (67i - 670)3 5 (63i - 630)3 8et another "easure is the cit) block distance$ defined as,9(! $ 0) 4 + B= 1 1 + 2 2Measures of Distance(cont!') All three "easures of distance depend on the units in which 67 and 63 are "easured$ and are influenced b) whichevervariable takes nu"ericall) larger values% For this reason$ the variables are often standardi*ed so that the) have "ean : and variance 7 before cluster anal)sis is applied% Alternativel)$ weights w7$ w3$ : : :$ wk reflecting the i"portance of the variables could be used and a weighted "easure of distance calculated%:earest :eigh%or Metho!s of Clustering .egin with as "an) clusters as there are observations$ that is$ with each observation for"ing a separate cluster% Merge that pair of observations that are nearest one another$ leaving n - 7 clusters for the ne't step% ;e't$ "erge into one cluster that pair of clusters that are nearest one another$ leaving n - 3 clusters for the ne't step% /ontinue in this fashion$ reducing the nu"ber of clusters b) one at each step$ until a single cluster consisting of all n observations is for"ed% At each above step$ keep track of the distance at which the clusters are for"ed% !n order to deter"ine the nu"ber of clusters$ consider the step(s) at which the "erging distance is relativel) large% A proble" with this procedure is how to "easure the distance between clusters consisting of two or "ore observations% e arbitraril) select (ad) as the new cluster and obtain three clusters: ad$ be and cas shown in Figure 3(b)%&he distance between (be) and (ad) is,(be $ ad) 4 "in ?,(be $ a) $ ,?be $ d@ 4 "in ?A%93B $ C%A7A@ 4 A%93B 1while that between c and (ad) is,(c $ ad) 4 "in ?,(c $ a) $ ,(c $ d)@ 4 "in ?C%:C7 $ F%3EA@ 4 C%:C7%&he three clusters re"aining at this step and the distances between these clusters are shown in the ne't Figure 9(a)% Figure 3:earest :eigh%or Metho!3 ()ample (continue!);earest neighbor "ethod$ =tep 9:A"ong these three clusters$ "erge the two that are closest$ i%e% "erge (be) with c to for" the cluster (bce) shown in Figure 9(b)%&he distance between the two re"aining clusters is,(ad1 bce) 4 "in?,(ad $ be) $ ,(ad $ c)@ 4 "in ?A%93B $C%:C7@ 4 A%93B&he grouping of these two clusters$ it will be noted$ occurs at a distance of A%93B$ a "uch greater distance than that at which the earlier groupings took place% Figure E shows the final grouping%Figure 4:earest :eigh%or Metho!3 ()ample (continue!);earest neighbor "ethod$ =tep E:&he groupings and the distance at which these took place are also shown in the tree diagra" (dendrogra") of Figure B% Figure 5 (dendrogra"):earest :eigh%or Metho!3 ()ample (continue!)+ne usuall) searches the dendrogra" for large 0u"ps in the grouping distance as guidance in arriving at the nu"ber of groups% !n this illustration$ it is clear that the ele"ents in each of the clusters (ad) and (bce) are close (the) were "erged at a s"all distance)$ but the clusters the"selves are far apart (the distance at which the) "erge is large)% Gence$ this "ethod of clustering shows (ad) and (bce) as the clusters we are looking for%------------------------------------------------------------------------------------- arthest :eigh%or Metho! of Clustering&he nearest neighbor is not the onl) "ethod for "easuring the distance between clusters% Hnder the farthest neighbor (or co"plete linkage) "ethod$ the distant between the two clusters is the distance between their two "ost distant "e"bers% =ee the ad0oining Figure%Figure: /luster distance$ Farthest neighbor "ethodarthest :eigh%or Metho!3 the Simple ()ample&his "ethod of clustering of observations in e'a"ple 7 can also be started fro" Figure7%+bviousl)$ farthest neighbor "ethod also calls for grouping b and e at =tep 7% Gowever$ the distances between (be)$ on the one hand$ andthe clusters (a)$ (c)$ and (d)$ on the other$ are different fro" those obtained with nearest neighbor "ethod% arthest :eigh%or Metho!3 ()ample (continue!),(be $ a) 4 "a' ?,(b $ a) $ ,(e $ a)@ 4 "a' ?A%93B $ C%7BD@ 4 C%7BD,(be $ c) 4 "a' ?,(b $ c) $ ,(e $ c)@ 4 "a' ?7%E7E $ 3%:A3@ 4 3%:A3,(be $ d) 4 "a' ?,(b $ d) $ ,(e $ d)@ 4 "a' ?C%A7A $ F%B::@ 4 F%B::&he four clusters re"aining at =tep 7 and the distances between these clusters are shown in Figure A(a)%Figure 6!n step 3$ the nearest clusters (a) and (d) are grouped into cluster (ad)% &he re"aining steps are si"ilarl) e'ecuted% Average #in2age Metho! of Clustering&he nearest and farthest neighbor "ethods produce the sa"e results in -'a"ple 7% !n other cases$ however$ the two "ethods "a) not agree% For instance$ consider the following Figure%Figure: &wo cluster ? means metho!3 ()ample &(continue!)>e now calculate the distance of a and b fro" the two centroids:=ince a is closer to the centroid of /luster 7$ to which it belongs$ a is not reassigned% =ince b is closer to /luster 3Js centroid than to that of /luster 7$ it is reassigned to /luster 3% &he new cluster centroids are calculated as shown in Figure F(a) ne't slide and plotted in Figure F(b)% @ariants of >?means metho!+ther variants of the I-"eans "ethod re(uire that the first cluster centroids (the KseedsK$ as the) are so"eti"es called) be specified% &hese seeds could be observations% +bservations within a specified distance fro" a centroid are then included in the cluster% !n so"e variants$ the first observation found to be nearer another cluster centroid is i""ediatel) reassigned and the new centroids recalculated% !n others reassign"ent and recalculation await until all observations are e'a"ined and one observation is selected on the basis of certain criteria% &he L(uickM or LfastM clustering procedures used b) co"puter progra"s such as =A= or =