cluster analysis

Download Cluster analysis

Post on 24-Jan-2015




1 download

Embed Size (px)




  • 1. Data Analysis CourseCluster Analysis and DecisionTrees(version-1)Venkat Reddy

2. Data Analysis Course Data analysis design document Introduction to statistical data analysis Descriptive statistics Data exploration, validation & sanitization Probability distributions examples and applicationsVenkat ReddyData Analysis Course Simple correlation and regression analysis Multiple liner regression analysis Logistic regression analysis Testing of hypothesis Clustering and Decision trees Time series analysis and forecasting Credit Risk Model building-1 2 Credit Risk Model building-2 3. Note This presentation is just class notes. The course notes for DataAnalysis Training is by written by me, as an aid for myself. The best way to treat this is as a high-level summary; theactual session went more in depth and contained other Venkat Reddy Data Analysis Courseinformation. Most of this material was written as informal notes, notintended for publication Please send questions/comments/corrections or Please check my website for latest version of this document -Venkat Reddy3 4. Contents What is the need of Segmentation Introduction to Segmentation & Cluster analysis Applications of Cluster Analysis Types of ClustersVenkat ReddyData Analysis Course Building Partitional Clusters K means Clustering & Analysis Building Decision Trees CHAID Segmentation & Analysis 4 5. What is the need of segmentation?Problem: 10,000 Customers - we know their age, city name, income,employment status, designation You have to sell 100 Blackberry phones(each costs $1000) toVenkat ReddyData Analysis Coursethe people in this group. You have maximum of 7 days If you start giving demos to each individual, 10,000 demos willtake more than one year. How will you sell maximum numberof phones by giving minimum number of demos? 5 6. What is the need of segmentation?Solution Divide the whole population into two groups employed / unemployed Further divide the employed population into two groups high/low salary Further divide that group into high /low designation 10000 customers Venkat Reddy Data Analysis CourseUnemployedEmployed 30007000Low salaryHigh Salary 5000 2000 LowHigh6DesignationDesignation 1800 200 7. Segmentation and Cluster Analysis Cluster is a group of similar objects (cases, points, observations, examples, members, customers, patients, locations, etc) Finding the groups of cases/observations/ objects in the population such that the objects are Homogeneous within the group (high intra-class similarity)Venkat ReddyData Analysis Course Heterogeneous between the groups(low inter-class similarity )Inter-clusterdistances areIntra-cluster distances are maximized minimized 7 8. Applications of Cluster Analysis Market Segmentation: Grouping people (with the willingness,purchasing power, and the authority to buy) according to theirsimilarity in several dimensions related to a product underconsideration. Sales Segmentation: Clustering can tell you what types of customersbuy what productsVenkat ReddyData Analysis Course Credit Risk: Segmentation of customers based on their credit history Operations: High performer segmentation & promotions based onpersons performance Insurance: Identifying groups of motor insurance policy holders witha high average claim cost. City-planning: Identifying groups of houses according to their housetype, value, and geographical location Geographical: Identification of areas of similar land use in an earth 8observation database. 9. Types of Clusters Partitional clustering or non-hierarchical : A divisionof objects into non-overlapping subsets (clusters) suchthat each object is in exactly one cluster The non-hierarchical methods divide a dataset of Nobjects into M clusters. K-means clustering, a non-hierarchical technique, isVenkat ReddyData Analysis Coursethe most commonly used one in business analytics Hierarchical clustering: A set of nested clustersorganized as a hierarchical tree The hierarchical methods produce a set of nestedclusters in which each pair of objects or clusters isprogressively nested in a larger cluster until only onecluster remains9 CHAID tree is most widely used in business analytics 10. Cluster Analysis -ExampleMaths ScienceGk Apt Maths Science Gk AptStudent-194 82 87 89Student-19482 87 89Student-246 67 33 72Student-24667 33 72Student-398 97 93 100 Student-39897 93 100Student-414 5 7 24Student-4145724Student-586 97 95 95Student-58697 95 95Student-634 32 75 66Student-63432 75 66Student-769 44 59 55Student-76944 59 55 Venkat Reddy Data Analysis CourseStudent-885 90 96 89Student-88590 96 89Student-924 26 15 22Student-92426 15 22Maths Science GkApt Student-4 145 7242,7 Student-9 242615 224,9,6 Student-6 343275 66 Student-2 466733 72 Student-7 694459 558,5,1,3 Student-8 859096 89 Student-5 869795 95 10 Student-1 948287 89 Student-3 989793 100 11. Building Clusters1.Select a distance measure2.Select a clustering algorithm3.Define the distance between two clusters4.Determine the number of clusters5.Validate the analysis Venkat Reddy Data Analysis Course The aim is to build clusters i.e divide the whole population into group of similarobjects What is similarity/dis-similarity? 11 How do you define distance between two clusters 12. Distance measures To measure similarity between two observations adistance measure is needed. With a single variable,similarity is straightforward Example: income two individuals are similar if their incomelevel is similar and the level of dissimilarity increases as the Venkat Reddy Data Analysis Courseincome gap increases Multiple variables require an aggregate distancemeasure Many characteristics (e.g. income, age, consumption habits,family composition, owning a car, education level, job), itbecomes more difficult to define similarity with a single value The most known measure of distance is the Euclideandistance, which is the concept we use in everyday life for 12spatial coordinates. 13. Examples of distances A x xkj n2Dij ki Euclidean distancek 1B n ADij xki xkjCity-block (Manhattan) distancek 1B Venkat Reddy Data Analysis CourseDij distance between cases i and jxkj - value of variable xk for case jOther distance measures: Chebychev, Minkowski, Mahalanobis,maximum distance, cosine similarity, simple correlation betweenobservations etc.,Data matrixDissimilarity matrix x11 ... x1f... x1p 0 d(2,1) ... ...... ...... 0 13x... xif... xip d(3,1) d ( 3,2) 0 i1 ... ...... ...... : : : x xnp d ( n,1) d ( n,2) ... ... xnf...... 0 n1 14. K -Means Clustering Algorithm1. The number k of clusters is fixed2. An initial set of k seeds (aggregation centres) is provided 1. First k elements 2. Other seeds (randomly selected or explicitly defined)3. Given a certain fixed threshold, all units are assigned to theVenkat ReddyData Analysis Course nearest cluster seed4. New seeds are computed5. Go back to step 3 until no reclassification is necessaryOr simply Initialize k cluster centers DoAssignment step: Assign each data point to its closest cluster centerRe-estimation step: Re-compute cluster centers14 While (there are still changes in the cluster centers) 15. K Means clustering in action Dividing the data into 10 clusters using K-MeansDistance metric willdecide cluster forthese pointsVenkat ReddyData Analysis Course15 16. SAS Code & Optionsproc fastclus data= mylib.super_market_dataradius=0 replace=full maxclusters =3 maxiter =20 list distance;id cust_id;var age income spend family_size visit_Other_shops;run;Options Venkat Reddy Data Analysis Course The RADIUS= option establishes the minimum distance criterion for selecting newseeds. No observation is considered as a new seed unless its minimum distance toprevious seeds exceeds the value given by the RADIUS= option. The default value is 0. The MAXCLUSTERS= option specifies the maximum number of clusters allowed. If youomit the MAXCLUSTERS= option, a value of 100 is assumed. The REPLACE= option specifies how seed replacement is performed. FULL :requests default seed replacement. PART :requests seed replacement only when the distance between the observation and the closest seed is greater than the minimum distance between seeds. NONE : suppresses seed replacement. 16 RANDOM :Selects a simple pseudo-random sample of complete observations as initial cluster seeds. 17. SAS Code & Options The MAXITER= option specifies the maximum number of iterations for recomputing cluster seeds. When the value of the MAXITER= option is greaterthan 0, each observation is assigned to the nearest seed, and the seeds arerecomputed as the means of the clusters. The LIST option lists all observations, giving the value of the ID variable (ifany), the number of the cluster to which the observation is assigned, andVenkat ReddyData Analysis Coursethe distance between the observation and the final cluster seed. The DISTANCE option computes distances between the cluster means. The ID variable, which can be character or numeric, identifies observationson the output when you specify the LIST option. The VAR statement lists the numeric variables to be used in the clusteranalysis. If you omit the VAR statement, all numeric variables not listed inother statements are used.17 18. Lab- Clustering Retail Analytics problem: Supermarket data. A super market want togive a special discount to few selected customers. The aim toincrease the sales by studying the buying capacity of the customers Find the customer segments using cluster analysis code(do notmention maxclusters)Venkat ReddyData Analysis Course How many clusters have been created? What are the properties of customers in each cluster? Analyze the output and identify the customer group with goodincome, high spend, large family size. How much time does it take to create the clusters? Now create 5 clusters using K means clustering techniques Find the target customer group Further clust