cluster analysis using rapidminer and sas

16
Cluster Analysis Using RapidMiner and SAS 9.3

Upload: madhumita-ghosh

Post on 18-Nov-2014

2.147 views

Category:

Technology


3 download

DESCRIPTION

 

TRANSCRIPT

  • 1. Cluster Analysis Using RapidMiner and SAS 9.3
  • 2. Agenda The Data Some preliminary treatments Checking for outliers Manual outlier checking for a given confidence level Filtering outliers Data without outliers Selecting attributes for clusters Setting up clusters Reading the clusters Using SAS for clustering Dendrogram Depicting Tree using SAS Conclusion
  • 3. The Data Number of observations: 97 3 numeric variables: Birth rate per thousand Death rate per thousand Infant mortality rate per thousand 1 polynomial variable: Country Data obtained from UN Demographic Yearbook 1990
  • 4. Some preliminary treatments Checking for outliers using RapidMiner
  • 5. Some preliminary treatments Manual checking for outliers at a given confidence level For Birth (95%) mu-2(sigma) = 27.384-2(12.978) = 1.428 mu+2(sigma) = 27.384+2(12.978) = 53.34 Hence, no outliers
  • 6. Filtering outliers o 10 outliers recorded
  • 7. Data without outliers o Filter examples o Parameter string: outlier=true o Invert filter
  • 8. Selecting attributes for clusters o Clusters on polynomial variables make no sense o Remove Country from attribute list
  • 9. Setting up clusters o K=3 o Join both nodes to get cluster model information
  • 10. Reading the Clusters Cluster 1: Low values of each numeric variable Cluster 2: High values of each numeric variable Cluster 0: Moderate values of each numeric variable
  • 11. Reading the Clusters Scatter Plot Birth and Death against Infant Death Rate Size Infant Death Rate
  • 12. Using SAS for clustering Using canonical variables for standardization of variables to mean 0 and standard deviation 1 Spherical within-cluster covariance matrix proc aceclus data=Poverty out=Ace p=.03 noprint; var Birth Death InfantDeath; run; proc cluster data=Ace outtree=Tree method=ward ccc pseudo print=15; var can1 can2 can3 ; id Country; run;
  • 13. Using SAS for clustering First 2 canonical variables account for about 93% of the total variation
  • 14. Dendrogram
  • 15. Tree depiction Plot can1 and can2 against cluster Shows similar plot compared to RapidMiner output
  • 16. Conclusion Cluster 1: Mostly developed European nations, USA, UK, Singapore, USSR, etc Cluster 2: Afghanistan, Pakistan, Iran, mostly under privileged African nations Efficient allocation of public goods Lower crime rates Abortion legalized Low GDP Abortion not legal High crime rates, prevalent wars and terrorist activities Poor health standards, high poverty levels Cluster 0: India, Mexico, South Africa, Saudi Arabia, etc Emerging nations Increasing growth rates Controlled negative externalities Focus on literacy and employment