# data mining using sas

Post on 15-Apr-2017

70 views

Embed Size (px)

TRANSCRIPT

Group Assignment Data Mining MKTG 5963

Abhinav Garg (11761380)

Tanu Srivastav (11772446)

Tejbeer Chhabra (11756746)

Maunik Desai (11758140)

Maanasa Nagaraja (11678486)

Table of Contents

Executive Summary ............................................................................................................................1

Data Audit .........................................................................................................................................1

Modeling ...........................................................................................................................................2

Model Comparison.............................................................................................................................4

Scoring ..............................................................................................................................................4

Segmentation ....................................................................................................................................4

Conclusion .........................................................................................................................................5

Appendix A : Data Exploration ............................................................................................................ i

Appendix B: Clustering ...................................................................................................................... iii

Appendix C: Data Modeling ............................................................................................................... vi

Appendix D: MODEL COMPARISON ................................................................................................... xi

Appendix E: Scored Data .................................................................................................................. xii

Contents for Table Table 1 Variable Worth in Clusters ............................................................................................................... 2

Table 2 Sensitivity and Specificity for Forward Regression Model ............................................................... 3

Table 3 Sensitivity and Specificity for Stepwise Regression ......................................................................... 3

Table 4 Sensitivity and Specificity for Neural Network ................................................................................. 3

Table 5 Model Comparisons ......................................................................................................................... 4

Table 6 Scored Data Summary for Target Variable ....................................................................................... 4

MKTG 5963 Data Mining Group Assignment

1

Executive Summary Diversity and SAT score plays an important role in creating a better learning environment and good college

experience for the students. Diversity enriches the educational experience and promotes personal growth. SAT

score is a useful predictor of college academic performance.

In our analysis, we aim to identify prospective students who would most likely enroll as new freshmen in Fall 2005.

Also we would focus on marketing strategy for administration to increase diversity and SAT score.

Data Audit Before performing data modeling it is critical to perform data exploration to find interesting insights about the

data.

1. DMDB Node

The DMDB tool gave us quick insights to understand our data better in the form of the summary statistics for

numerical variables, the number of categories for class variables, and the extent of the missing values in the data.

From the results, it is apparent that the categorical variables are not missing and interval variables have missing

values and distance, hscrat, init_span, init1rat, init2rat exhibit non-normal behavior, which can further introduce

biases.

2. Data Reduction

Variables ACADEMIC_INTEREST_1 and ACADEMIC_INTEREST_2 have their counterpart in INT1RAT and INT2RAT

respectively. Similarly IRSCHOOL was converted into HSCRAT. TELECQ had more than 50% missing values.

TOTAL_CONTACTS is nothing but summation of various other form of contact counts. CONTACT_CODE1 has

hundreds of levels and specifically such code doesnt provide much information. For these reasons,

ACADEMIC_INTEREST_1, ACADEMIC_INTEREST_2, IRSCHOOL, TELECQ, TOTAL_CONTACTS, and CONTACT_CODE1

were all removed from dataset.

3. Missing Value Imputation

Since our interval variables had lots of missing values we used the PROC MI procedure to impute the

missing values rather than traditional imputation methods, which creates unknown biases in data. The

PROC MI procedure allows both finding the patterns of missing data and imputation. It simulates and

generates multiple complete dataset from the original data with missing values by repeatedly replacing

missing entries with imputed ones.

4. Data Filter Node

Extreme values are problematic as they may have undue influence on the model. We handled extreme

values by excluding observations including outliers or other extreme values that we dont want to

include in our model. This also further improves the skewness and brings the variables closer to normal

distribution. The filtering methods for the interval and class variables used are Standard Deviations from

the mean and Rare Values (Percentage) respectively.

5. Data Partitioning

Before building our models, we split the data into training (70%) and validation (30%). We choose 7030 spit

because this is the sweet spot hit for honest assessment. Also 7030 split provided similar proportion of our target

as in the original dataset. A summary of the split has been provided in the appendix.

MKTG 5963 Data Mining Group Assignment

2

6. Data Transformation

Data Transformation corrects for skewed distribution of the numerical input variables and large number of classes

in the categorical variables. From the skewness and kurtosis values obtained after filtering, the independent

variables exhibit approximately normal distribution. We performed data transformation using Maximum Normal

for the independent variables, which is one of the best power transformations techniques that belongs to Box_Cox

transformation, to analyze its effectiveness in reducing skewness and kurtosis.

Although the skewness values have dropped, the decrease is not significant enough to use this methodology. For

instance, HSCRAT shows skweness of 2.64 and after log transformation, as suggested by max normal, the skweness

is 1.9. Moreover, the transformations bring in their own challenges. Transformed variables come with a cost, since

they are complicated to interpret (log, square root) especially in a business scenario. Therefore, we chose not to

perform any transformation.

Modeling We have used Decision Tree, Forward and Stepwise Regression, Neural Network, and Auto Neural data

modeling techniques.

1. Decision Trees

Decision tree methodology is a commonly used data mining method for establishing classification systems based

on multiple covariates or for developing prediction algorithms for a target variable. A split search algorithm

facilitates input selection. Model complexity is addressed by pruning. The setting used for decision tree node are

Maximum Branch = 2, Maximum Depth = 6, Minimum Leaf Size = 5, and we use the assessment method and

misclassification as assessment measure.

Below is the variable importance report

Variable Name Importance

SELF_INIT_CNTCTS 1.0000

HSCRAT 0.3798

STUEMAIL 0.2767

INIT_SPAN 0.1404

MAILQ 0.0816

INEREST 0.0698

INT1RAT 0.0638

Table 1 Variable Worth in Clusters

Model Assessment: Validation Misclassification Rate = 0.0512 and Training Misclassification Rate = 0.058.

2. Regression

Since our dependent variables ENROLL is a binary categorical variable, the type of regression chosen is Logistic.

2.1 Forward regression

Forward Regression creates a sequence of models of increasing complexity. At each step, each variable that is not

already added is tested for inclusion in the model. The most significant of these variables is added to the model, as

long as their p-values are below the SLENTRY = 0.05. The variables selected are SELF_INIT_CNTCTS, STUEMAIL,

HSCRAT, INIT_SPAN, DISTANCE, SATSCORE, MAILQ, and INT2RAT. We are also studying interaction effect through

our regression model. Following are the significant interactions are:

SELF_INIT_CNTCTS was found to be the

most important variable in determining

enrollment decision of prospective

student.

The Minimum Misclassification Rate was

found at Number of Leaves = 12.

MKTG 5963 Data Mining Group Assignment

3

CAMPUS_VISIT * PREMIERE

REFERAL_CNTCTS * INTEREST

INTEREST * PREMIERE

INSTATE*MAILQ

TRAVEL_INIT_CNTCTS*INTEREST

TERRITORY*STUEMAIL

Model Assessment: Validation Misclassification Rate = 0.0786 and Training Misclassification = 0.073.

Sensitivity Specificity

91% 93% Table 2 Sensitivity and Specificity for Forward Regression Model

2.2 Stepwise Regression

The stepwise regression combines elements

Recommended