data mining using sas

Download Data Mining using SAS

Post on 15-Apr-2017

70 views

Category:

Data & Analytics

0 download

Embed Size (px)

TRANSCRIPT

  • Group Assignment Data Mining MKTG 5963

    Abhinav Garg (11761380)

    Tanu Srivastav (11772446)

    Tejbeer Chhabra (11756746)

    Maunik Desai (11758140)

    Maanasa Nagaraja (11678486)

  • Table of Contents

    Executive Summary ............................................................................................................................1

    Data Audit .........................................................................................................................................1

    Modeling ...........................................................................................................................................2

    Model Comparison.............................................................................................................................4

    Scoring ..............................................................................................................................................4

    Segmentation ....................................................................................................................................4

    Conclusion .........................................................................................................................................5

    Appendix A : Data Exploration ............................................................................................................ i

    Appendix B: Clustering ...................................................................................................................... iii

    Appendix C: Data Modeling ............................................................................................................... vi

    Appendix D: MODEL COMPARISON ................................................................................................... xi

    Appendix E: Scored Data .................................................................................................................. xii

    Contents for Table Table 1 Variable Worth in Clusters ............................................................................................................... 2

    Table 2 Sensitivity and Specificity for Forward Regression Model ............................................................... 3

    Table 3 Sensitivity and Specificity for Stepwise Regression ......................................................................... 3

    Table 4 Sensitivity and Specificity for Neural Network ................................................................................. 3

    Table 5 Model Comparisons ......................................................................................................................... 4

    Table 6 Scored Data Summary for Target Variable ....................................................................................... 4

  • MKTG 5963 Data Mining Group Assignment

    1

    Executive Summary Diversity and SAT score plays an important role in creating a better learning environment and good college

    experience for the students. Diversity enriches the educational experience and promotes personal growth. SAT

    score is a useful predictor of college academic performance.

    In our analysis, we aim to identify prospective students who would most likely enroll as new freshmen in Fall 2005.

    Also we would focus on marketing strategy for administration to increase diversity and SAT score.

    Data Audit Before performing data modeling it is critical to perform data exploration to find interesting insights about the

    data.

    1. DMDB Node

    The DMDB tool gave us quick insights to understand our data better in the form of the summary statistics for

    numerical variables, the number of categories for class variables, and the extent of the missing values in the data.

    From the results, it is apparent that the categorical variables are not missing and interval variables have missing

    values and distance, hscrat, init_span, init1rat, init2rat exhibit non-normal behavior, which can further introduce

    biases.

    2. Data Reduction

    Variables ACADEMIC_INTEREST_1 and ACADEMIC_INTEREST_2 have their counterpart in INT1RAT and INT2RAT

    respectively. Similarly IRSCHOOL was converted into HSCRAT. TELECQ had more than 50% missing values.

    TOTAL_CONTACTS is nothing but summation of various other form of contact counts. CONTACT_CODE1 has

    hundreds of levels and specifically such code doesnt provide much information. For these reasons,

    ACADEMIC_INTEREST_1, ACADEMIC_INTEREST_2, IRSCHOOL, TELECQ, TOTAL_CONTACTS, and CONTACT_CODE1

    were all removed from dataset.

    3. Missing Value Imputation

    Since our interval variables had lots of missing values we used the PROC MI procedure to impute the

    missing values rather than traditional imputation methods, which creates unknown biases in data. The

    PROC MI procedure allows both finding the patterns of missing data and imputation. It simulates and

    generates multiple complete dataset from the original data with missing values by repeatedly replacing

    missing entries with imputed ones.

    4. Data Filter Node

    Extreme values are problematic as they may have undue influence on the model. We handled extreme

    values by excluding observations including outliers or other extreme values that we dont want to

    include in our model. This also further improves the skewness and brings the variables closer to normal

    distribution. The filtering methods for the interval and class variables used are Standard Deviations from

    the mean and Rare Values (Percentage) respectively.

    5. Data Partitioning

    Before building our models, we split the data into training (70%) and validation (30%). We choose 7030 spit

    because this is the sweet spot hit for honest assessment. Also 7030 split provided similar proportion of our target

    as in the original dataset. A summary of the split has been provided in the appendix.

  • MKTG 5963 Data Mining Group Assignment

    2

    6. Data Transformation

    Data Transformation corrects for skewed distribution of the numerical input variables and large number of classes

    in the categorical variables. From the skewness and kurtosis values obtained after filtering, the independent

    variables exhibit approximately normal distribution. We performed data transformation using Maximum Normal

    for the independent variables, which is one of the best power transformations techniques that belongs to Box_Cox

    transformation, to analyze its effectiveness in reducing skewness and kurtosis.

    Although the skewness values have dropped, the decrease is not significant enough to use this methodology. For

    instance, HSCRAT shows skweness of 2.64 and after log transformation, as suggested by max normal, the skweness

    is 1.9. Moreover, the transformations bring in their own challenges. Transformed variables come with a cost, since

    they are complicated to interpret (log, square root) especially in a business scenario. Therefore, we chose not to

    perform any transformation.

    Modeling We have used Decision Tree, Forward and Stepwise Regression, Neural Network, and Auto Neural data

    modeling techniques.

    1. Decision Trees

    Decision tree methodology is a commonly used data mining method for establishing classification systems based

    on multiple covariates or for developing prediction algorithms for a target variable. A split search algorithm

    facilitates input selection. Model complexity is addressed by pruning. The setting used for decision tree node are

    Maximum Branch = 2, Maximum Depth = 6, Minimum Leaf Size = 5, and we use the assessment method and

    misclassification as assessment measure.

    Below is the variable importance report

    Variable Name Importance

    SELF_INIT_CNTCTS 1.0000

    HSCRAT 0.3798

    STUEMAIL 0.2767

    INIT_SPAN 0.1404

    MAILQ 0.0816

    INEREST 0.0698

    INT1RAT 0.0638

    Table 1 Variable Worth in Clusters

    Model Assessment: Validation Misclassification Rate = 0.0512 and Training Misclassification Rate = 0.058.

    2. Regression

    Since our dependent variables ENROLL is a binary categorical variable, the type of regression chosen is Logistic.

    2.1 Forward regression

    Forward Regression creates a sequence of models of increasing complexity. At each step, each variable that is not

    already added is tested for inclusion in the model. The most significant of these variables is added to the model, as

    long as their p-values are below the SLENTRY = 0.05. The variables selected are SELF_INIT_CNTCTS, STUEMAIL,

    HSCRAT, INIT_SPAN, DISTANCE, SATSCORE, MAILQ, and INT2RAT. We are also studying interaction effect through

    our regression model. Following are the significant interactions are:

    SELF_INIT_CNTCTS was found to be the

    most important variable in determining

    enrollment decision of prospective

    student.

    The Minimum Misclassification Rate was

    found at Number of Leaves = 12.

  • MKTG 5963 Data Mining Group Assignment

    3

    CAMPUS_VISIT * PREMIERE

    REFERAL_CNTCTS * INTEREST

    INTEREST * PREMIERE

    INSTATE*MAILQ

    TRAVEL_INIT_CNTCTS*INTEREST

    TERRITORY*STUEMAIL

    Model Assessment: Validation Misclassification Rate = 0.0786 and Training Misclassification = 0.073.

    Sensitivity Specificity

    91% 93% Table 2 Sensitivity and Specificity for Forward Regression Model

    2.2 Stepwise Regression

    The stepwise regression combines elements

Recommended

View more >