using sas enterprise minertm - lex jansen's … sas enterprise minertm for forecasting response...

Using SAS Enterprise MinerTM

For Forecasting Response and Risk Kattamurj. S. Sanna, Ph.D.

White Plains, N.Y.

Abstract

This paper shows how to organize and execute a data mining project for developing predictive models for direct marketing. The steps involved in developing the project are demonstrated using SAS Enterprise Miner"". A data mining process diagiam is included to show the sequellce of steps in the project. The data mim'ng process diagram consists of a number of connected nodes (tools), each node perfonning a particular task and passing its output to the next node. The nodes included in the diagram are: Input Data Source node, Filter Outliers node, Data Partition node, Decision Tree node, Regression node, Assessment node, Score node, SAS Code node, and Insight node.

Introduction:

A company may 1Iy to boost the efficiency of its marketing campaign by promoting its products or services to those individuals who are most likely to respond. Response models are used to forecast an individual's probability of response and 1'llllk the individuals according to the predicted probability of response. However, certain individuals with high probability of response may s1so have a high propensity to generate losses for the company. Therefore, it is s1so necessary to forecast the potential losses associated with each individual in the target population. Risk Models are used to predict these potential losses.

In response models the variable being predicted (ie. the target variable) is usoally binary. It takes the value of 1 if there is a response, and 0 if there is no response. In risk models, the target variable can be binary, ordinal or continuous. For example, banks offering loans will incur losses if an acquired customer fails to pay the borrowed amount. In auto insurance, losses arise whenever the acquired customer has an insurance claim. In risk models for auto insurance companies, the frequency of claims can be used as an indicator of risk. In this paper a risk model is developed to predict claims per car yesr. The target variable, namely the claim frequency, is continuous (interval scaled.) By rounding, it can be changed into an ordinal variable.

Response models can be used to calculate a response score, while risk models can be used to calculate a risk score for each individual in the target population. Both of these scores should be used together to achieve optimum selection of 1lliiiHlB (current or prospective customers) for

149

promoting products or services. Customer level profitability can also be derived from these scores.

This paper shows the steps involved in developing a response model using SAS Entelprise MinerTM. A risk model is s1so developed with claim frequency as the target variable. Since the steps involved in both types of models are the same, the diagrams are provided only for the response model. Before developing the models it is necessary to prepare the data. Dsta preparation involves finding and eliminating eaors, filtering outliers, and imputing missing values. This paper shows how these tasks can be performed using various tools provided by the SAS Enterprise Minef'IM. There are several modeling options in the Enterprise MinerTM. Due to their intuitive appeal. decision tree models are demonstrated in this paper.

The data set used in this paper is hypothetical It is generated for illustrative purposes only. The examples provided here are generated using SAS Version 8.2 for Microsoft Windows 98.

Setting up the Forecasting Project in SAS Enterprise Miner.

To start a new project in the Enterprise Miner we follow. these steps:

(1) From the Menu bar at the topofthe SAS window click on ~lutions.

(2) Select Analysis and Enterprise Miner (3) From the Menu bar select File-> New-> Project.

Create New Project window opens as shown in Diagram 1. In this window you type the name of the project and select ''Create." An Enterprise Miner window opens that contains two sub-windows (Diagram 2). The right-most sub-window is the Diagram Workspace. To its left is the Project Navigator. In the Project Navigator you see the project name followed by the names of the diagrams. Initially there is one diagram name ''Untitled." This is changed to "Response 1" in this example. At the bottom of the Project Navigator window there are three tabs: Diagmm. Tools, and &ports. After clicking on the Tools tab, a menu of tools opens up. One can click on any tool and drag it into the Diagram Workspace. With a simple point-and-click action, one can perform complex tasks using these tools. Some of these tools are s1so on the tool bar. A Data Mining Process Diagram, created for developing the response model, is shown in Diagram 3.

Diagram 1: Creatblg a new project

Diagram 2: Enterprise Miner Window: Project Navigator and Diagram Workspace

150

Diagram 3: Enterprise Miner window: Data mining process diagram

Input Data Source node:

In the data mining process diagram (Diagram 3) the first node is the Input Data Source node. In this node the source

data is specified, and the roles of the variables are defined. SAS creates a data mining database from the input source

data. Diagram 4 below shows the Input Data Source node.

~ ~ o~ the _Input Data Source node either by double clicking on 1t or right clicking and selecting "Open." The Input Source Data window opens, allowing one to specify

151

Source Data, Description, and Role. In order to specicy the

source data, one first selects the library reference and then

the data set name. In this example, the source data is

myl!b.bookl. The role of the data set is "Raw." Other

choJ.CeS for the role are: "Train," "Validate," "Test," or "Score." At the top of the input source window there are

five tabs labeled: Data, Variables, Interval Variables Clo.Js Variables, and Notes. In this window one can ~ clumge the size of the metadata from its default size of 2000.

Diagram 4: Input Data Source node

Diagram 5 : Assigning model role to variables in the Input Data Source node

152

Allsiguing "Model Role" to the variables in the Input

Data Source node:

You can select the Variables tab to view the variables list and assign model roles. The variables window is shown in Diagram 5. There are 11 variables in the response model

dataset.

Table I : Variables for the Response Model

AGE: CREDIT: :MILEAGE: GENDER: EMP:

RES: NVEH: RES TYPE:

MFDU:

RESP:

Age of the responder. An index of credit rating. Annual miles driven. Gender. Number of jobs held during the last 3 years. Number of addresses in the last 3 years. Number of vehicles owned. Type of residence - private house or other. Dummy variable indicating whether the responder lives in a multifamily dwelling. Target variable in the response model. If an individual responds to direct mail then RESP takes the value 1, otherwise it is 0.

Diagram 6 : Filter Outllen node

153

Filter Outllen Node:

The Filter Outliers node (Diagram 6) is used to clean up the

data. One can examine each variable graphically and eliminate extreme values or outliers. When this· node is opened the window shown in Diagram 6 appears. At the

top of this window there are several tabs: Data, Settings,

Class Variables, Interval Variables, Output, and Notes.

Suppose you wish to examine the interval ~les for outliers. Click on the Interval Variables tsb. In Diagram 6

the CREDIT variable is selected for examination. By right clic.lcing in the column titled "Range to include" and the

row corresponding to the CREDIT variable another

window opens, as shown in Diagram 7. In this window there is a histogram of the variable CREDIT. The histogram bas two vertical bars (labeled MIN and MAX)

with handles, which can be moved horizontally. The

position of the left handle defines the minimum value to include and the position of the right handle defines the maximum value allowed for the variable. Any observation

with the variable taking a value outside this :range is

excluded. In this example, we excluded observations with

credit index above 728.859.

Diagram 7: Filter Outliers node (Select lltlhus window)

Diagram 8: Data Partition node

Data Partition node:

The filtered data set is then passed to the Data Partition

node. Here the data set is partitioned into three data sets.

The first one is for developing the models. This is called

the ''Training" data set. The next one is the "Validation"

154

data set, for validating the model. and the third data set is

''Test" data set. In this example we use the test data set for

scoring. These three data sets are passed to the successor

nodes. In this example the data sets are first passed to the

Decision Tree node where we develop the model.

Decision Tree Methodology:

A decision 1ree partitions the observations (records,

examples, or cases) of the data set into distinct groups (disjoint subsets), known as kaves, letifnodes, or temtinal nodes. Each letif has a unique combination of ranges . of the input variables. The root node is the first node of the

tree and it contains all the observations of the data set. Starting at the root node the tree algorithm successively splits the data set into sub-regions or nodes. If a node

cannot be partitioned further it becomes a temtinal node (leaf, or leaf node.) The process of partitioning proceeds in

the following way:

Let XI, x2, ..... Xtoo be the variables in the data set.

The tree algorithm examines all candidate splits of the fonn

X 1 ~ C where Cis a real number between the minimum

and maximum value of X; . All records, which have

X; ~ c go to the left node, and those records, which have

X 1 > C go to the right node. The algorithm selects the

best split on esch variable and then selects the best of

these. The process is repeated at esch node. In order to

determine which split is the best, one can use tests of impurity reduction or Pearson's chi-square test.

Impurity reduction:

If c is the split of node v into two child nodes a and

b , and 1f 4 and 1f b are the proportions of records from

node v going into nodes a and b , then the decrease in

impurity is i(v)-trAa)-trbi(b), where i(v) is

the impurity index of node v and i(a) and i(b) are the

impurity indexes of nodes a and b . There are two

measures of impurity. They are Gini Index and Entropy.

Gini Impurity Irulex:

If p1 is the proportion of responders in a node, and p 0

is the proportion of non-responders, the Gini Impurity

Index of that node is defined as i(p) = 1-p; - p~.

If two records are chosen at random (with replacement) from a node, the probability that both are responders is

p; , while the probability that both are non-responders is

p ~ and the probability that they are either both responders

or both non-responders is p; + p;. Hence

1 - p12

- p; can be inteipreted as the probability that

any two elements chosen at random (with replacement) are

different A pure node has a Gini Index of zero.

155

Entropy:

Entropy is another measure of the impurity of a node and it I

isdefinedas i(p)=-LP1log2(p1},forbinary i=O

targets.

Petii'Son 's chi41J1U1re Test: To illustrate the chi-square test let us consider a simple example. Suppose node vis split into two child nodes a

and b . Suppose there are 1000 individuals in node v . Of

these suppose there are 100 responders and 900 nonresponders. Suppose in node a there are 400 individuals

111111 60 responders, llllllinnode b there are 600 individuals

and 40 responders. One can construct a 2 x 2 contingency

table with rows representing the child nodes a and b , and columns representing the target classes (1 for response and 0 for no response).

The chi-square test statistic can be calculated as

z 2 = L <a~ E)2

• where o is observed frequency

of the cell, and E is the expected frequency under the null hypothesis that the class proportions are the sante in esch row. In the root node the proportion of responders is 10%. Under the null hypothesis we expect 40 responders and

360 non-responders in node a , and 60 responders and 540

non-responders in node b . The observed frequencies are

60, 340, 40, and 560 respectively. The z 2 statistic is

computed using these expected and observed frequencies.

Logworth can be calculated from the associatedP-value as

-log10 (P -value) . Logworth increases as p decresses. If there are I 0 candidate splits on an input, the

one with the highest logworth is selected.

Stopping rules for limiting tree growth:

Starting at the root node, the algorithm splits esch node further into child nodes or offspring nodes.

Splitting a node involves examining all candidate splits on

esch input and selecting the best split on each, and picking

the input that has the best of the selected splits. This process is repested at esch node. Any node, which cannot

be partitioned further, becomes a leaf (tenninal node, or leaf node.)

One can stop tree growth (ie. the process of partitioning the nodes into sub-nodes) by specifYing a stopping rule.

One stopping rule may be to specifY that a node should not

be split further if the chi-square statistics are not

significant The level of significance can be specified in

the Decision Tree node in the Enterprise Miner1U.

Tree growth can also be controlled by selecting an

appropriate depth of the tree. The maximum depth can be specified in the Basic tab of the Decision Tree node.

Diagram 9: Decision Tree node: Basic Tab

Diagram 10: Decision Tree node: Advanced Tab

156

Response Model with Decision Tree

In this example the decision tree partitioned the input space into the following disjoint subsets ox leaf nodes. These are also known as the terminal nodes or leaves of the tree.

Table D: The leaj"110tles of the Response ModeL

1. EMP 2: 3 and Credit < 152.5 Response score = 43.8

2. MS = U and EMP < 3 and CREDIT< 152.5 Resp score = 17.2

3. MS = M and EMP < 3 and CREDIT< 152.5 Response score = 11.5

4. 152.5 S CREDIT<297.5andEMP '2::: 3 Response score = 13.8

5. CREDIT 2: 297.5 and EMP 2: 3 Response score= 3.9

6. MILEAGE 2: 45,900 and 152.5 ::;; CREDIT< 285.5 and EMP < 3 Response score= 3.5

7. MILEAGE 2: 39,162 and CREDIT 2: 285.5 andEMP<3 Response score= 16.7

8. AGE< 27.5 and MILEAGE< 45,900 and 152.5 :!> CREDIT< 285.5 and EMP< 3 Response score=5.7

9. AGE '2::: 27.5 andMILEAGE<45,900 and 152.5 ::;; CREDIT< 285.5 and EMP < 3

Response score=2. 7 10. 285.5 ::;; CREDIT< 347.5 and

MILEAGE< 39,162 and EMP < 3 Response score = 1.4

11. CREDIT 2: 347.5 and MILEAGE< 39,162 and EMP<3 Response score = 0. 7

Note: In each node the response score is the same as the proportion of responders. These proportions are also referred to as the posterior probabilities.

In the first letif node all cases have a value greater than or eqlllll to 3 for the variable EMP. Proportion of responders in this node is 43.8o/o. In this example we assign a response score of 43.8 for this group.

The second leaf node has all cases with Marital Status {MS) unknown (U) and Credit index less than 152.5. The response rate in this group is 17.2. In this case the response score is 17.2.

Risk Model with Regression Tree

The decision tree partitioned the input space into six disjoint subsets. For each subset a risk score is computed from the predicted mean claim frequency.

•

157

Table ID: The leaf1Wtles of the Risk ModeL

1. Credit < 66.5 Risk score= 0.259

2. AGE< 19.5 and 66.5 ::;; CREDIT -<293.5 Risk score = 0.273

3. AGE< 45.5 and CREDIT 2: 293.5 Risk score = 0.089

4. AGE 2: 45.5 and CREDIT2: 293.5 Risk score= 0.05

S. AGE 2: 19.Sand 66.5S CREDIT<197.5 Risk score= 0.155

6. AGE '2::: 19.5 and 197.5 S CREDIT<293.5 Risk score= 0.094

Note: Risk score is the same as the calculated claim frequency in each node.

The target variable in the risk model is the frequency of claims per car-year. Since this is a continuous variable the decision tree algorithm calculates the mean claim frequency for each leaf node. We call the expected claim frequency the risk score.

Score node: The models are developed in the Decision Tree Node and passed to the Score node. The Score node applies the model to the data set that we want to score. In the case of the response model, the Score node sends each case (record) to one of the 11 leqf nodes, shown in Table ll above, according to the ranges of the values of the input variables of the record. Accordingly each record is assigned an expected probability eqlllll to the posterior probability of response of the node.

Similsrly, in the risk model we connected the Decision Tree node to the Score node. Again, based on the input values, each record is assigned one of the six terminal nodes shown in Table m and the estimated mean claim frequency of the node (calculated from the training data set) is assigned to the record. This mean is the predicted claim frequency for all the individuals in the node.

SAS Code node:

SAS Code node enables you to further process the scored data set created in the Score node. Custom graphs and tables can be generated.

Insight node:

J.usight node is used to view the scored data set or any other data set imported from the predecessor node.

References:

( 1) Kattamuri S. Sarma. 2001, Enterprise Miner>' for

Forecaating. Paper number: P2S0-26, presented at

SUGI 26, Long Beach, California, April24, 2001.

(2) SAS Institute Inc., Getting Started with Enterprise

Mineil" Software, Release 4.1, Cary, NC: SAS

Institute Inc., 2000.

Kattamuri S. Sarma, Ph.D. 61 Hawthorne Street White Plains, NY 10603 (914) 428-8733. Fax: 914-428-4551.

Email: KSSarma(dJ.worl!lnet att.net

158

using sas enterprise minertm - lex jansen's … sas enterprise minertm for forecasting response...

Documents