sas notes sas enterprise miner software - applying data mining techniques

Enterprise Miner:Applying

Data MiningTechniques

Course Notes

Enterprise Miner™: Applying Data Mining Techniques Course Notes was written byJim Georges, William Potts, and Doug Wielenga.

SAS INSTITUTE TRADEMARKS

The SAS System is an integrated system of software providing complete control over data access, management, analysis, andpresentation. Base SAS software is the foundation of the SAS System. Products within the SAS System include SAS/ACCESS,SAS/AF, SAS/ASSIST, SAS/CALC, SAS/CONNECT, SAS/CPE, SAS/DMI, SAS/EIS, SAS/ENGLISH, SAS/ETS,SAS/FSP, SAS/GIS, SAS/GRAPH, SAS/IML, SAS/IMS-DL/I, SAS/INSIGHT, SAS/LAB, SAS/MDDB,SAS/NVISION, SAS/OR, SAS/PH-Clinical, SAS/QC, SAS/REPLAY-CICS, SAS/SESSION, SAS/SHARE,SAS/SPECTRAVIEW, SAS/STAT, SAS/TOOLKIT, SAS/TUTOR, SAS/DB2, SAS/GEO, SAS/IntrNet,SAS/PH-Kinetics, SAS/SECURE, SAS/SHARE*NET, SAS/SQL-DS, and SAS/Warehouse Administrator software.Other SAS Institute products are SYSTEM 2000 Data Management Software, with basic SYSTEM 2000, CREATE,Multi-User, QueX, Screen Writer, and CICS interface software; InfoTap software; JMP, JMP IN, JMP Serve, andStatView software; SAS/RTERM software; the SAS/C Compiler; Video Reality software; Warehouse Viewer software;Budget Vision, Campaign Vision, CFO Vision, Enterprise Miner, Enterprise Reporter, HR Vision,IT Charge Manager, and IT Service Vision software; Scalable Performance Data Server software;SAS OnlineTutor software; and Emulus software. MultiVendor Architecture, MVA, MultiEngine Architecture,MEA, Risk Dimension, and SAS inSchool are trademarks of SAS Institute Inc. SAS Institute also offers SAS Consulting

and SAS Video Productions services. Authorline, Books by Userssm, The Encore Series, ExecSolutions, JMPer Cable,Observations, SAS Communications, SAS.COM, SAS OnlineDoc, SAS Professional Services, the SASware Ballot,SelecText, and Solutions@Work documentation are published by SAS Institute Inc. The SAS Video Productions logo, theBooks By Users SAS Institute’s Author Service logo, the SAS Online Samples logo, and The Encore Series logo are registeredservice marks or registered trademarks of SAS Institute Inc. The Helplus logo, the SelecText logo, the Video Reality logo, theQuality Partner logo, the SAS Business Solutions logo, the SAS Rapid Warehousing Program logo, the SAS Publications logo,the Instructor-based Training logo, the Online Training logo, the Trainer’s Kit logo, and the Video-based Training logo are servicemarks or trademarks of SAS Institute Inc. All trademarks above are registered trademarks or trademarks of SAS Institute Inc. inthe USA and other countries. indicates USA registration.

The Institute is a private company devoted to the support and further development of its software and relatedservices.

Other brand and product names are registered trademarks or trademarks of their respective companies.

Enteprise Miner™: Applying Data Mining Techniques Course Notes

Copyright 1998 by SAS Institute Inc., Cary, NC 27513, USA. All rights reserved. Printed in the United States ofAmerica. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form orby any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of thepublisher, SAS Institute Inc.

Book code 56606, course code DMEM, prepared 21Oct98.

Table of Contents

Chapter 1 Getting Started with Enterprise Miner...............................................................1

1.1 Introduction ...................................................................................................................3

1.2 Opening the Enterprise Miner .....................................................................................4

1.3 Enterprise Miner Nodes..............................................................................................11

Chapter 2 Predictive Modeling...........................................................................................19

2.1 Problem Formulation ..................................................................................................21

2.2 Project Preliminaries...................................................................................................23

2.3 Training and Assessment ...........................................................................................32

2.4 Model Deployment.......................................................................................................43

2.5 Missing Value Imputation ..........................................................................................47

2.6 Model Interpretation ...................................................................................................51

2.7 Decision Tree Models ..................................................................................................53

2.8 Dimension Reduction ..................................................................................................66

2.9 Polynomial Regression Models ...................................................................................72

2.10 Training a Neural Network ......................................................................................79

2.11 User-defined Models..................................................................................................84

Chapter 3 Neural Networks ................................................................................................93

3.1 Introduction .................................................................................................................95

3.2 Low Dimensional Example .........................................................................................96

3.3 Comparing to Logistic Regression............................................................................102

3.4 Stopped Training .......................................................................................................106

iv For Your Information

3.5 Two Hidden Layers ...................................................................................................109

3.6 Neural Network Classification .................................................................................112

Chapter 4 Decision Trees ................................................................................................. 117

4.1 Introduction ...............................................................................................................119

4.2 Problem Formulation ................................................................................................120

4.3 Decision Tree Comprehension ..................................................................................121

4.4 Cultivation of Tree Varieties ....................................................................................130

4.5 Tree Deployment .......................................................................................................139

Chapter 5 Cluster Analysis...............................................................................................143


5.2 Pre-clustering Transformations ...............................................................................146

5.3 Assessing Clusters Visually......................................................................................149

5.4 K-means Clustering...................................................................................................152

5.5 Visualizing Cluster Separation ................................................................................154

Chapter 6 Missing Value Imputation................................................................................157

6.1 Introduction ...............................................................................................................159

6.2 Missing Indicators and Simple Data Replacement.................................................160

6.3 Cluster Mean Imputation .........................................................................................167

Chapter 7 Associations.....................................................................................................171

7.1 Introduction ...............................................................................................................173


7.3 Support, Confidence, and Lift...................................................................................175

7.4 Dissociation................................................................................................................178

For Your Information v

Appendices ..........................................................................................................................183

A.1 Glossary .....................................................................................................................185

A.2 References..................................................................................................................195

vi For Your Information

Course DescriptionThis two-day, Level II course is designed for Enterprise Miner users at all levels. The courseprovides extensive hands-on experience with Enterprise Miner and covers the basic skillsrequired to assemble analyses using the rich tool set of Enterprise Miner.

To learn more…

A full curriculum of general and statistical instructor-basedtraining is available at any of the Institute’s training facilities.Institute instructors can also provide on-site training.

For information on other courses in the curriculum, contact theProfessional Services Division at 1-919-677-8000, then press1-7321, or send email to [email protected]. You can also findthis information on the Web at www.sas.com/training/ as wellas in the Training Course Catalog.

For a list of other SAS books that relate to the topics coveredin this Course Notes, USA customers can contact ourBook Sales Department at 1-800-727-3228 or send email [email protected]. Customers outside the USA, please contactyour local SAS Institute office.

Also, see the Publications Catalog on the Web atwww.sas.com/pubs/ for a complete list of books and a convenientorder form.

For Your Information vii

PrerequisitesBefore selecting this course,

• you should be familiar with Microsoft Windows and Windows-based software• it is recommended you complete the Data Mining Primer: Overview of Applications and

Methods course.

viii For Your Information

General ConventionsThis section explains the various conventions used in presenting text, SAS language syntax,and examples in this book.

Typographical Conventions

You will see several type styles in this book. This list explains the meaning of each style:

UPPERCASE ROMAN is used for SAS statements, variable names, and otherSAS language elements when they appear in the text.

italic identifies terms or concepts that are defined in text. Italic is alsoused for book titles when they are referenced in text, as well as forvarious syntax and mathematical elements.

bold is used for emphasis within text.

monospace is used for examples of SAS programming statements and for SAScharacter strings. Monospace is also used to refer to field names inwindows, information in fields, and user-supplied information.

select indicates selectable items in windows and menus. This book alsouses icons to represent selectable items.

Mouse Conventions

The number of buttons on mouse devices varies. On mouse devices with two or three buttons,one button makes selections and one displays pop-up menus. Because the locations of thesebuttons vary, this book references them as mouse select button or the mouse menu button. Ifyou use a mouse device, you can determine which button executes which action by trying them.

menu buttonselect button

Two-Button Mouse with Default Settings

Chapter 1 Getting Started withEnterprise Miner

1.1 Introduction.................................................................................................... 3

1.2 Opening the Enterprise Miner...................................................................... 4

1.3 Enterprise Miner Nodes .............................................................................. 11

2 Chapter 1 Getting Started with Enterprise Miner

1.1 Introduction 3

1.1 Introduction

SAS Institute defines data mining as the process of selecting, exploring, modifying, andmodeling large amounts of data to uncover previously unknown patterns in data for a businessadvantage. The data mining process is applicable across a variety of industries and providesmethodologies for such diverse business problems as fraud detection, customer retention andattrition, database marketing, market segmentation, risk analysis, affinity analysis, customersatisfaction, bankruptcy prediction, and portfolio analysis.

Enterprise Miner software provides tools to facilitate the data mining process. A graphical userinterface groups these tools by common data mining tasks: sample, explore, modify, model, andassess (SEMMA). In the course of an analysis, these tools are assembled into a process flowdiagram (PFD), a type of graphical computer program.


1.2 Opening the Enterprise Miner

To start Enterprise Miner software from your desktop, double-click on the Enterprise Minericon. The Available Projects window opens.

In the PC version, you can resize the Available Projects window. Place the cursor on a corner oran edge, click and hold the left mouse button, drag to the desired size, and release the mousebutton. If this window is closed during your SAS session, you can reopen it by entering theMINER command in the command dialog of the SAS Display Manager.

The Available Projects window has an explorer that expands to show projects and diagrams.When no projects are present, an explorer node labeled Projects is shown. This node is alwaysdisplayed. When you create new projects, they are listed in this folder.

The remainder of this section discusses the following topics:

� using the mouse

� using pull-down menus

� using the toolbar

� using pop-up menus

� grayed items.

1.2 Opening the Enterprise Miner 5

Using the Mouse

Within the Enterprise Miner, you use the mouse to perform many tasks. Different mousebuttons control different functions. Also, some tasks require single-clicking a mouse button andothers require double-clicking.

You perform the following basic tasks with the mouse:

• item selection (for example, push buttons and check boxes).

Place the cursor over the item and single-click the left mouse button (click and release).This is the mouse selection technique used most often in the Enterprise Miner.

• list item selection from pull-down menus.

Place the cursor over the name of the pull-down menu, and then click and hold the leftmouse button. The pull-down menu list appears. Drag the cursor to the desired list item,and then single-click the left mouse button again. For example, single-click the leftmouse button on the Insert menu, drag the cursor over the Project… menu item, andthen single-click the left mouse button again.

• pop-up menu selection.

Single-click and hold the right mouse button and a pop-up menu appears. Drag thecursor onto the menu list until you reach the desired menu item.

If an arrow appears next to the menu item, wait until the submenu appears, and movethe cursor over the desired item in the submenu. Optionally, you can single-click witheither the left or right mouse button to make the submenu appear instantly.

Finally, position the cursor over the desired item in the submenu and single-click witheither the left or right mouse button to select the item.

For example single-click and hold the right mouse button in the Available Projectswindow, drag the cursor to the Insert menu, wait until the submenu appears, drag thecursor over the Project… submenu item, and then single-click either the left or rightmouse button.

• specialized tasks.

These tasks are described as needed.


Using the Pull-down Menus

The Enterprise Miner contains pull-down menus that enable you to quickly and efficientlyperform common tasks. Different windows have different pull-down menus. The AvailableProjects window has the following pull-down menus:

File

• Open - opens the selected process flow diagram.

• Import - imports a copied project.

• Export - exports a copied project.

• Close - closes the Available Projects window.

• Send... - accesses email functions.

• Exit - closes the Enterprise Miner and ends the current SAS session.

Edit

• Delete - deletes the selected project or the selected diagram for a project. If you selectDelete, a message appears asking you to verify the deletion.

If you select Yes in the dialog above, you will delete Project 1. If you select No, youreturn to the previous window and Project 1 is not deleted. Note that you can alsoperform this task by using the Delete icon on the toolbar.

• Rename - enables you to rename a selected project or diagram. Simply type in theappropriate name.

• Setup - opens the Enterprise Miner Administrator window. The Enterprise MinerAdministrator enables you to define remote projects and customize default node options.


View

• Refresh - refreshes the image displayed in the windows of the Enterprise Miner.

• Properties - accesses the properties of the selected item.

For example, the properties of a diagram include the project name, the diagram name, thedate the diagram was created, and the last date the diagram was modified. It also indicateswhether the diagram is locked.

The properties for a project include the SAS libraries and pathnames where the project anddata are to be stored and the remote server information.

Note that you can also display item properties by using the Properties icon, which is locatedon the toolbar.


Insert

• Project... - enables you to create a new project.

• Diagram - enables you to create a new process flow diagram (PFD).

Globals - accesses SAS software globals.

Options - accesses operating environment options.

Windows - enables you to display and reposition windows.

Help

• How to

• SAS companion

• Sample programs

• Open Enterprise Miner Nodes Help - delivers online help about the EnterpriseMiner nodes.

• Getting Started with the Enterprise Miner - displays this document online.

• Tech Support

• What’s new

• Utility application

• About SAS


Using the Toolbar

When the Available Projects window is active, the toolbar appears as follows:

Moving from left to right, the icons represent, respectively,

• Up one level - enables you to move up one level in the hierarchy (project tree) of theAvailable Projects window.

• Delete - deletes the selected item. You can also perform this task by using the pull-downmenus and selecting Edit Ø Delete.

• Properties - displays the administrative properties of the selected item. Here is anexample of the properties of a diagram:

Note that you can also perform this task by using the pull-down menus andselecting View Ø Properties.


Using the Pop-Up Menus

Within the Available Projects window, the pop-up menu list items correspond to the pull-downmenu list items, and you can use them to perform the same tasks. You can access the pop-upmenus by single-clicking with the right mouse button.

For example, instead of using the pull-down menus to insert a project, follow these steps:

1. Place the cursor in the Available Projects window and click the right mouse button. The pop-up menu appears.

2. Place the cursor on the menu list item Insert. The sub-menu appears.

3. Place the cursor on the sub-menu list item Project..., and then select the item with either theleft or right mouse button. The New Projects window appears.

Grayed Items

Certain functions of some windows of the Enterprise Miner are grayed out; that is, the labelsare gray or the columns are gray, and you cannot access the underlying functionality. Some ofthe grayed functions require additional specifications. When you supply the needed information,the items are ungrayed. Other functions are grayed dynamically; that is, when a function isinappropriate for the current settings, it is grayed. When the settings are changed and thesefunctions are appropriate, they are ungrayed.

For statistical functions, graying is used to prevent the selection of inappropriate combinationsof options. In some cases, specifying certain options implies that additional options must also beselected to fully specify the method or technique. Be aware that failing to fully specify optionsmay prevent the node from running successfully. If a node fails to run, the colored bordersurrounding the node while the node is running will turn from green to red.

1.3 Enterprise Miner Nodes 11

1.3 Enterprise Miner Nodes

Enterprise Miner software’s data mining tools are represented by icons called nodes and areorganized by analysis function: sample, explore, modify, model and assess. In addition toanalysis nodes, utility nodes are also provided. These enable you to submit SAS softwareprogramming statements, to define control points in the process flow diagram (PFD), and tocreate subdiagrams.

This section gives an overview of the node groups found in Enterprise Miner. More informationabout individual nodes may be found in the online documentation.


Sample Nodes

• The Input Data Source node enables you to access SAS data sets and other types of datasets. This node reads data sources and defines their attributes for processing by theEnterprise Miner. Meta information (the metadata sample) is automatically created foreach variable when you import a data set with the Input Data Source node. Initialvalues are set for the measurement level and the model role for each variable. You canchange these values if you are not satisfied with the automatic selections made by thenode. Summary statistics are displayed for interval and class variables. Note that for thepurposes of this document, data sets and data tables are equivalent terms.

• The Sampling node enables you to take random and stratified random samples of datasets. Sampling is recommended for extremely large databases because it cansignificantly decrease model training time. If the sample is sufficiently representative,the relationships found in the sample can be expected to generalize to the complete dataset. The Sampling node writes the sampled observations to an output data set and savesthe seed values that are used to generate the random numbers for the samples so thatyou may replicate the samples.

• The Data Partition node enables you to partition data sets into training, test, andvalidation data sets. The training data set is used for preliminary model fitting. Thevalidation data set is used to monitor and tune the model weights during estimation andis also used for model assessment. The test data set is an additional hold-out data setthat you can use to obtain an unbiased estimate of how well the model will perform inpractice. You can use simple random sampling or stratified random sampling to createpartitioned data sets.


Explore Nodes

• The Bar Chart node is an advanced visualization tool that enables you to explore largevolumes of data in multi-dimensional histograms. You can view the distribution of up tothree variables at a time with the node.

• The Variable Selection node enables you to evaluate the importance of input variables inpredicting or classifying the target variable. You can remove variables unrelated to thetarget variable(s), remove variables in hierarchies, remove variables that have largepercentages of missing values, and remove class variables based on the number ofunique values.

• The Association node enables you to identify association relationships within the data.For example, if a customer buys a loaf of bread, how likely is the customer to also buy agallon of milk? The node also enables you to perform sequence discovery if a time stampvariable is present in the data set. Binary sequences are constructed automatically, butyou can use the event chain handler to construct longer sequences based on the patternsdiscovered by the algorithm.

• The Insight node enables you to open a SAS/INSIGHT session. SAS/INSIGHT softwareis an interactive tool for data exploration and analysis. With it you explore data throughgraphs and analyses that are linked across multiple windows. You can analyzeunivariate distributions, investigate multivariate distributions, and fit explanatorymodels using generalized linear models.


Modify Nodes

• The Data Set Attributes node enables you to modify data set attributes, such as data setnames, description, and roles. You can also use this node to modify the metadata sampleassociated with a data set. An example of a useful Data Set Attributes application is togenerate a data set in the SAS Code node and then modify its metadata sample with thisnode.

• The Transform Variables node enables you to transform variables; for example, you cantransform variables by taking the square root of a variable or by taking the naturallogarithm. Additionally, the node supports user-defined formulas for transformationsand provides a visual interface for grouping interval-valued variables into buckets orquantiles. Transforming variables to similar scale and variability may improve the fit ofmodels and subsequently, the classification and prediction precision of fitted models.

• The Filter Outliers node enables you to identify and remove outliers from data sets.Checking for outliers is recommended because outliers may greatly affect modelingresults and subsequently, the classification and prediction precision of fitted models.

• The Clustering node enables you to segment your data; that is, it enables you to identifydata observations that are similar in some way. Observations that are similar tend to bein the same cluster, and observations that are different tend to be in different clusters.The cluster identifier for each observation can be passed to other nodes for use as aninput, ID, or target variable. It can also be passed as a group variable that enables youto automatically construct separate models for each group.

• The Data Replacement node enables you to impute values for observations that havemissing values. You can replace missing values for interval variables with the mean,median, or midrange. You can replace missing values for class variables with the mostfrequently occurring value. The node also enables you to specify your own replacementvalues.


Model Nodes

• The Regression node enables you to fit both linear and logistic regression models to yourdata. You can use continuous, ordinal, and binary target variables. You can use bothcontinuous and discrete variables as inputs. The node supports the stepwise, forward,and backward selection methods. A point-and-click interaction builder enables you tocreate higher-order modeling terms.

• The User Defined Model node enables you to generate assessment statistics usingpredicted values from a model you built outside Enterprise Miner (for example, a logisticmodel using the LOGISTIC procedure in SAS/STAT software). The predicted values canalso be saved to a SAS data set and then imported into the process flow with the InputData Source node.

• The Decision Tree node enables you to fit decision tree models to your data. Theimplementation includes features found in a variety of popular decision trees algorithms(for example, CHAID, CART, and C4.5). The node supports both automatic andinteractive training. When you run the Decision Tree node in automatic mode, itautomatically ranks the input variables based on the strength of their contribution tothe tree. This ranking may be used to select variables for use in subsequent modeling.You may override any automatic step with the option to define a splitting rule and pruneexplicit nodes or subtrees. Interactive training enables you to explore and evaluate alarge set of trees as you develop them.

• The Neural Network node enables you to construct, train, and validate multilayer,feedforward neural networks. By default, the Neural Network node automaticallyconstructs a multilayer perceptron network that has one hidden layer consisting of threeneurons. In general, each input is fully connected to the first hidden layer, each hiddenlayer is fully connected to the next hidden layer, and the last hidden layer is fullyconnected to the output. The Neural Network node supports many variations of thisgeneral form.


Assess Nodes

• The Assessment node provides a common framework for comparing models andpredictions from any of the modeling nodes. The comparison is based on the expectedand actual profits that result from implementing the model. In addition, the nodeproduces charts that help to describe the usefulness of the model.

• The Score node enables you to generate and manage predicted values from a trainedmodel. Scoring formulas are created for both assessment and prediction. The EnterpriseMiner generates and manages scoring formulas in the form of SAS DATA step code.

Utility Nodes

• The Group Processing node enables you to perform group by-processing for classvariables (for example, GENDER). You can also use this node to analyze multipletargets.

• The Data Mining Database node enables you to create a data mining database (DMDB)for batch processing. For non-batch processing, DMDBs are automatically created asthey are needed.

• The SAS Code node enables you to incorporate new or existing SAS code into processflow diagrams. The ability to write SAS code enables you to include additional SASsoftware procedures into your data mining analysis.

• The Control Point node enables you to establish a control point in a process flow diagram(PFD). A Control Point node can be used to reduce the number of connections that aremade.

• The Subdiagram node enables you to group a portion of a PFD into a subdiagram. Thisgrouping simplifies the appearance of the PFD.


Usage Rules for Nodes

These are some general rules that govern the placement of nodes in a process flow diagram(PFD):

• The Input Data Source cannot be preceded by any other nodes.

• The Sampling node must be preceded by a node that exports a data set.

• The Assessment node must be preceded by one or more modeling nodes.

• The Score node must be preceded by a node that produces score code. For example, theModeling nodes produce score code.

• The SAS Code node can be defined in any stage of the process flow diagram. It does notrequire an input data set defined in the Input Data Source node.

Chapter 2 Predictive Modeling

2.1 Problem Formulation .................................................................................. 21

2.2 Project Preliminaries................................................................................... 23

2.3 Training and Assessment........................................................................... 32

2.4 Model Deployment....................................................................................... 43

2.5 Missing Value Imputation ........................................................................... 47

2.6 Model Interpretation .................................................................................... 51

2.7 Decision Tree Models.................................................................................. 53

2.8 Dimension Reduction ................................................................................. 66

2.9 Polynomial Regression Models ................................................................. 72

2.10 Training a Neural Network ........................................................................ 79

2.11 User-defined Models ................................................................................. 84

20 Chapter 2 Predictive Modeling

2.1 Problem Formulation 21

2.1 Problem Formulation

The direct marketing department of a retail products company seeks to increase catalog salesrevenue. To do this, they want to target only current customers likely to make a catalogpurchase. The campaign consists of two phases. First, a test mailing to a random selection oftheir current customers will be used to create a database from which a purchase propensitymodel can be created. Second, the purchase propensity model will be used to select customersfrom the remainder of company’s customer roll whose likelihood of purchase exceeds ananalytically determined threshold.

The BUYTEST data set contains the results from phase one, the random test mailing. The set ofapproximately 10,000 current customers all had a 24-month purchase total in excess of $60.Each test-mailing catalog contained a special promotion code. The customer provided thepromotion code while placing the order to identify himself as a participant in the test mailing.At the end of the test campaign, a purchase total for each customer was calculated (lessreturns). Any customer with a positive purchase total was identified as a responder to the testcampaign. The average profit margin on purchases was approximately 20%.

BUYTEST contains a variety of information, extracted from the company’s customer database,about each test mailing customer. The individual fields can be grouped as follows:

Category Variables Description

Demographic AGE

INCOME

MARRIED

SEX

COA6

OWNHOME

Age in years

Yearly income in thousands

1 if married, 0 otherwise

F or M

1 if change of address in last 6 months, 0 otherwise

1 if own home, 0 otherwise

Geographic LOC

CLIMATE

Location of residence, A-H

Climate code for residence, 10, 20, 30

Recency/

Frequency

BUY6

BUY12

BUY18

Number of purchases in last 6 months



Monetary VALUE24 Total value of purchases in past 24 months

Credit status FICO Credit score

PurchaseHistory

ORGSRC

DISCBUY

RETURN24

Original customer source, C, D, I, O, P, R, U

1 if discount buyer, 0 otherwise

1 if product return in past 24 months, 0 otherwise


Response RESPOND

PURCHTOT

C1-C7

1 if responder to test mailing, 0 otherwise

Test mailing purchase total

Test mailing purchase total by product category

Tracking ID Unique customer identification number

The overall response rate to this test mailing was approximately 7.5%. In addition to recordingthe presence or absence of a customer’s response, the data set includes details on the amount ofa customer’s purchase and the categories from which purchases are made. Some of thisinformation may be used for follow-up analyses on customer purchase behavior.

The purchase propensity model will be created using supervised classification techniquesincluding regression models, neural networks, and decision trees. The primary analysis target isthe RESPOND variable. Other variables in the data set, with the exception of PURCHTOT andC1, C2,…, C7, will serve as model inputs. Forty percent of the data in BUYTEST will beemployed to train the model. The remaining sixty percent will be split evenly; half will be usedto tune and compare models and half will be used as a final assessment of model performance.The models will be judged primarily on their assessed profitability and accuracy and secondarilyon their interpretability.

Upon selection of the most appropriate model, scoring code will be generated. This scoring codewill be deployed on the remainder of the company’s high valued customer roll and a subset ofcustomers with high propensity to purchase will be selected. The purchase propensity thresholdwill be determined in the course of model assessment.

2.2 Project Preliminaries 23

2.2 Project Preliminaries

Objectives

� Create an Enterprise Miner project.

� Build a process flow diagram.

� Modify node settings.


Creating an Enterprise Miner Project

A project contains data mining diagrams and information that pertains to them. Two SASlibraries are required to define a project: a SAS project library and a SAS data library.

The SAS project library stores information about a project. You can have no more than oneproject per SAS project library. Only files related to the Enterprise Miner project should be inthis library. The SAS data library stores data for a project. The data library should not be usedfor anything else but that one project. For example, do not use the same data library for anotherproject, either as a project or as a data library, and do not keep external files in that library.The data library is not the library where the raw data should be stored; rather, it is a placewhere the Enterprise Miner stores certain files associated with a project.

Create a new project from the Available Projects window.

1. Place the cursor on the Projects folder and single-click the left mouse button. This actionselects the Projects folder.

2. Use the Insert pull-down menu (or the Insert pop-up menu, which is accessed by single-clicking the right mouse button) and select Project.... A dialog box appears, and you areprompted to provide a project name as well as the names and locations of the project anddata libraries.

3. Name the new project by typing an appropriate name in the Name field (for example, MyProject).

4. Specify a location where you want the project to be stored. For example, type the librarynickname EMPROJ in the Library field and then type the pathname in the Path field. Youcan also use the Browse… button to identify the desired location.

Only one project can be stored in a library. Diagram entries, all node settings and other dataentries, graphs and charts, and code entries are stored in the project library.


5. Specify a location where you want the information about the data to be stored. For example,type the library nickname EMDATA in the Library field and the type the pathname in thePath field. You can also use the Browse… button to identify the desired location.

The data for one project only can be stored in the data library. This is where the raw,training, validation, test, score, and assessment data sets are stored. If desired, you canspecify the same library for the Data library and the Project library, although that is notdone here.

6. Optionally, you may specify to use a remote server to run the project remotely. After youselect the check box beside Use Remote Server, you must specify the server profile andremote path in the fields provided. To obtain a list of available server profiles, click thedown-arrow control with the left mouse button or click anywhere in the Server profile entryfield.

7. After specifying the settings, select OK to accept and apply or Cancel to cancel the dialog.When you select OK, the project appears in the Available Projects window.

8. Open the diagram by double clicking on the My Project folder and then the My ProjectDiagram icon.

! To obtain results consistent with this document, submit the following line of code in theSAS Program Editor after opening the SAS Enterprise Miner.

%let DM_SEED=12345;


Building a Process Flow Diagram

Adding Nodes

All process flows in the Enterprise Miner are built from nodes selected from the Sample,Explore, Modify, Model, Assess, and Utility tools palette. You can add nodes to the workspace inthe following ways:

• Make sure that the Node Types window and the workspace are both visible. Positionyour cursor over the desired node in the Node Types window. Press and hold the leftmouse button, drag the the node onto the workspace, and release the mouse button.

• Double-click in an open area of the workspace. A dialog appears with a list of allavailable nodes. Single-click the left mouse button on the name of the desired node inthe Add node window. You can also obtain this dialog by single-clicking the right mousebutton in an open area of the workspace and selecting Add node….

Moving and Connecting Nodes

By default, you can use your mouse to move and connect nodes. If you only want to use themouse to move or connect but not both, you can select the desired functionality from the Viewmenu. The default setting allows both movement and connection of the nodes and is generallypreferred by most users.

After dragging a node onto the workspace, you will see a dotted outline appear around the node.The outline indicates the node is selected, and you can use your cursor to move the node aroundthe workspace by dragging it. When you have moved the node to the desired location, click in anopen area thereby deselecting the node.

When you want to connect a node (say the initial node) to another node(say, the terminal node),first make sure the initial node is not selected. Recall that single-clicking the left mouse buttonin an open area will deselect any selected nodes. Position the cursor near the edge of the icon ofthe initial node, press the left mouse button and immediately start to drag until the cursor ispositioned over the terminal node. This will make a connection leading from the initial node tothe terminal node.

! If you hesitate after clicking on the initial node, the initial node is selected and willstart to move when you start to drag. If this occurs, simply reposition the initial node,click in an open area to deselect it, and try again.


Build a flow in the workspace.

1. Add the first node for your process flow diagram. Drag an Input Data Source node (Sampletool group) onto the workspace from the node palette.

2. Add a Data Partition node (Sample tool group) to the right of the Input Data Source node.

3. Now connect the nodes. Move the cursor to the right edge of the Input Data Source icon. Thecursor changes from an arrow to a cross.

4. Click and hold the left mouse button, drag the cursor to the left edge of the Data Partitionnode icon, and release the mouse button. A successful connection results in a line segmentthat joins the two nodes. Note that if the Input Data Source icon moves (instead of making aconnection), you can move the cursor away from the icons and single-click the left mousebutton and attempt to make the connection again.

5. Use the same technique to connect in order a Regression node (Model tool group) and anAssessment node (Assess tool group).

Your complete diagram should appear as shown.


Modifying Node Settings

To specify the name of a data source and details about the variables in the data source to beused as input for your diagram, you must modify the settings of the Input Data Source node.Modify the node settings by either double-clicking on the node or selecting the node with theright mouse button and selecting Open from the pop-up menu.

Double-click the Input Data Source node. The node’s dialog window opens.

The data set used to generate the propensity models is called BUYTEST. It is stored in theCRSSAMP library.

Set the input data source to BUYTEST.

1. Select Select….

2. Select the arrow control to display a list of assigned SAS data libraries.

3. Select the CRSSAMP library from the available list.

4. Select the BUYTEST data set from within the Tables box.

5. Select OK to continue.


By default, the Enterprise Miner creates a metadata sample consisting of a random sample of2000 records. The metadata sample is used in several Enterprise Miner nodes to determinemetadata information about the analysis data set. In the Input Data Source node,

• it calculates summary statistics, determines the number of variable levels, and determinesthe frequencies of the variable levels for the active data set in the Input Data Source node

• it is used as the project data when a remote project is run locally and the data does not existlocally

• it is used as the data source to create a distribution plot of a variable whenever you selectthe View distribution item for a variable in a node.

! You can control the size of the metadata sample by selecting the Change button in theInput Data Source settings window.

Before creating a predictive model from a data set, you must specify the modeling role andmeasurement scale for every variable in the data set. Enterprise Miner attempts to assign theroles and measurement scales of input variables using the metadata sample. To view thesedefault assignments, select the Variables tab in the Input Data Source settings window.

First consider the model roles for each variable. The variable ID has been correctly assigned therole id. All other variables have been assigned the role input.

Change the role of the RESPOND to target.

1. Click in the Model Role column for RESPOND with the right mouse button.

2. Select Set Model Role Ø target. The model role for RESPOND should now show target.

The variables C1 through C7 and PURCHTOT should be rejected from this analysis becausethey are neither valid model inputs nor targets of current interest.

3. Scroll to the bottom of the variables.

4. Select C1 through C7 and PURCHTOT.

5. Click in the Model Role column for PURCHTOT with the right mouse button.

6. Select Set Model Role Ø rejected. The model role for these variables should now showrejected.


Determining Measurement Scale

Now consider the measurement scales. For numeric data type without formats, EnterpriseMiner counts the number of distinct values of the variable in the metadata sample and assignsthe measurement scales as follows:

• interval for more than 10 distinct levels

• ordinal for 3 through 10 distinct levels

• binary for 2 levels

• unary for 1 level (the model role is set to rejected).

For character data without formats, Enterprise Miner counts the number of distinct levels ofthe variable in the metadata sample and assigns the measurement scales as follows:

• nominal for more than 2 levels

• binary for 2 levels

• unary for 1 level (the model role is set to rejected).

When variables are formatted, Enterprise Miner applies the same rules to the formatted valuesof the variable.

For the BUYTEST data, the default measurement scales are correct except for the variablesBUY6, BUY12, and BUY18. Because these numeric variables have only a small number ofdistinct levels, an ordinal measurement type has been chosen by default. An intervalmeasurement type is more appropriate.

Change the measurement scale for BUY6, BUY12, and BUY18 to interval.

1. Select BUY6, BUY12, and BUY18.

2. Click in the Measurement column for BUY18 with the right mouse button.

3. Select Set Measurement Ø interval. The measurement for these variables should nowshow interval.

In addition to setting the model role and the measurement scale of variables, you can viewdistributions of individual variables

4. Select RESPOND with the right mouse menu button.

5. Select View Distribution of RESPOND.


A bar chart showing the relative proportion of responders (RESPOND=1) and non-responders (RESPOND=0) appears.

! You can obtain specific values for the bar heights by selecting the view info tool andthen clicking on the desired bar.

6. Select the close box or select OK to close the Variable Histogram window. You canrepeat this process for other variables in the data set as desired.

Close and save changes to the Input Data Source node.

1. Select the close box .

2. Select Yes to save changes.


2.3 Training and Assessment

Objectives

• Partition input data into training, validation, and test sets.

• Train a regression prediction model.

• Assess performance of a trained prediction model.

2.3 Training and Assessment 33

Data Splitting

In data mining, one strategy to assess model generalization is data splitting. A portion of thedata is used for model fitting, the training data, and the rest is held out for empirical validation.The hold-out sample itself is often split into two parts. The validation data set is used to tuneand compare prediction models. The test data set is used for a final assessment of the chosenmodel.

Define the training and validation data sets.

1. Open the Data Partition node.

2. Examine the Percentages fields.

The default settings for the data partition node use 40% of the data for training, 30% of thedata for validation, and 30% of the data for model testing (final evaluation). To change thepercentages, edit the numbers in the corresponding fields. The defaults will be used for thischapter.

3. Type 1310 in the random seed field.

Enterprise Miner randomly splits the raw data into training, validation, and testcomponents. The chosen seed will generate results that are consistent with the CourseNotes.

4. Close and save changes to the Data Partition node.


Training a Regression Model

Discovering a useful prediction model is a recurrent investigative process. The Enterprise Minerprovides a complex set of modeling tools that help to streamline this process. The modeling toolschosen for a particular situation depend on both business and analytic knowledge.

Many companies use regression models to build purchase propensity models. For this reason, aregression model will be the first model trained.

Open the Regression node.

A list of variables to be used in the model is displayed. By default, only first order (as opposed topolynomial) terms are considered. The default, first order, or standard model assumes that eachinput affects the probability of response independent of other inputs. (More complex models willbe considered later.)

In regression modeling, dummy variables are needed for categorical inputs. Enterprise Minerautomatically creates these dummy variables for both nominal and ordinal inputs.

Select the Model Options tab.


Because of the binary target variable, Enterprise Miner has determined that a logistic—asopposed to a linear—regression model is appropriate for this analysis.

Train the regression model and assess its performance.

1. Close the Regression node.

2. Select the Assessment node using the right mouse button.

3. Select Run.

A green frame surrounds each node as it is executed. The process ends when all nodes up to thepoint at which Run was selected have been executed. Upon completion a message declaring asuccessful run completion appears.

Select Yes to view the assessment results.


Model Assessment

The Assessment node allows you to judge a prediction model’s generalization properties. Usingthe node, you can

• appraise model predictive power on the validation and test data

• compare lift, sensitivity, and profit for multiple models

• determine an appropriate threshold probability for response classification.

The settings window for the Assessment node displays a list of trained models. Change themodel name from Untitled to First and view an assessment plot.

1. Select the regression model.

2. Type First in the Name column and press Enter.

3. Select Tools Ø Lift Chart.

This is a cumulative gains chart on the validation data. The logistic regression model built onthe training data has been applied to the validation data, generating the predicted probability ofresponse for each case. The cases were sorted from highest to lowest predicted probability.

Using the sorted predicted probabilities, the proportion of actual responders in progressivelylarger proportions of the data is displayed. The red line shows the performance for regressionmodel, and the blue line shows the performance for a model that randomly assigns predictedprobabilities to each case.

For the regression model, the top 10% of predicted responders have an expected response rate ofabout 14%, the top 20% of the predicted responders have an expected response rate above 11%,and so on. For the random baseline model, the expected response rate varies around 7.5% for allpercentiles.


Select the Non-Cumulative radio button.

This is a non-cumulative gains chart on the validation data. The data has again been sortedfrom highest predicted probability to lowest predicted probability. This time, however, the datahas been grouped into deciles (each containing 10% of the data) and the proportion ofresponders within each decile is displayed.

For the regression model, the first decile again has an expected response rate of 14%, the seconddecile has an expected response rate below 9%, and so on. For the random baseline model, theexpected response rate varies around 7.5% for all deciles. Note that after the fourth decile, theproportion of responders in the regression model falls below that of the random model. Thesharp increase in expected response in the last decile is somewhat surprising. The reason forthis will become apparent.


Select the Lift Value radio button.

This is a non-cumulative lift chart on the validation data. Lift value is simply a re-scaling of thevertical axis. The expected response rate in each decile is divided by the overall response rate.Thus, the lift in the first decile is 14% divided by 7.5%, or about 1.9. In other words, theregression model customers in the first decile are 1.9 times more likely to respond than arandom collection of 10% of the customers from the test campaign. Note that after the fifthdecile, the lift is less than 1 (with the exception of the last decile).

Select the Cumulative radio button.

This is a cumulative lift chart on the validation data. As expected, the cumulative lift chartdecreases down to 1. If the top 50% of the customers are targeted, the expected response ratewill be 1.33 times that of the test campaign.


Select the %Captured Response radio button.

This plot goes by many names: a target concentration curve, a response chart, a lift chart, aLorenz curve, and (closely related to) a ROC curve. The vertical axis represents the percent ofactual responders in the entire validation data set captured in each (cumulative) decile. Forexample, for the regression model about 66.7% of all responders will be captured if the top 50%of the customers are targeted. Theoretically, if the random model is used instead, 50% of theresponders will be captured if 50% of the customers are targeted.


How Many Customers to Target?

Thus far you have seen several measures of model performance versus subsets of the customersbase. The fundamental question of this target-marketing project is what fraction of theremaining customer base to target. The answer requires incorporating customer profitinformation into the model assessment process. The Profit Vector provides one way to do this.

1. Select Define… in the Profit Vector box.

2. Type 14 in the Amount column for Target Level 1.

3. Type 1 in the Amount column for Target Level COST.

4. Close the Profit Vector definition dialog.

The values were chosen to reflect a 20% profit margin on a median order size of $70. Medianwas choose as a measure of centrality due to the skewness of the purchase distribution.

5. Select Apply.

6. Select the Profit radio button.

The Profit chart shows the expected profit per customer for the regression and the randommodel. For the regression model, the profitability per customer decreases as the number ofcustomers targeted increases. This still does not answer what fraction of the customer base totarget.


Select the Non-Cumulative radio button.

The non-cumulative profit chart shows that customers in the top 50% (first five deciles) have apositive expected profit. Customers in the remaining deciles, with the mysterious exception ofthe last decile, all have a negative expected profit.

You can increase the resolution of the assessment plot for a single model (at a cost of possiblegreater variability).

1. From the menu bar, select Format Ø Set Horizontal Scale….

2. Type 5 in the percentage field.

3. Select OK.

The horizontal scale of the assessment plot is now shown in demi-deciles instead of deciles. Theradical swings in profit around the 15th to 40th percentiles are the result of sampling variation.The decision to target the top 50% of customers seems reasonable at this scale as well.


Select the Cumulative radio button.

By targeting the top 50%, the expected profit per customer solicitation will be about $0.44.

Suppose you decide to target the top half of the customer roll. What is the cutoff predictedprobability corresponding to the top half of the data?

Type vt tmplift on the command line and press enter. The data table from which theassessment charts are produced is displayed. Scroll to decile 50 for the model first.

In the column labeled cutoff is a value of 0.06 approximately. Based on the regression model, allcustomers whose predicted response probability exceeds approximately 0.06 should receive acatalog in phase 2 of the target marketing campaign.

Close the TMPLIFT data set and the Assessment node.

2.4 Model Deployment 43

2.4 Model Deployment

Objectives

• Score new data within Enterprise Miner software.

• Score new data in base SAS software.


Scoring in Enterprise Miner

The assessment plots demonstrated that the regression model will select profitable customersfor phase 2, but having a useful model is of no practical value if it cannot be deployed againstthe remaining customer roll. The purpose of predictive modeling is the application of the modelto new data—scoring.

The SAS data set BUYROLL contains information on customers similar to those participatingin the test campaign. Because these customers did not receive a test mailing, the regressionmodel must predict the probability of these customers making a purchase.

1. Add a Score node (Assessment tool group) to the workspace.

2. Connect the Regression node to the Score node.

3. Add another Input Data Source node to the workspace.

4. Connect the new Input Data Source node to the Score node.

5. Add an Insight node to the workspace.

6. Connect the Score node to the Insight node.

The process flow diagram should appear as shown.

2.4 Model Deployment 45

The first task is to access the BUYROLL data set and prepare it for scoring. You identify thedata set to be scored by assigning it a roll of SCORE in the Input Data Source node.

1. Open the new Input Data Source node.

2. Select the BUYROLL data in the DMEM library.

3. Select or type SCORE in the Role field.

4. Select the Variables tab.

In general, BUYROLL has most of the same variables that BUYTEST has. The exceptionsare the variables associated with the test mailing response: RESPOND, PURCHAMT, andC1 through C7.

5. Close and save changes to the Input Data Source node.

The next task is to set up the Score node for processing a SCORE data set.

1. Open the Score node.

2. Select the Run action tab.

3. Select the Score and export a new predicted values data set radio button.

4. Close and save changes to the Score node.

Run the diagram from the Insight node.

1. Select the Insight node with the right mouse button.

2. Select Run.

3. Select Yes to view the results of the scoring in SAS/INSIGHT.

A SAS/INSIGHT table opens. The table contains a simple random sample of size 2000 from theSCORE data set BUYROLL. (You can control the size of the sample in the Insight node.) Scrollthe INSIGHT table to the right. A column called P_RESPO1 gives a predicted responseprobability for each customer in the customer roll—almost. Note that some of the values ofP_RESPO1 are missing. The case of the missing response probabilities will be discussed in thenext section.


Scoring in Base SAS

The data sets to which predictive models are deployed are often considerably larger thanBUYROLL (3.1 MB). In these cases, it will be difficult to score the data sets interactively in theEnterprise Miner. Conveniently, the Score node can export the SAS data step code required toscore new cases. Only base SAS is required to run this DATA step code.

Export the scoring code.


2. Double-click on Regression in the left window.

The DATA step code in the right window contains the full recipe of the regression model.

3. Click on Regression in the left window with the right mouse button and select Save. Bydefault, the saved scoring code is named Saved source file. You may edit this name ifyou desire.

4. Select OK.

5. Click on Saved source file (or the name you specified) in the left window with themouse menu button and select Export.

6. Export the scoring code as RECIPE.SAS.

7. Close the Score node.

8. Open the RECIPE.SAS program in the PROGRAM EDITOR.

9. At the beginning of the program type%let _score=crssamp.buyroll;%let _predict=scored;

10. At the end of the program typeproc print data=&_predict(where=(p_respo1 > 0.06));var id p_respo1;run;

11. Submit the program and view the results in the OUTPUT window.

The program produces a list of customers who should receive a catalog in the second phase ofthe target marketing campaign.

2.5 Missing Value Imputation 47

2.5 Missing Value Imputation

Objectives

• Accommodate missing values in predictive models.


Missing Value Imputation

The regression model created from the test campaign data is incomplete. The scoring recipe failsto calculate a response probability whenever there is a missing value in one of the model inputs.It is unacceptable to decline to predict new cases that have some missing data. A standardapproach to handling missing values is to impute them prior to building a model.

Enterprise Miner software offers several methods to impute missing values, discussed in detailin Chapter 5. The simplest is the Data Replacement node (in the Modify tool group).

Use the Data Replacement node in your process flow diagram.

1. Delete the connection between the Data Partition node and the Regression node.

2. Add a Data Replacement node to the workspace.

3. Connect the Data Partition node to the Data Replacement node.

4. Connect the Data Replacement node to the Regression node.

The modified diagram should appear as shown.

The process of missing value imputation is an extension of the prediction model. It must betrained just as the classifier is trained. Thus, you should always place your imputation methodafter the Data Partition node.

The default imputation method uses the mean of the complete cases in the training data set forinterval inputs and the most frequent class (mode) for the categorical inputs.

2.5 Missing Value Imputation 49

Rerun the the flow and evaluate the changes.

1. Run the diagram from the Assessment node.

2. View the results.

3. Select the regression model and then select Tools Ø Lift Chart.

4. Apply the profit vector used previously.


6. Click the Non-Cumulative radio button.

When the missing values have been accounted for, it appears that it is profitable to target thetop 60% of the customer base. Note the mysterious jump in expected profit in the last decile hasdisappeared. The jump was an artifact of the missing predicted values.


Click the Cumulative radio button.

By targeting the top 60% of the customers, the regression model now predicts an expected profitper solicitation of $0.42.

1. Close the Lift Chart and the Assessment node.


3. Select the down arrow Ø Current imports.

4. View the imputation code generated by double-clicking on Regression.

As stated above, for predictive modeling, data modifications such as imputation should bethought of as part of the overall model. The scoring code must contain the recipe for thesemodifications as well. Close the Score node.

1. Run the diagram from the Insight node.

2. View the results.

3. Scroll to the predicted probabilities (P_RESPO1).

The Data Replacement node has solved the case of the missing response probabilities. Close theInsight node.

2.6 Model Interpretation 51

2.6 Model Interpretation

Objectives

• Interpret output from a standard logistic regression model.


Model Interpretation

The primary goal in predictive modeling is to build a model that accurately predicts the value ofthe target variable. Interpreting the characteristics of cases yielding a particular response cansometimes be extremely difficult and potentially misleading.

For specific modeling types, like standard logistic regression, general statements of responsepropensity can be made by scrutinizing the modeling output.

1. Select the Regression node with the right mouse button.

2. Select Results….

The Estimates tab summarizes this information graphically. In this case, it appears thatyounger customers who have a high purchase frequency in the last 18 months, are married,have not purchased much in the last 12 months, and rent their home are most likely to respond.

The parameter estimates are colored to indicate sign and magnitude.

The plot shows T-scores (equal to the signed square root of each input’s Wald chi-square).Estimates colored bright yellow are relatively large negative coefficients and estimates coloredbright red are relatively large positive coefficients.

Select the Output tab and scroll to the bottom of the window.

Odds ratios are often used to make summary statements about standard logistic regressionmodels. By subtracting one from the odds ratio and multiplying by 100, you can state thepercent change in odds of a response.

For continuous inputs, the value corresponds to the change in the odds for each unit change in amodel input. For example, the model predicts for

• each year increase in customer age, the purchasing odds change by (0.966−1) × 100 =−3.4% or a 3.4% decrease in the purchasing odds

• each dollar increase in total purchases in the last 24 months, the purchasing oddschange by (1.000−1) x 100 = 0% or no decrease in the purchasing odds.

For nominal and binary inputs, odds ratios are presented versus the last level of the input. Forexample, the model predicts the purchasing odds for

• an unmarried customer are (0.641−1) × 100 = −35.9% or 35.9% lower than for a marriedcustomer

• a renter are (1.482−1)x100 = 48.2% or 48.2% higher than for a homeowner.

Close the Regression node.

2.7 Decision Tree Models 53

2.7 Decision Tree Models

Objectives

• Cultivate a decision tree.

• Interpret a decision tree.


Decision Tree Model

The attraction of a standard regression model is its simplicity. This is also its greatest flaw. Themain effects model may not be sufficiently flexible to accommodate the relationship between theinputs and the target. If more complex associations exist, both the predicted responseprobabilities and any interpretations of the modeling results will be inaccurate.

By adding polynomial and interaction terms to the standard model, additional flexibility can beincorporated into the logistic regression model. Unfortunately, adding these terms greatlycomplicates model interpretation. Moreover the number of possible two-way interaction termsincreases as the square of the number of inputs. This can pose computational difficulties forlarge numbers of inputs. If the input/target association is truly more complex and there is adesire to retain model interpretability, you may want to consider a modeling tool other thanstandard regression.

Decision Trees offer an attractive alternative to regression models. The models are very flexible,they offer easy interpretability, and they even handle missing values without imputation.

1. Add a Decision Tree node to the workspace.

2. Connect the Data Partition node to the Decision Tree node.

3. Connect the Decision Tree node to the Assessment node.

The modified diagram should appear as shown. Note the Assessment node has been lowered forclarity.


Run the diagram from the Assessment node. Because the Regression node and its predecessorshave not changed, Enterprise Miner only runs the Decision Tree node.

1. Select Yes to view the assessment results.

2. Select the decision tree model, rename it Tree, and press Enter.

3. Select Tools Ø Lift Chart from the pull-down menu.

Shockingly, the performance of the tree model is similar to the baseline model. Examine theresults of the tree modeling tools to determine why this has occurred.

Close the Lift Chart window and the Assessment window. Open the Decision Tree - Resultswindow by clicking on the Decision Tree node with the right mouse button and selectingResults….


The Decision Tree - Results window features four main elements.

At the upper-left, a table summarizes the overall classification process. Below this, anothertable lists the assessment value per case for increasing tree complexity. To the right, a plotpresents the same information graphically. Finally, at the upper-right, the Tree Ring Diagramindicates how the training data is split for the selected tree complexity. All tables and plots,except for the Tree Ring Diagram, present statistics for both the training and validation data.

The Decision Tree - Results window should be used in conjunction with the Tree Diagramwindow.

1. Move the Decision Tree - Results window to the right.

2. Select View Ø Tree from the pull-down menu.

The Tree Diagram window opens. As indicated in the Decision Tree - Results window’scomplexity table and confirmed in the Tree Diagram window, the selected tree has a single leaf.


Tree Selection

To model the test mailing response, the decision tree algorithm grows a single, large tree.Branches from this large tree can be removed to create subtrees with a varying number ofleaves. The decision tree algorithm searches the set of subtrees of the initial large tree for thebest tree with a given number of leaves up to the number in the initial large tree. The results ofthis search are accessed via the table at the lower-left of the Decision Tree - Results window.Here, best is defined in terms of assessed profit, as discussed below.

Select row 2 of the complexity table.

The tree diagram shows the root node split into two leaves. The left leaf corresponds tocustomers whose age is less than 31.5 years. The right leaf contains the remainder of the data.The count and the proportion of responders and non-responders are displayed in each leaf. Theleft column is for the training data, and the right column is for the validation data. Notice thatthe response proportion for the younger customers is more than twice that of the oldercustomers.


Select row 3 of the complexity table.

The older customers are split into two groups based on their 18-month purchase frequency,BUY18. Customers with more than one purchase in the previous 18 months prior to thecampaign are more than twice as likely to purchase as customers with none or one purchase.

Both of these models seem substantially better than the selected single leaf model. Why was itselected instead?


Decision trees select the simplest model with the highest validation assessment value. Thedefault assessment value is calculated from a user definable profit matrix that awards 1 pointfor each correct responder classification and 1 point for each correct non-responderclassification. Each leaf in the tree is defined as a responder leaf or a non-responder leafdepending on which alternative results in the larger profit on the training data. The defaultprofit matrix results in the most prevalent class in the training data determining theclassification of each leaf.

The overall assessment values, found in the tree complexity table, are the weighted averages ofthe assessment values calculated in each leaf.

The single leaf tree was selected because it was the simplest tree with the highest validationassessment value. If the profits associated with correctly identifying responders and non-responders are equal, this will be the correct tree to deploy. However, these profits are notequal, so the single node tree is not the optimal choice. To choose the correct tree complexity,the correct profits—or costs—associated with each classification alternative must beincorporated into the tree growing process.


Profits and Costs

You will now regrow the tree, this time incorporating the costs associated with misclassifyingresponders and non-responders.

1. Close the Decision Tree - Results window. (Do not save the changes.)

2. Open the Decision Tree node.

3. Select the Assess tab.

4. Select the Minimize misclassification cost radio button.

5. Select the Cost Matrix sub-tab.

The settings window now shows a misclassification cost matrix. The levels of the targetvariable (RESPOND) appear in the first column. The two additional columns correspond tothe predicted class and the resulting action: 1 corresponds to sending a catalog and 0corresponds to not sending a catalog.

6. Type 13 in the first row of the 0 column.

The cost of not mailing a catalog to a potential responder equals the expected profit (assumed tobe $14) less the cost of mailing (assumed to be $1). The cost of mailing a catalog to a non-responder is simply the mailing cost ($1).

1. Close and save the changes to the Decision Tree node.

2. Run the diagram from the Decision Tree node.

3. Select Yes to view the results.

The tree that minimizes the misclassification cost has six nodes.

4. Open the Tree Diagram window.

5. Select View Ø Tree.


The tree complexity table shows that the selected tree has six leaves, but only four leavesare visible. To remedy this situation, you must increase the tree viewing depth from thedefault of three to a number sufficient to view the entire tree structure, in this case six.

6. Select Tools Ø Tree Options….

7. Type 6 in the Tree depth down field.

The Tree Diagram should appear as follows.

The initial splits in the tree are identical to those seen in the three-leaf tree selected from thedefault cost structure. The additional leaves are obtained by partitioning the older individualswith at most one purchase in the last 18 months.


How does this relatively simple tree perform as a classification model?

1. Close and save changes to the Decision Tree node.

2. Open the Assessment node.

3. Select both the regression model and the decision tree model.

4. Select Tools Ø Lift Chart.

The decision tree model is as good as or better than the standard regression for most of thedata. You can examine profit for the tree.

1. Deselect the regression model.

2. View the lift chart for the decision tree model.

3. Set and apply the profit matrix as before.

4. Examine both the cumulative and non-cumulative profit charts.


The non-cumulative plot suggests mailing to 50% of customers (instead of 60% for theregression model). Doing so will result in an expected profit per solicitation of about $0.60(instead of $0.44 for the regression model).


Changing the Default Profit Vector

To examine profit for both models simultaneously you must set the profit matrix at themodeling node.

1. Close the Lift Chart window and the Assessment node.

2. Select the Regression node with the right mouse button.

3. Select Model manager….

4. Select the regression model.

5. Select the Profit Vector tab.

6. Type 14 on the first row and 1 on the third row.

7. Select Save as Default Ø OK.

8. Close and save changes to the Model Manager window.

9. Repeat steps 2 through 5 for the Decision Tree node.

10. Select Reset with Default.

11. Close and save changes to the Model Manager window.

12. Run the flow from the Assessment node and view the results. The profit plots are shownbelow.


Although both the Regression and Decision Tree nodes appear to run, no new model is beingtrained. The flow is run only to update the assessment data with the new profit information. Allsubsequent models will now use the updated profit vector for profit calculations.


2.8 Dimension Reduction

Objectives

• Use the Decision Tree node for dimension reduction.

2.8 Dimension Reduction 67

Dimension Reduction

In addition to being a potentially useful modeling method on their own, decision trees are oftenused in conjunction with other modeling methods. They are especially useful as a filter forirrelevant and redundant inputs (dimension reduction) and as a method for detecting importantvariable interactions. These tasks are crucial to building more flexible classification models likepolynomial regression models and neural networks.

Enterprise Miner software offers two tree methods for dimension reduction. The VariableSelection node is best for rapid rough-cuts through data sets with a large number of variablesand a binary target. The Decision Tree node, used in the previous section, provides a morerefined selection mechanism at a cost of slightly more processing time.

Both methods require some setup before use. The default settings for the Variable Selectionnode tend to select too many variables. The default settings for the Decision Tree node tend toselect too few variables.


Dimension Reduction with the Decision Tree Node

Use the Decision Tree node for dimension reduction.

1. Add a Decision Tree node to the workspace.


The modified diagram should appear as shown.


4. Select the Advanced tab.

5. Select the Distinct distribution in leaves radio button.

The Distinct distributions in leaves option changes the assessment criterion used forpruning. The best tree with a given number of leaves will be the one giving the bestapproximation to the underlying distribution of the target variable, not necessarily the onelikely to provide the most parsimonious classification tree.

6. Select the Score tab.

7. Select the Process or Score: Training, Validation, and Test check box.

8. Select the Variables sub-tab.

9. Remove the Leaf identification variable check mark.

10. Remove the Prediction variables check mark.


The node is now configured for eliminating redundant and irrelevant inputs.


12. Run the Decision Tree node and view the results.

13. Select the Score tab.

14. Select the Variables sub-tab.

The decision tree has reduced the number of inputs from sixteen to six: AGE, BUY18,CLIMATE, MARRIED, OWNHOME, and VALUE24. One interpretation of this process is thatmost of the important variability in response is captured using these dimensions. The next stepis to build a flexible prediction model on these chosen variables.


Dimension Reduction with the Variable Selection Node (Optional)

The Variable Selection node also uses a tree-based technique to reduce the dimension of theinput space. To use the Variable Selection node instead of the Decision Tree node, follow theinstructions below. Due to differences in the tree algorithm, the Variable Selection node willchoose different variables than the Decision Tree node. The remainder of the chapter assumesuse of the Decision Tree node.

1. Add a Variable Selection node to the workspace.

2. Connect the Data Partition node to the Variable Selection node.

The completed diagram should appear as shown.

1. Open the Variable Selection node.

2. Select the Target Associations tab.

The node actually contains two selection methods, one based on regression methods and theother based on tree methods. The regression-based method gauges the importance of inputsand, optionally, their two-way interactions (for categorical inputs) based on an R-squarestatistic.

The tree-based method gauges the importance of inputs and their higher order interactionsbased on a Chi-square statistic. The BUYTEST data contains both interval and categoricalinputs suggesting the use of the Chi-square selection method.

3. Select the Chi-square selection criterion.


4. Select Settings….

The Settings button opens a settings window that allows you to control how continuousvariables should to be binned, the minimum chi-square value required for a split, and thenumber of bins to create for interval variables prior to calculating the chi-square splittingvalue.

! Large data sets require a substantial increase in the default chi-square value todecrease the likelihood of including an excessive number of inputs.

5. Type 10 in the Chi-squared field. A lower chi-squared value will over-select inputs.

6. Select OK.

7. Remove the Score data sets check mark.

8. Close the Variable Selection node and save the changes.

9. Run the Variable Selection node and view the results.

The role of each variable after selection is displayed along with a reason for variablerejection.

10. Select the Output tab and scroll to the bottom of the window.

The output from the underlying tree selection procedure is presented. At the end of the outputinformation is a table listing the selected inputs and how many times the data was split on eachinput. Notice that the inputs selected by the Variable Selection node and the Decision Tree nodediffer.


2.9 Polynomial Regression Models

Objectives

• Add polynomial terms to a standard regression model.

• Tune a polynomial model with a variable selection procedure.

2.9 Polynomial Regression Models 73

Adding Higher Order Terms to Regression Models

Adding higher order terms to regression models can potentially improve model performance.This comes at a cost of more complicated model interpretation and training. For interpretation,the input’s association with the target may depend on the values of one or more other inputs.For training, the number of two-way interactions increases rapidly with increasing numbers ofinputs.

You can alleviate some the training difficulties by reducing the dimension of the input space.This has been accomplished using the Decision Tree node. As for the interpretability issue, theprimary goal of the analysis is to identify customers with a high propensity to purchase.Statements concerning the importance of individual inputs toward response propensity are ofsecondary interest.

Add a Data Replacement node and a Regression node to the workspace and connect them to theprocess flow as shown.


1. Open the just added Regression node.

Only the variables selected by the decision tree are set to use.

2. Select Tools Ø Interactions Builder…. The Interaction Builder window opens.

3. Select AGE through VALUE24 from the list of input variables.

4. Select Expand. All possible two-way combinations for the selected variables have beenadded to the model.

5. Select AGE, BUY18, and VALUE24.

6. Select Polynomial. The model now also includes squared terms for the interval inputs.

7. Select OK to close the Interaction Builder window.

8. Scroll the variables list to confirm the variables added to the model.

The number of terms in the model has increased from 6 to 24. Probably not all these arestrongly related to the target. A stepwise selection procedure is commonly used to tune modelcomplexity.


1. Select the Selection Method tab.

2. Select Forward in the Method field.

3. Select Validation Error in the Criteria field.

4. Type 0.2 in the Entry field under Significance Levels.

The forward selection procedure creates a sequence of models by systematically addingmodel terms. The model with the smallest squared error on the validation data is selected asthe final model.

5. Close and save changes to the Regression node.

6. Name the model Interact.

7. Run the diagram from the Assessment node and view the results.


Compare the assessment plots for both regression models.

8. Select Format Ø Model Name. The legend will use the model name instead of the modeltype.

The interaction model uniformly outperforms the standard model (First) and the tree model.


9. Examine the profit plot for the models.

Again, about 50% of the customer base should be targeted. Using the interaction model, theexpected profit per solicitation will be about $0.64, about 10% greater than the tree model.


The results of the forward selection process are found in the Regression node’s results window.

1. Close the Assessment node.

2. Open the results window for the interaction regression model.

3. Select the Output tab.

4. Scroll to the bottom of the output information.

The stepwise procedure terminated after 9 steps. The variable VALUE24 did not enter intothe model either by itself or as an interaction. Most of the included terms are interactions.Four of the six interaction terms involve CLIMATE.

5. Close the Regression Results window.

2.10 Training a Neural Network 79

2.10 Training a Neural Network

Objectives

• Train a neural network model.


Training a Neural Network

Incorporating second order interaction terms into a regression model greatly improvedprediction performance. Including even higher order terms may enhance performance further,but the curse of dimensionality places limits on this practice.

An alternative to enhancing standard regression models is to use a nonlinear regression modellike a multilayer perceptron (MLP) neural network. MLPs automatically handle nonlinearassociations between the target and the inputs. They are flexible classification methods that,when carefully tuned, often provide optimal performance in classification problems.Unfortunately, it is difficult to assess the importance of individual inputs on the classification.For this reason, MLPs have come to be known as black box predictive modeling tools.

Add a Neural Network node to the workspace as shown.

Open the Neural Network node. The neural network will use the same first order variables asthe regression model with interaction terms.


Select the Diagram tab.

The diagram is a schematic representation of the MLP. Inputs are grouped by measurementscale and represented by a light blue pentagon. The inputs are connected to a hidden layer,represented by a dark blue square. By default, the hidden layer contains three hidden units.The hidden layer is connected to the output layer, a yellow pentagon, which corresponds to thetarget (RESPOND).

The fundamental problem for neural network modeling is choosing an appropriate neuralarchitecture. Unfortunately, there is no foolproof way of doing this other than trial and error. Itis important to realize that the default choice of three hidden units is by no means optimal forany particular problem. You may change the number of hidden units in the hidden layer asfollows.

1. Double-click on the dark blue square. The Node Properties window opens.

2. Select the Hidden tab.

3. Type 4 in the Number of neurons field.

4. Select OK.

The network now has four hidden units in its hidden layer.

The parameters for a neural network model are called weights. The goal in training a neuralnetwork model is to select values for the weights that minimize some criterion such as theaverage validation error. To do this, Enterprise Miner searches the space of weights by startingat some random initial point. The Initialization tab allows you to specify a random seed for thissearch process.

1. Select the Initialization tab.

2. Type 359625 in the random seed field.


The MLP network weights are selected using an iterative method. It uses random startingvalues determined by the seed value. Fixing the seed gives identical results each time thisexample is run.

3. Select the Options tab.

4. Select the Train sub-tab.

5. Remove the Defaults check mark.

6. Type 49 in the Maximum Iterations field.

7. Close and save the changes to the Neural Network node.

8. Type Neural in the Model Name field.

9. Select OK.


Neural network modeling will be discussed in more detail in Chapter 3. At this point, it is onlynecessary to know that the Neural Network node will take the input and target data for eachcustomer and build a model that gives the predicted probability that each customer willrespond. The settings of this particular network were predetermined for speed and predictivepower.

Run the diagram from the Assessment node.

The progress of the training can be seen in the Neural Net Monitor window. The plots show thevalue of the error function for training and validation data. You can stop the training at anytime. The parameters of the model will be selected to minimize the validation error function atthe time the training was stopped.

View the assessment results. Compare the Neural Network model to the other models alreadytrained.

The neural net model appears to have a slight prediction advantage over all models for theinitial 30% of the validation data. After that, it gives similar performance to the interactionregression model.

Close the Assessment node and select File Ø Save to save your work to this point.


2.11 User-defined Models

Objectives

• Locate and access model source code.

• Combine two prediction models into a single prediction model.

• Assign new modeling roles to modeling data.

• Create an assessment data set from scored modeling data.

2.11 User-defined Models 85

Scoring Code Catalog

Both the polynomial logistic regression and neural network models appear to have good overallpredictive properties. What happens if the predictions from the two models are combined into asingle model?

You have already seen how to access model scoring code using the Score node. The score code iscreated and accumulated as the models are being built and is stored in a SAS catalog inside theproject library. The Score node simply provides a front end to this catalog.

Click on the Assessment node with the right mouse button and select About… from the pop-upmenu.

The Node Properties window opens.

Select the Imports tab.

This is a directory list of predecessor modeling nodes.

1. Select [+] for Neural Network.

2. Select SCORE_CODE_FILE.

The Value field contains the name and location of the score code file. The name consists ofthe characters SC_ followed by a random string of five characters (for example, SC_Q758J).

3. Write the name of the score code file on a piece of scratch paper.

4. Close the Node Properties window.


Note there are two regression entries. To determine which entry in the Node Properties windowcorresponds to the polynomial model, change the names of the nodes in the workspace diagram.You can change the name of a node by clicking on the text beneath the icon and editing it.Change the name of the standard regression model to Standard Regression.

1. Click to the left of the word Regression beneath the first Regression node you added.

2. Type Standard and press Enter.

3. Repeat steps 1 and 2 to change the name of the other Regression node to PolynomialRegression.

The process flow diagram should appear as shown.

With the names of the nodes changed, update the saved diagram in the project directory byselecting File Ø Save.


Return to the Node Properties window for the Assessment node and view the Imports tab. Thenames you entered in the process flow diagram are listed.

1. Select [+] for Polynomial Regression.

2. Select SCORE_CODE_FILE.

3. Write the name of the score code file on a piece of scratch paper.

4. Close the Node Properties window.


Creating a Custom Model

Now create a single model combining the polynomial regression and neural network models.The model will be created in a SAS Code node, Enterprise Miner’s window to the rest of the SASSystem.

Using a Data Set Attributes node, the output of the combined models will be assigned a newmodel role called PREDICT. The PREDICT and TARGET variables will be combined in a UserDefined Model node to create an assessment data set. The performance of the user definedmodel can then be compared in the Assessment node to the other modeling methods.

1. Add a SAS Code node (Utility node group) to the workspace.

2. Connect the Decision Tree node to the SAS Code node as shown.

3. Open the SAS Code node.

The SAS Code node Program tab contains a mini-editor where you can enter and edit SASprograms.

4. Select the Export tab.

5. Select Add Ø VALIDATE.

6. Remove the Pass imported data sets to successors check mark. A table entry for the macrovariable name &_VAL is added.


The new model predicts purchase probability by averaging the response probabilitiescalculated by the interaction regression and neural network models. The new model isdeployed against the validation data for comparison with other models.

7. Select the Program tab and type the following program:filename score LIBRARY “&_plib..SCORE”;

data &_val; set &_valid; %include score(SC_XXXXR); P_COMBO = P_RESPO1; %include score(SC_XXXXN); P_COMBO = mean(P_COMBO,P_RESPO1); keep RESPOND P_COMBO;run;

Important: Replace SC_XXXXR and SC_XXXXN with the names of the regression and neuralnetwork scoring code files you wrote down.

The program creates a new validation data set whose name is defined by the macro variable&_VAL. The new validation data set is created from the existing validation data set &_VALID.

The regression model scores the data using included score code. The predicted responseprobability is put into the variable P_COMBO. The neural network model scores the data usingincluded score code. The neural network’s predicted response probability is averaged with thevalue obtained from the regression model. For assessment purposes, only the actual targetvalue (RESPOND) and the predicted response probability are kept.

8. Close the SAS Code node, save the changes, and then run the SAS Code node.

! If a red frame surrounds the SAS Code node, you have made a typing error. Double-check the program syntax (especially the names of the score code files).

9. Select No in the View Results dialog.

10. Add a User Defined Model node (Model node group) to the workspace and connect as shown.


11. Open the User Defined Model node.

12. Click with the right mouse button in the first row of the Predicted Variable column.

13. Select Select Predicted Variable… Ø P_COMBO.

14. Close and save changes to the User Defined Model node. Name the model User.

The User Defined Model node is now ready to run. When run, the node will create anassessment data set from the scored validation data. The values from the assessment data setwill be used to generate assessment plots.


1. Run the User Defined Model node but do not view the results.

2. Open the Assessment node.

3. Select the user defined model, the polynomial regression model, and the neural networkmodel.

4. Examine the assessment plots.

The best model of the collection is the user defined model. With the exception of the first decile,it gives better predictive performance than the models used to create it. (The whole is greaterthan the sum of its parts?) Examine the profit plots for the user defined model with a horizontalscale set to 5% increments.


Choosing to target the top 45% of the customer base as defined by the user defined model willresult in an expected profit of approximately $0.75 per customer.

Chapter 3 Neural Networks

3.1 Introduction.................................................................................................. 95

3.2 Low Dimensional Example ......................................................................... 96

3.3 Comparing to Logistic Regression.......................................................... 102

3.4 Stopped Training ....................................................................................... 106

3.5 Two Hidden Layers.................................................................................... 109

3.6 Neural Network Classification.................................................................. 112

94 Chapter 3 Neural Networks

3.1 Introduction 95

3.1 Introduction

A neural network has the reputation of being a new and mysterious—almost magical—tool fordata mining. In truth, it is a straightforward extension of a well-known statistical techniquecalled a generalized linear model. However, knowing this fact does not necessarily shed light onhow a neural network predicts its target.

This chapter will explore topics in neural network training, generalization, and selection. TheEnterprise Miner can fit a variety of network architectures. You will focus on a widely used typeof neural network for supervised prediction problems—the multilayer perceptron (MLP).


3.2 Low Dimensional Example

Objectives

• Visualize output from a neural network.

3.2 Low Dimensional Example 97

Low Dimensional Example

To allow visualization of the output from a MLP, a network will be constructed with only twoinputs. Two inputs permit direct viewing of the trained prediction model and speed up training.

Insert a new diagram in the My Projects folder and assemble the diagram shown below.

An Input Data Source node connects to a Data Partition node. The Data Partition node connectsto a Data Replacement node. The Data Replace node connects to a Neural Network node. TheNeural Network node connects to an Insight node (Explore node group).

The input data and the data partitioning scheme for this example will be the same as inChapter 2. Select the data for this example.

1. Open the Input Data Source node.

2. Select the BUYTEST data set.

3. Set the model role of RESPOND to target.

4. Set the model role of all other variables, except AGE and INCOME, to rejected.


Partition the data.


2. Set random seed to 1310.

3. Set Validation to 60 and Test to 0.

No test set will be needed for this example. For efficiency, the test data will be grouped withthe validation data.



Now construct the MLP.

1. Open the Neural Network node.

2. Select the Diagram tab.

For this example, the default architecture with three hidden units will be used.

To synchronize the results with this text, a particular random seed will be deployed.

1. Select the Initialization tab.

2. Type 690604 in the Randomization field.

By default, the Neural Network node does not append predicted values to the training andvalidation data sets. You will need these predicted values in the subsequent Insight node.


Change this default setting.

1. Select the Output tab

2. Select the Training, Validation, and Test check box.

3. Close and save changes to the Neural Network node. You do not need to change the defaultmodel name.

Run the diagram from the Neural Network node and view the results.

The Output tab shows the details of the training process. For a binary target, a MLP with onehidden layer models the logit of the probability of the target event as a linear combination ofnonlinear functions (activation functions) of linear combinations of the inputs.

In this MLP with four hidden units, 17 parameters (called weights and biases in theneurocomputing world) must be estimated. The Output tab first lists the initial parameterestimates and then gives details of the iterative optimization process. It then lists the finalparameter estimates. The Label column explains the network connection to which theparameters correspond.

Unfortunately, these estimates do not have a simple interpretation like those in a standardlinear model. A neural network is a black box with regards to interpretation--inputs go in andpredicted values come out.

Select the Plot tab

The plot shows the value of the average value of the error function for both the training and thevalidation data sets. The blue and red lines represent the validation data and training data,respectively.

The training error function makes a gradual decrease over the iteration history. The validationerror function decreases and then increases. The white vertical line indicates the minimumvalue of the validation error function. It is from this iteration that the network parameterestimates have been made.


Curiously, at this iteration, the training error function is no where near its minimum. Why notselect the parameter estimates that minimize training error?

1. Select the minimum value of the training error from the plot. The vertical white line shouldmove to the last iteration.

2. Click on the plot with the right mouse button and select Set network at… Ø Set networkat selected iteration.

3. Select Tools Ø Score from the menu bar.

4. Select Yes to rescore at selected iteration.

5. Select OK to proceed.

6. Close the Neural Network node.

The parameter estimates corresponding to the selected iteration are recalled and the data isrescored using these new estimates.

Visualizing the response probability as a function of AGE and INCOME will show why the lastiteration was not selected.

Run the Insight node and view the results.

A sample of the training data opens as a SAS/INSIGHT data table.

1. Select Analyze Ø Rotating Plot (Z Y X).

2. Select P_RESPO1 Ø Y.

3. Select AGE Ø Z.

4. Select INCOME Ø X.

5. Select Output Ø At Minima.

6. Select OK Ø OK.

7. Resize the display as desired.

8. Select the right arrow at the lower-left of the plot and select Marker Sizes Ø 3.

The scatter plot shows the fitted probability surface. A MLP is a flexible nonlinear function thatcan model any smooth surface. This surface features a large plateau and ridge. The highestprobabilities correspond to cases with low ages and high incomes. The predicted probability issimply a step towards a decision rule. A cutoff probability must be chosen. The cutoff willdetermine a decision boundary. In Chapter 2, the cutoff of 0.065 was determined appropriate foran expected profit of $14 per order placed and mailing cost of $1 per catalog.


Rotate each plot to view the fitted probability surface. You can rotate the plots in one of threeways.

You may use the directional icons located on the boundaries of the plot window. Secondly, if youmove your cursor to one of the four corners of the plot window, the cursor changes to a hand.Dragging the hand allows you to spin the plot. The final method requires the Tools window.

1. Select Edit Ø Windows Ø Tools.

2. Select the hand icon from the Tools window and spin the plot by dragging.

Now identify the decision boundary.

1. Select a color button (red in this example) from the Tools window.

2. Select P_RESPO1 Ø > Ø 0.065ØOK.

All red points have a predicted response probability in excess of the 0.065 threshold. Note thetwo separate regions of high response probability. In general, the model implies a complexassociation between inputs and target.

Close the Insight node.


3.3 Comparing to Logistic Regression

Objectives

• Compare the performance of classification tools.

• Understand the implications of an overtrained model.

3.3 Comparing to Logistic Regression 103

Comparing to Logistic Regression

A standard logistic regression model is a MLP with zero hidden layers and a logistic outputactivation function.

Visualize a fitted logistic regression surface.

Drag a Regression node and a Control Point node (Utility tool group) onto the workspace andconnect them as shown below.

Note the connection between the Neural Network node and the Insight node now passesthrough the Control Point node. The Control Point node is an aid to simplify potentially complexprocess flow diagrams.

1. Open the Regression node.

2. Select the Output tab and then the Scored Data Sets sub-tab.

3. Select the Training, Validation, and Test check box.

4. Close and save changes to the Regression node. By default, the Regression model is namedUntitled. You may edit this name if desired.

5. Run the diagram from the Regression node but do not view the results.

6. Open the Insight node.

7. Select Select… and note the name of the scored validation data set from the NeuralNetwork model. You will open this data set from within Insight.


8. Choose the Scored Validation data from the Regression predecessor as the Selected Data.

9. Select OK.

10. Select Tools Ø Run Insight.

11. Generate a rotating scatter plot as you did in the previous section.

12. View the rotating scatter plot for the neural network model at the same time by opening thescored data set from the Neural Network node. (Select File Ø Open.) The name was notedin step 7.

In contrast to the MLP (right), the logistic model (left) is a planar surface (except where itcurves to accommodate probabilities that are close to zero or one). The decision boundary for thestandard logistic regression model is a continuous, straight line.

Close SAS/INSIGHT and the Insight node.

3.3 Comparing to Logistic Regression 105

A more complex model is not always better. To compare the MLP to the logistic model connectan Assessment node to the control point. The diagram should appear as shown.

Run the diagram from the Assessment node and view the results.

The assessment plots show that in this case the logistic regression model is superior to theneural network. The neural network overfit the data. The flexibility of a neural network makesthem prone to overfitting. They can easily mistake the noise for the signal; that is, they canfollow nuances in the sample data that are not features of the true model. This is precisely whathas occurred here.


3.4 Stopped Training

Objectives

• Understand the benefits of stopped training.

3.4 Stopped Training 107

Stopped Training

An effective remedy for overfitting is stopped training.

View the results of the Neural Network node and select the Plot tab.

If the model has a high degree of flexibility, the best model on the training set will usually notbe the best model on the validation data set. The value of the error function on the validationdata set (red line) can be monitored during training. Stopped training refers to choosing thefitted model that corresponds to an earlier iteration. Usually the iteration that minimizes theerror function on the validation data set is chosen. Recall that this value is chosen by default.

1. Click in the plot using the right mouse button and the select Set network at… Ø Setnetwork at minimum iteration.

2. Select Tools Ø Score from the menu bar.

3. Select Yes to rescore at selected iteration.

4. Select OK to proceed.

5. Close the Neural Network node.


Reexamine the predicted values using the rotating plot in the Insight node.

The neural network (left) and logistic regression (right) models have virtually identical fittedsurfaces.

Reexamine the assessment plots in the Assessment node.

Not surprisingly, the assessment plots show that the two models have nearly identicalpredictive power.

3.5 Two Hidden Layers 109

3.5 Two Hidden Layers

Objectives

• Investigate MLPs with additional hidden layers.


Two Hidden Layers

A MLP can have more than one hidden layer. Additional hidden layers add complexity to themodel. Theoretically, a MLP with one hidden layer has sufficient flexibility to model anysurface. However, models that are more parsimonious can often be constructed using more thanone hidden layer.

Add a hidden layer to a neural network.

1. Open the Neural Network node.

2. Select the Diagram tab.

3. Double-click on the hidden layer icon.


5. Type 3 in the Number of neurons field and select OK.

6. Delete the connection between the hidden layer and the output layer.

7. Click anywhere on the diagram using the right mouse button and select Add hidden layer.

8. Connect the new hidden layer between the first hidden layer and the output layer.

9. Double-click on the hidden layer icon.


11. Type 2 in the Number of neurons field and select OK.

12. Click in the diagram using the right mouse button and select Align nodes.

The network diagram should appear as shown

3.5 Two Hidden Layers 111

13. Close and save changes to the Neural Network node. Rerun the diagram from theAssessment node and view the results.

The network shows superior performance to the regression model for top half of the data.How has the fit changed?

14. Run the Insight node for the neural network and view the results. Construct a rotating plotas before.

The dip in the prediction surface for customers in their thirties seems to have improved thefit of the model.

15. Color observations in excess of 0.065.

The new decision boundary now seems to exclude a greater number of older customers. Thismay also contribute to fit improvement.


3.6 Neural Network Classification

Objectives

• Use decision trees to understand neural network classification.

3.6 Neural Network Classification 113

Neural Network Classification

In the previous section, you saw that specifying a classification threshold partitions the inputspace into regions of predicted response and nonresponse. By visualizing the prediction surface,it was easy to describe the characteristics of the predicted responders. If you were to write sucha description in words, it might be stated in terms like these:

If AGE<38 then RESPOND=1;If (32 < AGE < 42) and INCOME < $40,000 then RESPOND=1;If 42<AGE<52 then RESPOND=1;Otherwise RESPOND=0;

Such a description is exactly what a decision tree produces.

You can use a decision tree to describe the decision boundary for a neural network. To do thisyou will need to create a variable for the predicted action for the data (1 when the predictedprobability is in excess of the cutoff and 0 otherwise). You can then build a tree to describe theregions in which the predicted response equals 1. This will give a reasonable approximation tothe decision boundary for the neural network prediction model.

1. Add a Transform Variables node and connect it to the Neural Network node.

2. Add a Data Set Attributes node and connect it to the Transform Variables node.

3. Add a Decision Tree node and connect it to the Data Set Attributes node.

4. The completed diagram should appear as shown.


Open the Transform Variables node. You will use this node to create a new variable whosevalue is 1 when P_RESPO1 is greater than .065 and 0 otherwise. Such a variable is called anindicator variable because it indicates the presence of a particular condition.

1. Select Actions Ø Create Variable. The Computed Column window opens.

2. Type DECISION in the Name field.

3. Select Define….

The Customize window opens. The top part of the window has a selectable list of variablenames. Selecting the name will result in the variable appearing in the function definition areaat the bottom of the window.

4. Type (P_RESPO1 > 0.065) in the function definition area. DECISION equals 1 whenP_RESPO1 > 0.065 and 0 otherwise.

5. Select OK to close the Customize window and OK again to close the Computed Columnwindow.

After a brief pause, the variable DECISION will appear in the variable list along withstatistical summary information. Note the summary information is calculated using themetadata sample and will vary from run to run.

6. Close the Transform Variables node.

3.6 Neural Network Classification 115

Now change the target to be predicted by the tree from RESPOND to DECISION.

1. Open the Data Sets Attributes node.

2. Set the New Model Role of all variables except AGE and INCOME to rejected.

3. Set the New Model Role of DECSION to target.

4. Set the New Measurement of DECISION to binary.

5. Close and save changes to the Data Set Attributes node.

6. Run the diagram from the Decision Tree node and view the results.

7. Open the Tree Diagram.

8. Select Tools Ø Tree Options from the menu bar.

9. Type 6 in the Tree Depth Down field.

You can now see the complete tree approximation to the neural network.

Close the Decision Tree node.

This technique will work for any model, even those with many inputs. (As an exercise, tryvisualizing the combined model of Chapter 2.)

Chapter 4 Decision Trees

4.1 Introduction................................................................................................ 119

4.2 Problem Formulation ................................................................................ 120

4.3 Decision Tree Comprehension ................................................................ 121

4.4 Cultivation of Tree Varieties ..................................................................... 130

4.5 Tree Deployment........................................................................................ 139

118 Chapter 4 Decision Trees

4.1 Introduction 119

4.1 Introduction

Decision trees are widely used for predictive modeling. As already noted in Chapter 2, they areeasily interpretable, can model complex input/target associations, and can automatically handlemissing values. For interval targets, they are usually referred to as regression trees. When thetarget is categorical, they are usually referred to as classification trees. This chapter covers theuse of the Decision Tree node for growing and interpreting classification trees.



The consumer credit department of a bank wants to automate the decision making process forapproval of home equity lines of credit. To do this, they will follow the recommendations of theEqual Credit Opportunity Act to create an empirically derived and statistically sound creditscoring model. The model will be based on data collected from recent applicants granted creditthrough the current process of loan underwriting. The model will be built from predictivemodeling tools, but the created model must be sufficiently interpretable so as to provide areason for any adverse actions (rejections).

The HMEQ data set contains baseline and loan performance information for 5,960 recent homeequity loans. The target (BAD) is a binary variable indicating whether an applicant eventuallydefaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). Foreach applicant, 12 input variables were recorded.

Category Variables Description

Application REASON

JOB

LOAN

MORTDUE

VALUE

DEBTINC

YOJ

Debt consolidation or home improvement

Six occupational categories

Amount of loan request

Amount due on existing mortgage

Value of current property

Debt to income ratio

Years at present job

Credit bureau DEROG

CLNO

DELINQ

CLAGE

NINQ

Number of major derogatory reports (bankruptcies,foreclosures, charge-offs, etc.)

Number of trade lines

Number of delinquent trade lines

Age of oldest trade line in months

Number of recent credit inquires

The credit scoring model will give a probability of a given loan applicant defaulting on loanrepayment. A threshold will be selected such that all applicants whose probability of default isin excess of the threshold will be recommended for rejection.

! To obtain results consistent with this document, submit the following line of code in theSAS PROGRAM EDITOR after opening the Enterprise Miner.

%let DM_SEED=12345;

4.3 Decision Tree Comprehension 121

4.3 Decision Tree Comprehension

Objectives

• Use tools in the Decision Tree node to explore and understand a decision tree.


Decision Tree Comprehension

Open a new workspace from the Available Projects window and assemble the diagram shownbelow.

An Input Data Source node connects to a Data Partition node. At the Data Partition node, thepath splits. The top path has the Data Partition node followed by a Data Replacement node, aRegression node, and an Assessment node. The bottom path has the Data Partition nodefollowed by a Decision Tree node and the top path’s Assessment node.

The choice of standard regression and decision tree models was not made arbitrarily. The EqualCredit Opportunity Act mandates interpretability for credit scoring models.

Setup the Input Data Source node.


2. Select the HMEQ data set from the DMEM library.

3. Set the Model Role for BAD to target.

4. Set the Measurement scale for DEROG to interval.


Examine the distribution plots for the individual variables as desired. For example, examine thedistribution plot for DEBTINC.

Most of the debt to income ratios in the data set are less than about 45.

Select the Interval Variables tab.

Note the high missing rate for DEBTINC. More than 20% of the applicants have a missingvalue in this variable. Some method will be needed to handle the missing values in the data set.

Close and save changes to the Input Data Source node.

Now partition the HMEQ data for modeling. Once again, create training and validation datasets and omit the test data.


2. Set Train, Validation, and Test to 67, 33, and 0, respectively.



For this first analysis pass, use the default settings for the Regression and the Decision Treenodes.


2. Select both models.

3. Select Tools Ø Lift Chart from the menu bar. A cumulative gains chart appears.

The chart shows the decision tree model dominating the regression model. The decision tree’sfirst decile contains more than 80% bads. By comparison, the regression’s first decile containsonly about 66% bads.

Select the %Captured Response radio button.

The Target Concentration curve (%Captured Response) shows similarly dramatic results. Byrejecting the worst 30% of all applications using the decision tree, you eliminate more than 80%of the bad loans from your portfolio. The same performance from the regression model requiresrejecting almost half of the applications.

Clearly the default decision tree model is a better choice. Several questions remain:

• Why does the tree outperform the standard regression model?

• Can you grow a better tree than the one grown using the default settings?

• What threshold should be used as a cutoff for rejecting loans?


One reason for the superiority of tree may be its innate ability to handle nonlinear associationsbetween the inputs and the target. To test this idea, try modeling the data with another flexible,nonlinear regression method, a neural network. Although you may not be able to use the neuralnetwork model for the purposes of credit scoring, it may give some insight into the differencesbetween the standard regression and neural network models.

1. Add a Neural Network node the workspace.

2. Connect the Data Replacement node to the Neural Network node.

3. Connect the Neural Network node to the Assessment node.

The completed diagram should appear as shown:

1. Set the neural network initialization seed to 611.


3. Select all three models.

4. Select Tools Ø Lift Chart from the menu bar.


Although the neural network model performs slightly better than the standard regressionmodel, the decision tree model is still the best. Nonlinear association does not explain thedifference between the decision tree and the standard regression model.

Open the Decision Tree node results window.

From the summary information, you see that it takes only seven leaves to beat the regressionand neural network models. The assessment table gives a validation accuracy of 88.87%.

1. Select View Ø Tree from the menu bar.

Although the selected tree was supposed to have seven leaves, only six leaves are visible. Bydefault, the decision tree viewer displays three levels deep.


2. Select Tools Ø Tree Options….

3. Type 6 in the Tree depth down field.

4. Select OK.

The complete tree is now visible.

The colors in the tree ring diagram and the decision tree itself are set to indicate node purity. Ifthe node contains all ones or all zeros, the node is colored red. If the node contains an equal mixof ones and zeros, it is colored yellow.

You can change the coloring scheme to indicate target proportion instead of purity.

1. Select the Decision Tree-Results window.

2. Select Tools Ø Define Colors… from the menu bar.

The Data Splits – Color Palette window opens.

3. Select the Proportion of a target value radio button.

4. Select 0 in the Select a target value table.

5. Select OK.

The decision tree color palette is updated so that red indicates a high proportion of bads, yellowindicates a balanced mixture of bads and goods, and green indicates a high proportion of goods.


Using this new color palette, you can see how the tree is splitting the data simply by studyingthe tree ring diagram. The first split partitions the applicants into a set of goods (about ¾ of theapplicants) and a mixed set of goods and bads (about ¼ of the applicants). The good segment hasno more substantial splits. The mixed set is split in half into a set of bads and another mixedset. The bad segment has no further splitting. The mixed segment is split into a good set and yetanother mixed set.

To see what is generating the splits described above, you can browse the tree or use the viewinfo tool on the tool bar.

1. Select the view information tool.

2. Click in the desired segment to see the variable used to define the segment.

For example, select the mixed segment generated by the first split. The window shows that thissegment contains applicants with debt to income ratios in excess of about 45 and those withmissing values. Recall that DEBTINC had a missing value rate of more than 20%. Thisimplies most of the applicants in this segment have missing values.

1. Select Tools Ø Probe Tree Rings Statistics from the menu bar.

2. Select the mixed segment generated by the first split again.

The view information tool shows that more than 64% of the applicants who have a high ormissing debt to income ratio are bad. The majority of the applicants in this segment have amissing debt to income ratio. So if an applicant fails to report the information required tocalculate a debt to income ratio, they are very likely a bad credit risk. In short, DEBTINC’smissing status is highly associated with the target.

This fact also explains why the decision tree model outperforms the other modeling methods. Byusing the Data Replacement node, the association between DEBTINC and the target wasmasked.

Suppose you want to know the rules that define a particular segment in the tree ring. You canuse the view information tool and work your way out from the center of the tree ring, or you canscroll the tree window and follow the decision down the tree.

The Decision Tree node’s view path feature provides a more convenient way to understand thedecision rules associated with each tree ring segment.


2. Select the selection arrow tool.

3. Select View Ø Path from the menu bar.

4. Select the bad segment at the three o’clock position in the tree ring diagram.

The tree diagram is simplified to show only those decisions used to generate the selectedsegment.


The selected segment consists of applicants with a high or missing debt to income ratio and atleast one delinquency. Of the 323 applicants in the training data who share thesecharacteristics, almost 83% are bad credit risks. Similar results hold for the validation data.



Distressingly, the tree appears to have collapsed to one leaf corresponding to the just theselected segment. To view the entire tree again, select the center of the tree ring diagram.


4.4 Cultivation of Tree Varieties

Objectives

• Create multiway splits in a decision tree.

• Change rules that limit the growth of a decision tree.

• Modify selected splits.

4.4 Cultivation of Tree Varieties 131

Multiway Splits

There are adjustments you can make to the default tree algorithm that will cause your tree togrow differently. These changes will not necessarily improve the classification performance ofthe tree, but they may improve its interpretability.

Decision trees that allow splits of each node into more than two branches can be specified in theDecision Tree node. In theory, trees using multiway splits are no more flexible or powerful thantrees using binary splits. The primary goal is to increase interpretability of the final result.

1. Add another Decision Tree node to the workspace.


3. Connect the Decision Tree node to the Assessment node.


5. Select the Basic tab.

6. Type 4 in the Maximum number of branches from a node field. This option will allowbinary, 3-way, and 4-way splits to be considered.


8. Run the Decision Tree node and view the results.

The number of leaves in the selected tree has increased from 7 to 12. It is a matter of personaltaste as to whether this tree is more comprehensible than the binary split tree. The increasednumber of leaves suggests to some a lower degree of comprehensibility. Coincidentally, thevalidation accuracy of the tree has leaped from 88.87% to 88.97%.

If you inspect the tree diagram, there are many nodes containing only a few applicants. You canemploy additional cultivation options to limit this phenomenon.


Limiting Tree Growth

Various stopping or stunting rules (also known as pre-pruning) can be used to limit the growthof a decision tree. For example, it may be deemed beneficial to not split a node with less than100 cases and require that each node have at least 25 cases.


2. Select the Basic tab.

3. Type 25 in the Minimum number of observations in a leaf field.

4. Type 100 in the Observations required for a split search field.

The Decision Tree node requires that (Observations required for a split search) ≥ 2∗(Minimumnumber of observations in a leaf). In this example, the observations required for a split searchmust be greater than 2∗25=50. A node with less than 50 observations can not be split into twonodes, each having at least 25 observations. The number specified (100) satisfies this condition.

5. Close and save your changes to the Decision Tree node.

6. Run the Decision Tree node and view the Decision Tree-Results window and Tree Diagramwindow as before.

The optimal tree again has 7 leaves (although the branches are different from the original tree).The validation accuracy has dropped slightly to 88.15%.

Note that the initial split on DEBTINC has produced four branches.


You may wonder which branch contains the missing values.

1. Select the Tree Diagram window.

2. Select the variable name DEBTINC directly below the root node.

The Input Selection window opens.

The table lists, by default, the top 5 inputs considered for splitting as ranked by a measure ofthe input’s split worth.

3. Select the variable DEBTINC.

4. Select Browse rule.

The Interval Variable Splitting Rule window opens.

The table presents the selected ranges for each of the four branches as well as the branchnumber that contains the missing values (in this case it is the second branch).

Close the Interval Variable Splitting Rule window, the Input Select window, and the DecisionTree-Results window.


Modifying Selected Splits

Decision tree splits are selected on the basis of an analytic criterion. Sometimes it is necessaryor desirable to select splits on the basis of a practical business criterion. For example the bestsplit for a particular node may be on an input that is difficult or expensive to obtain. If acompeting split on an alternative input has a similar worth and is cheaper and easier to obtain,it makes sense to use the alternative input for the split at that node.

Likewise, splits may be selected that are statistically optimal but may be in conflict with anexisting business practice. For example, the credit department may treat applications wheredebt to income ratios are not available differently from those where this information isavailable. You can incorporate this type of business rule into your decision tree using theDecision Tree node’s Interactive Training method.

1. Select the most recently edited Decision Tree node with the right mouse button.

2. Select Interactive…. The Decision Tree-Interactive Training window opens.



The most recently fit decision tree is displayed.

Your goal is to modify the initial split so that one branch contains all the applications withmissing debt to income data and the other branch contains the rest of the applications. Fromthis initial split you will use the decision tree’s analytic method to grow the remainder of thetree.

4. Select the Explore Rules button on the tool bar.

5. Select the root node of the tree. The Input Selection window opens listing a dozen potentialsplits and their statistical worths.

6. Select the DEBTINC split.

7. Select Modify rule. The Interval Variable Splitting Rule window opens, as before.

8. Select range 4.

9. Select Remove range.

10. Repeat for ranges 3 and 2.

The split is now defined to put all non-missing values of DEBTINC into node 1 and all missingvalues of DEBTINC into node 2.

11. Select OK to close the Interval Variable Splitting Rule window.

12. Select Apply Rule in the Input Selection window.


The input selection window closes and the tree diagram is updated as shown.

The left node contains any value of DEBTINC, and the right node contains only missing valuesfor DEBTINC.

13. Close the Decision Tree-Interactive Training window.

14. Select Yes to save the tree as input for subsequent training.

15. Run the modified Decision Tree node and view the results.


The selected tree has seven nodes. Its validation accuracy is again 88.15%. The interpretation isextremely straightforward.

Open the Assessment node and compare the original and modified tree model.


The assessment plots are quite similar for both tree models.

The original model has a slightly higher lift and sensitivity in the second decile.

4.5 Tree Deployment 139

4.5 Tree Deployment

Objectives

• Select an appropriate threshold for rejecting loan applications.

• Create scoring code for the decision tree.


Tree Deployment

Choosing an appropriate threshold for rejecting loan applications can be obtained boththeoretically and empirically. Both approaches require specification of misclassification costs.For the credit-scoring example, assume every two dollars loaned eventually returns threedollars. Rejecting a good loan for two dollars forgoes the expected dollar profit. Accepting a badloan for two dollars forgoes the two dollar loan itself (assuming that the default is early in therepayment period).

The theoretical approach uses the plug in Bayes rule. It is easy to show, using simple decisiontheory, that the optimal threshold is given by

positive false ofcost

negative false ofcost 1

1

+=θ

Using the cost structure defined above, the optimal threshold is simply 1/(1+2) = 1/3. That is,reject all applications whose predicted probability of default exceeds 0.33.

You can obtain the same result using the Assessment node. As a bonus, you can estimate thefraction of loan applications you must reject when using the selected threshold.

1. Select the original binary split decision tree model.

2. Select Tools Ø Lift Chart from the menu bar.

3. Select Define… to define a profit vector.

4. Type –2 for Target Level 1.

5. Type 1 for Target Level 0.

6. Type 0 for Target Level COST.

The profit vectors for credit screening and direct marketing models are fundamentally different.

• In the direct marketing framework, a target value of 1 implies a purchase and thus apositive profit. A target value of 0 implies no purchase and hence no profit. Each solicitationhas a fixed and substantial cost.

• For credit screening a target value of 1 implies a default and hence a loss. A target value of0 implies a paid repaid loan and hence a profit. The fixed cost of processing each loanapplication is insubstantial and taken to be zero.

7. Close the Profit Vector definition window.

8. Select Apply.


10. Select the Non-Cumulative radio button.

4.5 Tree Deployment 141

The plot shows the expected profit for each decile of loan applications as ranked by the decisiontree model. Both the first and second deciles have a negative expected profit. Therefore, itmakes sense to reject the top 20% of loan applications. In that way, you are only accepting loanswhose predicted performance will result in a positive profit.

1. Set the horizontal scale so 2 percent of the observations are grouped as a data point.

2. Type vt tmplift in the command line and press return.

Using this finer scale, you can read the cutoff associated with the 20th percentile. Theappropriate threshold for scoring is approximately 0.31. Because of the stepwise nature of thetree classifier, this cutoff will result in an actual rejection rate slightly higher than 20%.

3. Close the TMPLIFT table and the Assessment node.

4. Attach a Score node to the original binary split tree.


6. Double-click on the decision tree model.


The scoring recipe for the decision tree appears.

The scoring recipe for a decision tree is a nested sequence of if-then statements. You can couchthis recipe in a data step and deploy it on any system with base SAS software.

Chapter 5 Cluster Analysis


5.2 Pre-clustering Transformations............................................................... 146

5.3 Assessing Clusters Visually..................................................................... 149

5.4 K-means Clustering................................................................................... 152

5.5 Visualizing Cluster Separation................................................................. 154

144 Chapter 5 Cluster Analysis

5.1 Problem Formulation 145


A clothing manufacturer seeks a method for allocating product assortments to 689 retail stores.The analysis goal is to partition these stores into a limited number of clusters based on salespatterns across four jean styles: Original, Fashion, Leisure, and Stretch.

The DUNGAREE data set consists of sales totals in the four categories for each of 689 stores.The stores will be grouped using the Clustering node.


5.2 Pre-clustering Transformations

Objectives

• Create new analysis variables using the Transform Variables node.

5.2 Pre-clustering Transformations 147

Pre-clustering Transformations

Assemble the following diagram.

An Input Data Source node connects to a Transform Variables node. The Transform Variablesnode connects to both an Insight node and a Clustering node.


2. Select the DUNGAREE data set.

3. Set the Model Role for STOREID to id.


One approach to clustering sales data of this type is to consider product assortment and salesvolume independently. First, variables containing the relative proportions of product sales foreach store are defined. Next, a variable specifying total sales for each store is created. Theformer specifies the product assortment, and the later specifies the sales volume. Clusters arethen created on the derived rather than the original variables.

The Clustering node of the Enterprise Miner uses the k-means method. Many cluster methods,including k-means, use the distance between cases (stores, in this example) in the spacespanned by the input variables (sales values, in this example).

Distance based methods are very sensitive to the scale of the inputs. Appropriatetransformations of the data often give better results.


Create the new variables for clustering.

1. Open the Transform Variables node.

2. Select Actions Ø Create Variable.

3. Type FA_RATIO in the Name field.

4. Select Define.

5. Type the formula LOG(FASHION/ORIGINAL) in the formula definition area.

6. Select OK Ø OK to return to variables list.

For each store, FA_RATIO expresses sales of fashion jeans relative to sales of original jeans. Inthis example, the log ratio transformation was thought to be a more appropriate scale foranalysis (Aitchison, J. 1986, The Statistical Analysis of Compositional Data, Chapman andHall).

Repeat the steps above to create three more variables.

• LE_RATIO = LOG(LEISURE/ORIGINAL)

• ST_RATIO = LOG(STRETCH/ORIGINAL)

• SALESTOT = FASHION+LEISURE+STRETCH+ORIGINAL

Close and save changes to the Transform Variables node.

5.3 Assessing Clusters Visually 149

5.3 Assessing Clusters Visually

Objectives

• Visualize high dimensional spaces using principal components analysis in SAS/INSIGHTsoftware.

• Obtain an estimate of the number of clusters in the data.


Assessing Clusters Visually

Clustering in Enterprise Miner requires an initial estimate of the appropriate number ofclusters. When the number of variables to be clustered is small, you can estimate this numberusing the three-dimensional data visualization tools found in SAS/INSIGHT.

Visualize the data generated by the Transform Variables node.

1. Run the diagram from the Insight node and view the results.

2. Select Analyze Ø Rotating Plot (Z Y X).

3. Select SALESTOT and ST_RATIO Ø Z.

4. Select LE_RATIO Ø Y.

5. Select FA_RATIO Ø X.

6. Select OK.

Plots of SALESTOT versus LE_RATIO and FA_RATIO as well as ST_RATIO versus LE_RATIOand FA_RATIO appear.

Use the Rotating Plot’s tool palette to examine the associations among the variables. The plotwith SALESTOT shows three well-separated clusters. The plot with FA_RATIO shows one well-separated cluster and possibly two additional clusters. In all, there appear to be three orpossibly four clusters.

Distinguishing clusters when there are three or four variables under consideration is arelatively simple task using the Rotating Plot. When more variables are present, meaningfulvisualization can be achieved by optimally projecting the high dimensional space formed by theoriginal variables into a single three-dimensional plot whose axes are defined by somecombination of the original variables. The principal components analysis tool of SAS/INSIGHTprovides one type of optimal projection. The axes used in a principal components plot are linearcombinations of the original variables selected to minimize the amount of information lost dueto projection of the data.

5.3 Assessing Clusters Visually 151

View the four variables in this example as a single three-dimensional plot using principalcomponents analysis.

1. Select Analyze Ø Multivariate (Y’s).

2. Select SALESTOT, ST_RATIO, LE_RATIO, and FA_RATIO Ø Y.

3. Select Output Ø Principal Component Analyses.

4. Select Principal Components Options Ø First 3 Components.

5. Deselect First 2 Components.

6. Select OK Ø OK Ø OK.

A plot of the data using the first three principal components appears showing three wellseparated clusters.

The plot is extremely informative because (in a particular sense) it gives the best three-dimensional representation of the four-dimensions found in the underlying data.


5.4 K-means Clustering

Objectives

• Cluster data using the Clustering node.

5.4 K-means Clustering 153

K-means Clustering

The method used by the Clustering node is based on the k-means clustering algorithm. Thenumber of clusters must be specified. Initial seeds for each cluster are chosen (optimally,assuming the clusters are well separated). Cases are assigned to their closest seed, and the seedis then moved to the center (mean) of the cluster. The cases are reassigned and the process isrepeated (usually once).

The preliminary analysis suggests three clusters are present in the data. To assign stores toeach of the three clusters, use the Clustering tool.

1. Open the Clustering node.

2. Select the variables STOREID, FASHION, LEISURE, STRETCH, and ORIGINAL.

3. Click in the Status column for one of the selected variables using the right menu button.

4. Select Set Status Ø don’t use.

The SALESTOT variable is on a different scale from the ratio variables. You can correct for thisby standardizing the variables and clustering on the standardized values.

5. Select the Std dev radio button.

6. Select the Clusters tab.

7. Type 3 in the Maximum number of clusters field.

8. Close and save changes to the Clustering node.

9. Run the diagram from the Clustering node and view the results.

The pie chart diagram shows three clusters were formed. Cluster 3, as indicated by the slicesize, has the greatest variability. Cluster 1 has the most members as indicated by the sliceheight. Cluster 3 has the largest radius as indicated by the slice color.


5.5 Visualizing Cluster Separation

Objectives

• Interpret the contents of the created clusters.

5.5 Visualizing Cluster Separation 155

Visualizing Cluster Separation

After clusters have been assigned, you can use another Insight node to visualize the actualassignment of observations to clusters.

Drag an Insight node onto the workspace and attach it to the Clustering node.

The diagram should appear as shown.

Run the diagram from the Insight node and view the results. Notice that two new variables,_SEGMNT_ and DISTANCE have been appended to the DUNGAREE data. The first indicatesthe cluster number; the second indicates the distance from cluster mean.

Use the value of _SEGMNT_ to color code stores by cluster.

1. Click on the word Int at the top of the _SEGMNT_ column with the right mouse buttonand select Nominal.

2. Select Edit Ø Windows Ø Tools.

3. Select the shaded bar in the colors palette Ø _SEGMNT_ ØOK.

Use the principal components analysis tools of SAS/INSIGHT to visualize the data.

1. Select Analyze Ø Multivariate (Y’s).

2. Select SALESTOT, ST_RATIO, LE_RATIO, and FA_RATIO ØY.

3. Select Output Ø Principal Component Analyses.

4. Select Principal Components Options Ø First 3 Components.

5. Deselect First 2 Components.

6. Select OK ØOK ØOK.

The colors red, purple, and orange correspond to clusters 1, 2, and 3, respectively.


It is common practice to interpret the results of a cluster analysis. A comparison of box plots foreach cluster/variable combination aids in this interpretation.

7. Double-click on _SEGMNT_ in the INSIGHT data table.

8. Select the Character radio button.

9. Select OK ØOK.

10. Control-click the _SEGMNT_ column header to deselect the column.

11. Select Analyze Ø Box Plot/Mosaic Plot (Y).

12. Select SALESTOT, ST_RATIO, LE_RATIO, and FA_RATIO Ø Y.

13. Select _SEGMNT_ Ø X.

14. Select OK.

The box plots show that all clusters differ by SALESTOT. An exceptionally low ratio of stretchto original jeans further distinguishes cluster 2 from the remainder. Similarly, an exceptionallylow ratio of fashion jeans to original jeans distinguishes cluster 3 from the remainder. Finally,there appears to be little variation between the clusters in the ratio of leisure jeans to originaljeans.

Chapter 6 Missing Value Imputation

6.1 Introduction................................................................................................ 159

6.2 Missing Indicators and Simple Data Replacement ................................ 160

6.3 Cluster Mean Imputation........................................................................... 167

158 Chapter 6 Missing Value Imputation


6.1 Introduction

Data sets often contain data observations that have missing values for one or more variables.Missing values occur for a wide variety of reasons and, as you have already seen, can severelyaffect analysis results. By default, if an observation contains a missing value for a variable, thenthat observation is not used for modeling by the Variable Selection, Neural Network, orRegression nodes. You may decide to exclude incomplete observations, but that may lead you toignore useful information from the variables that have non-missing values. It may also bias thesample because inputs that have missing values may be strongly related to the target.

This chapter examines several approaches to the imputation of missing values using the DataReplacement and Clustering tools.


6.2 Missing Indicators and Simple Data Replacement

Objectives

• Create an indicator variable for missing values in model inputs.

• Impute missing values with means, medians, and modes.

• Code missing values with the Data Replacement node.

6.2 Missing Indicators and Simple Data Replacement 161

Missing Value Indicators

Return to the Decision Tree diagram you assembled in Chapter 4.

The decision tree models suggest a strong association between missingness in the DEBTINCinput and default in loan repayment. This association was masked for the regression and neuralmodels because of the simple missing value imputation. A slight modification to these modelswill result in a large improvement in predictive performance.

Modify the diagram by adding a Transform Variables node (Modify node group) between theData Partition node and the Data Replacement node. Delete the connection between the DataPartition node and the Data Replacement node.



Open the Transform Variables node. You will use this node to create a new variable whosevalue is 1 when DEBTINC is missing and 0 otherwise. Such a variable is called an indicatorvariable because it indicates the presence of a particular condition.

1. Select Actions Ø Create Variable. The Computed Column window opens.

2. Type DIMISS in the Name field.

3. Select Define….

The Customize window opens. The top part of the window has a selectable list of variablenames. Selecting the name will result in the variable appearing in the function definition areaat the bottom of the window.

4. Type (DEBTINC=.)in the function definition area. DIMISS equals 1 when DEBTINC ismissing, and 0 otherwise.

5. Select OK to close the Customize window and OK again to close the Computed Columnwindow.

After a brief pause, the variable DIMISS will appear in the variable list along with statisticalsummary information. Note the summary information is calculated using the metadata sampleand will vary from run to run.

6. Close the Transform Variables node.



As anticipated, both the regression and the neural network models have improved. Thecumulative gain of the standard regression model is about the same as the decision tree modeland that of the neural network model is slightly higher.


Data Replacement

Open the Data Replacement node.

Both the regression and neural network models use the Data Replacement node to imputemissing values. The default imputation method for interval variables is to fill-in the missingvalues with the mean value of the nonmissing values. The default imputation method for binaryand nominal variables is to fill-in missing values with the most frequent category – the mode.For example, cases with missing values of REASON are assigned to the debt consolidation class.

It is important to realize that, by default, the Data Replacement node uses a simple randomsample of size 2000 to calculate the imputed values. This can lead to some small variation inyour modeling results. To use the entire training data set, select the corresponding radio buttonin the Data Replacement setup window.


Interval Variables

Select the Interval Variables tab.

You can choose to replace missing values for interval variables with values besides the mean ofthe non-missing values. If a variable has a highly skewed distribution, it may make sense to usethe median for replacement instead of the mean. For example, change the imputation methodfor the variable DELINQ to median.

1. Click in the Imputation Method column for DELINQ using the right mouse button.

2. Select Select Method Ø Median.

All missing values for DELINQ will be replaced by the median of the non-missing values.


Select the Class Variables tab.

You can also choose to replace missing values for categorical variables with values besides themode of the non-missing values. You may want to treat missing values in class variables asseparate levels. The coding of missing values causes methods like regressions and neuralnetworks to behave like decision trees. For example, missing values for REASON may resultfrom the loan application accommodating reasons other than debt consolidation or homeimprovement.

1. Click in the Imputation Method column for REASON using the right mouse button.

2. Select Select Method… Ø User specify….

3. Select the Other Value radio button.

4. Type Unknown in the Other Value field.

5. Select OK. The class Unknown will replace all missing values for REASON.

6. Close the Data Replacement node.

7. Run the diagram from the Data Replacement node and view the results.

Verify that the replacement has occurred as you specified by scrolling the data table or selectingthe Interval and Class Variables tabs.

6.3 Cluster Mean Imputation 167

6.3 Cluster Mean Imputation

Objectives

• Impute missing values using the Clustering node.


Cluster Mean Imputation

Cluster-mean imputation is an alternative to mean imputation for interval variables. Instead offilling-in all the missings with the same value (the overall mean), the mean within homogenousgroups is used. Cluster analysis is used to partition the cases into groups (donor classes). Amissing value is assigned to its nearest cluster based on its nonmissing values. The missingvalue is then filled-in with the mean of that cluster.

Add a Clustering node to the diagram between the Transform Variables node and the DataReplacement node.

1. Open the Clustering node.

2. Select the Std Dev radio button.

Distance-based clustering methods are sensitive to the scale of the data. Standardization(subtract the mean, divide by the standard deviation) is often used to put the inputs on roughlyequivalent scale.

3. Select the Clusters tab.

4. Type 4 in the Maximum number of clusters field.

5. Select the Missing Values tab.

6. Select the Imputation check box.

7. Change the Method field to Mean of Nearest Cluster.

8. Close and save changes to the Clustering node.

6.3 Cluster Mean Imputation 169

Cluster-mean imputation is done in the Clustering node, so imputation of the interval variablesin the Data Replacement node must be disabled. The Data Replacement node is still needed toimpute missing values for the nominal and binary variables.

1. Open the Data Replacement node.

2. Select the None radio button for Interval Variables on the Default Method tab.

3. Close and save changes to the Data Replacement node.

4. Run the diagram from the Data Replacement node and view the results.

5. Scroll the table to the extreme right.

Three new variables have been appended to the table: _SEGMNT_, _IMPUTE_, andDISTANCE. _SEGMNT_ indicates the cluster to which the case belongs. The variable_IMPUTE_ is the count of the number of missing values for each case. This is often a very goodpredictor of the target, as missingness is often related to the event of interest. Consequently,this new input is used in the regression analysis. The variable DISTANCE is the distance fromthe nearest cluster seed. Its use as a predictor of the target event is questionable.

1. Close the Data Replacement Results window.

2. Open the Regression node.

3. Scroll to the end of the variable list. By default the derived cluster variables have a modelrole of rejected.

4. Change the Status of _IMPUTE_ to use.

5. Close and save changes to the Regression node.

6. Repeat the previous two steps for the Neural Network node, if desired.


8. Browse the modeling results as desired.


Scoring Code

1. Attach a Score node to the Neural Network node.


3. Double-click the neural network scoring recipe.

The scoring code is a complete recipe for scoring data from a data set of similar format toHMEQ. First, the variable DIMISS is created. Next comes code to assign observations to acluster and impute the missing values for interval variables. This is followed by code to imputemissing values for class variables. Finally, all the code to score with the neural network isgiven. Only base SAS is required to deploy these results.

Chapter 7 Associations

7.1 Introduction................................................................................................ 173


7.3 Support, Confidence, and Lift .................................................................. 175

7.4 Dissociation ............................................................................................... 178

172 Chapter 7 Associations


7.1 Introduction

This chapter discusses the Association node for market-basket analysis. In addition, thischapter demonstrates the process of cloning nodes for customized data preparation tasks.



A bank seeks to examine its customer base and understand which of its products the samecustomer owns. It has chosen to conduct a market-basket analysis of a sample of its customerbase.

The BNKSERV data set lists the banking products/services used by 7,991 customers. Thirteenpossible services are represented:

ATM automated teller machine debit card

AUTO automobile installment loan

CCRD credit card

CD certificate of deposit

CKCRD check/debit card

CKING checking account

HMEQLC home equity line of credit

IRA individual retirement account

MMDA money market deposit account

MTG mortgage

PLOAN personal/consumer installment loan

SVG saving account

TRUST personal trust account.

There are 24,375 rows in the data set. Each row of the data set represents a customer-servicecombination. The median number of services per customer is three.

7.3 Support, Confidence, and Lift 175

7.3 Support, Confidence, and Lift

Objectives

• Create a market-basket analysis for the banking data.

• Understand the output from the Association node.


Support, Confidence, and Lift

Construct the following diagram.

An Input Data Source node is connected to an Association node.


2. Select the BNKSERV data set.

3. Set Model Role for ACCT to id and for SERVICE to target.


5. Run the diagram from the Association node and view the results.

7.3 Support, Confidence, and Lift 177

The Association node evaluates association rules to quantify the affinity among items. The first12 rows of the output contain the frequency that each service is used by the 7,991 customers.For example, 86% of the customers have checking accounts.

Each row represents an association rule among services. By default, all rules that involve atmost four items have a confidence above 10% and that involve items that occur with more than5% of the customers are listed. By default, the association rules are sorted by support withinnumber-of-item group. One-item rules are listed first, then two-item rules, three-item rules, andso on.

Click on the Support(%) column with the right mouse button and select Sort Ø Descending.

The support is the percentage of customers that have all the services involved in the rule. Forexample, 54% of the 7,991 customers have a checking and savings account and 25% have achecking account, savings account, and an ATM card.

Click on the Confidence(%) column with the right mouse button and select Sort ØDescending.

The confidence represents the percentage of customers who have the right-hand-side (RHS)item among those who have the left-hand-side (LHS) item. For example, all customers who havea check card also have a checking account, and 21.88% of those with a checking account and aCD have a money market account.

Lift, in the context of association rules, is the ratio of the confidence of a rule to the confidence ofa rule assuming the RHS was independent of the LHS. Consequently, lift is a measure ofassociation between the LHS and RHS of the rule. Values greater than one represent positivecorrelation between the LHS and RHS. Values equal to one represent independence. Values lessthan one represent negative correlation between the LHS and RHS.

Close and save changes to the Association node.


7.4 Dissociation

Objectives

• Extend the capabilities of the Association node to include dissociation rules.

• Create a new node on the tools palette.

7.4 Dissociation 179

Creating Dissociations

A dissociation rule is a rule involving the negation of some item. For example, the LHS may benot having a checking account and the RHS might be an auto loan. Dissociation rules may beparticularly interesting when the items involved are highly prevalent. The Association node willinclude dissociation rules if the data is modified to include the negation of selected items. TheSAS Code node can be used for such data modification.

Augment the data with services not present in each account.

1. Disconnect the Input Data Source and the Association node.

2. Drag a SAS Code node the workspace and connect it between the Input Data Source and theAssociation node.


3. Open the SAS Code node.

4. Select the Macros tab. Observe that the name of the training data set is (&_TRAIN).

5. Select the Export tab.

6. Select Add Ø TRAIN. Note that the name of the data set is (&_TRA).

7. Deselect Pass imported datasets to successors.

8. Select the Program tab.

9. Select File Ø Import file Ø Dissociations.sas.


10. Modify the first four lines of the imported program as shown.

%let id=ACCT;

%let target=SERVICE;

%let values=’SVG’,’CKING’,’MMDA’;

%let in=&_TRAIN;

%let out=&_TRA;

The first two lines identify the target and id variables, respectively. The third line identifies thevalues of the target for which negations are created. The values must be enclosed in quotes andseparated by commas. The final two lines provide generic macro names for the training dataand the augmented (exported) data.

This SAS program scans each id (ACCT) to see if the items (services) specified in the values arepresent. If not, the data is augmented with the negated items.

1. Close the SAS Code node.

2. Run the diagram from the SAS Code node but do not view the results.

3. Open the Association node.

4. Select the Data tab.

5. Select Properties… and then select the Table View tab. The listing is of the augmenteddata that was exported from the SAS Code node.

6. Close the Association node.

7. Run the Association node and view the results.

The results now list association and dissociation rules. For example, among customers with achecking and an ATM card but without a money market account, 24% have a home-equity lineof credit (rule 436).

Close the Association node and save the results.

7.4 Dissociation 181

Node Cloning

You can add custom nodes to the tools palette for tasks such as the data modifications neededfor dissociation rules.

Clone the SAS Code node and add it to the Node types palette.

1. Select the SAS Code node.

2. Select Actions Ø Node type manager… Ø Create custom node type….

3. Type Create Dissociations in the Description field.

4. Select the right arrow next to the Image field.

5. Select an appropriate icon from the palette.

6. Select Close Ø OK Ø Close.

A new tool appears at the bottom of the Node types palette with the icon you selected.

The cloned tool can be used in the diagram in place of the SAS Code node.

Note that this cloned node has variable and level names that are specific to the BNKSERV dataset. One may prefer to clone the node prior to modifying the program. The cloned tool is saved inthe project library. Consequently, every diagram created within the project will have the CreateDissociations node available for use.

Appendices

A.1 Glossary..................................................................................................... 185

A.2 References................................................................................................. 195

184 Appendices

A.1 Glossary 185

A.1 Glossary

activation function

in the language of neural networks, a mathematical transformation of the net input to yieldthe output of a neuron.

architecture

a statistical model, in the language of neural networks.

adaptation

the process of estimation and model-fitting, in the language of neural networks.

assessment

determining how well a model computes good outputs from input data not used duringtraining. Assessment statistics are automatically computed when you train a model with amodeling node. By default, assessment statistics are calculated from the validation data set.You can choose to assess the adequacy of a trained model(s) with a test data set. You cancompare models using either the Model Manager of a modeling node or the Assessmentnode.

Assessment Graph (Decision Tree)

a graph in the Tree Browser that plots the utility values from the Assessment Table. Thered, or lighter, symbols represent the validation data; the blue, or darker, symbols representthe training data. The vertical reference line corresponds to the tree partition highlighted inthe Assessment Table.

Assessment Table (Decision Tree)

a table in the Tree Browser that provides a measure of how well the tree describes the data.For a nominal target, the default measure is the proportion of observations correctlyclassified. For an interval target, the default measure is the average sum of squareddifferences of an observation from its predicted value. The table displays the assessment forseveral candidate partitions of the data. In the Assessment Table, one partition ishighlighted and the summary statistics for this partition are displayed in the SummaryTable in the Tree Browser.

back propagation

in the language of neural networks, the computation of derivatives for a multilayerperceptron.

binary variable

a variable that contains two discrete values (for example, PURCHASE: Yes and No). Thebinary measurement level is automatically set to binary variables in the Input DataSource node.

186 Appendices

bonus variable

a variable that is used by the Decision Tree node for model assessment. It assigns a valueto each observation for each nominal target value or each decision alternative. For aparticular leaf, the assessment of the nominal value (or decision alternative) is typically theaverage of the bonus variable among observations in that leaf. You set bonus variables inthe Input Data Source node.

branches

subtrees rooted in one of the initial divisions of a segment of a tree. For example, if a rulesplits a segment into seven subsets, then seven branches grow from the segment.

case

a collection of information about one of numerous entities represented in a data set. In SASSystem terminology, a case is an observation in the data set.

character variable

a variable whose values can consist of alphabetic and special characters as well as numericcharacters.

clustering

the process of dividing a data set into mutually exclusive groups such that the observationsfor each group are a close as possible to one another, and different groups are as far aspossible from one another.

cluster sampling

the process of selecting a sample of groups or clusters from a population. The samplecontains all the members of each selected group or cluster. Cluster sampling is useful fortransaction data or household data, for example.

combination function

in the language of neural networks, a function that is applied to both inputs and hiddenlayers that computes the net input to a hidden or output neuron.

database

See SAS data set.

data mining data base (DMDB)

a SAS data set that is designed to optimize the performance of the modeling nodes. TheDMDB enhances performance by reducing the number of passes that the analytical engineneeds to make through the data. It contains a meta catalog with summary statistics fornumeric variables and factor-level information for categorical variables. All nodes thatrequire a DMDB create one when you run the node. The DMDB node is available if youwant to create a DMDB on demand. You may want to create a DMDB with the DMDB nodeto establish a visual reference for the DMDB in the process flow diagram. The DMDB nodealso provides the only mechanism for browsing the statistics of a DMDB.

A.1 Glossary 187

data library

a data library where data for a project is stored. The data library should not be used foranything else but that one project. For example, do not use the same data library foranother project, either as a project or as a data library. Do not keep external files in thislibrary.

data library path

the directory path for SAS data sets and SAS data views stored in the data library on boththe client and the server. When the project is run remotely, the data library is set up to usethe server path, and when it is run locally, it is set up to use the client path.

deciles

a division of data into tenths after the data have been sorted by the values of one or morevariables. Deciles are usually cumulative, such that the first decile contains the top 10% ofthe data, the second decile contains the top 20% of the data, and so on.

dependent variable

See target variable.

depth

the number of successive hierarchical partitions of the data in a tree. The initial, undividedsegment is at depth 0. Specify a depth value to control how much of the tree to display in aTree Diagram. See also levels.

diagram

an Enterprise Miner process flow that you interactively create in the Enterprise MinerWorkspace window. A diagram is stored in a project. You can create multiple diagrams perproject.

diagram nodes

graphical regions of the Tree diagram and the Neural Network diagram that containinformation. For the Tree diagram, a diagram-node displays one of three types ofinformation: segment statistics, the names of the variables used to split the segments, or thevariable values. For the Neural Network diagram, diagram-nodes represent inputs, hiddenlayers, and targets.

Enterprise Miner Administrator

an administrative interface that enables you to configure a client/server session and definecustom default node options.

error function

a function that measures how well a neural network or other model fits the training data.The error function is also known as a Lyapunov function or an estimation criterion.

188 Appendices

estimation criterion

See error function.

format

a pattern that the SAS System uses to determine how a variable value should be displayed.The SAS System provides a set of standard formats and also enables you to define your owncustom formats.

freq variable

a variable that represents frequency of occurrence for other values in each observation. Youset the Freq variable role in the Input Data Source node.

generalization

to compute accurate outputs using input data that was not used during training.

hidden layer

a layer between input and output in a neural network where one or more activationfunctions are applied, typically to introduce nonlinearity.

id variable

an indicator variable. The Associations node requires an Id variable for associationdiscovery. You set the id variable role in the Input Data Source node.

imputation

computing replacement values for missing input values. Imputation can be done with theData Replacement node.

independent variable

See input.

informat

a pattern that the SAS System uses to determine how values entered in variable fieldsshould be interpreted. The SAS System provides a set of standard informats and alsoenables you to define your own custom informats.

input

a variable that is used to predict the value of the target variable(s). You set the inputvariable role in the Input Data Source node.

internal nodes

segments of a tree that have been further segmented.

A.1 Glossary 189

interval variable

a continuous variable that contains values across a range, (for example, TEMP: 0, 32, 34, 36,43.5, 44, 56, 80, 90, 99, 99.9, 100). You set the interval measurement level to a variable inthe Input Data Source node.

leaves

segments of a tree that are not further segmented. The final leaves in a tree are known asterminal nodes.

levels

successive hierarchical partitions of data in a tree. The first level represents the entireunpartitioned data set. The second level represents the first partition of the data intosegments, and so on. See also depth.

libref

the name that is temporarily associated with a SAS data library.

logistic regression

a form of regression analysis in which the target (response) variable represents a binary orordinal-level response.

measurement

the process of assigning numbers to things such that the properties of the numbers reflectsome attribute of the things.

measurement level

one of several different ways in which properties of numbers can reflect attributes of things.The most common measurement levels are nominal, ordinal, interval, log-interval, ratio, andabsolute. Measurement level roles are automatically set to variables in the Input DataSource node using the metadata sample.

metadata sample

a sample of the input data source that is downloaded to the client and is used throughoutthe Enterprise Miner to determine meta information about the data. By default the metasample file size is 2000 cases. You can set the metadata sample size in the Input DataSource node. The metadata sample is updated with new variables that are automaticallycreated or that you create as you build and run the process flow diagram. The metadatasample is used for the following tasks:

• Calculates summary statistics, determines the number of variable levels, anddetermines the frequencies of the variable levels for the active data set in the InputData Source node.

• Determines hierarchical relationships between variables when running theVariable Selection node.

190 Appendices

• Used as the project data when a remote project is run locally and the data do notexist locally.

• Determines what cases are filtered in the Automatic Filter window of the FilterOutliers node.

• Used as the data source to create a bar chart of a variable whenever you select the"View distribution" item for a variable in a node.

model

a formula or algorithm that computes outputs from inputs. A statistical model also includesinformation about the conditional distribution of the targets given the inputs.

multilayer perceptron (MLP)

a neural network with one or more hidden layers, each of which has a linear combinationfunction and executes some nonlinear activation function on the input to that layer.

net input

the result of the combination function of a neuron. The net input can be transformed by anactivation function to yield an output.

neural networks

a class of flexible nonlinear regression and discriminant models, data reduction models, andnonlinear dynamic systems, that consist of an often large number of neurons usuallyinterconnected in complex ways and often organized into layers.

neurons

linear or nonlinear computing elements in a neural network that accept one or more inputs,compute a function of the inputs, and may direct the result to one or more other neurons.Neurons are also known as nodes or units.

nodes (Decision Tree)

segments or diagram-nodes, depending on context. The terms leaves, nodes, and segmentsare closely related and sometimes refer to the same part of a tree. See also internal nodes.

nodes (Neural Network)

See neurons

nominal variable

a variable that contains discrete values that do not have a logical ordering (for example,PARTY: Democrat, Republican, other). The Input Data Source automatically assigns thenominal measurement level to variables based on the information in the metadata sample.You can also set the nominal measurement level to a variable in the Input Data Sourcenode.

A.1 Glossary 191

numeric variable

a variable that contains only numeric values and related symbols, such as decimal points,plus signs, and minus signs.

observation

See case.

ordinal variable

a variable that contains discrete value that do have a logical ordering (for example, GRADE:A, B, C, D, F). The Input Data Source automatically assigns the ordinal measurementlevel to variables based on the information in the metadata sample. You set the ordinalmeasurement level to a variable in the Input Data Source node.

output

a variable that is computed from the inputs as a prediction of the value of the targetvariable.

overfit

training the model to the random variation in the sample data. Overfit models contain toomany parameters (weights), and they do not generalize well.

partition

dividing the available data into training, validation, and test data sets.

perceptron

a linear or nonlinear neural network with or without one or more hidden layers.

predicted value

See output.

predict variable

a variable that contains the predicted values (outputs) for the target. You set the Predictvariable role in the Input Data Source node.

profit matrix

a table of expected revenues and expected costs for each decision alternative for each level ofthe target variable.

project

a collection of Enterprise Miner process flow diagrams. Each project contains a projectlibrary and a data library.

192 Appendices

project library

a library where information about the project is stored. You can have one project per SASproject library. You should have only files related to the Enterprise Miner project in thislibrary.

project library path

The directory or path on the client where SAS data sets and SAS catalogs for the project arestored.

response variable

See target variable.

root node

the initial tree segment. It represents the entire data set.

root segment

See root node.

rules

definitions of how to split segments of data into subsegments in a tree.

SAS data set

descriptor information and its related data values organized as a table of observations andvariables that can be processed by the SAS System.

scoring

the process of applying a model to new data to compute outputs. Scoring represents the endresult of data mining.

seed

an initial value from which a random number function calculates a random value.

sequence variable

a variable that represents the time span from observation to observation. The Associationsnode requires a sequence variable for sequence discovery. The sequence (time stamp)variable must be recorded on the same scale. You set the Sequence variable role in theInput Data Source node.

simple random sample

a sample for which each item in the population has an equal chance of selection.

A.1 Glossary 193

standard deviation

statistical measure of the variability of a group of data values. This measure, which is themost widely used measure of the dispersion of a frequency distribution, is equal to thepositive square root of the variance.

stratified random sample

a sample obtained by dividing a population into nonoverlapping parts, called strata, andrandomly selecting items from each stratum.

subdiagram

a collection of nodes in a process flow diagram that are compressed into a single node. Theuse of a subdiagram may improve your control of the information flow in the diagram.

tabbed dialog

a window in which you select labeled tabs to access different screens within the window.

target variable

a variable whose value is known in some currently available data (for example, the trainingdata set) but is unknown in some future data sets (for example, the score data set). Youtypically want to predict the values of the target variable(s) from other known variables.The ordering of the target values determines the event level for class targets. You set targetvariables in the Input Data Source node. The ordering of the target values determines theevent level for class targets.

test data

currently available data that contain input and target values that are not used duringtraining, but instead are used for generalization and to compare models. A test data set canbe created with the Data Partition node.

training

the process of computing good values for the weights in a model.

training data

currently available data that contain inputs and target values used for model training. Atraining data set can be created with the Data Partition node.

transformation

applying a function to a variable to adjust its range, variability, or both. Variables can betransformed in the Transform Variables node.

tree

the complete set of rules used to split the data into a hierarchy of successive segments. Atree consists of branches and leaves, in which each set of leaves represents an optimalsegmentation of the branches above them according to a statistical measure.

194 Appendices

Tree Diagram

a graphical representation of (at least) a selected portion of a tree, which may includesegment statistics, the names of the variables used to split the segments, and the variablevalues. Click on the Tree-Ring Navigator to open the Tree Diagram.

Tree Browser

the main analysis window for data segmentation. The Tree Browser contains the SummaryTable, Tree-Ring Navigator, Assessment Table, and Assessment Graph.

trial variable

contains count data for a binomial target, such as the number of responders who respondedto a mailing. Some of the trials are classified as events, and the remainder are classified asnonevents.

unary variable

a variable that contains a discrete value. The unary measurement level is automaticallyassigned to unary variables in the Input Data Source node.

underfit

training the model to only part of the actual patterns in the sample data. Underfit modelscontain too few parameters (weights), and they do not generalize well. See also overfit.

units

See neurons.

validation data

data that are used indirectly during training for model selection, early stopping, or for othermethods intended to improve generalization. You can also use the validation data set as aselection criterion when running stepwise regression with the Regression node. By default,all assessment statistics are calculated using the validation data set. You can create avalidation data set with the Data Partition node.

variable

one of the items of information that is represented in numeric or character form for eachcase in a data set.

weights

constants that are used in a model for which the constant values are unknown orunspecified prior to the analysis.

A.2 References 195

A.2 References

Berry, M. J. A. and Linoff, G. (1997), Data Mining Techniques for Marketing, Sales, andCustomer Support, New York: John Wiley and Sons.

Bishop, C. M. (1995), Neural Networks for Pattern Recognition, New York: Oxford UniversityPress.

Bigus, J. P. (1996), Data Mining with Neural Networks: Solving Business Problems - fromApplication Development to Decision Support, New York: McGraw-Hill.

Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984), Classification and Regression Trees,Belmont, CA: Wadsworth International Group.

Hand, D. J. (1997), Construction and Assessment of Classification Rules, New York: John Wileyand Sons.

Hoaglin, D. C., Mosteller, F. and Tukey, J. W. (1983), Understanding Robust and ExploratoryData Analysis, New York: John Wiley and Sons.

Little, R.J.A., and Rubin, D. B. (1987), Statistical Analysis with Missing Data, New York: JohnWiley and Sons.

Little, R.J.A. (1992), “Regression with missing X's: A review,” J. of the American StatisticalAssociation, 87, 1227-1237.

Michie, D., Spiegelhalter, D. J. and Taylor, C. C. (1994), Machine Learning, Neural andStatistical Classification, New York: Ellis Horwood.

Ripley, B. D. (1996), Pattern Recognition and Neural Networks, New York: CambridgeUniversity Press.

SAS Institute Inc. (1995), Logistic Regression Examples Using the SAS System, Version 6, FirstEdition, Cary, NC: SAS Institute Inc.

SAS Institute Inc. (1990), SAS Language: Reference, Version 6, First Edition, Cary, NC: SASInstitute Inc.

SAS Institute Inc. (1990), SAS Procedures Guide, Version 6, Third Edition, Cary, NC: SASInstitute Inc.

SAS Institute Inc. (1995), SAS/INSIGHT User’s Guide, Version 6, Third Edition, Cary, NC:SAS Institute Inc.

SAS Institute Inc. (1990), SAS/STAT User’s Guide, Version 6, Fourth Edition, Volumes 1 and 2,Cary, NC: SAS Institute Inc.

Sarle, W.S. (1994a), "Neural Networks and Statistical Models," Proceedings of the NineteenthAnnual SAS Users Group International Conference, Cary: NC, SAS Institute Inc., 1538-1550.

Sarle, W.S. (1994b), "Neural Network Implementation in SAS Software," Proceedings of theNineteenth Annual SAS Users Group International Conference, Cary: NC, SAS Institute Inc.,1550-1573.

196 Appendices

Sarle, W.S. (1995), "Stopped Training and Other Remedies for Overfitting," Proceedings of the27th Symposium on the Interface.

Smith, M. (1993), Neural Networks for Statistical Modeling, New York: Van Nostrand Reinhold.

Weiss, S.M. and Kulikowski, C.A. (1991), Computer Systems That Learn: Classification andPrediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems, SanMateo, CA: Morgan Kaufmann.

sas notes sas enterprise miner software - applying data mining techniques

Documents