zhangxi lin isqs 7342-001 texas tech university note: most slides in this file are sourced from sas@...

Zhangxi LinISQS 7342-001Texas Tech UniversityNote: Most slides in this file are sourced from SAS@ Course Notes

Lecture Notes 8Continuous and Multiple Target Prediction

2

Structure of the Chapter

Section 2.1 raises the problem that the normal decision tree methods did not turn out good results

Section 2.2 analyzes the problem

Section 2.3 develops basic two-stage models to improve the results

Section 2.4 further improves the two-stage models

Section 2.1

Introduction

4

Motivation

The results of the 1998 KDD-Cup produced a surprise. Almost half of the entrees yielded a total profit on the validation data that was less than that obtained by soliciting everyone.

Part of the problem lies in the method used to select cases for solicitation. This chapter extends the notion of profit introduced in Chapter 1 to allow for better selection of cases for solicitation.

5

1998 KDD-Cup Results

1.2.3.4.5.6.7.8.9.

10.

$14,71214,66213,95413,82513,79413,59813,04012,29811,42311,276

TotalProfitRank

$0.1530.1520.1450.1430.1430.1410.1350.1280.1190.117

OverallAvg. Profit

11.12.13.14.15.16.17.18.19.20.

$ 10,72010,70610,11210,0499,7419,4645,6835,4841,9251,706

TotalProfitRank

$ 0.1110.1110.1050.1040.1010.0980.0590.0570.0200.018

OverallAvg. Profit

$10,560$ 0.110

Total profitAvg. profitfor “solicit everyone”

model

Section 2.2

Generalized Profit Matrices

7

Random Profit Consequences

Profit Profit00 Profit0Profit0

Primary Decision Secondary Decision

Negative profit

8

Outcome Conditioned Random Profits

In a more general context, the profit associated with a decision for an individual case can be thought of as a random variable. The goal of predictive modeling is to estimate the distribution of this profit random variable conditioned on case input measurements.

Because the decisions are usually associated with discrete outcomes, the random profits are conditioned on each of these outcomes. For a binary outcome and two decisions, the random profits form the elements of a 2x2 random matrix.

9

Outcome Conditioned Random Profits


Profit Profit00

Profit0Profit0

PrimaryOutcome

SecondaryOutcome

0

Negative profit

10

Expected Profit Matrix

Profit Profit00

Profit0Profit0 E( ) E( )

E( )E( )


PrimaryOutcome

SecondaryOutcome

Negative profit

11

Expected/Reduced Profit Matrix

Because it is easier to work with concrete numbers than random variables, statistical summaries of the random profit matrices are used to quantify the consequence of a decision.

One way to do this is to calculate the expected value of the profit random variable for each outcome and decision combination. Arrayed as a matrix, this is called the expected profit-consequence matrix, or the expected profit matrix for a case.

Often, generalized profit matrices have zeros in the secondary decision column. Without loss of generality (assuming the profit-consequence is measured by expected value), it is always possible to write the generalized profit matrix with a column of zero profits

12

Reduced Profit Matrix

Profit Profit00

Profit0Profit0 E( ) E( )

E( )E( )


PrimaryOutcome

SecondaryOutcome

Negative profit

The difference

13

Reduced Profit Matrix

Profit0

Profit0 E( )

E( )

Primary Decision

PrimaryOutcome

SecondaryOutcome

Profit0

Profit0 E( )

E( )

Secondary Decision

Negative profit

The difference

14

Expected Profit-Consequence

0

0

Primary Decision

PrimaryOutcome

SecondaryOutcome

ExpectedProfit-Consequence

EPF

EPF

p

p

+∙ ∙EPF p EPF p∙ + ∙EPC =

Negative profit

15


0

0

Primary Decision

PrimaryOutcome

SecondaryOutcome


EPF

EPF

p

p

EPC

EPC EPF p EPF p∙ + ∙

EPF p EPF p∙ + ∙=

=

Negative profit

16


0

0

Primary Decision

PrimaryOutcome

SecondaryOutcome


EPF

EPF

p

p

EPC

EPC

EPC EPF p EPF p∙ + ∙

EPF p EPF p∙ + ∙=

=

Negative profit

17


EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC 0

0

Primary Decision

PrimaryOutcome

SecondaryOutcome

Negative profit

18

Sort Expected Profit-Consequence

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Sort cases by decreasing EPC.

19

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Total Expected Profit

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

Sum EPCs inexcess of threshold.

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC ≥

20

EPC

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Total Expected Profit

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

Sum EPCs inexcess of threshold.

21

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Observed Profit

EPC

EPC

Profit0

Profit0

Primary Decision

PrimaryOutcome

ObservedProfit

SecondaryOutcome

ObservedProfit

OP

OP

OP

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

EPC

22

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Observed Profit

EPC

EPC

EPC

Profit0

Profit0

Primary Decision

PrimaryOutcome

ObservedProfit

SecondaryOutcome

ObservedProfit

OP

OP

OP

OP EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

EPC

23

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Observed Profit

EPC

EPC

EPC

Profit0

Profit0

Primary Decision

PrimaryOutcome

ObservedProfit

SecondaryOutcome

ObservedProfit

OP

OP

OP

OP

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

EPC

Negative profit

24

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Observed Profit

EPC

EPC

EPCProfit0

Profit0

Primary Decision

PrimaryOutcome

ObservedProfit

SecondaryOutcome

ObservedProfit

OP

OP

OP OP

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

EPC

25

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Observed Profit

EPC

EPC

EPC

Profit0

Profit0

Primary Decision

PrimaryOutcome

ObservedProfit

SecondaryOutcome

ObservedProfit

OP

OP

OP

OP

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

EPC

Negative profit

26

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Observed Profit

EPC

EPC

EPC

Profit0

Profit0

Primary Decision

PrimaryOutcome

ObservedProfit

SecondaryOutcome

ObservedProfit

OP

OP

OP

OP

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

EPC

Negative profit

27

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Observed Profit

EPC

EPC

EPC

Profit0

Profit0

Primary Decision

PrimaryOutcome

ObservedProfit

SecondaryOutcome

ObservedProfit

OP

OP

OP

OP

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

EPC

28

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Observed Profit

EPC

EPC

EPC

Profit0

Profit0

Primary Decision

PrimaryOutcome

ObservedProfit

SecondaryOutcome

ObservedProfit

OP

OP

OP

OP

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

EPC

Negative profit

29

EPC

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

Observed Profit

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

OP

OP

OP

OP

OP

OP

OP

OP

OP

Record observedprofits.

30

OP

OP ≥

OP ≥

OP≥

OP≥

≥

OP

OP ≥

OP ≥

OP≥

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Observed Total Profit

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

OP

OP

OP

OP

OP

OP

OP

OP

OP

Sum OPs for cases with EPCs in excess

of threshold.

OP

OP ≥

OP ≥

OP≥

OP≥

≥

OP

OP ≥

OP ≥

OP≥

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

EPC

31

OP

OP ≥

OP ≥

OP≥

OP≥

≥

OP

OP ≥

OP ≥

OP≥

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Generalized Profit Assessment Data

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

OP

OP

OP

OP

OP

OP

OP

OP

OP

Sum OPs for cases with EPCs in excess

of threshold.

OP

OP ≥

OP ≥

OP≥

OP≥

≥

OP

OP ≥

OP ≥

OP≥

EPC

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

32

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

EPCEPC OP

OP ≥

OP ≥

OP≥

OP≥

≥

OP

OP ≥

OP ≥

OP≥

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

Total Profit Plot

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

OP

OP

OP

OP

OP

OP

OP

OP

OP OP

OP ≥

OP ≥

OP≥

OP≥

≥

OP

OP ≥

OP ≥

OP≥

EPC

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

Depth

33

Observed and Expected Profit Plot

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

EPCEPC OP

OP ≥

OP ≥

OP≥

OP≥

≥

OP

OP ≥

OP ≥

OP≥

OP

OP

OP

OP

OP

OP

OP

OP

OP

EPC

EPC

EPC

EPC

EPC

EPC

EPC

EPC

OP

OP ≥

OP ≥

OP≥

OP≥

≥

OP

OP ≥

OP ≥

OP≥

EPC

EPC ≥

EPC ≥

EPC≥

EPC≥

EPC ≥

EPC ≥

EPC ≥

EPC≥

Depth

34

Profit Confusion Matrix

Primary Decision

PrimaryOutcome

SecondaryOutcome

OP

OP

true positive profit

false positive profit

total primary profit

total secondary profit

Secondary Decision

OP

OP

false negative profit

true negative profit

OP

OP

total primary decision profit

OP total secondarydecision profit

OP

35

True Positive Profit Fraction

Primary Decision

PrimaryOutcome

SecondaryOutcome

OP

OP

true positive profit

false positive profit

total primary profit

total secondary profit

Secondary Decision

OP

OP



OP

OP



OP

36

False Positive Profit Fraction

Primary Decision

PrimaryOutcome

OP true positive profit total primary

profit

Secondary Decision

OP

OP



OP



OP

SecondaryOutcome

OP false positive profit total secondary

profit

OP

Section 2.3

Basic Two-Stage Models

38

Defining Two-Stage Model Components

E(B|X)E(D|X)

15.30X Specified values

Separate predictive models

Joint predictive modelsE(B,D|X)

39

Two-Stage Modeling Methods

A better estimate of the primary decision profit can be obtained by modeling both outcome probability and expected profit, using two-stage modeling methods.

The two ways to estimate the components used in two-stage models. The first is to simply specify values for certain components. This is

simple to do, but it often produces poor results. In a more sophisticated approach, you can use the value in an input

or a look-up table as a surrogate for expected donation amount. The most common approach is to estimate values for components with

individual models. At the extreme end of the sophistication scale, you can use a single model

to predict both components simultaneously, for example, the NEURAL procedure in SAS Enterprise Miner.

40

Basic Two-Stage Models

Two-stage model collapses two models: - One to estimate the donation propensity;- Another one to estimate the donation amount.

41

Two-Stage Model Tool

The Two Stage Model tool builds two models, one to predict TARGET_B and one to predict TARGET_D. Theoretically, you can use this node to combine predictions for the two target variables and get a prediction of expected donation amount.

The tool has two minor limitations: It does not recognize the prior vector. Thus, because

responders are overrepresented in the training data, the probabilities in the TARGET_B model are biased.

The node has no built-in diagnostic to assess overall average profit. Profit information passed to the Assessment node is incorrect.

Both of these limitations are easily overcome by the Generalized Profit Assessment tool.

42

The Model We Are Using

Basic model

Different from the book

43

Target Variables

44

Some Two-Stage Model Options

Model fitting approach: sequential, or concurrent Sequential: couples model by making the binary outcome

model’s prediction an input for the expected profit model Concurrent: fits a neural network model with two target

FILTER: removes cases from the training data when building the value model

MULTIPLY: multiplies the class and value model predictions

45

Results of the Two-Stage Node

46

Results of the GPA Node

Oddities in the assessment report.

1. The reported overall average profits from training data are extremely low.

2. The depth supposedly corresponding to optimum profit threshold is reported to be 100% (select all cases).

3. The total profit reported in the validation data is almost 40% higher than in the training data.

47

Stratification with BIN_TARGET_D

48

Improved Results of the GPA Node

The third problem has been solved.

But the performance of the model is still lower than that from “no model”.

49

Correct bias in GPA by setting the following parameter in the code:

%let EM_PROPERTY_adjustprobs = Y;

The model is no longer selecting all the data (it is now around 60%), but the overall average profit values remain low.

The average profit = 0.1105. It is slightly more than that without using a model.

50

Results from an Improved Two-Stage Model

Parameters:Class Model: Regression

Selection Model: Stepwise

Selection Criteria: Validation Error

The Average Profit: 0.155

This result is good enough to win the KDD Cup!

51

Summary – Improving the Performance

Section 2.3 Use two-stage models Stratification using the binned value target Correct bias in GPA: %let EM_PROPERTY_adjustprobs = Y;

Section 2.4 Use regression settings in the Two Stage node Reduce MSE: Interval target value transformation Construct the component models separately from the Two Stage node.

Use regression trees in a two-stage model(%let EM_PROPERTY_adjustprobs = N;)

Use neural networks in a two-stage model

(%let EM_PROPERTY_adjustprobs = N;)

Section 2.4

Constructing Component Models

53

Two-Stage Modeling Challenges

Model Assessment

Interval Model SpecificationE(D) = g(x;w)

54


Constructing two-stage (or more generally, any multiple component model) requires attention to several challenges not previously encountered.

Earlier modeling assessment efforts evaluated models based on profitability measures, assuming a fixed profit structure. Because the profit structure itself is being modeled in a two-stage model, you need a different mechanism to assess model performance.

Correct specification requires appropriately chosen inputs, link functions, and target error distribution.

By incorporating the predictions of the binary model into the interval mode, it can be possible to make a more parsimonious specification of the interval model.

55

Estimating Mean Squared Error

X

D

Training Data

(Di - Di )2^

i = 1

N1NEstimated MSE =

D̂

MSE

E[(D-D)2]^

56

D̂

MSE Decomposition: Variance

X

D

Training Data

Variance

(Di - Di )2^

i = 1

N1NEstimated MSE =

MSE

E[(D-D)2] = E[(D-ED)2] + [E(D-ED)]2^^

In theory, the MSE can be decomposed into two components, each involving adeviance from the true expected value of the target variable.

57

D̂

MSE Decomposition: Squared Bias

X

D

Training Data

Bias2

(Di - Di )2^

i = 1

N1NEstimated MSE =

VarianceMSE

E[(D-D)2] ^ = E[(D-ED)2] + [E(D-ED)]2^

Variance - independent of any fitted model.Bias2 - the difference between the predicted and actual expected value

58

D̂

Honest MSE Estimation

X

D

Validation Data

(Di - Di )2^

i = 1

N1NEstimated MSE =

VarianceMSE

E[(D-D)2] ^ = E[(D-ED)2] + [E(D-ED)]2

Bias2

^

Unbiased estimates can be obtained by correctly accounting for model degrees of freedom in the MSE estimate or simplyestimating MSE from an independent validation data set.

59

D̂


X

D

Validation Data

(Di - Di )2^

i = 1

N1NEstimated MSE =

VarianceMSE

E[(D-D)2] ^ = E[(D-ED)2] + [E(D-ED)]2

Bias2

^

60

D̂


X

D

Validation Data

(Di - Di )2^

i = 1

N1NEstimated MSE =

VarianceMSE

E[(D-D)2] ^ = E[(D-ED)2] + [E(D-ED)]2

Bias2

^

61

InseparabilityB̂

MSE and Binary Target Models

X

B

Validation Data

(Bi - Bi )2^

i = 1

N1NEstimated MSE =

Inaccuracy

E[(B-B)2] ^ = E[(B-EB)2] + [E(B-EB)]2

Imprecision

VarianceMSE Bias2

^

62

The Binary Target

The estimated MSE of the binary target can be thought of as measuring the overall inaccuracy of model prediction.

This inaccuracy estimate can be decomposed into a term related to the inseparability of the two-target levels (corresponding to the variance component) plus a term related to the imprecision of the model estimate (corresponding to the bias-squared component).

In this way, the model with the smallest estimated MSE will also be the least imprecise.

63


Model Assessment


Use Validation MSE

To assess both the binary and the interval component models, it is reasonable to compare their validation data mean squared error. Models with the smallest MSE will have the smallest bias or imprecision.

64


Model Assessment


Use Validation MSE

A standard regression model may be ill suited for accurately modeling the relationship between the inputs and TARGET_D.

Matching the structure of the model to the specific modeling requirements is vital to obtaining good predictions.

65

Interval Model Requirements

Correct Error Distribution

Good Inputsx1 x3 x10

E(D) > 0 Positive Predictions

Adequate Flexibility

66

Making Positive Predictions

log(E(Y |X ))

E( log(Y) | X ) Transform target.

Define appropriate link.

Hints:

The interval component of a two-stage model is often used to predict a monetaryresponse. Random variables that represent monetary amounts usually assume askewed distribution with positive range and a variance related to expected value.When the target variable represents a monetary amount, this limited range and skewness in the model specification must be considered.

Proper specification of the target range and error distribution increases the chances of selecting good inputs for the interval target model. With good inputs, the correct degree of flexibility can be incorporated into the model and predictions can be optimized.

67

Error Distribution Requirements

Possess correct skewness.

Have conforming support.

Account for heteroscedasticity.

Y

68

Specifying the Correct Error Distribution

Normal (truncated)constant*

Poisson E(Y)

Gamma (E(Y))2

Lognormal (E(Y))2

Distribution Variance

The normal distribution has a range from negative to positive infinity,whereas the target variable may have a more restricted range.

69



Poisson E(Y)

Gamma (E(Y))2

Lognormal (E(Y))2


One disadvantage of the Poisson distribution relates to its skewness properties.Poisson error distributions are limited to the Neural Network node.

70



Poisson E(Y)

Gamma (E(Y))2

Lognormal (E(Y))2


The gamma distribution is limited to the neural network node. The lognormaldistribution can be used with any modeling tool.

71



Poisson E(Y)

Gamma (E(Y))2

Lognormal (E(Y))2


100x

A few extreme outliers may indicate a lognormal distribution, whereas the absence of such may imply a gamma or less extreme distribution.

72


Model Assessment


log(Target) / Specify Link and Error

Use Validation MSE

73

Interval Target Model

74

The Parameters and Results

75

Compare the Distributions of Residuals

Use Log-transformed Target_D Using original Target_D

76

Using Regression Trees

77

Using Neural Network Models

zhangxi lin isqs 7342-001 texas tech university note: most slides in this file are sourced from sas@...

Documents

expected profit matrix

reduced profit matrix

profit consequence

notion of profit

random profit matrices

generalized profit matrix

expectedreduced profit

generalized profit matrices