zhangxi lin isqs 7342-001 texas tech university note: most slides in this file are sourced from sas@...
TRANSCRIPT
Zhangxi LinISQS 7342-001Texas Tech UniversityNote: Most slides in this file are sourced from SAS@ Course Notes
Lecture Notes 8Continuous and Multiple Target Prediction
2
Structure of the Chapter
Section 2.1 raises the problem that the normal decision tree methods did not turn out good results
Section 2.2 analyzes the problem
Section 2.3 develops basic two-stage models to improve the results
Section 2.4 further improves the two-stage models
Section 2.1
Introduction
4
Motivation
The results of the 1998 KDD-Cup produced a surprise. Almost half of the entrees yielded a total profit on the validation data that was less than that obtained by soliciting everyone.
Part of the problem lies in the method used to select cases for solicitation. This chapter extends the notion of profit introduced in Chapter 1 to allow for better selection of cases for solicitation.
5
1998 KDD-Cup Results
1.2.3.4.5.6.7.8.9.
10.
$14,71214,66213,95413,82513,79413,59813,04012,29811,42311,276
TotalProfitRank
$0.1530.1520.1450.1430.1430.1410.1350.1280.1190.117
OverallAvg. Profit
11.12.13.14.15.16.17.18.19.20.
$ 10,72010,70610,11210,0499,7419,4645,6835,4841,9251,706
TotalProfitRank
$ 0.1110.1110.1050.1040.1010.0980.0590.0570.0200.018
OverallAvg. Profit
$10,560$ 0.110
Total profitAvg. profitfor “solicit everyone”
model
Section 2.2
Generalized Profit Matrices
7
Random Profit Consequences
Profit Profit00 Profit0Profit0
Primary Decision Secondary Decision
Negative profit
8
Outcome Conditioned Random Profits
In a more general context, the profit associated with a decision for an individual case can be thought of as a random variable. The goal of predictive modeling is to estimate the distribution of this profit random variable conditioned on case input measurements.
Because the decisions are usually associated with discrete outcomes, the random profits are conditioned on each of these outcomes. For a binary outcome and two decisions, the random profits form the elements of a 2x2 random matrix.
9
Outcome Conditioned Random Profits
Primary Decision Secondary Decision
Profit Profit00
Profit0Profit0
PrimaryOutcome
SecondaryOutcome
0
Negative profit
10
Expected Profit Matrix
Profit Profit00
Profit0Profit0 E( ) E( )
E( )E( )
Primary Decision Secondary Decision
PrimaryOutcome
SecondaryOutcome
Negative profit
11
Expected/Reduced Profit Matrix
Because it is easier to work with concrete numbers than random variables, statistical summaries of the random profit matrices are used to quantify the consequence of a decision.
One way to do this is to calculate the expected value of the profit random variable for each outcome and decision combination. Arrayed as a matrix, this is called the expected profit-consequence matrix, or the expected profit matrix for a case.
Often, generalized profit matrices have zeros in the secondary decision column. Without loss of generality (assuming the profit-consequence is measured by expected value), it is always possible to write the generalized profit matrix with a column of zero profits
12
Reduced Profit Matrix
Profit Profit00
Profit0Profit0 E( ) E( )
E( )E( )
Primary Decision Secondary Decision
PrimaryOutcome
SecondaryOutcome
Negative profit
The difference
13
Reduced Profit Matrix
Profit0
Profit0 E( )
E( )
Primary Decision
PrimaryOutcome
SecondaryOutcome
Profit0
Profit0 E( )
E( )
Secondary Decision
Negative profit
The difference
14
Expected Profit-Consequence
0
0
Primary Decision
PrimaryOutcome
SecondaryOutcome
ExpectedProfit-Consequence
EPF
EPF
p
p
+∙ ∙EPF p EPF p∙ + ∙EPC =
Negative profit
15
Expected Profit-Consequence
0
0
Primary Decision
PrimaryOutcome
SecondaryOutcome
ExpectedProfit-Consequence
EPF
EPF
p
p
EPC
EPC EPF p EPF p∙ + ∙
EPF p EPF p∙ + ∙=
=
Negative profit
16
Expected Profit-Consequence
0
0
Primary Decision
PrimaryOutcome
SecondaryOutcome
ExpectedProfit-Consequence
EPF
EPF
p
p
EPC
EPC
EPC EPF p EPF p∙ + ∙
EPF p EPF p∙ + ∙=
=
Negative profit
17
Expected Profit-Consequence
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC 0
0
Primary Decision
PrimaryOutcome
SecondaryOutcome
Negative profit
18
Sort Expected Profit-Consequence
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Sort cases by decreasing EPC.
19
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Total Expected Profit
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
Sum EPCs inexcess of threshold.
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC ≥
20
EPC
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Total Expected Profit
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
Sum EPCs inexcess of threshold.
21
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Observed Profit
EPC
EPC
Profit0
Profit0
Primary Decision
PrimaryOutcome
ObservedProfit
SecondaryOutcome
ObservedProfit
OP
OP
OP
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
EPC
22
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Observed Profit
EPC
EPC
EPC
Profit0
Profit0
Primary Decision
PrimaryOutcome
ObservedProfit
SecondaryOutcome
ObservedProfit
OP
OP
OP
OP EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
EPC
23
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Observed Profit
EPC
EPC
EPC
Profit0
Profit0
Primary Decision
PrimaryOutcome
ObservedProfit
SecondaryOutcome
ObservedProfit
OP
OP
OP
OP
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
EPC
Negative profit
24
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Observed Profit
EPC
EPC
EPCProfit0
Profit0
Primary Decision
PrimaryOutcome
ObservedProfit
SecondaryOutcome
ObservedProfit
OP
OP
OP OP
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
EPC
25
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Observed Profit
EPC
EPC
EPC
Profit0
Profit0
Primary Decision
PrimaryOutcome
ObservedProfit
SecondaryOutcome
ObservedProfit
OP
OP
OP
OP
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
EPC
Negative profit
26
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Observed Profit
EPC
EPC
EPC
Profit0
Profit0
Primary Decision
PrimaryOutcome
ObservedProfit
SecondaryOutcome
ObservedProfit
OP
OP
OP
OP
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
EPC
Negative profit
27
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Observed Profit
EPC
EPC
EPC
Profit0
Profit0
Primary Decision
PrimaryOutcome
ObservedProfit
SecondaryOutcome
ObservedProfit
OP
OP
OP
OP
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
EPC
28
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Observed Profit
EPC
EPC
EPC
Profit0
Profit0
Primary Decision
PrimaryOutcome
ObservedProfit
SecondaryOutcome
ObservedProfit
OP
OP
OP
OP
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
EPC
Negative profit
29
EPC
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
Observed Profit
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
OP
OP
OP
OP
OP
OP
OP
OP
OP
Record observedprofits.
30
OP
OP ≥
OP ≥
OP≥
OP≥
≥
OP
OP ≥
OP ≥
OP≥
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Observed Total Profit
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
OP
OP
OP
OP
OP
OP
OP
OP
OP
Sum OPs for cases with EPCs in excess
of threshold.
OP
OP ≥
OP ≥
OP≥
OP≥
≥
OP
OP ≥
OP ≥
OP≥
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
EPC
31
OP
OP ≥
OP ≥
OP≥
OP≥
≥
OP
OP ≥
OP ≥
OP≥
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Generalized Profit Assessment Data
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
OP
OP
OP
OP
OP
OP
OP
OP
OP
Sum OPs for cases with EPCs in excess
of threshold.
OP
OP ≥
OP ≥
OP≥
OP≥
≥
OP
OP ≥
OP ≥
OP≥
EPC
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
32
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
EPCEPC OP
OP ≥
OP ≥
OP≥
OP≥
≥
OP
OP ≥
OP ≥
OP≥
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
Total Profit Plot
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
OP
OP
OP
OP
OP
OP
OP
OP
OP OP
OP ≥
OP ≥
OP≥
OP≥
≥
OP
OP ≥
OP ≥
OP≥
EPC
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
Depth
33
Observed and Expected Profit Plot
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
EPCEPC OP
OP ≥
OP ≥
OP≥
OP≥
≥
OP
OP ≥
OP ≥
OP≥
OP
OP
OP
OP
OP
OP
OP
OP
OP
EPC
EPC
EPC
EPC
EPC
EPC
EPC
EPC
OP
OP ≥
OP ≥
OP≥
OP≥
≥
OP
OP ≥
OP ≥
OP≥
EPC
EPC ≥
EPC ≥
EPC≥
EPC≥
EPC ≥
EPC ≥
EPC ≥
EPC≥
Depth
34
Profit Confusion Matrix
Primary Decision
PrimaryOutcome
SecondaryOutcome
OP
OP
true positive profit
false positive profit
total primary profit
total secondary profit
Secondary Decision
OP
OP
false negative profit
true negative profit
OP
OP
total primary decision profit
OP total secondarydecision profit
OP
35
True Positive Profit Fraction
Primary Decision
PrimaryOutcome
SecondaryOutcome
OP
OP
true positive profit
false positive profit
total primary profit
total secondary profit
Secondary Decision
OP
OP
false negative profit
true negative profit
OP
OP
total primary decision profit
OP total secondarydecision profit
OP
36
False Positive Profit Fraction
Primary Decision
PrimaryOutcome
OP true positive profit total primary
profit
Secondary Decision
OP
OP
false negative profit
true negative profit
OP
total primary decision profit
OP total secondarydecision profit
OP
SecondaryOutcome
OP false positive profit total secondary
profit
OP
Section 2.3
Basic Two-Stage Models
38
Defining Two-Stage Model Components
E(B|X)E(D|X)
15.30X Specified values
Separate predictive models
Joint predictive modelsE(B,D|X)
39
Two-Stage Modeling Methods
A better estimate of the primary decision profit can be obtained by modeling both outcome probability and expected profit, using two-stage modeling methods.
The two ways to estimate the components used in two-stage models. The first is to simply specify values for certain components. This is
simple to do, but it often produces poor results. In a more sophisticated approach, you can use the value in an input
or a look-up table as a surrogate for expected donation amount. The most common approach is to estimate values for components with
individual models. At the extreme end of the sophistication scale, you can use a single model
to predict both components simultaneously, for example, the NEURAL procedure in SAS Enterprise Miner.
40
Basic Two-Stage Models
Two-stage model collapses two models: - One to estimate the donation propensity;- Another one to estimate the donation amount.
41
Two-Stage Model Tool
The Two Stage Model tool builds two models, one to predict TARGET_B and one to predict TARGET_D. Theoretically, you can use this node to combine predictions for the two target variables and get a prediction of expected donation amount.
The tool has two minor limitations: It does not recognize the prior vector. Thus, because
responders are overrepresented in the training data, the probabilities in the TARGET_B model are biased.
The node has no built-in diagnostic to assess overall average profit. Profit information passed to the Assessment node is incorrect.
Both of these limitations are easily overcome by the Generalized Profit Assessment tool.
42
The Model We Are Using
Basic model
Different from the book
43
Target Variables
44
Some Two-Stage Model Options
Model fitting approach: sequential, or concurrent Sequential: couples model by making the binary outcome
model’s prediction an input for the expected profit model Concurrent: fits a neural network model with two target
FILTER: removes cases from the training data when building the value model
MULTIPLY: multiplies the class and value model predictions
45
Results of the Two-Stage Node
46
Results of the GPA Node
Oddities in the assessment report.
1. The reported overall average profits from training data are extremely low.
2. The depth supposedly corresponding to optimum profit threshold is reported to be 100% (select all cases).
3. The total profit reported in the validation data is almost 40% higher than in the training data.
47
Stratification with BIN_TARGET_D
48
Improved Results of the GPA Node
The third problem has been solved.
But the performance of the model is still lower than that from “no model”.
49
Correct bias in GPA by setting the following parameter in the code:
%let EM_PROPERTY_adjustprobs = Y;
The model is no longer selecting all the data (it is now around 60%), but the overall average profit values remain low.
The average profit = 0.1105. It is slightly more than that without using a model.
50
Results from an Improved Two-Stage Model
Parameters:Class Model: Regression
Selection Model: Stepwise
Selection Criteria: Validation Error
The Average Profit: 0.155
This result is good enough to win the KDD Cup!
51
Summary – Improving the Performance
Section 2.3 Use two-stage models Stratification using the binned value target Correct bias in GPA: %let EM_PROPERTY_adjustprobs = Y;
Section 2.4 Use regression settings in the Two Stage node Reduce MSE: Interval target value transformation Construct the component models separately from the Two Stage node.
Use regression trees in a two-stage model(%let EM_PROPERTY_adjustprobs = N;)
Use neural networks in a two-stage model
(%let EM_PROPERTY_adjustprobs = N;)
Section 2.4
Constructing Component Models
53
Two-Stage Modeling Challenges
Model Assessment
Interval Model SpecificationE(D) = g(x;w)
54
Two-Stage Modeling Challenges
Constructing two-stage (or more generally, any multiple component model) requires attention to several challenges not previously encountered.
Earlier modeling assessment efforts evaluated models based on profitability measures, assuming a fixed profit structure. Because the profit structure itself is being modeled in a two-stage model, you need a different mechanism to assess model performance.
Correct specification requires appropriately chosen inputs, link functions, and target error distribution.
By incorporating the predictions of the binary model into the interval mode, it can be possible to make a more parsimonious specification of the interval model.
55
Estimating Mean Squared Error
X
D
Training Data
(Di - Di )2^
i = 1
N1NEstimated MSE =
D̂
MSE
E[(D-D)2]^
56
D̂
MSE Decomposition: Variance
X
D
Training Data
Variance
(Di - Di )2^
i = 1
N1NEstimated MSE =
MSE
E[(D-D)2] = E[(D-ED)2] + [E(D-ED)]2^^
In theory, the MSE can be decomposed into two components, each involving adeviance from the true expected value of the target variable.
57
D̂
MSE Decomposition: Squared Bias
X
D
Training Data
Bias2
(Di - Di )2^
i = 1
N1NEstimated MSE =
VarianceMSE
E[(D-D)2] ^ = E[(D-ED)2] + [E(D-ED)]2^
Variance - independent of any fitted model.Bias2 - the difference between the predicted and actual expected value
58
D̂
Honest MSE Estimation
X
D
Validation Data
(Di - Di )2^
i = 1
N1NEstimated MSE =
VarianceMSE
E[(D-D)2] ^ = E[(D-ED)2] + [E(D-ED)]2
Bias2
^
Unbiased estimates can be obtained by correctly accounting for model degrees of freedom in the MSE estimate or simplyestimating MSE from an independent validation data set.
59
D̂
Honest MSE Estimation
X
D
Validation Data
(Di - Di )2^
i = 1
N1NEstimated MSE =
VarianceMSE
E[(D-D)2] ^ = E[(D-ED)2] + [E(D-ED)]2
Bias2
^
60
D̂
Honest MSE Estimation
X
D
Validation Data
(Di - Di )2^
i = 1
N1NEstimated MSE =
VarianceMSE
E[(D-D)2] ^ = E[(D-ED)2] + [E(D-ED)]2
Bias2
^
61
InseparabilityB̂
MSE and Binary Target Models
X
B
Validation Data
(Bi - Bi )2^
i = 1
N1NEstimated MSE =
Inaccuracy
E[(B-B)2] ^ = E[(B-EB)2] + [E(B-EB)]2
Imprecision
VarianceMSE Bias2
^
62
The Binary Target
The estimated MSE of the binary target can be thought of as measuring the overall inaccuracy of model prediction.
This inaccuracy estimate can be decomposed into a term related to the inseparability of the two-target levels (corresponding to the variance component) plus a term related to the imprecision of the model estimate (corresponding to the bias-squared component).
In this way, the model with the smallest estimated MSE will also be the least imprecise.
63
Two-Stage Modeling Challenges
Model Assessment
Interval Model SpecificationE(D) = g(x;w)
Use Validation MSE
To assess both the binary and the interval component models, it is reasonable to compare their validation data mean squared error. Models with the smallest MSE will have the smallest bias or imprecision.
64
Two-Stage Modeling Challenges
Model Assessment
Interval Model SpecificationE(D) = g(x;w)
Use Validation MSE
A standard regression model may be ill suited for accurately modeling the relationship between the inputs and TARGET_D.
Matching the structure of the model to the specific modeling requirements is vital to obtaining good predictions.
65
Interval Model Requirements
Correct Error Distribution
Good Inputsx1 x3 x10
E(D) > 0 Positive Predictions
Adequate Flexibility
66
Making Positive Predictions
log(E(Y |X ))
E( log(Y) | X ) Transform target.
Define appropriate link.
Hints:
The interval component of a two-stage model is often used to predict a monetaryresponse. Random variables that represent monetary amounts usually assume askewed distribution with positive range and a variance related to expected value.When the target variable represents a monetary amount, this limited range and skewness in the model specification must be considered.
Proper specification of the target range and error distribution increases the chances of selecting good inputs for the interval target model. With good inputs, the correct degree of flexibility can be incorporated into the model and predictions can be optimized.
67
Error Distribution Requirements
Possess correct skewness.
Have conforming support.
Account for heteroscedasticity.
Y
68
Specifying the Correct Error Distribution
Normal (truncated)constant*
Poisson E(Y)
Gamma (E(Y))2
Lognormal (E(Y))2
Distribution Variance
The normal distribution has a range from negative to positive infinity,whereas the target variable may have a more restricted range.
69
Specifying the Correct Error Distribution
Normal (truncated)constant*
Poisson E(Y)
Gamma (E(Y))2
Lognormal (E(Y))2
Distribution Variance
One disadvantage of the Poisson distribution relates to its skewness properties.Poisson error distributions are limited to the Neural Network node.
70
Specifying the Correct Error Distribution
Normal (truncated)constant*
Poisson E(Y)
Gamma (E(Y))2
Lognormal (E(Y))2
Distribution Variance
The gamma distribution is limited to the neural network node. The lognormaldistribution can be used with any modeling tool.
71
Specifying the Correct Error Distribution
Normal (truncated)constant*
Poisson E(Y)
Gamma (E(Y))2
Lognormal (E(Y))2
Distribution Variance
100x
A few extreme outliers may indicate a lognormal distribution, whereas the absence of such may imply a gamma or less extreme distribution.
72
Two-Stage Modeling Challenges
Model Assessment
Interval Model SpecificationE(D) = g(x;w)
log(Target) / Specify Link and Error
Use Validation MSE
73
Interval Target Model
74
The Parameters and Results
75
Compare the Distributions of Residuals
Use Log-transformed Target_D Using original Target_D
76
Using Regression Trees
77
Using Neural Network Models