chapter 4: predictive modeling

196

Click here to load reader

Upload: margaret-greene

Post on 03-Jan-2016

130 views

Category:

Documents


19 download

DESCRIPTION

Chapter 4: Predictive Modeling. Chapter 4: Predictive Modeling. Objectives. Explain the concepts of predictive modeling. Illustrate the modeling essentials of a predictive model. Explain the importance of data partitioning. Catalog Case Study. Analysis Goal: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 4: Predictive Modeling

1

Chapter 4: Predictive Modeling

4.1 Introduction to Predictive Modeling

4.2 Predictive Modeling Using Decision Trees

4.3 Predictive Modeling Using Logistic Regression

4.4 Churn Case Study

4.5 A Note about Model Management

4.6 Recommended Reading

Page 2: Chapter 4: Predictive Modeling

2

Chapter 4: Predictive Modeling

4.1 Introduction to Predictive Modeling4.1 Introduction to Predictive Modeling

4.2 Predictive Modeling Using Decision Trees

4.3 Predictive Modeling Using Logistic Regression

4.4 Churn Case Study

4.5 A Note about Model Management

4.6 Recommended Reading

Page 3: Chapter 4: Predictive Modeling

3

Objectives Explain the concepts of predictive modeling. Illustrate the modeling essentials of a predictive

model. Explain the importance of data partitioning.

Page 4: Chapter 4: Predictive Modeling

4

Catalog Case StudyAnalysis Goal:

A mail-order catalog retailer wants to save money on mailing and increase revenue by targeting mailed catalogs to customers most likely to purchase in the future.

Data set: CATALOG2010

Number of rows: 48,356

Number of columns: 98

Contents: sales figures summarized across departments and quarterly totals for 5.5 years of sales

Targets: RESPOND (binary)

ORDERSIZE (continuous)

Page 5: Chapter 4: Predictive Modeling

5

Where You’ve Been, Where You’re Going… With basic descriptive modeling techniques (RFM), you

identified customers who might be profitable. Sophisticated predictive modeling techniques can

produce risk scores for current customers, profitable prospects from outside the customer database, cross-sell and up-sell lists, and much more.

Scoring techniques based on predictive models can be implemented in real-time data collection systems, automating the process of fact-based decision making.

Page 6: Chapter 4: Predictive Modeling

6

Descriptive Modeling Tells You about NowDescriptive statistics inform you about your sample. This information is important for reacting to things that have happened in the past.

Past BehaviorFact-Based

Reports Current State of

the Customer

Page 7: Chapter 4: Predictive Modeling

7

From Descriptive to Predictive ModelingPredictive modeling techniques, paired with scoring and good model management, enable you to use your data about the past and the present to make good decisions for the future.

Fact-Based PredictionsPast Behavior

Page 8: Chapter 4: Predictive Modeling

8

Predictive Modeling Terminology

The observations in a training data set are known as training cases.

The variables are called inputs and targets.

inputs target

Training Data Set

Page 9: Chapter 4: Predictive Modeling

9

Predictive Model

Predictive model: a concise representation of the input and target association

Training Data Setinputs target

Page 10: Chapter 4: Predictive Modeling

10

Predictive Model

predictions

Predictions: output of the predictive model given a set of input measurements

inputs

Page 11: Chapter 4: Predictive Modeling

11

Modeling Essentials

Determine type of prediction.

Select useful inputs.

Optimize complexity.

Page 12: Chapter 4: Predictive Modeling

12

Select useful inputs.

Optimize complexity.

Modeling Essentials

Determine type of prediction.

Page 13: Chapter 4: Predictive Modeling

13

Three Prediction Types

rankings

estimates

decisionspredictioninputs

Page 14: Chapter 4: Predictive Modeling

14

Decision Predictions

A predictive model usesinput measurementsto make the best decision for each case.

primary

secondary

secondary

primary

tertiary

inputs prediction

Page 15: Chapter 4: Predictive Modeling

15

Ranking Predictions

A predictive model usesinput measurementsto optimally rank each case.

prediction

720

520

630

470

580

inputs

Page 16: Chapter 4: Predictive Modeling

16

Estimate Predictions

A predictive model usesinput measurementsto optimally estimate the target value.

prediction

0.65

0.33

0.75

0.28

0.54

inputs

Page 17: Chapter 4: Predictive Modeling

17

Idea ExchangeThink of two or three business problems that would require each of the three types of prediction. What would require a decision? How would you obtain

information to help you in making a decision based on a model score?

What would require a ranking? How would you use this ranking information?

What would require an estimate? Would you estimate a continuous quantity, a count, a proportion, or some other quantity?

Page 18: Chapter 4: Predictive Modeling

18

Select useful inputs.

Optimize complexity.

Modeling Essentials – Predict Review

Determine type of prediction. Decide, rank,and estimate.

Page 19: Chapter 4: Predictive Modeling

19

Select useful inputs.

Determine type of prediction.

Optimize complexity.

Modeling Essentials

Page 20: Chapter 4: Predictive Modeling

20

Input Reduction Strategies

Irrelevancy

0.70

0.60

0.50

0.40

x4

x3

Redundancy

x1

x2

Page 21: Chapter 4: Predictive Modeling

21

Irrelevancy

0.70

0.60

0.50

0.40

x4

x3

Input Reduction – Redundancy

Redundancy

x1

x2

Input x2 has the same information as input x1.

Example: x1 is household income and x2 is home value.

Page 22: Chapter 4: Predictive Modeling

22

Redundancy

x1

x2

Input Reduction – IrrelevancyIrrelevancy

0.70

0.60

0.50

0.40

x4

x3

Predictions change with input x4 but much

less with input x3.

Example: Target is response to direct mail solicitation, x3 is religious affiliation, and x4 is response to previous solicitations.

Page 23: Chapter 4: Predictive Modeling

23

Modeling Essentials – Select Review

Eradicateredundancies

and irrelevancies.

Decide, rank,and estimate.

Select useful inputs.

Determine type of prediction.

Optimize complexity.

Page 24: Chapter 4: Predictive Modeling

24

Select useful inputs.

Modeling Essentials

Determine type of prediction.

Optimize complexityOptimize complexity.

Page 25: Chapter 4: Predictive Modeling

25

Data PartitioningTraining Data Validation Data

Partition available data into training and validation sets.

The model is fit on the training data set, and model performance is evaluated on the validation data set.

inputs target inputs target

Page 26: Chapter 4: Predictive Modeling

26

5

4

3

2

1

Predictive Model SequenceTraining Data Validation Data

Create a sequence of models with increasing complexity.

ModelComplexity

inputs target inputs target

Page 27: Chapter 4: Predictive Modeling

27

5

4

3

2

1

Model Performance AssessmentTraining Data Validation Data

ModelComplexity

ValidationAssessment

Rate model performance using validation data.

inputs target inputs target

Page 28: Chapter 4: Predictive Modeling

28

3

Model SelectionTraining Data Validation Data

ModelComplexity

ValidationAssessment

Select the simplest model with the highest validation assessment.

inputs target inputs target

Page 29: Chapter 4: Predictive Modeling

29

4.01 Multiple Choice PollThe best model is the

a. simplest model with the best performance on the training data.

b. simplest model with the best performance on the validation data.

c. most complex model with the best performance on the training data.

d. most complex model with the best performance on the validation data.

Page 30: Chapter 4: Predictive Modeling

30

4.01 Multiple Choice Poll – Correct AnswerThe best model is the

a. simplest model with the best performance on the training data.

b. simplest model with the best performance on the validation data.

c. most complex model with the best performance on the training data.

d. most complex model with the best performance on the validation data.

Page 31: Chapter 4: Predictive Modeling

31

Select useful inputs.

Modeling Essentials – Optimize Review

Determine type of prediction.

Optimize complexity.

Eradicateredundancies

and irrelevancies.

Decide, rank,and estimate.

Tune models withvalidation data.

Page 32: Chapter 4: Predictive Modeling

32

Chapter 4: Predictive Modeling

4.1 Introduction to Predictive Modeling

4.2 Predictive Modeling Using Decision Trees 4.2 Predictive Modeling Using Decision Trees

4.3 Predictive Modeling Using Logistic Regression

4.4 Churn Case Study

4.5 A Note about Model Management

4.6 Recommended Reading

Page 33: Chapter 4: Predictive Modeling

33

Objectives Explain the concept of decision trees. Illustrate the modeling essentials of decision trees. Construct a decision tree predictive model in

SAS Enterprise Miner.

Page 34: Chapter 4: Predictive Modeling

34

Modeling Essentials – Decision Trees

Determine type of prediction.

Select useful inputs.

Optimize complexity.

Page 35: Chapter 4: Predictive Modeling

35

Simple Prediction Illustration

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

Predict dot color for each x1 and x2.

Training Data

Page 36: Chapter 4: Predictive Modeling

36

Decision Tree Prediction Rules

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

60%55%

70%

x1

<0.52 ≥0.52 <0.51 ≥0.51x1

x2

<0.63 ≥0.63

root node

interior node

leaf node

Page 37: Chapter 4: Predictive Modeling

37

Decision Tree Prediction Rules

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

60%55%

x1

<0.52 ≥0.52

<0.63

70%

<0.51 ≥0.51x1

x2

≥0.63

root node

interior node

leaf node

Predict:

Page 38: Chapter 4: Predictive Modeling

38

≥0.51

60%55%

x1

<0.52 ≥0.52

<0.63

40%

60%55%

x1

<0.52 ≥0.52 ≥0.51

<0.63

Decision Tree Prediction Rules

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

Decision = Estimate = 0.70

70%

<0.51x1

x2

≥0.63

Predict:

Page 39: Chapter 4: Predictive Modeling

39

Prediction rulesDetermine type of prediction.

Modeling Essentials – Decision Trees

Pruning

Split searchSelect useful inputs

Optimize complexity.

Select useful inputs.

Page 40: Chapter 4: Predictive Modeling

40

Decision Tree Split Search

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

x1

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x2

Calculate the logworth of every partition on input x1.

left right

Classification Matrix

Page 41: Chapter 4: Predictive Modeling

41

Decision Tree Split Search

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

maxlogworth(x1)

0.95

0.52left right

Select the partition with the maximum logworth.

Page 42: Chapter 4: Predictive Modeling

42

Decision Tree Split Search

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

maxlogworth(x1)

0.95

left right

53%53% 42%42%

47%47% 58%58%

Repeat for input x2.

Page 43: Chapter 4: Predictive Modeling

43

Decision Tree Split Search

maxlogworth(x1)

0.95

left right

53%53% 42%42%

47%47% 58%58%

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

0.63

maxlogworth(x2)

4.92

bottom top

Page 44: Chapter 4: Predictive Modeling

44

Decision Tree Split Search

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

maxlogworth(x2)

4.92

bottom top

maxlogworth(x1)

0.95

left right

Compare partition logworth ratings.

Page 45: Chapter 4: Predictive Modeling

45

Decision Tree Split Search

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

0.63

x2<0.63 ≥0.63

Create a partition rule from the best partition across all inputs.

Page 46: Chapter 4: Predictive Modeling

46

Decision Tree Split Search

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

x2<0.63 ≥0.63

Repeat the process in each subset.

Page 47: Chapter 4: Predictive Modeling

47

Decision Tree Split Search

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

0.52

maxlogworth(x1)

5.72

left right

Page 48: Chapter 4: Predictive Modeling

48

Decision Tree Split Search

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

maxlogworth(x1)

5.72

left right

61%61% 55%55%

39%39% 45%45%

0.02

maxlogworth(x2)

-2.01

bottom top

Page 49: Chapter 4: Predictive Modeling

49

Decision Tree Split Search

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

maxlogworth(x2)

-2.01

bottom top

maxlogworth(x1)

5.72

left right

Page 50: Chapter 4: Predictive Modeling

50

Decision Tree Split Search

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

0.52

maxlogworth(x2)

-2.01

bottom top

38%38% 55%55%

62%62% 45%45%

maxlogworth(x1)

5.72

left right

Page 51: Chapter 4: Predictive Modeling

51

Decision Tree Split Search

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

x2

x1

<0.63 ≥0.63

<0.52 ≥0.52

Create a second partition rule.

Page 52: Chapter 4: Predictive Modeling

52

Repeat to form a maximal tree.

Decision Tree Split Search

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x1

x2

Page 53: Chapter 4: Predictive Modeling

53

4.02 PollThe maximal tree is usually the tree that you use to score new data.

Yes

No

Page 54: Chapter 4: Predictive Modeling

54

4.02 Poll – Correct AnswerThe maximal tree is usually the tree that you use to score new data.

Yes

No

Page 55: Chapter 4: Predictive Modeling

55

Modeling Essentials – Decision Trees

Optimize complexityOptimize complexity.

Prediction rulesDetermine type of prediction.

Split searchSelect useful inputs.

Page 56: Chapter 4: Predictive Modeling

56

6

5

4

3

2

Predictive Model SequenceTraining Data Validation Data

ModelComplexity

1

Create a sequence of models with increasing complexity.

inputs target inputs target

Page 57: Chapter 4: Predictive Modeling

57

The Maximal TreeTraining Data Validation Data

6

5

4

3

2

ModelComplexity

1Maximal

Tree

A maximal tree is the most complex model in the sequence.

Create a sequence of models with increasing complexity.

inputs target inputs target

Page 58: Chapter 4: Predictive Modeling

58

The Maximal TreeTraining Data Validation Data

5

4

3

2

ModelComplexity

1

A maximal tree is the most complex model in the sequence.

inputs target inputs target

Page 59: Chapter 4: Predictive Modeling

60

Pruning One SplitTraining Data Validation Data

4

3

2

1

ModelComplexity

Each subtree’s predictive performance is rated on validation data.

inputs target inputs target

Page 60: Chapter 4: Predictive Modeling

61

Pruning One SplitTraining Data Validation Data

4

3

2

1

ModelComplexity

The subtree with the highest validation assessment is selected.

inputs target inputs target

Page 61: Chapter 4: Predictive Modeling

62

Pruning Two SplitsTraining Data Validation Data

4

3

2

1

ModelComplexity

Similarly, this is done for subsequent models.

inputs target inputs target

Page 62: Chapter 4: Predictive Modeling

63

Pruning Two SplitsTraining Data Validation Data

3

2

1

ModelComplexity

Prune two splits from the maximal tree,…

inputs target inputs target

continued...

Page 63: Chapter 4: Predictive Modeling

64

Pruning Two SplitsTraining Data Validation Data

3

2

1

ModelComplexity

…rate each subtree using validation assessment, and…

inputs target inputs target

continued...

Page 64: Chapter 4: Predictive Modeling

65

Pruning Two SplitsTraining Data Validation Data

3

2

1

ModelComplexity

…select the subtree with the best assessment rating.

inputs target inputs target

Page 65: Chapter 4: Predictive Modeling

66

Subsequent PruningTraining Data Validation Data

ModelComplexity

Continue pruning until all subtrees are considered.

inputs target inputs target

Page 66: Chapter 4: Predictive Modeling

67

Selecting the Best Tree Training Data Validation Data

ModelComplexity

ValidationAssessment

Compare validation assessment between tree complexities.

inputs target inputs target

Page 67: Chapter 4: Predictive Modeling

68

Validation AssessmentTraining Data Validation Data

Choose the simplest model with highest validation assessment.

ModelComplexity

ValidationAssessment

inputs target inputs target

Page 68: Chapter 4: Predictive Modeling

69

Validation AssessmentTraining Data Validation Data

What are appropriate validation assessmentratings?

inputs target inputs target

Page 69: Chapter 4: Predictive Modeling

70

Assessment Statistics

inputs target

Validation Data

target measurement (binary, continuous, and so on)

prediction type (decisions, rankings, estimates)

Ratings depend on…

Page 70: Chapter 4: Predictive Modeling

71

inputs

Binary Targets

primary outcomesecondary outcome

target

1

0

1

1

0

Page 71: Chapter 4: Predictive Modeling

72

inputs

Binary Target Predictions

target

1

0

1

1

0

prediction

primary

secondary

0.249

720

520 rankings

estimates

decisions

Page 72: Chapter 4: Predictive Modeling

73

inputs

Decision Optimization

target

1

0

1

1

0

prediction

0.249

720

520

primary

secondary

decisions

Page 73: Chapter 4: Predictive Modeling

74

inputs

Decision Optimization – Accuracy

target

1

0

1

1

0

prediction

0.249

720

520

primary

secondary

true positive true positive

true negativetrue negative

Maximize accuracy: agreement between outcome and prediction

Page 74: Chapter 4: Predictive Modeling

75

inputs

Decision Optimization – Misclassification

target

1

0

1

1

0

prediction

0.249

720

520

secondary

primarysecondary

primary

false negativefalse negative

false positivefalse positive

Minimize misclassification: disagreement between outcome and prediction

Page 75: Chapter 4: Predictive Modeling

76

inputs

Ranking Optimization

target

1

0

1

1

0

prediction

0.249

720

520

secondary

primary

1

0

720

520 rankings

estimates

decisions

Page 76: Chapter 4: Predictive Modeling

77

inputs

Ranking Optimization – Concordance

target

1

0

1

1

0

prediction

0.249

720

520

secondary

primary

1

0

720

520

Maximize concordance: proper ordering of primary and secondary outcomes

target=0→low score target=1→high scoretarget=0→low score target=1→high score

Page 77: Chapter 4: Predictive Modeling

78

inputs

Ranking Optimization – Discordance

target

1

0

1

1

0

prediction

0.249

secondary

primary

0

1

720

520

target=0→high scoretarget=1→low scoretarget=0→high scoretarget=1→low score

Minimize discordance: improper ordering of primary and secondary outcomes

720

520

Page 78: Chapter 4: Predictive Modeling

79

inputs

Estimate Optimization

target

1

0

1

1

0

prediction

0.249

secondary

primary

720

520

1 0.249

rankings

estimates

decisions

Page 79: Chapter 4: Predictive Modeling

80

inputs

Estimate Optimization – Squared Error

target

1

0

1

1

0

prediction

0.249

secondary

primary

720

520

1 0.249 (target – estimate)2(target – estimate)2

Minimize squared error:squared difference between target and prediction

Page 80: Chapter 4: Predictive Modeling

81

inputs

Complexity Optimization – Summary

target

1

0

1

1

0

prediction

0.249

secondary

primary

720

520concordance / discordance

squared error

accuracy / misclassification

rankings

estimates

decisions

Page 81: Chapter 4: Predictive Modeling

82

4.03 QuizWhat are some target variables that you might encounter that would require optimizing on… accuracy/misclassification? concordance/discordance? average squared error?

Page 82: Chapter 4: Predictive Modeling

83

Statistical Graphs

ROC Curves

Gains and Lift Charts

Page 83: Chapter 4: Predictive Modeling

84

Decision Matrix

TrueNegative

FalsePositive

FalseNegative

TruePositive

ActualNegative

PredictedNegative

PredictedPositive

ActualPositive

Predicted ClassA

ctua

l Cla

ss 0

1

0 1

Page 84: Chapter 4: Predictive Modeling

85

Sensitivity

TruePositive

PredictedPositive

ActualPositive

Predicted ClassA

ctua

l Cla

ss 0

1

0 1

Page 85: Chapter 4: Predictive Modeling

86

Positive Predicted Value

TruePositive

PredictedPositive

ActualPositive

Predicted ClassA

ctua

l Cla

ss 0

1

0 1

Page 86: Chapter 4: Predictive Modeling

87

Specificity

TrueNegative

ActualNegative

PredictedNegative

Predicted ClassA

ctua

l Cla

ss 0

1

0 1

Page 87: Chapter 4: Predictive Modeling

88

Negative Predicted Values

TrueNegative

ActualNegative

PredictedNegative

Predicted ClassA

ctua

l Cla

ss 0

1

0 1

Page 88: Chapter 4: Predictive Modeling

89

ROC Curve

Page 89: Chapter 4: Predictive Modeling

90

Gains Chart

Page 90: Chapter 4: Predictive Modeling

91

Catalog Case Study: Steps to Build a Decision Tree1. Add the CATALOG2010 data source to the diagram.

2. Use the Data Partition node to split the data into training and validation data sets.

3. Use the Decision Tree node to select useful inputs.

4. Use the Model Comparison node to generate model assessment statistics and plots.

Page 91: Chapter 4: Predictive Modeling

92

Constructing a Decision Tree Predictive Model

Catalog Case Study

Task: Construct a decision tree model.

Page 92: Chapter 4: Predictive Modeling

93

Chapter 4: Predictive Modeling

4.1 Introduction to Predictive Modeling

4.2 Predictive Modeling Using Decision Trees

4.3 Predictive Modeling Using Logistic 4.3 Predictive Modeling Using Logistic RegressionRegression

4.4 Churn Case Study

4.5 A Note about Model Management

4.6 Recommended Reading

Page 93: Chapter 4: Predictive Modeling

94

Objectives Explain the concepts of logistic regression. Discuss modeling strategies for building a

predictive model. Fit a predictive logistic regression model in

SAS Enterprise Miner.

Page 94: Chapter 4: Predictive Modeling

95

Modeling Essentials – Regressions

Determine type of prediction.

Select useful inputs.

Optimize complexity.

Page 95: Chapter 4: Predictive Modeling

97

Simple Linear Regression Model

Regression Best Fit Line

Page 96: Chapter 4: Predictive Modeling

98

Linear Regression Prediction Formula

parameterestimate

inputmeasurement

interceptestimate

= β0 + β1 x1 + β2 x2 ^ ^ ^y prediction

estimate^

Choose intercept and parameter estimates to minimize:

∑( yi – yi )2

trainingdata

^squared error function

Page 97: Chapter 4: Predictive Modeling

99

Binary Target

Linear regression does not work, because whatever the form of the equation, the results are generally unbounded.

Instead, you work with the probability p that the event will occur rather than a direct classification.

Page 98: Chapter 4: Predictive Modeling

100

Odds Instead of ProbabilityConsider the probability p of an event (such as a horse losing a race) occurring.

The probability of the event not occurring is 1-p.

The odds of the event happening are p:(1-p), although you more commonly express this as integers, such as a 19-to-1 long shot at the race track.

The ratio19:1 means that the horse has one chance of winning for 19 chances of losing, or the probability of winning is 1/(19+1) = 5%.

1loss

win

p podds

p p

Page 99: Chapter 4: Predictive Modeling

101

Properties of Odds and Log Odds

Odds is not symmetric, varying from 0 to infinity.

Odds is 1 when the probability is 50%.

Log Odds is symmetric, going from minus infinity to positive infinity, like a line.

Log Odds is 0 when the probability is 50%.

It is highly negative for low probabilities and highly positive for high probabilities.

Properties of Odds versus Log Odds

-5

-4

-3

-2

-1

0

1

2

3

4

5

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

odds

log odds

Page 100: Chapter 4: Predictive Modeling

102

Logistic Regression Prediction Formula

= β0 + β1 x1 + β2 x2 ^ ^ ^

logit scores

^log

p

1 – p( )^

Page 101: Chapter 4: Predictive Modeling

103

Logit Link Function

logitlink function

0 1

5

-5

The logit link function transforms probabilities (between 0 and 1) to logit scores (between −∞ and +∞).

^log

p

1 – p( )^

logit scores= β0 + β1 x1 + β2 x2 ^ ^ ^

Page 102: Chapter 4: Predictive Modeling

104

Logit Link Function

^log

p

1 – p( )^

1

1 + e-logit( p )p = ^^

^logit( p )

To obtain prediction estimates, the logit equation is solved for p. ^

== β0 + β1 x1 + β2 x2 ^ ^ ^

Page 103: Chapter 4: Predictive Modeling

105

4.04 PollLinear regression on a binary target is a problem because predictions can range outside of 0 and 1.

Yes

No

Page 104: Chapter 4: Predictive Modeling

106

4.04 Poll – Correct AnswerLinear regression on a binary target is a problem because predictions can range outside of 0 and 1.

Yes

No

Page 105: Chapter 4: Predictive Modeling

107

Simple Prediction Illustration – Regressions

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

x1

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x2

Predict dot color for each x1 and x2.

Need intercept and parameter estimates.

= β0 + β1 x1 + β2 x2 ^ ^ ^logit( p ) ^

Page 106: Chapter 4: Predictive Modeling

108

Simple Prediction Illustration – Regressions

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

x1

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x2

log-likelihood function

Find parameter estimates by maximizing.

= β0 + β1 x1 + β2 x2 ^ ^ ^logit( p ) ^

Page 107: Chapter 4: Predictive Modeling

109

Simple Prediction Illustration – Regressions

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

x1

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x2

0.40

0.50

0.60

0.70

Using the maximum likelihood estimates, the prediction formula assigns a logit score to each x1 and x2.

Page 108: Chapter 4: Predictive Modeling

110

Regressions: Beyond the Prediction Formula

Manage missing values.

Interpret the model.

Account for nonlinearities.

Handle extreme or unusual values.

Use nonnumeric inputs.

Page 109: Chapter 4: Predictive Modeling

111

Regressions: Beyond the Prediction Formula

Manage missing values.

Interpret the model.

Account for nonlinearities.

Handle extreme or unusual values.

Use nonnumeric inputs.

Page 110: Chapter 4: Predictive Modeling

112

Missing Values and Regression Modeling

Training Datatargetinputs

Problem 1: Training data cases with missing values on inputs used by a regression model are ignored.

Page 111: Chapter 4: Predictive Modeling

113

Missing Values and Regression Modeling

Consequence: Missing values can significantly reduce your amount of training data for regression modeling!

Training Datatargetinputs

Page 112: Chapter 4: Predictive Modeling

114

Missing Values and the Prediction Formula

Predict: (x1, x2) = (0.3, ? )

Problem 2: Prediction formulas cannot score cases with missing values.

Page 113: Chapter 4: Predictive Modeling

115

Missing Values and the Prediction Formula

Problem 2: Prediction formulas cannot score cases with missing values.

Page 114: Chapter 4: Predictive Modeling

116

Missing Value Issues

Manage missing values.

Problem 2: Prediction formulas cannot score cases with missing values.

Problem 1: Training data cases with missing valueson inputs used by a regression model are ignored.

Page 115: Chapter 4: Predictive Modeling

117

Missing Value Causes

Manage missing values.

Non-applicable measurement

No match on merge

Non-disclosed measurement

Page 116: Chapter 4: Predictive Modeling

118

Missing Value Remedies

Manage missing values.

xi = f(x1, … ,xp)

Non-applicable measurement

No match on merge

Non-disclosed measurement

Page 117: Chapter 4: Predictive Modeling

119

4.05 PollObservations with missing values should always be deleted from scoring because a predicted value cannot be determined.

Yes

No

Page 118: Chapter 4: Predictive Modeling

120

4.05 Poll – Correct AnswerObservations with missing values should always be deleted from scoring because a predicted value cannot be determined.

Yes

No

Page 119: Chapter 4: Predictive Modeling

121

Predictionformula

Modeling Essentials – Regressions

Best modelfrom sequence

Sequentialselection

Determine type of predictions.

Select useful inputs

Optimize complexity.

Select useful inputs.

Page 120: Chapter 4: Predictive Modeling

122

Variable Redundancy

Page 121: Chapter 4: Predictive Modeling

123

X1

X3

X4

X6

X8

X9

X10

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

Variable Clustering

Inputs are selected bycluster representationexpert opiniontarget correlation.

X1

X3

X4

X6

X8

X9

X10

Page 122: Chapter 4: Predictive Modeling

124

Selection by 1 – R2 Ratio

Own Cluster

Next Closest

R2 = 0.90

R2 = 0.01

1-R 2

own cluster

1-R 2

next closest

1 – 0.90

1 – 0.01= = 0.101

X2

Page 123: Chapter 4: Predictive Modeling

125

Predictionformula

Modeling Essentials – Regressions

Best modelfrom sequence

Determine type of prediction.

Select useful inputs

Optimize complexity.

Select useful inputs. Sequentialselection

Page 124: Chapter 4: Predictive Modeling

126

Sequential Selection – Forward

Entry CutoffInput p-value

Page 125: Chapter 4: Predictive Modeling

127

Sequential Selection – Forward

Entry CutoffInput p-value

Page 126: Chapter 4: Predictive Modeling

128

Sequential Selection – Forward

Entry CutoffInput p-value

Page 127: Chapter 4: Predictive Modeling

129

Sequential Selection – Forward

Entry CutoffInput p-value

Page 128: Chapter 4: Predictive Modeling

130

Sequential Selection – Forward

Entry CutoffInput p-value

Page 129: Chapter 4: Predictive Modeling

131

Sequential Selection – Forward

Entry CutoffInput p-value

Page 130: Chapter 4: Predictive Modeling

132

Sequential Selection – Backward

Stay CutoffInput p-value

Page 131: Chapter 4: Predictive Modeling

133

Sequential Selection – Backward

Stay CutoffInput p-value

Page 132: Chapter 4: Predictive Modeling

134

Sequential Selection – Backward

Stay CutoffInput p-value

Page 133: Chapter 4: Predictive Modeling

135

Sequential Selection – Backward

Stay CutoffInput p-value

Page 134: Chapter 4: Predictive Modeling

136

Sequential Selection – Backward

Stay CutoffInput p-value

Page 135: Chapter 4: Predictive Modeling

137

Sequential Selection – Backward

Stay CutoffInput p-value

Page 136: Chapter 4: Predictive Modeling

138

Sequential Selection – Backward

Stay CutoffInput p-value

Page 137: Chapter 4: Predictive Modeling

139

Sequential Selection – Backward

Stay CutoffInput p-value

Page 138: Chapter 4: Predictive Modeling

140

Sequential Selection – Backward

Stay CutoffInput p-value

Page 139: Chapter 4: Predictive Modeling

141

Sequential Selection – Stepwise

Input p-value Entry Cutoff

Stay Cutoff

Page 140: Chapter 4: Predictive Modeling

142

Sequential Selection – Stepwise

Input p-value Entry Cutoff

Stay Cutoff

Page 141: Chapter 4: Predictive Modeling

143

Sequential Selection – Stepwise

Input p-value Entry Cutoff

Stay Cutoff

Page 142: Chapter 4: Predictive Modeling

144

Sequential Selection – Stepwise

Input p-value Entry Cutoff

Stay Cutoff

Page 143: Chapter 4: Predictive Modeling

145

Sequential Selection – Stepwise

Input p-value Entry Cutoff

Stay Cutoff

Page 144: Chapter 4: Predictive Modeling

146

Sequential Selection – Stepwise

Input p-value Entry Cutoff

Stay Cutoff

Page 145: Chapter 4: Predictive Modeling

147

4.06 PollDifferent model selection methods often result in different candidate models. No one method is uniformly the best.

Yes

No

Page 146: Chapter 4: Predictive Modeling

148

4.06 Poll – Correct AnswerDifferent model selection methods often result in different candidate models. No one method is uniformly the best.

Yes

No

Page 147: Chapter 4: Predictive Modeling

149

Modeling Essentials – Regressions

Determine type of prediction.

Select useful inputs.

Optimize complexity.

Predictionformula

Variable clusteringand selection

Page 148: Chapter 4: Predictive Modeling

150

Model Fit versus Complexity

1 2 3 4 5 6

Model fit statistic

training

validation

Page 149: Chapter 4: Predictive Modeling

151

Select Model with Optimal Validation Fit

1 2 3 4 5 6

Model fit statistic

Evaluate eachsequence step.

Page 150: Chapter 4: Predictive Modeling

152

Beyond the Prediction Formula

Manage missing values.

Interpret the model.

Account for nonlinearities.

Handle extreme or unusual values.

Use nonnumeric inputs.

Page 151: Chapter 4: Predictive Modeling

153

Interpretation

x1

x2 x1

x2

Unit change in x2

2 change in logit

logit(p) p

100(exp(2)-1)%change in the odds

Page 152: Chapter 4: Predictive Modeling

154

Odds Ratio from a Logistic Regression ModelEstimated logistic regression model:

logit(p) = .7567 + .4373*(gender)

Estimated odds ratio (Females to Males):

odds ratio = (e-.7567+.4373)/(e-.7567) = 1.55

An odds ratio of 1.55 means that females have 1.55 times the odds of having the outcome compared to males.

Page 153: Chapter 4: Predictive Modeling

155

Properties of the Odds Ratio

Group in denominatorhas higher odds of the event.

Group in numeratorhas higher odds of the event.

No Association

0 1

Page 154: Chapter 4: Predictive Modeling

156

Beyond the Prediction Formula

Manage missing values.

Interpret the model.

Account for nonlinearities.

Handle extreme or unusual values.

Use nonnumeric inputs.

Page 155: Chapter 4: Predictive Modeling

157

Extreme Distributions and Regressions

high leverage pointsskewed inputdistribution

Original Input Scale

Page 156: Chapter 4: Predictive Modeling

158

Extreme Distributions and Regressions

high leverage pointsskewed inputdistribution

true association

true association

Original Input Scale

Page 157: Chapter 4: Predictive Modeling

159

Extreme Distributions and Regressions

high leverage pointsskewed inputdistribution

standard regression

true association

standard regression

true association

Original Input Scale

Page 158: Chapter 4: Predictive Modeling

160

Extreme Distributions and Regressions

high leverage pointsskewed inputdistribution

standard regression

true association

standard regression

true association

Original Input Scale

more symmetricdistribution

Regularized Scale

Page 159: Chapter 4: Predictive Modeling

161

Original Input Scale

Regularizing Input Transformations

more symmetricdistribution

Regularized Scale

standard regression

standard regression

Original Input Scale

high leverage pointsskewed inputdistribution

Page 160: Chapter 4: Predictive Modeling

162

Regularizing Input TransformationsRegularized Scale

standard regression

standard regression

Original Input Scale

regularized estimate

regularized estimate

true association

true association

Page 161: Chapter 4: Predictive Modeling

163

Idea ExchangeWhat are examples of variables with unusual distributions that could produce problems in a regression model? Would you transform these variables? If so, what types of transformations would you entertain?

Page 162: Chapter 4: Predictive Modeling

164

Beyond the Prediction Formula

Manage missing values.

Interpret the model.

Account for nonlinearities.

Handle extreme or unusual values.

Use nonnumeric inputs.

Page 163: Chapter 4: Predictive Modeling

165

Nonnumeric Input Coding

Level

1 0

DA DB

0 1AB

Two-level variable:

Level

1 00 1

AB

DA DB

Coding redundancy:

Page 164: Chapter 4: Predictive Modeling

166

Nonnumeric Input Coding: Many Levels

Level DI

1 0 0 0 0 0 0 0

DA DB DC DD DE DF DG DH

0

0 0 0 1 0 0 0 0

0 1 0 0 0 0 0 00 0 1 0 0 0 0 0

0 0 0 0 1 0 0 00 0 0 0 0 1 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 10 0 0 0 0 0 0 0

00000001

ABCDEFGHI

Page 165: Chapter 4: Predictive Modeling

167

DI

000000001

DI

000000001

Coding Redundancy: Many Levels

Level

1 0 0 0 0 0 0 0

DA DB DC DD DE DF DG DH

0 0 0 1 0 0 0 0

0 1 0 0 0 0 0 00 0 1 0 0 0 0 0

0 0 0 0 1 0 0 00 0 0 0 0 1 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 10 0 0 0 0 0 0 0

ABCDEFGHI

Page 166: Chapter 4: Predictive Modeling

168

DI

000000001

Coding Consolidation

Level

1 0 0 0 0 0 0 0

DA DB DC DD DE DF DG DH

0 0 0 1 0 0 0 0

0 1 0 0 0 0 0 00 0 1 0 0 0 0 0

0 0 0 0 1 0 0 00 0 0 0 0 1 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 10 0 0 0 0 0 0 0

ABCDEFGHI

Page 167: Chapter 4: Predictive Modeling

169

DI

000000001

Coding Consolidation

Level

1 0 0 0 0 0 0 0

DABCD DB DC DD DEF DF DGH DH

1 0 0 1 0 0 0 0

1 1 0 0 0 0 0 01 0 1 0 0 0 0 0

0 0 0 0 1 0 0 00 0 0 0 1 1 0 00 0 0 0 0 0 1 00 0 0 0 0 0 1 10 0 0 0 0 0 0 0

ABCDEFGHI

Page 168: Chapter 4: Predictive Modeling

170

Beyond the Prediction Formula

Manage missing values.

Interpret the model.

Account for nonlinearities.

Handle extreme or unusual values.

Use nonnumeric inputs.

Page 169: Chapter 4: Predictive Modeling

171

Standard Logistic Regression

= w0 + w1 x1 + w2 x2 ^

^ ^ ^log p

1 – p( )^ · ·

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

x1

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x2

Page 170: Chapter 4: Predictive Modeling

172

Polynomial Logistic Regression

= w0 + w1 x1 + w2 x2 ^

^ ^ ^log p

1 – p( )^ · ·

quadratic terms

+ w3 x1 + w4 x2 2 2^ ^

+ w5 x1 x2

0.0 0.50.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 1.0

x1

0.0

0.5

0.1

0.2

0.3

0.4

0.6

0.7

0.8

0.9

1.0

x2

0.40 0.50 0.60 0.700.30

0.60

0.70

0.80

Page 171: Chapter 4: Predictive Modeling

173

Idea ExchangeWhat are some predictors that you can think of that would have a nonlinear relationship with a target? What do you think the functional form of the relationship is (for example, quadratic, exponential, …)?

Page 172: Chapter 4: Predictive Modeling

174

Catalog Case StudyAnalysis Goal:

A mail-order catalog retailer wants to save money on mailing and increase revenue by targeting mailed catalogs to customers most likely to purchase in the future.

Data set: CATALOG2010

Number of rows: 48,356

Number of columns: 98

Contents: sales figures summarized across departments and quarterly totals for 5.5 years of sales

Targets: RESPOND (binary)

ORDERSIZE (continuous)

Page 173: Chapter 4: Predictive Modeling

175

Fitting a Logistic Regression Model

Catalog Case Study

Task: Build a logistic regression model in SAS Enterprise Miner.

Page 174: Chapter 4: Predictive Modeling

176

Catalog Case Study: Steps to Build a Logistic Regression Model1. Add the CATALOG2010 data source to the diagram.

2. Use the Data Partition node to split the data into training and validation data sets.

3. Use the Variable Clustering node to select relatively independent inputs.

4. Use the Regression node to select relevant inputs.

5. Use the Model Comparison node to generate model assessment statistics and plots.

In the previous example, you performed steps 1 and 2.

Page 175: Chapter 4: Predictive Modeling

177

Chapter 4: Predictive Modeling

4.1 Introduction to Predictive Modeling

4.2 Predictive Modeling Using Decision Trees

4.3 Predictive Modeling Using Logistic Regression

4.4 Churn Case Study4.4 Churn Case Study

4.5 A Note about Model Management

4.6 Recommended Reading

Page 176: Chapter 4: Predictive Modeling

178

Objectives Formulate an objective for predictive churn in a

telecommunications example. Generate predictive models in SAS Enterprise Miner

to predict churn. Score a customer database to target who is most likely

to churn.

Page 177: Chapter 4: Predictive Modeling

179

Telecommunications Company Mobile (prepaid and postpaid)

and fixed service provider. In recent years, a high

percentage of high revenue subscribers have churned.

Company wants to target subscribers with a high churn probability for its customer retention program.

Page 178: Chapter 4: Predictive Modeling

180

Churn Score A churn propensity score measures the propensity for

an active customer to churn. The score enables marketing managers to take

proactive steps to retain targeted customers before churn occurs.

Churn scores are derived from analysis of the historical behavior of churned customers and existing customers who have not churned.

Page 179: Chapter 4: Predictive Modeling

181

Possible Predictor Variables Outstanding bill value Outstanding balance period Number of calls Call duration (international, local, national calls) Period as customer Total dropped calls Total failed calls

Page 180: Chapter 4: Predictive Modeling

182

Model Implementation

inputs predictions

Predictions might be added to a data source inside or outside of SAS Enterprise Miner.

Page 181: Chapter 4: Predictive Modeling

183

Churn Case Study1. Examine the CHURN_TELECOM data set and add it

to a diagram.

2. Partition the data in training and validation data sets.

3. Perform missing value imputation.

4. Recode nominal variables to combine class levels.

5. Reduce redundancy with variable clustering.

6. Reduce irrelevant inputs with a decision tree and a logistic regression. Compare results and select the final model based on validation error.

7. Score a data set to generate the list of churn risk customers.

Page 182: Chapter 4: Predictive Modeling

184

Analyzing Churn Data

Churn Case Study

Task: Analyze churn data.

Page 183: Chapter 4: Predictive Modeling

185

Chapter 4: Predictive Modeling

4.1 Introduction to Predictive Modeling

4.2 Predictive Modeling Using Decision Trees

4.3 Predictive Modeling Using Logistic Regression

4.4 Churn Case Study

4.5 A Note about Model Management 4.5 A Note about Model Management

4.6 Recommended Reading

Page 184: Chapter 4: Predictive Modeling

186

Objectives Discuss the movement of analytics from the “back

office” to the executive level and the reasons for these changes.

Describe the three-way pull for model management. Explain why models must be maintained and

reassessed over time.

Page 185: Chapter 4: Predictive Modeling

187

Model Management and Business AnalyticsModel management is the assessment, deployment, and continued modification of models. This is a critical business process. Demonstrate that the model is well developed. Verify that the model is working well. Perform outcomes analysis.

Model management requires a collaborative effort across the company: VP Decision Analysis and Support Group, Senior Modeling

Analyst, Enterprise Architect, Internal Validation Compliance Analyst, Database Administrator

Page 186: Chapter 4: Predictive Modeling

188

Analytical Model Management Challenges

Proliferation of Data and Models

Largely Manual ProcessesMoving to Production

Increased RegulationSarbanes-Oxley, Basel II

ActionableInferences

Integrating withOperational Systems

Page 187: Chapter 4: Predictive Modeling

189

Three-Way Pull for Model Management

Business Value

GovernanceProcess

ProductionProcess

Page 188: Chapter 4: Predictive Modeling

190

Three-Way Pull for Model ManagementBusiness Value Deployment of the “best” models Consistent model development and validation Understanding of model strategy and lifetime value

Production Process Efficient deployment of models in a timely manner Effective deployment to minimize operational risk

Governance Process Audit trails for compliance purposes Justification for management and shareholders

Page 189: Chapter 4: Predictive Modeling

191

Changes in the Analytical Landscape

Analytical Modelers

Management

IT Ops

Data Integrators

Business

Governance

STAKEHOLDERSNow…

CustomerService

Retail

Logistics

Promotions

OPERATIONS TARGET

Customers

Stockholders

Suppliers

Employees

Page 190: Chapter 4: Predictive Modeling

192

Model ManagementAs models proliferate, you need:

To be more diligent, but… There is not an established process to handle model

deployment into production. Model deployment is inefficient. More individuals and groups in the organization must be

involved in the process.

To be more vigilant, but… It is difficult to effectively manage existing models and track

the model life cycle. It is difficult to consistently provide appropriate internal and

regulatory documentation.

Page 191: Chapter 4: Predictive Modeling

193

Idea ExchangeHow can you implement model management in your organization? Do you already have systems in place for continuous improvement and monitoring of models? For audit trails and compliance checks? Describe briefly how they operate.

Page 192: Chapter 4: Predictive Modeling

194

Lessons LearnedModel management is a key part of good business analytics. Models should be evaluated before, during, and after

deployment. New models replace old ones as dictated by the data

over time.

Page 193: Chapter 4: Predictive Modeling

195

Chapter 4: Predictive Modeling

4.1 Introduction to Predictive Modeling

4.2 Predictive Modeling Using Decision Trees

4.3 Predictive Modeling Using Logistic Regression

4.4 Churn Case Study

4.5 A Note about Model Management

4.6 Recommended Reading4.6 Recommended Reading

Page 194: Chapter 4: Predictive Modeling

196

Recommended Reading Davenport, Thomas H., Jeanne G. Harris, and Robert Morison. 2010. Analytics at Work: Smarter Decisions, Better Results. Boston: Harvard Business Press. Chapters 7 and 8

– Chapters 7 and 8 focus on making analytics an integral part of a business. Systems, processes, and organizational culture must work together to move toward analytical leadership. The remaining three chapters of the book (9-11) are optional, self-study material.

Page 195: Chapter 4: Predictive Modeling

197

Recommended ReadingMay, Thornton. 2010. The New Know: Innovation Powered by Analytics. New York: Wiley. Chapter 1

– May’s book provides a counterpoint to the Davenport, et al. book, from the perspective of the role of analysts in the organization, and how organizations can make the best use of their analytical talent.

Page 196: Chapter 4: Predictive Modeling

198

Recommended ReadingMorris, Michael. “Mining Student Data Could Save Lives.” The Chronicle of Higher Education. October 2, 2011. http://chronicle.com/article/Mining-Student-Data-Could-Save/129231/This article discusses the mining of student data at colleges and universities to prevent large-scale acts of violence on campus. Mining of students’ data (including Internet usage and social networking data), would enhance the capacity of threat-assessment teams to protect the health and safety of the students.