predictive modeling workshop

143
PREDICTIVE MODELING WORKSHOP Max Kuhn O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci

Upload: odsc

Post on 08-Aug-2015

185 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: Predictive Modeling Workshop

PREDICTIVE MODELING

WORKSHOPMax Kuhn

O P E ND A T AS C I E N C EC O N F E R E N C E_

BOSTON 2015

@opendatasci

Page 2: Predictive Modeling Workshop

Predictive Modeling Workshop

Max Kuhn, Ph.D

Pfizer Global R&DGroton, CT

[email protected]

Page 3: Predictive Modeling Workshop

Outline

Modeling conventions

Modeling capabilities

Data splitting

Pre–processing

Measuring performance

Over–fitting and resampling

Classification trees, boosting

Support vector machines

Extra topics as time allows

Max Kuhn (Pfizer) Predictive Modeling 2 / 142

Page 4: Predictive Modeling Workshop

Terminology

Data:

numeric data: numbers of any type (eg. counts, sales price)

categorical or nominal data: non–numeric data (eg. color, gender)

Variables:

outcomes: the data to be predicted

predictors (aka independent variables, descriptors, inputs): data used to predict the outcome

Models:

classification: models to predict categorical outcomes

regression: models to predict numeric outcomes

(these last two are imperfect definitions)

Max Kuhn (Pfizer) Predictive Modeling 3 / 142

Page 5: Predictive Modeling Workshop

Modeling Conventions in R

Page 6: Predictive Modeling Workshop

The Formula Interface

There are two main conventions for specifying models in R: the formulainterface and the non–formula (or “matrix”) interface.

For the former, the predictors are explicitly listed in an R formula that looks like: outcome ∼ var1 + var2 + ....

For example, the formulamodelFunction(price ~ numBedrooms + numBaths + acres,

data = housingData)

would predict the closing price of a house using three quantitativecharacteristics.

Max Kuhn (Pfizer) Predictive Modeling 5 / 142

Page 7: Predictive Modeling Workshop

The Formula Interface

The shortcut y ∼ . can be used to indicate that all of the columns in thedata set (except y) should be used as a predictor.

The formula interface has many conveniences. For example, transformations, such as log(acres) can be specified in–line.

Unfortunately, R does not efficiently store the information about the formula. Using this interface with data sets that contain a large number of predictors may unnecessarily slow the computations.

Max Kuhn (Pfizer) Predictive Modeling 6 / 142

Page 8: Predictive Modeling Workshop

The Matrix or Non–Formula Interface

The non–formula interface specifies the predictors for the model using amatrix or data frame (all the predictors in the object are used in the model).

The outcome data are usually passed into the model as a vector object. For example:

modelFunction(x = housePredictors, y = price)

In this case, transformations of data or dummy variables must be createdprior to being passed to the function.

Note that not all R functions have both interfaces.

Max Kuhn (Pfizer) Predictive Modeling 7 / 142

Page 9: Predictive Modeling Workshop

Building and Predicting Models

Almost all modeling functions in R follow the same workflow:Create the model using the basic function:

fit <- knn(trainingData, outcome, k = 5)Assess the properties of the model using print, plot. summary or other methods

Predict outcomes for samples using the predict method:predict(fit, newSamples).

1

2

3

The model can be used for prediction without changing the original modelobject.

Max Kuhn (Pfizer) Predictive Modeling 8 / 142

Page 10: Predictive Modeling Workshop

Modeling Capabilities

Page 11: Predictive Modeling Workshop

Predictive Modeling Methods in R

As previously mentioned, there is a machine learning Task View page on the R website that does a good job of describing the range of models available

parametric regression models: ordinary/generalized/robustregression models; neural networks; partial least squares; projection pursuit regression; multivariate adaptive regression splines; principal component regression

sparse/penalized models: ridge regression; the lasso; the elastic net; generalized linear models; partial least squares; nearest shrunken centroids; logistic regression

kernel methods: support vector machines; relevance vector machines; least squares support vector machine; Gaussian processes

(more)

Max Kuhn (Pfizer) Predictive Modeling 10 / 142

Page 12: Predictive Modeling Workshop

Predictive Modeling Methods in R

trees/rule–based models: CART; C4.5; conditional inference trees;node harvest, Cubist, C5.0

ensembles: random forest; boosting (trees, linear models, generalized additive models, generalized linear models, others); bagging (trees, multivariate adaptive regression splines), rotation forests

prototype methods: k nearest neighbors; learned vector quantization

discriminant analysis: linear; quadratic; penalized; stabilized; sparse;mixture; regularized; stepwise; flexible

others: naive Bayes; Bayesian multinomial probit models

Max Kuhn (Pfizer) Predictive Modeling 11 / 142

Page 13: Predictive Modeling Workshop

Model Function Consistency

Since there are many modeling packages written by different people, thereare some inconsistencies in how models are specified and predictions are made.

For example, many models have only one method of specifying the model(e.g. formula method only)

Max Kuhn (Pfizer) Predictive Modeling 12 / 142

Page 15: Predictive Modeling Workshop

The caret Package

The caret package was developed to:

create a unified interface for modeling and prediction (interfaces to183 models – up from 112 a year ago)

streamline model tuning using resampling

provide a variety of “helper” functions and classes for day–to–day model building tasks

increase computational efficiency using parallel processing

First commits within Pfizer: 6/2005, First version on CRAN: 10/2007

Website: http://topepo.github.io/caret/

JSS Paper: http://www.jstatsoft.org/v28/i05/paper

Model List: http://topepo.github.io/caret/bytag.html

Many computing sections in APM

Max Kuhn (Pfizer) Predictive Modeling 14 / 142

Page 16: Predictive Modeling Workshop

Example Data

Page 17: Predictive Modeling Workshop

Credit Score Data Set

These data can be found in Multivariate Statistical Modelling Based onGeneralized Linear Models by Fahrmeir et al and are in the Fahrmeir package.

Data were collected by South German bank to predict whether loan recipents would be good payers.

The data are moderate size: 1000 samples and 7 predictors.

The outcome was related to whether or not a recipent repayed their loan. There is a small class imbalance: 70% repayed.

Predictors are related to demographics, credit information and data related to the loan and bank account.

Max Kuhn (Pfizer) Predictive Modeling 16 / 142

Page 18: Predictive Modeling Workshop

Credit Score Data Set

>>>>>>>>>>>>>>>>>>>>

## translations, formatting, and dummy variableslibrary(Fahrmeir)data(credit)

credit$Male <-ifelse(credit$Sexo == "hombre", 1, 0)

credit$Lives_Alone <-ifelse(credit$Estc == "vive solo", 1, 0) credit$Good_Payer <-ifelse(credit$Ppag == "pre buen pagador", 1, 0) credit$Private_Loan <-ifelse(credit$Uso == "privado", 1, 0) credit$Class <-ifelse(credit$Y == "buen", "Good", "Bad") credit$Class <- factor(credit$Class, levels = c("Good", "Bad"))

credit$Y <- NULL

credit$Sexo <- NULL credit$Uso <- NULL credit$Ppag <- NULL credit$Estc <- NULLnames(credit)[names(credit) == "Mes"] <- "Loan_Duration"

names(credit)[names(credit) == "DM"] <- "Loan_Amount" names(credit)[names(credit) == "Cuenta"] <- "Credit_Quality"

Max Kuhn (Pfizer) Predictive Modeling 17 / 142

Page 19: Predictive Modeling Workshop

Credit Score Data Set

> library(plyr)>> ## to make valid R column names> trans <- c("good running" = "good_running", "bad running" = "bad_running")> credit$Credit_Quality <- revalue(credit$Credit_Quality, trans)>> str(credit)

'data.frame': 1000 obs. of 8 variables:$ Credit_Quality: Factor w/ 3 levels "no","good_running",..: 1 1 3 1 1 1 1 1 2 3 .$ Loan_Duration : num 18 9 12 12 12 10 8 6 18 24 ...

1049 2799 841 2122 2171 ...0 1 0 1 1 1 1 1 0 0 ...1 0 1 0 0 0 0 0 1 1 ...1 1 1 1 1 1 1 1 1 1 ...1 0 0 0 0 0 0 0 1 1 ...

$ Loan_Amount$ Male$ Lives_Alone$ Good_Payer$ Private_Loan$ Class

: num: num: num: num: num: Factor w/ 2 levels "Good","Bad": 1 1 1 1 1 1 1 1 1 1 ...

Max Kuhn (Pfizer) Predictive Modeling 18 / 142

Page 20: Predictive Modeling Workshop

General Strategies

APM Ch 1, 2 and 4.

Page 21: Predictive Modeling Workshop

Model Building Steps

Common steps during model building are:

estimating model parameters (i.e. training models)

determining the values of tuning parameters that cannot be directly calculated from the data

calculating the performance of the final model that will generalize to new data

How do we “spend” the data to find an optimal model? We typically splitdata into training and test data sets:

Training Set: these data are used to estimate model parameters and to pick the values of the complexity parameter(s) for the model.

Test Set (aka validation set): these data can be used to get an independent assessment of model efficacy. They should not be used during model training.

Max Kuhn (Pfizer) Predictive Modeling 20 / 142

Page 22: Predictive Modeling Workshop

Spending Our Data

The more data we spend, the better estimates we’ll get (provided the datais accurate). Given a fixed amount of data,

too much spent in training won’t allow us to get a good assessmentof predictive performance. We may find a model that fits the training data very well, but is not generalizable (over–fitting)

too much spent in testing won’t allow us to get a good assessment of model parameters

Statistically, the best course of action would be to use all the data formodel building and use statistical methods to get good estimates of error.

From a non–statistical perspective, many consumers of of these modelsemphasize the need for an untouched set of samples the evaluate performance.

Max Kuhn (Pfizer) Predictive Modeling 21 / 142

Page 23: Predictive Modeling Workshop

Spending Our Data

There are a few different ways to do the split: simple random sampling,stratified sampling based on the outcome, by date and methods that focus on the distribution of the predictors.

The base R function sample can be used to create a completely randomsample of the data. The caret package has a function createDataPartition that conducts data splits within groups of the data.For classification, this would mean sampling within the classes as topreserve the distribution of the outcome in the training and test sets

For regression, the function determines the quartiles of the data set andsamples within those groups

Max Kuhn (Pfizer) Predictive Modeling 22 / 142

Page 24: Predictive Modeling Workshop

Credit Data Set

For these data, let’s take a stratified random sample of 760 loans fortraining.

> set.seed(8140)> in_train <- createDataPartition(credit$Class, p = .75, list = FALSE)> head(in_train)

Resample1[1,]

[2,] [3,] [4,] [5,] [6,]

123458

> train_data <- credit[ in_train,]> test_data <- credit[-in_train,]

Max Kuhn (Pfizer) Predictive Modeling 23 / 142

Page 25: Predictive Modeling Workshop

Estimating Performance

APM Ch. 5 and 11

Page 26: Predictive Modeling Workshop

Estimating Performance

Later, once you have a set of predictions, various metrics can be used to evaluate performance.

For regression models:

R2 is very popular. In many complex models, the notion of the modeldegrees of freedom is difficult. Unadjusted R2 can be used, but does

not penalize complexity

the root mean square error is a common metric for understanding the performance

Spearman’s correlation may be applicable for models that are used to rank samples

Of course, honest estimates of these statistics cannot be obtained bypredicting the same samples that were used to train the model.

A test set and/or resampling can provide good estimates.

Max Kuhn (Pfizer) Predictive Modeling 25 / 142

Page 27: Predictive Modeling Workshop

Estimating Performance For Classification

For classification models:

overall accuracy can be used, but this may be problematic when the classes are not balanced.

the Kappa statistic takes into account the expected error rate:

O − Eκ =

1 − E

where O is the observed accuracy and E is the expected accuracyunder chance agreement

For 2–class models, Receiver Operating Characteristic (ROC)curves can be used to characterize model performance (more later)

Max Kuhn (Pfizer) Predictive Modeling 26 / 142

Page 28: Predictive Modeling Workshop

Estimating Performance For Classification

A “ confusion matrix” is a cross–tabulation of the observed and predictedclasses

R functions for confusion matrices are in the e1071 package (the classAgreement function), the caret package (confusionMatrix), the mda (confusion) and others.

ROC curve functions are found in the pROC package (roc) ROCRpackage (performance), the verification package (roc.area) and others.

We’ll use the confusionMatrix function and the pROC package later in this class.

Max Kuhn (Pfizer) Predictive Modeling 27 / 142

Page 29: Predictive Modeling Workshop

Estimating Performance For ClassificationFor 2–class classification models we might also be interested in:

Sensitivity: given that a result is truly an event, what is the probability that the model will predict an event results?

Specificity: given that a result is truly not an event, what is the probability that the model will predict a negative results?

(an “event” is really the event of interest)

These conditional probabilities are directly related to the false positive and false negative rate of a method.

Unconditional probabilities (the positive–predictive values and negative–predictive values) can be computed, but require an estimate of what the overall event rate is in the population of interest (aka the prevalence)

Max Kuhn (Pfizer) Predictive Modeling 28 / 142

Page 30: Predictive Modeling Workshop

Estimating Performance For Classification

For our example, let’s choose the event to be the loan being repaid:

# truly repaid predicted to be repaidSensitivity =

# truly repaid

# truly not repaid predicted to be not repaidSpecificity =

# truly not repaid

The caret package has functions called sensitivity and specificity

Max Kuhn (Pfizer) Predictive Modeling 29 / 142

Page 31: Predictive Modeling Workshop

ROC Curve

With two classes the Receiver Operating Characteristic (ROC) curve canbe used to estimate performance using a combination of sensitivity and specificity.

Given the probability of an event, many alternative cutoffs can beevaluated (instead of just a 50% cutoff ). For each cutoff, we can calculate the sensitivity and specificity.

The ROC curve plots the sensitivity (eg. true positive rate) by one minus specificity (eg. the false positive rate).

The area under the ROC curve is a common metric of performance.

Max Kuhn (Pfizer) Predictive Modeling 30 / 142

Page 32: Predictive Modeling Workshop

ROC Curve

0.25 (Sp = 0.6, Sn = 0.9)

0.50 (Sp = 0.7, Sn = 0.8)

(Sp = 0.9, Sn = 0.6)

1.00 (Sp = 1.0, Sn = 0.0)

1.0 0.8 0.6 0.4 0.2 0.0

Specificity

Max Kuhn (Pfizer) Predictive Modeling 31 / 142

Sen

sitiv

ity

0.0

0.2

0.4

0.6

0.8

1.0

0.75

0●

Page 33: Predictive Modeling Workshop

Over–Fitting and Model Tuning

APM Ch. 4

Page 34: Predictive Modeling Workshop

Over–Fitting

Over–fitting occurs when a model inappropriately picks up on trends in the training set that do not generalize to new samples.

When this occurs, assessments of the model based on the training set can show good performance that does not reproduce in future samples.

Some models have specific “knobs” to control over-fitting

neighborhood size in nearest neighbor models is an example

the number if splits in a tree model

Often, poor choices for these parameters can result in over-fitting

For example, the next slide shows a data set with two predictors. We want to be able to produce a line (i.e. decision boundary) that differentiates two classes of data.

Two new points are to be predicted. A 5–nearest neighbor model is illustrated.

Max Kuhn (Pfizer) Predictive Modeling 33 / 142

Page 35: Predictive Modeling Workshop

K –Nearest Neighbors Classification

Class 1 Class 2

●0.6

0.4

0.2

0.0

0.0 0.2 0.4

Predictor A

0.6

Max Kuhn (Pfizer) Predictive Modeling 34 / 142

Pre

dict

or B

Page 36: Predictive Modeling Workshop

Over–Fitting

On the next slide, two classification boundaries are shown for the adifferent model type not yet discussed.

The difference in the two panels is solely due to different choices in tuning parameters.

One over–fits the training data.

Max Kuhn (Pfizer) Predictive Modeling 35 / 142

Page 37: Predictive Modeling Workshop

Two Model Fits

0.2 0.4 0.6 0.8

0.8

0.6

0.4

0.2

0.2 0.4 0.6 0.8

Predictor A

Max Kuhn (Pfizer) Predictive Modeling 36 / 142

Pre

dict

or B

Model #2Model #1

Page 38: Predictive Modeling Workshop

Characterizing Over–Fitting Using the Training Set

One obvious way to detect over–fitting is to use a test set. However,repeated “looks” at the test set can also lead to over–fitting

Resampling the training samples allows us to know when we are making poor choices for the values of these parameters (the test set is not used).

Resampling methods try to “inject variation” in the system to approximate the model’s performance on future samples.

We’ll walk through several types of resampling methods for training set samples.

See the two blog posts “Comparing Different Species of Cross-Validation”at http://bit.ly/1yE0Ss5 and http://bit.ly/1zfoFj2

Max Kuhn (Pfizer) Predictive Modeling 37 / 142

Page 39: Predictive Modeling Workshop

K –Fold Cross–Validation

Here, we randomly split the data into K distinct blocks of roughly equalsize.

We leave out the first block of data and fit a model.

This model is used to predict the held-out block

We continue this process until we’ve predicted all K held–out blocks

1

2

3

The final performance is based on the hold-out predictions

K is usually taken to be 5 or 10 and leave one out cross–validation has each sample as a block

Repeated K –fold CV creates multiple versions of the folds and aggregates the results (I prefer this method)

Max Kuhn (Pfizer) Predictive Modeling 38 / 142

Page 40: Predictive Modeling Workshop

K –Fold Cross–Validation

Max Kuhn (Pfizer) Predictive Modeling 39 / 142

Page 41: Predictive Modeling Workshop

In R

Many packages have cross–validation functions, but they are usuallylimited to 10–fold CV.caret has a general purpose function called train that has manyresampling methods for many models (more later).

caret has functions to produce samples splits for K –fold CV(createFolds), multiple training/test splits (createDataPartition)

and bootstrap sampling (createResample).

Also, the base R function sample can be used to create completelyrandom splits or bootstrap samples

Max Kuhn (Pfizer) Predictive Modeling 40 / 142

Page 42: Predictive Modeling Workshop

The Big Picture

We think that resampling will give us honest estimates of futureperformance, but there is still the issue of which model to select.

One algorithm to select models:

Define sets of model parameter values to evaluate;for each parameter set do

for each resampling iteration doHold–out specific samples ;Fit the model on the remainder; Predict the hold–out samples;

endCalculate the average performance across hold–out predictionsend

Determine the optimal parameter set;

Max Kuhn (Pfizer) Predictive Modeling 41 / 142

Page 43: Predictive Modeling Workshop

K –Nearest Neighbors Classification

Class 1 Class 2

●0.6

0.4

0.2

0.0

0.0 0.2 0.4

Predictor A

0.6

Max Kuhn (Pfizer) Predictive Modeling 42 / 142

Pre

dict

or B

Page 44: Predictive Modeling Workshop

The Big Picture – K NN Example

Using k –nearest neighbors as an example:Randomly put samples into 10 distinct groups;for k = 1, 3, 5, . . . , 21 do

for i = 1 . . . 10 doHold–out block i ;Fit the model on the other 90%; Predict the i th block and save results;

endCalculate the average accuracy across the 10 hold–out sets of predictions

endDetermine k based on the highest cross–validated accuracy;

Max Kuhn (Pfizer) Predictive Modeling 43 / 142

Page 45: Predictive Modeling Workshop

The Big Picture – K NN Example

1.0

0.9

0.8

0.7

5 10 15 20

k

Max Kuhn (Pfizer) Predictive Modeling 44 / 142

Acc

urac

y

Page 46: Predictive Modeling Workshop

With a Different Set of Resamples

1.0

0.9

0.8

0.7

5 10 15 20

k

Max Kuhn (Pfizer) Predictive Modeling 45 / 142

Acc

urac

y

Page 47: Predictive Modeling Workshop

Data Pre–Processing

APM Ch. 3

Page 48: Predictive Modeling Workshop

Pre–Processing the Data

There are a wide variety of models in R. Some models have differentassumptions on the predictor data and may need to be pre–processed.

For example, methods that use the inverse of the predictor cross–product matrix (i.e. (X tX )-1) may require the elimination of collinear predictors.

Others may need the predictors to be centered and/or scaled, etc.

If any data processing is required, it is a good idea to base thesecalculations on the training set, then apply them to any data set used formodel building or prediction.

Max Kuhn (Pfizer) Predictive Modeling 47 / 142

Page 49: Predictive Modeling Workshop

Pre–Processing the Data

Examples of of pre-processing operations:

centering and scaling

imputation of missing data

transformations of individual predictors

transformations of the groups of predictors, such as the

the "spatial-sign" transformation (i.e. x i = x /llx ll)feature extraction via PCA

Max Kuhn (Pfizer) Predictive Modeling 48 / 142

Page 50: Predictive Modeling Workshop

Dummy Variables

Before pre-processing the data, there are a few predictors that arecategorical in nature.

For these, we would convert the values to binary dummy variables prior to using them in numerical computations.

The core R function model.matrix can be used to do this.

If a categorical predictors has C levels, it would make C − 1 variables with values 0 or 1 (one level would be omitted).

Max Kuhn (Pfizer) Predictive Modeling 49 / 142

Page 51: Predictive Modeling Workshop

Dummy Variables

In the credit data, the Current Credit Quality predictor has C = 3 distinctvalues: "no", "good_running," and "bad_running"

For these data, the possible dummy variables are:

Dummy Variable ColumnsData Value no good_running bad_running

"no"

"good_running" "bad_running"

100

010

001

For ordered categorical predictors, the default encoding is more complex.See "The Basics of Encoding Categorical Data for Predictive Models" athttp://bit.ly/1CtXg0x

Max Kuhn (Pfizer) Predictive Modeling 50 / 142

Page 52: Predictive Modeling Workshop

Dummy Variables

> alt_data <- model.matrix(Class ~ ., data = train_data)> alt_data[1:4, 1:3]

(Intercept) Credit_Qualitygood_running Credit_Qualitybad_running1234

1111

0000

0010

We can get rid of the intercept column and use this data.

However, we would want to apply this same transformation to new data sets.

The caret function dummyVars has more options and can be applied to any data

Note: using the formula method with models automatically handles this.

Max Kuhn (Pfizer) Predictive Modeling 51 / 142

Page 53: Predictive Modeling Workshop

Dummy Variables

> dummy_info <- dummyVars(Class ~ ., data = train_data)> dummy_info

Dummy Variable Object

Formula: Class ~ .8 variables, 2 factorsVariables and levels will be separated by '.'A less than full rank encoding is used

> train_dummies <- predict(dummy_info, newdata = train_data)> train_dummies[1:4, 1:3]

Credit_Quality.no Credit_Quality.good_running Credit_Quality.bad_running1234

1101

0000

0010

> test_dummies <- predict(dummy_info, newdata = test_data)

Max Kuhn (Pfizer) Predictive Modeling 52 / 142

Page 54: Predictive Modeling Workshop

Dummy Variables and Model Functions

Most models are parameterized so that the predictors are required to benumeric. Linear regression, for example, doesn’t know what to do with a raw value of "green."

The primary convention in R is to convert factors to dummy variables when a model uses the formula interface.

However, this is not always the case. Many models using trees or rules(e.g. rpart, C5.0, randomForest, etc):

do not require numeric representations of the predictors

do not create dummy variables

Other notable exceptions are naive Bayes models and support vectormachines using string kernel functions.

Max Kuhn (Pfizer) Predictive Modeling 53 / 142

Page 56: Predictive Modeling Workshop

Centering and ScalingThe input is a matrix or data frame of predictor data. Once the values are calculated, the predict method can be used to do the actual data transformations.

First, estimate the standardization parameters:

> pp_values <- preProcess(train_dummies, method = c("center", "scale"))> pp_values

Call:preProcess.default(x = train_dummies, method = c("center", "scale"))

Created from 750 samples and 9 variablesPre-processing: centered, scaled

Apply them to the training and test sets:

> train_scaled <- predict(pp_values, newdata = train_dummies)> test_scaled <- predict(pp_values, newdata = test_dummies)

Max Kuhn (Pfizer) Predictive Modeling 55 / 142

Page 57: Predictive Modeling Workshop

Signal Extraction via PCA

Principal component analysis (PCA) can be used to create a (hopefully)small subset of new predictors that capture most of the information in the whole set.

The principal components are linear combinations of each individual predictor and usually have not meaningful interpretation.

The components are created sequentially

the first captures the larges component of variation in the predictors

the second does the same for the leftover information, and so on

We can track how much of the variance is explained by the componentsand select enough to have fidelity to the original data.

Max Kuhn (Pfizer) Predictive Modeling 56 / 142

Page 58: Predictive Modeling Workshop

Signal Extraction via PCA

The advantages to this approach:

the components are all uncorrected to one another

a small number of predictors in the model can sometimes help

However...

this is not feature selection; all the predictors are still

required the components may not be correlated to the

outcome

Max Kuhn (Pfizer) Predictive Modeling 57 / 142

Page 59: Predictive Modeling Workshop

Signal Extraction via PCA in R

The base R function prcomp can be used to create the components.

preProcess can make the transformation and automatically retain the number of components to account for some pre-specified amount of information.

The predictors should be centered and scaled prior the PCA extraction.preProcess will automatically do this even if you forget.

Max Kuhn (Pfizer) Predictive Modeling 58 / 142

Page 60: Predictive Modeling Workshop

An Example

Another data set shows an nice example of PCA. There are two the predictors and two classes:

> dim(example_train)

[1] 1009 3

> dim(example_test)

[1] 1010 3

> head(example_train)

PredictorA PredictorB Class234121516

3278.7261727.4101194.9321027.2221035.6081433.918

154.8987684.56460

101.0910768.7106273.4055979.47569

One

Two One Two One OneMax Kuhn (Pfizer) Predictive Modeling 59 / 142

Page 61: Predictive Modeling Workshop

Correlated Predictors

Class One Two

400

300

200

100

2500 5000PredictorA

7500

Max Kuhn (Pfizer) Predictive Modeling 60 / 142

Pre

dict

orB

Page 62: Predictive Modeling Workshop

Mildly Predictive of the Classes

Class One Two

1.0

0.5

0.0

7 8 9 4.0 4.5 5.0 5.5 6.0log(value)

Max Kuhn (Pfizer) Predictive Modeling 61 / 142

dens

ity

PredictorA PredictorB

Page 63: Predictive Modeling Workshop

An Example

> pca_pp <- preProcess(example_train[, 1:2],+> pca_pp

method = "pca") # also added "center" and "scale"

Call:preProcess.default(x = example_train[, 1:2], method = "pca")

Created from 1009 samples and 2 variablesPre-processing: principal component signal extraction, scaled, centered

PCA needed 2 components to capture 95 percent of the variance

> train_pc <- predict(pca_pp, example_train[, 1:2])> test_pc <- predict(pca_pp, example_test[, 1:2])> head(test_pc, 4)

PC11 0.84204475 0.21891686 1.2074404

PC20.072848020.04568417

-0.210405587 1.1794578 -0.20980371

Max Kuhn (Pfizer) Predictive Modeling 62 / 142

Page 64: Predictive Modeling Workshop

A simple Rotation in 2D

Class One Two

1

0

−1

−2

−10.0 −7.5 −5.0PC1

−2.5 0.0

Max Kuhn (Pfizer) Predictive Modeling 63 / 142

PC

2

Page 65: Predictive Modeling Workshop

The First PC is Least Important Here

Class One Two

3

2

1

0

−2 −1 0 1 2 −2value

−1 0 1 2

Max Kuhn (Pfizer) Predictive Modeling 64 / 142

dens

ity

PC1 PC2

Page 66: Predictive Modeling Workshop

The Spatial Sign

The is another group transformation that can be effective when there aresignificant outliers in the data.

For P predictors, the data are projected onto a unit sphere in Pdimensions.

Recall that our principal component values had very long tails. Let’s apply the spatial sign after PCA extraction.

The predictors should be centered and scaled prior to this transformation too.

Max Kuhn (Pfizer) Predictive Modeling 65 / 142

Page 67: Predictive Modeling Workshop

Adding Another Step

> pca_ss_pp <- preProcess(example_train[, 1:2],+> pca_ss_pp

method = c("pca", "spatialSign"))

Call:preProcess.default(x = example_train[, 1:2], method = c("pca", "spatialSign"))

Created from 1009 samples and 2 variables

Pre-processing: principal component signal extraction, spatial sign transformation, scaled, centered

PCA needed 2 components to capture 95 percent of the variance

> train_pc_ss <- predict(pca_ss_pp, example_train[, 1:2])> test_pc_ss <- predict(pca_ss_pp, example_test[, 1:2])> head(test_pc_ss, 4)

PC11 0.99627865 0.97891216 0.9851544

PC20.086191290.20428207

-0.171670587 0.9845449 -0.17513231

Max Kuhn (Pfizer) Predictive Modeling 66 / 142

Page 68: Predictive Modeling Workshop

Projected onto a Unit Sphere

Class One Two

1.0

0.5

0.0

−0.5

−1.0

−1.0 −0.5 0.0PC1

0.5 1.0

Max Kuhn (Pfizer) Predictive Modeling 67 / 142

PC

2

Page 69: Predictive Modeling Workshop

Pre–Processing and Resampling

To get honest estimates of performance, all data transformations shouldbe included within the cross-validation loop.

The would be especially true for feature selection as well as pre-processing techniques (e.g. imputation, PCA, etc)

One function considered later called train that can apply preProcesswithin resampling loops.

Max Kuhn (Pfizer) Predictive Modeling 68 / 142

Page 70: Predictive Modeling Workshop

Filtering Problematic Variables

Some models have computational issues if predictors have degeneratedistributions. For example, models that use X tX or it’s inverse might have issues with

outliers

predictors with a single value (aka zero-variance predictors)

highly unbalanced distributions

Other models are insensitive to the characteristics of the predictordistributions.

The caret functions findCorrelation and nearZeroVar can be used for unsupervised filtering of predictors.

Max Kuhn (Pfizer) Predictive Modeling 69 / 142

Page 71: Predictive Modeling Workshop

Feature Engineering

One of the most critical parts of the modeling process is featureengineering; how should the predictors enter the model?

For example, two predictors might be more informative if they enter the model as a ratio instead of as two main effects.

This requires an in-depth knowledge of the problem at hand and the nuances of the data.

Also, like pre-processing, this is highly model dependent.

Max Kuhn (Pfizer) Predictive Modeling 70 / 142

Page 72: Predictive Modeling Workshop

Example: Encoding Time and Date Data

Some applications have date or time data as predictors. How should weencode this?

numerical day of the year along with the year?

categorical or ordinal factors for the day of the week, week, month, or season, etc?

number of days from some reference date?

The answer depends on the type of model and the nature of the data.

Max Kuhn (Pfizer) Predictive Modeling 71 / 142

Page 73: Predictive Modeling Workshop

Example: Encoding Time and Date Data

I have found the lubridate package to be invaluable in these cases.

Let’s load some example dates from an existing RData file:

> day_values <- c("2015-05-10", "1970-11-04", "2002-03-04", "2006-01-13")> class(day_values)

[1] "character"

> library(lubridate)> days <- ymd(day_values)> str(days)

POSIXct[1:4], format: "2015-05-10" "1970-11-04" "2002-03-04" "2006-01-13"

Max Kuhn (Pfizer) Predictive Modeling 72 / 142

Page 74: Predictive Modeling Workshop

Example: Encoding Time and Date Data

> day_of_week <- wday(days, label = TRUE)> day_of_week

[1] Sun Wed Mon FriLevels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

> year(days)

[1] 2015 1970 2002 2006

> week(days)

[1] 19 45 10 2

> month(days, label =

TRUE) [1] May Nov Mar Jan12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

> yday(days)

[1] 130 308 63 13

Max Kuhn (Pfizer) Predictive Modeling 73 / 142

Page 75: Predictive Modeling Workshop

Classification Models

APM Ch. 11-17

Page 76: Predictive Modeling Workshop

Classification Trees

A classification tree searches through each predictor to find a value of asingle variable that best splits the data into two groups.

typically, the best split minimizes impurity of the outcome in the resulting data subsets.

For the two resulting groups, the process is repeated until a hierarchicalstructure (a tree) is created.

in effect, trees partition the X space into rectangular sections that assign a single value to samples within the rectangle.

Max Kuhn (Pfizer) Predictive Modeling 75 / 142

Page 77: Predictive Modeling Workshop

An ExampleThere are many tree-based packages in R. The main package for fitting single trees are rpart, RWeka, C50 and party. rpart fits the classical "CART" models of Breiman et al (1984).

To obtain a shallow tree with rpart:

> library(rpart)> rpart1 <- rpart(Class ~ .,++> rpart1

n= 750

data = train_data,control = rpart.control(maxdepth = 2))

node), split, n, loss, yval, (yprob)* denotes terminal node

1) root 750 225 Good (0.7000000 0.3000000)2) Credit_Quality=good_running 293 39 Good (0.8668942 0.1331058) *3) Credit_Quality=no,bad_running 457 186 Good (0.5929978 0.4070022)

6) Loan_Duration< 31.5 366 129 Good (0.6475410 0.3524590) *7) Loan_Duration>=31.5 91 34 Bad (0.3736264 0.6263736) *

Max Kuhn (Pfizer) Predictive Modeling 76 / 142

Page 79: Predictive Modeling Workshop

A Shallow rpart Tree Using the party Package

Credit_Quality

Loan_Duration

Node 2 (n = 293) Node 4 (n = 366) Node 5 (n = 91)1 1 1

0.8

0.6

0.4

0.2

0

0.8

0.6

0.4

0.2

0

0.8

0.6

0.4

0.2

0

Max Kuhn (Pfizer) Predictive Modeling 78 / 142

Bad

Goo

d

Bad

Goo

d

Bad

Goo

d

31.531.5

3

no, bad_runninggood_running

1

Page 80: Predictive Modeling Workshop

Tree Fitting Process

Splitting would continue until some criterion for stopping is met, such asthe minimum number of observations in a node

The largest possible tree may over-fit and "pruning" is the process ofiteratively removing terminal nodes and watching the changes in resampling performance (usually 10-fold CV)There are many possible pruning paths: how many possible trees are therewith 6 terminal nodes?

Trees can be indexed by their maximum depth and the classical CARTmethodology uses a cost-complexity parameter (Cp ) to determine best tree depth

Max Kuhn (Pfizer) Predictive Modeling 79 / 142

Page 81: Predictive Modeling Workshop

The Final Tree

Previously, we told rpart to use a maximum of two splits.

By default, rpart will conduct as many splits as possible, then use 10-foldcross-validation to prune the tree.

Specifically, the "one SE" rule is used: estimate the standard error ofperformance for each tree size then choose the simplest tree within one standard error of the absolute best tree size.

> rpart_full <- rpart(Class ~ ., data = train_data)

Max Kuhn (Pfizer) Predictive Modeling 80 / 142

Page 82: Predictive Modeling Workshop

The Final Tree> rpart_full

n= 750

node), split, n, loss, yval, (yprob)* denotes terminal node

1) root 750 225 Good (0.7000000 0.3000000)2) Credit_Quality=good_running 293 39 Good (0.8668942 0.1331058) *3) Credit_Quality=no,bad_running 457 186 Good (0.5929978 0.4070022)

6) Loan_Duration< 31.5 366 129 Good (0.6475410 0.3524590)12) Loan_Amount< 10975.5 359 122 Good (0.6601671 0.3398329)

24) Good_Payer>=0.5 323 100 Good (0.6904025 0.3095975)48) Loan_Duration< 11.5 66 9 Good (0.8636364 0.1363636) *49) Loan_Duration>=11.5 257 91 Good (0.6459144 0.3540856)

98) Loan_Amount>=1381.5 187 54 Good (0.7112299 0.2887701) *

99) Loan_Amount< 1381.5 70198) Private_Loan>=0.5 38199) Private_Loan< 0.5 32

33 Bad (0.4714286 0.5285714)15 Good (0.6052632 0.3947368) *10 Bad (0.3125000 0.6875000) *

25) Good_Payer< 0.5 36 14 Bad (0.3888889 0.6111111)50) Loan_Duration>=16.5 23100) Private_Loan>=0.5 11101) Private_Loan< 0.5 12

51) Loan_Duration< 16.5 13

11 Good (0.5217391 0.4782609)3 Good (0.7272727 0.2727273) *4 Bad (0.3333333 0.6666667) *2 Bad (0.1538462 0.8461538) *

13) Loan_Amount>=10975.5 7 0 Bad (0.0000000 1.0000000) *7) Loan_Duration>=31.5 91 34 Bad (0.3736264 0.6263736)14) Loan_Duration< 47.5 54 25 Bad (0.4629630 0.5370370)

28) Credit_Quality=bad_running 27 11 Good (0.5925926 0.4074074)56) Loan_Amount< 7759 20 6 Good (0.7000000 0.3000000) *

57) Loan_Amount>=7759 729) Credit_Quality=no 27

15) Loan_Duration>=47.5 37

2 Bad (0.2857143 0.7142857) *9 Bad (0.3333333 0.6666667) *9 Bad (0.2432432 0.7567568) *

Max Kuhn (Pfizer) Predictive Modeling 81 / 142

Page 83: Predictive Modeling Workshop

The Final rpart Tree

1

Credit_Quality

good_running no, bad_running

3

Loan_Duration

31.5 31.5

4 19

Loan_Amount Loan_Duration

10975.5 10975.5 47.5 47.5

5 20

Good_Payer Credit_Quality

0.5 0.5 bad_running no

6 13 21

Loan_Duration Loan_Duration Loan_Amount

11.5 11.5 16.5 16.5

8 14

Loan_Amount Private_Loan

1381.5 1381.5 7759 7759

10

Private_Loan 0.5 0.5

0.5 0.5

Node 2 (n = 293) Node 7 (n = 66) Node 9 (n = 187) Node 11 (n = 38) Node 12 (n = 32) Node 15 (n = 11) Node 16 (n = 12) Node 17 (n = 13)

Node 18 (n = 7) Node 22 (n = 20) Node 23 (n = 7) Node 24 (n = 27) Node 25 (n = 37)1 1 1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0

Max Kuhn (Pfizer) Predictive Modeling 82 / 142

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

Page 84: Predictive Modeling Workshop

Test Set Results

> rpart_pred <- predict(rpart_full, newdata = test_data, type = "class")> confusionMatrix(data = rpart_pred, reference = test_data$Class)

Confusion Matrix and Statistics

# requires 2 factor vectors

ReferencePrediction Good Bad

Good 163 49Bad 12 26

Accuracy : 0.756

95% CI : (0.6979, 0.8079) No Information Rate : 0.7

P-Value [Acc > NIR] : 0.02945

Kappa : 0.3237Mcnemar's Test P-Value : 4.04e-06

Sensitivity : 0.9314Specificity : 0.3467

Pos Pred Value : 0.7689Neg Pred Value : 0.6842

Prevalence : 0.7000Detection Rate : 0.6520

Detection Prevalence : 0.8480Balanced Accuracy : 0.6390

'Positive' Class : Good

Max Kuhn (Pfizer) Predictive Modeling 83 / 142

Page 85: Predictive Modeling Workshop

Creating the ROC CurveThe pROC package can be used to create ROC curves.

The function roc is used to capture the data and compute the ROC curve. The functions plot.roc and auc.roc generate plot and area under the curve, respectively.

> class_probs <- predict(rpart_full, newdata = test_data)> head(class_probs, 3)

Good Bad6 0.8636364 0.13636367 0.8636364 0.136363613 0.8636364 0.1363636

> library(pROC)> ## The roc function assumes the *second* level is the one of> ## interest, so we use the 'levels' argument to change the order.> rpart_roc <- roc(response = test_data$Class, predictor = class_probs[, "Good"],+ levels = rev(levels(test_data$Class)))> ## Get the area under the ROC curve> auc(rpart_roc)

Area under the curve: 0.7975

Max Kuhn (Pfizer) Predictive Modeling 84 / 142

Page 86: Predictive Modeling Workshop

Tuning the Model

There are a few functions that can be used for this purpose in R.

the errorest function in the ipred package can be used to resample a single model (e.g. a gbm model with a specific number of iterations and tree depth)the e1071 package has a function (tune) for five models that will conduct resampling over a grid of tuning values.caret has a similar function called train for over 183 models. Different resampling methods are available as are custom performance metrics and facilities for parallel processing.

Max Kuhn (Pfizer) Predictive Modeling 85 / 142

Page 87: Predictive Modeling Workshop

The train Function

The basic syntax for the function is:

> train(formula, data, method)

Looking at ?train, using method =over Cp , so we can use:

"rpart" can be used to tune a tree

> train(Class ~ ., data = train_data, method = "rpart")

We’ll add a bit of customization too.

Max Kuhn (Pfizer) Predictive Modeling 86 / 142

Page 88: Predictive Modeling Workshop

The train Function

By default, the function will tune over 3 values of the tuning parameter(Cp for this model).

For rpart, the train function determines the distinct number of values ofCp for the data.

The tuneLength function can be used to evaluate a broader set of models:

>++

rpart_tune <- train(Class ~ ., data =

method = "rpart", tuneLength = 9)

train_data,

Max Kuhn (Pfizer) Predictive Modeling 87 / 142

Page 89: Predictive Modeling Workshop

The train Function

The default resampling scheme is the bootstrap. Let’s use 10-foldcross-validation instead.

To do this, there is a control function that handles some of the optionalarguments.

To use five repeats of 10-fold cross-validation, we would use

>>+++

cv_ctrl <-rpart_tune

trainControl(method = "repeatedcv", repeats = 5)<- train(Class ~ ., data = train_data,

method = "rpart", tuneLength = 9, trControl = cv_ctrl)

Max Kuhn (Pfizer) Predictive Modeling 88 / 142

Page 90: Predictive Modeling Workshop

The train Function

Also, the default CART algorithm uses overall accuracy and the onestandard-error rule to prune the tree.

We might want to choose the tree complexity based on the largestabsolute area under the ROC curve.

A custom performance function can be passed to train. The package hasone that calculates the ROC curve, sensitivity and specificity called . For example:

> twoClassSummary(fakeData)

ROC

Sens Spec0.5020 0.1145 0.8827

Max Kuhn (Pfizer) Predictive Modeling 89 / 142

Page 91: Predictive Modeling Workshop

The train FunctionWe can pass the twoClassSummary function in through trainControl.However, to calculate the ROC curve, we need the model to predict theclass probabilities. The classProbs option will also do this:

Finally, we tell the function to optimize the area under the ROC curveusing the metric argument:

>++>>++++

cv_ctrl <- trainControl(method = "repeatedcv", repeats = 5,

summaryFunction = twoClassSummary, classProbs = TRUE)set.seed(1735)

rpart_tune <- train(Class ~ ., data = train_data,

method = "rpart", tuneLength = 9, metric = "ROC", trControl = cv_ctrl)

Max Kuhn (Pfizer) Predictive Modeling 90 / 142

Page 92: Predictive Modeling Workshop

Digression – Parallel Processing

Since we are fitting a lot of independent models over different tuningparameters and sampled data sets, there is no reason to do these sequentially.

R has many facilities for splitting computations up onto multiple cores ormachinesSee Tierney et al (2009, Journal of Statistical Software) for a recentreview of these methods

Max Kuhn (Pfizer) Predictive Modeling 91 / 142

Page 93: Predictive Modeling Workshop

foreach and caret

To loop through the models and data sets, caret uses the foreach package,which parallelizes for loops.

foreach has a number of parallel backends which allow various technologiesto be used in conjunction with the package.On CRAN, these are the doSomething packages, such as doMC, doMPI,doSMP and others.

For example, doMC uses the multicore package, which forks processes tosplit computations (for unix and OS X). doParallel works well for Windows(I’m told)

Max Kuhn (Pfizer) Predictive Modeling 92 / 142

Page 94: Predictive Modeling Workshop

foreach and caret

To use parallel processing in caret, no changes are needed when callingtrain.

The parallel technology must be registered with foreach prior to callingtrain.

For multicore

> library(doMC)> registerDoMC(cores = 2)

Max Kuhn (Pfizer) Predictive Modeling 93 / 142

Page 95: Predictive Modeling Workshop

Training Time (min)50 bootstraps of a SVM model with 1000 samples and 400 predictors and the multicore package

500

400

300

200

100

2 4 6 8 10 12

cores

Max Kuhn (Pfizer) Predictive Modeling 94 / 142

min

● ●

●● ●

●● ●

Page 96: Predictive Modeling Workshop

Speed–Up

5

4

3

2 ●

1 ●

2 4 6 8 10 12

cores

Max Kuhn (Pfizer) Predictive Modeling 95 / 142

Spe

edU

p

● ●

● ●

●●

Page 97: Predictive Modeling Workshop

train Results> rpart_tune

CART

750 samples7 predictor2 classes: 'Good', 'Bad'

No pre-processingResampling: Cross-Validated (10 fold, repeated 5 times)

Summary of sample sizes: 675, 675, 674, 674, 676, 675, ...

Resampling results across tuning parameters:

cp0.000000.002220.004440.007780.008890.011110.017780.033330.05111

ROC0.6970.6960.6950.6950.6950.6930.6460.5540.507

Sens0.8350.8410.8480.8640.8630.8610.8770.9120.954

Spec0.44470.43670.41630.37720.38170.38010.35290.25910.0996

ROC SD0.07500.07440.07080.08770.08980.09590.14870.18760.1149

Sens SD0.04880.04800.04670.04330.04550.04420.05260.05050.0510

Spec SD0.1150.1100.1020.1140.1190.1270.1010.1040.106

ROC was used to select the optimal model using the largest value.The final value used for the model was cp = 0.

Max Kuhn (Pfizer) Predictive Modeling 96 / 142

Page 98: Predictive Modeling Workshop

Working With the train Object

There are a few methods of interest:plot.train or ggplot.train can be used to plot the resampling profiles across the different models

print.train shows a textual description of the results

predict.train can be used to predict new samples

there are a few others that we’ll mention shortly.

Additionally, the final model fit (i.e. the model with the best resamplingresults) is in a sub-object called finalModel.

So in our example, rpart_tune is of class train and the objectrpart_tune$finalModel is of class rpart.

Let’s look at what the plot method does.

Max Kuhn (Pfizer) Predictive Modeling 97 / 142

Page 99: Predictive Modeling Workshop

Resampled ROC Profileggplot(rpart_tune)

● ● ●

0.55

Max Kuhn (Pfizer) Predictive Modeling 98 / 142

RO

C (

Rep

eate

d C

ross

−V

alid

atio

n)● ● ●

0.70

0.65 ●

0.60

0.500.00 0.01 0.02 0.03 0.04 0.05

Complexity Parameter

Page 100: Predictive Modeling Workshop

Predicting New Samples

There are at least two ways to get predictions from a train object:predict(rpart_tune$finalModel, newdata, type = "class")predict(rpart_tune, newdata)

The first method uses predict.rpart, but is not preferred if using train.

predict.train does the same thing, but automatically uses the numberof trees that were selected using the train function.

Max Kuhn (Pfizer) Predictive Modeling 99 / 142

Page 101: Predictive Modeling Workshop

Test Set Results

> rpart_pred2 <- predict(rpart_tune, newdata = test_data)> confusionMatrix(rpart_pred2, test_data$Class)

Confusion Matrix and Statistics

ReferencePrediction Good Bad

Good 161 47Bad 14 28

Accuracy : 0.756

95% CI : (0.6979, 0.8079) No Information Rate : 0.7

P-Value [Acc > NIR] : 0.02945

Kappa : 0.3355Mcnemar's Test P-Value : 4.182e-05

Sensitivity : 0.9200Specificity : 0.3733

Pos Pred Value : 0.7740Neg Pred Value : 0.6667

Prevalence : 0.7000Detection Rate : 0.6440

Detection Prevalence : 0.8320Balanced Accuracy : 0.6467

'Positive' Class : Good

Max Kuhn (Pfizer) Predictive Modeling 100 / 142

Page 102: Predictive Modeling Workshop

Predicting Class Probabilities

predict.train has an argument type that can be used to get predictedclass probabilities for different models:

> rpart_probs <- predict(rpart_tune, newdata = test_data, type = "prob")> head(rpart_probs, n = 4)

Good Bad6 0.8636364 0.13636367 0.8636364 0.136363613 0.8636364 0.136363616 0.8636364 0.1363636

> rpart_roc <- roc(response = test_data$Class, predictor = rpart_probs[, "Good"],+> auc(rpart_roc)

levels = rev(levels(test_data$Class)))

Area under the curve: 0.8154

Max Kuhn (Pfizer) Predictive Modeling 101 / 142

Page 103: Predictive Modeling Workshop

Classification Tree ROC Curveplot(rpart_roc, legacy.axes = FALSE)

1.0 0.8 0.6 0.4 0.2 0.0

Specificity

Max Kuhn (Pfizer) Predictive Modeling 102 / 142

Sen

sitiv

ity

0.0

0.2

0.4

0.6

0.8

1.0

Page 104: Predictive Modeling Workshop

Pros and Cons of Single Trees

Trees can be computed very quickly and have simple interpretations.

Also, they have built-in feature selection; if a predictor was not used in anysplit, the model is completely independent of that data.

Unfortunately, trees do not usually have optimal performance whencompared to other methods.Also, small changes in the data can drastically affect the structure of atree.

This last point has been exploited to improve the performance of trees viaensemble methods where many trees are fit and predictions are aggregated across the trees. Examples are bagging, boosting and random forests.

Max Kuhn (Pfizer) Predictive Modeling 103 / 142

Page 105: Predictive Modeling Workshop

Boosted Trees

APM Sect 14.5

Page 106: Predictive Modeling Workshop

Boosted Trees (Original “ adaBoost” Algorithm)A method to "boost" weak learning algorithms (e.g. single trees) into strong learning algorithms.

Boosted trees try to improve the model fit over different trees byconsidering past fits (not unlike iteratively reweighted least squares)The basic adaBoost algorithm:

Initialize equal weights per sample;for j = 1 . . . M iterations doFit a classification tree using sample weights (denote the model

equation as fj (x ));forall the misclassified samples do

increase sample weightendSave a "stage-weight" (βj ) based on the performance of the current model;

end

Max Kuhn (Pfizer) Predictive Modeling 105 / 142

Page 107: Predictive Modeling Workshop

Boosted Trees (Original “ adaBoost” Algorithm)In this formulation, the categorical response yi is coded as either {−1, 1}and the model fj (x ) produces values of {−1, 1}.

The final prediction is obtained by first predicting using all M trees, then weighting each prediction

M1

f (x ) = X

βj fj (x )M

j =1

where fj is the j th tree fit and βj is the stage weight for that tree.

The final class is determined by the sign of the model prediction.

In English: the final prediction is a weighted average of each tree’sprediction. The weights are based on quality of each tree.

Max Kuhn (Pfizer) Predictive Modeling 106 / 142

Page 108: Predictive Modeling Workshop

Statistical Approaches to Boosting

The original algorithm was developed outside of statistical theory.Statisticians discovered that the basic boosting algorithm was very similarto gradient-based steepest decent methods, such as Newton’s method.

Based on their observations, a new class of boosted tree models wereproposed and are generally referred to as "gradient boosting" methods.

Max Kuhn (Pfizer) Predictive Modeling 107 / 142

Page 109: Predictive Modeling Workshop

Boosted Trees Parameters

Most implementations of boosting have three tuning parameters:

number of iterations (i.e. trees)

complexity of the tree (i.e. number of splits)

learning rate (aka. "shrinkage"): how quickly the algorithm

adapts the minimum number of samples in a terminal node

Boosting functions for trees in R: gbm in the gbm package, ada in ada,blackboost in mboost and C5.0 in the C50 package.

Max Kuhn (Pfizer) Predictive Modeling 108 / 142

Page 110: Predictive Modeling Workshop

Using the gbm Package

The gbm function in the gbm package can be used to fit the model, thenpredict.gbm and other functions are used to predict and evaluate the model.

>>>>>>>>++++++

library(gbm)# The gbm function does not accept factor response values so

# will make a copy and modify the result variable for_gbm <- train_datafor_gbm$Class <- ifelse(for_gbm$Class == "Good", 1, 0)

set.seed(10)gbm_fit <- gbm(formula = Class ~ ., # Try all predictors

distribution = "bernoulli", # For classificationdata = for_gbm,n.trees = 100,

interaction.depth = 1, shrinkage = 0.1, verbose = FALSE)

# 100 boosting iterations# How many splits in each tree# learning rate# Do not print the details

Max Kuhn (Pfizer) Predictive Modeling 109 / 142

Page 111: Predictive Modeling Workshop

gbm Predictions

We need to tell the predict method how many trees to use (let’s just pick500).

Also, it does not predict the actual class. We’ll get the class probabilityand do the conversion.

> gbm_pred <- predict(gbm_fit, newdata = test_data, n.trees = 100,++> head(gbm_pred)

## This calculates the class probstype = "response")

[1] 0.6803420 0.7628453 0.6882552 0.8262940 0.8861083 0.7280930

> gbm_pred <- factor(ifelse(gbm_pred > .5, "Good", "Bad"),+> head(gbm_pred)

levels = c("Good", "Bad"))

[1] Good Good Good Good Good GoodLevels: Good Bad

Max Kuhn (Pfizer) Predictive Modeling 110 / 142

Page 112: Predictive Modeling Workshop

Test Set Results

> confusionMatrix(gbm_pred, test_data$Class)

Confusion Matrix and Statistics

ReferencePrediction Good Bad

Good 166 51Bad 9 24

Accuracy : 0.76

95% CI : (0.7021, 0.8116) No Information Rate : 0.7

P-Value [Acc > NIR] : 0.02104

Kappa : 0.3197Mcnemar's Test P-Value : 1.203e-07

Sensitivity : 0.9486Specificity : 0.3200

Pos Pred Value : 0.7650Neg Pred Value : 0.7273

Prevalence : 0.7000Detection Rate : 0.6640

Detection Prevalence : 0.8680Balanced Accuracy : 0.6343

'Positive' Class : Good

Max Kuhn (Pfizer) Predictive Modeling 111 / 142

Page 113: Predictive Modeling Workshop

Model Tuning using trainWe can fit a series of boosted trees with different specifications and use resampling to understand which one is most appropriate.

For example, we can define a grid of 152 tuning parameter combinaitons:

number of trees: 100 to 1000 by 50

tree depth: 1 to 7 by 2

learning rate: 0.0001 and 0.01

minimum sample size of 10 (the default)

In R:

> gbm_grid <- expand.grid(interaction.depth = seq(1, 7, by = 2),+++

n.trees = seq(100, 1000, by = 50),

shrinkage = c(0.0001, 0.01), n.minobsinnode = 10)

We can use this grid to define exactly what models are evaluated bytrain. A list of model parameter names can be found using ?train.

Max Kuhn (Pfizer) Predictive Modeling 112 / 142

Page 114: Predictive Modeling Workshop

Another Digression – The Ellipses

R has a great feature known as the ellipses (aka "three dots") where anarbitrary number of function arguments can be pass through nested function calls. For example:> average <- function(dat, ...) mean(dat, ...)> names(formals(mean.default))

[1] "x" "trim" "na.rm" "..."

> average(dat = c(1:10, 100))

[1] 14.09091

> average(dat = c(1:10, 100), trim

= .1) [1] 6

train is structured to take full advantage of this so that arguments canbe passed to the underlying model function (e.g. rpart, gbm) when calling the train function.

Max Kuhn (Pfizer) Predictive Modeling 113 / 142

Page 115: Predictive Modeling Workshop

Model Tuning using train

>>++++++++

set.seed(1735)

gbm_tune <- train(Class ~ ., data = train_data, method = "gbm",

metric = "ROC",

# Use a custom grid of tuning parameters tuneGrid = gbm_grid,

trControl = cv_ctrl,# Remember the 'three dots' discussed previously?

# This options is directly passed to the gbm function. verbose = FALSE)

Max Kuhn (Pfizer) Predictive Modeling 114 / 142

Page 116: Predictive Modeling Workshop

Boosted Tree Results> gbm_tune

Stochastic Gradient Boosting

750 samples7 predictor2 classes: 'Good', 'Bad'

No pre-processingResampling: Cross-Validated (10 fold, repeated 5 times)

Summary of sample sizes: 675, 675, 674, 674, 676, 675, ...

Resampling results across tuning parameters:

shrinkage1e-041e-041e-041e-04:

1e-021e-021e-02

interaction.depth1111

:777

n.trees100150200250:900950

1000

ROC0.6080.6610.6610.681:

0.7330.7310.729

Sens1.0001.0001.0001.000:

0.8800.8760.877

Spec0.000

000.000

000.000

000.000

00:

0.425180.427830.43047

Tuning parameter 'n.minobsinnode' was held constant at a value of 10ROC was used to select the optimal model using the largest value.The final values used for the model were n.trees = 450, interaction.depth =3, shrinkage = 0.01 and n.minobsinnode = 10.

Max Kuhn (Pfizer) Predictive Modeling 115 / 142

Page 117: Predictive Modeling Workshop

Boosted Tree ROC Profile

ggplot(gbm_tune)

0.750

● ●

0.725

Max Tree Depth0.700

1

3

5

70.675

0.650

0.625 ●

250 500 750 1000 250 500 750 1000# Boosting Iterations

Max Kuhn (Pfizer) Predictive Modeling 116 / 142

RO

C (

Re

peat

ed C

ross

−V

alid

atio

n)

0.001 0.010

● ●

● ●● ● ●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●●

● ● ●

Page 118: Predictive Modeling Workshop

Test Set Results

> gbm_pred <- predict(gbm_tune, newdata = test_data) # Magic!> confusionMatrix(gbm_pred, test_data$Class)

Confusion Matrix and Statistics

ReferencePrediction Good Bad

Good 167 52Bad 8 23

Accuracy : 0.76

95% CI : (0.7021, 0.8116) No Information Rate : 0.7

P-Value [Acc > NIR] : 0.02104

Kappa : 0.3135Mcnemar's Test P-Value : 2.836e-08

Sensitivity : 0.9543Specificity : 0.3067

Pos Pred Value : 0.7626Neg Pred Value : 0.7419

Prevalence : 0.7000Detection Rate : 0.6680

Detection Prevalence : 0.8760Balanced Accuracy : 0.6305

'Positive' Class : Good

Max Kuhn (Pfizer) Predictive Modeling 117 / 142

Page 119: Predictive Modeling Workshop

GBM Probabilities

> gbm_probs <- predict(gbm_tune, newdata = test_data, type = "prob")> head(gbm_probs)

Good Bad1 0.7559353 0.24406472 0.8336368 0.16636323 0.7846890 0.21531104 0.8455947 0.15440535 0.8454949 0.15450516 0.6093760 0.3906240

Max Kuhn (Pfizer) Predictive Modeling 118 / 142

Page 120: Predictive Modeling Workshop

Boosted Tree ROC Curve

> gbm_roc <- roc(response = test_data$Class, predictor = gbm_probs[, "Good"],+> auc(rpart_roc)

levels = rev(levels(test_data$Class)))

Area under the curve: 0.8154

> auc(gbm_roc)

Area under the curve: 0.8422

> plot(rpart_roc, col = "#9E0142", legacy.axes = FALSE)> plot(gbm_roc, col = "#3288BD", legacy.axes = FALSE, add = TRUE)> legend(.6, .5, legend = c("rpart", "gbm"),++

lty = c(1, 1),col = c("#9E0142", "#3288BD"))

Max Kuhn (Pfizer) Predictive Modeling 119 / 142

Page 121: Predictive Modeling Workshop

Boosted Tree ROC Curve

1.0 0.8 0.6 0.4 0.2 0.0

Specificity

Max Kuhn (Pfizer) Predictive Modeling 120 / 142

Sen

sitiv

ity

0.0

0.2

0.4

0.6

0.8

1.0

rpart

gbm

Page 122: Predictive Modeling Workshop

Support Vector Machines for Classification

APM Sect 13.4

Page 123: Predictive Modeling Workshop

Support Vector Machines (SVM)

This is a class of powerful and very flexible models.SVMs for classification use a completely different objective function calledthe margin.Suppose a hypothetical situation: a dataset of two predictors and we aretrying to predict the correct class (two possible classes).

Let’s further suppose that these two predictors completely separate theclasses

Max Kuhn (Pfizer) Predictive Modeling 122 / 142

Page 124: Predictive Modeling Workshop

Support Vector Machines

−4 −2 0 2 4

Predictor 1

Max Kuhn (Pfizer) Predictive Modeling 123 / 142

Pre

dict

or 2

−4

−2

02

4

Page 125: Predictive Modeling Workshop

Support Vector Machines (SVM)

There are an infinite number of straight lines that we can use to separatethese two groups. Some must be better than others...

The margin is a defined by equally spaced boundaries on each side of theline.

To maximize the margin, we try to make it as large as possible withoutcapturing any samples.

As the margin increases, the solution becomes more robust.

SVMs determine the slope and intercept of the line which maximizes themargin.

Max Kuhn (Pfizer) Predictive Modeling 124 / 142

Page 126: Predictive Modeling Workshop

Maximizing the Margin

−4 −2 0 2 4

Predictor 1

Max Kuhn (Pfizer) Predictive Modeling 125 / 142

Pre

dict

or 2

−4

−2

02

4

Page 127: Predictive Modeling Workshop

SVM Prediction Function

Suppose our two classes are coded as -1 or 1.

The SVM model estimates n parameters ("1 . . . "n ) for the model.

Regularization is used to avoid saturated models (more on that in a minute).

For a new point u , the model predicts:

n

f (u ) = β0 + "

"'

i' x tu'

' =1

The decision rule would predict class #1 if f (u ) > 0 and class #2otherwise.

Data points that are support vectors have "' = 0, so the prediction

equation is only affected by the support vectors.

Max Kuhn (Pfizer) Predictive Modeling 126 / 142

Page 128: Predictive Modeling Workshop

Support Vector Machines (SVM)

When the classes overlap, points are allowed within the margin. Thenumber of points is controlled by a cost parameter.

The points that are within the margin (or on it’s boundary) are thesupport vectors

Consequences of the fact that the prediction function only uses thesupport vectors:

the prediction equation is more compact and

efficient the model may be more robust to outliers

Max Kuhn (Pfizer) Predictive Modeling 127 / 142

Page 129: Predictive Modeling Workshop

The Kernel Trick

You may have noticed that the prediction function was a function of an inner product between two samples vectors (x tu ). It turns out that this

'opens up some new possibilities.

Nonlinear class boundaries can be computed using the "kernel trick".The predictor space can be expanded by adding nonlinear functions in x .These functions, which must satisfy specific mathematical criteria, include common functions:

Polynomial : K (x , u ) = (1 + x tu )p

r -a

2

1

(x - u )2Radial basis function : K (x , u ) = exp

We don’t need to store the extra dimensions; these functions can becomputed quickly.

Max Kuhn (Pfizer) Predictive Modeling 128 / 142

Page 130: Predictive Modeling Workshop

SVM Regularization

As previously discussed, SVMs also include a regularization parameter thatcontrols how much the regression line can adapt to the data smaller values result in more linear (i.e. flat) surfaces

This parameter is generally referred to as "Cost"

If the cost parameter is large, there is a significant penalty for havingsamples within the margin ⇒ the boundary becomes very flexible.

Tuning the cost parameter, as well as any kernel parameters, becomes very important as these models have the ability to greatly over-fit the training data.

(animation)

Max Kuhn (Pfizer) Predictive Modeling 129 / 142

Page 131: Predictive Modeling Workshop

SVM Models in R

There are several packages that include SVM models:e1071 has the function svm for classification (2+ classes) and regression with 4 kernel functionsklaR has svmlight which is an interface to the C library of the same name. It can do classification and regression with 4 kernel functions (or user defined functions)

svmpath has an efficient function for computing 2-class models(including 2 kernel functions)

kernlab contains ksvm for classification and regression with 9 built-in kernel functions. Additional kernel function classes can be written. Also, ksvm can be used for text mining with the tm package.

Personally, I prefer kernlab because it is the most general and containsother kernel method functions (e1071 is probably the most popular).

Max Kuhn (Pfizer) Predictive Modeling 130 / 142

Page 132: Predictive Modeling Workshop

Tuning SVM Models

We need to come up with reasonable choices for the cost parameter and any other parameters associated with the kernel, such as

polynomial degree for the polynomial kernel

a, the scale parameter for radial basis functions (RBF)

We’ll focus on RBF kernel models here, so we have two tuning parameters.

However, there is a potential shortcut for RBF kernels. Reasonable valuesof a can be derived from elements of the kernel matrix of the training set.

The manual for the sigest function in kernlab has "The estimation [for a]is based upon the 0.1 and 0.9 quantile of lx - x tl2."

Anecdotally, we have found that the mid-point between these two numbers can provide a good estimate for this tuning parameter. This leaves only the cost function for tuning.

Max Kuhn (Pfizer) Predictive Modeling 131 / 142

Page 133: Predictive Modeling Workshop

SVM Example

We can tune the SVM model over the cost parameter.

>>+++++++++

set.seed(1735)

svm_tune <- train(Class ~ ., data = train_data, method = "svmRadial",

# The default grid of cost parameters go from 2^-2,# 0.5, 1, ...# We'll fit 10 values in that sequence via the tuneLength

# argument. tuneLength = 10,preProc = c("center", "scale"), metric = "ROC",trControl = cv_ctrl)

Max Kuhn (Pfizer) Predictive Modeling 132 / 142

Page 134: Predictive Modeling Workshop

SVM Example> svm_tune

Support Vector Machines with Radial Basis Function Kernel

750 samples7 predictor2 classes: 'Good', 'Bad'

Pre-processing: centered, scaledResampling: Cross-Validated (10 fold, repeated 5 times)

Summary of sample sizes: 675, 675, 674, 674, 676, 675, ...

Resampling results across tuning parameters:

C ROC0.7190.7190.7190.7170.7100.6910.6750.6670.6550.644

Sens0.9140.9230.9240.9240.9250.9280.9340.9480.9580.966

Spec0.3080.3080.2950.2780.2650.2330.2080.1740.1430.112

ROC SD0.07260.07190.07320.07260.07570.07680.07800.07710.07830.0757

Sens SD0.04430.04140.04290.04290.04640.04570.04130.03540.03690.0291

Spec SD0.09150.09340.09260.09820.09670.10280.08830.09340.08230.0681

0.250.501.002.004.008.00

16.0032.0064.00

128.00

Tuning parameter 'sigma' was held constant at a value of 0.11402

ROC was used to select the optimal model using the largest value. The final values used for the model were sigma = 0.114 and C = 0.5.

Max Kuhn (Pfizer) Predictive Modeling 133 / 142

Page 135: Predictive Modeling Workshop

SVM Example

> svm_tune$finalModel

Support Vector Machine object of class "ksvm"SV type: C-svc (classification)parameter : cost C = 0.5

Gaussian Radial Basis kernel function.Hyperparameter : sigma = 0.114020038085962

Number of Support Vectors : 458

Objective Function Value : -202.0199Training error : 0.230667Probability model included.

458 training data points (out of 750 samples) were used as support vectors.

Max Kuhn (Pfizer) Predictive Modeling 134 / 142

Page 136: Predictive Modeling Workshop

SVM ROC Profileggplot(svm_tune) + scale_x_log10()

0.72

0.70

0.68

0.66

0.641 10 100

Cost

Max Kuhn (Pfizer) Predictive Modeling 135 / 142

RO

C (

Rep

eate

d C

ross

−V

alid

atio

n)● ● ● ●

Page 137: Predictive Modeling Workshop

Test Set Results

> svm_pred <- predict(svm_tune, newdata = test_data)> confusionMatrix(svm_pred, test_data$Class)

Confusion Matrix and Statistics

ReferencePrediction Good Bad

Good 165 49Bad 10 26

Accuracy : 0.764

95% CI : (0.7064, 0.8152) No Information Rate : 0.7

P-Value [Acc > NIR] : 0.01474

Kappa : 0.34Mcnemar's Test P-Value : 7.53e-07

Sensitivity : 0.9429Specificity : 0.3467

Pos Pred Value : 0.7710Neg Pred Value : 0.7222

Prevalence : 0.7000Detection Rate : 0.6600

Detection Prevalence : 0.8560Balanced Accuracy : 0.6448

'Positive' Class : Good

Max Kuhn (Pfizer) Predictive Modeling 136 / 142

Page 138: Predictive Modeling Workshop

SVM Probability Estimates

> svm_probs <- predict(svm_tune, newdata = test_data, type = "prob")> head(svm_probs)

Good Bad1 0.8008819 0.19911812 0.8064645 0.19353553 0.8103000 0.18970004 0.7944448 0.20555525 0.6722484 0.32775166 0.7202097 0.2797903

Max Kuhn (Pfizer) Predictive Modeling 137 / 142

Page 139: Predictive Modeling Workshop

SVM ROC Curve

> svm_roc <- roc(response = test_data$Class, predictor = svm_probs[, "Good"],+> auc(rpart_roc)

levels = rev(levels(test_data$Class)))

Area under the curve: 0.8154

> auc(gbm_roc)

Area under the curve: 0.8422

> auc(svm_roc)

Area under the curve: 0.7838

> plot(rpart_roc, col = "#9E0142", legacy.axes = FALSE)> plot(gbm_roc, col = "#3288BD", legacy.axes = FALSE, add = TRUE)> plot(svm_roc, col = "#F46D43", legacy.axes = FALSE, add = TRUE)> legend(.6, .5, legend = c("rpart", "gbm", "svm"),++

lty = c(1, 1, 1),col = c("#9E0142", "#3288BD", "#F46D43"))

Max Kuhn (Pfizer) Predictive Modeling 138 / 142

Page 140: Predictive Modeling Workshop

SVM ROC Curve

1.0 0.8 0.6 0.4 0.2 0.0

Specificity

Max Kuhn (Pfizer) Predictive Modeling 139 / 142

Sen

sitiv

ity

0.0

0.2

0.4

0.6

0.8

1.0

rpart

gbm

svm

Page 141: Predictive Modeling Workshop

Other Functions and Classes

bag: a genereal bagging functionupSample, downSample: functions for class imbalances

predictors: class for determining which predictors are included in the prediction equations (e.g. rpart, earth, lars models)varImp: classes for assessing the aggregate effect of a predictor on the model equations

lift: creating lift/gain charts

Max Kuhn (Pfizer) Predictive Modeling 140 / 142

Page 142: Predictive Modeling Workshop

Other Functions and Classes

knnreg: nearest-neighbor regressionplsda, splsda: PLS discriminant analysis

icr: independent component regression

pcaNNet: nnet:::nnet with automatic PCA pre-processing step

bagEarth, bagFDA: bagging with MARS and FDA models

maxDissim: a function for maximum dissimilarity samplingrfe, sbf, gabf, sabf: classes/frameworks for recursive feature selection (RFE), univariate filters, genetic algrithms, simulated annealing feature selection methods

featurePlot: a wrapper for several lattice functions

Max Kuhn (Pfizer) Predictive Modeling 141 / 142

Page 143: Predictive Modeling Workshop

Session Info

R version 3.1.3 (2015-03-09), x86_64-apple-darwin10.8.0

Base packages: base, datasets, graphics, grDevices, grid, methods, parallel, splines, stats, tcltk, utils

Other packages: C50 0.1.0-24, caret 6.0-47, class 7.3-12, ctv 0.8-1,doMC 1.3.3, Fahrmeir 2012.04-0, foreach 1.4.2, Formula 1.2-0, gbm 2.1.1, ggplot2 1.0.1, Hmisc 3.15-0, iterators 1.0.7, kernlab 0.9-20, knitr 1.9, lattice 0.20-30, lubridate 1.3.3, MASS 7.3-40, mlbench 2.1-1,odfWeave 0.8.4, partykit 1.0-0, plyr 1.8.1, pROC 1.8, reshape2 1.4.1, rpart 4.1-9, survival 2.38-1, XML 3.98-1.1

Loaded via a namespace (and not attached): acepack 1.3-3.3,BradleyTerry2 1.0-6, brglm 0.5-9, car 2.0-25, cluster 2.0.1, codetools 0.2-10, colorspace 1.2-6, compiler 3.1.3, digest 0.6.8, e1071 1.6-4, evaluate 0.5.5, foreign 0.8-63, formatR 1.0, gtable 0.1.2, gtools 3.4.2, highr 0.4,labeling 0.3, latticeExtra 0.6-26, lme4 1.1-7, Matrix 1.2-0, memoise 0.2.1, mgcv 1.8-6, minqa 1.2.4, munsell 0.4.2, nlme 3.1-120, nloptr 1.0.4,nnet 7.3-9, pbkrtest 0.4-2, proto 0.3-10, quantreg 5.11, RColorBrewer 1.1-2, Rcpp 0.11.5, scales 0.2.4, SparseM 1.6, stringr 0.6.2, tools 3.1.3

Max Kuhn (Pfizer) Predictive Modeling 142 / 142