lecture 8: grid search and stat2450 - introduction …mat kallada stat2450 - introduction to data...
TRANSCRIPT
![Page 1: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/1.jpg)
Lecture 8: Grid Search and Model Validation Continued
Mat Kallada
STAT2450 - Introduction to Data Mining with R
![Page 2: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/2.jpg)
Outline for Today
Model Validation ←
Grid Search
![Page 3: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/3.jpg)
Some Preliminary NotesThank you for submitting Assignment 1!
Assignment 2 has been posted.
![Page 4: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/4.jpg)
Validating Predictive models
We build predictive models using a supervised data mining method
Decision Trees, KNN, Support Vector Machines, Neural Networks..
It is important that we validate if they actually work
There are many supervised data mining techniques out there to build predictive models
![Page 5: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/5.jpg)
Validating Predictive models
What were the two approaches we saw to validate predictive models?
That is - to get an estimate of how well it’ll work in the real-world.
Before I use this to predict stock prices, maybe I should test if this actually works
![Page 6: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/6.jpg)
Validating Predictive modelsHold-out validation: Split data into test/train; test for determining performance
K-Fold Cross-Validation: Split data into test/train K different times; Average the results.
These are the two main approaches to “test” whether a data mining method will actually work in the real-world
![Page 7: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/7.jpg)
Hold-out validation: In a Nutshell
Cut the rows of the data set into two separate parts
Train predictive model on one part
Validate performance on other part
This method makes sense intuitively.
Most of people use 25% for testing as a rule of thumb
![Page 8: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/8.jpg)
Hold-out validation
This model should work 99% of the time in the real-world.
Let’s use this and predict the stock!
![Page 9: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/9.jpg)
Hold-out ValidationWe can compute two statistics here...
Training score: How well does the model predict observations it has been trained on?
Testing score: How well does the model work on unseen observations in the real-world?
![Page 10: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/10.jpg)
Hold-out Validation: IllustratedThis is our original data set.
Height Species
2.3 Cat
4.65 Dog
6.87 Cat
3.5 Cat
6.3 Dog
7.4 Cat
There is only one feature in this data set.
![Page 11: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/11.jpg)
Hold-out Validation: IllustratedSplit the data into two separate parts.
Typically 25% for testing provides a good assessment.
Height Species
6.87 Cat
3.5 Cat
6.3 Dog
7.4 Cat
Height Species
2.3 Cat
4.65 Dog
Training Set Testing Set
![Page 12: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/12.jpg)
Hold-out Validation: IllustratedBuild a predictive model using the training rows only
Height Species
6.87 Cat
3.5 Cat
6.3 Dog
7.4 Cat
Height Species
2.3 Cat
4.65 Dog
Training Set
Testing Set
Predictive model
![Page 13: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/13.jpg)
Hold-out Validation: IllustratedValidate whether predictive model works on the test set to get testing score
Height Species
2.3 Cat
4.65 Dog
Testing Set<2.3> Cat ✓
<4.65> Dog X
Predictive model got 1 out of 2 predictions correctly.
The classification accuracy on test set is 50%
![Page 14: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/14.jpg)
Hold-out Validation: IllustratedValidate whether predictive model works on the training set to get training score
Training Set
<6.87> Cat ✓
<3.5> Cat ✓
Predictive model got 3 out of 4 predictions correctly.
The classification accuracy on training set is 75%
Height Species
6.87 Cat
3.5 Cat
6.3 Dog
7.4 Cat
<6.3> Dog ✓
<7.4> Dog X
![Page 15: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/15.jpg)
Hold-out Validation: Testing Score
What does a high testing score mean?
![Page 16: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/16.jpg)
Hold-out Validation: Testing Score
What does a high testing score mean?
It will probably work well in the real-world.
Likely did not overfit or underfit the data.
![Page 17: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/17.jpg)
Hold-out Validation: Testing Score
What does a high training score mean?
![Page 18: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/18.jpg)
Hold-out Validation: Training Score
What does a high training score mean?
It might have overfit the data.
![Page 19: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/19.jpg)
Hold-out Validation: Training Score
What does a high training score mean?
It might have overfit the data.
It predicted all the observations it trained on correctly.
![Page 20: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/20.jpg)
Issues with Hold-out Validation
What if the testing rows are easy?
What if testing rows don’t truly assess the predictive model?
A more robust measure of real-world performance is K-Fold Cross Validation
![Page 21: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/21.jpg)
2-Fold Cross Validation
![Page 22: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/22.jpg)
K-Fold Cross Validation
Wait - you have K predictive models now.
Which one do you use in the real-world?
In hold-out validation, we only had one predictive model.
![Page 23: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/23.jpg)
K-Fold Cross ValidationK-Fold Cross Validation gives a “real-world estimate” over the entire data set.
It tells you that:
“This data mining method will perform X well on the entire data set”
![Page 24: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/24.jpg)
K-Fold Cross Validation
After you compute K-Fold Cross-Validation score and are happy with the score...
You should train a model using the entire dataset using that data mining technique
![Page 25: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/25.jpg)
Model Validation: A SummaryTwo main ways to validate predictive models and answer the question:
“How likely will this predictive model work in the real-world?”
There are also two main statistics that you can compute when validating models
![Page 26: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/26.jpg)
Hold-out Validation in RWe can specify use “trainingControl” parameter to determine the evaluation technique we want.
train_control = trainControl(method="LGOCV", p=0.75, number=1)
…..
model = train(iris[,1:4], iris[,5], method = "knn", trControl = train_control, tuneGrid=knn_options)
![Page 27: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/27.jpg)
Hold-out Validation in R
DataJoy Example: https://www.getdatajoy.com/examples/56aa2ce939dc02266e7b0322
![Page 28: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/28.jpg)
K-Fold Cross-Validation Validation in RTo do K-Fold Cross Validation, all we need to do is change the method and the number at hand!
train_control = trainControl(method="cv", number=10)
…..
model = train(iris[,1:4], iris[,5], method = "knn", trControl = train_control, tuneGrid=knn_options)
![Page 29: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/29.jpg)
Hold-out Validation in R
DataJoy Example: https://www.getdatajoy.com/examples/56aa2faf1d2486f244a6943f
![Page 30: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/30.jpg)
Outline for Today
Model Validation
Grid Search ←
![Page 31: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/31.jpg)
Hyperparameters: What are they again?Things you specify before running a data mining method
![Page 32: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/32.jpg)
Hyperparameters: What are they again?Things you specify before running a data mining method
These define how complex your predictive model will be.
![Page 33: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/33.jpg)
Hyperparameters: What are they again?Things you specify before running a data mining method
These define how complex your predictive model will be.
This includes things like “K” in KNN, “cp” and “max_depth” in Decision Trees
As well as “C” and the “Kernel” for Support Vector Machines
![Page 34: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/34.jpg)
“Tuning” Hyperparameters properlyWe need to specify them before running a data mining technique
![Page 35: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/35.jpg)
“Tuning” Hyperparameters properly
Finding the right combination of hyperparameters is called “tuning”
In other words, finding the optimal combination of hyperparameters which leads to the best performance
Why do we need to tune hyperparameters?
![Page 36: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/36.jpg)
Why do we need to tune hyperparameters?
We need to conform to the noise and nature of a dataset.
When does K-nearest Neighbours with K=1 gives good predictive models?
When does K-nearest Neighbours with K=10 gives good predictive models?
![Page 37: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/37.jpg)
How do we pick these values (like K in KNN)?We said that we will try all of possible combinations; pick the one with the best score.
![Page 38: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/38.jpg)
The “sweet” spot for hyperparameters
Some combination of hyperparameters will give you the best generalization score
![Page 39: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/39.jpg)
Formally called the Grid Search approach
Define a “grid” of all possible combinations.
Try them all out.
Use the one with the best testing score score
![Page 40: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/40.jpg)
Grid Search
Define all possible values in a grid
Try all combinations of those values
Pick the combination that has best real-world generalization
![Page 41: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/41.jpg)
Grid Search in RtuneGrid specifies the grid that we are searching against.
When you run train, R will search for all possible hyperparameters.
We’ve been using tuneGrid with a constant value.
# Search for the K value from 1 to 6knn_options = data.frame(k=1:6)
![Page 42: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/42.jpg)
Grid Searching in R
DataJoy Example: https://www.getdatajoy.com/examples/56aa32b91d2486f244a69442
![Page 43: Lecture 8: Grid Search and STAT2450 - Introduction …Mat Kallada STAT2450 - Introduction to Data Mining with R Outline for Today Model Validation ← Grid Search Some Preliminary](https://reader033.vdocuments.mx/reader033/viewer/2022042804/5f56ff656289f2691301b6af/html5/thumbnails/43.jpg)
That’s all for today.
Start Assignment 2.
We will look into visualization techniques next week.
Have a good weekend!