intro to data science - github pages
TRANSCRIPT
![Page 1: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/1.jpg)
INTRO TO DATA SCIENCELECTURE 4: INTRO TO ML & KNN CLASSIFICATIONFrancesco Mosconi DAT10 SF // October 15, 2014
![Page 2: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/2.jpg)
DATA SCIENCE IN THE NEWSHEADER– CLASS NAME, PRESENTATION TITLE
![Page 3: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/3.jpg)
DATA SCIENCE IN THE NEWS
Source: http://f1metrics.wordpress.com/2014/10/03/building-a-race-simulator/
![Page 4: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/4.jpg)
DATA SCIENCE IN THE NEWS
Source: http://www.pyimagesearch.com/2014/10/13/deep-learning-amazon-ec2-gpu-python-nolearn/
![Page 5: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/5.jpg)
RECAP
‣Cleaning data ‣Dealing with missing data ‣Setting up github for homework !
LAST TIME
![Page 6: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/6.jpg)
QUESTIONS?INTRO TO DATA SCIENCE
![Page 7: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/7.jpg)
I. WHAT IS MACHINE LEARNING? II. CLASSIFICATION PROBLEMSIII. BUILDING EFFECTIVE CLASSIFIERS IV. THE KNN CLASSIFICATION MODELEXERCISES:IV. LAB: KNN CLASSIFICATION IN PYTHON V. BONUS LAB: VISUALIZATION WITH MATPLOTLIB (IF TIME ALLOWS)
AGENDA
![Page 8: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/8.jpg)
I. WHAT IS MACHINE LEARNING?
INTRO TO DATA SCIENCE
![Page 9: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/9.jpg)
WHAT IS MACHINE LEARNING? 9
![Page 10: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/10.jpg)
WHAT IS MACHINE LEARNING? 10
![Page 11: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/11.jpg)
WHAT IS MACHINE LEARNING? 11
![Page 12: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/12.jpg)
WHAT IS MACHINE LEARNING? 12
from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!!!!! !! !!source: http://en.wikipedia.org/wiki/Machine_learning
![Page 13: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/13.jpg)
WHAT IS MACHINE LEARNING? 13
from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!“The core of machine learning deals with representation and generalization…” !!! !! !!source: http://en.wikipedia.org/wiki/Machine_learning
![Page 14: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/14.jpg)
WHAT IS MACHINE LEARNING? 14
from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!“The core of machine learning deals with representation and generalization…” !‣ representation – extracting structure from data !!!!!source: http://en.wikipedia.org/wiki/Machine_learning
![Page 15: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/15.jpg)
WHAT IS MACHINE LEARNING? 15
from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!“The core of machine learning deals with representation and generalization…” !‣ representation – extracting structure from data !‣ generalization – making predictions from data !!!source: http://en.wikipedia.org/wiki/Machine_learning
![Page 16: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/16.jpg)
REMEMBER THIS? 16
source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/
![Page 17: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/17.jpg)
WE ARE NOW HERE 17
source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/
![Page 18: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/18.jpg)
WE WANT TO GO HERE 18
source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/
![Page 19: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/19.jpg)
WE WANT TO GO HERE 19
source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/
QUESTION!What does it take to make this jump?
![Page 20: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/20.jpg)
ANSWER: PROBLEM SOLVING! 20
![Page 21: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/21.jpg)
ANSWER: PROBLEM SOLVING! 21
NOTE!Implementing solutions to ML problems is the focus of this course!
![Page 22: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/22.jpg)
THE STRUCTURE OF MACHINE LEARNING PROBLEMS
INTRO TO DATA SCIENCE
![Page 23: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/23.jpg)
REMEMBER WHAT WE SAID BEFORE? 23
Supervised
Unsupervised
Making predictions
Extracting structure
![Page 24: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/24.jpg)
REMEMBER WHAT WE SAID BEFORE? 24
representation
generalization
Supervised
Unsupervised
Making predictions
Extracting structure
![Page 25: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/25.jpg)
TYPES OF LEARNING PROBLEMS - SUPERVISED EXAMPLE 25
![Page 26: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/26.jpg)
TYPES OF LEARNING PROBLEMS - UNSUPERVISED EXAMPLE 26
![Page 27: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/27.jpg)
TYPES OF LEARNING PROBLEMS - UNSUPERVISED EXAMPLE 27
![Page 28: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/28.jpg)
TYPES OF DATA 28
Continuous Categorical
Quantitative Qualitative
![Page 29: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/29.jpg)
TYPES OF DATA 29
Continuous Categorical
Quantitative Qualitative
NOTE!The space where data live is called the feature space. !Each point in this space is called a record.
![Page 30: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/30.jpg)
TYPES OF ML SOLUTIONS 30
Supervised
Unsupervised
Continuous Categorical
regression classificationdimension reduction clustering
![Page 31: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/31.jpg)
TYPES OF ML SOLUTIONS 31
Supervised
Unsupervised
Continuous Categorical
regression classificationdimension reduction clustering
NOTEWe will implement solutions using models and algorithms. !Each will fall into one of these four buckets.
![Page 32: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/32.jpg)
WHATIS THEGOALOFMACHINE LEARNING?
QUESTION
![Page 33: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/33.jpg)
REMEMBER WHAT WE SAID BEFORE? 33
Supervised
Unsupervised
Making predictions
Extracting structureANSWER!The goal is determined by the type of problem.
![Page 34: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/34.jpg)
HOWDO YOUDETERMINETHE RIGHTAPPROACH?
QUESTION
![Page 35: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/35.jpg)
TYPES OF ML SOLUTIONS 35
Supervised
Unsupervised
Continuous Categorical
regression classificationdimension reduction clustering
ANSWER!The right approach is determined by the desired solution.
![Page 36: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/36.jpg)
TYPES OF ML SOLUTIONS 36
Supervised
Unsupervised
Continuous Categorical
regression classificationdimension reduction clustering
ANSWER!The right approach is determined by the desired solution.
NOTE!All of this depends on your data!
![Page 37: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/37.jpg)
WHATDO YOUDOWITH YOURRESULTS?
QUESTION
![Page 38: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/38.jpg)
THE DATA SCIENCE WORKFLOW 38
source: http://benfry.com/phd/dissertation-110323c.pdf
ANSWER!Interpret them and react accordingly.
![Page 39: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/39.jpg)
THE DATA SCIENCE WORKFLOW 39
source: http://benfry.com/phd/dissertation-110323c.pdf
ANSWER!Interpret them and react accordingly
NOTE!This also relies on your problem solving skills!
![Page 40: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/40.jpg)
II. CLASSIFICATION PROBLEMS
INTRO TO DATA SCIENCE
![Page 41: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/41.jpg)
TYPES OF ML SOLUTIONS 41
Supervised
Unsupervised
Continuous Categorical
??? ???
??? ???
![Page 42: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/42.jpg)
CLASSIFICATION PROBLEMS 42
Supervised
Unsupervised
Continuous Categorical
regression classificationdimension reduction clustering
![Page 43: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/43.jpg)
CLASSIFICATION PROBLEMS 43
Here’s (part of) an example dataset:
![Page 44: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/44.jpg)
CLASSIFICATION PROBLEMS 44
Here’s (part of) an example dataset:
{independent variables
![Page 45: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/45.jpg)
CLASSIFICATION PROBLEMS 45
Here’s (part of) an example dataset: {class labels
(qualitative){independent variables
![Page 46: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/46.jpg)
CLASSIFICATION PROBLEMS 46
Q: What does “supervised” mean?
![Page 47: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/47.jpg)
CLASSIFICATION PROBLEMS 47
Q: What does “supervised” mean? A: We know the labels.
class labels
(qualitative)
![Page 48: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/48.jpg)
CLASSIFICATION PROBLEMS 48
Q: How does a classification problem work?
![Page 49: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/49.jpg)
CLASSIFICATION PROBLEMS 49
Q: How does a classification problem work? A: Data in, predicted labels out.
source: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
![Page 50: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/50.jpg)
Q: What steps does a classification problem require?
dataset
CLASSIFICATION PROBLEMS 50
model
![Page 51: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/51.jpg)
Q: What steps does a classification problem require? 1) split dataset
dataset
CLASSIFICATION PROBLEMS 51
model
![Page 52: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/52.jpg)
Q: What steps does a classification problem require? 1) split dataset 2) train model
CLASSIFICATION PROBLEMS 52
dataset
training set
model
![Page 53: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/53.jpg)
Q: What steps does a classification problem require? 1) split dataset 2) train model 3) test model
CLASSIFICATION PROBLEMS 53
model
dataset
training set
test set
![Page 54: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/54.jpg)
Q: What steps does a classification problem require? 1) split dataset 2) train model 3) test model 4) make predictions
CLASSIFICATION PROBLEMS 54
model
dataset
test set
training set
predictions
new data
![Page 55: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/55.jpg)
Q: What steps does a classification problem require? 1) split dataset 2) train model 3) test model 4) make predictions
CLASSIFICATION PROBLEMS 55
model
dataset
test set
training set
predictions
new data
NOTEThis new data is called out of sample data. !We don’t know the labels for these OOS records!
![Page 56: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/56.jpg)
III. BUILDING EFFECTIVE CLASSIFIERS
INTRO TO DATA SCIENCE
![Page 57: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/57.jpg)
Q: What types of prediction error will we run into?
BUILDING EFFECTIVE CLASSIFIERS 57
model
dataset
test set
training set
predictions
new data
![Page 58: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/58.jpg)
Q: What types of prediction error will we run into? 1) training error
BUILDING EFFECTIVE CLASSIFIERS 58
model
dataset
test set
training set
predictions
new data
![Page 59: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/59.jpg)
Q: What types of prediction error will we run into? 1) training error 2) generalization error
BUILDING EFFECTIVE CLASSIFIERS 59
model
dataset
test set
training set
predictions
new data
![Page 60: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/60.jpg)
Q: What types of prediction error will we run into? 1) training error 2) generalization error 3) OOS error
BUILDING EFFECTIVE CLASSIFIERS 60
model
dataset
test set
training set
predictions
new data
![Page 61: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/61.jpg)
Q: What types of prediction error will we run into? 1) training error 2) generalization error 3) OOS error
BUILDING EFFECTIVE CLASSIFIERS 61
model
dataset
test set
training set
predictions
NOTE!We want to estimate OOS prediction error so we know what to expect from our model.
new data
![Page 62: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/62.jpg)
TRAINING ERROR 62
Q: Why should we use training & test sets?
![Page 63: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/63.jpg)
TRAINING ERROR 63
Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset.
![Page 64: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/64.jpg)
TRAINING ERROR 64
Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error?
![Page 65: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/65.jpg)
TRAINING ERROR 65
Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively
“memorizing” the entire training set).
![Page 66: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/66.jpg)
TRAINING ERROR 66
Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively
“memorizing” the entire training set). A: Down to zero!
![Page 67: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/67.jpg)
TRAINING ERROR 67
Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively
“memorizing” the entire training set). A: Down to zero!
NOTE!This phenomenon is called overfitting.
![Page 68: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/68.jpg)
OVERFITTING 68
source: Data Analysis with Open Source Tools, by Philipp K. Janert. O’Reilly Media, 2011.!
![Page 69: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/69.jpg)
OVERFITTING - EXAMPLE 69
source: http://www.dtreg.com
![Page 70: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/70.jpg)
OVERFITTING - EXAMPLE 70
source: http://www.dtreg.com
![Page 71: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/71.jpg)
TRAINING ERROR 71
Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively
“memorizing” the entire training set). A: Down to zero! !A: Training error is not a good estimate of OOS accuracy.
NOTE!This phenomenon is called overfitting.
![Page 72: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/72.jpg)
GENERALIZATION ERROR 72
Suppose we do the train/test split.
![Page 73: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/73.jpg)
GENERALIZATION ERROR 73
Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy?
![Page 74: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/74.jpg)
GENERALIZATION ERROR 74
Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split.
![Page 75: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/75.jpg)
GENERALIZATION ERROR 75
Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same?
![Page 76: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/76.jpg)
GENERALIZATION ERROR 76
Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same? A: Of course not!
![Page 77: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/77.jpg)
GENERALIZATION ERROR 77
Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same? A: Of course not! !A: On its own, not very well.
![Page 78: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/78.jpg)
GENERALIZATION ERROR 78
Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same? A: Of course not! !A: On its own, not very well.
NOTE!The generalization error gives a high-variance estimate of OOS accuracy.
![Page 79: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/79.jpg)
GENERALIZATION ERROR 79
Something is still missing!
![Page 80: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/80.jpg)
GENERALIZATION ERROR 80
Something is still missing! !Q: How can we do better?
![Page 81: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/81.jpg)
GENERALIZATION ERROR 81
Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors.
![Page 82: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/82.jpg)
GENERALIZATION ERROR 82
Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors. Q: What if we did a bunch of these and took the average?
![Page 83: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/83.jpg)
GENERALIZATION ERROR 83
Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors. Q: What if we did a bunch of these and took the average? A: Now you’re talking!
![Page 84: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/84.jpg)
GENERALIZATION ERROR 84
Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors. Q: What if we did a bunch of these and took the average? A: Now you’re talking! !A: Cross-validation.
![Page 85: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/85.jpg)
CROSS-VALIDATION 85
Steps for n-fold cross-validation:
![Page 86: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/86.jpg)
CROSS-VALIDATION 86
Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions.
![Page 87: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/87.jpg)
CROSS-VALIDATION 87
Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set.
![Page 88: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/88.jpg)
CROSS-VALIDATION 88
Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set. 3) Find generalization error.
![Page 89: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/89.jpg)
CROSS-VALIDATION 89
Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set. 3) Find generalization error. 4) Repeat steps 2-3 using a different partition as the test set at each iteration.
![Page 90: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/90.jpg)
CROSS-VALIDATION 90
Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set. 3) Find generalization error. 4) Repeat steps 2-3 using a different partition as the test set at each iteration. 5) Take the average generalization error as the estimate of OOS accuracy.
![Page 91: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/91.jpg)
CROSS-VALIDATION 91
Features of n-fold cross-validation:
![Page 92: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/92.jpg)
CROSS-VALIDATION 92
Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error.
![Page 93: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/93.jpg)
CROSS-VALIDATION 93
Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error. 2) More efficient use of data than single train/test split. - Each record in our dataset is used for both training and testing.
![Page 94: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/94.jpg)
CROSS-VALIDATION 94
Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error. 2) More efficient use of data than single train/test split. - Each record in our dataset is used for both training and testing. 3) Presents tradeoff between efficiency and computational expense. - 10-fold CV is 10x more expensive than a single train/test split
![Page 95: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/95.jpg)
CROSS-VALIDATION 95
Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error. 2) More efficient use of data than single train/test split. - Each record in our dataset is used for both training and testing. 3) Presents tradeoff between efficiency and computational expense. - 10-fold CV is 10x more expensive than a single train/test split 4) Can be used for model selection.
![Page 96: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/96.jpg)
IV. KNN CLASSIFICATION
INTRO TO DATA SCIENCE
![Page 97: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/97.jpg)
KNN CLASSIFICATION - BASICS 97
Suppose we want to predict the color of the grey dot.
![Page 98: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/98.jpg)
KNN CLASSIFICATION 98
Suppose we want to predict the color of the grey dot. !1) Pick a value for k.
k = 3
![Page 99: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/99.jpg)
KNN CLASSIFICATION 99
Suppose we want to predict the color of the grey dot. !1) Pick a value for k. 2) Find colors of k nearest neighbors.
k = 3
![Page 100: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/100.jpg)
KNN CLASSIFICATION 100
Suppose we want to predict the color of the grey dot. !1) Pick a value for k. 2) Find colors of k nearest neighbors. 3) Assign the most common color to the grey dot.
![Page 101: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/101.jpg)
KNN CLASSIFICATION 101
Suppose we want to predict the color of the grey dot. !1) Pick a value for k. 2) Find colors of k nearest neighbors. 3) Assign the most common color to the grey dot.
OPTIONAL NOTE!Our definition of “nearest” implicitly uses the Euclidean distance function.
![Page 102: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/102.jpg)
KNN CLASSIFICATION 102
Another example with k = 3 Will our new example be blue or orange?
![Page 103: INTRO TO DATA SCIENCE - GitHub Pages](https://reader035.vdocuments.mx/reader035/viewer/2022081623/61570652a097e25c76503e05/html5/thumbnails/103.jpg)
LABS
INTRO TO DATA SCIENCE