lecture 27: the story of the nealix prize

Lecture 27: Review

Reading: All chapters in ISLR.

STATS 202: Data mining and analysis

December 6, 2017

1 / 16

AnnouncementsFinal exam:

I Tuesday, December 12, 8:30-11:30 am, in the following rooms:Last names A-L: Skilling Auditorium (here)Last names M-S: Gates B03Last names T-Z: Thornton 201.

I SCPD students: The exam will be sent out early Tuesday(12/12) morning, and your proctor must return it byWednesday December 13 at 2pm PST. If you wish to take theexam at Stanford with the rest of the class, you must tellSCPD (not me) ahead of time. Then pick one of the threeclass rooms above to take the exam.

I Closed book, closed notes. No calculators or computers.

I The final covers everything we did in class except non-lineardimensionality reduction, missing data, and learning fromrelational data. In other words, everything in the book.

I Practice problems and practice final on the website.

2 / 16

I Practice problems and practice final on the website. 2 / 16

Announcements

The Kaggle project is due this Thursday - upload Homework 8 by9am.There is no class Friday. Instructor and TAs will have their usualoffice hours. Instructor is out of town Friday – Monday.

3 / 16

Unsupervised learning

I In unsupervised learning, all the variables are on equalstanding, no such thing as an input and response.

I Two sets of methods:1. PCA: find the main directions of variation in the data

2. Clustering: find meaningful groups of samplesI Hierarchical clustering (single, complete, or average linkage).

I K-means clustering.

4 / 16

1. Find the linear combination of variables

θ11X1 + θ12X2 + · · ·+ θ1pXp

with∑

i θ21i = 1, which has the largest variance.

θ21X1 + θ22X2 + · · ·+ θ2pXp

with∑

i θ22i = 1 and θ1 ⊥ θ2, which has the largest variance.

3. ...

5 / 16

θ11X1 + θ12X2 + · · ·+ θ1pXp

with∑

θ21X1 + θ22X2 + · · ·+ θ2pXp

with∑

3. ...

5 / 16

θ11X1 + θ12X2 + · · ·+ θ1pXp

with∑

θ21X1 + θ22X2 + · · ·+ θ2pXp

with∑

3. ...

5 / 16

Some questions:

I What are the loadings?

I What are score variables?

I What is a biplot, how is it interpreted?

I What is the proportion of variance explained? A scree plot?

I What is the effect of rescaling variables?

6 / 16

Some questions:

6 / 16

Some questions:

6 / 16

Some questions:

6 / 16

Some questions:

6 / 16

K-means clustering

I The number of clusters is fixed at K.

I Goal is to minimize the average distance of a point to theaverage of its cluster.

I The algorithm starts from some assignment, and is guaranteedto decrease this average distance.

I This find a local minimum, not necessarily a global minimum,so we typically repeat the algorithm from many differentrandom starting points.

7 / 16

K-means clustering

7 / 16

K-means clustering

7 / 16

K-means clustering

7 / 16

Hierarchical clustering

Average Linkage Complete Linkage Single Linkage

I Agglomerative algorithm producesa dendrogram.

I At each step we join the twoclusters that are “closest”:

I Complete: distance betweenclusters is maximal distancebetween any pair of points.

I Single: distance between clustersis minimal distance.

I Average: distance betweenclusters is the average distance.

I Height of a branching point =distance between clusters joined.

8 / 16

Supervised learning

Now, we have a response variable yi associated to each vector ofpredictors xi.

Two classes of problem:I Regression: yi is numerical

MSE =1

n∑i=1

(yi − yi)2

I Classification: yi is categorical

0− 1 loss =n∑

1(yi 6= yi).

9 / 16

Supervised learning

Now, we have a response variable yi associated to each vector ofpredictors xi.

Two classes of problem:I Regression: yi is numerical

MSE =1

n∑i=1

(yi − yi)2

I Classification: yi is categorical

0− 1 loss =n∑

1(yi 6= yi).

9 / 16

Training vs. test error

Both the MSE for regression, and the 0-1 loss for classification canbe computed:

1. On the training data.

2. On an independent test set.

We want to minimize the error on a very large test set which issampled from the same process as the training data. This is calledthe test error.

10 / 16

Training vs. test error

Both the MSE for regression, and the 0-1 loss for classification canbe computed:

1. On the training data.

2. On an independent test set.

We want to minimize the error on a very large test set which issampled from the same process as the training data. This is calledthe test error.

10 / 16

Bias-variance decomposition

Consider a regression method, which given some data(x1, y1), . . . , (xn, yn) outputs a prediction f(x) for the regressionfunction.

If we think of the training data as coming from some distribution,then the function f can be considered a random variable as well.

The expected test MSE of f has the following decomposition forany fixed x:

E([Y − f(x)]2) = E([f(x)− Ef(x)]2]︸︷︷︸Var(f(x))>0

+[E(f(x)− f(x))]2︸︷︷︸Square bias of f(x).>0

+Var(ε)

Variance: Increases with the flexibility of the modelBias: Decreases as the flexibility of the model increases

11 / 16

+Var(ε)

11 / 16

+Var(ε)

11 / 16

How do we estimate the test error?

I Our main technique is cross-validation.

I Different approaches:

1. Validation set: Split the data in two parts, train the model onone subset, and compute the test error on the other.

2. k-fold: Split the data into k subsets. Average the test errorscomputed using each subset as a validation set.

3. LOOCV: k-fold cross validation with k = n.

I No approach is superior to all others.

I What are the main differences? How do the bias and varianceof the test error estimates compare? Which methods dependon the random seed?

12 / 16

The Bootstrap

I Main idea: If we have enough data, the empirical distributionis similar to the actual distribution of the data.

I Resampling with replacement allows us to obtainpseudo-independent datasets.

I They can be used to:

1. Approximate the standard error of a parameter (say, β in linearregression), which is just the standard deviation of theestimate when we repeat the procedure with manyindependent training sets.

2. Bagging: By averaging the predictions y made with manyindependent data sets, we reduce the variance of the predictor.

13 / 16

Regression methodsI Nearest neighbors regression

I Multiple linear regression

I Stepwise selection methods

I Ridge regression and the Lasso

I Principal Components Regression

I Partial Least Squares

I Non-linear methods:I Polynomial regression

I Cubic splines

I Smoothing splines

I Local regression

I GAMs: Combining the above methods with multiple predictors

I Decision trees, Bagging, Random Forests, and Boosting

14 / 16

Classification methods

I Nearest neighbors classification

I Logistic regression

I LDA and QDA

I Stepwise selection methods

I Decision trees, Bagging, Random Forests, and Boosting

I Support vector classifier and support vector machines

15 / 16

Self testing questions

For each of the regression and classification methods:

1. What are we trying to optimize?

2. What does the fitting algorithm consist of, roughly?

3. What are the tuning parameters, if any?

4. How is the method related to other methods, mathematicallyand in terms of bias, variance?

5. How does rescaling or transforming the variables affect themethod?

6. In what situations does this method work well? What are itslimitations?

16 / 16

lecture 27: the story of the nealix prize

Documents