machine learning - ht 2016 2. linear regression...machine learning - ht 2016 2. linear regression...

35
Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016

Upload: others

Post on 26-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Machine learning - HT 20162. Linear Regression

Varun Kanade

University of OxfordJanuary 22, 2016

Page 2: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Outline

Supervised Learning Setting

I Data consists of input- output pairs

I Inputs (also covariates, independent variables, predictors, features)

I Output (also variates, dependent variable, targets, labels)

Goals

I Understand the supervised learning setting

I Understand linear regression (aka least squares)

I Derivation of the least squares estimate

1

Page 3: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Outline

Supervised Learning Setting

I Data consists of input- output pairs

I Inputs (also covariates, independent variables, predictors, features)

I Output (also variates, dependent variable, targets, labels)

Goals

I Understand the supervised learning setting

I Understand linear regression (aka least squares)

I Derivation of the least squares estimate

1

Page 4: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Why study linear regression?

I ‘‘Least squares’’ is at least 200 years old (Legendre, Gauss)

I Francis Galton: Regression to mediocrity (1886)

I Often real processes can be approximated by linear models

I More complicated models require understanding linear regression

I Closed form analytic solutions can be obtained

I Many key notions of machine learning can be introduced

2

Page 5: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Why study linear regression?

I ‘‘Least squares’’ is at least 200 years old (Legendre, Gauss)

I Francis Galton: Regression to mediocrity (1886)

I Often real processes can be approximated by linear models

I More complicated models require understanding linear regression

I Closed form analytic solutions can be obtained

I Many key notions of machine learning can be introduced

2

Page 6: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Why study linear regression?

I ‘‘Least squares’’ is at least 200 years old (Legendre, Gauss)

I Francis Galton: Regression to mediocrity (1886)

I Often real processes can be approximated by linear models

I More complicated models require understanding linear regression

I Closed form analytic solutions can be obtained

I Many key notions of machine learning can be introduced

2

Page 7: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

A toy example

Want to predict commute time into city centre

What variables would be useful?I Distance to city centreI Day of the week

Data

dist (km) day commute time (min)2.7 fri 254.1 mon 331.0 sun 155.2 tue 452.8 sat 22

3

Page 8: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Linear Models

Suppose the input is a vector x ∈ Rn and the output is y ∈ R.

We have data 〈xi, yi〉mi=1

Notation: dimension n, size of datasetm, column vectors

Linear model:y = w0 + x1w1 + · · ·+ xnwn + noise

4

Page 9: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Linear Models

Linear model:y = w0 + x1w1 + · · ·+ xnwn + noise

Input encoding: mon-sun has to be converted to a numberI monday: 0, tuesday: 1, . . . , sunday: 6

I 0 if weekend, 1 if weekday

Assume x1 = 1 for all data. So model can be succinctly represented as

y = x ·w + noise = y + noise

5

Page 10: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Linear Models

Linear model:y = w0 + x1w1 + · · ·+ xnwn + noise

Input encoding: mon-sun has to be converted to a numberI monday: 0, tuesday: 1, . . . , sunday: 6

I 0 if weekend, 1 if weekday

Assume x1 = 1 for all data. So model can be succinctly represented as

y = x ·w + noise = y + noise

5

Page 11: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Linear Models

Linear model:y = w0 + x1w1 + · · ·+ xnwn + noise

Input encoding: mon-sun has to be converted to a numberI monday: 0, tuesday: 1, . . . , sunday: 6

I 0 if weekend, 1 if weekday

Assume x1 = 1 for all data. So model can be succinctly represented as

y = x ·w + noise = y + noise

5

Page 12: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Learning Linear Models

Data: 〈(xi, yi)〉mi=1

Model parameterw

Training phase: (learning/estimation ofw)

Testing phase: (predict ym+1 = xm+1 ·w)

6

Page 13: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Learning Linear Models

Data: 〈(xi, yi)〉mi=1

Model parameterw

Training phase: (learning/estimation ofw)

Testing phase: (predict ym+1 = xm+1 ·w)

6

Page 14: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Learning Linear Models

Data: 〈(xi, yi)〉mi=1

Model parameterw

Training phase: (learning/estimation ofw)

Testing phase: (predict ym+1 = xm+1 ·w)

6

Page 15: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

y(x) = w0 + xw1

L(w) = L(w0, w1) =m∑i=1

(yi − yi)2 =

m∑i=1

(w0 + xiw1 − yi)2

7

Page 16: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

y(x) = w0 + xw1

L(w) = L(w0, w1) =m∑i=1

(yi − yi)2 =

m∑i=1

(w0 + xiw1 − yi)2

8

Page 17: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Linear Regression: General Case

Recall that the linear model is

yi =

n∑j=1

xijwj

where we assume that xi1 = 1 for all xi. So w1 is the bias or offset term.

Expressing everything in matrix notation

y = Xw

Here we have y ∈ Rm×1,X ∈ Rm×n andw ∈ Rn×1

ym×1y1y2...

ym

=

Xm×nxT1

xT2

...xTm

wn×1w1

...wn

=

Xm×nx11 · · · x1n

x21 · · · x2n

.... . .

...xm1 · · · xmn

wn×1w1

...wn

9

Page 18: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Linear Regression: General Case

Recall that the linear model is

yi =

n∑j=1

xijwj

where we assume that xi1 = 1 for all xi. So w1 is the bias or offset term.

Expressing everything in matrix notation

y = Xw

Here we have y ∈ Rm×1,X ∈ Rm×n andw ∈ Rn×1

ym×1y1y2...

ym

=

Xm×nxT1

xT2

...xTm

wn×1w1

...wn

=

Xm×nx11 · · · x1n

x21 · · · x2n

.... . .

...xm1 · · · xmn

wn×1w1

...wn

9

Page 19: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Back to toy exampledist (km) weekday? commute time (min)

2.7 1 (fri) 254.1 1 (mon) 331.0 0 (sun) 155.2 1 (tue) 452.8 0 (sat) 22

We havem = 5, n = 3 and so we get

y =

2533154522

, X =

1 2.7 11 4.1 11 1.0 01 5.2 11 2.8 0

, w =

w1

w2

w3

Suppose we getw = [6, 6.5, 2]T . Then our predictions would be

y =

25.5534.6512.541.824.2

10

Page 20: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Back to toy exampledist (km) weekday? commute time (min)

2.7 1 (fri) 254.1 1 (mon) 331.0 0 (sun) 155.2 1 (tue) 452.8 0 (sat) 22

We havem = 5, n = 3 and so we get

y =

2533154522

, X =

1 2.7 11 4.1 11 1.0 01 5.2 11 2.8 0

, w =

w1

w2

w3

Suppose we getw = [6, 6.5, 2]T . Then our predictions would be

y =

25.5534.6512.541.824.2

10

Page 21: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Linear Prediction

Suppose someone lives 4.8km from city centre, how long would they taketo get in on a Wednesday?

We can write xnew = [1, 4.8, 1]T and then compute

ynew =[1 4.8 1

] 66.52

= 39.2minutes

Interpreting regression coefficients

11

Page 22: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Linear Prediction

Suppose someone lives 4.8km from city centre, how long would they taketo get in on a Wednesday?

We can write xnew = [1, 4.8, 1]T and then compute

ynew =[1 4.8 1

] 66.52

= 39.2minutes

Interpreting regression coefficients

11

Page 23: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Minimizing the Squared Error

L(w) =∑m

i=1(xTi w − yi)

2 = (Xw − y)T (Xw − y)

0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.6

0.81.00.0

0.1

0.2

0.3

0.4

0.5

0.6

12

Page 24: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Minimizing the Squared Error

L(w) =∑m

i=1(xTi w − yi)

2 = (Xw − y)T (Xw − y)

0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.6

0.81.00.0

0.1

0.2

0.3

0.4

0.5

0.6

12

Page 25: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Finding Optimal Solutions using Calculus

L(w) =∑m

i=1(xTi w − yi)

2 = (Xw − y)T (Xw − y)

13

Page 26: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Differentiating Matrix Expressions

L(w) =∑m

i=1(xTi w − yi)

2 = (Xw − y)T (Xw − y)

Rules

(i) ∇wcTw = c(ii) ∇wwTAw = Aw +ATw (= 2Aw for symmetricA)

14

Page 27: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Solution to Linear Regression

L(w) =∑m

i=1(xTi w − yi)

2 = (Xw − y)T (Xw − y)

w = (XTX)−1XTy

What if we had one-hot encoding for days?I 7 binary features, exactly one of them is 1I x1 = 1, x2 =dist, x3 =mon?, x4 = tue?, . . . , x9 = sun?

15

Page 28: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Solution to Linear Regression

L(w) =∑m

i=1(xTi w − yi)

2 = (Xw − y)T (Xw − y)

w = (XTX)−1XTy

What if we had one-hot encoding for days?I 7 binary features, exactly one of them is 1I x1 = 1, x2 =dist, x3 =mon?, x4 = tue?, . . . , x9 = sun?

15

Page 29: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Linear Regression as Solving Noisy Linear Systems

Without noise:Xw = y

Linear regression finds y that makes systemXw = y feasible

And y is closest to y in Euclidean distance

16

Page 30: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Other Loss Functions

L(w) =m∑i=1

|xTi w − yi|

17

Page 31: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Loss Functions

I Consider ` : R+ → R+ satisfying

1. `(0) = 02. ` is non-decreasing

I L(w) =

m∑i=1

`(|xTi ·w − yi|)

`(z) = |z|2 `(z) = |z|4 `(z) = |z|10

`(z) = |z| `(z) =√|z| `(z) = |z|0.01

18

Page 32: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Loss Functions

I Consider ` : R+ → R+ satisfying

1. `(0) = 02. ` is non-decreasing

I L(w) =

m∑i=1

`(|xTi ·w − yi|)

`(z) = |z|2 `(z) = |z|4 `(z) = |z|10

`(z) = |z| `(z) =√|z| `(z) = |z|0.01

18

Page 33: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Optimization with different loss functions

L(w) =∑|yi − yi|2 L(w) =

∑|yi − yi|

L(w) =∑√

|yi − yi| L(w) =∑|yi − yi|0.1

19

Page 34: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Probabilistic Modelling

I Linear Model: y = xTw∗ + noise (for somew∗)

I E[y | x,w∗] = xTw∗

I Data 〈(xi, yi)〉mi=1

I Unbiased estimator w = (XTX)−1XTy

20

Page 35: Machine learning - HT 2016 2. Linear Regression...Machine learning - HT 2016 2. Linear Regression Varun Kanade University of Oxford January 22, 2016 Outline Supervised Learning Setting

Next Time

I Maximum likelihood estimation

I Make sure you are familiar with the Gaussian distribution

I Non-linearity using basis expansion

I What to do when you have more features than data?

21