stochastic models - machine learningffffffff-858f-2321-ffff-ffffa67db428/sm… · stochastic models...

29
Stochastic Models Machine Learning Walt Pohl Universit¨ at Z¨ urich Department of Business Administration March 19, 2015

Upload: others

Post on 08-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Stochastic ModelsMachine Learning

Walt Pohl

Universitat ZurichDepartment of Business Administration

March 19, 2015

Page 2: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

What is Machine Learning?

Machine learning is aimed at prediction, nothypothesis testing.

Use a high-dimensional approximation to extract themaximum predictability.

No effort is made to interpret individual parameters.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 2 / 29

Page 3: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

The Prediction Problem

The basic framework is:

Predict Y , given some vector of predictors, X .

Find a function, f (X ), to predict Y ,

Y = f (X ) + ε.

where ε is random.

The space of all possible f ’s is infinite-dimensional, butwe choose a finite-dimensional approximation.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 3 / 29

Page 4: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Example: Polynomial Regression

Use regression to fit a a high-degree polynomial – degree10 or 20.

f (X ) = b0 + b1X + · · ·+ bNXN

Coefficients are hard to interpret. What does thecoefficient of x8 mean?

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 4 / 29

Page 5: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

In Versus Out of Sample

Use a high-enough degree, you can fit perfectly – insample.

Out of sample, the fit will be terrible, much worse than alinear regression.

This is the problem of overfitting.

The solution? Penalize complexity.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 5 / 29

Page 6: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Loss Function

Choose a loss function, L(Y , f (X )).

Choose f to minimize

E (L(Y , f (X ))).

L can be modified to penalize complexity.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 6 / 29

Page 7: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Penalizing Complexity via Loss Function

One natural choice for L is squared-error loss:

(Y − f (X ))2

This leads to regression.

But now, let’s introduce a term to penalize complexity.

Example:

(Y − f (X ))2 + λ∑i

β2i

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 7 / 29

Page 8: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Ridge Regression

Minimizing this penalty gives you ridge regression.

Note that λ – known as the ridge parameter – cannot beestimated from the data. It must be given.

λ = 0 is ordinary regression. λ→∞ will force thecoefficients towards zero.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 8 / 29

Page 9: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Variable Selection Methods

Ridge regression works by pushing all variable estimatestowards zero.

A natural alternative is to only set some variables to zero.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 9 / 29

Page 10: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Subset Selection

Subset selection works by choosing a subset of thevariables and only regressing those.

Several standard techniques:

Best subset

Forward stepwise

Backward stepwise

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 10 / 29

Page 11: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Best subset

For a fixed k , choose the k variables that maximize theR2.

Downside: can be computationally expensive.

Unspecified: k .

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 11 / 29

Page 12: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Forward selection

Start with only the intercept, and add one variable at atime. Choose the variable that increases the R2 the most.

Downside: not optimal fit.

Unspecified: when to stop.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 12 / 29

Page 13: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Backward selection

Start with all variables, and remove one variable at atime. Choose the variable that decreases the R2 theleast.

Downside: not optimal fit.

Unspecified: when to stop.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 13 / 29

Page 14: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

The Lasso

The lasso superficially resembles ridge regression, buthas some of the aspects of subset selection.

It’s regression with a penalty term,

∑i

yi − α−∑j

βjxij

2

+ λ∑j

|βj |,

but the penalty is minimized by setting some of thevariables to zero.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 14 / 29

Page 15: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Other Penalties

Other penalties appear in the literature:

p-norm (for 1 ≤ 2):|βj |p

elastic net:αβ2

j + (1− α)|βj |

Both somewhere between ridge and lasso in behavior.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 15 / 29

Page 16: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Nonlinear Models

The previous techniques are for linear models of largenumbers of variables. What about nonlinear models of afew variables?

In theory we can treat this problem as a special case ofthe previous problem: choose a big family of basisfunctions, a la polynomial regression.

We now consider some more intrinsically nonlinearmethods:

Splines

Local regression

Generalized additive models

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 16 / 29

Page 17: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Piecewise Basis Functions

Polynomial regression has the downside that everyobservation affects every coefficient.

Alternative: divide x-axis into intervals. (The endpointsare known as knots.) For each interval, choose a set ofbasis functions that are zero outside that interval.

Regression coefficients will be unaffected by observationsoutside an interval.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 17 / 29

Page 18: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Choosing Piecewise Basis Functions

General recipe:

Choose knots: ξ1, . . . , ξn.

Choose arbitrary set of basis functions: constants,linear, polynomials, etc.

Let I (ξi ≤ x < xi+1) be the function that is 1 in theinterval[ξi , ξi+1], 0 otherwise. Then functions of theform

fi(x − ξi)I (ξi ≤ ξi+1)

do the job.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 18 / 29

Page 19: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Continuity

Downside: A regression fit will usually be discontinuousat the boundary.

Fix: Choose coefficients so that the basis functionsmatch up on either side of a knot.

This imposes a linear constraint on coefficients at eachknot.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 19 / 29

Page 20: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Derivatives

We can go further, and choose coefficient so thatderivatives agree on either side of a knot. This is againlinear.

A spline is piecewise polynomials of degree d such thatderivatives up to degree d − 1 agree on each side of theknots.

Usual case is cubic splines.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 20 / 29

Page 21: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Smoothing Splines

Splines can be fit by constrained linear regression –ordinary linear regression with linear constraints on thecoefficients.

These fits can be pretty wiggly, especially as the numberof knots increases.

An alternative techique is smoothing splines – add apenalty term for wiggliness.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 21 / 29

Page 22: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Smoothness Penalty

A fairly general objective with penalty is of the form∑i

(y − f (x))2 + λ

∫f ′′(t)2dt.

f is chosen from some family, such as cubic splines, tominimize this.

The f ′′ term is easy to compute for polynomials.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 22 / 29

Page 23: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Local Regression

Piecewise polynomials are an extreme solution to theproblem of making coefficients depend only nearby data.

Local regression is a less-extreme solution, where foreach point we blend together nearby points.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 23 / 29

Page 24: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Nearest-neighbor average

Simplest technique:For a point x , let f (x) be the average of yi for the k xi ’snearest to x .

f is a discontinuous step function, because anobserveration is used or not used.

Alternative: take a weighted sum, where the weights dieoff smoothly.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 24 / 29

Page 25: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Kernels

Let K (x , x ′) be a function with a maximum at x = x ′

such that it goes to zero as x − x ′ → ±∞. K is calledthe kernel.

Let f (x) be

f (x) =

∑i K (x , xi)yiK (x , xi)

.

The K (x , xi) are the weights.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 25 / 29

Page 26: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Nonlinear multivariate models

Once we combine nonlinearity and many variables, thingsget considerably harder. Some techniques:

Multidimensional splines.

Multidimensional local regression.

The curse of dimensionality: methods do not scale tomany variables.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 26 / 29

Page 27: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

High-tech approaches

There are high-tech approaches such as

Neural nets

Genetic algorithms

Support vector machines

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 27 / 29

Page 28: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Generalized Additive Models

Generalized additive models are a low-tech approach:

Assume that the influence of each variable is wellexplained by splines.

Add them together to get the total influence.

y = f1(x1) + · · ·+ fn(xn).

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 28 / 29

Page 29: Stochastic Models - Machine Learningffffffff-858f-2321-ffff-ffffa67db428/sm… · Stochastic Models Machine Learning Walt Pohl Universit at Zurich Department of Business Administration

Fitting Generalized Additive Models

These models can be fit in several ways. One naturalway generalizes smoothing splines.

Minimize a penalty of the form∑i

(y −∑j

fj(x))2 + λj

∫f ′′j (t)2dt.

Walt Pohl (UZH QBA) Stochastic Models March 19, 2015 29 / 29