qbus3820 data mining and data analysis - sergey v. alexeev · 2018-11-20 · global polynomial...

QBUS3820 Data Mining and Data Analysis

Lecture: Neural Networks

Dr. Minh-Ngoc TranUniversity of Sydney Business School

Table of contents

Introduction

Fundamental concepts

Single layer perceptron

Introduction

What are neural networks?

They are a set of very flexible non-linear methods forregression/classification. Suitable when your dataset is large.


I A neural network, or artificial neural network (ANN) is acomputational model that tries to mimic a network of neuronsin the human brain.

I Artificial neural networks (ANNs) are not biological neuralnetworks, but a mathematical model that is inspired bybiological neural networks.


I A neural network is an interconnected assembly of simpleprocessing units or neurons, which communicate by sendingsignals to each other over weighted connections

I A neural network is made of layers of similar neurons: aninput layer, hidden layers, and an output layer.

I The input layer receives data from outside the network. Theoutput layer sends data out of the network. Hidden layersreceive/process/send data within the network.

What are neural networks used for?

I Neural networks are often used for statistical analysis and datamodelling, an alternative to standard nonlinearregression/classification

I They have been successfully used in speech recognition,textual character recognition, medical imaging diagnosis,robotics, financial market prediction, etc.

I But their applications to business are still somewhat limited

Fundamental concepts

Elements of an artificial neural network

An ANN includes

I a set of processing units/neurons/nodes.

I an activation level Zi for each unit i , which, often, is the sameas the output of the unit

I weights wik , which are connection strengths between the unitsi and k

I a propagation rule that determines the total input Sk of a unitfrom its connected units

I an activation function hk that determines the activation levelZk based on the total input Sk , Zk = hk(Sk)


Often, the total input sent to unit k is

Sk =∑i

wikZi + w0k

which is a weighted sum of the outputs from all units i that areconnected to unit k, plus an offset term w0k .

Then, the output of unit k is

Zk = hk(Sk) = hk

(∑i

wikZi + w0k

)


It’s useful to distinguish three types of units:

I input units (denoted by X ): receive data from outside thenetwork

I output units (denoted by Y ): send data out of the network

I hidden units (denoted by Z ): receive data from and send datato units within the network.

Given the signal from a set of inputs X , an ANN produces anoutput Y .


The function

hk(Sk) =1

1 + e−Sk

is commonly used as the activation function for hidden units.

For output units

hk(Sk) = Sk (use in regression)

or

hk(Sk) =Sk∑` S`

(use in classification)

Training neural networks

I A neural network is a computational model that needs to beestimated

I The unknown things in an ANN is the set of the weights wik .

I These parameters are often estimated based on training data

Examples of neural networks

I Consider an ANN with no hidden layer

I Suppose that there’re p input units X1, ..., Xp and one outputunit Y


I Let S = w0 +∑

i wiXi be the total input sent to Y . Forclassification using logistic regression, we classify

Y = 1 if and only ifeS

1 + eS≥ 0.5, i.e. S ≥ 0

I Equivalently

Y = h(S) =

{1, S ≥ 0

0, else


So this classification model is a special case of ANN with nohidden units.


Multiple linear regression is a special case of ANN with no hiddenunits.


I We now focus on the most widely used neural networks instatistics, the ANNs with a single hidden layer, often called asingle layer perceptron

Single layer perceptron for regression

I Suppose we have p input predictors/features X = (X1, ...,Xp)′

and a scalar target YI Create M hidden units Z1, ...,Zm

I Total input of unit Zm

SZm = α0m + α1mX1 + ...+ αpmXp = α0m + α′mX

The weights αij are unknown, need to be estimated.I Activation level of unit Zm

Zm = h(SZm) = h(α0m + α′mX ), m = 1, ...,M

I Compute the total input of the output unit Y

S = β0 + β1Z1 + ...+ βMZM = β0 + β′Z

with βi the weight from hidden unit Zi to the output unit Y

I The output function Y = S is a prediction of E(Y |X ).

single layer perceptron for regression


I We can write

f (X ) = S = β0+β1h(α01+

p∑i=1

αi1Xi )+...+βMh(α0M+

p∑i=1

αiMXi )

I So, the inputs Xi enter the prediction function f (X ) in anonlinear way.

I Here, for simplicity, we use the same activation h for allhidden units Zm

I If h(x) = x , it can be seen that f (X ) is a linear combinationof the Xi , so multiple linear regression is a special case of thissingle layer perceptron model


I The model parameters areθ = (α01, α11, ..., αp1; ...;α0M , α1M , ..., αpM ;β0, β1, ..., βM)

I Let {yi , xi = (xi1, ..., xip)}, i = 1, ..., n be the training dataset.The sum of squared errors is

R(θ) =n∑

i=1

(yi − f (xi )2)

I We estimate θ by minimising R(θ)

single layer perceptron for 0-1 classificationI Suppose we have p input predictors/features X = (X1, ...,Xp)′

and a scalar target YI Create M hidden units Z1, ...,Zm

I Total input of unit Zm

SZm = α0m + α1mX1 + ...+ αpmXp = α0m + α′mX

The weights αij are unknown, need to be estimated.I Activation level of unit Zm

Zm = h(SZm) = h(α0m + α′mX ), m = 1, ...,M

I Compute the total input of the output unit Y

S = β0 + β1Z1 + ...+ βMZM = β0 + β′Z

with βi the weight from hidden unit Zi to the output unit YI The output function

Y = h(S) =

{1, S ≥ 0

0, else

Issues in training neural networks

I Often, neural networks have too many weights and mightoverfit the data. To overcome the overfitting problem, weminimise

R(θ) + λP(θ)

where P(θ) is a model complexity penalty, and λ ≥ 0 controlsthe penalty.

I For example

P(θ) =K∑

k=1

M∑m=1

β2mk +

p∑j=1

M∑m=1

α2jm

I λ can be selected by cross-validation

Issues in training neural networks

I With too few hidden units M, the model might not be flexibleenough to capture the nonlinearities in the data; with too bigM the model might overfit the data

I It is most common to put down a reasonably large M and usea penalty term to avoid overfitting

multiple hidden layer neural networks

I One can use neural networks with multiple hidden layers

I Choice of the number of hidden layers is guided bybackground knowledge or by using a test dataset.

Example: handwriting recognition

I handwriting recognition is an important task, especially inpostal services

I we want to recognise handwritten digits, scanned fromenvelopes


I Input is a vector of 16× 16 = 256 pixel values, and output isone of the ten numbers 0,...,9.

I Each training observation, i.e. an image, is a vector of its 256pixel values and the correct-recognition digit

I There are 320 images in the training set, and 160 in the testset.


Five networks were used

I Net-1: No hidden layer, equivalent to multinomial logisticregression.

I Net-2: One hidden layer with 12 hidden units fully connected.

I Net-3: Two hidden layers locally connected.

I Net-4 and Net-5: Two hidden layers, locally connected withdifferent constraint levels on the weights.

Summary

I ANNs provide a range of flexibly nonlinear models for datamodelling

I ANNs have been successfully applied in many fields: robotics,vision, image processing, etc

I They are useful for prediction, not for inference, because it’sdifficult to interpret the coefficients/weights in an ANN model


Lecture: Piecewise polynomials and Spline regression


Table of contents

Piecewise polynomials

Spline regression

Reading: Chapter 5, The Elements of Statistical Learning.

Introduction

I Let Y be the response variable and X be a predictor. Weconsider scalar/univariate X in this lecture.

I The conditional mean f (X ) = E(Y |X ) is all we need for bothinference and prediction tasks.The regression model

Y = f (X ) + ε, ε is error with E(ε) = 0

is the most general regression model, as we don’t make anyassumptions on the form of f (X )

I The linear regression model assumes f (X ) = β0 + β1X .

I In many cases, it’s unlikely that the conditional mean E(Y |X )is truly linear in X !

Introduction

Obviously, f (X ) = E(Y |X ), approximated by the red curve, is notlinear in X .

Introduction

I This lecture is to go beyond the linearity assumption inregression

First, what is a polynomial?An p-degree (or p-order) polynomial is

f (x) = a0 + a1x + a2x2 + ....+ apx

p

where a0,..., ap are coefficients.

Global polynomial regression

Consider the general regression model

Y = f (X ) + ε, E(ε) = 0 (1)

where f (X ) = E(Y |X ) is unknown.

Let’s use the Taylor expansion of f (X ) at 0: For some order p ≥ 1

f (X ) ≈ f (0) + f ′(0)X +f (2)(0)

2!X 2 +

f (3)(0)

3!X 3 + ...+

f (p)(0)

p!X p

= β0 + β1X + β2X2 + ...+ βpX

p

for all values in the range of X .

Here f (0)=β0, β1 = f ′(0), ..., βp = f (p)(0)p! are unknown coefficients.

f (k)(x) denotes the kth derivative of function f (x).


I Model (1) becomes

Y = β0 + β1X + β2X2 + ...+ βpX

p + ε (2)

Model (2) is now a multiple linear regression model. So thecoefficients βj can be easily estimated from data.

I This model is referred to as the global polynomial regressionmodel

I “Global” means the coefficients βj ’s are constant across theentire range (also called domain) of X

I The global polynomial regression model offers a way to relaxthe linearity assumption.

I However, in general, global polynomial regression has a maindrawback: because of its global nature, it can provide good fit(to the data) in one area but behave rather weirdly in anotherarea


Figure: Industrial production index, January 1990-October 2005. Leftpanel: global polynomial regression. Right panel: Cubic spline regression(see later)

Tuning the coefficients to achieve a functional form in one regioncan cause the function to flap about madly in other regions.

Piecewise polynomial regression

In this lecture we consider techniques that allow for localpolynomial representations.

The first technique is the piecewise polynomial regression


Basic idea: we divide the range/domain of X into continuousintervals, then approximate f (X ) in each interval by a separatepolynomial.

Piecewise constantDivide the range/domain of X into continuous intervals, thenapproximate f (X ) in each interval by a constant, i.e. a 0-degreepolynomial.

Figure: We approximate the true curve by a step function

Piecewise constant

For example, we divide the domain of X into 3 intervals by 2 limitpoints ξ1 and ξ2 (called knots).

Define three functions, called basis functions

h1(X ) = I (X < ξ1), h2(X ) = I (ξ1 ≤ X < ξ2), h3(X ) = I (X ≥ ξ2)

We approximate the true curve f (X ) = E(Y |X ) by a step function

f (X ) =3∑

m=1

βmhm(X ) =

β1, X < ξ1

β2, ξ1 ≤ X < ξ2

β3, X ≥ ξ2

f (X ) is a constant in each of the regions X < ξ1, ξ1 ≤ X < ξ2 andX ≥ ξ2

Piecewise constant

Let {Xi ,Yi , i = 1, ..., n} be the training data.

LetZi1 = h1(Xi ), Zi2 = h2(Xi ), Zi3 = h3(Xi )

We arrive at the following multiple linear regression model

Yi = β1Zi1 + β2Zi2 + β3Zi3 + εi , i = 1, ..., n

So it’s easy to estimate the coefficients βi ’s using the least squaresmethods or maximum likelihood method.

Piecewise constantIt can be shown that

βm = Ave{Yi |Xi ∈ region m}

Piecewise linear

Divide the domain of X into continuous intervals, thenapproximate f (X ) in each interval by a linear function, i.e. a1-degree polynomial.

Define six basis functions

h1(X ) = I (X < ξ1), h2(X ) = I (ξ1 ≤ X < ξ2), h3(X ) = I (X ≥ ξ2)

h4(X ) = XI (X < ξ1), h5(X ) = XI (ξ1 ≤ X < ξ2), h6(X ) = XI (X ≥ ξ2)

We approximate the true curve f (X ) = E(Y |X ) by a piecewiselinear function

f (X ) =6∑

m=1

βmhm(X ) =

β1 + β4X , X < ξ1

β2 + β5X , ξ1 ≤ X < ξ2

β3 + β6X , X ≥ ξ2

f (X ) is a linear function in each region.

Piecewise linear


LetZik = hk(Xi ), k = 1, ..., 6, i = 1, ..., n


Yi = β1Zi1 + β2Zi2 + β3Zi3 + β4Zi4 + β5Zi5 + β6Zi6 + εi

It’s easy to estimate the coefficients βi ’s using the least squaresmethods or maximum likelihood method.

Piecewise linear

Higher order piecewise polynomials

Similarly, we can construct higher order piecewise polynomials. Forexample,

I piecewise quadratic polynomials: approximatef (X ) = E(Y |X ) by a 2-order polynomial in each region

I piecewise cubic polynomials: approximate f (X ) = E(Y |X ) bya 3-order polynomial in each region

Spline regression

Discontinuity issue

Consider a function f (x) and some x0 in its domain

I f −(x0) = limx→x−0f (x) is the left limit of f (x) at x0, i.e. the

limit of f (x) when x goes to x0 from the left.

I f +(x0) = limx→x+0f (x) is the right limit of f (x) at x0, i.e. the

limit of f (x) when x goes to x0 from the right.

I If f −(x0) 6= f +(x0), we say that f (x) is discontinuous at x0.

Discontinuity issue

I Piecewise polynomials are not continuous at the knots ξj :f −(ξj) = f +(ξj)

I Discontinuity causes inference/prediction problems. E.g.,what is the prediction of f (ξj) = E(Y |X = ξj)?

I Typically, we prefer continuity in statistical modelling

Spline

I We need continuity constraints

f −(ξj) = f +(ξj)

i.e. the left limit meets the right limit at the knots.

I A technique to impose such constraints is using a spline

I A spline is a piecewise polynomial that is continuous at theknots. Therefore, a spline is continuous at everywhere in theentire range of X .

Linear splines

Suppose that we divide the range of X into three intervals withknots ξ1 and ξ2. Define basis functions

h0(X ) = 1

h1(X ) = X

h2(X ) = (X − ξ1)+ = (X − ξ1)I (X ≥ ξ1)

=

{0, if X < ξ1

X − ξ1, if X ≥ ξ1h3(X ) = (X − ξ2)+ = (X − ξ2)I (X ≥ ξ2)

=

{0, if X < ξ2

X − ξ2, if X ≥ ξ2

Similarly, we can define the basis functions when there are K>2knots ξ1, ξ2,... ξK .

Linear splines

Let

f (X ) = β0h0(X ) + β1h1(X ) + β2h2(X ) + β3h3(X )

= β0 + β1X + β2(X − ξ1)+ + β3(X − ξ2)+

=

β0 + β1X , X < ξ1

β1 + β1X + β2(X − ξ1), ξ1 ≤ X < ξ2

β0 + β1X + β2(X − ξ1) + β3(X − ξ2), X ≥ ξ2

It’s easy to check that

I f −(ξj)= f +(ξj), j =1,2. That is, f (X ) is continuous at everyX

I f (X ) is linear in each region: X <ξ1, ξ1≤X <ξ2 and X ≥ξ2I f (X ) is called a linear spline: a piecewise linear polynomial

that is continuous everywhere.

Estimating a linear spline

Let {Xi ,Yi , i = 1, ..., n} be the training data. The linear splineregression model becomes

Yi = β0h0(Xi ) + β1h1(Xi ) + β2h2(Xi ) + β3h3(Xi ) + εi

LetZik = hk(Xi ), k = 0, ..., 3, i = 1, ..., n


Yi = β0Zi0 + β1Zi1 + β2Zi2 + β3Zi3 + εi

So it’s easy to estimate the coefficients βi ’s using the least squaresmethod or maximum likelihood method.

Linear splines

Cubic splines

I A linear spline is continuous but not smooth, it’s has apeak/sudden change at the knots ξk . Not an attractivefeature in statistical modelling.

I We often prefer smoother functions. Typically, we prefer f (X )not only continuous, but has continuous first and secondderivatives

I These can be achieved by increasing the order of the localpolynomial. These can be done by using cubic splines

Cubic splines

Cubic splines

Suppose that we divide the range of X into three intervals withknots ξ1 and ξ2.

Starting with 3-degree polynomial basis functions

h0(X ) = 1, h1(X ) = X , h2(X ) = X 2, h3(X ) = X 3

and additional one basis function for each region

h4(X ) = (X − ξ1)3+ = (X − ξ1)3I (X ≥ ξ1)

h5(X ) = (X − ξ2)3+ = (X − ξ2)3I (X ≥ ξ2)

Similarly, we can define the basis functions when there are K>2knots ξ1, ξ2,... ξK .

Cubic splines

Let

f (X ) = β0 + β1X + β2X2 + β3X

3 + β4(X − ξ1)3+ + β5(X − ξ2)3+

f ′(X ) = β1 + 2β2X + 3β3X2 + 3β4(X − ξ1)2I (X ≥ ξ1)

3β5(X − ξ2)2I (X ≥ ξ2)

f ′′(X ) = 2β2 + 6β3X + 6β4(X − ξ1)I (X ≥ ξ1)

6β5(X − ξ2)I (X ≥ ξ2)

It can be checked that

I f (X ),f ′(X ) and f ′′(X ) are continuous at every X

I f (X ) is a 3-degree polynomial in each region

I f (X ) is called a cubic spline

Estimating a cubic spline

Again, it’s easy to fit a cubic spline to data.


LetZik = hk(Xi ), k = 0, ..., 5, i = 1, ..., n


Yi = β0Zi0 + β1Zi1 + β2Zi2 + β3Zi3 + β4Zi4 + β5Zi5 + εi

It’s easy to estimate the coefficients βi ’s using the least squaresmethods or maximum likelihood method.

Cubic splines

Spline regression

I Basically, in spline regression, we fit a separate polynomial tothe data in each region

I Unlike global polynomial regression where the coefficients areconstant across all regions, in spline regression the coefficientsare adjusted locally

I Unlike piecewise polynomial regression, in splines regressioncertain order of smoothness is imposed at the knots

Cubic splines with K knots

Given a set of K knots

ξ1 < ... < ξk < .... < ξK

The cubic spline with K knots is

f (X ) = β0 + β1X + β2X2 + β3X

3 +K∑

k=1

β3+k(X − ξk)3+.

Knots selection

I Selecting the number of knots K and the knot positions ξk isan art

I An option is to set ξk to the 100k/(K + 1)-th percentile ofthe distribution of X .

I How many knots K should be used?I If K is too large, we have an overfitting problemI If K is too small, we have an underfitting problemI This is a model selection problem (see later)

An example

Figure: Scatter plot of x v.s. y

An example

Figure: 300 training data points (o) and 100 test data points (x)

An example

I Both the training and test datasets are available in Blackboard

I Clearly, simple linear regression is not appropriate for thisdataset.

I Let’s fit a cubic spline regression model to this trainingdataset and test its prediction power on the test data.

I Let’s proceed as if there isn’t any built-in function in yoursoftware that can help you to do this task!

An example

I Let’s select K = 3I ξ1, ξ2 and ξ3 are 25%-, 50% and 75%-percentile of X -data

I α%-percentile is a number where α% of the data points aresmaller than that number

I Check BUSS1020 or Google if you don’t remember how tocompute percentiles!

I Most statistical softwares have functions to compute this

I Based on my calculation, ξ1 = 10, ξ2 = 19.5 and ξ3 = 58

I We need to fit the following cubic spline regression model

Yi = β0 + β1Xi + β2X2i + β3X

3i + β4(Xi − 10)3+ +

+β5(Xi − 19.5)3+ + β6(Xi − 58)3+ + εi ,

i =1,...,n.

An example

I Let Zi1 = Xi , Zi2 = X 2i , ..., Zi6 = (Xi − 58)3+, i = 1, ..., n. We

have the following multiple linear regression model

Yi = β0 +6∑

k=1

βkZik + εi

I Let

X =

1 Z11 ... Z16

1 Z21 ... Z26...

... ......

1 Zn1 ... Zn6

, y =

Y1

Y2...Yn

The estimate of vector β is β = (X ′X )−1X ′y

An example

β = (498.9679,−0.3252, 0.1046,−0.0064, 0.0091,−0.0025,−0.0018)

I The estimate of f (X ) = E(Y |X ) is

Y = β0 + β1X + β2X2 + β3X

3 + β4(X − 10)3+ +

+β5(X − 19.5)3+ + β6(X − 58)3+

I Plot of (X ,Y ) when X varies is the fitted (also calledpredicted) curve.

An example

Figure: 300 training data points and the fitted curve

An example

Figure: 100 test data points and the fitted curve

An example

Let’s see the effect of the number of knots K

An example

Figure: 100 test data points and the fitted curve

Homework

Download the datasets from Blackboard and fit a cubic splineregression model by yourself. Have fun!


Lecture: Kernel methods


Table of contents

Kernel density estimation

Kernel regression

Introduction

I We’ve so far discussed on parametric methods: we firstassume a parametric form for the underlying model thatgenerated the data, then estimate the parameters.

I In parametric modelling, the underlying model that generatedthe data is described by a functional form that depends on avector of unknown parameters θ.

I E.g., simple linear regression

yi = β0 + β1xi + εi , εi ∼ N(0, σ2)

is a parametric model as we assume the model that generateddata yi , given xi , is normal distribution with mean β0 + β1xiand variance σ2. The set of unknown parameters isθ = (β0, β1, σ

2).

Introduction

I This lecture is about nonparametric methods for estimatingprobability density functions and regression functions. Theyare also called kernel methods as they’re based on kernelfunctions.

I Kernel methods are considered as modern data analysistechniques as they grow rapidly after the wide spread ofcomputer power

I This lecture coversI kernel methods for estimating density functionsI kernel methods for estimating regression functions

Density estimation

I Let X1, ...,Xn be i.i.d. (independent and identicallydistributed) samples from an unknown cumulative distributionfunction (cdf) F (x) with probability density function (pdf)f (x).

I We will use the generic notation X to denote a r.v. withdistribution F (x), i.e the Xi ’s are identical copies of X .

I We want to estimate F (x) (or equivalently f (x)), as F (x)contains all information about X we need to know: mean,variance, correlation between components of X (if X is arandom vector), etc.

Density estimation

For example, in the spending dataset,

I What is the distribution of spending amounts?

I What is the mode? What is the shape of this distribution?

Parametric v.s. Nonparametric

I Parametric approaches assume a known functional form for f ,such as normal, gamma, Poisson, which involves someunknown parameters

f (x) = f (x |θ).

Then the task is to estimate θ by, e.g., Maximum likelihoodestimation or Markov chain Monte Carlo (not covered in thiscourse)

I Parametric methods often achieve attractive properties(estimators have small variances, fast convergence to the truepopulation values), given that the true underlying densityfunction that generated the data is well approximated by thepostulated form f (x |θ)

I Parametric approaches might lead to misleading results if theassumed parametric model is far away from the true density.

Parametric v.s. Nonparametric

I Nonparametric approaches don’t assume a known functionalform for f

I They only make basic assumptions likeI Finite second moment E(X 2) <∞, orI The true density is smooth enough: the derivates f (r)(x) exist

up to some certain order r

I So nonparametric approaches are more robust and moreflexible

I But they do have their own drawbacks (discussed later)

Histogram

I Histogram is the oldest and most widely used nonparametricdensity estimator.

I X1, ...,Xn are iid samples from an unknown cdf F with pdff (x) = ∂F (x)/∂x .

I The empirical distribution function defined as

Fn(x) =1

n

n∑i=1

I (Xi ≤ x)

is an estimator of cdf F (x) = P(X ≤ x)

I Note that∑n

i=1 I (Xi ≤ x) ∼ Bi(n,F (x)). So

E(Fn(x)) = F (x), V(Fn(x)) =F (x)(1− F (x))

n→ 0 n→∞

for every x .

I Fn(x) is a consistent and unbiased estimator of F (x).

HistogramBasic idea:

f (x) ≈ F (x + h)− F (x − h)

2h, h > 0 is small

Replace F (x) by its estimate Fn(x)

fn(x) =Fn(x + h)− Fn(x − h)

2h

=1

2hn

∑i

(I (Xi ≤ x + h)− I (Xi ≤ x − h))

=1

2hn

∑i

I (x − h < Xi ≤ x + h)

=1

2hn× number of the Xi ’s in (x − h, x + h]

is an estimator of f (x). Note: the interval is open on the left andclosed on the right.

Histogram

Constructing a histogram estimator of f (x)

(i) Given a bin width h > 0, form the bins of the histogram: E.g.,given an origin x0, let

(x0 + mh, x0 + (m + 1)h], m = 0,±1,±2, ...

Or, divide the range (a, b) into bins with length h.

(ii)

fn(x) =1

nh× number of the X ′i s in the bin that contains x

Some statistical softwares might show frequency (the number ofthe Xi ’s in each bin) on the y -axis rather than the density(frequency divided by nh).

Technically, the density must be shown to make sure that theentire area under fn(x) sums up to 1.

Histogram

Histograms of 1000 observations from the standard normaldistribution. The y-axis of the left panel shows the frequency.

Asymptotic properties of histogram estimator*

The Mean Squared Error (MSE) of fn(x)

MSE (fn(x)) = E[fn(x)− f (x)]2

= [E(fn(x))− f (x)]2 + E[fn(x)− E(fn(x))]2

= Bias2 + Variance.

MSE is a widely used performance measure.


Assume that f (x) is smooth enough, in the sense that there existsa constant γ > 0 such that

|f (x)− f (y)| ≤ γ|x − y | for all x , y .

Then there exists a number ξx belonging to the bin containing xsuch that

MSE (fn(x)) ≤ γ2h2 +f (ξx)

nh

The optimal bin width (that minimises the MSE) is

h∗ =

(f (ξx)

2γ2n

)1/3

.


Under the optimal h∗, the resulting MSE is

MSE (fn(x)) =C

n2/3→ 0 as n→∞

for some constant C <∞.

I So, fn(x) converges to the true value f (x) as n→∞ at therate n−2/3

I Note: we don’t make any assumptions on the form of theunderlying density f (x)

I Therefore kernel density estimation is more robust thanparametric approaches.


It is shown that, for parametric models f (x |θ) where the parameterθ is estimated by MLE θ, then

MSE (f (x |θ)) =M

n

for some constant M <∞.

I So f (x |θ) converges to the true value f (x) as n→∞ at therate n−1, provided that the postulated model f (x |θ) is thetrue underlying density

I So the rate of convergence of nonparametric estimators isslower than that of parametric estimators. Why?

How many bins to be used?

Sturges’ rule: the number of bins should be 1 + log2(n).

Spending Amount example

The number of bins is too small. Important features, such asmode, of this distribution are not revealed


The number of bins selected by Sturges’ rule.


The number of bins is too large. The distribution is overfitted.

Kernel density estimationRecall the estimator fn(x) we had earlier

fn(x) =1

2hn× number of the xi ’s falling in (x − h, x + h]

which can be written as

fn(x) =1

hn

n∑i=1

K

(x − Xi

h

)where the kernel K (·) is

K (t) =

{1/2, if − 1 < t ≤ 1

0, else

I Apart from the multiplication factor 1/(hn), this estimatorassigns the same weight of 1/2 to every Xi falling in(x − h, x + h]

I Intuitively, more weight should be put on Xi ’s that are closerto x


I Intuitively, more weight should be put on Xi ’s that are closerto x

I Kernel K (t) should be well-designed so that K(x−Xih

)gets

larger when Xi is closer to x .

I This means, K (t) should be larger when t is closer to 0.

I Uniform kernel K (t) = 12 I (|t| ≤ 1) doesn’t have this desired

property.

Kernel density estimationSome commonly used kernels

Gaussian kernel K (t) = 12π e−t2/2

Triangular kernel K (t) = (1− |t|)I (|t| ≤ 1)

Epanechnikov kernel K (t) = 34(1− t2)I (|t| ≤ 1)


The kernel density estimation of the unknown pdf f (x) based oniid samples X1, ...,Xn is defined as

fn(x) =1

hn

n∑i=1

K

(x − Xi

h

)

I h is called the bandwidth that controls the amount ofsmoothness

I K is a kernel function


Some remarks

I Kernels in order of efficiency: Epanechnikov, Triangular,Gaussian, Uniform

I But kernels typically play a lesser important role than thebandwidth h in determining the performance

I There are several methods for selecting the bandwidth h. Wewon’t discuss this in the course.

I Builtin functions in statistical softwares often already comewith an optimal choice of h.


The bandwidth h is too large. Local features of this distributionare not revealed


The bandwidth h is selected by a rule-of-thumb called normalreference bandwidth


The bandwidth h is too small. The distribution is overfitted.

Kernel regression

Kernel regression

I Consider n iid samples (x1, y1), ..., (xn, yn), copies of the pair(X ,Y ) with X ∈ R and Y ∈ R

I We want to predict Y based on X

I The simple linear regression model assumes

Y = β0 + β1X + ε

i.e. we assume a linear function for the conditional meanm(x) = E(Y |X = x) = β0 + β1x .

I Parametric approaches assume a known function form forE(Y |X = x)

I In contrast, nonparametric approaches don’t assume anyknown functional form for m(x) = E(Y |X = x)

k-nearest neighbour method

kNN is the simplest nonparametric regression method.

It estimates m(x) = E(Y |X = x) by

mkNN(x) = Average(yi |xi ∈ Nk(x))

where Nk(x) denotes the neighbourhood that contains k elementsamong the xi ’s closest to x .

mkNN(x) =1

k

n∑i=1

yi I (xi ∈ Nk(x)) =1

k

∑i :xi∈Nk (x)

yi

kNN for classification

I The output variable G has two values: BLUE (=0) andORANGE (=1)

I There are two predictors X1 and X2

I 200 training data points are shown in the picture


We estimate the probability of being ORANGE by

Y (x) = P(G = 1|x) =1

k

∑i :xi∈Nk (x)

yi

Classification rule

G (x) =

{ORANGE , if Y (x) > 0.5

BLUE , if Y (x) ≥ 0.5


Figure: kNN classification with k = 15. The black curve is the decisionboundary {x : Y (x) = 0.5}


Figure: kNN classification with k = 1. The model is overfitted: theclassifier works perfectly well on the training data, but not that well ontest data


k has a big influence on the performance of kNN. k is oftenselected cross-validation.


mkNN(x) =1

k

n∑i=1

yi I (xi ∈ Nk(x)) =1

k

∑i :xi∈Nk (x)

yi

I kNN assigns the same weight 1/k to all yi that have xi closeto x . This is similar to using a uniform kernel.

I Intuitively, yi should be given a bigger weight if xi is closer tox .

I So, we can make kNN better. Let’s move on...

Nadaraya-Watson estimator

Let

Kh(x , xi ) = K

(x − xi

h

)be a kernel function that gets bigger when xi gets closer to x .

The Nadaraya-Watson estimator of the conditional meanm(x) = E(Y |X = x) is

m(x) =∑i

wi (x)yi

where

wi (x) =Kh(x , xi )∑nj=1 Kh(x , xj)

That is, we put more weight on those yi that have thecorresponding xi closer to x .

Asymptotic properties of the Nadaraya-Watson estimator*

m(x) =

∑Kh(x , xi )yi∑Kh(x , xi )

, m(x) =

∑Kh(x , xi )m(x)∑

Kh(x , xi )

m(x)−m(x) =

∑Kh(x , xi )(yi −m(x))∑

Kh(x , xi )

Bias(m(x)) = E[m(x)−m(x)]

=

∑Kh(x , xi )(m(xi )−m(x))∑

Kh(x , xi )

=(nh)−1

∑Kh(x , xi )(m(xi )−m(x))

fn(x)=ψ(x)

fn(x)

Recall that fn(x) is KDE of f (x) - the pdf of X , and thereforefn(x)→ f (x).


By LLN,

ψ(x) →∫

h−1Kh(x , t)(m(t)−m(x))f (t)dt

=

∫K (w)[m(x + hw)−m(x)]f (x + hw)dw

=

∫K (w)[hwm′(x) +

1

2h2w2m′′(x) + o(h2)]×

×[f (x) + hwf ′(x) +1

2h2w2f ′′(x) + o(h2)]dw

= C (x ,K )h2 + o(h2)

where C (x ,K ) is a constant depending on m(x),f (x) and thekernel K .


Similarly, we can show that

V(m(x)) =C2(x ,K )

nh+ o((nh)−1)

So

MSE (m(x)) = C1(x ,K )h4 +C2(x ,K )

nh+ o(h4) + o((nh)−1)

The optimal bandwidth h is the one that minimises the right handside

h∗(x) = C3(x ,K )n−1/5

Under this optimal h∗

MSE (m(x)) = C4(x ,K )n−4/5 → 0

as n→∞.

Asymptotic properties of the Nadaraya-Watson estimator

So,

MSE (m(x)) = E(m(x)−m(x))2 = constant× n−4/5 → 0

as the sample size n→∞.

I So, m(x) converges to the true value m(x) as n→∞.

I Note: we don’t make any assumptions on the form of theunderlying conditional mean m(x)

I Therefore kernel regression method is more robust thanparametric approaches.

UN data exampleConsider the UN data on the relationship between GDP per capita(X ) and Fertility rate Y . We want to estimate E(Y |X = x)

Figure: The red line show the NW estimator m(x) with a too smallbandwidth h

UN data example

Figure: The red line show the NW estimator m(x) with an optimalbandwidth h

UN data example

Figure: The red line show the NW estimator m(x) with a too largebandwidth h


Model Section and Variable Selection


Table of contents

Introduction and Basic Concepts

Popular model selection methods

LASSO

Reading: Chapter 7 of the textbook The Elements of StatisticalLearning.

Introduction and BasicConcepts

IntroductionModel selection problem

I Model selection in general and variable selection in particularare important parts of data analysis. Variable selection can beconsidered as a special case of model selection

I Consider a dataset D. Let {Mi , i ∈ I} be a set of potentialmodels that can be used to explain D.

I The model selection problem is to select the “best” model tointerpret D and/or to make good predictions on futureobservations.

I “best” depends on how we define it: not being overfitting,producing accurate predictions, etc

IntroductionModel selection problem

For example, given a dataset D = {(x1, y1), ..., (xn, yn)}. Twomodels are proposed to explain D (one is yours, one is your boss’):

Model 1:yi = β0 + β1xi + εi ,

where εi is assumed to have a normal distribution N(0, σ2).

Model 2:yi = β0 + β1xi + εi ,

where εi is assumed to have a Student’s t distribution tν(0, σ2).

Then, you need to answer the question: which model is better?

IntroductionVariable Selection problem

I Consider a regression model with response Y and a set of ppotential covariates X1, ...,Xp.

I At the beginning stage of modelling, p is often large in orderto reduce possible bias

I a large p might cause heavy admin duties, be costly, etcI more importantly, a large p typically leads to a high variance in

prediction (see later)

I The Variable Selection problem is to select the “best” subsetof these p covariates to explain/predict Y .

I “best” depends on how we define it: not being overfitting,producing accurate predictions, etc

Basic concepts

I Suppose that we have a target variable Y to be predictedbased on an input vector X .

I Given a model M, and based on data D, we predict Y by

fM(X |D)

I The functional form of fM(X |D) is formed by the nature ofmodel M. E.g., a linear regression model or a spline regressionmodel

I The estimated parameters in fM(X |D) are computed based ondata D

I Let L(Y , fM(X |D)) denote the loss when we predict Y by

fM(X |D), e.g.

I Squared error loss: L(Y , fM(X |D)) = (Y − fM(X |D))2

I 0-1 loss: L(Y , fM(X |D)) = I (Y = fM(X |D))I Log-likelihood loss: L(Y , fM(X |D)) = − log p(Y |fM(X |D))

Basic concepts

I The prediction error of model M, conditional on data D, isdefined as

Err(M|D) = E(X ,Y )[L(Y , fM(X |D))]

= Average{L(Yj , fM(Xj |D), all future (Xj ,Yj)

},

I The expectation E(X ,Y )[·] is with respect to the jointpopulation distribution of Y and X .

I The prediction error Err(M|D) measures the performance ofmodel M. The smaller this error, the better M is

I Note that Err(M|D) is dependent on the data D, and istherefore a random quantity

Basic concepts

I Another measure of prediction performance is the expectedprediction error

I The expected prediction error of model M is defined as

Err(M) = ED [Err(M|D)] = ED

[E(X ,Y )[L(Y , fM(X |D))]

]I ED [·] is the expectation with respect to all datasets of the

same size as D

I So Err(M) averages out the effect of data D. There’s nolonger any uncertainty involved as all randomness (in (X,Y)and D) has been averaged out.

I This expected prediction error is an ideal measure ofperformance for model selection

Basic conceptsBias and Variance Decomposition

Consider the general regression model

Y = f (X ) + ε, E(ε) = 0, V(ε) = σ2

Let f (X ) be an estimate of f (X ). For notation simplicity, I’vesuppressed the dependence of f (X ) on model M and data D.

I Suppose we want to predict the mean value of Y at X = x0:f (x0) = E(Y |X = x0)

I The prediction is f (x0)

Using squared-error loss, the prediction error is

Err(M|D) = Eε[(Y |X=x0 − f (x0)

)2]= Eε

[(f (x0)− f (x0) + ε

)2]=

(f (x0)− f (x0)

)2+ σ2


The expected prediction error (EPR) is

Err(M) = E(f (x0)− f (x0)

)2+ σ2

= E((

f (x0)− E[f (x0)])

+(E[f (x0)]− f (x0)

))2+ σ2

= V(f (x0)) +(E[f (x0)]− f (x0)

)2+ σ2

= Variance(f ) + Bias2(f ) + σ2

As σ2 is a constant independent of M, the EPR of model M isbasically decomposed into two terms: Variance and Bias2.


For example, in the multiple linear regression model, the true valuef (x0) = E(Y |X = x0) (usually non-linear in p-vector x0) is firstdeterministically approximated by a linear combination x ′0β, then

stochastically estimated by f (x0) = x ′0β.

Err(M) = Variance + Bias2 + σ2

= V(x ′0β) +(E[x ′0β]− f (x0)

)2+ σ2

= x ′0cov(β)x0 +(x ′0β − f (x0)

)2+ σ2

= σ2p∑

i=1

x20i +(x ′0β − f (x0)

)2+ σ2,

here, suppose that the design matrix X is standardized so thatcov(β)=σ2(X ′X )−1 =σ2I .

So, the more number of covariates p is, the bigger the varianceand the smaller the bias2, and vice versa.

Basic conceptsOverfitting

Expected prediction error = Variance + Bias2

I Overfitting: A complex model is used (f (x) is a complicatedfunction, involves a lot of parameters, etc) then Bias2 is smallbut Variance is large. Underfitting: otherwise.

I Model selection: pick up a right model that trades offbetween Bias and Variance

Hope you enjoy this song!

https://www.youtube.com/watch?v=DQWI1kvmwRg

Basic conceptsOverfitting

The light red curves show the prediction errors in a linearregression. The solid red curve shows the expected prediction error(averaged over the prediction errors). The x-axis is the modelcomplexity - proportional to the number of predictors used in themodel

Basic conceptsTraining error

Training error is the average loss over the training data points, thedata used to estimate the model,

err =1

n

n∑i=1

L(yi , f (xi ))

I Given a model, training data is used to fit the model (i.e.estimating the parameters), e.g. by minimising the trainingerror.

I Training error is not a good measure of model selection, itdecreases when model complexity increases

Basic conceptsTraining error is not good for model selection

The light blue curves show the training errors in a linear regression.The solid blue curve shows the expected training error E(err)(averaged over the training errors). Training errors consistentlydecrease when model complexity increases

Basic conceptsTraining data and validation data

I In the ideal case of rich-data, we can divide the data into twosets: training set and validation set

I use the training set to fit/estimate the model and thevalidation set to estimate the prediction error (NOT expectedpredictor error!)

I Pick up the model with the smallest prediction errorI But...

I data is preciousI we can do better with cross-validation (see later)

Basic conceptsPenalised maximum likelihood principle

I Typically, the training error err is smaller than the predictionerror, because the same data is used to fit the model andassess its error (see the picture).

I As training error decreases as model complexity increases, itmight be a good idea to penalise for model complexity

I Many popular model selection criteria have the form

model selection criterion = err+penalty of model complexity

Popular model selectionmethods

AIC

Akaike’s information criterion (AIC): Select the model with thesmallest AIC

AIC = −2× log-likelihood(θmle) + 2d

I d is the number of parameters in θ (i.e. # covariates)

I log-likelihood(θmle) is the log-likelihood evaluated at the MLEθmle

I The factor 2 is not important, but useful when comparing AICto other model selection criteria

I AIC is an estimate of the expected prediction error, where theloss function is L(yi , xi ) = − log p(yi |f (xi ))

I proposed by Hirotugu Akaike in 1973

BIC

Bayesian information criterion (BIC): Select the model with thesmallest BIC

BIC = −2× log-likelihood(θmle) + (log n)× d

BIC, proposed by Gideon Schwarz 1978, is motivated by Bayesianapproach.

BIC*Consider a model M with d-vector of parameters θ, and data D.The posterior of M is

p(M|D) ∝ p(M)p(D|M)

= p(M)

∫p(D|θ,M)p(θ|M)dθ.

Using an uniform prior for M and approximating the integral bythe so-called Laplace approximation

log p(D|M) = log p(D|θmle ,M)− d

2log(n) + O(1)

O(1), read as “big order one”, means O(1) is a term depending onn, but stays constant as n grows.

We want to pick a model with the highest posterior p(M|D),which is equivalent to picking a model with the smallest BIC

BIC=−2×log-likelihood(θmle)+(logn)×d

AIC or BIC?

AIC = −2× log-likelihood(θmle) + 2d

BIC = −2× log-likelihood(θmle) + (log n)× d

I They’re both popular model selection methods. BIC puts aheavier penalty on model complexity.

I BIC is shown to be consistent asymptotically: it is able toidentify the true model when n→∞ (if there exists such atrue model! Some people argue that true model doesn’t exist)

I Practitioners seem to prefer AIC over BIC when n is small

[M.-N. Tran (2011), The Loss Rank Criterion for Variable Selectionin Linear Regression Analysis, Scandinavian J of Statistics]proposes another criterion which is somehow a compromisebetween AIC and BIC.

Cross-validation

I basic idea: like you validate your peers’ work and they validateyours

I probably the simplest but most commonly used modelselection method

I give an estimate of the expected prediction error

Cross-validation

I divide the data into K sets, K ≥ 2. Often, this is donerandomly

I for the kth part, fit the model to the other K − 1 parts.Denote the fitted model as f −k(x)

I use the fitted model to predict the kth part. The predictionerror is ∑

(yi ,xi )∈part k

L(yi , f−k(xi ))

I The k-fold cross-validated prediction error is

CV =1

n

K∑k=1

∑(yi ,xi )∈part k

L(yi , f−k(xi ))

It’s an estimate of the expected prediction error as it’saveraged over both test data and training data

Cross-validation

I The selected model is the one with the smallest CV predictionerror.

I Typical choices of K are 5, 10 or n. The case K = n is knownas leave-one-out cross-validation.

Cross-validation is simple and widely used. However, CV can besometimes very computationally expensive because one has to fitthe model many times.

Variable selection in linear regression

Consider a linear regression model or a logistic regression modelwith p potential covariates.

At the initial step of modelling, a large number p of covariates isoften introduced in order to reduce potential bias. The task is thento select the best subset among these p variables

Best subset selection: Search over the totally 2p possible subsetsof p covariates to find the best subset. The criterion can be AIC,BIC or any other model selection criteria.


Searching over 2p subsets is only feasible when p is small (< 30)

Forward-stepwise selection: Start with the intercept, thensequentially add into the model the covariate that most improvesthe model selection criterion.

Backward-stepwise selection: Start with the full model with pcovariates, then sequentially remove the covariate that mostimproves the model selection criterion.

I Advantage: much more time-efficient than the best subsetselection method

I Disadvantage: not necessarily end up at the best subset.


Variable selection based on hypothesis testing

I Consider the test H0 : βj = 0 v.s. H1 : βj 6= 0.

I Let βj be an estimator of βj . If the sampling distribution of βjis known, p-value can be computed

I If the p-value is large (e.g. > 0.05, 0.1) then thecorresponding covariate Xj might be removed from the model

Possible disadvantages

I It’s not clear what predictor error is optimised

I not time-efficient when p is large (need to refit the modelmany times)

This variable selection method is not popular in “modern”statistics, machine learning. Because of historic reasons, it’s stillwidely used in many fields such as social sciences.

Woman labor force example

The data set MROZ.xlsx, available in Blackboard, containsinformation on Womens labour force participation.

We would like to build a logistic regression model to explainwomens labour force participation using potential predictorsnwifeinc (income), educ (years of education), age, exper (yearsof experience), expersq (squared years of experience), kidslt6(number of kids less than 6-year-old) and kidsge6 (number ofkids more than 6-year-old).

Let’s carry out the variable selection task.

Woman labor force example

LASSO

I Consider a simple linear regression model

yi = β0 + β1xi + εi , i = 1, ..., n

I Assume that the xi ’s have been standardised so that∑i xi = 0 and

∑i x

2i = 1

I The LS method estimates β = (β0, β1)′ by minimising thesum of squared errors∑

i

(yi − β0 − β1xi )2

I It’s easy to see that the solution is

βls1 =∑i

xiyi , βls0 =1

n

∑i

yi

LASSOI The LASSO method estimates β = (β0, β1)′ by minimising

the sum of squared errors plus a penalty term on β1∑i

(yi − β0 − β1xi )2 + λ|β1|

I |β1| is the absolute value of β1, λ > 0 controls the penalty,it’s called the shrinkage parameter

I Note that there’s no penalty term for β0. The reason is thatwe are in general not interested in determining whether or notβ0 = 0

I It can be shown that the solution is

βlasso1 =

0, λ ≥ |βls1 |βls1 − λ, λ < |βls1 | and βls1 > 0

βls1 + λ, λ < |βls1 | and βls1 < 0

βlasso0 = βls0 =1

n

∑i

yi

LASSO

I So when the shrinkage parameter λ is large enough (i.e.λ ≥ |βls1 |), the Lasso estimate βlasso1 will be 0

I This can also be interpreted as follows: when |βls1 | is smallenough that can be regarded as being insignificant, Lasso willautomatically shrink it to zero.

I Because of this attractive feature, Lasso is a method forvariable selection.

I In general, the Lasso method shrinks all the LS estimatestowards 0

LASSO

I Now, consider the general multiple linear regression model

yi = β0 + β1xi1 + ...+ βpxip + εi

I For a given λ, the Lasso method estimatesβ = (β0, β1, ..., βp)′ by minimising

∑i

(yi − β0 − β1xi1 − ...− βpxip)2 + λ

p∑j=1

|βj |

I Note that, in general, we don’t penalize β0 as we are notinterested in whether or not β0 = 0.

I There isn’t a closed form solution to this optimisationproblem, but it can be solved by optimisation techniques

I Many “modern” statistical softwares have built-in functions toimplement Lasso

LASSO

I Let’s apply the method to the prostate cancer dataset,available in the textbook’s website and Blackboard

I The goal is to predict the log of prostate specific antigen level(lpsa), using multiple linear regression with eight predictors:log cancer volume (lcavol), log prostate weight (lweight),age, etc.

I We want to estimate the coefficients and simultaneouslyremove insignificant predictors

Lasso

The figure shows the profile of the Lasso estimates when theshrinkage parameter λ varies from 0 to λmax, where all coefficientsare 0.

Lasso

I For a particular λ w.r.t. the dotted red line, three predictorslcavol, lweight and svi are selected, the other five areremoved.

I When λ = +∞, ALL coefficients β1, ..., βp are 0. This iswhen the model is likely to be underfitted

I When λ = 0, ALL coefficients β1, ..., βp are not 0. This iswhen the model is likely to be overfitted

Selecting λ

The shrinkage parameter λ can be selected using the BIC-typecriterion.

Let X be the design matrix and y be the vector of responses.Denote by βlassoλ the Lasso estimate of β given λ. Define

BIC(λ) = log

(‖y − X βlassoλ ‖2

n

)+ dfλ

log(n)

n

dfλ is called the degrees of freedom, which is approximately thenumber of non-zero coefficients in the model.

The best λ is the one that minimises BIC

Lasso

BIC(λ) is minimised at λ = 0.0623

Lasso

The final Lasso estimate is shown by the vertical line atλ = 0.0623.

Lasso

The final Lasso estimate with λ chosen by BIC is

βlassoλ=0.0623 =

0.37150.51510.3421

00.04910.5623

00

0.0014

So three predictors age, lcp and gleason are removed. Notethat the first element is the intercept.

Selecting λ by Loss rank principle method

Tran (2011), Scandinavian Journal of Statistics, Vol. 38:p.466-479, proposes a method called the loss rank principle forselecting λ.

LR(λ) = KL

(dfλn, 1− ρλ

)where

I KL(p, q) = p log(p/q) + (1− p) log((1− p)/(1− q))

I dfλ the number of non-zero coefficients in the model

I ρλ = ‖y − X βlassoλ ‖2/‖y‖2

The best λ is the one that maximises LR

Selecting λ by Loss rank principle method

LR(λ) is maximised at λ = 0.0623. In this example, BIC and LRgive the same result.

LASSO for logistic regressionI Response/output data yi are binary: 0 and 1, Yes or NoI Want to explain/predict yi based on a vector of predictors

xi = (xi1, ..., xip)′.I We assume yi |xi ∼ B(1, p1(xi )), i.e. the distribution of yi is

Bernoulli distribution with probability of success p1(xi ), where

p1(xi ) = P(yi = 1|xi ) =exp(β0 + β1xi1 + ...+ βpxip)

1 + exp(β0 + β1xi1 + ...+ βpxip)

I If Y is a Bernoulli r.v. with probability π, then the densityfunction of Y is

p(y |π) = πy (1− π)1−y .

I The probability density function of yi is therefore

p(yi |xi , β) = p1(xi )yi (1− p1(xi ))1−yi

So the likelihood function is

p(y |X , β) =n∏

i=1

p1(xi )yi (1− p1(xi ))1−yi

LASSO for logistic regression

I The log-likelihood

`(β) = log p(y |X , β) =∑i

(yi log p1(xi ) + (1− yi ) log(1− p1(xi )))

I Lasso estimates β by minimising the minus log-likelihood plusa penalty term

−`(β) + λ

p∑j=1

|βj |, λ > 0

I Insignificant coefficients will be automatically shrunk to 0

I Most “modern” statistical softwares have built-in functions toimplement Lasso for logistic regression

LASSO for logistic regression

λ can be selected using BIC or AIC criterion

AIC(λ) = −2× log-likelihood(βlassoλ ) + 2× dfλ

BIC(λ) = −2× log-likelihood(βlassoλ ) + (log n)× dfλ

where dfλ the number of non-zero coefficients in the model.

The selected λ is the one that minimises BIC(λ) or AIC(λ)

LASSO

I Lasso is very useful when there are a lot of potentialpredictors, i.e. p is large

I Lasso still works even when p � n. No other methods can.

qbus3820 data mining and data analysis - sergey v. alexeev · 2018-11-20 · global polynomial...

Documents