qbus3820 data mining and data analysis - sergey v. alexeev · 2018-11-20 · global polynomial...

188
QBUS3820 Data Mining and Data Analysis Lecture: Neural Networks Dr. Minh-Ngoc Tran University of Sydney Business School

Upload: others

Post on 03-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

QBUS3820 Data Mining and Data Analysis

Lecture: Neural Networks

Dr. Minh-Ngoc TranUniversity of Sydney Business School

Page 2: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Table of contents

Introduction

Fundamental concepts

Single layer perceptron

Page 3: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Introduction

Page 4: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

What are neural networks?

They are a set of very flexible non-linear methods forregression/classification. Suitable when your dataset is large.

Page 5: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

What are neural networks?

Page 6: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

What are neural networks?

I A neural network, or artificial neural network (ANN) is acomputational model that tries to mimic a network of neuronsin the human brain.

I Artificial neural networks (ANNs) are not biological neuralnetworks, but a mathematical model that is inspired bybiological neural networks.

Page 7: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

What are neural networks?

I A neural network is an interconnected assembly of simpleprocessing units or neurons, which communicate by sendingsignals to each other over weighted connections

I A neural network is made of layers of similar neurons: aninput layer, hidden layers, and an output layer.

I The input layer receives data from outside the network. Theoutput layer sends data out of the network. Hidden layersreceive/process/send data within the network.

Page 8: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

What are neural networks used for?

I Neural networks are often used for statistical analysis and datamodelling, an alternative to standard nonlinearregression/classification

I They have been successfully used in speech recognition,textual character recognition, medical imaging diagnosis,robotics, financial market prediction, etc.

I But their applications to business are still somewhat limited

Page 9: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Fundamental concepts

Page 10: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Elements of an artificial neural network

An ANN includes

I a set of processing units/neurons/nodes.

I an activation level Zi for each unit i , which, often, is the sameas the output of the unit

I weights wik , which are connection strengths between the unitsi and k

I a propagation rule that determines the total input Sk of a unitfrom its connected units

I an activation function hk that determines the activation levelZk based on the total input Sk , Zk = hk(Sk)

Page 11: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Elements of an artificial neural network

Page 12: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Elements of an artificial neural network

Often, the total input sent to unit k is

Sk =∑i

wikZi + w0k

which is a weighted sum of the outputs from all units i that areconnected to unit k, plus an offset term w0k .

Then, the output of unit k is

Zk = hk(Sk) = hk

(∑i

wikZi + w0k

)

Page 13: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Elements of an artificial neural network

It’s useful to distinguish three types of units:

I input units (denoted by X ): receive data from outside thenetwork

I output units (denoted by Y ): send data out of the network

I hidden units (denoted by Z ): receive data from and send datato units within the network.

Given the signal from a set of inputs X , an ANN produces anoutput Y .

Page 14: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Elements of an artificial neural network

The function

hk(Sk) =1

1 + e−Sk

is commonly used as the activation function for hidden units.

For output units

hk(Sk) = Sk (use in regression)

or

hk(Sk) =Sk∑` S`

(use in classification)

Page 15: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Training neural networks

I A neural network is a computational model that needs to beestimated

I The unknown things in an ANN is the set of the weights wik .

I These parameters are often estimated based on training data

Page 16: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Examples of neural networks

I Consider an ANN with no hidden layer

I Suppose that there’re p input units X1, ..., Xp and one outputunit Y

Page 17: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Examples of neural networks

I Let S = w0 +∑

i wiXi be the total input sent to Y . Forclassification using logistic regression, we classify

Y = 1 if and only ifeS

1 + eS≥ 0.5, i.e. S ≥ 0

I Equivalently

Y = h(S) =

{1, S ≥ 0

0, else

Page 18: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Examples of neural networks

So this classification model is a special case of ANN with nohidden units.

Page 19: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Examples of neural networks

Multiple linear regression is a special case of ANN with no hiddenunits.

Page 20: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Single layer perceptron

Page 21: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Single layer perceptron

I We now focus on the most widely used neural networks instatistics, the ANNs with a single hidden layer, often called asingle layer perceptron

Page 22: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Single layer perceptron for regression

I Suppose we have p input predictors/features X = (X1, ...,Xp)′

and a scalar target YI Create M hidden units Z1, ...,Zm

I Total input of unit Zm

SZm = α0m + α1mX1 + ...+ αpmXp = α0m + α′mX

The weights αij are unknown, need to be estimated.I Activation level of unit Zm

Zm = h(SZm) = h(α0m + α′mX ), m = 1, ...,M

I Compute the total input of the output unit Y

S = β0 + β1Z1 + ...+ βMZM = β0 + β′Z

with βi the weight from hidden unit Zi to the output unit Y

I The output function Y = S is a prediction of E(Y |X ).

Page 23: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

single layer perceptron for regression

Page 24: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

single layer perceptron for regression

I We can write

f (X ) = S = β0+β1h(α01+

p∑i=1

αi1Xi )+...+βMh(α0M+

p∑i=1

αiMXi )

I So, the inputs Xi enter the prediction function f (X ) in anonlinear way.

I Here, for simplicity, we use the same activation h for allhidden units Zm

I If h(x) = x , it can be seen that f (X ) is a linear combinationof the Xi , so multiple linear regression is a special case of thissingle layer perceptron model

Page 25: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

single layer perceptron for regression

I The model parameters areθ = (α01, α11, ..., αp1; ...;α0M , α1M , ..., αpM ;β0, β1, ..., βM)

I Let {yi , xi = (xi1, ..., xip)}, i = 1, ..., n be the training dataset.The sum of squared errors is

R(θ) =n∑

i=1

(yi − f (xi )2)

I We estimate θ by minimising R(θ)

Page 26: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

single layer perceptron for 0-1 classificationI Suppose we have p input predictors/features X = (X1, ...,Xp)′

and a scalar target YI Create M hidden units Z1, ...,Zm

I Total input of unit Zm

SZm = α0m + α1mX1 + ...+ αpmXp = α0m + α′mX

The weights αij are unknown, need to be estimated.I Activation level of unit Zm

Zm = h(SZm) = h(α0m + α′mX ), m = 1, ...,M

I Compute the total input of the output unit Y

S = β0 + β1Z1 + ...+ βMZM = β0 + β′Z

with βi the weight from hidden unit Zi to the output unit YI The output function

Y = h(S) =

{1, S ≥ 0

0, else

Page 27: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Issues in training neural networks

I Often, neural networks have too many weights and mightoverfit the data. To overcome the overfitting problem, weminimise

R(θ) + λP(θ)

where P(θ) is a model complexity penalty, and λ ≥ 0 controlsthe penalty.

I For example

P(θ) =K∑

k=1

M∑m=1

β2mk +

p∑j=1

M∑m=1

α2jm

I λ can be selected by cross-validation

Page 28: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Issues in training neural networks

I With too few hidden units M, the model might not be flexibleenough to capture the nonlinearities in the data; with too bigM the model might overfit the data

I It is most common to put down a reasonably large M and usea penalty term to avoid overfitting

Page 29: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

multiple hidden layer neural networks

I One can use neural networks with multiple hidden layers

I Choice of the number of hidden layers is guided bybackground knowledge or by using a test dataset.

Page 30: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Example: handwriting recognition

I handwriting recognition is an important task, especially inpostal services

I we want to recognise handwritten digits, scanned fromenvelopes

Page 31: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Example: handwriting recognition

I Input is a vector of 16× 16 = 256 pixel values, and output isone of the ten numbers 0,...,9.

I Each training observation, i.e. an image, is a vector of its 256pixel values and the correct-recognition digit

I There are 320 images in the training set, and 160 in the testset.

Page 32: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Example: handwriting recognition

Five networks were used

I Net-1: No hidden layer, equivalent to multinomial logisticregression.

I Net-2: One hidden layer with 12 hidden units fully connected.

I Net-3: Two hidden layers locally connected.

I Net-4 and Net-5: Two hidden layers, locally connected withdifferent constraint levels on the weights.

Page 33: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Example: handwriting recognition

Page 34: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Summary

I ANNs provide a range of flexibly nonlinear models for datamodelling

I ANNs have been successfully applied in many fields: robotics,vision, image processing, etc

I They are useful for prediction, not for inference, because it’sdifficult to interpret the coefficients/weights in an ANN model

Page 35: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

QBUS3820 Data Mining and Data Analysis

Lecture: Piecewise polynomials and Spline regression

Dr. Minh-Ngoc TranUniversity of Sydney Business School

Page 36: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Table of contents

Piecewise polynomials

Spline regression

Reading: Chapter 5, The Elements of Statistical Learning.

Page 37: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Introduction

I Let Y be the response variable and X be a predictor. Weconsider scalar/univariate X in this lecture.

I The conditional mean f (X ) = E(Y |X ) is all we need for bothinference and prediction tasks.The regression model

Y = f (X ) + ε, ε is error with E(ε) = 0

is the most general regression model, as we don’t make anyassumptions on the form of f (X )

I The linear regression model assumes f (X ) = β0 + β1X .

I In many cases, it’s unlikely that the conditional mean E(Y |X )is truly linear in X !

Page 38: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Introduction

Obviously, f (X ) = E(Y |X ), approximated by the red curve, is notlinear in X .

Page 39: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Introduction

I This lecture is to go beyond the linearity assumption inregression

Page 40: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Piecewise polynomials

Page 41: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

First, what is a polynomial?An p-degree (or p-order) polynomial is

f (x) = a0 + a1x + a2x2 + ....+ apx

p

where a0,..., ap are coefficients.

Page 42: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Global polynomial regression

Consider the general regression model

Y = f (X ) + ε, E(ε) = 0 (1)

where f (X ) = E(Y |X ) is unknown.

Let’s use the Taylor expansion of f (X ) at 0: For some order p ≥ 1

f (X ) ≈ f (0) + f ′(0)X +f (2)(0)

2!X 2 +

f (3)(0)

3!X 3 + ...+

f (p)(0)

p!X p

= β0 + β1X + β2X2 + ...+ βpX

p

for all values in the range of X .

Here f (0)=β0, β1 = f ′(0), ..., βp = f (p)(0)p! are unknown coefficients.

f (k)(x) denotes the kth derivative of function f (x).

Page 43: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Global polynomial regression

I Model (1) becomes

Y = β0 + β1X + β2X2 + ...+ βpX

p + ε (2)

Model (2) is now a multiple linear regression model. So thecoefficients βj can be easily estimated from data.

I This model is referred to as the global polynomial regressionmodel

I “Global” means the coefficients βj ’s are constant across theentire range (also called domain) of X

I The global polynomial regression model offers a way to relaxthe linearity assumption.

I However, in general, global polynomial regression has a maindrawback: because of its global nature, it can provide good fit(to the data) in one area but behave rather weirdly in anotherarea

Page 44: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Global polynomial regression

Figure: Industrial production index, January 1990-October 2005. Leftpanel: global polynomial regression. Right panel: Cubic spline regression(see later)

Tuning the coefficients to achieve a functional form in one regioncan cause the function to flap about madly in other regions.

Page 45: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Piecewise polynomial regression

In this lecture we consider techniques that allow for localpolynomial representations.

The first technique is the piecewise polynomial regression

Page 46: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Piecewise polynomials

Basic idea: we divide the range/domain of X into continuousintervals, then approximate f (X ) in each interval by a separatepolynomial.

Page 47: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Piecewise constantDivide the range/domain of X into continuous intervals, thenapproximate f (X ) in each interval by a constant, i.e. a 0-degreepolynomial.

Figure: We approximate the true curve by a step function

Page 48: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Piecewise constant

For example, we divide the domain of X into 3 intervals by 2 limitpoints ξ1 and ξ2 (called knots).

Define three functions, called basis functions

h1(X ) = I (X < ξ1), h2(X ) = I (ξ1 ≤ X < ξ2), h3(X ) = I (X ≥ ξ2)

We approximate the true curve f (X ) = E(Y |X ) by a step function

f (X ) =3∑

m=1

βmhm(X ) =

β1, X < ξ1

β2, ξ1 ≤ X < ξ2

β3, X ≥ ξ2

f (X ) is a constant in each of the regions X < ξ1, ξ1 ≤ X < ξ2 andX ≥ ξ2

Page 49: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Piecewise constant

Let {Xi ,Yi , i = 1, ..., n} be the training data.

LetZi1 = h1(Xi ), Zi2 = h2(Xi ), Zi3 = h3(Xi )

We arrive at the following multiple linear regression model

Yi = β1Zi1 + β2Zi2 + β3Zi3 + εi , i = 1, ..., n

So it’s easy to estimate the coefficients βi ’s using the least squaresmethods or maximum likelihood method.

Page 50: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Piecewise constantIt can be shown that

βm = Ave{Yi |Xi ∈ region m}

Page 51: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Piecewise linear

Divide the domain of X into continuous intervals, thenapproximate f (X ) in each interval by a linear function, i.e. a1-degree polynomial.

Define six basis functions

h1(X ) = I (X < ξ1), h2(X ) = I (ξ1 ≤ X < ξ2), h3(X ) = I (X ≥ ξ2)

h4(X ) = XI (X < ξ1), h5(X ) = XI (ξ1 ≤ X < ξ2), h6(X ) = XI (X ≥ ξ2)

We approximate the true curve f (X ) = E(Y |X ) by a piecewiselinear function

f (X ) =6∑

m=1

βmhm(X ) =

β1 + β4X , X < ξ1

β2 + β5X , ξ1 ≤ X < ξ2

β3 + β6X , X ≥ ξ2

f (X ) is a linear function in each region.

Page 52: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Piecewise linear

Let {Xi ,Yi , i = 1, ..., n} be the training data.

LetZik = hk(Xi ), k = 1, ..., 6, i = 1, ..., n

We arrive at the following multiple linear regression model

Yi = β1Zi1 + β2Zi2 + β3Zi3 + β4Zi4 + β5Zi5 + β6Zi6 + εi

It’s easy to estimate the coefficients βi ’s using the least squaresmethods or maximum likelihood method.

Page 53: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Piecewise linear

Page 54: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Higher order piecewise polynomials

Similarly, we can construct higher order piecewise polynomials. Forexample,

I piecewise quadratic polynomials: approximatef (X ) = E(Y |X ) by a 2-order polynomial in each region

I piecewise cubic polynomials: approximate f (X ) = E(Y |X ) bya 3-order polynomial in each region

Page 55: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Spline regression

Page 56: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Discontinuity issue

Consider a function f (x) and some x0 in its domain

I f −(x0) = limx→x−0f (x) is the left limit of f (x) at x0, i.e. the

limit of f (x) when x goes to x0 from the left.

I f +(x0) = limx→x+0f (x) is the right limit of f (x) at x0, i.e. the

limit of f (x) when x goes to x0 from the right.

I If f −(x0) 6= f +(x0), we say that f (x) is discontinuous at x0.

Page 57: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Discontinuity issue

I Piecewise polynomials are not continuous at the knots ξj :f −(ξj) = f +(ξj)

I Discontinuity causes inference/prediction problems. E.g.,what is the prediction of f (ξj) = E(Y |X = ξj)?

I Typically, we prefer continuity in statistical modelling

Page 58: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Spline

I We need continuity constraints

f −(ξj) = f +(ξj)

i.e. the left limit meets the right limit at the knots.

I A technique to impose such constraints is using a spline

I A spline is a piecewise polynomial that is continuous at theknots. Therefore, a spline is continuous at everywhere in theentire range of X .

Page 59: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Linear splines

Suppose that we divide the range of X into three intervals withknots ξ1 and ξ2. Define basis functions

h0(X ) = 1

h1(X ) = X

h2(X ) = (X − ξ1)+ = (X − ξ1)I (X ≥ ξ1)

=

{0, if X < ξ1

X − ξ1, if X ≥ ξ1h3(X ) = (X − ξ2)+ = (X − ξ2)I (X ≥ ξ2)

=

{0, if X < ξ2

X − ξ2, if X ≥ ξ2

Similarly, we can define the basis functions when there are K>2knots ξ1, ξ2,... ξK .

Page 60: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Linear splines

Let

f (X ) = β0h0(X ) + β1h1(X ) + β2h2(X ) + β3h3(X )

= β0 + β1X + β2(X − ξ1)+ + β3(X − ξ2)+

=

β0 + β1X , X < ξ1

β1 + β1X + β2(X − ξ1), ξ1 ≤ X < ξ2

β0 + β1X + β2(X − ξ1) + β3(X − ξ2), X ≥ ξ2

It’s easy to check that

I f −(ξj)= f +(ξj), j =1,2. That is, f (X ) is continuous at everyX

I f (X ) is linear in each region: X <ξ1, ξ1≤X <ξ2 and X ≥ξ2I f (X ) is called a linear spline: a piecewise linear polynomial

that is continuous everywhere.

Page 61: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Estimating a linear spline

Let {Xi ,Yi , i = 1, ..., n} be the training data. The linear splineregression model becomes

Yi = β0h0(Xi ) + β1h1(Xi ) + β2h2(Xi ) + β3h3(Xi ) + εi

LetZik = hk(Xi ), k = 0, ..., 3, i = 1, ..., n

We arrive at the following multiple linear regression model

Yi = β0Zi0 + β1Zi1 + β2Zi2 + β3Zi3 + εi

So it’s easy to estimate the coefficients βi ’s using the least squaresmethod or maximum likelihood method.

Page 62: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Linear splines

Page 63: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Cubic splines

I A linear spline is continuous but not smooth, it’s has apeak/sudden change at the knots ξk . Not an attractivefeature in statistical modelling.

I We often prefer smoother functions. Typically, we prefer f (X )not only continuous, but has continuous first and secondderivatives

I These can be achieved by increasing the order of the localpolynomial. These can be done by using cubic splines

Page 64: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Cubic splines

Page 65: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Cubic splines

Suppose that we divide the range of X into three intervals withknots ξ1 and ξ2.

Starting with 3-degree polynomial basis functions

h0(X ) = 1, h1(X ) = X , h2(X ) = X 2, h3(X ) = X 3

and additional one basis function for each region

h4(X ) = (X − ξ1)3+ = (X − ξ1)3I (X ≥ ξ1)

h5(X ) = (X − ξ2)3+ = (X − ξ2)3I (X ≥ ξ2)

Similarly, we can define the basis functions when there are K>2knots ξ1, ξ2,... ξK .

Page 66: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Cubic splines

Let

f (X ) = β0 + β1X + β2X2 + β3X

3 + β4(X − ξ1)3+ + β5(X − ξ2)3+

f ′(X ) = β1 + 2β2X + 3β3X2 + 3β4(X − ξ1)2I (X ≥ ξ1)

3β5(X − ξ2)2I (X ≥ ξ2)

f ′′(X ) = 2β2 + 6β3X + 6β4(X − ξ1)I (X ≥ ξ1)

6β5(X − ξ2)I (X ≥ ξ2)

It can be checked that

I f (X ),f ′(X ) and f ′′(X ) are continuous at every X

I f (X ) is a 3-degree polynomial in each region

I f (X ) is called a cubic spline

Page 67: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Estimating a cubic spline

Again, it’s easy to fit a cubic spline to data.

Let {Xi ,Yi , i = 1, ..., n} be the training data.

LetZik = hk(Xi ), k = 0, ..., 5, i = 1, ..., n

We arrive at the following multiple linear regression model

Yi = β0Zi0 + β1Zi1 + β2Zi2 + β3Zi3 + β4Zi4 + β5Zi5 + εi

It’s easy to estimate the coefficients βi ’s using the least squaresmethods or maximum likelihood method.

Page 68: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Cubic splines

Page 69: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Spline regression

I Basically, in spline regression, we fit a separate polynomial tothe data in each region

I Unlike global polynomial regression where the coefficients areconstant across all regions, in spline regression the coefficientsare adjusted locally

I Unlike piecewise polynomial regression, in splines regressioncertain order of smoothness is imposed at the knots

Page 70: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Cubic splines with K knots

Given a set of K knots

ξ1 < ... < ξk < .... < ξK

The cubic spline with K knots is

f (X ) = β0 + β1X + β2X2 + β3X

3 +K∑

k=1

β3+k(X − ξk)3+.

Page 71: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Knots selection

I Selecting the number of knots K and the knot positions ξk isan art

I An option is to set ξk to the 100k/(K + 1)-th percentile ofthe distribution of X .

I How many knots K should be used?I If K is too large, we have an overfitting problemI If K is too small, we have an underfitting problemI This is a model selection problem (see later)

Page 72: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

Figure: Scatter plot of x v.s. y

Page 73: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

Figure: 300 training data points (o) and 100 test data points (x)

Page 74: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

I Both the training and test datasets are available in Blackboard

I Clearly, simple linear regression is not appropriate for thisdataset.

I Let’s fit a cubic spline regression model to this trainingdataset and test its prediction power on the test data.

I Let’s proceed as if there isn’t any built-in function in yoursoftware that can help you to do this task!

Page 75: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

I Let’s select K = 3I ξ1, ξ2 and ξ3 are 25%-, 50% and 75%-percentile of X -data

I α%-percentile is a number where α% of the data points aresmaller than that number

I Check BUSS1020 or Google if you don’t remember how tocompute percentiles!

I Most statistical softwares have functions to compute this

I Based on my calculation, ξ1 = 10, ξ2 = 19.5 and ξ3 = 58

I We need to fit the following cubic spline regression model

Yi = β0 + β1Xi + β2X2i + β3X

3i + β4(Xi − 10)3+ +

+β5(Xi − 19.5)3+ + β6(Xi − 58)3+ + εi ,

i =1,...,n.

Page 76: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

I Let Zi1 = Xi , Zi2 = X 2i , ..., Zi6 = (Xi − 58)3+, i = 1, ..., n. We

have the following multiple linear regression model

Yi = β0 +6∑

k=1

βkZik + εi

I Let

X =

1 Z11 ... Z16

1 Z21 ... Z26...

... ......

1 Zn1 ... Zn6

, y =

Y1

Y2...Yn

The estimate of vector β is β = (X ′X )−1X ′y

Page 77: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

β = (498.9679,−0.3252, 0.1046,−0.0064, 0.0091,−0.0025,−0.0018)

I The estimate of f (X ) = E(Y |X ) is

Y = β0 + β1X + β2X2 + β3X

3 + β4(X − 10)3+ +

+β5(X − 19.5)3+ + β6(X − 58)3+

I Plot of (X ,Y ) when X varies is the fitted (also calledpredicted) curve.

Page 78: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

Figure: 300 training data points and the fitted curve

Page 79: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

Figure: 100 test data points and the fitted curve

Page 80: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

Let’s see the effect of the number of knots K

Page 81: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

Figure: 100 test data points and the fitted curve

Page 82: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

Figure: 100 test data points and the fitted curve

Page 83: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

Figure: 100 test data points and the fitted curve

Page 84: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

Figure: 100 test data points and the fitted curve

Page 85: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

Figure: 100 test data points and the fitted curve

Page 86: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

Figure: 100 test data points and the fitted curve

Page 87: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

An example

Figure: 100 test data points and the fitted curve

Page 88: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Homework

Download the datasets from Blackboard and fit a cubic splineregression model by yourself. Have fun!

Page 89: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

QBUS3820 Data Mining and Data Analysis

Lecture: Kernel methods

Dr. Minh-Ngoc TranUniversity of Sydney Business School

Page 90: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Table of contents

Kernel density estimation

Kernel regression

Page 91: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Introduction

I We’ve so far discussed on parametric methods: we firstassume a parametric form for the underlying model thatgenerated the data, then estimate the parameters.

I In parametric modelling, the underlying model that generatedthe data is described by a functional form that depends on avector of unknown parameters θ.

I E.g., simple linear regression

yi = β0 + β1xi + εi , εi ∼ N(0, σ2)

is a parametric model as we assume the model that generateddata yi , given xi , is normal distribution with mean β0 + β1xiand variance σ2. The set of unknown parameters isθ = (β0, β1, σ

2).

Page 92: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Introduction

I This lecture is about nonparametric methods for estimatingprobability density functions and regression functions. Theyare also called kernel methods as they’re based on kernelfunctions.

I Kernel methods are considered as modern data analysistechniques as they grow rapidly after the wide spread ofcomputer power

I This lecture coversI kernel methods for estimating density functionsI kernel methods for estimating regression functions

Page 93: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Kernel density estimation

Page 94: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Density estimation

I Let X1, ...,Xn be i.i.d. (independent and identicallydistributed) samples from an unknown cumulative distributionfunction (cdf) F (x) with probability density function (pdf)f (x).

I We will use the generic notation X to denote a r.v. withdistribution F (x), i.e the Xi ’s are identical copies of X .

I We want to estimate F (x) (or equivalently f (x)), as F (x)contains all information about X we need to know: mean,variance, correlation between components of X (if X is arandom vector), etc.

Page 95: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Density estimation

For example, in the spending dataset,

I What is the distribution of spending amounts?

I What is the mode? What is the shape of this distribution?

Page 96: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Parametric v.s. Nonparametric

I Parametric approaches assume a known functional form for f ,such as normal, gamma, Poisson, which involves someunknown parameters

f (x) = f (x |θ).

Then the task is to estimate θ by, e.g., Maximum likelihoodestimation or Markov chain Monte Carlo (not covered in thiscourse)

I Parametric methods often achieve attractive properties(estimators have small variances, fast convergence to the truepopulation values), given that the true underlying densityfunction that generated the data is well approximated by thepostulated form f (x |θ)

I Parametric approaches might lead to misleading results if theassumed parametric model is far away from the true density.

Page 97: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Parametric v.s. Nonparametric

I Nonparametric approaches don’t assume a known functionalform for f

I They only make basic assumptions likeI Finite second moment E(X 2) <∞, orI The true density is smooth enough: the derivates f (r)(x) exist

up to some certain order r

I So nonparametric approaches are more robust and moreflexible

I But they do have their own drawbacks (discussed later)

Page 98: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Histogram

I Histogram is the oldest and most widely used nonparametricdensity estimator.

I X1, ...,Xn are iid samples from an unknown cdf F with pdff (x) = ∂F (x)/∂x .

I The empirical distribution function defined as

Fn(x) =1

n

n∑i=1

I (Xi ≤ x)

is an estimator of cdf F (x) = P(X ≤ x)

I Note that∑n

i=1 I (Xi ≤ x) ∼ Bi(n,F (x)). So

E(Fn(x)) = F (x), V(Fn(x)) =F (x)(1− F (x))

n→ 0 n→∞

for every x .

I Fn(x) is a consistent and unbiased estimator of F (x).

Page 99: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

HistogramBasic idea:

f (x) ≈ F (x + h)− F (x − h)

2h, h > 0 is small

Replace F (x) by its estimate Fn(x)

fn(x) =Fn(x + h)− Fn(x − h)

2h

=1

2hn

∑i

(I (Xi ≤ x + h)− I (Xi ≤ x − h))

=1

2hn

∑i

I (x − h < Xi ≤ x + h)

=1

2hn× number of the Xi ’s in (x − h, x + h]

is an estimator of f (x). Note: the interval is open on the left andclosed on the right.

Page 100: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Histogram

Constructing a histogram estimator of f (x)

(i) Given a bin width h > 0, form the bins of the histogram: E.g.,given an origin x0, let

(x0 + mh, x0 + (m + 1)h], m = 0,±1,±2, ...

Or, divide the range (a, b) into bins with length h.

(ii)

fn(x) =1

nh× number of the X ′i s in the bin that contains x

Some statistical softwares might show frequency (the number ofthe Xi ’s in each bin) on the y -axis rather than the density(frequency divided by nh).

Technically, the density must be shown to make sure that theentire area under fn(x) sums up to 1.

Page 101: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Histogram

Histograms of 1000 observations from the standard normaldistribution. The y-axis of the left panel shows the frequency.

Page 102: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Asymptotic properties of histogram estimator*

The Mean Squared Error (MSE) of fn(x)

MSE (fn(x)) = E[fn(x)− f (x)]2

= [E(fn(x))− f (x)]2 + E[fn(x)− E(fn(x))]2

= Bias2 + Variance.

MSE is a widely used performance measure.

Page 103: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Asymptotic properties of histogram estimator*

Assume that f (x) is smooth enough, in the sense that there existsa constant γ > 0 such that

|f (x)− f (y)| ≤ γ|x − y | for all x , y .

Then there exists a number ξx belonging to the bin containing xsuch that

MSE (fn(x)) ≤ γ2h2 +f (ξx)

nh

The optimal bin width (that minimises the MSE) is

h∗ =

(f (ξx)

2γ2n

)1/3

.

Page 104: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Asymptotic properties of histogram estimator*

Under the optimal h∗, the resulting MSE is

MSE (fn(x)) =C

n2/3→ 0 as n→∞

for some constant C <∞.

I So, fn(x) converges to the true value f (x) as n→∞ at therate n−2/3

I Note: we don’t make any assumptions on the form of theunderlying density f (x)

I Therefore kernel density estimation is more robust thanparametric approaches.

Page 105: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Asymptotic properties of histogram estimator*

It is shown that, for parametric models f (x |θ) where the parameterθ is estimated by MLE θ, then

MSE (f (x |θ)) =M

n

for some constant M <∞.

I So f (x |θ) converges to the true value f (x) as n→∞ at therate n−1, provided that the postulated model f (x |θ) is thetrue underlying density

I So the rate of convergence of nonparametric estimators isslower than that of parametric estimators. Why?

Page 106: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

How many bins to be used?

Sturges’ rule: the number of bins should be 1 + log2(n).

Page 107: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Spending Amount example

The number of bins is too small. Important features, such asmode, of this distribution are not revealed

Page 108: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Spending Amount example

The number of bins selected by Sturges’ rule.

Page 109: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Spending Amount example

The number of bins is too large. The distribution is overfitted.

Page 110: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Kernel density estimationRecall the estimator fn(x) we had earlier

fn(x) =1

2hn× number of the xi ’s falling in (x − h, x + h]

which can be written as

fn(x) =1

hn

n∑i=1

K

(x − Xi

h

)where the kernel K (·) is

K (t) =

{1/2, if − 1 < t ≤ 1

0, else

I Apart from the multiplication factor 1/(hn), this estimatorassigns the same weight of 1/2 to every Xi falling in(x − h, x + h]

I Intuitively, more weight should be put on Xi ’s that are closerto x

Page 111: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Kernel density estimation

Page 112: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Kernel density estimation

Page 113: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Kernel density estimation

I Intuitively, more weight should be put on Xi ’s that are closerto x

I Kernel K (t) should be well-designed so that K(x−Xih

)gets

larger when Xi is closer to x .

I This means, K (t) should be larger when t is closer to 0.

I Uniform kernel K (t) = 12 I (|t| ≤ 1) doesn’t have this desired

property.

Page 114: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Kernel density estimationSome commonly used kernels

Gaussian kernel K (t) = 12π e−t2/2

Triangular kernel K (t) = (1− |t|)I (|t| ≤ 1)

Epanechnikov kernel K (t) = 34(1− t2)I (|t| ≤ 1)

Page 115: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Kernel density estimation

The kernel density estimation of the unknown pdf f (x) based oniid samples X1, ...,Xn is defined as

fn(x) =1

hn

n∑i=1

K

(x − Xi

h

)

I h is called the bandwidth that controls the amount ofsmoothness

I K is a kernel function

Page 116: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Kernel density estimation

Some remarks

I Kernels in order of efficiency: Epanechnikov, Triangular,Gaussian, Uniform

I But kernels typically play a lesser important role than thebandwidth h in determining the performance

I There are several methods for selecting the bandwidth h. Wewon’t discuss this in the course.

I Builtin functions in statistical softwares often already comewith an optimal choice of h.

Page 117: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Spending Amount example

The bandwidth h is too large. Local features of this distributionare not revealed

Page 118: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Spending Amount example

The bandwidth h is selected by a rule-of-thumb called normalreference bandwidth

Page 119: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Spending Amount example

The bandwidth h is too small. The distribution is overfitted.

Page 120: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Kernel regression

Page 121: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Kernel regression

I Consider n iid samples (x1, y1), ..., (xn, yn), copies of the pair(X ,Y ) with X ∈ R and Y ∈ R

I We want to predict Y based on X

I The simple linear regression model assumes

Y = β0 + β1X + ε

i.e. we assume a linear function for the conditional meanm(x) = E(Y |X = x) = β0 + β1x .

I Parametric approaches assume a known function form forE(Y |X = x)

I In contrast, nonparametric approaches don’t assume anyknown functional form for m(x) = E(Y |X = x)

Page 122: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

k-nearest neighbour method

kNN is the simplest nonparametric regression method.

It estimates m(x) = E(Y |X = x) by

mkNN(x) = Average(yi |xi ∈ Nk(x))

where Nk(x) denotes the neighbourhood that contains k elementsamong the xi ’s closest to x .

mkNN(x) =1

k

n∑i=1

yi I (xi ∈ Nk(x)) =1

k

∑i :xi∈Nk (x)

yi

Page 123: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

kNN for classification

I The output variable G has two values: BLUE (=0) andORANGE (=1)

I There are two predictors X1 and X2

I 200 training data points are shown in the picture

Page 124: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

kNN for classification

We estimate the probability of being ORANGE by

Y (x) = P(G = 1|x) =1

k

∑i :xi∈Nk (x)

yi

Classification rule

G (x) =

{ORANGE , if Y (x) > 0.5

BLUE , if Y (x) ≥ 0.5

Page 125: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

kNN for classification

Figure: kNN classification with k = 15. The black curve is the decisionboundary {x : Y (x) = 0.5}

Page 126: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

kNN for classification

Figure: kNN classification with k = 1. The model is overfitted: theclassifier works perfectly well on the training data, but not that well ontest data

Page 127: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

k-nearest neighbour method

k has a big influence on the performance of kNN. k is oftenselected cross-validation.

Page 128: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

k-nearest neighbour method

mkNN(x) =1

k

n∑i=1

yi I (xi ∈ Nk(x)) =1

k

∑i :xi∈Nk (x)

yi

I kNN assigns the same weight 1/k to all yi that have xi closeto x . This is similar to using a uniform kernel.

I Intuitively, yi should be given a bigger weight if xi is closer tox .

I So, we can make kNN better. Let’s move on...

Page 129: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Nadaraya-Watson estimator

Let

Kh(x , xi ) = K

(x − xi

h

)be a kernel function that gets bigger when xi gets closer to x .

The Nadaraya-Watson estimator of the conditional meanm(x) = E(Y |X = x) is

m(x) =∑i

wi (x)yi

where

wi (x) =Kh(x , xi )∑nj=1 Kh(x , xj)

That is, we put more weight on those yi that have thecorresponding xi closer to x .

Page 130: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Asymptotic properties of the Nadaraya-Watson estimator*

m(x) =

∑Kh(x , xi )yi∑Kh(x , xi )

, m(x) =

∑Kh(x , xi )m(x)∑

Kh(x , xi )

m(x)−m(x) =

∑Kh(x , xi )(yi −m(x))∑

Kh(x , xi )

Bias(m(x)) = E[m(x)−m(x)]

=

∑Kh(x , xi )(m(xi )−m(x))∑

Kh(x , xi )

=(nh)−1

∑Kh(x , xi )(m(xi )−m(x))

fn(x)=ψ(x)

fn(x)

Recall that fn(x) is KDE of f (x) - the pdf of X , and thereforefn(x)→ f (x).

Page 131: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Asymptotic properties of the Nadaraya-Watson estimator*

By LLN,

ψ(x) →∫

h−1Kh(x , t)(m(t)−m(x))f (t)dt

=

∫K (w)[m(x + hw)−m(x)]f (x + hw)dw

=

∫K (w)[hwm′(x) +

1

2h2w2m′′(x) + o(h2)]×

×[f (x) + hwf ′(x) +1

2h2w2f ′′(x) + o(h2)]dw

= C (x ,K )h2 + o(h2)

where C (x ,K ) is a constant depending on m(x),f (x) and thekernel K .

Page 132: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Asymptotic properties of the Nadaraya-Watson estimator*

Similarly, we can show that

V(m(x)) =C2(x ,K )

nh+ o((nh)−1)

So

MSE (m(x)) = C1(x ,K )h4 +C2(x ,K )

nh+ o(h4) + o((nh)−1)

The optimal bandwidth h is the one that minimises the right handside

h∗(x) = C3(x ,K )n−1/5

Under this optimal h∗

MSE (m(x)) = C4(x ,K )n−4/5 → 0

as n→∞.

Page 133: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Asymptotic properties of the Nadaraya-Watson estimator

So,

MSE (m(x)) = E(m(x)−m(x))2 = constant× n−4/5 → 0

as the sample size n→∞.

I So, m(x) converges to the true value m(x) as n→∞.

I Note: we don’t make any assumptions on the form of theunderlying conditional mean m(x)

I Therefore kernel regression method is more robust thanparametric approaches.

Page 134: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

UN data exampleConsider the UN data on the relationship between GDP per capita(X ) and Fertility rate Y . We want to estimate E(Y |X = x)

Figure: The red line show the NW estimator m(x) with a too smallbandwidth h

Page 135: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

UN data example

Figure: The red line show the NW estimator m(x) with an optimalbandwidth h

Page 136: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

UN data example

Figure: The red line show the NW estimator m(x) with a too largebandwidth h

Page 137: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

QBUS3820 Data Mining and Data Analysis

Model Section and Variable Selection

Dr. Minh-Ngoc TranUniversity of Sydney Business School

Page 138: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Table of contents

Introduction and Basic Concepts

Popular model selection methods

LASSO

Reading: Chapter 7 of the textbook The Elements of StatisticalLearning.

Page 139: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Introduction and BasicConcepts

Page 140: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

IntroductionModel selection problem

I Model selection in general and variable selection in particularare important parts of data analysis. Variable selection can beconsidered as a special case of model selection

I Consider a dataset D. Let {Mi , i ∈ I} be a set of potentialmodels that can be used to explain D.

I The model selection problem is to select the “best” model tointerpret D and/or to make good predictions on futureobservations.

I “best” depends on how we define it: not being overfitting,producing accurate predictions, etc

Page 141: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

IntroductionModel selection problem

For example, given a dataset D = {(x1, y1), ..., (xn, yn)}. Twomodels are proposed to explain D (one is yours, one is your boss’):

Model 1:yi = β0 + β1xi + εi ,

where εi is assumed to have a normal distribution N(0, σ2).

Model 2:yi = β0 + β1xi + εi ,

where εi is assumed to have a Student’s t distribution tν(0, σ2).

Then, you need to answer the question: which model is better?

Page 142: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

IntroductionVariable Selection problem

I Consider a regression model with response Y and a set of ppotential covariates X1, ...,Xp.

I At the beginning stage of modelling, p is often large in orderto reduce possible bias

I a large p might cause heavy admin duties, be costly, etcI more importantly, a large p typically leads to a high variance in

prediction (see later)

I The Variable Selection problem is to select the “best” subsetof these p covariates to explain/predict Y .

I “best” depends on how we define it: not being overfitting,producing accurate predictions, etc

Page 143: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Basic concepts

I Suppose that we have a target variable Y to be predictedbased on an input vector X .

I Given a model M, and based on data D, we predict Y by

fM(X |D)

I The functional form of fM(X |D) is formed by the nature ofmodel M. E.g., a linear regression model or a spline regressionmodel

I The estimated parameters in fM(X |D) are computed based ondata D

I Let L(Y , fM(X |D)) denote the loss when we predict Y by

fM(X |D), e.g.

I Squared error loss: L(Y , fM(X |D)) = (Y − fM(X |D))2

I 0-1 loss: L(Y , fM(X |D)) = I (Y = fM(X |D))I Log-likelihood loss: L(Y , fM(X |D)) = − log p(Y |fM(X |D))

Page 144: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Basic concepts

I The prediction error of model M, conditional on data D, isdefined as

Err(M|D) = E(X ,Y )[L(Y , fM(X |D))]

= Average{L(Yj , fM(Xj |D), all future (Xj ,Yj)

},

I The expectation E(X ,Y )[·] is with respect to the jointpopulation distribution of Y and X .

I The prediction error Err(M|D) measures the performance ofmodel M. The smaller this error, the better M is

I Note that Err(M|D) is dependent on the data D, and istherefore a random quantity

Page 145: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Basic concepts

I Another measure of prediction performance is the expectedprediction error

I The expected prediction error of model M is defined as

Err(M) = ED [Err(M|D)] = ED

[E(X ,Y )[L(Y , fM(X |D))]

]I ED [·] is the expectation with respect to all datasets of the

same size as D

I So Err(M) averages out the effect of data D. There’s nolonger any uncertainty involved as all randomness (in (X,Y)and D) has been averaged out.

I This expected prediction error is an ideal measure ofperformance for model selection

Page 146: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Basic conceptsBias and Variance Decomposition

Consider the general regression model

Y = f (X ) + ε, E(ε) = 0, V(ε) = σ2

Let f (X ) be an estimate of f (X ). For notation simplicity, I’vesuppressed the dependence of f (X ) on model M and data D.

I Suppose we want to predict the mean value of Y at X = x0:f (x0) = E(Y |X = x0)

I The prediction is f (x0)

Using squared-error loss, the prediction error is

Err(M|D) = Eε[(Y |X=x0 − f (x0)

)2]= Eε

[(f (x0)− f (x0) + ε

)2]=

(f (x0)− f (x0)

)2+ σ2

Page 147: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Basic conceptsBias and Variance Decomposition

The expected prediction error (EPR) is

Err(M) = E(f (x0)− f (x0)

)2+ σ2

= E((

f (x0)− E[f (x0)])

+(E[f (x0)]− f (x0)

))2+ σ2

= V(f (x0)) +(E[f (x0)]− f (x0)

)2+ σ2

= Variance(f ) + Bias2(f ) + σ2

As σ2 is a constant independent of M, the EPR of model M isbasically decomposed into two terms: Variance and Bias2.

Page 148: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Basic conceptsBias and Variance Decomposition

For example, in the multiple linear regression model, the true valuef (x0) = E(Y |X = x0) (usually non-linear in p-vector x0) is firstdeterministically approximated by a linear combination x ′0β, then

stochastically estimated by f (x0) = x ′0β.

Err(M) = Variance + Bias2 + σ2

= V(x ′0β) +(E[x ′0β]− f (x0)

)2+ σ2

= x ′0cov(β)x0 +(x ′0β − f (x0)

)2+ σ2

= σ2p∑

i=1

x20i +(x ′0β − f (x0)

)2+ σ2,

here, suppose that the design matrix X is standardized so thatcov(β)=σ2(X ′X )−1 =σ2I .

So, the more number of covariates p is, the bigger the varianceand the smaller the bias2, and vice versa.

Page 149: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Basic conceptsOverfitting

Expected prediction error = Variance + Bias2

I Overfitting: A complex model is used (f (x) is a complicatedfunction, involves a lot of parameters, etc) then Bias2 is smallbut Variance is large. Underfitting: otherwise.

I Model selection: pick up a right model that trades offbetween Bias and Variance

Page 150: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Hope you enjoy this song!

https://www.youtube.com/watch?v=DQWI1kvmwRg

Page 151: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Basic conceptsOverfitting

The light red curves show the prediction errors in a linearregression. The solid red curve shows the expected prediction error(averaged over the prediction errors). The x-axis is the modelcomplexity - proportional to the number of predictors used in themodel

Page 152: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Basic conceptsTraining error

Training error is the average loss over the training data points, thedata used to estimate the model,

err =1

n

n∑i=1

L(yi , f (xi ))

I Given a model, training data is used to fit the model (i.e.estimating the parameters), e.g. by minimising the trainingerror.

I Training error is not a good measure of model selection, itdecreases when model complexity increases

Page 153: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Basic conceptsTraining error is not good for model selection

The light blue curves show the training errors in a linear regression.The solid blue curve shows the expected training error E(err)(averaged over the training errors). Training errors consistentlydecrease when model complexity increases

Page 154: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Basic conceptsTraining data and validation data

I In the ideal case of rich-data, we can divide the data into twosets: training set and validation set

I use the training set to fit/estimate the model and thevalidation set to estimate the prediction error (NOT expectedpredictor error!)

I Pick up the model with the smallest prediction errorI But...

I data is preciousI we can do better with cross-validation (see later)

Page 155: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Basic conceptsPenalised maximum likelihood principle

I Typically, the training error err is smaller than the predictionerror, because the same data is used to fit the model andassess its error (see the picture).

I As training error decreases as model complexity increases, itmight be a good idea to penalise for model complexity

I Many popular model selection criteria have the form

model selection criterion = err+penalty of model complexity

Page 156: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Popular model selectionmethods

Page 157: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

AIC

Akaike’s information criterion (AIC): Select the model with thesmallest AIC

AIC = −2× log-likelihood(θmle) + 2d

I d is the number of parameters in θ (i.e. # covariates)

I log-likelihood(θmle) is the log-likelihood evaluated at the MLEθmle

I The factor 2 is not important, but useful when comparing AICto other model selection criteria

I AIC is an estimate of the expected prediction error, where theloss function is L(yi , xi ) = − log p(yi |f (xi ))

I proposed by Hirotugu Akaike in 1973

Page 158: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

BIC

Bayesian information criterion (BIC): Select the model with thesmallest BIC

BIC = −2× log-likelihood(θmle) + (log n)× d

BIC, proposed by Gideon Schwarz 1978, is motivated by Bayesianapproach.

Page 159: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

BIC*Consider a model M with d-vector of parameters θ, and data D.The posterior of M is

p(M|D) ∝ p(M)p(D|M)

= p(M)

∫p(D|θ,M)p(θ|M)dθ.

Using an uniform prior for M and approximating the integral bythe so-called Laplace approximation

log p(D|M) = log p(D|θmle ,M)− d

2log(n) + O(1)

O(1), read as “big order one”, means O(1) is a term depending onn, but stays constant as n grows.

We want to pick a model with the highest posterior p(M|D),which is equivalent to picking a model with the smallest BIC

BIC=−2×log-likelihood(θmle)+(logn)×d

Page 160: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

AIC or BIC?

AIC = −2× log-likelihood(θmle) + 2d

BIC = −2× log-likelihood(θmle) + (log n)× d

I They’re both popular model selection methods. BIC puts aheavier penalty on model complexity.

I BIC is shown to be consistent asymptotically: it is able toidentify the true model when n→∞ (if there exists such atrue model! Some people argue that true model doesn’t exist)

I Practitioners seem to prefer AIC over BIC when n is small

[M.-N. Tran (2011), The Loss Rank Criterion for Variable Selectionin Linear Regression Analysis, Scandinavian J of Statistics]proposes another criterion which is somehow a compromisebetween AIC and BIC.

Page 161: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Cross-validation

I basic idea: like you validate your peers’ work and they validateyours

I probably the simplest but most commonly used modelselection method

I give an estimate of the expected prediction error

Page 162: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Cross-validation

I divide the data into K sets, K ≥ 2. Often, this is donerandomly

I for the kth part, fit the model to the other K − 1 parts.Denote the fitted model as f −k(x)

I use the fitted model to predict the kth part. The predictionerror is ∑

(yi ,xi )∈part k

L(yi , f−k(xi ))

I The k-fold cross-validated prediction error is

CV =1

n

K∑k=1

∑(yi ,xi )∈part k

L(yi , f−k(xi ))

It’s an estimate of the expected prediction error as it’saveraged over both test data and training data

Page 163: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Cross-validation

I The selected model is the one with the smallest CV predictionerror.

I Typical choices of K are 5, 10 or n. The case K = n is knownas leave-one-out cross-validation.

Cross-validation is simple and widely used. However, CV can besometimes very computationally expensive because one has to fitthe model many times.

Page 164: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Variable selection in linear regression

Consider a linear regression model or a logistic regression modelwith p potential covariates.

At the initial step of modelling, a large number p of covariates isoften introduced in order to reduce potential bias. The task is thento select the best subset among these p variables

Best subset selection: Search over the totally 2p possible subsetsof p covariates to find the best subset. The criterion can be AIC,BIC or any other model selection criteria.

Page 165: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Variable selection in linear regression

Searching over 2p subsets is only feasible when p is small (< 30)

Forward-stepwise selection: Start with the intercept, thensequentially add into the model the covariate that most improvesthe model selection criterion.

Backward-stepwise selection: Start with the full model with pcovariates, then sequentially remove the covariate that mostimproves the model selection criterion.

I Advantage: much more time-efficient than the best subsetselection method

I Disadvantage: not necessarily end up at the best subset.

Page 166: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Variable selection in linear regression

Variable selection based on hypothesis testing

I Consider the test H0 : βj = 0 v.s. H1 : βj 6= 0.

I Let βj be an estimator of βj . If the sampling distribution of βjis known, p-value can be computed

I If the p-value is large (e.g. > 0.05, 0.1) then thecorresponding covariate Xj might be removed from the model

Possible disadvantages

I It’s not clear what predictor error is optimised

I not time-efficient when p is large (need to refit the modelmany times)

This variable selection method is not popular in “modern”statistics, machine learning. Because of historic reasons, it’s stillwidely used in many fields such as social sciences.

Page 167: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Woman labor force example

The data set MROZ.xlsx, available in Blackboard, containsinformation on Womens labour force participation.

We would like to build a logistic regression model to explainwomens labour force participation using potential predictorsnwifeinc (income), educ (years of education), age, exper (yearsof experience), expersq (squared years of experience), kidslt6(number of kids less than 6-year-old) and kidsge6 (number ofkids more than 6-year-old).

Let’s carry out the variable selection task.

Page 168: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Woman labor force example

Page 169: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Woman labor force example

Page 170: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Woman labor force example

Page 171: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

LASSO

Page 172: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

LASSO

I Consider a simple linear regression model

yi = β0 + β1xi + εi , i = 1, ..., n

I Assume that the xi ’s have been standardised so that∑i xi = 0 and

∑i x

2i = 1

I The LS method estimates β = (β0, β1)′ by minimising thesum of squared errors∑

i

(yi − β0 − β1xi )2

I It’s easy to see that the solution is

βls1 =∑i

xiyi , βls0 =1

n

∑i

yi

Page 173: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

LASSOI The LASSO method estimates β = (β0, β1)′ by minimising

the sum of squared errors plus a penalty term on β1∑i

(yi − β0 − β1xi )2 + λ|β1|

I |β1| is the absolute value of β1, λ > 0 controls the penalty,it’s called the shrinkage parameter

I Note that there’s no penalty term for β0. The reason is thatwe are in general not interested in determining whether or notβ0 = 0

I It can be shown that the solution is

βlasso1 =

0, λ ≥ |βls1 |βls1 − λ, λ < |βls1 | and βls1 > 0

βls1 + λ, λ < |βls1 | and βls1 < 0

βlasso0 = βls0 =1

n

∑i

yi

Page 174: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

LASSO

I So when the shrinkage parameter λ is large enough (i.e.λ ≥ |βls1 |), the Lasso estimate βlasso1 will be 0

I This can also be interpreted as follows: when |βls1 | is smallenough that can be regarded as being insignificant, Lasso willautomatically shrink it to zero.

I Because of this attractive feature, Lasso is a method forvariable selection.

I In general, the Lasso method shrinks all the LS estimatestowards 0

Page 175: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

LASSO

I Now, consider the general multiple linear regression model

yi = β0 + β1xi1 + ...+ βpxip + εi

I For a given λ, the Lasso method estimatesβ = (β0, β1, ..., βp)′ by minimising

∑i

(yi − β0 − β1xi1 − ...− βpxip)2 + λ

p∑j=1

|βj |

I Note that, in general, we don’t penalize β0 as we are notinterested in whether or not β0 = 0.

I There isn’t a closed form solution to this optimisationproblem, but it can be solved by optimisation techniques

I Many “modern” statistical softwares have built-in functions toimplement Lasso

Page 176: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

LASSO

I Let’s apply the method to the prostate cancer dataset,available in the textbook’s website and Blackboard

I The goal is to predict the log of prostate specific antigen level(lpsa), using multiple linear regression with eight predictors:log cancer volume (lcavol), log prostate weight (lweight),age, etc.

I We want to estimate the coefficients and simultaneouslyremove insignificant predictors

Page 177: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Lasso

The figure shows the profile of the Lasso estimates when theshrinkage parameter λ varies from 0 to λmax, where all coefficientsare 0.

Page 178: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Lasso

I For a particular λ w.r.t. the dotted red line, three predictorslcavol, lweight and svi are selected, the other five areremoved.

I When λ = +∞, ALL coefficients β1, ..., βp are 0. This iswhen the model is likely to be underfitted

I When λ = 0, ALL coefficients β1, ..., βp are not 0. This iswhen the model is likely to be overfitted

Page 179: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Selecting λ

The shrinkage parameter λ can be selected using the BIC-typecriterion.

Let X be the design matrix and y be the vector of responses.Denote by βlassoλ the Lasso estimate of β given λ. Define

BIC(λ) = log

(‖y − X βlassoλ ‖2

n

)+ dfλ

log(n)

n

dfλ is called the degrees of freedom, which is approximately thenumber of non-zero coefficients in the model.

The best λ is the one that minimises BIC

Page 180: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Lasso

BIC(λ) is minimised at λ = 0.0623

Page 181: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Lasso

The final Lasso estimate is shown by the vertical line atλ = 0.0623.

Page 182: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Lasso

The final Lasso estimate with λ chosen by BIC is

βlassoλ=0.0623 =

0.37150.51510.3421

00.04910.5623

00

0.0014

So three predictors age, lcp and gleason are removed. Notethat the first element is the intercept.

Page 183: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Selecting λ by Loss rank principle method

Tran (2011), Scandinavian Journal of Statistics, Vol. 38:p.466-479, proposes a method called the loss rank principle forselecting λ.

LR(λ) = KL

(dfλn, 1− ρλ

)where

I KL(p, q) = p log(p/q) + (1− p) log((1− p)/(1− q))

I dfλ the number of non-zero coefficients in the model

I ρλ = ‖y − X βlassoλ ‖2/‖y‖2

The best λ is the one that maximises LR

Page 184: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

Selecting λ by Loss rank principle method

LR(λ) is maximised at λ = 0.0623. In this example, BIC and LRgive the same result.

Page 185: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

LASSO for logistic regressionI Response/output data yi are binary: 0 and 1, Yes or NoI Want to explain/predict yi based on a vector of predictors

xi = (xi1, ..., xip)′.I We assume yi |xi ∼ B(1, p1(xi )), i.e. the distribution of yi is

Bernoulli distribution with probability of success p1(xi ), where

p1(xi ) = P(yi = 1|xi ) =exp(β0 + β1xi1 + ...+ βpxip)

1 + exp(β0 + β1xi1 + ...+ βpxip)

I If Y is a Bernoulli r.v. with probability π, then the densityfunction of Y is

p(y |π) = πy (1− π)1−y .

I The probability density function of yi is therefore

p(yi |xi , β) = p1(xi )yi (1− p1(xi ))1−yi

So the likelihood function is

p(y |X , β) =n∏

i=1

p1(xi )yi (1− p1(xi ))1−yi

Page 186: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

LASSO for logistic regression

I The log-likelihood

`(β) = log p(y |X , β) =∑i

(yi log p1(xi ) + (1− yi ) log(1− p1(xi )))

I Lasso estimates β by minimising the minus log-likelihood plusa penalty term

−`(β) + λ

p∑j=1

|βj |, λ > 0

I Insignificant coefficients will be automatically shrunk to 0

I Most “modern” statistical softwares have built-in functions toimplement Lasso for logistic regression

Page 187: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

LASSO for logistic regression

λ can be selected using BIC or AIC criterion

AIC(λ) = −2× log-likelihood(βlassoλ ) + 2× dfλ

BIC(λ) = −2× log-likelihood(βlassoλ ) + (log n)× dfλ

where dfλ the number of non-zero coefficients in the model.

The selected λ is the one that minimises BIC(λ) or AIC(λ)

Page 188: QBUS3820 Data Mining and Data Analysis - Sergey V. Alexeev · 2018-11-20 · Global polynomial regression I Model (1) becomes Y = 0 + 1X + 2X2 + :::+ pXp + (2) Model (2) is now a

LASSO

I Lasso is very useful when there are a lot of potentialpredictors, i.e. p is large

I Lasso still works even when p � n. No other methods can.