qbus3820 data mining and data analysis - sergey v. alexeev · 2018-11-20 · global polynomial...
TRANSCRIPT
QBUS3820 Data Mining and Data Analysis
Lecture: Neural Networks
Dr. Minh-Ngoc TranUniversity of Sydney Business School
Table of contents
Introduction
Fundamental concepts
Single layer perceptron
Introduction
What are neural networks?
They are a set of very flexible non-linear methods forregression/classification. Suitable when your dataset is large.
What are neural networks?
What are neural networks?
I A neural network, or artificial neural network (ANN) is acomputational model that tries to mimic a network of neuronsin the human brain.
I Artificial neural networks (ANNs) are not biological neuralnetworks, but a mathematical model that is inspired bybiological neural networks.
What are neural networks?
I A neural network is an interconnected assembly of simpleprocessing units or neurons, which communicate by sendingsignals to each other over weighted connections
I A neural network is made of layers of similar neurons: aninput layer, hidden layers, and an output layer.
I The input layer receives data from outside the network. Theoutput layer sends data out of the network. Hidden layersreceive/process/send data within the network.
What are neural networks used for?
I Neural networks are often used for statistical analysis and datamodelling, an alternative to standard nonlinearregression/classification
I They have been successfully used in speech recognition,textual character recognition, medical imaging diagnosis,robotics, financial market prediction, etc.
I But their applications to business are still somewhat limited
Fundamental concepts
Elements of an artificial neural network
An ANN includes
I a set of processing units/neurons/nodes.
I an activation level Zi for each unit i , which, often, is the sameas the output of the unit
I weights wik , which are connection strengths between the unitsi and k
I a propagation rule that determines the total input Sk of a unitfrom its connected units
I an activation function hk that determines the activation levelZk based on the total input Sk , Zk = hk(Sk)
Elements of an artificial neural network
Elements of an artificial neural network
Often, the total input sent to unit k is
Sk =∑i
wikZi + w0k
which is a weighted sum of the outputs from all units i that areconnected to unit k, plus an offset term w0k .
Then, the output of unit k is
Zk = hk(Sk) = hk
(∑i
wikZi + w0k
)
Elements of an artificial neural network
It’s useful to distinguish three types of units:
I input units (denoted by X ): receive data from outside thenetwork
I output units (denoted by Y ): send data out of the network
I hidden units (denoted by Z ): receive data from and send datato units within the network.
Given the signal from a set of inputs X , an ANN produces anoutput Y .
Elements of an artificial neural network
The function
hk(Sk) =1
1 + e−Sk
is commonly used as the activation function for hidden units.
For output units
hk(Sk) = Sk (use in regression)
or
hk(Sk) =Sk∑` S`
(use in classification)
Training neural networks
I A neural network is a computational model that needs to beestimated
I The unknown things in an ANN is the set of the weights wik .
I These parameters are often estimated based on training data
Examples of neural networks
I Consider an ANN with no hidden layer
I Suppose that there’re p input units X1, ..., Xp and one outputunit Y
Examples of neural networks
I Let S = w0 +∑
i wiXi be the total input sent to Y . Forclassification using logistic regression, we classify
Y = 1 if and only ifeS
1 + eS≥ 0.5, i.e. S ≥ 0
I Equivalently
Y = h(S) =
{1, S ≥ 0
0, else
Examples of neural networks
So this classification model is a special case of ANN with nohidden units.
Examples of neural networks
Multiple linear regression is a special case of ANN with no hiddenunits.
Single layer perceptron
Single layer perceptron
I We now focus on the most widely used neural networks instatistics, the ANNs with a single hidden layer, often called asingle layer perceptron
Single layer perceptron for regression
I Suppose we have p input predictors/features X = (X1, ...,Xp)′
and a scalar target YI Create M hidden units Z1, ...,Zm
I Total input of unit Zm
SZm = α0m + α1mX1 + ...+ αpmXp = α0m + α′mX
The weights αij are unknown, need to be estimated.I Activation level of unit Zm
Zm = h(SZm) = h(α0m + α′mX ), m = 1, ...,M
I Compute the total input of the output unit Y
S = β0 + β1Z1 + ...+ βMZM = β0 + β′Z
with βi the weight from hidden unit Zi to the output unit Y
I The output function Y = S is a prediction of E(Y |X ).
single layer perceptron for regression
single layer perceptron for regression
I We can write
f (X ) = S = β0+β1h(α01+
p∑i=1
αi1Xi )+...+βMh(α0M+
p∑i=1
αiMXi )
I So, the inputs Xi enter the prediction function f (X ) in anonlinear way.
I Here, for simplicity, we use the same activation h for allhidden units Zm
I If h(x) = x , it can be seen that f (X ) is a linear combinationof the Xi , so multiple linear regression is a special case of thissingle layer perceptron model
single layer perceptron for regression
I The model parameters areθ = (α01, α11, ..., αp1; ...;α0M , α1M , ..., αpM ;β0, β1, ..., βM)
I Let {yi , xi = (xi1, ..., xip)}, i = 1, ..., n be the training dataset.The sum of squared errors is
R(θ) =n∑
i=1
(yi − f (xi )2)
I We estimate θ by minimising R(θ)
single layer perceptron for 0-1 classificationI Suppose we have p input predictors/features X = (X1, ...,Xp)′
and a scalar target YI Create M hidden units Z1, ...,Zm
I Total input of unit Zm
SZm = α0m + α1mX1 + ...+ αpmXp = α0m + α′mX
The weights αij are unknown, need to be estimated.I Activation level of unit Zm
Zm = h(SZm) = h(α0m + α′mX ), m = 1, ...,M
I Compute the total input of the output unit Y
S = β0 + β1Z1 + ...+ βMZM = β0 + β′Z
with βi the weight from hidden unit Zi to the output unit YI The output function
Y = h(S) =
{1, S ≥ 0
0, else
Issues in training neural networks
I Often, neural networks have too many weights and mightoverfit the data. To overcome the overfitting problem, weminimise
R(θ) + λP(θ)
where P(θ) is a model complexity penalty, and λ ≥ 0 controlsthe penalty.
I For example
P(θ) =K∑
k=1
M∑m=1
β2mk +
p∑j=1
M∑m=1
α2jm
I λ can be selected by cross-validation
Issues in training neural networks
I With too few hidden units M, the model might not be flexibleenough to capture the nonlinearities in the data; with too bigM the model might overfit the data
I It is most common to put down a reasonably large M and usea penalty term to avoid overfitting
multiple hidden layer neural networks
I One can use neural networks with multiple hidden layers
I Choice of the number of hidden layers is guided bybackground knowledge or by using a test dataset.
Example: handwriting recognition
I handwriting recognition is an important task, especially inpostal services
I we want to recognise handwritten digits, scanned fromenvelopes
Example: handwriting recognition
I Input is a vector of 16× 16 = 256 pixel values, and output isone of the ten numbers 0,...,9.
I Each training observation, i.e. an image, is a vector of its 256pixel values and the correct-recognition digit
I There are 320 images in the training set, and 160 in the testset.
Example: handwriting recognition
Five networks were used
I Net-1: No hidden layer, equivalent to multinomial logisticregression.
I Net-2: One hidden layer with 12 hidden units fully connected.
I Net-3: Two hidden layers locally connected.
I Net-4 and Net-5: Two hidden layers, locally connected withdifferent constraint levels on the weights.
Example: handwriting recognition
Summary
I ANNs provide a range of flexibly nonlinear models for datamodelling
I ANNs have been successfully applied in many fields: robotics,vision, image processing, etc
I They are useful for prediction, not for inference, because it’sdifficult to interpret the coefficients/weights in an ANN model
QBUS3820 Data Mining and Data Analysis
Lecture: Piecewise polynomials and Spline regression
Dr. Minh-Ngoc TranUniversity of Sydney Business School
Table of contents
Piecewise polynomials
Spline regression
Reading: Chapter 5, The Elements of Statistical Learning.
Introduction
I Let Y be the response variable and X be a predictor. Weconsider scalar/univariate X in this lecture.
I The conditional mean f (X ) = E(Y |X ) is all we need for bothinference and prediction tasks.The regression model
Y = f (X ) + ε, ε is error with E(ε) = 0
is the most general regression model, as we don’t make anyassumptions on the form of f (X )
I The linear regression model assumes f (X ) = β0 + β1X .
I In many cases, it’s unlikely that the conditional mean E(Y |X )is truly linear in X !
Introduction
Obviously, f (X ) = E(Y |X ), approximated by the red curve, is notlinear in X .
Introduction
I This lecture is to go beyond the linearity assumption inregression
Piecewise polynomials
First, what is a polynomial?An p-degree (or p-order) polynomial is
f (x) = a0 + a1x + a2x2 + ....+ apx
p
where a0,..., ap are coefficients.
Global polynomial regression
Consider the general regression model
Y = f (X ) + ε, E(ε) = 0 (1)
where f (X ) = E(Y |X ) is unknown.
Let’s use the Taylor expansion of f (X ) at 0: For some order p ≥ 1
f (X ) ≈ f (0) + f ′(0)X +f (2)(0)
2!X 2 +
f (3)(0)
3!X 3 + ...+
f (p)(0)
p!X p
= β0 + β1X + β2X2 + ...+ βpX
p
for all values in the range of X .
Here f (0)=β0, β1 = f ′(0), ..., βp = f (p)(0)p! are unknown coefficients.
f (k)(x) denotes the kth derivative of function f (x).
Global polynomial regression
I Model (1) becomes
Y = β0 + β1X + β2X2 + ...+ βpX
p + ε (2)
Model (2) is now a multiple linear regression model. So thecoefficients βj can be easily estimated from data.
I This model is referred to as the global polynomial regressionmodel
I “Global” means the coefficients βj ’s are constant across theentire range (also called domain) of X
I The global polynomial regression model offers a way to relaxthe linearity assumption.
I However, in general, global polynomial regression has a maindrawback: because of its global nature, it can provide good fit(to the data) in one area but behave rather weirdly in anotherarea
Global polynomial regression
Figure: Industrial production index, January 1990-October 2005. Leftpanel: global polynomial regression. Right panel: Cubic spline regression(see later)
Tuning the coefficients to achieve a functional form in one regioncan cause the function to flap about madly in other regions.
Piecewise polynomial regression
In this lecture we consider techniques that allow for localpolynomial representations.
The first technique is the piecewise polynomial regression
Piecewise polynomials
Basic idea: we divide the range/domain of X into continuousintervals, then approximate f (X ) in each interval by a separatepolynomial.
Piecewise constantDivide the range/domain of X into continuous intervals, thenapproximate f (X ) in each interval by a constant, i.e. a 0-degreepolynomial.
Figure: We approximate the true curve by a step function
Piecewise constant
For example, we divide the domain of X into 3 intervals by 2 limitpoints ξ1 and ξ2 (called knots).
Define three functions, called basis functions
h1(X ) = I (X < ξ1), h2(X ) = I (ξ1 ≤ X < ξ2), h3(X ) = I (X ≥ ξ2)
We approximate the true curve f (X ) = E(Y |X ) by a step function
f (X ) =3∑
m=1
βmhm(X ) =
β1, X < ξ1
β2, ξ1 ≤ X < ξ2
β3, X ≥ ξ2
f (X ) is a constant in each of the regions X < ξ1, ξ1 ≤ X < ξ2 andX ≥ ξ2
Piecewise constant
Let {Xi ,Yi , i = 1, ..., n} be the training data.
LetZi1 = h1(Xi ), Zi2 = h2(Xi ), Zi3 = h3(Xi )
We arrive at the following multiple linear regression model
Yi = β1Zi1 + β2Zi2 + β3Zi3 + εi , i = 1, ..., n
So it’s easy to estimate the coefficients βi ’s using the least squaresmethods or maximum likelihood method.
Piecewise constantIt can be shown that
βm = Ave{Yi |Xi ∈ region m}
Piecewise linear
Divide the domain of X into continuous intervals, thenapproximate f (X ) in each interval by a linear function, i.e. a1-degree polynomial.
Define six basis functions
h1(X ) = I (X < ξ1), h2(X ) = I (ξ1 ≤ X < ξ2), h3(X ) = I (X ≥ ξ2)
h4(X ) = XI (X < ξ1), h5(X ) = XI (ξ1 ≤ X < ξ2), h6(X ) = XI (X ≥ ξ2)
We approximate the true curve f (X ) = E(Y |X ) by a piecewiselinear function
f (X ) =6∑
m=1
βmhm(X ) =
β1 + β4X , X < ξ1
β2 + β5X , ξ1 ≤ X < ξ2
β3 + β6X , X ≥ ξ2
f (X ) is a linear function in each region.
Piecewise linear
Let {Xi ,Yi , i = 1, ..., n} be the training data.
LetZik = hk(Xi ), k = 1, ..., 6, i = 1, ..., n
We arrive at the following multiple linear regression model
Yi = β1Zi1 + β2Zi2 + β3Zi3 + β4Zi4 + β5Zi5 + β6Zi6 + εi
It’s easy to estimate the coefficients βi ’s using the least squaresmethods or maximum likelihood method.
Piecewise linear
Higher order piecewise polynomials
Similarly, we can construct higher order piecewise polynomials. Forexample,
I piecewise quadratic polynomials: approximatef (X ) = E(Y |X ) by a 2-order polynomial in each region
I piecewise cubic polynomials: approximate f (X ) = E(Y |X ) bya 3-order polynomial in each region
Spline regression
Discontinuity issue
Consider a function f (x) and some x0 in its domain
I f −(x0) = limx→x−0f (x) is the left limit of f (x) at x0, i.e. the
limit of f (x) when x goes to x0 from the left.
I f +(x0) = limx→x+0f (x) is the right limit of f (x) at x0, i.e. the
limit of f (x) when x goes to x0 from the right.
I If f −(x0) 6= f +(x0), we say that f (x) is discontinuous at x0.
Discontinuity issue
I Piecewise polynomials are not continuous at the knots ξj :f −(ξj) = f +(ξj)
I Discontinuity causes inference/prediction problems. E.g.,what is the prediction of f (ξj) = E(Y |X = ξj)?
I Typically, we prefer continuity in statistical modelling
Spline
I We need continuity constraints
f −(ξj) = f +(ξj)
i.e. the left limit meets the right limit at the knots.
I A technique to impose such constraints is using a spline
I A spline is a piecewise polynomial that is continuous at theknots. Therefore, a spline is continuous at everywhere in theentire range of X .
Linear splines
Suppose that we divide the range of X into three intervals withknots ξ1 and ξ2. Define basis functions
h0(X ) = 1
h1(X ) = X
h2(X ) = (X − ξ1)+ = (X − ξ1)I (X ≥ ξ1)
=
{0, if X < ξ1
X − ξ1, if X ≥ ξ1h3(X ) = (X − ξ2)+ = (X − ξ2)I (X ≥ ξ2)
=
{0, if X < ξ2
X − ξ2, if X ≥ ξ2
Similarly, we can define the basis functions when there are K>2knots ξ1, ξ2,... ξK .
Linear splines
Let
f (X ) = β0h0(X ) + β1h1(X ) + β2h2(X ) + β3h3(X )
= β0 + β1X + β2(X − ξ1)+ + β3(X − ξ2)+
=
β0 + β1X , X < ξ1
β1 + β1X + β2(X − ξ1), ξ1 ≤ X < ξ2
β0 + β1X + β2(X − ξ1) + β3(X − ξ2), X ≥ ξ2
It’s easy to check that
I f −(ξj)= f +(ξj), j =1,2. That is, f (X ) is continuous at everyX
I f (X ) is linear in each region: X <ξ1, ξ1≤X <ξ2 and X ≥ξ2I f (X ) is called a linear spline: a piecewise linear polynomial
that is continuous everywhere.
Estimating a linear spline
Let {Xi ,Yi , i = 1, ..., n} be the training data. The linear splineregression model becomes
Yi = β0h0(Xi ) + β1h1(Xi ) + β2h2(Xi ) + β3h3(Xi ) + εi
LetZik = hk(Xi ), k = 0, ..., 3, i = 1, ..., n
We arrive at the following multiple linear regression model
Yi = β0Zi0 + β1Zi1 + β2Zi2 + β3Zi3 + εi
So it’s easy to estimate the coefficients βi ’s using the least squaresmethod or maximum likelihood method.
Linear splines
Cubic splines
I A linear spline is continuous but not smooth, it’s has apeak/sudden change at the knots ξk . Not an attractivefeature in statistical modelling.
I We often prefer smoother functions. Typically, we prefer f (X )not only continuous, but has continuous first and secondderivatives
I These can be achieved by increasing the order of the localpolynomial. These can be done by using cubic splines
Cubic splines
Cubic splines
Suppose that we divide the range of X into three intervals withknots ξ1 and ξ2.
Starting with 3-degree polynomial basis functions
h0(X ) = 1, h1(X ) = X , h2(X ) = X 2, h3(X ) = X 3
and additional one basis function for each region
h4(X ) = (X − ξ1)3+ = (X − ξ1)3I (X ≥ ξ1)
h5(X ) = (X − ξ2)3+ = (X − ξ2)3I (X ≥ ξ2)
Similarly, we can define the basis functions when there are K>2knots ξ1, ξ2,... ξK .
Cubic splines
Let
f (X ) = β0 + β1X + β2X2 + β3X
3 + β4(X − ξ1)3+ + β5(X − ξ2)3+
f ′(X ) = β1 + 2β2X + 3β3X2 + 3β4(X − ξ1)2I (X ≥ ξ1)
3β5(X − ξ2)2I (X ≥ ξ2)
f ′′(X ) = 2β2 + 6β3X + 6β4(X − ξ1)I (X ≥ ξ1)
6β5(X − ξ2)I (X ≥ ξ2)
It can be checked that
I f (X ),f ′(X ) and f ′′(X ) are continuous at every X
I f (X ) is a 3-degree polynomial in each region
I f (X ) is called a cubic spline
Estimating a cubic spline
Again, it’s easy to fit a cubic spline to data.
Let {Xi ,Yi , i = 1, ..., n} be the training data.
LetZik = hk(Xi ), k = 0, ..., 5, i = 1, ..., n
We arrive at the following multiple linear regression model
Yi = β0Zi0 + β1Zi1 + β2Zi2 + β3Zi3 + β4Zi4 + β5Zi5 + εi
It’s easy to estimate the coefficients βi ’s using the least squaresmethods or maximum likelihood method.
Cubic splines
Spline regression
I Basically, in spline regression, we fit a separate polynomial tothe data in each region
I Unlike global polynomial regression where the coefficients areconstant across all regions, in spline regression the coefficientsare adjusted locally
I Unlike piecewise polynomial regression, in splines regressioncertain order of smoothness is imposed at the knots
Cubic splines with K knots
Given a set of K knots
ξ1 < ... < ξk < .... < ξK
The cubic spline with K knots is
f (X ) = β0 + β1X + β2X2 + β3X
3 +K∑
k=1
β3+k(X − ξk)3+.
Knots selection
I Selecting the number of knots K and the knot positions ξk isan art
I An option is to set ξk to the 100k/(K + 1)-th percentile ofthe distribution of X .
I How many knots K should be used?I If K is too large, we have an overfitting problemI If K is too small, we have an underfitting problemI This is a model selection problem (see later)
An example
Figure: Scatter plot of x v.s. y
An example
Figure: 300 training data points (o) and 100 test data points (x)
An example
I Both the training and test datasets are available in Blackboard
I Clearly, simple linear regression is not appropriate for thisdataset.
I Let’s fit a cubic spline regression model to this trainingdataset and test its prediction power on the test data.
I Let’s proceed as if there isn’t any built-in function in yoursoftware that can help you to do this task!
An example
I Let’s select K = 3I ξ1, ξ2 and ξ3 are 25%-, 50% and 75%-percentile of X -data
I α%-percentile is a number where α% of the data points aresmaller than that number
I Check BUSS1020 or Google if you don’t remember how tocompute percentiles!
I Most statistical softwares have functions to compute this
I Based on my calculation, ξ1 = 10, ξ2 = 19.5 and ξ3 = 58
I We need to fit the following cubic spline regression model
Yi = β0 + β1Xi + β2X2i + β3X
3i + β4(Xi − 10)3+ +
+β5(Xi − 19.5)3+ + β6(Xi − 58)3+ + εi ,
i =1,...,n.
An example
I Let Zi1 = Xi , Zi2 = X 2i , ..., Zi6 = (Xi − 58)3+, i = 1, ..., n. We
have the following multiple linear regression model
Yi = β0 +6∑
k=1
βkZik + εi
I Let
X =
1 Z11 ... Z16
1 Z21 ... Z26...
... ......
1 Zn1 ... Zn6
, y =
Y1
Y2...Yn
The estimate of vector β is β = (X ′X )−1X ′y
An example
β = (498.9679,−0.3252, 0.1046,−0.0064, 0.0091,−0.0025,−0.0018)
I The estimate of f (X ) = E(Y |X ) is
Y = β0 + β1X + β2X2 + β3X
3 + β4(X − 10)3+ +
+β5(X − 19.5)3+ + β6(X − 58)3+
I Plot of (X ,Y ) when X varies is the fitted (also calledpredicted) curve.
An example
Figure: 300 training data points and the fitted curve
An example
Figure: 100 test data points and the fitted curve
An example
Let’s see the effect of the number of knots K
An example
Figure: 100 test data points and the fitted curve
An example
Figure: 100 test data points and the fitted curve
An example
Figure: 100 test data points and the fitted curve
An example
Figure: 100 test data points and the fitted curve
An example
Figure: 100 test data points and the fitted curve
An example
Figure: 100 test data points and the fitted curve
An example
Figure: 100 test data points and the fitted curve
Homework
Download the datasets from Blackboard and fit a cubic splineregression model by yourself. Have fun!
QBUS3820 Data Mining and Data Analysis
Lecture: Kernel methods
Dr. Minh-Ngoc TranUniversity of Sydney Business School
Table of contents
Kernel density estimation
Kernel regression
Introduction
I We’ve so far discussed on parametric methods: we firstassume a parametric form for the underlying model thatgenerated the data, then estimate the parameters.
I In parametric modelling, the underlying model that generatedthe data is described by a functional form that depends on avector of unknown parameters θ.
I E.g., simple linear regression
yi = β0 + β1xi + εi , εi ∼ N(0, σ2)
is a parametric model as we assume the model that generateddata yi , given xi , is normal distribution with mean β0 + β1xiand variance σ2. The set of unknown parameters isθ = (β0, β1, σ
2).
Introduction
I This lecture is about nonparametric methods for estimatingprobability density functions and regression functions. Theyare also called kernel methods as they’re based on kernelfunctions.
I Kernel methods are considered as modern data analysistechniques as they grow rapidly after the wide spread ofcomputer power
I This lecture coversI kernel methods for estimating density functionsI kernel methods for estimating regression functions
Kernel density estimation
Density estimation
I Let X1, ...,Xn be i.i.d. (independent and identicallydistributed) samples from an unknown cumulative distributionfunction (cdf) F (x) with probability density function (pdf)f (x).
I We will use the generic notation X to denote a r.v. withdistribution F (x), i.e the Xi ’s are identical copies of X .
I We want to estimate F (x) (or equivalently f (x)), as F (x)contains all information about X we need to know: mean,variance, correlation between components of X (if X is arandom vector), etc.
Density estimation
For example, in the spending dataset,
I What is the distribution of spending amounts?
I What is the mode? What is the shape of this distribution?
Parametric v.s. Nonparametric
I Parametric approaches assume a known functional form for f ,such as normal, gamma, Poisson, which involves someunknown parameters
f (x) = f (x |θ).
Then the task is to estimate θ by, e.g., Maximum likelihoodestimation or Markov chain Monte Carlo (not covered in thiscourse)
I Parametric methods often achieve attractive properties(estimators have small variances, fast convergence to the truepopulation values), given that the true underlying densityfunction that generated the data is well approximated by thepostulated form f (x |θ)
I Parametric approaches might lead to misleading results if theassumed parametric model is far away from the true density.
Parametric v.s. Nonparametric
I Nonparametric approaches don’t assume a known functionalform for f
I They only make basic assumptions likeI Finite second moment E(X 2) <∞, orI The true density is smooth enough: the derivates f (r)(x) exist
up to some certain order r
I So nonparametric approaches are more robust and moreflexible
I But they do have their own drawbacks (discussed later)
Histogram
I Histogram is the oldest and most widely used nonparametricdensity estimator.
I X1, ...,Xn are iid samples from an unknown cdf F with pdff (x) = ∂F (x)/∂x .
I The empirical distribution function defined as
Fn(x) =1
n
n∑i=1
I (Xi ≤ x)
is an estimator of cdf F (x) = P(X ≤ x)
I Note that∑n
i=1 I (Xi ≤ x) ∼ Bi(n,F (x)). So
E(Fn(x)) = F (x), V(Fn(x)) =F (x)(1− F (x))
n→ 0 n→∞
for every x .
I Fn(x) is a consistent and unbiased estimator of F (x).
HistogramBasic idea:
f (x) ≈ F (x + h)− F (x − h)
2h, h > 0 is small
Replace F (x) by its estimate Fn(x)
fn(x) =Fn(x + h)− Fn(x − h)
2h
=1
2hn
∑i
(I (Xi ≤ x + h)− I (Xi ≤ x − h))
=1
2hn
∑i
I (x − h < Xi ≤ x + h)
=1
2hn× number of the Xi ’s in (x − h, x + h]
is an estimator of f (x). Note: the interval is open on the left andclosed on the right.
Histogram
Constructing a histogram estimator of f (x)
(i) Given a bin width h > 0, form the bins of the histogram: E.g.,given an origin x0, let
(x0 + mh, x0 + (m + 1)h], m = 0,±1,±2, ...
Or, divide the range (a, b) into bins with length h.
(ii)
fn(x) =1
nh× number of the X ′i s in the bin that contains x
Some statistical softwares might show frequency (the number ofthe Xi ’s in each bin) on the y -axis rather than the density(frequency divided by nh).
Technically, the density must be shown to make sure that theentire area under fn(x) sums up to 1.
Histogram
Histograms of 1000 observations from the standard normaldistribution. The y-axis of the left panel shows the frequency.
Asymptotic properties of histogram estimator*
The Mean Squared Error (MSE) of fn(x)
MSE (fn(x)) = E[fn(x)− f (x)]2
= [E(fn(x))− f (x)]2 + E[fn(x)− E(fn(x))]2
= Bias2 + Variance.
MSE is a widely used performance measure.
Asymptotic properties of histogram estimator*
Assume that f (x) is smooth enough, in the sense that there existsa constant γ > 0 such that
|f (x)− f (y)| ≤ γ|x − y | for all x , y .
Then there exists a number ξx belonging to the bin containing xsuch that
MSE (fn(x)) ≤ γ2h2 +f (ξx)
nh
The optimal bin width (that minimises the MSE) is
h∗ =
(f (ξx)
2γ2n
)1/3
.
Asymptotic properties of histogram estimator*
Under the optimal h∗, the resulting MSE is
MSE (fn(x)) =C
n2/3→ 0 as n→∞
for some constant C <∞.
I So, fn(x) converges to the true value f (x) as n→∞ at therate n−2/3
I Note: we don’t make any assumptions on the form of theunderlying density f (x)
I Therefore kernel density estimation is more robust thanparametric approaches.
Asymptotic properties of histogram estimator*
It is shown that, for parametric models f (x |θ) where the parameterθ is estimated by MLE θ, then
MSE (f (x |θ)) =M
n
for some constant M <∞.
I So f (x |θ) converges to the true value f (x) as n→∞ at therate n−1, provided that the postulated model f (x |θ) is thetrue underlying density
I So the rate of convergence of nonparametric estimators isslower than that of parametric estimators. Why?
How many bins to be used?
Sturges’ rule: the number of bins should be 1 + log2(n).
Spending Amount example
The number of bins is too small. Important features, such asmode, of this distribution are not revealed
Spending Amount example
The number of bins selected by Sturges’ rule.
Spending Amount example
The number of bins is too large. The distribution is overfitted.
Kernel density estimationRecall the estimator fn(x) we had earlier
fn(x) =1
2hn× number of the xi ’s falling in (x − h, x + h]
which can be written as
fn(x) =1
hn
n∑i=1
K
(x − Xi
h
)where the kernel K (·) is
K (t) =
{1/2, if − 1 < t ≤ 1
0, else
I Apart from the multiplication factor 1/(hn), this estimatorassigns the same weight of 1/2 to every Xi falling in(x − h, x + h]
I Intuitively, more weight should be put on Xi ’s that are closerto x
Kernel density estimation
Kernel density estimation
Kernel density estimation
I Intuitively, more weight should be put on Xi ’s that are closerto x
I Kernel K (t) should be well-designed so that K(x−Xih
)gets
larger when Xi is closer to x .
I This means, K (t) should be larger when t is closer to 0.
I Uniform kernel K (t) = 12 I (|t| ≤ 1) doesn’t have this desired
property.
Kernel density estimationSome commonly used kernels
Gaussian kernel K (t) = 12π e−t2/2
Triangular kernel K (t) = (1− |t|)I (|t| ≤ 1)
Epanechnikov kernel K (t) = 34(1− t2)I (|t| ≤ 1)
Kernel density estimation
The kernel density estimation of the unknown pdf f (x) based oniid samples X1, ...,Xn is defined as
fn(x) =1
hn
n∑i=1
K
(x − Xi
h
)
I h is called the bandwidth that controls the amount ofsmoothness
I K is a kernel function
Kernel density estimation
Some remarks
I Kernels in order of efficiency: Epanechnikov, Triangular,Gaussian, Uniform
I But kernels typically play a lesser important role than thebandwidth h in determining the performance
I There are several methods for selecting the bandwidth h. Wewon’t discuss this in the course.
I Builtin functions in statistical softwares often already comewith an optimal choice of h.
Spending Amount example
The bandwidth h is too large. Local features of this distributionare not revealed
Spending Amount example
The bandwidth h is selected by a rule-of-thumb called normalreference bandwidth
Spending Amount example
The bandwidth h is too small. The distribution is overfitted.
Kernel regression
Kernel regression
I Consider n iid samples (x1, y1), ..., (xn, yn), copies of the pair(X ,Y ) with X ∈ R and Y ∈ R
I We want to predict Y based on X
I The simple linear regression model assumes
Y = β0 + β1X + ε
i.e. we assume a linear function for the conditional meanm(x) = E(Y |X = x) = β0 + β1x .
I Parametric approaches assume a known function form forE(Y |X = x)
I In contrast, nonparametric approaches don’t assume anyknown functional form for m(x) = E(Y |X = x)
k-nearest neighbour method
kNN is the simplest nonparametric regression method.
It estimates m(x) = E(Y |X = x) by
mkNN(x) = Average(yi |xi ∈ Nk(x))
where Nk(x) denotes the neighbourhood that contains k elementsamong the xi ’s closest to x .
mkNN(x) =1
k
n∑i=1
yi I (xi ∈ Nk(x)) =1
k
∑i :xi∈Nk (x)
yi
kNN for classification
I The output variable G has two values: BLUE (=0) andORANGE (=1)
I There are two predictors X1 and X2
I 200 training data points are shown in the picture
kNN for classification
We estimate the probability of being ORANGE by
Y (x) = P(G = 1|x) =1
k
∑i :xi∈Nk (x)
yi
Classification rule
G (x) =
{ORANGE , if Y (x) > 0.5
BLUE , if Y (x) ≥ 0.5
kNN for classification
Figure: kNN classification with k = 15. The black curve is the decisionboundary {x : Y (x) = 0.5}
kNN for classification
Figure: kNN classification with k = 1. The model is overfitted: theclassifier works perfectly well on the training data, but not that well ontest data
k-nearest neighbour method
k has a big influence on the performance of kNN. k is oftenselected cross-validation.
k-nearest neighbour method
mkNN(x) =1
k
n∑i=1
yi I (xi ∈ Nk(x)) =1
k
∑i :xi∈Nk (x)
yi
I kNN assigns the same weight 1/k to all yi that have xi closeto x . This is similar to using a uniform kernel.
I Intuitively, yi should be given a bigger weight if xi is closer tox .
I So, we can make kNN better. Let’s move on...
Nadaraya-Watson estimator
Let
Kh(x , xi ) = K
(x − xi
h
)be a kernel function that gets bigger when xi gets closer to x .
The Nadaraya-Watson estimator of the conditional meanm(x) = E(Y |X = x) is
m(x) =∑i
wi (x)yi
where
wi (x) =Kh(x , xi )∑nj=1 Kh(x , xj)
That is, we put more weight on those yi that have thecorresponding xi closer to x .
Asymptotic properties of the Nadaraya-Watson estimator*
m(x) =
∑Kh(x , xi )yi∑Kh(x , xi )
, m(x) =
∑Kh(x , xi )m(x)∑
Kh(x , xi )
m(x)−m(x) =
∑Kh(x , xi )(yi −m(x))∑
Kh(x , xi )
Bias(m(x)) = E[m(x)−m(x)]
=
∑Kh(x , xi )(m(xi )−m(x))∑
Kh(x , xi )
=(nh)−1
∑Kh(x , xi )(m(xi )−m(x))
fn(x)=ψ(x)
fn(x)
Recall that fn(x) is KDE of f (x) - the pdf of X , and thereforefn(x)→ f (x).
Asymptotic properties of the Nadaraya-Watson estimator*
By LLN,
ψ(x) →∫
h−1Kh(x , t)(m(t)−m(x))f (t)dt
=
∫K (w)[m(x + hw)−m(x)]f (x + hw)dw
=
∫K (w)[hwm′(x) +
1
2h2w2m′′(x) + o(h2)]×
×[f (x) + hwf ′(x) +1
2h2w2f ′′(x) + o(h2)]dw
= C (x ,K )h2 + o(h2)
where C (x ,K ) is a constant depending on m(x),f (x) and thekernel K .
Asymptotic properties of the Nadaraya-Watson estimator*
Similarly, we can show that
V(m(x)) =C2(x ,K )
nh+ o((nh)−1)
So
MSE (m(x)) = C1(x ,K )h4 +C2(x ,K )
nh+ o(h4) + o((nh)−1)
The optimal bandwidth h is the one that minimises the right handside
h∗(x) = C3(x ,K )n−1/5
Under this optimal h∗
MSE (m(x)) = C4(x ,K )n−4/5 → 0
as n→∞.
Asymptotic properties of the Nadaraya-Watson estimator
So,
MSE (m(x)) = E(m(x)−m(x))2 = constant× n−4/5 → 0
as the sample size n→∞.
I So, m(x) converges to the true value m(x) as n→∞.
I Note: we don’t make any assumptions on the form of theunderlying conditional mean m(x)
I Therefore kernel regression method is more robust thanparametric approaches.
UN data exampleConsider the UN data on the relationship between GDP per capita(X ) and Fertility rate Y . We want to estimate E(Y |X = x)
Figure: The red line show the NW estimator m(x) with a too smallbandwidth h
UN data example
Figure: The red line show the NW estimator m(x) with an optimalbandwidth h
UN data example
Figure: The red line show the NW estimator m(x) with a too largebandwidth h
QBUS3820 Data Mining and Data Analysis
Model Section and Variable Selection
Dr. Minh-Ngoc TranUniversity of Sydney Business School
Table of contents
Introduction and Basic Concepts
Popular model selection methods
LASSO
Reading: Chapter 7 of the textbook The Elements of StatisticalLearning.
Introduction and BasicConcepts
IntroductionModel selection problem
I Model selection in general and variable selection in particularare important parts of data analysis. Variable selection can beconsidered as a special case of model selection
I Consider a dataset D. Let {Mi , i ∈ I} be a set of potentialmodels that can be used to explain D.
I The model selection problem is to select the “best” model tointerpret D and/or to make good predictions on futureobservations.
I “best” depends on how we define it: not being overfitting,producing accurate predictions, etc
IntroductionModel selection problem
For example, given a dataset D = {(x1, y1), ..., (xn, yn)}. Twomodels are proposed to explain D (one is yours, one is your boss’):
Model 1:yi = β0 + β1xi + εi ,
where εi is assumed to have a normal distribution N(0, σ2).
Model 2:yi = β0 + β1xi + εi ,
where εi is assumed to have a Student’s t distribution tν(0, σ2).
Then, you need to answer the question: which model is better?
IntroductionVariable Selection problem
I Consider a regression model with response Y and a set of ppotential covariates X1, ...,Xp.
I At the beginning stage of modelling, p is often large in orderto reduce possible bias
I a large p might cause heavy admin duties, be costly, etcI more importantly, a large p typically leads to a high variance in
prediction (see later)
I The Variable Selection problem is to select the “best” subsetof these p covariates to explain/predict Y .
I “best” depends on how we define it: not being overfitting,producing accurate predictions, etc
Basic concepts
I Suppose that we have a target variable Y to be predictedbased on an input vector X .
I Given a model M, and based on data D, we predict Y by
fM(X |D)
I The functional form of fM(X |D) is formed by the nature ofmodel M. E.g., a linear regression model or a spline regressionmodel
I The estimated parameters in fM(X |D) are computed based ondata D
I Let L(Y , fM(X |D)) denote the loss when we predict Y by
fM(X |D), e.g.
I Squared error loss: L(Y , fM(X |D)) = (Y − fM(X |D))2
I 0-1 loss: L(Y , fM(X |D)) = I (Y = fM(X |D))I Log-likelihood loss: L(Y , fM(X |D)) = − log p(Y |fM(X |D))
Basic concepts
I The prediction error of model M, conditional on data D, isdefined as
Err(M|D) = E(X ,Y )[L(Y , fM(X |D))]
= Average{L(Yj , fM(Xj |D), all future (Xj ,Yj)
},
I The expectation E(X ,Y )[·] is with respect to the jointpopulation distribution of Y and X .
I The prediction error Err(M|D) measures the performance ofmodel M. The smaller this error, the better M is
I Note that Err(M|D) is dependent on the data D, and istherefore a random quantity
Basic concepts
I Another measure of prediction performance is the expectedprediction error
I The expected prediction error of model M is defined as
Err(M) = ED [Err(M|D)] = ED
[E(X ,Y )[L(Y , fM(X |D))]
]I ED [·] is the expectation with respect to all datasets of the
same size as D
I So Err(M) averages out the effect of data D. There’s nolonger any uncertainty involved as all randomness (in (X,Y)and D) has been averaged out.
I This expected prediction error is an ideal measure ofperformance for model selection
Basic conceptsBias and Variance Decomposition
Consider the general regression model
Y = f (X ) + ε, E(ε) = 0, V(ε) = σ2
Let f (X ) be an estimate of f (X ). For notation simplicity, I’vesuppressed the dependence of f (X ) on model M and data D.
I Suppose we want to predict the mean value of Y at X = x0:f (x0) = E(Y |X = x0)
I The prediction is f (x0)
Using squared-error loss, the prediction error is
Err(M|D) = Eε[(Y |X=x0 − f (x0)
)2]= Eε
[(f (x0)− f (x0) + ε
)2]=
(f (x0)− f (x0)
)2+ σ2
Basic conceptsBias and Variance Decomposition
The expected prediction error (EPR) is
Err(M) = E(f (x0)− f (x0)
)2+ σ2
= E((
f (x0)− E[f (x0)])
+(E[f (x0)]− f (x0)
))2+ σ2
= V(f (x0)) +(E[f (x0)]− f (x0)
)2+ σ2
= Variance(f ) + Bias2(f ) + σ2
As σ2 is a constant independent of M, the EPR of model M isbasically decomposed into two terms: Variance and Bias2.
Basic conceptsBias and Variance Decomposition
For example, in the multiple linear regression model, the true valuef (x0) = E(Y |X = x0) (usually non-linear in p-vector x0) is firstdeterministically approximated by a linear combination x ′0β, then
stochastically estimated by f (x0) = x ′0β.
Err(M) = Variance + Bias2 + σ2
= V(x ′0β) +(E[x ′0β]− f (x0)
)2+ σ2
= x ′0cov(β)x0 +(x ′0β − f (x0)
)2+ σ2
= σ2p∑
i=1
x20i +(x ′0β − f (x0)
)2+ σ2,
here, suppose that the design matrix X is standardized so thatcov(β)=σ2(X ′X )−1 =σ2I .
So, the more number of covariates p is, the bigger the varianceand the smaller the bias2, and vice versa.
Basic conceptsOverfitting
Expected prediction error = Variance + Bias2
I Overfitting: A complex model is used (f (x) is a complicatedfunction, involves a lot of parameters, etc) then Bias2 is smallbut Variance is large. Underfitting: otherwise.
I Model selection: pick up a right model that trades offbetween Bias and Variance
Hope you enjoy this song!
https://www.youtube.com/watch?v=DQWI1kvmwRg
Basic conceptsOverfitting
The light red curves show the prediction errors in a linearregression. The solid red curve shows the expected prediction error(averaged over the prediction errors). The x-axis is the modelcomplexity - proportional to the number of predictors used in themodel
Basic conceptsTraining error
Training error is the average loss over the training data points, thedata used to estimate the model,
err =1
n
n∑i=1
L(yi , f (xi ))
I Given a model, training data is used to fit the model (i.e.estimating the parameters), e.g. by minimising the trainingerror.
I Training error is not a good measure of model selection, itdecreases when model complexity increases
Basic conceptsTraining error is not good for model selection
The light blue curves show the training errors in a linear regression.The solid blue curve shows the expected training error E(err)(averaged over the training errors). Training errors consistentlydecrease when model complexity increases
Basic conceptsTraining data and validation data
I In the ideal case of rich-data, we can divide the data into twosets: training set and validation set
I use the training set to fit/estimate the model and thevalidation set to estimate the prediction error (NOT expectedpredictor error!)
I Pick up the model with the smallest prediction errorI But...
I data is preciousI we can do better with cross-validation (see later)
Basic conceptsPenalised maximum likelihood principle
I Typically, the training error err is smaller than the predictionerror, because the same data is used to fit the model andassess its error (see the picture).
I As training error decreases as model complexity increases, itmight be a good idea to penalise for model complexity
I Many popular model selection criteria have the form
model selection criterion = err+penalty of model complexity
Popular model selectionmethods
AIC
Akaike’s information criterion (AIC): Select the model with thesmallest AIC
AIC = −2× log-likelihood(θmle) + 2d
I d is the number of parameters in θ (i.e. # covariates)
I log-likelihood(θmle) is the log-likelihood evaluated at the MLEθmle
I The factor 2 is not important, but useful when comparing AICto other model selection criteria
I AIC is an estimate of the expected prediction error, where theloss function is L(yi , xi ) = − log p(yi |f (xi ))
I proposed by Hirotugu Akaike in 1973
BIC
Bayesian information criterion (BIC): Select the model with thesmallest BIC
BIC = −2× log-likelihood(θmle) + (log n)× d
BIC, proposed by Gideon Schwarz 1978, is motivated by Bayesianapproach.
BIC*Consider a model M with d-vector of parameters θ, and data D.The posterior of M is
p(M|D) ∝ p(M)p(D|M)
= p(M)
∫p(D|θ,M)p(θ|M)dθ.
Using an uniform prior for M and approximating the integral bythe so-called Laplace approximation
log p(D|M) = log p(D|θmle ,M)− d
2log(n) + O(1)
O(1), read as “big order one”, means O(1) is a term depending onn, but stays constant as n grows.
We want to pick a model with the highest posterior p(M|D),which is equivalent to picking a model with the smallest BIC
BIC=−2×log-likelihood(θmle)+(logn)×d
AIC or BIC?
AIC = −2× log-likelihood(θmle) + 2d
BIC = −2× log-likelihood(θmle) + (log n)× d
I They’re both popular model selection methods. BIC puts aheavier penalty on model complexity.
I BIC is shown to be consistent asymptotically: it is able toidentify the true model when n→∞ (if there exists such atrue model! Some people argue that true model doesn’t exist)
I Practitioners seem to prefer AIC over BIC when n is small
[M.-N. Tran (2011), The Loss Rank Criterion for Variable Selectionin Linear Regression Analysis, Scandinavian J of Statistics]proposes another criterion which is somehow a compromisebetween AIC and BIC.
Cross-validation
I basic idea: like you validate your peers’ work and they validateyours
I probably the simplest but most commonly used modelselection method
I give an estimate of the expected prediction error
Cross-validation
I divide the data into K sets, K ≥ 2. Often, this is donerandomly
I for the kth part, fit the model to the other K − 1 parts.Denote the fitted model as f −k(x)
I use the fitted model to predict the kth part. The predictionerror is ∑
(yi ,xi )∈part k
L(yi , f−k(xi ))
I The k-fold cross-validated prediction error is
CV =1
n
K∑k=1
∑(yi ,xi )∈part k
L(yi , f−k(xi ))
It’s an estimate of the expected prediction error as it’saveraged over both test data and training data
Cross-validation
I The selected model is the one with the smallest CV predictionerror.
I Typical choices of K are 5, 10 or n. The case K = n is knownas leave-one-out cross-validation.
Cross-validation is simple and widely used. However, CV can besometimes very computationally expensive because one has to fitthe model many times.
Variable selection in linear regression
Consider a linear regression model or a logistic regression modelwith p potential covariates.
At the initial step of modelling, a large number p of covariates isoften introduced in order to reduce potential bias. The task is thento select the best subset among these p variables
Best subset selection: Search over the totally 2p possible subsetsof p covariates to find the best subset. The criterion can be AIC,BIC or any other model selection criteria.
Variable selection in linear regression
Searching over 2p subsets is only feasible when p is small (< 30)
Forward-stepwise selection: Start with the intercept, thensequentially add into the model the covariate that most improvesthe model selection criterion.
Backward-stepwise selection: Start with the full model with pcovariates, then sequentially remove the covariate that mostimproves the model selection criterion.
I Advantage: much more time-efficient than the best subsetselection method
I Disadvantage: not necessarily end up at the best subset.
Variable selection in linear regression
Variable selection based on hypothesis testing
I Consider the test H0 : βj = 0 v.s. H1 : βj 6= 0.
I Let βj be an estimator of βj . If the sampling distribution of βjis known, p-value can be computed
I If the p-value is large (e.g. > 0.05, 0.1) then thecorresponding covariate Xj might be removed from the model
Possible disadvantages
I It’s not clear what predictor error is optimised
I not time-efficient when p is large (need to refit the modelmany times)
This variable selection method is not popular in “modern”statistics, machine learning. Because of historic reasons, it’s stillwidely used in many fields such as social sciences.
Woman labor force example
The data set MROZ.xlsx, available in Blackboard, containsinformation on Womens labour force participation.
We would like to build a logistic regression model to explainwomens labour force participation using potential predictorsnwifeinc (income), educ (years of education), age, exper (yearsof experience), expersq (squared years of experience), kidslt6(number of kids less than 6-year-old) and kidsge6 (number ofkids more than 6-year-old).
Let’s carry out the variable selection task.
Woman labor force example
Woman labor force example
Woman labor force example
LASSO
LASSO
I Consider a simple linear regression model
yi = β0 + β1xi + εi , i = 1, ..., n
I Assume that the xi ’s have been standardised so that∑i xi = 0 and
∑i x
2i = 1
I The LS method estimates β = (β0, β1)′ by minimising thesum of squared errors∑
i
(yi − β0 − β1xi )2
I It’s easy to see that the solution is
βls1 =∑i
xiyi , βls0 =1
n
∑i
yi
LASSOI The LASSO method estimates β = (β0, β1)′ by minimising
the sum of squared errors plus a penalty term on β1∑i
(yi − β0 − β1xi )2 + λ|β1|
I |β1| is the absolute value of β1, λ > 0 controls the penalty,it’s called the shrinkage parameter
I Note that there’s no penalty term for β0. The reason is thatwe are in general not interested in determining whether or notβ0 = 0
I It can be shown that the solution is
βlasso1 =
0, λ ≥ |βls1 |βls1 − λ, λ < |βls1 | and βls1 > 0
βls1 + λ, λ < |βls1 | and βls1 < 0
βlasso0 = βls0 =1
n
∑i
yi
LASSO
I So when the shrinkage parameter λ is large enough (i.e.λ ≥ |βls1 |), the Lasso estimate βlasso1 will be 0
I This can also be interpreted as follows: when |βls1 | is smallenough that can be regarded as being insignificant, Lasso willautomatically shrink it to zero.
I Because of this attractive feature, Lasso is a method forvariable selection.
I In general, the Lasso method shrinks all the LS estimatestowards 0
LASSO
I Now, consider the general multiple linear regression model
yi = β0 + β1xi1 + ...+ βpxip + εi
I For a given λ, the Lasso method estimatesβ = (β0, β1, ..., βp)′ by minimising
∑i
(yi − β0 − β1xi1 − ...− βpxip)2 + λ
p∑j=1
|βj |
I Note that, in general, we don’t penalize β0 as we are notinterested in whether or not β0 = 0.
I There isn’t a closed form solution to this optimisationproblem, but it can be solved by optimisation techniques
I Many “modern” statistical softwares have built-in functions toimplement Lasso
LASSO
I Let’s apply the method to the prostate cancer dataset,available in the textbook’s website and Blackboard
I The goal is to predict the log of prostate specific antigen level(lpsa), using multiple linear regression with eight predictors:log cancer volume (lcavol), log prostate weight (lweight),age, etc.
I We want to estimate the coefficients and simultaneouslyremove insignificant predictors
Lasso
The figure shows the profile of the Lasso estimates when theshrinkage parameter λ varies from 0 to λmax, where all coefficientsare 0.
Lasso
I For a particular λ w.r.t. the dotted red line, three predictorslcavol, lweight and svi are selected, the other five areremoved.
I When λ = +∞, ALL coefficients β1, ..., βp are 0. This iswhen the model is likely to be underfitted
I When λ = 0, ALL coefficients β1, ..., βp are not 0. This iswhen the model is likely to be overfitted
Selecting λ
The shrinkage parameter λ can be selected using the BIC-typecriterion.
Let X be the design matrix and y be the vector of responses.Denote by βlassoλ the Lasso estimate of β given λ. Define
BIC(λ) = log
(‖y − X βlassoλ ‖2
n
)+ dfλ
log(n)
n
dfλ is called the degrees of freedom, which is approximately thenumber of non-zero coefficients in the model.
The best λ is the one that minimises BIC
Lasso
BIC(λ) is minimised at λ = 0.0623
Lasso
The final Lasso estimate is shown by the vertical line atλ = 0.0623.
Lasso
The final Lasso estimate with λ chosen by BIC is
βlassoλ=0.0623 =
0.37150.51510.3421
00.04910.5623
00
0.0014
So three predictors age, lcp and gleason are removed. Notethat the first element is the intercept.
Selecting λ by Loss rank principle method
Tran (2011), Scandinavian Journal of Statistics, Vol. 38:p.466-479, proposes a method called the loss rank principle forselecting λ.
LR(λ) = KL
(dfλn, 1− ρλ
)where
I KL(p, q) = p log(p/q) + (1− p) log((1− p)/(1− q))
I dfλ the number of non-zero coefficients in the model
I ρλ = ‖y − X βlassoλ ‖2/‖y‖2
The best λ is the one that maximises LR
Selecting λ by Loss rank principle method
LR(λ) is maximised at λ = 0.0623. In this example, BIC and LRgive the same result.
LASSO for logistic regressionI Response/output data yi are binary: 0 and 1, Yes or NoI Want to explain/predict yi based on a vector of predictors
xi = (xi1, ..., xip)′.I We assume yi |xi ∼ B(1, p1(xi )), i.e. the distribution of yi is
Bernoulli distribution with probability of success p1(xi ), where
p1(xi ) = P(yi = 1|xi ) =exp(β0 + β1xi1 + ...+ βpxip)
1 + exp(β0 + β1xi1 + ...+ βpxip)
I If Y is a Bernoulli r.v. with probability π, then the densityfunction of Y is
p(y |π) = πy (1− π)1−y .
I The probability density function of yi is therefore
p(yi |xi , β) = p1(xi )yi (1− p1(xi ))1−yi
So the likelihood function is
p(y |X , β) =n∏
i=1
p1(xi )yi (1− p1(xi ))1−yi
LASSO for logistic regression
I The log-likelihood
`(β) = log p(y |X , β) =∑i
(yi log p1(xi ) + (1− yi ) log(1− p1(xi )))
I Lasso estimates β by minimising the minus log-likelihood plusa penalty term
−`(β) + λ
p∑j=1
|βj |, λ > 0
I Insignificant coefficients will be automatically shrunk to 0
I Most “modern” statistical softwares have built-in functions toimplement Lasso for logistic regression
LASSO for logistic regression
λ can be selected using BIC or AIC criterion
AIC(λ) = −2× log-likelihood(βlassoλ ) + 2× dfλ
BIC(λ) = −2× log-likelihood(βlassoλ ) + (log n)× dfλ
where dfλ the number of non-zero coefficients in the model.
The selected λ is the one that minimises BIC(λ) or AIC(λ)
LASSO
I Lasso is very useful when there are a lot of potentialpredictors, i.e. p is large
I Lasso still works even when p � n. No other methods can.