linear models for regression - university at buffaloregression with multiple inputs...

Machine Learning Srihari

1

Linear Models for Regression

Sargur [email protected]


Topics in Linear Regression• What is regression?

– Polynomial Curve Fitting with Scalar input– Linear Basis Function Models

• Maximum Likelihood and Least Squares• Stochastic Gradient Descent • Regularized Least Squares

2


3

The regression task• It is a supervised learning task• Goal of regression:

– predict value of one or more target variables t– given d-dimensional vector x of input variables – With dataset of known inputs and outputs

• (x1,t1), ..(xN,tN)• Where xi is an input (possibly a vector) known as the predictor• ti is the target output (or response) for case i which is real-valued

– Goal is to predict t from x for some future test case• We are not trying to model the distribution of x

– We dont expect predictor to be a linear function of x• So ordinary linear regression of inputs will not work• We need to allow for a nonlinear function of x• We don’t have a theory of what form this function to take


An example problem• Fifty points generated (one-dimensional problem)

– With x uniform from (0,1)– y generated from formula y=sin(1+x2)+noise

• Where noise has N(0,0.032) distribution• Noise-free true function and data points are as shown

4


Applications of Regression

1. Expected claim amount an insured person will make (used to set insurance premiums) or prediction of future prices of securities

2. Also used for algorithmic trading

5


6

ML Terminology• Regression

– Predict a numerical value t given some input• Learning algorithm has to output function f : RnàR

– where n =no of input variables

• Classification– If t value is a label (categories): f : Rnà{1,..,k}

• Ordinal Regression– Discrete values, ordered categories


7

Polynomial Curve Fitting with a Scalar

– With a single input variable x– y(x,w) = w0+w1x+w2x2+…+wMxM

M is the order of the polynomial,x j denotes x raised to the power j,Coefficients w0,…,wM are collectively denoted by vector w

– Task: Learn w from training data D ={(xi,ti)}, i =1,..,N•Can be done by minimizing an error function that minimizes the misfit between y(x,w) for any given w and training data

•One simple choice of error function is sum of squares of error between predictions y(xn,w) for each data point xn and corresponding target values tn so that we minimize

•It is zero when function y(x,w) passes exactly through each training data point

= w

jj=0

M

∑ x j

Training data setN=10, Input x, target t

E(w) = 1

2y(x

n,w)− t

n{ }n=1

N

∑2


Results with polynomial basis

8


9

Regression with multiple inputs

• Generalization– Predict value of continuous target variable t given value of D

input variables x=[x1,..xD]– t can also be a set of variables (multiple regression)– Linear functions of adjustable parameters

• Specifically linear combinations of nonlinear functions of input variable

• Polynomial curve fitting is good only for:– Single input variable scalar x– It cannot be easily generalized to several variables, as we

will see


10

Simplest Linear Model with D inputs• Regression with D input variables

y(x,w) = w0+w1x1+..+wd xD = wTx

where x=(x1,..,xD)T are the input variables

• Called Linear Regression since it is a linear function of – parameters w0,..,wD

– input variables x1,..,xD

• Significant limitation since it is a linear function of input variables– In the one-dimensional case this amounts a straight-line fit (degree-one

polynomial)– y(x,w) = w0 + w1x

This differs fromLinear Regression with one variableand Polynomial Reg with one variable


• Assume t is a function of inputs x1, x2,...xDGoal: find best linear regressor of t on all inputs – Fitting a hyperplane through N input samples– For D =2:

• Being a linear function of input variables imposes limitations on the model– Can extend class of models by considering fixed

nonlinear functions of input variables

Fitting a Regression Plane

x1 x2 t


12

Basis Functions• In many applications, we apply some form of

fixed-preprocessing, or feature extraction, to the original data variables

• If the original variables comprise the vector x, then the features can be expressed in terms of basis functions {φj (x)}– By using nonlinear basis functions we allow the

function y(x,w) to be a nonlinear function of the input vector x• They are linear functions of parameters (gives them

simple analytical properties), yet are nonlinear wrt input variables


13

Linear Regression with M Basis Functions

• Extended by considering nonlinear functions of input variables

– where φj(x) are called Basis functions– We now need M weights for basis functions instead of D weights for

features– With a dummy basis function φ0(x)=1 corresponding to the bias

parameter w0, we can write

– where w=(w0,w1,..,wM-1) and Φ=(φ0,φ1,..,φM-1)T

• Basis functions allow non-linearity with D input variables

y(x,w) = w

0+ w

jφ

j(x)

j=1

M −1

∑

y(x,w) = w

jφ

j(x)

j=0

M −1

∑ = wTφ(x)


Choice of Basis Functions

• Many possible choices for basis function:1. Polynomial regression

• Good only if there is only one input variable2. Gaussian basis functions3. Sigmoidal basis functions4. Fourier basis functions5. Wavelets

14


• Linear Basis Function Model

• Polynomial Basis (for single variable x)φj(x)=x j with degree M-1 polynomial

• Disadvantage– Global:

• changes in one region of input space affects others– Difficult to formulate

• Number of polynomials increases exponentially with M– Can divide input space into regions

• use different polynomials in each region:• equivalent to spline functions 15

1. Polynomial Basis for one variable

y(x,w) = w

jφ

j(x)

j=0

M−1

∑ = wTφ(x)x j

x


• Consider (for a vector x) the basis:– x=(2,1) and x= (1,2) have the same squared sum, so it is unsatisfactory– Vector is being converted into a scalar value thereby losing information

• Better polynomial approach:– Polynomial of degree M-1 has terms with variables taken none, one,

two… M-1 at a time. – Use multi-index j=(j1,j2,..jD) such that j1+j2+..jD < M-1– For a quadratic (M=3) with three variables (D=3)

– Number of quadratic terms is 1+D+D(D-1)/2+D– For D=46, it is 1128

• Better to use Gaussian kernel, discussed next

Can we use Polynomial with D variables?(Not practical!)

φ

j(x) =|| x ||j = x

12 + x

22 + ..+ x

d2⎡

⎣⎢

⎤⎦⎥j

y(x,w) = wj

( j1,j2,j3)∑ φ

j(x) = w

0+w

1,0,0x

1+w

0,1,0x

2+w

0,0,1x

3+w

1,1,0x

1x

2+w

1,0,1x

1x

3+

w0,1,1

x2x

3+w

2,0,0x

12 +w

0,2,0x

22 +w

0,0,2x

32


Disadvantage of Polynomial

• Polynomials are global basis functions– Each affecting the prediction over the whole input

space• Often local basis functions are more

appropriate

17


• Gaussian

– Does not necessarily have a probabilistic interpretation– Usual normalization term is unimportant

• since basis function is multiplied by weight wj• Choice of parameters

– μj govern the locations of the basis functions• Can be an arbitrary set of points within the range of the data

– Can choose some representative data points

– σ governs the spatial scale• Could be chosen from the data set e.g., average variance

• Several variables– A Gaussian kernel would be chosen for each dimension – For each j a different set of means would be needed– perhaps chosen from

the data

2. Gaussian Radial Basis Functions

φj(x) = exp

(x −µj)2

2σ2

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟

φ

j(x) = exp −

12(x−µ

j)tΣ−1(x−µ

j)

⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟


Result with Gaussian Basis Functions

6856.5-3544.1-2473.7-2859.8-2637.7-2861.5-2468.0-3558.4

wj s for middlemodel:

Basis functions for s=0.1, with the µj ona grid with spacing s


Biological Inspiration for RBFs• Nervous system

contains many examples– Local receptive

fields in visual cortex

20

• RBF Network

• Receptive fields overlap• So there is usually more

than one unit active• But for given input, total

no. of active units is small

Machine Learning SrihariTiling the input space• Determining centers– k-means clustering

• Choose k cluster centers• Mark each training point as captured by cluster to which it

is closest• Move each cluster center to mean of points it captured

• Determining variance σ2Global: σ =mean distance between each unit j and its

closest neighbor P-nearest neighbor: set each σj so that there is certain

overlap with P closest neighbors of unit j


• Sigmoid

• Equivalently, tanh because it is related to logistic sigmoid by

3. Sigmoidal Basis Function

Logistic SigmoidFor different µj

φ

j(x) = σ

x − µj

s

⎛

⎝⎜

⎞

⎠⎟ where σ(a) = 1

1 + exp(−a)

tanh(a) = 2σ(a)− 1


23

4. Other Basis Functions• Fourier

– Expansion in sinusoidal functions– Infinite spatial extent

• Signal Processing– Functions localized in time and frequency– Called wavelets

• Useful for lattices such as images and time series• Further discussion independent of choice

of basis including φ(x) = x


• Will show that Minimizing sum-of-squared errors is the same as maximum likelihood solution under a Gaussian noise model

• Target variable is a scalar t given by deterministic function y (x,w) with additive Gaussian noise εt = y(x,w)+ ε– which is a zero-mean Gaussian with precision β

• Thus distribution of t is univariate normal:p(t|x,w,β)=N (t | y(x,w),β -1 )

Relationship between Maximum Likelihood and Least Squares

mean variance


• Data set:– Input X={x1,..xN} with target t = {t1,..tN}– Target variables tn are scalars forming a vector of size N

• Likelihood of the target data– It is the probability of observing the data assuming they

are independent– since p(t|x,w,β)=N (t | y(x,w),β -1 )– and

Likelihood Function

p(t | X,w,β) = N t

n| wTφ(x

n),β−1( )

n=1

N

∏ y(x,w) = w

jφ

j(x)

j=0

M −1

∑ = wTφ(x)


• Likelihood

• Log-likelihood

– Using standard univariate Gaussian

• Where

• With Gaussian basis functions

Log-Likelihood Function p(t | X,w,β) = N t

n| wTφ(x

n),β−1( )

n=1

N

∏

φj(x) = exp

(x−µj)t(x−µ

j)

2s2

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟

{ }2

1 )(w

21)w( å

=

-=N

nn

TnD xtE f Sum-of-squares Error Function

�

N(x |µ,σ 2) = 1(2πσ 2)1/ 2

exp − 12σ 2 (x − µ)2

⎧ ⎨ ⎩

⎫ ⎬ ⎭

ln p(t | w,β) = lnN t

n| wTφ(x

n),β−1( )

n=1

N

∑

ln p(t | w,β) =

N2

lnβ−N2

ln2π−βED(w)


• Log-likelihood

– where

• Therefore, maximizing likelihood is equivalent to minimizing ED(w)

Maximizing Log-Likelihood Function

ln p(t | w,β) = lnN tn| wTφ(x

n),β−1( )

n=1

N

∑

= N2

lnβ − N2

ln2π − βED(w)

E

D(w) = 1

2 t

n−wTϕ(x

n) { }

n=1

N

∑2


Determining max likelihood solution

• The likelihood function has the form

• where

• We will show that the maximum likelihood solution has a closed form

– Take derivative of wrt w and set equal to zero and solve for w– or equivalently just the derivative of ED(w)

28

ln p(t | w,β)

ln p(t | w,β) = N2

lnβ − N2

ln2π − βED(w)

ED(w) = 1

2 t

n−wTφ(x

n) { }

n=1

N

∑2


Gradient of Log-likelihood wrt w

• Gradient is set to zero and we solve for w

• Second derivative will be negative making this a maximum

∇ ln p(t | w,β) = β t

n−wTφ(x

n) { }

n=1

N

∑ φ(xn)T

-which is obtained from log-likelihood expression and by using calculus result

0 = tnφ(xn )n=1

N

∑ −wT φ(xn )φ(xn )T

n=1

N

∑#

$%

&

'( as shown in next slide

∇

w[− 1

2a −wb( )2 ] = (a −wb)b


Max Likelihood Solution for w• Solving for w we obtain:

where is the Moore-Penrose pseudo inverse of the N�M Design Matrix Φ whose elements are given by Φnj=ϕj(xn)

• Known as the normal equations for the least squares problem

tw +F=ML

TT FFF=F -+ 1)(

÷÷÷÷÷

ø

ö

ççççç

è

æ

=F

-

-

)x()x(

)x()x(...)x()x(

10

20

111110

NMN

M

ff

ffff Pseudo inverse:

generalization of notion of matrix inverse to non-square matricesIf design matrix is square and invertible. then pseudo-inverse is same as inverse

Design Matrix:Rows correspond to N samples, Columns to M basis functions

X={x1,..xN} are samples (vectors of d variables)

t = {t1,..tN} are targets (scalars)

ϕi(xn) are M basis functions, e.g., Gaussians centered on M data points

φ

j(x) = exp −

12(x−µ

j)tΣ−1(x−µ

j)

⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟


Design Matrix Φ

÷÷÷÷÷

ø

ö

ççççç

è

æ

=F

-

-

)x()x(

)x()x(...)x()x(

10

20

111110

NMN

M

ff

ffff

M Basis functions

ND

ata

�

ϕ0 ϕ1 ϕM-1

Represented as N-dim vectors

Φ is an N� M matrixThus ΦT is an M� N matrixThus, ΦT Φ is M � M, and so is [ΦT Φ]-1

So we have [ΦT Φ]-1� ΦT is M� NSince t is N x 1, we have that wML = [ΦT Φ]-1� ΦT t is M� 1.which consists of the M weights (including bias).

Note thatϕ0 correspondsto bias, which is setto 1


• Sum-of squares error function is:– Substituting: we get:

– Setting derivatives wrt w0 equal to zero and solving for w0

• First term is average of the N values of t• Second term is weighted sum of the average basis

function values over N samples

• Thus bias w0 compensates for difference between average target values and weighted sum of averages of basis function values

What is the role of Bias parameter w0?

�

t = 1N

tnn=1

N

∑

�

φ j = 1N

φ j (xn )n=1

N

∑where

ED (w) = 12

t n−wTφ(xn ) { }n=1

N

∑2

w

0= t − w

jφ

jj=1

M −1

∑

y(x,w) = w

0+ w

jφ

j(x)

j=1

M −1

∑

E

D(w) = 1

2 t

n−w

0− w

jj=1

M −1

∑ φj(x

n)

⎧⎨⎩⎪

⎫⎬⎭⎪n=1

N

∑2

and


33

Maximum Likelihood for precision β• We have determined m.l.e. solution for w

using a probabilistic formulation• p(t|x,w,β)=N(t|y(x,w),β -1)• With log-likelihood

• Taking gradient wrt β gives

– Thus Inverse of the noise precision gives Residual variance of the target values around the regression function

1β

ML

= 1N

tn− w

MLT φ(x

n) { }2

n=1

N

∑

ln p(t | w,β) = N

2lnβ − N

2ln2π − βE

D(w)

wML = Φ+t

∇ ln p(t | w,β) = β t

n− wTφ(x

n) { }

n=1

N

∑ φ(xn)T


Geometry of Least Squares• Geometrical Interpretation of Least Squares Solution instructive• Consider N-dim space with axes tn

so that t =(t1,…,tN)T is a vector in this space

• Each basis ϕj(xn) evaluated at N points can also be represented as a vector in the same space

• φj corresponds to jth column of Φ, whereas ϕ(xn) corresponds to the the nth row of Φ

• If the no of basis functions is smaller than the no of data points–i.e., M<N then the M vectors φj(xn) will span linear subspace S of dim M

• Define y to be an N-dim vector whose nth element is y(xn,w)• Sum-of-squares error is equal to squared Euclidean distance

between y and t• Solution w corresponds to y that lies in subspace S that is

closest to t–Corresponds to orthogonal projection of t onto S

subspace


Difficulty of Direct solution

• Direct solution of normal equations

• This direct solution can lead to numerical difficulties– When ΦTΦ is close to singular (determinant=0)– When two basis functions are collinear parameters can have

large magnitudes• Not uncommon with real data sets• Can be addressed using

– Singular Value Decomposition – Addition of regularization term ensures matrix is non-singular

35

tw +F=MLTT FFF=F -+ 1)(


Method of Gradient Descent• Criterion f(x) minimized by moving from

current solution in direction of negative of gradient f’(x)

• Steepest descent proposes a new point

– where η is the learning – rate, a positive scalar. – Set to a small constant.

36

x ' = x − ηf ' x( )


• For multiple inputs we need partial derivatives:is how f changes as only xi increases

–Gradient of f is a vector of partial derivatives• Gradient descent proposes a new point

–where η is the learning rate, a positive scalar • Set to a small constant

Gradient with multiple inputs

Direction inw0-w1 planeproducing steepestdescent

x' = x - η∇x f x( )

∇x f x( )

∂∂x

i

f (x)


38

Stochastic Gradient Descent• Error function sums over data

– Denoting ED(w)=ΣnEn , SGD updates w using

• where τ is the iteration no., η is a learning rate parameter and we are updating after presenting pattern n

• Substituting for the derivative

• w is initialized to some starting vector w(0)

• η chosen with care to ensure convergence

• Known as Least Mean Squares Algorithm

nEÑ-=+ htt )()1( ww

w(τ+1) = w(τ) + η(t

n−w(τ)Tφ

n)φ

n

E

D(w) =

12

tn−wTφ(x

n) { }

n=1

N

∑2

∇E

n=− t

n−wTφ(x

n) { }

n=1

N

∑ φ(xn)T

where ϕn=ϕ(xn)


Choosing the Learning rate• Useful to reduce η as training progresses• Constant learning rate is default in Keras

– Momentum and decay are set to 0 by default• keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)

39

Constant learning rate

Time-based decay: decay_rate=learning_rate/epochs) SGD(lr=0.1, momentum=0.8, decay=decay_rate,Nesterov=False)


40

Sequential (On-line) Learning• Disadvantage of ML solution, which is

– It is a batch technique• Processing entire training set in one go• It is computationally expensive for large data sets

– Due to huge N� M Design matrix Φ

• Solution is to use a sequential algorithm where samples are presented one at a time (or a minibatch at a time)– Called stochastic gradient descent

wML = (ΦTΦ)−1ΦT t


Computational bottleneck

• A recurring problem in machine learning:– large training sets are necessary for good

generalization – but large training sets are also computationally

expensive• SGD is an extension of gradient descent that

offers a solution– Moreover it is a method of generalization beyond

the training set

41


Insight of SGD• Gradient is an expectation

– Expectation may be approximated using small set of samples

• In each step of SGD we can sample a minibatch of examples B ={x(1),..,x(m’)}– drawn uniformly from the training set– Minibatch size m’ is typically chosen to be small: 1

to a hundred• Crucially m’ is held fixed even if sample set is in billions• We may fit a training set with billions of examples using

updates computed on only a hundred examples42

∇ ln p(y | X,θ,β) = β y(i)− θTx (i){ }

i=1

m

∑ x (i)T


43

Regularized Least Squares• As model complexity increases, e.g., degree of

polynomial or no.of basis functions, then it is likely that we overfit

• One way to control overfitting is not to limit complexity but to add a regularization term to the error function

• Error function to minimize takes the form

• where λ is the regularization coefficient that controls relative importance of data-dependent error ED(w) and regularization term EW(w)

E(w) = ED(w)+ λE

W(w)


44

Simplest Regularizer is weight decay• Regularized least squares is

• Simple form of regularization term is

– Thus total error function becomes

• This regularizer is called weight decay– because in sequential learning weight values

decay towards zero unless supported by data• Also, the error function remains a quadratic

function of w, so exact minimizer found in closed form

E(w) = ED (w)+ λEW (w)

ww21)w( T

WE =

E(w) = 1

2 t

n− wTφ(x

n) { }2

n=1

N

∑ + λ2

wTw


Closed-form Solution with Regularizer

• Error function with quadratic regularizer is,

• Its exact minimizer can be found in closed form– By setting gradient wrt w to zero and solving for w

• This is a simple extension of the least squared solution

w = (λI + ΦTΦ)−1ΦTt

�

wML = (ΦTΦ)−1ΦT t

E(w) = 1

2 t

n− wTφ(x

n) { }2

n=1

N

∑ + λ2

wTw


Geometric Interpretation of Regularizer

46

w1

w2 In unregularized case: we are trying to find w that minimizes

In regularized case:choose that value of w subject to the constraint

|w

j|2

j=1

M

∑ ≤ η

We don�t want the weights to become too largeThe two approaches related by Lagrange multipliers

Contours ofUnregularizedError function

Any value of w on contourgives same error

E(w) = 12

tn −wTφ(xn ) { }2

n=1

N

∑ + λ2

wTw

E

D(w) =

12

tn−wTφ(x

n) { }

n=1

N

∑2

E(w)


47

Minimization of Unregularized Error subject to constraint

• Blue: Contours of unregularized error function

• Constraint region• w* is optimum value

MinimizationWith quadraticRegularizer, q=2

E(w)


A more general regularizer• Regularized Error

• Where q=2 corresponds to the quadratic regularizerq=1 is known as lasso

• Lasso has the property that if λ is sufficiently large some of the coefficients wj are driven to zero leading to a sparse model in which the corresponding basis functions play no role

12

tn− wTφ(x

n) { }2

n=1

N

∑ + λ2

|wj|q

j=1

M

∑


49

Contours of regularization term

• Contours of regularization term |wj|q for values of q

QuadraticLasso

Space of w1,w2 Any choice along the contour has the same value of w

w12 +w

22 = const w1

4 +w24 = const

w1+w

2= const

12

tn− wTφ(x

n) { }2

n=1

N

∑ + λ2

|wj|q

j=1

M

∑

w

1+ w

2= const


50

Sparsity with Lasso constraint

• With q=1 and λ is sufficiently large, some of the coefficients wj are driven to zero

• Leads to a sparse model– where corresponding basis functions play no role

• Origin of sparsity is illustrated here:Minimization with Lasso RegularizerA sparse solution with w1*=0

Quadratic solution wherew1* and w0* are nonzero

Contours ofUnregularizedError function

Constraintregion


Regularization: Conclusion

• Regularization allows – complex models to be trained on small data sets– without severe over-fitting

• It limits model complexity – i.e., how many basis functions to use?

• Problem of limiting complexity is shifted to– one of determining suitable value of regularization

coefficient

51


Linear Regression Summary• Linear Regression with M basis functions:

• Objective Function without/with regularization is

• Closed-form ML solution is:

• Gradient Descent:

52

y(x,w) = w

jφ

j(x)

j=0

M −1

∑ = wTφ(x)

÷÷÷÷÷

ø

ö

ççççç

è

æ

=F

-

-

)x()x(

)x()x(...)x()x(

10

20

111110

NMN

M

ff

ffff

wML= (ΦTΦ)−1ΦTt

ED (w) = 12

tn −wTφ(xn ) { }2

n=1

N

∑ + λ2

wTw E

D(w) =

12

tn−wTφ(x

n) { }

n=1

N

∑2

wML= (λI +ΦTΦ)−1ΦTt

nEÑ-=+ htt )()1( ww

∇E

n=− t

n−wTφ(x

n) { }

n=1

N

∑ φ(xn)T ∇En = − tn −wTφ(xn ) { }

n=1

N

∑ φ(xn )T⎡⎣⎢

⎤⎦⎥+ λw

φ

j(x) = exp −

12(x−µ

j)tΣ−1(x−µ

j)

⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟


Returning to LeToR Problem

• Try:• Several Basis Functions• Quadratic Regularization• Express results as ERMS

– rather than as squared error E(w*) or as Error Rate with thresholded results

53

�

ERMS = 2E(w*) /N


Multiple Outputs• Several target variables t =(t1,..,tK) K >1• Can be treated as multiple (K ) independent

regression problems– Different basis functions for each component of t

• More common solution: same set of basis functions to model all components of target vector y(x,w)=WTφ(x) – where y is a K-dim column vector, W is a M x K

matrix of weights and φ(x) is a M-dimensional column vector with with elements φj(x)

54


Solution for Multiple Outputs

• Set of observations t1,..,tN are combined into a matrix T of size N x K such that the nth row is given by tnT

• Combine input vectors x1,..,xN into matrix X• Log-likelihood function is maximized

• Solution is similar: WML=(ΦTΦ)-1ΦTT

55

linear models for regression - university at buffaloregression with multiple inputs...

Documents