linear models for regression - university at buffaloregression with multiple inputs...
TRANSCRIPT
Machine Learning Srihari
Topics in Linear Regression• What is regression?
– Polynomial Curve Fitting with Scalar input– Linear Basis Function Models
• Maximum Likelihood and Least Squares• Stochastic Gradient Descent • Regularized Least Squares
2
Machine Learning Srihari
3
The regression task• It is a supervised learning task• Goal of regression:
– predict value of one or more target variables t– given d-dimensional vector x of input variables – With dataset of known inputs and outputs
• (x1,t1), ..(xN,tN)• Where xi is an input (possibly a vector) known as the predictor• ti is the target output (or response) for case i which is real-valued
– Goal is to predict t from x for some future test case• We are not trying to model the distribution of x
– We dont expect predictor to be a linear function of x• So ordinary linear regression of inputs will not work• We need to allow for a nonlinear function of x• We don’t have a theory of what form this function to take
Machine Learning Srihari
An example problem• Fifty points generated (one-dimensional problem)
– With x uniform from (0,1)– y generated from formula y=sin(1+x2)+noise
• Where noise has N(0,0.032) distribution• Noise-free true function and data points are as shown
4
Machine Learning Srihari
Applications of Regression
1. Expected claim amount an insured person will make (used to set insurance premiums) or prediction of future prices of securities
2. Also used for algorithmic trading
5
Machine Learning Srihari
6
ML Terminology• Regression
– Predict a numerical value t given some input• Learning algorithm has to output function f : RnàR
– where n =no of input variables
• Classification– If t value is a label (categories): f : Rnà{1,..,k}
• Ordinal Regression– Discrete values, ordered categories
Machine Learning Srihari
7
Polynomial Curve Fitting with a Scalar
– With a single input variable x– y(x,w) = w0+w1x+w2x2+…+wMxM
M is the order of the polynomial,x j denotes x raised to the power j,Coefficients w0,…,wM are collectively denoted by vector w
– Task: Learn w from training data D ={(xi,ti)}, i =1,..,N•Can be done by minimizing an error function that minimizes the misfit between y(x,w) for any given w and training data
•One simple choice of error function is sum of squares of error between predictions y(xn,w) for each data point xn and corresponding target values tn so that we minimize
•It is zero when function y(x,w) passes exactly through each training data point
= w
jj=0
M
∑ x j
Training data setN=10, Input x, target t
E(w) = 1
2y(x
n,w)− t
n{ }n=1
N
∑2
Machine Learning Srihari
Results with polynomial basis
8
Machine Learning Srihari
9
Regression with multiple inputs
• Generalization– Predict value of continuous target variable t given value of D
input variables x=[x1,..xD]– t can also be a set of variables (multiple regression)– Linear functions of adjustable parameters
• Specifically linear combinations of nonlinear functions of input variable
• Polynomial curve fitting is good only for:– Single input variable scalar x– It cannot be easily generalized to several variables, as we
will see
Machine Learning Srihari
10
Simplest Linear Model with D inputs• Regression with D input variables
y(x,w) = w0+w1x1+..+wd xD = wTx
where x=(x1,..,xD)T are the input variables
• Called Linear Regression since it is a linear function of – parameters w0,..,wD
– input variables x1,..,xD
• Significant limitation since it is a linear function of input variables– In the one-dimensional case this amounts a straight-line fit (degree-one
polynomial)– y(x,w) = w0 + w1x
This differs fromLinear Regression with one variableand Polynomial Reg with one variable
Machine Learning Srihari
• Assume t is a function of inputs x1, x2,...xDGoal: find best linear regressor of t on all inputs – Fitting a hyperplane through N input samples– For D =2:
• Being a linear function of input variables imposes limitations on the model– Can extend class of models by considering fixed
nonlinear functions of input variables
Fitting a Regression Plane
x1 x2 t
Machine Learning Srihari
12
Basis Functions• In many applications, we apply some form of
fixed-preprocessing, or feature extraction, to the original data variables
• If the original variables comprise the vector x, then the features can be expressed in terms of basis functions {φj (x)}– By using nonlinear basis functions we allow the
function y(x,w) to be a nonlinear function of the input vector x• They are linear functions of parameters (gives them
simple analytical properties), yet are nonlinear wrt input variables
Machine Learning Srihari
13
Linear Regression with M Basis Functions
• Extended by considering nonlinear functions of input variables
– where φj(x) are called Basis functions– We now need M weights for basis functions instead of D weights for
features– With a dummy basis function φ0(x)=1 corresponding to the bias
parameter w0, we can write
– where w=(w0,w1,..,wM-1) and Φ=(φ0,φ1,..,φM-1)T
• Basis functions allow non-linearity with D input variables
y(x,w) = w
0+ w
jφ
j(x)
j=1
M −1
∑
y(x,w) = w
jφ
j(x)
j=0
M −1
∑ = wTφ(x)
Machine Learning Srihari
Choice of Basis Functions
• Many possible choices for basis function:1. Polynomial regression
• Good only if there is only one input variable2. Gaussian basis functions3. Sigmoidal basis functions4. Fourier basis functions5. Wavelets
14
Machine Learning Srihari
• Linear Basis Function Model
• Polynomial Basis (for single variable x)φj(x)=x j with degree M-1 polynomial
• Disadvantage– Global:
• changes in one region of input space affects others– Difficult to formulate
• Number of polynomials increases exponentially with M– Can divide input space into regions
• use different polynomials in each region:• equivalent to spline functions 15
1. Polynomial Basis for one variable
y(x,w) = w
jφ
j(x)
j=0
M−1
∑ = wTφ(x)x j
x
Machine Learning Srihari
• Consider (for a vector x) the basis:– x=(2,1) and x= (1,2) have the same squared sum, so it is unsatisfactory– Vector is being converted into a scalar value thereby losing information
• Better polynomial approach:– Polynomial of degree M-1 has terms with variables taken none, one,
two… M-1 at a time. – Use multi-index j=(j1,j2,..jD) such that j1+j2+..jD < M-1– For a quadratic (M=3) with three variables (D=3)
– Number of quadratic terms is 1+D+D(D-1)/2+D– For D=46, it is 1128
• Better to use Gaussian kernel, discussed next
Can we use Polynomial with D variables?(Not practical!)
φ
j(x) =|| x ||j = x
12 + x
22 + ..+ x
d2⎡
⎣⎢
⎤⎦⎥j
y(x,w) = wj
( j1,j2,j3)∑ φ
j(x) = w
0+w
1,0,0x
1+w
0,1,0x
2+w
0,0,1x
3+w
1,1,0x
1x
2+w
1,0,1x
1x
3+
w0,1,1
x2x
3+w
2,0,0x
12 +w
0,2,0x
22 +w
0,0,2x
32
Machine Learning Srihari
Disadvantage of Polynomial
• Polynomials are global basis functions– Each affecting the prediction over the whole input
space• Often local basis functions are more
appropriate
17
Machine Learning Srihari
• Gaussian
– Does not necessarily have a probabilistic interpretation– Usual normalization term is unimportant
• since basis function is multiplied by weight wj• Choice of parameters
– μj govern the locations of the basis functions• Can be an arbitrary set of points within the range of the data
– Can choose some representative data points
– σ governs the spatial scale• Could be chosen from the data set e.g., average variance
• Several variables– A Gaussian kernel would be chosen for each dimension – For each j a different set of means would be needed– perhaps chosen from
the data
2. Gaussian Radial Basis Functions
φj(x) = exp
(x −µj)2
2σ2
⎛
⎝
⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟
φ
j(x) = exp −
12(x−µ
j)tΣ−1(x−µ
j)
⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟
Machine Learning Srihari
Result with Gaussian Basis Functions
6856.5-3544.1-2473.7-2859.8-2637.7-2861.5-2468.0-3558.4
wj s for middlemodel:
Basis functions for s=0.1, with the µj ona grid with spacing s
Machine Learning Srihari
Biological Inspiration for RBFs• Nervous system
contains many examples– Local receptive
fields in visual cortex
20
• RBF Network
• Receptive fields overlap• So there is usually more
than one unit active• But for given input, total
no. of active units is small
Machine Learning SrihariTiling the input space• Determining centers– k-means clustering
• Choose k cluster centers• Mark each training point as captured by cluster to which it
is closest• Move each cluster center to mean of points it captured
• Determining variance σ2Global: σ =mean distance between each unit j and its
closest neighbor P-nearest neighbor: set each σj so that there is certain
overlap with P closest neighbors of unit j
Machine Learning Srihari
• Sigmoid
• Equivalently, tanh because it is related to logistic sigmoid by
3. Sigmoidal Basis Function
Logistic SigmoidFor different µj
φ
j(x) = σ
x − µj
s
⎛
⎝⎜
⎞
⎠⎟ where σ(a) = 1
1 + exp(−a)
tanh(a) = 2σ(a)− 1
Machine Learning Srihari
23
4. Other Basis Functions• Fourier
– Expansion in sinusoidal functions– Infinite spatial extent
• Signal Processing– Functions localized in time and frequency– Called wavelets
• Useful for lattices such as images and time series• Further discussion independent of choice
of basis including φ(x) = x
Machine Learning Srihari
• Will show that Minimizing sum-of-squared errors is the same as maximum likelihood solution under a Gaussian noise model
• Target variable is a scalar t given by deterministic function y (x,w) with additive Gaussian noise εt = y(x,w)+ ε– which is a zero-mean Gaussian with precision β
• Thus distribution of t is univariate normal:p(t|x,w,β)=N (t | y(x,w),β -1 )
Relationship between Maximum Likelihood and Least Squares
mean variance
Machine Learning Srihari
• Data set:– Input X={x1,..xN} with target t = {t1,..tN}– Target variables tn are scalars forming a vector of size N
• Likelihood of the target data– It is the probability of observing the data assuming they
are independent– since p(t|x,w,β)=N (t | y(x,w),β -1 )– and
Likelihood Function
p(t | X,w,β) = N t
n| wTφ(x
n),β−1( )
n=1
N
∏ y(x,w) = w
jφ
j(x)
j=0
M −1
∑ = wTφ(x)
Machine Learning Srihari
• Likelihood
• Log-likelihood
– Using standard univariate Gaussian
• Where
• With Gaussian basis functions
Log-Likelihood Function p(t | X,w,β) = N t
n| wTφ(x
n),β−1( )
n=1
N
∏
φj(x) = exp
(x−µj)t(x−µ
j)
2s2
⎛
⎝
⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟
{ }2
1 )(w
21)w( å
=
-=N
nn
TnD xtE f Sum-of-squares Error Function
�
N(x |µ,σ 2) = 1(2πσ 2)1/ 2
exp − 12σ 2 (x − µ)2
⎧ ⎨ ⎩
⎫ ⎬ ⎭
ln p(t | w,β) = lnN t
n| wTφ(x
n),β−1( )
n=1
N
∑
ln p(t | w,β) =
N2
lnβ−N2
ln2π−βED(w)
Machine Learning Srihari
• Log-likelihood
– where
• Therefore, maximizing likelihood is equivalent to minimizing ED(w)
Maximizing Log-Likelihood Function
ln p(t | w,β) = lnN tn| wTφ(x
n),β−1( )
n=1
N
∑
= N2
lnβ − N2
ln2π − βED(w)
E
D(w) = 1
2 t
n−wTϕ(x
n) { }
n=1
N
∑2
Machine Learning Srihari
Determining max likelihood solution
• The likelihood function has the form
• where
• We will show that the maximum likelihood solution has a closed form
– Take derivative of wrt w and set equal to zero and solve for w– or equivalently just the derivative of ED(w)
28
ln p(t | w,β)
ln p(t | w,β) = N2
lnβ − N2
ln2π − βED(w)
ED(w) = 1
2 t
n−wTφ(x
n) { }
n=1
N
∑2
Machine Learning Srihari
Gradient of Log-likelihood wrt w
• Gradient is set to zero and we solve for w
• Second derivative will be negative making this a maximum
∇ ln p(t | w,β) = β t
n−wTφ(x
n) { }
n=1
N
∑ φ(xn)T
-which is obtained from log-likelihood expression and by using calculus result
0 = tnφ(xn )n=1
N
∑ −wT φ(xn )φ(xn )T
n=1
N
∑#
$%
&
'( as shown in next slide
∇
w[− 1
2a −wb( )2 ] = (a −wb)b
Machine Learning Srihari
Max Likelihood Solution for w• Solving for w we obtain:
where is the Moore-Penrose pseudo inverse of the N�M Design Matrix Φ whose elements are given by Φnj=ϕj(xn)
• Known as the normal equations for the least squares problem
tw +F=ML
TT FFF=F -+ 1)(
÷÷÷÷÷
ø
ö
ççççç
è
æ
=F
-
-
)x()x(
)x()x(...)x()x(
10
20
111110
NMN
M
ff
ffff Pseudo inverse:
generalization of notion of matrix inverse to non-square matricesIf design matrix is square and invertible. then pseudo-inverse is same as inverse
Design Matrix:Rows correspond to N samples, Columns to M basis functions
X={x1,..xN} are samples (vectors of d variables)
t = {t1,..tN} are targets (scalars)
ϕi(xn) are M basis functions, e.g., Gaussians centered on M data points
φ
j(x) = exp −
12(x−µ
j)tΣ−1(x−µ
j)
⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟
Machine Learning Srihari
Design Matrix Φ
÷÷÷÷÷
ø
ö
ççççç
è
æ
=F
-
-
)x()x(
)x()x(...)x()x(
10
20
111110
NMN
M
ff
ffff
M Basis functions
ND
ata
�
ϕ0 ϕ1 ϕM-1
Represented as N-dim vectors
Φ is an N� M matrixThus ΦT is an M� N matrixThus, ΦT Φ is M � M, and so is [ΦT Φ]-1
So we have [ΦT Φ]-1� ΦT is M� NSince t is N x 1, we have that wML = [ΦT Φ]-1� ΦT t is M� 1.which consists of the M weights (including bias).
Note thatϕ0 correspondsto bias, which is setto 1
Machine Learning Srihari
• Sum-of squares error function is:– Substituting: we get:
– Setting derivatives wrt w0 equal to zero and solving for w0
• First term is average of the N values of t• Second term is weighted sum of the average basis
function values over N samples
• Thus bias w0 compensates for difference between average target values and weighted sum of averages of basis function values
What is the role of Bias parameter w0?
�
t = 1N
tnn=1
N
∑
�
φ j = 1N
φ j (xn )n=1
N
∑where
ED (w) = 12
t n−wTφ(xn ) { }n=1
N
∑2
w
0= t − w
jφ
jj=1
M −1
∑
y(x,w) = w
0+ w
jφ
j(x)
j=1
M −1
∑
E
D(w) = 1
2 t
n−w
0− w
jj=1
M −1
∑ φj(x
n)
⎧⎨⎩⎪
⎫⎬⎭⎪n=1
N
∑2
and
Machine Learning Srihari
33
Maximum Likelihood for precision β• We have determined m.l.e. solution for w
using a probabilistic formulation• p(t|x,w,β)=N(t|y(x,w),β -1)• With log-likelihood
• Taking gradient wrt β gives
– Thus Inverse of the noise precision gives Residual variance of the target values around the regression function
1β
ML
= 1N
tn− w
MLT φ(x
n) { }2
n=1
N
∑
ln p(t | w,β) = N
2lnβ − N
2ln2π − βE
D(w)
wML = Φ+t
∇ ln p(t | w,β) = β t
n− wTφ(x
n) { }
n=1
N
∑ φ(xn)T
Machine Learning Srihari
Geometry of Least Squares• Geometrical Interpretation of Least Squares Solution instructive• Consider N-dim space with axes tn
so that t =(t1,…,tN)T is a vector in this space
• Each basis ϕj(xn) evaluated at N points can also be represented as a vector in the same space
• φj corresponds to jth column of Φ, whereas ϕ(xn) corresponds to the the nth row of Φ
• If the no of basis functions is smaller than the no of data points–i.e., M<N then the M vectors φj(xn) will span linear subspace S of dim M
• Define y to be an N-dim vector whose nth element is y(xn,w)• Sum-of-squares error is equal to squared Euclidean distance
between y and t• Solution w corresponds to y that lies in subspace S that is
closest to t–Corresponds to orthogonal projection of t onto S
subspace
Machine Learning Srihari
Difficulty of Direct solution
• Direct solution of normal equations
• This direct solution can lead to numerical difficulties– When ΦTΦ is close to singular (determinant=0)– When two basis functions are collinear parameters can have
large magnitudes• Not uncommon with real data sets• Can be addressed using
– Singular Value Decomposition – Addition of regularization term ensures matrix is non-singular
35
tw +F=MLTT FFF=F -+ 1)(
Machine Learning Srihari
Method of Gradient Descent• Criterion f(x) minimized by moving from
current solution in direction of negative of gradient f’(x)
• Steepest descent proposes a new point
– where η is the learning – rate, a positive scalar. – Set to a small constant.
36
x ' = x − ηf ' x( )
Machine Learning Srihari
• For multiple inputs we need partial derivatives:is how f changes as only xi increases
–Gradient of f is a vector of partial derivatives• Gradient descent proposes a new point
–where η is the learning rate, a positive scalar • Set to a small constant
Gradient with multiple inputs
Direction inw0-w1 planeproducing steepestdescent
x' = x - η∇x f x( )
∇x f x( )
∂∂x
i
f (x)
Machine Learning Srihari
38
Stochastic Gradient Descent• Error function sums over data
– Denoting ED(w)=ΣnEn , SGD updates w using
• where τ is the iteration no., η is a learning rate parameter and we are updating after presenting pattern n
• Substituting for the derivative
• w is initialized to some starting vector w(0)
• η chosen with care to ensure convergence
• Known as Least Mean Squares Algorithm
nEÑ-=+ htt )()1( ww
w(τ+1) = w(τ) + η(t
n−w(τ)Tφ
n)φ
n
E
D(w) =
12
tn−wTφ(x
n) { }
n=1
N
∑2
∇E
n=− t
n−wTφ(x
n) { }
n=1
N
∑ φ(xn)T
where ϕn=ϕ(xn)
Machine Learning Srihari
Choosing the Learning rate• Useful to reduce η as training progresses• Constant learning rate is default in Keras
– Momentum and decay are set to 0 by default• keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)
39
Constant learning rate
Time-based decay: decay_rate=learning_rate/epochs) SGD(lr=0.1, momentum=0.8, decay=decay_rate,Nesterov=False)
Machine Learning Srihari
40
Sequential (On-line) Learning• Disadvantage of ML solution, which is
– It is a batch technique• Processing entire training set in one go• It is computationally expensive for large data sets
– Due to huge N� M Design matrix Φ
• Solution is to use a sequential algorithm where samples are presented one at a time (or a minibatch at a time)– Called stochastic gradient descent
wML = (ΦTΦ)−1ΦT t
Machine Learning Srihari
Computational bottleneck
• A recurring problem in machine learning:– large training sets are necessary for good
generalization – but large training sets are also computationally
expensive• SGD is an extension of gradient descent that
offers a solution– Moreover it is a method of generalization beyond
the training set
41
Machine Learning Srihari
Insight of SGD• Gradient is an expectation
– Expectation may be approximated using small set of samples
• In each step of SGD we can sample a minibatch of examples B ={x(1),..,x(m’)}– drawn uniformly from the training set– Minibatch size m’ is typically chosen to be small: 1
to a hundred• Crucially m’ is held fixed even if sample set is in billions• We may fit a training set with billions of examples using
updates computed on only a hundred examples42
∇ ln p(y | X,θ,β) = β y(i)− θTx (i){ }
i=1
m
∑ x (i)T
Machine Learning Srihari
43
Regularized Least Squares• As model complexity increases, e.g., degree of
polynomial or no.of basis functions, then it is likely that we overfit
• One way to control overfitting is not to limit complexity but to add a regularization term to the error function
• Error function to minimize takes the form
• where λ is the regularization coefficient that controls relative importance of data-dependent error ED(w) and regularization term EW(w)
E(w) = ED(w)+ λE
W(w)
Machine Learning Srihari
44
Simplest Regularizer is weight decay• Regularized least squares is
• Simple form of regularization term is
– Thus total error function becomes
• This regularizer is called weight decay– because in sequential learning weight values
decay towards zero unless supported by data• Also, the error function remains a quadratic
function of w, so exact minimizer found in closed form
E(w) = ED (w)+ λEW (w)
ww21)w( T
WE =
E(w) = 1
2 t
n− wTφ(x
n) { }2
n=1
N
∑ + λ2
wTw
Machine Learning Srihari
Closed-form Solution with Regularizer
• Error function with quadratic regularizer is,
• Its exact minimizer can be found in closed form– By setting gradient wrt w to zero and solving for w
• This is a simple extension of the least squared solution
w = (λI + ΦTΦ)−1ΦTt
�
wML = (ΦTΦ)−1ΦT t
E(w) = 1
2 t
n− wTφ(x
n) { }2
n=1
N
∑ + λ2
wTw
Machine Learning Srihari
Geometric Interpretation of Regularizer
46
w1
w2 In unregularized case: we are trying to find w that minimizes
In regularized case:choose that value of w subject to the constraint
|w
j|2
j=1
M
∑ ≤ η
We don�t want the weights to become too largeThe two approaches related by Lagrange multipliers
Contours ofUnregularizedError function
Any value of w on contourgives same error
E(w) = 12
tn −wTφ(xn ) { }2
n=1
N
∑ + λ2
wTw
E
D(w) =
12
tn−wTφ(x
n) { }
n=1
N
∑2
E(w)
Machine Learning Srihari
47
Minimization of Unregularized Error subject to constraint
• Blue: Contours of unregularized error function
• Constraint region• w* is optimum value
MinimizationWith quadraticRegularizer, q=2
E(w)
Machine Learning Srihari
A more general regularizer• Regularized Error
• Where q=2 corresponds to the quadratic regularizerq=1 is known as lasso
• Lasso has the property that if λ is sufficiently large some of the coefficients wj are driven to zero leading to a sparse model in which the corresponding basis functions play no role
12
tn− wTφ(x
n) { }2
n=1
N
∑ + λ2
|wj|q
j=1
M
∑
Machine Learning Srihari
49
Contours of regularization term
• Contours of regularization term |wj|q for values of q
QuadraticLasso
Space of w1,w2 Any choice along the contour has the same value of w
w12 +w
22 = const w1
4 +w24 = const
w1+w
2= const
12
tn− wTφ(x
n) { }2
n=1
N
∑ + λ2
|wj|q
j=1
M
∑
w
1+ w
2= const
Machine Learning Srihari
50
Sparsity with Lasso constraint
• With q=1 and λ is sufficiently large, some of the coefficients wj are driven to zero
• Leads to a sparse model– where corresponding basis functions play no role
• Origin of sparsity is illustrated here:Minimization with Lasso RegularizerA sparse solution with w1*=0
Quadratic solution wherew1* and w0* are nonzero
Contours ofUnregularizedError function
Constraintregion
Machine Learning Srihari
Regularization: Conclusion
• Regularization allows – complex models to be trained on small data sets– without severe over-fitting
• It limits model complexity – i.e., how many basis functions to use?
• Problem of limiting complexity is shifted to– one of determining suitable value of regularization
coefficient
51
Machine Learning Srihari
Linear Regression Summary• Linear Regression with M basis functions:
• Objective Function without/with regularization is
• Closed-form ML solution is:
• Gradient Descent:
52
y(x,w) = w
jφ
j(x)
j=0
M −1
∑ = wTφ(x)
÷÷÷÷÷
ø
ö
ççççç
è
æ
=F
-
-
)x()x(
)x()x(...)x()x(
10
20
111110
NMN
M
ff
ffff
wML= (ΦTΦ)−1ΦTt
ED (w) = 12
tn −wTφ(xn ) { }2
n=1
N
∑ + λ2
wTw E
D(w) =
12
tn−wTφ(x
n) { }
n=1
N
∑2
wML= (λI +ΦTΦ)−1ΦTt
nEÑ-=+ htt )()1( ww
∇E
n=− t
n−wTφ(x
n) { }
n=1
N
∑ φ(xn)T ∇En = − tn −wTφ(xn ) { }
n=1
N
∑ φ(xn )T⎡⎣⎢
⎤⎦⎥+ λw
φ
j(x) = exp −
12(x−µ
j)tΣ−1(x−µ
j)
⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟
Machine Learning Srihari
Returning to LeToR Problem
• Try:• Several Basis Functions• Quadratic Regularization• Express results as ERMS
– rather than as squared error E(w*) or as Error Rate with thresholded results
53
�
ERMS = 2E(w*) /N
Machine Learning Srihari
Multiple Outputs• Several target variables t =(t1,..,tK) K >1• Can be treated as multiple (K ) independent
regression problems– Different basis functions for each component of t
• More common solution: same set of basis functions to model all components of target vector y(x,w)=WTφ(x) – where y is a K-dim column vector, W is a M x K
matrix of weights and φ(x) is a M-dimensional column vector with with elements φj(x)
54
Machine Learning Srihari
Solution for Multiple Outputs
• Set of observations t1,..,tN are combined into a matrix T of size N x K such that the nth row is given by tnT
• Combine input vectors x1,..,xN into matrix X• Log-likelihood function is maximized
• Solution is similar: WML=(ΦTΦ)-1ΦTT
55