regularized, polynomial, logistic regressionlogistic regression 27 assumes the following functional...
TRANSCRIPT
Regularized,Polynomial,LogisticRegression
PradeepRavikumar
Co-instructor:Ziv Bar-Joseph
MachineLearning10-701
Regressionalgorithms
Learning algorithm
2
LinearRegressionRegularizedLinearRegression– Ridgeregression,LassoPolynomialRegressionGaussianProcessRegression…
thatpredicts/estimatesoutputYgiveninputX
Recap:LinearRegression
3
- ClassofLinearfunctions
β1 - intercept
β2 =slopeUni-variate case:
Multi-variate case:
where,
LeastSquaresEstimator
Recap:LeastSquaresEstimator
4
f(Xi) = Xi�
Recap:LeastSquaresolutionsatisfiesNormalEquations
5
Ifisinvertible,
Whenisinvertible?Recall:Fullrankmatricesareinvertible.Whatisrankof?
pxp px1 px1
Rank =numberofnon-zeroeigenvaluesof<=min(n,p)sinceA isnxp
So,rank=:r<=min(n,p)Notinvertibleifr<p(e.g.n<pi.e.high-dimensionalsetting)
6
RegularizedLeastSquares
Whatifisnotinvertible?
requations,punknowns– underdeterminedsystemoflinearequationsmanyfeasiblesolutions
Needtoconstrainsolution further
e.g.biassolutionto“small”valuesofβ (smallchangesininputdon’ttranslatetolargechangesinoutput)
�̂MAP = (A>A+ �I)�1A>Y
(A>A+ �I)Is invertible?
� � 0
RidgeRegression(l2penalty)
7
UnderstandingregularizedLeastSquares
RidgeRegression:
βs withconstantJ(β)(levelsetsofJ(β))
βs withconstantl2norm(levelsetsofpen(β))
β2
β1
Unregularized Least Squares solution
8
RegularizedLeastSquares
Whatifisnotinvertible?
Lasso(l1penalty)
requations,punknowns– underdeterminedsystemoflinearequationsmanyfeasiblesolutions
Needtoconstrainsolution further
e.g.biassolutionto“small”valuesofb (smallchangesininputdon’ttranslatetolargechangesinoutput)
Manyparametervalues canbezero– manyinputsareirrelevanttopredictioninhigh-dimensionalsettings
� � 0
RidgeRegression(l2penalty)
9
RegularizedLeastSquares
Whatifisnotinvertible?
Lasso(l1penalty)
requations,punknowns– underdeterminedsystemoflinearequationsmanyfeasiblesolutions
Needtoconstrainsolution further
e.g.biassolutionto“small”valuesofβ (smallchangesininputdon’ttranslatetolargechangesinoutput)
� � 0
RidgeRegression(l2penalty)
Noclosedformsolution,butcanoptimizeusingsub-gradientdescent(packagesavailable)
RidgeRegressionvs Lasso
10
RidgeRegression: Lasso:
Lasso(l1penalty)resultsinsparsesolutions– vectorwithmorezerocoordinatesGoodforhigh-dimensionalproblems– don’thavetostoreallcoordinates,interpretablesolution!
βs withconstantl1norm
Ideallyl0penalty,butoptimizationbecomesnon-convex
βs withconstantl0norm
βs withconstantJ(β)(levelsetsofJ(β))
βs withconstantl2norm
β2
β1
LassovsRidge
LassoCoefficients RidgeCoefficients
0 10 20 30 40 50 60 70 80 90 1000
2
4
6
8
10
12
0 10 20 30 40 50 60 70 80 90 100-3
-2
-1
0
1
2
3
4
5
12
RegularizedLeastSquares– connectiontoMLEandMAP(Model-basedapproaches)
LeastSquaresandM(C)LE
13
Intuition:Signalplus(zero-mean)Noisemodel
LeastSquareEstimateissameasMaximumConditionalLikelihoodEstimateunderaGaussianmodel!
Conditional log likelihood
= X�⇤
p({Yi}ni=1|�,�2, {Xi}ni=1)
RegularizedLeastSquaresandM(C)AP
14
Whatifisnotinvertible?
Conditional log likelihood logprior
I)GaussianPrior
0
Ridge Regression
b�MAP = (AAA>AAA+ �III)�1AAA>YYY
p({Yi}ni=1|�,�2, {Xi}ni=1)
RegularizedLeastSquaresandM(C)AP
15
Whatifisnotinvertible?
Priorbeliefthatβ isGaussianwithzero-meanbiasessolution to“small”β
I)GaussianPrior
0
Ridge Regression
Conditional log likelihood logprior
p({Yi}ni=1|�,�2, {Xi}ni=1)
RegularizedLeastSquaresandM(C)AP
16
Whatifisnotinvertible?
Priorbeliefthatβ isLaplacewithzero-meanbiasessolution to“sparse”β
Lasso
II)LaplacePrior
Conditional log likelihood logprior
p({Yi}ni=1|�,�2, {Xi}ni=1)
BeyondLinearRegression
17
PolynomialregressionRegressionwithnonlinearfeatures
PolynomialRegression
18
Univariate (1-dim)case:
where,
Multivariate(p-dim)case:
degreem
f(X) = �0 + �1X(1) + �2X
(2) + · · ·+ �pX(p)
+pX
i=1
pX
j=1
�ijX(i)X(j) +
pX
i=1
pX
j=1
pX
k=1
X(i)X(j)X(k)
+ . . . terms up to degree m
b�MAP = (ATA+ �I)�1ATYor
where
19
PolynomialRegression
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-45
-40
-35
-30
-25
-20
-15
-10
-5
0
5
k=1 k=2
k=3 k=7
Polynomialoforderk,equivalentlyofdegreeuptok-1
Whatistherightorder?
Bias– VarianceTradeoffLargebias,Smallvariance– poorapproximationbutrobust/stable
Smallbias,Largevariance– goodapproximationbutunstable
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
3Independenttrainingdatasets
21
Bias– VarianceDecomposition
Laterinthecourse,wewillshowthat
E[(f(X)- f*(X))2]=Bias2 +Variance
Bias=E[f(X)]– f*(X) …..Howfaristhemodelfrom“truefunction”
Variance=E[(f(X)- E[f(X)])2] …..Howvariable/stableisthemodel
EffectofModelComplexity
Testerror
Variance
Bias
EffectofModelComplexity
Testerror
Variance
BiasTrainingerror
Regressionwithbasisfunctions
24
PolynomialBasis
Basisfunctions (Linearcombinationsyieldmeaningful spaces)Basiscoefficients
Goodrepresentationforperiodic functions
Goodrepresentationforlocalfunctions
FourierBasis WaveletBasis
…… …
25
Regressionwithnonlinearfeatures
Ingeneral,useanynonlinearfeatures
e.g.eX,logX,1/X,sin(X),…
Nonlinearfeatures
Weightofeachfeature
�0(X1) �1(X1) . . . �m(X1)
�0(Xn) �1(Xn) . . . �m(Xn)
X = [�0(X) �1(X) . . . �m(X)]
b�MAP = (ATA+ �I)�1ATY
or
26
RegressiontoClassificationRegression
Classification
AnemiccellHealthycell
X=CellImage Y=Diagnosis
X=BrainScan
Y=Ageofasubject
Canwepredictthe“probability”ofclasslabelbeingAnemicorHealthy– arealnumber– usingregressionmethods?
Butoutput(probability)needstobein[0,1]
LogisticRegression
27
AssumesthefollowingfunctionalformforP(Y|X):
Logisticfunction(orSigmoid):
Logisticfunctionappliedtoalinearfunctionofthedata
zlogistic(z)
Featurescanbediscreteorcontinuous!
Notreallyregression
LogisticRegressionisaLinearClassifier!
28
AssumesthefollowingfunctionalformforP(Y|X):
Decisionboundary:
(LinearDecisionBoundary)
0
1
Note- Labelsare0,1
1
LogisticRegressionisaLinearClassifier!
29
AssumesthefollowingfunctionalformforP(Y|X):
01
0
1
1
TrainingLogisticRegression
30
Howtolearntheparametersw0,w1,…wd?(dfeatures)
TrainingData
MaximumLikelihoodEstimates
Butthereisaproblem…
Don’thaveamodelforP(X)orP(X|Y)– onlyforP(Y|X)
TrainingLogisticRegression
31
Howtolearntheparametersw0,w1,…wd? (dfeatures)
TrainingData
Maximum(Conditional)LikelihoodEstimates
Discriminativephilosophy– Don’twasteeffortlearningP(X),focusonP(Y|X)– that’sallthatmattersforclassification!
ExpressingConditionallogLikelihood
32
Badnews:noclosed-formsolutiontomaximizel(w)
Goodnews:l(w)isconcavefunctionofwconcavefunctionseasytomaximize
33
Concavefunctionl(w)
w
A function l(w) is called concaveif the line joining two pointsl(w1),l(w2) on the function doesnot go above the function on theinterval [w1,w2]
w1 w2
l(w1)
l(w2)
(Strictly)Concavefunctionshaveauniquemaximum!
Convex BothConcave&Convex Neither
Optimizingconcavefunction
34
• ConditionallikelihoodforLogisticRegressionisconcave• Maximumofaconcavefunctioncanbereachedby
GradientAscentAlgorithm
Gradient:
Learningrate,η>0Updaterule:
d
l(w)
w
Initialize:Pickw atrandom
GradientAscentforLogisticRegression
35
Gradientascentruleforw0:
=X
j
"yj � 1
1 + exp(w0 +Pd
i wixji )
· exp(w0 +dX
i
wixji )
#
GradientAscentforLogisticRegression
36
• Gradientascentissimplestofoptimizationapproaches– e.g.,Newtonmethod,Conjugategradientascent,IRLS(seeBishop4.3.3)
Gradientascentalgorithm:iterateuntilchange<ε
Fori=1,…,d,
repeat PredictwhatcurrentweightthinkslabelYshouldbe
That’sallM(C)LE.HowaboutM(C)AP?
37
• Definepriorsonw– Commonassumption:Normaldistribution,
zeromean,identitycovariance– “Pushes”parameterstowardszero
• M(C)APestimate
Stillconcaveobjective!
Zero-meanGaussianprior
Penalizeslargeweights
M(C)AP– Gradient
• Gradient
38
Zero-meanGaussianprior
Sameasbefore
ExtratermPenalizeslargeweights
M(C)LEvs.M(C)AP
39
• Maximumconditionallikelihoodestimate
• Maximumconditionalaposterioriestimate
LogisticRegressionformorethan2classes
40
• Logisticregressioninmoregeneralcase,whereY {y1,…,yK}
fork<K
fork=K(normalization,sonoweightsforthisclass)
Predict
Isthedecisionboundarystilllinear?
2