cpsc 340: machine learning and data miningfwood/cs340/lectures/l13.pdf · non-uniqueness of least...
TRANSCRIPT
CPSC340:MachineLearningandDataMining
NonlinearRegressionFall2019
LastTime:LinearRegression• Wediscussedlinearmodels:
• “Multiplyfeaturexij byweightwj,addthemtogetyi”.
• Wediscussedsquarederror function:
• Interactivedemo:– http://setosa.io/ev/ordinary-least-squares-regression
http://www.bloomberg.com/news/articles/2013-01-10/the-dunbar-number-from-the-guru-of-social-networks
Matrix/NormNotation(MEMORIZE/STUDYTHIS)
• Tosolvethed-dimensionalleastsquares,weusematrixnotation:– Weuse‘w’asa“dtimes1”vectorcontainingweight‘wj’inposition‘j’.– Weuse‘y’asan“ntimes1”vectorcontainingtarget‘yi’inposition‘i’.– Weuse‘xi’asa“dtimes1”vectorcontainingfeatures‘j’ofexample‘i’.
• We’renowgoingtobecarefultomakesurethesearecolumnvectors.
– So‘X’isamatrixwithxiT inrow‘i’.
Matrix/NormNotation(MEMORIZE/STUDYTHIS)
• Tosolvethed-dimensionalleastsquares,weusematrixnotation:– Ourpredictionforexample‘i’isgivenbythescalarwTxi.– Ourpredictionsforall‘i’ (ntimes1vector)isthematrix-vectorproductXw.
Matrix/NormNotation(MEMORIZE/STUDYTHIS)
• Tosolvethed-dimensionalleastsquares,weusematrixnotation:– Ourpredictionforexample‘i’isgivenbythescalarwTxi.– Ourpredictionsforall‘i’ (ntimes1vector)isthematrix-vectorproductXw.– Residualvector‘r’givesdifferencebetweenpredictionsandyi (ntimes1).– LeastsquarescanbewrittenasthesquaredL2-normoftheresidual.
BacktoDerivingLeastSquaresford>2…• Wecanwritevectorofpredictions𝑦"𝑖 asamatrix-vectorproduct:
• Andwecanwritelinearleastsquaresinmatrixnotationas:
• We’llusethisnotationtoderived-dimensionalleastsquares‘w’.– By settingthegradient𝛻𝑓 𝑤 equaltothezerovectorandsolvingfor‘w’.
Digression:MatrixAlgebraReview• Quickreviewoflinearalgebraoperationswe’lluse:– If‘a’and‘b’bevectors,and‘A’and‘B’bematricesthen:
LinearandQuadraticGradients• Fromtheseruleswehave(seepost-lectureslideforsteps):
• Howdowecomputegradient?
LinearandQuadraticGradients• We’vewrittenasad-dimensionalquadratic:
• Gradientisgivenby:
• Usingdefinitionsof‘A’and‘b’:
NormalEquations• Setgradientequaltozerotofindthe“critical”points:
• Wenowmovetermsnotinvolving‘w’totheotherside:
• Thisisasetof‘d’linearequations calledthenormalequations.– Thisalinearsystemlike“Ax=b”fromMath152.
• YoucanuseGaussianeliminationtosolvefor‘w’.
– InJulia,the“\”commandcanbeusedtosolvelinearsystems:
NormalEquations• Setgradientequaltozerotofindthe“critical”points:
• Wenowmovetermsnotinvolving‘w’totheotherside:
• Thisisasetof‘d’linearequations calledthe“normalequations”.– Thisalinearsystemlike“Ax=b”fromMath152.
• YoucanuseGaussianeliminationtosolvefor‘w’.
– InPython,yousolvelinearsystemsin1lineusingnumpy.linalg.solve.
IncorrectSolutionstoLeastSquaresProblem
LeastSquaresCost• Cost ofsolving“normalequations”XTXw =XTy?• FormingXTy vectorcostsO(nd).– Ithas‘d’elements,andeachisaninnerproductbetween‘n’numbers.
• FormingmatrixXTXcostsO(nd2).– Ithasd2 elements,andeachisaninnerproductbetween‘n’numbers.
• SolvingadxdsystemofequationscostsO(d3).– CostofGaussianeliminationonad-variablelinearsystem.– Otherstandardmethodshavethesamecost.
• OverallcostisO(nd2 +d3).– Whichtermdominatesdependson‘n’and‘d’.
LeastSquaresIssues• Issueswithleastsquaresmodel:– Solutionmightnotbeunique.– Itissensitivetooutliers.– Italwaysusesallfeatures.– Datacanmightsobigwecan’tstoreXTX.
• Oryoucan’taffordtheO(nd2 +d3)cost.– Itmightpredictoutsiderangeofyi values.– Itassumesalinearrelationshipbetweenxi andyi.
Non-UniquenessofLeastSquaresSolution• Whyisn’tsolutionunique?– Imaginehavingtwofeaturesthatareidenticalforallexamples.– Icanincreaseweightononefeature,anddecreaseitontheother,withoutchangingpredictions.
– Thus,if(w1,w2)isasolutionthen(w1+w2,0)isanothersolution.– Thisisspecialcaseoffeaturesbeing“collinear”:
• Onefeatureisalinearfunctionoftheothers.
• But,any‘w’where∇f(w)=0isaglobalminimizerof‘f’.– Thisisduetoconvexity of‘f’,whichwe’lldiscusslater.
(pause)
Motivation:Non-LinearProgressionsinAthletics
• Aretopathletesgoingfaster,higher,andfarther?
http://www.at-a-lanta.nl/weia/Progressie.htmlhttps://en.wikipedia.org/wiki/Usain_Bolthttp://www.britannica.com/biography/Florence-Griffith-Joyner
AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:
http://www.at-a-lanta.nl/weia/Progressie.html
AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.
http://www.at-a-lanta.nl/weia/Progressie.html
AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.
• CPSC540.
https://en.wikipedia.org/wiki/Multivariate_normal_distribution
AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.– Non-parametricmodels:
• KNNregression:– Find‘k’nearestneighbours ofxi.– Returnthemeanofthecorrespondingyi.
http://scikit-learn.org/stable/modules/neighbors.html
AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.– Non-parametricmodels:
• KNNregression.• Couldbeweightedbydistance.
– Closepoints‘j’getmore“weight”wij.
http://scikit-learn.org/stable/modules/neighbors.html
AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.– Non-parametricmodels:
• KNNregression.• Couldbeweightedbydistance.• ‘Nadaraya-Waston’:weightall yi bydistancetoxi.
http://www.mathworks.com/matlabcentral/fileexchange/35316-kernel-regression-with-variable-window-width/content/ksr_vw.m
AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.– Non-parametricmodels:
• KNNregression.• Couldbeweightedbydistance.• ‘Nadaraya-Waston’:weightall yi bydistancetoxi.• ‘Locallylinearregression’:foreachxi,fitalinearmodelweightedbydistance.
(BetterthanKNNandNWatboundaries.)
http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd423.htm
AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression:– Regressiontree:treewithmeanvalueorlinearregressionatleaves.– Probabilisticmodels:fitp(xi |yi)andp(yi)withGaussianorothermodel.– Non-parametricmodels:
• KNNregression.• Couldbeweightedbydistance.• ‘Nadaraya-Waston’:weightall yi bydistancetoxi.• ‘Locallylinearregression’:foreachxi,fitalinearmodelweightedbydistance.
(BetterthanKNNandNWatboundaries.)
– Ensemblemethods:• Canimproveperformancebyaveragingacrossregressionmodels.
AdaptingCounting/Distance-BasedMethods• Wecanadaptourclassificationmethodstoperformregression.
• Applications:– Regressionforestsforfluidsimulation:
• https://www.youtube.com/watch?v=kGB7Wd9CudA– KNNforimagecompletion:
• http://graphics.cs.cmu.edu/projects/scene-completion• Combinedwith“graphcuts”and“Poissonblending”.
– KNNregressionfor“voicephotoshop”:• https://www.youtube.com/watch?v=I3l4XLZ59iw• Combinedwith“dynamictimewarping”and“Poissonblending”.
• Butwe’llfocusonlinearmodelswithnon-lineartransforms.– Thesearethebuildingblocksformoreadvancedmethods.
http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd423.htm
Whydon’twehaveay-intercept?– Linearmodelis𝑦"i =wxi insteadof𝑦"i =wxi +w0 withy-interceptw0.– Withoutanintercept,ifxi =0thenwemustpredict𝑦"i =0.
Whydon’twehaveay-intercept?– Linearmodelis𝑦"i =wxi insteadof𝑦"i =wxi +w0 withy-interceptw0.– Withoutanintercept,ifxi =0thenwemustpredict𝑦"i =0.
AddingaBiasVariable• Simpletricktoadday-intercept(“bias”)variable:
– Makeanewmatrix“Z”withanextrafeaturethatisalways“1”.
• Nowuse“Z”asyourfeaturesinlinearregression.– We’lluse ‘v’insteadof‘w’asregressionweightswhenweusefeatures‘Z’.
• Sowecanhaveanon-zeroy-interceptbychangingfeatures.– Thismeanswecanignorethey-interceptinourderivations,whichiscleaner.
Motivation:LimitationsofLinearModels• Onmanydatasets,yi isnotalinearfunctionofxi.
• Canweuseleastsquaretofitnon-linear models?
Non-LinearFeatureTransforms• Canweuselinearleastsquarestofitaquadraticmodel?
• Youcandothisbychangingthefeatures(changeofbasis):
• Fitnewparameters‘v’ under“changeofbasis”:solveZTZv = ZTy.• It’salinearfunctionofw,butaquadraticfunctionofxi.
Non-LinearFeatureTransforms
GeneralPolynomialFeatures(d=1)• Wecanhaveapolynomialofdegree‘p’byusingthesefeatures:
• Therearepolynomialbasisfunctionsthatarenumericallynicer:– E.g.,Lagrangepolynomials(seeCPSC303).
Summary• Matrixnotationforexpressingleastsquaresproblem.• Normalequations:solutionofleastsquaresasalinearsystem.– Solve(XTX)w=(XTy).
• Solutionmightnotbeuniquebecauseofcollinearity.– Butanysolutionisoptimalbecauseof “convexity”.
• Tree/probabilistic/non-parametric/ensemble regressionmethods.• Non-lineartransforms:– Allowustomodelnon-linearrelationshipswithlinearmodels.
• Nexttime:howtodoleastsquareswithamillionfeatures.
LinearLeastSquares:ExpansionStep
VectorViewofLeastSquares• Weshowedthatleastsquaresminimizes:
• The½andthesquaringdon’tchangesolution,soequivalentto:
• Fromthisviewpoint,leastsquareminimizesEuclideandistancebetweenvectoroflabels‘y’andvectorofpredictionsXw.
BonusSlide:Householder(-ish)Notation• Househoulder notation:setof(fairly-logical)conventionsformath.
BonusSlide:Householder(-ish)Notation• Househoulder notation:setof(fairly-logical)conventionsformath:
Whendoesleastsquareshaveauniquesolution?• Wesaidthatleastsquaressolutionisnotuniqueifwehaverepeatedcolumns.
• Butthereareotherwaysitcouldbenon-unique:– Onecolumnisascaledversionofanothercolumn.– Onecolumncouldbethesumof2othercolumns.– Onecolumncouldbethreetimesonecolumnminusfourtimesanother.
• LeastsquaressolutionisuniqueifandonlyifallcolumnsofXare“linearlyindependent”.– Nocolumncanbewrittenasa“linearcombination”oftheothers.– Manyequivalentconditions(seeStrang’s linearalgebrabook):
• Xhas“fullcolumnrank”,XTXisinvertible,XTXhasnon-zeroeigenvalues,det(XTX)>0.– Notethatwecannothaveindependentcolumnsifd>n.