cis 519/419 applied machine learning

101
CIS419/519 Spring ’18 CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth [email protected] http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Lecture given by Daniel Khashabi Slides were created by Dan Roth (for CIS519/419 at Penn or CS446 at UIUC), Eric Eaton for CIS519/419 at Penn, or from other authors who have made their ML slides available.

Upload: others

Post on 19-May-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

CIS519/419AppliedMachineLearning

www.seas.upenn.edu/~cis519

[email protected]://www.cis.upenn.edu/~danroth/461C,3401Walnut

LecturegivenbyDanielKhashabi

SlideswerecreatedbyDanRoth(forCIS519/419atPennorCS446atUIUC),EricEatonforCIS519/419atPenn,orfromotherauthorswhohavemadetheirMLslidesavailable.

Page 2: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

FunctionsCanbeMadeLinear§ Dataarenotlinearlyseparableinonedimension§ Notseparableifyouinsistonusingaspecificclassof

functions

2

x

Page 3: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

BlownUpFeatureSpace§ Dataareseparablein<x,x2>space

3

x

x2

Page 4: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Multi-LayerNeuralNetwork§ Multi-layernetworkweredesignedtoovercomethe

computational(expressivity)limitationofasinglethresholdelement.

§ Theideaistostackseverallayersofthresholdelements,eachlayerusingtheoutputofthepreviouslayerasinput.

§ Multi-layernetworkscanrepresentarbitraryfunctions,butbuildingeffectivelearningmethodsforsuchnetworkwas[thoughttobe]difficult.

4

activation

Input

Hidden

Output

Page 5: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

BasicUnits§ LinearUnit: Multiplelayersoflinearfunctions

oj =w¢ xproducelinearfunctions.Wewanttorepresentnonlinearfunctions.

§ Needtodoitinawaythatfacilitateslearning

§ Thresholdunits:oj =sgn(w¢ x)arenotdifferentiable, henceunsuitableforgradientdescent.

§ Thekeyideawastonoticethatthediscontinuityofthethresholdelementcanberepresentsbyasmoothnon-linearapproximation:oj =[1+exp{-w¢ x}]-1

§ (Rumelhart,Hinton,Williiam,1986),(Linnainmaa,1970),see:http://people.idsia.ch/~juergen/who-invented-backpropagation.html )

5

activation

Input

Hidden

Output

w2ij

w1ij

Page 6: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ModelNeuron(Logistic)§ Usanon-linear,differentiableoutputfunctionsuch

asthesigmoidorlogisticfunction

§ Netinputtoaunitisdefinedas:§ Outputofaunitisdefinedas:

6

iijj xwnet ∑ •=

)T(netj jje11O

−−+=

jT

12

6

345

7

67w

17w

∑T

jO

1x

7x

Page 7: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

NeuralNetworks§ NeuralNetworksarefunctions:𝑁𝑁:𝑋 → 𝑌

§ where𝑋 = 0,1 *,or{0,1}* and𝑌 = 0,1 ,{0,1}§ Robustapproachtoapproximatingreal-valued,discrete-valued

andvectorvaluedtargetfunctions.

§ Amongthemosteffectivegeneralpurpose supervisedlearningmethodcurrentlyknown.

§ Effectiveespeciallyforcomplexandhardtointerpretinputdatasuchasreal-worldsensorydata,wherealotofsupervisionisavailable.

§ TheBackpropagationalgorithmforneuralnetworkshasbeenshownsuccessfulinmanypracticalproblems§ handwrittencharacterrecognition,speechrecognition,object

recognition,someNLPproblems

7

Page 8: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

NeuralNetworks§ NeuralNetworksarefunctions:NN:𝑋 → 𝑌

§ where𝑋 = 0,1 *,or{0,1}* and𝑌 = 0,1 , {0,1}

§ NNcanbeusedasanapproximationofatargetclassifier§ Intheirgeneralform,evenwithasinglehiddenlayer,NNcan

approximateanyfunction§ AlgorithmsexistthatcanlearnaNNrepresentationfromlabeled

trainingdata(e.g.,Backpropagation).

8

Page 9: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Multi-LayerNeuralNetworks§ Multi-layernetworkweredesignedtoovercomethe

computational(expressivity)limitationofasinglethresholdelement.

§ Theideaistostackseverallayersofthresholdelements,eachlayerusingtheoutputofthepreviouslayerasinput.

9

activation

Input

Hidden

Output

Page 10: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

MotivationforNeuralNetworks§ Inspiredbybiologicalsystems

§ Butdon’ttakethis(aswellasanyotherwordsinthenewon“emergence”ofintelligentbehavior)seriously;

§ WearecurrentlyonrisingpartofawaveofinterestinNNarchitectures,afteralongdowntimefromthemid-90-ies.§ Bettercomputerarchitecture(GPUs,parallelism)§ Alotmoredatathanbefore;inmanydomains,supervisionis

available.

§ Currentsurgeofinteresthasseenveryminimalalgorithmicchanges

10

Page 11: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

MotivationforNeuralNetworks§ Minimaltonoalgorithmicchanges§ Onepotentiallyinterestingperspective:

§ BeforewelookedatNNonlyasfunctionapproximators.§ Now,welookattheintermediaterepresentationsgenerated

whilelearningasmeaningful§ Ideasarebeingdevelopedonthevalueoftheseintermediate

representationsfortransferlearningetc.

§ Wewillpresentinthenexttwolecturesafewofthebasicarchitecturesandlearningalgorithms,andprovidesomeexamplesforapplications

11

Page 12: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

NeuralSpeedConstraints§ Neuron“switchingtime”isO(milliseconds),comparedto

nanosecondfortransistors.§ However,biologicalsystemscanperformsignificantcognitive

tasks(vision,languageunderstanding)infractionsofasecond.

§ Evenforlimitedabilities,currentAIsystemsrequireordersofmagnitudemoresteps.

§ Humanbrainhasapproximately10^10neurons,eachconnectedto10^4;mustexploremassiveparallelism(butthere’smore…)

12

Page 13: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

BasicUnitinMulti-LayerNeuralNetwork

§ LinearUnit:𝑜/ = 𝑤.𝑥 multiplelayersoflinearfunctionsproducelinearfunctions.Wewanttorepresentnonlinearfunctions.

§ Thresholdunits:𝑜/ = 𝑠𝑔𝑛(𝑤.𝑥 − 𝑇) arenotdifferentiable,henceunsuitableforgradientdescent

13

activation

Input

Hidden

Output

Page 14: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ModelNeuron(Logistic)§ Neuronismodeledbyaunit𝑗 connectedbyweighted

links𝑤</ tootherunits𝑖.

§ Useanon-linear,differentiableoutputfunctionsuchasthesigmoidorlogisticfunction

§ Netinputtoaunitisdefinedas:

§ Outputofaunitisdefinedas:

14

net/ = ∑𝑤</. 𝑥<

𝑜/ =1

1 + exp −(net/ − 𝑇/)

∑ 𝑜/

𝑥E𝑥F𝑥G𝑥H𝑥I𝑥J

𝑥/𝑤E/

𝑤J/

Page 15: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

History:NeuralComputation

15

§ McCullochandPitts(1943) showedhowlinearthresholdunitscanbeusedtocomputelogicalfunctions

§ Canbuildbasiclogicgates§ AND:𝑤</ = 𝑇//𝑛§ OR:𝑤</ = 𝑇/§ NOT: usenegativeweight

§ Canbuildarbitrarylogiccircuits,finite-statemachinesandcomputersgiventhesebasisgates.

§ CanspecifyanyBooleanfunctionusingtwolayernetwork(w/negation)§ DNFandCNFareuniversalrepresentations

net/ = ∑𝑤</ . 𝑥<

𝑜/ =1

1 + exp −(net/ − 𝑇/)

Page 16: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

RepresentationalPower§ AnyBooleanfunctioncanberepresentedbyatwolayer

network(simulateatwolayerAND-ORnetwork)

§ Anybounded continuousfunctioncanbeapproximatedwitharbitrarysmallerrorbyatwolayernetwork.

§ Sigmoidfunctionsprovideasetofbasisfunctionfromwhicharbitraryfunctioncanbecomposed.

§ Anyfunctioncanbeapproximatedtoarbitraryaccuracybyathreelayernetwork.

16

Page 17: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

QuizTime!§ Givenaneuralnetwork,howcanwe

makepredictions?§ Giveninput,calculatetheoutputofeach

layer(startingfromthefirstlayer),untilyougettotheoutput.

§ Whatisrequiredtofullyspecifyaneuralnetwork?§ Theweights.

17

§ WhyNNpredictionscanbequick?§ Becausemanyofthecomputationscouldbeparallelized.

§ Whatmakesaneuralnetworksnon-linearapproximator?§ Thenon-linearunits.

Page 18: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

TrainingaNeuralNet

Page 19: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Widrow-HoffRule§ Thisincrementalupdateruleprovidesanapproximation

tothegoal:§ Findthebestlinearapproximationofthedata

𝐸𝑟𝑟 𝑤 / =12P 𝑡R − 𝑜R F

R∈T§ where:

𝑜R = P𝑤</. 𝑥< =<

𝑤 / . �⃗�

outputoflinearunitonexampled§ 𝑡R =Targetoutputforexampled

19

Page 20: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

History:LearningRules§ Hebb(1949) suggestedthatiftwounitsarebothactive

(firing)thentheweightsbetweenthemshouldincrease:𝑤</ = 𝑤</ + 𝑅𝑜<𝑜/

§ 𝑅 andisaconstantcalledthelearningrate§ Supportedbyphysiologicalevidence

§ Rosenblatt(1959) suggestedthatwhenatargetoutputvalueisprovidedforasingleneuronwithfixedinput,itcanincrementallychangeweightsandlearntoproducetheoutputusingthePerceptronlearningrule.§ assumesbinaryoutputunits;singlelinearthresholdunit§ LedtothePerceptronAlgorithm

§ See:http://people.idsia.ch/~juergen/who-invented-backpropagation.html

20

Page 21: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

PerceptronLearningRule§ Given:

§ thetarget outputfortheoutputunitis𝑡/§ theinput theneuronseesis𝑥<§ theoutput itproducesis𝑜/

§ Updateweightsaccordingto𝑤</ ← 𝑤</ + 𝑅 𝑡/ − 𝑜/ 𝑥<§ Ifoutputiscorrect,don’tchangetheweights§ Ifoutputiswrong,changeweightsforallinputswhichare1

§ Ifoutput islow(0,needstobe1)incrementweights§ Ifoutput ishigh(1,needstobe0)decrementweights

21

∑𝑇/

𝑜/

𝑥E𝑥F𝑥G𝑥H𝑥I𝑥J

𝑥W𝑤EW

𝑤JW

Page 22: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

GradientDescent§ Weusegradientdescentdeterminetheweightvectorthat minimizes

𝐸𝑟𝑟 𝑤 / ;

§ Fixingtheset𝐷 ofexamples,𝐸 isafunctionof𝑤 /

§ Ateachstep,theweightvectorismodifiedinthedirectionthatproducesthesteepestdescentalongtheerrorsurface.

22

𝐸𝑟𝑟(𝑤)

𝑤𝑤G 𝑤F 𝑤E 𝑤Y

Page 23: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Summary:SingleLayerNetwork§ Varietyofupdaterules

§ Multiplicative§ Additive

§ Batch andincrementalalgorithms§ Variousconvergenceandefficiencyconditions§ Thereareotherwaystolearnlinearfunctions

§ LinearProgramming (generalpurpose)§ ProbabilisticClassifiers(someassumption)

§ Keyalgorithmsaredrivenbygradientdescent

23

Page 24: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

GeneralStochasticGradientAlgorithms

wt+1 =wt – rt gw Q(xt, yt,wt)=wt – rt gt

LMS: Q((x,y),w) =1/2 (y– wT x)2leadstotheupdaterule(AlsocalledWidrow’s Adaline):

wt+1 =wt +r(yt –𝑤Z[ xt)xtHere,eventhoughwemakebinarypredictionsbasedonsgn (wT x)wedonottakethesign ofthedot-productintoaccountintheloss.

Anothercommonlossfunctionis:Hingeloss:Q((x,y),w)=max(0,1- ywT x)Thisleadstotheperceptron updaterule:

Ifyi𝑤<[·xi >1(Nomistake,byamargin):NoupdateOtherwise (Mistake,relativetomargin): wt+1 =wt +ryt xt

24

wT x

ThelossQ:afunction ofx,w andyLearningrate gradient

Hereg=-yxGood tothinkaboutthecaseofBooleanexamples

Page 25: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Summary:SingleLayerNetwork§ Varietyofupdaterules

§ Multiplicative§ Additive

§ Batch andincrementalalgorithms§ Variousconvergenceandefficiencyconditions§ Thereareotherwaystolearnlinearfunctions

§ LinearProgramming (generalpurpose)§ ProbabilisticClassifiers(someassumption)

§ Keyalgorithmsaredrivenbygradientdescent§ However,therepresentationalrestrictionislimitingin

manyapplications

25

Page 26: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

BackpropagationLearningRule§ Sincetherecouldbemultipleoutputunits,wedefinethe

error asthesumoverallthenetworkoutputunits.

𝐸𝑟𝑟 𝑤 = EF∑ ∑\∈] 𝑡\R − 𝑜\R FR∈T

§ where𝐷 isthesetoftrainingexamples,§ 𝐾 isthesetofoutputunits

§ Thisisusedtoderivethe(global)learningrulewhichperformsgradientdescentintheweightspaceinanattempttominimizetheerrorfunction.

Δ𝑤</ = −𝑅𝜕𝐸𝜕𝑤</

26

𝑜E…𝑜\

(1, 0, 1, 0, 0)

Function 1

Page 27: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

LearningwithaMulti-LayerPerceptron

§ It’seasytolearnthetoplayer– it’sjustalinearunit.§ Givenfeedback(truth)atthetoplayer,andtheactivationatthe

layerbelowit,youcanusethePerceptronupdaterule(moregenerally,gradientdescent)toupdatedtheseweights.

§ Theproblemiswhattodowiththeothersetofweights– wedonotgetfeedbackintheintermediatelayer(s).

27

activation

Input

Hidden

Output

w2ij

w1ij

Page 28: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

LearningwithaMulti-LayerPerceptron

§ Theproblem iswhattodowiththeother setofweights– wedonotgetfeedbackintheintermediatelayer(s).

§ Solution: Ifalltheactivationfunctionsaredifferentiable, thentheoutput ofthenetworkisalsoadifferentiablefunction oftheinputandweightsinthenetwork.

§ Defineanerror function (e.g.,sumofsquares)thatisadifferentiable functionoftheoutput, i.e.thiserrorfunction isalsoadifferentiable functionof theweights.

§ Wecanthenevaluatethederivativesoftheerrorwithrespecttotheweights,andusethesederivativestofindweightvaluesthatminimizethiserrorfunction, usinggradientdescent(orotheroptimizationmethods).

§ Thisresultsinanalgorithmcalledback-propagation.

28

activation

Input

Hidden

Output

w2ij

w1ij

Page 29: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

SomefactsfromrealanalysisFirstlet’sgetthenotationright:

Thearrowshowsfunctionaldependnence ofz onyi.e.giveny,wecancalculatez.e.g.,forexample:z(y)=2y^2

Thederivativeofz,withrespecttoy.

29

Page 30: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Somefactsfromrealanalysis§ Simplechainrule

§ If𝑧 isafunctionof𝑦,and𝑦 isafunctionof𝑥§ Then𝑧 isafunctionof𝑥,aswell.

§ Question:howtofindcdce

30

WewillusethesefactstoderivethedetailsoftheBackpropagationalgorithm.

zwillbetheerror(loss) function.- Weneedtoknowhowtodifferentiatez

Intermediatenodesusealogisticsfunction (oranotherdifferentiablestepfunction).- Weneedtoknowhowtodifferentiateit.

Page 31: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Somefactsfromrealanalysis§ Multiplepathchainrule

31Slide Credit: Richard Socher

Page 32: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Somefactsfromrealanalysis§ Multiplepathchainrule:general

32Slide Credit: Richard Socher

Page 33: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

KeyIntuitionsRequiredforBP§ GradientDescent

§ Changetheweightsinthedirectionofgradienttominimizetheerrorfunction.

§ ChainRule§ Usethechainruletocalculatethe

weightsoftheintermediateweights

§ DynamicProgramming(Memoization)§ Memoize theweightupdatestomake

theupdatesfaster.§ The“back”partof“backpropagation”

33

𝜕𝐸𝜕𝑤</

output

input

Page 34: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Backpropagation:thebigpicture§ Loopoverinstances:

1. Theforwardstep§ Giventheinput,makepredictions layer-by-layer,startingfromthefirstlayer)

2. Thebackwardstep§ Calculatetheerrorintheoutput§ Updatetheweightslayer-by-layer,startingfromthefinallayer

34

𝜕𝐸𝜕𝑤</

output

input

Page 35: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Quiztime!§ Whatisthepurposeofforwardstep?

§ Tomakepredictions,givenaninput.

§ Whatisthepurposeofbackwardstep?§ Toupdatetheweights,givenanoutputerror.

§ Whydoweusethechainrule?§ Tocalculategradientintheintermediatelayers.

§ Whybackpropagationcouldbeefficient?§ Becauseitcanbeparallelized.

35

Page 36: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Derivingtheupdaterules

Page 37: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Reminder:ModelNeuron(Logistic)

§ Neuronismodeledbyaunit𝑗 connectedbyweightedlinks𝑤</ tootherunits𝑖.

§ Useanon-linear,differentiableoutputfunctionsuchasthesigmoidorlogisticfunction

§ Netinputtoaunitisdefinedas:

§ Outputofaunitisdefinedas:

37

net/ = ∑𝑤</. 𝑥<

𝑜/ =1

1 + exp −(net/ − 𝑇/)

∑ 𝑜/

𝑥E𝑥F𝑥G𝑥H𝑥I𝑥J

𝑥W𝑤EW

𝑤JW

Theparameterssofar?Thesetofconnectiveweights:𝑤</Thethresholdvalue:𝑇/

Page 38: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Derivatives§ Function1(error):

§ 𝐸 = EF∑ 𝑡\ − 𝑜\ F\∈]

§chcij

= − 𝑡𝑖 − 𝑜<§ Function2(lineargate):

§ net/ = ∑𝑤</ . 𝑥<

§cklmncojn

= 𝑥𝑖

§ Function3(differentiable stepfunction):

§ 𝑜< =E

Eplqr{s(klmns[)}

§cijcklmn

= lqr{s(klmns[)}(Eplqr{s(klmns[)})F

=𝑜<(1 − 𝑜<)

38

𝑜E…𝑜\

𝑗

𝑖

𝑤</

Page 39: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

DerivationofLearningRule§ Theweightsareupdatedincrementally;theerroris

computedforeachexampleandtheweightupdateisthenderived.

𝐸R 𝑤 =12P 𝑡\ − 𝑜\ F

\∈]

§ 𝑤</ influencestheoutputonlythroughnet/

§ Therefore:𝜕𝐸R𝜕𝑤</

=𝜕𝐸R𝜕o/

𝜕𝑜/𝜕net/

𝜕net/𝜕𝑤</

39

𝑜E…𝑜\

𝑗

𝑖

𝑤</

𝑜< =E

Eplqr{s(klmns[)}and net/ = ∑𝑤</ . 𝑥<

Page 40: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

= − 𝑡/ − 𝑜/ 𝑜/ 1 − 𝑜/ 𝑥<

𝜕𝐸R𝜕𝑤</

=𝜕𝐸R𝜕o/

𝜕𝑜/𝜕net/

𝜕net/𝜕𝑤</

DerivationofLearningRule(2)§ Weightupdatesofoutputunits:

§ 𝑤</ influencestheoutputonlythroughnet/§ Therefore:

40

𝑗

𝑖

𝑤</

𝐸R 𝑤 =12P 𝑡\ − 𝑜\ F

\∈]net/ = ∑𝑤</ . 𝑥<

𝜕𝑜/𝜕net/

= 𝑜/(1 − 𝑜/)

𝑜/ =1

1 + exp{−(net/ − 𝑇/)}

𝑜E…𝑜\

Page 41: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

DerivationofLearningRule(3)§ Weightsofoutputunits:

§ 𝑤</ ischangedby:

Wherewedefined:

𝛿/ =chvcklmn

= 𝑡/ − 𝑜/ 𝑜/ 1 − 𝑜/

41

Δ𝑤</ = 𝑅 𝑡/ − 𝑜/ 𝑜/ 1 − 𝑜/ 𝑥<= 𝑅𝛿/𝑥<

𝑗

𝑖

𝑤</𝑜/

𝑥<

Page 42: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

= P −𝛿\ 𝜕net\𝜕net/

𝑥<\∈wxyz*Z(/)

= P𝜕𝐸R𝜕net\

𝜕net\𝜕net/

𝑥<\∈wxyz*Z(/)

𝜕𝐸R𝜕𝑤</

=𝜕𝐸R𝜕net/

𝜕net/𝜕𝑤</

=

DerivationofLearningRule(4)§ Weightsofhiddenunits:

§ 𝑤</ Influencestheoutputonlythroughalltheunitswhosedirectinputinclude𝑗

42

𝑘

𝑗

𝑖

𝑤</

𝑜\net/ = ∑𝑤</ . 𝑥<

=𝜕𝐸R𝜕net/

𝑥< =

Page 43: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

= P −𝛿\ 𝜕net\𝜕𝑜/

𝜕𝑜/𝜕net/

𝑥<\∈wxyz*Z(/)

= P −𝛿\𝑤/\𝑜/(1 − 𝑜/)𝑥<\∈wxyz*Z(/)

DerivationofLearningRule(5)§ Weightsofhiddenunits:

§ 𝑤</ influencestheoutputonlythroughalltheunitswhosedirectinputinclude𝑗

43

𝑘

𝑗

𝑖

𝑤</

𝑜\

𝜕𝐸R𝜕𝑤</

= P −𝛿\𝜕net\𝜕net/

𝑥<\∈wxyz*Z(/)

=

Page 44: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

DerivationofLearningRule(6)§ Weightsofhiddenunits:

§ 𝑤</ ischangedby:

§ Where

𝛿/ = 𝑜/ 1 − 𝑜/ . ∑ −𝛿\𝑤/\\∈wxyz*Z /

§ Firstdeterminetheerrorfortheoutputunits.§ Then,backpropagate thiserrorlayerbylayerthroughthenetwork,

changingweightsappropriatelyineachlayer.

44

𝑘

𝑗

𝑖

𝑤</

𝑜\

Δ𝑤</ = 𝑅𝑜/ 1 − 𝑜/ . P −𝛿\𝑤/\\∈wxyz*Z /

𝑥<

= 𝑅𝛿/𝑥</

Page 45: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

TheBackpropagationAlgorithm§ Createafullyconnectedthreelayernetwork.Initializeweights.§ Untilallexamplesproducethecorrectoutputwithin𝜖 (orother

criteria)Foreachexampleinthetrainingsetdo:

1. Computethenetworkoutput forthisexample2. Computetheerrorbetweentheoutputandtargetvalue

𝛿\ = 𝑡\ − 𝑜\ 𝑜\ 1 − 𝑜\1. Foreachoutputunitk,computeerrorterm

𝛿/ = 𝑜/ 1 − 𝑜/ . P −𝛿\𝑤/\ \∈Rio*}Zyzx~ /

1. Foreachhiddenunit, computeerrorterm:Δ𝑤</ = 𝑅𝛿/𝑥<

1. UpdatenetworkweightswithΔ𝑤</Endepoch

45

Page 46: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

MoreHiddenLayers§ Thesamealgorithmholdsformorehiddenlayers.

46

output

input

Page 47: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Demotime!§ Link:https://playground.tensorflow.org/

47

Page 48: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

CommentsonTraining§ Noguaranteeofconvergence;mayoscillateorreachalocal

minima.§ Inpractice,manylargenetworkscanbetrainedonlarge

amountsofdataforrealisticproblems.§ Manyepochs(tensofthousands)maybeneededforadequate

training.LargedatasetsmayrequiremanyhoursofCPU§ Terminationcriteria:Numberofepochs;Thresholdontraining

seterror;Nodecreaseinerror;Increasederroronavalidationset.

§ Toavoidlocalminima:severaltrialswithdifferentrandominitialweightswithmajorityorvotingtechniques

48

Page 49: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Over-trainingPrevention§ Runningtoomanyepochsmayover-trainthenetworkand

resultinover-fitting.(improvedresultontraining,decreaseinperformanceontestset)

§ Keepanhold-outvalidationsetandtestaccuracyaftereveryepoch

§ Maintainweightsforbestperformingnetworkonthevalidationsetandreturnitwhenperformancedecreasessignificantlybeyondthat.

§ Toavoidlosingtrainingdatatovalidation:§ Use10-foldcross-validationtodetermine theaveragenumberofepochs

thatoptimizesvalidationperformance§ Trainonthefulldatasetusingthismanyepochs toproduce thefinal

results

49

Page 50: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Over-fittingprevention§ Toofewhiddenunitspreventthesystemfromadequately

fittingthedataandlearningtheconcept.§ Usingtoomanyhiddenunitsleadstoover-fitting.§ Similarcross-validationmethodcanbeusedtodetermine

anappropriatenumberofhiddenunits.(general)§ Anotherapproachtopreventover-fittingisweight-decay:

allweightsaremultipliedbysomefractionin(0,1)aftereveryepoch.§ Encouragessmallerweightsandlesscomplexhypothesis§ Equivalently:changeErrorfunctiontoincludeatermforthesum

ofthesquaresoftheweightsinthenetwork.(general)

50

Page 51: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Dropouttraining§ Proposedby(Hintonetal,2012)

§ Eachtimedecidewhethertodeleteonehiddenunitwithsomeprobabilityp

51

Page 52: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Dropouttraining

§ Dropoutof50%ofthehiddenunitsand20%oftheinputunits(Hintonetal,2012)

52

Page 53: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Dropouttraining

§ Modelaveragingeffect§ Among2H models,withsharedparameters

§ H:numberofunitsinthenetwork§ Onlyafewgettrained§ Muchstrongerthantheknownregularizer

§ Whatabouttheinputspace?§ Dothesamething!

53

Page 54: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Input-OutputCoding§ Appropriatecodingofinputsandoutputscanmake

learningproblemeasierandimprovegeneralization.§ Encodeeachbinaryfeatureasaseparateinputunit;

§ Formulti-valuedfeaturesincludeonebinaryunitpervalueratherthantryingtoencodeinputinformationinfewerunits.§ Verycommontodaytousedistributedrepresentationoftheinput

– realvalued,denserepresentation.

§ Fordisjointcategorizationproblem,besttohaveoneoutputunitforeachcategoryratherthanencodingNcategoriesintologNbits.

54

Onewaytodoit,ifyoustartwithacollectionofsparselyrepresentationexamples,istousedimensionalityreductionmethods:- Yourm examplesarerepresentedasamx106 matrix- Multipleitbyarandommatrixofsize106 x300,say.- Randommatrix:Normal(0,1)- Newrepresentation:mx300denserows

Page 55: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

HiddenLayerRepresentation§ Weighttuningproceduresetsweightsthatdefine

whateverhiddenunitsrepresentationismosteffectiveatminimizingtheerror.

§ SometimesBackpropagationwilldefinenewhiddenlayerfeaturesthatarenotexplicitintheinputrepresentation,butwhichcapturepropertiesoftheinputinstancesthataremostrelevanttolearningthetargetfunction.

§ Trainedhiddenunitscanbeseenasnewlyconstructedfeaturesthatre-representtheexamplessothattheyarelinearlyseparable

55

Page 56: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

GradientChecksareuseful!§ Allowyoutoknowthattherearenobugsinyourneural

networkimplementation!

§ Implementyourgradient§ Implementafinitedifferencecomputationbyloopingthroughthe

parametersofyournetwork,addingandsubtractingasmallepsilon(∼10^-4)andestimatederivatives

𝑓� 𝜃 ≈ � �� s�(��)F� 𝜃± = 𝜃 ± 𝜖

§ Comparethetwoandmakesuretheyarealmostthesame

56

Page 57: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Auto-associativeNetwork§ Anauto-associativenetworktrainedwith8inputs,3hiddenunitsand

8outputnodes,wheretheoutputmustreproducetheinput.§ Whentrainedwithvectorswithonlyonebiton

INPUTHIDDEN10000000.89.400.801000000.97.99.71….00000001.01.11.88

§ Learnedthestandard3-bitencodingforthe8bitvectors.§ Illustratesalsodatacompressionaspectsoflearning

57

10001000

10001000

Page 58: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

SparseAuto-encoder

58

§ Encoding:§ Decoding:

§ Goal:perfectreconstructionofinputvector𝒙,bytheoutput 𝒙�

§ Where𝜽 = {𝑾,𝑾′}§ Minimizeanerrorfunction𝒍(𝒙�, 𝒙)

§ Forexample:

§ Andregularizeit

§ Afteroptimizationdropthereconstructionlayerandaddanewlayer

𝒚 = 𝑓(𝑊𝒙 + 𝒃)

𝒙� = 𝑔(𝑊′𝒚 + 𝒃′)

𝑙 𝒙�, 𝒙 = 𝒙� − 𝒙 F

min� P𝑙 𝒙�, 𝒙𝒙

+P|𝑤<|<

Page 59: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

StackingAuto-encoder§ Addanewlayer,andareconstructionlayerforit.§ Andtrytotuneitsparameterssuchthat§ Andcontinuethisforeachlayer

59

Page 60: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Beyondsupervisedlearning

60

§ Sofarwhatwehadwaspurelysupervised.§ Initializeparametersrandomly§ Traininsupervisedmodetypically,usingbackprop§ Usedinmostpracticalsystems (e.g.speechandimagerecognition)

§ Unsupervised,layer-wisepre-training+supervisedclassifierontop§ Traineachlayerunsupervised,oneaftertheother§ Trainasupervisedclassifierontop,keepingtheother layersfixed§ Goodwhenveryfewlabeledsamplesareavailable

§ Unsupervised,layer-wisepre-training+supervisedfine-tuning§ Traineachlayerunsupervised, oneaftertheother§ Addaclassifierlayer,andretrainthewholethingsupervised§ Goodwhenlabelsetispoor (e.g.pedestriandetection)

Wewon’ttalkaboutunsupervised pre-traininghere.Butit’s goodtohavethisinmind, sinceitisanactivetopicofresearch.

Page 61: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

NN-2

61

Page 62: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Recap:Multi-LayerPerceptrons§ Multi-layernetwork

§ Aglobalapproximator§ Differentrulesfortrainingit

§ TheBack-propagation§ Forwardstep§ Backpropagationoferrors

§ Congrats!Nowyouknowtheoneoftheimportantalgorithmsinneuralnetworks!

§ Today:§ ConvolutionalNeuralNetworks§ RecurrentNeuralNetworks

62

activation

Input

Hidden

Output

Page 63: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ReceptiveFields§ The receptivefield ofanindividual sensoryneuron istheparticular

regionofthesensoryspace(e.g.,thebodysurface,ortheretina)inwhicha stimuluswilltriggerthefiringofthatneuron.§ Intheauditorysystem,receptivefieldscancorrespond tovolumesin

auditory space§ Designing“proper”receptivefieldsfortheinputNeuronsisa

significantchallenge.§ Considerataskwithimageinputs

§ Receptivefieldsshouldgiveexpressivefeaturesfromtherawinput tothesystem

§ Howwouldyoudesign thereceptivefieldsforthisproblem?

63

Page 64: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

§ Afullyconnectedlayer:§ Example:

§ 100x100images§ 1000unitsintheinput

§ Problems:§ 10^7edges!§ Spatialcorrelationslost!§ Variablessizedinputs.

64Slide Credit: Marc'Aurelio Ranzato

Input layer

Page 65: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

§ Considerataskwithimageinputs:§ Alocallyconnectedlayer:

§ Example:§ 100x100images§ 1000unitsintheinput§ Filtersize:10x10

§ Localcorrelationspreserved!§ Problems:

§ 10^5edges§ Thisparameterizationisgoodwheninput imageisregistered(e.g.,facerecognition). § Variablesizedinputs,again.

65Slide Credit: Marc'Aurelio Ranzato

Input layer

Page 66: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ConvolutionalLayer§ Asolution:

§ Filterstocapturedifferentpatternsintheinputspace.§ Shareparametersacrossdifferentlocations(assuming inputisstationary)

§ Convolutionswithlearnedfilters§ Filterswillbelearnedduring training.§ Theissueofvariable-sizedinputswillberesolvedwithapoolinglayer.

66SlideCredit:Marc'Aurelio Ranzato

Sowhatisaconvolution?

Input layer

Page 67: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ConvolutionOperator§ Convolutionoperator:∗

§ takestwofunctionsandgivesanotherfunction

§ Onedimension:

67

𝑥 ∗ ℎ 𝑡 = � 𝑥 𝜏 ℎ 𝑡 − 𝜏 𝑑𝜏

𝑥 ∗ ℎ [𝑛] = ∑ 𝑥 𝑚 ℎ[𝑛− 𝑚]~

“Convolution” isvery similarto

“cross-correlation”,exceptthatinconvolutiononeofthefunctions

isflipped.

Page 68: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ConvolutionOperator(2)§ Convolutionintwodimension:

§ Thesameidea:fliponematrixandslideitontheothermatrix§ Example:Sharpenkernel:

68Tryotherkernels:http://setosa.io/ev/image-kernels/

Page 69: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ConvolutionOperator(3)

Convolutionintwodimension:q Thesameidea:fliponematrixandslideitontheother

matrix

69

Page 70: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ComplexityofConvolution§ Complexityofconvolutionoperatoris𝑛log 𝑛 ,for𝑛

inputs.§ UsesFast-Fourier-Transform(FFT)

§ Fortwo-dimension,eachconvolutiontakes𝑀𝑁log 𝑀𝑁time,wherethesizeofinputis𝑀𝑁.

70Slide Credit: Marc'Aurelio Ranzato

Page 71: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ConvolutionalLayer§ Theconvolutionoftheinput(vector/matrix)withweights

(vector/matrix)resultsinaresponsevector/matrix.§ Wecanhavemultiplefiltersineachconvolutionallayer,each

producinganoutput.§ Ifitisanintermediatelayer,itcanhavemultipleinputs!

71

ConvolutionalLayer

FilterFilterFilterFilterOnecanaddnonlinearityattheoutputof

convolutional layer

Page 72: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

PoolingLayer§ Howtohandlevariablesizedinputs?

§ Alayerwhichreducesinputsofdifferentsize,toafixedsize.§ Pooling

72SlideCredit:Marc'Aurelio Ranzato

Page 73: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

PoolingLayer§ Howtohandlevariablesizedinputs?

§ Alayerwhichreducesinputsofdifferentsize,toafixedsize.§ Pooling§ Differentvariations

§ Maxpooling

ℎ< 𝑛 = max<∈¡(*)

ℎ¢ [𝑖]

§ Averagepooling

ℎ< 𝑛 = E*

∑<∈¡(*)

ℎ¢[𝑖]

§ L2-pooling

ℎ< 𝑛 = E*

∑<∈¡(*)

ℎ¢F[𝑖]

§ etc

73

Page 74: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ConvolutionalNets§ Onestagestructure:

§ Wholesystem:

74

Conv. Pooling

Stage1 Stage2 Stage3Fully

ConnectedLayer

InputImage

ClassLabel

Page 75: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

TrainingaConvNet§ ThesameprocedurefromBack-propagationapplieshere.

§ Rememberinbackprop westartedfromtheerrortermsinthelaststage,andpassedthembacktothepreviouslayers,onebyone.

§ Back-propforthepoolinglayer:§ Consider,forexample,thecaseof“max”pooling.§ Thislayeronlyroutesthegradienttotheinputthathasthehighestvalueinthe

forwardpass.§ Hence,duringtheforwardpassofapoolinglayeritiscommontokeeptrackofthe

indexofthemaxactivation(sometimesalsocalled theswitches)sothatgradientroutingisefficientduringbackpropagation.

§ Thereforewehave: 𝛿 = chvc£j

75

Convol. Pooling

Stage3 FullyConnectedLayerInputImage

Class Label

𝛿¤¥¦ms¤¥§l̈ =𝜕𝐸R

𝜕𝑦¤¥¦ms¤¥§l¨

𝐸R

Stage1 Stage2

𝛿©ª¨¦ms¤¥§l¨ =𝜕𝐸R

𝜕𝑦©ª¨¦ms¤¥§l¨

𝑥< 𝑦<

Page 76: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

TrainingaConvNet§ Back-propfortheconvolutionallayer:

76

Convol. Pooling

Stage3 FullyConnectedLayerInputImage

ClassLabel

𝛿¤¥¦ms¤¥§l̈ =𝜕𝐸R

𝜕𝑦¤¥¦ms¤¥§l¨

𝐸R

Stage1 Stage2

𝛿©ª¨¦ms¤¥§l¨ =𝜕𝐸R

𝜕𝑦©ª¨¦ms¤¥§l¨

𝑥< 𝑦<

Wederivetheupdaterulesfora1Dconvolution,buttheideaisthesameforbiggerdimensions.

𝑦« = 𝑤 ∗ 𝑥 ⟺𝑦«< = P 𝑤x

~sE

x­Y

𝑥<sx = P 𝑤<sx

~sE

x­Y

𝑥x∀𝑖

𝑦 = 𝑓 𝑦« ⟺𝑦< = 𝑓(𝑦«<)∀𝑖

𝜕𝐸R𝜕𝑤x

=

𝜕𝐸R𝜕𝑦«<

=

𝛿 =𝜕𝐸R𝜕𝑥x

=

Theconvolution

Adifferentiablenonlinearity

P𝜕𝐸R𝜕𝑦«<

𝜕𝑦« <𝜕𝑤x

~sE

<­Y

= P𝜕𝐸R𝜕𝑦«<

𝑥<sx

~sE

<­Y

𝜕𝐸R𝜕𝑦<

𝜕𝑦<𝜕𝑦«<

=𝜕𝐸R𝜕𝑦<

𝑓′(𝑦«)

P𝜕𝐸R𝜕𝑦«<

𝜕𝑦«<𝜕𝑥x

~sE

<­Y

= P𝜕𝐸R𝜕𝑦«<

𝑤<sx

~sE

<­Y

Nowwehaveeverythinginthislayertoupdatethefilter

Weneedtopassthegradienttotheprevious layer

NowwecanrepeatthisforeachstageofConvNet.

Page 77: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ConvolutionalNets

77

Stage1

Stage2

Stage3

FullyConnected

LayerInputImage

ClassLabel

FeaturevisualizationofconvolutionalnettrainedonImageNetfrom[Zeiler &Fergus2013]

Page 78: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Demo(TeachableMachines)https://teachablemachine.withgoogle.com/

78

Page 79: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ConvNet roots§ Fukushima,1980s designednetworkwithsamebasicstructurebut

didnottrainbybackpropagation.§ ThefirstsuccessfulapplicationsofConvolutionalNetworksbyYann

LeCun in1990's (LeNet)§ Wasusedtoreadzipcodes,digits,etc.

§ Manyvariantsnowadays,butthecoreideaisthesame§ Example:asystemdeveloped inGoogle (GoogLeNet)

§ Computedifferentfilters§ Composeonebigvectorfromallofthem§ Layerthisiteratively

79Seemore:http://arxiv.org/pdf/1409.4842v1.pdf

Page 80: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Depthmatters

80

Slidefrom[Kaiming He2015]

Page 81: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Vanishing/explodinggradients

§ Vanishinggradientsarequiteprevalentandaseriousissue.§ Arealexample

§ Trainingafeed-forwardnetwork§ y-axis:sumofthegradientnorms§ Earlierlayershaveexponentiallysmallersumofgradientnorms§ Thiswillmaketrainingearlierlayersmuchslower.

81

Gradientcanbecomeverysmallorverylargequickly,andthelocalityassumptionofgradientdescentbreaksdown (Vanishinggradient)[Bengio etal1994]

Page 82: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Vanishing/explodinggradients§ Inarchitectureswithmanylayers(e.g.>10)thegradientscaneasily

explodeorvanish.

§ Manymethodsproposedforreducetheeffectofvanishinggradients;althoughitisstillaproblem§ Introduceshorterpathbetweenlongconnections§ Abandonstochasticgradientdescentinfavorofamuchmore

sophisticatedHessian-Free(HF)optimization§ Clipgradientswithbiggersizes:

82

Defnne𝑔 = chc¯

If 𝑔 ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then𝑔 ← Z²yz}²i³R

´ 𝑔

Page 83: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

PracticalTips§ Beforelargescaleexperiments,testonasmallsubsetofthedataand

checktheerrorshouldgotozero.§ Overfittingonsmalltraining

§ Visualizefeatures(featuremapsneedtobeuncorrelated)andhavehighvariance

§ Badtraining:manyhiddenunitsignoretheinputand/orexhibitstrongcorrelations.

83FigureCredit:Marc'Aurelio Ranzato

Page 84: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Debugging§ Trainingdiverges:

§ Learningratemaybetoolarge→decreaselearningrate§ BackProp isbuggy→numericalgradientchecking

§ Lossisminimizedbutaccuracyislow§ Checklossfunction: Isitappropriateforthetaskyouwanttosolve?Does

ithavedegeneratesolutions?

§ NNisunderperforming/under-fitting§ Computenumber ofparameters→iftoosmall,makenetworklarger

§ NNistooslow§ Computenumber ofparameters→Usedistributed framework,useGPU,

makenetworksmaller

84

Manyofthesepointsapplytomanymachinelearningmodels,nojustneuralnetworks.

Page 85: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

CNNforvectorinputs

85

§ Let’sstudyanothervariantofCNNforlanguage§ Example:sentenceclassification(sayspamornotspam)

§ Firststep:representeachwordwithavectorinℝRThis is not a spam

Concatenatethevectors

§ NowwecanassumethattheinputtothesystemisavectorℝR³§ Wheretheinput sentencehaslength𝑙 (𝑙 = 5 inourexample)§ Eachwordvector’slength𝑑 (𝑑 = 7 inourexample)

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

O OO O OO O O OO O OO O O OO O OO O O OO O OO O O OO O OO O

Page 86: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ConvolutionalLayeronvectors§ Thinkaboutasingleconvolutionallayer

§ Abunchofvectorfilters§ Eachdefined inℝR²

• Whereℎ isthenumberofthewordsthefiltercovers• Sizeofthewordvector𝑑

§ Findits(modified)convolutionwiththeinputvector

§ Resultoftheconvolutionwiththefilter

§ Convolutionwithafilterthatspans2words, isoperatingonallofthebi-grams(vectorsoftwoconsecutiveword,concatenated):“thisis”,“isnot”,“nota”,“aspam”.

§ Regardlessofwhether itisgrammatical(notappealinglinguistically)

86

OO O O O OO O OO OO OO O OO OO OO O OO OO OO O OO OO OO

OOO O OO O OO O OO O O

OOO O OO O OO O OO O O OO O OO OO OO OO O OO O OO O OO OOOO O OO O OO O OO O OOOO O OO O OO O OO O O OO O OO OO OO OO O OO O OO O OO O

OOO O OO O OO O OO O OOOO O OO O OO O OO O O OO O OO OO OO OO O OO O OO O OO O

OOO O OO O OO O OO O OOOO O OO O OO O OO O O OO O OO OO OO OO O OO O OO O OO O

OOO O OO O OO O OO O O

𝑐E = 𝑓(𝑤. 𝑥E:²)𝑐F = 𝑓(𝑤.𝑥²pE:F²)𝑐G = 𝑓(𝑤.𝑥F²pE:G²)𝑐H = 𝑓(𝑤.𝑥G²pE:H²)

𝑐 = [𝑐E,… . , 𝑐*s²pE]OOO O

Page 87: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ConvolutionalLayeronvectors

87

OOO O O

OOO O OO O

OOO O OO O OO OO O OO O OO O OO O OO OO O OO O OO O OO O

This is not a spam

O OO O OO O O OO O OO O O OO O OO O O OO O OO O O OO O OO O

OOO O OO O OO O OO O OOOO O OO O OO O OO O O

OOO O OO O OO O OO O O OO OO O OOOOO O OO O OO O OO O O OO OO O OO

OOO OOOO O

OOOOOO

Getwordvectorsforeachwords

Concatenatevectors

Performconvolution

witheachfilter Filterbank

Setofresponsevectors

*

Howarewegoing tohandlethevariable

sizedresponsevectors?Pooling!

#offilters

#words- #lengthoffilter+1

Page 88: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

ConvolutionalLayeronvectors

§ Nowwecanpass thefixed-sizedvectortoalogisticunit(softmax), orgiveittomulti-layernetwork(lastsession)

88

OOO O O

OOO O OO O

OOO O OO O OO OO O OO O OO O OO O OO OO O OO O OO O OO O

This is not a spam

O OO O OO O O OO O OO O O OO O OO O O OO O OO O O OO O OO O

OOO O OO O OO O OO O OOOO O OO O OO O OO O O

OOO O OO O OO O OO O O OO OO O OOOOO O OO O OO O OO O O OO OO O OO

OOO OOOO O

OOOOOO

Getwordvectorsforeachwords

Concatenatevectors

Performconvolutionwitheachfilter

Filterbank

*

#offilters

#words- #lengthoffilter+1

Poolingonfilter

responses

OOOOOO

OOOOOOOOO

Somechoicesforpooling:

k-max,mean,etc

Page 89: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

RecurrentNeuralNetworks§ Predictiononchain-likeinput:

§ Example:POStaggingwordsofasentence

§ Issues:§ Structureintheoutput: Thereisconnections betweenlabels§ Interdependencebetweenelementsoftheinputs: Thefinaldecisionisbased

onanintricateinterdependenceofthewordsoneachother.§ Variablesizeinputs:e.g.sentencesdifferinsize

§ Howwouldyougoaboutsolvingthistask?

89

This is a sample sentence

DT VBZ DT NN NN

Page 90: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

RecurrentNeuralNetworks§ Infiniteusesoffinitestructure

90

Y0

H0

X0

Y1

H1

X1

Y2

H2

X2

Y3

H3

X3

Hiddenstaterepresentation

Output

Input

Page 91: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

RecurrentNeuralNetworks§ AchainRNN:

§ Hasachain-likestructure§ Eachinputisreplacedwithitsvectorrepresentation𝑥Z§ Hidden(memory)unitℎZ containinformationaboutprevious

inputsandprevioushiddenunitsℎZsE,ℎZsF, etc§ Computed fromthepastmemoryandcurrentword.Itsummarizesthesentenceuptothattime.

91

OOO O O OOO O O OOO O O𝑥ZsE 𝑥Z 𝑥ZpE

OO

OOO

OO

OOO

OO

OOO

ℎZsE ℎZ ℎZpEMemorylayer

Inputlayer

Page 92: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

RecurrentNeuralNetworks§ Apopularwayofformalizingit:

ℎZ = 𝑓(𝑊²ℎZsE +𝑊<𝑥Z)§ Where𝑓 isanonlinear, differentiable (why?) function.

§ Outputs?§ Manyoptions;dependingonproblemandcomputational

resource

92

OOO O O OOO O O OOO O O𝑥ZsE 𝑥Z 𝑥ZpE

OO

OOO

OO

OOO

OO

OOO

ℎZsE ℎZ ℎZpEMemorylayer

Inputlayer

Page 93: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

RecurrentNeuralNetworks§ Prediction for𝑥Z,withℎZ:

§ Someinherent issueswithRNNs:§ Recurrentneuralnetscannotcapturephraseswithout prefixcontext§ Theyoftencapturetoomuchoflastwordsinfinalvector

§ Aslightlymoresophisticated solution: LongShort-TermMemory(LSTM)units

93

𝑦Z = softmax 𝑊iℎZ

OOO O O OOO O O OOO O O𝑥ZsE 𝑥Z 𝑥ZpE

OO

OOO

OO

OOO

OO

OOOℎZsE ℎZ ℎZpE

Memorylayer

Inputlayer

𝑦ZsE 𝑦Z 𝑦ZpE Output layer

Page 94: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

RecurrentNeuralNetworks

§ Multi-layerfeed-forwardNN:DAG§ Justcomputesafixedsequenceofnon-linearlearnedtransformationstoconvertaninputpatterintoanoutputpattern

§ RecurrentNeuralNetwork:Digraph§ Hascycles.§ Cyclecanactasamemory;§ Thehiddenstateofarecurrentnetcancarryalonginformation

abouta“potentially”unboundednumberofpreviousinputs.§ Theycanmodelsequentialdatainamuchmorenaturalway.

94

Page 95: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

EquivalencebetweenRNNandFeed-forwardNN

§ Assumethatthereisatimedelayof1inusingeachconnection.

§ Therecurrentnetisjustalayerednetthatkeepsreusingthesameweights.

95SlideCredit:GeoffHinton

1 2 3

1 2 3

1 2 3

1 2 3W1 W2 W3 W4

time=0

time=2

time=1

time=3

W1 W2 W3 W4

W1 W2 W3 W4

1 2 3

w1 w4

w2 w3

Page 96: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Bi-directionalRNN§ OneoftheissueswithRNN:

§ Hiddenvariablescaptureonlyonesidecontext

§ Abi-directionalstructure

96

RNN Bi-directional RNN

Page 97: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

Stackofbi-directionalnetworks§ Usethesameideaandmakeyourmodelfurther

complicated:

97

Page 98: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

TrainingRNNs§ Howtotrainsuchmodel?

§ Generalizethesameideasfromback-propagation

§ Totaloutputerror:𝐸 𝑦, 𝑡 = ∑ 𝐸Z 𝑦Z , 𝑡Z[Z­E

98

OOO O O OOO O O OOO O O

𝑥ZsE 𝑥Z 𝑥ZpE

OO

OOO

OO

OOO

OO

OOOℎZsE ℎZ ℎZpE

𝑦ZsE 𝑦Z 𝑦ZpE

Reminder:𝑦Z = softmax 𝑊iℎZℎZ = 𝑓(𝑊²ℎZsE + 𝑊<𝑥Z)

𝜕𝐸𝜕𝑊 =P

𝜕𝐸Z𝜕𝑊

[

Z­E𝜕𝐸Z𝜕𝑊 =P

𝜕𝐸Z𝜕𝑦Z

𝜕𝑦Z𝜕ℎZ

𝜕ℎZ𝜕ℎZs\

𝜕ℎZs\𝜕𝑊

[

Z­E

Parameters?𝑊i ,𝑊< ,𝑊² +vectorsfor

input

Thissometimesiscalled“BackpropagationThroughTime”,sincethegradientsarepropagatedbackthroughtime.

Page 99: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

RecurrentNeuralNetwork

99𝑦ZsE 𝑦Z 𝑦ZpE

OO O O O OO O O O OO O O O

𝑥ZsE 𝑥Z 𝑥ZpE

OO

OOO

OO

OOO

OO

OOO

ℎZsE ℎZ ℎZpE

Reminder:𝑦Z = softmax 𝑊iℎZℎZ = 𝑓(𝑊²ℎZsE +𝑊<𝑥Z)

𝜕𝐸𝜕𝑊 =P

𝜕𝐸Z𝜕𝑦Z

𝜕𝑦Z𝜕ℎZ

𝜕ℎZ𝜕ℎZs\

𝜕ℎZs\𝜕𝑊

[

Z­E

𝜕ℎZ𝜕ℎZs\

= ¿𝜕ℎ/𝜕ℎ/sE

Z

/­Zs\pE

= ¿ 𝑊²diag 𝑓�(𝑊²ℎZsE + 𝑊<𝑥Z)Z

/­Zs\pE

𝜕ℎZ𝜕ℎZsE

= 𝑊²diag 𝑓�(𝑊²ℎZsE + 𝑊<𝑥Z) diag 𝑎E,… , 𝑎* =𝑎E 0 00 ⋱ 00 0 𝑎*

Page 100: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

UnsupervisedRNNs§ Whattoputhere?

§ Hewaslockedupafterhe______.

§ Notethat:§ Thisisunsupervised;youcanusetonsofdatatotrainthis.§ Whiletrainingthemodel,wetrainthewordrepresentationstoo.

100

OOO O O OOO O O OOO O O

𝑥ZsF 𝑥ZsE 𝑥Z 𝑦

OO

OOO

ℎZMemorylayer

Input(context)

OOO O O

output

OO

OOO ℎZpE

OO

OOO ℎZsE

Page 101: CIS 519/419 Applied Machine Learning

CIS419/519Spring’18

UnsupervisedRNNs§ Thiswouldresultinwordrepresentations

§ thatconveyinformationabouttheirco-occurrence§ Orsomeformofweak“semantic”similarity

§ Abigpartofprogress(past5-10years)ispartlyduetodiscoveringbetterwayscreateunsupervisedcontext-sensitiverepresentations

101