logistic regression & neural networksanswer: back-propagation calculate derivative with chain...

37
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein

Upload: others

Post on 16-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

LogisticRegression&NeuralNetworksCMSC723/LING723/INST725

MarineCarpuat

Slidescredit:GrahamNeubig,JacobEisenstein

Page 2: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

LogisticRegression

Page 3: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Perceptron&Probabilities

• Whatifwewantaprobabilityp(y|x)?

• Theperceptrongivesusapredictiony• Let’sillustratethiswithbinaryclassification

Illustrations:GrahamNeubig

Page 4: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Thelogisticfunction

• “Softer”functionthaninperceptron

• Canaccountforuncertainty

• Differentiable

Page 5: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Logisticregression:howtotrain?

• Trainbasedonconditionallikelihood• Findparametersw thatmaximizeconditionallikelihoodofallanswers𝑦" givenexamples𝑥"

Page 6: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Stochasticgradientascent(ordescent)• Onlinetrainingalgorithmforlogisticregression

• andotherprobabilisticmodels

• Updateweightsforeverytrainingexample• Moveindirectiongivenbygradient• Sizeofupdatestepscaledbylearningrate

Page 7: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Gradientofthelogisticfunction

Page 8: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Example:Person/not-personclassificationproblemGivenanintroductorysentenceinWikipediapredictwhetherthearticleisaboutaperson

Page 9: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Example:initialupdate

Page 10: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Example:secondupdate

Page 11: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Howtosetthelearningrate?

• Variousstrategies• decayovertime

𝛼 =1

𝐶 + 𝑡

• Useheld-outtestset,increaselearningratewhenlikelihoodincreases

ParameterNumberofsamples

Page 12: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Multiclassversion

Page 13: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Somemodelsarebetterthenothers…• Considerthese2examples

• Whichofthe2modelsbelowisbetter?

Classifier2willprobablygeneralizebetter!Itdoesnotincludeirrelevantinformation=>Smallermodelisbetter

Page 14: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Regularization

• Apenaltyonaddingextraweights

• L2regularization:• bigpenaltyonlargeweights• smallpenaltyonsmallweights

• L1regularization:• Uniformincreasewhenlargeorsmall• Willcausemanyweightstobecomezero

𝑤 +

𝑤 ,

Page 15: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

L1regularizationinonlinelearning

Page 16: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Whatyoushouldknow

• Standardsupervisedlearningset-upfortextclassification• Differencebetweentrainvs.testdata• Howtoevaluate

• 3examplesofsupervisedlinearclassifiers• NaïveBayes,Perceptron,LogisticRegression• Learningasoptimization:whatistheobjectivefunctionoptimized?• Differencebetweengenerativevs.discriminativeclassifiers• Smoothing,regularization• Overfitting,underfitting

Page 17: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Neuralnetworks

Page 18: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Person/not-personclassificationproblemGivenanintroductorysentenceinWikipediapredictwhetherthearticleisaboutaperson

Page 19: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Formalizingbinaryprediction

Page 20: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

ThePerceptron:a“machine”tocalculateaweightedsum

sign - 𝑤".

"/,⋅ ϕ" 𝑥

φ“A” = 1φ“site” = 1

φ“,” = 2

φ“located” = 1

φ“in” = 1

φ“Maizuru”= 1

φ“Kyoto” = 1φ“priest” = 0φ“black” = 0

0-30000020

-1

Page 21: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

ThePerceptron:Geometricinterpretation

O

X O

X O

X

Page 22: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

ThePerceptron:Geometricinterpretation

O

X O

X O

X

Page 23: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Limitationofperceptron● canonlyfindlinearseparations betweenpositiveandnegativeexamples

X

O

O

X

Page 24: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

NeuralNetworks● Connecttogethermultipleperceptrons

φ“A” = 1φ“site” = 1

φ“,” = 2

φ“located” = 1

φ“in” = 1

φ“Maizuru”= 1

φ“Kyoto” = 1φ“priest” = 0φ“black” = 0

-1

● Motivation:Canrepresentnon-linearfunctions!

Page 25: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

NeuralNetworks:keyterms

φ“A” = 1φ“site” = 1

φ“,” = 2

φ“located” = 1

φ“in” = 1

φ“Maizuru”= 1

φ“Kyoto” = 1φ“priest” = 0φ“black” = 0

-1

• Input(akafeatures)• Output• Nodes• Layers• Hiddenlayers• Activationfunction(non-linear)

• Multi-layerperceptron

Page 26: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Example● Createtwoclassifiers

X

O

O

X

φ0(x2) = {1, 1}φ0(x1) = {-1, 1}

φ0(x4) = {1, -1}φ0(x3) = {-1, -1}

sign

sign

φ0[0]

φ0[1]

1

11

-1

-1-1

-1

φ0[0]

φ0[1] φ1[0]

φ0[0]

φ0[1]

1

w0,0

b0,0

φ1[1]

w0,1

b0,1

Page 27: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Example● Theseclassifiersmaptoanewspace

X

O

O

X

φ0(x2) = {1, 1}φ0(x1) = {-1, 1}

φ0(x4) = {1, -1}φ0(x3) = {-1, -1}

11-1

-1-1-1

φ1

φ2

φ1[1]

φ1[0]

φ1[0]

φ1[1]

φ1(x1) = {-1, -1}X O

φ1(x2) = {1, -1}

O

φ1(x3) = {-1, 1}

φ1(x4) = {-1, -1}

Page 28: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Example● Innewspace,theexamplesarelinearlyseparable!

X

O

O

X

φ0(x2) = {1, 1}φ0(x1) = {-1, 1}

φ0(x4) = {1, -1}φ0(x3) = {-1, -1}

11-1

-1-1-1

φ0[0]

φ0[1]

φ1[1]

φ1[0]

φ1[0]

φ1[1]

φ1(x1) = {-1, -1}X O φ1(x2) = {1, -1}

Oφ1(x3) = {-1, 1}

φ1(x4) = {-1, -1}

111

φ2[0] = y

Page 29: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Examplewrap-up:Forwardpropagation

● Thefinalnet

tanh

tanh

φ0[0]

φ0[1]

1

φ0[0]

φ0[1]

1

11

-1

-1-1

-11 1

1

1

tanh

φ1[0]

φ1[1]

φ2[0]

Page 30: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

30

Softmax Functionformulticlassclassification

● Sigmoid function for multiple classes

● Can be expressed using matrix/vector ops

𝑃 𝑦 ∣ 𝑥 =𝑒𝐰⋅6 7,9

∑ 𝑒𝐰⋅6 7,9;�9;

Current class

Sum of other classes

𝐫 = exp 𝐖 ⋅ ϕ 𝑥, 𝑦

𝐩 = 𝐫 - ���

E∈𝐫G

Page 31: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

StochasticGradientDescentOnlinetrainingalgorithmforprobabilisticmodels

w=0for I iterationsforeachlabeledpairx,y inthedata

w +=α *dP(y|x)/dw

Inotherwords• For every training example, calculate the gradient

(the direction that will increase the probability of y)• Move in that direction, multiplied by learning rate α

Page 32: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

GradientoftheSigmoidFunctionTakethederivativeoftheprobability

𝑑𝑑𝑤 𝑃 𝑦 = 1 ∣ 𝑥 =

𝑑𝑑𝑤

𝑒𝐰⋅6 7

1 + 𝑒𝐰⋅6 7

= ϕ 𝑥𝑒𝐰⋅6 7

1 + 𝑒𝐰⋅6 7 +

𝑑𝑑𝑤 𝑃 𝑦 = −1 ∣ 𝑥 =

𝑑𝑑𝑤 1 −

𝑒𝐰⋅6 7

1 + 𝑒𝐰⋅6 7

= −ϕ 𝑥𝑒𝐰⋅6 7

1 + 𝑒𝐰⋅6 7 +

Page 33: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Learning:WeDon'tKnowtheDerivativeforHiddenUnits!

ForNNs,onlyknowcorrecttagforlastlayer

y=1ϕ 𝑥

𝑑𝑃 𝑦 = 1 ∣ 𝐱𝑑𝐰𝟒

= 𝐡 𝑥𝑒𝐰𝟒⋅𝐡 7

1 + 𝑒𝐰𝟒⋅𝐡 7 +

𝐡 𝑥

𝑑𝑃 𝑦 = 1 ∣ 𝐱𝑑𝐰𝟏

= ?

𝑑𝑃 𝑦 = 1 ∣ 𝐱𝑑𝐰𝟐

= ?

𝑑𝑃 𝑦 = 1 ∣ 𝐱𝑑𝐰𝟑

= ?

w1

w2

w3

w4

Page 34: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Answer:Back-PropagationCalculatederivativewithchainrule

𝑑𝑃 𝑦 = 1 ∣ 𝑥𝑑𝐰𝟏

=𝑑𝑃 𝑦 = 1 ∣ 𝑥𝑑𝐰𝟒𝐡 𝐱

𝑑𝐰𝟒𝐡 𝐱𝑑ℎ, 𝐱

𝑑ℎ, 𝐱𝑑𝐰𝟏

𝑒𝐰𝟒⋅𝐡 7

1 + 𝑒𝐰𝟒⋅𝐡 7 +𝑤,,R

Error ofnext unit (δ4)

Weight Gradient ofthis unit

𝑑𝑃 𝑦 = 1 ∣ 𝐱𝐰𝐢

=𝑑ℎ" 𝐱𝑑𝐰𝐢

- δU�

U𝑤",UIn General

Calculate i basedon next units j:

Page 35: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

Backpropagation=

Gradientdescent+

Chainrule

Page 36: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

FeedForwardNeuralNetsAllconnectionspointforward

yϕ 𝑥

Itisadirectedacyclicgraph(DAG)

Page 37: Logistic Regression & Neural NetworksAnswer: Back-Propagation Calculate derivative with chain rule H2!=1∣# H5 M = H2!=1∣# H5 KLJ H5 KLJ Hℎ

NeuralNetworks

• Non-linearclassification

• Prediction:forwardpropagation• Vector/matrixoperations+non-linearities

• Training:backpropagation+stochasticgradientdescent

Formoredetails,seeCIMLChap7