logistic regression & neural networksanswer: back-propagation calculate derivative with chain...

LogisticRegression&NeuralNetworksCMSC723/LING723/INST725

MarineCarpuat

Slidescredit:GrahamNeubig,JacobEisenstein

LogisticRegression

Perceptron&Probabilities

• Whatifwewantaprobabilityp(y|x)?

• Theperceptrongivesusapredictiony• Let’sillustratethiswithbinaryclassification

Illustrations:GrahamNeubig

Thelogisticfunction

• “Softer”functionthaninperceptron

• Canaccountforuncertainty

• Differentiable

Logisticregression:howtotrain?

• Trainbasedonconditionallikelihood• Findparametersw thatmaximizeconditionallikelihoodofallanswers𝑦" givenexamples𝑥"

Stochasticgradientascent(ordescent)• Onlinetrainingalgorithmforlogisticregression

• andotherprobabilisticmodels

• Updateweightsforeverytrainingexample• Moveindirectiongivenbygradient• Sizeofupdatestepscaledbylearningrate

Gradientofthelogisticfunction

Example:Person/not-personclassificationproblemGivenanintroductorysentenceinWikipediapredictwhetherthearticleisaboutaperson

Example:initialupdate

Example:secondupdate

Howtosetthelearningrate?

• Variousstrategies• decayovertime

𝛼 =1

𝐶 + 𝑡

• Useheld-outtestset,increaselearningratewhenlikelihoodincreases

ParameterNumberofsamples

Multiclassversion

Somemodelsarebetterthenothers…• Considerthese2examples

• Whichofthe2modelsbelowisbetter?

Classifier2willprobablygeneralizebetter!Itdoesnotincludeirrelevantinformation=>Smallermodelisbetter

Regularization

• Apenaltyonaddingextraweights

• L2regularization:• bigpenaltyonlargeweights• smallpenaltyonsmallweights

• L1regularization:• Uniformincreasewhenlargeorsmall• Willcausemanyweightstobecomezero

𝑤 +

𝑤 ,

L1regularizationinonlinelearning

Whatyoushouldknow

• Standardsupervisedlearningset-upfortextclassification• Differencebetweentrainvs.testdata• Howtoevaluate

• 3examplesofsupervisedlinearclassifiers• NaïveBayes,Perceptron,LogisticRegression• Learningasoptimization:whatistheobjectivefunctionoptimized?• Differencebetweengenerativevs.discriminativeclassifiers• Smoothing,regularization• Overfitting,underfitting

Neuralnetworks

Person/not-personclassificationproblemGivenanintroductorysentenceinWikipediapredictwhetherthearticleisaboutaperson

Formalizingbinaryprediction

ThePerceptron:a“machine”tocalculateaweightedsum

sign - 𝑤".

"/,⋅ ϕ" 𝑥

φ“A” = 1φ“site” = 1

φ“,” = 2

φ“located” = 1

φ“in” = 1

φ“Maizuru”= 1

φ“Kyoto” = 1φ“priest” = 0φ“black” = 0

0-30000020

-1

ThePerceptron:Geometricinterpretation

O

X O

X O

X

Limitationofperceptron● canonlyfindlinearseparations betweenpositiveandnegativeexamples

X

O

O

X

NeuralNetworks● Connecttogethermultipleperceptrons

φ“A” = 1φ“site” = 1

φ“,” = 2

φ“located” = 1

φ“in” = 1

φ“Maizuru”= 1


-1

● Motivation:Canrepresentnon-linearfunctions!

NeuralNetworks:keyterms

φ“A” = 1φ“site” = 1

φ“,” = 2

φ“located” = 1

φ“in” = 1

φ“Maizuru”= 1


-1

• Input(akafeatures)• Output• Nodes• Layers• Hiddenlayers• Activationfunction(non-linear)

• Multi-layerperceptron

Example● Createtwoclassifiers

X

O

O

X

φ0(x2) = {1, 1}φ0(x1) = {-1, 1}

φ0(x4) = {1, -1}φ0(x3) = {-1, -1}

sign

sign

φ0[0]

φ0[1]

1

11

-1

-1-1

-1

φ0[0]

φ0[1] φ1[0]

φ0[0]

φ0[1]

1

w0,0

b0,0

φ1[1]

w0,1

b0,1

Example● Theseclassifiersmaptoanewspace

X

O

O

X

φ0(x2) = {1, 1}φ0(x1) = {-1, 1}

φ0(x4) = {1, -1}φ0(x3) = {-1, -1}

11-1

-1-1-1

φ1

φ2

φ1[1]

φ1[0]

φ1[0]

φ1[1]

φ1(x1) = {-1, -1}X O

φ1(x2) = {1, -1}

O

φ1(x3) = {-1, 1}

φ1(x4) = {-1, -1}

Example● Innewspace,theexamplesarelinearlyseparable!

X

O

O

X

φ0(x2) = {1, 1}φ0(x1) = {-1, 1}

φ0(x4) = {1, -1}φ0(x3) = {-1, -1}

11-1

-1-1-1

φ0[0]

φ0[1]

φ1[1]

φ1[0]

φ1[0]

φ1[1]

φ1(x1) = {-1, -1}X O φ1(x2) = {1, -1}

Oφ1(x3) = {-1, 1}

φ1(x4) = {-1, -1}

111

φ2[0] = y

Examplewrap-up:Forwardpropagation

● Thefinalnet

tanh

tanh

φ0[0]

φ0[1]

1

φ0[0]

φ0[1]

1

11

-1

-1-1

-11 1

1

1

tanh

φ1[0]

φ1[1]

φ2[0]

30

Softmax Functionformulticlassclassification

● Sigmoid function for multiple classes

● Can be expressed using matrix/vector ops

𝑃 𝑦 ∣ 𝑥 =𝑒𝐰⋅6 7,9

∑ 𝑒𝐰⋅6 7,9;�9;

Current class

Sum of other classes

𝐫 = exp 𝐖 ⋅ ϕ 𝑥, 𝑦

𝐩 = 𝐫 - ��

E∈𝐫G

StochasticGradientDescentOnlinetrainingalgorithmforprobabilisticmodels

w=0for I iterationsforeachlabeledpairx,y inthedata

w +=α *dP(y|x)/dw

Inotherwords• For every training example, calculate the gradient

(the direction that will increase the probability of y)• Move in that direction, multiplied by learning rate α

GradientoftheSigmoidFunctionTakethederivativeoftheprobability

𝑑𝑑𝑤 𝑃 𝑦 = 1 ∣ 𝑥 =

𝑑𝑑𝑤

𝑒𝐰⋅6 7

1 + 𝑒𝐰⋅6 7

= ϕ 𝑥𝑒𝐰⋅6 7

1 + 𝑒𝐰⋅6 7 +

𝑑𝑑𝑤 𝑃 𝑦 = −1 ∣ 𝑥 =

𝑑𝑑𝑤 1 −

𝑒𝐰⋅6 7

1 + 𝑒𝐰⋅6 7

= −ϕ 𝑥𝑒𝐰⋅6 7

1 + 𝑒𝐰⋅6 7 +

Learning:WeDon'tKnowtheDerivativeforHiddenUnits!

ForNNs,onlyknowcorrecttagforlastlayer

y=1ϕ 𝑥

𝑑𝑃 𝑦 = 1 ∣ 𝐱𝑑𝐰𝟒

= 𝐡 𝑥𝑒𝐰𝟒⋅𝐡 7

1 + 𝑒𝐰𝟒⋅𝐡 7 +

𝐡 𝑥

𝑑𝑃 𝑦 = 1 ∣ 𝐱𝑑𝐰𝟏

= ?

𝑑𝑃 𝑦 = 1 ∣ 𝐱𝑑𝐰𝟐

= ?

𝑑𝑃 𝑦 = 1 ∣ 𝐱𝑑𝐰𝟑

= ?

w1

w2

w3

w4

Answer:Back-PropagationCalculatederivativewithchainrule

𝑑𝑃 𝑦 = 1 ∣ 𝑥𝑑𝐰𝟏

=𝑑𝑃 𝑦 = 1 ∣ 𝑥𝑑𝐰𝟒𝐡 𝐱

𝑑𝐰𝟒𝐡 𝐱𝑑ℎ, 𝐱

𝑑ℎ, 𝐱𝑑𝐰𝟏

𝑒𝐰𝟒⋅𝐡 7

1 + 𝑒𝐰𝟒⋅𝐡 7 +𝑤,,R

Error ofnext unit (δ4)

Weight Gradient ofthis unit

𝑑𝑃 𝑦 = 1 ∣ 𝐱𝐰𝐢

=𝑑ℎ" 𝐱𝑑𝐰𝐢

- δU�

U𝑤",UIn General

Calculate i basedon next units j:

Backpropagation=

Gradientdescent+

Chainrule

FeedForwardNeuralNetsAllconnectionspointforward

yϕ 𝑥

Itisadirectedacyclicgraph(DAG)

NeuralNetworks

• Non-linearclassification

• Prediction:forwardpropagation• Vector/matrixoperations+non-linearities

• Training:backpropagation+stochasticgradientdescent

Formoredetails,seeCIMLChap7

logistic regression & neural networksanswer: back-propagation calculate derivative with chain...

Documents