(dl hacks輪読) difference target propagation

Difference Target PropagationDong-Hyun Lee, Saizheng Zhang, Asja Fischer, Antoine Biard, & Yoshua Bengio

DL Hacks

ICLR2015 workshop

Bengio

Target propagationdifference target propagation

Bengio(arXiv:1407.7906)

Toward Biologically Plausible Deep Learning target propagation

(credit assignment)

(back-propagation)

or

0

1.

2. fprop

3. bpropfprop

4.

5. fpropbprop

6.

Target propagation

hi:hi-1:1Wi:si:

ijhjhi

i

Accepted as a workshop contribution at ICLR 2015

The main idea of target propagation is to associate with each feedforward units activation valuea target value rather than a loss gradient. The target value is meant to be close to the activationvalue while being likely to have provided a smaller loss (if that value had been obtained in thefeedforward phase). In the limit where the target is very close to the feedforward value, targetpropagation should behave like back-propagation. This link was nicely made in (LeCun, 1986;1987), which introduced the idea of target propagation and connected it to back-propagation via aLagrange multipliers formulation (where the constraints require the output of one layer to equal theinput of the next layer). A similar idea was recently proposed where the constraints are relaxedinto penalties, yielding a different (iterative) way to optimize deep networks (Carreira-Perpinanand Wang, 2014). Once a good target is computed, a layer-local training criterion can be definedto update each layer separately, e.g., via the delta-rule (gradient descent update with respect to thecross-entropy loss).

By its nature, target propagation can in principle handle stronger (and even discrete) non-linearities,and it deals with biological plausibility issues (1), (2), (3) and (4) described above. Extensions ofthe precise scheme proposed here could handle (5) and (6) as well, but this is left for future work.

In this paper, we describe how the general idea of target propagation by using auto-encoders to assigntargets to each layer (as introduced in an earlier technical report (Bengio, 2014)) can be employedfor supervised training of deep neural networks (section 2.1 and 2.2). We continue by introducinga linear correction for the imperfectness of the auto-encoders (2.3) leading to robust training inpractice. Furthermore, we show how the same principles can be applied to replace back-propagationin the training of auto-encoders (section 2.4). In section 3 we provide several experimental resultson rather deep neural networks as well as discrete and stochastic networks and auto-encoders. Theresults show that the proposed form of target propagation is comparable to back-propagation withRMSprop (Tieleman and Hinton, 2012) - a very popular setting to train deep networks nowadays-and achieves state of the art for training stochastic neural nets on MNIST.

2 TARGET PROPAGATION

Although many variants of the general principle of target propagation can be devised, this paperfocuses on a specific approach, which is based on the ideas presented in an earlier technical re-port (Bengio, 2014) and is described in the following.

2.1 FORMULATING TARGETS

Let us consider an ordinary (supervised) deep network learning process, where the training data isdrawn from an unknown data distribution p(x,y). The network structure is defined by

hi

= fi

(hi1) = si(Wihi1), i = 1, . . . ,M (1)

where hi

is the state of the i-th hidden layer (where hM

corresponds to the output of the network andh0 = x) and fi is the i-th layer feed-forward mapping, defined by a non-linear activation functions

i

(e.g. the hyperbolic tangents or the sigmoid function) and the weights Wi

of the i-th layer. Here,for simplicity of notation, the bias term of the i-th layer is included in W

i

. We refer to the subset ofnetwork parameters defining the mapping between the i-th and the j-th layer (0 i < j M ) as

i,j

W

= {Wk

, k = i+1, . . . , j}. Using this notion, we can write hj

as a function of hi

depending onparameters i,j

W

, that is we can write hj

= hj

(hi

; i,jW

).

Given a sample (x,y), let L(hM

(x; 0,MW

),y) be an arbitrary global loss function measuring theappropriateness of the network output h

M

(x; 0,MW

) for the target y, e.g. the MSE or cross-entropyfor binomial random variables. Then, the training objective corresponds to adapting the networkparameters 0,M

W

so as to minimize the expected global loss Ep

{L(hM

(x; 0,MW

)y)} under the datadistribution p(x,y). For i = 1, . . . ,M 1 we can write

L(hM

(x; 0,MW

),y) = L(hM

(hi

(x; 0,iW

); i,MW

),y) (2)

to emphasize the dependency of the loss on the state of the i-th layer.

Training a network with back-propagation corresponds to propagating error signals through the net-work to calculate the derivatives of the global loss with respect to the parameters of each layer.

2

Target

i(target)


Thus, the error signals indicate how the parameters of the network should be updated to decrease theexpected loss. However, in very deep networks with strong non-linearities, error propagation couldbecome useless in lower layers due to exploding or vanishing gradients, as explained above.

To avoid this problems, the basic idea of target propagation is to assign to each hi

(x; 0,iW

) a nearbyvalue h

i

which (hopefully) leads to a lower global loss, that is which has the objective to fulfill

L(hM

(hi

; i,MW

),y) < L(hM

(hi

(x; 0,iW

); i,MW

),y) . (3)

Such a hi

is called a target for the i-th layer.

Given a target hi

we now would like to change the network parameters to make hi

move a smallstep towards h

i

, since if the path leading from hi

to hi

is smooth enough we would expect toyield a decrease of the global loss. To obtain an update direction for W

i

based on hi

we can definea layer-local target loss L

i

, for example by using the MSE

L

i

(hi

,hi

) = ||hi

hi

(x; 0,iW

)||22 . (4)

Then, Wi

can be updated locally within its layer via stochastic gradient descent, where hi

is consid-ered as a constant with respect to W

i

. That is

W

(t+1)i

= W (t)i

f

i

@L

i

(hi

,hi

)

@W

i

= W (t)i

f

i

@L

i

(hi

,hi

)

@hi

@hi

(x; 0,iW

)

@W

i

, (5)

where f

i

is a layer-specific learning rate.

Note, that in this context, derivatives can be used without difficulty, because they correspond tocomputations performed inside a single layer. Whereas, the problems with the severe non-linearitiesobserved for back-propagation arise when the chain rule is applied through many layers. This mo-tivates target propagation methods to serve as alternative credit assignment in the context of a com-position of many non-linearities.

However, it is not directly clear how to compute a target that guarantees a decrease of the global loss(that is how to compute a h

i

for which equation (3) holds) or that at least leads to a decrease of thelocal loss L

i+1 of the next layer, that is

L

i

(hi+1, fi(hi)) < Li(hi+1, fi(hi)) . (6)

Proposing and validating answers to this question is the subject of the rest of this paper.

2.2 HOW TO ASSIGN A PROPER TARGET TO EACH LAYER

Clearly, in a supervised learning setting, the top layer target should be directly driven from thegradient of the global loss

hM

= hM

@L(hM ,y)@h

M

, (7)

where is usually a small step size. Note, that if we use the MSE as global loss and = 0.5 we gethM

= y.

But how can we define targets for the intermediate layers? In the previous technical report (Bengio,2014), it was suggested to take advantage of an approximate inverse. To formalize this idea,suppose that for each f

i

we have a function gi

such that

f

i

(gi

(hi

)) hi

or gi

(fi

(hi1)) hi1 . (8)

Then, choosinghi1 = gi(hi) (9)

would have the consequence that (under some smoothness assumptions on f and g) minimizing thedistance between h

i1 and hi1 should also minimize the loss Li of the i-th layer. This idea isillustrated in the left of Figure 1. Indeed, if the feed-back mappings were the perfect inverses of thefeed-forward mappings (g

i

= f1i

), one gets

L

i

(hi

, f

i

(hi1)) = Li(hi, fi(gi(hi))) = Li(hi, hi) = 0 . (10)

3

layer-local

SGD

layer-specific


Thus, the error signals indicate how the parameters of the network should be updated to decrease theexpected loss. However, in very deep networks with strong non-linearities, error propagation couldbecome useless in lower layers due to exploding or vanishing gradients, as explained above.

To avoid this problems, the basic idea of target propagation is to assign to each hi

(x; 0,iW

) a nearbyvalue h

i

which (hopefully) leads to a lower global loss, that is which has the objective to fulfill

L(hM

(hi

; i,MW

),y) < L(hM

(hi

(x; 0,iW

); i,MW

),y) . (3)

Such a hi

is called a target for the i-th layer.

Given a target hi

we now would like to change the network parameters to make hi

move a smallstep towards h

i

, since if the path leading from hi

to hi

is smooth enough we would expect toyield a decrease of the global loss. To obtain an update direction for W

i

based on hi

we can definea layer-local target loss L

i

, for example by using the MSE

L

i

(hi

,hi

) = ||hi

hi

(x; 0,iW

)||22 . (4)

Then, Wi

can be updated locally within its layer via stochastic gradient descent, where hi

is consid-ered as a constant with respect to W

i

. That is

W

(t+1)i

= W (t)i

f

i

@L

i

(hi

,hi

)

@W

i

= W (t)i

f

i

@L

i

(hi

,hi

)

@hi

@hi

(x; 0,iW

)

@W

i

, (5)

where f

i

is a layer-specific learning rate.

Note, that in this context, derivatives can be used without difficulty, because they correspond tocomputations performed inside a single layer. Whereas, the problems with the severe non-linearitiesobserved for back-propagation arise when the chain rule is applied through many layers. This mo-tivates target propagation methods to serve as alternative credit assignment in the context of a com-position of many non-linearities.

However, it is not directly clear how to compute a target that guarantees a decrease of the global loss(that is how to compute a h

i

for which equation (3) holds) or that at least leads to a decrease of thelocal loss L

i+1 of the next layer, that is

L

i

(hi+1, fi(hi)) < Li(hi+1, fi(hi)) . (6)

Proposing and validating answers to this question is the subject of the rest of this paper.

2.2 HOW TO ASSIGN A PROPER TARGET TO EACH LAYER

Clearly, in a supervised learning setting, the top layer target should be directly driven from thegradient of the global loss

hM

= hM

@L(hM ,y)@h

M

, (7)

where is usually a small step size. Note, that if we use the MSE as global loss and = 0.5 we gethM

= y.

But how can we define targets for the intermediate layers? In the previous technical report (Bengio,2014), it was suggested to take advantage of an approximate inverse. To formalize this idea,suppose that for each f

i

we have a function gi

such that

f

i

(gi

(hi

)) hi

or gi

(fi

(hi1)) hi1 . (8)

Then, choosinghi1 = gi(hi) (9)

would have the consequence that (under some smoothness assumptions on f and g) minimizing thedistance between h

i1 and hi1 should also minimize the loss Li of the i-th layer. This idea isillustrated in the left of Figure 1. Indeed, if the feed-back mappings were the perfect inverses of thefeed-forward mappings (g

i

= f1i

), one gets

L

i

(hi

, f

i

(hi1)) = Li(hi, fi(gi(hi))) = Li(hi, hi) = 0 . (10)

3

i

Target

0.5MSEy

[Bengio 2014]approximate inverse

Approximate Inverse

figi

gii-1i

i

Target-Prop as an alternative to Back-Prop Weights!at!each!layer!are!

trained!to!match!targets!associated!with!the!layer!output!

How!to!propagate!targets?! If!gi!is!the!inverse!of!fi,!then!

it!guarantees!an!improvement!in!the!cost.!

26!

gifi0

giapproximate inverse

gi


But choosing g to be the perfect inverse of f may need heavy computation and instability, sincethere is no guarantee that f1

i

applied to a target would yield a value that is in the domain of fi1.

An alternative approach is to learn an approximate inverse gi

, making the fi

/ gi

pair look like anauto-encoder. This suggests parametrizing g

i

as follows:

g

i

(hi

) = si

(Vi

hi

), i = 1, ...,M (11)

where si

is a non-linearity associated with the decoder and Vi

the matrix of feed-back weights ofthe i-th layer. With such a parametrization, it is unlikely that the auto-encoder will achieve zeroreconstruction error. The decoder could be trained via an additional auto-encoder-like loss at eachlayer

L

inv

i

= ||gi

(fi

(hi1)) hi1||22 . (12)

Changing Vi

based on this loss, makes g closer to f1i

. By doing so, it also makes fi

(hi1) =

f

i

(gi

(hi

)) closer to hi

, and is thus also contributing to the decrease of Li

(hi

, f

i

(hi1)). But we

do not want to estimate an inverse mapping only for the concrete values we see in training but for aregion around the these values to facilitate the computation of g

i

(hi

) for hi

which have never beenseen before. For this reason, the loss is modified by noise injection

L

inv

i

= ||gi

(fi

(hi1 + )) (hi1 + )||22, N(0,) , (13)

which makes fi

and gi

approximate inverses not just at hi1 but also in its neighborhood.

Figure 1: (left) How to compute a target in the lower layer via difference target propagation.f

i

(hi1) should be closer to hi than fi(hi1). (right) Diagram of the back-propagation-free auto-

encoder via difference target propagation.

As mentioned above, a required property of target propagation is, that the layer-wise parameterupdates, each improving a layer-wise loss, also lead to an improvement of the global loss. Thefollowing theorem shows that, for the case that g

i

is a perfect inverse of fi

and fi

having a certainstructure, the update direction of target propagation does not deviate more then 90 degrees from thegradient direction (estimated by back-propagation), which always leads to a decrease of the globalloss.Theorem 1. 1 Assume that g

i

= f1i

, i = 1, ...,M , and fi

satisfies hi

= fi

(hi1) = Wisi(hi1)

2 where si

can be any differentiable monotonically increasing element-wise function. Let W tpi

and

1See the proof in the Appendix.2This is another way to obtain a non-linear deep network structure.

4

Approximate Inverse

1 gififi

target propagation90

Appendix

Difference Target Propagation

target

target

target propagationdifference target propagation

gifitarget propagation


difference target propagation

target propagationg

difference target propagation1g


target

hi-1target


feedback

feedforward

1. =

2. fgapproximate inverse


1. 2. 3. 4.

0

10

difference target propagation

1. MNIST 7240 tanh

ReLUtarget:3.15%back:1.62% ReLU

1. CIFAR-10 MNIST 3072-1000-1000-1000-10 [0,1]

target:50.37% back:53.72% [Krizhevsky and Hinton 2009]1+100049.78%

1000051.53% [Konda+ 2015] state-of-the-art64.1%

2.

MNIST 784-500-500-10 12

sign 2

2.

0

2 Straight-through estimator [Bengio+ 2013]+

1

2.

Baseline10 0biased gradient

Baseline2 1

Target propagation0

3.

[Bengio 2013][Tang+ 2013][Bengio+ 2013] P(Y|X)

3. MNIST 784-200-200-10[Raiko+ 2014]

Straight-through biased gradient estimator [Bengio+ 2013]

3. Target propagation

21

layer-local


where sign(x) = 1 if x > 0, and sign(x) = 0 if x 0. The network output is given byp(y|x) = f3(h2) = softmax(W3h2) . (25)

The inverse mapping of the second layer and the associated loss are given byg2(h2) = tanh(V2sign(h2)) , (26)

L

inv

2 = ||g2(f2(h1 + )) (h1 + )||22, N(0,) . (27)If feed-forward mapping is discrete, back-propagated gradients become 0 and useless when theycross the discretization step. So we compare target propagation to two baselines. As a first baseline,we train the network with back-propagation and the straight-through estimator (Bengio et al., 2013),which is biased but was found to work well, and simply ignores the derivative of the step function(which is 0 or infinite) in the back-propagation phase. As a second baseline, we train only theupper layers by back-propagation, while not changing the weight W1 which are affected by thediscretization, i.e., the lower layers do not learn.

The results on the training and test sets are shown in Figure 3. The training error for the first base-line (straight-through estimator) does not converge to zero (which can be explained by the biasedgradient) but generalization performance is fairly good. The second baseline (fixed lower layer) sur-prisingly reached zero training error, but did not perform well on the test set. This can be explainedby the fact that it cannot learn any meaningful representation at the first layer. Target propagationhowever did not suffer from this drawback and can be used to train discrete networks directly (train-ing signals can pass the discrete region successfully). Though the training convergence was slower,the training error did approach zero. In addition, difference target propagation also achieved goodresults on the test set.

3.3 STOCHASTIC NETWORKS

Another interesting model class which vanilla back-propagation cannot deal with are stochastic net-works with discrete units. Recently, stochastic networks have attracted attention (Bengio, 2013;Tang and Salakhutdinov, 2013; Bengio et al., 2013) because they are able to learn a multi-modalconditional distribution P (Y |X), which is important for structured output predictions. Trainingnetworks of stochastic binary units is also biologically motivated, since they resemble networks ofspiking neurons. Here, we investigate whether one can train networks of stochastic binary units onMNIST for classification using target propagation. Following Raiko et al. (2014), the network archi-tecture was 784-200-200-10 and the hidden units were stochastic binary units with the probabilityof turning on given by a sigmoid activation:

hpi

= P (Hi

= 1|hi1) = (Wihi1), hi P (Hi|hi1) , (28)

that is, hi

is one with probability hpi

.

As a baseline, we considered training based on the straight-through biased gradient estimator (Ben-gio et al., 2013) in which the derivative through the discrete sampling step is ignored (this methodshowed the best performance in Raiko et al. (2014).) That is

hpi1 = h

p

i

@hpi

@hpi1

0(Wi

hi1)W

T

i

hpi

. (29)

With difference target propagation the stochastic network can be trained directly, setting the targetsto

hp2 = hp

2 @L

@h2and hp1 = h

p

1 + g2(hp

2) g2(hp

2) (30)

where gi

(hpi

) = tanh(Vi

hpi

) is trained by the loss

L

inv

i

= ||gi

(fi

(hi1 + )) (hi1 + )||22, N(0,) , (31)

and layer-local target losses are defined as Li

= ||hpi

hpi

||22.For evaluation, we averaged the output probabilities for a given input over 100 samples, and clas-sified the example accordingly, following Raiko et al. (2014). Results are given in Table 1. Weobtained a test error of 1.71% using the baseline method and 1.54% using target propagation, whichis to our knowledge the best result for stochastic nets on MNIST reported so far. This suggeststhat target propagation is highly promising for training networks of binary stochastic units.

9

3.

Target propagationMNIST

4. MNIST 1000

1.35%

target propagation

Difference target propagation

MNISTstate-of-the-art

Toward Biologically Plausible Deep Learning

Hinton2007talk

(STDP)

J

JEMvariational bound

target propagation

Bengio New York University invited talk: Towards Biologically Plausible Deep

Learning http://www.iro.umontreal.ca/~bengioy/talks/NYU_12March2015.pdf

Montreal Institute of Learning Algorithms (MILA) tea-talk: Towards Biologically Plausible Deep Learning http://www.iro.umontreal.ca/~bengioy/talks/MILA_20Feb2015.pdf

NIPS'2014 MLINI workshop : Towards a biologically plausible replacement for back-prop: target-prop http://www.iro.umontreal.ca/~bengioy/talks/NIPS-MLINI-workshop-targetprop-13dec2014.pdf

(dl hacks輪読) difference target propagation

Technology