opening the black box of deep neural networks via...

59
Opening the black box of Deep Neural Networks via Information (Ravid Shwartz-Ziv and Naftali Tishby) An overview by Philip Amortila and Nicolas Gagné 1

Upload: others

Post on 28-May-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Opening the black box of Deep Neural Networks via Information

(Ravid Shwartz-Ziv and Naftali Tishby)

An overview by Philip Amortila and Nicolas Gagné

1

Page 2: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

2

"bull"

? ??

Problem:The usual "black box" story:

"Despite their great success, there is still no comprehensive understanding of the

internal organization of Deep Neural Networks."

Page 3: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Solution:Opening the blackbox using

mutual information!

3

"bull"

? ??

Mutual information

Page 4: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Solution:Opening the blackbox using

mutual information!

4

"bull"

? ??4

Mutual information

Page 5: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Solution:Opening the blackbox using

mutual information!

5

"bull"

? ??5

Mutual

infor

mation

Page 6: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

6

"bull"

6

Mutual information

Solution:Opening the blackbox using

mutual information!

Page 7: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

77

Spoiler alert!As we train a deep neural network, its layers will

Page 8: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

1. Gain "information" about the true label and the input.

2. Then, will gain "information" about the true label but will loose "information" about the input!

88

Spoiler alert!As we train a deep neural network, its layers will

Page 9: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

1. Gain "information" about the true label and the input.

2. Then, will gain "information" about the true label but will lose "information" about the input!

99

Spoiler alert!As we train a deep neural network, its layers will

Page 10: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

During that second stage, it forgets the details of the input that are irrelevant to the task.

1. Gain "information" about the true label and the input.

2. Then, will gain "information" about the true label but will lose "information" about the input!

1010

Spoiler alert!As we train a deep neural network, its layers will

Page 11: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

During that second stage, it forgets the details of the input that are irrelevant to the task.

1. Gain "information" about the true label and the input.

2. Then, will gain "information" about the true label but will lose "information" about the input!

1111

Spoiler alert!As we train a deep neural network, its layers will

(Not unlike Picasso's deconstruction of the bull.)

Page 12: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Spoiler alert!

1212

Before delving in, we first need to define what we mean by "information".

Page 13: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Spoiler alert!

1313

Before delving in, we first need to define what we mean by "information".

By information, we mean mutual information.

Page 14: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

What is mutual information?For discrete random variables X and Y :

H(X) := �nX

i=1

p(xi) log (p(xi))• Entropy of X

given nothing

Page 15: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

What is mutual information?For discrete random variables X and Y :

H(X|y) := �nX

i=1

p(xi|y) log (p(xi|y))

H(X) := �nX

i=1

p(xi) log (p(xi))

• Entropy of X given y

• Entropy of X given nothing

Page 16: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

What is mutual information?

16

For discrete random variables X and Y :

H(X|y) := �nX

i=1

p(xi|y) log (p(xi|y))

H(X) := �nX

i=1

p(xi) log (p(xi))

H(X|Y ) :=mX

j=1

p(yj)H(X|yj)

• Entropy of X given y

• Entropy of X given Y

• Entropy of X given nothing

Page 17: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

What is mutual information?

17

For discrete random variables X and Y :

H(X|y) := �nX

i=1

p(xi|y) log (p(xi|y))

H(X) := �nX

i=1

p(xi) log (p(xi))

H(X|Y ) :=mX

j=1

p(yj)H(X|yj)

I(X;Y ) := H(X)�H(X|Y )

• Entropy of X given y

• Entropy of X given Y

• Entropy of X given nothing

Mutual information between X and Y

Page 18: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Setup:

18

Given a deep neural network where X and Y are discrete random variables:

p(x|y)

Page 19: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Setup:

19

We identify the propagated values at layer 1 with the vector T1

p(x|y)

Page 20: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Setup:

20

We identify the propagated values at layer i with the vector Ti

p(x|y)

Page 21: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Setup:

21

We identify the values for every layer with their corresponding vector.

p(x|y)

Page 22: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

We identify the values for every layer with their corresponding vector.

Setup:

22

And doing so, we get the following Markov chain.

Page 23: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

We identify the values for every layer with their corresponding vector.

Setup:

23

And doing so, we get the following Markov chain.

(Hint: Markov chain rhymes with data processing inequality)I(Y, Ti) � I(Y, Ti+n) and I(X,Ti) � I(X,Ti+n)

Page 24: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Setup:

24

And doing so, we get the following Markov chain.

Next, pick your favourite layer, say Ti

Page 25: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Setup:

25

And doing so, we get the following Markov chain.

Next, pick your favourite layer, say Ti

Page 26: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

26

"How much it knows about Y"

"How much it knows about X"

I(Y;T)

I(X;T)

Next, pick your favorite layer, say T.

We will plot T's current location on the "information plane".

Page 27: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

27

"How much it knows about Y"

I(Y;T)

I(X;T)

Then, we train our deep neural network for a bit.

And plot the new corresponding distribution of T

"How much it knows about X"

Page 28: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

28

"How much it knows about Y"

I(Y;T)

I(X;T)

We train a bit more...

"How much it knows about X"

Page 29: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

29

"How much it knows about Y"

I(Y;T)

I(X;T)

And as we train, we trace the layer's trajectory in the information plane.

"How much it knows about X"

Page 30: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

30

"How

muc

h it

know

s ab

out Y

"

I(Y;T)

I(X;T)"How much it knows about X"

( , )Let's see what it looks like for a fixed data point "bull"

Page 31: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

31

"How

muc

h it

know

s ab

out Y

"

I(Y;T)

I(X;T)"How much it knows about X"

( , )Let's see what it looks like for a fixed data point "bull"

Page 32: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

32

"How

muc

h it

know

s ab

out Y

"

I(Y;T)

I(X;T)"How much it knows about X"

( , )"dog"

( , )Let's see what it looks like for a fixed data point "bull"

Page 33: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

33

"How

muc

h it

know

s ab

out Y

"

I(Y;T)

I(X;T)"How much it knows about X"

( , )"dog"

( , )Let's see what it looks like for a fixed data point "bull"

Page 34: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

34

"How

muc

h it

know

s ab

out Y

"

I(Y;T)

I(X;T)"How much it knows about X"

( , )"dog"

( , )Let's see what it looks like for a fixed data point "bull"

( , )"goat"

Page 35: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

( , )

35

"How

muc

h it

know

s ab

out Y

"

I(Y;T)

I(X;T)"How much it knows about X"

( , )"dog"

"goat"

( , )Let's see what it looks like for a fixed data point "bull"

Page 36: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

( , )( , )

36

"How

muc

h it

know

s ab

out Y

"

I(Y;T)

I(X;T)"How much it knows about X"

( , )"dog"

"goat"

"bull"

( , )Let's see what it looks like for a fixed data point "bull"

Page 37: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

( , )( , )

37

"How

muc

h it

know

s ab

out Y

"

I(Y;T)

I(X;T)"How much it knows about X"

( , )"dog"

"goat"

"bull"

( , )Let's see what it looks like for a fixed data point "bull"

Page 38: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

( , )( , )

38

"How

muc

h it

know

s ab

out Y

"

I(Y;T)

I(X;T)"How much it knows about X"

( , )"dog"

"goat"

"bull"

( , )"bull"

( , )Let's see what it looks like for a fixed data point "bull"

Page 39: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Numerical Experiments and Results

Examining the dynamics of SGD in the mutual information plane

Page 40: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Experimental Setup• Explored fully connected feed-forward neural nets, with no other

architecture constraints: 7 fully connected hidden layers with widths 12-10-7-5-4-3-2

• sigmoid activation on final layer, tanh activation on all other layers

• Binary decision task, synthetic data is used

• Experiment with 50 different randomized weight initializations and 50 different datasets generated from the same distribution

• trained with SGD to minimize the cross-entropy loss function, and with no regularization

D ⇠ P (X,Y )

Page 41: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Dynamics in the Information Plane

Page 42: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Dynamics in the Information Plane

• All 50 test runs follow similar paths in the information plane

• Two different phases of training: a fast ‘ERM reduction’ phase and a slower ‘representation compression’ phase

• In the first phase (~400 epochs), the test error is rapidly reduced (increase in )

• In the second phase (from 400-9000 epochs) the error is relatively unchanged but the layers lose input information (decrease in )

I(T ;Y )

I(X;T )

Page 43: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Dynamics in the Information Plane

• All 50 test runs follow similar paths in the information plane

• Two different phases of training: a fast ‘ER reduction’ phase and a slower ‘representation compression’ phase

• In the first phase (~400 epochs), the test error is rapidly reduced (increase in )

• In the second phase (from 400-9000 epochs) the error is relatively unchanged but the layers lose input information (decrease in )

I(T ;Y )

I(X;T )

Page 44: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Dynamics in the Information Plane

• All 50 test runs follow similar paths in the information plane

• Two different phases of training: a fast ‘ER reduction’ phase and a slower ‘representation compression’ phase

• In the first phase (~400 epochs), the test error is rapidly reduced (increase in )

• In the second phase (from 400-9000 epochs) the error is relatively unchanged but the layers lose input information (decrease in )

I(T ;Y )

I(X;T )

Page 45: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Dynamics in the Information Plane

• All 50 test runs follow similar paths in the information plane

• Two different phases of training: a fast ‘ER reduction’ phase and a slower ‘representation compression’ phase

• In the first phase (~400 epochs), the test error is rapidly reduced (increase in )

• In the second phase (from 400-9000 epochs) the error is relatively unchanged but the layers lose input information (decrease in )

I(T ;Y )

I(X;T )

Page 46: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Representation Compression

• The loss of input information occurs without any form of regularization

• This prevents overfitting since the layers lose irrelevant information (“generalizing by forgetting”)

• However, overfitting can still occur with less data

5%, 45%, and 85% of the data respectively

Page 47: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Representation Compression

• The loss of input information occurs without any form of regularization

• This prevents overfitting since the layers lose irrelevant information (generalizing by forgetting)

• However, overfitting can still occur with less data

5%, 45%, and 85% of the data respectively

Page 48: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Representation Compression

• The loss of input information occurs without any form of regularization

• This prevents overfitting since the layers lose irrelevant information (generalizing by forgetting)

• However, overfitting can still occur with less data

5%, 45%, and 85% of the data respectively

Page 49: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Phase transitions

• The ‘ER reduction’ phase is called a drift phase, where the gradients are large and the weights are changing rapidly (high signal-to-noise)

• The ‘Representation compression’ phase is called a diffusion phase, where the gradients are small compared to their variance

The phase transition occurs at the dotted line

Page 50: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Phase transitions

• The ‘ER reduction’ phase is called a drift phase, where the gradients are large and the weights are changing rapidly (high signal-to-noise)

• The ‘Representation compression’ phase is called a diffusion phase, where the gradients are small compared to their variance (low signal-to-noise)

The phase transition occurs at the dotted line

Page 51: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Effectiveness of Deep Nets

• Because of the low Signal-to-noise Ratio in the diffusion phase, the final weights obtained by the DNN are effectively random

• Across different experiments, the correlations between the weights of different neurons in the same layer was very small

• “This indicates that there is a huge number of different networks with essentially optimal performance, and attempts to interpret single weights or single neutrons in such networks can be meaningless”

Page 52: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Effectiveness of Deep Nets

• Because of the low Signal-to-noise Ratio in the diffusion phase, the final weights obtained by the DNN are effectively random

• Across different experiments, the correlations between the weights of different neurons in the same layer was very small

• “This indicates that there is a huge number of different networks with essentially optimal performance, and attempts to interpret single weights or single neutrons in such networks can be meaningless”

Page 53: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Effectiveness of Deep Nets

• Because of the low Signal-to-noise Ratio in the diffusion phase, the final weights obtained by the DNN are effectively random

• Across different experiments, the correlations between the weights of different neurons in the same layer was very small

• “This indicates that there is a huge number of different networks with essentially optimal performance, and attempts to interpret single weights or single neurons in such networks can be meaningless”

Page 54: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Why go deep?• Faster ER minimization (in epochs)

• Faster representation compression time

Page 55: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Why go deep?• Faster ER minimization (in epochs)

• Faster representation compression time (in epochs)

Page 56: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Discussions/Disclaimers

• The claims being made are certainly very interesting, although the scope of experiments is limited: only one specific distribution and one specific network are examined.

• The paper acknowledges that different setups need to be tested: do the results hold with different decision rules and network architectures? And is this observed in “real world” problems?

Page 57: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Discussions/Disclaimers

• The claims being made are certainly very interesting, although the scope of experiments is limited: only one specific distribution and one specific network are examined.

• The paper acknowledges that different setups need to be tested: do the results hold with different decision rules and network architectures? And is this observed in “real world” problems?

Page 58: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

Discussions/Disclaimers• As of now there is some controversy about whether or not these

claims hold up: see “On the Information Bottleneck Theory of Deep Learning (Saxe et al., 2018)”

• “ Here we show that none of these claims hold true in the general case. […] we demonstrate that the information plane trajectory is predominantly a function of the neural nonlinearity employed […] Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa. Next, we show that the compression phase, when it exists, does not arise from stochasticity in training by demonstrating that we can replicate the IB findings using full batch gradient descent rather than stochastic gradient descent.”

Page 59: Opening the black box of Deep Neural Networks via Informationmitliagkas.github.io/ift6085/student_slides/IFT6085... · 2020-04-14 · Opening the black box of Deep Neural Networks

The End