representation learning and information bottleneck

CS 103: Representation Learning, Information Theory and ControlLecture 5, Feb 8, 2019

Representation Learning and Information Bottleneck

3

Desiderata for representations

An optimal representation z of the data x for the task y is a stochastic functionz ∼ p(z|x) that is:

I(z; y) = I(x; y)SufficientMinimal I(x; z) is minimal among sufficient z

Invariant to nuisances If n ⫫ y, then I(n; z) = 0TC(z) = KL(p(z)k

Qi p(zi))Maximally disentangled is minimized

Sufficient Minimal

Minimal sufficient representations for deep learning

4

Information Bottleneck Lagrangian

A minimal sufficient representation is the solution to:

minimizep(z |x) I(x ; z)

s.t. H(y |z) = H(y |x)

Trade-off: between sufficiency and minimality, regulated by the parameter.

Information Bottleneck Lagrangian:

L = Hp,q(y |z) + �I(z ; x)regularizercross-entropy

We only need to enforce minimality (easy) to gain invariance (difficult)

5

Invariant if and only if minimal

Proposition. (A. and Soatto, 2017) Let z be a sufficient representation and n a nuisance. Then,

Moreover, there exists a nuisance n for which equality holds.

I(z ; n) I(z ; x)� I(x ; y)

constantminimalityinvariance

A representation is maximally insensitive to all nuisances iff it is minimal

The standard architecture alone already promotes invariant representations

6

Corollary: Ways of enforcing invariance

Regularization by architecture Reducing dimension (max-pooling) or adding noise (dropout) increases minimality and invariance.

Stacking layersStacking multiple layers makes the representation increasingly minimal.

Task informationI(x; y)

Nuisance informationI(x; n)

Only nuisance information dropped in a bottleneck (sufficiency).

1

Increasingly more minimal implies increasingly more invariant to nuisances.

2

The classifier cannot overfit to nuisances.3

Creating a soft bottleneck with controlled noise

7

Information Dropout: a Variational Bottleneck

bottleneck

L = Hp,q(y |z) + � I(z ; x) = Hp,q(y |z)� � log↵(x)

Task informationI(x; y)

Nuisance informationI(x; n)

Multiplicative noise ~ N(0, 𝛼(x))

Achille and Soatto, "Information Dropout: Learning Optimal Representations Through Noisy Computation”, PAMI 2018 (arXiv 2016)

8

Learning invariant representations

Deeper layers filter increasingly more nuisances

Stro

nger

bot

tlene

ck =

mor

e filt

erin

g

Only informative part of the image Other information is discarded

(Achille and Soatto, 2017)

Achille and Soatto, "Information Dropout: Learning Optimal Representations Through Noisy Computation”, PAMI 2018 (arXiv 2016)

representation learning and information bottleneck

Documents