representation learning and information bottleneck
TRANSCRIPT
3
Desiderata for representations
An optimal representation z of the data x for the task y is a stochastic functionz ∼ p(z|x) that is:
I(z; y) = I(x; y)SufficientMinimal I(x; z) is minimal among sufficient z
Invariant to nuisances If n ⫫ y, then I(n; z) = 0TC(z) = KL(p(z)k
Qi p(zi))Maximally disentangled is minimized
Sufficient Minimal
Minimal sufficient representations for deep learning
4
Information Bottleneck Lagrangian
A minimal sufficient representation is the solution to:
minimizep(z |x) I(x ; z)
s.t. H(y |z) = H(y |x)
Trade-off: between sufficiency and minimality, regulated by the parameter.
Information Bottleneck Lagrangian:
L = Hp,q(y |z) + �I(z ; x)regularizercross-entropy
We only need to enforce minimality (easy) to gain invariance (difficult)
5
Invariant if and only if minimal
Proposition. (A. and Soatto, 2017) Let z be a sufficient representation and n a nuisance. Then,
Moreover, there exists a nuisance n for which equality holds.
I(z ; n) I(z ; x)� I(x ; y)
constantminimalityinvariance
A representation is maximally insensitive to all nuisances iff it is minimal
The standard architecture alone already promotes invariant representations
6
Corollary: Ways of enforcing invariance
Regularization by architecture Reducing dimension (max-pooling) or adding noise (dropout) increases minimality and invariance.
Stacking layersStacking multiple layers makes the representation increasingly minimal.
Task informationI(x; y)
Nuisance informationI(x; n)
Only nuisance information dropped in a bottleneck (sufficiency).
1
Increasingly more minimal implies increasingly more invariant to nuisances.
2
The classifier cannot overfit to nuisances.3
Creating a soft bottleneck with controlled noise
7
Information Dropout: a Variational Bottleneck
bottleneck
L = Hp,q(y |z) + � I(z ; x) = Hp,q(y |z)� � log↵(x)
Task informationI(x; y)
Nuisance informationI(x; n)
Multiplicative noise ~ N(0, 𝛼(x))
Achille and Soatto, "Information Dropout: Learning Optimal Representations Through Noisy Computation”, PAMI 2018 (arXiv 2016)
8
Learning invariant representations
Deeper layers filter increasingly more nuisances
Stro
nger
bot
tlene
ck =
mor
e filt
erin
g
Only informative part of the image Other information is discarded
(Achille and Soatto, 2017)
Achille and Soatto, "Information Dropout: Learning Optimal Representations Through Noisy Computation”, PAMI 2018 (arXiv 2016)