module networks (july, 2018) explainable neural computation via …mooney/gnlp/slides/... ·...

22
Explainable Neural Computation via Stack Neural Module Networks (July, 2018) Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko, UC Berkley

Upload: others

Post on 13-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Explainable Neural Computation via Stack Neural Module Networks (July, 2018)

Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko, UC Berkley

Page 2: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Outline● The Problem

● Motivation and Importance

● The Approach

○ Module layout controller

○ Neural modules with a memory stack

○ Soft program execution

● Dataset

● Results

● Critique

Page 3: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Explain NN by :

❏ Building in attention layers❏ Post hoc extraction of implicit model attention (eg : gradient propagation)❏ Network dissection

Can we go beyond a single heatmap?

❏ Explainable model of more complex task : Question Answering , Referential Expression grounding

❏ [ Requires several reasoning steps to solve ]

Motivation

Page 4: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

The Problem and Importance

❏ Single heat map highlighting important spatial regions may not tell the full story❏ Existing modular nets: analyse question → predict sequence of predefined

modules → predict answer❏ But, need supervised module layouts (expert layout) for training layout policy❏ Explicit modular reasoning process with low supervision

Question : There is a small gray block. Are there any spheres to the left of it

Page 5: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

The Approach❏ Replace layout graph with a stack based data structure

❏ [Instead of making discrete choices of layout, this makes layout soft and

continuous → model can be optimised with in fully differentiable way with

SGD]

❏ Steps :

❖ Module layout controller

❖ Neural modules with a memory stack

❖ Soft program execution

Page 6: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Model

❖ Module layout

controller

❖ Neural modules

with a memory

stack

❖ Soft program

execution

Page 7: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Layout controller

ct = dxd

wt = |M| dim

W(t)1= d x d

W2 = d x 2d

W3= 1 x d

input Q [S words]

Encodes in sequence [h1,. . ., hs] {l = S, dim = d} with BiLSTM

hs = Concatenation of the forward LSTM output and backward LSTM output at the s-th word

At each t applies time dependent linear transform to Q and linearly combines it with previous ct-1 asu = W2[ W

(t)1q + b1; ct-1] +b2

Next, controller runs in a recurrent manner from t=0 to T-1

At each t, a small MLP is applied to u to predict w(t) w(t) = softmax(MLP(u;𝛉MLP))∑M wm

(t) = 1

At each t, controller predicts ct cvt,s = softmax(W3(u*hs))ct = ∑

Scvt,s . hs

Page 8: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Neural modules with a memory stackHow many objects are right of the blue object → Answer[how many](transform[right](find[blue]))

Page 9: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Differentiable memory stack

❏ Modules may take diff # of inputs + compare what they sees at t

with previously seen.

❏ Typical tree structure layout Compare(Find(),Transform(Find()))

❏ Therefore, give them memory to remember.

❏ But, restrict to Last-in-first-out (LIFO) stack.

❏ Thus, functions like functions in programs, allowing only arguments and

returned values to be passed between modules

Page 10: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Differentiable memory stackStack = stores values of fixed length dimension

length L memory array | A={A i}i =1L + stack top pointer p | L-dim 1-hot vec

Push function : pointer increment + value writingp := 1d_cov(p,[0,0,1])Ai := Ai (1-pi) + z.pi i = 1,...,L

Pop function : pointer decrement + value readingp := 1d_cov(p,[1,0,0])z := ∑

LAi . pi

❏ Store HxW image attention maps❏ Each module first pops this image attention map → then pushes it back❏ Eg: Compare(Find(),Transform(Find()))Find → pushes its localization result into stackThen transform pops one attention map then pushes the transformed attentionThen compare module pops two image attention maps & uses them to predict the answers

Page 11: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Soft program execution

Thus, model performs continuous selection of layout through wm(t)

At t = 0 → Initialize (A,P) with uniform image attention & p = [0,...,0,1] i.e at bottom

At every t → execute every module on current(A(t),P(t)) During execution each module m may pop/push to get (Am

(t),Pm(t))

then → use wm(t) to weight + sharpen the stack pointer with

softmaxA(t+1)= ∑M Am

(t)wm(t)

p(t+1)= softmax(∑M pm(t)wm

(t) )

Page 12: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Final Output

VQA: collect outputs from all the modules that have answer outputs from all timesteps

y= ∑T-1 ∑M(ans) ym(t)wm

(t) M(ans) = answer+compare modules

REF: Take the image-attention map at the top of the final stack at t=T and extract attended image features from this attention map. Then, a linear layer is applied on the attended image feature to predict the bounding box offsets from the feature grid location.

Page 13: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Experiments

What does the soft layout performance depend on?

❏ How does choice of training task affect it?

Does the soft layout hurt performance?

❏ Comparison with the models with discrete layouts.

Does explicit modular structure make the models more interpretable?

❏ Human Evaluation❏ Comparison with non modular model

Page 14: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Dataset

CLEVR VQA : Images generated with a graphics engine. Focus on

compositional reasoning.

70,000 | 15,000 | 15,000 | 10 questions/image

CLEVR-Ref : collected by the authors. Uses the same graphics engine.

70,000 | 15,000 | 15,000 | 10 REFs/image

Page 15: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Results : What does the soft layout performance depend on?

Joint training can lead to higher performance on both of these two task (especially when not using the expert layout)

Page 16: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Best models perform better with supervision but fail to converge without it.

Results : Does the soft layout hurt performance?

Page 17: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Real VQA datasets focus more on visual recognition than on compositional reasoning.

Still outperforms N2NMN

Results : Does the soft layout hurt performance?

Page 18: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

MAC : Also performs multi step sequential reasoning and has image and textual attention at each step.

Subject understanding: Can you understand what the step is doing from attention.

Forward Prediction: Can you tell what the model will predict? [tell us if the person can tell if where the model will go wrong].

Results : Explicit modular structure makes models more interpretable?

Percentage of each choice

Page 19: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Critique - The Good

● Motivation:○ Novel idea to increase applicability of modular neural networks which are more interpretable.

● Stack-NMN Model:○ Novel end-to-end differentiable training approach to modular networks. ○ Additional advantage of reduction in model parameters [ PG+EE : 40.4M, TbD-net : 115M ,

StackNMN : 7.32M]● Ablation study:

○ Performed ablation study of all the important model components giving reasoning behind model design decisions.

Page 20: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Critique - The Not So Good

● Dataset : ○ Synthetic datasets are know to suffer from biases. An analysis of the created CLEVR-Ref

would have been good. ● Stack-NMN Model:

○ How many modules are sufficient? [PG+EE, TbD-net : 39 modules | Stack-NMN : 9 modules]○ Can modules themselves be made reusable to decrease parameters?○ Perhaps, learnable generic modules?

● Evaluation Methodology:○ Could given breakdown of accuracy over Count, Compare Numbers, Exist, Query Attribute,

Compare Attribute. ○ Performance on CLEVR-CoGenT dataset provides an excellent test for generalization.

● Output Analysis:○ Could have shown instances of where the model is going wrong.

Page 21: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

Development Since then

https://arxiv.org/pdf/1905.11532.pdf

- Learnable module- The cell denotes a generic module, which can

span all the required modules for a visual reasoning task.

- Each cell contains a certain number of nodes. - The function of a node (denoted by O) is to

perform a weighted sum of outputs of different arithmetic operations applied on ′the input feature maps x1 and x2

Page 22: Module Networks (July, 2018) Explainable Neural Computation via …mooney/gnlp/slides/... · 2020-04-07 · Each module first pops this image attention map → then pushes it back

References- Ronghang Hu, Jacob Andreas, Trevor Darrell, Kate Saenko, Explainable Neural Computation via Stack Neural Module

Networks, ECCV, 2018.