neural module networks · describe measure modules. convolve every position in input image w/...

79
Neural Module Networks Berthy Feng

Upload: others

Post on 08-Mar-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Neural Module NetworksBerthy Feng

Page 2: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

OutlinePresentation Overview

1. Neural Module

Networks

2. Learning to Reason:

End-to-End Neural

Module Networks

Page 3: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Neural Module NetworksJacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. CVPR,

2016.

Page 4: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

MotivationsNeural Module Networks

1. VQA Overview2. Compositional

Visual Intelligence

Page 5: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

(I, Q) → A

Previous approaches:

1. Classifier on encoded Q + I

2. Semantic parsers → logical

expressions

VQA Overview

Page 6: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

● Classifier on Encoded Question + Image■ Monolithic — difficult to understand/explain models

■ Limited visual reasoning — probably just memorizing statistics

■ Limited linguistic reasoning — bag-of-words, RNN can’t capture

linguistic complexity

● Logical Expressions■ Purely logical representations of world might ignore visual

features and attentions

■ Semantic parsers not optimized for VQA

VQA: Problems with Previous Approaches

Page 7: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

“Is this a truck?”

1. Classify● Convolutional

“How many objects are to the left of the toaster?”

1. Find toaster2. Look left3. Find objects4. Count● Convolutional● Recurrent

Each question requires different # of reasoning steps and different kinds of reasoning

Insight: Compositional Nature of VQA

Page 8: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Questions are composed of “reasoning modules” and might share

substructures

Insight: Compositional Nature of VQA

Where is the dog? What color is the dog? Where is the cat?

Page 9: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Representational Approaches

Neural network structures are not

universal, but at least modular in

applications

Compositional Approaches

Logical expressions and functional

programs provide computational

structure

Proposed Approach

1. Predict computational structure from question

2. Construct modular neural network from this structure

Insight: Combine Both Approaches

Page 10: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Neural Modules

Neural Module Networks

1. Neural Module Networks

2. Module Types

Page 11: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Neural Module NetworksModel = collection

of modules +

network layout

predictor

Output =

distribution over

answers

Page 12: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Data Types:

1. Images

2. Attentions

3. Labels

Module Types:

● FIND

● TRANSFORM

● COMBINE

● DESCRIBE

● MEASURE

Modules

Page 13: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Convolve every position in input image w/ weight

vector (distinct for each instance) to produce

attention heatmap

find[dog] = matrix with high values in regions containing dogs

Image → Attention

Find Module

Page 14: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Fully connected mapping from one attention to

another (MLP w/ ReLUs)

transform[not]: move attention away from active regions

Attention → Attention

Transform Module

Page 15: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Merge two attentions into one

combine[and]: activate regions active in both inputs

Attention x Attention → Attention

Combine Module

Page 16: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Average image features weighted by attention, then

pass averaged feature vector through FC layer

describe[color] = representation of colors in the region attended to

Image x Attention → Label

Describe Module

Page 17: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Map attention to a distribution over labels

Can evaluate existence of detected object or count

sets of objects

Attention → Label

Measure Module

Page 18: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Question Parsing

Neural Module Networks

1. Question Parsing2. Question Tree3. Question Encoding4. Predicting an Answer5. Note on Training

Page 19: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Question → dependency representation

Filter dependencies to those connected by wh-word or

connecting verb

Question Parsing

“What is standing in the field?” what(stand)

“What color is the truck?” color(truck)

“Is there a circle next to a square?” is(circle, next-to(square))

Page 20: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

leaves: find

internal nodes: transform, combine

root nodes: describe, measure

Question Tree

Page 21: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Full model = NMN +

LSTM question encoder

● LSTM models attributes lost

in parsing: “underlying

syntactic, semantic

regularities in the data”

● Single-layer LSTM w/ 1000

hidden units

Question Encoding

Page 22: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

1. Pass final hidden state of

LSTM through FC layer

2. Add output elementwise to

representation produced

by root module of NMN

3. Apply ReLU, another FC

layer, and get distribution

over possible answers

Predicting an Answer

Page 23: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Modules jointly trained

Some weights updated more frequently than others

● Use adaptive per-weight learning rates (ADADELTA)

Training

Page 24: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

ExperimentsNeural Module Networks

1. Evaluation on SHAPES2. Evaluation on VQA3. Future Work

Page 25: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

● Synthetic dataset

● Complex questions about simple arrangements of

colored shapes

● To prevent mode-guessing, all answers yes/no

SHAPES Dataset

Page 26: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Is there a red shape above a circle?

Page 27: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Performance on SHAPES# modules in

neural module network

Page 28: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

● Contains over 200,000 images from MS COCO

● Each image annotated with 3 questions and 10 answers/question

VQA Dataset

Page 29: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Visual input = conv5 layer of

VGG-16 (after max-pooling)

1. VGG pre-trained on ImageNet classification

2. VGG fine-tuned on COCO for captioning

What color is his tie?

Page 30: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Does especially well on questions answered by object, attribute, or number

Results on VQA

Page 31: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Good Example

Page 32: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Good Examples (2 of 2)

Page 33: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Bad examples

Bad Example

Page 34: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Bad examples

Bad Examples (2 of 2)

Page 35: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Better parser

“Are these people most likely

experiencing a work day?”

● Parser: be(people, likely)● Correct: be(people, work)

Combine predicting network structures and learning network parameters

Future Work

Page 36: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Learning to Reason: End-to-End Neural Module Networks for Visual Question Answering

Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. ICCV, 2017.

Page 37: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

End-to-End Module

NetworksLearning to Reason

1. Motivations2. Proposed Approach3. Attentional Neural

Modules

Page 38: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

● Standard neural module networks rely on external

parser

“Are these people most likely experiencing a work day?”

Parser: be(people, likely)

● New datasets■ CLEVR

Motivations

Page 39: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Predict modular network architectures directly from

textual input

Proposed Approach

1) Set of co-attentive

neural modules

2) Layout policy that

predicts question-specific

network layout

Page 40: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Neural module m :=

Each input tensor ai= image attention map over

convolutional feature grid

Output tensor y = image attention map or probability

distribution over answers

Attentional Neural Modules

Page 41: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Attentional Neural Modules

Page 42: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Attentional Neural Modules: Implementations

Page 43: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

ArchitectureLearning to Reason

1. Model Overview2. Layout Policy

Page 44: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Model Overview

Page 45: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Model Overview

Page 46: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Input: Question

Output: Probability

distribution over

layouts, p ( l | q )

Layout Policy

Page 47: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

1. Encode questiona. Embed every word i → vector wi (using GloVe)

b. Multi-layer LSTM encodes input question → [h1

,

h2

, …, hT

], where there are T words in the question

2. Decoder LSTM

3. Predict next module at t

Layout Policy: Seq-to-Seq RNN (1 of 3)

Page 48: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

1. Encode question → hi’s: [h1

, h2

, …, hT

]

2. Decoder LSTMa. Same structure as encoder, but different params

b. At each time step t, predict attention weights ati of

every input word using hi (encoder output) and ht

(decoder output)

3. Predict next module at t

Layout Policy: Seq-to-Seq RNN (2 of 3)

Page 49: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

1. Encode question → [h1

, h2

, …, hT

]

2. Decoder LSTM → ati = attention at time t for word i3. Predict next module at t

a. Context vector ct = (attention heatmap

across question words at time t)

b. Probability for next module token (encoded form

of a module) m(t) predicted from ht and ct

Layout Policy: Seq-to-Seq RNN (3 of 3)

Page 50: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

● Each time step t corresponds to word in input

question

● For each t, sample from distribution over all modules

to get next module token from

● Probability of layout l is:

Layout Policy: Choosing a Layout

Page 51: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Previous: Textual features hard-coded into module

instance

● describe[‘shape’] and describe[‘where’] different instantiations

Idea: Pass attention-based textual input to each module in

network

● “shape” and “where” would be input to general describe module

Layout Policy: Textual Input (1 of 3)

Page 52: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Construct textual

input xtxt(t) for each

time step t, using ati

from each word i

Layout Policy: Textual Input (2 of 3)

Simplified textual attention heat map:

Page 53: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Construct textual

input xtxt(t) for each

time step t, using ati

from each word i

Layout Policy: Textual Input (3 of 3)

textual attention heatmap

Page 54: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Reverse Polish

notation lets us

linearize

functional

programs

Linearized Layout

Page 55: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Model Overview (Revisited)

Page 56: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

TrainingLearning to Reason

1. Loss Function2. Behavioral Cloning

Page 57: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Training loss

function:

Softmax loss over output answer scores

Answer-Based Loss: Function

Page 58: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Loss function not fully differentiable since layout l is

discrete

● Differentiable parts: backpropagation

● Non-differentiable parts: policy gradient method in

reinforcement learning

Answer-Based Loss: Backpropagation

Page 59: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

● Difficult to train — 3 sets of params to learn■ Seq-to-seq RNN params

■ Textual attention weights

■ Neural module params

Behavioral Cloning (1 of 2)

Page 60: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

● Pre-train by behavioral cloning from expert policy pe■ Minimize KL-divergence between pe and p (our layout policy)

■ Simultaneously minimize question-answering loss using layout l obtained from pe

● After learning good initialization, further train model

end-to-end using previous estimated gradient

Behavioral Cloning (2 of 2)

Page 61: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

ExperimentsLearning to Reason

1. Evaluation on SHAPES2. Evaluation on CLEVR3. Evaluation on VQA

Page 62: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Evaluation on SHAPES

Page 63: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

SHAPES Example

Page 64: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

Johnson et al., 2016

● 100k images + 850k questions

● Questions synthesized with functional programs

CLEVR Dataset

Page 65: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

● Visual inputs■ For each image, extract 15x10x512 tensor from pool5 output of

VGG-16 trained on ImageNet classification

■ Add two extra x, y dimensions to each location on feature map

→ 15x10x514 tensor

● Textual inputs■ Each question word embedded to 300-dimensional vector

● Expert layout policy■ Convert annotated functional programs in CLEVR into module

layout

Testing on CLEVR

Page 66: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

CLEVR Results

Page 67: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

CLEVR Results

Page 68: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

CLEVR Results

Page 69: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

CLEVR Results

Page 70: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

CLEVR Examples (1 of 2)

Page 71: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

CLEVR Examples (2 of 2)

Page 72: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

● Same input preparation as for CLEVR

● Expert layout policy same as in NMN paper (using

external parser)

Testing on VQA

Page 73: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

VQA Results

Page 74: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

MCB

Page 75: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

VQA Examples (1 of 2)

Page 76: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

VQA Examples (2 of 2)

Page 77: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

SummaryNeural Module Networks +

Learning to Reason

1. Neural Module Networksa. Compositional

reasoningb. Module toolboxc. Question parsing for

layout prediction2. Learning to Reason

a. End-to-end module networks

b. Layout policyc. Baseline for CLEVR

Page 78: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned

ExtensionsNeural Module Networks +

Learning to Reason

1. Compositional Reasoning for Other Vision Tasks

2. More Versions of Compositional VQAa. “Inferring and

executing programs for visual reasoning” (Johnson et al., 2017) https://arxiv.org/pdf/1705.03633.pdf

Page 79: Neural Module Networks · DESCRIBE MEASURE Modules. Convolve every position in input image w/ weight vector (distinct for each instance) to produce ... classification 2. VGG fine-tuned