multimodal compact bilinear pooling for vqa - vqa: …multimodal compact bilinear pooling for vqa...

Multimodal Compact Bilinear Pooling for VQA

AkiraFukui1,2,DongHuk Park1,DaylenYang1,AnnaRohrbach1,3,TrevorDarrell1,MarcusRohrbach1

1UCBerkeleyEECS,CA,UnitedStates2SonyCorp.,Tokyo,Japan3MaxPlanckInstituteforInformatics,Saarbrücken,Germany

Multimodallanguageandvisualunderstanding

Atablefulloffoodforafeast

Description


Thebowlwiththebrownsouce

Grounding


Gravy

VisualQuestionAnswering

Whatisthebrownsouce?

spoonplatebowltablefoodcorn…person

HowtoCombineImageRepresentationandQuestionRepresentation?

Isthisgoingtobeafeast?

CNNLSTM

Yes

Is?feastgoingtobe…

¨ Allelementscaninteract¨ Multiplicativeinteraction




CNNLSTM

Yes


þ Allelementscaninteract¨ Multiplicativeinteraction- Difficulttolearnoutputclassification

Concatenate

FC FC

RELU




CNNLSTM

Yes


¨ Allelementscaninteractþ Multiplicativeinteraction- Difficulttolearninputembedding

ElementwiseMultiplication

FC FC

RELU⨀




CNNLSTM

Yes


þ Allelementscaninteractþ Multiplicativeinteraction

OuterProduct/BilinearPooling

FC[Lin ICCV 2015]

[LinICCV2015]Tsung-YuLin,Aruni RoyChowdhury, andSubhransu Maji.Bilinear CNNmodelsforfine-grainedvisualrecognition. ICCV2015




CNNLSTM

Yes


þ Allelementscaninteractþ Multiplicativeinteraction¨ High#activations&computation¨ High#parameters

OuterProduct/BilinearPooling

FC

[Lin ICCV 2015]

2048

2048 4 million

4 million x 1000

[LinICCV2015]Tsung-YuLin,Aruni RoyChowdhury, andSubhransu Maji.Bilinear CNNmodelsforfine-grainedvisualrecognition. ICCV2015


MultimodalCompactBilinearPooling


CNNLSTM

Yes


þ Allelementscaninteractþ Multiplicativeinteractionþ Low#activations&computationþ Low#parameters

CompactBilinearPooling

FC2048

2048

16k x 1000

MCB

[Gao CVPR 16]

[GaoCVPR16]YangGao,OscarBeijbom,NingZhang,andTrevorDarrell.Compactbilinearpooling.CVPR2016

[ICLRWorkshops2016] Fine-grainedposeprediction,normalization,andrecognitionNZhang,EShelhamer,YGao,TDarrell



CNNLSTM

Yes

þ Allelementscaninteractþ Multiplicativeinteraction¨ Low#activations&computationþ Low#parameters

FC

16k x 1000

RandomProjection:Countsketch Ψ

[Countsketch]M.Charikar,K.Chen, M.Farach-Colton.Findingfrequentitemsindatastreams.Automata,languagesandprogramming2002.[Pham&Pagh 13]N.Pham,R.Pagh.Fastandscalablepolynomialkernelsviaexplicit featuremaps.KDD2013

Pham&Pagh (2013):Ψ x⨂𝑦 = Ψ 𝑥 ∗ Ψ(𝑦)



CNNLSTM

Yes


FC

16k x 1000

Ψ


Ψ

Convolution

∗




CNNLSTM

Yes


FC

16k x 1000

Ψ


Ψ

FFT

FFTFFT

-1

Convolution

⨀


Relatedwork

• Alternativeapproachtomultiplicativeinteractions- DPPNet:Hyeonwoo Noh,PaulHongsuck Seo,andBohyung Han.Imagequestionansweringusingconvolutionalneuralnetworkwithdynamicparameterprediction.CVPR2016

Figure 2. Overall architecture of the proposed Dynamic Parameter Prediction network (DPPnet), which is composed of the classificationnetwork and the parameter prediction network. The weights in the dynamic parameter layer are mapped by a hashing trick from thecandidate weights obtained from the parameter prediction network.

by constructing multiple branches connected to a commonCNN architecture. In this work, we hope to solve the het-erogeneous recognition tasks using a single CNN by adapt-ing the weights in the dynamic parameter layer. Since thetask is defined by the question in ImageQA, the weightsin the layer are determined depending on the question sen-tence. In addition, a hashing trick is employed to predicta large number of weights in the dynamic parameter layerand avoid parameter explosion.

3.2. Problem FormulationImageQA systems predict the best answer a given an im-

age I and a question q. Conventional approaches [16, 23]typically construct a joint feature vector based on two inputsI and q and solve a classification problem for ImageQA us-ing the following equation:

a = argmaxa∈Ω

p(a|I, q;θ) (1)

where Ω is a set of all possible answers and θ is a vectorfor the parameters in the network. On the contrary, we usethe question to predict weights in the classifier and solve theproblem. We find the solution by

a = argmaxa∈Ω

p(a|I;θs,θd(q)) (2)

where θs and θd(q) denote static and dynamic parameters,respectively. Note that the values of θd(q) are determinedby the question q.

4. Network ArchitectureFigure 2 illustrates the overall architecture of the pro-

posed algorithm. The network is composed of two sub-networks: classification network and parameter predictionnetwork. The classification network is a CNN. One of the

fully-connected layers in the CNN is the dynamic parame-ter layer, and the weights in the layer are determined adap-tively by the parameter prediction network. The parame-ter prediction network has GRU cells and a fully-connectedlayer. It takes a question as its input, and generates a real-valued vector, which corresponds to candidate weights forthe dynamic parameter layer in the classification network.Given an image and a question, our algorithm estimatesthe weights in the dynamic parameter layer through hash-ing with the candidate weights obtained from the parameterprediction network. Then, it feeds the input image to theclassification network to obtain the final answer. More de-tails of the proposed network are discussed in the followingsubsections.

4.1. Classification Network

The classification network is constructed based on VGG16-layer net [24], which is pre-trained on ImageNet [6]. Weremove the last layer in the network and attach three fully-connected layers. The second last fully-connected layer isthe dynamic parameter layer whose weights are determinedby the parameter prediction network, and the last fully-connected layer is the classification layer whose output di-mensionality is equal to the number of possible answers.The probability for each answer is computed by applying asoftmax function to the output vector of the final layer.

We put the dynamic parameter layer in the second lastfully-connected layer instead of the classification layer be-cause it involves the smallest number of parameters. As thenumber of parameters in the classification layer increases inproportion to the number of possible answers, predicting theweights for the classification layer may not be a good op-tion to general ImageQA problems in terms of scalability.Our choice for the dynamic parameter layer can be inter-preted as follows. By fixing the classification layer while

32

Experimentalsetup(withoutAttention)

• Solver- Cross-entropy-loss,Adam,learningrate0.0007

• FeatureExtraction- ResNet 152,image:448x448

• Answers- 3000mostfrequentontrain- Samplingwithprobabilityofanswers

• Trainedontrain/validatedonval /testedontest-dev


CNN(ResNet152)

LSTM,drop

LSTM,drop

FullConnected

Yes

2048

MCB16k

L2Norm

alization

Embed,Tanh

13k ~ 20k

300

1024

1024

2048

Softmax

SignedSqrt

L2Norm

alization

16k

16k

3000

58.7

58.5

59.8

57.8

56.4

58.6

57.1

58.4

57.5

56.5

54.0 55.0 56.0 57.0 58.0 59.0 60.0 61.0

MCB(128x128->4k)

FullBilinear(128x128->16k)

MCB(2048x2048->16k)

EltwiseProduct+FC+FC

EltwiseProduct+FC

EltwiseProduct

Concat+FC+FC

Concat+FC

Concat

EltwiseSum

Trainedontrain,test-devAcc.[%]

AblationComparisontoothermultimodalmethods


• MCBachieveshighestaccuracy

58.7

58.5

59.8

57.8

56.4

58.6

57.1

58.4

57.5

56.5

54.0 55.0 56.0 57.0 58.0 59.0 60.0 61.0

MCB(128x128->4k)

FullBilinear(128x128->16k)

MCB(2048x2048->16k)

EltwiseProduct+FC+FC

EltwiseProduct+FC

EltwiseProduct

Concat+FC+FC

Concat+FC

Concat

EltwiseSum

Trainedontrain,test-devAcc.[%]



• MCBcomparabletoFullBilinear

DimensionalityofMCB

• DimensionalityofMCBdecidestheperformanceofouterproductapproximation

2048

MCB

Dim size

2048

58.4

58.8

59.459.7

59.859.7

57.5

58.0

58.5

59.0

59.5

60.0

1024 2048 4096 8192 16000 32000

test-devAcc.[%]

VQAOpen-Endedtest-devaccuracy


Gravy

VisualQuestionAnswering

Whatisthebrownsouce?

MCBwithAttention

• PredictspatialattentionswithMCB

CNN(ResNet152)

Tile

MCB

WE,LSTM

16k x14x14SignedSqrt

L2Norm

alization2048x14x14

2048x14x14

Conv,Relu512 x 14 x 14

Softmax1 x 14 x 14

WeightedSum

Conv

FullConnected

CornMCB16k

Softmax

SignedSqrt

L2Norm

alization

16k

16k

3000

2048

2048

Whatistheyellowfood?

Attentionforcaptioning:- K.Xu,Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttentionAttentionforVQA:- H.Xu,K.Saenko Ask,AttendandAnswer:ExploringQuestion-GuidedSpatialAttentionforVisualQuestionAnswering- J.Lu HierarchalQuestion-ImageCo-AttentionforVisualQuestionAnswering

AttentionVisualizations

Isthispersonwearingahat?Yes[Groundtruth:Yes]

ResultsonMCBwithAttention

• MCBperformswellwithAttention

58.4

59.8

58.4

62.5

56.0

57.0

58.0

59.0

60.0

61.0

62.0

63.0

Concat+FC MCB Concat+FC+Attention

MCB+Attention

test-devAcc.[%]

PerformanceofAttentionModels

62.564.2 65.1

66.7

70.2

58

60

62

64

66

68

70

72

MCB+Attention(train)

MCB+Attention(train+val)

MCB+Attention

(train+val+genome)

MCB+Attention+Ensemble

(train+val+genome)

MCB+Attention+Ensemble

(train+val+genome)Multiple Choice

test-devAcc.[%]

VQAOpen-Ended accuracyforgenomeandensemble

• DataAugmentation- VQAdatafromVisualGenomeDataset- Additional1MQuestionandanswerpairs- Removedarticles,Singlewordanswer

• Ensembles- Averagetheoutput ofSoftmax overmodels

Techniquestoimproveperformance

Visual genome: Connecting language and vision using crowdsourced dense image annotations.

MCBonotherDatasetsandTasks

• Visual7w(MultipleChoice) • VisualGrounding

Our architecture for Visual 7w : MCB with Attention and Answer Encoding.

Visual7W: Grounded Question Answering in Images Grounding of textual phrases in images by reconstruction.

43.843.9

47.746.5

47.447.9

48.7

40 42 44 46 48 50

Plummer et al.Wang et al.

Rohrbach et al.Concat

Eltwise ProdEltwise Prod + Conv

MCB

Accuracy on Flickr30k Entities

54.3

52.8

62.2

45 50 55 60 65

Zhu et al.

Concat + Attention

MCB + Attention

Accuracy on Visual7W

Examples for VQA

What is the woman feeding the giraffe?Carrot[Groundtruth: Carrot]

Attention Visualizations

What color is her shirt?Purple[Groundtruth: Purple]


What is her hairstyle for the picture?Ponytail[Groundtruth: Ponytail]


What color is the chain on the red dress?Pink[Groundtruth: Gold]


• Correct Attention, Incorrect Fine-grained Recognition

Is the man going to fall down?No[Groundtruth: No]


What is the surface of the court made of?Clay[Groundtruth: Clay]


What sport is being played?Tennis[Groundtruth: Tennis]


What does the shop sell?Clocks[Groundtruth: Hot Dogs]


• Incorrect Attention

What credit card company is on the banner in the background?Budweiser[Groundtruth: Mastercard]


• Correct Attention, Incorrect Concept Association

Conclusions

• MultimodalCompactBilinearPooling- AllelementsinteractMultiplicatively- Compact andEfficient

• MCBwithAttention- Successfullypredictspatialattention

• GeneralizationCapability- Performanceimprovementinothervisionandlanguagetasks- Visual7W,VisualGrounding

- Compatiblewithothermodels- Applicabletogeneralmultimodaltasks,notonlyonvisionandlanguage

Thankyouforyourattention!

Demo:demo.berkeleyvision.orgCode:https://github.com/akirafukui/vqa-mcb/

MultimodalCompactBilinearPoolingforVisualQuestionAnsweringandGrounding

AkiraFukui,DongHuk Park,DaylenYang,AnnaRohrbach,TrevorDarrell,MarcusRohrbach

Arxiv 2016

multimodal compact bilinear pooling for vqa - vqa: …multimodal compact bilinear pooling for vqa...

Documents