multimodal compact bilinear pooling for vqa - vqa: …multimodal compact bilinear pooling for vqa...
TRANSCRIPT
Multimodal Compact Bilinear Pooling for VQA
AkiraFukui1,2,DongHuk Park1,DaylenYang1,AnnaRohrbach1,3,TrevorDarrell1,MarcusRohrbach1
1UCBerkeleyEECS,CA,UnitedStates2SonyCorp.,Tokyo,Japan3MaxPlanckInstituteforInformatics,Saarbrücken,Germany
Multimodallanguageandvisualunderstanding
Atablefulloffoodforafeast
Description
Multimodallanguageandvisualunderstanding
Thebowlwiththebrownsouce
Grounding
Multimodallanguageandvisualunderstanding
Gravy
VisualQuestionAnswering
Whatisthebrownsouce?
spoonplatebowltablefoodcorn…person
HowtoCombineImageRepresentationandQuestionRepresentation?
Isthisgoingtobeafeast?
CNNLSTM
Yes
Is?feastgoingtobe…
¨ Allelementscaninteract¨ Multiplicativeinteraction
spoonplatebowltablefoodcorn…person
HowtoCombineImageRepresentationandQuestionRepresentation?
Isthisgoingtobeafeast?
CNNLSTM
Yes
Is?feastgoingtobe…
þ Allelementscaninteract¨ Multiplicativeinteraction- Difficulttolearnoutputclassification
Concatenate
FC FC
RELU
spoonplatebowltablefoodcorn…person
HowtoCombineImageRepresentationandQuestionRepresentation?
Isthisgoingtobeafeast?
CNNLSTM
Yes
Is?feastgoingtobe…
¨ Allelementscaninteractþ Multiplicativeinteraction- Difficulttolearninputembedding
ElementwiseMultiplication
FC FC
RELU⨀
spoonplatebowltablefoodcorn…person
HowtoCombineImageRepresentationandQuestionRepresentation?
Isthisgoingtobeafeast?
CNNLSTM
Yes
Is?feastgoingtobe…
þ Allelementscaninteractþ Multiplicativeinteraction
OuterProduct/BilinearPooling
FC[Lin ICCV 2015]
[LinICCV2015]Tsung-YuLin,Aruni RoyChowdhury, andSubhransu Maji.Bilinear CNNmodelsforfine-grainedvisualrecognition. ICCV2015
spoonplatebowltablefoodcorn…person
HowtoCombineImageRepresentationandQuestionRepresentation?
Isthisgoingtobeafeast?
CNNLSTM
Yes
Is?feastgoingtobe…
þ Allelementscaninteractþ Multiplicativeinteraction¨ High#activations&computation¨ High#parameters
OuterProduct/BilinearPooling
FC
[Lin ICCV 2015]
2048
2048 4 million
4 million x 1000
[LinICCV2015]Tsung-YuLin,Aruni RoyChowdhury, andSubhransu Maji.Bilinear CNNmodelsforfine-grainedvisualrecognition. ICCV2015
spoonplatebowltablefoodcorn…person
MultimodalCompactBilinearPooling
Isthisgoingtobeafeast?
CNNLSTM
Yes
Is?feastgoingtobe…
þ Allelementscaninteractþ Multiplicativeinteractionþ Low#activations&computationþ Low#parameters
CompactBilinearPooling
FC2048
2048
16k x 1000
MCB
[Gao CVPR 16]
[GaoCVPR16]YangGao,OscarBeijbom,NingZhang,andTrevorDarrell.Compactbilinearpooling.CVPR2016
[ICLRWorkshops2016] Fine-grainedposeprediction,normalization,andrecognitionNZhang,EShelhamer,YGao,TDarrell
MultimodalCompactBilinearPooling
Isthisgoingtobeafeast?
CNNLSTM
Yes
þ Allelementscaninteractþ Multiplicativeinteraction¨ Low#activations&computationþ Low#parameters
FC
16k x 1000
RandomProjection:Countsketch Ψ
[Countsketch]M.Charikar,K.Chen, M.Farach-Colton.Findingfrequentitemsindatastreams.Automata,languagesandprogramming2002.[Pham&Pagh 13]N.Pham,R.Pagh.Fastandscalablepolynomialkernelsviaexplicit featuremaps.KDD2013
Pham&Pagh (2013):Ψ x⨂𝑦 = Ψ 𝑥 ∗ Ψ(𝑦)
MultimodalCompactBilinearPooling
Isthisgoingtobeafeast?
CNNLSTM
Yes
þ Allelementscaninteractþ Multiplicativeinteractionþ Low#activations&computationþ Low#parameters
FC
16k x 1000
Ψ
Pham&Pagh (2013):Ψ x⨂𝑦 = Ψ 𝑥 ∗ Ψ(𝑦)
Ψ
Convolution
∗
[Countsketch]M.Charikar,K.Chen, M.Farach-Colton.Findingfrequentitemsindatastreams.Automata,languagesandprogramming2002.[Pham&Pagh 13]N.Pham,R.Pagh.Fastandscalablepolynomialkernelsviaexplicit featuremaps.KDD2013
MultimodalCompactBilinearPooling
Isthisgoingtobeafeast?
CNNLSTM
Yes
þ Allelementscaninteractþ Multiplicativeinteractionþ Low#activations&computationþ Low#parameters
FC
16k x 1000
Ψ
Pham&Pagh (2013):Ψ x⨂𝑦 = Ψ 𝑥 ∗ Ψ(𝑦)
Ψ
FFT
FFTFFT
-1
Convolution
⨀
[Countsketch]M.Charikar,K.Chen, M.Farach-Colton.Findingfrequentitemsindatastreams.Automata,languagesandprogramming2002.[Pham&Pagh 13]N.Pham,R.Pagh.Fastandscalablepolynomialkernelsviaexplicit featuremaps.KDD2013
Relatedwork
• Alternativeapproachtomultiplicativeinteractions- DPPNet:Hyeonwoo Noh,PaulHongsuck Seo,andBohyung Han.Imagequestionansweringusingconvolutionalneuralnetworkwithdynamicparameterprediction.CVPR2016
Figure 2. Overall architecture of the proposed Dynamic Parameter Prediction network (DPPnet), which is composed of the classificationnetwork and the parameter prediction network. The weights in the dynamic parameter layer are mapped by a hashing trick from thecandidate weights obtained from the parameter prediction network.
by constructing multiple branches connected to a commonCNN architecture. In this work, we hope to solve the het-erogeneous recognition tasks using a single CNN by adapt-ing the weights in the dynamic parameter layer. Since thetask is defined by the question in ImageQA, the weightsin the layer are determined depending on the question sen-tence. In addition, a hashing trick is employed to predicta large number of weights in the dynamic parameter layerand avoid parameter explosion.
3.2. Problem FormulationImageQA systems predict the best answer a given an im-
age I and a question q. Conventional approaches [16, 23]typically construct a joint feature vector based on two inputsI and q and solve a classification problem for ImageQA us-ing the following equation:
a = argmaxa∈Ω
p(a|I, q;θ) (1)
where Ω is a set of all possible answers and θ is a vectorfor the parameters in the network. On the contrary, we usethe question to predict weights in the classifier and solve theproblem. We find the solution by
a = argmaxa∈Ω
p(a|I;θs,θd(q)) (2)
where θs and θd(q) denote static and dynamic parameters,respectively. Note that the values of θd(q) are determinedby the question q.
4. Network ArchitectureFigure 2 illustrates the overall architecture of the pro-
posed algorithm. The network is composed of two sub-networks: classification network and parameter predictionnetwork. The classification network is a CNN. One of the
fully-connected layers in the CNN is the dynamic parame-ter layer, and the weights in the layer are determined adap-tively by the parameter prediction network. The parame-ter prediction network has GRU cells and a fully-connectedlayer. It takes a question as its input, and generates a real-valued vector, which corresponds to candidate weights forthe dynamic parameter layer in the classification network.Given an image and a question, our algorithm estimatesthe weights in the dynamic parameter layer through hash-ing with the candidate weights obtained from the parameterprediction network. Then, it feeds the input image to theclassification network to obtain the final answer. More de-tails of the proposed network are discussed in the followingsubsections.
4.1. Classification Network
The classification network is constructed based on VGG16-layer net [24], which is pre-trained on ImageNet [6]. Weremove the last layer in the network and attach three fully-connected layers. The second last fully-connected layer isthe dynamic parameter layer whose weights are determinedby the parameter prediction network, and the last fully-connected layer is the classification layer whose output di-mensionality is equal to the number of possible answers.The probability for each answer is computed by applying asoftmax function to the output vector of the final layer.
We put the dynamic parameter layer in the second lastfully-connected layer instead of the classification layer be-cause it involves the smallest number of parameters. As thenumber of parameters in the classification layer increases inproportion to the number of possible answers, predicting theweights for the classification layer may not be a good op-tion to general ImageQA problems in terms of scalability.Our choice for the dynamic parameter layer can be inter-preted as follows. By fixing the classification layer while
32
Experimentalsetup(withoutAttention)
• Solver- Cross-entropy-loss,Adam,learningrate0.0007
• FeatureExtraction- ResNet 152,image:448x448
• Answers- 3000mostfrequentontrain- Samplingwithprobabilityofanswers
• Trainedontrain/validatedonval /testedontest-dev
Isthisgoingtobeafeast?
CNN(ResNet152)
LSTM,drop
LSTM,drop
FullConnected
Yes
2048
MCB16k
L2Norm
alization
Embed,Tanh
13k ~ 20k
300
1024
1024
2048
Softmax
SignedSqrt
L2Norm
alization
16k
16k
3000
58.7
58.5
59.8
57.8
56.4
58.6
57.1
58.4
57.5
56.5
54.0 55.0 56.0 57.0 58.0 59.0 60.0 61.0
MCB(128x128->4k)
FullBilinear(128x128->16k)
MCB(2048x2048->16k)
EltwiseProduct+FC+FC
EltwiseProduct+FC
EltwiseProduct
Concat+FC+FC
Concat+FC
Concat
EltwiseSum
Trainedontrain,test-devAcc.[%]
AblationComparisontoothermultimodalmethods
AblationComparisontoothermultimodalmethods
• MCBachieveshighestaccuracy
58.7
58.5
59.8
57.8
56.4
58.6
57.1
58.4
57.5
56.5
54.0 55.0 56.0 57.0 58.0 59.0 60.0 61.0
MCB(128x128->4k)
FullBilinear(128x128->16k)
MCB(2048x2048->16k)
EltwiseProduct+FC+FC
EltwiseProduct+FC
EltwiseProduct
Concat+FC+FC
Concat+FC
Concat
EltwiseSum
Trainedontrain,test-devAcc.[%]
AblationComparisontoothermultimodalmethods
AblationComparisontoothermultimodalmethods
• MCBcomparabletoFullBilinear
DimensionalityofMCB
• DimensionalityofMCBdecidestheperformanceofouterproductapproximation
2048
MCB
Dim size
2048
58.4
58.8
59.459.7
59.859.7
57.5
58.0
58.5
59.0
59.5
60.0
1024 2048 4096 8192 16000 32000
test-devAcc.[%]
VQAOpen-Endedtest-devaccuracy
Multimodallanguageandvisualunderstanding
Gravy
VisualQuestionAnswering
Whatisthebrownsouce?
MCBwithAttention
• PredictspatialattentionswithMCB
CNN(ResNet152)
Tile
MCB
WE,LSTM
16k x14x14SignedSqrt
L2Norm
alization2048x14x14
2048x14x14
Conv,Relu512 x 14 x 14
Softmax1 x 14 x 14
WeightedSum
Conv
FullConnected
CornMCB16k
Softmax
SignedSqrt
L2Norm
alization
16k
16k
3000
2048
2048
Whatistheyellowfood?
Attentionforcaptioning:- K.Xu,Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttentionAttentionforVQA:- H.Xu,K.Saenko Ask,AttendandAnswer:ExploringQuestion-GuidedSpatialAttentionforVisualQuestionAnswering- J.Lu HierarchalQuestion-ImageCo-AttentionforVisualQuestionAnswering
AttentionVisualizations
Isthispersonwearingahat?Yes[Groundtruth:Yes]
ResultsonMCBwithAttention
• MCBperformswellwithAttention
58.4
59.8
58.4
62.5
56.0
57.0
58.0
59.0
60.0
61.0
62.0
63.0
Concat+FC MCB Concat+FC+Attention
MCB+Attention
test-devAcc.[%]
PerformanceofAttentionModels
62.564.2 65.1
66.7
70.2
58
60
62
64
66
68
70
72
MCB+Attention(train)
MCB+Attention(train+val)
MCB+Attention
(train+val+genome)
MCB+Attention+Ensemble
(train+val+genome)
MCB+Attention+Ensemble
(train+val+genome)Multiple Choice
test-devAcc.[%]
VQAOpen-Ended accuracyforgenomeandensemble
• DataAugmentation- VQAdatafromVisualGenomeDataset- Additional1MQuestionandanswerpairs- Removedarticles,Singlewordanswer
• Ensembles- Averagetheoutput ofSoftmax overmodels
Techniquestoimproveperformance
Visual genome: Connecting language and vision using crowdsourced dense image annotations.
MCBonotherDatasetsandTasks
• Visual7w(MultipleChoice) • VisualGrounding
Our architecture for Visual 7w : MCB with Attention and Answer Encoding.
Visual7W: Grounded Question Answering in Images Grounding of textual phrases in images by reconstruction.
43.843.9
47.746.5
47.447.9
48.7
40 42 44 46 48 50
Plummer et al.Wang et al.
Rohrbach et al.Concat
Eltwise ProdEltwise Prod + Conv
MCB
Accuracy on Flickr30k Entities
54.3
52.8
62.2
45 50 55 60 65
Zhu et al.
Concat + Attention
MCB + Attention
Accuracy on Visual7W
Examples for VQA
What is the woman feeding the giraffe?Carrot[Groundtruth: Carrot]
Attention Visualizations
What color is her shirt?Purple[Groundtruth: Purple]
Attention Visualizations
What is her hairstyle for the picture?Ponytail[Groundtruth: Ponytail]
Attention Visualizations
What color is the chain on the red dress?Pink[Groundtruth: Gold]
Attention Visualizations
• Correct Attention, Incorrect Fine-grained Recognition
Is the man going to fall down?No[Groundtruth: No]
Attention Visualizations
What is the surface of the court made of?Clay[Groundtruth: Clay]
Attention Visualizations
What sport is being played?Tennis[Groundtruth: Tennis]
Attention Visualizations
What does the shop sell?Clocks[Groundtruth: Hot Dogs]
Attention Visualizations
• Incorrect Attention
What credit card company is on the banner in the background?Budweiser[Groundtruth: Mastercard]
Attention Visualizations
• Correct Attention, Incorrect Concept Association
Conclusions
• MultimodalCompactBilinearPooling- AllelementsinteractMultiplicatively- Compact andEfficient
• MCBwithAttention- Successfullypredictspatialattention
• GeneralizationCapability- Performanceimprovementinothervisionandlanguagetasks- Visual7W,VisualGrounding
- Compatiblewithothermodels- Applicabletogeneralmultimodaltasks,notonlyonvisionandlanguage
Thankyouforyourattention!
Demo:demo.berkeleyvision.orgCode:https://github.com/akirafukui/vqa-mcb/
MultimodalCompactBilinearPoolingforVisualQuestionAnsweringandGrounding
AkiraFukui,DongHuk Park,DaylenYang,AnnaRohrbach,TrevorDarrell,MarcusRohrbach
Arxiv 2016