visual7w grounded question answering in images
TRANSCRIPT
Visual7WGrounded Question Answering in Images
Yuke Zhu, Oliver Groth, Michael Bernstein, Li Fei-Fei
Slides by Issey Masuda MoraComputer Vision Reading Group (09/05/2016)
[arXiv] [web] [GitHub]
Context
Visual Question Answering
Goal: predict the answer of a given question related to an image
Motivation
New Turing test? How to evaluate AI’s image understanding?
Visual7W
The 7W
WHAT
WHERE
WHEN
WHO
WHY
HOW
WHICH
Questions: multi-choice4 candidates, only one correct
Grounding: image-text correspondencesExploit the relation between image regions and nouns in the questions
The new answer is...Question-Answer types:
● Telling questions: the answer is text
● Pointing questions: a new type of QA that they introduce where the answers are image regions
Related work
Common approach
Who is under the umbrella?
Extract visual features
Embedding
Merge Predict answer Two women
The Dataset
Visual7W DatasetCharacteristics:
● 47.300 images from COCO dataset● 327.939 QA pairs● 561.459 object bounding boxes spread across 36.579 categories
Creating the DatasetProcedure:
● Write QA pairs● 3 AMT workers evaluate as good or bad each pair● Only the ones with at least 2 good evaluations are considered● Write the 3 wrong answers (having the right one)● Extract object names and draw bounding boxes for each one
The Model
Attention-based modelPointing questions model
Experiments & Results
ExperimentsDifferent experiments have been conducted depending on the information given to the subject:
● Only the question● Question + Image
Subjects/models:
● Human● Logistic regression● LSTM● LSTM + attention model
Results
Conclusions
Conclusions
● Visual QA model has been presented● Attention model to focus on local regions of the image● Dataset created with goundings