aishwarya agrawal ph.d. student machine learning and ... · aishwarya agrawal ph.d. student machine...
TRANSCRIPT
Aishwarya Agrawal
Ph.D. Student
Machine Learning and Perception Lab
2
Identify objects in scene
3
sky
buscar
stop light
person
building
sidewalk
Identify attributes of objects
4
blue
sky
red
bus
many
cars
green
stop light
one
bicycle
tall
building
Identify activities in scene
5
person wearing a
helmet riding bicycle
man walking
on sidewalk
Identify the scene
6
street scene
Describe the scene
8
A person on bike going through
green light with bus nearby
A giraffe standing in the
grass next to a tree.
11
• Answer questions about the scene
– Q: How many buses are there?
– Q: What is the name of the street?
– Q: Is the man on bicycle wearing a
helmet?
13
14
Visual Question Answering (VQA)
Task: Given an image and a natural language open-
ended question, generate a natural language answer.
15
VQA Task
16
VQA CloudCV Demo
cloudcv.org/vqa/?useVoice=1&listenAnswer=1
17
Applications of VQA
• An aid to visually-impaired
Is it safe to cross the street now?
18
Applications of VQA
• Surveillance
What kind of car did the man in red shirt leave in?
19
Applications of VQA
• Interacting with robot
Is my laptop in my bedroom upstairs?
20
VQA Dataset
21
Real images (from MSCOCO)
Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in COntext.” ECCV 2014.
http://mscoco.org/
22
Questions
Stump a smart robot!
Ask a question that a human can answer,
but a smart robot probably can’t!
23
Two modalities of answering
• Open Ended
• Multiple Choice
24
Open Ended Task
What is the girl holding in her hand?How many mirrors?Why is the girl holding an umbrella?
25
Multiple Choice Task
What is the bus number?
a) 3 b) 1 c) green d) 4 e) window trim f) blue
g) m5 h) corn, carrots, onions, rice i) red j) 125 k) san antonio l) sign pen
m) 478 n) no o) 25 p) 2 q) yes r) white
26
Dataset Stats
• >250K images (MSCOCO + 50K Abstract Scenes)
• >750K questions (3 per image)
• ~10M answers (10 w/ image + 3 w/o image)
27
Please visit www.visualqa.org for more details.
28
Browse the Dataset
http://visualqa.org/browser/
29
Questions
30
Dataset Visualization
http://visualqa.org/visualize/
32
Answers
• 38.4% of questions are binary yes/no
• 98.97% questions have answers <= 3 words
– 23k unique 1 word answers
33
Answers
34
2-Channel VQA Model
Convolution Layer
+ Non-Linearity
Pooling Layer Convolution Layer
+ Non-Linearity
Pooling Layer Fully-Connected MLP
4096-dim
Embedding
Embedding
“How many horses are in this image?”
Neural Network
Softmax
over top K answers
Image
Question
36
1024-dim
Ablation #1: Language-alone
Convolution Layer
+ Non-Linearity
Pooling Layer Convolution Layer
+ Non-Linearity
Pooling Layer Fully-Connected MLP
1k output
units
EmbeddingNeural Network
Softmax
over top K answers
Image
“How many horses are in this image?”
Question Embedding
37
1024-dim
Ablation #2: Vision-alone
Convolution Layer
+ Non-Linearity
Pooling Layer Convolution Layer
+ Non-Linearity
Pooling Layer Fully-Connected MLP
4096-dim
EmbeddingNeural Network
Softmax
over top K answers
Image
“How many horses are in this image?”
Question Embedding
38
Accuracy Metric
39
Open-Ended Task Accuracies
40
Human Machine
25.14room for
improvement
Human vs. Machine performanceHuman performance
Results
41
Code available!
• Multiple-Choice > Open-Ended
• Question alone does quite well
• Image helps
Commonsense
• Does this person have 20/20 vision?
42
Does this question need commonsense?
43
Q: How many calories are in this pizza?
How old does a person need to be?
44
Q: How many calories are in this pizza?
Most “commonsense” questions
45
Least “commonsense” questions
46
Spectrum
3-4 (15.3%) 5-8 (39.7%) 9-12 (28.4%) 13-17 (11.2%) 18+ (5.5%)
Is that a bird in the sky? How many pizzas are shown? Where was this picture taken? Is he likely to get mugged if he walked down a dark alleyway like this?
What type of architecture is this?
What color is the shoe? What are the sheep eating? What ceremony does the cake commemorate?
Is this a vegetarian meal? Is this a Flemish bricklaying pattern?
How many zebras are there? What color is his hair? Are these boats too tall to fit under the bridge?
What type of beverage is in the glass? How many calories are in this pizza?
Is there food on the table? What sport is being played? What is the name of the white shape under the batter?
Can you name the performer in the purple costume?
What government document is needed to partake in this activity?
Is this man wearing shoes? Name one ingredient in the skillet. Is this at the stadium? Besides these humans, what other animals eat here?
What is the make and model of this vehicle?
47
Question Average Age
what brand 12.5
why 11.18
what type 11.04
what kind 10.55
is this 10.13
what does 10.06
what time 9.81
who 9.58
where 9.54
which 9.32
does 9.29
do 9.23
what is 9.11
what are 9.04
are 8.65
is the 8.52
is there 8.24
what sport 8.06
how many 7.67
what animal 6.74
what color 6.6 48
VQA Age
• Average “age of questions” = 8.98 years.
• Our model =* 4.74 years old!
* age as estimated by untrained crowd-sourced workers
49
VQA Common sense• Average common sense required = 31%.
• Our best algorithm has* 17% common sense!
* as estimated by untrained crowd-sourced workers
50
VQA Challenges on www.codalab.org
51
VQA Challenge @ CVPR16
52
VQA Challenge @ CVPR16
53
code available!
VQA Workshop @ CVPR16
54
Papers using VQA
… and many more
55
Dataset: >1k downloads
Code: >1.5k views
Academia, industry, start ups
56
Conclusions
• VQA: Visual Question Answering
– The next “grand challenge” in vision, language, AI
• Spectrum: Easy to Difficult
– “What room is this?” Scene Recognition
– “How many …” Object Recognition
– …
– “Does this person have 20/20 vision” Common sense
• Exciting times ahead!
57
VQA Team
Aishwarya Agrawal
Virginia Tech
Meg Mitchell
Microsoft Research
Dhruv Batra
Virginia Tech
Larry Zitnick
Facebook AI
Research
Jiasen Lu
Virginia Tech
Devi Parikh
Virginia Tech
Stanislaw Antol
Virginia Tech
Akrit Mohapatra
Virginia Tech
Webmaster
58
Closing Remarks
• CloudCV VQA Exhibition: Booth 101
• Contact email: [email protected]
• Please complete the Presenter Evaluation sent to
you by email or through the GTC Mobile App. Your
feedback is important!
59
Thanks!
Questions?
60
Visual Question Answering (VQA)
61