aishwarya agrawal ph.d. student machine learning and ... · aishwarya agrawal ph.d. student machine...

Aishwarya Agrawal

Ph.D. Student

Machine Learning and Perception Lab

Identify objects in scene

3

sky

buscar

stop light

person

building

sidewalk

Identify attributes of objects

4

blue

sky

red

bus

many

cars

green

stop light

one

bicycle

tall

building

Identify activities in scene

5

person wearing a

helmet riding bicycle

man walking

on sidewalk

Identify the scene

6

street scene

Describe the scene

8

A person on bike going through

green light with bus nearby

A giraffe standing in the

grass next to a tree.

11

• Answer questions about the scene

– Q: How many buses are there?

– Q: What is the name of the street?

– Q: Is the man on bicycle wearing a

helmet?

13

Visual Question Answering (VQA)

Task: Given an image and a natural language open-

ended question, generate a natural language answer.

15

VQA Task

16

VQA CloudCV Demo

cloudcv.org/vqa/?useVoice=1&listenAnswer=1

17

Applications of VQA

• An aid to visually-impaired

Is it safe to cross the street now?

18

Applications of VQA

• Surveillance

What kind of car did the man in red shirt leave in?

19

Applications of VQA

• Interacting with robot

Is my laptop in my bedroom upstairs?

20

VQA Dataset

21

Real images (from MSCOCO)

Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in COntext.” ECCV 2014.

http://mscoco.org/

22

Questions

Stump a smart robot!

Ask a question that a human can answer,

but a smart robot probably can’t!

23

Two modalities of answering

• Open Ended

• Multiple Choice

24

Open Ended Task

What is the girl holding in her hand?How many mirrors?Why is the girl holding an umbrella?

25

Multiple Choice Task

What is the bus number?

a) 3 b) 1 c) green d) 4 e) window trim f) blue

g) m5 h) corn, carrots, onions, rice i) red j) 125 k) san antonio l) sign pen

m) 478 n) no o) 25 p) 2 q) yes r) white

26

Dataset Stats

• >250K images (MSCOCO + 50K Abstract Scenes)

• >750K questions (3 per image)

• ~10M answers (10 w/ image + 3 w/o image)

27

Please visit www.visualqa.org for more details.

28

Browse the Dataset

http://visualqa.org/browser/

29

Questions

30

Dataset Visualization

http://visualqa.org/visualize/

32

Answers

• 38.4% of questions are binary yes/no

• 98.97% questions have answers <= 3 words

– 23k unique 1 word answers

33

Answers

34

2-Channel VQA Model

Convolution Layer

+ Non-Linearity

Pooling Layer Convolution Layer

+ Non-Linearity

Pooling Layer Fully-Connected MLP

4096-dim

Embedding

Embedding

“How many horses are in this image?”

Neural Network

Softmax

over top K answers

Image

Question

36

1024-dim

Ablation #1: Language-alone

Convolution Layer

+ Non-Linearity


+ Non-Linearity


1k output

units

EmbeddingNeural Network

Softmax

over top K answers

Image


Question Embedding

37

1024-dim

Ablation #2: Vision-alone

Convolution Layer

+ Non-Linearity


+ Non-Linearity


4096-dim

EmbeddingNeural Network

Softmax

over top K answers

Image


Question Embedding

38

Accuracy Metric

39

Open-Ended Task Accuracies

40

Human Machine

25.14room for

improvement

Human vs. Machine performanceHuman performance

Results

41

Code available!

• Multiple-Choice > Open-Ended

• Question alone does quite well

• Image helps

Commonsense

• Does this person have 20/20 vision?

42

Does this question need commonsense?

43

Q: How many calories are in this pizza?

How old does a person need to be?

44

Q: How many calories are in this pizza?

Most “commonsense” questions

45

Least “commonsense” questions

46

Spectrum

3-4 (15.3%) 5-8 (39.7%) 9-12 (28.4%) 13-17 (11.2%) 18+ (5.5%)

Is that a bird in the sky? How many pizzas are shown? Where was this picture taken? Is he likely to get mugged if he walked down a dark alleyway like this?

What type of architecture is this?

What color is the shoe? What are the sheep eating? What ceremony does the cake commemorate?

Is this a vegetarian meal? Is this a Flemish bricklaying pattern?

How many zebras are there? What color is his hair? Are these boats too tall to fit under the bridge?

What type of beverage is in the glass? How many calories are in this pizza?

Is there food on the table? What sport is being played? What is the name of the white shape under the batter?

Can you name the performer in the purple costume?

What government document is needed to partake in this activity?

Is this man wearing shoes? Name one ingredient in the skillet. Is this at the stadium? Besides these humans, what other animals eat here?

What is the make and model of this vehicle?

47

Question Average Age

what brand 12.5

why 11.18

what type 11.04

what kind 10.55

is this 10.13

what does 10.06

what time 9.81

who 9.58

where 9.54

which 9.32

does 9.29

do 9.23

what is 9.11

what are 9.04

are 8.65

is the 8.52

is there 8.24

what sport 8.06

how many 7.67

what animal 6.74

what color 6.6 48

VQA Age

• Average “age of questions” = 8.98 years.

• Our model =* 4.74 years old!

* age as estimated by untrained crowd-sourced workers

49

VQA Common sense• Average common sense required = 31%.

• Our best algorithm has* 17% common sense!

* as estimated by untrained crowd-sourced workers

50

VQA Challenges on www.codalab.org

51

VQA Challenge @ CVPR16

52

VQA Challenge @ CVPR16

53

code available!

VQA Workshop @ CVPR16

54

Papers using VQA

… and many more

55

Dataset: >1k downloads

Code: >1.5k views

Academia, industry, start ups

56

Conclusions

• VQA: Visual Question Answering

– The next “grand challenge” in vision, language, AI

• Spectrum: Easy to Difficult

– “What room is this?” Scene Recognition

– “How many …” Object Recognition

– …

– “Does this person have 20/20 vision” Common sense

• Exciting times ahead!

57

VQA Team

Aishwarya Agrawal

Virginia Tech

Meg Mitchell

Microsoft Research

Dhruv Batra

Virginia Tech

Larry Zitnick

Facebook AI

Research

Jiasen Lu

Virginia Tech

Devi Parikh

Virginia Tech

Stanislaw Antol

Virginia Tech

Akrit Mohapatra

Virginia Tech

Webmaster

58

Closing Remarks

• CloudCV VQA Exhibition: Booth 101

• Contact email: [email protected]

• Please complete the Presenter Evaluation sent to

you by email or through the GTC Mobile App. Your

feedback is important!

59

mailto:[email protected]

Thanks!

Questions?

60

Visual Question Answering (VQA)

61

aishwarya agrawal ph.d. student machine learning and ... · aishwarya agrawal ph.d. student machine...

Documents