cvpr talk v2

42
Situation Recognition Visual Semantic Role Labeling for Image Understanding 1 Mark Yatskar in collaboration w/ Luke Zettlemoyer, Ali Farhadi

Upload: others

Post on 19-May-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CVPR Talk v2

Situation Recognition Visual Semantic Role Labeling for Image Understanding

1

Mark Yatskar

in collaboration w/ Luke Zettlemoyer, Ali Farhadi

Page 2: CVPR Talk v2

How can we summarize what is happening in an image?

LOADINGAGENT ITEM DESTINATION TOOL PLACE

WOMAN HORSE TRAILER ROPE OUTDOORS

Page 3: CVPR Talk v2

Is the same thing happening in two images?

turkers say…

0255075

100

yes somewhat no

Page 4: CVPR Talk v2

Is the same thing happening in two images?

turkers say…

0255075

100

yes somewhat no

why no?

0255075

100

activty other

Activity

Page 5: CVPR Talk v2

Is the same thing happening in two images?

turkers say…

0255075

100

yes somewhat no

why yes?

0255075

100

throwing playing

Activity

Page 6: CVPR Talk v2

Is the same thing happening in two images?

turkers say…

0255075

100

yes somewhat no

why yes?

0255075

100

throwing playing

why no?

0255075

100

ball sport

Activity

Object

Page 7: CVPR Talk v2

Activity

Object

turkers say…

0255075

100

yes somewhat no

why yes?

0255075

100

beer pouring glass man

Is the same thing happening in two images?

Page 8: CVPR Talk v2

Activity

Object

turkers say…

0255075

100

yes somewhat no

why yes?

0255075

100

beer pouring glass man

Is the same thing happening in two images?

why no?

0255075

100

destination source other

Role

Page 9: CVPR Talk v2

Systematically describe how objects participate

LOADINGAGENT ITEM DESTINATION TOOL PLACE

WOMAN HORSE TRAILER ROPE OUTDOORS

in activities through roles

Page 10: CVPR Talk v2

Situation Recognition

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

Page 11: CVPR Talk v2

Situation Recognition

POURING

AGENT MANSUBSTANCE BEER

SOURCE TAP

DESTINATION GLASS

PLACE BARROOM

POURING

AGENT MANSUBSTANCE BEER

SOURCE GLASS

DESTINATION MOUTH

PLACE BACKYARD

Same

Different

Page 12: CVPR Talk v2

Situation Recognition

POURING

AGENT MANSUBSTANCE BEER

SOURCE TAP

DESTINATION GLASS

PLACE BARROOM

POURING

AGENT MANSUBSTANCE BEER

SOURCE GLASS

DESTINATION MOUTH

PLACE BACKYARD

Same

Different

What is the space of possible situations?

Page 13: CVPR Talk v2

imSituA Large Scale Situation Dataset

120k+ images, 500+ verbs, 100k+ situations

Page 14: CVPR Talk v2

Natural Language Processing: Semantic Role Labeling

A boy is fixing a car tire with a tire iron outdoors.

Activity

Object

Role

Page 15: CVPR Talk v2

Natural Language Processing: Semantic Role Labeling

A boy is fixing a car tire with a tire iron outdoors.

Activity

Object

Role

Page 16: CVPR Talk v2

Natural Language Processing: Semantic Role Labeling

A boy is fixing a car tire with a tire iron outdoors.

Activity

Object

Role

Page 17: CVPR Talk v2

Natural Language Processing: Semantic Role Labeling

A boy is fixing a car tire with a tire iron outdoors.

Activity

Object

Role

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

Page 18: CVPR Talk v2

A jockey falling from a horse onto the ground at a racetrack.

Natural Language Processing: Semantic Role Labeling

FALLINGAGENT SOURCE DESTINATION PLACEJOCKEY HORSE GROUND RACETRACK

Activity

Object

Role

Page 19: CVPR Talk v2

FrameNet for Verb and Role Inventory

FIXINGAGENT OBJECT PART TOOL PLACE

semantic role labeling ontology:

FrameNet (8000 verbs)

creating imSitu

Page 20: CVPR Talk v2

Visualness

FIXINGAGENT OBJECT PART TOOL PLACE

semantic role labeling ontology:

FrameNet (8000 verbs)

~1000 visual verbs~3.5 roles/verb

filter verbs, semantic roles

creating imSituFrameNet

Page 21: CVPR Talk v2

WordNet for Noun Inventory

FIXINGAGENT OBJECT PART TOOL PLACE

values from noun ontology: WordNet (80, 000 nouns)

semantic role labeling ontology:

FrameNet (8000 verbs)

creating imSituFrameNetVisualness

Page 22: CVPR Talk v2

FIXINGAGENT OBJECT PART TOOL PLACE

values from noun ontology: WordNet (80, 000 nouns)

semantic role labeling ontology:

FrameNet (8000 verbs)

Google Images SearchWeb N-grams

Filter Images creating imSitu

WordNet

FrameNetVisualness

Page 23: CVPR Talk v2

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

values from noun ontology: WordNet (80, 000 nouns)

semantic role labeling ontology:

FrameNet (8000 verbs)

Fill Valuescreating imSitu

WordNet

FrameNetVisualness

Filter Images

Page 24: CVPR Talk v2

imSitu: Dataset Statistics

Verbs 504Images 126,102

Situation/Image 3Roles (types) 1,788 (190)Nouns ( >=3) 11,538 (6,794)Annotations 1,481,851Images/Verb 200-400

Uniq. situations (>= 3) 205,095 (21,505)

Despite 80,000 possible values, 2/3 annotators on 76.8% of role-value

WordNet

creating imSituFrameNetVisualness

Filter ImagesFill Values

Page 25: CVPR Talk v2

Skew - not all verbs are equal

0 175 350 525 700

scoopingputting

feeding

fetching

climbing

stumbling

laughingbowing

inflating

shellingsnuggling

twisting

dancing

flossing

VERBS # NOUNS

food: milk

receiver: piglet receiver: dolpin

food: carrot

tool:pot tool: scooper

item:poop item: seed

Page 26: CVPR Talk v2

Skew - not all verbs are equal

0 175 350 525 700

scoopingputting

feeding

fetching

climbing

stumbling

laughingbowing

inflating

shellingsnuggling

twisting

dancing

flossing

VERBS # NOUNS

food: milk

receiver: piglet receiver: dolpin

food: carrot

tool:pot tool: scooper

item:poop item: seed

Page 27: CVPR Talk v2

Skew - not all verbs are equal

0 175 350 525 700

scoopingputting

feeding

fetching

climbing

stumbling

laughingbowing

inflating

shellingsnuggling

twisting

dancing

flossing

VERBS # NOUNS

food: milk

receiver: piglet receiver: dolpin

food: carrot

tool:pot tool: scooper

item:poop item: seed

Page 28: CVPR Talk v2

1 10 100 1000

car

man

elephant

zebra

fireplace

octopus

priestflower

nostril

ice

baconvacuum

bow

cherry

NOUNS # OF SEMANTIC ROLES

splashing.agent swimming.agent

riding.vehicle attacking.victim

colliding.agentdriving.vehicle

pumping.destbuckling.place

Skew - not all nouns are equal

Page 29: CVPR Talk v2

1 10 100 1000

car

man

elephant

zebra

fireplace

octopus

priestflower

nostril

ice

baconvacuum

bow

cherry

NOUNS # OF SEMANTIC ROLES

splashing.agent swimming.agent

riding.vehicle attacking.victim

colliding.agentdriving.vehicle

pumping.destbuckling.place

Skew - not all nouns are equal

Page 30: CVPR Talk v2

Situation RecognitionModels, Evaluation and Basic Results

structure matterssituation recognition improves object and activity recognition

Page 31: CVPR Talk v2

CLEANAGENT SOURCE DIRT TOOL PLACE

man chimney soot brush roof

Conditional Random FieldVGG

Convolutional Layers

Verb-Role-Noun Potential

Verb Potential

Neural Conditional Random Field

Backpropogate CRF loss through VGG

p(S|i; ✓) / v(v, i; ✓)Y

(r,nr)2F

e(v, r, nr, i; ✓)

Page 32: CVPR Talk v2

Qualitative Examples

SPEARING

AGENT PERSON PERSON

VICTIM FISH FISHPLACE OCEAN OCEAN

FALLING

AGENT PERSON PERSONSOURCE HORSE HORSE

DEST. GRND. GRND.PLACE FIELD FIELD

SWIMMING

AGENT SNAKE SNAKEPLACE OCEAN OCEAN

Gold Correct Incorrect

Page 33: CVPR Talk v2

Qualitative Examples

SHAVING

AGENT MAN PERSONCO-AGENT MAN MANBODYPART HEAD HEADSUBSTANCE S. CREAM

TOOL RAZOR RAZORPLACE INSIDE INSIDE

GIVING

AGENT SOLDIERRECIPIENT GIRL

ITEM BAGPLACE OUTSIDE

DETAINING

AGENT SOLDIERVICTIM MANPLACE OUTSIDE

Gold Correct Incorrect

Page 34: CVPR Talk v2

Verb

010203040506070

top-1 top-5

Baseline: 5040-way CNN Predictor (10 most frequent situation/verb)Situation CRF

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

Quantitive : Structured Prediction Crucial

Page 35: CVPR Talk v2

Verb

010203040506070

top-1 top-5

Baseline: 5040-way CNN Predictor (10 most frequent situation/verb)Situation CRF

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

Verb-Role-Noun

top-1 top-5

Quantitive : Structured Prediction Crucial

Page 36: CVPR Talk v2

Verb

010203040506070

top-1 top-5

Baseline: 5040-way CNN Classifer (10 most frequent situation/verb)Situation CRF

Full Structure

top-1 top-5

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

Verb-Role-Noun

top-1 top-5

Quantitive : Structured Prediction Crucial

Page 37: CVPR Talk v2

FEEDING

AGENT MANEATER BABYFOOD MILK

SOURCE BOTTLEPLACE ROOM

FEEDING

AGENT GIRLEATER HORSEFOOD CARROT

SOURCE HANDPLACE PEN

FEEDING

AGENT WOMANEATER HORSEFOOD MILK

SOURCE BOTTLEPLACE BARN

Test

Instances in train : 35 Instances in train : 7 Instances in train : 0

Generalize to Unseen Combinations

Train

Page 38: CVPR Talk v2

Activity

20

30

40

50

60

70

top-1 top-5

activitysituation

Situations Improves Object and Activity Recognition

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

Page 39: CVPR Talk v2

Activity

20

30

40

50

60

70

top-1 top-5

objectactivitysituation

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

Object

60

70

80

90

100

top-1 top-5

Situations Improves Object and Activity Recognition

Page 40: CVPR Talk v2

Errors

PRYING

AGENT PERSONITEM WOOD

SOURCE FLOORTOOL CROWBARPLACE ROOM

PAINTING SPRAYING

PUMPINGAGENT PERSON

ITEM AIRSOURCE AIR

DESTINATION WHEELTOOL PUMPPLACE OUTSIDE

Page 41: CVPR Talk v2

imsitu.orgdata/browsing/demo/code

Page 42: CVPR Talk v2

Conclusion

Introduced situation recognitionrole-centric structured representation of whats happening

Collected imSitu120k+ images, 500+ verbs, 100k+ situations

Introduced simple model neural CRF for situationstructure mattersprovides strong context for activity and object recognition

data/browsing/demo/code

imsitu.org