cvpr talk v2

Situation Recognition Visual Semantic Role Labeling for Image Understanding

Mark Yatskar

in collaboration w/ Luke Zettlemoyer, Ali Farhadi

How can we summarize what is happening in an image?

LOADINGAGENT ITEM DESTINATION TOOL PLACE

WOMAN HORSE TRAILER ROPE OUTDOORS

Is the same thing happening in two images?

turkers say…

0255075

yes somewhat no

turkers say…

0255075

yes somewhat no

why no?

0255075

activty other

Activity

turkers say…

0255075

yes somewhat no

why yes?

0255075

throwing playing

Activity

turkers say…

0255075

yes somewhat no

why yes?

0255075

throwing playing

why no?

0255075

ball sport

Activity

Object

Activity

Object

turkers say…

0255075

yes somewhat no

why yes?

0255075

beer pouring glass man

Activity

Object

turkers say…

0255075

yes somewhat no

why yes?

0255075

beer pouring glass man

why no?

0255075

destination source other

Systematically describe how objects participate

LOADINGAGENT ITEM DESTINATION TOOL PLACE

WOMAN HORSE TRAILER ROPE OUTDOORS

in activities through roles

Situation Recognition

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

POURING

AGENT MANSUBSTANCE BEER

SOURCE TAP

DESTINATION GLASS

PLACE BARROOM

POURING

SOURCE GLASS

DESTINATION MOUTH

PLACE BACKYARD

Different

POURING

SOURCE TAP

DESTINATION GLASS

PLACE BARROOM

POURING

SOURCE GLASS

DESTINATION MOUTH

PLACE BACKYARD

Different

What is the space of possible situations?

imSituA Large Scale Situation Dataset

120k+ images, 500+ verbs, 100k+ situations

Natural Language Processing: Semantic Role Labeling

A boy is fixing a car tire with a tire iron outdoors.

Activity

Object

Activity

Object

Activity

Object

Activity

Object

A jockey falling from a horse onto the ground at a racetrack.

FALLINGAGENT SOURCE DESTINATION PLACEJOCKEY HORSE GROUND RACETRACK

Activity

Object

FrameNet for Verb and Role Inventory

semantic role labeling ontology:

FrameNet (8000 verbs)

creating imSitu

Visualness

~1000 visual verbs~3.5 roles/verb

filter verbs, semantic roles

creating imSituFrameNet

WordNet for Noun Inventory

values from noun ontology: WordNet (80, 000 nouns)

creating imSituFrameNetVisualness

Google Images SearchWeb N-grams

Filter Images creating imSitu

WordNet

FrameNetVisualness

Fill Valuescreating imSitu

WordNet

FrameNetVisualness

Filter Images

imSitu: Dataset Statistics

Verbs 504Images 126,102

Situation/Image 3Roles (types) 1,788 (190)Nouns ( >=3) 11,538 (6,794)Annotations 1,481,851Images/Verb 200-400

Uniq. situations (>= 3) 205,095 (21,505)

Despite 80,000 possible values, 2/3 annotators on 76.8% of role-value

WordNet

creating imSituFrameNetVisualness

Filter ImagesFill Values

Skew - not all verbs are equal

0 175 350 525 700

scoopingputting

feeding

fetching

climbing

stumbling

laughingbowing

inflating

shellingsnuggling

twisting

dancing

flossing

VERBS # NOUNS

food: milk

receiver: piglet receiver: dolpin

food: carrot

tool:pot tool: scooper

item:poop item: seed

0 175 350 525 700

scoopingputting

feeding

fetching

climbing

stumbling

laughingbowing

inflating

shellingsnuggling

twisting

dancing

flossing

VERBS # NOUNS

food: milk

food: carrot

0 175 350 525 700

scoopingputting

feeding

fetching

climbing

stumbling

laughingbowing

inflating

shellingsnuggling

twisting

dancing

flossing

VERBS # NOUNS

food: milk

food: carrot

1 10 100 1000

elephant

fireplace

octopus

priestflower

nostril

baconvacuum

cherry

NOUNS # OF SEMANTIC ROLES

splashing.agent swimming.agent

riding.vehicle attacking.victim

colliding.agentdriving.vehicle

pumping.destbuckling.place

Skew - not all nouns are equal

1 10 100 1000

elephant

fireplace

octopus

priestflower

nostril

baconvacuum

cherry

NOUNS # OF SEMANTIC ROLES

splashing.agent swimming.agent

riding.vehicle attacking.victim

colliding.agentdriving.vehicle

pumping.destbuckling.place

Skew - not all nouns are equal

Situation RecognitionModels, Evaluation and Basic Results

structure matterssituation recognition improves object and activity recognition

CLEANAGENT SOURCE DIRT TOOL PLACE

man chimney soot brush roof

Conditional Random FieldVGG

Convolutional Layers

Verb-Role-Noun Potential

Verb Potential

Neural Conditional Random Field

Backpropogate CRF loss through VGG

p(S|i; ✓) / v(v, i; ✓)Y

(r,nr)2F

e(v, r, nr, i; ✓)

Qualitative Examples

SPEARING

AGENT PERSON PERSON

VICTIM FISH FISHPLACE OCEAN OCEAN

FALLING

AGENT PERSON PERSONSOURCE HORSE HORSE

DEST. GRND. GRND.PLACE FIELD FIELD

SWIMMING

AGENT SNAKE SNAKEPLACE OCEAN OCEAN

Gold Correct Incorrect

Qualitative Examples

SHAVING

AGENT MAN PERSONCO-AGENT MAN MANBODYPART HEAD HEADSUBSTANCE S. CREAM

TOOL RAZOR RAZORPLACE INSIDE INSIDE

GIVING

AGENT SOLDIERRECIPIENT GIRL

ITEM BAGPLACE OUTSIDE

DETAINING

AGENT SOLDIERVICTIM MANPLACE OUTSIDE

Gold Correct Incorrect

010203040506070

top-1 top-5

Baseline: 5040-way CNN Predictor (10 most frequent situation/verb)Situation CRF

Quantitive : Structured Prediction Crucial

010203040506070

top-1 top-5

Baseline: 5040-way CNN Predictor (10 most frequent situation/verb)Situation CRF

Verb-Role-Noun

top-1 top-5

010203040506070

top-1 top-5

Baseline: 5040-way CNN Classifer (10 most frequent situation/verb)Situation CRF

Full Structure

top-1 top-5

Verb-Role-Noun

top-1 top-5

FEEDING

AGENT MANEATER BABYFOOD MILK

SOURCE BOTTLEPLACE ROOM

FEEDING

AGENT GIRLEATER HORSEFOOD CARROT

SOURCE HANDPLACE PEN

FEEDING

AGENT WOMANEATER HORSEFOOD MILK

SOURCE BOTTLEPLACE BARN

Instances in train : 35 Instances in train : 7 Instances in train : 0

Generalize to Unseen Combinations

Activity

top-1 top-5

activitysituation

Situations Improves Object and Activity Recognition

Activity

top-1 top-5

objectactivitysituation

Object

top-1 top-5

Situations Improves Object and Activity Recognition

Errors

PRYING

AGENT PERSONITEM WOOD

SOURCE FLOORTOOL CROWBARPLACE ROOM

PAINTING SPRAYING

PUMPINGAGENT PERSON

ITEM AIRSOURCE AIR

DESTINATION WHEELTOOL PUMPPLACE OUTSIDE

imsitu.orgdata/browsing/demo/code

Conclusion

Introduced situation recognitionrole-centric structured representation of whats happening

Collected imSitu120k+ images, 500+ verbs, 100k+ situations

Introduced simple model neural CRF for situationstructure mattersprovides strong context for activity and object recognition

data/browsing/demo/code

imsitu.org

cvpr talk v2

Documents

cvpr 2019 tracking and detection challenge cvpr 2019 ·...

the power of names smithsonian...

weakly supervised object detection (wsod) · diba cvpr 2017...

on representing spherical videos (cvpr 2001)

cvpr2011: game theory in cvpr part 2

ieee cvpr biometrics 2009

lxcv @ cvpr 2021 reviewer mentoring program

economist talk v2

cvpr prototype drugs table

3d scene understanding from rgb-d...

cvim saisentan-cvpr-hyper depth

ieee cvpr 04

talk different roe-20150615-v2-0 (3)

cvpr presentation

rsc final cvpr submit

cvpr 2012 manifold based fingerprinting

the connected economy mark skilton july 15 bright talk v2

bright talk voip vofi webinar jan2015-v2

tuto cvpr part4

ken intern-talk-v2