cvpr talk v2

Post on 19-May-2022

15 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Situation Recognition Visual Semantic Role Labeling for Image Understanding

1

Mark Yatskar

in collaboration w/ Luke Zettlemoyer, Ali Farhadi

How can we summarize what is happening in an image?

LOADINGAGENT ITEM DESTINATION TOOL PLACE

WOMAN HORSE TRAILER ROPE OUTDOORS

Is the same thing happening in two images?

turkers say…

0255075

100

yes somewhat no

Is the same thing happening in two images?

turkers say…

0255075

100

yes somewhat no

why no?

0255075

100

activty other

Activity

Is the same thing happening in two images?

turkers say…

0255075

100

yes somewhat no

why yes?

0255075

100

throwing playing

Activity

Is the same thing happening in two images?

turkers say…

0255075

100

yes somewhat no

why yes?

0255075

100

throwing playing

why no?

0255075

100

ball sport

Activity

Object

Activity

Object

turkers say…

0255075

100

yes somewhat no

why yes?

0255075

100

beer pouring glass man

Is the same thing happening in two images?

Activity

Object

turkers say…

0255075

100

yes somewhat no

why yes?

0255075

100

beer pouring glass man

Is the same thing happening in two images?

why no?

0255075

100

destination source other

Role

Systematically describe how objects participate

LOADINGAGENT ITEM DESTINATION TOOL PLACE

WOMAN HORSE TRAILER ROPE OUTDOORS

in activities through roles

Situation Recognition

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

Situation Recognition

POURING

AGENT MANSUBSTANCE BEER

SOURCE TAP

DESTINATION GLASS

PLACE BARROOM

POURING

AGENT MANSUBSTANCE BEER

SOURCE GLASS

DESTINATION MOUTH

PLACE BACKYARD

Same

Different

Situation Recognition

POURING

AGENT MANSUBSTANCE BEER

SOURCE TAP

DESTINATION GLASS

PLACE BARROOM

POURING

AGENT MANSUBSTANCE BEER

SOURCE GLASS

DESTINATION MOUTH

PLACE BACKYARD

Same

Different

What is the space of possible situations?

imSituA Large Scale Situation Dataset

120k+ images, 500+ verbs, 100k+ situations

Natural Language Processing: Semantic Role Labeling

A boy is fixing a car tire with a tire iron outdoors.

Activity

Object

Role

Natural Language Processing: Semantic Role Labeling

A boy is fixing a car tire with a tire iron outdoors.

Activity

Object

Role

Natural Language Processing: Semantic Role Labeling

A boy is fixing a car tire with a tire iron outdoors.

Activity

Object

Role

Natural Language Processing: Semantic Role Labeling

A boy is fixing a car tire with a tire iron outdoors.

Activity

Object

Role

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

A jockey falling from a horse onto the ground at a racetrack.

Natural Language Processing: Semantic Role Labeling

FALLINGAGENT SOURCE DESTINATION PLACEJOCKEY HORSE GROUND RACETRACK

Activity

Object

Role

FrameNet for Verb and Role Inventory

FIXINGAGENT OBJECT PART TOOL PLACE

semantic role labeling ontology:

FrameNet (8000 verbs)

creating imSitu

Visualness

FIXINGAGENT OBJECT PART TOOL PLACE

semantic role labeling ontology:

FrameNet (8000 verbs)

~1000 visual verbs~3.5 roles/verb

filter verbs, semantic roles

creating imSituFrameNet

WordNet for Noun Inventory

FIXINGAGENT OBJECT PART TOOL PLACE

values from noun ontology: WordNet (80, 000 nouns)

semantic role labeling ontology:

FrameNet (8000 verbs)

creating imSituFrameNetVisualness

FIXINGAGENT OBJECT PART TOOL PLACE

values from noun ontology: WordNet (80, 000 nouns)

semantic role labeling ontology:

FrameNet (8000 verbs)

Google Images SearchWeb N-grams

Filter Images creating imSitu

WordNet

FrameNetVisualness

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

values from noun ontology: WordNet (80, 000 nouns)

semantic role labeling ontology:

FrameNet (8000 verbs)

Fill Valuescreating imSitu

WordNet

FrameNetVisualness

Filter Images

imSitu: Dataset Statistics

Verbs 504Images 126,102

Situation/Image 3Roles (types) 1,788 (190)Nouns ( >=3) 11,538 (6,794)Annotations 1,481,851Images/Verb 200-400

Uniq. situations (>= 3) 205,095 (21,505)

Despite 80,000 possible values, 2/3 annotators on 76.8% of role-value

WordNet

creating imSituFrameNetVisualness

Filter ImagesFill Values

Skew - not all verbs are equal

0 175 350 525 700

scoopingputting

feeding

fetching

climbing

stumbling

laughingbowing

inflating

shellingsnuggling

twisting

dancing

flossing

VERBS # NOUNS

food: milk

receiver: piglet receiver: dolpin

food: carrot

tool:pot tool: scooper

item:poop item: seed

Skew - not all verbs are equal

0 175 350 525 700

scoopingputting

feeding

fetching

climbing

stumbling

laughingbowing

inflating

shellingsnuggling

twisting

dancing

flossing

VERBS # NOUNS

food: milk

receiver: piglet receiver: dolpin

food: carrot

tool:pot tool: scooper

item:poop item: seed

Skew - not all verbs are equal

0 175 350 525 700

scoopingputting

feeding

fetching

climbing

stumbling

laughingbowing

inflating

shellingsnuggling

twisting

dancing

flossing

VERBS # NOUNS

food: milk

receiver: piglet receiver: dolpin

food: carrot

tool:pot tool: scooper

item:poop item: seed

1 10 100 1000

car

man

elephant

zebra

fireplace

octopus

priestflower

nostril

ice

baconvacuum

bow

cherry

NOUNS # OF SEMANTIC ROLES

splashing.agent swimming.agent

riding.vehicle attacking.victim

colliding.agentdriving.vehicle

pumping.destbuckling.place

Skew - not all nouns are equal

1 10 100 1000

car

man

elephant

zebra

fireplace

octopus

priestflower

nostril

ice

baconvacuum

bow

cherry

NOUNS # OF SEMANTIC ROLES

splashing.agent swimming.agent

riding.vehicle attacking.victim

colliding.agentdriving.vehicle

pumping.destbuckling.place

Skew - not all nouns are equal

Situation RecognitionModels, Evaluation and Basic Results

structure matterssituation recognition improves object and activity recognition

CLEANAGENT SOURCE DIRT TOOL PLACE

man chimney soot brush roof

Conditional Random FieldVGG

Convolutional Layers

Verb-Role-Noun Potential

Verb Potential

Neural Conditional Random Field

Backpropogate CRF loss through VGG

p(S|i; ✓) / v(v, i; ✓)Y

(r,nr)2F

e(v, r, nr, i; ✓)

Qualitative Examples

SPEARING

AGENT PERSON PERSON

VICTIM FISH FISHPLACE OCEAN OCEAN

FALLING

AGENT PERSON PERSONSOURCE HORSE HORSE

DEST. GRND. GRND.PLACE FIELD FIELD

SWIMMING

AGENT SNAKE SNAKEPLACE OCEAN OCEAN

Gold Correct Incorrect

Qualitative Examples

SHAVING

AGENT MAN PERSONCO-AGENT MAN MANBODYPART HEAD HEADSUBSTANCE S. CREAM

TOOL RAZOR RAZORPLACE INSIDE INSIDE

GIVING

AGENT SOLDIERRECIPIENT GIRL

ITEM BAGPLACE OUTSIDE

DETAINING

AGENT SOLDIERVICTIM MANPLACE OUTSIDE

Gold Correct Incorrect

Verb

010203040506070

top-1 top-5

Baseline: 5040-way CNN Predictor (10 most frequent situation/verb)Situation CRF

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

Quantitive : Structured Prediction Crucial

Verb

010203040506070

top-1 top-5

Baseline: 5040-way CNN Predictor (10 most frequent situation/verb)Situation CRF

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

Verb-Role-Noun

top-1 top-5

Quantitive : Structured Prediction Crucial

Verb

010203040506070

top-1 top-5

Baseline: 5040-way CNN Classifer (10 most frequent situation/verb)Situation CRF

Full Structure

top-1 top-5

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

Verb-Role-Noun

top-1 top-5

Quantitive : Structured Prediction Crucial

FEEDING

AGENT MANEATER BABYFOOD MILK

SOURCE BOTTLEPLACE ROOM

FEEDING

AGENT GIRLEATER HORSEFOOD CARROT

SOURCE HANDPLACE PEN

FEEDING

AGENT WOMANEATER HORSEFOOD MILK

SOURCE BOTTLEPLACE BARN

Test

Instances in train : 35 Instances in train : 7 Instances in train : 0

Generalize to Unseen Combinations

Train

Activity

20

30

40

50

60

70

top-1 top-5

activitysituation

Situations Improves Object and Activity Recognition

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

Activity

20

30

40

50

60

70

top-1 top-5

objectactivitysituation

FIXINGAGENT OBJECT PART TOOL PLACE

BOY CAR TIRE TIRE IRON OUTDOORS

Object

60

70

80

90

100

top-1 top-5

Situations Improves Object and Activity Recognition

Errors

PRYING

AGENT PERSONITEM WOOD

SOURCE FLOORTOOL CROWBARPLACE ROOM

PAINTING SPRAYING

PUMPINGAGENT PERSON

ITEM AIRSOURCE AIR

DESTINATION WHEELTOOL PUMPPLACE OUTSIDE

imsitu.orgdata/browsing/demo/code

Conclusion

Introduced situation recognitionrole-centric structured representation of whats happening

Collected imSitu120k+ images, 500+ verbs, 100k+ situations

Introduced simple model neural CRF for situationstructure mattersprovides strong context for activity and object recognition

data/browsing/demo/code

imsitu.org

top related