1 integrating vision models for holistic scene understanding geremy heitz cs223b march 4 th, 2009

Integrating Vision Models for Holistic Scene

Understanding

Geremy Heitz

CS223BMarch 4th, 2009

Scene/Image Understanding

What’s happening in these pictures?

Human View of a “Scene”

“A car passes a bus on the road,

while people walk past a building.”

BUILDING

BUSPEOPLEWALKING

Computer View of a “Scene”

BUILDING

STREET

Can we integrateall of these subtasks,

so thatwhole > sum of parts ?

Outline

Overview Integrating Vision Models

CCM: Cascaded Classification Models Learning Spatial Context

TAS: Things and Stuff Future Directions

[Heitz et al. NIPS 2008a]

[Heitz & Koller ECCV 2008]

Image/Scene Understanding

“a man and a dogare walking

on a sidewalkin front of a building”

Backpack

Cigarette

Primitives Objects Parts Surfaces Regions

Interactions Context Actions Scene

Descriptions

Established techniques address these in isolation.

Reasoning over image statistics

Complex web of relations well represented by graphical models.

Reasoning over more abstract entities.

Building

Sidewalk

Why will integration help?

What is this object?

More Context

Context is key!

Outline

[Heitz et al. NIPS 2008a]

Human View of a “Scene”

BUILDING

BUSPEOPLEWALKING

Scene Categorization

Object Detection Region Labelling Depth

Reconstruction Surface Orientations Boundary/Edge

Detection Outlining/Refined

Localization Occlusion

Reasoning ...

Intrinsic Images [Barrow and Tenenbaum, 1978], [Tappen et al., 2005]

Hoiem et al., “Closing the Loop in Scene Interpretation” , 2008

We want to focus more on “semantic” classes We want to be flexible to using outside models We want an extendable framework, not one engineered for a particular set of

Related Work

How Should we Integrate? Single joint model over all variables

Pros: Tighter interactions, more designer control Cons: Need expertise in each of the subtasks

Simple, flexible combination of existing models

Pros: State-of-the-art models, easier to extendLimited “black-box” interface to components

Cons: Missing some of the modeling power

DETECTIONDalal & Triggs, 2006

REGION LABELINGGould et al., 2007

DEPTH RECONSTRUCTIONSaxena et al., 2007

DET1 REG1 REC1

Cascaded Classification Models

Features fDET

Object Detection

RegionLabeling

DET0IndependentModels

3DReconstructio

Context-awareModels

Integrated Model for Scene Understanding

Object Detection Multi-class

Segmentation Depth Reconstruction Scene Categorization

I’ll show you

Basic Object Detection

= Person

= Motorcycle= Boat

= Sheep= Cow

Detection Window W

Score(W) > 0.5

Base Detector - HOG

[ Dalal & Triggs, CVPR, 2006 ] HOG Detector:

Feature Vector X SVM Classifier

Context-Aware Object Detection

From Base Detector Log Score D(W)

From Scene Category MAP category, marginals

From Region Labels How much of each label is in

a window adjacent to W From Depths

Mean, variance of depths,estimate of “true” object size

Final Classifier

P(Y) = Logistic(Φ(W))

Scene Type: Urban scene

% of “road” below W

Variance of depths in W

Multi-class Segmentation CRF Model

Label each pixel as one of:{‘grass’, ‘road’, ‘sky’, etc }

Conditional Markov random field (CRF) over superpixels:

Singleton potentials: log-linear function of boosted detectors scores for each class

Pairwise potentials: affinity of classes appearing together conditioned on (x,y) location within the image

[Gould et al., IJCV 2007]

Context-Aware Multi-class Seg.

Additional Feature:Relative Location Map

Where is

the grass?

Depth Reconstruction CRF

[Saxena et al., PAMI 2008]

Label each pixel with it’s distance from the camera

Conditional Markov random field (CRF) over superpixels

Continuous variables Models depth as linear

function of features with pairwise smoothness constraints

http://make3d.stanford.edu

Depth Reconstruction with Context

BLACK BOX

Grass is horizontal

Sky is far away

Find d* Reoptimize depths

with new constraints:

dCCM = argmin α||d - d*||

+ β||d - dCONTEXT||

Training

I: Image f: Image Features Ŷ: Output labels Training Regimes

Independent

Ground: Groundtruth Input

fD fS fZ

ŶD0 ŶS

0 ŶZ0

fD fS fZ

)(maxarg 0 fYPINDEP

),(maxarg *1 fYYP otherGROUND

Training

CCM Training Regime Later models can

ignore the mistakes of previous models

Training realistically emulates testing setup

Allows disjoint datasets

K-CCM: A CCM with K levels of classifiers

fD fS fZ

),ˆ(maxarg 01 fYYP otherCCM

Experiments

DS1 422 Images, fully labeled Categorization, Detection, Multi-class

Segmentation 5-fold cross validation

DS2 1745 Images, disjoint labels Detection, Multi-class Segmentation, 3D

Reconstruction 997 Train, 748 Test

CCM Results – DS1

CAR PEDESTRIAN

MOTORBIKE BOAT

1 integrating vision models for holistic scene understanding geremy heitz cs223b march 4 th, 2009

sidewalk slide

context context

stateoftheart models

graphical models

contextaware object

holistic scene

scene interpretation

vision models ccm

Documents

towards helicopter tracking cs223b project #20 ritchie lee...

cs223b, jana kosecka rigid body motion and image formation

yin yang, cordelia robinson, dominique heitz, etienne mémin

heitz - oneiros - tödlicher fluch

cv fabrice heitz 2015-1 page -...

cs223b homework 1 results

yin yang, cordelia robinson, dominique heitz, etienne...

j med genet-1992-heitz-794-801

graphische datenverarbeitung iv dr. markus heitz

heitz+koller eccv08

motion estimation introduction to computer vision cs223b,...

convex point estimation using undirected bayesian transfer...

discriminative learning of markov random fields for...

115.01 - gestion clientèle - heitz system · 115.01 -...

sebastian thrun and jana kosecha cs223b computer vision,...

theologie des bildes. s. heitz, p

stanford cs223b computer vision, winter 2007 lecture 2b ...

franÇoise heitz

uncalibrated epipolar - calibration jana kosecka cs223b

geremy lashon white