discriminative models for multi- class object...

Discriminative Models for Multi-Class Object Layout

Chaitanya Desai, Deva Ramanan, Charles Fowlkes

Presented by:

Vignesh Ramanathan, Vivardhan Kanoria, Kevin Truong

Introduction Why another Object Detector?

Issues with other Detectors:

Binary 0-1 classification model for each image window and object class, independent of the remaining image and objects present in it

Heuristic post processing to improve performance of detectors on datasets, e.g. Non Maximal Suppression

Interactions between Objects 1. Activation

Intra Class – Textures of Objects

[17] Y. Liu, W. Lin, and J. Hays. Near-regular texture analysis and manipulation. ACM Transactions on Graphics, 23(3):368–376, 2004

Between Class – Spatial Cueing

Interactions between Objects 2. Inhibition

Intra Class – Non Maximal Suppression

Between Class – Mutual Exclusion

Interactions between Objects 3. Global Properties

Between Class – Co-occurrence

At most 1 biker per bike

Intra Class – Total Counts

At most 1 Sydney Opera House

Summary of Spatial Interactions Modeled

Within Class Between Class

Activation Textures of Objects Spatial Cueing

Inhibition Non Maximal Suppression Mutual Exclusion

Global Expected Counts Co-occurrence

Contributions of Multi-Class Object Layout

The object layout framework formulates detection as a structured prediction task for an entire image rather than a binary classification task on sub-windows

The model learns all of the listed spatial interactions, in addition to learning local appearance statistics

Problem Formulation

The objective is to train a model to detect multiple classes of objects in test images given training images with annotated bounding boxes for each class specified

learning Model Parameters

Test Image

Inference

Model Formulation

Suppose we wish to model 𝐾 different object classes. The vector of object labels is

𝑙𝑎𝑏𝑒𝑙𝑠: 𝑌 = 𝑦𝑖: 𝑖 = 1…𝑀 , 𝑦𝑖𝜖 0…𝐾 ; 0 = background

Construct the image pyramid Let 𝑀 be the total number of sub-windows. An image 𝑋 is represented by a set of features 𝑥𝑖: 𝑋 = 𝑥𝑖: 𝑖 = 1…𝑀

𝒙𝒊 = 𝑯𝑶𝑮 𝑭𝒆𝒂𝒕𝒖𝒓𝒆𝒔 𝒚𝒊 = 𝟑 (𝒉𝒖𝒎𝒂𝒏)

Task: Model should predict all labels Y, given an image X

Spatial Interaction Model

The spatial configuration of a window 𝑗 with respect to a window 𝑖 is encoded as follows:

𝑑𝑖𝑗 =

𝑁𝑒𝑎𝑟?𝐹𝑎𝑟?𝐴𝑏𝑜𝑣𝑒?𝑂𝑛𝑡𝑜𝑝?𝐵𝑒𝑙𝑜𝑤?𝑁𝑒𝑥𝑡 − 𝑡𝑜?50% 𝑂𝑣𝑒𝑟𝑙𝑎𝑝?

;

j=1

j=2

𝑑𝑖1 =

1000000

; 𝑑𝑖2 =

1000010

𝑑𝑖𝑗 is a 7 dimensional sparse binary vector:

The first 6 components depend only on the relative location of the center of window j with respect to window i.

Model Parameters The score of labeling an image X with labels Y is:

𝑆 𝑋, 𝑌 = 𝜔𝑦𝑖,𝑦𝑗𝑇

𝑖,𝑗

𝑑𝑖𝑗 + 𝜔𝑦𝑖𝑇

𝑖

𝑥𝑖 ;

where 𝜔𝑎,𝑏 and 𝜔𝑐 are model parameters.

Sum over all pairs of windows Sum over all windows

𝜔𝑎,𝑏 captures spatial interactions between object classes a and b

𝜔𝑐 captures local appearance characteristics of object class c

• 𝜔𝑎,𝑏 = 7 × 1; 𝑎, 𝑏 ∈ 0…𝐾 × 0…𝐾

• 𝜔𝑐 = 𝑆𝑖𝑧𝑒 𝑜𝑓 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 𝑥𝑖 𝐻𝑂𝐺, 𝑒𝑡𝑐. ; 𝑐 𝜖 0…𝐾

Append a 1 to each 𝑥𝑖 to learn biases between classes

Assign 𝜔0 and 𝜔0,1 and 𝜔1,0 to be 0

Inference: NP Hard

To get the desired detection, we need to compute:

arg𝑚𝑎𝑥𝑌 𝑆 𝑋, 𝑌 = arg𝑚𝑎𝑥𝑌 𝜔𝑦𝑖,𝑦𝑗𝑇

𝑖,𝑗

𝑑𝑖𝑗 + 𝜔𝑦𝑖𝑇

𝑖

𝑥𝑖

i.e. Find the labeling 𝑌 that maximizes the score S for image 𝑋, given learnt model parameters 𝜔

There are (𝐾 + 1)𝑀 possible values for 𝑌

This is NP hard.

Inference: Greedy Forward Search Algorithm

1. Initialize all labels to 0 (i.e. background)

2. Repeatedly change the label of window 𝑖 to class 𝑐, where:

𝑖, 𝑐 is the window-class pair that maximizes the increase in score S(X,Y)

3. Stop when all windows have been instanced or step 2 causes a decrease in score

Effectiveness was tested on small scale problems where the brute force solution was easily computed

The score for the greedy forward search algorithm was found to be quite close to the actual solution

The two solutions typically differed in the labels of 1-3 windows

Greedy Forward Search: Details

Initialize 𝐼 = ; Set of instanced windows 𝑆 = 0; ∆ 𝑖, 𝑐 = 𝜔𝑐

𝑇𝑥𝑖 ; Change in score Repeat:

1. 𝑖∗, 𝑐∗ = arg𝑚𝑎𝑥(𝑖,𝑐)∉𝐼 ∆ 𝑖, 𝑐

2. 𝐼 = 𝐼 ∪ 𝑖∗, 𝑐∗ 3. 𝑆 = 𝑆 + ∆ 𝑖∗, 𝑐∗

4. ∆ 𝑖, 𝑐 = ∆ 𝑖, 𝑐 + 𝜔𝑐∗,𝑐𝑇 𝑑𝑖∗,𝑖 +𝜔𝑐,𝑐∗

𝑇 𝑑𝑖,𝑖∗

Stop when: ∆ 𝑖∗, 𝑐∗ < 0 or all windows have been instanced

CRF Formulation - Scoring

Model 𝑃(𝑌|𝑋) as a CRF with pairwise potentials between 𝑌 and each 𝑋

being exponential in 𝑆 𝑋, 𝑌 , i.e. 𝑃 𝑌 𝑋 =1

𝑍(𝑋)𝑒𝑆(𝑋,𝑌)

A natural choice for scoring each detection is the log odds ratio between probability of detecting a class c versus detecting any other class:

𝑚 𝑦𝑖 = 𝑐 = 𝑙𝑜𝑔𝑃(𝑦𝑖 = 𝑐|𝑋)

𝑃(𝑦𝑖 ≠ 𝑐|𝑋)= 𝑙𝑜𝑔

𝑃(𝑦𝑖 = 𝑐, 𝒚𝒓|𝑋)𝒚𝒓

𝑃(𝑦𝑖 = 𝑐′, 𝒚𝒔|𝑋)𝒚𝒔,𝒄

′≠𝑐

Assume that both marginals are dominated by their largest terms. These are given by:

𝑟∗ = arg𝑚𝑎𝑥𝑟 𝑆 𝑋, 𝑦𝑖 = 𝑐, 𝑦𝑟 𝑠∗ = arg𝑚𝑎𝑥𝑠,𝑐′≠𝑐 𝑆 𝑋, 𝑦𝑖 = 𝑐

′, 𝑦𝑠

Then the log odds ratio is given by:

𝑚 𝑦𝑖 = 𝑐 ≈ 𝑙𝑜𝑔𝑃 𝑦𝑖=𝑐,𝑦𝑟∗ 𝑋

𝑃 𝑦𝑖=𝑐∗,𝑦𝑠∗ 𝑋

= 𝑆 𝑋, 𝑦𝑖 = 𝑐, 𝑦𝑟∗ − 𝑆 𝑋, 𝑦𝑖 = 𝑐∗, 𝑦𝑠∗

discriminative models for multi- class object...

Documents