a system for observing and recognizing objects in the real world

A System for Observing and Recognizing Objects in the Real World

J.O. Eklundh, M. Björkman, E. HaymanComputational Vision and Active Perception Lab

Royal Institute of Technology (KTH)

Vision in the Real World: Attending, Foveating and Recognizing Objects

Motivation

An autonomous agent moving about in a dynamic indoor environment, performing tasks such as finding, picking up and delivering known objects or classes of objects.

What should the vision system of such an agent be capable of?


Capabilities

“where” what

attention segmentation

recognition what

Should be dealt with jointly. Bootstrapping?


Recognition and Categorization

Feifei et al 03


A robot looking at a table at 1.5 m

Objects subtend only a fraction of the scene and are not centered (no attentional step)


Approach - themes

• 3D cues relevant in the scene• Motion and stereo used for bootstrapping• Integration of multiple cues• A system interacting with the environment• Fast processes and anytime algorithms

desirable


The f-g-s problem

• Segmenting 3D objects from the background• Computing motion, depth and ego-motion• Acquiring appearance models

Issues:• Combining cues• Demonstrate simple algorithms that suffice

Monocular as well as binocular cases


Cue integration for f-g-s

First, the dynamic monocular case

• Problem: classifying pixels as being foreground or background (or into layers)

• Cues: motion, colour, contrast + prediction (temporal continuity)

Inference problem: observations from different spaces to be combined


Integration• Two approaches:

- probabilistic: likelihood of observing data, given a model of each layer- voting: each cue decides independently, form weighted combination

Algorithm• Online initialization of colour + texture models

– Use segmentation from motion to train distributions

• Suppress unreliable cues

• Sequential algorithm


Related work• Khan & Shah CVPR’01• Triesch & von der Malsburg

ICAFGR’00• Spengler & Schiele ICVS’01• Toyama & Horvitz ACCV’00• Kragic & Christensen ICRA’99• Belongie et al ICCV’98• …

Similarity to tracking


VotingThe likelihood of observation

fromcue k at pixel i given modelof layer j :

Posterior probability of layer j :

We set

kiZ ,

kjM ,

)( ,,, kjkiki MZf

∑= )()() ,,,, lpMZf)p(j)M(Zf Z(jp klkikikj,ki,ki,kiki,

∑=k

ki,ki,ki Z(jp w jscore ))(


Probabilistic fusion• Assume independent observations• The total likelihood of the observations given

the combined model :

• The posterior estimate of layer membership:

An independent opinion pool

}{ kjjj MM M ,1, .....=

∑∏∏= )p(j)M(Zf)p(j)M(Zf )Z(jp kj,ki,ki,kj,ki,ki,ii

)()( ,,, kjkikijii MZf MZp ∏=


Illustrations

• Assume a distribution for each layer for each cue

f.g. model

b.g. model

8f ( Z | f.g ) = 8

0.4

f ( Z | b.g ) = 0.4

Observation: Z


Cue combination: Weighted voting

• Each cue makes independent decision• Combine using weighted sum

• Assuming equal weights:

f.g.

b.g.

Colour

8.0

0.4

Motion

0.2

0.3


Cue combination: Probabilistic

• Compute total likelihood of observations for each layer• Classify using Bayes’ Rule

• Assuming uniform priors:

f.g.

b.g.

Colour

8.0

0.4

Motion

0.2

0.3


Pros and cons

• Simple!• Easy to combine observations from very

different spaces• Not obvious how cues interact• Graded output

• Mathematically well-founded• One cue can easily dominate over others• Almost binary output

Voting:

Probabilistic:


Training and adaptation

• Start with segmentation just from motion

• Use this to train colour and contrast distributions– Use EM algorithm to train Gaussian mixture models

• Subsequently adapted online– Recompute models from current data– Update model as weighted sum over time window

(Raja et al ECCV’98)


Suppressing unreliable cues

• Cues unreliable during training

• No independent motion poor motion segmentation

• Unreliable in the past probably unreliable now!(Triesch and von der Malsburg ICAFGR’00)

• Mechanisms: Voting: weights Probabilistic: hyper-priors


The effect of hyper-priors

Orginal pdf’s After marginalization


Results

Original Probabilistic Voting

Degree of membership to foreground

QuickTime™ and aYUV420 codec decompressor

are needed to see this picture.


Cue combination

Motion

Colour Prediction

Texture(contrast)

CombinedProbabilistic

CombinedVoting


Results

Original Foreground




Results: Probabilistic cue integration

Original Foreground mask




The cues

Motion

Colour Prediction

Texture(contrast)

Combined


Epipolar geometry

Non-retinal info.

Image features

Disparity map

Ego-motion

Independent motion

Additional retinal info.

Regions of interest

Fixation point

Top-down control

Process overview


Calibration

Relative orientations has to be known to• relate disparities to depths• simplify estimation of disparities

xy

(1+x ) – y rxy + r + x r

2

z

zy= + 1

z1 – x t- y t

Using corner features and optical flow model

Unstable process => be carefulWe first assume r and r to be zero.z y


More examples


Final output


• Many objects of interest static. Harder!• Motion included in full system; now only

stereo• The cues in these examples

– Stereo data - exist along contours– Color data/appearance between contours

3D Cues: stereo and motion


Proposed system structure

Technically by a combination of wide field and foveal cameras

A wide field for attention, recognition in foveated view

Problem: transfer from wide field to foveal view

Steps:•Divide scene into 3D objects•Select objects through attention (e.g.hue and expected size)•Fixate (and track) object of interest•Recognize objects in foveal view


Processes

Recognition

AttentionSegmen-

tation

Hypotheses

Knowledge

Adaptation

Shape and size

Knowledge

Region of Interest

Where

What


Flow of information

Left

Segmentation Global hue SIFT features Local hue

Attention

Gaze direction

Recognition

RegistrationRegistration

Left RightRight

FixationCalibration

Wide field Foveal


Figure-ground segmentation

Disparity map is sliced into layers.Widths are set to that of requested object.


Figure-ground segmentation

• Disparities using SAD correlations.• Segmentation based on slicing the 3D world.

BinoCues BinoAttn


Hue based attention

Local hue histograms correlated with that of requested object.Fast implementation using rotating sums.


Saliency peaks

Peaks from blob detection of depth slices. Based on Differences of Gaussians.Hue saliency map used for weighting.Random value added before selection.


Fixation

0 0 c0 0 da b e

F =

The foveal system continuously tries to fixate• done using corner features• and affine essential matrix

Zero disparity filters won’t work


Foveated segmentation

To boost recognition

• Foveal segmentation based on disparities• Rectification using affine fundamental matrix

• Only search for disparities around zero => Large number of false positives• Points clustered in 3D using mean shift


Foveated segmentation


Small object database in real-time experiments

Models of SIFT features and hue histograms


Visual scene search




Segmentation robustness


Effect of occlusions


Effect of rotations


Recognition in w-f-o-v


Recognition after foveation


Conclusions

We have a running system.Objects normally found within three saccadesConcern: dependency on corner features

Current work:• Focus on recognition and categorization• More robust foveal segmentation• Additional cues e.g. texture• Learning and adaptation on all levels

a system for observing and recognizing objects in the real world

Documents

algorithms desirablethe

backgroundcomputing

weighted votingeach

train colour

layer voting

modelof layer j

independent decisioncombine

d objects