attention in computer vision

Attention in Computer Vision

Mica Arie-Nachimson and Michal Kiwkowitz

May 22, 2005Advanced Topics in Computer Vision

Weizmann Institute of Science

Problem definition – Search Order

Object recognition

NO

• Vision applications apply “expensive” algorithms (e.g. recognition) to image patches

• Mostly naïve selection of patches• Selection of patches determines number of calls to

“expensive” algorithm

Problem Definition - Search Order

Object recognition

NOYES

• More sophisticated selection of patches would imply less calls to “expensive” algorithm

• Attention used to efficiently focus on incoming data (better use for limited processing capacity)

Problem Definition - Search Order

Object recognition

12345

6

Outline• What is Attention• Attention in Object Recognition

• Saliency Model• Feature Integration Theory• Saliency Algorithm• Saliency & Object Recognition• Comparison

• Inner Scene Similarity Model• Biological motivation• Difficulty of Search Tasks• Algorithms

• FLNN• VSLE

Attention

• Attention implies allocating resources, perceptual or cognitive, to some things at the expense of not allocating them to something else.

What is Attention

• You are sitting in class listening to a lecture.

• Two people behind you are talking. – Can you hear the lecture?

• One of them mentions the name of a friend of yours. – How did you know?

Attention in Other Applications

• Face Detection (feature selection)

• Video Analysis (temporal block selection)

• Robot Navigation (select locations)

• …

Attention is Directed by:

Bottom-up: • From small to large units of meaning • Rapid • Task-independent

Attention is Directed by:

Top-down:• Use higher levels (context, expectation)

to process incoming information (Guess)• Slower• Task dependent

http://www.rybak-et-al.net/nisms.html




• FLNN• VSLE

When is information selected (filtered)? – Early selection (Broadbent, 1958)– Cocktail party phenomenon (Moray, 1959)– Late selection (Treisman, 1960) - attenuation

• All information is sent to perceptual systems for processing

• Some is selected for complete processing• Some is more likely to be selected

Attention

WHICH?

Parallel SearchIs there a green O ?

+

A. Treisman, G. Gelade, 1980

Conjunction Search

Is there a green N ?

+


Results


Conjunction Search

+


Color map Orientation map


Conjunction Search

+


Primitives

PP PP

PP

Intensity

PP P

PPP

Orientation

PP PP

PP

Color

xx

x

xs

x

Curvature

II

I

IILine End

Movement

x x x

xx

x

Feature Integration Theory

Attention - two stages:

Attention•Serial Processing•Localized Focus•Slower•Conjunctive search

Pre-attention•Parallel Processing•Low Level Features•Fast•Parallel Search

How is the Focus

found & shifted?





• FLNN• VSLE

Shifts in Attention

“Shifts in selective visual attention: towards the underlying neural circuitry”,

Christof Koch, and Shimon Ullman, 1985

C. Koch, and S. Ullman, 1985

Feature Maps

•Orientation•Color•Curvature•Line end•Movement

Feature Maps


Feature Maps


Feature Maps


Feature Maps


Central RepresentationAttention

SaliencySaliency

Saliency

“A model of saliency-based visual attention for rapid scene analysis”

Laurent Itti, Christof Koch, and Ernst Niebur, 1998

L. Itti, C. Koch, and E. Niebur, 1998

• Salient - stands out

• Example – telephone & road sign have high saliency

from C. Koch L. Itti, C. Koch, and E. Niebur, 1998

Intensity


Cells in the retina

01

2

Intensity

Create 8 spatial scale using Gaussian pyramids

8


IntensityCenter-Surround difference operator- Sensitive to local spatial

discontinuities- Principle computation in the retina &

primary visual cortex- Subtract coarse scale from fine

scale

+

-

Fine scale

Coarse scale


+

-

fine

coarse

Toy Example

0 0 0

0 0 0

0 0 0

0 0 0

0 255 0

0 0 0

Fine level Coarse level

Gauss Pyramid Interpolation

Coarse level

Point-by-point subtraction

0 0 0

0 255 0

0 0 0

Toy Example

255 255 255

255 255 255

255 255 255

255 255 255

255 255 255

255 255 255

Fine level Coarse level

Gauss Pyramid Interpolation

Coarse level

Point-by-point subtraction

0 0 0

0 0 0

0 0 0

Intensity

4,3,2c 4,3 ccs

)()(),( sIcIscI

)5()2()5,2( III

Compute:

6 Intensity maps

)6()2()6,2( III

Different ratios – multiscale feature extraction

)6()3()6,3( III


Color

Same c and s as with intensity12 Color maps

Kandel et al. (2000). Principles of Neural Science. McGraw-Hill/Appleton & Lange

L. Itti, C. Koch, and E. Niebur, 1998More

Orientation

Same c and s as with intensity24 Orientation maps

}135,90,45,0{

|),(),(|),,( sOcOscO

From Visual system presentation by S. Ullman

L. Itti, C. Koch, and E. Niebur, 1998More

from C. Koch L. Itti, C. Koch, and E. Niebur, 1998

More

Normalization Operator


Saliency Map

3

)()()( ONCNINS


1. Extract Feature Maps

Algorithm- up to now

2. Compute Center-Surround (42)

• Intensity – I (6)

• Color – C (12)

• Orientation – O (24)

3. Combine each channel into conspicuity map

4. Compute saliency by summing and normalizing maps

Laurent Itti, Christof Koch, and Ernst Niebur, 1998

Leaky integrate-and-fire neurons“Inhibition of return”

Winner Takes All

Selection (FOA)


FOA – Focus Of Attention

Results

• FOA shifts: 30-70 ms• Inhibition: 500-900 ms

Inhibition of return ends


Results

Spatial Frequency Content, Reinage & Zador, 1997

Image

SFC

Saliency

Output


Results

(a) (b)

(c) (d)

Image

SFC

Saliency

Output

L. Itti, C. Koch, and E. Niebur, 1998Spatial Frequency Content, Reinage & Zador, 1997




• FLNN• VSLE

Attention & Object Recognition

• “Is bottom-up attention useful for object recognition?”– U. Rutishauser, D. Walther, C. Koch and P. Perona,

2004

U. Rutishauser, D. Walther, C. Koch and P. Perona, 2004

Computer recognition

Human recognition

segmented Cluttered scenes

labeled Non labeled

Attention

Object Recognition

saliency model


Growing region in strongest map

To Object Recognition

(Lowe)

More

Attention & Object Recognition

Learning inventories – “grocery cart problem”


Real world scenes1 image for training (15 fixations)

2-5 images for testing (20 fixations)

testing

training Object recognitionMatch

“Grocery Cart” Problem


training testing1

testing2

“Grocery Cart” Problem

Downsides:

• Bias of human photography

• Small image set


Solution• Robot as acquisition tool

Robot - Landmark Learning

Objective – how many objects are found and classified correctly?

Navigation – simple obstacle avoiding algorithm using infrared sensors


http://images.google.co.il/imgres?imgurl=http://www.neatstuff.net/space-robots/Metal-House-Piston-Robot.jpg&imgrefurl=http://www.neatstuff.net/space-robots/Space-robots.html&h=432&w=250&sz=25&tbnid=OLo1KHLZ-LwJ:&tbnh=123&tbnw=71&hl=iw&start=40&prev=/images%3Fq%3Drobot%26start%3D20%26hl%3Diw%26lr%3D%26sa%3DN

Object recognition

< 3 key points

Landmark Learning

With

Attention


Landmark Learning

With Random Selection


Landmark Learning - Results


Saliency Based Object Recognition

• Biologically motivated• Uses bottom-up, allows

combining top-down information

• Segmentation– Cluttered scenes– Unlabeled objects– Multiple objects in single image

• Static priority map





• FLNN• VSLE

Comparison

“Comparing attention operators for learning landmarks”, R. Sim, S. Polifroni, G. Dudek , June 2003

Other attention operators for low level features

R. Sim, S. Polifroni, G. Dudek , June 2003

Comparison


Edge density Radial symmetry

Smallest eigenvalue Caltech saliency

Comparison

• Landmark learning

• Training – learn landmarks knowing camera pose

• Testing - determine pose of camera according to landmarks (pose estimation)


Comparison - Results

• All operators better than random

• Radial symmetry worst results

• Caltech operator performs similar to edge and eigenvalue operators

• BUT – More complex to implement – More computing time

• Less preferred candidate in practice





• FLNN• VSLE

The Problem

Object recognition

12345

6




• FLNN• VSLE

Biological Motivation

• An alternative approach: continuous search difficulty

• Based on similarity:– Between Targets and Non-Targets in the scene– Between Non-Targets and Non-Targets in the scene

• Similar structural units do not need separate treatment

• Structural units similar to a possible target get high priority

Duncan & Humphreys [89]


similar

similar

not similar

not similar

search difficulty

target- nontarget similarity

nontarget- nontarget similarity



• Explains pop-out vs. serial search phenomenon

Non-targets:

Target:


similar

similar

not similar

not similar

search difficulty


• Explains pop-out vs. serial search phenomenon Non-targets:

Target:

Non-targets:

Target:

target- nontarget similarity

nontarget- nontarget similarity


Using Inner-scene Similarities

• Every candidate is characterized by a vector of n attributes

• n-dimentional metric space– A candidate is a point in the space– Some distance function d is associated with

the space

Avraham & Lindenbaum [04] Avraham & Lindenbaum [05]

Using Inner-scene Similarities Example

• One feature only: object area

• d: regular Euclidean distance Feature space




• FLNN• VSLE

Difficulty of Search

• The difficulty measure is the number of queries until the first target is found

• Two main factors– Distance between Targets and Non-Targets– Distance between Non-Targets and Non-

Targets

Feature space

CoverDifficulty of Search

Feature space

c: the number of circles in the cover


c will be our measure of the search difficulty

We need some constraint on the

circles’ size!

c: the number of circles

dt: max-min target distanceDifficulty of Search

dt

dt-cover

diamete

r

d t

Difficulty of Searchdt

Minimum dt-cover

c: The number of circles in the minimal dt-cover

diamete

r

d t

Difficulty of Searchdt

c: the number of circlesDifficulty of Search

dt

c = 7

dt

dt

c: insects exampleDifficulty of Search

dt

Feature spacec = 3

Example: easy searchDifficulty of Search

dt

c = 2

Example: hard searchDifficulty of Search

c = # of candidates

dt

Define the Difficulty using c

• Lower bound: Every search algorithm needs c calls to the oracle before finding the first target in the worst case

• Upper bound: There is an algorithm that will need max. c calls to the oracle to find the first target, for all search tasks


Lower bound

Every search algorithm needs c calls to the oracle before finding the first target in the worst case


1

2

3

4

5dt

dt

dt

dt

Upper bound

There is an algorithm that will need max. c calls to the oracle to find the first target, for all search tasks

FLNN-Farthest Labeled Nearest Neighbor





• FLNN• VSLE

FLNNFarthest Labeled Nearest Neighbor

Efficient Algorithms

1

2

3

4

5

c is a tight bound!

How do we compute c?Difficulty of Search

dt

– Need to know dt

– Compute the minimal dt-cover

– Count number of circles c=7

dt

– Need to know dt

– Compute the minimal dt-cover

– Count number of circles = c

To know the exact dt we need to know all the targets and non-targets, but that’s what we’re looking for…

Computing the minimal dt-cover is NP-complete!

Ok, that’s easy…


dt

How do we compute c?

Upper & Lower Bounds on c

• Upper bounds:– The number of candidates

– Know that dt is larger than some d0:• Can approximate cover size

• Lower bounds:– FLNN worst case

– Know that dt is larger than some d0:• Can approximate cover size





• FLNN• VSLE

Improving FLNN

• What’s wrong with FLNN?– Relates only to the nearest known neighbor– Finds only the first target efficiently– Cannot be easily extended to include top-

down information


VSLEVisual Search using Linear Estimation

• Each candidate has a prob. to be a target• Query the candidate with the highest probability• Update other candidates’ prob. according to the

known results– Every known target/non-target affects other

candidates in reverse order to its distance.

If we know results for candidates 1,…,m:

• Dynamic priority map



0.650.4

0.45

0.6 0.5

0.54

0.450.51

0.530.46

0.58

0.51

0.1

0.4

0.450.5

0.560.48

0.5

0.56

0.63

0.70.68



0.15

0.45

0.6 0.60.63

0.450.65

0.20.25

0.53

0.23

0.55

0.1 0.620.15

0.59

0.210.27

0.65


0.06

0.45

0.12 0.550.18

0.95

0.220.28

More

Combining Top-Down Information

• Simply specify the initial probabilities to match previous known data

• Add known target objects to the space. This will alter the probabilities accordingly and speed up search


Experiment 1: COIL-100Efficient Algorithms

Columbia Object Image Library [96]

Experiment 1: COIL-100

• Features:– 1st, 2nd, 3rd gaussian derivatives 9 basis

filters– 5 scales 9x5 = 45 features

• Euclidean distance


Rao & Ballard [95]

Experiment 1: COIL-100Efficient Algorithms

10 cars10 cups

# queries# queries

Experiment 2: hand segmentedEfficient Algorithms

• Every large segment is a candidate• 24 candidates• 4 targets

Berkeley hand segmented DB

Martin, Fowlkes, Tal & Malik [01]

Experiment 2: hand segmented

• Features: color histograms and

separated into 8 bins each 64 features



Experiment 3: automatic color segmentation

• Automatic color segmented image for face detection


Experiment 3: color segmentation

• 146 candidates

• 4 features: segment size, mean value of red, green and blue



# queries

Combining top-down information

• Add known targets to the space


Without additional targets With additional targets

# queries# queries

Summary: similarity modelSaliency model• Biologically motivated• Uses bottom-up, allows


• Segmentation• Static priority map

Similarity model• Biologically motivated• Uses bottom-up, allows


• No segmentation• Dynamic priority map• Measures the search

difficulty

Summary

• What is attention

• Aid object recognition tasks by choosing the area of interest

• Two approaches: saliency model and similarity model– Biological motivation– Algorithms

Thank You!

Linearly Estimating l(xk)

A linear estimation for l(xk):

Which, of course, minimizes the error

Solving a set of equations gives an estimation:

Linearly Estimating l(xk)

Estimation:

Where vector of known labels,

and is computed as follows (i,j=1,…,m):

R and r depend only on the distances, computed in

advance once

attention in computer vision

Documents

late selection treisman

conjunction searchis

expensive algorithms

early selection broadbent

expensive algorithmattention

attenuationall information

green o

number of calls