andrew zisserman talk - part 1a

Visual search and recognitionPart I – large scale instance search

Andrew ZissermanVisual Geometry Group

University of Oxfordhttp://www.robots.ox.ac.uk/~vgg

Slides with Josef Sivic

Overview

Part I: Instance level recognition• e.g. find this car in a dataset of images

and the dataset may contain 1M images …

Overview

Part II: Category level recognition• e.g. find images that contain cars

• or cows …

and localize the occurrences in each image …

“Groundhog Day” [Rammis, 1993]Visually defined query

“Find this clock”

Example: visual search in feature films

“Find this place”

Problem specification: particular object retrieval

Particular Object Search

Find these objects ...in these images and 1M more

Search the web with a visual query …

The need for visual search

Flickr: has more than 5 billion photographs, more than 1 million added daily

Company collections

Personal collections: 10000s of digital camera photos and video clips

Vast majority will have minimal, if any, textual annotation.

Why is it difficult?Problem: find particular occurrences of an object in a very

large dataset of images

Want to find the object despite possibly large changes inscale, viewpoint, lighting and partial occlusion

ViewpointScale

Lighting Occlusion

Outline

1. Object recognition cast as nearest neighbour matching

• Covariant feature detection

• Feature descriptors (SIFT) and matching

2. Object recognition cast as text retrieval

3. Large scale search and improving performance

4. Applications

5. The future and challenges

Approach

Determine regions (detection) and vector descriptors in each frame which are invariant to camera viewpoint changes

Match descriptors between frames using invariant vectors

Visual problem

query?

• Retrieve image/key frames containing the same object

Example of visual fragments

Image content is transformed into local fragments that are invariant to translation, rotation, scale, and other imaging parameters

Lowe ICCV 1999• Fragments generalize over viewpoint and lighting

Detection Requirements

• detection must commute with viewpoint transformation

• i.e. detection is viewpoint covariant

• NB detection computed in each image independently

Detected image regions must cover the same scene region in different views

viewpoint transformation

viewpoint transformation

detection detection

Scale-invariant feature detection

Goal: independently detect corresponding regions in scaled versions of the same image

Need scale selection mechanism for finding characteristic region size that is covariant with the image transformation

Laplacian

Scale-invariant features: Blobs

Slides from Svetlana Lazebnik

Recall: Edge detection

gdxdf ∗

f

gdxd

Source: S. Seitz

Edge

Derivativeof Gaussian

Edge = maximumof derivative

Edge detection, take 2

gdxdf 2

2

∗

f

gdxd

2

2

Edge

Second derivativeof Gaussian (Laplacian)

Edge = zero crossingof second derivative

Source: S. Seitz

From edges to `top hat’ (blobs)

Blob = top-hat = superposition of two step edges

Spatial selection: the magnitude of the Laplacian response will achieve a maximum at the center of the blob, provided the scale of the Laplacian is “matched” to the scale of the blob

maximum

Scale selectionWe want to find the characteristic scale of the blob by convolving it with Laplacians at several scales and looking for the maximum response

However, Laplacian response decays as scale increases:

Why does this happen?

increasing σoriginal signal(radius=8)

Scale normalization

The response of a derivative of Gaussian filter to a perfect step edge decreases as σ increases

πσ 21

Scale normalization

The response of a derivative of Gaussian filter to a perfect step edge decreases as σ increases

To keep response the same (scale-invariant), must multiply Gaussian derivative by σ

Laplacian is the second Gaussian derivative, so it must be multiplied by σ2

Effect of scale normalization

Scale-normalized Laplacian response

Unnormalized Laplacian responseOriginal signal

maximumscale σ increasing

Blob detection in 2D

2

2

2

22

yg

xgg

∂∂

+∂∂

=∇

Laplacian of Gaussian: Circularly symmetric operator for blob detection in 2D

Blob detection in 2D

⎟⎟⎠

⎞⎜⎜⎝

⎛∂∂

+∂∂

=∇ 2

2

2

222

norm yg

xgg σScale-normalized:

Laplacian of Gaussian: Circularly symmetric operator for blob detection in 2D

Characteristic scale

We define the characteristic scale as the scale that produces peak of Laplacian response

characteristic scaleT. Lindeberg (1998). "Feature detection with automatic scale selection."International Journal of Computer Vision 30 (2): pp 77--116.

Scale selection

Scale invariance of the characteristic scale

∗∗ =⋅ 21 sss

norm

. Lap

.

norm

. Lap

.

s

• Relation between characteristic scales

scale scales∗1 s∗2

Scale invariance

• Multi-scale extraction of Harris interest points

• Selection of characteristic scale in Laplacian scale space

Laplacian

Chacteristic scale :- maximum in scale space- scale invariant

Mikolajczyk and Schmid ICCV 2001

What class of transformations are required?

Similarity(translation, scale, rotation)

Affine

Projective(homography)

2D transformation models

View 1

Motivation for affine transformations

View 2 View 2

Not same region

Local invariance requirements

Geometric: 2D affine transformation

where A is a 2x2 non-singular matrix

• Objective: compute image descriptors invariant to this class of transformations

Viewpoint covariant detection

• Characteristic scales (size of region)• Lindeberg and Garding ECCV 1994

• Lowe ICCV 1999

• Mikolajczyk and Schmid ICCV 2001

• Affine covariance (shape of region)• Baumberg CVPR 2000

• Matas et al BMVC 2002 Maximally stable regions

• Mikolajczyk and Schmid ECCV 2002

• Schaffalitzky and Zisserman ECCV 2002

• Tuytelaars and Van Gool BMVC 2000

• Mikolajczyk et al., IJCV 2005

Shape adapted regions “Harris affine”

Example affine covariant region

1. Segment using watershed algorithm, and track connected components as threshold value varies.

2. An MSR is detected when the area of the component is stationary

See Matas et al BMVC 2002

Maximally Stable regions (MSR)

first image second image

Maximally stable regions

sub-image

varying threshold

area vsthreshold

change of area vs threshold

first image

second image

Example: Maximally stable regions

Example of affine covariant regions

1000+ regions per image Shape adapted regions


• a region’s size and shape are not fixed, but

• automatically adapts to the image intensity to cover the same physical surface

• i.e. pre-image is the same surface region

Viewpoint invariant description

• Elliptical viewpoint covariant regions• Shape Adapted regions

• Maximally Stable Regions

• Map ellipse to circle and orientate by dominant direction

• Represent each region by SIFT descriptor (128-vector) [Lowe 1999]• see Mikolajczyk and Schmid CVPR 2003 for a comparison of descriptors

Local descriptors - rotation invarianceEstimation of the dominant orientation

• extract gradient orientation• histogram over gradient orientation• peak in this histogram

Rotate patch in dominant direction

0 2π

Descriptors – SIFT [Lowe’99]

distribution of the gradient over an image patch

gradient 3D histogram

→ →

image patch

y

x

very good performance in image matching [Mikolaczyk and Schmid’03]

4x4 location grid and 8 orientations (128 dimensions)

Summary – detection and description

Extract affine regions Normalize regions Eliminate rotation Compute appearancedescriptors

SIFT (Lowe ’04)

Approach

Determine regions (detection) and vector descriptors in each frame which are invariant to camera viewpoint changes

Match descriptors between frames using invariant vectors

Visual problem

query?

• Retrieve image/key frames containing the same object

1. Compute regions in each image independently2. “Label” each region by a descriptor vector from its local intensity neighbourhood3. Find corresponding regions by matching to closest descriptor vector4. Score each frame in the database by the number of matches

Outline of an object retrieval strategy

Finding corresponding regions transformed to finding nearest neighbour vectors

frames

regions

invariant descriptor

vectors

invariant descriptor

vectors

ExampleIn each frame independentlydetermine elliptical regions (detection covariant with camera viewpoint) compute SIFT descriptor for each region [Lowe ‘99]

Harris-affine


1000+ descriptors per frame

Object recognition

Establish correspondences between object model image and target image by nearest neighbour matching on SIFT vectors

128D descriptor space

Model image Target image

Match regions between frames using SIFT descriptors

Harris-affine


• Multiple fragments overcomes problem of partial occlusion

• Transfer query box to localize object

Now, convert this approach to a text retrieval representation

Outline



• bag of words model

• visual words


4. Applications


Success of text retrieval

• efficient• high precision• scalable

Can we use retrieval mechanisms from text for visual retrieval?• ‘Visual Google’ project, 2003+

For a million+ images:• scalability• high precision• high recall: can we retrieve all occurrences in the corpus?

Text retrieval lightning tour

Stemming

Stop-list

Inverted file

Reject the very common words, e.g. “the”, “a”, “of”

Represent words by stems, e.g. “walking”, “walks” “walk”

Ideal book index: Term List of hits (occurrences in documents)

People [d1:hit hit hit], [d4:hit hit] …

Common [d1:hit hit], [d3: hit], [d4: hit hit hit] …

Sculpture [d2:hit], [d3: hit hit hit] …

• word matches are pre-computed

Ranking • frequency of words in document (tf-idf)

• proximity weighting (google)

• PageRank (google)

Need to map feature descriptors to “visual words”.

Build a visual vocabulary for a movie

Vector quantize descriptors• k-means clustering

+

+

Implementation• compute SIFT features on frames from 48 shots of the film

• 6K clusters for Shape Adapted regions

• 10K clusters for Maximally Stable regions

SIFT 128DSIFT 128D

Samples of visual words (clusters on SIFT descriptors):

Maximally stable regionsShape adapted regions

generic examples – cf textons

More specific example

Samples of visual words (clusters on SIFT descriptors):

Sivic and Zisserman, ICCV 2003Visual words: quantize descriptor space

Nearest neighbour matching


Image 1 Image 2

•expensive to do for all frames

Sivic and Zisserman, ICCV 2003



Image 1 Image 2

Vector quantize descriptors


Image 1 Image 2

42

5

425 5

42


Visual words: quantize descriptor space




Image 1 Image 2



Image 1 Image 2

42

5

425 5

42

New image






Image 1 Image 2



Image 1 Image 2

42

5

425 5

42

New image

42



The same visual word

Vector quantize the descriptor space (SIFT)

Visual words are ‘iconic’ image patches or fragments• represent the frequency of word occurrence• but not their position

Representation: bag of (visual) words

ImageCollection of visual words

Detect patches

Compute SIFTdescriptor

Normalize patch

+

+

Find nearest cluster centre

Offline: assign visual words and compute histograms for each key frame in the video

Represent frame by sparse histogram of visual word occurrences

200101…

Offline: create an index

For fast search, store a “posting list” for the dataset

This maps word occurrences to the documents they occur in

frame #5 frame #10

Posting list

1

2

...

5,10, ...

10,...

...

#1 #1

#2

At run time …

Generates a tf-idf ranked list of all the frames in dataset

frame #5 frame #10

Posting list

1

2

...

5,10, ...

10,...

...

#1 #1

#2

• User specifies a query region• Generate a short list of frames using visual words in region

1. Accumulate all visual words within the query region2. Use “book index” to find other frames with these words3. Compute similarity for frames which share at least one word

For a vocabulary of size K, each image is represented by a K-vector

where ti is the (weighted) number of occurrences of visual word i.

Images are ranked by the normalized scalar product between the query vector vq and all vectors in the database vd:

Image ranking using the bag-of-words model

Scalar product can be computed efficiently using inverted file.

frames

regions invariant descriptor

vectors

1. Compute affine covariant regions in each frame independently (offline)2. “Label” each region by a vector of descriptors based on its intensity (offline)3. Build histograms of visual words by descriptor quantization (offline)4. Rank retrieved frames by matching vis. word histograms using inverted files.

Summary: Match histograms of visual words

Quantize Single vector (histogram)

Films = common dataset

“Pretty Woman”

“Groundhog Day”

“Casablanca”

“Charade”

Video Google Demo

retrieved shots

ExampleSelect a region

Search in film “Groundhog Day”

Visual words - advantages

Design of descriptors makes these words invariant to:• illumination• affine transformations (viewpoint)

Multiple local regions gives immunity to partial occlusion

Overlap encodes some structural information

NB: no attempt to carry out a ‘semantic’ segmentation

Sony logo from Google image search on `Sony’

Retrieve shots from Groundhog Day

Example application – product placement

Retrieved shots in Groundhog Day for search on Sony logo

Outline




• large vocabularies and approximate k-means

• query expansion

• soft assignment

4. Applications


Particular object search

Find these landmarks ...in these images

Investigate …

Vocabulary size: number of visual words in range 10K to 1M

Use of spatial information to re-rank

Oxford buildings dataset

Automatically crawled from Flickr

Dataset (i) consists of 5062 images, crawled by searching for Oxford landmarks, e.g.

“Oxford Christ Church”“Oxford Radcliffe camera”“Oxford”

“Medium” resolution images (1024 x 768)

Oxford buildings dataset

Automatically crawled from Flickr

Consists of:

Dataset (i) crawled by searching for Oxford landmarks

Datasets (ii) and (iii) from other popular Flickr tags. Acts as additional distractors

Oxford buildings datasetLandmarks plus queries used for evaluation

All Soul's

Ashmolean

Balliol

Bodleian

Thom Tower

Cornmarket

Bridge of Sighs

Keble

Magdalen

University Museum

Radcliffe Camera

Ground truth obtained for 11 landmarks over 5062 images

Evaluate performance by mean Average Precision

Precision - Recall

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

recall

prec

isio

n

all images

returned images

relevant images

• Precision: % of returned images that are relevant

• Recall: % of relevant images that are returned

Average Precision

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

recall

prec

isio

n • A good AP score requires both high recall and high precision

• Application-independentAP

Performance measured by mean Average Precision (mAP) over 55 queries on 100K or 1.1M image datasets

Quantization / Clustering

K-means usually seen as a quick + cheap method

But far too slow for our needs – D~128, N~20M+, K~1M

Use approximate k-means: nearest neighbour search by multiple, randomized k-d trees

K-means overview

K-means overview:

Initialize cluster centres

Find nearest cluster to each datapoint (slow) O(N K)

Re-compute cluster centres as centroid

Iterate

Idea: nearest neighbour search is the bottleneck – use approximate nearest neighbour search

K-means provably locally minimizes the sum of squared errors (SSE) between a cluster centre and its points

Approximate K-means

Use multiple, randomized k-d trees for search

A k-d tree hierarchically decomposes the descriptor space

Points nearby in the space can be found (hopefully) by backtracking around the tree some small number of steps

Single tree works OK in low dimensions – not so well in high dimensions

Approximate K-means

Multiple randomized trees increase the chances of finding nearby points

Query point

True nearest neighbour

True nearest neighbour found? No No Yes

Approximate K-means

Use the best-bin first strategy to determine which branch of the tree to examine next

share this priority queue between multiple trees – searching multiple trees only slightly more expensive than searching one

Original K-means complexity = O(N K)

Approximate K-means complexity = O(N log K)

This means we can scale to very large K

Experimental evaluation for SIFT matchinghttp://www.cs.ubc.ca/~lowe/papers/09muja.pdf

Oxford buildings datasetLandmarks plus queries used for evaluation

All Soul's

Ashmolean

Balliol

Bodleian

Thom Tower

Cornmarket

Bridge of Sighs

Keble

Magdalen

University Museum

Radcliffe Camera

Ground truth obtained for 11 landmarks over 5062 images

Evaluate performance by mean Average Precision over 55 queries

Approximate K-means

How accurate is the approximate search?

Performance on 5K image dataset for a random forest of 8 trees

Allows much larger clusterings than would be feasible with standard K-means: N~17M points, K~1M

AKM – 8.3 cpu hours per iterationStandard K-means - estimated 2650 cpu hours per iteration

Performance against vocabulary size

Using large vocabularies gives a big boost in performance (peak @ 1M words)

More discriminative vocabularies give:

Better retrieval quality

Increased search speed – documents share less words, so fewer documents need to be scored

Beyond Bag of Words

Use the position and shape of the underlying features to improve retrieval quality

Both images have many matches – which is correct?

Beyond Bag of Words

We can measure spatial consistency between the query and each result to improve retrieval quality

Many spatially consistent matches – correct result

Few spatially consistent matches – incorrect result

Compute 2D affine transformation• between query region and target image

where A is a 2x2 non-singular matrix

Estimating spatial correspondences1. Test each correspondence

Estimating spatial correspondences2. Compute a (restricted) affine transformation (5 dof)

Estimating spatial correspondences3. Score by number of consistent matches

Use RANSAC on full affine transformation (6 dof)

Beyond Bag of Words

Extra bonus – gives localization of the object

Example Results

Query Example Results

Rank short list of retrieved images on number of correspondences

0.6250.6021.25M0.6450.6181M0.6300.609750K0.6420.606500K0.6330.598250K0.5970.535100K0.5990.47350K

vocabsize

bag of words

spatial

Mean Average Precision variation with vocabulary size

Bag of visual words particular object retrieval

Hessian-Affineregions + SIFT descriptors

visual words+tf-idf weighting

querying

sparse frequency vector

centroids(visual words)

Invertedfile

ranked imageshort-list

Set of SIFTdescriptorsquery image

[Lowe 04, Chum & al 2007]

Geometricverification

[Chum & al 2007]

andrew zisserman talk - part 1a

Education

scale scales

scale increases

scale invariancemikolajczyk

scaleinvariant features

automatic scale selection

large scale search

regions detection

different views detection