andrew zisserman talk - part 1a

93
Visual search and recognition Part I – large scale instance search Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vgg Slides with Josef Sivic

Upload: anton-konushin

Post on 13-Jan-2015

2.068 views

Category:

Education


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Andrew Zisserman Talk - Part 1a

Visual search and recognitionPart I – large scale instance search

Andrew ZissermanVisual Geometry Group

University of Oxfordhttp://www.robots.ox.ac.uk/~vgg

Slides with Josef Sivic

Page 2: Andrew Zisserman Talk - Part 1a

Overview

Part I: Instance level recognition• e.g. find this car in a dataset of images

and the dataset may contain 1M images …

Page 3: Andrew Zisserman Talk - Part 1a

Overview

Part II: Category level recognition• e.g. find images that contain cars

• or cows …

and localize the occurrences in each image …

Page 4: Andrew Zisserman Talk - Part 1a

“Groundhog Day” [Rammis, 1993]Visually defined query

“Find this clock”

Example: visual search in feature films

“Find this place”

Problem specification: particular object retrieval

Page 5: Andrew Zisserman Talk - Part 1a

Particular Object Search

Find these objects ...in these images and 1M more

Search the web with a visual query …

Page 6: Andrew Zisserman Talk - Part 1a

The need for visual search

Flickr: has more than 5 billion photographs, more than 1 million added daily

Company collections

Personal collections: 10000s of digital camera photos and video clips

Vast majority will have minimal, if any, textual annotation.

Page 7: Andrew Zisserman Talk - Part 1a

Why is it difficult?Problem: find particular occurrences of an object in a very

large dataset of images

Want to find the object despite possibly large changes inscale, viewpoint, lighting and partial occlusion

ViewpointScale

Lighting Occlusion

Page 8: Andrew Zisserman Talk - Part 1a

Outline

1. Object recognition cast as nearest neighbour matching

• Covariant feature detection

• Feature descriptors (SIFT) and matching

2. Object recognition cast as text retrieval

3. Large scale search and improving performance

4. Applications

5. The future and challenges

Page 9: Andrew Zisserman Talk - Part 1a

Approach

Determine regions (detection) and vector descriptors in each frame which are invariant to camera viewpoint changes

Match descriptors between frames using invariant vectors

Visual problem

query?

• Retrieve image/key frames containing the same object

Page 10: Andrew Zisserman Talk - Part 1a

Example of visual fragments

Image content is transformed into local fragments that are invariant to translation, rotation, scale, and other imaging parameters

Lowe ICCV 1999• Fragments generalize over viewpoint and lighting

Page 11: Andrew Zisserman Talk - Part 1a

Detection Requirements

• detection must commute with viewpoint transformation

• i.e. detection is viewpoint covariant

• NB detection computed in each image independently

Detected image regions must cover the same scene region in different views

viewpoint transformation

viewpoint transformation

detection detection

Page 12: Andrew Zisserman Talk - Part 1a

Scale-invariant feature detection

Goal: independently detect corresponding regions in scaled versions of the same image

Need scale selection mechanism for finding characteristic region size that is covariant with the image transformation

Laplacian

Page 13: Andrew Zisserman Talk - Part 1a

Scale-invariant features: Blobs

Slides from Svetlana Lazebnik

Page 14: Andrew Zisserman Talk - Part 1a

Recall: Edge detection

gdxdf ∗

f

gdxd

Source: S. Seitz

Edge

Derivativeof Gaussian

Edge = maximumof derivative

Page 15: Andrew Zisserman Talk - Part 1a

Edge detection, take 2

gdxdf 2

2

f

gdxd

2

2

Edge

Second derivativeof Gaussian (Laplacian)

Edge = zero crossingof second derivative

Source: S. Seitz

Page 16: Andrew Zisserman Talk - Part 1a

From edges to `top hat’ (blobs)

Blob = top-hat = superposition of two step edges

Spatial selection: the magnitude of the Laplacian response will achieve a maximum at the center of the blob, provided the scale of the Laplacian is “matched” to the scale of the blob

maximum

Page 17: Andrew Zisserman Talk - Part 1a

Scale selectionWe want to find the characteristic scale of the blob by convolving it with Laplacians at several scales and looking for the maximum response

However, Laplacian response decays as scale increases:

Why does this happen?

increasing σoriginal signal(radius=8)

Page 18: Andrew Zisserman Talk - Part 1a

Scale normalization

The response of a derivative of Gaussian filter to a perfect step edge decreases as σ increases

πσ 21

Page 19: Andrew Zisserman Talk - Part 1a

Scale normalization

The response of a derivative of Gaussian filter to a perfect step edge decreases as σ increases

To keep response the same (scale-invariant), must multiply Gaussian derivative by σ

Laplacian is the second Gaussian derivative, so it must be multiplied by σ2

Page 20: Andrew Zisserman Talk - Part 1a

Effect of scale normalization

Scale-normalized Laplacian response

Unnormalized Laplacian responseOriginal signal

maximumscale σ increasing

Page 21: Andrew Zisserman Talk - Part 1a

Blob detection in 2D

2

2

2

22

yg

xgg

∂∂

+∂∂

=∇

Laplacian of Gaussian: Circularly symmetric operator for blob detection in 2D

Page 22: Andrew Zisserman Talk - Part 1a

Blob detection in 2D

⎟⎟⎠

⎞⎜⎜⎝

⎛∂∂

+∂∂

=∇ 2

2

2

222

norm yg

xgg σScale-normalized:

Laplacian of Gaussian: Circularly symmetric operator for blob detection in 2D

Page 23: Andrew Zisserman Talk - Part 1a

Characteristic scale

We define the characteristic scale as the scale that produces peak of Laplacian response

characteristic scaleT. Lindeberg (1998). "Feature detection with automatic scale selection."International Journal of Computer Vision 30 (2): pp 77--116.

Page 24: Andrew Zisserman Talk - Part 1a

Scale selection

Scale invariance of the characteristic scale

∗∗ =⋅ 21 sss

norm

. Lap

.

norm

. Lap

.

s

• Relation between characteristic scales

scale scales∗1 s∗2

Page 25: Andrew Zisserman Talk - Part 1a

Scale invariance

• Multi-scale extraction of Harris interest points

• Selection of characteristic scale in Laplacian scale space

Laplacian

Chacteristic scale :- maximum in scale space- scale invariant

Mikolajczyk and Schmid ICCV 2001

Page 26: Andrew Zisserman Talk - Part 1a

What class of transformations are required?

Similarity(translation, scale, rotation)

Affine

Projective(homography)

2D transformation models

Page 27: Andrew Zisserman Talk - Part 1a

View 1

Motivation for affine transformations

View 2 View 2

Not same region

Page 28: Andrew Zisserman Talk - Part 1a

Local invariance requirements

Geometric: 2D affine transformation

where A is a 2x2 non-singular matrix

• Objective: compute image descriptors invariant to this class of transformations

Page 29: Andrew Zisserman Talk - Part 1a

Viewpoint covariant detection

• Characteristic scales (size of region)• Lindeberg and Garding ECCV 1994

• Lowe ICCV 1999

• Mikolajczyk and Schmid ICCV 2001

• Affine covariance (shape of region)• Baumberg CVPR 2000

• Matas et al BMVC 2002 Maximally stable regions

• Mikolajczyk and Schmid ECCV 2002

• Schaffalitzky and Zisserman ECCV 2002

• Tuytelaars and Van Gool BMVC 2000

• Mikolajczyk et al., IJCV 2005

Shape adapted regions “Harris affine”

Page 30: Andrew Zisserman Talk - Part 1a

Example affine covariant region

1. Segment using watershed algorithm, and track connected components as threshold value varies.

2. An MSR is detected when the area of the component is stationary

See Matas et al BMVC 2002

Maximally Stable regions (MSR)

first image second image

Page 31: Andrew Zisserman Talk - Part 1a

Maximally stable regions

sub-image

varying threshold

area vsthreshold

change of area vs threshold

first image

second image

Page 32: Andrew Zisserman Talk - Part 1a

Example: Maximally stable regions

Page 33: Andrew Zisserman Talk - Part 1a

Example of affine covariant regions

1000+ regions per image Shape adapted regions

Maximally stable regions

• a region’s size and shape are not fixed, but

• automatically adapts to the image intensity to cover the same physical surface

• i.e. pre-image is the same surface region

Page 34: Andrew Zisserman Talk - Part 1a

Viewpoint invariant description

• Elliptical viewpoint covariant regions• Shape Adapted regions

• Maximally Stable Regions

• Map ellipse to circle and orientate by dominant direction

• Represent each region by SIFT descriptor (128-vector) [Lowe 1999]• see Mikolajczyk and Schmid CVPR 2003 for a comparison of descriptors

Page 35: Andrew Zisserman Talk - Part 1a

Local descriptors - rotation invarianceEstimation of the dominant orientation

• extract gradient orientation• histogram over gradient orientation• peak in this histogram

Rotate patch in dominant direction

0 2π

Page 36: Andrew Zisserman Talk - Part 1a

Descriptors – SIFT [Lowe’99]

distribution of the gradient over an image patch

gradient 3D histogram

→ →

image patch

y

x

very good performance in image matching [Mikolaczyk and Schmid’03]

4x4 location grid and 8 orientations (128 dimensions)

Page 37: Andrew Zisserman Talk - Part 1a

Summary – detection and description

Extract affine regions Normalize regions Eliminate rotation Compute appearancedescriptors

SIFT (Lowe ’04)

Page 38: Andrew Zisserman Talk - Part 1a

Approach

Determine regions (detection) and vector descriptors in each frame which are invariant to camera viewpoint changes

Match descriptors between frames using invariant vectors

Visual problem

query?

• Retrieve image/key frames containing the same object

Page 39: Andrew Zisserman Talk - Part 1a

1. Compute regions in each image independently2. “Label” each region by a descriptor vector from its local intensity neighbourhood3. Find corresponding regions by matching to closest descriptor vector4. Score each frame in the database by the number of matches

Outline of an object retrieval strategy

Finding corresponding regions transformed to finding nearest neighbour vectors

frames

regions

invariant descriptor

vectors

invariant descriptor

vectors

Page 40: Andrew Zisserman Talk - Part 1a

ExampleIn each frame independentlydetermine elliptical regions (detection covariant with camera viewpoint) compute SIFT descriptor for each region [Lowe ‘99]

Harris-affine

Maximally stable regions

1000+ descriptors per frame

Page 41: Andrew Zisserman Talk - Part 1a

Object recognition

Establish correspondences between object model image and target image by nearest neighbour matching on SIFT vectors

128D descriptor space

Model image Target image

Page 42: Andrew Zisserman Talk - Part 1a

Match regions between frames using SIFT descriptors

Harris-affine

Maximally stable regions

• Multiple fragments overcomes problem of partial occlusion

• Transfer query box to localize object

Now, convert this approach to a text retrieval representation

Page 43: Andrew Zisserman Talk - Part 1a

Outline

1. Object recognition cast as nearest neighbour matching

2. Object recognition cast as text retrieval

• bag of words model

• visual words

3. Large scale search and improving performance

4. Applications

5. The future and challenges

Page 44: Andrew Zisserman Talk - Part 1a

Success of text retrieval

• efficient• high precision• scalable

Can we use retrieval mechanisms from text for visual retrieval?• ‘Visual Google’ project, 2003+

For a million+ images:• scalability• high precision• high recall: can we retrieve all occurrences in the corpus?

Page 45: Andrew Zisserman Talk - Part 1a

Text retrieval lightning tour

Stemming

Stop-list

Inverted file

Reject the very common words, e.g. “the”, “a”, “of”

Represent words by stems, e.g. “walking”, “walks” “walk”

Ideal book index: Term List of hits (occurrences in documents)

People [d1:hit hit hit], [d4:hit hit] …

Common [d1:hit hit], [d3: hit], [d4: hit hit hit] …

Sculpture [d2:hit], [d3: hit hit hit] …

• word matches are pre-computed

Page 46: Andrew Zisserman Talk - Part 1a

Ranking • frequency of words in document (tf-idf)

• proximity weighting (google)

• PageRank (google)

Need to map feature descriptors to “visual words”.

Page 47: Andrew Zisserman Talk - Part 1a

Build a visual vocabulary for a movie

Vector quantize descriptors• k-means clustering

+

+

Implementation• compute SIFT features on frames from 48 shots of the film

• 6K clusters for Shape Adapted regions

• 10K clusters for Maximally Stable regions

SIFT 128DSIFT 128D

Page 48: Andrew Zisserman Talk - Part 1a

Samples of visual words (clusters on SIFT descriptors):

Maximally stable regionsShape adapted regions

generic examples – cf textons

Page 49: Andrew Zisserman Talk - Part 1a

More specific example

Samples of visual words (clusters on SIFT descriptors):

Page 50: Andrew Zisserman Talk - Part 1a

Sivic and Zisserman, ICCV 2003Visual words: quantize descriptor space

Nearest neighbour matching

128D descriptor space

Image 1 Image 2

•expensive to do for all frames

Page 51: Andrew Zisserman Talk - Part 1a

Sivic and Zisserman, ICCV 2003

Nearest neighbour matching

128D descriptor space

Image 1 Image 2

Vector quantize descriptors

128D descriptor space

Image 1 Image 2

42

5

425 5

42

•expensive to do for all frames

Visual words: quantize descriptor space

Page 52: Andrew Zisserman Talk - Part 1a

Sivic and Zisserman, ICCV 2003

Nearest neighbour matching

128D descriptor space

Image 1 Image 2

Vector quantize descriptors

128D descriptor space

Image 1 Image 2

42

5

425 5

42

New image

•expensive to do for all frames

Visual words: quantize descriptor space

Page 53: Andrew Zisserman Talk - Part 1a

Sivic and Zisserman, ICCV 2003

Nearest neighbour matching

128D descriptor space

Image 1 Image 2

Vector quantize descriptors

128D descriptor space

Image 1 Image 2

42

5

425 5

42

New image

42

•expensive to do for all frames

Visual words: quantize descriptor space

Page 54: Andrew Zisserman Talk - Part 1a

The same visual word

Vector quantize the descriptor space (SIFT)

Page 55: Andrew Zisserman Talk - Part 1a

Visual words are ‘iconic’ image patches or fragments• represent the frequency of word occurrence• but not their position

Representation: bag of (visual) words

ImageCollection of visual words

Page 56: Andrew Zisserman Talk - Part 1a

Detect patches

Compute SIFTdescriptor

Normalize patch

+

+

Find nearest cluster centre

Offline: assign visual words and compute histograms for each key frame in the video

Represent frame by sparse histogram of visual word occurrences

200101…

Page 57: Andrew Zisserman Talk - Part 1a

Offline: create an index

For fast search, store a “posting list” for the dataset

This maps word occurrences to the documents they occur in

frame #5 frame #10

Posting list

1

2

...

5,10, ...

10,...

...

#1 #1

#2

Page 58: Andrew Zisserman Talk - Part 1a

At run time …

Generates a tf-idf ranked list of all the frames in dataset

frame #5 frame #10

Posting list

1

2

...

5,10, ...

10,...

...

#1 #1

#2

• User specifies a query region• Generate a short list of frames using visual words in region

1. Accumulate all visual words within the query region2. Use “book index” to find other frames with these words3. Compute similarity for frames which share at least one word

Page 59: Andrew Zisserman Talk - Part 1a

For a vocabulary of size K, each image is represented by a K-vector

where ti is the (weighted) number of occurrences of visual word i.

Images are ranked by the normalized scalar product between the query vector vq and all vectors in the database vd:

Image ranking using the bag-of-words model

Scalar product can be computed efficiently using inverted file.

Page 60: Andrew Zisserman Talk - Part 1a

frames

regions invariant descriptor

vectors

1. Compute affine covariant regions in each frame independently (offline)2. “Label” each region by a vector of descriptors based on its intensity (offline)3. Build histograms of visual words by descriptor quantization (offline)4. Rank retrieved frames by matching vis. word histograms using inverted files.

Summary: Match histograms of visual words

Quantize Single vector (histogram)

Page 61: Andrew Zisserman Talk - Part 1a

Films = common dataset

“Pretty Woman”

“Groundhog Day”

“Casablanca”

“Charade”

Page 62: Andrew Zisserman Talk - Part 1a

Video Google Demo

Page 63: Andrew Zisserman Talk - Part 1a

retrieved shots

ExampleSelect a region

Search in film “Groundhog Day”

Page 64: Andrew Zisserman Talk - Part 1a

Visual words - advantages

Design of descriptors makes these words invariant to:• illumination• affine transformations (viewpoint)

Multiple local regions gives immunity to partial occlusion

Overlap encodes some structural information

NB: no attempt to carry out a ‘semantic’ segmentation

Page 65: Andrew Zisserman Talk - Part 1a

Sony logo from Google image search on `Sony’

Retrieve shots from Groundhog Day

Example application – product placement

Page 66: Andrew Zisserman Talk - Part 1a

Retrieved shots in Groundhog Day for search on Sony logo

Page 67: Andrew Zisserman Talk - Part 1a

Outline

1. Object recognition cast as nearest neighbour matching

2. Object recognition cast as text retrieval

3. Large scale search and improving performance

• large vocabularies and approximate k-means

• query expansion

• soft assignment

4. Applications

5. The future and challenges

Page 68: Andrew Zisserman Talk - Part 1a

Particular object search

Find these landmarks ...in these images

Page 69: Andrew Zisserman Talk - Part 1a

Investigate …

Vocabulary size: number of visual words in range 10K to 1M

Use of spatial information to re-rank

Page 70: Andrew Zisserman Talk - Part 1a

Oxford buildings dataset

Automatically crawled from Flickr

Dataset (i) consists of 5062 images, crawled by searching for Oxford landmarks, e.g.

“Oxford Christ Church”“Oxford Radcliffe camera”“Oxford”

“Medium” resolution images (1024 x 768)

Page 71: Andrew Zisserman Talk - Part 1a

Oxford buildings dataset

Automatically crawled from Flickr

Consists of:

Dataset (i) crawled by searching for Oxford landmarks

Datasets (ii) and (iii) from other popular Flickr tags. Acts as additional distractors

Page 72: Andrew Zisserman Talk - Part 1a

Oxford buildings datasetLandmarks plus queries used for evaluation

All Soul's

Ashmolean

Balliol

Bodleian

Thom Tower

Cornmarket

Bridge of Sighs

Keble

Magdalen

University Museum

Radcliffe Camera

Ground truth obtained for 11 landmarks over 5062 images

Evaluate performance by mean Average Precision

Page 73: Andrew Zisserman Talk - Part 1a

Precision - Recall

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

recall

prec

isio

n

all images

returned images

relevant images

• Precision: % of returned images that are relevant

• Recall: % of relevant images that are returned

Page 74: Andrew Zisserman Talk - Part 1a

Average Precision

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

recall

prec

isio

n • A good AP score requires both high recall and high precision

• Application-independentAP

Performance measured by mean Average Precision (mAP) over 55 queries on 100K or 1.1M image datasets

Page 75: Andrew Zisserman Talk - Part 1a

Quantization / Clustering

K-means usually seen as a quick + cheap method

But far too slow for our needs – D~128, N~20M+, K~1M

Use approximate k-means: nearest neighbour search by multiple, randomized k-d trees

Page 76: Andrew Zisserman Talk - Part 1a

K-means overview

K-means overview:

Initialize cluster centres

Find nearest cluster to each datapoint (slow) O(N K)

Re-compute cluster centres as centroid

Iterate

Idea: nearest neighbour search is the bottleneck – use approximate nearest neighbour search

K-means provably locally minimizes the sum of squared errors (SSE) between a cluster centre and its points

Page 77: Andrew Zisserman Talk - Part 1a

Approximate K-means

Use multiple, randomized k-d trees for search

A k-d tree hierarchically decomposes the descriptor space

Points nearby in the space can be found (hopefully) by backtracking around the tree some small number of steps

Single tree works OK in low dimensions – not so well in high dimensions

Page 78: Andrew Zisserman Talk - Part 1a

Approximate K-means

Multiple randomized trees increase the chances of finding nearby points

Query point

True nearest neighbour

True nearest neighbour found? No No Yes

Page 79: Andrew Zisserman Talk - Part 1a

Approximate K-means

Use the best-bin first strategy to determine which branch of the tree to examine next

share this priority queue between multiple trees – searching multiple trees only slightly more expensive than searching one

Original K-means complexity = O(N K)

Approximate K-means complexity = O(N log K)

This means we can scale to very large K

Page 80: Andrew Zisserman Talk - Part 1a

Experimental evaluation for SIFT matchinghttp://www.cs.ubc.ca/~lowe/papers/09muja.pdf

Page 81: Andrew Zisserman Talk - Part 1a

Oxford buildings datasetLandmarks plus queries used for evaluation

All Soul's

Ashmolean

Balliol

Bodleian

Thom Tower

Cornmarket

Bridge of Sighs

Keble

Magdalen

University Museum

Radcliffe Camera

Ground truth obtained for 11 landmarks over 5062 images

Evaluate performance by mean Average Precision over 55 queries

Page 82: Andrew Zisserman Talk - Part 1a

Approximate K-means

How accurate is the approximate search?

Performance on 5K image dataset for a random forest of 8 trees

Allows much larger clusterings than would be feasible with standard K-means: N~17M points, K~1M

AKM – 8.3 cpu hours per iterationStandard K-means - estimated 2650 cpu hours per iteration

Page 83: Andrew Zisserman Talk - Part 1a

Performance against vocabulary size

Using large vocabularies gives a big boost in performance (peak @ 1M words)

More discriminative vocabularies give:

Better retrieval quality

Increased search speed – documents share less words, so fewer documents need to be scored

Page 84: Andrew Zisserman Talk - Part 1a

Beyond Bag of Words

Use the position and shape of the underlying features to improve retrieval quality

Both images have many matches – which is correct?

Page 85: Andrew Zisserman Talk - Part 1a

Beyond Bag of Words

We can measure spatial consistency between the query and each result to improve retrieval quality

Many spatially consistent matches – correct result

Few spatially consistent matches – incorrect result

Page 86: Andrew Zisserman Talk - Part 1a

Compute 2D affine transformation• between query region and target image

where A is a 2x2 non-singular matrix

Page 87: Andrew Zisserman Talk - Part 1a

Estimating spatial correspondences1. Test each correspondence

Page 88: Andrew Zisserman Talk - Part 1a

Estimating spatial correspondences2. Compute a (restricted) affine transformation (5 dof)

Page 89: Andrew Zisserman Talk - Part 1a

Estimating spatial correspondences3. Score by number of consistent matches

Use RANSAC on full affine transformation (6 dof)

Page 90: Andrew Zisserman Talk - Part 1a

Beyond Bag of Words

Extra bonus – gives localization of the object

Page 91: Andrew Zisserman Talk - Part 1a

Example Results

Query Example Results

Rank short list of retrieved images on number of correspondences

Page 92: Andrew Zisserman Talk - Part 1a

0.6250.6021.25M0.6450.6181M0.6300.609750K0.6420.606500K0.6330.598250K0.5970.535100K0.5990.47350K

vocabsize

bag of words

spatial

Mean Average Precision variation with vocabulary size

Page 93: Andrew Zisserman Talk - Part 1a

Bag of visual words particular object retrieval

Hessian-Affineregions + SIFT descriptors

visual words+tf-idf weighting

querying

sparse frequency vector

centroids(visual words)

Invertedfile

ranked imageshort-list

Set of SIFTdescriptorsquery image

[Lowe 04, Chum & al 2007]

Geometricverification

[Chum & al 2007]