andrew zisserman talk - part 1a
DESCRIPTION
TRANSCRIPT
Visual search and recognitionPart I – large scale instance search
Andrew ZissermanVisual Geometry Group
University of Oxfordhttp://www.robots.ox.ac.uk/~vgg
Slides with Josef Sivic
Overview
Part I: Instance level recognition• e.g. find this car in a dataset of images
and the dataset may contain 1M images …
Overview
Part II: Category level recognition• e.g. find images that contain cars
• or cows …
and localize the occurrences in each image …
“Groundhog Day” [Rammis, 1993]Visually defined query
“Find this clock”
Example: visual search in feature films
“Find this place”
Problem specification: particular object retrieval
Particular Object Search
Find these objects ...in these images and 1M more
Search the web with a visual query …
The need for visual search
Flickr: has more than 5 billion photographs, more than 1 million added daily
Company collections
Personal collections: 10000s of digital camera photos and video clips
Vast majority will have minimal, if any, textual annotation.
Why is it difficult?Problem: find particular occurrences of an object in a very
large dataset of images
Want to find the object despite possibly large changes inscale, viewpoint, lighting and partial occlusion
ViewpointScale
Lighting Occlusion
Outline
1. Object recognition cast as nearest neighbour matching
• Covariant feature detection
• Feature descriptors (SIFT) and matching
2. Object recognition cast as text retrieval
3. Large scale search and improving performance
4. Applications
5. The future and challenges
Approach
Determine regions (detection) and vector descriptors in each frame which are invariant to camera viewpoint changes
Match descriptors between frames using invariant vectors
Visual problem
query?
• Retrieve image/key frames containing the same object
Example of visual fragments
Image content is transformed into local fragments that are invariant to translation, rotation, scale, and other imaging parameters
Lowe ICCV 1999• Fragments generalize over viewpoint and lighting
Detection Requirements
• detection must commute with viewpoint transformation
• i.e. detection is viewpoint covariant
• NB detection computed in each image independently
Detected image regions must cover the same scene region in different views
viewpoint transformation
viewpoint transformation
detection detection
Scale-invariant feature detection
Goal: independently detect corresponding regions in scaled versions of the same image
Need scale selection mechanism for finding characteristic region size that is covariant with the image transformation
Laplacian
Scale-invariant features: Blobs
Slides from Svetlana Lazebnik
Recall: Edge detection
gdxdf ∗
f
gdxd
Source: S. Seitz
Edge
Derivativeof Gaussian
Edge = maximumof derivative
Edge detection, take 2
gdxdf 2
2
∗
f
gdxd
2
2
Edge
Second derivativeof Gaussian (Laplacian)
Edge = zero crossingof second derivative
Source: S. Seitz
From edges to `top hat’ (blobs)
Blob = top-hat = superposition of two step edges
Spatial selection: the magnitude of the Laplacian response will achieve a maximum at the center of the blob, provided the scale of the Laplacian is “matched” to the scale of the blob
maximum
Scale selectionWe want to find the characteristic scale of the blob by convolving it with Laplacians at several scales and looking for the maximum response
However, Laplacian response decays as scale increases:
Why does this happen?
increasing σoriginal signal(radius=8)
Scale normalization
The response of a derivative of Gaussian filter to a perfect step edge decreases as σ increases
πσ 21
Scale normalization
The response of a derivative of Gaussian filter to a perfect step edge decreases as σ increases
To keep response the same (scale-invariant), must multiply Gaussian derivative by σ
Laplacian is the second Gaussian derivative, so it must be multiplied by σ2
Effect of scale normalization
Scale-normalized Laplacian response
Unnormalized Laplacian responseOriginal signal
maximumscale σ increasing
Blob detection in 2D
2
2
2
22
yg
xgg
∂∂
+∂∂
=∇
Laplacian of Gaussian: Circularly symmetric operator for blob detection in 2D
Blob detection in 2D
⎟⎟⎠
⎞⎜⎜⎝
⎛∂∂
+∂∂
=∇ 2
2
2
222
norm yg
xgg σScale-normalized:
Laplacian of Gaussian: Circularly symmetric operator for blob detection in 2D
Characteristic scale
We define the characteristic scale as the scale that produces peak of Laplacian response
characteristic scaleT. Lindeberg (1998). "Feature detection with automatic scale selection."International Journal of Computer Vision 30 (2): pp 77--116.
Scale selection
Scale invariance of the characteristic scale
∗∗ =⋅ 21 sss
norm
. Lap
.
norm
. Lap
.
s
• Relation between characteristic scales
scale scales∗1 s∗2
Scale invariance
• Multi-scale extraction of Harris interest points
• Selection of characteristic scale in Laplacian scale space
Laplacian
Chacteristic scale :- maximum in scale space- scale invariant
Mikolajczyk and Schmid ICCV 2001
What class of transformations are required?
Similarity(translation, scale, rotation)
Affine
Projective(homography)
2D transformation models
View 1
Motivation for affine transformations
View 2 View 2
Not same region
Local invariance requirements
Geometric: 2D affine transformation
where A is a 2x2 non-singular matrix
• Objective: compute image descriptors invariant to this class of transformations
Viewpoint covariant detection
• Characteristic scales (size of region)• Lindeberg and Garding ECCV 1994
• Lowe ICCV 1999
• Mikolajczyk and Schmid ICCV 2001
• Affine covariance (shape of region)• Baumberg CVPR 2000
• Matas et al BMVC 2002 Maximally stable regions
• Mikolajczyk and Schmid ECCV 2002
• Schaffalitzky and Zisserman ECCV 2002
• Tuytelaars and Van Gool BMVC 2000
• Mikolajczyk et al., IJCV 2005
Shape adapted regions “Harris affine”
Example affine covariant region
1. Segment using watershed algorithm, and track connected components as threshold value varies.
2. An MSR is detected when the area of the component is stationary
See Matas et al BMVC 2002
Maximally Stable regions (MSR)
first image second image
Maximally stable regions
sub-image
varying threshold
area vsthreshold
change of area vs threshold
first image
second image
Example: Maximally stable regions
Example of affine covariant regions
1000+ regions per image Shape adapted regions
Maximally stable regions
• a region’s size and shape are not fixed, but
• automatically adapts to the image intensity to cover the same physical surface
• i.e. pre-image is the same surface region
Viewpoint invariant description
• Elliptical viewpoint covariant regions• Shape Adapted regions
• Maximally Stable Regions
• Map ellipse to circle and orientate by dominant direction
• Represent each region by SIFT descriptor (128-vector) [Lowe 1999]• see Mikolajczyk and Schmid CVPR 2003 for a comparison of descriptors
Local descriptors - rotation invarianceEstimation of the dominant orientation
• extract gradient orientation• histogram over gradient orientation• peak in this histogram
Rotate patch in dominant direction
0 2π
Descriptors – SIFT [Lowe’99]
distribution of the gradient over an image patch
gradient 3D histogram
→ →
image patch
y
x
very good performance in image matching [Mikolaczyk and Schmid’03]
4x4 location grid and 8 orientations (128 dimensions)
Summary – detection and description
Extract affine regions Normalize regions Eliminate rotation Compute appearancedescriptors
SIFT (Lowe ’04)
Approach
Determine regions (detection) and vector descriptors in each frame which are invariant to camera viewpoint changes
Match descriptors between frames using invariant vectors
Visual problem
query?
• Retrieve image/key frames containing the same object
1. Compute regions in each image independently2. “Label” each region by a descriptor vector from its local intensity neighbourhood3. Find corresponding regions by matching to closest descriptor vector4. Score each frame in the database by the number of matches
Outline of an object retrieval strategy
Finding corresponding regions transformed to finding nearest neighbour vectors
frames
regions
invariant descriptor
vectors
invariant descriptor
vectors
ExampleIn each frame independentlydetermine elliptical regions (detection covariant with camera viewpoint) compute SIFT descriptor for each region [Lowe ‘99]
Harris-affine
Maximally stable regions
1000+ descriptors per frame
Object recognition
Establish correspondences between object model image and target image by nearest neighbour matching on SIFT vectors
128D descriptor space
Model image Target image
Match regions between frames using SIFT descriptors
Harris-affine
Maximally stable regions
• Multiple fragments overcomes problem of partial occlusion
• Transfer query box to localize object
Now, convert this approach to a text retrieval representation
Outline
1. Object recognition cast as nearest neighbour matching
2. Object recognition cast as text retrieval
• bag of words model
• visual words
3. Large scale search and improving performance
4. Applications
5. The future and challenges
Success of text retrieval
• efficient• high precision• scalable
Can we use retrieval mechanisms from text for visual retrieval?• ‘Visual Google’ project, 2003+
For a million+ images:• scalability• high precision• high recall: can we retrieve all occurrences in the corpus?
Text retrieval lightning tour
Stemming
Stop-list
Inverted file
Reject the very common words, e.g. “the”, “a”, “of”
Represent words by stems, e.g. “walking”, “walks” “walk”
Ideal book index: Term List of hits (occurrences in documents)
People [d1:hit hit hit], [d4:hit hit] …
Common [d1:hit hit], [d3: hit], [d4: hit hit hit] …
Sculpture [d2:hit], [d3: hit hit hit] …
• word matches are pre-computed
Ranking • frequency of words in document (tf-idf)
• proximity weighting (google)
• PageRank (google)
Need to map feature descriptors to “visual words”.
Build a visual vocabulary for a movie
Vector quantize descriptors• k-means clustering
+
+
Implementation• compute SIFT features on frames from 48 shots of the film
• 6K clusters for Shape Adapted regions
• 10K clusters for Maximally Stable regions
SIFT 128DSIFT 128D
Samples of visual words (clusters on SIFT descriptors):
Maximally stable regionsShape adapted regions
generic examples – cf textons
More specific example
Samples of visual words (clusters on SIFT descriptors):
Sivic and Zisserman, ICCV 2003Visual words: quantize descriptor space
Nearest neighbour matching
128D descriptor space
Image 1 Image 2
•expensive to do for all frames
Sivic and Zisserman, ICCV 2003
Nearest neighbour matching
128D descriptor space
Image 1 Image 2
Vector quantize descriptors
128D descriptor space
Image 1 Image 2
42
5
425 5
42
•expensive to do for all frames
Visual words: quantize descriptor space
Sivic and Zisserman, ICCV 2003
Nearest neighbour matching
128D descriptor space
Image 1 Image 2
Vector quantize descriptors
128D descriptor space
Image 1 Image 2
42
5
425 5
42
New image
•expensive to do for all frames
Visual words: quantize descriptor space
Sivic and Zisserman, ICCV 2003
Nearest neighbour matching
128D descriptor space
Image 1 Image 2
Vector quantize descriptors
128D descriptor space
Image 1 Image 2
42
5
425 5
42
New image
42
•expensive to do for all frames
Visual words: quantize descriptor space
The same visual word
Vector quantize the descriptor space (SIFT)
Visual words are ‘iconic’ image patches or fragments• represent the frequency of word occurrence• but not their position
Representation: bag of (visual) words
ImageCollection of visual words
Detect patches
Compute SIFTdescriptor
Normalize patch
+
+
Find nearest cluster centre
Offline: assign visual words and compute histograms for each key frame in the video
Represent frame by sparse histogram of visual word occurrences
200101…
Offline: create an index
For fast search, store a “posting list” for the dataset
This maps word occurrences to the documents they occur in
frame #5 frame #10
Posting list
1
2
...
5,10, ...
10,...
...
#1 #1
#2
At run time …
Generates a tf-idf ranked list of all the frames in dataset
frame #5 frame #10
Posting list
1
2
...
5,10, ...
10,...
...
#1 #1
#2
• User specifies a query region• Generate a short list of frames using visual words in region
1. Accumulate all visual words within the query region2. Use “book index” to find other frames with these words3. Compute similarity for frames which share at least one word
For a vocabulary of size K, each image is represented by a K-vector
where ti is the (weighted) number of occurrences of visual word i.
Images are ranked by the normalized scalar product between the query vector vq and all vectors in the database vd:
Image ranking using the bag-of-words model
Scalar product can be computed efficiently using inverted file.
frames
regions invariant descriptor
vectors
1. Compute affine covariant regions in each frame independently (offline)2. “Label” each region by a vector of descriptors based on its intensity (offline)3. Build histograms of visual words by descriptor quantization (offline)4. Rank retrieved frames by matching vis. word histograms using inverted files.
Summary: Match histograms of visual words
Quantize Single vector (histogram)
Films = common dataset
“Pretty Woman”
“Groundhog Day”
“Casablanca”
“Charade”
Video Google Demo
retrieved shots
ExampleSelect a region
Search in film “Groundhog Day”
Visual words - advantages
Design of descriptors makes these words invariant to:• illumination• affine transformations (viewpoint)
Multiple local regions gives immunity to partial occlusion
Overlap encodes some structural information
NB: no attempt to carry out a ‘semantic’ segmentation
Sony logo from Google image search on `Sony’
Retrieve shots from Groundhog Day
Example application – product placement
Retrieved shots in Groundhog Day for search on Sony logo
Outline
1. Object recognition cast as nearest neighbour matching
2. Object recognition cast as text retrieval
3. Large scale search and improving performance
• large vocabularies and approximate k-means
• query expansion
• soft assignment
4. Applications
5. The future and challenges
Particular object search
Find these landmarks ...in these images
Investigate …
Vocabulary size: number of visual words in range 10K to 1M
Use of spatial information to re-rank
Oxford buildings dataset
Automatically crawled from Flickr
Dataset (i) consists of 5062 images, crawled by searching for Oxford landmarks, e.g.
“Oxford Christ Church”“Oxford Radcliffe camera”“Oxford”
“Medium” resolution images (1024 x 768)
Oxford buildings dataset
Automatically crawled from Flickr
Consists of:
Dataset (i) crawled by searching for Oxford landmarks
Datasets (ii) and (iii) from other popular Flickr tags. Acts as additional distractors
Oxford buildings datasetLandmarks plus queries used for evaluation
All Soul's
Ashmolean
Balliol
Bodleian
Thom Tower
Cornmarket
Bridge of Sighs
Keble
Magdalen
University Museum
Radcliffe Camera
Ground truth obtained for 11 landmarks over 5062 images
Evaluate performance by mean Average Precision
Precision - Recall
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
recall
prec
isio
n
all images
returned images
relevant images
• Precision: % of returned images that are relevant
• Recall: % of relevant images that are returned
Average Precision
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
recall
prec
isio
n • A good AP score requires both high recall and high precision
• Application-independentAP
Performance measured by mean Average Precision (mAP) over 55 queries on 100K or 1.1M image datasets
Quantization / Clustering
K-means usually seen as a quick + cheap method
But far too slow for our needs – D~128, N~20M+, K~1M
Use approximate k-means: nearest neighbour search by multiple, randomized k-d trees
K-means overview
K-means overview:
Initialize cluster centres
Find nearest cluster to each datapoint (slow) O(N K)
Re-compute cluster centres as centroid
Iterate
Idea: nearest neighbour search is the bottleneck – use approximate nearest neighbour search
K-means provably locally minimizes the sum of squared errors (SSE) between a cluster centre and its points
Approximate K-means
Use multiple, randomized k-d trees for search
A k-d tree hierarchically decomposes the descriptor space
Points nearby in the space can be found (hopefully) by backtracking around the tree some small number of steps
Single tree works OK in low dimensions – not so well in high dimensions
Approximate K-means
Multiple randomized trees increase the chances of finding nearby points
Query point
True nearest neighbour
True nearest neighbour found? No No Yes
Approximate K-means
Use the best-bin first strategy to determine which branch of the tree to examine next
share this priority queue between multiple trees – searching multiple trees only slightly more expensive than searching one
Original K-means complexity = O(N K)
Approximate K-means complexity = O(N log K)
This means we can scale to very large K
Experimental evaluation for SIFT matchinghttp://www.cs.ubc.ca/~lowe/papers/09muja.pdf
Oxford buildings datasetLandmarks plus queries used for evaluation
All Soul's
Ashmolean
Balliol
Bodleian
Thom Tower
Cornmarket
Bridge of Sighs
Keble
Magdalen
University Museum
Radcliffe Camera
Ground truth obtained for 11 landmarks over 5062 images
Evaluate performance by mean Average Precision over 55 queries
Approximate K-means
How accurate is the approximate search?
Performance on 5K image dataset for a random forest of 8 trees
Allows much larger clusterings than would be feasible with standard K-means: N~17M points, K~1M
AKM – 8.3 cpu hours per iterationStandard K-means - estimated 2650 cpu hours per iteration
Performance against vocabulary size
Using large vocabularies gives a big boost in performance (peak @ 1M words)
More discriminative vocabularies give:
Better retrieval quality
Increased search speed – documents share less words, so fewer documents need to be scored
Beyond Bag of Words
Use the position and shape of the underlying features to improve retrieval quality
Both images have many matches – which is correct?
Beyond Bag of Words
We can measure spatial consistency between the query and each result to improve retrieval quality
Many spatially consistent matches – correct result
Few spatially consistent matches – incorrect result
Compute 2D affine transformation• between query region and target image
where A is a 2x2 non-singular matrix
Estimating spatial correspondences1. Test each correspondence
Estimating spatial correspondences2. Compute a (restricted) affine transformation (5 dof)
Estimating spatial correspondences3. Score by number of consistent matches
Use RANSAC on full affine transformation (6 dof)
Beyond Bag of Words
Extra bonus – gives localization of the object
Example Results
Query Example Results
Rank short list of retrieved images on number of correspondences
0.6250.6021.25M0.6450.6181M0.6300.609750K0.6420.606500K0.6330.598250K0.5970.535100K0.5990.47350K
vocabsize
bag of words
spatial
Mean Average Precision variation with vocabulary size
Bag of visual words particular object retrieval
Hessian-Affineregions + SIFT descriptors
visual words+tf-idf weighting
querying
sparse frequency vector
centroids(visual words)
Invertedfile
ranked imageshort-list
Set of SIFTdescriptorsquery image
[Lowe 04, Chum & al 2007]
Geometricverification
[Chum & al 2007]