lec07 aggregation-and-retrieval-system

Image Analysis & Retrieval

CS/EE 5590 Special Topics (Class Ids: 44873, 44874)

Fall 2016, M/W 4-5:15pm@Bloch 0012

Lec 07

Feature Aggregation and Image Retrieval System

Zhu Li

Dept of CSEE, UMKC

Office: FH560E, Email: [email protected], Ph: x 2346.

http://l.web.umkc.edu/lizhu

p.1Image Analysis & Retrieval, 2016

mailto:[email protected]

http://l.web.umkc.edu/lizhu

Outline

ReCap of Lecture 06 SIFT

Box Filter

Image Retrieval System

Why Aggregation ?

Aggregation Schemes

Summary

Image Analysis & Retrieval, 2016 p.2

Scale Space Theory - Lindeberg

Scale Space Response via Laplacian of Gaussian The scale is controlled by 𝜎

Characteristic Scale:


2

2

2

22

y

g

x

gg

𝑔 = 𝑒− 𝑥+𝑦 2

2𝜎

r

image𝜎 = 0.8𝑟 𝜎 = 1.2𝑟 𝜎 = 2𝑟

…

characteristic scale

SIFT

Use DoG to approximate LoG Separable Gaussian filter

Difference of image instead of difference of Gaussian kernel


LoG

Scale space construction By Gaussian Filtering, and Image Difference

Peak Strength & Edge Removal

Peak Strength: Interpolate true DoG response and pixel location by Taylor

expansion

Edge Removal:

Re-do Harris type detection to remove edge on much reduced pixel set


Scale Invariance thru Dominant Orientation Coding

Voting for the dominant orientation Weighted by a Gaussian window to give more emphasis to the

gradients closer to the center


SIFT Matching and Repeatability Prediction

SIFT Distance

Not all SIFT are created equal…

Peak strength (DoG response at interpolated position)


Combined scale/peak strength pmf

𝑑(𝑠11, 𝑠𝑘∗

2 )

𝑑(𝑠11, 𝑠𝑘

2)≤ 𝜃

Box Fitler – CABOX work

Basic Idea: Approximate DoG with linear combination of box filters

min.𝒉

𝒈− 𝐵 ∙ 𝒉 𝐿22 + 𝒉 𝐿1

Solution by LASSO


= h1* h2*+ + …

Outline


Box Filter


Why Aggregation ?

Aggregation Schemes

Summary


Image Matching/Retrieval System

SIFT is a sub-image level feature, we actually care more on how SIFT match will translate into image level matching/retrieval accuracy

Say if we can compute a single distance from a collection of features:

Then for a data base of n images, we can compute an n x n distance matrix This gives us full information of the performance of this

feature/distance system

How to characterize the performance of such image matching and retrieval system ?


𝑑 𝐼1, 𝐼2 =

𝑘

𝛼𝑘𝑑(𝐹𝑘1, 𝐹𝑘

2)

𝐷𝑖 ,𝑘= 𝑑(𝐼𝑗 , 𝐼𝑘)

Thresholding for Matching

Basically, for any pair of Images (documents, in IR jargon), we declare

Then for each possible image pair, or pairs we care, for a given threshold t, there will be 4 possible consequences TP pair: {Ij, Ik} declared matching pairs, d(Ij, Ik) < t;

FP pair: {Ij, Ik} declared matching pairs, d(Ij, Ik) >= t;

TN pair: {Ij, Ik} declared non-matching pairs, d(Ij, Ik) >= t;

FN pair: {Ij, Ik} declared non- matching pairs, d(Ij, Ik) < t;


𝐼𝑗 , 𝐼𝑘 𝑎𝑟𝑒 𝑚𝑎𝑡𝑐ℎ, 𝑖𝑓 𝑑 𝐼𝑗 , 𝐼𝑘 < 𝑡

𝐼𝑗 , 𝐼𝑘 𝑎𝑟𝑒𝑛𝑜𝑡 𝑚𝑎𝑡𝑐ℎ, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Matching System Performance

True Positive Rate/Precision: Out of retrieved matching pairs, how many are true matching

pairs

For all matching pairs with distance < t

False Positive Rate:

Out of retrieved matching pairs, how many are actually negative, false matchings


𝑇𝑃𝑅 =𝑡𝑝

𝑡𝑝 + 𝑓𝑛

𝐹𝑃𝑅 =𝑓𝑝

𝑓𝑝 + 𝑡𝑛

TPR-FPR

Definition:

TP rate = TP/(TP+FN)

FP rate = FP/(FP+TN)

From the actual value

point of view


ROC curve(1)

ROC = receiver operating characteristic

Y:TP rate

X:FP rate


ROC curve(2)

Which method (A or B) is better?compute ROC area: area under ROC

curve


Precision, Recall, F-measure

Precision = TP/(TP + FP),

Recall = TP/(TP + FN)

F-measure = 2*(precision*recall)/(precision + recall)

Precision:is the probability that a

retrieved document is relevant.

Recall:is the probability that a

relevant documentis retrieved in a search.


Matlab Implementation

We will compute all image pair distances D(j,k)

How do we compute the TPR-FPR plot ? Understand that TPR and

FPR are actually function of threshold t,

Just need to parameterize TPR(t) and FPR(t), and obtaining operating points of meaningful thresholds, to generate the plot.

Matlab Implementation: [tp, fp, tn,

fn]=getPrecisionRecall()


d_min = min(min(d0), min(d1));

d_max = max(max(d0), max(d1));

delta = (d_max - d_min) / npt;

for k=1:npt

thres = d_min + (k-1)*delta;

tp(k) = length(find(d0<=thres));

fp(k) = length(find(d1<=thres));

tn(k) = length(find(d1>thres));

fn(k) = length(find(d0>thres));

end

if dbg

figure(22); grid on; hold on;

plot(fp./(tn+fp), tp./(tp+fn), '.-r',

'DisplayName', 'tpr-fpr');legend();

end

TPR-FPR

Image Matching performance are characterized by functions TPR(FPR)

Retrieval set: we want high Precision, Short List: High Recall.


Outline


Box Filter


Why Aggregation ?

Aggregation Schemes

Summary


Why Aggregation ?

What (Local) Interesting Points features bring us ? Scale and rotation invariance in the form of nk x d:

Un-cerntainty of the number of detected features nk, at query time

Permutation along rows of features are the same representation.

Problems: The feature has state, not able to draw decision boundaries,

Not directly indexable/hashable

Typically very high dimensionality


𝑆𝑘| [𝑥𝑘 , 𝑦𝑘, 𝜃𝑘 , 𝜎𝑘, ℎ1, ℎ2, … , ℎ128] , 𝑘 = 1. . 𝑛

Decision Boundary in Matching

Can we have a decision boundary function for interesting points based representation ?


…..

Curse of Dimensionality in Retrieval

What feature dimensions will do to the retrieval efficiency… Looking at retrieval 99% of per dimension locality, and the

total volume covered plot.

Matlab: showDimensionCurse.m


+

Aggregation – 30,000ft view

Bag of Words Compute k centroids in feature space, called visual words

Compute histogram

k x1 feature, hard assignment

VLAD Compute centroids in feature space

Compute aggregaged difference w.r.t the centroids

k x d feature, soft assignment

Fisher Vector Compute a Gaussian Mixture Model (GMM) with 2nd order info

Compute the aggregated feature w.r.t the mean and covariance of GMM

2 x k x d feature

AKULA Adaptive centroids and feature count

Improved with covariance ?


0.5

0.4 0.05

0.05

Visual Key Words: main idea

Extract some local features from a number of images …

Image Analysis & Retrieval, 2016 24

e.g., SIFT descriptor

space: each point is 128-

dimensional

Slide credit: D. Nister

Visual Key Words: main idea

Image Analysis & Retrieval, 2016 25Slide credit: D. Nister

Visual words: main idea




Visual Key Words


Each point is a local

descriptor, e.g. SIFT

vector.

Visual words

Example: each group of patches belongs to the same visual word


Figure from Sivic & Zisserman, ICCV 2003

Visual words


Source credit: K. Grauman, B. Leibe

• More recently used for describing scenes and objects for the sake of indexing or classification.

Sivic & Zisserman 2003;

Csurka, Bray, Dance, & Fan

2004; many others.

Object Bag of ‘words’

ICCV 2005 short course, L. Fei-Fei

Bag of Words


BoW Examples

Illustration


Bags of visual words

Summarize entire image based on its distribution (histogram) of word occurrences.

Analogous to bag of words representation commonly used for documents.


Image credit: Fei-Fei Li

Texture Retrieval

Texons…


Universal texton dictionary

histogram

Source: Lana Lazebnik

BoW Distance Metrics

Rank images by normalized scalar product between their (possibly weighted) occurrence counts---nearest neighbor search for similar images.


[5 1 1 0][1 8 1 4]

djq

Inverted List

Image Retrieval via Inverted List


Image credit: A. Zisserman

Visual

Word

number

List of image

numbers

When will this give us a significant gain in efficiency?

Indexing local features: inverted file index

For text documents, an efficient way to find all pageson which a word occurs is to use an index…

We want to find all images in which a feature occurs.

We need to index each feature by the image it appears and also we keep the # of occurrence.


Source credit : K. Grauman, B. Leibe

TF-IDF Weighting

Term Frequency – Inverse Document Frequency Describe image by frequency of each visual word within

it, down-weight words that appear often in the database (Standard weighting for text retrieval)


Total number of

words in database

Number of

occurrences of

word i in whole

database

Number of

occurrences of

word i in

document d

Number of

words in

document d

BoW Use Case with Spatial Localization

Collecting words within a query region


Query region:

pull out only the SIFT

descriptors whose

positions are within the

polygon

BoW Patch Search

Localizing the BoW representation


Localization with BoW


Hiearchical Assignment of Histogram

Tree construction:


[Nister & Stewenius, CVPR’06]

Vocabulary Tree

Training: Filling the tree



46

Vocabulary Tree


Image Analysis & Retrieval, 2016 46Slide credit: David Nister


47

Vocabulary Tree




Vocabulary Tree




50

Vocabulary Tree

Recognition



RANSAC

verification

Vocabulary Tree: Performance

Evaluated on large databases Indexing with up to 1M images

Online recognition for databaseof 50,000 CD covers Retrieval in ~1s

Find experimentally that large vocabularies can be beneficial for recognition



Larger vocabularies

can be

advantageous…

But what happens if it

is too large?

Visual Word Vocabulary Size

Performance w.r.t vocabulary size


Bags of words: pros and cons

Good:+ flexible to geometry / deformations / viewpoint+ compact summary of image content+ provides vector representation for sets+ Inverted List implementation offers practical solution

against large repository

Bad:- Lost of information at quantization and histogram

generation- basic model ignores geometry – must verify afterwards,

or encode via features- background and foreground mixed when bag covers

whole image- interest points or sampling: no guarantee to capture

object-level parts

Image Analysis & Retrieval, 2016 53Source credit : K. Grauman, B. Leibe

Can we improve BoW ?

• E.g. Why isn’t our Bag of Words classifier at 90% instead of 70%?

• Training Data

– Huge issue, but not necessarily a variable you can manipulate.

• Learning method

– BoW is on top of any feature scheme

• Representation

– Are we losing too much info in the process ?


Standard Kmeans Bag of Words

BoW revisited


http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf

Motivation

Bag of Visual Words is only about counting the number of local descriptors assigned to each Voronoi region

Why not including other statistics/information ?



We already looked at the Spatial Pyramid/Pooling

Spatial Pooling


level 2: 4x4level 0: 1x1 level 1: 2x2

Key take away: Multiple assignment ? Soft Assignment ?

Motivation


Why not including other statistics? For instance:• mean of local descriptors



Motivation


Why not including other statistics? For instance:• mean of local descriptors

• (co)variance of local descriptors



Simple case: Soft Assignment

Called “Kernel codebook encoding” by Chatfield et al. 2011. Cast a weighted vote into the most similar clusters.


Simple case: Soft Assignment

Called “Kernel codebook encoding” by Chatfield et al. 2011. Cast a weighted vote into the most similar clusters.

This is fast and easy to implement (try it for Project 3!) but it does have some downsides for image retrieval –the inverted file index becomes less sparse.


A first example: the VLAD

Given a codebook ,e.g. learned with K-means, and a set oflocal descriptors :

• assign:

• compute:

• concatenate vi’s + normalize


Jégou, Douze, Schmid and Pérez, “Aggregating local descriptors into a compact image representation”, CVPR’10.

3

x

v1 v2v3 v4

v5

1

4

2

5

① assign descriptors

② compute x- i

③ vi=sum x- i for cell i

A first example: the VLAD

A graphical representation of


Jégou, Douze, Schmid and Pérez, “Aggregating local descriptors into a compact image representation”, CVPR’10.

VL_FEAT Implementation

Matlab:


function [vc]=vladSiftEncoding(sift,

codebook)

dbg=1;

if dbg

if (0) % init VL_FEAT, only need

to do once

run('../../tools/vlfeat-

0.9.20/toolbox/vl_setup.m');

end

im = imread('../pics/flarsheim-

2.jpg');

[f, sift] =

vl_sift(single(rgb2gray(im))); sift =

single(sift');

[indx, codebook] = kmeans(sift,

16);

% make sift # smaller

sift = sift(1:800,:);

end

[n, kd]=size(sift);

[m, kd]=size(codebook);

% compute assignment

dist = pdist2(codebook, sift);

mdist = mean(mean(dist));

% normalize the heat kernel s.t. mean

dist is mapped to 0.5

a = -log(0.5)/mdist;

indx = exp(-a*dist);

vc=vl_vlad(sift', codebook', indx);

if dbg

figure(41); colormap(gray);

subplot(2,2,1); imshow(im);

title('image');

subplot(2,2,2); imagesc(dist);

title('m x n distance');

subplot(2,2,3); imagesc(indx);

title('m x n assignment');

subplot(2,2,4); imagesc(reshape(vc,

[m, kd]));title('vlad code');

end

VLAD Code

What are the tweaks ? Code book design

Soft Assignment options


References

Vocabulary Tree: David Nistér, Henrik Stewénius: Scalable Recognition with a Vocabulary

Tree. CVPR (2) 2006: 2161-2168

VLAD: Herve Jegou, Matthijs Douze, Cordelia Schmid:

Improving Bag-of-Features for Large Scale Image Search. International Journal of Computer Vision 87(3): 316-336 (2010)

Fisher Vector: Florent Perronnin, Jorge Sánchez, Thomas Mensink:

Improving the Fisher Kernel for Large-Scale Image Classification. ECCV (4) 2010: 143-156

AKULA: Abhishek Nagar, Zhu Li, Gaurav Srivastava, Kyungmo Park:

AKULA - Adaptive Cluster Aggregation for Visual Search. DCC 2014: 13-22


http://dblp.uni-trier.de/pers/hd/s/Stew=eacute=nius:Henrik

http://dblp.uni-trier.de/db/conf/cvpr/cvpr2006-2.html#NisterS06

http://dblp.uni-trier.de/pers/hd/d/Douze:Matthijs

http://dblp.uni-trier.de/pers/hd/s/Schmid:Cordelia

http://dblp.uni-trier.de/db/journals/ijcv/ijcv87.html#JegouDS10

http://dblp.uni-trier.de/pers/hd/s/S=aacute=nchez:Jorge

http://dblp.uni-trier.de/pers/hd/m/Mensink:Thomas

http://dblp.uni-trier.de/db/conf/eccv/eccv2010-4.html#PerronninSM10

http://dblp.uni-trier.de/pers/hd/n/Nagar:Abhishek

http://dblp.uni-trier.de/pers/hd/s/Srivastava:Gaurav

http://dblp.uni-trier.de/pers/hd/p/Park:Kyungmo

http://dblp.uni-trier.de/db/conf/dcc/dcc2014.html#NagarLSP14

http://google.com/search?q=Improving+the+Fisher+Kernel+for+Large-Scale+Image+Classification.

http://google.com/search?q=Improving+the+Fisher+Kernel+for+Large-Scale+Image+Classification.

Lec 07 Summary

Image Retrieval System Metric What is true positive, false positive, true negative, false

negative ?

What is precision, recall, F-score ?

Why Aggregation ? Decision boundary

Indexing/Hashing

Bag of Words A histogram with bins visual words

Variations: hierarchical assignment with vocabulary tree

Implementation: Inverted List

VLAD Richer encoding of aggregated info

Soft assignment of features to codebook bins

Vectorized representation – no need for inverted list


lec07 aggregation-and-retrieval-system

Education