approximate nearest neighbor - applications to vision & matching lior shoval rafi haddad
Post on 20-Dec-2015
231 views
TRANSCRIPT
Approximate Nearest Neighbor - Applications to Vision & Matching
Lior ShovalRafi Haddad
Approximate Nearest NeighborApplications to Vision & Matching
1. Object matching in 3D Recognizing cars in cluttered
scanned images A. Frome, D. Huber, R. Kolluri, T. Bulow,
and J. Malik
2. Video Google A Text Retrieval Approach to object
Matching in Videos Sivic, J. and Zisserman, A
Object Matching Input:
An object and a dataset of models Output:
The most “similar” model Two methods will be presented
1. Voting based method2. Cost based method
Object Sq Model S1 Model S2 Model Sn …
A descriptor based Object matching - Voting Every descriptor vote for the model that gave
the closet descriptor Choose the model with the most votes Problem
The hard vote discards the relative distances between descriptors
Object Sq Model S1 Model S2 Model Sn …
A descriptor based Object matching - Cost Compare all object descriptors to all
target model descriptors
Object Sq Model S1 Model S2 Model Sn …
},..,1{},..,1{
),(min),(cosKk
mkMm
iq pqdistSSt
Application to cars matching
Matching - Nearest Neighbor
In order to match the object to the right model a NN algorithm is implemented
Every descriptor in the object is compared to all descriptors in the model
The operational cost is very high.
Experiment 1 – Model matching
Experiment 2 – Clutter scenes
Matching - Nearest Neighbor
E.g: Q – 160 descriptors in the object N – 83,640 [ref. desc.] X 12
[rotations] ~ 1E6 descriptors in the models
Exact NN - takes 7.4 Sec on 2.2GHz processor per one object descriptor
Speeding search with LSH Fast search techniques such as LSH
(Locality-sensitive hashing) can reduce the search space by order of magnitude
Tradeoff between speed and accuracy LSH – Dividing the high dimensional
feature space into hypercubes, devided by a set of k randomly-chosen axis parallel hyperplanes & l different sets of hypercubes
LSH – k=4; l=1
LSH – k=4; l=2
LSH – k=4; l=3
LSH - Results Taking the best
80/160 descriptors
Achieving close results with fewer descriptors
Descriptor based Object matching – Reducing Complexity
Approximate nearest neighbor Dividing the problem to two stages
1. Preprocessing 2. Querying
Locality-Sensitive Hashing (LSH)
Or...
Video Google
A Text Retrieval Approach to object Matching in Videos
Query
Results
Interesting facts on Google
The most used search engine in the web
Who wants to be a Millionaire?
a. Around half a billion
How many pages Google search?
c. Around 10 billions d. Around 50 billions
b. Around 4 billions
a. 10
How many machines do Google use?
c. Few thousands d. Around a million
b. Few hundreds
Video Google: On-line Demo
SamplesRun Lola Run:Supermarket logo (Bolle)
Frame/shot 72325 / 824Red cube logo:
Entry frame/shot 15626 / 174Rolette #20
Frame/shot 94951 / 988
Groundhog Day:Bill Murray's ties
Frame/shot 53001/294Frame/shot 40576/208
Phil's home:Entry frame/shot 34726/172
Query
Occluded !!!
Video Google
Text Google Analogy from text to video Video Google processes Experimental results Summary and analysis
Text retrieval overview
Word & Document Vocabulary Weighting Inverted file Ranking
Words & Documents Documents are parsed into words Common words are ignored (the, an,
etc) This is called ‘stop list’
Words are represented by their stems ‘walk’, ‘walking’, ‘walks’ ’walk’
Each word is assigned a unique identifier A document is represented by a vector
With components given by the frequency of occurrence of the words it contains
Vocabulary
The vocabulary contains K words Each document is represented by a
K components vector of words frequencies
(0,0, … 3,… 4,…. 5, 0,0)
Example:
“…… Representation, detection and learning are
the main issues that need to be tackled in
designing a visual system for recognizing
object. categories …….”
Parse and clean
represent detect learn
Representation, detection and learning are the
main issue tackle design
main issues that need to be tackled in designing
visual system recognize category
a visual system for recognizing object categories.
…
Creating document vector ID Assign unique id to
each word Create a document
vector of size K with word frequency: (3,7,2,………)/789 Or compactly with
the original order and position
WordPositionID
represent1,12,551
detect2,32,44.,..
2
learn3,113
…….…
Total789
Weighting
The vector components are weighted in various ways: Naive - Frequency of each word. Binary – 1 if word appear 0 if not. tf-idf - ‘Term Frequency – Inverse
Document Frequency’
id
idi n
N
n
nt log
tf-idf Weighting
- Number of occurrences of word i in document
- Total number of words in the document - The number of documents in the whole
database - The number of occurrences of term i in
the whole database=> “Word frequency” X “Inverse
document frequency”=> All documents are equal!
id
idi n
N
n
nt log
idn
dnN
in
Tkid tttV ,...,,...,1
Inverted File – Index Crawling stage
Parsing all documents to create document representing vectors
Creating word Indices An entry for each word in
the corpus followed by a list of all documents (and positions in it)
Word ID
1
2
3
…
K
Doc. ID
1
2
3
…
N
1. Parsing the query to create query vector Query: “Representation learning” Query Doc ID = (1,0,1,0,0,…)
2. Retrieve all documents ID containing one of the Query words ID (Using the invert file index)
3. Calculate the distance between the query and document vectors (angle between vectors)
4. Rank the results
Querying
Ranking the query results
1. Page Rank (PR) Assume page A has page T1,T2…Tn links to it Define C(X) as the number of links in page
X d is a weighting factor ( 0≤d≤1)
2. Word Order3. Font size, font type and more
n
i TiC
TiPRddAPR
1 )(
)()1()(
FilmCorpus
The Visual Analogy
Document Frame
Stem ???
???Word
Text Visual
Detecting “Visual Words”
“Visual word” Descriptor What is a good descriptor?
Invariant to different view points, scale, illumination, shift and transformation
Local Versus Global How to build such a descriptor ?
1. Finding invariant regions in the frame2. Representation by a descriptor
Finding invariant regions
Two types of ‘viewpoint covariant regions’, are computed for each frame
1. SA – Shape Adapted2. MS - Maximally Stable
1. SA – Shape Adapted
• Finding interest point using Harris corner detector
• Iteratively determining the ellipse center, scale and shape around the interest point
• Reference - Baumberg
2. MS - Maximally Stable Intensity water shade image segmentation Iteratively determining the ellipse center,
scale and shape Reference - Matas
Why two types of detectors ? They are complementary representation
of a frame SA regions tends to centered at corner like
features MS regions correspond to blobs of high
contrast (such as dark window on a gray wall)
Each detector describes a different “vocabulary” (e.g. the building design and the building specification)
MS - MA example
MS – yellowSA - cyan Zoom
Building the Descriptors
SIFT – Scale Invariant Feature Transform Each elliptical region is represented
by a 128-dimensional vector [Lowe] SIFT is invariant to a shift of a few
pixels (often occurs)
Building the Descriptors
Removing noise – tracking & averaging Regions are tracked across sequence of
frames using “Constant Velocity Dynamical model”
Any region which does not survive for more than three frames is rejected
Descriptors throughout the tracks are averaged to improve SNR
Large covariance’s descriptors are rejected
FilmCorpus
The Visual Analogy
Document Frame
Stem ???
DescriptorWord
Text Visual
Building the “Visual Stems”
Cluster descriptors into K groups using K-mean clustering algorithm
Each cluster represent a “visual word” in the “visual vocabulary”
Result: 10K SA clusters 16K MS clusters
K-Mean Clustering Input
A set of n unlabeled examples D={x1,x2,…,xn} in d-dimensional feature space
Number of clusters - K Objective
Find the partition of D into K non-empty disjoint subsets
So that the points in each subset are coherent according to certain criterion
jiDDDD ji
K
j j 1
E.g. Minimize square distance of vectors to centroids
j
jj Dx
Dj
K
j Dxj xmmx 1
1
2 ;
K-mean clustering - algorithm
Step 1: Initialize a partition of D
a. Randomly choose K equal size sets and calculate their centers
D={a,b,…,k,l) ; n=12 ; K=4 ; d=2
m1
K-mean clustering - algorithm
Step 1: Initialize a partition of D
b. For other point y, it is put into subset Dj, if xj is the closest center to y among the K centers
m1
D1={a,c,l} ; D2={e,g} ;
D3={d,h,i} ; D4={b,f,k)
K-mean clustering - algorithm
Step 2: Repeat till no update
a. Compute the mean (mass center) for each cluster Dj,
b. For each xi:assign xi to the cluster with the closest center
m1
D1={a,c,l} ; D2={e,g} ;
D3={d,h,i} ; D4={b,f,k)
K-mean algorithm
Final result
K-mean clustering - Cons
Sensitive to selection of initial grouping and metric
Sensitive to the order of input vectors
The number of clusters, K, must be determined before hand
Each attribute has the same weight
K-mean clustering - Resolution
Run with different grouping and ordering
Run for different K values Problem ?
Complexity!
MS and SA “Visual Words”
SA
MS
FilmCorpus
The Visual Analogy
Document Frame
Stem Centroid
DescriptorWord
Text Visual
Visual “Stop List” The most
frequent visual words that occur in almost all images are suppressed
After stop list
Before stop list
Ranking Frames
1. Distance between vectors (Like in words/Document)
2. Spatial consistency (= Word order in the text)
Visual Google process
Preprocessing: Vocabulary building Crawling Frames Creating Stop list
Querying Building query vector Ranking results
Vocabulary building
Regions construction (SA + MS)
10k frames * 1600 = 1.6E6 regions
Frames tracking
Subset of 48 shots is selected
10k frames = 10% of movie
Rejecting unstable regions
Clustering descriptors using k-mean algo.
Parameters tuning is done with the
ground truth set
SIFT descriptors representation
1.6E6 ~200k regions
Crawling Implementation To reduce complexity – one keyframe
per second is selected (100-150k frames 5k frames)
Descriptors are computed for stable regions in each key frame
Mean values are computed using two frames each side of the key frame
Vocabulary: Vector quantization – using the nearest neighbor algorithm (found from the ground truth set)
•The expressiveness of the visual vocabulary
Frames outside the ground truth set contains new object and scenes, and their detected regions have not been included in forming the clusters
Crawling movies summary
Regions construction (SA + MS)
Frames tracking
Key frames selection
5k frames
Rejecting unstable regions
Nearest neighbored for vector quantization
Stop list Tf-idf weighting Indexing
SIFT descriptors representation
“Google like” Query Object
Use nearest neighbor algo’ to build query vector
Use inverse index to find relevant frames
Generate query descriptor
Calculate distance to relevant frames
Rank results
0.1 seconds with a Matlab
Doc vectors are sparse small set
Experimental results
The experiment was conducted in two stages: Scene location Matching Object retrieval
Scene Location matching
Goal Evaluate the method by matching
scene locations within a closed world of shots (=‘ground truth set’)
Tuning the system parameters
Ground truth set
164 frames, from 48 shots, were taken at 19 3D location in the movie ‘Run Lola Run’ (4-9 frames from each location)
There are significant view point changes in the frames for the same location
Ground Truth Set
Location matching
The entire frame is used as a query region
The performance is measured over all 164 frames
The correct results were determined by hand
Rank calculation
Location matching
relN
i
relreli
rel
NNR
NNRank
1 2
11
Rank - Ordering quality (0≤Rank≤1) ; 0 - best
Nrel - number of relevant images
N - the size of the image set (164)
Ri - the location of the i-th relevant image (1≤Ri≤N) in the result
if all the relevant images are
returned first
relN
ii
relrel RNN
12
1
Location matching - Example
– Frame 6 is the current query frame– Frames 13,17,29,135 contain the same scene
location Nrel = 5.
– The result was: {17,29,6,142,19,135,13,…
Frame number
6131729135Total
Ri3712619 Best Rank“4“"515
Location matching
152
155
2
1
relrel NN
19621731
relN
iiR
00487.015195164
1
Rank
Best Rank
Query Rank
relN
i
relreli
rel
NNR
NNRank
1 2
11
Rank of relevant frames
Frames 61 - 64
Object retrieval
Goal Searching for objects throughout the
entire movie The object of interest is specified by
the user as a sub part of any frame
Object query results (1)
Run Lola Run results
Groundhog Day results
Object query results (2)
•The expressive power of the visual vocabulary
The visual word learnt for ‘Lola’ are used unchanged for the ‘groundhog day’ retrieval!
Object query results (2)
Analysis: Both the actual frame returned and
the ranking are excellent No frames containing the object are
missed No false negative
The highly ranked frames all do contain the object
Good precision
Google Performance Analysis Vs Object macthing Q – Number of queried descriptors (~102) M – Number of descriptors per frame (~103) N – Number of key frames per movie (~104) D – Descriptor dimension (128~102) K – Number of “words” in the vocabulary (16X103~103) α - ratio of documents that does not contain any of the Q “words”
(~.1)
Brute force NN: Cost = QMND ~ 1011 Google: Query Vector quantization + Distance =
QKD + KN QKD + Q(αN)~ 107 + 105
Improvement factor ~ 104 -:- 106
Sparse
Video Google Summary
Immediate run-time object retrieval
Visual Word and vocabulary analogy
Modular frame work Demonstration of the expressive
power of the visual vocabulary
Open issues
Automatic ways for building the vocabulary are needed
Ranking of retrieval results method as Google does
Extension to non rigid objects, like faces
Future thoughts
Using this method for higher level analysis of movies Finding content of a movie by the
“words” it contains Finding the important (e.g. a star)
object in a movie Finding the location of unrecognized
video frames More ?
a. The number 1E10
What is the meaning of the word Google?
c. The number 1E100 d. A simple clean search
b. Very big data
$1 Million!!!
Reference1. Sivic, J. and Zisserman, A., Video Google: A Text Retrieval Approach to Object Matching in
Videos. Proceedings of the International Conference on Computer Vision (2003)
2. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In 7th Int. WWW Conference, 1998.
3. K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. In Proc. ECCV. Springer-Verlag, 2002.
4. A. Frome, D. Huber, R. Kolluri, T. Bulow, and J. Malik. Recognizing Objects in Range Data Using Regional Point Descriptors. To appear in European Conference on Computer Vision, Prague, Czech Republic, 2004
5. D. Lowe. Object recognition from local scale-invariant features. In Proc. ICCV, pages 1150–1157, 1999.
6. F. Schaffalitzky and A. Zisserman; Automated Location Matching in Movies
7. J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable
external regions. In Proceedings of the British Machine Vision Conference, pages 384.393, 2002.
Parameter tuning
K – number of clusters for each region type
The initial cluster center values Minimum tracking length for stable
features The proportion of unstable
descriptors to reject, based on their covariance
Locality-Sensitive Hashing (LSH) Divide the high -
dimensional feature space into hypercubes, by k randomly chosen axis-parallel hyperplanes
Each hypercube is a hash bucket
The probability that 2 nearby points are separated is reduced by independently choosing l different sets of hyperplanes
2 hyperplanes
ε-nearest-neighbor
ε-Nearest Neighbor Search
• d(q, p) ≤ (1 + ε) d(q, P)• d(q, p) is the distance between p and q in
the euclidean space• Normalized distance• d(q, p) = (Σ (x(i) – y(i))2)(1/2)
• Epsilon is the maximum allowed 'error'• d(q, P) distance of q to the closest point in
P• Point p is the member of P that is retrieved
(or not)
ε-Nearest Neighbor Search Also called approximate Nearest
Neighbor searching Reports nearest neighbors to the
query point (q) with distances possibly greater than the true nearest neighbor distances d(q, p) ≤ (1 + ε) d(q, P) Don't worry, the math is on the next
slide
ε-Nearest Neighbor Search Goal
• The goal is not to get the exact answer, but a good approximate answer
• Many applications of nearest neighbor search where an approximate answer is good enough
ε-Nearest Neighbor Search
• What is currently out?• Arya and Mount presented an algorithm
• Query time• O(exp(d) * ε-d log n)
• Pre-processing• O(n log n)
• Clarkson improved dependence on ε• exp(d) * ε-(d-1)/2
• Grows exponentially with d
ε-Nearest Neighbor Search• Striking observation
• “Brute Force” algorithm provides a faster query time
• Simply computes the distance from the query to every point in P
• Analysis: O(dn)• Arya and Mount
• “… if the dimension is significantly larger than log n (as it for a number of practical instances), there are no approaches we know of that are significantly faster than brute-force search”
High Dimensions
• What is the problem?• Many applications of nearest neighbor
(NN) have a high number of dimensions• Current algorithms do not perform much
better than brute force linear searches
• Much work has been done for dimension reduction
Dimension Reduction• Principal Component Analysis
• Transforms a number of correlated variables into a smaller number of uncorrelated variables
• Can anyone explain this further?
• Latent Semantic Indexing• Used with the document indexing process• Looks at the entire document, to see which
other documents contain some of the same words
Descriptor based Object matching - Complexity Finding for each object descriptor,
the nearest descriptor in the model, can be a costly operation
Descriptor dimension ~ 1E2 1000 object descriptors 1E6 descriptors per model 56 models
Brute force nearest neighbor ~1E12
),(min},..,1{
mkMm
pqdist