overview
DESCRIPTION
A Higher-Level Visual Representation For Semantic Learning In Image Databases Ismail EL SAYAD 18/07/2011. Introduction. Related w ork s. Our approach. Experiments. Conclusion and perspectives. Overview. Introduction Related works Our approach Enhanced Bag of Visual Words ( E-BOW ) - PowerPoint PPT PresentationTRANSCRIPT
A Higher-Level Visual Representation For Semantic
Learning In ImageDatabases
Ismail EL SAYAD
18/07/2011President :
Sophie Tison Université Lille 1
Reviewers :
Philippe Mulhem
Laboratoire d'Informatique de Grenoble
Zhongfei Zhang
State University of New York
Examinator:
Bernard Merialdo
Eurecom Sophia-Antipolis
Advisor : Chabane Djeraba
Université Lille 1
Co-advisor :
Jean Martinet Université Lille 1
Overview Introduction
Related works
Our approach Enhanced Bag of Visual Words (E-BOW) Multilayer Semantically Significant Analysis Model
(MSSA) Semantically Significant Invariant Visual Glossary
(SSIVG)
Experiments Image retrieval Image classification Object Recognition
Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives
2
Digital content grows rapidly Personal acquisition devices Broadcast TV Surveillance
Relatively easy to store, but useless if no automatic processing, classification, and retrieving
The usual way to solve this problem is by describing images by keywords.
This method suffers from subjectivity, text
ambiguity and the lack of automatic annotation
Motivation
3
Introduction Related works Our approach Experiments Conclusion and perspectives
Image-based representations are based on global visual features extracted over the whole image like color, color moment, shape or texture
Image-based representations
Part-based representations
Visual representations
Visual representations
Introduction Related works Our approach Experiments Conclusion and perspectives
4
The main drawbacks of Image-based representations: High sensitivity to :
▪ Scale▪ Pose ▪ Lighting condition changes ▪ Occlusions
Cannot capture the local information of an image
Part-based representations: Based on the statistics of features extracted from
segmented image regions
Visual representations
5
Introduction Related works Our approach Experiments Conclusion and perspectives
Visual representationsPart-based representations (Bag of visual words)
6
Compute local descriptors Feature
clustering
Feature space
VW1VW2VW3VW4
.
.
.
Visual word vocabulary
2111...
VW1
VW2
VW3
VW4
.
.
.
Frequency
VW1
VW3
VW2
VW4
VW1
Introduction Related works Our approach Experiments Conclusion and perspectives
Spatial information loss Record number of occurrences Ignore the position
Using only keypoints-based Intensity descriptors: Neither shape nor color information is used
Feature quantization noisiness: Unnecessary and insignificant visual words are generated
7
Visual representations Bag of visual words (BOW) drawbacks
Introduction Related works Our approach Experiments Conclusion and perspectives
Low discrimination power: Different image semantics are represented by the same visual
words
Low invariance for visual diversity: One image semantic is represented by different visual words
8
Visual representationsDrawbacks Bag of Visual words (BOW)
VW330 VW480
VW148
VW263
Introduction Related works Our approach Experiments Conclusion and perspectives
VW1364VW1364
ObjectivesIntroduction Related works Our approach Experiments Conclusion and
perspectives
Enhanced BOW representation Different local information (intensity, color, shape…) Spatial constitution of the image Efficient visual word vocabulary structure
Higher-level visual representation Less noisy More discriminative More invariant to the visual diversity
9
Overview of the proposed higher-level visual representation
M S S A m o d e l
Learning the MSSA model
E - B O W
Set of images
E-BOW representation
Visual word vocabulary
building
S S I V G
SSVIWs & SSIVPs
generationSSIVG
representation
Introduction Related works Our approach Experiments Conclusion and perspectives
10
Introduction
Related works Spatial Pyramid Matching Kernel (SPM) & sparse coding Visual phrase & descriptive visual phrase Visual phrase pattern & visual synset
Our approach
Experiments
Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives
11
Lazebnik et al. [CVPR06] Spatial Pyramid Matching Kernel (SPM): exploiting the
spatial information of location regions.
Yang et al. [CVPR09] SPM + sparse coding: replacing k-means in the SPM
12
Spatial Pyramid Matching Kernel (SPM) & sparse coding
Introduction Related works Our approach Experiments Conclusion and perspectives
Zheng and Gao [TOMCCAP08] Visual phrase: pair of spatially adjacent local image
patches
Zhang et al. [ACM MM09] Descriptive visual phrase: selected according to the
frequencies of its constituent visual word pairs
Visual phrase & descriptive visual phraseIntroduction Related works Our approach Experiments Conclusion and
perspectives
13
Yuan et al. [CVPR07] Visual phrase pattern: spatially co-occurring group of
visual words
Zheng et al. [CVPR08] Visual synset: relevance-consistent group of visual
words or phrases in the spirit of the text synset
Visual phrase pattern & visual sysnetIntroduction Related works Our approach Experiments Conclusion and
perspectives
14
SPM SPM +
sparse
coding
Visual phrase
Descriptive visual phrase
Visual phrase pattern
Visual synse
t
Our approa
ch
Considering the spatial location + + - - - - +Describing different local information
- - - - - - +Eliminating ambiguous visual words semantically
- - - - - - +Efficient structure for storing visual vocabulary
- - - + + - +Enhancing low discrimination power
- - + + + + +Tackling low invariance for visual diversity
- - - - + + +
Comparison of the different enhancements of the BOW
Introduction Related works Our approach Experiments Conclusion and perspectives
15
Introduction
Related works
Our approach Enhanced Bag of Visual Words (E-BOW) Multilayer Semantically Significant Analysis Model (MSSA) Semantically Significant Invariant Visual Glossary (SSIVG)
Experiments
Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives
16
Enhanced Bag of Visual Words (E-BOW)Introduction Related works Our approach Experiments Conclusion and
perspectives
17
SS IVGMSSA mode l
E -BOW
Hierarchal features
quantization
Features fusion
Set of images
E-BOW representa
tion
SURF & Edge
Context extraction
Enhanced Bag of Visual Words (E-BOW)Feature extraction
18
Interest points detection
SURF feature vector extraction at each interest
point
Fusion of the SURF and edge context feature vectors
HAC and Divisive Hierarchical K-Means
clustering
VW vocabulary
Collection of all vectors for the whole image set
Edge points detection
Color and position vector clustering using Gaussian
mixture model
∑3 µ3
Pi3
∑2 µ2
Pi2
Color filtering using vector median filter (VMF )
∑1 µ1
Pi1
Color feature extraction at each interest and edge
point
Edge Context feature vector extraction at each interest
point
Introduction Related works Our approach Experiments Conclusion and perspectives
SURF is a low-level feature descriptor Describes how the pixel intensities are distributed
within a scale dependent neighborhood of each interest point.
Good at Handling serious blurring Handling image rotation
Poor at Handling illumination change
Efficient
Enhanced Bag of Visual Words (E-BOW)Feature extraction (SURF)
19
Introduction Related works Our approach Experiments Conclusion and perspectives
Edge context descriptor is represented at each interest point as a histogram : 6 bins for the magnitude of the drawn vectors to the
edge points 4 bins for the orientation angle
Enhanced Bag of Visual Words (E-BOW)Feature extraction (Edge Context descriptor)
20
Introduction Related works Our approach Experiments Conclusion and perspectives
This descriptor is invariant to :
Translation : The distribution of the edge points is measured with
respect to fixed points
Scale: The radial distance is normalized by a mean
distance between the whole set of points within the same Gaussian
Rotation: All angles are measured relative to the tangent
angle of each interest point
Enhanced Bag of Visual Words (E-BOW)Feature extraction (Edge context descriptor)
21
Introduction Related works Our approach Experiments Conclusion and perspectives
Visual word vocabulary is created by clustering the observed merged features (SURF + Edge context 88 D) in 2 clustering steps:
Enhanced Bag of Visual Words (E-BOW)Hierarchal feature quantization
22
…
Stop clustering at
desired level k
A cluster
at k =4
Merged feature in the feature space
Hierarchical Agglomerative Clustering (HAC)
The tree is determined level by level, down to some maximum number of levels L, and each division into k parts.
Divisive Hierarchical K-Means Clustering
k clusters from HAC
Introduction Related works Our approach Experiments Conclusion and perspectives
Multilayer Semantically Significant Analysis (MSSA) model
Introduction Related works Our approach Experiments Conclusion and perspectives
23
SS IVGMSSA mode l
Generative process
Parameters
estimation
Number of latent
topics Estimatio
n
VWs semantic inference
estimation
E-BOW
Hierarchal features
quantization
Features fusion
Set of images
E-BOW representa
tion
SURF & Edge
Context extraction
Multilayer Semantically Significant Analysis (MSSA) model Generative Process
Different Visual aspects
Higher-level aspect: People
24
A topic model that considers this hierarchal structure is needed
Introduction Related works Our approach Experiments Conclusion and perspectives
Multilayer Semantically Significant Analysis (MSSA) model Generative Process
V W vhim
φΘ Ψ
MN 25
In the MSSA, there are two different latent (hidden) topics: High latent topic that represents the high aspects Visual latent topic that represents the visual aspects
Introduction Related works Our approach Experiments Conclusion and perspectives
Probability distribution function :
Log-likelihood function :
Gaussier et al. [ ACM SIGIR05]: maximizing the likelihood can be seen as a Nonnegative Matrix Factorization (NMF) problem under the generalized KL divergence
Objective function:
Multilayer Semantically Significant Analysis (MSSA) model Parameter Estimation
26
Introduction Related works Our approach Experiments Conclusion and perspectives
KKT conditions are used to derive the multiplicative update rules for minimizing the objective function
This leads to the following multiplicative update rules :
Multilayer Semantically Significant Analysis (MSSA) model Parameter Estimation
Introduction Related works Our approach Experiments Conclusion and perspectives
27
Multilayer Semantically Significant Analysis (MSSA) modelNumber of Latent Topics Estimation
Minimum Description Length (MDL) is used as a model selection criteria Number of the high latent topics (L) Number of the visual latent topics (K)
is the log-likelihood
is the number of free parameters:
28
Introduction Related works Our approach Experiments Conclusion and perspectives
Semantically Significant Invariant Visual Glossary (SSIVG) representation
Introduction Related works Our approach Experiments Conclusion and perspectives
MSSA mode l
Generative process
Parameters
estimation
Number of latent
topics Estimatio
n
VWs semantic inference
estimation
E-BOW
Hierarchal features
quantization
Features fusion
Set of images
E-BOW representa
tion
SURF & Edge
Context extraction
29
SS IVG
SSVW representa
tion
SSVPs generatio
n
SSVP representa
tion
Divisivetheoretic clustering
SSIVW representa
tion
SSVWs selection
SSIVP representa
tion
SSIVG representa
tion
Semantically Significant Invariant Visual Glossary (SSIVG) representationSemantically Significant Visual Word (SSVW)
Estimating using MSSA
Set of relevant
visual topics
Estimating using MSSA
Set of SSVWs
Set of VWs
30
Introduction Related works Our approach Experiments Conclusion and perspectives
SSVP: Higher-level and more discriminative representation SSVWs + their inter-relationships
SSVPs are formed from SSVW sets that satisfy all the following conditions: Occur in the same spatial context Involved in strong association rules
High support and confidence Have the same semantic meaning
High probability related to at least one common visual latent topic
Semantically Significant Invariant Visual Glossary (SSIVG) representationSemantically significant Visual Phrase (SSVP)
31
Introduction Related works Our approach Experiments Conclusion and perspectives
Semantically Significant Invariant Visual Glossary (SSIVG) representationSemantically Significant Visual Phrase (SSVP)
Introduction Related works Our approach Experiments Conclusion and perspectives
SSIVP12
6
SSIVP32
6
SSIVP30
4
SSIVP12
6
SSIVP32
6
SSIVP30
4
32
Studying the co-occurrence and spatial scatter information make the image representation more discriminative
The invariance power of SSVWs and SSVPs is still low
Text documents Synonymous words can be clustered into one
synonymy set to improve the document categorization performance
33
Semantically Significant Invariant Visual Glossary (SSIVG) representationInvariance Problem
Introduction Related works Our approach Experiments Conclusion and perspectives
SSIVG : higher-level visual representation composed from two different layers of representation Semantically Significant Invariant Visual Word (SSIVW)
▪ Re-indexed SSVWs after a distributional clustering Semantically Significant Invariant Visual Phrases (SSIVP)
▪ Re-indexed SSVPs after a distributional clustering
Semantically Significant Invariant Visual Glossary (SSIVG) representation
Estimating using MSSA
Set of SSVWs and
SSVPs
Set of relevant
visual topics
Set of SSIVGs
Divisivetheoretic clustering
Estimating using MSSA
34
Introduction Related works Our approach Experiments Conclusion and perspectives
Set of SSIVPs
Set of SSIVWs
Experiments
Introduction
Related works
Our approach
Experiments Image retrieval Image classification Object Recognition
Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives
35
36
Assessment of the SSIVG representation performance in image retrieval
Introduction Related works Our approach Experiments Conclusion and perspectives
Dataset Total Nr. of images
Nr. of training images
Nr. of test images
Nr. of image categories
NUS-WIDE 269,648 161,789 107,859 81
Evaluation criteria : Mean Average Precision (MAP)
The traditional Vector Space Model of Information Retrieval is adapted The weighting for the SSIVP Spatial weighting for the SSIVW
The inverted file structure
37
Assessment of the SSIVG representation Performance in image retrieval
Introduction Related works Our approach Experiments Conclusion and perspectives
38
Assessment of the SSIVG representation performance in image retrieval
Introduction Related works Our approach Experiments Conclusion and perspectives
38
39
Introduction Related works Our approach Experiments Conclusion and perspectives
Dataset # images # training images
# test images
# image categories
MIRFLICKER 25000 15000 10,000 11
Evaluation criteria : Classification Average Precision over each class
Classifiers : SVM with linear kernel Multiclass Vote-Based Classifier (MVBC)
Evaluation of the SSIVG representation in image classification
The final voting score for a high latent topic :
is
Each image is categorized according to the dominant high latent
Evaluation of the SSIVG representation in image classificationMulticlass Vote-Based Classifier (MVBC)
Introduction Related works Our approach Experiments Conclusion and perspectives
For each , we detect the high latent topic that maximizes:
is
40
41
Evaluation of the SSIVG representation performance in classification
Introduction Related works Our approach Experiments Conclusion and perspectives
42
Assessment of the SSIVG representation Performance in object recognition
Introduction Related works Our approach Experiments Conclusion and perspectives
Dataset # images # training images
# test images
# image categories
Caltech101 8707 7697 1,010 101
Each test image is recognized by predicting the object class using the SSIVG representation and the MVBC
Evaluation criteria: Classification Average Precision (AP) over each object
class
43
Assessment of the SSIVG Representation Performance in Object Recognition
Introduction Related works Our approach Experiments Conclusion and perspectives
Experiments
Introduction
Related works
Our approach
Experiments
Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives
44
45
Conclusion Enhanced BOW (E-BOW) representation
Modeling the spatial-color image constitution using GMM New local feature descriptor (Edge Context) Efficient visual word vocabulary structure
New Multilayer Semantic Significance (MSSA) model Semantic inferences of different layers of representation
Semantically Significant Visual Glossary (SSIVG) More discriminative More invariant to visual diversity
Experimental validation Outperform other sate of the art works
Introduction Related works Our approach Experiments Conclusion and perspectives
46
PerspectivesIntroduction Related works Our approach Experiments Conclusion and
perspectives
MSSA Parameters update On-line algorithms to continuously (re-)learn the
parameters
Invariance issue Context large-scale databases where large intra-
class variations can occur
Cross-modalitily extension to video content Cross-modal data (visual and textual closed captions
contents)
New generic framework of video summarization Study the semantic coherence between visual contents
and textual captions
QUESTIONS ?
Thank you for your attention [email protected]
48
Parameter Settings
Introduction Related works Our approach Experiments Conclusion and perspectives