overview

A Higher-Level Visual Representation For Semantic

Learning In ImageDatabases

Ismail EL SAYAD

18/07/2011President :

Sophie Tison Université Lille 1

Reviewers :

Philippe Mulhem

Laboratoire d'Informatique de Grenoble

Zhongfei Zhang

State University of New York

Examinator:

Bernard Merialdo

Eurecom Sophia-Antipolis

Advisor : Chabane Djeraba

Université Lille 1

Co-advisor :

Jean Martinet Université Lille 1

Overview Introduction

Related works

Our approach Enhanced Bag of Visual Words (E-BOW) Multilayer Semantically Significant Analysis Model

(MSSA) Semantically Significant Invariant Visual Glossary

(SSIVG)

Experiments Image retrieval Image classification Object Recognition

Conclusion and perspectives

Introduction Related works Our approach Experiments Conclusion and perspectives

2

Digital content grows rapidly Personal acquisition devices Broadcast TV Surveillance

Relatively easy to store, but useless if no automatic processing, classification, and retrieving

The usual way to solve this problem is by describing images by keywords.

This method suffers from subjectivity, text

ambiguity and the lack of automatic annotation

Motivation

3


Image-based representations are based on global visual features extracted over the whole image like color, color moment, shape or texture

Image-based representations

Part-based representations

Visual representations



4

The main drawbacks of Image-based representations: High sensitivity to :

▪ Scale▪ Pose ▪ Lighting condition changes ▪ Occlusions

Cannot capture the local information of an image

Part-based representations: Based on the statistics of features extracted from

segmented image regions


5


Visual representationsPart-based representations (Bag of visual words)

6

Compute local descriptors Feature

clustering

Feature space

VW1VW2VW3VW4

.

.

.

Visual word vocabulary

2111...

VW1

VW2

VW3

VW4

.

.

.

Frequency

VW1

VW3

VW2

VW4

VW1


Spatial information loss Record number of occurrences Ignore the position

Using only keypoints-based Intensity descriptors: Neither shape nor color information is used

Feature quantization noisiness: Unnecessary and insignificant visual words are generated

7

Visual representations Bag of visual words (BOW) drawbacks


Low discrimination power: Different image semantics are represented by the same visual

words

Low invariance for visual diversity: One image semantic is represented by different visual words

8

Visual representationsDrawbacks Bag of Visual words (BOW)

VW330 VW480

VW148

VW263


VW1364VW1364

ObjectivesIntroduction Related works Our approach Experiments Conclusion and

perspectives

Enhanced BOW representation Different local information (intensity, color, shape…) Spatial constitution of the image Efficient visual word vocabulary structure

Higher-level visual representation Less noisy More discriminative More invariant to the visual diversity

9

Overview of the proposed higher-level visual representation

M S S A m o d e l

Learning the MSSA model

E - B O W

Set of images

E-BOW representation

Visual word vocabulary

building

S S I V G

SSVIWs & SSIVPs

generationSSIVG

representation


10

Introduction

Related works Spatial Pyramid Matching Kernel (SPM) & sparse coding Visual phrase & descriptive visual phrase Visual phrase pattern & visual synset

Our approach

Experiments



11

Lazebnik et al. [CVPR06] Spatial Pyramid Matching Kernel (SPM): exploiting the

spatial information of location regions.

Yang et al. [CVPR09] SPM + sparse coding: replacing k-means in the SPM

12

Spatial Pyramid Matching Kernel (SPM) & sparse coding


Zheng and Gao [TOMCCAP08] Visual phrase: pair of spatially adjacent local image

patches

Zhang et al. [ACM MM09] Descriptive visual phrase: selected according to the

frequencies of its constituent visual word pairs

Visual phrase & descriptive visual phraseIntroduction Related works Our approach Experiments Conclusion and

perspectives

13

Yuan et al. [CVPR07] Visual phrase pattern: spatially co-occurring group of

visual words

Zheng et al. [CVPR08] Visual synset: relevance-consistent group of visual

words or phrases in the spirit of the text synset

Visual phrase pattern & visual sysnetIntroduction Related works Our approach Experiments Conclusion and

perspectives

14

SPM SPM +

sparse

coding

Visual phrase

Descriptive visual phrase

Visual phrase pattern

Visual synse

t

Our approa

ch

Considering the spatial location + + - - - - +Describing different local information

- - - - - - +Eliminating ambiguous visual words semantically

- - - - - - +Efficient structure for storing visual vocabulary

- - - + + - +Enhancing low discrimination power

- - + + + + +Tackling low invariance for visual diversity

- - - - + + +

Comparison of the different enhancements of the BOW


15

Introduction

Related works

Our approach Enhanced Bag of Visual Words (E-BOW) Multilayer Semantically Significant Analysis Model (MSSA) Semantically Significant Invariant Visual Glossary (SSIVG)

Experiments



16

Enhanced Bag of Visual Words (E-BOW)Introduction Related works Our approach Experiments Conclusion and

perspectives

17

SS IVGMSSA mode l

E -BOW

Hierarchal features

quantization

Features fusion

Set of images

E-BOW representa

tion

SURF & Edge

Context extraction

Enhanced Bag of Visual Words (E-BOW)Feature extraction

18

Interest points detection

SURF feature vector extraction at each interest

point

Fusion of the SURF and edge context feature vectors

HAC and Divisive Hierarchical K-Means

clustering

VW vocabulary

Collection of all vectors for the whole image set

Edge points detection

Color and position vector clustering using Gaussian

mixture model

∑3 µ3

Pi3

∑2 µ2

Pi2

Color filtering using vector median filter (VMF )

∑1 µ1

Pi1

Color feature extraction at each interest and edge

point

Edge Context feature vector extraction at each interest

point


SURF is a low-level feature descriptor Describes how the pixel intensities are distributed

within a scale dependent neighborhood of each interest point.

Good at Handling serious blurring Handling image rotation

Poor at Handling illumination change

Efficient

Enhanced Bag of Visual Words (E-BOW)Feature extraction (SURF)

19


Edge context descriptor is represented at each interest point as a histogram : 6 bins for the magnitude of the drawn vectors to the

edge points 4 bins for the orientation angle

Enhanced Bag of Visual Words (E-BOW)Feature extraction (Edge Context descriptor)

20


This descriptor is invariant to :

Translation : The distribution of the edge points is measured with

respect to fixed points

Scale: The radial distance is normalized by a mean

distance between the whole set of points within the same Gaussian

Rotation: All angles are measured relative to the tangent

angle of each interest point

Enhanced Bag of Visual Words (E-BOW)Feature extraction (Edge context descriptor)

21


Visual word vocabulary is created by clustering the observed merged features (SURF + Edge context 88 D) in 2 clustering steps:

Enhanced Bag of Visual Words (E-BOW)Hierarchal feature quantization

22

…

Stop clustering at

desired level k

A cluster

at k =4

Merged feature in the feature space

Hierarchical Agglomerative Clustering (HAC)

The tree is determined level by level, down to some maximum number of levels L, and each division into k parts.

Divisive Hierarchical K-Means Clustering

k clusters from HAC


Multilayer Semantically Significant Analysis (MSSA) model


23

SS IVGMSSA mode l

Generative process

Parameters

estimation

Number of latent

topics Estimatio

n

VWs semantic inference

estimation

E-BOW

Hierarchal features

quantization

Features fusion

Set of images

E-BOW representa

tion

SURF & Edge

Context extraction

Multilayer Semantically Significant Analysis (MSSA) model Generative Process

Different Visual aspects

Higher-level aspect: People

24

A topic model that considers this hierarchal structure is needed


Multilayer Semantically Significant Analysis (MSSA) model Generative Process

V W vhim

φΘ Ψ

MN 25

In the MSSA, there are two different latent (hidden) topics: High latent topic that represents the high aspects Visual latent topic that represents the visual aspects


Probability distribution function :

Log-likelihood function :

Gaussier et al. [ ACM SIGIR05]: maximizing the likelihood can be seen as a Nonnegative Matrix Factorization (NMF) problem under the generalized KL divergence

Objective function:

Multilayer Semantically Significant Analysis (MSSA) model Parameter Estimation

26


KKT conditions are used to derive the multiplicative update rules for minimizing the objective function

This leads to the following multiplicative update rules :

Multilayer Semantically Significant Analysis (MSSA) model Parameter Estimation


27

Multilayer Semantically Significant Analysis (MSSA) modelNumber of Latent Topics Estimation

Minimum Description Length (MDL) is used as a model selection criteria Number of the high latent topics (L) Number of the visual latent topics (K)

is the log-likelihood

is the number of free parameters:

28


Semantically Significant Invariant Visual Glossary (SSIVG) representation


MSSA mode l

Generative process

Parameters

estimation

Number of latent

topics Estimatio

n

VWs semantic inference

estimation

E-BOW

Hierarchal features

quantization

Features fusion

Set of images

E-BOW representa

tion

SURF & Edge

Context extraction

29

SS IVG

SSVW representa

tion

SSVPs generatio

n

SSVP representa

tion

Divisivetheoretic clustering

SSIVW representa

tion

SSVWs selection

SSIVP representa

tion

SSIVG representa

tion

Semantically Significant Invariant Visual Glossary (SSIVG) representationSemantically Significant Visual Word (SSVW)

Estimating using MSSA

Set of relevant

visual topics


Set of SSVWs

Set of VWs

30


SSVP: Higher-level and more discriminative representation SSVWs + their inter-relationships

SSVPs are formed from SSVW sets that satisfy all the following conditions: Occur in the same spatial context Involved in strong association rules

High support and confidence Have the same semantic meaning

High probability related to at least one common visual latent topic

Semantically Significant Invariant Visual Glossary (SSIVG) representationSemantically significant Visual Phrase (SSVP)

31


Semantically Significant Invariant Visual Glossary (SSIVG) representationSemantically Significant Visual Phrase (SSVP)


SSIVP12

6

SSIVP32

6

SSIVP30

4

SSIVP12

6

SSIVP32

6

SSIVP30

4

32

Studying the co-occurrence and spatial scatter information make the image representation more discriminative

The invariance power of SSVWs and SSVPs is still low

Text documents Synonymous words can be clustered into one

synonymy set to improve the document categorization performance

33

Semantically Significant Invariant Visual Glossary (SSIVG) representationInvariance Problem


SSIVG : higher-level visual representation composed from two different layers of representation Semantically Significant Invariant Visual Word (SSIVW)

▪ Re-indexed SSVWs after a distributional clustering Semantically Significant Invariant Visual Phrases (SSIVP)

▪ Re-indexed SSVPs after a distributional clustering

Semantically Significant Invariant Visual Glossary (SSIVG) representation


Set of SSVWs and

SSVPs

Set of relevant

visual topics

Set of SSIVGs

Divisivetheoretic clustering


34


Set of SSIVPs

Set of SSIVWs

Experiments

Introduction

Related works

Our approach

Experiments Image retrieval Image classification Object Recognition



35

36

Assessment of the SSIVG representation performance in image retrieval


Dataset Total Nr. of images

Nr. of training images

Nr. of test images

Nr. of image categories

NUS-WIDE 269,648 161,789 107,859 81

Evaluation criteria : Mean Average Precision (MAP)

The traditional Vector Space Model of Information Retrieval is adapted The weighting for the SSIVP Spatial weighting for the SSIVW

The inverted file structure

37

Assessment of the SSIVG representation Performance in image retrieval


38

Assessment of the SSIVG representation performance in image retrieval


38

39


Dataset # images # training images

# test images

# image categories

MIRFLICKER 25000 15000 10,000 11

Evaluation criteria : Classification Average Precision over each class

Classifiers : SVM with linear kernel Multiclass Vote-Based Classifier (MVBC)

Evaluation of the SSIVG representation in image classification

The final voting score for a high latent topic :

is

Each image is categorized according to the dominant high latent

Evaluation of the SSIVG representation in image classificationMulticlass Vote-Based Classifier (MVBC)


For each , we detect the high latent topic that maximizes:

is

40

41

Evaluation of the SSIVG representation performance in classification


42

Assessment of the SSIVG representation Performance in object recognition


Dataset # images # training images

# test images

# image categories

Caltech101 8707 7697 1,010 101

Each test image is recognized by predicting the object class using the SSIVG representation and the MVBC

Evaluation criteria: Classification Average Precision (AP) over each object

class

43

Assessment of the SSIVG Representation Performance in Object Recognition


Experiments

Introduction

Related works

Our approach

Experiments



44

45

Conclusion Enhanced BOW (E-BOW) representation

Modeling the spatial-color image constitution using GMM New local feature descriptor (Edge Context) Efficient visual word vocabulary structure

New Multilayer Semantic Significance (MSSA) model Semantic inferences of different layers of representation

Semantically Significant Visual Glossary (SSIVG) More discriminative More invariant to visual diversity

Experimental validation Outperform other sate of the art works


46

PerspectivesIntroduction Related works Our approach Experiments Conclusion and

perspectives

MSSA Parameters update On-line algorithms to continuously (re-)learn the

parameters

Invariance issue Context large-scale databases where large intra-

class variations can occur

Cross-modalitily extension to video content Cross-modal data (visual and textual closed captions

contents)

New generic framework of video summarization Study the semantic coherence between visual contents

and textual captions

QUESTIONS ?

Thank you for your attention [email protected]

48

Parameter Settings


overview

Documents

visual content

visual diversity

different visual words

insignicant visual words

global visual features

visual word vocabulary

visual words low invariance

different image semantics