feature-independent context estimation for automatic image ... · feature-independent context...

1
Feature-Independent Context Estimation for Automatic Image Annotation Amara Tariq, Hassan Foroosh The Computational Imaging Lab., Computer Science, University of Central Florida, Orlando, FL, USA. Automatic image annotation is an important tool for image search, retrieval and archival systems. Automatic annotation process usually suffers from se- mantic gap, i.e., the lack of any correlation between low-level visual features and textual annotations. We propose to tackle this problem through incor- poration of context of an image as well as its contents, in the annotation process. We present a unique strategy for context estimation. Our strategy is based on tensor analysis of raw images. Tensors have been employed as a natural representation scheme for videos. Novelty of our approach is to use tensor analysis to extract useful context information from raw images. The proposed context estimation process is feature-independent, thus avoid- ing the semantic gap problem associated with any form of low-level visual features. We propose a three-step process for context estimation. Images are clustered into groups such that Each cluster is a representative of some context category, de- fined by distinctive words used in the descriptions of its mem- ber images All member images of every cluster have enough visual simi- larity to each other to be able to define some visual signature for the corresponding context category. We employed hierarchical clustering technique based on cosine sim- ilarity between tfidf vectors of descriptions of images in training data to construct these context categories. Nature of tfidf representation ensures that clustering process puts an emphasis on grouping together images with similar distinctive words in their descriptions. If word ‘sky’ is common among image descriptions, it is not a distinctive word. If word ‘snow’ appears in descriptions of a few images, it is a distinctive word for those images. High similarity between image descriptions indicates a reasonable level of visual similarity between images. Figure 1: An example of context group formed on the basis of similarity in textual descriptions Images in one cluster/context category are resized to fixed height and width, converted to gray-scale, blurred by a Gaussian filter and con- catenated together to form a context tensor, denoted by T c . Rank-1 Tucker decomposition generates three vectors and a scalar value for this tensor. The vector R corresponding to the tensor dimension in- dicating indices of images contains information on how each image relates to other images of the same context category. This vector forms a compact signature of the corresponding context category. This is an extended abstract. The full paper is available at the Computer Vision Foundation webpage. Width Height S P Q R Figure 2: Rank-1 Tucker decomposition: S is a scalar, P , Q and R are vec- tors, R = Image-indices × 1 where 1 represent the single context group rep- resented by tensor. After formation of signatures of context categories defined over train- ing data, the next step is to calculate association of a test image I o to all context categories as probability distribution P(T c |I o ). Each ten- sor T c is updated by swapping every L th image in the tensor by the test image I o . Tucker decomposition is applied to calculate updated signature vector R 0 . Insertion of a foreign entity, such as a test image, disturbs entries at and around every L th location in vector R. Amount of disturbance is proportional to the dissimilarity between foreign en- tity and images in the neighborhood of every L th index. P(T c |I o ) is given by P(T c |I o )= exp(-(R 0 - R) T γ -1 (R 0 - R)) p 2π |γ | (1) where γ is assumed to be a uniform diagonal matrix whose diagonal entries are equal to an empirically selected constant. 0 10 20 30 40 50 60 70 -0.16 -0.14 -0.12 -0.1 -0.08 -0.06 -0.04 -0.02 Figure 3: Comparison of rank-1 tucker decomposition with visually simi- lar and dissimilar image inserted into a tensor; Blue curve: original tucker decomposition vector R, Green curve: New tucker decomposition vector R 0 with an image visually similar to images of the context group inserted into context tensor, Red curve: New decomposition vector R 0 with visually dissimilar image inserted into tensor. We employ an expectation process weighted by estimated context, in- spired by relevance model from machine translation, that maximizes the following equation P(w, r|I o )= T c T P(T c |I o ) I I P(I |T c ) bB P V Tc (w b |I ) aA P R (r a |I ) (2) Each image I I is made up of a few visual unites r a and words w b where I denotes training set. V T c denotes vocabulary set of context category T c . A ‘general’ context category is added to service words which are not dis- tinctive words of any category. P(I |T c ) is a step distribution. P V Tc (w b |I ) is w th b component of multiple Bernoulli distribution over V T c . P R (r a |I ) is computed by putting a Gaussian kernel over distance between r a and corre- sponding visual unit of I . We evaluated the proposed context-sensitive annotation process over IAPR-TC 12 and ESP game dataset. Our experiments indicate that the pro- posed scheme achieves far better results than other relevance model inspired methods and greedy algorithms based systems, e.g., CRM, MBRM, JEC, Lasso, etc. The proposed scheme also outperforms various time-intensive iterative optimization based methods such as TagProp, FastTag, etc. http://www.imageclef.org/photodata http://hunch.net/ jl/

Upload: others

Post on 26-Jun-2020

36 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature-Independent Context Estimation for Automatic Image ... · Feature-Independent Context Estimation for Automatic Image Annotation Amara Tariq, Hassan Foroosh The Computational

Feature-Independent Context Estimation for Automatic Image Annotation

Amara Tariq, Hassan ForooshThe Computational Imaging Lab., Computer Science, University of Central Florida, Orlando, FL, USA.

Automatic image annotation is an important tool for image search, retrievaland archival systems. Automatic annotation process usually suffers from se-mantic gap, i.e., the lack of any correlation between low-level visual featuresand textual annotations. We propose to tackle this problem through incor-poration of context of an image as well as its contents, in the annotationprocess. We present a unique strategy for context estimation. Our strategyis based on tensor analysis of raw images. Tensors have been employed asa natural representation scheme for videos. Novelty of our approach is touse tensor analysis to extract useful context information from raw images.The proposed context estimation process is feature-independent, thus avoid-ing the semantic gap problem associated with any form of low-level visualfeatures.

We propose a three-step process for context estimation.

• Images are clustered into groups such that

– Each cluster is a representative of some context category, de-fined by distinctive words used in the descriptions of its mem-ber images

– All member images of every cluster have enough visual simi-larity to each other to be able to define some visual signaturefor the corresponding context category.

We employed hierarchical clustering technique based on cosine sim-ilarity between tfidf vectors of descriptions of images in training datato construct these context categories. Nature of tfidf representationensures that clustering process puts an emphasis on grouping togetherimages with similar distinctive words in their descriptions. If word‘sky’ is common among image descriptions, it is not a distinctiveword. If word ‘snow’ appears in descriptions of a few images, it isa distinctive word for those images. High similarity between imagedescriptions indicates a reasonable level of visual similarity betweenimages.

Figure 1: An example of context group formed on the basis of similarity intextual descriptions

• Images in one cluster/context category are resized to fixed height andwidth, converted to gray-scale, blurred by a Gaussian filter and con-catenated together to form a context tensor, denoted by Tc. Rank-1Tucker decomposition generates three vectors and a scalar value forthis tensor. The vector R corresponding to the tensor dimension in-dicating indices of images contains information on how each imagerelates to other images of the same context category. This vectorforms a compact signature of the corresponding context category.

This is an extended abstract. The full paper is available at the Computer Vision Foundationwebpage.

Width

Heig

ht S

P

Q

R

Figure 2: Rank-1 Tucker decomposition: S is a scalar, P,Q and R are vec-tors, R = Image-indices×1 where 1 represent the single context group rep-resented by tensor.

• After formation of signatures of context categories defined over train-ing data, the next step is to calculate association of a test image Io toall context categories as probability distribution P(Tc|Io). Each ten-sor Tc is updated by swapping every Lth image in the tensor by thetest image Io. Tucker decomposition is applied to calculate updatedsignature vector R′. Insertion of a foreign entity, such as a test image,disturbs entries at and around every Lth location in vector R. Amountof disturbance is proportional to the dissimilarity between foreign en-tity and images in the neighborhood of every Lth index. P(Tc|Io) isgiven by

P(Tc|Io) =exp(−(R′−R)T γ−1(R′−R))√

2π|γ|(1)

where γ is assumed to be a uniform diagonal matrix whose diagonalentries are equal to an empirically selected constant.

0 10 20 30 40 50 60 70−0.16

−0.14

−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

Figure 3: Comparison of rank-1 tucker decomposition with visually simi-lar and dissimilar image inserted into a tensor; Blue curve: original tuckerdecomposition vector R, Green curve: New tucker decomposition vectorR′ with an image visually similar to images of the context group insertedinto context tensor, Red curve: New decomposition vector R′ with visuallydissimilar image inserted into tensor.

We employ an expectation process weighted by estimated context, in-spired by relevance model from machine translation, that maximizes thefollowing equation

P(w,r|Io) = ∑Tc∈T

P(Tc|Io)∑I∈I

P(I|Tc)∏b∈B

PVTc(wb|I)∏

a∈APR(ra|I) (2)

Each image I ∈ I is made up of a few visual unites ra and words wb whereI denotes training set. VTc denotes vocabulary set of context category Tc.A ‘general’ context category is added to service words which are not dis-tinctive words of any category. P(I|Tc) is a step distribution. PVTc

(wb|I)is wth

b component of multiple Bernoulli distribution over VTc . PR(ra|I) iscomputed by putting a Gaussian kernel over distance between ra and corre-sponding visual unit of I.

We evaluated the proposed context-sensitive annotation process overIAPR-TC 12 and ESP game dataset. Our experiments indicate that the pro-posed scheme achieves far better results than other relevance model inspiredmethods and greedy algorithms based systems, e.g., CRM, MBRM, JEC,Lasso, etc. The proposed scheme also outperforms various time-intensiveiterative optimization based methods such as TagProp, FastTag, etc.

http://www.imageclef.org/photodatahttp://hunch.net/ jl/