Transcript

Image Auto-annotation With Graph Learning

Yu Tang Guo Department of Computer Science and Technology

Hefei normal University Hefei,China

[email protected]

Bin Luo School of Computer Science and Technology

Anhui University Hefei,China

[email protected]

Abstract—It is important to integrate contextual information in order to improve the performance of automatic image annotation. Graph based representations allow incorporation of such information. In this paper, we propose a graph-based approach to automatic image annotation which models both feature similarities and semantic relations in a single graph. The annotation quality is enhanced by introducing graph link weighting techniques based on inverse document frequenct and the similarity of the word based on Co-occurrence relation in the training set . According to the characteristics of inear correlation ,block-wise and community-like structure in the modeled graph, we divide the graph into several subgraphs and approximate high rank adjacent matrix of the graph by using low rank matrix. Thus,we can achieve image annotation quickly. Experimental results on the Corel image dabasets show the effectiveness of the proposed approach in terms of performance.

Index Terms—image annotation, graph learning, Random walk with restart, fast solution

I. INTRODUCTION With the advent of digital imagery, the number of digital

images has been growing rapidly and an efficent image retrieval system is desirable where given a large database, we need, for example, to find the images that have tigers,or given an unseen image,find keywords that best describe its content.

Early Content Based Image Retrieval(CBIR) systems were solely based on indexing low-level visual features such as color histograms,textures,shapes and spatial layout etc. However,the problem is that visual similarity is not semantic similarty. There is a gap between low-level visual features and semantic meaning. The so-called semantic gap is the major problem that needs to be solved for most CBIR approaches.

A solution towards bridging the semantic gap is to index images using semantic features,such as keywords,describing the content of the image. The majority of automatic image annotation system incorporate machine learing approaches for finding correlations between image visual features and words used to annotate images in a training set. The learnt correlations can then be used to annotate new images. Therefore,automatic image annotation is an important and challenging work. It can shorten the semantic gap in the content-based image retrieval. With automatic image

annotation, image retrieval will be based on current powerful pure text retrieval techniques.

II. RELATED WORKS There are so many researches on this subject: Mori et al.

proposed a Co-ocurrence Model[1] in which they looked at the co-occurrence of words with image regions created using a regular grid. Duygule et al proposed translation models[2],in which they described images using a vocabulary of blobs. First,regions are create using a segmentation algorithm like normalized cuts. For each regions,features are computed and blobs are generated by clustering the image features for these regions across images .Each image is generated by using a certain number of these blobs.Then, the semantic concept was translated into the image blobs. Another approach using Cross-Media Relevance Models(CMRM) was introduced by Jeon et al[3]. Here the joint distribution of blobs and words was learned from a training set of annotated images. Assume a set of keywords is related to the set of blobs in an image, rather than one-to-one correspondence between the blob-lokens and keywords. Carneiro and Vasconcelesy presented a Supervised Multiclass Labeling (SML) with the optimization and statistical classification based on minimum error rate criteria[4-5]. Domestic scholars have also carried out relevant studies in the automatic image annotation. In order to achieve the concept-based indexing of image annotation,the paper[6] mapped the low-level image feature to the high-level semantic features by using multi-class classifier based on support vector machine(svm).

For the past years, graph-based automatic image annotation have been proposed. In the paper[7],a graph-based approach is presented ,in which image visual content and keywords are incorporated by manifold ranking. A nearest spanning chain procedure is used to derive the similarity between images and keyword in an adaptive graph-like structure. After performing a random walk with restart on the graph, candidate keywords are output and inconsistent ones are filtered out by a word correlation procedure. In the paper[8], Pan et al proposed a Gcap algorithm, which models relationships between images and words by connecting them in an undirected graph. Image nodes are linked to each other based on their similarity measured in the image feature space ,while image and word nodes are linked based on the prior knowledge provided by the human-annotatted images of the training set. To annotate a new,one appends it to the most similar images of the trained graph and perform a random

2010 International Conference on Artificial Intelligence and Computational Intelligence

978-0-7695-4225-6/10 $26.00 © 2010 IEEE

DOI 10.1109/AICI.2010.171

235

2010 International Conference on Artificial Intelligence and Computational Intelligence

978-0-7695-4225-6/10 $26.00 © 2010 IEEE

DOI 10.1109/AICI.2010.171

235

walk with restart algorithm to estimate the steady-state probability of annotation word for the image to be annotated.

Gcap presented a graph-based learning method for automatic image annotation. But it has some limitations. In the Gcap, only similarity of the image regions is used

Image semantic has the characters of vague, complex, and abstractive, therefore only low-level features are not enough to describing image semantics, and require a combination of related content in an image in order to improve the accuracy of the image annotation. In this paper, an image annotation algorithm based on graph is proposed. The proposed approach models the relationship between the images and words by an undirected graph. Semantic information is extracted from paired-nodes, and inverse document frequency(IDF) [9] is used to amend the edge weights between the image node and its keywords, which overcome deviations caused by high-frequency words in the traditional method, improves effectively the image annotation performance. Based on the analysis on the structure of graph, A fast solution algorithm is proposed.

III. PROPOSED METHOD OVERVIEW

A. Model relationships among images and words with graph

Let },,{ 21 niiiT = is the training set of images, each the training image Ti ∈ is represented as visual feature if and },,{ 21 lwwww = is the list of keywords.Where

),,2,1( liwi = )is the ith word in the list, We use an undirected graph G=<V,E> to represent the relationships among images and words . Image nodes are linked to its k nearest neighboring image nodes based on their similarity measured in the image feature space, edge weight is denoted as ),( ji ffsim . Image and word nodes are linked thanks to the prior knowledge provided by the human-annotated images of the training set,. edge weight is denoted as

),( iwisim . Similarly, each word nodes are linked to its k nearest neighboring nodes based on their similarity measure, its edge weight is denoted as ),( ji wwsim . Figure (1) as shown.

Fig.1 Diagram of relationships among images and words with

graph We adopt the uniformed Local Binary Pattern(LBP)

image features to represent each image.the LBP feature is a

compact textue descriptor for which each comparison result between a center pixel and one of its surrounding neighbors will be encoded as a bit in a LBP code.LBP codes are computed for every pixel and accumulated into a histogram to represent the whole image .

The weight between two images can be obtained by the equation (1) calculations:

othetwise

fofKNNisfffffsim jiji

jiif

0

)/||||exp(),(

2

⎪⎩

⎪⎨⎧ −−

=σ (1)

The similarity of the keyword can be obtained by useing Co-occurrence based on probability statistics in data sets [10], Co-occurrence is expressed by the formula (2).:

N ( , )( , )

m in ( ( ) , ( ))i j

i ji j

w ww w

N w N w=s i m

(2) Where ),( ji wwN represent the co-occurrence

frequency of word iw and word jw , )( iwN and )( jwN represent the occurrence frequency in the training set of the word iw and the word jw , respectively.

According to the similarity ),( ji wwsim = of the word, two word nodes are connected with an edge if and only if they are the k nearest neighbors.

In the list of training set, each word the frequency varied widely, for example, in Corel image database, word Water, Shy and Tree appear far more than Race, Canoe. These high-frequency words have a dominant position in the similarity matrix. Regardless of the image to be annotated, these high-frequency words will affect the results of the final annotation.

In order to overcome the partial impact of high-frequency words, we use inverse document frequency(IDF) to amend the edge weights between the image node and its annotation words.

In the image annotation, an image is equivalent to a document and the annotation words are equivalent to the keywords in the document. )( iwdf represents the number of occurrences in the training set for the word iw . We use formula (3) to calculate the edge weights between the image node and its annotation words:

( , ) ((1 ) *log((1 | |)/ ( )))* ( , ) j j jsimi w w df w i wλ λ δ= − + + (3)

Where || w is the size of the annotated vocabulary. When is

jw the annotated word of the image i, ),( jwiδ is equal to 1,

otherwise take the value 0. λ is the smoothing factor used to adjust the weight of high-frequency words and low-frequency words in the training set.

B. Annotation a new Image To annotate a new image, one appends it to K nearest

neighbor images of the trained graph and perform a random walk with restart (RWR)[11] .RWR is a MRF process started from initial nodes which is an iterative process as equation (4) shows:

1 (1 )n nR cAR c Y+ = + − (4)

236236

Where R is an N dimensional vector represented transition states in MRF for all image nodes. N is the number of nodes. A is an adjacency matrix of graph G. (1-c) is the probability of each node back to the initial node during the random walking process. Y is an N dimensional vector with all elements zero but one exception of “1” on the position which is corresponding to the initial node. To ensure the convergence of equation (4), A is normalized along columns and the summation of all the elements of R is set to one. When equation (4) is converged, the word nodes are sorted with descent order by the corresponding R component. The first n (e.g. n=5) words are the annotation of the image.

C. preprocessing In the process of annotation, iterate equation (4) until it

converges. The time to converge is usually very long. Especially in image annotation system with large training dataset, a lot of time will cost on running RWR. Thit is due to the time complexity of RWR iteration is proportional to the number of iterations and the size of graph.Therefore. , fast resolution is urgently needed. Based on the structure analysis of graph, a fast alogirthm is proposed in this paper. By keeping comparative accuracy of image annotation, our algorithm avoids iterative computation and obtains efficient result.

Equation (4) can be reformulated as: 1 1(1 )( )nR c I cA Y+ −= − − (5)

By seting cAIQ −= , equation (5) can be rewrote as: 1 1(1 )nR c Q Y+ −= − (6)

Equation (6) is a close form of RWR. However, in practice, the directly resolution can’t be obtained because the space and time complexity of 1−Q is )( 2nO and )( 3nO respectively. For image annotation system with large data processing, directly computation of 1−Q is impossible.

We noticed that matrix A has two prominent properties. Firstly, row elements and column elements are linear correlated in matrix A. Secondly, the matrix A consists of sparse and dense blcok regions. Based on the two properties of the modeled graph and the balance between accuracy and speed of image annotation, we propose a fast algorithm to approximate RWR resolution.

First of all, according to the first property, dimension reduction is used to obtain a low rank matrix to approximate 1−Q . Based on property two, graph G can be subdivided into k subgraphs. The adjacent matrix of each subgraph is represented by ),2,1(1 kiA i = . The adjacent matrix among subgraphs is denoted by 2A . Till now,

solution to high rank matrix 1−Q is transferred into solving several inverse matrix with low rank. The detailed algorithm is described as following:

Suppose V is node set with size n in the graph G. Normalize A with Laplacian normalization:

1 12 2L D AD

− −= (7)

where ∑ ≠===

n

kijikii jiDAD

1,0, . According to K-way

normalized segmentation algorithm[12], graph G is subdivided into k subgraphs and L is decomposed into the summation of two matrixes as equation (8) shows.

1 2L L L= + (8) where 2L is adjacency matrix among all subgraphs and

1L is represented as:

1 ,1

1 , 21

1 .

0 00 0

00 0 k

LL

L

L

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦ (9)

),2,1(1 kiL i = is adjancency matrix for each

subgraph. 2L is decomposed as eigenvalue decomposition: USVL =2 . Then:

1L L U S V= + (10) Hence

1 11( ) ( )I c L I c L c U S V− −− = − − (11)

Based on Sherman-Morrison lemma[13] and combination of equation (10), we obtain:

11

1 )()( −− −−=− cUSVcLIcLI

11

111

111

11 )( −−−−−− −+= VQUcVQSUcQQ

1−= Q According to equation (6)

1 1(1 )nR c Q Y+ −= − 1 1 1 1 1 1

1 1 1 1(1 )( ( ) )c Q cQ U S cV Q U V Q Y− − − − − −= − + − (12)

Since 1L is a diagonal matrix, 11−Q can be rewrote as:

1 11 1

11,1

11,2

11,

( )

0 00 0

00 0 k

Q I cL

QQ

Q

− −

= −

⎡ ⎤⎢ ⎥⎢ ⎥= ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦ (13)

Where 1,1

1,1 )( −− −= ii cLIQ .

According to equation (12) and equation (13), solution to the inverse matrix 1−Q of high rank matrix Q can be

transferred into solving low rank matrixes ),2,1(1,1 kiQ i =−

and 111

1 )( −−− − UcVQS If the results of ),2,1(1,1 kiQ i =−

and 111

1 )( −−− − UcVQS are calculated and stored previously, 1−Q can be calculated instantly and the same as R .

237237

D. complexity analysis The complexity of proposed algorithm is consist of the

offline preprocessing and online computation of R. We shall discuss both the time complexity and the space complexity as following:

1) the time complexity of online computation of R: according to equation (12), the computation of R can be decomposed into 6 times matrix multiplication.

The online computation of R can meet the requirement of real time output since 1

1−Q , 11

11 )( −−− − UcVQS and V have

already been calculated and stored during the offline preprocessing stage.

2) complexity of offline preprocessing which includes the following steps:

a) Subdivision of graph G; b) Inverse operation of k matrices

kicLIQ ii ,2,1),( ,1,1 =−= ; c) Eigenvalue decomposition of 2L ;

d) Computation of 111

1 )( −−− − UcVQS

The computation of 111

1 )( −−− − UcVQS has the same

magnitude of time complexity as the computation of 1,1−iQ , so

the time comlexity of the whole offline preprocessing deponds on the computation of k+1 small-scale matix inverse operation, Eigenvalue decomposition of 2L and Subdivision of graph G.

As for space comlexity, only k+1 small-scale matrices, the matix U and the matrix V need to be stored during offline preprocessing stage. In addition, it’s noted that many elements in 1

,1−iQ ,U and V are apporximate to 0. setting

these elements as 0 will not affect the output precision. So the storage space will be significantly reduced if we express 1

,1−iQ ,U and V as sparse matrices.

On the other hand, because of the symmetry of the normalized weighted matrix L, the k low-order matrices are symmetric and TVU = . The storage space will reduce 50% benefited from the symmetry.

IV. EXPERIMENTAL RESULTS AND ANALYSIS 1)Experimental data sets: In order to validate our algorithm

and conduct comparison with the algorithm refered in paper[8], we test our algorithm on Corel image database where 500 images, in which 450 as training images and 50 as test images have been selected. 2) Parameter set up: The parameters have the same setting in both algorithms, i.e., C=0.35, μ =0.4 and 3.0=λ .

Experiment 1: effect of the number of subgraph k We investigate the effect of the number of subgraph to the

performance including storage space of preprocessing, preprocessing time, online computation time of R and the output accuracy in following experiment. The results are reported in Fig (2).

0

500

1000

K=1 K=50 K=100 K=700

Storage space of preprocessing

Preprocessing time (s)

online computation time of R (MS)

Figure 2 Algorithm performance under different subgraph K

When K=1, we have 0, 21 == LLL , this case is

equivalent to work out R by formula (6) directly so longer preprocessing time (840s) is required. On the other hand, the output accuracy is nor affected as a result of no approximation. While k = 700, the preprocessing time is prolonged due to the increasing of the number of subgraph and the output accuracy decreases due to the increasing of approximation.

Experiment 2: comparison to the related algorithm .Figure

3 shows the annotation results.

image

Ground truth Clouds,sky,sun Bears,black,grass

Gcap Sun,sky,water,black, sun

Bears,sky,water Black,sun

proposed algorithm

Sunset,clouds,sea Waves,sky

Black,bears,grass Water,ground

image

Ground truth Sky,tree Formation,

mountain,sky GCap Sky,water,plane,tree

jet Water,sky,bears,

beach,tree proposed algorithm

sky,village,tree,sand house

Mountain,hills,sky grass,river

Figure.3 Image annotation results

Table (1) shows the experimental results, where the the

recall and precision are average over 260 words compared with GCap. From Table (1) we can see: proposed algorithm can significantly improve the running-speed without apparent effect to the recall and accuracy although both of them have a slight decrease.

238238

Table 1 Comparison with other algorithms

V. CONCLUSIONS An automatic image annotation algorithm based on graph

is proposed. Firstly, the an undirected graph is employed to integrate the correlation among the low-level features, and words. Then the image annotation is implemented by Random Walk with Restart (RWR). The performance has been improved benefited from the application of graph and Inverse Document Frequency ( IDF ) . The simulation experiments on Corel image database demonstrate the effectiveness of proposed algorithm: the recall or the accuracy has been improved comparing with the GCap method proposed by Pan et al. In addition, based on the structure analysis of graph, we have proposed a fast algorithm which can avoid the iteration calculation and achieve fast the solution without apparently affecting the accuracy of image annotation. The experiments show the satisfactory results of proposed algorithm.

ACKNOWLEDGEMENTS The authors would like to thank to Korbus Barnard and

Pinar Duygulu for making their dataset available This research is supported by the National Natural

Science Foundation of China(Grant 60374044)and the Natural Science Foundation of Anhui Education office under grant KJ2009A150.

REFERENCES [1] Mori Y,Takahashi H,Oka R, Image-to-word transformation based on

dividing and vector quantizing images with words, In MISRM’99 Firstintl.Workshop on Multidedia Intelligent Storage and Retrieval Managemem,1999

[2] Duygulu P,Barnard K,de Freitas J F G, et al. Object recognition as machine translation:learning a lexicon for a fixed image vocabulary Leture Noyes in Computer Science.Heidelberg:Springer.2002, 23(53):97-112

[3] Jeon J,Lavrenko V,Mnmatha R. Automatic image annotation and retrieval using cross-media relevance models. Proceedings. of the 26th Annual Intelnational ACM SIGIR Conference on Research and Development in information Retrieval,Toronto.2003:119-126.

[4] Carneiro G, Chan A B, Moreno P J,et al.Supervised learning of semantic classes for image annotation and retrival.IEEE Transactions on Pattern Analysis and Machine Intelligence.2007,29(3):394-4l0.

[5] Vasconcelos N. Minimum probability of error image retrieval.IEEE Transactions on signal Processing.2004,52(8):2322-2336.

[6] Cusano C, Ciocca G, Schettini R. Image annotation using SVM. Proceedings of SPIE,San Jose. 2004, 53(4l ): 330-338.

[7] Liu J,Li M,Ma W Y, et al. Adaptive graph model for automatic image annotation[C].Proceedings of the 8th ACM international Workshop on Multimedia Information Retrieval,2006,:61-67

[8] Jia-Yu Pan, Hyung-Jeong Yang, Christos Faloutsos, et al . GCap: Graph-based Automatic Image Captioning. Proceedings of the 4th International Workshop on Multimedia Data and Document Engineering (MDDE 04), in conjunction with Computer Vision Pattern Recognition Conference (CVPR 04).2004: 146-156.

[9] Witten I H, Moffat A, Bell T. Managing Gigabytes: Compressing and Indexing Documents and Images.Morgan Kaufmann Publishers. 1999.

[10] Miller G,Beckwith R,Fellbaum C,et al.WordNet:An on-line lexical database. International Journal of Lexicography.1990,3(4): 235-244

[11] Bailloeul T, Zhu C Z, Xu Y. Automatic image tagging as a random walk with priors on the canonical correlation subspace . Proceedings of ACM Multimedia Information Retrieval.2008:75-82.

[12] Meila M,Xu L. Multiway cuts and spectral clustering. Washington Tech Report. 2003

[13] H Tong, C Faloutsos, JY Pan. Random walk with restart: fast solutions and applications. Knowledge and Information Systems, 2008,14: 327–346

[14] Piegorsch W,Casella G E. Inverting a sum of matrices. SIAM Review.1990

methods Avg_precision Avg_recall accuracy

GCap 0.1308 0.1607 0.3431

proposed

algorithm

0.1501 0.1892 0.3687

239239


Top Related