shape description for content-based image retrieval

Shape Description for Content-based ImageRetrieval

E. Ardizzone12, A. Chella12, andR. Pirrone12

1 DIAI - University of Palermo, Viale delle Scienze 90128 Palermo, Italy2 CERE - National Research Council, Viale delle Scienze 90128 Palermo, Italy

{ardizzon, chella, pirrone}@unipa.it

Abstract. The present work is focused on a global image characteri-zation based on a description of the 2D displacements of the differentshapes present in the image, which can be employed for CBIR applica-tions.To this aim, a recognition system has been developed, that detects au-tomatically image ROIs containing single objects, and classifies them asbelonging to a particular class of shapes.In our approach we make use of the eigenvalues of the covariance matrixcomputed from the pixel rows of a single ROI. These quantities are ar-ranged in a vector form, and are classified using Support Vector Machines(SVMs). The selected feature allows us to recognize shapes in a robustfashion, despite rotations or scaling, and, to some extent, independentlyfrom the light conditions.Theoretical foundations of the approach are presented in the paper, to-gether with an outline of the system, and some preliminary experimentalresults.

1 Introduction

Images indexing and retrieval using content has gained increasing importanceduring last years. Almost all kinds of image analysis techniques have been in-vestigated in order to derive sets of meaningful features which could be usefulfor the description of pictorial information, and a considerable effort has beenspent towards the development of powerful but easy-to-use commercial databaseengines.

The most popular CBIR (content-based image retrieval) systems developedso far, like QBIC [5, 7] Photobook [9], Virage [8], model the image content asa set of uncorrelated shape, texture and color features. Queries are obtainedeither by manual specification of the weights for each feature or by presentingan example to the system, and they’re refined by means of various relevancefeedback strategies.

A more effective way to describe image content is to derive global descriptionsof the objects, like in the approach followed by Malik et al. [2, 3]. In this wayit’s possible to obtain image indexing structures that are closer to the intuitive

descriptions provided by the end-user when he or she submits the query to thesystem.

Following this idea, we developed the recognition system presented in thiswork, which is aimed to extract global information about the shape of the objectsin a scene and to provide a simple description of their 2D displacement insidethe image under investigation. This, in turn, can be useful for content-basedretrieval.

The proposed architecture is arranged as follows. Image is automaticallysearched in order to detect ROIs containing single objects. The objects’ shapeis described in terms of the eigenvalues of the covariance matrix computed fromthe pixel rows of the ROI: the eigenvalues are arranged as a single vector. We usea pool of suitably designed Support Vector Machines [13, 11] in order to classifydifferent shape classes such as cubes, spheres, cones, pyramids, cylinders and soon.

The use of this particular feature is justified by theoretical analysis. First, itcan be proven that, under the assumption of Lambertian surface, the eigenvaluesvector is directly related to the change of surface normals of objects under inves-tigation. Moreover, the eigenvalues vector are a very compact and efficient wayto describe the statistical variance pattern of the shape profile, due to the com-plete de-correlation performed on the input patterns by the KL transform [6].Finally, KLT allows comparisons between separate pixel rows populations, andtherefore between different ROIs.

The rest of the paper is arranged as follows. In section 2 theoretical issuesabout the eigenvalues vector will be addressed. Section 3 will explain in detailthe entire system performance, while the experimental results will be reportedin section 4. Finally, conclusions will be drawn in section 5.

2 Theoretical Remarks

The use of KLT features for pattern recognition is a well known technique inthe computer vision community [10, 12] but, in general, this transformation isapplied to the image as a whole, and the transformed vectors are used directlyfor the classification task.

In our approach, KLT is implicitly applied to the scan-lines of a generic sub-image, and only the eigenvalues of their covariance matrix are taken into account.In other words, given a N × M rectangular region of an image, we compute thematrix:

Cr = E{(rk − rm)(rk − rm)T } = 1N

N∑i=1

rkrkT − rmrmT , where rm =1N

N∑i=1

rk.

In the previous equation rk is the generic pixel row vector of the ROI underinvestigation, considered in column form. Then, we compute the vector:

λ = diag(Cq) (1)

Here Cq = ACrAT , is the covariance matrix of the KL transformed vectors qk,while A is the transformation matrix whose rows are the eigenvectors of Cr.

The matrix Cq is diagonal, so KLT performs total decorrelation of the inputvectors. Moreover the mean value of the transformed vectors is always zero.These properties will be used in the rest of the section to reinforce our conclusionsabout the choice of λ as a global shape descriptor.

The first step towards the justification of the usability of λ, is the proof of therelation between λ and the actual shape of the object depicted in the selectedROI. In what follows we’ll consider a weak perspective camera model and theLambert law to model the light reflection process upon the object surface. Theseconstraints are not so restrictive, and are widely used throughout computer vi-sion to model perceptive processes. In particular, weak perspective is introducedonly for simplicity in the mathematical passages, while the Lambertian surfaceconstraint holds for most real objects.

If an object is imaged by a camera under the weak perspective assump-tion, each point po = (x, y, z) of the object, expressed in the camera coordinatesystem, is mapped onto an image point p = (u, v) where p = WPpo is theperspective transformation. According to the Lambert law the image irradianceE in each point is equal to the image intensity value in the same point, and isexpressed by:

I(i, j) = E(p) = HρlT n(po)

In the previous equation H is a constant value related to the lens model, ρ isthe albedo of the object surface, l is the illuminant (constant) vector and n(po)is the surface normal at the point po. The first equality takes into account thecoordinate change from the image center to the upper-left corner, which is alinear transformation.

If we consider a vertical slice of the image, then each pixel row vector can bedefined as:

rk = {I(k, j) : j = 0, . . . , M − 1}T , k = 0, . . . , N − 1 (2)

Here the transpose symbol is used in order to define rk as a column vector. Ifwe substitute the expression of the generic pixel value I(i, j) in equation 2 thenwe obtain:

rk = Hρ{lT nkj : j = 0, . . . , M − 1}T , k = 0, . . . , N − 1 (3)

In equation 3 nkj refers to the surface normal vector that is projected ontoposition (k, j) in the image plane.

Now, we want to derive an expression for the generic element Cr(i, j) of thepixel rows covariance matrix, using the equation stated above:

Cr(i, j) =1N

∑k

rkirkj − 1N2

∑k

rki

∑k

rkj (4)

Substituting equation 3 in equation 4 we obtain:

Cr(i, j) =Hρ

N

∑k

(lT nki)(lT nkj)− Hρ

N2

(∑k

lT nki

)(∑k

lT nkj

)(5)

We can then rewrite equation 5, after some arrangements:

Cr(i, j) = HρlT

1

N

∑k

nkinTkj − 1

N2

(∑k

nki

)(∑k

nkj

)T l (6)

Finally, equation 6 can be rewritten in two different forms for diagonal andoff-diagonal terms:

Cr(i, j) =

HρlT C(i)n l , i = j

HρlT(K(ij)

n − n(i)m n(j)

m

T)

l , i �= j(7)

The last equation states that diagonal terms of the pixel rows covariance matrixcan be computed directly from the covariance matrices C(i)

n of the object surfacenormals projecting themselves onto a single slice column. The off-diagonal termsof the same matrix can be computed from the difference between the correlationmatrix K(ij)

n of the normals related to two different columns minus the termobtained from the product of their mean vectors.

From the previous result we can argue that the matrix Cr is well suited toexpress the statistical variance pattern of the object surface shape along bothrows (off-diagonal terms) and columns (diagonal terms) despite it is not referredto the entire slice, but it’s computed starting from its rows. In this way we achievea considerable reduction of computational time, without losing the expressivenessof the selected feature, because we’ve to compute only M eigenvalues, while theapplication of the KLT to the entire region involves the computation of N × Mcoefficients.

The use of the eigenvalues, allows us to transform our feature in a verycompact way. The λ vector still expresses the rows variance pattern because itresults from the covariance matrix (equation 1) of the KL transformed pixel rowsthat are completely uncorrelated.

Moreover, the λ vector allows performing comparisons between different re-gions in the same image or from different ones in order to search for similarshapes. In general, two different sets of rows cannot be compared directly, dueto the presence of bias effects in the pixel values deriving from noise and/or locallighting conditions. The implicit application of KLT deriving from the use of λimplies that if we compare two different regions we refer to their transformedrows which have zero mean value: these can be correctly compared becausethey’ve the same mean value and no bias effect is present.

3 Description of the System

In this section, the complete structure of the presented system will be reported,starting from the considerations of the preceding section.

We’ve analyzed the histogram of the components of λ computed from severalimages both synthetic and real, depicting single shapes under varying attitudes

and lighting (see figure 1). We’ve noticed that this histogram exhibits somedominant modes, whose relative position and amplitude depend on the shapeobserved. The amplitude and position of these histogram modes remain almostunchanged under rotation, translation, and scaling of the object. It can be notedfrom equation 7 that the light direction acts as a scaling factor for all the termsof Cr, thus affecting in a uniform manner all the components of λ. From theexperimental observation, we have noticed that varying l doesn’t affect the his-togram too much. Vectors for similar shapes tend to be similar, so we’ve set up a

Fig. 1. Some shape examples together with the relative λ histogram. Selected ROIs are256 × 100 wide. Comparing the couples along each row, it can be noted that changesin attitude and lighting don’t affect the histogram too much.

classifier based on the use of a pool of suitably tuned SVMs that operate in theeigenvalues space.Moreover, a search algorithm for the automatic analysis of thetest images has been derived, which is based on the maximization of correlationbetween the actual λ vector and some sample vectors from the different shapeclasses.

The complete system acts in the following way: first the image is scannedfrom left to right and from top to bottom by moving windows of fixed size inorder to locate some possible regions of interest. Then the height of each windowis resized in order to enclose at most a single complete object. Finally, all theselected regions are classified by the SVMs pool.

The rest of this section is arranged into two parts: the first one is devoted tothe automatic search algorithm, while the second one provides some remarks onSVM and classification strategies.

3.1 Automatic Search Algorithm

The search algorithm we implemented is based on a two-pass strategy. The firststep performs rough location of the ROIs both for the horizontal and vertical dis-placement. The second step defines the windows’ dimensions for all the selectedpositions.

The search criterion is the correlation maximization between the λ vector ofa fixed size slice and a sample of each shape class computed as the mean vectorbetween those used as training set for the various SVMs. Scanning the image fromleft to right with a 256× 100 fixed slice, all the correlation values are computed,one for each class sample, and the maximum is taken. This information is usedonly to detect if there’s something without looking at a particular shape. Positivepeaks of the correlation defined above vs. the position of the left side of the slice,the indicate a region of interest.

Relevant positions are selected as follows. The cumulative function of thecorrelation is computed, and the selected points are the zero crossings of itssecond order derivative: these are the slope inversion points of the cumulativefunction, that in turn correspond approximately to the correlation maxima (seefigure 2). We found more convenient the use of the cumulative function in orderto avoid noisy spikes that can be present near a peak when detecting maximadirectly from the correlation plot.

For each selected ROI, single objects are detected using a 20×100 fixed slicethat moves from top to bottom. Again the correlation maxima are computedwith the previous strategy.

In the second step of the algorithm, we use the variance maximization asguiding criterion to resize windows’ height in order to enclose a single completeobject. Here the variance has to be intended as the maximum eigenvalue in theλ vector of the current slice. Starting from the position of each correlation peak,windows are enlarged along their height by a single row at a time, and the λvector is computed, taking into account its maximum component. Positive peakscorrespond approximately to the upper and lower edge of the object. Again weuse the second order derivative of the cumulative function in order to avoidmismatches due to the presence of variance spikes near the actual maximum.Search results are depicted in figure 3.

3.2 Shape classification using SVM

SVMs have been introduced by Vapnik [13]. Here we will focus the attentionon the most relevant theoretical topics on SVMs for pattern recognition: moredetailed information can be found in [11].

In a typical binary classification problem we are given a set S of pointsxi ∈ R

N , and a set of labels yi ∈ {±1}, i = 1, . . . , l, and we want to find

Fig. 2. an example of correlation maximization search. In the topmost row there is asample with the slices corresponding to the correlation maxima, the cumulative func-tion plot, and its second order derivative. Maxima have been found in position 21, 39,105 and 128. In the lower row there are the vertical sub-slices of the ROI in position105 along with the cumulative function and its second order derivative.

a function f : RN → {±1} in order to correctly associate each point to the

respective label.Even if we find a function f that does well on all the training data, we are

not ensured that it performs a correct generalization. Vapnik and Chervonenkisdefined a measure of the generalization ability (the capacity) of a function class:the VC dimension, that is the largest number h of points that can be separatedin all possible ways using functions of the selected class.

In order to obtain a correct generalization from the training data, a learningmachine must use a class of functions with an adequate capacity. Vapnik andChervonenkis considered the class of hyperplanes, and developed a learning algo-

Fig. 3. an example of variance maximization search. On the left, final slices of thepicture in figure 2 are depicted along with the plot of variance, and its second orderderivative for the slice in position 105.

rithm for separable problems, finding the unique Optimal Hyperplane (OH) thatseparates data. This approach can be easily extended to non linearly separableproblems.

The SVM in its original formulation is designed for two-class discrimination,so we used a particular training strategy, in order to cope with our multi-classtask. Two different kinds of SVMs have been trained on six shape classes: cube,cylinder, pyramid, cone, ellipsoid, and box. First, six SVMs have been trainedin a one-versus-others fashion, each of them being able to discriminate betweena particular class and all other objects. Besides, a second pool of 15 SVMs havebeen trained using a pair-wise strategy: each SVM is trained to discriminatebetween a single pair of the desired classes, so for K classes we need K(K −1)/2different machines.

The use of two learning strategies is related to the need to avoid mismatchesin classification. Kreßel has demonstrated in [11] that a one-versus-others train-ing leaves some uncertainty regions in the feature spaces where we’re not able todecide correctly to which class belongs the actual sample. The pair-wise strategyprovides a refinement of the boundaries between multiple classes.

In our experiments we’ve noticed that the use of one-versus-others or pair-wise strategy alone is not sufficient to obtain a correct classification. So, inthe test phase, we use the first set of machines in order to provide a roughdiscrimination, which is then refined by the use of the second ones. The one-versus-others machines provide their own estimate in a winner-takes-all fashion:the distances between the object’s λ vector and the optimal hyperplanes definingeach shape class are computed, and the class with the highes positive distanceis taken as the winner. In this way the class where the actual sample vector ismore ”inside” is selected.

In some cases this approach doesn’t allow a very sharp classification, andthe sample vector results inside two or more classes. The pair-wise machines areused in order to provide a disambiguation. In this case the result of testing thevector with each machine is accumulated for each class in a sort of round-robinchallenge. The class with the highest score wins the tournament, and the samplevector is classified according to this outcome. In this way, each object belongsto a single class.

4 Experimental Setup

In order to set up the classifier a training set has been used, which consists of118 images representing single objects belonging to all six classes. These imageshave been taken under varying lighting conditions, and they represent both realand synthetic shapes with different orientation and scaling.

The same training set has been used to train both one-versus-others andpair-wise SVMs in order to allow the second ones to act as a refinement of theboundaries between the various classes with respect to the first set of machines.

A 3 × 3 median filter is used to reduce noise and undesired mismatch dueto artifacts in the background. Moreover all the input images are normalized

with respect to the quantity∑

i,j I(i, j)2 that is a measure of the global energycontent of the image. In this way we obtain that the λ vector components rangealmost in the same interval for all images.

Experiments have been carried on both images depicting single objects, andcomplex scenes with many objects even partialyy occluded. Some trys have beenperformed on some well known images from computer vision handbooks. Ta-bles 1, 2, and 3, and figures 4, and 5 illustrate the results of some experiments.

Table 1. the performance of the system on the scene depicted in figure 3. The positionvalues are referred to the upper left corner of each slice. In slice two the PW classi-fication is incorrect, but it’s not used to the high score obtained in WTA mode. PWrefines the outcome of the WTA only in the last case.

Slice n. Pos. WTA (%) PW0 (21, 91) box (87.03) cube1 (39, 74) box (100) box2 (105, 70) cylinder (89.52) box3 (105, 152) box (68.67) cube

Fig. 4. an example of the performance of the system on a multiple objects image.

5 Conclusions and Future Work

The presented work is a first step in the direction of a more robust and generalobject recognition system, that can be a suitable extension to our image andvideo database system Jacob [4, 1]. Early results are satisfactory and provide uswith many cues about future developments.

The use of a statistical approach makes the system quite robust with respectto noise, but the system fails in presence of textures. On the other hand one

Table 2. the output of the system for the slices in figure 4. It can be noted that for thefirst slice we obtain a weak correct response from the WTA machines, while the PWclassification is almost wrong due to the closeness between the two shape classes. Manyothers slices, detected by the search algorithm, have been discarded by the classifier.

Slice n. Pos. WTA (%) PW0 (52, 40) box (48.76) cube1 (71, 132) box (69.87) cube2 (141, 46) cylinder (100) cylinder

Fig. 5. an example of the performance of the system on a real image, depicting thecity of Venice.

Table 3. the output of the system for the slices in figure 5. Here, the slices havebeen selected interactively. Slice 1 is misclassified as a cone due to the strong similaritybetween one side of the actual pyramid and the background. Slice 3 is correctly classifiedby the WTA machine, and the PW response is not taken into account.

Slice n. Pos. WTA (%) PW0 (1, 40) box (100) box1 (15, 1) cone (57.10) cube2 (118, 190) box (57.49) box3 (118, 230) box (89.48) box/cylinder

might think to specialize the system to the recognition of textures as a globalfeature, while shape could be argued using some other approach.

The influence of the illuminant direction has not yet exploited in detail, butour approach has proven itself not so much influenced by this parameter dueto the fact that l affects all the elements of the covariance matrix in the sameway. We are now studying the use of homomorphic filtering in order to stronglyreduce the influence of the lighting conditions on the perceived scene.

Another possible development is the use of the approach in a 3D vision sys-tem, instead of a preprocessing stage for content based image indexing and re-trieval. In this way the system should perform model recognition, thus providinga reconstruction layer with its information.

6 Acknowledgements

This work has been partially supported by the Italian MURST Project ”Galileo2000” and MURST-CNR Biotechnology Program l. 95/95.

References

1. E. Ardizzone and M. La Cascia. Automatic Video Database Indexing and Retrieval.Multimedia Tools and Applications, 4(1):29–56, January 1997.

2. S. Belongie, C. Carson, H. Greenspan, and J. Malik. Color- and Texture-based Im-age Segmentation using EM and its application to Content-based Image Rtrieval.In Proc. of International Conference on Computer Vision, 1998.

3. C. Carson, M. Thomas, S. Belongie, J.M. Hellerstein, and J. Malik. Blobworld: ASystem for Region-Based Image Indexing and Retrieval. In Proc. of Third Inter-national Conference on Visual Information Systems VISUAL’99, pages 509–516,Amsterdam, The Netherlands, June 1999. Springer.

4. M. La Cascia and E. Ardizzone. Jacob: Just a Content-based Query System forVideo Databases. In Proc. of IEEE Int. Conference on Acoustics, Speech and SignalProcessing, ICASSP-96, pages 7–10, Atlanta, May 1996.

5. M. Flickner, H. Sawhney, W. Niblack, J. Ashley, et al. Query by Image and VideoContent: The QBIC System. IEEE Computer, 28(9):23–32, September 1995.

6. R.C. Gonzalez and P. Wintz. Digital Image Processing. Addison-Wesley, ii edition,1987.

7. J. Hafner, H. Sawhney, W. Equitz, M. Flickner, and W. Niblack. Efficient ColorHistogram Indexing for Quadratic Form Distance Functions. IEEE Trans. onPattern Analysis and Machine Intelligence, 17(7):729–736, July 1995.

8. Hampapur et al. Virage Video Engine. Proc. of SPIE, Storage and Retrieval forImage and Video Databeses V, 3022:188–200, 1997.

9. A. Pentland, R. Picard, and S. Sclaroff. Photobook: Content-based Manipulationof Image Databases. International Journal of Computer Vision, 18:233–254, 1996.

10. A. Pentland and M. Turk. Eigenfaces for Recognition. Journal of Cognitive Neu-roscience, 3(1):71–86, 1991.

11. B. Scholkopf, C. Burges, and A.J. Smola, editors. Support Vector Learning. Ad-vances in Kernel Methods. The MIT Press, Cambridge, MA, 1999.

12. A. Talukder and D. Casaent. General Methodology for Simultaneous Repre-sentation and Discrimination of Multiple Object Classes. Optical Engineering,37(3):904–913, March 1998.

13. V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.

shape description for content-based image retrieval

Documents