hierarchical clustering for image databasessanjiv/publications/hier_wp.pdf · 2005-11-29 · tail...

6
Hierarchical Clustering for Image Databases Sanjiv K. Bhatia Department of Mathematics & Computer Science University of Missouri – St. Louis St. Louis, MO 63121 [email protected] Abstract The organization of an image database is one of the important issues in efficient storage and retrieval of images. Most of the existing image databases are based on flat structures, with the possibility of an index into the database that can help in narrow- ing down the images to be searched. In this pa- per, I’ll present a technique to create a hierarchi- cal data structure based on the clustering approach such that a user can select or discard a number of images for subsequent operations. The presented technique is based on application of wavelet anal- ysis to scale the images in hierarchy, and can take advantage of the structure of compressed images in the JPEG 2000 standard. 1. Introduction An image can be abstracted as data in multiple di- mensions, defined by I fxy , where fxy in- dicates the value of a discrete pixel at spatial coor- dinates xy . This abstraction implies that there is no easy way to impose an order on a set of im- ages based on content. However, an image can be manipulated in a number of ways to make the data contained in the image provide more informa- tion that can be used to create an index that can be searched readily. In a flat environment, an image needs to be com- pared to every other image in the database to de- termine the closest match for retrieval or other op- erations. This leads us to a linear search algorithm but with the caveat that each comparison involves comparing a number of pixels in each image – data in each dimension of our multidimensional abstrac- tion – to compute a coefficient of match. This structure was used in the celebrated QBIC project [1, 7, 8]. The indexing algorithms reduce the com- putation time of each comparison but still, require a comparison with a representation of every im- age, yielding a linear algorithm with improved effi- ciency due to working with a computationally effi- cient structure rather than raw images [3, 6]. There is a need for a technique that can reduce the number of comparisons while still performing the search in an efficient manner. One of the techniques success- fully employed to achieve such an objective in mul- tidimensional data is provided by clustering. Clustering is defined as an organization technique where the objects of interest, that have some simi- larity along a dimension of interest, are kept close together while the objects that differ from each other, are kept further apart. The dimension of in- terest is defined based on the application. In case of images, we abstract the dimension of interest as the images that have similar color distributions in corresponding areas. This may make the images perceived as similar to each other in appearance in perceptual terms but the images could be farther apart based on actual color distribution. As an ex- ample, consider the images presented in Figure 1 which are similar in form but are very different if we look at color distribution. Comparing these im- ages poses yet another challenge in clustering. Clustering results in improved performance for

Upload: others

Post on 18-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hierarchical Clustering for Image Databasessanjiv/publications/hier_wp.pdf · 2005-11-29 · tail into a single cluster using a multi-branchtree structure. Such a system allows for

Hierarchical Clustering for Image Databases

Sanjiv K. BhatiaDepartmentof Mathematics& ComputerScience

Universityof Missouri– St.LouisSt.Louis,MO [email protected]

Abstract

Theorganizationof animagedatabaseis oneof theimportantissuesin efficientstorageandretrievalofimages. Most of the existing image databasesarebasedon flat structures,with the possibilityof anindex into the databasethat can help in narrow-ing down the images to be searched. In this pa-per, I’ll presenta techniqueto createa hierarchi-cal datastructurebasedontheclusteringapproachsuch that a usercanselector discard a numberofimages for subsequentoperations. The presentedtechniqueis basedon applicationof waveletanal-ysisto scalethe imagesin hierarchy, andcantakeadvantageof thestructureof compressedimagesintheJPEG 2000standard.

1. Introduction

An imagecanbeabstractedasdatain multiple di-mensions,definedby I � f

�x � y� , where f

�x � y� in-

dicatesthevalueof a discretepixel at spatialcoor-dinates

�x � y� . This abstractionimpliesthatthereis

no easyway to imposean order on a set of im-agesbasedon content. However, an image canbe manipulatedin a numberof waysto make thedatacontainedin theimageprovide moreinforma-tion thatcanbeusedto createanindex thatcanbesearchedreadily.

In a flat environment,an imageneedsto be com-paredto every other imagein the databaseto de-terminetheclosestmatchfor retrieval or otherop-

erations.This leadsusto a linearsearchalgorithmbut with the caveat that eachcomparisoninvolvescomparinganumberof pixelsin eachimage– datain eachdimensionof ourmultidimensionalabstrac-tion – to computea coefficient of match. Thisstructurewasusedin the celebratedQBIC project[1, 7, 8]. Theindexing algorithmsreducethecom-putationtime of eachcomparisonbut still, requirea comparisonwith a representationof every im-age,yieldingalinearalgorithmwith improvedeffi-ciency dueto working with a computationallyeffi-cientstructureratherthanraw images[3, 6]. Thereis aneedfor atechniquethatcanreducethenumberof comparisonswhile still performingthesearchinanefficientmanner. Oneof thetechniquessuccess-fully employedto achievesuchanobjectivein mul-tidimensionaldatais providedby clustering.

Clusteringis definedasan organizationtechniquewheretheobjectsof interest,thathave somesimi-larity alonga dimensionof interest,arekept closetogetherwhile the objects that differ from eachother, arekept furtherapart.Thedimensionof in-terestis definedbasedon the application. In caseof images,we abstractthedimensionof interestasthe imagesthat have similar color distributionsincorrespondingareas. This may make the imagesperceivedassimilar to eachotherin appearanceinperceptualtermsbut the imagescould be fartherapartbasedon actualcolor distribution. As anex-ample,considerthe imagespresentedin Figure1which aresimilar in form but arevery differentifwe look atcolor distribution. Comparingtheseim-agesposesyet anotherchallengein clustering.

Clustering results in improved performancefor

Page 2: Hierarchical Clustering for Image Databasessanjiv/publications/hier_wp.pdf · 2005-11-29 · tail into a single cluster using a multi-branchtree structure. Such a system allows for

Figure 1. A picture of Taj Mahal in red andgreen bands

searchandretrieval asthesystemcanselector dis-carda numberof objectsbasedon a limited num-ber of comparisonswith the itemsbeingsearchedfor. A hierarchicalclustergoesonestepfurtherbycollectingsimilar clustersat differentlevelsof de-tail into a singleclusterusinga multi-branchtreestructure.Sucha systemallows for selectionof asetof clustersfor furtherexploration.

In this paper, I presentthe use of a hierarchicalclusteringdatastructurefor imagedatabaseorga-nization. This datastructurehasevolved from anadaptive clusteringschemefor multidimensionaldatathat hasresultedin a searchtime of O

���n�

in an averagecaseover n multidimensionalob-jects[4]. I havesuccessfullyusedthis techniquetoachieve performancegainsin assignmentof ther-mal propertiesto compositematerialdatafor FLIR

simulationas well as to generatea color table inwhichthecolorshavebeenuniformly distributedin

aperceptualsense.In thispaper, I’ ll show thatclus-teringcanbeappliedto imagedatabasesto achievesimilargainsin efficiency.

In the next section,I’ ll presenta review of theex-isting literatureon imagedatabaseorganization.InSection3, I’ ll presenttheuseof hierarchicalclusterdatastructureto organizethe images.This will befollowedby someexamplesandconclusionin thelastsection.

2. Organization of Images in Databases

Most of the imagedatabasesystemscanbe char-acterizedascontent-basedimage retrieval (CBIR)methods. Thesesystemsextract imagecharacter-istics,known assignatures, andperformreasoningon imagesbasedon somerulesdefinedon imagesignatures.

Mostof theCBIR systemscanbeclassifiedinto oneof threecategoriesbroadly definedas histogram-basedsystems,color distribution-basedsystems,or region-basedsystems[9]. Somesystems,suchas QBIC [7], Walrus [9], and JPEG-coefficientindexing-basedsystem[6] have characteristicsofmorethanonecategory.

The histogram-basedsystemscreatea histogramof eachof thethreeprimarycolorsindependentofeachother. The numberof bins in eachcolor atdifferentquantizationlevel aretallied to createthesignatureof the image.Thesignatureof an imagecan be comparedagainstthe collectedsignaturesof imagesin the databaseto determinethe closestmatch. The main drawbackof this schemeis thatthereis no regardfor theshape,texture,andobjectlocationin theimage.It worksonly onthedistribu-tion of coloranywhere in theimageandgiveahighdegreeof matchin semanticallyunrelatedimages.

Thesystemsbasedon color layout performsome-whatbetterby dividing theimageinto anumberofsmall blocks andusing the averagecolor of eachblock to createa signaturefor the image. Somesystems,suchasWALRUS [9] andWBIIS [12], usesignificantwavelet coefficients insteadof averagevaluesto capturesharpcolor variationswithin ablock. Thesesystemsshow low tolerancefor ob-

Page 3: Hierarchical Clustering for Image Databasessanjiv/publications/hier_wp.pdf · 2005-11-29 · tail into a single cluster using a multi-branchtree structure. Such a system allows for

ject translationandscalingin animage.

Region-basedsystems,including QBIC [7], Blob-world [2], and SIMPLIcity [11], uselocal proper-tiesof regionswith theassumptionthateachregionidentifiesanobject. However, a problemcanariseif anobjectis dividedinto multiple regions.

Thesystempresentedhereconsidersa moreholis-tic approachby using a query imageat differentresolutionsto navigateto theimagesthatmatchthequery. Themultiresolutionnatureof thedatastruc-tureuseswaveletsto getamipmap-likeschemeforeachimage. In thecurrentimplementation,I haveusedthe Haarwavelet for its simplicity to derivemultiple resolutionsof the imageandby throwingawaythedetailcoefficients,retainingonly thescalecoefficients.

In the next section, I describeclustering phe-nomenonin relation to imagesfollowed by theuseof scalingcoefficientsto createthehierarchicalclusterdatastructureaswell asthe overall imple-mentationof thisstructure.

3. Cluster Scheme for Images

The hierarchicalclusterdatastructureis basedonthe developmentof a representationof eachclus-ter thatadequatelycapturesthecontentsof objectscontainedin thecluster. This representationformsthe centroid of the cluster. The centroidof clus-ter, in caseof images,is simply theaverageimagefor the clusterwherethe averageimageis formedby taking the averagevalueof correspondingpix-els in theimagesthat form thecluster. This condi-tion requiresthatall imagesin theclusterbeof thesamesize,or scaledto bethesamesizefor internalmatching.

To develop the clusters, considerall images tobe randomlydistributedalong a two-dimensionalplane.We developclustersby looking at eachim-ageand addingit to the cluster that is closestinmatchto this image. Anotherpoint to be consid-ered is that the quantitative measureof match iswithin a specifiedthresholdfrom the centroidofthecluster. If wecannotfind suchacluster, wecre-atea new clusterthat is madeup of just this one

image. Whenever an imageis addedto a cluster,its centroidmayshift becauseof thechangein av-eragevalueof correspondingpixelsdueto thenewimage.An exampleof imagesdividedinto clustersis shown in Figure2.

Figure 2. Clustered images in a 2D plane

Thehierarchicalclusterdatastructureneedsamea-sureof distancebetweentwo givenimages.A sim-pledefinitionof thisdistanceis providedby theav-eragedifferencein eachof thered,green,andbluebandsof thetwo images.If thetwo imagesof sizer � c pixelsaredenotedby I andI � , thedistancedbetweenthemis givenby

d � 13rc

c � 1

∑x 0

r � 1

∑y 0

∑z RGB

Ix� y� z � I �x� y� z

Thedistancecomputedin this fashionis extremelysimplisticandsuffers from theproblemillustratedin Figure 1. The problemis fixed by discardingthe RGB color spacein favor of a perceptualcolorspacesuchasHSV or L*a*b*.

Wheneveranew imageis addedto acluster, theav-eragingfor new clusterrepresentationis performedbasedon thenumberof imagesalreadyin theclus-ter. If aclusterrepresentedby containsn images,and I is a new imagebeingaddedto , the newclusterrepresentation � is givenby

� � 1n � 1

�n �� �� I �

Page 4: Hierarchical Clustering for Image Databasessanjiv/publications/hier_wp.pdf · 2005-11-29 · tail into a single cluster using a multi-branchtree structure. Such a system allows for

The first clusterrepresentationis createdfrom thefirst imageitself. Thustheimageis at thecenterofthecluster. As animageis addedto thiscluster, thenew clusterrepresentationis createdby taking theaverageof two imagesin thecluster. This changesthe locationof centroidto the point in the middleof the two images.From this point on, the repre-sentationkeepson shifting but never goesfurtherthanthetolerancespecifiedby thethresholdduetotheweightedaveraging.Thus,thealgorithmguar-anteesthat an imagecannotbe classifiedto morethanonecluster. Therefore,we canassignthe im-ageto aclusterassoonasasuitablematchis found.Moreover, acorollaryto this impliesthattwo clus-terscannothaveanoverlappingboundary.

3.1. Hierarchical Cluster

Theclusteringdescribedabovecandegenerateintoworst casescenarioif the thresholdis selectedtobe small. In sucha case,eachimageis a clusterby itself. Theabove clusteringalgorithmperformsa linear scanof the imageagainstthe clustersal-readyproduced.The only way to improve on thealgorithm will be if we cancomparea query im-ageagainstclustersin order suchthat the clusterwith maximumnumberof imagesis comparedfirstand the onewith leastnumberof imagesis com-paredlast. A betterway to improve thealgorithmperformanceis by usinga waveletanalysisto cre-ateahierarchyof clusters.Thehierarchyis createdin theform of clusterof clustersat multiple levels,generatedby full averagingof images.This clusterof clustersorganizestheimagesatdifferentresolu-tion levelsin theform of a multi-branchhierarchi-cal tree.

Thehierarchicalclusteringalgorithmperformsthefull averagingusingHaarwaveletontheimagethatis to be clustered.It performsthis analysisrecur-sively till the entire imageis reducedto a singlepixel, saving the intermediateresultsin a stack. Itshouldbe recalledthat at eachstep,the imageisreducedby half in bothheightandwidth. Thus,a256 � 256(28 � 28) pixel imagewill yield 9 levelsof averaging,from 20 � 20 pixels to 28 � 28 pix-els.Eachhierarchyin thenew datastructureis thus

1 � lgn levelsdeepwheren is thelargerof thenum-ber of rows andcolumnsin the texture. As in thecaseof flat clusters,we will requireeachimageintheclusterto beof thesamesize,or scaledto beofthesamesize.

The hierarchicalcluster data structureis imple-mentedby a recursively definedvectorof clusterswith eachcluster representedby a single image,andcontaininga hookfor anotherhierarchybelowit. Theimagerepresentingtheclusteris simply anaverageof all the imagesthatarecontainedin theclusterat that resolutionlevel. The hookprovidesfor thehierarchystartingat thenext higherresolu-tion level.

As an image I is addedto the cluster, we cre-atemultiple resolutionsof I by applyingthe Haarwavelet anddiscardingthe wavelet or detail coef-ficients until we have an imageof just onepixel,denoting it by I0. This pixel capturesthe aver-agevalueof all pixels in the image,in respectivebands.This imageis thencomparedwith thevec-tor of clustersat therootof hierarchicalclusterdatastructureto determinetheclosestmatch.

Oncewe have found the matchingcluster, and itis within the cluster radius,we selectthe clustervectorat thenext higherresolutionlevel that is as-sociatedwith this cluster. This will leadus to theclustervectorof 2 � 2 pixel images.We comparethe2 � 2 versionof currentimageI , denotedby I1to eachmemberof this clustervector. Theclosestmatchis thenselectedandwecontinuewith these-lectionat thenext level. Therecursive applicationstopswhenwefind theimagethatmatcheswithin athresholdlimit, or whenwediscover thattheimageis outsidetheradiusof any clusterat thatresolutionlevel within the selectedcluster. At sucha point,we stopandcreatea new clusterto addthe imageinto thenew cluster.

3.2. Results

We’ll illustratetheresultsusingtheimagescatego-rizedasart in theSmithsoniandatabase.Thereare31 imagesin this category andtheir representationis givenin Figure3.

Page 5: Hierarchical Clustering for Image Databasessanjiv/publications/hier_wp.pdf · 2005-11-29 · tail into a single cluster using a multi-branchtree structure. Such a system allows for

I testedthis setof imagesseparatelyusing256 �256pixel imagesandcreatedthelusteringby com-paringin L*a*b* colorspace.Thehierarchycon-tainsanumberof clustersasshown in Table1. The

Table 1. Number of clusters at different lev-els in hierarchy

Level Picture Numberofsize clusters

0 1 � 1 41 2 � 2 72 4 � 4 133 8 � 8 184 16 � 16 225 32 � 32 266 64 � 64 287 128 � 128 318 256 � 256 31

tableshowsthemultiresolutionnatureof hierarchi-cal clusterstructure.At thelowestresolutionlevelor root,eachimageis reducedto asinglepixel thatcapturesthe averageof all the color bandssepa-rately. At this level, the imagesweredivided intofour differentclusters.As wegodown thestructureto higher resolutions,the numberof clustersstartto increaseandby thetime we arriveat level 7, allimagesare separatedfrom eachother and form aclusterby themselves.

The system is implementedin C++ on a SunSPARCstation running Solaris. The query for amatchingimage in this structureis completedinlessthanonesecondon an averageoncethe clus-ter hasbeenloadedinto memoryandthe stackofmultiresolutionimagescreated.In casethesystemdoesnot find theexactmatchof thequeriedimage,it returnstheclosestmatch.

4. Conclusion and Future Work

The hierarchicalcluster data structureis imple-mentedusingHaarwavelet for scalingthe imagesat differentresolutionlevels. The systemincludesseparateprogramsto createtheclustersaswell as

to addindividual imagesto thedatastructure.ThedistancemeasureusesRGB color spacethoughtheusercanspecifyL*a*b* or HSV color spacesascommandline options.Thesystemhasbeentestedwith two separatedatabases.

The first test was performedwith a databasethatcontainsabout4000texturesof size256 � 256pix-els. In eachquery, I wasable to satisfactorily re-trieve an exact or approximatematchto the queryimage. The currentimplementationusesthe sim-plestform of wavelets– theHaarwavelets– to per-form scalingatdifferentresolutions.

Thesystemhasalsobeentestedwith Smithsoniandatabasecontaining747images,dividedinto sevendifferentcategories.Eachof thesecategoriescon-tainsbetween31 and216 images.The imagesarescaledto 512 � 512pixel sizefor internalusewithour datastructure,giving a depthof 9 levelsin thehierarchy.

Currently, thesystemgivesonly oneimageasa re-sponseto thequery. This imagecouldbetheexactmatch,if available in a cluster, or it could be theonethat is at the smallestdistancefrom the queryimage.Oneof theenhancementsin thesystemwillbe to addthe capabilityto outputmultiple imagesandprovidearankingto thoseimagesin theoutput.

The Haarwaveletsperforma simpleaveragingofcolorvaluesin differentpictureelements.They aresimpleto program,comparedwith otherwavelets.I usedthemto prove a conceptthat they canhelpwith hierarchicalclustering. However, to createarobust environment,I believe I needto usea dif-ferent type of wavelet. In future, I plan to re-placetheuseof Haarwaveletswith Debauchies5/7and7/9 wavelets[5] to scalethe imagesto createthehierarchicaldatastructurethatwill further im-prove theretrieval process.I chosetheDebauchieswaveletsbecausethey form the basisfor the JPEG

2000standard[10]. This selectionleadsmeto be-lieve that I may be able to work on imagescom-pressedwith JPEG 2000algorithmdirectly by ex-tracting the scalingcoefficientsat different levelsfor usein thehierarchicalclusterdatastructure.

Page 6: Hierarchical Clustering for Image Databasessanjiv/publications/hier_wp.pdf · 2005-11-29 · tail into a single cluster using a multi-branchtree structure. Such a system allows for

References

[1] R. Barber, W. Equitz, M. Flickner, W. Niblack,D. Petkovic, and P. Yanker. Efficient query byimagecontentfor very large imagedatabases.InCOMPCON, pages17–19.IEEEComputerSocietyPress,Los Alamitos,CA, 1993.

[2] S. Belongie,C. Carson,H. Greenspan,andJ. Ma-lik. Color- andtexture-basedimagesegmentationusingtheexpectation-maximization algorithmandits applicationto content-basedimageretrieval. InICCV98: Proceedingsof theInternationalConfer-enceonComputerVision, pages675–682,1998.

[3] S.K. Bhatia. Imagedatabaseindexing usingJPEGcoefficients. In Proceedingsof the 10th FLAIRSConference, DaytonaBeach,FL, May 1997.

[4] S.K. Bhatia.Adaptivek-meansclustering.In Pro-ceedingsof the FLAIRS2004 Conference, SouthBeach,FL, May 2004.

[5] C. K. Chui. Wavelets: A MathematicalTool forSignalAnalysis. Societyfor IndustrialandAppliedMathematics,Philadelphia,PA, 1997.

[6] S.Climer andS.K. Bhatia. Imagedatabaseindex-ing usingJPEGcoefficients. PatternRecognition,35(11):2479–2488,November2002.

[7] M. Flickner, et. al. Query by image and videocontent: The QBIC system. IEEE Computer,28(9):23–32,September1995.

[8] B. Holt andL. Hartwick. Retrieving art imagesbyimagecontent:theUC Davis QBIC project.ASLIBProceedings, 46(10):243–248,October1994.

[9] A. Netsev, R. Rastogi,andK. Shim. WALRUS:Asimilarity retrieval algorithmfor imagedatabases.In A. Delis,C. Faloutsos,andS.Ghandeharizadeh,editors,SIGMOD1999: Proceedingsof the ACMSIGMOD International Conference on Manage-ment of Data, pages395–406,Philadelphia,PA,June1999.ACM Press.

[10] www.jpeg.org. Codingof Still Pictures: JPEG2000Part I Final CommitteeDraft, version1.0edi-tion, April 2000.

[11] J.Z. Wang,J.Li, andG. Wiederhold.SIMPLIcity:Semantics-sensitive integratedmatchingfor picturelibraries. IEEE Transactionson Pattern AnalysisandMachineIntelligence, 23,2001.

[12] J.Z. Wang,G. Wiederhold,O. Firschein,andS.X.Wei. Content-basedimageindexing andsearchingusingDaubechies’wavelets.InternationalJournalon Digital Libraries, 1(4):311–328,1997. Figure 3. Representative images from art

category in Smithsonian database