research article accelerating content-based image

Research ArticleAccelerating Content-Based Image Retrieval viaGPU-Adaptive Index Structure

Lei Zhu

School of Computer Science and Technology Huazhong University of Science and Technology Wuhan 430074 China

Correspondence should be addressed to Lei Zhu leizhu0608gmailcom

Received 31 August 2013 Accepted 17 February 2014 Published 20 March 2014

Academic Editors D Talia and C-W Tsai

Copyright copy 2014 Lei Zhu This is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

A tremendous amount of work has been conducted in content-based image retrieval (CBIR) on designing effective index structureto accelerate the retrieval process Most of them improve the retrieval efficiency via complex index structures and few take intoaccount the parallel implementation of them on underlying hardware making the existing index structures suffer from low-degreeof parallelism In this paper a novel graphics processing unit (GPU) adaptive index structure termed as plane semantic ball (PSB)is proposed to simultaneously reduce the work of retrieval process and exploit the parallel acceleration of underlying hardware InPSB semantics are embedded into the generation of representative pivots and multiple balls are selected to cover more informativereference featuresWith PSB the online retrieval of CBIR is factorized into independent components that are implemented onGPUefficiently Comparative experiments with GPU-based brute force approach demonstrate that the proposed approach can achievehigh speedup with little information loss Furthermore PSB is compared with the state-of-the-art approach random ball cover(RBC) on two standard image datasets Corel 10 K and GIST 1M Experimental results show that our approach achieves higherspeedup than RBC on the same accuracy level

1 Introduction

With the popularity of high-end digital imaging deviceand convenient image storage system the amount of high-quality images is growing exponentially It provides us a greatchallenge to find visual similar images from massive imagecollections Content-based image retrieval (CBIR) is one ofthe desirable solutions to return similar images for a givenquery instance automatically based on pure visual analysisand similarity comparison [1 2]

CBIR represents the visual content of image with low-level descriptors via feature extraction With the extractedvisual features similarities between query images and imagesin database are calculated and sorted and the most sim-ilar image is returned For a given query image responsetime of retrieval depends on the complexity of similaritycomparison which is further determined by the dimensionof image feature and image database size For accuratedescription high dimensional feature is needed to representthe complex image content which is generated from highquality imaging devices At the same time large quantities ofimages are produced from portable image capturing devices

High dimensional features and large quantities of imagesmake the process of similarity calculation and ranking time-consuming Retrieving similar images becomes a compu-tationally expensive task For a practical retrieval systemdesign the efficiency of CBIR system is turned into a crucialissue that needs to be seriously taken into consideration

To ease the computational pressure a huge variety ofstrategies and efforts have been made by researchers in theliterature to reduce the retrieval time of CBIR On the onehand software developers employ graphics processing unit(GPU) as one of the important platforms to accelerate featureextraction and similarity comparison Many researchers havereported their promising results on parallelizing sequentialalgorithm on GPU Their successful practices demonstratethat hardware acceleration becomes one of the importantoptions to improve the retrieval efficiency On the otherhand algorithm designers propose many effective indexstructures at the algorithm design level to accelerate theprocess of similarity comparison via pruning unnecessarycomputations These index structures generally have elegantmathematical formulation but pay little attention to parallelimplementation Oversight of parallel implementation dur-

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 829059 11 pageshttpdxdoiorg1011552014829059

2 The Scientific World Journal

ing algorithmdesignmakes the conventional index structuresinterleave series of similarity calculations and comparisonsIn fact inherent dependence between calculation elementsin traditional index structures is not adaptive to capture thecharacteristics of GPU which launches a bundle of threadsto run independent tasks Therefore direct parallelizationof traditional index structure cannot make full use of GPUresources To the best of our knowledge few papers concen-trate their efforts on designing GPU-adaptive index structureto integrate both efforts from algorithm design and hardwareparallelization

In this paper we are motivated to design our approachvia combining the advantages of both algorithm design andhardware parallelizationThis is our starting point to developthe GPU-based parallel CBIR system A novel GPU-adaptiveindex structure termed as plane semantic ball (PSB) isproposed to leveragemultiple semantic balls to cover retrievalspace It should be noted that this technique is differentfrom traditional index structures that exploit hierarchicalspace decomposition to prune redundant computations Toimprove efficiency further PSB is designed to represent theinvolved data structures in the formofmatrix so that the datastructures can be processed in parallel by different calculationunits in GPU In this case large number of threads can belaunched on GPU platform to conduct the same operationsconcurrently and thus the online retrieval process of CBIRcan be greatly improved

The contributions of the paper are mainly concentratedon the following three facets

(i) A novel GPU-adaptive index structure PSB is pro-posed to index the features in databaseDifferent fromthe existing approaches PSB is designed to capturethe hardware characteristics of GPU

(ii) An effective parallel strategy is designed to leverageboth the space decomposition ability of PSB and thehardware resources of GPU

(iii) Comprehensive experiments on two image datasetsare conducted to demonstrate the effectiveness of theproposed approach

The remainder of the paper is organized as followsrelated work is reviewed in Section 2 Section 3 provides thebackground of CBIR system Data structure and parallelretrieval algorithm are introduced in Section 4 Section 5introduces the characteristics of GPU and programmingenvironment CUDA Section 6 presents the details of parallelacceleration Experimental results and discussions are givenin Section 7 We finally conclude the paper in Section 8

2 Related Work

This section presents related work on algorithm accelerationand hardware parallelization of CBIR Related work aboutalgorithm acceleration is mainly concentrated on dimensionreduction and index structure Hardware parallelization ismainly focused on parallel acceleration of CBIR on platformof multicore or many-core

Dimension reduction is an approach that reduces thedimension of feature vector prior to any application andspeeds up the process of similarity computation In thisresearch field a huge variety of effective algorithms suchas principal component analysis (PCA) [3] local linearembedding (LLE) [4] and locality preserving projection(LPP) [5] are proposed The basic idea of these techniques iseliminating the redundant information in feature dimensionsand compressing the feature into low dimensions Althoughthese techniques can accelerate the retrieval process to somedegree they incur some information loss which negativelyimpacts the descriptive ability of image feature and bringsdown the retrieval accuracy

Index structure is another effective approach whichprunes the retrieval space and avoids unnecessary compu-tations In general this kind of approaches can be furtherdivided into two main categories tree-based index structureand hashing-based index structure Tree-based index struc-ture performs well for low-dimensional feature (around 20dimensions) but poorly for high-dimensional feature Theperformance of it even degrades to that of brute force retrievalin extreme case when the dimension of features becomes veryhigh In CBIR hundreds of feature dimensions are neededto describe the complex visual content of image Thereforethe retrieval performance will be degraded if tree-basedindexing structure is used to index features Furthermoretree-based index structure involves an interleaved series ofsimilarity computations and comparison operations causingineffectiveness for parallelization [6] Although hashing-based index structures such as local sensitive hashing (LSH)[7] and spectral hashing (SH) [8] are proposed without thedrawbacks described above they are not designed for thegeneric distance types and thus the application of thesealgorithms is largely limited In addition another weaknessof hashing-based index structure is that the building of itinvolves many parameters which cannot be easily adjustedfor any inexperienced users

There is also a large body of work focusing on hardwareparallelization of retrieval process in CBIR The authors of[9 10] propose an effective parallelization of brute forceretrieval using compute unified device architecture (CUDA)and CUDA basic linear algebra subroutines (CUBLAS) [11]respectively An effective algorithm is given in [12] to furtherextend the process of the nearest neighbor search on multi-GPU A parallelization of CBIR system is reported in [13]to accelerate the retrieval process on 8-core system and 16-core system Their experimental results report 56x and 97xspeedups The work in [14 15] exploits a computer clustersystem to parallelize CBIR with several computers connectedby high-speed network The authors of [16] develop a three-level flexible image retrieval system (FIRE) using openmulti-processing (OpenMP) and shared memory multiprocessorsChallenges such as load balancing parallelization overheadand parallelism exploitation are considered in their workAll methods described above can only be considered asbrute force parallelization which is inherently parallelizableAlthough they have reported promising experimental resultstheir performance can be further improved if principals ofindex structure are taken into account

The Scientific World Journal 3

The approaches that are most similar to this work are[17 18] They suggest the nearest neighbor search on GPUvia exploiting a simple index structure termed as a randomball cover (RBC) To the best of our knowledge RBCmay be the most effective GPU-adaptive index structureHowever this work only exploits a simple random subset asrepresentative pivots for space decomposition which lackssemantics cover for the feature database In addition RBConly selects the closest ball as retrieval scope for each queryfeature which brings much performance loss Therefore inorder to reduce accuracy lossmore representative pivots haveto be selected to cover more information Moreover theirapproach only focuses on algorithm design and pays littleattention to efficient parallel implementation To mitigate thedrawbacks of the RBC described above we develop a newGPU-adaptive index structure PSB in this paper In PSBsemantics are embedded to generate representative pivots tomake the feature space more compact Appropriate numberof representative pivots is selected according to the similaritymeasurement to cover more semantic balls and involvemore possible reference features In addition to maximizethe performance of the proposed index structure we alsopropose a high-efficient parallel retrieval scheme tomake fulluse of the GPU resources to accelerate the retrieval process

3 Overview of CBIR System

This section reviews the background of CBIR system Formore detailed analysis of relevant techniques one can referto [1 2]

Although many CBIR systems such as VisualSeek [19]MARS [20] and SIMPLIcity [21] have been proposed inmultimedia field the majority of them are designed with thesame architecture These systems are generally comprised ofan offline part for feature indexing and an online part forquery interface Figure 1 shows the typical framework of aCBIR system

In the offline part visual features of images in databaseare automatically extracted and stored in feature databaseThis process is called feature extraction In computer visionfield many effective approaches are proposed to extractfeature descriptors to represent the visual content of imageIn general these features are classified into global featuresand local features which represent images by vectors andvector sets respectively Since the main aim of this paper is tofocus on the acceleration of retrieval process we simply adoptGIST [22] as the main image feature The effectiveness of ourapproach on other features will be explored in further workIn the literature GIST is proposed to use a set of perceptualdimensions (naturalness openness roughness expansionand ruggedness) to describe the spatial structure of theimage A Gabor filter with fixed parameters is built Imageis filtered and segmented into grid cells where orientationhistograms are extracted Finally the response of Gaborfilers on each grid cells is concatenated into GIST featureWith extracted GIST features distances between images arecalculated In practical CBIR system to speed up the retrievalprocess features are usually indexed by data structures toprune the unnecessary computations An effective index

Query interface

Query image Resultsvisualization

Yes

No

Online

Relevancefeedback Satisfied

Retrieval process

Featureextraction

Similaritycomparison

Similarityranking

Similaritycalculation

Offline

Imagedatabase

Featuredatabase

Indexeddatabase

Figure 1 Typical framework of CBIR system

structure should keep a balance between retrieval accuracyand efficiencyThe conventional tree-based or hashing-basedindex structures are proposed to achieve this requirement

In the online part given a particular query image visualfeature is extracted as the images in the database Dissim-ilarities between query feature and features in database arethen calculated and ranked Based on the ranked results thecorresponding images are finally forwarded to user

In a typical CBIR system for the limited discriminatingability of low-level features several rounds of relevance feed-back need to be iterated to bring the semantic gap betweenlow-level features and high-level semantics In each roundof relevance feedback as descriptive information of queryimage is revised by the feedback information similaritiesbetween query feature and features in database should be allrecalculated To make things worse in a practical systemseveral rounds of feedback are generally needed to achievequery intend which brings more computational pressure onthe underlying hardware

In general the stage of similarity calculation and rankingconsumesmost of the online retrieval timeHence it has greatimportance to develop efficient index structure to reduce theresponse time for query and improve the user experience fora CBIR system

4 Plane Semantic Ball

In this section details of the proposed GPU-adaptive indexstructure PSB and its associated parallel retrieval algorithmare presented The main notations in the paper are listed inNotations section

In offline part of CBIR we build the structure of PSBRepresentative pivots are generated from clustering process


and balls are built for all pivots to cover the whole featurespaceWith the built balls features in database are assigned totheir proper balls according to the nearest distance principleWe call the built index structures as semantic since theclustering process embeds semantics into the representativepivots and balls

With the built PSB the online retrieval process is dividedinto two consecutive stages In the first stage several nearestrepresentative pivots are selected for query Features in ballsof the selected pivots are combined to constitute the retrievalscope of the second retrieval The nearest features are foundin the second stage and considered as the final results of thewhole retrieval system

41 Basic Computational Unit In this subsection we intro-duce the module basic computation unit (BCU) on whichboth the PSB and online retrieval algorithm are built Thebasic aim of BCU is to find119870 nearest features for query Wedenote the operation of BCU as NN(XY 119870) whereX is theset of query featuresY is the set of reference features and 119870is the number of the nearest features which are returned byBCU

We further decompose the procedure of BCU into twosubprocedures distance computation and distance selectionAs the query image 119902 is described by a feature vector andfeature space is actually vector space the process of distancecomputation can be easily transformed into vector-matrixmultiplication This operation can be further transformedinto matrix-matrix multiplication when streams of queriesare launched Therefore many existing GPU implementa-tions of vector-matrix and matrix-matrix multiplication canbe applied directly As image features are generally sparseenough optimization tricks of sparse matrix-vector multipli-cation can be further exploited to improve the performanceWith BCU presented above we introduce the construction ofour proposed index structure PSB and its associated retrievalalgorithm in the following subsections

42 PSB Construction In this subsection details about theconstruction of PSB are described To build PSB we shouldfirst choose representative pivots from feature database Fto cover the whole feature space Each representative pivotmaintains a ball which holds119879nearest features fromdatabaseRepresentative pivots are generated by clustering the featuredatabase to involve more semantics into the cluster centerswhich are considered as the representative pivots in this studyFigure 2 shows the basic structure of our proposed GPU-adaptive index structure PSB

In PSB representative pivots are determined by 119896-meansThe ball of a particular representative pivot 119877119891 is built fromNN(119877119891 F 119879) to find the 119879 nearest neighbors among F Itsformula can be represented as (| sdot | is the cardinality of set)

Ball (119877119891) = 119891 | 119891 isin FD (119877119891 119891) lt Φ119877119891

10038161003816100381610038161003816Ball (119877119891)

10038161003816100381610038161003816= 119879

Φ119877119891= MAX119891isinBall(119877119891)

D (119877119891 119891)

(1)

By properly setting the value of 119879 feature points indatabase will belong to at least one semantic ball Thecombined balls of the 119867 nearest representative pivots aredefined as Union(1198771 1198772 119877119867) which is calculated bythe Union operation on corresponding balls Formally it iscalculated by Union(Ball(1198771)Ball(1198772) Ball(119877119867))43 PSB-Based Online Retrieval Once the user uploadsquery image feature is first extracted and image is repre-sented by feature vector With the PSB in place the onlineretrieval process is divided into two consecutive retrievalrounds The aim of the first round retrieval is reducingretrieval scope while the second round retrieval is to returnthe final nearest images Formally the first retrieval roundcan be represented as NN(QR 119867) which is used to find119867 nearest representative pivots for each query among 119878representative pivots The search scope of the second pro-cess is determined as the Union ball of 119867 representativepivots which can be represented as Union(NN(QR 119867))Thus the whole retrieval process can be represented asNN(QUnion(NN(QR 119867)) 119864) which is simply built on theBCU andUnion operation on balls of the nearest representa-tive pivots Based on the returned results the correspondingimages are returned to user

Let us analyse the time complexity of online retrievalWe assume the whole feature database is comprised of 119872images and119873 query images which are represented by featurevector with 119889 dimensions If brute force retrieval is adoptedthe time complexity will be 119874(119872119873119889) In our approach asthere are 119878 representative pivots and each of them points to119879 image features the time complexity of online retrieval is119874(119873(119878+119867119879)119889) ((119878+119867119879) lt 119872) As semantics are embeddedin PSB by clustering small 119878 and 119867 are required to obtainthe same retrieval accuracywith the brute force retrievalThisinference is validated in our experiments which demonstratethat PSB can achieve high speedup with almost no accuracyloss

5 GPU and CUDA

In this section GPU and its programming environmentCUDA are introduced to give a general description Instruc-tions that should be considered during the parallel imple-mentation are also presented Figure 3 shows a typical GPUarchitecture GPU came about as a powerful many-coreprocessor for graphic rendering It usually contains sev-eral streaming multiprocessors (SM) which are capable ofsupporting concurrently threads A typical SM is generallyequipped with multiple streaming processors (SP) and low-capacity shared memory which has low data access latencyand speedy bandwidth Global memory is the only uniquememory type in GPU that can be shared by both CPU andGPU Therefore data exchange between GPU and CPU ismainly through globalmemoryThemain drawback of globalmemory is data access latency caused by PCI transmission InadditionGPU is equippedwith constant and texturememoryto speed up the process of readwrite operation for frequentlyaccessed data

In order to enable the general applications to run onGPU we develop a CUDA C-like program that parallelizes


the sequential algorithm To programwith CUDA sequentialalgorithms should be first analyzed and decomposed intosequential and parallel programs which can be dispatchedto run on CPU and GPU respectively Parallel programs areorganized into kernels which launch threads to run the samecode on segmented data concurrently

Architecture characteristics of GPU and special program-ming mode of CUDA present significant challenges for par-allel algorithm design To maximize the efficiency sequentialalgorithms should be seriously analyzed and decomposedinto components with low dependencies In addition thecharacteristics of parallel programming and memory opti-mization tricks should be carefully taken into account

6 Parallel Acceleration

61 Data Layouts A GPU-based parallel algorithm that ismemory boundmay as is the case in CBIR yield poor perfor-mance if the programmer ignores the specific characteristicsof GPUTherefore to obtain better memory throughput datastructures included in PSB should be placed in appropriatememory type by taking into account both the intrinsic GPUmemory characteristics and inherent attributes of memoryaccess to data structure

Since representative pivots and reference features arealmost constant in thewhole retrieval process they are placedon constant memory and texture memory respectively toimprove memory throughput For large-scale image datasetreference features should be placed on global memory as thecapacity of texturememory is limited Because query featuresare varied these features are directly transferred to globalmemory to providememory coalescing Note that all featuresdescribed above are transformed in columns so that eachthread can access one component of current query data

62 Parallelization of BCU Since the retrieval process ofqueries is independent we parallelize the computation ofBCU on query level It means that the retrieval processes ofqueries are parallelized

621 Distance Matrix Computation The aim of this step is tocalculate the pairwise distance between queries and referencefeatures generating distancematrix119863The element in 119894th rowand 119895th column of matrix is denoted as119863119894119895 which representsthe distance between query feature 119894 and reference feature 119895For presentation convenience in our approach the distancemetric considered between two image features is triviallyrestricted to Euclidean distance This metric can be naturallysubstituted with other metrics for other features Formally119863119894119895 is given by

119863119894119895 =radic

119889

sum

119897=1

(119883119897119895 minus 119884119897119894)2 (2)

Matrix multiplication can be implemented on GPU effi-ciently since it can be decomposed into many independentcalculation elements Therefore in order to make full useof the GPU resources the computation of distance matrix

Representative pivot R2

(cluster center)

Ball of representative pivot

R1

R2

R3R4

R5R6

120595R119894

Figure 2 Basic structure of PSB (R1 R2 R3 R4 R5 R6 arerepresentative pivots)

is transformed into matrix-based calculation Formally thevalue of119863119894119895 can be calculated as

119863119894119895 =radic(119883119895 minus 119884119894)

119879(119883119895 minus 119884119894)

119883119895 = (1198831119895 1198832119895 119883119889119895)119879

119884119894 = (1198841119894 1198842119894 119884119889119894)119879

(3)

After mathematical transformation119863 can be calculated as

119863 = radic119860 + 119861 + 119862 (4)

where the 119894th row of 119860 is the modulus of 119884119894 and denotedas 119884119894 and the 119895th column of 119861 is the modulus of 119883119895 anddenoted as 119883119894 ( sdot is the Euclidean norm) The matrix119862 is computed in form of the matrix-matrix multiplicationbetween query and reference features as

119860 119894119895 =10038171003817100381710038171198841198941003817100381710038171003817 =

radic

119889

sum

119897=1

1198842119897119894119861119894119895 =

10038171003817100381710038171003817119883119895

10038171003817100381710038171003817= radic

119889

sum

119897=1

1198832119897119895

119862119894119895 = minus2 times

119889

sum

119897=1

119883119897119895 times 119884119897119894119863119894119895 = radic119860 119894119895 + 119861119894119895 + 119862119894119895

(5)

Since in the context of CBIR we only need the rank orderof reference features in two retrieval rounds the values ofdistance are not valuable Therefore the matrix calculationof 119861 is omitted for simplicity

119863 = radic119860 + 119862 (6)

The computation ofmatrix119862 ismatrix-matrixmultiplica-tion There are a lot of existing implementations Thereforewe do not implement our own GPU code of matrix-matrixmultiplication but just use the optimized implementationin CUBLAS In this way to compute the matrix 119863 wejust calculate the modulus of reference features and add itto proper position in 119862 In our approach this process iscompleted in coarse interreference parallelism Threads arelaunched with the same number as reference features Eachthread calculates the modulus and adds the value to thecorresponding column


CPU

Core 0 Core Mmiddot middot middot

Registers Registers

RegistersRegisters Registers Registers

Cache

Host memory

GPU

SM 0 SM P

Shared memory Shared memory

SP 0 SP N SP 0 SP N

Localmemory

Localmemory

Localmemory

Localmemory

Texturememory

Constantmemory

Devicememory

Figure 3 Typical architecture of GPU

2034Yes

Drop

Insertion

Rootelement

No

Thread ID 1 2 3 middot middot middot

2034Yes

Drop

Insertion

Rootelement

No


Unordered heap array each max-heap keepscurrent k smallest elements

2225

2115

1943

1623

2225

2115

1943

1623

2486

2486

2158

2158

1432

1432

2332

2332

2599

2599

2314

2314

1621

1621

2101

2101

1669

1669

1325

1325

1185

1185

1549

1549

2635

2635

2541

2541

2364

2364

1435

1435

Ordered heap array each max-heap keepscurrent k smallest elements

Figure 4 Two types of heap-array built for features of query images(ordered heap keeps the stored minimum elements ordered butunordered heap does not need this requirement)

622 Distance Selection To this end distances betweenquery features and reference features are computedThis stepis to select the nearest distances for queries Two availablestrategies can be adopted to complete this process The firststrategy that can be taken is to selectminimumvalues directlyon distances Its main drawbacks are the memory abusing tostore the entire distance vector in globalmemory and the timedelay as a result of continuous memory access

In our implementation a heap-array is built for featuresof query images which means that a max-heap (root elementis largest one is Heap) is built for each query Two types ofmax-heap are built in retrieval process In the first retrievalround an unordered max-heap is built to store the distancesof unsorted representative pivots In the second retrievalround ordered max-heap is built to store the distances offinal nearest reference features Figure 4 shows the basicstructure of ordered and unordered heap-array built in ourapproachTheonly difference of two types ofmax-heap is thatwhether the stored indexes of nearest features are ordered ornot

Max-heap keeps track of elements with minimum dis-tance that have been found so far and excludes unnecessarydistance comparisons We assume to find 119870 smallest dis-tances (in the first retrieval round 119870 = 119867 in the secondretrieval round 119870 = 119864) In the beginning first 119870 distancesare inserted into the heap directly without comparisonWhenthe heap is full the newly encountered elements are insertedwith comparison Unnecessary distance comparisons can beexcluded directly by comparing the input value with rootcounterpart in max-heap If the value of newly encounteredelement is larger than root element it is excluded directlyIn contrast if the value is smaller than root element it ispushed into its corresponding position in distance heap afteradjustment For orderedmax-heap indexes of stored featuresshould always be kept ordered after each adjustment

After the insertion of elements 119870 smallest distances areobtained However the main aim of distance selection is toobtain the indexes of the corresponding reference featuresbut not the distance values To accomplish this goal an indexHeap is allocated as a counterpart to store the indexes ofnearest reference features The element insertion operationsperformed in distance Heap are applied to index Heap againAfter several rounds of comparison and insertion the indexesof nearest reference features for query features are obtainedand are finally stored into the global memory


Q = Q1 Q2 QN

Q1 Q4 Q100

Q2 Q8 Q50

Q12 Q

31 Q985

R10 R223 R123R23 R853 R778

R97 R246 R1034

Parallel BCU

Distance matrixcomputation

Unordered heap array

Map

Ball (R1) Ball (R2) Ball (R3)




Parallel BCUReduce

Ordered heap array

Sorted results

middot middot middot

R = R1 R2 RS

(H times T) times N distance matrix

Figure 5 Parallel online retrieval in CBIR

In order to take advantages of both max-heap and inher-ent characteristics of GPU memory types coarse interqueryparallelism is exploited for parallelization of this procedureA kernel function is designed and threads are launched withthe same number as queries These threads compute distancebetween queries and reference features and select indexes of119870minimum distances from the distance vector

63 Parallel Retrieval Based on the parallel BCU a MapRe-duce-like parallel retrieval scheme is designed for CBIRFigure 5 describes the basic framework of our parallelretrieval process ofCBIR In this framework features of queryimages and that of representative pivots are first importedinto the module of parallel BCU where unordered heap arrayis already built After that119867 nearest representative pivots arefound for each query feature Queries are then mapped totheir corresponding semantic ball where distance matrix iscomputed between queries and reference features A reduceoperation is designed to collect the distances calculated foreach query feature from 119867 balls This procedure generates(119867119879) times 119873 distance matrix which is further imported intothe ordered heap array After these procedures indexesof the nearest reference features are obtained and theircorresponding images are returned to query user

7 Experiments and Results

In this section a set of experiments are performed todemonstrate the performance of the proposed methodologyFirst details about the test dataset and experimental setup

are introduced Second performance variation is investigatedwith parameters of our approach Third our approach iscompared with brute force retrieval and the state-of-the-artapproach on Corel 10 K [23] respectively Finally compara-tive experiment is performed onGIST 1M [24] to evaluate theperformance on large-scale image dataset

71 Dataset and Experimental Setup The proposed approachis evaluated on standard image dataset Corel 10 K andGIST 1M which are widely used in the literature for CBIRperformance evaluation Corel 10 K contains 10000 images in100 categories In experiment 10 images for each category arerandomly selected as query images The remaining imagesare used as reference images to be retrieved For Corel 10 KGIST is extractedwith 4 blocks and 8 orientations generating512-dimensional features GIST 1M dataset has two featuredatasets query and reference dataset The reference datasetcontains features of the Holidays image set and images fromFlickr 1M [25]The query dataset only contains features fromthe Holidays image On GIST 1M dataset all images arerepresented by 960 feature dimensions Statistics of datasetis given in Table 1

In experiments GPU code is implemented using CUDAAll the experiments are performed on the platform equippedwith GPU of NVIDIA Geforce GTX460 whose CUDAcapability is 21 drive version is 40 and GPU clock rate is155GHzThe platform is also equipped with an Intel Core i7950 CPU running at 307GHzThe operating system is 64-bitRHEL AS 6 with Linux kernel 2632


0164

0162

016

0158

0156

01540

3

4

5

Spee

dup

Number of representative pivots

Mea

n av

erag

e pre

cisio

n

200 400 600 800 1000 1200 1400 1600 1800 2000

(a)

Mea

n av

erag

e pre

cisio

n

017

016

015

014

Radius of ball

Spee

dup

15

13

11

9

7

5

3

1

0 200 400 600 800 1000 1200 1400 1600 1800 2000

(b)

Figure 6 Performance evaluation on Corel 10 K for a number of representative pivots and radius of ball (a) MAP variations with the numberof representative pivots (b) MAP variations with radius of ball

Table 1 Statistics of test dataset

Image database Corel 10 K GIST 1MFeature dimensionality 512 960Image database size 10000 1001000Query dataset size 1000 1000Reference dataset size 9000 1000000

Table 2 MAP and speedup variations with S on Corel 10 K (119867 =

3 119879 = 500)

119878 400 600 800 1000 1200 1400 1600 1800MAP 01607 01612 01618 01621 01623 01625 01626 01624Speedup 436x 412x 397x 372x 352x 328x 312x 298x

Table 3 MAP and speedup variations with T on Corel 10 K (119867 =

3 119878 = 300)


The performance is evaluated on mean average precision(MAP) and speedup For Corel 10 KMAP is calculated as themean of precisions achieved at all recall levels For GIST 1Mdataset MAP is calculated as the mean of precisions when100 images are returned Speedup of an approach is calculatedas the ratio between run time of approach and that of GPU-based brute force implementation

72 Evaluation for Number of Representative Pivots andRadius of Ball Adjusting parameters for PSB keeps a balancebetween speedup and accuracy loss In this subsectionexperiments are performed to observe the MAP variationswith the parameters of PSB Figure 6 reports MAP andspeedup variations with the number of representative pivots(119878) and the radius of ball (119879) For more data one can refer

Table 4 MAP and speedup variations with H on Corel 10 K (119878 =800 119879 = 700)

119867 4 5 8 10 12 14 16RBC-MAP 0161 0162 01626 0163 01628 01632 01634RBC-speedup 314x 253x 250x 229x 216x 207x 196xPSB-MAP 01641 01642 01642 01642 01642 01642 01642PSB-speedup 153x 165x 127x 120x 115x 108x 101x

to Table 2 and Table 3 for details The figure shows that whenparameters are above certain value (119878 = 800 or 119879 = 700)MAP curve becomes steadyThis experimental phenomenongives us a revelation that the best parameters of PSB can bedetermined as the certain point in growth curve of accuracyIn addition Figure 6 also shows that the proposed PSB canachieve more than 1x speedup with little accuracy loss Forinstance when 119878 is 300 and 119879 is 1800 theMAP we can obtainis 0164 (00002 smaller than accurate MAP) At the sametime the speedup we get is 196 which is almost 2x fastercompared with GPU-based brute force retrieval

73 Evaluation for Semantic Embedding and Number ofSearched Representative Pivot Selection In this subsectionwe evaluate the effectiveness of semantic embedding andmultiple representative pivot selection In this experiment119878 and 119879 are fixed to 800 and 700 respectively to observeperformance variations Figure 7 shows the performanceimprovement For concrete data one can refer to Table 4Figure 7(a) shows that the proposed PSB can get higherMAP compared with RBC on each 119867 When the 119867 is5 we can obtain the same MAP with brute retrieval and165x speedup It clearly demonstrates that the semanticembedding can involve more semantics into the generationof representative pivots and condense the retrieval spaceBased on the condensed search scope higher MAP is easilyachieved if the same 119867 is set Figure 7(a) also reports theMAP and speedup variations with 119867 It shows that MAPof RBC and PSB both increase with 119867 This experimental


Mea

n av

erag

e pre

cisio

n

017

016

015

Number of searched representative pivots0 1 2 4 6 8 10 12 14 16 18 20

Spee

dup

9

7

5

3

PSB-MAPRBC-MAP

PSB-speedupRBC-speedup

1

(a)

0154 0156 0158 016 0162 0164 0166 01681

15

2

25

3

35

4

45

5

55

Mean average precision

Spee

dup

RBCPSB

(b)

Figure 7 Performance evaluation on Corel 10 K for semantic embedding and multiple representative pivots selection (a) MAP and speedupvariations with the number of representative pivots (b) Speedup variations with MAP

0001 0005 001 0015 002 0025 003 00350

2

4

6

8

10

12

14

16

Mean average precision loss

Spee

dup

Figure 8 Performance comparison with GPU-based brute forceretrieval

result shows that multiple representative pivot selection cancover more informative reference features in database Thenegative effect of multiple representative pivot selection isthat the retrieval process is slightly extended which makesthe speedup decrease steadily with119867

From Figure 7(a) we can draw a conclusion that MAPand speedup are varied with 119867 Therefore for comparisonconvenience and better presentation Figure 7(b) is plottedto reflect the relationship between speedup and MAP Thisfigure clearly illustrates that our approach achieves lowerspeedup compared with RBC when different 119867 is set Buton the same MAP level we can achieve higher speedup Inaddition we can obtain the MAP that cannot be achievedby RBC (the best MAP obtained by RBC is 01637 while thatobtained by PSB is 01642)

74 Comparison with Brute Force Retrieval In this subsec-tion we compare the proposed PSB with the brute forceretrieval We directly download the source code of GPU-based brute force retrieval from [10] and use it to findthe nearest referenced features for each query A set ofexperiments is performed to observe the speedup variationswith accuracy loss

Figure 8 shows that the MAP loss of PSB is generallyranged from 0035 to 00001 The speedup achieved in thiscontext is very important Even when the accuracy loss is00001 (very close to accurate retrieval) we can obtain 173xspeedup 73 improvement compared against brute forceretrieval In addition on other accuracy loss level we caneasily obtain higher speedup For instance we can achieveabout 28x speedup when the accuracy loss is 0001 and 37xspeedup when accuracy loss is 0002This experiment clearlydemonstrates that we can obtain high speedup with littleaccuracy loss It also indicates that the proposed approach iseffective in scenario of approximate CBIR

75 Comparison with the State-of-the-Art Approach RBC Inthis subsection the performance of PSB is compared withthe state-of-the-art approach RBC Source code of RBCimplementation is directly downloaded from author website[26] In this experiment parameters of the two approachesare adjusted manually and a figure is drawn to reflect therelationship between speedup and accuracy

Figure 9(a) describes speedup variations of twoapproaches with MAP on Corel 10 K We can easily seefrom the figure that PSB performs better than RBC Thefigure shows that PSB can obtain MAP which cannot bereached by RBC MAP of PSB is mainly ranged from 01542to 01641 while that of RBC is concentrated in a rangefrom 01313 to 01615 The largest MAP PSB obtained is01641 while that obtained by RBC is only 01615 More


013

013

5

014

014

5

015

015

88

015

5

016

13

016

42

012

4

6

8

10

12

14


Spee

dup

RBCPSB

(a)

0 005 01 015 02 025 03 035 04 045 051

2

3

4

5

6

7

8

Spee

dup

RBCPSB


(b)

Figure 9 Comparisons with the state-of-the-art approach RBC (a) Comparison on Corel 10 K (b) Comparison on GIST 1M

importantly PSB obtains higher speedup than RBC on thesame accuracy level For instance when MAP is 01613 PSBachieves speedup up to 455x which is larger than 251xspeedup achieved by RBC When MAP is 01588 speedup ofPSB is up to 458x which is also large than 282x achieved bythe RBC It should be noted that speedup of PSB is almost 2times the speedup of RBC when accuracy is about 0154 Themain reason that PSB obtains the better performance is thatPSB covers the whole reference features with semantics sothat retrieval scope can be reduced

Figure 9(b) reports speedup variations of two approacheswith MAP loss on GIST 1M It is not hard to find from thefigure that even with high dimensional features and large-scale image dataset PSB still achieves better performancethan RBC On this dataset PSB can obtain MAP loss thatcannot be reached by RBC On the same accuracy loss levelour approach can also achieve higher speedup

8 Conclusions

CBIR is one of the available solutions to find similar imagesautomatically for query from large quantities of images Toovercome the computational complexity of CBIR a novelGPU-adaptive index structure is proposed in this paper tosimultaneously prune unnecessary computations and lever-age the parallel processing capability of GPU With the builtGPU-adaptive index structure the retrieval process of CBIRcan be decomposed into independent components that can beparallelized with high efficiency Effective parallel techniquesare designed to make full use of the computational resourcesof underlying hardware Experiments on standard imageretrieval datasets demonstrate that the proposed approachachieves substantial speedup with little accuracy loss and alsoachieves higher speedup compared with the state-of-the-art

approach All experiments demonstrate that the proposedapproach can be an effective and robust tool to help solvethe computational challenges of CBIR In the future wewill continue to deploy the proposed algorithm to otherapplications and evaluate its performance

Notations

119872 Number of images in database119873 Number of query images119878 Number of representative pivots119879 Radius of ball119864 Number of results returned by CBIR119867 Number of searched representative

pivots119889 Dimension of feature vector119863 Distance matrix between featuresD Dissimilarity measure119877119891 Feature of a particular representative

pivotR = 1198901 1198902 119890119878 Feature of a representative pivotF = 1198911 1198912 119891119872 Feature databaseQ = 1199021 1199022 119902119873 Query features

Conflict of Interests

The author declares that there is no conflict of interests regar-ding the publication of this paper

Acknowledgment

The author thanks the anonymous reviewers for their helpfulsuggestions


References

[1] A W M Smeulders M Worring S Santini A Gupta and RJain ldquoContent-based image retrieval at the end of the earlyyearsrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 22 no 12 pp 1349ndash1380 2000

[2] Y Rui T S Huang and S-F Chang ldquoImage retrieval currenttechniques promising directions and open issuesrdquo Journal ofVisual Communication and Image Representation vol 10 no 1pp 39ndash62 1999

[3] H Lu K N Plataniotis and A N Venetsanopoulos ldquoUncorre-lated multilinear principal component analysis through succes-sive variance maximizationrdquo in Proceeding of the 25th Interna-tional Conference on Machine Learning pp 616ndash623 July 2008

[4] S T Roweis and L K Saul ldquoNonlinear dimensionality reduc-tion by locally linear embeddingrdquo Science vol 290 no 5500pp 2323ndash2326 2000

[5] WYuX Teng andC Liu ldquoFace recognition using discriminantlocality preserving projectionsrdquo Image and Vision Computingvol 24 no 3 pp 239ndash248 2006

[6] J L Bentley ldquoMultidimensional binary search trees used for ass-ociative searchingrdquo Communications of the ACM vol 18 no 9pp 509ndash517 1975

[7] A Andoni and P Indyk ldquoNear-optimal hashing algorithms forapproximate nearest neighbor in high dimensionsrdquo Communi-cations of the ACM vol 51 no 1 pp 117ndash122 2008

[8] YWeiss R Fergus andA Torralba ldquoMultidimensional spectralhashingrdquo in Proceedings of 12th European Conference on Com-puter Vision (ECCV rsquo12) pp 340ndash353 Springer Berlin Ger-many

[9] VGarcia EDebreuve andMBarlaud ldquoFast k nearest neighborsearch using GPUrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPR rsquo08) Los Alamitos CA USA June 2008

[10] V Garcia E Debreuve F Nielsen and M Barlaud ldquoK-nearestneighbor search fast GPU-based implementations and appli-cation to high-dimensional feature matchingrdquo in Proceedingsof the 17th IEEE International Conference on Image Processing(ICIP rsquo10) pp 3757ndash3760 Los Alamitos CA USA September2010

[11] NVIDIA 2007 CUDA CUBLAS library[12] K Kato and T Hosino ldquoMulti-GPU algorithm for k-nearest

neighbor problemrdquo Concurrency Computation Practice andExperience vol 24 no 1 pp 45ndash53 2012

[13] QMiao Y Chen J Li Q Zhang Y Zhang andG Chen ldquoPara-llelization and optimization of a CBVIR system on multi-corearchitecturesrdquo in Proceedings of the 23rd IEEE InternationalParallel and Distributed Processing Symposium (IPDPS rsquo09) LosAlamitos CA USA May 2009

[14] J L Bosque O D Robles L Pastor and A Rodrıguez ldquoParallelCBIR implementationswith load balancing algorithmsrdquo Journalof Parallel and Distributed Computing vol 66 no 8 pp 1062ndash1075 2006

[15] T M Lehmann M O Guld C Thies et al ldquoContent-basedimage retrieval inmedical applicationsrdquoMethods of Informationin Medicine vol 43 no 4 pp 354ndash361 2004

[16] C Terboven T deselaers C bischof andHNey ldquoShared-mem-ory parallelization for content-based image retrievalrdquo in Pro-ceedings of Workshop on Computation Intensive Methods forComputer Vision (CIMCV rsquo10) Springer Berlin Germany 2010

[17] L Cayton ldquoA nearest neighbor data structure for graphics har-dwarerdquo in Proceedings of the 1th International Workshop on

Accelerating Data Management Systems Using Modern Processorand Storage Architectures (ADMS rsquo10)

[18] L Cayton ldquoAccelerating nearest neighbor search on manycoresystemsrdquo in Proceedings of the 26th International Conference onParallel amp Distributed Processing Symposium (IPDPS rsquo12) pp402ndash413 Los Alamitos CA USA 2012

[19] J R Smith and S-F Chang ldquoVisualSEEk a fully automated con-tent-based image query systemrdquo in Proceedings of the 4th ACMInternational Multimedia Conference pp 87ndash98 New York NYUSA November 1996

[20] Y Rui T S Huang and S Mehrotra ldquoContent-based image ret-rieval with relevance feedback in MARSrdquo in Proceedings of theInternational Conference on Image Processing pp 815ndash818 LosAlamitos CA USA October 1997

[21] J Z Wang J Li and G Wiederhold ldquoSIMPLIcity semantics-sensitive integratedMatching for Picture Librariesrdquo IEEETrans-actions on Pattern Analysis andMachine Intelligence vol 23 no9 pp 947ndash963 2001

[22] A Oliva and A Torralba ldquoModeling the shape of the scene aholistic representation of the spatial enveloperdquo InternationalJournal of Computer Vision vol 42 no 3 pp 145ndash175 2001

[23] G Liu and J Yang ldquoContent-based image retrieval using colordifference histogramrdquo Pattern Recognition vol 46 no 1 pp188ndash198 2012

[24] H Jegou M Douze and C Schmid ldquoProduct quantization fornearest neighbor searchrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 33 no 1 pp 117ndash128 2011

[25] H JegouMDouze andC Schmid ldquoHamming embedding andweak geometric consistency for large scale image searchrdquo Lec-ture Notes in Computer Science vol 5302 no 1 pp 304ndash3172008

[26] httpwwwlcaytoncomcodehtml

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



ing algorithmdesignmakes the conventional index structuresinterleave series of similarity calculations and comparisonsIn fact inherent dependence between calculation elementsin traditional index structures is not adaptive to capture thecharacteristics of GPU which launches a bundle of threadsto run independent tasks Therefore direct parallelizationof traditional index structure cannot make full use of GPUresources To the best of our knowledge few papers concen-trate their efforts on designing GPU-adaptive index structureto integrate both efforts from algorithm design and hardwareparallelization

In this paper we are motivated to design our approachvia combining the advantages of both algorithm design andhardware parallelizationThis is our starting point to developthe GPU-based parallel CBIR system A novel GPU-adaptiveindex structure termed as plane semantic ball (PSB) isproposed to leveragemultiple semantic balls to cover retrievalspace It should be noted that this technique is differentfrom traditional index structures that exploit hierarchicalspace decomposition to prune redundant computations Toimprove efficiency further PSB is designed to represent theinvolved data structures in the formofmatrix so that the datastructures can be processed in parallel by different calculationunits in GPU In this case large number of threads can belaunched on GPU platform to conduct the same operationsconcurrently and thus the online retrieval process of CBIRcan be greatly improved

The contributions of the paper are mainly concentratedon the following three facets

(i) A novel GPU-adaptive index structure PSB is pro-posed to index the features in databaseDifferent fromthe existing approaches PSB is designed to capturethe hardware characteristics of GPU

(ii) An effective parallel strategy is designed to leverageboth the space decomposition ability of PSB and thehardware resources of GPU

(iii) Comprehensive experiments on two image datasetsare conducted to demonstrate the effectiveness of theproposed approach

The remainder of the paper is organized as followsrelated work is reviewed in Section 2 Section 3 provides thebackground of CBIR system Data structure and parallelretrieval algorithm are introduced in Section 4 Section 5introduces the characteristics of GPU and programmingenvironment CUDA Section 6 presents the details of parallelacceleration Experimental results and discussions are givenin Section 7 We finally conclude the paper in Section 8

2 Related Work

This section presents related work on algorithm accelerationand hardware parallelization of CBIR Related work aboutalgorithm acceleration is mainly concentrated on dimensionreduction and index structure Hardware parallelization ismainly focused on parallel acceleration of CBIR on platformof multicore or many-core

Dimension reduction is an approach that reduces thedimension of feature vector prior to any application andspeeds up the process of similarity computation In thisresearch field a huge variety of effective algorithms suchas principal component analysis (PCA) [3] local linearembedding (LLE) [4] and locality preserving projection(LPP) [5] are proposed The basic idea of these techniques iseliminating the redundant information in feature dimensionsand compressing the feature into low dimensions Althoughthese techniques can accelerate the retrieval process to somedegree they incur some information loss which negativelyimpacts the descriptive ability of image feature and bringsdown the retrieval accuracy

Index structure is another effective approach whichprunes the retrieval space and avoids unnecessary compu-tations In general this kind of approaches can be furtherdivided into two main categories tree-based index structureand hashing-based index structure Tree-based index struc-ture performs well for low-dimensional feature (around 20dimensions) but poorly for high-dimensional feature Theperformance of it even degrades to that of brute force retrievalin extreme case when the dimension of features becomes veryhigh In CBIR hundreds of feature dimensions are neededto describe the complex visual content of image Thereforethe retrieval performance will be degraded if tree-basedindexing structure is used to index features Furthermoretree-based index structure involves an interleaved series ofsimilarity computations and comparison operations causingineffectiveness for parallelization [6] Although hashing-based index structures such as local sensitive hashing (LSH)[7] and spectral hashing (SH) [8] are proposed without thedrawbacks described above they are not designed for thegeneric distance types and thus the application of thesealgorithms is largely limited In addition another weaknessof hashing-based index structure is that the building of itinvolves many parameters which cannot be easily adjustedfor any inexperienced users

There is also a large body of work focusing on hardwareparallelization of retrieval process in CBIR The authors of[9 10] propose an effective parallelization of brute forceretrieval using compute unified device architecture (CUDA)and CUDA basic linear algebra subroutines (CUBLAS) [11]respectively An effective algorithm is given in [12] to furtherextend the process of the nearest neighbor search on multi-GPU A parallelization of CBIR system is reported in [13]to accelerate the retrieval process on 8-core system and 16-core system Their experimental results report 56x and 97xspeedups The work in [14 15] exploits a computer clustersystem to parallelize CBIR with several computers connectedby high-speed network The authors of [16] develop a three-level flexible image retrieval system (FIRE) using openmulti-processing (OpenMP) and shared memory multiprocessorsChallenges such as load balancing parallelization overheadand parallelism exploitation are considered in their workAll methods described above can only be considered asbrute force parallelization which is inherently parallelizableAlthough they have reported promising experimental resultstheir performance can be further improved if principals ofindex structure are taken into account







Query interface


Yes

No

Online


Retrieval process

Featureextraction


Similarityranking


Offline

Imagedatabase

Featuredatabase

Indexeddatabase
















Ball (119877119891) = 119891 | 119891 isin FD (119877119891 119891) lt Φ119877119891

10038161003816100381610038161003816Ball (119877119891)

10038161003816100381610038161003816= 119879

Φ119877119891= MAX119891isinBall(119877119891)

D (119877119891 119891)

(1)



5 GPU and CUDA











119863119894119895 =radic

119889

sum

119897=1

(119883119897119895 minus 119884119897119894)2 (2)



(cluster center)


R1

R2

R3R4

R5R6

120595R119894



119863119894119895 =radic(119883119895 minus 119884119894)

119879(119883119895 minus 119884119894)

119883119895 = (1198831119895 1198832119895 119883119889119895)119879

119884119894 = (1198841119894 1198842119894 119884119889119894)119879

(3)


119863 = radic119860 + 119861 + 119862 (4)


119860 119894119895 =10038171003817100381710038171198841198941003817100381710038171003817 =

radic

119889

sum

119897=1

1198842119897119894119861119894119895 =

10038171003817100381710038171003817119883119895

10038171003817100381710038171003817= radic

119889

sum

119897=1

1198832119897119895

119862119894119895 = minus2 times

119889

sum

119897=1

119883119897119895 times 119884119897119894119863119894119895 = radic119860 119894119895 + 119861119894119895 + 119862119894119895

(5)


119863 = radic119860 + 119862 (6)



CPU


Registers Registers


Cache

Host memory

GPU

SM 0 SM P


SP 0 SP N SP 0 SP N

Localmemory

Localmemory

Localmemory

Localmemory

Texturememory

Constantmemory

Devicememory


2034Yes

Drop

Insertion

Rootelement

No


2034Yes

Drop

Insertion

Rootelement

No



2225

2115

1943

1623

2225

2115

1943

1623

2486

2486

2158

2158

1432

1432

2332

2332

2599

2599

2314

2314

1621

1621

2101

2101

1669

1669

1325

1325

1185

1185

1549

1549

2635

2635

2541

2541

2364

2364

1435

1435








Q = Q1 Q2 QN

Q1 Q4 Q100

Q2 Q8 Q50

Q12 Q

31 Q985

R10 R223 R123R23 R853 R778

R97 R246 R1034

Parallel BCU



Map





Parallel BCUReduce

Ordered heap array

Sorted results


R = R1 R2 RS











0164

0162

016

0158

0156

01540

3

4

5

Spee

dup


Mea

n av

erag

e pre

cisio

n

200 400 600 800 1000 1200 1400 1600 1800 2000

(a)

Mea

n av

erag

e pre

cisio

n

017

016

015

014

Radius of ball

Spee

dup

15

13

11

9

7

5

3

1

0 200 400 600 800 1000 1200 1400 1600 1800 2000

(b)





3 119879 = 500)



3 119878 = 300)









Mea

n av

erag

e pre

cisio

n

017

016

015


Spee

dup

9

7

5

3

PSB-MAPRBC-MAP


1

(a)

0154 0156 0158 016 0162 0164 0166 01681

15

2

25

3

35

4

45

5

55


Spee

dup

RBCPSB

(b)


0001 0005 001 0015 002 0025 003 00350

2

4

6

8

10

12

14

16


Spee

dup









013

013

5

014

014

5

015

015

88

015

5

016

13

016

42

012

4

6

8

10

12

14


Spee

dup

RBCPSB

(a)

0 005 01 015 02 025 03 035 04 045 051

2

3

4

5

6

7

8

Spee

dup

RBCPSB


(b)




8 Conclusions



Notations






Acknowledgment



References



































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in









Query interface


Yes

No

Online


Retrieval process

Featureextraction


Similarityranking


Offline

Imagedatabase

Featuredatabase

Indexeddatabase
















Ball (119877119891) = 119891 | 119891 isin FD (119877119891 119891) lt Φ119877119891

10038161003816100381610038161003816Ball (119877119891)

10038161003816100381610038161003816= 119879

Φ119877119891= MAX119891isinBall(119877119891)

D (119877119891 119891)

(1)



5 GPU and CUDA











119863119894119895 =radic

119889

sum

119897=1

(119883119897119895 minus 119884119897119894)2 (2)



(cluster center)


R1

R2

R3R4

R5R6

120595R119894



119863119894119895 =radic(119883119895 minus 119884119894)

119879(119883119895 minus 119884119894)

119883119895 = (1198831119895 1198832119895 119883119889119895)119879

119884119894 = (1198841119894 1198842119894 119884119889119894)119879

(3)


119863 = radic119860 + 119861 + 119862 (4)


119860 119894119895 =10038171003817100381710038171198841198941003817100381710038171003817 =

radic

119889

sum

119897=1

1198842119897119894119861119894119895 =

10038171003817100381710038171003817119883119895

10038171003817100381710038171003817= radic

119889

sum

119897=1

1198832119897119895

119862119894119895 = minus2 times

119889

sum

119897=1

119883119897119895 times 119884119897119894119863119894119895 = radic119860 119894119895 + 119861119894119895 + 119862119894119895

(5)


119863 = radic119860 + 119862 (6)



CPU


Registers Registers


Cache

Host memory

GPU

SM 0 SM P


SP 0 SP N SP 0 SP N

Localmemory

Localmemory

Localmemory

Localmemory

Texturememory

Constantmemory

Devicememory


2034Yes

Drop

Insertion

Rootelement

No


2034Yes

Drop

Insertion

Rootelement

No



2225

2115

1943

1623

2225

2115

1943

1623

2486

2486

2158

2158

1432

1432

2332

2332

2599

2599

2314

2314

1621

1621

2101

2101

1669

1669

1325

1325

1185

1185

1549

1549

2635

2635

2541

2541

2364

2364

1435

1435








Q = Q1 Q2 QN

Q1 Q4 Q100

Q2 Q8 Q50

Q12 Q

31 Q985

R10 R223 R123R23 R853 R778

R97 R246 R1034

Parallel BCU



Map





Parallel BCUReduce

Ordered heap array

Sorted results


R = R1 R2 RS











0164

0162

016

0158

0156

01540

3

4

5

Spee

dup


Mea

n av

erag

e pre

cisio

n

200 400 600 800 1000 1200 1400 1600 1800 2000

(a)

Mea

n av

erag

e pre

cisio

n

017

016

015

014

Radius of ball

Spee

dup

15

13

11

9

7

5

3

1

0 200 400 600 800 1000 1200 1400 1600 1800 2000

(b)





3 119879 = 500)



3 119878 = 300)









Mea

n av

erag

e pre

cisio

n

017

016

015


Spee

dup

9

7

5

3

PSB-MAPRBC-MAP


1

(a)

0154 0156 0158 016 0162 0164 0166 01681

15

2

25

3

35

4

45

5

55


Spee

dup

RBCPSB

(b)


0001 0005 001 0015 002 0025 003 00350

2

4

6

8

10

12

14

16


Spee

dup









013

013

5

014

014

5

015

015

88

015

5

016

13

016

42

012

4

6

8

10

12

14


Spee

dup

RBCPSB

(a)

0 005 01 015 02 025 03 035 04 045 051

2

3

4

5

6

7

8

Spee

dup

RBCPSB


(b)




8 Conclusions



Notations






Acknowledgment



References



































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in










Ball (119877119891) = 119891 | 119891 isin FD (119877119891 119891) lt Φ119877119891

10038161003816100381610038161003816Ball (119877119891)

10038161003816100381610038161003816= 119879

Φ119877119891= MAX119891isinBall(119877119891)

D (119877119891 119891)

(1)



5 GPU and CUDA











119863119894119895 =radic

119889

sum

119897=1

(119883119897119895 minus 119884119897119894)2 (2)



(cluster center)


R1

R2

R3R4

R5R6

120595R119894



119863119894119895 =radic(119883119895 minus 119884119894)

119879(119883119895 minus 119884119894)

119883119895 = (1198831119895 1198832119895 119883119889119895)119879

119884119894 = (1198841119894 1198842119894 119884119889119894)119879

(3)


119863 = radic119860 + 119861 + 119862 (4)


119860 119894119895 =10038171003817100381710038171198841198941003817100381710038171003817 =

radic

119889

sum

119897=1

1198842119897119894119861119894119895 =

10038171003817100381710038171003817119883119895

10038171003817100381710038171003817= radic

119889

sum

119897=1

1198832119897119895

119862119894119895 = minus2 times

119889

sum

119897=1

119883119897119895 times 119884119897119894119863119894119895 = radic119860 119894119895 + 119861119894119895 + 119862119894119895

(5)


119863 = radic119860 + 119862 (6)



CPU


Registers Registers


Cache

Host memory

GPU

SM 0 SM P


SP 0 SP N SP 0 SP N

Localmemory

Localmemory

Localmemory

Localmemory

Texturememory

Constantmemory

Devicememory


2034Yes

Drop

Insertion

Rootelement

No


2034Yes

Drop

Insertion

Rootelement

No



2225

2115

1943

1623

2225

2115

1943

1623

2486

2486

2158

2158

1432

1432

2332

2332

2599

2599

2314

2314

1621

1621

2101

2101

1669

1669

1325

1325

1185

1185

1549

1549

2635

2635

2541

2541

2364

2364

1435

1435








Q = Q1 Q2 QN

Q1 Q4 Q100

Q2 Q8 Q50

Q12 Q

31 Q985

R10 R223 R123R23 R853 R778

R97 R246 R1034

Parallel BCU



Map





Parallel BCUReduce

Ordered heap array

Sorted results


R = R1 R2 RS











0164

0162

016

0158

0156

01540

3

4

5

Spee

dup


Mea

n av

erag

e pre

cisio

n

200 400 600 800 1000 1200 1400 1600 1800 2000

(a)

Mea

n av

erag

e pre

cisio

n

017

016

015

014

Radius of ball

Spee

dup

15

13

11

9

7

5

3

1

0 200 400 600 800 1000 1200 1400 1600 1800 2000

(b)





3 119879 = 500)



3 119878 = 300)









Mea

n av

erag

e pre

cisio

n

017

016

015


Spee

dup

9

7

5

3

PSB-MAPRBC-MAP


1

(a)

0154 0156 0158 016 0162 0164 0166 01681

15

2

25

3

35

4

45

5

55


Spee

dup

RBCPSB

(b)


0001 0005 001 0015 002 0025 003 00350

2

4

6

8

10

12

14

16


Spee

dup









013

013

5

014

014

5

015

015

88

015

5

016

13

016

42

012

4

6

8

10

12

14


Spee

dup

RBCPSB

(a)

0 005 01 015 02 025 03 035 04 045 051

2

3

4

5

6

7

8

Spee

dup

RBCPSB


(b)




8 Conclusions



Notations






Acknowledgment



References



































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in











119863119894119895 =radic

119889

sum

119897=1

(119883119897119895 minus 119884119897119894)2 (2)



(cluster center)


R1

R2

R3R4

R5R6

120595R119894



119863119894119895 =radic(119883119895 minus 119884119894)

119879(119883119895 minus 119884119894)

119883119895 = (1198831119895 1198832119895 119883119889119895)119879

119884119894 = (1198841119894 1198842119894 119884119889119894)119879

(3)


119863 = radic119860 + 119861 + 119862 (4)


119860 119894119895 =10038171003817100381710038171198841198941003817100381710038171003817 =

radic

119889

sum

119897=1

1198842119897119894119861119894119895 =

10038171003817100381710038171003817119883119895

10038171003817100381710038171003817= radic

119889

sum

119897=1

1198832119897119895

119862119894119895 = minus2 times

119889

sum

119897=1

119883119897119895 times 119884119897119894119863119894119895 = radic119860 119894119895 + 119861119894119895 + 119862119894119895

(5)


119863 = radic119860 + 119862 (6)



CPU


Registers Registers


Cache

Host memory

GPU

SM 0 SM P


SP 0 SP N SP 0 SP N

Localmemory

Localmemory

Localmemory

Localmemory

Texturememory

Constantmemory

Devicememory


2034Yes

Drop

Insertion

Rootelement

No


2034Yes

Drop

Insertion

Rootelement

No



2225

2115

1943

1623

2225

2115

1943

1623

2486

2486

2158

2158

1432

1432

2332

2332

2599

2599

2314

2314

1621

1621

2101

2101

1669

1669

1325

1325

1185

1185

1549

1549

2635

2635

2541

2541

2364

2364

1435

1435








Q = Q1 Q2 QN

Q1 Q4 Q100

Q2 Q8 Q50

Q12 Q

31 Q985

R10 R223 R123R23 R853 R778

R97 R246 R1034

Parallel BCU



Map





Parallel BCUReduce

Ordered heap array

Sorted results


R = R1 R2 RS











0164

0162

016

0158

0156

01540

3

4

5

Spee

dup


Mea

n av

erag

e pre

cisio

n

200 400 600 800 1000 1200 1400 1600 1800 2000

(a)

Mea

n av

erag

e pre

cisio

n

017

016

015

014

Radius of ball

Spee

dup

15

13

11

9

7

5

3

1

0 200 400 600 800 1000 1200 1400 1600 1800 2000

(b)





3 119879 = 500)



3 119878 = 300)









Mea

n av

erag

e pre

cisio

n

017

016

015


Spee

dup

9

7

5

3

PSB-MAPRBC-MAP


1

(a)

0154 0156 0158 016 0162 0164 0166 01681

15

2

25

3

35

4

45

5

55


Spee

dup

RBCPSB

(b)


0001 0005 001 0015 002 0025 003 00350

2

4

6

8

10

12

14

16


Spee

dup









013

013

5

014

014

5

015

015

88

015

5

016

13

016

42

012

4

6

8

10

12

14


Spee

dup

RBCPSB

(a)

0 005 01 015 02 025 03 035 04 045 051

2

3

4

5

6

7

8

Spee

dup

RBCPSB


(b)




8 Conclusions



Notations






Acknowledgment



References



































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




CPU


Registers Registers


Cache

Host memory

GPU

SM 0 SM P


SP 0 SP N SP 0 SP N

Localmemory

Localmemory

Localmemory

Localmemory

Texturememory

Constantmemory

Devicememory


2034Yes

Drop

Insertion

Rootelement

No


2034Yes

Drop

Insertion

Rootelement

No



2225

2115

1943

1623

2225

2115

1943

1623

2486

2486

2158

2158

1432

1432

2332

2332

2599

2599

2314

2314

1621

1621

2101

2101

1669

1669

1325

1325

1185

1185

1549

1549

2635

2635

2541

2541

2364

2364

1435

1435








Q = Q1 Q2 QN

Q1 Q4 Q100

Q2 Q8 Q50

Q12 Q

31 Q985

R10 R223 R123R23 R853 R778

R97 R246 R1034

Parallel BCU



Map





Parallel BCUReduce

Ordered heap array

Sorted results


R = R1 R2 RS











0164

0162

016

0158

0156

01540

3

4

5

Spee

dup


Mea

n av

erag

e pre

cisio

n

200 400 600 800 1000 1200 1400 1600 1800 2000

(a)

Mea

n av

erag

e pre

cisio

n

017

016

015

014

Radius of ball

Spee

dup

15

13

11

9

7

5

3

1

0 200 400 600 800 1000 1200 1400 1600 1800 2000

(b)





3 119879 = 500)



3 119878 = 300)









Mea

n av

erag

e pre

cisio

n

017

016

015


Spee

dup

9

7

5

3

PSB-MAPRBC-MAP


1

(a)

0154 0156 0158 016 0162 0164 0166 01681

15

2

25

3

35

4

45

5

55


Spee

dup

RBCPSB

(b)


0001 0005 001 0015 002 0025 003 00350

2

4

6

8

10

12

14

16


Spee

dup









013

013

5

014

014

5

015

015

88

015

5

016

13

016

42

012

4

6

8

10

12

14


Spee

dup

RBCPSB

(a)

0 005 01 015 02 025 03 035 04 045 051

2

3

4

5

6

7

8

Spee

dup

RBCPSB


(b)




8 Conclusions



Notations






Acknowledgment



References



































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




Q = Q1 Q2 QN

Q1 Q4 Q100

Q2 Q8 Q50

Q12 Q

31 Q985

R10 R223 R123R23 R853 R778

R97 R246 R1034

Parallel BCU



Map





Parallel BCUReduce

Ordered heap array

Sorted results


R = R1 R2 RS











0164

0162

016

0158

0156

01540

3

4

5

Spee

dup


Mea

n av

erag

e pre

cisio

n

200 400 600 800 1000 1200 1400 1600 1800 2000

(a)

Mea

n av

erag

e pre

cisio

n

017

016

015

014

Radius of ball

Spee

dup

15

13

11

9

7

5

3

1

0 200 400 600 800 1000 1200 1400 1600 1800 2000

(b)





3 119879 = 500)



3 119878 = 300)









Mea

n av

erag

e pre

cisio

n

017

016

015


Spee

dup

9

7

5

3

PSB-MAPRBC-MAP


1

(a)

0154 0156 0158 016 0162 0164 0166 01681

15

2

25

3

35

4

45

5

55


Spee

dup

RBCPSB

(b)


0001 0005 001 0015 002 0025 003 00350

2

4

6

8

10

12

14

16


Spee

dup









013

013

5

014

014

5

015

015

88

015

5

016

13

016

42

012

4

6

8

10

12

14


Spee

dup

RBCPSB

(a)

0 005 01 015 02 025 03 035 04 045 051

2

3

4

5

6

7

8

Spee

dup

RBCPSB


(b)




8 Conclusions



Notations






Acknowledgment



References



































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




0164

0162

016

0158

0156

01540

3

4

5

Spee

dup


Mea

n av

erag

e pre

cisio

n

200 400 600 800 1000 1200 1400 1600 1800 2000

(a)

Mea

n av

erag

e pre

cisio

n

017

016

015

014

Radius of ball

Spee

dup

15

13

11

9

7

5

3

1

0 200 400 600 800 1000 1200 1400 1600 1800 2000

(b)





3 119879 = 500)



3 119878 = 300)









Mea

n av

erag

e pre

cisio

n

017

016

015


Spee

dup

9

7

5

3

PSB-MAPRBC-MAP


1

(a)

0154 0156 0158 016 0162 0164 0166 01681

15

2

25

3

35

4

45

5

55


Spee

dup

RBCPSB

(b)


0001 0005 001 0015 002 0025 003 00350

2

4

6

8

10

12

14

16


Spee

dup









013

013

5

014

014

5

015

015

88

015

5

016

13

016

42

012

4

6

8

10

12

14


Spee

dup

RBCPSB

(a)

0 005 01 015 02 025 03 035 04 045 051

2

3

4

5

6

7

8

Spee

dup

RBCPSB


(b)




8 Conclusions



Notations






Acknowledgment



References



































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




Mea

n av

erag

e pre

cisio

n

017

016

015


Spee

dup

9

7

5

3

PSB-MAPRBC-MAP


1

(a)

0154 0156 0158 016 0162 0164 0166 01681

15

2

25

3

35

4

45

5

55


Spee

dup

RBCPSB

(b)


0001 0005 001 0015 002 0025 003 00350

2

4

6

8

10

12

14

16


Spee

dup









013

013

5

014

014

5

015

015

88

015

5

016

13

016

42

012

4

6

8

10

12

14


Spee

dup

RBCPSB

(a)

0 005 01 015 02 025 03 035 04 045 051

2

3

4

5

6

7

8

Spee

dup

RBCPSB


(b)




8 Conclusions



Notations






Acknowledgment



References



































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




013

013

5

014

014

5

015

015

88

015

5

016

13

016

42

012

4

6

8

10

12

14


Spee

dup

RBCPSB

(a)

0 005 01 015 02 025 03 035 04 045 051

2

3

4

5

6

7

8

Spee

dup

RBCPSB


(b)




8 Conclusions



Notations






Acknowledgment



References



































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in




References



































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in










Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



research article accelerating content-based image

Documents