[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

3
SaveRF: Towards Efficient Relevance Feedback Search Heng Tao Shen 1 Beng Chin Ooi 2 Kian-Lee Tan 2 1 School of Information Technology and Electrical Engineering The University of Queensland, Australia 2 Department of Computer Science National University of Singapore, Singapore Abstract In multimedia retrieval, a query is typically interac- tively refined towards the ‘optimal’ answers by exploiting user feedback. However, in existing work, in each itera- tion, the refined query is re-evaluated. This is not only in- efficient but fails to exploit the answers that may be com- mon between iterations. In this paper, we introduce a new approach called SaveRF (Save random accesses in Rel- evance Feedback) for iterative relevance feedback search. SaveRF predicts the potential candidates for the next it- eration and maintains this small set for efficient sequen- tial scan. By doing so, repeated candidate accesses can be saved, hence reducing the number of random accesses. In addition, efficient scan on the overlap before the search starts also tightens the search space with smaller prun- ing radius. We implemented SaveRF and our experimen- tal study on real life data sets show that it can reduce the I/O cost significantly. 1. Introduction Existing content-based systems that exploit low-level features (such as color and texture) does not necessar- ily return semantically relevant (based on human per- ception) answers. One promising direction towards se- mantic retrieval is the adoption of relevance feedback mechanism [2, 3]. A relevance feedback process is inter- active and iterative in nature. From the current results returned by the system, the user provides feedback to the system; based on these feedback, the system will re- fine the query to get better results which are closer to the user’s expectations. Feedback query is usually re- fined from either moving the query to a new position or modifying the similarity metrics (i.e., weights of feature vectors), or both, based on the selected objects. Unfor- tunately, the iterative nature of relevance feedback loop further lengthens the searching time. Since the feedback query moves away from the previous one with updated similarity metric, a complete KNN search has to be re- performed in next iteration. If any indexing structure is deployed, then each iteration corresponds to one new KNN search inside of the structure. Generally the ran- dom access is the major concern in most research work. However, the search spaces of two consecutive queries may overlap largely given that the query in current it- eration is refined based on the “good” results from the last iteration. In this paper, we propose SaveRF, a new method to speed up the relevance feedback search, by discovering the overlap between two consecutive itera- tions. SaveRF investigates three methods called linear regression, exponential smoothing and linear exponential Smoothing to predict the new query to be searched in next iteration. Taking the features of relevance feed- back search into consideration, SaveRF further intro- duces adaptive linear exponential smoothing to achieve better prediction quality. By forecasting the search space of the new query, the overlap between two consecutive queries’ search spaces can then be estimated. By per- forming sequential scan on the overlap, expensive ran- dom accesses on those candidates lying in the overlap can be avoided in next iteration, hence the total number of random accesses can be reduced. SaveRF can be easily and well integrated with existing feedback mechanisms and indexing structures. Experiments study proves both the effectiveness and efficiency of SaveRF, and also dis- covers some interesting features of relevance feedback search. 2. The SaveRF In this paper, our goal is to achieve efficient KNN search during the feedback loop along the direction of re- ducing the searching time during each iteration. Our ap- proach is to achieve faster retrieval in the subsequent it- erations as the relevance feedback loop goes on. Based on the information obtained from the early iterations, the number of random accesses can hopefully be further re- duced for next iteration underlying the corresponding index structure. Hence the key is to maintain the infor- mation among the iterations and explore their correla- tions. Our inspirations come from the following observations from relevance feedback search. First, search space of the refined query Q t+1 in (t+1) th iteration is highly likely to overlap with that of Q t . Second, relevance feedback mechanisms assume that the feedback query is modified towards the “optimal” query steps further as more iter- ations are processed. SaveRF forecasts Q t+1 by using the query informa- tion from the first iteration to the t th iteration. The pre- diction is made for every dimension of the Q t+1 , includ- ing the coordinate value and weight. Denote the predic- tion of Q t+1 as Q t+1 . We first look at how the linear re- Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Upload: lebao

Post on 09-Feb-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - SaveRF: Towards

SaveRF: Towards Efficient Relevance Feedback Search

Heng Tao Shen1 Beng Chin Ooi2 Kian-Lee Tan2

1 School of Information Technology and Electrical EngineeringThe University of Queensland, Australia

2 Department of Computer ScienceNational University of Singapore, Singapore

Abstract

In multimedia retrieval, a query is typically interac-tively refined towards the ‘optimal’ answers by exploitinguser feedback. However, in existing work, in each itera-tion, the refined query is re-evaluated. This is not only in-efficient but fails to exploit the answers that may be com-mon between iterations. In this paper, we introduce a newapproach called SaveRF (Save random accesses in Rel-evance Feedback) for iterative relevance feedback search.SaveRF predicts the potential candidates for the next it-eration and maintains this small set for efficient sequen-tial scan. By doing so, repeated candidate accesses can besaved, hence reducing the number of random accesses. Inaddition, efficient scan on the overlap before the searchstarts also tightens the search space with smaller prun-ing radius. We implemented SaveRF and our experimen-tal study on real life data sets show that it can reduce theI/O cost significantly.

1. Introduction

Existing content-based systems that exploit low-levelfeatures (such as color and texture) does not necessar-ily return semantically relevant (based on human per-ception) answers. One promising direction towards se-mantic retrieval is the adoption of relevance feedbackmechanism [2, 3]. A relevance feedback process is inter-active and iterative in nature. From the current resultsreturned by the system, the user provides feedback tothe system; based on these feedback, the system will re-fine the query to get better results which are closer tothe user’s expectations. Feedback query is usually re-fined from either moving the query to a new position ormodifying the similarity metrics (i.e., weights of featurevectors), or both, based on the selected objects. Unfor-tunately, the iterative nature of relevance feedback loopfurther lengthens the searching time. Since the feedbackquery moves away from the previous one with updatedsimilarity metric, a complete KNN search has to be re-performed in next iteration. If any indexing structureis deployed, then each iteration corresponds to one newKNN search inside of the structure. Generally the ran-dom access is the major concern in most research work.

However, the search spaces of two consecutive queriesmay overlap largely given that the query in current it-eration is refined based on the “good” results from the

last iteration. In this paper, we propose SaveRF, a newmethod to speed up the relevance feedback search, bydiscovering the overlap between two consecutive itera-tions. SaveRF investigates three methods called linearregression, exponential smoothing and linear exponentialSmoothing to predict the new query to be searched innext iteration. Taking the features of relevance feed-back search into consideration, SaveRF further intro-duces adaptive linear exponential smoothing to achievebetter prediction quality. By forecasting the search spaceof the new query, the overlap between two consecutivequeries’ search spaces can then be estimated. By per-forming sequential scan on the overlap, expensive ran-dom accesses on those candidates lying in the overlapcan be avoided in next iteration, hence the total numberof random accesses can be reduced. SaveRF can be easilyand well integrated with existing feedback mechanismsand indexing structures. Experiments study proves boththe effectiveness and efficiency of SaveRF, and also dis-covers some interesting features of relevance feedbacksearch.

2. The SaveRF

In this paper, our goal is to achieve efficient KNNsearch during the feedback loop along the direction of re-ducing the searching time during each iteration. Our ap-proach is to achieve faster retrieval in the subsequent it-erations as the relevance feedback loop goes on. Based onthe information obtained from the early iterations, thenumber of random accesses can hopefully be further re-duced for next iteration underlying the correspondingindex structure. Hence the key is to maintain the infor-mation among the iterations and explore their correla-tions.

Our inspirations come from the following observationsfrom relevance feedback search. First, search space of therefined query Qt+1 in (t+1)th iteration is highly likelyto overlap with that of Qt. Second, relevance feedbackmechanisms assume that the feedback query is modifiedtowards the “optimal” query steps further as more iter-ations are processed.

SaveRF forecasts Qt+1 by using the query informa-tion from the first iteration to the tth iteration. The pre-diction is made for every dimension of the Qt+1, includ-ing the coordinate value and weight. Denote the predic-tion of Qt+1 as Q′

t+1. We first look at how the linear re-

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 2: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - SaveRF: Towards

gression can be adapted in relevance feedback, followedby exponential smoothing [1].

2.1. Overlap Prediction

2.1.1. Linear Regression The feedback queries areassumed to move along a direction from initial query toits “optimal”. This satisfies the assumption of linear re-gression (LR) that the data change over time increas-ingly or decreasingly.

First, we look at how the ith dimensional value can bepredicted. We denote the prediction of qt+1 to be q′t+1.Hence the forecast of ith coordinate value by using lin-ear regression is computed as follows:

q′t+1[i] = α + β ∗ t

where t represents the tth iteration, α and β are theparameters to be determined by regression and β indi-cates the amount changed over each iteration. The esti-mates of α and β give the least value of forecasting SSE(Sum of Square Error).

2.1.2. Exponential Smoothing Exponentialsmoothing (ES) assigns unequal weight on differ-ent data, the largest weight on the most recent data,and the least weight on the earliest data. It pro-vides better predictions when the prediction is notgreatly extended. When applied to relevance feed-back, exponential smoothing gives greatest weight tothe most recent query, and the least weight to the ini-tial query, i.e., it is more “responsive” to changes occur-ring in the recent iterations. In exponential smoothing,q′t+1[i] is computed as:

q′t+1[i] = α ∗ qt[i] + (1 − α) ∗ q′t[i]

where α is the smoothing parameter and 0 < α < 1.Exponential smoothing is intuitively more appealing.

One major drawback of exponential smoothing is thatthere is no intrinsic best value for α. To determine α,generally a set of values are tested, and the value whichbest fits the queries is selected. We use the set of [0.05,0.1, ... , 0.9, 0.95] to choose α. The value which gives theminimal SSE is then chosen.

2.1.3. Linear Exponential Smoothing In rele-vance feedback search, although the queries tend to getcloser to the “optimal” query, the trend does not neces-sarily remain constant, i.e., the trend may vary slowlyover time. To capture the time-varying/local trends offeedback queries, one method is to use Linear (i.e., dou-ble) Exponential Smoothing (LES).

LES modifies exponential smoothing for following alinear trend, i.e., smooths the smoothed values obtainedfrom double application of exponential smoothing. De-note the predictions of singly-smoothing and doubly-smoothing by exponential smoothing as

q′t+1[i] = α ∗ qt[i] + (1 − α) ∗ q′t[i]

andq′′t+1[i] = α ∗ q′t[i] + (1 − α) ∗ q′′t [i]

respectively. a(t) is the estimated value and b(t) is theestimated trend at tth iteration.

2.1.4. Adaptive Linear Exponential SmoothingA noise (i.e., a query with suddenly abnormal change)may affect the quality of prediction. For example, inES or LES, the selection of α is decided by SSE. Anoisy query may dominate the overall SSE. As a result,a α value far from best may be selected. Consideringabove factors, we introduce adaptive prediction for rele-vance feedback search. In this paper, we apply the adap-tive strategy particularly on LES for its effectiveness asshown in experiments and name is as Adaptive LinearExponential Smoothing (ALES).

ALES monitors the prediction error and judges itschanging trend. From the changing trends, ALES iden-tifies and smooth the noisy queries. A query is identi-fied as a noise if its prediction error exhibits a sharp upor down along the changing trend. In our experiments,we use the following heuristic. A query is identified as anoisy query if the following inequality holds:

|qt − qt−1 + qt+1

2| ≥ qt−1 + qt+1

2

ALES considers noises and modifies α (its values hasto be in the range of (0, 1)) to be adaptive on the chang-ing prediction error. Furthermore, in relevance feedbacksearch, ALES has the functionality of smoothing noisyqueries too.

2.2. The KNN Search in Relevance Feed-back

As mentioned, SaveRF can be nicely integrated withexisting feedback mechanisms and search methodologies.Figure 1 depicts a general system architecture enrichedwith SaveRF. As we can see, SaveRF receives the queryinformation and candidate set, and returns the overlapto the search engine which deploying some KNN searchalgorithms for the corresponding indexing structures.

Feedback query

Candidates, pruning radius

Overlap, upper bounds

Feedback query

Mechanism

User

Interface Feedback Information

Results

Relevance Feedback

Search EngineGraphic

Initial query

SaveRF

Figure 1. A general relevance feedback architec-ture integrated with SaveRF.

While SaveRF can be easily deployed in existing KNNsearch methods, here we choose VA-file’s two-phrase al-gorithm as our example for its effectiveness and sim-plicity. Comparing with the approach in [4], SaveRF hastwo distinctive features. First and most interestingly, re-peated random accesses on the same candidates in twoconsecutive iterations are avoided. Instead, an efficientscan on the overlap leads to a much faster response. Gen-erally, the overlap size is much smaller than the can-didate size [4]. Manipulating overlap instead of whole

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 3: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - SaveRF: Towards

0

0.2

0.4

0.6

0.8

1

1.2

3 4 5 6 7 8

PP

Iterations

ALESLES

ESLR

(a) WWW image dataset

0

0.2

0.4

0.6

0.8

1

1.2

3 4 5 6 7 8

PP

Iterations

ALESLES

ESLR

(b) Corel image dataset

Figure 2. Precision of Prediction

candidate set in last iteration saves both storage andscan overhead. Second, an initial set of results are com-puted when the sequential scan is performed on the pre-dicted overlap. It is potential that the initial set containsome real results. It’s noticed that SaveRF is more ef-fective when there are indeed some considerable overlapbetween two iterations.

3. Experiments

3.1. Experiments Setup

All the experiments were performed on a SunUtraSparc II 450Mhz (2 CPU). We use two realdatasets in our experiments: 159-dimensional color his-tograms extracted from 62,400 WWW images and32-dimensional HSV Color Histograms extractedfrom 68,040 Corel image collection. A VA-file is con-structed for each image feature space. To avoid thesubjectivity involved in selecting “good” images to re-fine the query, by default we choose the top 5 mostrelevant images as “good” images. To study the ef-fectiveness of our proposal, we consider two mea-sures:

• (a) Precision of Prediction - PP: PP indicates howaccurate SaveRF can predict the overlap, and it isformally defined as: PP = Overlap′

t,t+1∩Overlapt,t+1

Overlap′t,t+1∪Overlapt,t+1

where Overlapt,t+1=Ct ∩ Ct+1, which is the actualoverlap.

• (b) Ratio of Random Access Saved - RRAS: In tth it-eration, denote the number of randomly accessedcandidates of standard VA-file search and SaveRFas CV A

t and CSaveRFt respectively. Assume that a

random access is 10 times more expensive than a se-quential scan. RRAS is then defined as: RRAS =

CV At

CSaveRFt + Overlap′

10

.

3.2. Precision of Prediction

Figure 2 depicts the precisions of prediction for fourmethods. For both datasets, we have the following ob-servations. First, ALES achieves linearly increasing andbest performance of more than 0.9 after 6th iteration.The next best performer is LES followed by ES whoachieves stable performance of about 0.8. This confirms

0

50

100

150

200

250

3 4 5 6 7 8

RR

AS

Iterations

with overlap predictionwithout overlap prediction

(a) WWW image dataset

0

50

100

150

200

250

3 4 5 6 7 8

RR

AS

Iterations

with overlap predictionwithout overlap prediction

(b) Corel image dataset

Figure 3. Ratio of Random Access Saved

that some noisy queries and local trends exist in the rel-evance feedback search. Second, in the first few itera-tions, there is no clear difference between three models.The main reason we believe is due to the extremely smallnumber of available queries for prediction. As more it-erations are executed, the improvements turn to be ob-vious (last few iterations). This is mainly because thatafter several iterations, recent queries start convergingsteadily and forming a local trend towards to “optimal”query. Adaptivity to the prediction errors and noises fur-ther distinguish ALES from LES. Consequently, we onlyuse ALES as the prediction model for SaveRF in the fol-lowing experiments.

3.3. Ratio of Random Access Saved

Here we tested SaveRF with and without overlap pre-diction function by comparing with the number of ran-dom accesses of the standard VA-file search algorithm.Figure 3a shows the results for WWW image dataset. Aswe can see, as iterations go on, SaveRF without overlapprediction improves standard VA-file algorithm by morethan an order of magnitude, while SaveRF with over-lap prediction further outperforms standard VA-file al-gorithm by more than two orders of magnitude. Theresults on SaveRF without overlap prediction suggestthat the number of candidates drops dramatically innext iterations of relevance feedback. And the results onSaveRF with overlap prediction reconfirms that a largeportion of candidates in current iteration have been ac-cessed in the pervious iteration, and scan on the pre-dicted overlap further reduces the number of random ac-cesses greatly. Figure 3b shows similar results for Corelimage dataset. This experiment confirms the effective-ness of SaveRF.

References

[1] Simon Benninga. Financial modeling. In MIT Press,2000.

[2] A. L.Ratan,O.Maron,W.E. L.Grimson, andT. Lozano-Perez. A framework for learning query concepts in imageclassification. In CVPR, pages 423–431, 1999.

[3] Y. Rui and T. Huang. Optimizing learning in image re-trieval. In ICCV, pages 236–243, 2000.

[4] P. Wu and B.S. Manjunath. Adaptive nearest neighborsearch for relevance feedback in large image datasets. InACM Multimedia, pages 87–98, 2001.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE