[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...
TRANSCRIPT
SaveRF: Towards Efficient Relevance Feedback Search
Heng Tao Shen1 Beng Chin Ooi2 Kian-Lee Tan2
1 School of Information Technology and Electrical EngineeringThe University of Queensland, Australia
2 Department of Computer ScienceNational University of Singapore, Singapore
Abstract
In multimedia retrieval, a query is typically interac-tively refined towards the ‘optimal’ answers by exploitinguser feedback. However, in existing work, in each itera-tion, the refined query is re-evaluated. This is not only in-efficient but fails to exploit the answers that may be com-mon between iterations. In this paper, we introduce a newapproach called SaveRF (Save random accesses in Rel-evance Feedback) for iterative relevance feedback search.SaveRF predicts the potential candidates for the next it-eration and maintains this small set for efficient sequen-tial scan. By doing so, repeated candidate accesses can besaved, hence reducing the number of random accesses. Inaddition, efficient scan on the overlap before the searchstarts also tightens the search space with smaller prun-ing radius. We implemented SaveRF and our experimen-tal study on real life data sets show that it can reduce theI/O cost significantly.
1. Introduction
Existing content-based systems that exploit low-levelfeatures (such as color and texture) does not necessar-ily return semantically relevant (based on human per-ception) answers. One promising direction towards se-mantic retrieval is the adoption of relevance feedbackmechanism [2, 3]. A relevance feedback process is inter-active and iterative in nature. From the current resultsreturned by the system, the user provides feedback tothe system; based on these feedback, the system will re-fine the query to get better results which are closer tothe user’s expectations. Feedback query is usually re-fined from either moving the query to a new position ormodifying the similarity metrics (i.e., weights of featurevectors), or both, based on the selected objects. Unfor-tunately, the iterative nature of relevance feedback loopfurther lengthens the searching time. Since the feedbackquery moves away from the previous one with updatedsimilarity metric, a complete KNN search has to be re-performed in next iteration. If any indexing structureis deployed, then each iteration corresponds to one newKNN search inside of the structure. Generally the ran-dom access is the major concern in most research work.
However, the search spaces of two consecutive queriesmay overlap largely given that the query in current it-eration is refined based on the “good” results from the
last iteration. In this paper, we propose SaveRF, a newmethod to speed up the relevance feedback search, bydiscovering the overlap between two consecutive itera-tions. SaveRF investigates three methods called linearregression, exponential smoothing and linear exponentialSmoothing to predict the new query to be searched innext iteration. Taking the features of relevance feed-back search into consideration, SaveRF further intro-duces adaptive linear exponential smoothing to achievebetter prediction quality. By forecasting the search spaceof the new query, the overlap between two consecutivequeries’ search spaces can then be estimated. By per-forming sequential scan on the overlap, expensive ran-dom accesses on those candidates lying in the overlapcan be avoided in next iteration, hence the total numberof random accesses can be reduced. SaveRF can be easilyand well integrated with existing feedback mechanismsand indexing structures. Experiments study proves boththe effectiveness and efficiency of SaveRF, and also dis-covers some interesting features of relevance feedbacksearch.
2. The SaveRF
In this paper, our goal is to achieve efficient KNNsearch during the feedback loop along the direction of re-ducing the searching time during each iteration. Our ap-proach is to achieve faster retrieval in the subsequent it-erations as the relevance feedback loop goes on. Based onthe information obtained from the early iterations, thenumber of random accesses can hopefully be further re-duced for next iteration underlying the correspondingindex structure. Hence the key is to maintain the infor-mation among the iterations and explore their correla-tions.
Our inspirations come from the following observationsfrom relevance feedback search. First, search space of therefined query Qt+1 in (t+1)th iteration is highly likelyto overlap with that of Qt. Second, relevance feedbackmechanisms assume that the feedback query is modifiedtowards the “optimal” query steps further as more iter-ations are processed.
SaveRF forecasts Qt+1 by using the query informa-tion from the first iteration to the tth iteration. The pre-diction is made for every dimension of the Qt+1, includ-ing the coordinate value and weight. Denote the predic-tion of Qt+1 as Q′
t+1. We first look at how the linear re-
Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE
gression can be adapted in relevance feedback, followedby exponential smoothing [1].
2.1. Overlap Prediction
2.1.1. Linear Regression The feedback queries areassumed to move along a direction from initial query toits “optimal”. This satisfies the assumption of linear re-gression (LR) that the data change over time increas-ingly or decreasingly.
First, we look at how the ith dimensional value can bepredicted. We denote the prediction of qt+1 to be q′t+1.Hence the forecast of ith coordinate value by using lin-ear regression is computed as follows:
q′t+1[i] = α + β ∗ t
where t represents the tth iteration, α and β are theparameters to be determined by regression and β indi-cates the amount changed over each iteration. The esti-mates of α and β give the least value of forecasting SSE(Sum of Square Error).
2.1.2. Exponential Smoothing Exponentialsmoothing (ES) assigns unequal weight on differ-ent data, the largest weight on the most recent data,and the least weight on the earliest data. It pro-vides better predictions when the prediction is notgreatly extended. When applied to relevance feed-back, exponential smoothing gives greatest weight tothe most recent query, and the least weight to the ini-tial query, i.e., it is more “responsive” to changes occur-ring in the recent iterations. In exponential smoothing,q′t+1[i] is computed as:
q′t+1[i] = α ∗ qt[i] + (1 − α) ∗ q′t[i]
where α is the smoothing parameter and 0 < α < 1.Exponential smoothing is intuitively more appealing.
One major drawback of exponential smoothing is thatthere is no intrinsic best value for α. To determine α,generally a set of values are tested, and the value whichbest fits the queries is selected. We use the set of [0.05,0.1, ... , 0.9, 0.95] to choose α. The value which gives theminimal SSE is then chosen.
2.1.3. Linear Exponential Smoothing In rele-vance feedback search, although the queries tend to getcloser to the “optimal” query, the trend does not neces-sarily remain constant, i.e., the trend may vary slowlyover time. To capture the time-varying/local trends offeedback queries, one method is to use Linear (i.e., dou-ble) Exponential Smoothing (LES).
LES modifies exponential smoothing for following alinear trend, i.e., smooths the smoothed values obtainedfrom double application of exponential smoothing. De-note the predictions of singly-smoothing and doubly-smoothing by exponential smoothing as
q′t+1[i] = α ∗ qt[i] + (1 − α) ∗ q′t[i]
andq′′t+1[i] = α ∗ q′t[i] + (1 − α) ∗ q′′t [i]
respectively. a(t) is the estimated value and b(t) is theestimated trend at tth iteration.
2.1.4. Adaptive Linear Exponential SmoothingA noise (i.e., a query with suddenly abnormal change)may affect the quality of prediction. For example, inES or LES, the selection of α is decided by SSE. Anoisy query may dominate the overall SSE. As a result,a α value far from best may be selected. Consideringabove factors, we introduce adaptive prediction for rele-vance feedback search. In this paper, we apply the adap-tive strategy particularly on LES for its effectiveness asshown in experiments and name is as Adaptive LinearExponential Smoothing (ALES).
ALES monitors the prediction error and judges itschanging trend. From the changing trends, ALES iden-tifies and smooth the noisy queries. A query is identi-fied as a noise if its prediction error exhibits a sharp upor down along the changing trend. In our experiments,we use the following heuristic. A query is identified as anoisy query if the following inequality holds:
|qt − qt−1 + qt+1
2| ≥ qt−1 + qt+1
2
ALES considers noises and modifies α (its values hasto be in the range of (0, 1)) to be adaptive on the chang-ing prediction error. Furthermore, in relevance feedbacksearch, ALES has the functionality of smoothing noisyqueries too.
2.2. The KNN Search in Relevance Feed-back
As mentioned, SaveRF can be nicely integrated withexisting feedback mechanisms and search methodologies.Figure 1 depicts a general system architecture enrichedwith SaveRF. As we can see, SaveRF receives the queryinformation and candidate set, and returns the overlapto the search engine which deploying some KNN searchalgorithms for the corresponding indexing structures.
Feedback query
Candidates, pruning radius
Overlap, upper bounds
Feedback query
Mechanism
User
Interface Feedback Information
Results
Relevance Feedback
Search EngineGraphic
Initial query
SaveRF
Figure 1. A general relevance feedback architec-ture integrated with SaveRF.
While SaveRF can be easily deployed in existing KNNsearch methods, here we choose VA-file’s two-phrase al-gorithm as our example for its effectiveness and sim-plicity. Comparing with the approach in [4], SaveRF hastwo distinctive features. First and most interestingly, re-peated random accesses on the same candidates in twoconsecutive iterations are avoided. Instead, an efficientscan on the overlap leads to a much faster response. Gen-erally, the overlap size is much smaller than the can-didate size [4]. Manipulating overlap instead of whole
Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE
0
0.2
0.4
0.6
0.8
1
1.2
3 4 5 6 7 8
PP
Iterations
ALESLES
ESLR
(a) WWW image dataset
0
0.2
0.4
0.6
0.8
1
1.2
3 4 5 6 7 8
PP
Iterations
ALESLES
ESLR
(b) Corel image dataset
Figure 2. Precision of Prediction
candidate set in last iteration saves both storage andscan overhead. Second, an initial set of results are com-puted when the sequential scan is performed on the pre-dicted overlap. It is potential that the initial set containsome real results. It’s noticed that SaveRF is more ef-fective when there are indeed some considerable overlapbetween two iterations.
3. Experiments
3.1. Experiments Setup
All the experiments were performed on a SunUtraSparc II 450Mhz (2 CPU). We use two realdatasets in our experiments: 159-dimensional color his-tograms extracted from 62,400 WWW images and32-dimensional HSV Color Histograms extractedfrom 68,040 Corel image collection. A VA-file is con-structed for each image feature space. To avoid thesubjectivity involved in selecting “good” images to re-fine the query, by default we choose the top 5 mostrelevant images as “good” images. To study the ef-fectiveness of our proposal, we consider two mea-sures:
• (a) Precision of Prediction - PP: PP indicates howaccurate SaveRF can predict the overlap, and it isformally defined as: PP = Overlap′
t,t+1∩Overlapt,t+1
Overlap′t,t+1∪Overlapt,t+1
where Overlapt,t+1=Ct ∩ Ct+1, which is the actualoverlap.
• (b) Ratio of Random Access Saved - RRAS: In tth it-eration, denote the number of randomly accessedcandidates of standard VA-file search and SaveRFas CV A
t and CSaveRFt respectively. Assume that a
random access is 10 times more expensive than a se-quential scan. RRAS is then defined as: RRAS =
CV At
CSaveRFt + Overlap′
10
.
3.2. Precision of Prediction
Figure 2 depicts the precisions of prediction for fourmethods. For both datasets, we have the following ob-servations. First, ALES achieves linearly increasing andbest performance of more than 0.9 after 6th iteration.The next best performer is LES followed by ES whoachieves stable performance of about 0.8. This confirms
0
50
100
150
200
250
3 4 5 6 7 8
RR
AS
Iterations
with overlap predictionwithout overlap prediction
(a) WWW image dataset
0
50
100
150
200
250
3 4 5 6 7 8
RR
AS
Iterations
with overlap predictionwithout overlap prediction
(b) Corel image dataset
Figure 3. Ratio of Random Access Saved
that some noisy queries and local trends exist in the rel-evance feedback search. Second, in the first few itera-tions, there is no clear difference between three models.The main reason we believe is due to the extremely smallnumber of available queries for prediction. As more it-erations are executed, the improvements turn to be ob-vious (last few iterations). This is mainly because thatafter several iterations, recent queries start convergingsteadily and forming a local trend towards to “optimal”query. Adaptivity to the prediction errors and noises fur-ther distinguish ALES from LES. Consequently, we onlyuse ALES as the prediction model for SaveRF in the fol-lowing experiments.
3.3. Ratio of Random Access Saved
Here we tested SaveRF with and without overlap pre-diction function by comparing with the number of ran-dom accesses of the standard VA-file search algorithm.Figure 3a shows the results for WWW image dataset. Aswe can see, as iterations go on, SaveRF without overlapprediction improves standard VA-file algorithm by morethan an order of magnitude, while SaveRF with over-lap prediction further outperforms standard VA-file al-gorithm by more than two orders of magnitude. Theresults on SaveRF without overlap prediction suggestthat the number of candidates drops dramatically innext iterations of relevance feedback. And the results onSaveRF with overlap prediction reconfirms that a largeportion of candidates in current iteration have been ac-cessed in the pervious iteration, and scan on the pre-dicted overlap further reduces the number of random ac-cesses greatly. Figure 3b shows similar results for Corelimage dataset. This experiment confirms the effective-ness of SaveRF.
References
[1] Simon Benninga. Financial modeling. In MIT Press,2000.
[2] A. L.Ratan,O.Maron,W.E. L.Grimson, andT. Lozano-Perez. A framework for learning query concepts in imageclassification. In CVPR, pages 423–431, 1999.
[3] Y. Rui and T. Huang. Optimizing learning in image re-trieval. In ICCV, pages 236–243, 2000.
[4] P. Wu and B.S. Manjunath. Adaptive nearest neighborsearch for relevance feedback in large image datasets. InACM Multimedia, pages 87–98, 2001.
Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE