versõ foliõ: diversiﬁed ranking for large graphs with ... · arxiv:1607.07504v1 [cs.ir] 25 jul...

arX

iv:1

607.

0750

4v1

[cs.

IR]

25 J

ul 2

016

Versõ foliõ: Diversified Ranking for Large Graphs withContext-Aware Considerations

George TsatsanifosHomeGrown Research LabsOmirou 39, Agios Dimitrios,

172 36, Athens, Greece

ABSTRACTThis work is pertaining to the diversified ranking of web-resourcesand interconnected documents that rely on a network-like struc-ture, e.g. web-pages. A practical example of this would be a queryfor the k most relevant web-pages that are also in the same timeas dissimilar with each other as possible. Relevance and dissimi-larity are quantified using an aggregation of network distance andcontext similarity. For example, for a specific configuration of theproblem, we might be interested in web-pages that are similar withthe query in terms of their textual description but distant from eachother in terms of the web-graph, e.g. many clicks away. In retro-spect, a dearth of work can be found in the literature addressing thisproblem taking the network structure formed by the documentlinksinto consideration. In addition, the vast majority of the approachesthat have been proposed in the literature leverage greedy heuristicsrelying on certain measures that capture only some aspect oftheproblem. Arguably, this is not enough.

In this work, we propose a hill-climbing approach that is seededwith a document collection which is generated using greedy heuris-tics to diversify initially. More importantly, we tackle the problemin the context of web-pages where there is an underlying networkstructure connecting the available documents and resources. Thisis a significant difference to the majority of works that tackle theproblem in terms of either content definitions, or the graph struc-ture of the data, but never addressing both aspects simultaneously.To the best of our knowledge, this is the very first effort thatcan befound to combine both aspects of this important problem in anel-egant fashion by also allowing a great degree of flexibility on howto configure the trade-offs of (i) document relevance over result-items’ dissimilarity, and (ii) network distance over content rele-vance or dissimilarity. Last but not least, we present an extensiveevaluation of our methods that demonstrate the effectiveness andefficiency thereof.

Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: Search Process

1. INTRODUCTIONResult diversification has lavished great scientific interest in re-

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 4.0 Unported licence, available athttp://creativecommons.org/licenses/by-nc-sa/4.0/.You may contact the author for commercial licensing of source code ornew implementations.

cent years for it addresses a variety of problems simultaneously.First, modern information retrieval research focuses on result diver-sification as a means of counteracting intrinsic IR problems, like theover-specialization problem of retrieving too homogeneous results.For instance, for information queries, users reading through a list ofrelevant but redundant pages quickly stop as they do not expect tolearn more. Therefore, new search techniques had to be improvisedin order to ameliorate user satisfaction and decrease queryaban-donment. Second, diversification reduces the risk that noneof thereturned results satisfies a user’s query intent. For example queryterm “Apache” could pertain to the well-known software founda-tion, or the Native American tribe just as well. Similarly, amongthe results for “Python” we could find information about a reptilespecie, a programming language, or even the Monty Python, a com-edy group. Therefore, by focusing on just one genre of information,we could easily fail to answer the query properly, or even dangerto be completely irrelevant. Also, among the most frequent exam-ples found in the literature for ambiguous and exploratory queries“flash” and “jaguar” are the most typical for they can representcompletely different things, e.g. cars, felines, etc. Ambiguity isa common and usual problem, a solution to which can be resultdiversification. Third, enabling diversity tacitly facilitates (near)duplicates elimination.

Besides, conventional information retrieval models presume thatgiven a similarity measure with different degrees of relevancy, therelevance of a document is independent of the relevance of otherdocuments. In reality however, thisindependence relevanceas-sumption rarely holds; the utility of retrieving one document maydepend on which documents the user has already seen. Regard-ing this complementary aspect of diversification, it is in generalinsufficient to simply return a set of relevant results, since correla-tions among them are also important. More specifically, documentsshould be selected progressively according to their relevance andalso in terms with the documents that come before it. For example,a system that delivers documents that are relevant and novelmustbe able to identify relevant documents that are dissimilar to the pre-viously delivered documents, in the sense of comprising newinfor-mation. The main challenge in diversification lies in that these twoimportant goals are contradictory to each other.

In essence, the goal of diversification is to identify a list of itemsthat are dissimilar with each other, but nonetheless relevant to theuser’s search criteria. In other words, the diversity is defined byhow much each item of the result-set differs from the others interms of their attribute values. In effect, diversificationis a mul-tifaceted problem with many practical and theoretical extensions.A profound and rich link with combinatorics and approximationis exploited in research works where diversification is performedthrough heuristics and greedy algorithms mostly. This workfo-

http://arxiv.org/abs/1607.07504v1

http://creativecommons.org/licenses/by-nc-sa/4.0/

Figure 1: Diversified result for query “ Apache” over the Greek collection of Wikipedia articles. The “Apache HTTP server” articlewas chosen as the center for this search because of its title.Relevant articles representing different fields are returned in the result.

cuses on diversified ranking of web-resources and documentsthatrely on anetwork-like structure, e.g. web-pages. A typical exampleof this problem would be a query for thek most relevant web-pagesthat are also in the same time as dissimilar with each other aspos-sible. This is an important problem with deep rooted theoreticalextensions and even more practical applications. It is a NP-hardproblem and the fact that we address it in the context of data repre-senting interconnected information, adds not only to the complex-ity of the problem, but also to the value of an effective and efficientsolution that is capable of returning a near-optimal solution withina few seconds.

More specifically, there are two approaches proposed in thispa-per used in combination, one after the other, in an effort to max-imize their effect. The first relies on greedy heuristics, whereasthe second leverages a hill-climbing approach. According to theformer, we computekg diversified sets starting from the object inthe network that is most similar to the query, given a metric thatsmooths network distance with context similarity, and we incre-mentally append to the set the object which when added to the resultconstructed so far, the rank of the outcome is better than it wouldbe if we had added any other object. This process is repeated asmany times as the desired number of elements. On the other hand,the latter approach, takes as input thekg greedily built diversifiedsets, and iteratively refines them by removing certain elements andreplacing them with more preferable choices in order to producebetter diversified sets. Each time we keep only the bestkc derivedsets we have retrieved so far, and we work on that collection of setsas a means to explore the overall search space.

What is more, we prove our concept through a prototype thatwe launched on-line. It operates over multi-lingual collections ofthe Wikipedia dataset but other collections of web-pages can alsobe used as soon as the appropriate models are extracted, e.g.web-graph, a vector space model (based on tf-idfs in our case), etc. InFigure 1, we depict the results for “Apache”. Among those onecan find links to articles up to 4 clicks away while still beingrel-evant to the “Apache HTTP server”, the article which was chosento serve as the query center for our search because of its title forit matches the query better than other articles. Among the top re-sults “Server” and “Unix” are the closest articles that are also verysimilar to the query center in the same time. Of course, othercloseneighbors could have been selected, but we put special emphasis onthe fact that those are quite diverse with each other as they belong

to different categories, tacitly addressed though a model that relieson content-based definitions. When we examine the remainingre-sults (not directly connected to the query center) we see links toother articles like “Computer Networks”, “ Information Systems”,“ Internet”, “ Operating Systems”, and “Computer Science”, eachcontributing to the result by representing a different fieldor the-matic area, tackling this way directly the over-specialization prob-lem, the main challenge addressed in this paper. Other notions,like the page-rank of each web-page could easily be incorporatedin our rank functions as a way to promote more important pages,exactly like we do with the relevant ones. All different criteria aresmoothed according to the user preferances via an elegant interface.

In summary, the contributions of this paper are the following:

1. We presentverso, an algorithm which given a query objectq, a set of interconnected objectsS, and a ranking functionf , it retrieves theoptimalsingle object, which when added inS it is better ranked in terms of both network characteristicsand textual descriptions of its elements, than it would be ifany other point had been inserted.

2. We propose a paradigm that combines a method based upongreedy heuristics with an approach leveraging a hill-climbingmethodology to exploitverso, so as to construct diversifiedsets of arbitrary cardinality.

3. We show through an extensive experimental evaluation thatour paradigm clearly outperforms theBestCoveragemethodproposed in [12] in terms of response times, result quality,and memory requirements, as well.

4. We developed in Java and made available on-line a prototypeimplementing our method.

As a final note, the name of our paradigm “versõ foliõ” translatesfrom Latin to English as “the turned side of the page”. The remain-der of this paper is organized as follows: Section 2 discusses con-temporary diversification techniques over different typesof data. Aformal definition of the problem we study here is presented inSec-tion 3. Our paradigm is presented in Section 4. Section 5 presentsan extensive evaluation of our scheme. Finally, we concludeandsummarize our contributions in Section 6.

2. RELATED WORKIn this section, we present an overview of the literature on result

diversification over two different fields.

2.1 Result Diversification for Structured DataThe first part of this section describes the various definitions of

the result diversification problem proposed in the researchliteratureand classifies the algorithms proposed for locating diverseitems. Acomplete survey on the state of the art can be found in [5].

In the absence of any explicit knowledge of user intent, threemain approaches are employed for result diversification: (i) con-tentbased, for differentiated result items in terms of their attributevalues, (ii)noveltybased, when promoting items that contain newinformation compared to those ranked higher, and (iii)coverage,for including items so as to cover as many categories as possible.

First and foremost, content-based definitions aim at maximizingthe aggregated distance between any pair of objects in the result-set. Arguably, the most influential work on diversification is that ofMaximal Marginal Relevance (MMR) [3]. In their seminal work,the authors introduced a trade-off between diversity and the rele-vance of search results through the combination of two similarityfunctions shown in Equation 1. One for measuring the similarityamong a document collectionS, and the other the similarity be-tween queryq and each document. In addition, parameterλ con-trols the degree of trade-off.

maxx∈S

λd(q,x)− (1−λ ) minx,y∈S

d(x,y) (1)

Moreover, [9] is another conspicuous work that follows an ax-iomatic approach to develop a set of natural axioms that a diver-sification system is expected to follow. A variety of methodsarediscussed, such as semantic distance-based schemes, taxonomy-based categorical distance schemes, and the advantages anddisad-vantages thereof. On the other hand, [21] focuses on diversificationover structured data to propose an appropriate ranking-similarityfunction for a given ordering of the data attributes. An interestingapproach is presented in [10] that extends thek nearest neighborsproblem with providing a minimum distance threshold among thepoints of the result-set in order to guarantee diversity. Ina morerecent work [8], the authors adopt an approach based on Voronoicells in order to retrieve the object maximizing a novelty function.More specifically, items that already constitute the answerare ap-pointed as cell centers. Then, all objects that happen to be placedon the vertices and the edges of the formed Voronoi diagram areexamined and the best of them is inserted into the answer-set. An-other similar approach is proposed in [22] which however requiresknowing in advance all the objects to compute the objective func-tion.

Novelty can be seen as a means to avoid redundancy and (near)duplicates, and also as a means to impart additional informationto the result-set; whereas diversity is viewed as the need tore-solve ambiguity. Adhering to this principle, queries and objectsare treated as sets of “information nuggets” in [4], while relevanceis a function of the nuggets contained in possible interpretations ofthe query and the result-set. In another line of work, [27] empowersadaptive filtering systems to distinguish novel and redundant items.Such systems identify items that are similar to previously deliveredones in having the same topic, but also dissimilar to them, inthesense of containing novel information.

Conversely, [23] suggests from the user point of view that “nov-elty seeking” is not equivalent to “diversity seeking”, andthat thenovelty preferences of individual users are directed towards find-ing more information on specific subtopics of interest, rather thanan undirected quest for any new information (as in the coveragebased definitions). In [1] this approach is augmented with takinginto account the relative importance of different nuggets (as distri-bution categories), and the fact that different documents containing

the same nugget may satisfy the user to different extent. On theother hand, [25] suggests models used in traditional IR but seeks tomodify the ranking so as to include documents relevant to severalsubtopics in terms of a taxonomy.

Furthermore, the diversification problem has been shown to beNP-hard, since it is directly related to thep-dispersion problem. Toelaborate, the objective of thep-dispersion problem is to choosepout of n given points, so that the minimum distance between anypair of chose points is maximized. Quite often however, the ob-jective function is the average distance of any two points. The au-thors in [9] consider different formulations of the problem, such asmin-max diversification, min-sum, and amono-objectiveformula-tion of the problem. Additionally, the combination of thesecriteriahas been studied in [26] as an optimization problem. Besides, dueto the increased complexity of the task, most efforts rely onap-proximation. Many heuristic algorithms have been proposedin thediversity discourse and have been employed for solving variationsof the problem. These heuristics can be classified into two maincategories: (i) greedy, and (ii) interchange (or hill-climbing).

The majority of the proposed algorithms adopt agreedyapproachfor being intuitive and relatively fast. A straightforwardmethodol-ogy is proposed in [1], according to which, given the set of the top-kmost relevant documents to a query, we re-order them in such awaythat the objective function of the model is optimized. Othersim-plistic solutions rely on post-processing returned items to identifythose that differ in their attribute values. In the context of recom-mender systems, [29] presents a conspicuous greedy paradigm, ac-cording to which the result-set is initialized with the mostrelevantrecommendation. Then, recommendations that are not part oftheanswer are sorted in terms of their relevance to the query andtheirminimum distance from any item of the answer set. Now, the rankof each recommendation is a linear combination of its position inthe two lists. At each iteration, we append to the result-setthe rec-ommendation with the minimum rank, until the answer consists ofk recommendations. Another similar approach is proposed in [24].Starting with the most relevant item, it adds the next most relevantitem if and only if that item is far away (compared to a distancebound) from the existing items in the set, until we have includedk items. A similar algorithm is proposed in [10] in the contextofspatial databases.

Interchange heuristics start with a random result-set and then it-eratively attempt to improve it by swapping an item in the solutionwith another item that is not part of the answer. At each iteration,possible swap candidates are the ones that ameliorate diversity themost. [14] constitutes such an effort in the context of (semi-) struc-tured data. Another additional interchange heuristic is presented in[24], where starting with thek most relevant items; at each itera-tion, the item that contributes the least to the diversity, is substitutedby the most promising item that is not part of the answer set. Thissequential procedure terminates when there are no more items withbetter diversity than a given threshold. In [20] an approachthatleverages both methodologies is proposed in the context of struc-tured peer-to-peer networks. In [19], we propose similar diversifi-cation principles for spatial data indices. In [6, 7], DisC is proposedmore recently for computing diversified results of variableapproxi-mation by processing a hierarchical tree structure accordingly usinga top-down method.

2.2 Diversified Search over Graph DataMost of the works that can be found in the literature on the re-

trieval of diversified sets of vertices use non-negative monotonesubmodular set functions for ranking. The main advantage thatcomes with this is that the near-optimality is easily achieved within

(1− 1e)-approximation from the optimal solution using greedy heuris-

tics. A theoretical analysis about the approximation guarantees canbe found in [16, 2]. However, our work considers bicriteria rankingfunctions that are not necessarily submodular, but insteadempha-size on a more general family of the Maximal Marginal Relevance(MMR) ranking functions, and therewith, near-optimality is muchharder to be achieved. Besides, the problem we tackle in thispaperis NP-hard and the network structure of the data we delve withsetseven more challenges, practical and theoretical, as well.

In [13] the authors propose a diversified ranking measure onlarge graphs that captures the relevance to the query and thedissim-ilarity of the elements in the result. They formulate the problem asa submodular set function optimization problem. Their approachrelies on a notion of expansion, according to which a set of nodeswith a large expansion ratio implies that the nodes are dissimilarto each other, in a sense that they do not share the common neigh-bors in a graph. Subsequently, the larger expansion ratio the set ofnodes has, the better diversity we achieve. Their ranking functionis designed accordingly in a way that compromises the personal-ized PageRank scores over the ranking results, compromising thisway their relevance with their expansion ratio. Nevertheless, theranking function does not consider the ordering of the list items, ina sense that a permutation of the elements of the result wouldyieldexactly the same score. On the other hand, since our scheme reliesalso on greedy heuristics, each new result item is selected for it pro-duces an augmented set that achieves a better score than any otherset derived by adding an alternative node to the set. Hence, previ-ous selections play a dominant role on which node will be includednext.

The authors in [12] propose a measure that also correspondsto a submodular set function, they callexpanded relevanceand isused to compute the coverage of the “interesting” part of thegraph.However, they argue that treating the diversification problem as abicriteria optimization problem with a relevance measure that ig-nores diversity, and a diversity measure that ignores relevancy canturn out problematic. Towards this end, they propose a greedy al-gorithm which at each iteration considers including in the result-set a vertex amongkδ ℓ ranked vertices, whereδ the average fandegree andℓ a system parameter dictating nodes from how manyhops away we should consider including in the result. The bestamong those nodes is selected to augment the result and the impactof its distanceℓ neighbors can be adjusted. On the downside, wereckon that this is somehow limiting as each time only vertices thatare up toℓ hops away from the tentative result are considered, andno object in distance greater thankℓ from the query center is everexamined, even though it could constitute a much better option.

The method proposed in [11] assumes that all pairwise distancesare pre-computed and made available in a square matrix and givena query the relevance of each vertex is computed. Given thesepa-rameters, the proposed algorithm greedily appends to the result-setthe element that maximizes a relevance factor that also capturesclustering properties, e.g., a relevant object that is close to otherrelevant items forming a cluster this way is ranked higher, and pe-nalizes the items that are close to the already selected ones.

GrassHopper [28] incrementally selects each time the best rankedvertex and turns it into a sink node. Leveraging random walks, thealgorithm converges when all sinks are traced, and estimates theranks with the number of visits to each node before convergence. Ituses a matrix inversion to find the expected number of visits;nev-ertheless, inverting a sparse matrix makes it dense, which is notpractical for the large and sparse graphs that the research commu-nity is interested in, e.g. web-graphs, social networks, etc.

In another line of work, Dragon [18] relies on greedy heuris-

tics and they propose a submodular set function for ranking thatcombines relevance with the “Google matrix” [17] in an effort tocapture the strength of the connection between any two elements.Similarly, DivRank [15] adjusts a transition matrix based on thenumber of visits to the vertices using a random walk variant calledvertex-reinforced random walks that assumes that there is alink forall vertices returning back to the node itself, which is followed withsome probability. The method ensures that the highly rankednodescollect more value over the iterations, resulting in a rich-gets-richerphenomenon.

3. PROBLEM SPECIFICATIONThe problem we study in this paper can be described as follow-

ing: Given adirected graphG(V,E), whereV and E its sets ofvertices and edges respectively, we want to find the best setSof kvertices that minimizes either of the following Maximal MarginalRelevance (MMR) ranking functions from Equations 2 or 3 for agiven query object, sayq. Henceforth, we assume that each ver-tex u in the network is associated with a textual descriptiontu, towhich we may also refer as the context of vertexu. By definition,the combination of the two, a vertex fromV along with its textualdescription, constitutes an object.

The association becomes now clear. According to the adoptedmodel each vertex represents a web-page, itsdirectededges cor-respond to its links, and its textual description to its content. Ourmodel is versatile and flexible enough to accomodate any measureof similarity between two textual descriptions. For our purposes,we employ the cosine similarity, and thus, we define the distanceof two verticesu andv given their contextstu andtv, respectively, as

dtext(u,v) = 1− 〈tu·tv〉|tu||tv|

, and is smoothed accordingly by the networkdistance in an affine combination using special variables dedicatedto that purpose. To elaborate, parameterλ takes values in[0,1]in Equations 2 and 3, and denotes the desired trade-off betweenrelevance and result dissimilarity. Likewise,α andβ also take val-ues within the same interval and stand for the compromise betweenthe network distance and context similarity of any pair of objects.Moreover, each of the two ranking functions adopts different crite-ria on how to quantify collectively the different characteristics andproperties of an input-set and compromise the relevance of its el-ements with their degree of dissimilarity. More importantly, theformulation of the problem we study here is novel, in a sense thatthis is the first work to compromise the structure of the networkwith the textual descriptions of its elements, allowing in the sametime a great degree of freedom to define exactly how much of thosefeatures should contribute to the rank of the result, by either config-uring theα parameter for its relevance, orβ for the desired pairwisedissimilarity of the result items. For example, for a specific config-uration of the problem, we might be interested in web-pages thatare similar with the query in terms of their textual description only,and hence we setα = 0, but as much distant from each other aspossible in terms of their network properties, and we therefore setβ = 1. In addition, we can specify, how much we want relevanceand result dissimilarity to contribute in the ranking process usingtheλ parameter. For example, we might want to promote sets thatcongregate better representatives of the domain than relevant onesfor low λ -values, and vice versa for high. In Equations 4 and 5we show the form that Eqs. 2 and 3 take when we are interested inrelevance according to web-pages’ content and also want to put theitems of the result as far as possible from each other.

Let’s have a closer look now on how the distance functions areused and it will become clear how objects that have been addedpreviously affect the selection of the succeeding elements. We notethat just like the web-graph, the conceptual graph considered in our

σq(S) =λN ∑

u∈S{αdgraph(q,u)+(1−α)dtext(q,u)}+

1−λN(N−1) ∑

v∈S∑w∈Sv≺w

{βdgraph(v,w)+(1−β )dtext(v,w)} (2)

σq(S) = λ maxu∈S

{αdgraph(q,u)+(1−α)dtext(q,u)︸︷︷︸

drel(q,u)

}+(1−λ ) minv,w∈Sv≺w

{βdgraph(v,w)+(1−β )dtext(v,w)︸︷︷︸

ddiss(v,w)

} (3)

paradigm is directed. The first factor of our ranking functions, andwe note withdrel(., .), aggregates the relevance of the elements inthe result. Regarding nodes’ textual descriptions the cosine sim-ilarity is a symmetric function. Nonetheless, in terms of networkdistance, we aggregate the distances from the query centerq to eachof the result items, since we are interested in the distance in clicksfrom the query center to each of the result-items, and not theop-posite. Occasionally, we could aggregate for each result item thedistances to the query center as well as the distances from it, so asto produce a more symmetric effect. Of course, one could arguethat web-pages that are close to each other from both directions,rather than just one, could be more related to each other, butthis isjust a convention regarding the rank function. Likewise, for the sec-ond factor of our ranking functions, and we note withddiss(.,.), weuse the distance from theprecedingresult items in order to quantifytheir diversity. This is important as the user skims throughthe resultitems and expects to find more diverse elements, starting from themost relevant ones. In principle, web-pages possibly many hopsaway from the query center are examined each time, but the bestranked are processed first in a best-first manner, according to theirpotential to contribute to the final result. Specific variable thresh-olds can guarantee how far we should expand our search from thequery centerq and the other elements of the tentative result, so asto achieve the desired level of diversification, always consideringtheir context similarities, as well.

σq(S) =λN ∑

u∈S{dtext(u,q)}+

1−λN(N−1) ∑

v∈S∑w∈Sv≺w

{dgraph(v,w)} (4)

σq(S) = λ maxu∈S

{dtext(u,q)}+(1−λ ) minv,w∈Sv≺w

{dgraph(v,w)} (5)

Furthermore, the returned solutions are different depending onwhich of the criteria we put emphasis. For example, when pro-cessing structured data, by averaging the properties of theitems inthe result, their distribution seems to fit the data distribution, withmany of those tuples appearing within different clusters ofdatamostly. On the other hand, for the latter type of aggregation, thepoints would ideally form a constellation around the query centerfor a given relevance (max distance from the query), lying almostequidistant from each other on the fringes of a circle havinga ra-dius analogous to the least allowed relevance. Depending ontheexpected size of the result and the trade-off between relevance andresult dissimilarity, we should retrieve the solutions that somehowresemble that formation. Last but not least, the very essence of theproblem becomes completely different when the network structureis involved. To the best of our knowledge, this is the first work tocombine both aspects of this important problem in an elegantfash-ion that allows a great degree of flexibility on how to configure thetrade-offs of relevance over result-items’ dissimilarityand networkdistance over context (dis)similarity.

4. COMPUTATION OF DIVERSIFIED SETSWe provide in this section a top-down presentation of our paradigm.

At the top level we have Alg. 1 for combining a method leveraging

greedy heuristics with a hill-climbing based approach. Thelatteris seeded with the solution of the greedy algorithm. In particular,we seed the interchange algorithm withkg instances ofn elementseach, so as to producekc even more diversified sets, and eventuallyopt for the best to return, approaching this way as much as possiblethe true optimal solution. In effect, it is similar to performing manysearches from multiple points towards the optimal destination.

In Algorithm 2, we present our hill-climbing method, which start-ing from a collection of sets, it iteratively refines them oneby onewith respect to their rank. First, we create all possible subsets ofcardinality equal ton−1 each in lines 4–15, and then, insert themin a min-heap that is dedicated to storing them in such a way thatit would allow us to process them according to how promising theyare to be part of the final solution. This is what happens in lines16–37, where with each iteration we take the subset that is rankedhighest (line 17), and from then on, we iterate through its possiblei-th addendum objects (lines 19–37) in terms of their eligibility tocomplement the examined subset, starting from the optimal1 suchelement. Notably, even though the outcome is not the globally op-timal solution, and neither the examined subset is necessarily partof it, the first encountered object is the best element overall to in-sert in the subset (line 22) so as to achieve the best possiblescorewhile keeping that particular subset composition intact. In turn,the one encountered in the next iteration of the loop is the secondbest, and so on. Then, in lines 23–29, we createn−1 new possi-ble subsets given the new composition congregating that particularannex element. However, not all of them will be considered inthefuture processing steps. With the checks in lines 26 and 27, onlythe ones that are better ranked than the subsets they were derivedfrom will be inserted in the heap where we maintain the subsets tobe processed later on. Next, in lines 30–37, we check whetherthenewly formed candidate set should be inserted in the result-heap.This happens if it outranks any of the candidates retrieved so far.But if this is not the case, we can proceed with examining the nextsubset, since the succeeding possible replacement would make aneven worse set, and therefore, there is no point in forming thosesets to discard them eventually. Finally, this process is terminatedwhen there are no more subsets to process.

Algorithm 1: diversify (q,n,R,kg,kc): Combining bothschemes, greedy and interchange, using the solutions of thefor-mer to feed the latter.

1 refined = interverso(q,R,greeverso(q,n,R,kg),kc);2 while refined.size()> 1 do3 refined.pop();

4 return refined.pop();

In Algorithm 3, we present our method that relies on greedyheuristics to generatekg diversified sets. Each of them is initiatedwith a vertex which is similar to the query center according to an

1We describe towards the end of this section how the single optimalsuch object using ourverso method for complementing the restof the set is retrieved.

aggregation of their network distance and context relevance dic-tated by the selected ranking function. The selection of those ver-tices that are similar to the query takes place in lines 5–17,and theseinitial sets of just one element at first will be processed in terms oftheir potential in lines 26–47, and will be incrementally augmentedwith well diversified objects that are greedily selected while takinginto consideration the preceding elements of the set. In particular,we examine in lines 29–47 each set that is not completely filledyet, and we create new ones with the same first elements but oneadditional element which is retrieved in line 30 according to its el-igibility to complement the partial result, and is therefore placedat the next available position. Then, we choose which of the fully

Algorithm 2: interverso (q,R,seeds,kc): Hill-climbing diversi-fication algorithm.

1 candidates =newMaxHeap();2 subsets =new MinHeap();3 iterators =new Map();4 foreach seedin seedsdo5 foreach s in seeddo6 subset =new Set(seed);7 subset.remove(s);8 subsets.push(subset);9 iterators.put (subset,new

GraphDivIterable(q,subset,R).iterator());

10 if candidates.size()< kc then11 candidates.push(seed);

12 else13 if σq(seed)<σq(candidates.peek())then14 candidates.pop();15 candidates.push(seed);

16 while not subsets.isEmpty()do17 subset = subsets.pop();18 replacements = iterators.get (subset);19 while replacements.hasNext()do20 replacement = replacements.next();21 augmented =new Set(subset);22 augmented.insert (replacement);23 foreach s in subsetdo24 newsubset =new Set(augmented);25 newsubset.remove(s);26 if σq(newsubset)< σq(subset)then27 if candidates.size()< kc or σq(newsubset)

< σq(candidates.peek())then28 subsets.push (newsubset);29 iterators.put(newsubset,30 replacements.replace(s,replacement));

31 if candidates.size()< kc then32 candidates.insert (augmented);

33 else34 if σq(augmented)< σq(candidates.peek())then35 candidates.pop();36 candidates.insert (augmented);

37 else38 break;

39 return cadidates;

formed derived sets ofn elements to keep in lines 34–47, if theyoutrank any of the top-kg tentative results we have retrieved so far.Otherwise, we keep in lines 44–47 only the bestkg such incom-plete sets with an addendum at the end for future processing in thefollowing runs of the iteration. To elaborate on the usage ofthe it-erator in lines 24, 28, 30, 31, 34, it allows us to browse the verticesin order of their suitability to complete the partial set in the bestpossible way. Their retrieval is based on Alg. 4 and is innovativein a sense that it combines the context of the vertices along withtheir network distance from the items of the examined set. Theiterator’sexpand(s) invocation in line 34 allows us to take ad-vantage of the graph search we have conducted so far, insteadofcreating a newly initialized iterator instance and performagain asignificant portion of the search we have already done at an ear-lier stage. Thereby, we exploit the state of the previous iterator bycopying its reusable elements into a new instance, and also add thenecessary information, e.g. distance-heap, score-heap, etc., so as toaccommodate the search around the new result items.

In the heart of our paradigm is Algorithm 4, which given a querycenterq and a set of interconnected objectsS, it returns the most di-versified objectu in the network that also belongs to setR. We notethat the inclusion requirement for setR aims at preventing certainweb-pages from appearing in the result-set for reasons thatdo notconcern this work, e.g., inappropriate or offensive content, reportedweb-pages, etc. Alternatively, one could remove those elementsfrom the graph, but this is a more subtle and elegant interventionthat can be upheld if necessary. Besides, all documents thatwereled from those nodes and were not accessible via any other path,would not be accessible anymore, regardless whether they belongto R or not. Thereby,R could play the role of a while-list or ablack-list, depending on the requirements of the problem.

We first initialize the min-heaps used for searching aroundq andeach element ofS in lines 1–11, Alg. 4. More importantly, the ver-tices are ranked in terms of an aggregation of network distance andcontext similarity according to either Eqs. 2 and 3. For eachsourcevertex, we have a heap where the vertices are inserted according totheir distance from that particular source, and another in terms oftheir partial score regarding that source only, though. By conven-tion, we let the complete score of an individual vertexu given byσq(S∪{u})−σq(S), and it reflects the change in the score of theaugmented set. For instance, it follows for Eq. 2 that the changeincurred in the score of the diversified set is equal to

1|S|+1

(λαdgraph(q,u)−β1−λ|S| ∑

s∈Sdgraph(s,u))+

|S||S|+1

σq(S)

in terms of its network properties, and in terms of its context is

1|S|+1

(λ (1−α)dtext(q,u)−(1−λ )1−β|S| ∑

s∈Sdtext(s,u))+

|S||S|+1

σq(S)

Of course, since we want to retrieve the object with the smallestsuch value by comparing a carefully selected subset of the availableweb-documents with each other, we can just use the part of theformula within the parentheses and disregard the rest for itmakesno difference when comparing any two web-pages.

Moreover, this marginal gain can be further divided accordinglyin the partial costs arising from each of the elements ofS or thequery centerq. Likewise, we obtain the cost function associatedwith Eq. 3 which has a branching form. In particular, we discern thefollowing cases: (i)drel(p,q)≤maxx∈Sdrel(x,q) and minx∈Sddiss(p,x)≥miny,z∈Sddiss(y,z), and thus, the marginal gainσq(p) caused by theinsertion ofp is equal to 0, (ii)drel(p,q) > maxx∈Sdrel(x,q) andminx∈Sddiss(p,x)≥miny,z∈Sddiss(y,z), and hence, we haveσq(p)=

Algorithm 3: greeverso (q,n,R,kg): Greedy diversification al-gorithm.

1 similar =newMaxHeap();2 paths =new MinHeap();3 paths.push((q,q),0);4 marked.insert(q);5 while not paths.isEmpty()do6 (q,u) = paths.pop();7 if u∈ R then8 if similar.size()< k then9 similar.insert(u,σq(u));

10 else11 if σq(u)< σq(similar.peek())then12 similar.pop();13 similar.push(u,σq(u));

14 foreach adjacent node v to udo15 if not marked.contains(v)and (similar.size()< kg or

(q,u).weight+(u,v).weight< σq(similar.peek()))then16 paths.push((q,v),(q,u).weight+(u,v).weight);17 marked.insert(v);

18 foreach s in similar do19 S= newSet();20 S.insert(s);21 candidates.add(S);

22 iterators =new Map();23 foreach candidatein candidatesdo24 iterators.put(candidate,new

GraphDivIterable(q,candidate,R));

25 result =newMaxHeap();26 while not candidates.isEmpty()and

candidates.first().size()< n do27 S= candidates.removeFirst();28 iterator = iterators.get(S);29 counter = 0;30 while iterator.hasNext()do31 s= iterator.next();32 newcandidate =new Set(S);33 newcandidate.insert(s);34 iterators.put(newcandidate,iterator.expand(s));35 if newcandidate.size()==nthen36 if result.size()< kg then37 result.insert(newcandidate);

38 else39 if σq(newcandidate)< σq(result.peek())then40 result.pop();41 result.insert(newcandidate);

42 else43 break;

44 else45 counter = counter + 1;46 if counter≥ kg then47 break;

λ (drel(p,q)−maxx∈Sdrel(x,q)), (iii) drel(p,q)≤ maxx∈Sdrel(x,q)and minx∈Sddiss(p,x)<miny,z∈Sddiss(y,z), then it followsσq(p) =

(1−λ )(minx,y∈Sddiss(x,y)−minz∈Sddiss(p,z)), (iv) otherwise, wetake,σq(p)= λ (drel(p,q)−maxx∈Sdrel(x,q))+(1−λ )(miny,z∈Sddiss(y,z)−minx∈Sddiss(p,x)); wheredrel(., .) denotes the affine combinationof network distance and content similarity for the relevance fac-tor (smoothed using theα parameter), andddiss(., .) stands for theaffine combination that corresponds to the result items’ dissimilar-ity (smoothed with theβ parameter) from Eq. 3. Hence, betweenany two objects we are inclined to prefer the one whose insertionleads to the set of the better rank.

Algorithm 4: verso (q,S,R): Graph-context diversificationgiven a query vertexq, its context and a set of verticesS withtheir contexts.

1 initialize score min-heap forq;2 insertq with 0 score value;3 initialize distance min-heap forq;4 foreach adjacent node of qdo5 insert it in terms of network distance fromq;

6 foreach s in Sdo7 initialize score min-heap fors;8 inserts with 0 score value;9 initialize distance min-heap fors;

10 foreach adjacent node of sdo11 insert it in terms of network distance froms;

12 create candidates min-heap congregating the tentative results;13 while the best element of the candidates is no better than the

score of the best undiscovered nodedo14 hσ the min cost expansion score heap;15 hδ the respective distance heap for the same source, sayu;16 w = hσ .pop();17 if w discovered from all sourcesthen18 if w in R then19 pushw to the candidates;

20 else21 update partial score ofw, since now accessed fromu;

22 while σq (hσ .peek())≥ γ ×dgraph(u,hδ .peek())do23 v = hδ .pop();24 hσ .push(u);25 foreach adjacent node y of vdo26 if y undiscovered from uthen27 hδ .push (y,v.weight+(v,y).weight);

28 return candidates.pop();

Next, in lines 13–27, we search for the most diversified vertex byexpanding with each iteration our search from the source node thatcontributes the least partial score amongq and all elements ofS. Ifthe newly accessed nodeu is now visited from all source nodes thenit is most certainly qualified to be included in the set of the candi-date solutions, provided that it also belongs to setR. Otherwise,we need to update the partial score ofu accordingly. Furthermore,since the sources from which we initiate our graph search maylie distant from each other, the structure where we keep the ver-tices by their partial scores may grow disproportionately large untilwe retrieve the very first vertex that is accessed from all sources.This situation can turn out extremely tricky when the sources fromwhich we initiate the search are far away from each other in termsof their network distance, and this case is not necessarily uncom-mon in practice. Therefore, it is of major importance to store this

information regarding vertices’ partial scores in such a way that itcan be accessed and updated effectively and efficiently. In partic-ular, we need to check in line 13 whether the best score that anyvertex, which is not yet accessed from all sources, is outranked bythe best candidate solution (and thus encountered from all sources)retrieved so far, and we are therefore in position of returning it tothe user. If this is not the case, the search needs to be continuedwith more iterations until the constraint is satisfied. Nevertheless,in order to compute a meaningful lower bound for the best possi-ble complete score of any partially accessed vertex, we willhaveto put at the place of the missing information (after all we are talk-ing about partial scores) the distance value that corresponds to thereach of the search from the source node that the examined nodehas not been visited yet. Hence, even if we have those nodes sortedaccording to their partial scores, we still have to enrich that infor-mation with the respective distances, so as to safely compute thetermination criteria without any loss. Most importantly, this infor-mation changes constantly with every iteration of the while-loop inlines 13–27. Thereby, the most appropriate structure to accommo-date this operation along with fast and efficient updates is asortedlist, where each time we find the position of each element to bein-serted with binary search. Hence, by selecting the firstC elementswith the minimum partial scores at the head of the list, whereCtakes a small value like 3 and 5, and replacing the respectivedis-tances when appropriate, we find the best possible score value thatany partially accessed vertex could achieve. But now there mightbe partial scores that fall below that value. Hence, we need to ex-amine all the elements of the list that have a partial score below thatvalue, and among those values we shall find the best possible scorean unvisited vertex can achieve.

Similarly, with each iteration of the while-loop, we need toup-date in line 21 the partial score of the newly accessed node sinceit is now accessed from a new source. Next, the new position ofthe accessed node is found using binary search. Using a differentdata-structure for performing these operations, for example a heap,would involve extremely expensive updates. Similarly, examininglinearly all the vertices would also be computationally expensivesince the size of the structure could grow very large to performthese operations again and again, as with each iteration we have anew distance that would complement a portion of the scores ofthepartly seen nodes differently than before. In line 16, we popthe el-ement that contributes the minimum score increase (low score val-ues correspond to a better rank) among the heads of all score-heaps,and if that node is now accessed from all sources, it will be insertedin the heap where we keep all candidate results ordered by theirscores; otherwise we update its partial score accordingly in line 21.Then, in lines 22–27, we pop from the distance-heap that is asso-ciated with the same source the next closest vertex and we insert itinto the respective score-heap, for this is the very next node, amongthe ones not encountered from all sources, to cause the leastscoreoverhead given the Maximal Marginal Relevance (MMR) rankingfunctions that we study in this paper. Then, in lines 25–27, we aug-ment the respective distance-heaps with all the adjacent nodes ofthe popped element that are unvisited from that specific source, soas to be able in the next iteration of the while-loop to accessthenext closest vertex. The purpose of this nested inner loop isto en-sure that each time we pop an element from a score heap that isassociated with a specific source, there is no other vertex inthe net-work that can cause a smaller increase to the score function giventhat source. This is the case where the constraint of the while-loopis satisfied in line 22, as it is ensured that the next shortestdistancemultiplied by either factorα or β , depending on whether the sourcethat we examine at the time is the query center or any of the ele-

ments ofS, is greater than the least score value in the respectiveheap. Hence no textual description considered here, outranks thebest element at the head of the associated score heap. The combi-nation of the two loops ensures that each time we return the vertexthat achieves the optimal score overall, given that specificquerycenterq and the set of verticesS that were given as arguments.

Alg. 4 is used to implement the iterator of our aforementionedmethods Algs. 2 and 3. Remember the GraphDivIterable objectthat we create in line 9, Alg. 2 and line 24, Alg. 3. In particular,lines 1–12 would be part of the constructor of the iterator while thelines 13–28 would constitute the method returning the next best ob-ject in the graph to complement the set formed so far. A functiontesting whether there are still objects to be retrieved would simplycheck whether there are still objects in the candidates’ heap or anyof the score-heaps and distance-heaps with items waiting tobe pro-cessed. Moreover, in line 30 in Alg. 2 and line 34 in Alg. 3 weuse the previous state of the existing iterators to enhance the newlycreated ones and prevent from repeating a significant portion of thegraph search. This is an important optimization since graphtraver-sal might consume considerable time. In addition, when we createa new iterator, we insert into its list with the partially seen objectsall candidate objects from the iterator we copy the state to initializethe new one. Those items are now seen from all sources but thenew source for the new iterator. In addition, we copy the elementsfrom the old list with partially seen objects, and also, the objectsthat had already been returned as results (we keep these in a list).

Further, due to the nature of the problem, when real-time re-sponse is required for processing very large graphs, we can set atime-out parameter for each of the two modules of our implemen-tation for the greedy and the hill-climbing methods. Equivalently,we can limit the maximum number of iterations by adding an ad-ditional such constraint in line 16, Alg. 2 and line 12, Alg. 4. Thisparameter would have no meaning for small sets, but might be re-quired for large diversified sets in order to achieve faster responsetimes at the cost of a result of lesser quality though. First,the mostdiversified object in the network is returned withintd time, the time-out parameter we could add in line 12, Alg. 4. When the executionof the loop is stopped due to the time-out variable, most probablythe algorithm has not returned the next most diversified object, butinstead, the best candidate that has been retrieved so far. This is anindirect way since setting a cut-off time parameter for our greedymethod in Alg. 3 as a whole is prohibited since sets with less thann elements could be returned otherwise. On the other hand, forourhill-climbing based algorithm, we set cut-off timetc to a multipleof |S| × kg × td, so as to allow for the refinement of all the seedsby at least a few elements. Nevertheless, when a fast response isrequired andtc takes a small value, parameterkc should also bequite small, and therewith, just one or two of the best seeds will befurther processed. Whether all of those selected seeds willbe re-fined further also depends on the time-out parameter, and thus, forlow tc values settingkc appropriately small would allow for thosefew seeds to reach their respective optima within the required time-frame. Under these circumstances, this is a wiser policy comparedto expanding more seeds (for largekc values) inadequately.

5. EXPERIMENTAL EVALUATIONIn this section we assess the performance of our methods for

various configurations and demonstrate our results following.

5.1 SettingMetrics. We adopt three important metrics: (i) the totalexe-

cution timerequired until each method returns its result, (ii) theallocatedmemoryby all data-structures (lists, queues, heaps, etc.),

without considering the memory required for storing the respectivegraphs and web-documents’ tf-idfs for the employed vector spacemodel, (iii) thequality of the result returned by each method. Thescores of the respective cost functions are used for this purpose.

Parameters. A variety of different parameters have been em-ployed in the construction of our synthetic workload so as toascer-tain the effectiveness and efficiency of our paradigm: (1) the num-ber of web-documents, (2) the number of links per page, (3) num-ber of lemmas per web-document whose frequency follows a zip-fian distribution, (4) the number of web-pages in the diversified setthat is returned to the user, (5) the number of seedskg which is alsoequal tokc in our configuration, (6) the compromiseλ between rel-evance and result elements’ dissimilarity, (7) the network-contexttrade-offβ upon which result items’ dissimilarity is computed. For(6) high values indicate a bias towards relevance, whereas in (7)high values indicate a bias towards graph characteristics over con-tent. On the other hand, the relevance network/context trade-offis constant in our setting and considers documents’ contentexclu-sively as we fixα to 0. Clearly, we are interested in a diversifiedresult of highly relevant web-documents in terms of their contentthat are also diversified with each other in terms of both their con-tent and network properties, a mixture smoothed by parameter β .In Tables 1 and 2, we present all parameters along with the rangeof values that they take and their default values.

Parameter Range Default#web-pages 5k,10k,20k,40k 10k

#links per page 5,10,20,40 10#lemmas per page 50,100,200,400 100

Table 1: Synthetic dataset parameters.

Datasets.We make use of Wikipedia snapshots for real datasetsthat were obtained fromhttp://dumps.wikimedia.org/.The snapshot that we use here corresponds to the Greek collectionof articles. It consists of 104,364 documents and in total there are2,733,279 links. On average, each article has 17.47 links tootherarticles, while 16.7 other links point to each article. Fromthiscollection we consider in the directed graph only the documentsthat correspond to actual articles and disregard user profiles, dis-cussions, etc. For the synthetic graphs, we connect the vertices atrandom until we have the desired number of links. For each config-uration we execute 100 queries with query centers that are selectedat random and arbitrarily.

Parameter Range Default#results 5,10,20,40 10#seeds 1,2,3,4 2

rel/dissλ .2, .4, .6, .8 .8diss net/textβ .2, .4, .6, .8 .8

Table 2: Queryset configuration.

Baseline. We compare our methods against an adapted versionof BestCoverage from [12] for diversified ranking over graphdata so as to consider documents’ content as well. As we describedin §2.2, their effort constitutes a heuristic scheme. All methodswere implemented in Java and run in an AMD Opteron processortweaked at 2.3 GHz.

5.2 ResultsIn Figures 2, 3 and 4, we present our results for the synthetic

workload, whereas in Figs. 6, 7 and 4 we show our results for

the Wikipedia dataset. First and foremost, a prevalent observationthroughout our evaluation is that our methods clearly outperformtheBestCoveragemethod from [12] in terms of all metrics, pro-cessing time, allocated memory, and result quality, for allconfigu-rations of our parameters. In particular, our paradigm is faster byseveral orders of magnitude and consumes significantly lessmem-ory. More importantly, unlike our competitor, our approachis notlimited to examining web-pages up to a particular distance,and ittherefore returns a result of better quality. The main reason theirperformance deteriorates lies with the fact that they do nottraversethe graph carefully as they do not exploit the information originatedfrom previous traversals. For instance, they consider web-pages upto a certain number of hops away from the elements of the currentresult, but still, the cost of computing the score of a candidate doc-ument remains high, as our evaluation confirms, since many otherdocuments can intervene in between when trying to compute thenetwork distance between the candidate document and every otherresult item. On the other hand, our scheme performs a coordinatedsearch around each result item and the query center until certainconstraints are satisfied. What is more, according to our scheme wedo not have to traverse any subpath twice, not even for score com-putation. Remember how we expand each distance-heap and insertthe new node at the respective score-heap until it is encounteredfrom all sources. That is the moment it becomes a candidate result.Never again any part of the path to that node from any source isvisited. In particular, we reckon that graph data have specific char-acteristics that should be carefully considered in order todesigneffective and efficient processing methods. We display the resultsfor our competitor for the synthetic workload only.

Effect of collection size|V|. We can study the impact of thenumber of documents in the collection from the first column ofFigs. 2, 3 and 4. Regarding execution time, we observe in Fig.2(a)that the first phase of our paradigm, consisting of our heuristicmethod, requires less time than our hill-climbing approach. Thefirst phase requires almost just as much time for bothmin-avgandmin-maxranking criteria. However, more time is spent during thelatter phase whenmin-avgranking criteria are used. We observe thesame pattern in Fig. 3(a) where the amount of memory allocated byeach method is shown. In Fig. 4(a), the result quality seems tobe only slightly affected by the size of the collection, and just forthe heuristic method undermin-maxranking criteria. Notably, formin-sumranking the number of the web-documents has no effectat all.

Figure 5: Legend for compared methods and ranking criteria.

Effect of number of hyper-links |E|. The second column illus-trates performance in terms of the number of hyper-links to otherdocuments of the collection. Apparently, processing time grows

http://dumps.wikimedia.org/

0

50

100

150

200

250

300

5k 10k 20k 40k

pro

cessin

g tim

e (

sec)

#webpages

0

50

100

150

200

250

300

5 10 20 40

pro

cessin

g tim

e (

sec)

#links/page

0

50

100

150

200

250

300

5 10 20 40

pro

cessin

g tim

e (

sec)

result-size

0

20

40

60

80

100

1 2 3 4

pro

cessin

g tim

e (

sec)

#seeds and #results

0

50

100

150

200

50 100 200 400

pro

cessin

g tim

e (

sec)

#lemmas/page

0

50

100

150

200

.2 .4 .6 .8

pro

cessin

g tim

e (

sec)

λ

0

50

100

150

200

.2 .4 .6 .8

pro

cessin

g tim

e (

sec)

β

Figure 2: Processing time for synthetic workloads over various configurations.

0

100

200

300

400

500

5k 10k 20k 40k

mem

ory

(M

Bs)

#webpages

0

50

100

150

200

250

300

350

400

5 10 20 40

mem

ory

(M

BS

)

#links/page

0

100

200

300

400

500

5 10 20 40

score

valu

e

result-size

0

50

100

150

200

250

300

1 2 3 4

mem

ory

(M

Bs)

#seeds and #results

0

50

100

150

200

50 100 200 400

mem

ory

(M

Bs)

#lemmas/page

0

50

100

150

200

.2 .4 .6 .8

mem

ory

(M

Bs)

λ

0

50

100

150

200

.2 .4 .6 .8

mem

ory

(M

Bs)

β

Figure 3: Allocated memory for synthetic workloads over various configurations.

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

5k 10k 20k 40k

score

valu

e

#webpages

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

5 10 20 40

score

valu

e

#links/page

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

5 10 20 40

score

valu

e

result-size

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

1 2 3 4

score

valu

e#seeds and #results

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

50 100 200 400

score

valu

e

#lemmas/page

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

.2 .4 .6 .8

score

valu

e

λ

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

.2 .4 .6 .8

score

valu

e

β

Figure 4: Rank function scores for synthetic workloads overvarious configurations.

with the number of links per document in Fig. 2(b). The reasonbehind this behavior lies with the fact that as the size of thecol-lection remains the same and equal to the default value, and in-creasing the fan-out parameter brings the documents closerto thequery center. Subsequently, we expect more documents to be ex-amined now in order to ascertain whether they are qualified tobeadded to the result, and more documents will constitute candidateresult items to be compared with other eligible ones. This imposesa computational overhead that is reflected in Fig. 2(b) accordingly.Fig. 3(b), the memory requirements increase formin-maxcriteriaas expected, but especially formin-avgcriteria allocated memoryseems a little irregular with the fan-out degree. What couldex-actly cause an increase in the allocated memory? First, for our hill-climbing approach that would be the number of qualified subsetswaiting to be further processed. Keep in mind that with each suchsubset we construct an iterator that allows us to traverse the bestitems to complement that particular subset. Evidently, thenumberof subsets that are kept simultaneously increases with the fan-outdegree formin-max. However, formin-avgtheir number increasesfor the first two configurations to gradually diminish later on. Inpractice, themin-avg ranking function allows for more effectivepruning as it enables ranking of greater granularity. More specifi-cally, as the cardinality of the result-set increases, and has alreadybeen filled with well-diversified objects that constitute good repre-sentatives around the query center, the remaining verticesachievescores that are almost identical regardless where they lie in thegraph, for they will always have an item of the result-set some-where relatively close, at least no further than any other vertex inthe graph. We reckon that this is the critical value of the cardinalityof the result-set formin-maxranking criteria, and this situation canbe used as a termination criterion for the method when encountered.This is the point beyond which there is no practical distinction forthat particular query center among the vertices that have been ex-cluded from the result-set. This is also the main reason thatmorecandidate subsets accumulate, something which is reflectedon theallocated memory. This intricate detail explains most of the dis-crepancies in the behavior of the two ranking functions. Second,we run a configuration for our heuristic scheme where we limitthenumber of candidates bykg, with a separate seed starting from each

of the most similar documents to the query center, and therefore,the memory requirements are significantly less.

Effect of result sizen. The third column illustrates performancein terms of the result size. Apparently, execution time grows veryfast with the cardinality of the result-set in Fig. 2(c) for both phases,even though the latter phase is more expensive undermin-avgrank-ing criteria. In Fig. 3(c), the memory allocated during the latterphase is increased compared to the amount of allocated memoryduring the first phase. Again, that gap is increased formin-avgranking criteria. In Fig. 4(c), we see the scores deteriorate withn, but this is only attributed to the augmentation of the result withmore elements. Since those are selected incrementally, less relevantelements start to appear whose dissimilarity from the rest of the re-sult items evidently fails to compensate for that relevanceloss.

Effect of number of seedskg and resultskc. We depict in thefourth column the performance in terms of the number of seedsproduced during the first phase, and the number of results returnedby the second phase from which we select the best to return to theuser. This is an important system parameter of our scheme thatdoes not affect our competitor, and therefore, no results are shownfor theBestCoverage method here. We also remind the readerthat each of those results has cardinality defined accordingto theaforementioned parametern. Time grows linearly in Fig. 2(d) formin-avgranking criteria and sub-linearly formin-maxranking cri-teria. We make identical remarks in Fig. 3(d) where the memoryallocated by our heuristic scheme increases linearly, whereas seemsto grow faster for our hill-climbing approach. In Fig. 4(d),we seethe beneficial effect of this parameter where the quality of the scoreameliorates. This is expected due to the wider variety of sets fromwhich we select the final result. However, the improvement isnotsurprising and it indicates that even for small values of this parame-ter our paradigm is in position of constructing a well-diversified set.Subsequently, we claim that the time overhead stemming fromin-creasing this parameter to large values does not compensatefor theimprovement in the quality of the result for our synthetic dataset.Setting this parameter to a small value (say two or three, just likewe do for the default value) is sufficient.

Effect of number of lemmas|W|. The fifth column representsour results with regard to the number of lemmas observed per web-

1

10

100

1000

5 10 20 40

pro

cessin

g tim

e (

sec)

result-size

1

10

100

1000

1 2 3 4

pro

cessin

g tim

e (

sec)

#seeds and #results

0

50

100

150

200

250

300

350

.2 .4 .6 .8

pro

cessin

g tim

e (

sec)

λ

0

50

100

150

200

250

300

350

.2 .4 .6 .8

pro

cessin

g tim

e (

sec)

β

Figure 6: Processing time for the Wikipedia dataset over various configurations.

1

10

100

1000

5 10 20 40

mem

ory

(M

Bs)

result-size

1

10

100

1000

1 2 3 4

mem

ory

(M

Bs)

#seeds and #results

0

20

40

60

80

100

120

.2 .4 .6 .8

mem

ory

(M

Bs)

λ

0

20

40

60

80

100

120

.2 .4 .6 .8

mem

ory

(M

Bs)

β

Figure 7: Allocated memory for the Wikipedia dataset over various configurations.

3

3.2

3.4

3.6

3.8

4

4.2

5 10 20 40

score

valu

e

result-size

3

3.2

3.4

3.6

3.8

4

4.2

1 2 3 4

score

valu

e

#seeds and #results

0

0.5

1

1.5

2

2.5

3

3.5

4

.2 .4 .6 .8

score

valu

e

λ

0

0.5

1

1.5

2

2.5

3

3.5

4

.2 .4 .6 .8

score

valu

e

β

Figure 8: Rank function scores for the Wikipedia dataset over various configurations.

document. For our configuration the lemma frequencies follow azipfian distribution with skewness factor equal to 0.1. Processingtime and the amount of allocated memory grow linearly with thenumber of lemmas in Figs. 2(e) and 3(e) as expected. In particu-lar, we see in Fig. 3(e) that our heuristic method requires the exactamount of memory regardless the increase in the number of lem-mas. This can be attributed to the fact that this measurementdoesnot account for storing the textual descriptions themselves, whichof course would now are larger and more cumbersome (for com-puting inner products for similarities, etc.), as the response timesindicate, but only accounts for the references to those documentsthat each method creates. Nevertheless, we discern an irregularpattern for our second phase. These fluctuations testify thesensi-tivity of the latter method to this parameter. In Fig. 4(e), the scoresfor min-maxranking criteria tend to converge with the increase inthe number of lemmas, but there is a steady advantage formin-avgranking. The reasons for this behavior lie again with the enhancedgranularity that themin-avgranking criteria offer.

Effect of trade-off parametersλ andβ . Regarding the smooth-ing parametersλ andβ of the ranking function, those have littleimpact on either execution time, or allocated memory in Figs. 2(f),2(g), 3(f), 3(g), for eithermin-avg, or min-maxcriteria. However,this is not true for the scores achieved in Figs. 4(f) and 4(g). Inessence, applying the hill-climbing approach right after the heuris-tic method, improves the greedy result in a way that the heuristicmethod could not do. To elaborate, unlike our hill-climbingap-proach, the selection of the next element for the greedy methodtakes place in accordance with the previously selected items, with-out being able to undo any of the previous selections and replacingthem with a better object given the selection of a subsequentele-ment. The last row of our results signifies exactly the contributionof the latter phase to the refinement of the result. Arguably,forsome configurations however, when the result of the first phase isgood enough, the latter phase is not necessary, as the sets used toseed it do not lead to a subsequent better result for they are al-ready well-diversified. Another way to interpret such a behaviorwould be that for those configurations, the form of the searchspaceimpedes the retrieval of better replacements, and therefore, searchhas to be initiated from more seeds so as to overcome this diffi-

culty. Also, we might want to put into the same category the casewhere the improvement does not compensate for the time overhead.But when can we claim that the second phase is redundant, or hasnothing contribute? In order to answer this question a closer lookis crucial so as to detect those cases. Apparently, the only caseswhere the contribution of the latter phase is subliminal is when ei-therλ or β take low values. For every other configuration, we seethat at some degree our hill-climbing approach succeeds in ame-liorating the quality of the result of the heuristic method.Keepin mind that at all timesα is set to zero, and thus, relevance iscomputed by textual similarity only. In effect, whenβ also takeslow values, then diversified ranking relies substantially on the con-tent of the web-documents and their network distances play asec-ondary (but still important) role in the computation of the result.Thereby, when both factors of the ranking functions (for relevanceand result items’ dissimilarity) agree on what aspect of thedata toleverage, then the heuristic method leaves no room for the secondphase to improve the result. We observe the same phenomenon forlow λ -values. This is when our search is oriented towards resultitems’ dissimilarity instead of their relevance, and sincethe defaultvalue of β aggregating result items’ dissimilarity is directed to-wards elements’ network distances by convention, we tacitly com-pare against the case where our main concern is biased towardsnetwork characteristics in terms of relevance and result items’ dis-similarity also. Consequently, we can rely on the heuristicresult ex-clusively when both factors agree on what aspect of the data shouldbe put emphasis.

Additional remarks on the Wikipedia dataset. We show inFigs. 6, 7 and 8 our results with regard to the result-size (cardinal-ity n), the number of seedskg and resultskc, but also the trade-offparameters of the used MMR ranking functionsλ andβ , respec-tively. Clearly, there is a one to one association of our observa-tions for the synthetic workload to the Wikipedia dataset through-out. The results from the two different workloads do not contradictfor any configuration. Nevertheless, in the two last columnsonenotices that the hill-climbing method, corresponding to the secondprocessing phase, cannot improve the intermediate result that theheuristic method produces during the first processing step.Morespecifically, we conclude that the default number of seeds (two in

our case) is not sufficient for the second phase to find good replace-ments for the heuristic set collection. Notably however, when itdoes improve the heuristic intermediate result for largerkg andkcvalues, the computation overhead becomes preventive mainly to thelarge size of the article collection, and as a result does notseem tocompensate for the improvement.

Moreover, it is hard to miss the sudden increase in memory re-quirements for lowλ values in Fig. 7(c) during the latter processingphase (hill-climbing) formin-avgranking. This configuration de-notes that we are interested in dissimilar result-items mainly, ratherthan relevant. It looks like the competition for which subset willprevail is fierce as pruning becomes not as helpful, and this is whywe have to keep the structures associated with so many subsetswaiting to be refined and augmented with even better replacements.Remarkably, this spiking behavior is not present in Fig. 6(c), andthis can mean only one thing. Even though there is an increasednumber of subsets to be processed, each of them has no more thanjust a few replacements eventually that can complement them, andthus, refinement is fast for each of them. In effect, since thedefaultaggregation of content dissimilarity and network distancegives pri-ority to the latter, graph search becomes quite limited and indicatesa few good replacements within a well-bounded search space.Mostimportantly, owing to ourverso method, this is the lossless wayto bound the search space, whereas our competitor uses an artificialthreshold of a fixed maximum number of hops that graph search canexpand, even when all the interesting documents that achieve highscores are just a little further away from the query center. This isthe most prominent advantage of our paradigm.

6. CONCLUSIONSTo recapitulate, in this paper we studied the problem of result

diversification over interconnected document collections, e.g. web-pages. To the best of our knowledge our paradigm constitutesthevery first approach to consider the network structure formedby thedocuments’ links apart from their content. Our solution relies ongreedy heuristics to obtain an initial set of candidate solutions, re-fined through a hill-climbing approach. The criteria we adopt arebased on Maximal Marginal Relevance (MMR) ranking functionswhich were expanded for our purposes so as to compromise ob-jects’ network distances with their content similarities according totheir textual descriptions. Withal, special parameters can be used toconfigure the trade-off of one feature over the other, in sucha waythat the users’ preferences can be accommodated to the utmost de-gree. Several optimizations have been employed in our paradigmso as to make feasible the processing of large graphs. Moreover,our elegant design allows for the tweaking of the response timeswhen required, sometimes at a loss over the result quality ifnec-essary. Finally, a thorough and meticulous experimental evaluationvalidates the effectiveness and efficiency of our method. Last butnot least, we launched a prototype implementing our methods. Itruns over the Wikipedia dataset but can be extended with other doc-ument collections, as well.

7. REFERENCES[1] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong.

Diversifying search results. InWSDM, pages 5–14, 2009.[2] G. Calinescu, C. Chekuri, M. Pál, and J. Vondrák.

Maximizing a monotone submodular function subject to amatroid constraint.SIAM J. Comput., 40(6):1740–1766,2011.

[3] J. G. Carbonell and J. Goldstein. The use of MMR,diversity-based reranking for reordering documents andproducing summaries. InSIGIR, pages 335–336, 1998.

[4] C. L. A. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova,A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty anddiversity in information retrieval evaluation. InSIGIR, pages659–666, 2008.

[5] M. Drosou and E. Pitoura. Search result diversification.SIGMOD Record, 39(1):41–47, 2010.

[6] M. Drosou and E. Pitoura. Disc diversity: resultdiversification based on dissimilarity and coverage.PVLDB,6(1):13–24, 2012.

[7] M. Drosou and E. Pitoura. Multiple radii disc diversity:Result diversification based on dissimilarity and coverage.ACM Trans. Database Syst., 40(1):4, 2015.

[8] P. Fraternali, D. Martinenghi, and M. Tagliasacchi. Top-kbounded diversification. InSIGMOD, pages 421–432, 2012.

[9] S. Gollapudi and A. Sharma. An axiomatic framework forresult diversification.IEEE Da. Eng. Bul., 32(4):7–14, 2009.

[10] J. R. Haritsa. The KNDN problem: A quest for unity indiversity.IEEE Data Eng. Bull., 32(4):15–22, 2009.

[11] J. He, H. Tong, Q. Mei, and B. K. Szymanski. Gender: Ageneric diversified ranking algorithm. InNIPS 2012, pages1151–1159, 2012.

[12] O. Küçüktunç, E. Saule, K. Kaya, and Ü. V. Çatalyürek.Diversified recommendation on graphs: pitfalls, measures,and algorithms. InWWW 2013, pages 715–726, 2013.

[13] R. Li and J. X. Yu. Scalable diversified ranking on largegraphs.IEEE Trans. Knowl. Data Eng., 25(9):2133–2146,2013.

[14] Z. Liu, P. Sun, and Y. Chen. Structured search resultdifferentiation.PVLDB, 2(1):313–324, 2009.

[15] Q. Mei, J. Guo, and D. R. Radev. Divrank: The interplay ofprestige and diversity in information networks. InSIGKDD2010, pages 1009–1018, 2010.

[16] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. Ananalysis of approximations for maximizing submodular setfunctions - I.Math. Program., 14(1):265–294, 1978.

[17] L. Page, S. Brin, R. Motwani, and T. Winograd. Thepagerank citation ranking: Bringing order to the web.Technical Report 1999-66, Stanford InfoLab, November1999.

[18] H. Tong, J. He, Z. Wen, R. Konuru, and C. Lin. Diversifiedranking on large graphs: An optimization viewpoint. InSIGKDD 2011, pages 1028–1036, 2011.

[19] G. Tsatsanifos. The tantalizing new prospect of index-baseddiversified retrieval. InSIGMOD/PODS 2013 Ph.D.Symposium, pages 49–54, 2013.

[20] G. Tsatsanifos, D. Sacharidis, and T. Sellis. RIPPLE: Ascalable framework for distributed processing of rankqueries. InEDBT 2014, pages 259–270, 2014.

[21] E. Vee, J. Shanmugasundaram, and S. Amer-Yahia. Efficientcomputation of diverse query results.IEEE Data Eng. Bull.,32(4):57–64, 2009.

[22] M. Vieira, H. Razente, M. Barioni, M. Hadjieleftheriou,D. Srivastava, C. Jr., and V. Tsotras. On query resultdiversification. InICDE, pages 1163–1174, 2011.

[23] Y. Xu and H. Yin. Novelty and topicality in interactiveinformation retrieval.JASIST, 59(2):201–215, 2008.

[24] C. Yu, L. V. S. Lakshmanan, and S. Amer-Yahia. It takesvariety to make a world: Diversification in recommendersystems. InEDBT, pages 368–378, 2009.

[25] C. Zhai, W. W. Cohen, and J. D. Lafferty. Beyondindependent relevance: Methods and evaluation metrics for

subtopic retrieval. InSIGIR, pages 10–17, 2003.[26] M. Zhang and N. Hurley. Avoiding monotony: Improving the

diversity of recommendation lists. InRecSys.[27] Y. Zhang, J. P. Callan, and T. P. Minka. Novelty and

redundancy detection in adaptive filtering. InSIGIR, pages81–88, 2002.

[28] X. Zhu, A. B. Goldberg, J. V. Gael, and D. Andrzejewski.Improving diversity in ranking using absorbing randomwalks. InHLT 2007, pages 97–104, 2007.

[29] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen.Improving recommendation lists through topicdiversification. InWWW, pages 22–32, 2005.

versõ foliõ: diversiﬁed ranking for large graphs with ... · arxiv:1607.07504v1 [cs.ir] 25 jul...

Documents