0809.4668v1

Upload: therm000

Post on 02-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 0809.4668v1

    1/16

    a r X i v : 0 8 0 9 . 4 6 6 8 v 1 [ c s . I R ] 2 6 S e p 2 0 0 8

    Faceted Ranking of Egos in Collaborative Tagging Systems

    Jose I. Orlicki (1 , 2) , Pablo I. Fierens (2) and J. Ignacio Alvarez-Hamelin (2 , 3)

    (1) Core Security Technologies, Humboldt 1967 1 p, C1414CTU Buenos Aires, Argentina(2) ITBA, Av. Madero 399, C1106ACD Buenos Aires, Argentina

    (3) CONICET (Argentinian Council of Scientic and Technological Research)

    AbstractMultimedia uploaded content is tagged and recom-mended by users of collaborative systems, resultingin informal classications also known as folksonomies.Faceted web ranking has been proved a reasonablealternative to a single ranking which does not takeinto account a personalized context. In this paper weanalyze the online computation of rankings of usersassociated to facets made up of multiple tags. Possi-ble applications are user reputation evaluation (ego-ranking) and improvement of content quality in caseof retrieval. We propose a solution based on PageR-ank as centrality measure: (i) a ranking for each tagis computed offline on the basis of the correspond-ing tag-dependent subgraph; (ii) a faceted order isgenerated by merging rankings corresponding to allthe tags in the facet. The fundamental assumption,validated by empirical observations, is that step (i) isscalable. We also present algorithms for part (ii) hav-ing time complexity O(k), where k is the number of tags in the facet, well suited to online computation.

    1 IntroductionIn collaborative tagging systems, users assign key-words or tags to their uploaded content, or book-

    marks, in order to improve future navigation, lter-ing or searching (see, e.g., Marlow et al. [MNBD06]).These systems generate a categorization of contentcommonly known as a folksonomy .

    An example is the collaborative URL tagging sys-tem Delicious [Del], which was analyzed in depth

    by Golger and Huberman [GH06], discovering tem-poral stability in the relative proportions of tagswithin a given tagging subject. In this system In-ternet resources (URLs) are bookmarked and classi-ed with tags by users. Other two well-known col-laborative tagging systems for multimedia contentare YouTube [You] (videos) and Flickr [Fli] (photos),which are the focus of this paper.

    YouTube and Flickr differ from Delicious in thatthe resources are uploaded by users, so that all con-tents bookmarked as favorites are inside the system.That is, YouTube and Flickr can be considered closedsystems.

    Users can be ranked in relation to a tag or set of tags which we call a facet . Some applications of these

    faceted (i.e., tag-associated) rankings are: (i) search-ing for content through navigation of the best usersinside a tag-facet; (ii) measuring reputation of usersby listing their best rankings for different tags or tagsets.

    The order or ranking can be determined by acentrality measure, such as PageRank [ PBMW98,LM03], in a recommendation or subscription graph.Given a facet, a straightforward solution is to com-pute the centrality measure based on an appropri-ate facet-dependent subgraph of the recommenda-tion network. However, the online computation of the centrality measure is unfeasible because its high

    time complexity, even for small facets with two orthree tags. Moreover, the offline computation of thecentrality measure for each facet is also unfeasiblebecause the large number of possible facets. There-fore, alternative solutions must be looked for. A sim-ple solution is to use a general ranking computed

    1

    http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1http://arxiv.org/abs/0809.4668v1
  • 8/11/2019 0809.4668v1

    2/16

    offline, which is then ltered online for each facetquery. Using a single ranking of web pages or userswithin folksonomies has the disadvantage that thebest ranked ones are those having the highest central-ity in a global ranking, which is facet-independent. Inthe information retrieval case, this implies that thereturned results are ordered in a way that does nottake into account the focus on the searched topic.This problem is called topic drift [RD02].

    In this paper we propose a solution to the problemof topic drift in faceted rankings which is based onPageRank as centrality measure. Our approach fol-lows a two-step procedure: (i) a ranking for each tagis computed offline on the basis of the correspondingtag-dependent subgraph; (ii) a faceted order is gen-erated by merging rankings corresponding to all thetags in the facet.

    The fundamental assumption is that step (i) inthis procedure can be computed with an acceptableoverhead which depends on the size of the dataset.This hypothesis is validated by two empirical obser-vations. On one hand, in the studied recommenda-tion (tagged) graphs most of the tags are associatedto very small subgraphs, while only a small number of tags have large associated subgraphs (see Section 3).On the other hand, the mean number of tags per edgeis nite and small as explained in Section 4.8.

    The problem then becomes to nd a good and effi-cient algorithm to merge several rankings in step (ii).In Section 4, we present several alternatives. We con-centrate our effort on facets that correspond to thelogical conjunction of tags (match-all-tags -queries)because this is the most used logical combination ininformation retrieval (Christopher [Chr08], Chapter1).

    The rest of the paper is organized as follows. Wediscuss prior works and their limitations in Section 2.In Section 3 we explore two real examples of taggedgraphs. In particular, we analyze several importantcharacteristics of these graphs, such as the scale-free

    behavior of the vertex indegree and assortativenessof the embedded recommendation network (see Sec-tion 3.3). The proposed algorithms are introduced inSection 4, including an analysis of related scalabilityissues in Section 4.8. We discuss experimental resultsin Section 5 and we conclude with some nal remarks

    and possible directions of future work in Section 6.

    2 Related workTheory and implementation concepts used in thiswork for PageRank centrality are based on the com-prehensive survey of Langville and Meyer [LM03].This centrality measure for directed graphs is a vari-ation of eigenvector centrality which includes the no-tion of a random surfer, i.e., an imaginary surfer that,in arriving to a vertex with no out-links, jumps to arandomly chosen vertex. The PageRank algorithm isbased on the iterated multiplication of the adjacencymatrix of the directed graph (modied to add therandom surfer), and a vector representing the prob-ability that a surfer is in a particular vertex. Theiteration stops when each vector component does notchange more than a given error . Only a hundred of matrix multiplications are needed for = 10 6 andstandard parameters (see [LM03] for details).

    Basic topic-sensitive PageRank analysis was at-tempted biasing the general PageRank equation tospecial subsets of web pages by Al-Saffar and Heile-man [ASH07], and using a predened set of categoriesby Haveliwala [ Hav02] extracted from the Open Di-rectory Project [ODp]. Although encouraging results

    were obtained in both works, they suffer from thelimitation of a xed number of topics biasing therankings. Another variations of personalized PageR-ank were augmented with weights based on usageby Eirinaki and Vazirgiannis [ EV05] and on accesstime-length and frequency by Guo et al. [GRP07 ] byprevious users, they built a unique PageRank vectoradapted to usage but the result is not user dependentnor query dependent as we prefer.

    Hotho et al. [HJSS06] adapted PageRank to workon a tripartite graph of users, tags and resources cor-responding to a folksonomy. They also developeda form of topic-biasing on the modied PageRank,

    but the generation of a faceted ranking implies a newcomputation of the adapted PageRank algorithm onthe network for each new facet.

    There has also been some work done on facetedranking of web pages. For example, the approachof DeLong, Mane and Srivastava [DMS06] involves

    2

  • 8/11/2019 0809.4668v1

    3/16

    the construction of a larger multigraph using the hy-perlink graph with each vertex corresponding to apair webpage-concept and each edge to a hyperlinkassociated with a concept. Subgraph ideas are sug-gested by them: It might be faster to simply runPageRank on sub-graphs pertaining to each individ-ual concept (assuming there are a small number of concepts). Although DeLong et al. [DMS06] obtaingood ranking results for single-keyword facets, theydo not support multi-keyword queries.

    Query-dependent PageRank calculation was intro-duced in Richarson and Domingos [RD02] to extracta weighted probability per keyword for each web-

    page. These probabilities are summed up to gener-ate a query-dependent result. They also show thatthis faceted ranking has, for thousands of keywords,computation and storage requirements that are onlyapproximately 100-200 times greater than that of asingle query-independent PageRank. As we show inSection 4.8, our facet-dependent ranking algorithmshave similar time complexity.

    Scalability issues were also tackled by Jeh andWidom [ JW02] criticizing offline computation of mul-tiple PageRank vectors for each possible query andpreferring another more efficient dynamic program-ming algorithm for online calculation of the facetedrankings based on offline computation of basis vec-tors. They found that their algorithm scales wellwith the size of set H , the biasing page set, and theycriticize previous ideas in [ RD02]: [Richarson andDomingos] suggested that importance scores be pre-computed offline for every possible text query, butthe enormous number of possibilities makes this ap-proach difficult to scale.

    In this paper, we propose a different alternative tothe problem of faceted ranking. Instead of comput-

    ing offline the rankings corresponding to all possiblefacets, our solution requires only the offline compu-tation of a ranking per tag. A faceted ranking isgenerated by adequately merging the rankings of thecorresponding tags. Section 4 deals with different ap-proaches to the merging step.

    3 Construction of a tagged

    graphIn this section we introduce the basic denitions re-lated to tagged graphs and we present the networkanalysis of two real cases.

    3.1 Basic denitionsLet G = ( N,E,T ) be a simple 1 directed graph withtags on the edges, a tagged graph . N is the set of vertices {u 1 , . . . , u n }, E is the set of edges and T (e)is the set of tags {t 1 , . . . , t k e } associated with edge ein E . If e / E then T (e) := . We shall call a certain

    set of tags F e E T (e) a facet .Let M = {(u1 , m 1 , T 1 ), . . . , (u r , m r , T r )} be a setof tagged contents, where ui is the user, mi is thecontent and T i is the preferred set of tags includedby the user 2 , and let V = {(c1 , m 1 ), . . . , (c p , m

    p)} the

    set of favorite recommendations, where ci is a recom-mender user and m i is the recommended content

    3 ,then a tagged graph G = ( N,E,T ) is build, where

    N := {u i : i(u i , m i , T i ) M }{cj : j (c

    j , m

    j ) V },

    E := {(cj , u k ) : (cj , m

    j ) V (uk , m

    j , T k ) M },

    and

    T ((cj , u k )) := {T k : (cj , m

    j ) V

    (uk , m j , T k ) M (cj , u k ) E } .

    We show an example of the application of these def-initions in Figure 1.

    Given G = ( N,E,T ) and a tag t then G(t) :=(N , E , T ) is a tagged subgraph, where E = {e :e E t T (e)}, N = {a, b : (a, b) E } andT = {t T (e ) : e E }.

    1 Not a multigraph .2 In the rest of the paper user or vertex will be used indis-

    tinctly to mean a vertex in a tagged graph as an abstractionof the webpage where the user publishes his/her content, andedges or links will be used when referring to edges in a taggedgraph build using favorite recommendations.

    3 Each content is considered unique, i.e., different users donot upload the same content.

    3

  • 8/11/2019 0809.4668v1

    4/16

    M 0 = {(A, song1, {blues }) V 0 = {(A, song2)(B, song2, {blues,jazz }) (B, song4)

    (C, song3, {blues }) (B, song5)(C, song4, { jazz }) (A, song3)(D, song5, {blues }) (A, song4)(D, song6, {rock }) } (C, song6) }

    Ablues,jazz

    blues,jazz

    B

    jazz

    blues D

    C

    rock

    Figure 1: Example of construction of a tagged graphfrom a set of contents M 0 and a set of recommenda-tions V 0 .

    If G1 = ( N 1 , E 1 , T 1 ) and G2 = ( N 2 , E 2 , T 2 ) aregraphs then G1 G2 := ( N , E , T ) where E = E 1 E 2 , N = {a, b : (a, b) E } and T (e) := T 1 (e) T 2 (e). Also G1 G2 := ( N , E , T ) where E = E 1 E 2 , N = {a, b : (a, b) E } and T (e) := T 1 (e) T 2 (e).

    The conjunction and disjunction graphs can be de-ned as G(t 1 . . . tk 1 tk ) := G(t 1 . . . tk 1 ) G(tk ) (see Figure 6(c)) and G(t 1 . . . tk 1 tk ) :=G(t 1 . . . tk 1 ) G(tk ) (see Figure 6(d) ) .

    The number of edges of a graph G is denoted|E (G)| and the number of vertices in a graph is de-noted by |N (G)|.

    3.2 Two real cases: YouTube andFlickr

    In this section, we present two examples of collabo-rative tagging systems where content is tagged andrecommendations are made. These systems actually

    rank content according to the number of visits, rec-ommendations or relevance of the text accompanyingthe content. However, to our knowledge, no use of graph-based faceted ranking is made.

    The taxonomy of tagging systems in Marlow et al. [MNBD06] allows us to classify YouTube [You]

    and Flickr [Fli] in the following ways:

    regarding the tagging rights, both are self-tagging systems;

    regarding the aggregation model, they are set systems;

    regarding the object-type, they are called non-textual systems;

    regarding source of material, they are classiedas user-contributed ;

    nally, regarding tagging support, whileYouTube can be classied as a suggested tagging

    system, Flickr must be considered a blind tagging system.

    In our rst example the content is multimediain the form of favorite videos recommended byusers. The information was collected from the serviceYouTube [You] using the public API crawling 185852edges and 51490 vertices in Breadth-First Search(BFS) order starting from the popular user jcl5m that had videos included in the top twenty top ratedvideos during April 2008. From this information andfollowing the the denitions in Section 3.1, we con-structed a complete tagged graph G and several sam-

    ple subgraphs such as G(music funny ), G(music ),G(funny ) and G(music funny ) (other subgraphspresent a similar behavior). Table 1 presents thenumber of vertices and edges of each of these net-works. We must note that mandatory categoricaltags such as Entertainment , Sports or Music , al-ways capitalized, were removed in order to includeonly tags inserted by users.

    Graph vertices edgesG 51,490 185,852G(music funny ) 18,368 26,388G(music ) 12,849 10,273G(funny ) 8,734 13,392G(music funny ) 1,406 1,147

    Table 1: Sizes of the video tagged graph and some of its subgraphs.

    4

  • 8/11/2019 0809.4668v1

    5/16

    In our second example the content are photosand the recommendations are in the form of fa-vorite photos 4 . The information was collected fromthe service Flickr [Fli] using the public API crawl-ing 229709 edges and 35210 vertices in BFS or-der starting from the popular user junku-newcleus .The complete tagged graph G and the sample sub-graphs G(blue flower ), G(blue), G(flower ) andG(blue flower ) were constructed. The number of vertices and edges of these graphs are shown in Ta-ble 2.

    Graph vertices edgesG 35,210 229,709G(blue flower ) 12,921 20,105G(blue) 10,241 11,703G(flower ) 7,032 9,566G(blue flower ) 1,551 1,164

    Table 2: Sizes of the photo tagged graph and someof its subgraphs.

    3.3 Network analysisGraph analysis was made using the tool NetworkWorkbench [ N06], except for the calculation of PageRank. Figures 2, 3 and 4 show vertex indegreedistribution, vertex outdegree distribution and cor-relation of indegree of in-neighbors with indegree of vertices for the YouTube and Flickr networks. Allgraph-analytical parameters, except those for smallsubgraphs like G(music funny ) were binned andplotted in log-log curves. This is the reason why somedegree points appear below zero and one ( x-axis), be-cause there exist vertices with either indegree or out-degree equal to zero.

    Vertex indegree, in both video and photo net-works, is characterized by a power-law distribution:

    P (k) k

    , where 2 < < 3 (see Figure 2). Ran-dom variables modelled by this type of heavy-taileddistributions have a nite mean, but innite secondand higher non-central moments. Furthermore, there

    4 Only the rst fty favorites photos of each user were re-trieved.

    1e-08

    1e-07

    1e-06

    1e-05

    1e-04

    0.001

    0.01

    0.1

    1

    0.1 1 10 100 1000 10000

    P ( v e r

    t e x i n d e g r e e )

    vertex indegree

    YouTubemusicfunny

    music AND funnymusic OR funny

    a * x -2.41

    1e-08

    1e-07

    1e-06

    1e-05

    1e-04

    0.001

    0.01

    0.1

    1

    0.1 1 10 100 1000

    P ( v e r

    t e x

    i n d e g r e e )

    vertex indegree

    Flickrflower

    blueblue AND flower

    blue OR flowera*x -2.2

    Figure 2: Binned indegree distribution

    5

  • 8/11/2019 0809.4668v1

    6/16

    1e-05

    1e-04

    0.001

    0.01

    0.1

    1

    0.01 0.1 1 10 100

    P ( v e r

    t e x o u

    t d e g r e e )

    vertex outdegree

    YouTubemusicfunny

    music AND funnymusic OR funny

    1e-05

    1e-04

    0.001

    0.01

    0.1

    1

    0.01 0.1 1 10 100

    P ( v e r

    t e x o u t d e g r e e )

    vertex outdegree

    Flickrflower

    blueblue AND flower

    blue OR flower

    Figure 3: Binned outdegree distribution

    0.01

    0.1

    1

    10

    0.1 1 10 100 1000 10000

    i n d e g r e e o

    f i n - n e i g

    h b o r s

    vertex indegree

    YouTubemusicfunny

    music AND funnymusic OR funny

    0.1

    1

    10

    0.1 1 10 100 1000

    i n d e g r e e o

    f i n - n e i g

    h b o r s

    vertex indegree

    Flickrflower

    blueblue AND flower

    blue OR flower

    Figure 4: Binned correlation of indegree of in-neighbors with indegree

    6

  • 8/11/2019 0809.4668v1

    7/16

    1e-05

    1e-04

    0.001

    0.01

    0.1

    1

    1e-05 1e-04 0.001 0.01 0.1

    P ( v e r

    t e x

    P a g e R a n

    k )

    vertex PageRank

    (complete graph)musicfunny

    music AND funnymusic OR funny

    1e-05

    1e-04

    0.001

    0.01

    0.1

    1

    1e-05 1e-04 0.001 0.01 0.1

    P ( v e r

    t e x

    P a g e R a n

    k )

    vertex PageRank

    (complete graph)flower

    blueblue AND flower

    blue OR flower

    Figure 5: Binned Vertex PageRank distribution forYouTube (top) and Flickr (bottom)

    is a non-vanishing probability of nding a vertex withan arbitrary high indegree. Clearly, in any real-world network, the total number of vertices is a nat-ural upper-bound to the greatest possible indegree.However, experience with Internet related networksshows that the power-law distribution of the indegreedoes not change signicantly as the network growsand, hence, the probability of nding a vertex withan arbitrary degree eventually becomes non-zero (formore details see, e.g., Pastor-Satorras and Vespig-nani [PSV04]).

    Since recommendation lists are made by individualusers, vertex outdegree does not show the same kindof scale-free behavior than vertex indegree. On thecontrary, each user recommends only 20 to 30 otherusers on average (see Figure 3). Moreover, since ver-tex outdegree is mostly controlled by human users,we do not expect its average to change signicantlyas the network grows.

    The correlation of indegree of in-neighbors withvertex indegree (see Figure 4) indicates the exis-tence of assortative (positive slope) or disassorta-tive behavior (negative slope). Assortativeness iscommonly observed in social networks, where peo-ple with many connections relates to people which isalso well-connected. Disassortativeness is more com-

    mon in other kinds of networks, such as information,technological and biological networks (see, e.g., New-man [New02]). In the favorite videos network there isno clear correlation (small or no slope), but the photonetwork there is a slight assortativeness indicating abiased preference of vertices with high indegree forvertices with high indegree (see Figure 4).

    We also computed the PageRank of the samplegraphs, removing dangling vertices with indegree 1and out degree 0, because most of them correspondto vertices which have not been expanded by thecrawler (BFS), having the lowest PageRank (a simi-

    lar approach is taken in [ PBMW98] ). Figure 5 showsthat PageRank distributions are also scale-free, i.e.,they can be approximated by power law distributions.Note that the power law exponents are very similarfor the complete tagged graph and subgraphs, on eachnetwork.

    7

  • 8/11/2019 0809.4668v1

    8/16

    4 Faceted Ranking on Tagged

    GraphsGiven a set M of tagged content, a set V of favoriterecommendations and a tag set or facet F , the faceted ranking problem consists in nding the ranking of users according to facet F .

    In this section we present six different approachesto the faceted ranking problem using tagged graphs.The rst two algorithms ( E -intersection and E -union/ N -intersection in Sections 4.2 and 4.3 respec-tively) are not scalable for online queries becausetheir computation requires the extraction of a sub-graph which might be very large in a large network 5

    and the calculation of the corresponding PageRankvector. Moreover, the offline computation of thoserankings for each possible facet F e E T (e) is alsounfeasible because the large number of such facets.However, they serve as a basis of comparison for theother four online algorithms because they are a goodapproximation to the desired result.

    We should note that the focus of this paper is onconjunction-based queries in which all words must bematched, as opposed to disjunction-based ones wherethe match of any word is sufficient. Conjunction-based queries are the most common type of booleanqueries [Chr08 ].

    Before presenting faceted ranking algorithms, weneed some preliminary denitions related to vertexcentrality which are given in the following section.

    4.1 Vertex CentralityGiven a graph G = ( N, E ), C (G) : N R is avertex centrality function and R(C (G)) : N N is avertex ranking function associates a complete ordersuch that to the highest centrality vertex of C (G)corresponds the number one, the second highest hasnumber two and so on. PageRank C (G) is a vertex

    centrality function associating probabilities accordingto a random surfer traversing the graph G [LM03].The vertex ranking function R(G) := R(C (G)) will

    5 We have observed that as the network grows the relativefrequency of tags usage converges. Similar behavior was ob-served for particular resources by [GH06 ].

    be our default ranking for graphs.

    Ablues,jazz

    blues,jazz

    B

    blues D

    C (a)

    Ablues,jazz

    blues,jazz

    B

    jazz

    C

    (b)

    Ablues,jazz

    blues,jazz

    B

    C (c)

    Ablues,jazz

    blues,jazz

    B

    jazz

    blues D

    C (d)

    Figure 6: Example subgraphs of a tagged graph: (a)G(blues); (b) G( jazz ); (c) G(blues jazz ); (d)G(blues jazz ).

    4.2 E -intersection

    Given a set of tags, a ranking may be calculatedby computing the centrality measure of the sub-graph corresponding to the recommendation edges

    which include all the tags. This approach, calledE -intersection, cannot be implemented for onlinequeries, as explained above, but serves as a reason-able standard of comparison because we use the exactinformation available for the PageRank in a conjunc-tive query.

    8

  • 8/11/2019 0809.4668v1

    9/16

    The E -intersection ranking for tagged graph G ac-cording to facet F = {t 1 , . . . , t k } is

    R (G(t 1 . . . tk )) .

    As an example see Figure 6(c). Assuming a pre-viously built inverted index (Christopher [Chr08],Chapter 1) for the tagged graph mapping tagsinto sets of edges, the complexity can be decom-posed on the retrieval time for each subgraph,which takes proportional to

    ki =1 |E (G(t i )) | , and

    the time of PageRank and sort algorithms, tak-ing O(m log m), where m = |E (G(t 1 . . . tk )) | . Then, the total time complexity for this al-gorithm is O (k | E (G(t i )) |max + m log m), where|E (G(t i )) |max is computed for the largest subgraphG(t i ).

    4.3 E -union/ N -intersectionConsider the example given in Figure 1 under thequery blues rock . According to the E -intersectionalgorithm, there is no node in the network satisfy-ing the query. However, it may seem reasonable toreturn node D as a response to such search. In or-der to take into account this case, we devised an-other algorithm called E -union/ N -intersection. Inthis case, the union of all edge recommendations pertag is used when computing the PageRank, but onlythose vertices involved in recommendations for alltags are kept. The latter ltering is included becausewe want vertices recommended for each of the tagsin the facet.

    The E -union/ N -intersection ranking for vertexn in a tagged graph G according to facet F ={t 1 , . . . , t k } is

    R(C (G(t 1 . . . tk )))( n),

    where C is restricted to vertices in vertex intersec-tion N (G(t 1 )) . . . N (G(tk )), the other vertices

    having centrality 0. Note that, in general, there aremore vertices in N (G(t 1 )) . . . N (G(tk )) than inN (G(t 1 ) . . . G(tk )).

    The time complexity of this algorithm is propor-tional to ki =1 |E (G(t i )) | + m log m, where m =|E (G(t 1 . . .tk )) | . Then, the total time complexity

    for this algorithm is O (k | E (G(t i )) |max + m log m),where |E (G(t i )) |max is computed for the largest sub-graph G(t i ).

    4.4 Single rankingA simple online faceted ranking consists of a mono-lithic ranking, without considering the facet, which isthen ltered to exclude those vertices that are not re-lated to all tags in the facet. That is, one ranks by themonolithic global rank of the complete tagged graphand the only vertices remaining for facet {t 1 , . . . , t k }are the ones in

    N G(t 1 ) . . . N G(tk ) .

    Assuming a precomputed inverted index, mappingtags into nodes, the complexity of this algorithm isO(k| N (G(t i )) |max ), where k is the number of differ-ent tags in the facet, and |N (G(t i )) |max is computedfor the biggest subgraph G(t i ). It is also possible toretrieve a (small) constant number of top elements tointersect, yielding a time complexity of O(k).

    4.5 P R -productIn order to approximate efficiently the edge intersec-tion we can precompute individual rankings for eachtag and then combine them by element-wise multipli-cation. This approximation is inspired on the proba-bility product of independent events.

    If G1 and G2 are subgraphs of graph G we denePageRank ranking product as

    R (G1 ) R (G2 ) := R(C (G1 ) C (G2 )) ,

    where (C (G1 ) C (G2 ))( n) := C (G1 )(n) C (G2 )(n) (realproduct). The P R -product ranking for tagged graphG according to facet F = {t 1 , . . . , t k } is

    k

    i =0R (G(t i )) .

    Assuming the individual rankings for all tags havebeen computed, the complexity of this algorithm isO(k | N (G(t i )) |max ). Here, it is also possible to re-duce the time complexity to O(k) taking a (small)constant number of top elements to make the prod-uct.

    9

  • 8/11/2019 0809.4668v1

    10/16

    4.6 R -sum

    Consider a recommendation graph G larger than thatin Figure 1 and the query blues jazz . Assume thatthe PageRank of the top three nodes in the rankingscorresponding the subgraphs G(blues) and G( jazz )are as given in Table 3. Ignoring other nodes, theranking given by the P R-product rule is a, b andc. However, it may be argued that node b shows abetter equilibrium of PageRank values than node a .Intuitively, one may feel inclined to rank b over agiven the values in the table. In order to follow thisintuition, we devised the R-sum algorithm which isalso intended to avoid topic drift inside the queriedfacet, that is, any tag prevailing over the others.

    The R-sum ranking for a tagged graph G accordingto facet F = {t 1 , . . . , t k } is

    k

    i =0R (G(t i )) ,

    where we dene PageRank ranking sum as

    R (G1 ) + R (G2 ) := R( (R (G1 ) + R (G2 ))) .

    Notice that in this sum we are using as centrality thesum of ranking positions in a reverse order, and ac-

    cording to the R-sum algorithm, the ranking of nodesin the example of Table 3 is b, a and c.The complexity of this algorithm is similar to that

    of P R-product.

    Node C(G (blues )) C(G ( jazz )) P R -pr. R -suma 0.75 0.04 0.03 4b 0.1 0.1 0.01 3c 0.01 0.05 0.005 6

    Table 3: Comparison of P R -product and R-sum inan example.

    4.7 -N -intersection

    In this case, edge intersection is computed involvingonly vertices (and associated edges) that are on the

    top w positions of the individual rankings. The -N -intersection ranking for tagged graph G according tofacet F = {t 1 , . . . , t k } is

    R k

    i =0G( (R (G(t i )) , w))( t i ) ,

    where (R (G), w) is the set of vertices includingthe top w vertices ranked using PageRank andG({a ,b , . . . }) is the maximal subgraph of G includ-ing vertices {a ,b , . . . } and edges connecting them. Inother words, this algorithm has the following steps:(i) for each t i , the subgraph G(t i ) is constructed; (ii)a ranking of users is computed on the basis of the

    PageRank of G(t i ); (iii) the w winners of each ti -associated ranking are extracted; (iv) given a facetF = {t 1 , , tk }, a new subgraph including onlythe winners for tag ti is constructed; (v) a facet-associated ranking is constructed based on the newgraph. Steps (i)-(iii) are computed offline. In thispresentation, we have xed the number of top itemsselected at ve hundred ( w = 500).

    Assuming the individual rankings for k tags hasbeen computed, the complexity of this algorithm isO(k).

    4.8 Scalability AnalysisAs noticed by Langville and Meyer [LM03], the num-ber of iterations of PageRank is xed when both thetolerated error and other parameters are xed, yield-ing one hundred for = 10 6 (see Section 2). Aseach iteration consists of the sparse adjacency ma-trix multiplication, the time complexity of PageRankis linear on the number of edges of the graph. In ourcase, given a tagged graph G0 = ( N 0 , E 0 , T 0 ), for eachtag there is a corresponding subgraph with a knownsize. Then the total temporal and spatial complexityof the faceted PageRank for all individual tags is

    t T 0

    |E (G0 (t)) | =e E 0

    |T 0 (e)| ,

    where T 0 := e E 0 T 0 (e), the complete set of tags.Therefore, if the average number of tags per edge

    is constant or grows very slowly as the graph grows,

    10

  • 8/11/2019 0809.4668v1

    11/16

    then the algorithms in Sections 4.5, 4.6 and 4.7 arescalable, linear on the number of edges of the com-plete tagged graph. This can be veried empiricallyon Figure 7, showing that distribution of tags peredges falls quickly, having a mean of 9 .26 tags peredge for the YouTube tagged graph and 13 .37 forthe Flickr tagged graph. These are not heavy-taileddistributions and, since tags are manually added toeach uploaded content, we do not expect the aver-age number of tags per recommendation to increasesignicantly with network growth.

    0.1

    1

    10

    100

    1000

    10000

    100000

    1 10 100 1000

    # e

    d g e s

    # tags

    YouTubeFlickr

    Figure 7: The distribution of number of tags peredge.

    In our experiments the computation of all thefaceted singleton tag rankings (104 , 927 tags) forthe video network sample took 211 .4 times moretime than the single ranking for the completetagged graph. Meanwhile the photo network sample(283, 093 tags) took 1744 .9 times more time.

    Our merging algorithms work in real-time becausethey use only the top w results, where w is a smallxed number like 500 or 1000. Choosing an appropri-ate w for an application 6 will enable it to store onlythe w top elements of each single-tag facet.

    5 Experimental resultsIn this section, we compare the behavior of thealgorithms presented in Section 4. As a basis of

    6 How to choose a good w is beyond the scope of this paper.

    comparison we use two algorithms whose onlinecomputation is unfeasible, but which are intu-itively reasonable: E -intersection (Section 4.2)and E -union/ N -intersection (Section 4.3). Inorder to quantify the distance between the re-sults given by two different algorithms, we usetwo ranking similarity measures, OSim [Hav02]and KSim [Ken38, Hav02]. The rst measure,OSim (r 1 , r 2 ) indicates the degree of overlap betweenthe top n elements of rankings r 1 and r 2 . We denethe overlap of two sets A and B (each of size n) tobe |A B |/n . The second measure, KSim (r 1 , r 2 ) =|(u, v ) : r 1 , r 2 same order (u, v ), u = v| |U |(|U | 1)where U in the union of all elements in rankingsr 1 and r2 , r 1 is r1 extended with U r 2 and r2 isextended analogously to obtain r 2 . This measure isa variant of Kendalls distance that considers therelative orderings, i.e., counts how many inversionsare in a determined top set. In both cases, valuescloser to 0 mean that the results are not similar andcloser to 1 mean the opposite.

    5.1 Favorite videos network

    Samples include all facets of tag pairs {t j , t k } ex-tracted from the 99 most used tags of the network 7 .That is, 4851 tag pairs compared with their simi-larities averaged. For each tag pair the proposedmerging algorithms ( P R -product, R-sum and -N -intersection) were compared with the reference algo-rithms ( E -intersection and E -union/ N -intersection)using OSim and KSim to measure the rankings sim-ilarity. Some of the tags are: music, funny, comedy,live, guitar, rock, super, dance, animation, parody,song, mario, game, new, tv, pop, john, love, world.

    Table 4 presents a summary of the comparisons forthe favorite videos network, where we display aver-aged similarities for different top sizes of ranked users.Figures 8 and 9 also show a more detailed summary of results for the OSim metric (because it discriminatesdifferent situations better than KSim ). The x-axiscorresponds to the number of vertices resulting fromthe basis of comparison algorithm ( E -intersection or

    7 Some tags like you , video or youtube which give no infor-mation were removed from the experiment.

    11

  • 8/11/2019 0809.4668v1

    12/16

  • 8/11/2019 0809.4668v1

    13/16

    6 Summary

    We have proposed different algorithms for mergingfaceted-rankings of users in collaborative tagging sys-tems which gave results comparable to those of tworeasonable standards. We have also analyzed thescalability of this approach.

    A prototypic application called Egg-O-Matic isavailable online [ EOM ] including ranking merging R-sum to approximate the E -intersection ranking, in amode called all tags, same content , and includingthe ranking merging we called P R -product to approx-imate the E -union/ N -intersection ranking, in a modecalled all tags, any content .

    Another step that can be taken to reduce tag-dimensionality is clustering to agglomerate them.This work also opens the path for a more complexcomparison of reputations, for example by integrat-ing the best positions of a user even if the tags in-volved are not related ( disjunctive queries) in orderto summarize the relevance of a user generating con-tent on the web. It is also possible to extend thealgorithms in Section 4 to merge of rankings gen-erated from different systems ( cross-system ranking )looking to obtain a ranking of users using multiplecollaborative tagging systems.

    References[ASH07] Sinan Al-Saffar and Gregory Heileman.

    Experimental Bounds on the Useful-ness of Personalized and Topic-SensitivePageRank. In WI 07: Proceedings of the IEEE/WIC/ACM International Confer-ence on Web Intelligence , pages 671675, Washington, DC, USA, 2007. IEEEComputer Society.

    [Chr08] Christopher. Introduction to Informa-tion Retrieval . Cambridge UniversityPress, 2008.

    [Del] Delicious. http://del.icio.us/.

    [DMS06] Colin DeLong, Sandeep Mane, andJaideep Srivastava. Concept-Aware

    Ranking: Teaching an Old Graph NewMoves. icdmw , 0:8088, 2006.

    [EOM] Egg-O-Matic. http://egg-o-matic.itba.edu.ar/.

    [EV05] Magdalini Eirinaki and Michalis Vazir-giannis. Usage-Based PageRank for WebPersonalization. In ICDM 05: Proc. of the Fifth IEEE International Conference on Data Mining , pages 130137, Wash-ington, DC, USA, 2005. IEEE ComputerSociety.

    [Fli] Flickr. http://www.ickr.com/.

    [GH06] Scott Golder and Bernardo A. Huber-man. Usage patterns of collaborativetagging systems. Journal of Information Science , 32(2):198208, April 2006.

    [GRP07] Yong Zhen Guo, Kotagiri Ramamoha-narao, and Laurence A. F. Park. Per-sonalized PageRank for Web Page Pre-diction Based on Access Time-Lengthand Frequency. In WI 07: Proc. of the IEEE/WIC/ACM International Confer-ence on Web Intelligence , pages 687690, Washington, DC, USA, 2007. IEEEComputer Society.

    [Hav02] Taher H. Haveliwala. Topic-sensitivePageRank. In Proc. of the Eleventh International World Wide Web Confer-ence , Honolulu, Hawaii, May 2002.

    [HJSS06] Andreas Hotho, Robert J aschke,Christoph Schmitz, and Gerd Stumme.Information Retrieval in Folksonomies:Search and Ranking. In Proc. of the 3rd European Semantic Web Conference ,volume 4011 of LNCS , pages 411426, Budva, Montenegro, June 2006.Springer.

    [JW02] Glen Jeh and Jennifer Widom. Scalingpersonalized web search. Technical re-port, Stanford University, 2002.

    13

  • 8/11/2019 0809.4668v1

    14/16

    [Ken38] M. G. Kendall. A New Measure of RankCorrelation. Biometrika , 30(1/2):8193,June 1938.

    [LM03] Amy Nicole Langville and Carl DeanMeyer. Survey: Deeper Inside PageR-ank. Internet Mathematics , 1(3), 2003.

    [MNBD06] Cameron Marlow, Mor Naaman, DanahBoyd, and Marc Davis. HT06, taggingpaper, taxonomy, Flickr, academic arti-cle, to read. In HYPERTEXT 06: Proc.of the seventeenth conference on Hyper-text and hypermedia , pages 3140, NewYork, NY, USA, 2006. ACM Press.

    [N06] N. W. B. Team, Network WorkbenchTool, 2006.

    [New02] M. E. J. Newman. Assortative Mix-ing in Networks. Phys. Rev. Lett. ,89(20):208701, Oct 2002.

    [ODp] Open-Directory-project. NetscapeCommunication Corporationhttp://www.dmoz.org/.

    [PBMW98] Lawrence Page, Sergey Brin, RajeevMotwani, and Terry Winograd. The

    PageRank Citation Ranking: BringingOrder to the Web. Technical report,Stanford Digital Library TechnologiesProject, 1998.

    [PSV04] R Pastor-Satorras and A Vespignani.Evolution and structure of the Internet:A statistical physics approach . Cam-bridge University Press, 2004.

    [RD02] Mathew Richardson and Pedro Domin-gos. The Intelligent Surfer: Probabilis-tic Combination of Link and Content In-formation in PageRank. In Advances in Neural Information Processing Systems 14. MIT Press, 2002.

    [You] YouTube. http://www.youtube.com/.

    14

  • 8/11/2019 0809.4668v1

    15/16

    5 13 46 175 674 2618

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eintersection vs. Single

    Intersection Size

    T o p

    #

    5 13 46 175 674 2618

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eintersection vs. PRproduct

    Intersection Size

    T o p

    #

    5 13 46 175 674 2618

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eintersection vs. Rsum

    Intersection Size

    T o p

    #

    5 13 46 175 674 2618

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eintersection vs. tauNintersection

    Intersection Size

    T o p

    #

    Figure 8: Videos network: Average similarity ( OSim ) to E -intersection

    21 30 63 196 718 2763

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eunion/Nintersection vs. Single

    Intersection Size

    T o p

    #

    21 30 63 196 718 2763

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eunion/Nintersection vs. PRproduct

    Intersection Size

    T o p

    #

    21 30 63 196 718 2763

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eunion/Nintersection vs. Rsum

    Intersection Size

    T o p

    #

    21 30 63 196 718 2763

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eunion/Nintersection vs. tauNintersection

    Intersection Size

    T o p

    #

    Figure 9: Videos network: Average similarity ( OSim ) to E -union/ N -intersection

    15

  • 8/11/2019 0809.4668v1

    16/16

    5 13 42 153 569 2130 7996

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eintersection vs. Single

    Intersection Size

    T o p

    #

    5 13 42 153 569 2130 7996

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eintersection vs. PRproduct

    Intersection Size

    T o p

    #

    5 13 42 153 569 2130 7996

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eintersection vs. Rsum

    Intersection Size

    T o p

    #

    5 13 42 153 569 2130 7996

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eintersection vs. tauNintersection

    Intersection Size

    T o p

    #

    Figure 10: Photos network: Average similarity to E -intersection

    92 100 132 252 713 2482 9261

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eunion/Nintersection vs. Single

    Intersection Size

    T o p

    #

    92 100 132 252 713 2482 9261

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eunion/Nintersection vs. PRproduct

    Intersection Size

    T o p

    #

    92 100 132 252 713 2482 9261

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eunion/Nintersection vs. Rsum

    Intersection Size

    T o p

    #

    92 100 132 252 713 2482 9261

    2

    4

    8

    16

    32

    64

    128

    256

    512

    Eunion/Nintersection vs. tauNintersection

    Intersection Size

    T o p

    #

    Figure 11: Photos network: Average similarity to E -union/ N -intersection

    16