exploring hierarchies in online social networks1502.04220v1 [cs.si] 14 feb 2015 exploring...

15
arXiv:1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li , Hao Wei The Chinese University of Hong Kong, Hong Kong, China * Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, China {lucan,yu,rhli,hwei}@se.cuhk.edu.hk ABSTRACT Social hierarchy (i.e., pyramid structure of societies) is a funda- mental concept in sociology and social network analysis. The im- portance of social hierarchy in a social network is that the topo- logical structure of the social hierarchy is essential in both shaping the nature of social interactions between individuals and unfolding the structure of the social networks. The social hierarchy found in a social network can be utilized to improve the accuracy of link prediction, provide better query results, rank web pages, and study information flow and spread in complex networks. In this paper, we model a social network as a directed graph G, and consider the social hierarchy as DAG (directed acyclic graph) of G, denoted as GD. By DAG, all the vertices in G can be partitioned into differ- ent levels, the vertices at the same level represent a disjoint group in the social hierarchy, and all the edges in DAG follow one direc- tion. The main issue we study in this paper is how to find DAG GD in G. The approach we take is to find GD by removing all possible cycles from G such that G = U (G) GD where U (G) is a maximum Eulerian subgraph which contains all possible cy- cles. We give the reasons for doing so, investigate the properties of GD found, and discuss the applications. In addition, we develop a novel two-phase algorithm, called Greedy-&-Refine, which greed- ily computes an Eulerian subgraph and then refines this greedy so- lution to find the maximum Eulerian subgraph. We give a bound between the greedy solution and the optimal. The quality of our greedy approach is high. We conduct comprehensive experimental studies over 14 real-world datasets. The results show that our algo- rithms are at least two orders of magnitude faster than the baseline algorithm. 1. INTRODUCTION Social hierarchy refers to the pyramid structure of societies, with minority on the top and majority at the bottom, which is a prevalent and universal feature in organizations. Social hierarchy is also rec- ognized as a fundamental characteristic of social interactions, being well studied in both sociology and psychology [11]. In recent years, social hierarchy has attracted considerable attention and generates profound and lasting influence in various fields, especially social Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. networks. This is because the hierarchical structure of a popula- tion is essential in shaping the nature of social interactions between individuals and unfolding the structure of underlying social net- works. Gould in [11] develops a formal theoretical model to model the emergence of social hierarchy, which can accurately predict the network structure. By the social status theory in [11], individu- als with low status typically follow individuals with high status. Clauset et al. in [8] develop a technique to infer hierarchical struc- ture of a social network based on the degree of relatedness between individuals. They show that the hierarchical structure can explain and reproduce some commonly observed topological properties of networks and can also be utilized to predict missing links in net- works. Assuming that underlying hierarchy is the primary factor guiding social interactions, Maiya and Berger-Wolf in [22] infer social hierarchy from undirected weighted social networks based on maximum likelihood. All these studies imply that social hier- archy is a primary organizing principle of social networks, capable of shedding light on many phenomena. In addition, social hierar- chy is also used in many aspects of social network analysis and data mining. For instance, social hierarchy can be utilized to im- prove the accuracy of link prediction [21], provide better query re- sults [15], rank web pages [12], and study information flow and spread in complex networks [1, 2]. In this paper, we focus on social networks that can be modeled by directed graphs, because in many social networks (e.g., Google+, Weibo, Twitter) information flow and influence propagate follow certain directions from vertices to vertices. Given a social network as a directed graph G, its social hierarchy can be represented as a directed acyclic graph (DAG). By DAG, all the vertices in G are partitioned into different levels (disjoint groups), and all the edges in the cycle-free DAG follow one direction, as observed in social networks that prestige users at high levels are followed by users at low levels and the prestige users typically do not follow their followers. Here, a level in DAG represents the status of a vertex in the hierarchy the DAG represents. The issue we study in this paper is how to find hierarchy as a DAG in a general directed graph G which represents a social net- work. Given a graph G, there are many possible ways to obtain a DAG. First, converting graph G into a DAG, by contracting all ver- tices in a strongly connected component in G as a vertex in DAG, does not serve the purpose, because all vertices in a strongly con- nected component do not necessarily belong to the same level in a hierarchy. Second, a random DAG does not serve the purpose, be- cause it heavily relies on the way to select the vertices as the start to traverse and the way to traverse. Therefore, two random DAGs can be significantly different topologically. Third, finding the maxi- mum DAG of G is not only NP-hard but also NP-approximate [14]. The way we do is to find the DAG by removing all possible cycles 1

Upload: nguyenhuong

Post on 31-Jan-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

arX

iv:1

502.

0422

0v1

[cs.

SI]

14

Feb

201

5

Exploring Hierarchies in Online Social Networks

Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao WeiThe Chinese University of Hong Kong, Hong Kong, China

∗Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, China

{lucan,yu,rhli,hwei}@se.cuhk.edu.hk

ABSTRACTSocial hierarchy (i.e., pyramid structure of societies) isa funda-mental concept in sociology and social network analysis. The im-portance of social hierarchy in a social network is that the topo-logical structure of the social hierarchy is essential in both shapingthe nature of social interactions between individuals and unfoldingthe structure of the social networks. The social hierarchy found ina social network can be utilized to improve the accuracy of linkprediction, provide better query results, rank web pages, and studyinformation flow and spread in complex networks. In this paper,we model a social network as a directed graphG, and consider thesocial hierarchy as DAG (directed acyclic graph) ofG, denoted asGD . By DAG, all the vertices inG can be partitioned into differ-ent levels, the vertices at the same level represent a disjoint groupin the social hierarchy, and all the edges in DAG follow one direc-tion. The main issue we study in this paper is how to find DAGGD in G. The approach we take is to findGD by removing allpossible cycles fromG such thatG = U(G) ∪ GD whereU(G)is a maximum Eulerian subgraph which contains all possible cy-cles. We give the reasons for doing so, investigate the properties ofGD found, and discuss the applications. In addition, we develop anovel two-phase algorithm, called Greedy-&-Refine, which greed-ily computes an Eulerian subgraph and then refines this greedy so-lution to find the maximum Eulerian subgraph. We give a boundbetween the greedy solution and the optimal. The quality of ourgreedy approach is high. We conduct comprehensive experimentalstudies over 14 real-world datasets. The results show that our algo-rithms are at least two orders of magnitude faster than the baselinealgorithm.

1. INTRODUCTIONSocial hierarchy refers to the pyramid structure of societies, with

minority on the top and majority at the bottom, which is a prevalentand universal feature in organizations. Social hierarchy is also rec-ognized as a fundamental characteristic of social interactions, beingwell studied in both sociology and psychology [11]. In recent years,social hierarchy has attracted considerable attention andgeneratesprofound and lasting influence in various fields, especiallysocial

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

networks. This is because the hierarchical structure of a popula-tion is essential in shaping the nature of social interactions betweenindividuals and unfolding the structure of underlying social net-works. Gould in [11] develops a formal theoretical model to modelthe emergence of social hierarchy, which can accurately predict thenetwork structure. By the social status theory in [11], individu-als with low status typically follow individuals with high status.Clauset et al. in [8] develop a technique to infer hierarchical struc-ture of a social network based on the degree of relatedness betweenindividuals. They show that the hierarchical structure canexplainand reproduce some commonly observed topological properties ofnetworks and can also be utilized to predict missing links innet-works. Assuming that underlying hierarchy is the primary factorguiding social interactions, Maiya and Berger-Wolf in [22]infersocial hierarchy from undirected weighted social networksbasedon maximum likelihood. All these studies imply that social hier-archy is a primary organizing principle of social networks,capableof shedding light on many phenomena. In addition, social hierar-chy is also used in many aspects of social network analysis anddata mining. For instance, social hierarchy can be utilizedto im-prove the accuracy of link prediction [21], provide better query re-sults [15], rank web pages [12], and study information flow andspread in complex networks [1, 2].

In this paper, we focus on social networks that can be modeledbydirected graphs, because in many social networks (e.g., Google+,Weibo, Twitter) information flow and influence propagate followcertain directions from vertices to vertices. Given a social networkas a directed graphG, its social hierarchy can be represented as adirected acyclic graph (DAG). By DAG, all the vertices inG arepartitioned into different levels (disjoint groups), and all the edgesin the cycle-free DAG follow one direction, as observed in socialnetworks that prestige users at high levels are followed by usersat low levels and the prestige users typically do not follow theirfollowers. Here, a level in DAG represents the status of a vertex inthe hierarchy the DAG represents.

The issue we study in this paper is how to find hierarchy as aDAG in a general directed graphG which represents a social net-work. Given a graphG, there are many possible ways to obtain aDAG. First, converting graphG into a DAG, by contracting all ver-tices in a strongly connected component inG as a vertex in DAG,does not serve the purpose, because all vertices in a strongly con-nected component do not necessarily belong to the same levelin ahierarchy. Second, a random DAG does not serve the purpose, be-cause it heavily relies on the way to select the vertices as the startto traverse and the way to traverse. Therefore, two random DAGscan be significantly different topologically. Third, finding the maxi-mum DAG ofG is not only NP-hard but also NP-approximate [14].The way we do is to find the DAG by removing all possible cycles

1

Page 2: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

from G following [13]. In [13] Gupte et al. propose a way to de-compose a directed graphG into a maximum Eulerian subgraphU(G) and DAGGD, such thatG = U(G) ∪ GD. Here, all possi-ble cycles inG are inU(G), and all edges inGD do not appear inU(G). We take the same approach to find DAGGD for a graphGby finding the maximum Eulerian subgraphU(G) of G such thatG = U(G) ∪GD , as given in [13].

Main contributions : We summarize the main contributions of ourwork as follows. First, unlike [13] which studies a measure be-tween 0 and 1 to indicate how close a given directed graph is toa perfect hierarchy, we focus on the hierarchy (DAG). In additionto the properties investigated in [13], we show thatGD found isrepresentative, exhibits the pyramid rank distribution. In addition,GD found can be used to study social mobility and recover hiddendirections of social relationships. Here, social mobilityis a fun-damental concept in sociology, economics and politics, andrefersto the movement of individuals from one status to another. Second,we significantly improve the efficiency of computing the maximumEulerian subgraphU(G). Note that the time complexity of theBF-U algorithm [13] isO(nm2), wheren andm are the numbers ofvertices and edges, respectively. Such an algorithm is impractical,because it can only work on small graphs. We propose a new algo-rithm with time complexityO(m2), and propose a novel two-phasealgorithm, called Greedy-&-Refine, which greedily computes anEulerian subgraph inO(n +m) and then refines this greedy solu-tion to find the maximum Eulerian subgraph inO(cm2) wherecis a very small constant less than 1. The quality of our greedyap-proach is high. Finally, we conduct extensive performance studiesusing 14 real-world datasets to evaluate our algorithms, and con-firm our findings.

Further related work : Ball and Newman [3] analyze directednetworks between students with both reciprocated and unrecipro-cated friendships and develop a maximum-likelihood methodtoinfer ranks between students such that most unreciprocatedfriend-ships are from lower-ranked individuals to higher-ranked ones, cor-responding to status theory [11]. Leskovec et al. in [19, 18]inves-tigate signed networks and develop an alternate theory of statusin replace of the balance theory frequently used in undirected andunsigned networks to both explain edge signs observed and predictedge signs unknown. Influence has been widely studied [6], findingsocial hierarchy provides a new perspective to explore the influencegiven the existence of a social hierarchy.

Eulerian graphs have been well studied in the theory community[9, 10, 5, 7, 20]. For example, in [9], Fleischner gives a comprehen-sive survey on this topic. In [10], the same author surveys severalapplications of Eulerian graphs in graph theory. Another closelyrelated concept is super-Eulerian graph, which contains a spanningEulerian subgraph [5, 7, 20], here a spanning Eulerian subgraphmeans an Eulerian subgraph that includes all vertices. The prob-lem of determining whether or not a graph is super-Eulerian is NP-complete [7]. Most of these work mainly focus on the properties ofEulerian subgraphs. There are no much related work on comput-ing the maximum Eulerian subgraphs for large graphs. To the bestof our knowledge, the only one in the literature is done by Gupte,et al. in [13]. However, the time complexity of their algorithm isO(nm2), which is clearly impractical for large graphs.

Organization: In Section 2, we focus on the properties of the so-cial hierarchy found after giving some useful concepts on maxi-mum Eulerian subgraph, and discuss the applications. In Section 3,we discuss an existing algorithmBF-U [13]. In Section 4, we pro-pose a new algorithmDS-U of time complexityO(m2), and treatit as the baseline algorithm. We present a new two-phase algorithm

v1

v11

v2

v3 v4

v5

v6 v7v8

v9v10

v12 v13 v14

Figure 1: Illustration of the maximum Eulerian subgraph

GR-U for finding the maximum Eulerian subgraph, as well as itsanalysis in Section 5. Extensive experimental studies are reportedin Section 6. Finally, we conclude this work in Section 7.

2. THE HIERARCHYConsider an unweighted directed graphG = (V,E), where

V (G) andE(G) denote the sets of vertices and directed edges ofG, respectively. We usen = |V (G)| andm = |E(G)| to de-note the number of vertices and edges of graphG, respectively.In G, a pathp = (v1, v2, · · · , vk) represents a sequence of edgessuch that(vi, vi+1) ∈ E(G), for eachvi (1 ≤ i < k). Thelength of pathp, denoted aslen(p), is the number of edges inp. A simple path is a path(v1, v2, · · · , vk) with k distinct ver-tices. A cycle is a path where a same vertex appears more thanonce, and a simple cycle is a path(v1, v2, · · · , vk−1, vk) wherethe firstk − 1 vertices are distinct whilevk = v1. For simplic-ity, below, we useV and E to denoteV (G) and E(G) of G,respectively, when they are obvious. For a vertexvi ∈ V (G),the in-neighbors ofvi, denoted asNI(vi), are the vertices thatlink to vi, i.e., NI(vi) = {vj | (vj , vi) ∈ E(G)}, and the out-neighbors ofvi, denoted asNO(vi), are the vertices thatvi linksto, i.e.,NO(vi) = {vj | (vi, vj) ∈ E(G)}. The in-degreedI(vi)and out-degreedO(vi) of vertexvi are the numbers of edges thatdirect to and fromvi, respectively, i.e.,dI(vi) = |NI(vi)| anddO(vi) = |NO(vi)|.

A strongly connected component (SCC) is a maximal subgraphof a directed graph in which every pair of verticesvi andvj arereachable from each other.

A directed graphG is an Eulerian graph (or simply Eulerian) iffor every vertexvi ∈ V (G), dI(vi) = dO(vi). An Eulerian graphcan be either connected or disconnected. An Eulerian subgraphof a graphG is a subgraph ofG, which is Eulerian, denoted asGU . The maximum Eulerian subgraph of a graphG is an Euleriansubgraph with the maximum number of edges, denoted asU(G).Given a directed graphG, we focus on the problem of finding itsmaximum Eulerian subgraph,U(G), which does not need to beconnected. Note that the problem of finding the maximum Euleriansubgraph (U(G)) in a directed graph can be solved in polynomialtime, whereas the problem of finding the maximum connected Eu-lerian subgraph is NP-hard [4]. The following example illustratesthe concept of maximum Eulerian subgraph.

Example 2.1: Fig. 1 shows a graphG = (V,E) with 14 ver-tices and 22 edges. Its maximum Eulerian subgraphU(G) is asubgraph ofG, where its edges are in solid lines:E(U(G)) ={(v1, v2), (v2, v4), (v4, v3), (v3, v5), (v5, v1), (v4, v6), (v6, v4),(v3, v6), (v6, v3), (v6, v8), (v8, v11), (v11, v12), (v12, v13), (v13,v14), (v14, v7), (v7, v6)}, andV (U(G)) is the set of vertices thatappear inE(U(G)).

The main issue here is to find a hierarchy of a directed graphGas DAGGD by finding the maximum Eulerian subgraphU(G) fora directed graphG. With U(G) found,GD can be efficiently founddue toG = U(G)∪GD, andE(U(G))∩E(GD) = ∅. We discuss

2

Page 3: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

the properties of the hierarchyGD and the applications.

The representativeness: The maximum Eulerian subgraphU(G)for a general graphG is not unique. A natural question is howrepresentativeGD is as the hierarchy. Note thatGD is only uniquew.r.t U(G) found. Below, we showGD identified by an arbitraryU(G) is representative based on a notion ofstrictly-higherdefinedbetween two vertices inGD , over a rankingr(·) wherer(u) <r(v) for each edge(u, v) ∈ GD . Here, for two verticesu andv, a larger rank implies a vertex is in a higher status in a followerrelationship, andu is strictly-higherthanv if r(u) > r(v) andu isreachable fromv, i.e. there is a directed path fromv to u in GD .

Theorem 2.1: Let GD1andGD2

be two DAGs forG such thatG = U1(G) ∪ GD1

= U2(G) ∪ GD2. There are no verticesu

and v such thatu is strictly-higherthan v in GD1whereasv is

strictly-higherthanu in GD2.

Proof Sketch: Assume the opposite. We can construct an auxiliarygraphG′ = G ∪ {(u, v)}. Then finding the maximum Euleriansubgraph forG′ can be done in two steps. In the first step, find themaximum Eulerian subgraphU(G), and in the second step, findthe maximum Eulerian subgraph forG plus the additional edge(u, v). SinceG = U1(G) ∪ GD1

= U2(G) ∪ GD2, there are

supposed to be at least two corresponding relaxing orders, whenthe first phase terminates, namely, identifyingU1(G) andU2(G).For one relaxing order, we can show that the added edge(u, v) canbe relaxed, which results in findingU(G′) such that|E(U(G′))| >|E(U(G))|. For the other relaxing order, we can also show thatthe added edge(u, v) cannot be relaxed andU(G′) = U(G). Itleads to a contradiction, because it can find two different maximumEulerian subgraphs forG′ with different sizes.

Alternatively, let the ranking inGD1and GD2

be r1(·) andr2(·). Assume there are two verticesu andv such thatu is strictly-higher thanv by r1 whereasv is strictly-higherthanu by r2. Weprove this cannot achieve based on the finding in [13]. In [13], itgives a total score onG which measures howG is different fromDAG GD based on a rankingr(·). The total score, denoted asA(G, r), is obtained by summing up the weights assigned to edges,max{r(u)−r(v)+1,0} for edge(u, v). The finding in [13] is thatthe minimum total score equals to the number of edges in the max-imum Eulerian subgraph,minr{A(G, r)} = |E(U(G))|. Chooser1 andr2 satisfying thatA(G, r1) = |E(U1(G))| andA(G, r2) =|E(U2(G))|. Sinceu is strictly-higherthanv in GD1

, there is adirected path fromv to u in GD1

. We can construct an auxiliarygraphG′ = G ∪ {(u, v)}, then|E(U(G′))| > |E(U1(G))|. Onthe other hand, over the sameG′, sincev is strictly-higherthanu inGD2

, we can show|E(U(G′))| ≤ A(G′, r2) = A(G, r2), whichleads to a contradiction. ✷

A case study: With the hierarchy (DAGGD) found, suppose weassign every vertexu a minimum non-negative rankr(u) such thatr(u) < r(v) for any edge(u, v) ∈ GD , wherer(·) is a strictly-higher rank. To show whether such ranking reflects the groundtruth, as a case study, we conduct testing using Twitter, wherethe celebrities are known, for instance, refer to Twitter Top 100(http://twittercounter.com/pages/100). We samplea subgraph among 41.7 million users (vertices) and 1.47 billion re-lationships (edges) from Twitter social graphG crawled in 2009 [16].In brief, we randomly sample 5 vertices in the celebrity set givenin Twitter, and then sample 1,000,000 vertices starting from the 5vertices as seeds using random walk sampling [17]. We constructan induced subgraphG′ of the 1,000,000 vertices sampled fromG, and we uniformly sample about 10,000,000 edges fromG′ toobtain the sample graphG, which contains 759,105 vertices and

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 >9

Perc

en

t

Rank

wiki-VoteEpinions

Slashdot0902Pokec

Gplus2Weibo0

(a) The hierarchy

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 >9

Perc

en

t

Rank

wiki-VoteEpinions

Slashdot0902Pokec

Gplus2Weibo0

(b) Random DAG

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 >9

Perc

en

t

Rank

wiki-VoteEpinions

Slashdot0902Pokec

Gplus2Weibo0

(c) SCC

Figure 2: Rank DistributionGraph |V | |E| |V (U(G))| |E(U(G))|

Gplus0 100,000 115,090 2,833 6,271Gplus1 100,000 512,281 14,797 70,537Gplus2 100,000 2,867,781 51,605 770,854Gplus3 100,000 8,289,203 87,941 3,644,147

Weibo0 100,000 2,431,525 96,765 850,136Weibo1 100,000 2,446,002 96,833 855,131Weibo2 100,000 2,463,050 96,902 861,729Weibo3 100,000 2,479,140 96,969 868,044

Table 1: To study social mobility

11,331,061 edges. InG, we label a vertexu as a celebrity, ifu isa celebrity and has at least 100,000 followers inG. There are 430celebrities inG including Britney Spears, Oprah Winfrey, BarackObama, etc. We compute the hierarchy (GD) of G using our ap-proach and rank vertices inGD. The hierarchy reflects the truth:88% celebrities are in the top 1% vertices and 95% celebrities inthe top 2% vertices. In consideration of efficiency, we can approxi-mate the exact hierarchy with a greedy solution obtained byGreedyin Section 5. In the approximate hierarchy, 85% celebritiesare inthe top 1% vertices and 93% celebrities in the top 2% vertices.

The pyramid structure of rank distribution is one of the mostfundamental characteristics of social hierarchy. We test the socialnetworks: wiki-Vote, Epinions, Slashdot0902, Pokec, Google+,Weibo. The details about the datasets are in Table 1 and Table3.The rank distribution derived from hierarchyGD , shown in Fig. 2(a),indicates the existence of pyramid structure, while the rank distri-butions derived from a random DAG (Fig. 2(b)) and by contract-ing SCCs (Fig. 2(c)) are rather random. Here, the x-axis is therank where a high rank means a high status, and the y-axis is thepercentage in a rank over all vertices. By analyzing the vertices,u, in G over the difference between in-degree and out-degree, i.e.dI(u) − dO(u), it reflects the fact that those verticesu with neg-ative dI(u) − dO(u) are always at the bottom ofGD , whereasthose vertices in the higher rank are typically with large positivedI(u)− dO(u) values.

The social mobility: With the DAG GD found, we can furtherstudy social mobility over the social hierarchyGD represents. Here,social mobility is a fundamental concept in sociology, economicsand politics, and refers to the movement of individuals fromonestatus to another. It is important to identify individuals who jumpfrom a low status (a level inGD) to a high status (a level inGD).We conduct experimental studies using the social network Google+(http://plus.google.com) crawled from Jul. 2011 to Oct.2011 [27, 26], and Sina Weibo (http://weibo.com) crawledfrom 28 Sep. 2012 to 29 Oct. 2012 [24]. For Google+ and Weibo,we randomly extract 100,000 vertices respectively, and then extractall edges among these vertices in 4 time intervals during theperiodthe datasets are crawled, as shown in Table 1.

We show social mobility in Fig. 3. We compare two snapshots,G1 andG2, and investigate the social mobility fromG1 toG2. ForGoogle+,G1 andG2 are Gplus0 and Gplus1, and for Weibo,G1

andG2 are Weibo0 and Weibo1. ForG1, we divide all vertices into5 equal groups. The top 20% go into group 5, and the second 20%go to group 4, for example. In Fig. 3, the x-axis shows the 5 groups

3

Page 4: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

per

cent

Status in snapshot 1

5 4 3 2 1

(a) Google+ (exact)

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

per

cent

Status in snapshot 1

5 4 3 2 1

(b) Weibo (exact)

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

per

cent

Status in snapshot 1

5 4 3 2 1

(c) Google+ (approx)

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

per

cent

Status in snapshot 1

5 4 3 2 1

(d) Weibo (approx)

Figure 3: Social mobility result from hierarchyAlgorithm 1 BF-U (G)Input : A graphG = (V,E)Output : Two subgraphs ofG, U(G) and GD (G = U(G) ∪ GD)

1: w(vi, vj)← −1 for each edge(vi, vj) ∈ E;2: while there is a negative cyclepc in G do3: for every edge(vi, vj) in the negative cyclepc do4: w(vi, vj)← −w(vi, vj);5: Reverse the direction of the edge(vi, vj) to be(vj , vi);6: end for7: end while8: GD is a subgraph that contains all edges with weight -1;9: U(G) is a subgraph containing the reversed edges with weight +1;

for G1. Consider the number of vertices in a group as 100%. InFig. 3, we show the percentage of vertices in one group moves toanother group inG2. Fig. 3(a) and Fig. 3(b) show the results forGoogle+ and Weibo. Some observations can be made. Google+is a new social network when crawled since it starts from Jun.29,2011, and Weibo is a rather mature social network since it startsfrom Aug. 14, 2009. From Fig. 3(a), many vertices move from onestatus to another, whereas from Fig. 3(b), only a very small numberof vertices move from one status to another. Similar resultscanbe observed from approximate hierarchies, by our greedy solutionGreedygiven in Section 5, as shown in Fig. 3(c) and Fig. 3(d).Those moved to/from the highest level deserve to be investigated.

Recovering the hidden directionsis to identify the direction ofan edge if the direction of the edge is unknown [25]. The direc-tionality of edges in social networks being recovered is importantin many social analysis tasks. We show that our approach has ad-vantage over the semi-supervised approach (SM-ReDirect) in [25].Here, the task is using the given 20% directed edges as trainingdata to recover the directions for the remaining edges. In our ap-proach, we construct a graphG from the 20% training data, andidentify GD by G = U(G) ∪ GD. With the rankingr(·) overthe vertices, we predict the direction of an edge(u, v) is from uto v if r(v) > r(u). Take Slashdot and Epinion datasets used[25], our approach outperforms the matrix-factorization based SM-ReDirect both in terms of accuracy and efficiency. For Slashdot,our prediction accuracy is 0.7759 whereas SM-ReDirect is 0.6529.For Epinion, ours is 0.8285 whereas SM-Redirect is 0.7118. Us-ing approximate hierarchy, our accuracy is 0.7682 for Slashdot and0.8277 for Epinion, respectively.

3. THE EXISTING ALGORITHM

Algorithm 2 DS-U (G)Input : A graphG = (V, E)Output : Two subgraphs ofG, U(G) and GD (G = U(G) ∪ GD)

1: for each edge(vi, vj) in E(G) dow(vi, vj)← −1;2: for each vertexu in V (G) do dst(u) ← 0, relax(u) ← true,

pos(u)← 0;3: while there is a vertexu ∈ V (G) such thatrelax(u) = true do4: SV ← ∅, SE ← ∅, NV ← ∅;5: if FindNC(G, u) then6: while SV .top() 6= NV do7: SV .pop();(vi, vj)← SE .pop();8: w(vi, vj)← −w(vi, vj);9: Reverse the direction of the edge(vi, vj) to be(vj , vi);

10: end while11: SV .pop();(vi, vj)← SE .pop();12: w(vi, vj)← −w(vi, vj);13: Reverse the direction of the edge(vi, vj) to be(vj , vi);14: end if15: end while16: GD is a subgraph that contains all edges with weight -1;17: U(G) is a subgraph containing the reversed edges with weight +1;

Algorithm 3 FindNC(G, u)1: SV .push(u);2: for each edge(u, v) starting atpos(u) in E(G) do3: pos(u)← pos(u) + 1;4: if dst(u) + w(u, v) < dst(v) then5: dst(v) ← dst(u) + w(u, v);6: relax(v)← true, pos(v)← 0;7: if v is not inSV then8: SE .push((u, v));9: if FindNC(G, v) then return true; endif

10: else11: SE .push((u, v)); NV ← v; return true;12: end if13: end if14: end for15: relax(u)← false;16: SV .pop();SE .pop() ifSE is not empty;return false;

To find the maximum Eulerian subgraph, Gupte, et al. in [13]propose an iterative algorithm based on the Bellman-Ford algo-rithm, which we callBF-U (Algorithm 1). Let w(vi, vj) be aweight assigned to an edge(vi, vj) in G. Initially, BF-U assigns anedge-weight with a value of -1 to every edge in graphG (Line 1).Let a negative cycle be a cycle with a negative sum of edge weights.In every iteration (Lines 2-7),BF-U finds a negative cyclepc re-peatedly until there are no negative cycles. For every edge(vi, vj)in the negative cyclepc found, it changes the weight of(vi, vj)to be−w(vi, vj) and reverses the direction of the edge (Lines 4-5). As a result, it finds a maximum Eulerian subgraphU(G) anda directed acyclic graph (DAG) ofG, denoted asGD, such thatG = U(G) ∪ GD. Since the number of edges with weight +1increases by at least one during each iteration, there are atmostO(m) iterations (Line 2-7). In every iteration it has to invoke theBellman-Ford algorithm to find a negative cycle (Line 2), or to de-termine whether there is a negative cycle. The time complexity ofBellman-Ford algorithm isO(nm). Therefore, in the worst case,the total time complexity ofBF-U isO(nm2), which is too expen-sive for real-world graphs.

4. A NEW ALGORITHMTo address the scalability problem ofBF-U, we propose a new al-

gorithm, calledDS-U. Different fromBF-U which starts by finding

4

Page 5: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

a negative cycle using the Bellman-Ford algorithm in every itera-tion, DS-U finds a negative cycle only when necessary with condi-tion. In brief, in every iteration, when necessary,DS-U invokes analgorithmFindNC (short for find a negative cycle) to find a nega-tive cycle while relaxing vertices followingDFS order. Applyingamortized analysis [23], we prove the time complexity ofDS-U, isO(m2) to find the maximum Eulerian subgraphU(G).

TheDS-U algorithm is outlined in Algorithm 2, which invokesFindNC (Algorithm 3) to find a negative cycle. Here,FindNCis designed based on the same idea of relaxing edges as used inthe Bellman-Ford algorithm. In addition to edge weightw(vi, vj),we use three variables for every vertexu, relax(u), pos(u), anddst(u). Here,relax(u) is a Boolean variable indicating whetherthere are out-going edges fromu that may need to relax to find anegative cycle. It will try to relax an edge fromu further whenrelax(u) = true. When relaxing fromu, pos(u) records the nextvertex v in NO(u) (maintained as an adjacent list) for the edge(u, v) to be relaxed next. It means that all edges fromu to anyvertex beforepos(u) has already been relaxed.dst(u) is an es-timation on the vertexu which decreases when relaxing. Whendst(u) decreases,relax(u) is reset to betrue andpos(u) is re-set to be 0, since all its out-going edges can be possibly relaxedagain. Initially, inDS-U, every edge weightw(vi, vj) is initializedto -1, and the three variables,relax(u), pos(u), anddst(u), onevery vertexu are initialized totrue, 0, and0, respectively. Allw(vi, vj), relax(u), pos(u), anddst(u) are used inFindNC tofind a negative cycle following the main idea of Bellman-Fordal-gorithm in DFS order. A negative cycle, found byFindNC whilerelaxing edges, is maintained using a vertex stackSV and an edgestackSE together with a variableNV , whereNV maintains thefirst vertex of a negative cycle. InDS-U, by popping vertex/edgesfrom Sv/SE until encountering the vertex inNV , a negative cyclecan be recovered. As shown in Algorithm 2, in the while state-ment (Lines 3-15), for every vertexu in V (G), only when there isa possible relax (relax(u) = true) and there is a negative cyclefound by the algorithmFindNC, it will reverse the edge directionand update the edge weight,w(vi, vj), for each edge(vi, vj) in thenegative cycle (Lines 6-13).

v6 v3 v1 v2

v8

Figure 4: A subgraphG of Fig. 1

Example 4.1: We explainDS-U (Algorithm 2) using an examplegraphG in Fig. 4. For ever vertexu, there is an adjacent list tomaintain its out-neighbors. Initially, for every vertexu, dst(u) =0, relax(u) = true, pos(u) = 0; and for every edge(vi, vj),w(vi, vj) = −1. Suppose we processv6, v3, v1, v2, v8 in such anorder. In the first iteration,relax(v6) = true andFindNC(G, v6)returns true, which implies a negative cycle is found. Here,for allvertices inG, we havedst(v6) = 0, relax(v6) = true, pos(v6) =1; dst(v3) = −4, relax(v3) = true, pos(v3) = 0; dst(v1) =−2, relax(v1) = true, pos(v1) = 0; dst(v2) = −3, relax(v2) =true, pos(v2) = 0; dst(v8) = 0, relax(v8) = true, pos(v8) =0. In addition,NV = v3, SV = {v6, v3, v1, v2} , SE = {(v6, v3),(v3, v1), (v1, v2), (v2, v3)}. Following Lines 6–13 in Algorithm 2,we find and reverse negative cycle(v3, v1, v2, v3) and makew(v3,v2) = w(v2, v1) = w(v1, v3) = 1. In the second iteration, the out-neighbors ofv6 are relaxed frompos(v6) = 1 in v6’s adjacent list,i.e. from edge(v6, v8). FindNC (G, v6) returns false. We have,dst(v6) = 0, relax(v6) = false, pos(v6) = 2, anddst(v8) =

−1, relax(v8) = false, pos(v8) = 1. In the following iterations(FindNC(G, v3), FindNC(G, v1), andFindNC(G, v2)), all returnfalse. Finally, for vertexv8, sincerelax(v8) = false, FindNC(G, v8) is unnecessary, andDS-U (G) terminates. It finds the max-imum Eulerian subgraphU(G) = {(v3, v1), (v1, v2), (v2, v3)}.

Lemma 4.1: In Algorithm 2, if there is a negative cycleC, relax(u)= true holds for at least one vertexu ∈ V (C).

Proof Sketch: Assume the opposite, i.e., there exists a negativecycleC such that for every vertexu ∈ V (C), relax(u) = false.Let C = (v0, v1, . . . , vk−1, vk = v0), then

∑k−1i=0 w(vi, vi+1) <

0. Sincerelax(vi) = false holds fori = 0, 1, . . . , k − 1, thendst(vi+1) ≤ dst(vi) + w(vi, vi+1). It leads to a contradiction,if summing both sides fromi = 0 to i = k − 1, thendst(v0) =

dst(vk) ≤ dst(v0) +∑k−1

i=0 w(vi, vi+1) < dst(v0). ✷

Theorem 4.1: Algorithm 2 correctly finds the maximum EuleriansubgraphU(G) when it terminates.

Proof Sketch: It can be proved by Lemma 4.1. ✷

Lemma 4.2: Given an Eulerian graphG, whenDS-U (G) termi-nates, for each vertexu, dst(u) ∈ [−2m, 0], wherem = |E(G)|.

Proof Sketch: We do mathematical induction on the maximumnumber of cycles the Eulerian graphG contains.

1. If G contains only one cycle, i.e.G is a simple cycle itself,it is easy to see that for each vertexu, dst(u) ≥ −m ∈[−2m, 0].

2. Assume Lemma 4.2 holds whenG contains no more thank cycles, we prove it also holds whenG contains at mostk + 1 cycles. We first decomposeG into a simple cycleCwhich is the last negative cycle found duringDS-U (G) andthe remaining is an Eulerian graphG′ containing at mostkcycles. We explain the validation of this decomposition asfollows. If the last negative cycle found contains some pos-itive edges, then the resulting maximum Eulerian subgraphU(G) will contain some negative edges, it is against the factthatG itself is given as Eulerian. Next, we decomposeDS-U(G) into two phases, it findsG′ as an Eulerian subgraph inthe first phase while cycleC is identified in the second phase.According to the assumption, when the first phase completes,for each vertexu ∈ G′, dst(u) ∈ [−2|E(G′)| − |E(C)|, 0],where−2|E(G′)| is byDS-U(G′) and−|E(C)| is the resultby relaxingC. There are two cases for the second phase.

(a) If V (C)⋂

V (G′) = ∅, the two phases are indepen-dent. Therefore, when the second phase terminates, foreach vertexu ∈ V (C), dst(u) ∈ [−2|E(C)|, 0], andfor each vertexu ∈ V (G′), dst(u) ∈ [−2|E(G′)|, 0],Lemma 4.2 holds.

(b) If V (C)⋂

V (G′) 6= ∅, supposew ∈ V (C)⋂

V (G′),thendst(w) ∈ [−2|E(G′)|− |E(C)|, 0] when the firstphase completes. During the second phase,dst(w)decreases by|E(C)|, thendst(w) ≥ −2|E(G′)| −2|E(C)| ∈ [−2m, 0]. For any vertexv ∈ V (G′) \V (C), dst(v) can only change along a pathp = (w0, w1,. . . , wk−1, wk = v), wherew0 ∈ V (C) and(wi, wi−1)∈ E(G′), for i = 1, . . . , k. Thend(v) = d(w0) +∑k−1

i=0 w(wi, wi+1) > d(w0) ≥ −2m ∈ [−2m, 0].Therefore, Lemma 4.2 holds.

5

Page 6: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

v1

v2

v3

v4

v5

v6 v7

v8

Figure 5: Example graph to explain 2(b) in Lemma 4.2

Example 4.2: We explain the proof of 2(b) in Lemma 4.2 usingFig. 5. Fig. 5 shows an Eulerian graphG containing simple cy-cles. Suppose the first negative cycle found isC1 = (v1, v2, v3, v4,v1), with the resultingdst(v1) = −4, dst(v2) = −1, dst(v3) =−2, dst(v4) = −3. In a similar way, suppose the second negativecycle found isC2 = (v1, v5, v6, v3, v2, v1) by relaxingdst(v1) =−5, dst(v5) = −5, dst(v6) = −6, dst(v3) = −7, dst(v2) =−6, and the third negative cycle isC3 = (v3, v7, v8, v1, v4, v3)with dst(v1) = −10, dst(v4) = −9, dst(v3) = −8, dst(v7) =−8, dst(v8) = −9. By reversing these three negative cycles, wehave a cycleC = (v1, v2, v3, v4), and a graphG′ which is a simplecycle with (v1, v5, v6, v3, v7, v8, v1). As can be seen, the currentmin{dst(u)} = dst(v1) = −10 is in the range of[−2|E(G′)| −|E(C)|, 0]. WhenDS-U (G) terminates,min{dst(u)|u ∈ V (C)}= dst(v1) = −14 is in the rage of[−2|E(G)|, 0], andmin{dst(u)|u ∈ V (G′)\V (C)} = dst(v8) = −13 is in the range of[−2|E(G)|,0]. It shows that Lemma 4.2 holds for this example.

Lemma 4.3: Given a general graphG, whenDS-U(G) terminates,for each vertexu, dst(u) ∈ [−4m, 0], wherem = |E(G)|.

Proof Sketch: For a general graphG, we can add edges(u, v) fromverticesu with dI(u) > dO(u) to verticesv with dI(v) < dO(v)and(u, v) /∈ E(G). Obviously, the resulting augment graphGA

has at most2m edges.Based on Lemma 4.2, whenDS-U (GA) terminates, each ver-

tex u satisfiesd(u) ∈ [−4m, 0]. On the other hand,DS-U (GA)can be decomposed into two phases,DS-U (G) and further relax-ations exploitingE(GA) \ E(G), implying that for each vertexu,dst(u) ∈ [−4m, 0] holds whenDS-U (G) terminates. ✷

Lemma 4.4: For each value ofdst(u) of every vertexu, the out-neighbors ofu, i.e.NO(u), are relaxed at most once.

Proof Sketch: As shown in Algorithm. 3,dst(u) is monotone de-creasing, andpos(u) is monotone increasing for a particulardst(u)value. So Lemma 4.4 holds. ✷

Theorem 4.2:Time complexity ofDS-U (G) isO(m2).

Proof Sketch: Given Lemma 4.2, Lemma 4.3, and Lemma 4.4,since every edge(u, v) is checked at most|dst(u)| + |dst(v)| ≤8m times for relaxations. By applying amortized analysis [23], thetime complexity ofDS-U (G) is O(m2). ✷

Consider Algorithm 2. During each iteration of the while loop,only a small part of the graph can be traversed and most edgesare visited at most twice. Therefore, each iteration can be approx-imately bounded asO(m), and the time complexity ofDS-U isapproximated asO(K · m), whereK is the number of iterations,bounded by|E(U(G))| ≤ m. In the following discussion, we willanalyze the time complexity of algorithms based on the number ofiterations.

5. THE OPTIMAL: GREEDY-&-REFINEDS-U reduces the time complexity ofBF-U to O(m2), but it is

still very slow for large graphs. To further reduce the running timeof DS-U, we propose a new two-phase algorithm which is shownto be two orders of magnitude faster thanDS-U. Below, we first in-

troduce an important observation which can be used to prune manyunpromising edges. Then, we will present our new algorithmsaswell as theoretical analysis.

Let S be a set of strongly connected components (SCCs) of G,such thatS = {G1, G2, · · · }, whereGi is anSCC of G, Gi ⊆ G,andGi ∩ Gj = ∅ for i 6= j. We show that for any edge, if it isnot included in anySCC Gi of G, then it cannot be contained inthe maximum Eulerian subgraphU(G). Therefore, the problem offinding the maximum Eulerian subgraph ofG becomes a problemof finding the maximum Eulerian subgraph of eachGi ∈ S , sincethe union of the maximum Eulerian subgraph ofGi ∈ S , 1 ≤ i ≤|S|, is the maximum Eulerian subgraph ofG.

Lemma 5.1: An Eulerian graphG can be divided into several edgedisjoint simple cycles.

Proof Sketch: It can be proved if there is a process that we canrepeatedly remove edges from a cycle found in an Eulerian graphG, andG has no edges after the last cycle being removed. Note thatdI(u) = dO(u) for everyu in G. Let a subgraph ofG, denoted asGc, be such a cycle found inG. Gc is an Eulerian subgraph, andG⊖Gc is also an Eulerian subgraph. The lemma is established.✷

Theorem 5.1:LetG be a directed graph, andS = {G1, G2, · · · }be a set ofSCCs of G. The maximum Eulerian subgraph ofG,U(G) =

⋃Gi∈S U(Gi).

Proof Sketch: For each edgee = (u, v) ∈ U(G), there is at leastone cycle containing this edge, given by Lemma 5.1. Therefore,uandv belong to the sameSCC, i.e., for any edgee′ ∈ G − S , itcannot be included inU(G). The theorem is established. ✷

Below, we discuss how to find the maximum Eulerian subgraphfor each strongly connected component (SCC) Gi of G. In thefollowing discussion, we assume that a graphG is anSCC itself.

We can useDS-U to find the maximum Eulerian subgraph for anSCC G. However,DS-U is still too expensive to deal with largegraphs. The key issue is that the number of iterations inDS-U(Algorithm 2, Lines 3-15), can be very large when the graph andits maximum Eulerian subgraph are both very large. Since in mostiterations, the number of edges with weight +1 increases only by 1,and thus it takes almost|U(G)| iterations to get the optimal numberof edges in the maximum Eulerian subgraphU(G).

In order to reduce the number of iterations, we propose a two-phase Greedy-&-Refine algorithm, abbreviated byGR-U. Here, aGreedyalgorithm computes an Eulerian subgraph ofG, denoted asU(G), and aRefinealgorithm refines the greedy solutionU(G) toget the maximum Eulerian subgraphU(G), which needs at most|E(U(G))| − |E(U(G))| iterations. TheGR-U algorithm is givenin Algorithm 4, and an overview is shown in Fig. 6. In Algorithm 4,it first computes allSCCs (Line 1). For eachSCC Gi, it computesan Eulerian subgraph usingGreedy, denoted asU(Gi) (Line 3). InGreedy, in every iterationl (1 ≤ l ≤ lmax), it identifies a sub-graph by anl-Subgraphalgorithm, and further deletes/reverses allspecific length-l paths calledpn-paths which we will discuss indetails byDFS. Note lmax is a small number. After computingU(Gi), Gi − U(Gi) is near acyclic, and it moves all cycles fromGi − U(Gi) to U(Gi) (Line 4). Finally, it refinesU(Gi) to ob-tain the optimalU(Gi) by callingRefine(Line 5). The union of allU(Gi) is the maximum Eulerian subgraph forG. Below, we firstlist some important concepts introduced in the algorithm and anal-ysis parts in Table. 2, and then we shall detail the greedy algorithmand refine algorithm, respectively.

5.1 The Greedy AlgorithmsGiven a graphG, we propose two algorithms to obtain an ini-

6

Page 7: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

Used-In Symbol MeaningGreedy label(u) label(u) = dO(u)− dI(u)

pn-path (u, v) path(u = v1, v2, · · · , vl = v), label(u) > 0, label(v) < 0, andlabel(vi) = 0 for 1 < i < lGl (l-Subgraph) subgraph ofG contains allpn-paths of lengthlGT V (GT ) = V (G), E(GT ) = {(u, v) | (v, u) ∈ E(G)}level(v) the shortest distance from any vertexu with a positive label,label(u) > 0, in Grlevel(v) the shortest distance to any vertexu with a negative label,label(u) < 0, in G

Refine G V (G) = V (G), ∀(vi, vj) ∈ E(G), (vj , vi) ∈ E(G), andw(vj , vi) = −w(vi, vj)Analysis p-path/n-path a path where every edge is with a positive/negative weight

k-cycle (v+1 , v−1 , v+2 , . . . , v+k, v−

k, v+1 ), where(v+i , v−i ) aren-paths, and(v−i , v+i+1) plus(v−

k, v+1 ) arep-paths

△k the total weight ofn-edges for ak-cycle (△k =∑

i=1,...,k w(v+i , v−i ))△′

kthe total weight ofp-edges for ak-cycle (△′

k=

∑i=1,...,k−1 w(v−i , v+i+1) +w(v−

k, v+1 ))

G G = GP ⊕GN , GP = G⊖ U(G) andGN = G⊖ U(G)

Table 2: Notations

Algorithm 4 GR-U (G)1: ComputeSCCs ofG, S = {G1, G2, · · · };2: for eachGi ∈ S do3: U(Gi)← Greedy(Gi);4: Move all cycles found inGi−U(Gi) to U(Gi); {MakeGi−U(Gi)

acyclic}5: U(Gi)← Refine(U(Gi), Gi);6: end for7: return

⋃|S|i=1 U(Gi);

G

1 l

O(n +m)

delete/reverse pnpaths of length l

2

O(n +m) O(n +m)

lmax

Greedy

U(G)

Refine

O(cm2), c ≪ 1U(G)

one iteration

l-Subgraph DFS

Figure 6: An Overview of Greedy-&-Refine

tial Eulerian subgraphU(G). The first algorithm is denoted asGreedy-D (Algorithm 5), which deletes edges fromG to makedI(v) = dO(v) for every vertexv in U(G). The second algo-rithm is denoted asGreedy-R(Algorithm 8), which reverses edgesinstead of deletion to the same purpose. We useGreedywhen werefer to either of these two algorithms. By definition, the result-ing U(G) is an Eulerian subgraph ofG. The more edges we havein U(G), the closer the resulting subgraphU(G) is to U(G). Wediscuss some notations below

The vertex label: For each vertexu in G, we define a vertex labelonu, label(u) = dO(u)− dI(u). If label(u) = 0, it means thatucan be a vertex in an Eulerian subgraph without any modifications.If label(u) 6= 0, it needs to delete/reverse some adjacent edges tomakelabel(u) being zero.

The pn-path: We also define a positive-start and negative-end pathbetween two vertices,u andv, denoted aspn-path (u, v). Here,pn-path (u, v) is a pathp = (v1, v2, · · · , vl), whereu = v1 andv = vl with the following conditions:label(u) > 0, label(v) < 0,and all label(vi) = 0 for 1 < i < l. Clearly, by this defini-tion, if we delete all the edges inpn-path (u, v), then label(u)decreases by 1,label(v) increases by 1, and all intermediate ver-tices inpn-path (u, v) will have their labels as zero. To make allvertex labels being zero, the total number of suchpn-paths to bedeleted/reversed isN =

∑label(u)>0 label(u).

The transportation graph GT : A transportation graphGT of G isa graph such thatV (GT ) = V (G) andE(GT ) = {(u, v) | (v, u) ∈E(G)}.

Algorithm 5 Greedy-D(G)1: l← 1; G′ ← G;2: while some vertexu ∈ G′ with label(u) > 0 do3: G′ ← PN-path-D(G′, l); l← l + 1;4: end while5: return G′;

Algorithm 6 PN-path-D(G, l)1: Gl ← l-Subgraph(G, l);2: Enqueue all verticesu ∈ V (Gl) with label(u) > 0 into queueQ;3: while Q 6= ∅ do4: u←Q.top();5: Following DFS starting fromu over Gl, traverse unvisited edges

and mark them “visited”; let the path fromu to v bepn-path (u, v),when it reaches the first vertexv in Gl with level(v) = l;

6: if pn-path (u, v) 6= ∅ then7: delete all edges inpn-path (u, v) from G;8: label(u)← label(u)− 1; label(v) ← label(v) + 1;9: if label(u) = 0 then Q.dequeue();

10: else11: Q.dequeue();12: end if13: end while14: return G;

The level and rlevel: level(v) is the shortest distance from anyvertex u with a positive label,label(u) > 0, in G. rlevel(v)is the shortest distance from any vertexu with a positive label,label(u) > 0, in GT . Note rlevel(v) is the shortest distance toany vertexu with a negative label,label(u) < 0, in G.

5.1.1 The Greedy-D AlgorithmBelow, we first concentrate onGreedy-D(Algorithm 5). LetG′

beG (Line 1). In thewhile loop (Lines 2-4), it repeatedly deletesall pn-paths starting from lengthl = 1 by calling an algorithmPN-path-D(Algorithm 6) until no vertexu in G′ with a positive value(label(u) > 0).

Example 5.1: Consider graphG in Fig. 7. Three vertices,v2, v4,and v8, in double cycles, have alabel +1, and three other ver-tices,v1, v3, andv7, in dashed cycles, have alabel-1. Initially,l = 1, Greedy-D(Algorithm 5) deletespn-path (v2, v3), makinglabel(v2) = label(v3) = 0. When l = 2, pn-path (v4, v1) =(v4, v3, v1) is deleted. Finally, whenl = 5, pn-path (v8, v7) =(v8, v11, v12, v13, v14, v7) will be deleted. In Fig. 7, the graph withsolid edges isU(G) or the graphG′ returned by Algorithm 5. It isworth mentioning that for the same graphG, DS-Uneeds 10 itera-tions. From the Eulerian subgraphU(G) obtain byGreedy, it onlyneeds at most 2 additional iterations to get the maximum Eulerian

7

Page 8: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

v1

v11

v2

v3 v4

v5

v6 v7v8

v9v10

v12 v13 v14

Figure 7: An Eulerian subgraph obtained byGreedyfor graphG in Fig. 1

v4 v8

v9v3v6 v10v11

v1 v5 v7 v12

level

1

0

2

s −1

(a) BFS-Tree ofBFS(G, s)

v1 v7

v3v5 v9 v14

v6 v10 v4 v13

rlevel

0

1

2

t −1

(b) BFS-Tree ofBFS(GT , t)

Figure 8: BFS-Trees used for constructingl-Subgraph

subgraph.

It is worth noting thatU(G) is not optimal. Some edges inU(G)may not be in the maximum Eulerian subgraph, while some edgesdeleted should appear in the maximum Eulerian subgraph. In nextsection, we will discuss how to obtain the maximum Eulerian sub-graphU(G) from the greedy solutionU(G).

Finding all pn-paths with length l: ThePN-path-Dalgorithm isshown in Algorithm 6. In brief, for a given graphG, PN-path-D first extracts a subgraphGl ⊆ G which contains allpn-pathsof length l that are possible to be deleted fromG by calling analgorithml-Subgraph(Algorithm 7) in Line 1. In other words, alledges inE(G) but not inE(Gl) cannot appear in anypn-pathswith a length≤ l. Based onGl obtained,PN-path-Ddeletespn-paths fromG (not fromGl) with additional conditions (in Lines 2-13). LetG′

l be a subgraph ofGl that includes all edges appearingin pn-paths of lengthl to be deleted inPN-path-D. PN-path-Dwillreturn a subgraphG \G′

l as a subgraph ofG, which will be used inthe next run inGreedy-Dfor deletingpn-paths with lengthl + 1.

We discuss thel-Subgraphalgorithm (Algorithm 7), which ex-tractsGl from G by BFS(breadth-first-search) traversingG twice.In the firstBFS (Lines 4-6), it adds a virtual vertexs, and adds anedge(s, u) to every vertexu with a positive label (label(u) > 0)in G. Then, it assigns alevel to every vertex inG as follows. Letlevel(s) be−1. By BFS, it assignslevel(u) to belevel(parent(u))+1, whereparent(u) is the parent vertex ofu following BFS. Inthe secondBFS (Lines 7-10), it conceptually considers the trans-position graphGT of G by reversing every edge(v, u) ∈ E(G)as (u, v) ∈ E(GT ) (Line 7). Then, it assigns a differentrlevelto every vertex inG using the transposition graphGT . Like thefirst BFS, it adds a virtual vertext, and adds an edge(t, u) to ev-ery vertexu with a negative label (label(u) < 0) in GT . Then,it assignsrlevel to every vertex inGT as follows. Letrlevel(t)be−1. By BFS, it assignsrlevel(u) to berlevel(parent(u)) + 1,whereparent(u) is the parent vertex ofu in GT following BFS.The resulting subgraphGl to be returned froml-Subgraphis ex-tracted as follows. Here,V (Gl) contains all verticesu in G iflevel(u) + rlevel(u) = l for the given lengthl, andE(Gl) con-tains all edges(u, v) if both u andv appear inV (Gl), (u, v) is anedge in the given graphG, andlevel(u) + 1 = level(v) (Lines 11-13). The following example illustrates howl-Subgraphalgorithmworks.

Example 5.2:Fig. 9 illustrates theGl returned byl-Subgraph(Al-

v1

v11

v2

v3 v4

v5

v6 v7v8

v9v10

v12 v13 v14

Figure 9: An l-Subgraphfor length l = 2

Algorithm 7 l-Subgraph(G, l)1: for each vertexu in V (G) do2: level(u)←∞, rlevel(u)←∞;3: end for4: Add a virtual vertexs and an edge(s, u) from s to every vertexu in G

if label(u) > 0;5: level(s)← −1;6: level(u) ← level(parent(u)) + 1 for all verticesu in G following

BFSstaring froms;7: Construct a graphGT where V (GT ) = V (G) and E(GT ) ={(u, v)|(v, u) ∈ E(G)};

8: Add a virtual vertext and an edge(t, u) from s to every vertexu inGT if label(u) < 0;

9: rlevel(t)← −1;10: rlevel(u)← rlevel(parent(u))+1 for all verticesu in GT following

BFSstaring fromt;11: Extract a subgraphGl;12: V (Gl) = {u | level(u) + rlevel(u) = l};13: E(Gl) = {(u, v) | u ∈ V (Gl), v ∈ V (Gl), (u, v) ∈ E(G),

level(u) + 1 = level(v)};14: return Gl;

gorithm 7) whenl = 2. It is constructed using twoBFS, i.e.,BFS(G, s) and BFS (GT , t), and the associatedBFS-trees with level≤ 2 andrlevel ≤ 2 are shown in Fig. 8(a) and Fig. 8(b), respec-tively. In Fig. 8(a), verticesv1 andv7 are the only vertices withlabel < 0. In Fig. 8(b), vertexv4 is the only one withlabel > 0.Therefore,Gl contains only four edges, in dashed lines, which ismuch smaller than the original graphG to be handled.

Lemma 5.2: Byl-Subgraph, the resulting subgraphGl includes allpn-paths of lengthl in G.

Proof Sketch: Recall thatl-Subgraphreturns a graphGl whereV (Gl) = {u | level(u)+rlevel(u) = l} andE(Gl) = {(u, v) | u ∈V (Gl), v ∈ V (Gl), (u, v) ∈ E(G), level(u) + 1 = level(v)}. Itimplies the following. All vertices inGl are on at least one shortestpath from a positive label vertexu (label(u) > 0) to a negative la-bel vertexv (label(v) < 0) of lengthl. All edges are on such short-est paths. No any edge in apn-path of lengthl will be excludedfrom Gl. In other words, there does not exist an edge(u′, v′) onpn-path (u, v) of lengthl, which does not appear inE(Gl). ✷

We explainPN-path-D(Algorithm 6). Based onGl obtainedfrom G using l-Subgraph(Algorithm 7), in PN-path-D, we deleteall possiblepn-paths of lengthl fromG (Lines 2-13). The deletionof all pn-paths of lengthl from the given graphG is done usingDFSoverGl with a queueQ. It first pushes all verticesu in V (Gl)with a positive label (label(u) > 0) into queueQ, because theyare the starting vertices of allpn-paths with lengthl. We checkthe vertexu on the top of queueQ. With the vertexu, we doDFSstarting fromu overGl, traverse unvisited edges inGl, and markthe edges visited as “visited”. Letp be the firstpn-path (u, v) withlengthl alongDFS. We delete all edges onp, and adjust the labelsas to reducelabel(u) by 1 and increaselabel(v) by 1. We dequeueu from queueQ until we cannot find any morepn-paths of lengthlstarting fromu, i.e. p returned byDFS(u) is empty. It is importantto note that we only visit each edge at most once. There are two

8

Page 9: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

Algorithm 8 Greedy-R(G)1: l← 1;2: Assign an initial value of−1 to the weightw(vi, vj) for every edge

(vi, vj) ∈ E;3: while some vertexu ∈ G with label(u) > 0 do4: G← PN-path-R(G, l); { PN-path-Ris the same asPN-path-D(Al-

gorithm 6) except that in Algorithm 6, Line 7 is changed to be “re-verse all edges inpn-paths (u, v) in G, both weights and direc-tions”}

5: l← l+ 1;6: end while7: Remove edges(vi, vj) from G if w(vi, vj) = +1;8: return G;

cases. One is that the edges visited will be deleted and thereis noneed to revisit. The other is that they are marked as “visited” butnot included in anypn-paths with lengthl. For this case, theseedges will not appear in any otherpn-paths starting from any othervertices.

Lemma 5.3: By PN-path-D, all pn-paths of lengthl are deleted.

Proof Sketch: It can be proved based onDFS overGl obtainedfrom l-Subgraph.

Lemma 5.4: By PN-path-D, the resultingG does not include anypn-paths of length≤ l.

Proof Sketch: Let G′i be the resulting graph ofPN-path-Dafter

deleting allpn-paths of lengthi from G. It is trivial wheni = 1.Assume that it holds forG′

i wheni < l. We prove thatG′i holds

wheni = l. First, there are nopn-paths of length≤ l− 1 in graphG′

l−1 as a result ofPN-path-Dby assumption. Second,G′l ⊆ G′

l−1

becauseG′l is obtained by deletingpn-paths of lengthl fromG′

l−1,as given in theGreedy-Dalgorithm (Algorithm 5). Furthermore,in PN-path-D, every vertexu with label(u) = 0 in G′

l−1 keepslabel(u) = 0 in G′

l. If there is apn-path (u, v) of length≤ l − 1found inG′

l, then it must be inG′l−1, which contradicts the assump-

tion. Therefore,G′l does not include anypn-paths of length≤ l.

Theorem 5.2:ThePN-path-Dalgorithm correctly identifies a sub-graph Gl which contains allpn-paths of lengthl and returns agraph includes nopn-paths of length≤ l.

Proof Sketch: It can be proved by Lemma 5.2 and Lemma 5.4.We discuss the time complexity of theGreedy-Dalgorithm. In

our experiments, we show that more than 99.99%pn-paths deletedin most real-world datasets are with a length less than or equal to6. We take the maximum lengthlmax in theGreedy-Dalgorithm,which is equivalent to the iterations of callingPN-path-D, as a con-stant, since it is always less than 100 in our extensive experiments.Here, bothPN-path-Dandl-SubgraphcostO(n + m), becausel-SubgraphinvokesBFS twice andPN-path-DperformsDFS oncein addition. Givenlmax as a constant, the time complexity of theGreedy-Dalgorithm isO(n+m).

5.1.2 The Greedy-R AlgorithmTheGreedy-Ralgorithm is shown in Algorithm 8. LikeGreedy-

D, Greedy-Rwill result in an Eulerian subgraph. UnlikeGreedy-D,it reverses the edges onpn-paths of lengthl from l = 1 until theredoes not exist a vertexu in G with label(u) > 0. Initially, Greedy-R assigns every edge,(vi, vj), inG with a weightw(vi, vj) = −1.Then, in the while loop, it callsPN-path-R. PN-path-Ris the sameas PN-path-D(Algorithm 6) except that in Algorithm 6 Line 7is changed to be “reverse all edges inpn-path (u, v) in G, bothweights and directions”. As a result,Greedy-Ridentifies an Eule-

u2

v1

v′v

u1

v2

Figure 10: An example to explainPN-path-R.

rian subgraph ofG, U(G). Here,E(U(G)) contains all edges witha weight= −1 andV (U(G)) contains all the vertices inE(U(G)).Below, we give two lemmas to prove the correctness ofGreedy-R.

Lemma 5.5: By PN-path-R, the resultingG does not include anypn-paths of length≤ l.

Proof Sketch: Let G′i be the resulting graph ofPN-path-R(G, i),

i.e. after reversing allpn-paths of lengthi from G. It is triv-ial when i = 1. Assume that it holds forG′

i when i < l. Weprove thatG′

i holds wheni = l. Otherwise, suppose that there isa pn-path (v1, v2) of length< l in G′

l, then there exists at leastone edgee = (v, v′) in pn-path (v1, v2) that has been reversedduringPN-path-R(G, l), otherwise,pn-path (v1, v2) will be fullyincluded inG′

l−1. Without loss of generality, assume that edge(v′, v) is a part ofpn-path (u1, u2) = (u1, v

′, v, u2) of lengthl.Fig. 10 showsG′

l−1 (before callingPN-path-R(G, l)). Then, wecan easily constructpn-path (u1, v2) = (u1, v

′, v2) andpn-path(v1, u2) = (v1, v, u2), and at least one of them is of length≤ l−1,contradicting the assumption. In addition, there can not exist anypn-path of length l in G′

l. As a consequence, byPN-path-R, theresultingG does not include anypn-paths of length≤ l. ✷

Similar to Theorem. 5.2,PN-path-Ralgorithm correctly iden-tifies a subgraphGl which contains allpn-paths of lengthl andreturns a graph includes nopn-paths of length≤ l.

Theorem 5.3:ThePN-path-Ralgorithm correctly identifies a sub-graph Gl which contains allpn-paths of lengthl and returns agraph includes nopn-paths of length≤ l.

We omit the proof of Theorem. 5.3 since it can be proved in asimilar manner like Theorem. 5.2 using Lemma 5.5.

v1

v11

v2

v3 v4

v5

v6 v7v8

v9v10

v12 v13 v14

(a) G′ returned byGreedy-D

v1

v11

v2

v3 v4

v5

v6 v7v8

v9v10

v12 v13 v14

(b) G returned byGreedy-R

Figure 11: U(G) returned by Greedy-Dand Greedy-RIt is worth noticing thatU(G) obtained byGreedy-Ris at least

as good as that obtained byGreedy-D. If each edge inG − U(G)

is reversed once, then theU(G) obtained byGreedy-Ris equiva-lent to that obtained byGreedy-D, as each edge appears in at mostonepn-path. On the other hand, if there are some edges being re-versed more than once,Greedy-Rperforms better. Fig. 11 showsthe difference betweenGreedy-DandGreedy-R. Sincepn-paths oflength 1 and 2 are the same, we only show the last deleted/reversedpn-path. In Fig. 11(a), we deletepn-path (v8, v7) = (v8, v11, v12,v13, v14, v7). On the other hand, in Fig. 11(b), we reversepn-path(v8, v7) = (v8, v10, v3, v4, v9, v7). Here edge(v4, v3) is reversedtwice. U(G) returned byGreedy-Rconsists of solid lines, which isbetter than that returned byGreedy-D.

5.2 The Refine AlgorithmWith the greedy Eulerian subgraphU(G) found, we have insight

9

Page 10: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

Algorithm 9 Refine(U(G), G)

Input : A graphG, and the Eulerian subgraph obtained byGreedy, U(G)Output : Two subgraphs ofG, U(G) and GD (G = U(G) ∪ GD)

1: for each edge(vi, vj) in E(G) do2: if (vi, vj) ∈ U(G) then3: reverse the edge to be(vj , vi) in G; w(vj , vi)← +1;4: else5: w(vi, vj)← −1;6: end if7: end for8: Assigndst(u) for everyu ∈ V (G) based on Eq. (1);9: for each vertexu in V (G) do relax(u)← true, pos(u)← 0;

10: Enqueue every vertexu in V (G) into a queueQ;11: u← Q.front();12: while Q 6= ∅ do13: SV ← ∅, SE ← ∅, NV ← ∅;14: if relax(u) = true andFindNC(G, u) then15: Reverse negative cycle and change the edge weights usingSV

andSE (refer to Algorithm 2);16: Q ← Q∪ SV ;17: else18: Q.pop();u← Q.front();19: end if20: end while21: GD is a subgraph that contains all edges with a weight of -1;22: U(G) is a subgraph that contains the edges reversed for all edges with

a weight of +1;

on G because we knowG = U(G) ∪ GD whereGD is a DAG(acyclic), and can design aRefinealgorithm based on such insight,to reduce the number of times to updatedst(u), which reduces thecost of relaxing. TheRefinealgorithm (Algorithm 9) is designedbased on the similar idea given inDS-U usingFindNC with twofollowing enhancements.

First, we utilizeG = U(G) ∪ GD to initialize the edge weightw(vi, vj) for every edge(vi, vj) and dst(u) for every vertexuin G. The edge weights are initialized in Line 1-7 in Algorithm 9based onU(G) which is a greedy Eulerian subgraph. We also makeuse ofGD to initializedst(u) based on Eq. (1) in Line 8.

dst(u) =

0 if dI (u) in GD is 0,min{dst(v) − 1|(v, u) ∈ GD} u ∈ GD ,

0 otherwise(1)

Some comments on the initialization are made below. FollowingAlgorithm 2, dst(u) can be initialized asdst(u) = 0. In fact,consider Lemma 4.1. No matter whatdst(vi) is for a vertexvi(1 ≤ i ≤ k − 1) in a negative cycleC = (v1, v2, . . . , vk = v1),the negative cycle can be identified because there is at leastoneedge(vi, vi+1) that can be relaxed. Based on it, if we initializedst(u) in a way such thatdst(u) ≤ dst(v) + w(v, u), thenucannot be relaxed through(v, u) before updatingdst(v). It reducesthe number of times to updatedst(u), and improves the efficiency.We explain it further. Because for any edge,(v, u) ∈ GD, u cannever be relaxed through edge(v, u) beforedst(v) being updated,FindNC (G,u) will relax edges along a path with a few branchesto identify anegative-cycle. The variables such asrelax(u) andpos(u) are initialized in Line 9 as done in Algorithm 2.

Second, we use a queueQ to maintain candidate vertices,u,from which there may existnegative-cycles, if relax(u) = true.Initially, all vertices are enqueued intoQ. In each iteration, wheninvokingFindNC(G, v), letV ′ be the set of vertices relaxed. Among

V ′, for any vertexw ∈ SV \ {v}, dst(w) has been updated andit has only relaxed partial out-neighbors when finding the negativecycle. On the other hand, for any vertexw ∈ V ′ \ SV , all of theout-neighbors ofw have been relaxed and cannot be relaxed beforeupdatingdst(w). We excludew ∈ V ′ \ SV from Q implicitly bysettingrelax(w) = false in FindNC(G,w).

Example 5.3:Suppose we have a greedy Eulerian subgraphU(G)(Fig. 7) of G (Fig. 1) byGreedy-D, and will refine it to the opti-mal U(G) usingRefine. Initially, all edges (solid lines) inU(G)are reversed with initial +1 edge weight, and all remaining edges inGD are initialized with -1 edge weight.dst(v1) = −2, dst(v3) =−1, dst(v7) = −5, dst(v11) = −1, dst(v12) = −2, dst(v13) =−3, dst(v14) = −4, and other verticesu havedst(u) = 0. Inthe while loop,FindNC (G, v1) relaxesdst(v5) = −1 and returnsfalse. This makesrelax(v1) = relax(v5) = false by whichv1andv5 are dequeued fromQ. Afterwards, none ofv2, v3, v4, v6can relax any out-neighbors, and all are dequeued fromQ. FindNC(G, v7) relaxes all vertices, finds a negative cycle(v7, v9, v4, v2, v3,v10, v8, v11, v12, v13, v14, v7), and addsv2, v3, v4 into Q as newcandidates. Then, no vertices fromv8 to v14 can relax any out-neighbors untilFindNC(G, v2) finds the last negative cycle(v2, v4,v3, v2). For most cases,FindNC (G,u) relaxes a few ofu’s out-neighbors.

We discuss the time complexity ofRefine. The initialization(Lines 1–9) isO(n + m). SinceU(G) approximatesU(G), thenumber ofnegative-cycles found byRefinewill be no more than|E(U(G))| − |E(U(G))|, and verticesu will havedst(u) updatedless than|E(U(G))| − |E(U(G))| times. This implies the whileloop costsO(|E(U(G))| − |E(U(G))| · m). Time complexity ofRefineisO(cm2), wherec ≪ 1, as confirmed in our testing.

5.3 The Bound between Greedy and OptimalWe discuss the bound betweenU(G) obtained byGreedyand

the maximum Eulerian subgraphU(G). To simplify our discus-sion, below, a graphG is a graph with multiple edges between twovertices but without self loops, and every edge(vi, vj) is associatedwith a weightw(vi, vj), which is initialized to be -1. Given a graphG, we useG to represent the reversed graph ofG such thatV (G) =V (G) andE(G) contains every edge(vj , vi) if (vi, vj) ∈ E(G),andw(vj , vi) = −w(vi, vj). In addition, we use two operations,⊕ and⊖, for two graphsGi andGj . Here,Gij = Gi ⊕ Gj is anoperation that constructs a new graphGij by union of two graphs,Gi andGj , such thatV (Gij) = V (Gi) ∪ V (Gj), andE(Gij) =E(Gi)∪E(Gj). AndG′ = Gi⊖Gj is an operation that constructsa new graphG′ by removing a subgraphGj from Gi (Gj ⊆ Gi)such thatV (G′) = V (Gi) andE(G′) = E(Gi) \ E(Gj). Giventwo Eulerian subgraphs,Gi andGj , it is easy to show thatGi⊕Gj

andGi ⊖Gj are still Eulerian graphs. Given any graphG, G⊕Gis an Eulerian graph. Note that assume that there is a cycle withtwo edges,(vi, vj) and(vj , vi), between two vertices,vi andvj ,in G. there will be four edges inG⊕G, i.e., two edges are fromGand two corresponding reversed edges fromG.

We discuss the bound using an Eulerian graphG = GP ⊕ GN ,whereGP = G⊖U(G) andGN = G⊖U(G). We call every edgein GN a negative edge (n-edge), and a path inGN a negative path(n-path). We also call every edge inGP a positive edge (p-edge),and a path inGP a positive path (p-path). It is important to notethatp-edges are given forGP but not forGP , all n-edges are witha weight of -1, while allp-edges are with a weight of +1, becausethey are the reversed edges inGP . Here,G is a graph with multiple

10

Page 11: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

v1

v11

v2

v3 v4

v5

v6 v7v8

v9v10

v12 v13 v14

(a) U(G)

v1

v11

v2

v3 v4

v5

v6 v7v8

v9v10

v12 v13 v14

(b) GP = G⊖ U(G)

v1

v11

v2

v3 v4

v5

v6 v7v8

v9v10

v12 v13 v14

(c) U(G)

v1

v11

v2

v3 v4

v5

v6 v7v8

v9v10

v12 v13 v14

(d) GN = G⊖ U(G)

Figure 12: U(G) andU(G) of G in Fig. 1

v1

v11

v2

v3 v4

v5

v6 v7v8

v9v10

v12 v13 v14

Figure 13: G = GP ⊕ GN whereGP = G ⊖ U(G) and GN =G ⊖ U(G).

edges between a pair of vertices.

Example 5.4: Consider the example graphG in Fig. 1. The Eule-rian subgraph obtained byGreedy, i.e. U(G) is shown in Fig. 12(a).It is worth noting that we make use of the resulting graph ofGreedy-D, since that obtained byGreedy-Ris actually the maximum Eule-rian subgraph in this case. Fig. 12(c) shows the maximum EuleriansubgraphU(G). As observed, some edges inU(G) do not appearin U(G), while some edges that do not appear inU(G) appear inU(G). Fig. 12(b) and Fig. 12(d) showGP = G ⊖ U(G) andGN = G ⊖ U(G), respectively. Fig. 13 showsG = GP ⊕ GN .In Fig. 13, the solid edges represent thep-edges fromGP , and thedashed edges represent then-edges fromGN .

SinceG is Eulerian, it can be divided into several edge disjointsimple cycles as given by Lemma 5.1. Among these cycles, thereare no cycles inG with onlyn-edges, because they must be inU(G)if exist. And there are no cycles inG with onlyp-edges, because allsuch cycles have been moved intoU(Gi) in GR-U (Algorithm 4,Line 4).

Next, let a cycle be apositive-cycle if the total weight of theedges in this cycle> 0, and let it be anegative-cycle if its totalweight of edges< 0. We show there are nonegative-cycles inG.

Lemma 5.6: There does not exist anegative-cycle in G.

Proof Sketch: Assume there is anegative-cycle in G, denoted asGcyc. Since there are no cycle with onlyp-edges orn-edges, therearep-edges andn-edges in Gcyc. We divideGcyc into two sub-graphs,Gp andGn. HereGp consists of allp-edges, where eachp-edge is with a +1 weight, andGn consists of alln-edges, whereeachn-edge is with a -1 weight. Clearly,|E(Gp)| < |E(Gn)|,since it assumes thatGcyc is anegative-cycle. Note thatU(G) ⊖Gp ⊕ Gn, which is equivalent toU(G) ⊕ Gcyc ⊖ (Gp ⊕ Gp), isEulerian, and it contains more edges thanU(G), resulting in a con-

. . .

. . .

v+1

v+2 v+

k

v−1

v−2 v−

k

(a) k-cycle

v+1 v+2

v−1 v−2

(b) 2-cycle

v+1

v+2 v+

3

v−1 v−

2 v−3

(c) 3-cycle

Figure 14: k-cycle

n1 n2 n3 n4

v+1

v+2

v+3

v+4

v−1 v−

2v−3

v−4

(a) Step-1

n1 n2 n3 n4p1

v+1 v+

2v+3

v+4

v−1 v−

2v−3 v−

4

(b) Step-2

n1 n2 n3 n4p1

n′1

v+1

v+2

v+3

v+4

v−1

v−2 v−

3v−4

(c) Step-3

Figure 15: k-cycle generated byGreedy-R

tradiction. Therefore, there does not exist anegative-cycle in G.✷

Lemma 5.6 shows all cycles inG are non-negative. Since thereare no cycles with onlyp-edges or n-edges, each cycle inG canbe partitioned into an alternating sequence ofk p-paths andkn-paths, and represented as(v+1 , v−1 , v+2 , . . . , v+k , v−k , v+1 ), where(v+i , v−i ), for i = 1, 2, . . . , k, aren-paths, and(v−i , v+i+1), fori = 2, . . . , k − 1, k, plus (v−k , v+1 ) arep-paths. We call suchcycle ak-cycle. Fig. 14(a) shows an example ofk-cycle, and anarrow presents a path.p-paths are in solid lines whilen-paths arein dashed lines.

The difference|E(U(G))| − |E(U(G))| is equal to|E(G ⊖

U(G))| − |E(G⊖U(G))| = |E(GP )| − |E(GN )| = |E(GP )| −|E(GN )|, becomes the total number of edges inGP minus thetotal number of edges inGN . On the other hand, the difference|E(U(G))| − |E(U(G))| can be considered as the total weight ofall k-cycles inG. Recall that all edges inG are with weight -1 andthe edges inG are with weight +1 by our definition. Assume thatG = {C1, C2, · · · }, whereCi is a k-cycle. The total weight ofG regarding allk-cycles isw(G) =

∑iw(Ci). Below, we bound

|E(U(G))| − |E(U(G))| usingk-cycles.ConsiderG in Fig. 13, there are 3k-cycles. C1 = (v3, v1, v3)

andC2 = (v3, v2, v3) with weight 0, andC3 = (v8, v11, v12, v13,v14, v7, v9, v4, v3, v10, v8) with weight 2. This means that it needsat most 2 more iterations to get the maximum Eulerian subgraphfrom the greedy solution.

For ak-cycle (v+1 , v−1 , v+2 , . . . , v+k , v−k , v+1 ), we use△k and△′k

to represent the total weight ofn-edges1 andp-edges, i.e. △k =∑i=1,...,k w(v+i , v−i ) and △′

k =∑

i=1,...,k−1 w(v−i , v+i+1) +

w(v−k , v+1 ). Because△k is determined by the optimal in|E(GN)|= |E(G ⊖ U(G))|, the bound is obtained when getting the maxi-mum of△′

k.

Theorem 5.4:The upper bound of the total weight ofp-edges in ak-cycle with specifick isk times that ofn-edges, i.e.,△′

k ≤ k ·△k

Proof Sketch: The proof is based on the wayk-cycles constructedby Greedy-R. For simplicity, we first assume that eachn-path andp-path is a pn-path itself, and we will deal with general caseslater. Based onGreedy-R, a k-cycle is constructed as shown inFig. 15, which is a 4-cycle. Initially, there are4 n-paths, ni =

1Forn-edges, we take the absolute value of total weight.

11

Page 12: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

(v+i , v−i ), i = 1, 2, . . . , 4, as Fig. 15(a) shows.Greedy-Rdealswith pn-paths of lengthl from a smalll to a largel. First,Greedy-R finds a pathp1 = pn-path (v+2 , v−1 ), and combinesp1 with twoseparatedn-paths,n1 andn2 into a newn-path n′

1. Here,len(p1)is no larger than anylen(ni). Greedy-Rwill repeat this processto add allp-paths, pi into k-cycle in an ascending order of theirlengths. The lastp-path (v−k , v+1 ) to be added tok-cycle shouldbe the longest one among allp-paths. Then its upper bound is∑

i=1,...,k w(v+i , v−i ) +∑

i=1,...,k−1 w(v−i , v+i+1). Otherwise, its

upper bound should bemax{w(v−i , v+i+1)}. Below, we prove The-orem. 5.4.

• For 2-cycle (Fig. 14(b)): Sincew(v−1 , v+2 ) ≤ w(v+1 , v−1 ),w(v−1 , v+2 ) ≤ w(v+2 , v−2 ), andw(v−2 , v+1 ) ≤ w(v+1 , v−1 ) +w(v−1 , v+2 ) + w(v+2 , v−2 ) we have,

△′2 = w(v−1 , v+2 ) + w(v−2 , v+1 )

≤ w(v+1 , v−1 ) + 2 · w(v−1 , v+2 ) + w(v+2 , v−2 )

≤ 2 · △2

• For 3-cycle (Fig. 14(c)): Since

w(v−1 , v+2 ) ≤ w(v+1 , v−1 ), w(v+2 , v−2 ), w(v+3 , v−3 )

w(v−2 , v+3 ) ≤ w(v+1 , v−1 ) + w(v−1 , v+2 ) + w(v+2 , v−2 )

w(v−2 , v+3 ) ≤ w(v+3 , v−3 )

w(v−3 , v+1 ) ≤ w(v+1 , v−1 ) + w(v−1 , v+2 ) + w(v+2 , v−2 ) +

w(v−2 , v+3 ) + w(v+3 , v−3 )

we have,

△′3 = w(v−1 , v+2 ) + w(v−2 , v+3 ) + w(v−3 , v+1 )

≤ △3 + 2 · (w(v−1 , v+2 ) + w(v−2 , v+3 ))

≤ 2 · △3 + 3 · w(v−1 , v+2 ) ≤ 3 · △3

• Assume that it holds fork-cycles whenk < l, we provethat it also holds whenk = l. Suppose that the shortestp-path is (v−1 , v+2 ), combine(v+1 , v−1 ), (v−1 , v+2 ) and(v+2 , v−2 )into a singlep-path (v+1 , v−2 ), then we get ak-cycle ask =l − 1. As a result,△′

k ≤ (k − 1) · (△k + w(v−1 , v+2 )) +w(v−1 , v+2 ) ≤ k · △k. ✷

Let △′Ci

and△Cidenote the total weight ofp-edges andn-

edges in ak-cycle Ci. Bounding|E(U(G))|- |E(U(G))| can beformulated as an LP (linear programming) problem.

max∑

Ci

(△′Ci

−△Ci)

s.t. (Cond-1)△′Ci

> △Ci, ∀i,

(Cond-2)△′Ci

≤ ki · △Ci, for ak-cycle Ci with k-valueki,

(Cond-3)∑

Ci

(△′Ci

+△Ci) ≤ |E(G)| ≤ |E|

In Fig. 16(a),Bt at y-axis illustrates the theoretical upper boundof |E(U(G))| − |E(U(G))| = K−1

K+1|E| by solving the LP prob-

lem, where the three solid lines represent the three conditions in theabove LP problem, respectively. Here,K is the maximum amongall k values. The theoretical upper bound is far from tight. First,|E(G)| ≪ |E|, which is a tighter upper bound of

∑Ci(△′

Ci+

△Ci), moving Cond-3 towards the origin. Second, for mostk-

cycles,△′k = (1 + ǫ) · △k, 0 < ǫ < 1, since mostp-paths in a

k-cycle are far from the upper bound it can get. This leads Cond-2moving towards x-axis. Therefore, a tighter empirical upper boundis Bp at y-axis in Fig. 16(b). We will show it in the experiments.

|E|

|E|

∑△′

Ci

∑△Ci

Bt

cond-1

cond-2

cond-3

(a) Theoretical Upper Bound

∑△′

Ci

∑△Ci|G|

|G|

Bp

cond-1cond-2

cond-3

(b) Empirical Upper Bound

Figure 16: Upper Bounds

v+1

v+2

v+3

v−1

v−2 v−

3

v4

v5

v6

Figure 17: Generalp-paths

We have proved Theorem 5.4 for the casep-paths andn-pathsarepn-paths, which shows that eachp-path in a k-cycle has animplicit upper bound. In general, there are cases wherep-paths arenot pn-paths. In fact, eachp-path in a k-cycle can be classifiedinto two classes. (a) Ap-path is a part of apn-path, includingthe case that thep-path is a pn-path. (b) A p-path can be di-vided into several sub-paths, each is a part of apn-path. In Fig. 17,there are threep-paths in thek-cycle, (v−1 , v+2 ) is a part ofpn-path (v−1 , v4), (v+2 , v−3 ) itself is pn-path (v−2 , v+3 ) and(v−3 , v+1 )consists of two sub-paths:(v−3 , v5) and(v5, v+1 ), and each of themis a part of apn-path or itself is apn-path.

For the cases when ap-path in ak-cycle is not apn-path, we usewp andwu to denote its practical weight and the theoretical upperbound it can reach when itself is apn-path, respectively. Sincewe concentrate on weight ofp-paths, we treat such ap-path as apn-path with weight wp if wp < wu, and treat it as apn-pathwith weightwu if wp > wu and add the differencewp − wu to aglobal variableW . We will show in Section 6 thatW is very smallcompared with|E(U(G))|.

Time complexity: Revisit GR-U (Algorithm 4), it includes fourparts:SCC decomposition (Line 1),Greedy(Line 3), cycle moving(Line 4) andRefine(Line 5). SCC decomposition can be accom-plished in 2DFS, in timeO(n + m). As analyzed in Section 5,Greedyinvokes lmax times PN-path, and eachPN-pathneeds 2BFS (l-Subgraph) and 1DFS (remove/reversepn-paths). Sincelmax is small (< 100 in our extensive experiments), the time com-plexity of Greedyis O(n + m). Regarding moving cycles fromGi − U(Gi) to U(Gi), it is equivalent to moving cycles from non-trivial SCCs of Gi − U(Gi) to U(Gi). Based on the fact thatGi − U(Gi) is near acyclic, there are a few cycles inGi − U(Gi),cycle moving is inO(n + m). The time complexity ofRefine, asgiven in Section 5.2 isO(cm2), because mostFindNC(G,u) relaxedges along a path with a few branches and verticesu will havedst(u) updated less than|E(U(G))| − |E(U(G))| times.

6. PERFORMANCE STUDIESWe conduct extensive experiments to evaluate two proposedGR-

U algorithms. One isGR-U-D usingGreedy-D(Algorithm 5) andRefine (Algorithm 9), and the other isGR-U-R using Greedy-R(Algorithm 8) andRefine(Algorithm 9). We do not compare our

12

Page 13: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

Graph |V | |E| |V (U(G))| |E(U(G))|

wiki-Vote 7,115 103,689 1,286 17,676Gnutella 62,586 147,892 11,952 18,964Epinions 75,879 508,837 33,673 264,995Slashdot0811 77,360 828,159 70,849 734,021Slashdot0902 82,168 870,159 71,833 748,580web-NotreDame 325,729 1,469,679 99,120 783,788web-Stanford 281,903 2,312,497 211,883 691,521amazon 403,394 3,387,388 399,702 1,973,965Wiki-Talk 2,394,385 5,021,410 112,030 1,083,509web-Google 875,713 5,105,039 461,381 1,841,215web-BerkStan 685,230 7,600,595 478,774 2,068,081Pokec 1,632,803 30,622,560 1,297,362 20,911,934

Table 3: Summary of real Datasets

Graph Refine GR-U-D Refine GR-U-R DS-U c

wiki-Vote 0.1 0.1 0.1 0.1 1.0 0.100Gnutella 0.5 0.5 0.4 0.4 1.6 0.250Epinions 15.9 16.1 15.2 15.4 414.4 0.037Slashdot0811 80.6 80.8 70.9 71.0 12,748.6 0.006Slashdot0902 87.3 87.5 76.6 76.8 14,324.5 0.005web-NotreDame 2.6 3.0 2.4 2.7 370.4 0.007web-Stanford 21.5 25.7 16.7 24.9 2,780.0 0.009amazon 126.5 133.5 124.8 130.5 44,865.0 0.003Wiki-Talk 504.3 504.9 487.3 487.9 9,120.1 0.053web-Google 100.2 110.3 78.6 84.6 35,271.7 0.002web-BerkStan 129.7 137.7 67.8 75.9 7,853.9 0.010Pokec 30,954.5 30,983.7 30,120.4 30,140.5 - -Gplus2 363.5 364.2 360.5 361.2 39,083.8 0.009Weibo0 206.5 207.3 202.4 203.3 8,004.6 0.025

Table 4: Efficiency of GR-U-D, GR-U-Rand DS-U

algorithms withBF-U in [13], becauseBF-U is in O(nm2) andis too slow. We use ourDS-U as the baseline algorithm, which isO(m2). We show thatGreedyproduces an answer which is veryclose the the exact answer. In order to confirmGreedyis of timecomplexityO(n +m), we show the largest iterationlmax used inGreedyis a small constant by showing that the longestpn-path (thesame aslmax) deleted/reversed byGreedyis small. In addition,we confirm the constantc of O(c · m2) for Refineis very smallby showing statistics ofG, W , andk-cycles. We also confirm thescalability ofGR-Uas well asGreedyandRefine.

All these algorithms are implemented in C++ and complied bygcc 4.8.2, and tested on machine with 3.40GHz Intel Core i7-4770CPU, 32GB RAM and running Linux. The time unit used is second.

Datasets: We use 14 real datasets. Among the datasets, Epin-ions, wiki-Vote, Slashdot0811, Slashdot0902, Pokec, Google+, andWeibo are social networks; web-NotreDame, web-Stanford, web-Google, and web-BerkStan are web graphs; Gnutella is a peer-to-peer network; amazon is a product co-purchasing network; andWiki-Talk is a communication network. All the datasets are down-loaded from Stanford large network dataset collection (http://snap.stanford.edu/data)except for Google+ and Weibo. The detailed information of thedatasets are summarized in Table 1 and Table 3. In the tables,foreach graph, the 2nd and 3rd columns show the numbers of verticesand edges2, respectively, and the 4th and 5th columns show thenumbers of vertices and edges of its maximum Eulerian subgraph,respectively.

Efficiency: Table 4 shows the efficiency of these three algorithms,i.e., GR-U-D, GR-U-R, andDS-U, over 14 real datasets. ForGR-U-D, the 2nd column shows the running time ofRefineand the 3rdcolumn shows the total running time ofGR-U-D. As can be seen,for GR-U-D, the running time ofRefinedominates that ofGreedy-D. The 4th and 5th columns show the running time ofRefineandthe total running time ofGR-U-R, respectively. Likewise, theRe-fine algorithm is the most time-consuming procedure inGR-U-R.

2for each dataset, we delete all self-loops if exist.

0.8

0.9

1.0

wiki-Vote

Gnutella

Epinions

Slashdot0811

Slashdot0902

web-NotreDame

web-Stanford

amazon

Wiki-Talk

web-Google

web-BerkStan

PokecGplus2

Weibo0

Ratio

Greedy-D Greedy-R

Figure 18: |E(U(G))|/|E(U(G))|Graph IRD ISD % IRR ISR % IR_DSU

wiki-Vote 659 95.4 629 95.6 14,361Gnutella 2,504 69.5 1,410 82.8 8,202Epinions 5,466 97.4 5,334 97.4 207,124Slashdot0811 11,464 97.9 9,990 98.2 541,970Slashdot0902 12,036 97.8 10,426 98.1 554,163web-NotreDame 9,030 98.1 6,119 98.7 486,240web-Stanford 23,427 94.8 15,721 96.5 448,960amazon 75,104 94.1 61,818 95.2 1,282,326Wiki-Talk 37,662 95.7 36,139 95.9 871,020web-Google 90,375 92.4 59,387 95.0 1,196,616web-BerkStan 69,078 95.2 41,703 97.1 1,437,188Pokec 686,765 - 635,286 - -Gplus2 18,766 96.9 18,721 96.9 613,008Weibo0 25,991 96.2 24,550 96.4 686,765

Table 5: The numbers of Iterations

It is important to note that bothGR-U-DandGR-U-RsignificantlyoutperformDS-U. In most large datasets,GR-U-DandGR-U-Raretwo orders of magnitude faster thanDS-U. For instance, in web-Stanford dataset,GR-U-R takes 25 seconds to find the maximumEulerian subgraph, whileDS-U takes 2,780 seconds, which is morethan 100 times slower. In addition, it is worth mentioning that inPokec dataset,DS-U cannot get a solution in two weeks. In the6th column,c is thec value inRefine’s time complexityO(cm2),by comparing running time ofGR-U-R andDS-U. In all graphs,c ≪ 1. NoteBF-U is very slow, for example,BF-U takes morethan 30,000 seconds to handle the smallest dataset wiki-Vote, whileour GR-U takes only 0.1 second.

Effectiveness ofGreedy: To evaluate the effectiveness of the greedyalgorithms, we first study the size of Eulerian subgraph obtainedby Greedy-DandGreedy-R. Fig. 18 depicts the results. In Fig. 18,|E(U(G))| denotes the size of Eulerian subgraph obtained by thegreedy algorithms,|E(U(G))| denotes the size of the maximumEulerian subgraph, and|E(U(G))|/|E(U(G))| denotes the ratiobetween them. The ratios obtained by bothGreedy-DandGreedy-R are very close to 1 in most datasets. That is to say, bothGreedy-Dand Greedy-Rcan get a near-maximum Eulerian subgraph, indi-cating that bothGreedy-DandGreedy-Rare very effective. Theperformance ofGreedy-Ris slightly better than that ofGreedy-D,which supports our analysis. The ratio of Gnutella dataset usingGreedy-Dis slightly lower than others. One possible reason is thatGnutella is much sparser than other datasets, thus some inappropri-atepn-path deletions may result in enlarging otherpn-paths, andthis situation can be largely relieved inGreedy-R.

Second, we investigate the numbers of iterations used inGR-U-D, GR-U-R, andDS-U. Table 5 reports the results. In Table 5, the2nd and 4th columns ‘IRD’ and ‘IRR’ denote the numbers of iter-ations used in the refinement procedure (i.e.,Refine, Algorithm 9)of GR-U-D andGR-U-R, respectively. The last column ‘IR_DSU’reports the total number of iterations used inDS-U. From thesecolumns, we can see that in large graphs (e.g., web-NotreDamedataset), the numbers of iterations used inRefineof GR-U-D andGR-U-R are at least two orders of magnitude smaller than thoseused inDS-U. In addition, it is worth mentioning that in Pokecdataset,DS-U cannot get a solution in two weeks. The 3rd and 5th

13

Page 14: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

Graph |E(U(G))| |E(G)| W

wiki-Vote 17,676 3,214 20Gnutella 18,964 6,906 10Epinions 264,995 30,997 129Slashdot0811 734,021 45,315 118Slashdot0902 748,580 46,830 145web-NotreDame 783,788 10,439 3,963web-Stanford 691,521 35,402 6,168amazon 1,973,965 202,513 12,994Wiki-Talk 1,083,509 158,848 331web-Google 1,841,215 149,425 22,361web-BerkStan 2,068,081 105,569 16,991Pokec 20,911,934 3,003,797 8,964Gplus2 770,854 117,641 80weibo0 850,136 124,395 384

Table 6: Statistics of|G| andW

2 3 4 5 6 7 8 9 >=10

1.5

2

2.5

Rat

io

(a) Epinions2 3 4 5 6 7 8 9 >=10

1

1.5

2

2.5

3

3.5

Rat

io

(b) web−Stanford

Figure 19: Distributions of k-cycles for eachk

columns report the percentages of iterations saved byGR-U-DandGR-U-R, respectively. BothGreedy-DandGreedy-Rcan reduce atleast 95% iterations in most datasets. Similarly, the results obtainedby GR-U-Rare slightly better than those obtained byGR-U-D.

The largest iteration lmax: We show the largest iterationlmax inGreedyby showing the longestpn-paths deleted/reversed, which isthe numbers ofPN-path-D/PN-path-Rinvoked byGreedy-D/Greedy-R using the real datasets. Below, the first/second number is thelongestpn-paths deleted/reversed. wiki-Vote (9/9), Gnutella (29/22),Epinions (12/10), Slashdot0811 (6/6), Slashdot0902 (8/8), web-NotreDame (96/41), web-Stanford (275/221), amazon (57/37), Wiki-Talk (9/7), web-Google (93/37), web-BerkStan (123/85), Pokec(14/13), Gplus2 (9/8), and Weibo0 (12/10). The longestpn-pathsdeleted or reversed are always of small size, especially comparedwith |E|. Therefore, the time complexity ofGreedycan be re-garded asO(n+m).

The support to a small c: We show the support thatc given inO(cm2) for Refineis small by giving statistics ofG, W , andk-cycles. We first show the statistics of|G| (= GP ⊕ GN ) andWdiscussed in Section 5.3. Table 6 reports the results. From Table 6,we can find that for each graph,|E(G)| andW are small comparedwith |E(U(G))|. These results confirm our theoretical analysis inSection 5.3. Second, we study the statistics ofk-cycles. The resultsof Epinions and web-Stanford datasets are depicted in Fig. 19, andsimilar results can be observed from other datasets. In Fig.19, y-axis denotes the ratio between the total weights ofp-edges and thetotal weights ofn-edges (i.e.,∆′

k/∆k defined in Section 5.3), andthe x-axis denotesk for k-cycles, wherek = 2, 3, · · · , >= 10. Ascan be seen, for allk-cycles, the ratios are always smaller than 2in both Epinions and web-Stanford datasets. These results confirmour analysis in Section 5.3.

Scalability: We test the scalability forGR-U-R, GR-U-D, andDS-U. We report the results for web-NotreDame and web-Stanford inFig. 20. Similar results are observed for other real datasets. Totest the scalability, we sample 10 subgraphs starting from 10% ofedges, up to 100% by 10% increments. Fig. 20(a) and Fig. 20(b)show bothGR-U-RandGR-U-D scale well. For web-NotreDame,we further show the performance ofGreedyandRefinein Fig. 20(c)and Fig. 20(d). In Fig. 20(c),Greedyseems to be not really linear.

0.01

0.1

1

10

100

1000

10 20 30 40 50 60 70 80 90 100

Ru

nn

ing

tim

e (

s)

Fraction of edges (%)

GR-U-D GR-U-R DS-U

(a) web-NotreDame

0.01

0.1

1

10

100

1000

10000

10 20 30 40 50 60 70 80 90 100

Ru

nn

ing

tim

e (

s)

Fraction of edges (%)

GR-U-D GR-U-R DS-U

(b) web-Stanford

0.1

0.2

0.3

0.4

0.5

10 20 30 40 50 60 70 80 90 100

Ru

nn

ing

tim

e (

s)

Fraction of edges (%)

Greedy-D Greedy-R

(c) Greedyof (a)

0.01

0.1

1

10

10 20 30 40 50 60 70 80 90 100

Ru

nn

ing

tim

e (

s)

Fraction of edges (%)

Refine (D) Refine (R)

(d) Refineof (a)

Figure 20: Scalability: web-NotreDame and web-Stanford

We explain the reason below. Revisit Algorithm 4, the efficiencyof Greedyis mainly determined by two factors, the graph size (ormore precisely the size of the largestSCC) and the number of timesinvokingPN-path(i.e. lmax). When a subgraph is sparse, bothSCC

size andlmax tend to be small (the smallest sample graph with 10%edges contains a largestSCC with 1,155 vertices and 4,317 edges,and lmax = 30/16 for Greedy-D/Greedy-R), whereas, both thesize of the largestSCC andlmax tend to be large in dense subgraphs(the entire graph contains a largestSCC with 53,968 vertices and296,228 edges, andlmax = 96/41 for Greedy-D/Greedy-R).

7. CONCLUSIONIn this paper, we study social hierarchy computing to find a so-

cial hierarchyGD as DAG from a social network represented asa directed graphG. To find GD, we study how to find a maxi-mum Eulerian subgraphU(G) of G such thatG = U(G) ∪ GD .We justify our approach, and give the properties ofGD and theapplications. The key is how to computeU(G). We propose aDS-U algorithm to computeU(G), and develop a novel two-phaseGreedy-&-Refine algorithm, which greedily computes an Euleriansubgraph and then refines this greedy solution to find the maxi-mum Eulerian subgraph. The quality of our greedy approach ishigh which can be used to support social mobility and recoverthehidden directions. We conduct extensive experiments to confirmthe efficiency of our Greedy-&-Refine approach.

8. REFERENCES[1] D. Acemoglu, A. Ozdaglar, and A. ParandehGheibi. Spreadof (mis)

information in social networks.Games and Economic Behavior,70(2), 2010.

[2] J. A. Almendral, L. López, and M. A. Sanjuán. Informationflow ingeneralized hierarchical networks.Physica A: Statistical Mechanicsand its Applications, 324(1), 2003.

[3] B. Ball and M. E. Newman. Friendship networks and social status.Network Science, 1(01), 2013.

[4] L. Cai and B. Yang. Parameterized complexity of evenodd subgraphproblems.Journal of Discrete Algorithms, 9(3), 2011.

[5] P. A. Catlin. Supereulerian graphs: a survey.Journal of Graphtheory, 16(2), 1992.

[6] W. Chen, L. V. Lakshmanan, and C. Castillo.Information andInfluence Propagation in Social Networks. Morgan & Claypool,2013.

[7] Z. Chen and H. Lai. Reduction techniques for supereulerian graphsand related topics: a survey.Combinatorics and graph theory, 95,1995.

[8] A. Clauset, C. Moore, and M. E. Newman. Hierarchical structure andthe prediction of missing links in networks.Nature, 453(7191), 2008.

14

Page 15: Exploring Hierarchies in Online Social Networks1502.04220v1 [cs.SI] 14 Feb 2015 Exploring Hierarchies in Online Social Networks Can Lu, Jeffrey Xu Yu, Rong-Hua Li∗, Hao Wei The Chinese

[9] H. Fleischner.Eulerian graphs and related topics, volume 1. NorthHolland, 1990.

[10] H. Fleischner. (some of) the many uses of eulerian graphs in graphtheory (plus some applications).Discrete Mathematics, 230(1), 2001.

[11] R. V. Gould. The origins of status hierarchies: A formaltheory andempirical test1.American journal of sociology, 107(5), 2002.

[12] M. S. Granovetter. The strength of weak ties.American journal ofsociology, 1973.

[13] M. Gupte, P. Shankar, J. Li, S. Muthukrishnan, and L. Iftode. Findinghierarchy in directed online social networks. InProc. of WWW’11,2011.

[14] V. Guruswami, R. Manokaran, and P. Raghavendra. Beating therandom ordering is hard: Inapproximability of maximum acyclicsubgraph. InProc. of FOCS’08, 2008.

[15] J. M. Kleinberg. Authoritative sources in a hyperlinked environment.Journal of the ACM (JACM), 46(5), 1999.

[16] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a socialnetwork or a news media? InProc. of WWW’10, 2010.

[17] J. Leskovec and C. Faloutsos. Sampling from large graphs. InProc.of KDD’06, 2006.

[18] J. Leskovec, D. Huttenlocher, and J. Kleinberg. Predicting positiveand negative links in online social networks. InProc. of WWW’10,2010.

[19] J. Leskovec, D. Huttenlocher, and J. Kleinberg. Signednetworks insocial media. InProc. of CHI’10, 2010.

[20] D. Li, D. Li, and J. Mao. On maximum number of edges in aspanning eulerian subgraph.Discrete mathematics, 274(1), 2004.

[21] D. Liben-Nowell and J. Kleinberg. The link-predictionproblem forsocial networks.Journal of the American society for informationscience and technology, 58(7), 2007.

[22] A. S. Maiya and T. Y. Berger-Wolf. Inferring the maximumlikelihood hierarchy in social networks. InProc. of CSE’09, 2009.

[23] R. E. Tarjan. Amortized computational complexity.SIAM Journal onAlgebraic Discrete Methods, 6(2), 1985.

[24] J. Zhang, B. Liu, J. Tang, T. Chen, and J. Li. Social influence localityfor modeling retweeting behaviors. InProc. of IJCAI’13, 2013.

[25] J. Zhang, C. Wang, and J. Wang. Who proposed the relationship?:recovering the hidden directions of undirected social networks. InProc. of WWW’14, 2014.

[26] N. Zhenqiang Gong, A. Talwalkar, L. Mackey, L. Huang, E.C. R.Shin, E. Stefanov, D. Song, et al. Jointly predicting links andinferring attributes using a social-attribute network (SAN). ACMWorkshop on Social Network Mining and Analysis, 2011.

[27] N. Zhenqiang Gong, W. Xu, L. Huang, P. Mittal, E. Stefanov,V. Sekar, and D. Song. Evolution of social-attribute networks:Measurements, modeling, and implications using google+.ACM/USENIX Internet Measurement Conference, 2012.

15