[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

8
One-mode projections of multiplex bipartite graphs Em˝ oke- ´ Agnes Horv´ at Interdisciplinary Center for Scientific Computing Heidelberg University Speyererstrasse 6 69115 Heidelberg, Germany [email protected] Katharina A. Zweig Interdisciplinary Center for Scientific Computing Heidelberg University Speyererstrasse 6 69115 Heidelberg, Germany [email protected] Abstract—Several important social network data sets have an inherent bipartite structure: for example, agents are affiliated with societies, authors write articles, customers buy, rent, or rate products. One commonly used network analytic approach to their analysis involves projecting them, i.e., deducing relations between actors of the same type (e.g. societies, articles, or products). Some of the available large scale data sets not only represent one, but several distinct relations between the same actors thereby calling for a projection method that accounts for the multiple nature of the relations. In this article we present a statistical method that properly extends a projection algorithm developed for bipartite networks containing one single type of relation. We show the stability of the proposed method on synthetic data. Then, we apply it to a real-world network of users rating films, namely a subset of the Netflix prize data set. We show that there is a gain from differentiating between the relation types. Based on the assumption that co-ratings of films contain information about the films’ similarity, we analyze the co-liking and co-disliking structures obtained by the new one-mode projection. We find that the projections of concordant ratings show a high clustering coefficient while discordant co-ratings have a very small one. This result indicates that the assumption is valid and that thus the new one-mode projection can be used as basis for recommendations. I. I NTRODUCTION What has once been considered unattainable data about human behavior nowadays is collected in unprecedented amounts [1]. With data acquisition methods getting highly efficient, the need for more sophisticated exploratory methods that allow studying this often multi-dimensional data becomes more and more evident. The field devoted to understanding complex systems of interacting social actors –social network analysis– is one of the modeling frameworks which is able to represent several different, simultaneously observed aspects of a given social phenomenon. However, analytic methods that handle networks having different types of actors (multipar- tite networks) and different types of relations between the same set of actors (multiplex, multirelational or multilayered networks) are scarce. So far research mainly concentrated on understanding either multipartite or multiplex networks but seldom worked on multipartite and multiplex networks. For example, regarding multi-actor networks, a state-of-the- art work studied the role of different actor properties and their influence on connection probability [2], [3]. The analysis of multiplex networks is firmly based in classic social network analysis and advanced from small-scale questionnaire-based approaches [4], [5] to large-scale analysis that granted addi- tional insight into the organization principles [6], [7], [8], the community structure [9], and the predictability [10], [11], [12] of multiplex networks. Recently, researchers suggested frame- works for representing and handling multiplex networks [13] and for extending traditional network measures to deal with multiplicity [14]. Systematic large-scale methods for investigating both mul- tipartite and multiplex networks are still largely missing (for an exception see [15]). The contribution of the present article advances in this direction by generalizing a method for bipar- tite, single-relation networks to bipartite multiplex networks, i.e., networks that describe multiplex relations between actors of two different types. Such data is collected in huge amounts for market basket analysis purposes. Examples are records of customers buying, renting, or rating products. Usually, the task associated with these data sets of economic interest is to relate products by quantifying their similarity based on customer behavior. Accordingly, the product-customer bipartite graph (which is also called a two-mode graph in social network analysis) is transformed into a unipartite graph between the products, i.e., it is subdued to a so-called one-mode projection. Previous work on the example of the Netflix prize data set [16] illustrated how user ratings can be used to detect statistically relevant similarity between pairs of films [17], [18]. The Netflix data set consists of millions of discrete ratings from 1 to 5. The aforementioned research analyzed the bipartite graph consisting of the ”good” ratings (having a value of 4 or 5). Through the availability of the different ratings the data set enables a multirelational analysis where we differentiate between likes (ratings of 1, 2, and 3) and dislikes (ratings of 4 and 5). For the analysis we apply the extended one-mode projection algorithm to bipartite graphs having two types of relations [19]. In this article we present the generalization of the uni- plex one-mode projection method introduced in [17], [18] to multiplex bipartite graphs having many types of relations. Then we test its robustness against the random addition and elimination of edges using synthetic graphs. Next, we show how it performs on Netflix film rating data. We discuss the obtained results in terms of user behavior and film like/dislike patterns. Finally, we conclude our article with a discussion and 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4799-2/12 $26.00 © 2012 IEEE DOI 10.1109/ASONAM.2012.101 598 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4799-2/12 $26.00 © 2012 IEEE DOI 10.1109/ASONAM.2012.101 599

Upload: k-a

Post on 20-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

One-mode projections of multiplex bipartite graphs

Emoke-Agnes HorvatInterdisciplinary Center

for Scientific Computing

Heidelberg University

Speyererstrasse 6

69115 Heidelberg, Germany

[email protected]

Katharina A. ZweigInterdisciplinary Center

for Scientific Computing

Heidelberg University

Speyererstrasse 6

69115 Heidelberg, Germany

[email protected]

Abstract—Several important social network data sets have aninherent bipartite structure: for example, agents are affiliatedwith societies, authors write articles, customers buy, rent, or rateproducts. One commonly used network analytic approach to theiranalysis involves projecting them, i.e., deducing relations betweenactors of the same type (e.g. societies, articles, or products). Someof the available large scale data sets not only represent one, butseveral distinct relations between the same actors thereby callingfor a projection method that accounts for the multiple nature ofthe relations. In this article we present a statistical method thatproperly extends a projection algorithm developed for bipartitenetworks containing one single type of relation. We show thestability of the proposed method on synthetic data. Then, weapply it to a real-world network of users rating films, namely asubset of the Netflix prize data set. We show that there is a gainfrom differentiating between the relation types. Based on theassumption that co-ratings of films contain information aboutthe films’ similarity, we analyze the co-liking and co-dislikingstructures obtained by the new one-mode projection. We findthat the projections of concordant ratings show a high clusteringcoefficient while discordant co-ratings have a very small one. Thisresult indicates that the assumption is valid and that thus the newone-mode projection can be used as basis for recommendations.

I. INTRODUCTION

What has once been considered unattainable data about

human behavior nowadays is collected in unprecedented

amounts [1]. With data acquisition methods getting highly

efficient, the need for more sophisticated exploratory methods

that allow studying this often multi-dimensional data becomes

more and more evident. The field devoted to understanding

complex systems of interacting social actors –social networkanalysis– is one of the modeling frameworks which is able to

represent several different, simultaneously observed aspects of

a given social phenomenon. However, analytic methods that

handle networks having different types of actors (multipar-tite networks) and different types of relations between the

same set of actors (multiplex, multirelational or multilayerednetworks) are scarce. So far research mainly concentrated

on understanding either multipartite or multiplex networks

but seldom worked on multipartite and multiplex networks.

For example, regarding multi-actor networks, a state-of-the-

art work studied the role of different actor properties and their

influence on connection probability [2], [3]. The analysis of

multiplex networks is firmly based in classic social network

analysis and advanced from small-scale questionnaire-based

approaches [4], [5] to large-scale analysis that granted addi-

tional insight into the organization principles [6], [7], [8], the

community structure [9], and the predictability [10], [11], [12]

of multiplex networks. Recently, researchers suggested frame-

works for representing and handling multiplex networks [13]

and for extending traditional network measures to deal with

multiplicity [14].

Systematic large-scale methods for investigating both mul-

tipartite and multiplex networks are still largely missing (for

an exception see [15]). The contribution of the present article

advances in this direction by generalizing a method for bipar-

tite, single-relation networks to bipartite multiplex networks,

i.e., networks that describe multiplex relations between actors

of two different types. Such data is collected in huge amounts

for market basket analysis purposes. Examples are records of

customers buying, renting, or rating products. Usually, the task

associated with these data sets of economic interest is to relate

products by quantifying their similarity based on customer

behavior. Accordingly, the product-customer bipartite graph

(which is also called a two-mode graph in social network

analysis) is transformed into a unipartite graph between the

products, i.e., it is subdued to a so-called one-mode projection.

Previous work on the example of the Netflix prize data set [16]

illustrated how user ratings can be used to detect statistically

relevant similarity between pairs of films [17], [18]. The

Netflix data set consists of millions of discrete ratings from

1 to 5. The aforementioned research analyzed the bipartite

graph consisting of the ”good” ratings (having a value of 4

or 5). Through the availability of the different ratings the data

set enables a multirelational analysis where we differentiate

between likes (ratings of 1, 2, and 3) and dislikes (ratings of

4 and 5). For the analysis we apply the extended one-mode

projection algorithm to bipartite graphs having two types of

relations [19].

In this article we present the generalization of the uni-plex one-mode projection method introduced in [17], [18]

to multiplex bipartite graphs having many types of relations.

Then we test its robustness against the random addition and

elimination of edges using synthetic graphs. Next, we show

how it performs on Netflix film rating data. We discuss the

obtained results in terms of user behavior and film like/dislike

patterns. Finally, we conclude our article with a discussion and

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.101

598

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.101

599

Page 2: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

Fig. 1. Example: a film-user bipartite graph in which positive ratings areshown as red lines and negative ratings appear as green lines (left). Themultiplex film network of co-rated films contains film pairs that were bothliked (red line), film pairs that were both disliked (green line), and film pairswhere one of the films was disliked (source of the blue arrow) and the otherwas liked (target of the blue arrow) by the same users. The presented methodassesses the statistical significance of the edges in this network.

present possible directions for future research.

II. DEFINITIONS

We start by providing needed graph definitions for describ-

ing our approach. Formally, a network is represented by a

graph consisting of a set of actors (vertices) V and a set of

edges E ⊆ V × V . If the actor set can be partitioned into

two disjoint sets L and R, the graph is called bipartite. An

unweighted, undirected bipartite graph is then described by a

tupel B = (L ∪ R, E ⊆ L × R), where edges (i, j) ∈ E can

exist only between actors i ∈ L and j ∈ R. Alternatively,

edges of a bipartite graph can be captured by a |L| × |R|binary adjacency matrix Aij. Entries of the matrix are equal

to one if there is an edge between actors i and j, and zero

otherwise.

Multiplex networks model simultaneous relations among

the same actors by representing each relation type with a

corresponding edge typeM. Accordingly, a multiplex bipartitegraph is defined as B = (L ∪R, E |M|), where E |M| denotes

the set of all possible edges of different types. The so-called

supersociomatrix [4] representation of such a graph is a tensor

Aijα of dimension |L| × |R| × |M| that stores the bipartite

adjacency matrix for each edge type α ∈M. The multiplicity

mij =∑

α Aijα of a relation between i and j counts the

number of different edges between them. In the special case

where there is at most one type of relation admitted between

any pair of actors, the tensor Aijα can be aggregated to

a weighted adjacency matrix whose entries Λij encode the

relation type. Figure 1 (left) shows a bipartite graph with two

relation types and maximal multiplicity 1.

The degree of an actor i ∈ L (j ∈ R) with respect to

relation type α, denoted by degα(i) (degα(j)), is the number

of actors it is connected to and can be directly computed as

the row (column) sum of the adjacency matrix Aα. Given

a relation type α, the degree sequences corresponding to the

two sides of a bipartite graph, Dα(L) and Dα(R), represent

the ordered sequences of the degrees for the individual actor

sets. Further, B(Dα(L),Dα(R)) denotes the set of all possible

graphs with the same degree sequences as Bα, i.e., the bipartite

graph containing only relations of type α.

Without loss of generality let us now consider the case when

projecting the multiplex bipartite graph B = (L∪R, E |M|) to

the actor set L. The co-occurrence of two actors v, w ∈ L is

denoted by cooccαβ(v, w) and equals the number of common

neighbors u ∈ R they have with respect to relation types

α, β ∈ M. This means that the co-occurrence counts how

often there is an edge of type α between v and u and an edge

of type β between w and u. Figure 1 (right) shows a one-

mode network with three relation types where two actors are

connected if they have a co-occurrence of one in the bipartite

graph.

The one-mode projection of B to L is based on the co-

occurrence of the actor pairs from L. It results in a completegraph, meaning that an edge is deduced between all pairs of

actors. The graph is also weighted, multiplex, and unipartite.

Thus, G = (L, E ′|M′|,Ω), where E ′|M′|

denotes the set of all

possible edges between actors from L of types belonging to

M′ = M×M and Ω : E ′|M′| → [0, 1] is a function that

assigns each actor pair a weight. This weight is an empirical

p-value. It approximates the probability that the co-occurrence

of the actor pairs would assume a value greater or equal to the

observed co-occurrence in B given an appropriate null model.

Details of constructing the null model and the procedure by

which the projected graph is obtained from the Ω mapping are

described in the following.

III. MODELS AND METHODS

It is common practice in network analysis to assess the

statistical significance of topological structures using the null

model approach. A typical example of this approach was used

by Shen-Orr et al. [20], [21] to detect so-called network motifs(for an earlier application in ecology see [22]). These are

substructures that occur more often in an observed network

than expected in randomized versions of it under proper

constraints. The underlying assumption is that a network’s

structure follows its function and thus, prevalence of certain

motifs can be attributed to their assumed functional role. The

relevant substructure for the one-mode projection of bipartite

graphs is the co-occurrence of two nodes belonging to the

actor set we project onto, i.e., the number of their common

neighbors. Zweig et al. presented a method that assesses the

statistical significance of co-occurrences under a null model

called the fixed degree sequence model (FDSM) [18]. Here

we propose an extension of this method for multiplex bipartite

graphs in which there is at most one type of relation admitted

between any pair of actors1:

1) Given a multiplex bipartite graph B, we compute the co-

occurrence cooccαβ(v, w) ∀ α, β ∈ M for all distinct

actor pairs v and w on the side of interest.

2) We want to compare these observed values with the

expected ones in a bipartite graph in which every node

maintains its degree in each of the relation types. Let Bbe the set of all such graphs. In addition, these graphs

have only one edge of any type between two actors at

1Note that this is the case of the Netflix data set: users rate films onlyonce by either liking or disliking them. Thus, the maximal multiplicity of thebipartite graph deduced from film ratings is 1.

599600

Page 3: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

the same time. Note that B can not be fully enumerated

for most real-world networks. Thus, we compute a large

sample of graphs Θ ⊂ B. For sampling we use an edge

swap method based on a Markov chain Monte Carlo

technique [23]. From a list with all edge-pairs of the

same type, in each step choose two edges (v, w), (x, y)uniformly at random. This way we sample each relation

type with the probability with which it is contained in

the data. If neither (v, y) nor (x,w) is already contained

in E , we remove the edges and add (v, y), (x,w) to

E . By always checking for the existence of any type

of relation between (v, y) and (x,w) we do not allow

for multiple edges. Thus, the multiplicity of the bipartite

graph remains 1.

3) We count the number of co-occurrences of v and w with

respect to all possible combinations of relation types for

all sampled graphs Θ and compute the fraction p of

graphs in which cooccαβ(v, w) ∀α, β ∈ M is at least

as large as in the original graph B. This value gives

the empirical probability (the p-value) that the number

of co-occurrences can be explained by the structure of

the graph.

Note that the multiplex projection G induced by a multiplex

bipartite graph with |M| = n types of relations will contain

|M′| = (n+12

)types of relations. Transforming the resulting

complete projection into a sparser graph involves choosing a

set of threshold p-values P = {pαβ |(α, β) ∈ M′} (one for

each relation type available in the projection) and creating a

relation of type αβ between all actor pairs having a p-value

of at most pαβ in Gαβ . There is a rule of thumb stating

that observations with p ≤ 0.05 are statistically significant.

We will use this standard threshold for the synthetic graphs.

Later on we show on the example of Netflix that for real-

world problems the topology of the sub-graphs of Gαβ built

with different possible thresholds indicates non-arbitrary and

meaningful significance thresholds [24].

The computational complexity of the algorithm arises from

a) generating the random graph samples and b) computing the

co-occurrences for each sample. The runtime of generating

the samples with the edge swap method is bounded by the

number of swaps. An empirical study gave evidence that the

number of steps required for convergence is in the order of

|E| [23]. The complexity of computing the co-occurrences is

given by the total number of co-occurrences for each relation

type available in the projection G. In case of a one-mode

projection onto the actor set L, each actor j from R induces(degα(j)

2

)co-occurrences of type (α, α) and degα(j)degβ(j)

co-occurrences of type (α, β), α, β ∈ M. Aggregating over

the relation types yields the following total number of co-

occurrences:

coocc(G) =∑j∈R

(deg(j)

2

).

Thus, computing the co-occurrences for each sample is in

O(coocc(G)).

IV. EXPERIMENTAL RESULTS

In this section, we first show the robustness of the presented

method on bipartite graphs containing two types of relations.

This will show how stable the one-mode projection is with

respect to a single relation type (the robustness of the projec-

tion Gαα ∀α ∈ M) and with respect to the combination of

relation types (the robustness of the projection Gαβ ∀α = βand α, β ∈ M). Let us refer in the following to the two

different relation types of the bipartite graph as + and −,

i.e., M = {+,−}. Accordingly, the one-mode projection will

contain the relation types M′ = {++,−−,+−}.To demonstrate the robustness of the method we use com-

puter generated data for which the optimal structure of the

projection is defined by construction. Testing an algorithm on

synthetic graphs represents a standard procedure in network

analysis that has been extensively used for assessing the

quality of clustering methods [25], [26].

A. Robustness analysis on synthetic data

Synthetic graphs suited for testing the presented multiplex

one-mode projection algorithm should have a built-in clustered

structure. This enables us to check the ability of the algorithm

to detect groups of actors on the projection side that are similar

based on their connection patterns. Synthetic graphs should

also resemble one of the main difficulties real-world data sets

have from the perspective of a one-mode projection algorithm,

namely the heterogeneity of the degree sequences.

These two requirements can be instantiated in multiple

different ways. Existing bipartite network models aim at

explaining cooperation for ecological [27] and organizational

networks [28] or model affiliation [29] and scientific collabo-

ration networks [30]. These problem specific models are not

suited for producing toygraphs for large-scale tests. They are

rather based on processes that constrain the graph structure

beyond the degree sequences. Therefore, we use a bipartite

model that resembles the structure of the Netflix data while

keeping the toygraphs small and considering the above two

requirements2. Our synthetic graphs consist of four built-in

clusters with |L| = |R| = 60 actors and the same number

of |E+| = |E−| = 128 edges per cluster. Each cluster has

the following structure: in the actor set on which we are

projecting, there are two equal-sized groups that have only +or − relations. With this we model a film rating scenario where

we have a group of films X which is basically liked by most

users, and a group of films Y that is not liked by the majority.

In the synthetic graphs, the degree sequences corresponding to

the two relation types are the same (D+(L) = D−(L)): there

are 16 actors with degree 2, 8 with degree 4, 4 with degree

8, and 2 with degree 16 (implicitly, the rest has degree 0). To

keep it simple, on the other side of the bipartite graph actors

have as many + as − edges. Accordingly, there are 32 actors

2We ran experiments also on synthetic data where the degree sequence onone of the sides of the bipartite graph was more homogenous. This work inprogress shows that the presented multiplex one-mode projection is robustwhen using a different network model as well.

600601

Page 4: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

with degree 1 + 1, 16 with degree 2 + 2, 8 with 8 + 8, and 4with 16 + 16.

We generate an ensemble of 100 random graphs with the

given degree sequences and project each of them. The groundtruth3 is defined by the projections that contain all actor pairs

in relations ++, −−, and +− with a co-occurrence of at least

1 in the original graph. Without noise all the edges contained

in the ground truth are recognized by the algorithm, i.e., they

are assigned a p-value that is below τ = 0.05 (the standard

p-value threshold). Thus, ground truth is perfectly recovered

when no noise is added to the system. In the following we add

and eliminate randomly up to ρ = 20% of the edges in the

synthetic graphs (without preference for the relation types) and

check how the applied noise affects the ability of the method

to recover the ground truth.Figure 2 (left) shows that the performance measured by the

AUC4 is almost perfect for up to ρ = 20% added edges, but

less accurate for eliminated edges. Since the AUC is hardly

able to detect any changes for added edges, we use a more

sensitive measure as well. The PPVk is equal to the fraction

of true positives among the k top-ranked actor similarities,

where k is the number of elements in the ground truth. The

PPVk shows an inferior quality for both added and eliminated

edges, see Figure 2 (right). Nevertheless, the algorithm is able

to detect more than 90% of the truly significant edges when up

to ρ = 6− 8% of edges are added and ρ = 8% are eliminated

– depending on the relation type. There are two trends in the

results:

1) adding random edges affects the quality of the algorithm

less than randomly missing edges and

2) the different relation types seem to show different sen-

sibility against noise: while ++ and −− relations are

practically indistinguishable in terms of precision due

to the symmetry of the synthetic graphs, the prediction

quality in case of +− relation is slightly better.

We conclude that the presented method is robust against

random noise. According to an earlier finding of Zweig [17],

the one-mode projection is also stable regarding different sub-

sets of a larger data set: Zweig obtained the same significant

film similarities for different, equal-sized subsets of the Netflix

data. Thus, even in the case of networks with a strong hetero-

geneity of the degrees, we can analyze representative random

samples. These can then be projected within a reasonable

time frame. Thus, in the following we apply the one-mode

projection method to a subset of the Netflix data – a real-

world data set of particular interest.

3The term ground truth is a standard term in machine learning which definesthe set of observations that is to be re-discovered by a good algorithm. Anyalgorithm can then be evaluated by the number of true positive predictions,i.e., those that are in the ground truth, the number of false positives, i.e., thosenot in the ground truth set but predicted by the algorithm, the number of truenegatives (not predicted, not present in ground truth), and the number of falsenegatives (not predicted, but present in ground truth).

4The Area Under (the receiver operating, ROC) Curve is a standard machinelearning measure, which quantifies the probability that true positives areassigned lower p-values than true negatives by a given algorithm. Thus, aperfect one-mode projection algorithm regarding ground truth has an AUCof 1 while random guessing results in an AUC of 0.5.

B. Application to real-world data

As argued above, it suffices to use a large enough random

subset of the Netflix data in order to obtain representative

results for the whole data set. In the following we work

with the graph containing all ratings given by anonymized

users that have their IDs between 0 and 10, 000. The users

are not numbered consecutively, and thus the resulting subset

involves only 1, 811 users, 13, 581 films. We consider two

types of relations in the graph: out of a total of 364, 225 ratings

221, 512 express like (+) and 142, 713 express dislike (−).

We compute the corresponding multiplex projection that

contains film pairs that are both liked (++ relation type),

both disliked (−− relation type) and rated antagonistically

by the users (+− relation type). For comparison we project

the bipartite network where all edges are considered to be of

the same type, i.e., there is no relation type sensitivity. In the

following, we denote this projection graph the all network.

The one-mode projections arise by generating 25, 000 random

graphs with the same degree distributions as the original graph,

where each of the graphs results from the previously sampled

one with |E| log |E| link-swaps.

Transforming the complete one-mode projections into

sparser graphs by thresholding with the standard τ = 0.05results in graphs that are still too dense (depending on the

relation type they contain between 1.45 and 3.27 million

edges). Instead, we choose a set of more restrictive thresholds

based on the structure of the projection graphs. To do so, we

are guided by a comprehensive measure of topological network

characteristics called the clustering coefficient (denoted by

cc). This quantifies the average probability that neighbors

of an actor are connected themselves [31]. Monitoring the

average clustering coefficient (cc) of the sub-graphs of Gαβ

in dependence of different p-value thresholds, we see non-

trivial changes in topology indicating meaningful threshold

candidates. For each of the four networks (++,−−,+−,all)we find the corresponding p-value threshold based on Fig-

ure 3. Accordingly, we choose τ++ = 0.006, τ−− = 0.014,

τ+− = 0.004, and τall = 0.008. These thresholds are

considered optimal since they mark a clear maximum in

the average clustering coefficient of the particular projection

graphs suggesting a strong increase in interconnectedness. The

+− network is a striking exception. The very low clustering

and its monotonic increase indicates that there is no specific

trend in the way users couple liked and disliked films. This

observation strengthens our intuition that a user’s co-rating

does say something about the similarity of two films. If we

assume that co-liked films are in the same subgroup, we

naturally expect a transitivity in the co-liking and co-disliking

behavior. A blue triangle implies that all three films are

in different subgroups – otherwise we would expect to see

additional red or green edges, which is very rare. For each

blue edge in the triangle there would then be a statistically

significant subset of users that likes subgroup A but not B,

another that likes B but not C, and a third subset that likes Cbut not A. While this is possible, intuition says it is unlikely

601602

Page 5: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

+ +− −+ −

0.99

60.

997

0.99

80.

999

1.00

0

0 5 10 15 20

0.85

0.90

0.95

1.00

0.70

0.80

0.90

1.00

0 5 10 15 20

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Fig. 2. Robustness analysis on synthetic data: AUC and PPVk evaluating the performance of the algorithm on an ensemble of 100 synthetic graphs forincreasing ρ noise levels. Results are shown for added (upper row) and eliminated edges (lower row). Red data points represent the performance of recoveringthe ++ network, green data points refer to the −− network, and blue ones indicate results for the +− network.

0.00 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

+ +− −+ −all

0.00 0.02 0.04 0.06 0.08 0.10

0.1

0.2

0.3

0.4

0.5

+ +− −+ −all

Fig. 3. Analyzing the Netflix data: Deducing meaningful τ significance levelthresholds for the four networks based on the average clustering coefficient ofthe projections containing film similarities at most equal to the candidate p-value thresholds. Results for the various types of networks (++,−−,+−,all)are color coded differently (red, green, blue, purple).

and thus the network structure of the one-mode projection

reinforces the assumption on which it was built: that co-

rating behavior of users – if it is properly denoised – tells

us something about the similarity landscape of the films.

Next to the above stated difference in the clustering co-

efficient, other properties of the single projections can be

regarded: Table I summarizes their basic network statistics.

TABLE IBASIC NETWORK STATISTICS OF THE SINGLE NETFLIX PROJECTIONS.SHOWN ARE THE p-VALUE THRESHOLD τ , THE NUMBER OF NODES |L|AND LINKS |E ′|, THEIR DENSITY δ, THE NUMBER OF COMPONENTS |C|,

THE PERCENTAGE OF ACTORS IN THE LARGEST COMPONENT |Cmax|, AND

THE CLUSTERING COEFFICIENT cc.

τ |L| |E ′| δ |C| |Cmax| ccG++ 0.006 9,452 351,989 0.0078 388 77.42 0.50G−− 0.014 10,923 502,149 0.0084 390 77.20 0.49G+− 0.004 9,807 177,365 0.0036 73 98.32 0.03Gall 0.008 12,664 833,410 0.0103 397 76.53 0.49

While there is no considerable variation in the number of

nodes the different projections have, there is some distinction

in terms of their density: the +− network is roughly half as

dense as the ++ and −− networks and has one third of the

density of the all network. When it comes to the division

in components and the average clustering coefficient the +−network is again the outlier. As opposed to the many highly

clustered components the other three projections have, +−relation forms one large component that is ”widely spread”

among the films. The role and position of these +− edges

can be established more precisely based on the multiplex

projection where the ++, −−, and +− relations coexist.

This projection is too big for a visual inspection. Thus, as

an illustration we show an ego-network from it5. We randomly

5An ego-network is a subgraph centered around an actor, containing itself,its neighbors, and all edges between them.

602603

Page 6: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

depict season 3 from the series entitled Buffy the VampireSlayer. Maintaining only those films that have at least degree 2we obtain the ego-network shown in Figure 4. We see that the

Buffy sequels and Buffy The Movie form a tightly connected

++ group, meaning that users who liked and rated season 3

did the same with the other sequels as well. Moreover, the

ego-network reflects a significant extra information about the

relatedness between Buffy, Angel (a well-defined ++ group

on its own), and Firefly: all of them have similar topic and

style and share the creator (Joss Whedon). The connection

patterns might even indicate a temporal aspect: Buffy TheMovie preceded the Buffy series which in turn came out before

Angel. What we see in the connections is that fans of Buffyhave seen the feature film as well, but fans of Angel did not

necessarily. Further fantasy/science fiction series are co-liked

with these series, e.g. The Highlander, The X-Files, Joan ofArcadia, or Mystery Science Theater. Interestingly, the family

series Gilmore Girls is much more connected to Buffy than to

Angel. Regarding the +−, i.e., the like-dislike relation type, we

see an interesting pattern: whoever disliked the comic series

Cheers and Ernest Scared Stupid liked sequels of Buffy.

Another remark concerning the directed +− relation type:

It is informative to extract the film pairs A and B that are in

a reciprocal like-dislike (+−) relation. This means that some

users liked film A and disliked B, while other users liked

B and disliked A. From these pairs we can deduce which

were the controversial films for the Netflix users at the time of

data collection. The +− subnetwork with the reciprocal edges

contains 805 films (mainly classics, block busters, and popular

films of the considered time frame) and 975 reciprocal edges.

Interestingly, it seems that this network reveals subcultures

within the popular genres. Figure 5 shows as examples a

component of musicals from the 40s/50s (left) and one of

horror and mystery films from the 80s/90s (right).

Next, to compare the connection patterns of the films in

the multiplex projection (++,−−,+−) and the aggregate

projection (all), we compute the Jaccard coefficient. In our

case this measures the pairwise overlap between the dif-

ferent networks in terms of their actor and edge sets. E.g.

the actor overlap between the ++ and −− projections is

equal to the intersection divided by the union of their actor

sets: Ja++,−− = |L++∩L−−|

|L++∪L−−| . Table II records the resulting

coefficients. Accordingly, there is a high actor overlap and

a very low edge overlap. Note that 13, 015 out of 13, 581films show up in the union of the actor set of the multiplex

(++,−−,+−) projection. Thus, the method is able to find

significant similarities for almost all films, even for those

with a small degree and co-occurrence in the bipartite graph.

The relatively high overlap between the edges contained in

the multiplex projection and the all network suggests that

the later detects several significant rating patterns as well.

However, due to its inability to differentiate between their

distinct connotations the all projection remains uninformative

with respect to the ++ and −− relations and misleading

regarding the +− relation type. For example, the all network

contains an edge between Tootsie (1982) and Rosemary’s Baby

Fig. 4. Ego-network of Buffy the Vampire Slayer: Season 3 (black ellipse)showing highly interconnected groups of liked sequels belonging to the seriesBuffy and Angel. Remarking is also how two films are steadily disliked by thefans of Buffy. Consistently with the color coding of previous figures, red linesindicate the like-like (++) relation, blue arrows stand for like-dislike (+−).

Fig. 5. Two components of the +− subnetwork containing only the reciprocaledges.

(1968). Here the edge between the romantic comedy and

the horror film is deceiving: the multiplex projection reveals

that this relation should be actually of type +− because

significantly many users liked Tootsie and at the same time

603604

Page 7: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

TABLE IIJACCARD COEFFICIENTS QUANTIFYING THE OVERLAP BETWEEN THE

FILMS (UPPER TRIANGLE) AND THE LIKE/DISLIKE RELATIONS (LOWER

TRIANGLE).

++ −− +− all++ × 0.58 0.67 0.70−− 0.0239 × 0.74 0.81+− 0.0181 0.0027 × 0.75all 0.2151 0.1769 0.0459 ×

TABLE IIIOVERLAP BETWEEN THE EDGES IN THE MULTIPLEX PROJECTION.

|E ′++| ∩ |E ′−−| |E ′++| ∩ |E ′+−| |E ′−−| ∩ |E ′+−||E ′++| 0.0568 0.0267 ×|E ′−−| 0.0398 × 0.0036|E ′+−| × 0.0531 0.0104

disliked Rosemary’s Baby.

Even more interesting are the overlaps between the relation

type sensitive projections because they reveal information

about the film liking/disliking patterns of users. Quantifying

the edge overlap based on the Jaccard coefficient is difficult

due to the normalization with the different union sizes. Ta-

ble III offers a more appropriate comparison. Accordingly,

from the shown overlaps the one between the like-like (++)

and the dislike-dislike (−−) patterns is the highest. This marks

films belonging to debated categories having at the same time

a clear fan-base and oppositional raters. The relatively high

probability of finding films that are both liked by some users

and perceived antagonistically by others indicates that user

preferences are not bound to categories (e.g. defined by genre),

but are also guided by other aspects (e.g. cast or director style).

For instance, the two sci-fi adventure films Back to the FuturePart III (1990) and X-Men (2000) were both liked by some

users, while others only liked X-Men and at the same time

disliked part III of Back to the Future. Interestingly, the overlap

between the disliked and antagonistically perceived films is

considerably lower. This reveals a steady consensus in their

case.

V. DISCUSSION

The presented algorithm for the one-mode projection of

multiplex bipartite graphs extends the uniplex projection

method introduced by Zweig et al. [17], [18] and details the

original idea we used for a systems biology project [19]. The

adjustment of the uniplex method consisted in making the

random graph model sensitive to the relation types. This was

a necessary step to assure that we have the proper null model

for establishing the relevance of the edges in the projection.

Also, here we use a different statistical test for assessing

their significance. The empirical p-value has the advantages

that a) it does not make any assumptions concerning the

distribution of the initial co-occurrences; b) it expresses a

probability and is thus normalized and comparable; and c) it

does not directly depend on the degrees of the actors in the

bipartite graph.

We showed through the analysis of synthetic data that the

method is very stable in the range of noise we expect from

market basket analysis data. Future research will have to show

whether the algorithm is more sensible in case of the difficult

regimes (e.g. actors with extremely low/high degree and initial

co-occurrence). The results shown here also clearly imply

that constructing meaningful synthetic data for multipartite,

multiplex networks with heterogeneous degree sequences is

an area for future research.

Based on the Netflix film ratings we analyzed the pro-

jections separately and found that the +− relation type was

differing from the others in many respects. Also, the aggregate

network (all) obtained from the bipartite graph where we did

not differentiate between the positive and negative connotation

of the ratings inevitably mixed up edges that should have had

different connotations in the projection and resulted in partly

uninformative/uninterpretable similarities. Relevant informa-

tion was contained in the projection where the three patterns

(co-like, co-dislike, and antagonistic perception of the films)

coexisted. In this vein, we did a first step in showing what

type of additional insight we can gain into the structure of the

multiplex projection.

VI. CONCLUSION AND FUTURE WORK

In this paper we presented a method for the one-mode

projection of multiplex bipartite graphs. We showed the sta-

bility of the approach on synthetic data and evaluated its

usefulness on Netflix data. We found that a proper handling

of multiplexity reveals valuable extra information about film

rating patterns. Interesting questions arise from the Netflix

analysis done here that represent the base line for future work:

• How can we explain the difference in the clustering

structures of the three relation types (++,−−,+−)?

• Can we deduce a model for the film rating behavior of

the users?

• What is the benefit of incorporating aspects revealed here

into a film recommendation system?

We conclude that beyond the specific problem tackled here the

analysis of multipartite, multiplex networks is an important

future research area in complex network analysis that has not

yet been explored up to its potential. The versatile method

introduced in this paper is not limited to the Netflix-case: e.g.,

we also used a variant of it for a biological data set and were

able to identify three biomolecules that hinder the growth of an

especially lethal breast cancer type [19]. We thus expect that

its generality can be further exploited for multiplex bipartite

graphs in social and complex network analysis.

ACKNOWLEDGMENT

The authors would like to thank Andreas Spitz for use-

ful discussions and software and the anonymous reviewers

for helpful comments. EAH is supported by the Heidelberg

Graduate School of Mathematical and Computational Methods

for the Sciences, University of Heidelberg, Germany, which is

funded by the German Excellence Initiative (GSC 220).

604605

Page 8: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

REFERENCES

[1] d. boyd and K. Crawford, “Six provocations for big data,” in A Decade inInternet Time: Symposium on the Dynamics of the Internet and Society,September 2011.

[2] J. Park and A.-L. Barabasi, “Distribution of node characteristics incomplex networks,” Proceedings of the National Academy of Sciences,vol. 104, no. 46, pp. 17 916–17 920, 2007.

[3] Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann, “Link communities revealmultiscale complexity in networks,” Nature, vol. 466, pp. 761–764,2010.

[4] S. Wasserman and K. Faust, Social Network Analysis: Methods andApplications. Cambridge University Press, 1994.

[5] M. McPherson, L. Smith-Lovin, and J. M. Cook, “Birds of a feather:Homophily in social networks,” Annual Review of Sociology, vol. 27,pp. 415–444, 2001.

[6] K. Lewis, J. Kaufman, M. Gonzalez, A. Wimmer, and N. Christachis,“Tastes, ties, and time: A new social network dataset using face-book.com,” Social Networks, vol. 30, pp. 330–342, 2008.

[7] M. Szell, R. Lambiotte, and S. Thurner, “Multirelational organizationof large-scale social networks in an online world,” Proceedings of theNational Academy of Sciences Early Edition, pp. 1–6, 2010.

[8] M. Szell and S. Thurner, “Measuring social dynamics in a massivemultiplayer online game,” Social Networks, vol. 32, pp. 313–329, 2010.

[9] P. J. Mucha, T. Richardson, K. Macon, M. A. Porter, and J.-P. Onnela,“Community structure in time-dependent, multiscale, and multiplexnetworks,” Science, vol. 328, pp. 876–878, 2010.

[10] N. Eagle, A. S. Pentland, and D. Lazer, “Inferring friendship networkstructure by using mobile phone data,” Proceedings of the NationalAcademy of Sciences, vol. 106, pp. 15 274–15 278, 2009.

[11] P. Kazienko, K. Musial, and T. Kajdanowicz, “Multidimensional socialnetwork in the social recommender system,” IEEE Transactions onSystems, Man and Cybernetics Part A: Systems and Humans, vol. 41,no. 4, pp. 746–759, 2011.

[12] N. Li and G. Chen, “Multi-layered friendship modeling for location-based mobile social networks,” in Proceedings of Mobiquitous 2009(MobiQuitous ’09), 2009, pp. 1–10.

[13] M. Magnani and L. Rossi, “The ML-model for multi-layer socialnetworks,” in Proceedings of the 2011 International Conference onAdvances in Social Networks Analysis and Mining (ASONAM ’11), 2011,pp. 5–12.

[14] Brodka, P. Stawiak, and P. Kazienko, “Shortest path discovery in themulti-layered social network,” in Proceedings of the 2011 InternationalConference on Advances in Social Networks Analysis and Mining(ASONAM ’11), 2011, pp. 497–501.

[15] D. Davis, R. Lichtenwalter, and N. V. Chawla, “Multi-relational linkprediction in heterogeneous information networks,” in Proceedings of the2011 International Conference on Advances in Social Networks Analysisand Mining (ASONAM ’11), 2011, pp. 281–288.

[16] “The Netflix Prize.” [Online]. Available: http://www.netflixprize.com/

[17] K. A. Zweig, “How to forget the second side of the story: A newmethod for the one-mode projection of bipartite graphs,” in Proceedingsof the second International Conference on Advances in Social NetworksAnalysis and Mining (ASONAM’10), 2010, pp. 200–207.

[18] K. A. Zweig and M. Kaufmann, “A systematic approach to the one-modeprojection of bipartite graphs,” Social Network Analysis and Mining,vol. 1, no. 3, pp. 187–218, 2011.

[19] S. Uhlmann, H. Mannsperger, J. D. Zhang, E.-A. Horvat, C. Schmidt,M. Kublbeck, A. Ward, U. Tschulena, K. Zweig, U. Korf, S. Wiemann,and O. Sahin, “Global miRNA regulation of a local protein network:Case study with the EGFR-driven cell cycle network in breast cancer,”Molecular Systems Biology, vol. 570, p. 8, 2012.

[20] S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifsin the transcriptional regulation network of Escherichia coli,” NatureGenetics, vol. 31, pp. 64–68, 2002.

[21] R. Milo, S. S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii,and U. Alon, “Network motifs: Simple building blocks of complexnetworks,” Science, vol. 298, pp. 824–827, 2004.

[22] N. J. Gotelli and G. R. Graves, Null-Models in Ecology. SmithsonianInstitution Press, 1996.

[23] A. Gionis, H. Mannila, T. Mielikinen, and P. Tsaparas, “Assessingdata mining results via swap randomization,” ACM Transactions onKnowledge Discovery from Data, vol. 1, 2007.

[24] L. Zahoranszky, G. Katona, P. Hari, A. Malnasi-Csizmadia, K. Zweig,and G. Zahoranszky-Kohalmi, “Breaking the hierarchy – a new clusterselection mechanism for hierarchical clustering methods,” Algorithmsfor Molecular Biology, vol. 4, p. 12, 2009.

[25] M. Girvan and M. E. Newman, “Community structure in social andbiological networks,” Proceedings of the National Academy of Sciences,vol. 99, pp. 7821–7826, 2002.

[26] A. Lancichinetti, S. Fortunato, and F. Radicchi, “Benchmark graphs fortesting community detection algorithms,” Physical Review E, vol. 78, p.046110, 2008.

[27] C. Campbell, S. Yang, R. Albert, and K. Sheab, “A network modelfor plant-pollinator community assembly,” Proceedings of the NationalAcademy of Sciences, vol. 108, pp. 197–202, 2011.

[28] S. Saavedra, F. Reed-Tsochas, and B. Uzzi, “A simple model of bipartitecooperation for ecological and organizational networks,” Nature, vol.457, pp. 463–466, 2009.

[29] J. Gomez-Gardenes, D. Vilone, and A. Sanchez, “Disentangling socialand group heterogeneities: Public Goods games on complex networks,”European Journal of Physics, vol. 95, p. 68003, 2011.

[30] J. Ramasco, S. Dorogovtsev, and R. Pastor-Satorras, “Self-organizationof collaboration networks,” Physical Review E, vol. 70, p. 036106, 2004.

[31] D. J. Watts and S. H. Strogatz, “Collective dynamics of ’small-world’networks,” Nature, vol. 393, pp. 440–442, 1998.

605606