# [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Recurrent Structural Motifs Reflect Characteristics of Distinct Networks

Post on 24-Mar-2017

214 views

Embed Size (px)

TRANSCRIPT

Recurrent structural motifs reflect characteristics of distinct networks

*Chen-Hsiang Yeang1, Liang-Cheng Huang1,2, Wei-Chung Liu11Institute of Statistical Science, Academia Sinica, Taipei, Taiwan

2Department of Information Management, National Taiwan University, Taipei, Taiwan*Corresponding author; address: 128 Academia Road, Section, 2, Nankang 115, Taipei, Taiwan

tel: +886-2-27835611; email: chyeang@stat.sinica.edu.tw

AbstractIn large-scale networks, certain topological pat-terns may occur more frequently than expected from a nullmodel that preserves global (such as the density of the graph)and local (such as the connectivity of each node) propertiesof the graph. These network motifs are the building blocksof large-scale networks and may confer functional/mechanisticimplications of their underlying processes. Despite active in-vestigations and rich literature in systems biology, networkmotifs are less explored in social network studies. In thiswork, we modified and improved the method from Milo etal. 2002 to detect significantly enriched motifs in both directedand undirected networks. We applied this method to identify3-node and 4-node motifs from the datasets of 18 networks(4 directed and 14 undirected) covering social interactions,co-authorships, web document hyperlinks, neuronal circuitry,protein-protein interactions (PPI), trophic relations in a foodweb, and others. Presence and absence of enriched motifsprovide rich information regarding each type of networkrelations. In undirected networks, triangles are enriched inalmost all datasets, suggesting the prevalence of transitivity indiverse networks. However, 4-node structures lacking transi-tivity diamonds and stars are also enriched in the majorityof undirected networks. In directed networks, variations offeed-forward loops are over-represented in the networks ofweb document and political weblog hyperlinks as well asneuronal connections. In contrast, the food web is enrichedwith unidirectional motifs with distinct trophic levels. Theseresults reveal the nature of distinct types of networks and invitefurther explorations on the relations of network structures andtypes of relations.

Keywords-network motifs; directed graph; undirected graph;permutation tests;

I. INTRODUCTION

Large-scale networks have ubiquitous presence in a widerange of phenomena. The sheer size of the networks, inter-dependencies between the constituting members, and diversetypes of possibly encoded relations make the analysis ofnetwork data challenging. One important topic of large-scalenetwork analysis is network motif detection. A motif is arecurrent pattern that appears in multiple instantiations ofan application domain, such as biology (protein structuralmotifs and DNA sequence motifs), music (music ideas ofa symphony), literature (recurring narrative elements in astory), and visual arts (a graphical pattern that populatesin an Islamic textile artwork). These recurrent patterns areconsidered as building blocks of entities (a protein, the

promoter of a gene, a symphony, a tapestry), and differentcombinations of a few motifs can generate enormouslydiverse forms. Similarly, a network motif is a graph structure(topology) that recurs in multiple parts of a large-scalenetwork. The concept of network motifs was first coinedby Alon in systems biology [8]. His team devised statisticalmethods to detect network motifs with significant enrich-ment [1]. Moreover, they provided functional argumentsto support enrichment of certain network motifs in generegulatory networks [2], and experimentally validated thesearguments in selected cases [3]. Since then network motifdetection has become a sub-discipline in systems biology.

In the arena of social networks, sociologists have inten-sively studied small network structures particularly dyads(two-person interactions) and triads (three-person interac-tions) for a long time (e.g., [4], [5]). Faust counted theoccurrences of all possible triads in a variety of socialnetworks and used the occurrence count vectors to cate-gorize the interactions encoded by those social networks[6]. However, to our knowledge systematic approaches toidentify statistically enriched motifs with more than 3 nodeshave been pursued primarily in the analysis of biologicalnetworks but are relatively rare in social networks.

In this work, we modify and improve the method from[1] to detect significantly enriched motifs in both directedand undirected networks. We apply our method to detect 3-node and 4-node motifs from the datasets of 18 networks (4directed and 14 undirected) covering social interactions, co-authorships, web document hyperlinks, neuronal circuitry,protein-protein interactions, trophic interactions in a foodweb, and others. We first describe the concept of networkmotifs and the algorithm to exactly count network motifsin a graph. We then introduce a null model of generatingrandomized networks that retain key characteristics nodeconnectivity and the counts of lower-order motifs ofthe empirical network. Motifs significantly over-representedin the empirical network according to the null model arereported. Finally, we discuss the functional/mechanistic im-plications of the presence/absence of enriched motifs.

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.94

550

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.94

551

II. METHODS

A. Network motifs

Qualitatively, a network motif is a subgraph structureembodying multiple instantiations in a network. In graphtheory, an isomorphism of graphs G and H is a bijection(one-to-one and onto mapping) between their vertex sets.

f : V (G) V (H).such that any node pair u, v V (G) are adjacent if and onlyif f(u), f(v) V (H) are also adjacent. In other words, twographs G and H are isomorphic if there exists a bijectionmapping f such that the adjacency matrices of V (G) andf(V (G)) are identical. Isomorphism applies to both directedand undirected graphs.

All possible graphs with a fixed number of nodes are par-titioned into equivalence classes according to isomorphisms.Graphs in the same equivalence class are isomorphic to eachother, thus are considered as belonging to the same graphstructure. We define a network motif as an equivalence classaccording to graph isomorphisms.

B. Exact counting of network motifs

As the name suggests, a network motif is a small graphstructure (e.g., a triangle) that possesses multiple instantia-tions in a much larger base network (e.g., the social networkof all students in a school or the interaction network ofall human proteins). To detect these recurrent motifs it isnecessary to count the motif occurrences in a base network.The upper bound of motif occurrences is O(nm), wheren is the number of nodes in the base network and m isthe number of nodes in the motif. For relatively small basenetworks (n 1000) and motifs (m 4), we propose adecision-tree algorithm to efficiently count the exact numberof motif occurrences. Briefly, we consider all the m-nodeconnected subgraphs of the base network and count the totalnumber of those subgraphs isomorphic to each motif. Theroot of the decision tree is an empty node, and its childreninclude all nodes in the base network. The children ofeach non-empty node are its neighbors in the base network,excluding its ancestors in the decision tree. The depth ofthe decision tree is the motif size (m). A path from the rootto a leaf corresponds to an m-node connected subgraph.To avoid double-counting, we only consider one traversingorder among all the paths sharing the same nodes.

The time complexity of the exact counting algorithm isO(ndm1), where d is the maximum connectivity of nodesin G. It is much smaller than that of the brute-force approachO(nm) on sparse graphs where d n.C. Evaluating statistical significance of network motif fre-quencies

A fundamental hypothesis for the existence of net-work motifs is that certain graph structures (topologies)

confer functional/mechanistic implications thus are over-represented in the base networks. To detect such function-ally/mechanistically favorable network motifs, it is necessaryto show that certain network structures are over-representedin the base networks, conditioned on a proper null hypothesisregarding the motif distributions in the base networks. Theenrichment results depend on the choice of the null hypothe-sis. An ideal null hypothesis should capture the characteristicof the base network and exclude the information of the targetmotif frequencies. Two common null models were previ-ously employed in a wide range of publications. The firstnull model controls the size and average density ( # edges# node pairs )of the network and generates random Erdos-Renyi graphsaccordingly (e.g., [8]). This null model retains the globalcharacteristic of the base network (size and density) butdistorts the properties of individual nodes (e.g., connectivityof each node). The second null model randomly swaps edgepairs from the base network (e.g., [7]). A swap on an edgepair with distinct nodes (e.g., converting AB and CD intoAD and CB) does not alter the connectivity of any node.Hence the perturbed networks retain the global (networksize, average density, degree distribution) as well as local(connectivity of individual nodes) properties of the basenetwork.

Despite the advantages of the second null model, edgeswaps alone may not preserve the statistics of lower-ordernetwork motifs from the base network. To claim over-representation of an m-node network motif, we have toexclude the contributions from enrichment of network motifswith less than m nodes. For instance, in networks where tri-angles are over-represented, the probability of observing any4-node motif containing a triangle is also elevated comparedto random Erdos-Renyi networks or edge-swapped networks.Yet this enrichment is due to over-representation of trianglesinstead of the true effects of the 4-node motif.

To incorporate the effects of lower-order motif statistics,we consider only the edge swaps that preserve the numbersof lower-order motifs. Figure 1 gives examples of one validedge swap and one invalid edge swap. In the top diagram,the subgraph contains 1 triangle and 2 chains before andafter the edge swap. In the bottom diagram, however, thesubgraph contains 1 triangle and no chain before the edgeswap and no triangle and 2 chains afterwards. In Figure 2,we describe a method to test whether an edge swap preservesoccurrences of lower-order motifs.

Notice whenm = 3, edge swaps alone preserve the countsof 2-node connected motifs (edges). Hence the lower motifpreservation criteria in Figure 2 are not needed. Moreover,the edge swap and validity test apply to both directed andundirected graphs.

Proposition: The edge swaps passing the test in Figure 2preserve the counts of (m 1)-node motifs.Proof: It suffices to show that only subgraphs obtained

from the first step of the test can possibly alter the counts

551552

Figure 1. Top: a subgraph where an edge swap AB,CD AD,BCpreserves the counts of 3-node motifs. Bottom: a subgraph where an edgeswap AB,CD AD,BC annihilates a triangle and creates two chains.

Figure 2. A test verifying whether an edge swap operation preservesoccurrences of (m 1)-node motifsInput: A graph G, network motif size m > 3, edge pair(v1, v2), (v3, v4).Output: Whether edge swap (v1, v2), (v3, v4) (v1, v4), (v3, v2) preserves the numbers of (m 1)-nodemotifs in G.Procedures:

1) Exhaust all subgraphs containing v1, v2, v3, v4 and kadditional nodes, where k m 3. Moreover, thek additional nodes are connected to at least one ofv1, v2, v3, v4 with distance m 3.

2) For each of those subgraphs, count the number of each(m 1)-node motif before and after the edge swap.

3) If there exists at least one subgraph and (m1)-nodemotif such that the motif counts before and after theedge swap on the subgraph are different, then the edgeswap does not preserve the counts of (m 1)-nodemotifs. Otherwise the edge swap preserves the countsof (m 1)-node motifs.

of (m 1)-node motifs. Since the only changes are theswapped edges ((v1, v2), (v3, v4) before the edge swapand (v1, v4), (v3, v2) after the edge swap), only subgraphscontaining at least one swapped edge can possibly alterthe counts of (m 1)-node motifs. Each such subgraphcontains (m 1) nodes, and at least two nodes belong to{v1, v2, v3, v4}. Hence there are at most m 3 additionalnodes. Furthermore, only the nodes within distance m 3from {v1, v2, v3, v4} can be possibly contained in the con-nected (m 1)-node motifs. Q.E.D.To evaluate the statistical significance of an m-node

network motif, we perturb the base network by performingvalid edge swaps until all the nodes containing swappableedges have been swapped at once. 100 perturbed networksare generated, and the p-value is the fraction of perturbednetworks whose motif counts exceed those of the empiricalvalues.

In large and dense networks containing more than 1000

nodes and/or high connected nodes, exact counting of theentire networks is either intractable or time-consuming. Foreach perturbed network, we identify the nodes undergoingswap operations and extract the subgraphs spanned by theswapped nodes in the empirical and perturbed networks. Wethen count network motifs in the subgraphs of the empiricaland perturbed networks and evaluate the p-value accordingly.This counting is approximate as the non-swapped nodes mayalso contribute to the changes of motif count differences. Yetthe motif counts in the selected subgraphs can faithfully re-flect the order (not magnitude) of motif abundance betweenthe empirical and perturbed networks. Thus the approximatecounting would not drastically distort the true p-values.

Notice that our method shares the same goal as the oneproposed by Milo et al.: to preserve the counts of lower-order motifs in the randomized graphs [1]. Milo et al.defined an energy function as the normalized differencebetween the counts of lower-order motifs in the empiricaland randomized networks. They applied Metropolis Monte-Carlo sampling to generate randomized networks that min-imized the energy function. Despite the relative simplicityof implementing the Metropolis Monte-Carlo sampling, theburn-in period for achieving zero energy and the nodesundergoing edge swaps are not explicitly controlled. In theseregards our method of explicitly generating valid edge swapsis superior to the Metropolis sampling.

III. RESULTS

A. Data sources

We extracted 18 network datasets compiled from the web-page http://www-personal.umich.edu/mejn/netdata/ andadditional sources. Table I summarizes the general informa-tion of these networks. They include directed networks ofhyperlinks of World-Wide-Web documents [9] and politicalweblogs [11], neuronal connections of the worm Caenorhab-ditis elegans...

Recommended