[ieee 2011 seventh international conference on natural computation (icnc) - shanghai, china...

5
978-1-4244-9953-3/11/$26.00 ©2011 IEEE 43 2011 Seventh International Conference on Natural Computation A Parallel Computing Model for Large-Graph Mining with MapReduce Bin Wu, Yuxiao Dong, Qing Ke and Yanan Cai School of Computer Science Beijing University of Posts and Telecommunication Beijing, China Abstract—How can we quickly find the structures and characters of a large-scale graph? Large-scale graph exists everywhere, such as CALL graph, the World Wide Web, Facebook networks and many more. The continued exponential growth in both the size and complexity of the graphs is giving birth to a new challenge to the analysts and researchers. With respect to these challenges, a new class of algorithms and computing models is needed urgently for the large-scale graphs. An excellent promising clue for dealing with graphs with great sizes is the emerging MapReduce framework and its open-source implementation, Hadoop. The problem of 3-clique enumeration of a graph is an important operation that can help structure mining and a difficult mission for graphs with great sizes on the single computer. In this paper, we propose a parallel computing model for 3-clique enumeration based on cluster system with the help of MapReduce for large- scale graphs. The process of enumeration is firstly to extract one- leap information of the graph, then the two-leap information and finally, the key-based 3-clique enumeration. Also, we apply the computing model to the computation of clustering coefficient. More than anything else, the computing model is applied to three real-world large CALL graphs and the results of the experiments manifest the good scalability and efficiency of the model. Keywords-graph mining; social network analysis; MapReduce; clustering coefficient; 3-clique; I. INTRODUCTION Networks are ubiquitous. The analysis of networks such as the World Wide Web, social, computer and biological networks has attracted much attention recently, especially for the large network [1]. Common social network analysis are based on single-computer systems as sequential applications on small to medium size datasets, and struggle with datasets due to large memory consumption and long execution times [2]. There is a clear need for high performance and distributed social network analysis for enabling scenarios with large data at all. The Google File System [3] and the MapReduce [4] approach have proved to be very successful for handling the analysis of large data in parallel on huge clusters of computers [2]. Structure mining plays an important role in social network analysis. As a main task in this area, the problem of 3-clique enumeration has attracted much interest and been studied in variant avenues in prior works. In this paper we focus on estimating the clustering coefficient of massive graphs by enumerating the 3-clique. Clustering coefficient is one of the most interesting metrics of a network. Clustering coefficient C ଷே , where is the number of triangles in the network and is the number of connected triples [5]. Also, the computing model can serve for the enumeration of all maximal cliques. Due to exponential growth in both the volume and the complexity of real-world data, computing the clustering coefficient of truly large data is an extremely challenging task [6]. Motivated by these challenges, in this work, we provide a parallel computing model for getting 3-cliques of a network using Hadoop [7], the open source implementation of MapReduce. With the computing model, we can compute the cluster coefficient in a scalable, highly optimized way. We chose Hadoop because it is freely available, and because of the power and convenience that it provides to the programmer [2]. We run our algorithms on the large call graph, with millions of edges where we achieve excellent scale up. The rest of this paper is organized as follows: Section 2 presents a short description of preliminaries. Section 3 describes the parallel computing model for 3-clique in detail and give the process of computing cluster coefficient, too. In Section 4, we give an evaluation of our algorithm. At the same time, the experimental results are also soundly presented. Finally, we conclude our work in Section 5. II. PRELIMINARY In this section, we introduce two definitions employed throughout this paper. Besides, we also introduce a basic knowledge of MapReduce framework and its open source counterpart, Hadoop. A. Definition Given a graph G, V(G) represents its vertices and E(G) represents its edge. In this paper, we assume without loss of generality that G is simple and connected. For a vertex v of G, let ∂ሺvሻ ൌ ሼ ݑ |ሺݒ,ݑܧrepresents the neighbor of ݒand |∂ሺvሻ| represents the degree of ݒ. Thus a clique can be defined as: In an undirected graph G, a clique C is a complete subgraph of G in which any two vertices are adjacent [6]. Based on this definition, a k-clique is defined as a clique containing k vertices (k >= 3). Two versions of this clustering coefficient exist: the global and the local. The global clustering coefficient is designed to give an overall indication of the clustering in the network,

Upload: yanan

Post on 12-Dec-2016

219 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: [IEEE 2011 Seventh International Conference on Natural Computation (ICNC) - Shanghai, China (2011.07.26-2011.07.28)] 2011 Seventh International Conference on Natural Computation -

978-1-4244-9953-3/11/$26.00 ©2011 IEEE 43

2011 Seventh International Conference on Natural Computation

A Parallel Computing Model for Large-Graph Mining with MapReduce

Bin Wu, Yuxiao Dong, Qing Ke and Yanan Cai School of Computer Science

Beijing University of Posts and Telecommunication Beijing, China

Abstract—How can we quickly find the structures and characters of a large-scale graph? Large-scale graph exists everywhere, such as CALL graph, the World Wide Web, Facebook networks and many more. The continued exponential growth in both the size and complexity of the graphs is giving birth to a new challenge to the analysts and researchers. With respect to these challenges, a new class of algorithms and computing models is needed urgently for the large-scale graphs. An excellent promising clue for dealing with graphs with great sizes is the emerging MapReduce framework and its open-source implementation, Hadoop. The problem of 3-clique enumeration of a graph is an important operation that can help structure mining and a difficult mission for graphs with great sizes on the single computer. In this paper, we propose a parallel computing model for 3-clique enumeration based on cluster system with the help of MapReduce for large-scale graphs. The process of enumeration is firstly to extract one-leap information of the graph, then the two-leap information and finally, the key-based 3-clique enumeration. Also, we apply the computing model to the computation of clustering coefficient. More than anything else, the computing model is applied to three real-world large CALL graphs and the results of the experiments manifest the good scalability and efficiency of the model.

Keywords-graph mining; social network analysis; MapReduce; clustering coefficient; 3-clique;

I. INTRODUCTION Networks are ubiquitous. The analysis of networks such as

the World Wide Web, social, computer and biological networks has attracted much attention recently, especially for the large network [1]. Common social network analysis are based on single-computer systems as sequential applications on small to medium size datasets, and struggle with datasets due to large memory consumption and long execution times [2]. There is a clear need for high performance and distributed social network analysis for enabling scenarios with large data at all. The Google File System [3] and the MapReduce [4] approach have proved to be very successful for handling the analysis of large data in parallel on huge clusters of computers [2].

Structure mining plays an important role in social network analysis. As a main task in this area, the problem of 3-clique enumeration has attracted much interest and been studied in variant avenues in prior works. In this paper we focus on estimating the clustering coefficient of massive graphs by enumerating the 3-clique. Clustering coefficient is one of the most interesting metrics of a network. Clustering coefficient

C ∆, where ∆ is the number of triangles in the network and is the number of connected triples [5]. Also, the computing model can serve for the enumeration of all maximal cliques. Due to exponential growth in both the volume and the complexity of real-world data, computing the clustering coefficient of truly large data is an extremely challenging task [6].

Motivated by these challenges, in this work, we provide a parallel computing model for getting 3-cliques of a network using Hadoop [7], the open source implementation of MapReduce. With the computing model, we can compute the cluster coefficient in a scalable, highly optimized way. We chose Hadoop because it is freely available, and because of the power and convenience that it provides to the programmer [2]. We run our algorithms on the large call graph, with millions of edges where we achieve excellent scale up.

The rest of this paper is organized as follows: Section 2 presents a short description of preliminaries. Section 3 describes the parallel computing model for 3-clique in detail and give the process of computing cluster coefficient, too. In Section 4, we give an evaluation of our algorithm. At the same time, the experimental results are also soundly presented. Finally, we conclude our work in Section 5.

II. PRELIMINARY In this section, we introduce two definitions employed

throughout this paper. Besides, we also introduce a basic knowledge of MapReduce framework and its open source counterpart, Hadoop.

A. Definition Given a graph G, V(G) represents its vertices and E(G)

represents its edge. In this paper, we assume without loss of generality that G is simple and connected. For a vertex v of G, let ∂ v | , represents the neighbor of and |∂ v | represents the degree of . Thus a clique can be defined as: In an undirected graph G, a clique C is a complete subgraph of G in which any two vertices are adjacent [6]. Based on this definition, a k-clique is defined as a clique containing k vertices (k >= 3).

Two versions of this clustering coefficient exist: the global and the local. The global clustering coefficient is designed to give an overall indication of the clustering in the network,

Page 2: [IEEE 2011 Seventh International Conference on Natural Computation (ICNC) - Shanghai, China (2011.07.26-2011.07.28)] 2011 Seventh International Conference on Natural Computation -

44

whereas the local version gives an indication of the embeddedness of single nodes. Based on the local clustering coefficient, we can get the network average clustering coefficient. The global clustering coefficient is based on the following definition for undirected unweighted networks:

C ∆

where ∆ is the number of triangles in the network and is the number of connected triples. The factor three accounts for the fact that triangle can be treated as consisting of three different connected triples, one with each of the vertices as central vertex, and assures that 0 C 1. A triangle is a set of the vertices where each vertex can be reached form each other [5].

The local clustering coefficient for vertex i in undirected graph is defined as: 2 ek k 1 e is an edge which connects vertex I with vertex j and k is the degree of the vertex i. Also

∆ ∆ is the number of triangles which include vertex i and T is

the number of connected triples in which i is incident to both edges. And the network average clustering coefficient is given by Watts and Strogatz [8] as the average of the local clustering coefficients of all vertices n:

1

B. MapReduce Framework MapReduce is a programing framework for processing

huge amounts of unstructured data in a massively parallel way, collectively referred to as a cluster. Computational processing can occur on data stored in a distributed file system. The framework is inspired by map and reduce functions commonly used in functional programming. In Map step, the master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. In Reduce step, it then takes the answers to all the sub-problems and combines them in a way to get the output. MapReduce framework has two advantages: (a) the programmer is oblivious of the details of the data distribution, replication, load balancing etc. And furthermore (b) the programming concept is familiar, for example, the concept of functional programming [9].

Hadoop is an open source implementations of the MapReduce framework. It provides a distributed file system (HDFS) to store data and a means of running applications on large clusters built of commodity hardware which is near the data.

III. THE COMPUTING MODEL FOR 3-CLIQUE The input of the computing model is the undirected graph.

The graph is represented by node pair. The function map() of MapReduce Framework gets the one node pair once when it reads a line of the input file.

The computing model for 3-clique include two MapReduce processes. In this section, we introduce the model as a graph manipulating process. We take the graph in Fig. 1 for example to describe the algorithm.

Figure 1. The Graph for Example

A. One- Leap Process To obtain the adjacency list of the input graph, we launched

a MapReduce task.

The Mapper of the task is described in Table 1. The key-value pair < u, v> donates an edge of the input graph. The output is < u , v >, which u < v , because we only take undirected graph into consideration. The entire list is stored locally as intermediate result, which will be accessed by Reducers.

Table 1 also describes the Reducer of the task. Reducer collects the key-value pair emitted from Mapper with the same key. The output key is u , and the output value is the adjacency list of u . All the vertexes in the adjacency list of u is smaller than u . We demonstrate the first MapReduce procedure in Fig. 2 with the graph as input in Fig. 1.

TABLE I. ONE-LEAP MAPREDUCE PROCESS

One Leap Mapper Input: u,v

Key: line position

Value: u,v

Map: output< , >

Reducer Input: < , >

Key:

Value: ∂

Reduce: output< , >

B. Two-Leap Process One-leap information of a specific vertex stands for the

adjacency list of the vertex, namely ∂ , the neighbors of v. To extract a 3-clique, we should obtain the two-leap information of a given vertex. The two-leap information stands for the adjacency list of the adjacency of the vertex, namely ∂ ∂ , the neighbors of v’s neighbors. The process is presented in Table 2.

Page 3: [IEEE 2011 Seventh International Conference on Natural Computation (ICNC) - Shanghai, China (2011.07.26-2011.07.28)] 2011 Seventh International Conference on Natural Computation -

45

Figure 2. The MapReduce Process of One-Leap and Two-Leap

Figure 3. The Process of enumerating 3-clique

Take the graph in Fig. 1 for example, to obtain the two-leap information of a specific vertex is a direct manipulation with the help of indexing in a main memory. But in a distributed system, the adjacent list as input is generally split into several partitions and the information of a vertex may locate in different slave nodes and have no acquaintance with each other.

To reserve a two-leap information of a vertex, now we describe our solution, which is presented in Fig. 2 as a vivid way. <2, <3, 4, 6 >> is a record in the adjacency list, we transform it into the records:

<3, < 2-4, 6 >>

<4, < 2-3, 6 >>

<6, < 2-3, 4 >>

We perform the same operation for the adjacency lists, finally get an intermediate information in which each record contains a vertex, one of its neighbor and the neighbors of this neighbor. This transformation is finished in Mapper in Table 2. The process is presented in Table 3.

After the transformation, we can collect all the two-leap information of one vertex by MapReduce framework, then we can extract all the 3-clique through the two-leap information.

TABLE II. TWO-LEAP MAPREDUCE PROCESS

One Leap Mapper Input:

Key: line position

Value:

Map:

Function Get two-leap information

Output two-leap list

Reducer Input: < , List( - >

Key:

Value: List( -

Reduce:

Function Enumerate 3-clique

output 3-clique

TABLE III. FUNCTION GET TWO-LEAP INFORMATION

Function Get two-leap information

for each v' � ∂(u' ) output<v', u'+ ∂(u' ) - v'> end for

C. Enumerating 3-clique Take the vertex 4 for example, the two-leap information is

collected:

<4, <1-2, 3>>

<4, <2-3, 6>>

<4, <3->>

We can detect the 3-clique based on two-leap information records. The complete demonstration of this procedure is presented in Fig. 3. When we process the tree with vertex 4, we can extract four 3-cliques, (1, 2, 4), (1, 3, 4), (2, 3, 4) and (1, 2, 3). The 3-clique (1, 2, 3) can also be extracted by the two-leap information of vertex 3. To solve the redundancy problem, we only generate and output the 3-clique with the key as the maximum vertex in it. The process is finished in Reducer in Table 3 and the detailed algorithm is given in Table 4.

TABLE IV. FUNCTION GET ENUMERATE 3-CLIQUE

Function Get two-leap information

Take v as the root of a tree T. for each u in List(u ∂ u - v take u as a child of v end for

for each u in the children of v

for each u (not equal to u )in the children of v

if (edge <u , u > exists)

output (v', u , u )

end for

end for

Page 4: [IEEE 2011 Seventh International Conference on Natural Computation (ICNC) - Shanghai, China (2011.07.26-2011.07.28)] 2011 Seventh International Conference on Natural Computation -

46

D. Computing Cluster Coefficient In graph theory, a clustering coefficient is a measure of

degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real world networks, and in particular social networks, nodes tend to create tightly knit groups characterized by a relatively high density of ties [8, 10]. In this paper, we take the computation of local clustering coefficient and network average clustering coefficient for example using the 3-clique computing model.

Based on the extracted 3-cliques, the local clustering coefficient and network average clustering coefficient can be computing through 2 major MapReduce process. The first MapReduce job is described in Table 5. The input of this job includes the 3-clique information and the degree information of every vertex. According to the different input, the Mapper emits respective output results. In Reducer, the clustering coefficient can be computed by the collected Key and Value.

TABLE V. LOCAL CLUSTERING COEFFICIENT

Local clustering coefficient Mapper Input: u, v, w

m, k

Key: line position

Value: u, v, w m, k

Map:

if (input == , , )

for in , , output < , 1>

end for

else

output < , k>

Reducer Input: <x, List( 1,1,1 … k >

Key: x

Value: List( 1,1,1 … 1,1, k

Reduce:

t = 1 + 1 + 1 + … + 1 + 1

tr = k (k - 1) / 2

C = t / tr

output <x, C>

After getteing the local clustering coefficient of every vertex, the process of computing the network average clustering coefficient is simple by the MapReduce framework. The process is transformed into the computing of average number which is presented in Table 6.

TABLE VI. NETWORK AVERAGE CLUSTERING COEFFICIENT

Local clustering coefficient Mapper Input: x, C

Key: line position

Value: x, C

Map:

output < cc , C >

Reducer Input: < , List( , ,… >

Key:

Value: List( , ,…

Reduce:

C = ( /

output <cc, C>

IV. EXPERIMENTS In this section, we introduce the experiments by

implementing the computing model we described above and running it in three different CALL graphs.

A. Hardware Description We set up a Hadoop cluster environment, composed of one

master and 32 computing nodes (Intel Xeon 2.50GHz 4, 8GB RAM, Ubuntu 10.10 OS) with 8TB total storage capacity. The cluster is interconnected through 1000Mbps Switch. And deployed Hadoop platform version is 0.20.0. All algorithms are implemented in Java.

B. Datasets Description The basic statistics of three datasets is summarized in Table

7. These datasets come from Call Detailed Records (CDR) in three different cities in China. In these datasets, each vertex stands for a service number and each edge for the calls between two people. Certainly, all the identities in the graph have been replaced by a unique number to ensure the private information.

TABLE VII. GRAPH INFORMATION

Graph N M degree density

CALL A

CALL B

CALL C

872 118

1 632 055

12 107 977

2 807 203

5 676 890

29 372 757

6.438

6.957

4.852

3.691e-6

2.131e-6

2.004e-7

C. Results Table 8 shows the network average clustering coefficient

computing by the parallel computing model we mentioned above. More importantly, we get the running time of the algorithms, as the number of computing nodes increasing. Fig. 4 and Fig. 5 is shown that with the number of nodes increasing from 2, 4, 8, 16 to 32, the computing model process the data has excellent scalability, that is to say the speedup ratio increases nearly linearly with the number of nodes.

TABLE VIII. GRAPH INFORMATION

Graph CALL A CALL B CALL C

cc 0.05425 0.05719 0.04208

Page 5: [IEEE 2011 Seventh International Conference on Natural Computation (ICNC) - Shanghai, China (2011.07.26-2011.07.28)] 2011 Seventh International Conference on Natural Computation -

47

Figure 4. Running time and scalability of cc at CALL A

Figure 5. Running time and scalability of cc at CALL B

V. CONCLUSION AND FUTURE WORK In this paper, a novel parallel computing model based on

MapReduce framework is introduced for exploiting 3-clique in extremely large graphs. The model includes extracting the one-leap information, two-leap information and enumerating the 3-clique from the graph. The most important step of extracting 3-clique is the process of getting two-leap information. The process can be solved by the MapReduce framework. Also, we apply the computing model to the computation of local clustering coefficient and network clustering coefficient.

Obviously, some graph mining operations especially structure mining can be solved in MapReduce manner. The structure of 3-clique can be applied to many social network analysis, not only the computation of clustering coefficient. For example, the community detection and evolution, frequent pattern extraction etc. Here, we not only describe our algorithm results, but also introduce our running speedup ratio in using MapReduce in cluster environment.

Graph mining is the foundation of social networks analysis. This work is to explore the large graph mining problems in a parallel manner with MapReduce. Our future work will continue to solve the large graph mining problems in the cluster environment and construct a graph mining framework based on MapReduce framework.

ACKNOWLEDGMENT This work is supported by the National Science Foundation

of China (Grant No.60905025, 90924029, 61074128), National High Technology Research and Development Program of China (No.2009AA04Z136), National Key Technology R&D Program of China (2006BAH03B05) and the Fundamental Research Funds for the Central Universities.

REFERENCES [1] D. Wegener, M. Mock, D. Adranale and S. wrobel, “Toolkit-based high-

performance Data Mining of large Data on MapReduce Clusters”. 2009 IEEE International Conference on Data Ming Workshops.

[2] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez and D. U. Hwang. “Complex networks: Structure and dynamics”. Physics Reports, 424(4-5):175-308, February 2006.

[3] Chemawat. S, Gobioff H. and Leung S, “The Google File System”. 19th ACM Sysposium on Operating Systems Principles, Lake George, NY(2003).

[4] Dean J and Ghemawat S, “MapReduce: Simplified data processing on large clusters”. Sixth Symposium on Operating Systems Desigh and Implementation, pp. 137-149, San Francisco, CA, USA(2004).

[5] L. da F. Costa, F.A. Rodrigues, G. Travieso and P.R. Villas Boas, “Characterization of Complex Networks: A Survey of measurements”

[6] Bin Wu, Shengqi Yang, Haizhou Zhao and Bai Wang. “A Distributed Algorithm to Enumerate All Maximal Cliques in MapReduce”. 2009 International Conference on Frontier of Computer Science and Technology.

[7] Hadoop. http://hadoop.apache.org [8] D. J. Watts and Steven Strogatz. “Collective dynamics of small-world

networks”. Nature 393 (6684): 440-442 [9] Bin Wu, Shengqi Yang, Haizhou Zhao, Yuan Gao and Lijun Suo.

CosDic: towards a Comprehensive System for Knowledge Discovery in Large-scale data. In the 2009 IEEE/WIC/ACM International Conference on Web Intelligence 2009.

[10] P. W. Holland and S. Leinhardt, “Transitivity in structural models of small groups”. Comparative Group Studies 2: 107-124.