activedbc: learning knowledge-based information propagation in mobile social...
TRANSCRIPT
ActiveDBC: learning Knowledge-based Information propagationin mobile social networks
Jiho Park1 • Jegwang Ryu1 • Sung-Bong Yang1
� Springer Science+Business Media, LLC 2017
Abstract Due to fast-growing mobile devices usage such
as smartphones and wearable devices, the rapid informa-
tion propagation in the Mobile Social Networks environ-
ment is very important. In particular, information
transmission of people with repeated daily patterns in
complex areas such as big cities requires a very meaningful
analysis. We address the problem of identifying a key
player who can quickly propagate the information to the
whole network. This problem, in other words, often refer as
the information propagation problem. In this research, we
selected the top-k influential nodes to learn the knowledge-
based movements of people by using a Markov chain
process in a real-life environment. Subsequently, their
movement probabilities according to virtual regions were
used to ensure appropriate clustering based on the Density-
based Spatial Clustering of Applications with Noise
(DBSCAN) algorithm. Since moving patterns in a univer-
sity campus data have a dense collection of people, the
DBSCAN algorithm was useful for producing very dense
groupings. After clustering, we also elected the top-k
influential nodes based on the results learned from the
score of each node according to groups. We determined the
rate at which information spreads by using trace data from
a real network. Our experiments were conducted in the
Opportunistic Network Environment simulator. The results
showed that the proposed method has outstanding perfor-
mance for the level of spreading time in comparison to
other methods such as Naıve, Degree, and K-means. Fur-
thermore, we compared the performance of RandomDBC
with that of ActiveDBC, proving that the latter method was
important to extract the influential top-k nodes, and showed
superior performance.
Keywords DBSCAN � Markov chain � Machine learning �Real trace data � Opportunistic network environment
simulator
1 Introduction
As the usage of smart devices are increasing, applications
of Mobile Social Networks (MSNs) are becoming more
popular. The MSNs can be viewed as a type of socially-
aware Delay Tolerant Networks (DTNs), and it is
promising to provide diverse communication services by
interconnecting the usage of mobile devices and social
networks. With short range wireless communication tech-
niques such as Wi-Fi and Bluetooth, mobile users share
information via opportunistic links [1]. Since the mobile
users receive information by employing opportunistic
contacts, their mobility pattern can be a significant basis for
information propagation. However, balancing the social tie
and mobility based on intermittent and uncertain network
connectivity in MSNs is a challenging problem [2–4]. For
example, if the user changes his interests or social prefer-
ences, he would be willing to forward the information that
he is currently interested in, not the one that he is not.
Therefore, a model of analyzing the dissemination phe-
nomenon is necessary to efficiently capture the realistic
features of information propagation in MSNs. In some
& Jiho Park
Jegwang Ryu
Sung-Bong Yang
1 Department of Computer Science, Yonsei University, Seoul,
Korea
123
Wireless Netw
DOI 10.1007/s11276-017-1608-9
approaches [2, 5–10], the data dissemination is similar with
information propagation model, which means that one or
more special mobile users are intentionally deployed to
spread the information. The actions of special users are
controllable to facilitate connectivity among other users.
Such special users take the burden of data dissemination
away from general users, and subsequently save the limited
energy and storage resources of the whole users. Typically,
the main focus of the influential maximization problem
[3, 11–14] over the past decade has been on optimization,
which means the goal is to maximize the information
propagation through a given network, or to find the top-
k nodes with expected the degree of influence with the
social information. This property, finding special users for
information spreading, plays an important role as the
simulation time gets longer and longer. The top-k nodes
that efficiently maximize the information spread are k-in-
fluential nodes, and the method to select these nodes is the
key issue in this research.
In this paper, we introduce a new method, named Ac-
tiveDBC, for propagating information based on the col-
lected information of each node. During the initialization
period, the influential nodes are defined by the expected
number of active nodes, i.e., the extracted k-influential
nodes are in active status before spreading information to
other inactive nodes. Then, active nodes can affect other
inactive nodes until the maximum number of nodes is
active in given simulation time. The main objective of this
study is to explore the properties of each node according to
its movement patterns and to construct the groups by using
real trace data for selecting proper top-k influential nodes
based on their movement patterns. Detail phases are con-
densed into the three main points described below.
First, ActiveDBC performs Markov chain process [15]
to analyze the movements of the nodes for prediction
purposes. Considering the nature of people in real life, we
analyzed the movements of the nodes and eventually
translated them into a pattern of inheritance. Therefore, the
Markov chain is best suited to obtain the convergence
probability of each node according to the movement pat-
terns [16].
Second, based on the movement patterns analyzed by
the Markov chain, we implement efficient grouping by
applying the Density-based Spatial Clustering of Applica-
tions with Noise (DBSCAN) algorithm [17, 18]. The
results we obtained by employing this algorithm to analyze
real data enabled us to verify the performance of the
algorithm compared to other clustering methods. Although
there may be outliers, DBSCAN is an effective method for
grouping in a concentrated area that we used for our
investigation. Therefore, we attempted to empirically
reduce the number of outliers with a variety of parameters.
After the clustering procedure, we also adopted a scoring
rule based the knowledge they learned to elect the top-
k influential nodes.
Third, we implemented this approach by using the
Opportunistic Network Environment (ONE) simulator [19],
which was designed to specifically evaluate MSN [20] or
Delay Tolerant Network (DTN) [21] protocols and focuses
on the network layer without considering the details of the
physical layer. We applied our methods to realistic envi-
ronments and based the real trace data on the set of human
geotagged traces that were experimentally obtained in an
area on and around the National Cheng-Chi University
(NCCU) campus [22]. The NCCU dataset has been
appropriately applied to our proposed idea by using
DBSCAN algorithm. Our detailed contributions are as
follows:
• The proposal of a Markov chain-based model to
analyze and predict node movement patterns. Based
on real world human behaviors, node movement
patterns can be translated into patterns of inheritance
with high convergence probability.
• The DBSCAN algorithm using outlier reduction
parameters to implement efficient clustering method
on the analyzed movement patterns. After node clus-
tering, we select the top-k influential nodes based on
knowledge learned on the movement patterns.
The remainder of this paper is organized as follows. In
Sect. 2, we briefly introduce the Markov chain model and
DBSCAN algorithm with some of their characteristics. Our
ActiveDBC algorithm is presented in Sect. 3. Section 4
explains the real trace data and provides the analysis of our
experimental results. Finally, the conclusions in this paper
are presented in Sect. 5.
2 Background
2.1 Information propagation model in MSNs
In complex and time-varying networks, information prop-
agation modeling is an interesting issue for mobile social
networks. Many data dissemination methods in MSN
environments have been proposed from different perspec-
tives, such as influential maximization models including
[3, 6, 9, 11], information diffusion models including [2, 8],
and alarm dissemination models including [23]. However,
most related research in MSNs is based on data dissemi-
nation without learning nodes’ movement patterns. Nev-
ertheless, this assumption is not optimized solutions. In
everyday life, for example, movement patterns of people
correlate with their lifecycle, and they live according to
their movement patterns. Using these patterns are much
more efficient way of dealing with the data dissemination.
Wireless Netw
123
Unlike the existing work, in this paper, we assume that
movement patterns with infrastructure can be trained dur-
ing an initialization period; i.e., if we choose the top-k in-
fluential nodes for effective data dissemination based on
the pre-trained movement patterns, it will not only spread
quickly but also maximize information propagation. Note
that use of social relations (e.g., community-based) can
also be an effective way and these applications are con-
sidered in [4, 12, 13]. Unlike the existing work, in this
paper, we investigate application scenarios in which top-
k influential nodes are chosen by the pre-trained moving
patterns that propagate information in the community-
based intermittently connected infrastructure.
2.2 Markov-Chain model
Commonly, a Markov chain model can be used to represent
a discrete stochastic process. The process has meaning that
the future status of the system is only dependent upon the
system’s present state and is independent of the history of
previous events. Its model can be computed as a series of
state transitions based on certain probabilities and each
state can pass to another at each time step according to
fixed probabilities. A stochastic process whose transition
probability of a future state depends only on the present
state is defined as a first-order Markov process [15]. In a
stochastic process X tð Þ; t 2 Tf g, Markov chain model can
be expressed in the following Eq. (1).
P Xtþ1 ¼ itþ1 X0j ¼ i0;X1 ¼ i1; . . .;Xt ¼ itð Þ¼ P Xtþ1 ¼ itþ1 Xt ¼ itjð Þ
ð1Þ
where P is the conditional probability of a future event, and
it is the process state at time t. A Markov chain process is
supposed that has n possible states in certain time. And at a
given nth observation period, probability of the system
being in a particular state depends on its status at the n-1th
period. Define aij to be the probability of the system to be
in state i after it was in state j at any observation, and with
these aij we create the matrix P = aij, called a Transition
matrix. This matrix P is constructed by transition proba-
bilities, and the sum of probabilities should be 1. A typical
transition probability matrix P is defined as (2).
P ¼P11 � � � P1m
..
. . .. ..
.
Pn1 � � � Pnm
264
375;Pij � 0;
Xni¼1
Pij ¼ 1 ð2Þ
where n and m are the number of condition states, and Pij
presents the probability that any condition will pass from
state i to state j during a certain time step. In this way, if the
initial set X(0) is known, the future condition can be
obtained by (3) after several time steps.
X tð Þ ¼ X 0ð Þ � Pt ð3Þ
where, if the Markov process is ergodic, there is a unique
steady-state distribution X with positive entries.
2.3 Density-based clustering algorithm (DBSCAN)
The key function of the DBSCAN algorithm is to facilitate
the determination of arbitrary groups. This algorithm is
specially employed by a typical density-based clustering
algorithm [17]. The DBSCAN algorithm has numerous
characteristics. Firstly, it can learn clusters of random
shape, and secondly, it can distinguish noise points from
clustering groups. Lastly, it is efficient for large spatial
networks. We assume that a set of objects O with n objects,
that has as least a certain number of neighbors (minPoints)
within a specified range (epsilon e), where minPoints and
epsilon are the initial input parameters. The main idea is to
find clusters by starting from each object. However,
attempting to determine both of these parameters might not
be a trivial problem. The following definitions are used in
the DBSCAN algorithm.
Definition 1 (e-neighborhood) Nepsilon pð Þ ¼ q 2fO d p; qð Þj � eg, The e-neighborhood of an object p 2 O,
denoted as Nepsilon pð Þ, is the set of objects inside the ep-
silon around p iff the distance between objects p and q is
less or equal than the e.
Definition 2 (Core object) if a set of Nepsilon pð Þ� min-
Points, an object p is a Core object.
Definition 3 (Border object) if a set of Nepsilon qð Þ�minPoints and q is density-reachable from a Core object p,
an object q is a Border object.
Definition 4 (Noise) if a set of Nepsilon pð Þ\ minPoints
and q is not density-reachable from any Core objects, an
object q is Noise.
Definition 5 (Density-reachability) An object q is
directly density-reachable from object q, if q 2 Nepsilon pð Þand a set of Nepsilon pð Þ� minPoints.
Definition 6 (Density-connectedness) Two objects p and
q are density-connected if they are density-reachable
through a chain of connected core objects {p1; . . .; pngwhere p1 ¼ p and pn ¼ q, such that piþ1 is density-reach-
able from pi and 8i 2 1; . . .; n� 1f g.
The DBSCAN algorithm constructs clusters by ran-
domly choosing an unlabeled object p and performs the
epsilon query on p. If a set of Nepsilon pð Þ� minPoints, a
new cluster C is created and executed for all q 2 Nepsilon pð Þto expand the cluster until no core object is found. Then, an
object p and all of its density-connected objects are
Wireless Netw
123
assigned a cluster label. The algorithm terminates when all
unlabeled objects are processed to form new expanding
clusters.
3 System overview
3.1 Overview
Generally, an information propagation or diffusion prob-
lem is focused on reducing the overall time and increasing
the propagation speed to other neighbors. This requires a
similar type of group to be clustered because information
only needs to be sent once to delegates who have properties
similar to neighboring nodes. We model the relationship
between nodes with undirected graphs, G = (V, E), where
V denotes a set of all nodes and E denotes a set of links
between nodes based on contact frequency.
Before spreading information (or messages), we assume
that the whole stage is divided into two phases. The first
phase is to learn the movement information of each node,
which is called as initialization period, and the second
phase is to efficiently propagate the information based on
these learning data. In the initialization period, a server is
only concerned at this stage to predict the movement pat-
terns based on Markov chain process. Then, the DBSCAN
algorithm is applied to create a meaningful group and the
top-k influential nodes are identified through active learn-
ing. After all the above processes are completed in the
initialization period, the server distributes information of
each node to the top-k nodes. We describe and illustrate the
general information propagation model in Sect. 3.2. In
Sect. 3.3, we present the algorithms and methods behind
ActiveDBC in a step-by-step manner by utilizing a Markov
chain and DBSCAN clustering.
3.2 Information propagation model
Based on previous studies [4], The information propagation
model is similar with diffusion minimization problem and
its model can be described as follows. We assume that each
node can be either active or inactive. Active nodes are the
adopters of the information and are ready to propagate the
information to their inactive neighbors. When they contact,
the state changes from inactive to active, and only one side
is possible. The more frequently node u contacts with
neighbor node v, the more likely node v obtains informed
and becomes active state. From the social behavior point of
view, people most likely shares the information with their
best friends or frequently encounters. First, an initial set of
active nodes should be selected. When an active node
contacts inactive nodes, the inactive nodes become active
state with a probability until all the nodes become active
state. Then, the information propagation process is termi-
nated. Given an weighted graph G = (V, E), let V is the set
of all nodes, S be the initial set of active nodes, k is
expected number of total active nodes, and information
propagation process time by initially selected node set is
defined as s S;Vð Þ. An information propagation model was
formulated by the following equation:
argminS�V
s S;Vð Þ; Sj j � k: ð4Þ
Under the probabilistic information propagation model,
the contact frequency using the edge weight is quite
important factor, because it determines transition from
inactive to active state.
Figure 1 illustrates an example of the general informa-
tion propagation model at time t0 and t1. Assume that we
select the top-k influence node (in this case node p) before
time t0, and the communication range of all nodes is x.Here, nodes f, g, and q are in the inactive state at time t0,
and there is no contact among them at all. However, at time
t1, when p and q meet each other, node q changes its state
to active. On the other hand, even though f and g also meet
each other, they never undergo any change by remaining in
the inactive state. In this way, the state of any node may be
changed at any given time t.
3.3 Major steps of ActiveDBC
ActiveDBC utilizes a Markov chain process before it
applies DBSCAN clustering, because the patterns of each
node can be predicted and the same pattern fits the same
clustering group. Before propagating information to other
nodes, all these steps are performed during the initializa-
tion period. Ultimately, we select the top-k influential
nodes by using the Markov chain process and DBSCAN
clustering. Initially, we can use the Markov chain method
to learn the mobility pattern of each of the nodes to predict
Fig. 1 Information propagation model at time t0 and t1. Nodes are
labeled p, q, f, and g. The red and blue dots represent active and
inactive nodes, respectively
Wireless Netw
123
future movement paths. Generally, ActiveDBC consists of
the following major steps.
Step 1 Collecting contact information
In this step, each node sends a probing message and
timestamp ts at regular time intervals during the initial-
ization period, where s = 1…u and u is the time at which
the initialization period terminates. We assume that each
node knows the wall-clock time [24].
Figure 2 illustrates how node a creates a contact fre-
quency vector as times proceeds. In the initial state, the
vector of node a is set to zeros (0, 0, 0, 0, 0). At time t1, its
contact count automatically increases by 1 in its own state,
because it means that node a does not meet anyone. At time
t2 and t3, node a collects contact information from other
nodes. Similarly, other nodes build their own contact fre-
quency vector.
Step 2 Constructing a transition matrix P
At the given time u, the server accumulates the contact
information of all the nodes. As shown in Fig. 3, the server
generates the temporary matrix (b) from the contact fre-
quency vectors (a) of each node. In other words, (b) is a
symmetric matrix form. The server normalizes of the
temporary matrix to obtain the transition probability matrix
P (c) such that all elements are calculated by dividing the
element by the sum of the elements in the corresponding
column. For example, the first column of (c) is (0.12 0.25
0.37 0.12 0.12) because the sum of the column entries in
the probability vector should be 1. In this way, all columns
are changed by the probability values; these are multiplied
by the steady-state vector in the Markov chain process in
Sect. 2.1.
Step 3 Selecting promising probability vectors
After constructing the transition probability matrix P,
the server also generates the initial distribution X(0) of
probability vectors to obtain steady-state vectors. The
distribution X(0) consists ofN
k
� �initial probability vec-
tors whose elements are all possible combinations N � kð Þof zeros and k of 1
ks. For example, the initial distribution
5
2
� �¼ 10 where N = 5 and k = 2. That is, N is the
number of nodes and k is assumed to be selected as a
candidate of seed nodes. As the time proceeds to t, the set
X(t) = {1� i� N
k
� �xi tð Þj } of probability vectors is
computed by Eq. (5).
xj tð Þ ¼
rt1rt2
..
.
rtN
26664
37775 ð5Þ
where rtN is the probability value of all nodes at time t.
Then we use the stationary distribution of the Markov
chain to select promising probability vectors because this
depends on the initial distribution. The long-term behavior
property of the Markov chain with the transition matrix P
results in a unique probability vector q such that Pq = qPq.
The vector q is known as steady-state vector, which can be
found by solving the homogeneous linear system in
Eq. (6).
I � Pð Þq ¼ 0: ð6Þ
The vector q symbolizes the probability that each node
has the information at time t?. In Fig. 4, we can obtain the
mobility pattern probability of a location from the q vector.
For example, there are p matrix has 5 9 5, which has
location probability of each node that has visited. Then,
each node is multiplied by vector q, which is the proba-
bility of future visit. Since the first 0.260 has the highest
probability value in vector q, each node can predict the
Fig. 2 Contact frequency
vectors of node a at time t1, t2,
and t3
Wireless Netw
123
probability of moving forward in accordance with its the
highest probability values.
Step 4 Creating a graph for DBSCAN
In order to create clusters according to the mobility
patterns of the nodes, we obtained the Euclidian distances
by using steady-state vector q. According to the probability
in a vector q, each node can move with the mobility pat-
terns of its future locations. The server transforms vector q
to similarity matrix S using the Euclidian distances
between steady-state vectors viq and v jq. Our aim is to
construct a similarity matrix S by minimizing the Euclidian
distance in Eq. (7).
minimize D x; yð Þð Þ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXni¼1
viq � vjq
� �2
s: ð7Þ
The above-mentioned computation of the similarity
matrix S is constructed by using symmetric matrix in linear
algebra. Then, we apply the DBSCAN algorithm by using
similarity matrix S in step 5.
Step 5 Finding primitive clusters
The DBSCAN algorithm uses the Euclidian distances of
the similarity of nodes to capture the relationships among
the primitive clusters. This is the reason why the core
objects are determined according to each epsilon, which is
obtained by the Euclidian distance. Thus, primitive clusters
are created at given time t, and there are two primitive
clusters with mutual density-connectedness. Generally, two
connected primitive clusters may belong to the same
cluster. Ultimately, depending on the values of minPoints
and epsilon, any node is density-reachable from any other
object of the cluster, thus, it is also part of the cluster (due
to Definition 6 in Sect. 2.2). In other words, the probability
of belonging to the same cluster increases.
However, the DBSCAN algorithm includes outliers
because not all nodes are reachable from any other node
(due to Definition 4 in Sect. 2.2). Hence, it is very
important to determine the epsilon appropriately when the
primitive clusters are initially found. The final clustering
graph is continuously completed by repeating patterns in
the same way. Figure 5 shows the result of DBSCAN
clustering in our environment and the units is represented
as nodes in NCCU dataset. Figure 5(a) presents the results
that were obtained by using a Naıve method with random
Fig. 3 Construct transition probability matrix P
Fig. 4 Example for selecting
promising probability vectors
q using the Markov chain
process
Wireless Netw
123
movements, and Fig. 5(b) presents the result after
DBSCAN clustering with minPoints 4 and epsilon 0.1.
Step 6 Selecting k-influential nodes
After all the steps have been carried out, it is important to
extract the k-influential nodes in each cluster for very effi-
cient information propagation. The naıve approach, known as
RandomDBC, entails randomly selecting a k-influential node
of the clusters. This method randomly chooses any node
without information such as the relationship between nodes.
Thus, it is an inefficient approach because it is used without
any learning knowledge such as the strong edges from the
current cluster graph. Preliminary knowledge obtained by
learning from their strong edges and degrees, enables weak-
tie nodes or unknown nodes to be eliminated faster. There-
fore, the general idea is to learn actively among nodes in the
current cluster graph, which is considered the most effective
for choosing k-influential nodes. ActiveDBC first evaluates
the importance of the degree of each node. Then, the highest
scores are chosen as the top-k nodes compared with the
degree of surrounding nodes. At any given time t, the
knowledge-based information that was learned of each node,
denoted as score(d), is defined as in Eq. (8)
score dð Þ ¼ 2� uð Þ �Xp;q2C
1
E p; qð Þ ; ð8Þ
where � uð Þ is the degree when a node encountered another
node in the same cluster, p; q 2 C signifies that nodes
belong to the same cluster, E p; qð Þ is the number of degrees
when an encounter occurs between node p and q at time
t. The score(d) is calculated based on division of the sum of
degrees of all nodes and the current number of neighboring
nodes. This scoring rule is formulated based on learning the
knowledge-based information of all nodes. Hence, based
on the information learned earlier, the server spayed top-
k nodes with high scores. Contrary to the RandomDBC, an
ActiveDBC can be capable of effectively selecting the top-
k nodes. Thus, it is a better information propagation
method with high accuracy.
4 Experiments
4.1 Real trace data
Unlike synthetic models that have movements without
considering the everyday lives of people, model based on a
real trace consider specific places such the location of the
office, school and home, and even the time of day. Real
trace models have been reported in the past, e.g., Info-
com06 [25], Cambridge [26], and MIT Reality [27].
Although these real trace models follow human movements
through the real environment, specific groups act similarly
in that they supposedly perform the same actions or have
similar interests in the same building [28]. Since we
assume that people usually to go a certain location
depending on their interests and schedules, NCCU trace
data [22] is well adopted as a representation of real human
movements. The NCCU students are not restricted to any
specific place and are free to move around the campus. All
students in this university walk around according to their
class schedule or for some purpose. Thus, 115 students
with mobile devices running an Android app were traced
over a period of 2 weeks. Furthermore, since the NCCU
trace data was designed to be location-aware by recording
the GPS position once every 10 min, we utilized the
Fig. 5 Clustering results: a Naıve method and b DBSCAN clustering
Wireless Netw
123
location coordinates by dividing the entire map into nine
segments. Figure 6 shows a map of the area surrounding
NCCU with the location coordinates and the nine segments
into which we divided this area. These nine segments are
denoted as virtual regions and they enable us to implement
our proposed idea, because the server can determine the
position of each node according to its location coordinates.
4.2 Simulation methods
We verified our proposed method ActiveDBC by com-
paring its performance with that of other comparable
methods, namely Naıve, Degree and K-means. The fol-
lowing summarizes the methods we used for the compar-
ison given a simulation time.
• Naıve The top-k influential nodes are randomly selected
without social properties and learning knowledge.
Information is disseminated to other nodes uncondi-
tionally. Thus, this method referred to as Naıve, which
is appropriate because it is the simplest method.
• Degree This approach is very similar to the Degree-
centrality problem [29] of social networks. In order to
select the top-k influential nodes, the Degree method
compares the degree of a node with that of all the other
nodes. The weighted node with the highest degree is
chosen as an influential node. Although it is very
simple to compute, for the method is ineffective with
respect to the isolated nodes.
• K-means The most similar approach to our proposed
method. K-means clustering [30] is effective for
quickly and optimally identifying the top-k influential
nodes. The method aims to partition nodes into
k clusters in which each node belongs to the group
with the nearest mean in data mining. Initially,
K-means randomly chooses k nodes from all the nodes
and uses these as the initial means. Then, it continu-
ously finds nodes closer to the center for updating the
clusters. It also uses the Euclidean distance to measure
the distance of each node. This method is also
important for selecting the k nodes appropriately.
4.3 Simulation setup
Our simulation used the ONE simulator [19] and the map
of the area surrounding NCCU [22] to validate our pro-
posed idea.
Table 1 summarizes the parameters of the simulation
environments. The map of NCCU covers an area of
3764 m 9 3420 m, and the movement model is NCCU
trace data consisting of movement data collected from the
mobile devices of 115 students. The total data collection
period included an initialization period of 1500 s during
which the mobility pattern of each node was learned. This
Fig. 6 Data acquisition and segmentation: a Area surrounding the National Cheng-Chi University and b the nine virtual regions
Table 1 Simulation setting
Parameter (unit) Value (default)
number of nodes 115
Area (m2) 3764 9 3420
Movement model NCCU trace data
Interval of message behavior Student behavior
Top-k influential nodes 4, 5, 6, 7 (4)
Communication ranges (m) 1, 5, 10, 15, 20 (10)
epsilon e for DBSCAN 0.01, 0.1, 0.5 (0.1)
minPoints are within e 3, 4, 5 (4)
Initialization periods (s) 1500
Simulation times (s) 15,000
Wireless Netw
123
Fig. 7 Results with variation of active nodes according to communication ranges
Wireless Netw
123
Fig. 8 Results with variation of active nodes according to top-k influential nodes
Wireless Netw
123
training process was conducted by processing the Markov
chain and obtaining the clustering groups. Subsequently,
the simulation was carried out for a period of 15,000 s,
which was the information propagation time defined above
in Sect. 3. When the total simulation time is terminated,
most of node can have an active status. The communication
ranges indicate the scope for communication with other
nodes. Relatively, a smaller communication range would
be indicative of a sparse environment. However, we do not
consider the buffer size and message size of the nodes,
because our model focuses on the extent to which the
information becomes widespread.
Extensive experiments were conducted to obtain the most
optimal values for the parameters in each environment.
Especially it is important to specify appropriate values for the
variables (minPoints and epsilon) of the DBSCAN algorithm.
Note that establishing an influential node is an important
matter. Thus, we iteratively learn the scoring of each node at
the end of the initialization period for our proposed Acti-
veDBC. Additionally, we measured the percentage of active
nodes within a given simulation time and the simulations
were conducted 20 times to obtain the average results.
4.4 Simulation results
4.4.1 Communication ranges
In the experiments described in this section, the number of
active nodes was varied according to the communication
ranges because our aim was to determine the extent to which
Fig. 9 Comparison of results between RandomDBC and ActiveDBC with variation of number of active nodes
Wireless Netw
123
the information was spread among the nodes as x values, as
mentioned in Sect. 3. Figure 7(a) compares the overall per-
centage of active nodes with communication ranges of 5, 10,
15 and 20 m. The smaller communication ranges can be seen
as relatively sparse environments. This confirmed the supe-
rior performance of our proposed ActiveDBC compared to
that of other methods in reality. Conversely, the 20 m
communication range covers a high node density, because all
nodes are within easy reach of each other without requiring a
special movement pattern. Hence, all methods show similar
tendencies. Moreover, since isolated nodes are existed in the
NCCU trace model as a mobility property, inactive nodes
still existed. As we mentioned, in Fig. 7(d), we can see that
all the methods become increasingly indifferent.
4.4.2 Number of top-k influential nodes
Since we choose the top-k influential nodes by k clusters in
the DBSCAN algorithm, it is important to determine the size
of the epsilon e. In Fig. 8(a), we specified the epsilon as
0.205, 0.113, 0.024, and 0.014 as the number of k influential
nodes respectively. For example, Fig. 8(b) shows the result
with epsilon 0.205 and it has eight outliers among 115 nodes
(noise 0.06%) when there are four top-k influential nodes,
whereas the seven influential nodes have 32 outliers (noise
0.27%). Incidentally, despite the fact that the results in
Fig. 8(e) include many outliers, the reason for the superior
performance of the proposed method is that we measure how
fast the active nodes spread information to others according
to the top-k nodes without considering the number of out-
liers. As a result, when a larger number of influential nodes
initially exist, information tends to spread faster.
4.4.3 Knowledge-learning based propagation
We have proven that our proposed method, ActiveDBC is
more effective than RandomDBC after DBSCAN clustering.
Even though we predicted the mobility patterns of each node
based on the Markov chain and clustered nodes effectively
by using the DBSCAN algorithm, randomly choosing the
top-k influential nodes showed poor performance as
Fig. 9(c). Hence, applying the scoring rule we suggested, the
results in Fig. 9(a, b) represent good performance. However,
the communication range of 20 m shows a similar tendency
in Fig. 9(a). Since the coverage of each node is too large,
they are easily encountered by each other.
5 Conclusion
In this paper, in order to provide the effective information
propagation, we propose knowledge-based information
propagation method by identifying proper top-k influential
nodes. Our proposed idea is to learn the movement prob-
abilities of each node to reduce the number of isolated
nodes as much as possible based on the virtual regions. In
addition, the DBSCAN algorithm was applied to have an
effective result on denser and populated area. To select
k-influential node, two provided two major steps. First, we
predict the movement patterns of each node by using
Markov chain process, then the movement probabilities of
each node were provided to cluster with useful information.
Second, since the nodes with similar movement patterns
were clustered by using the DBSCAN algorithm, applying
the scoring rule was much better by choosing the top-
k nodes based on the learning acquaintance among nodes.
As a result, for short communication distances, the pro-
posed method was found to perform well in comparison to
the other methods. However, the shortcoming is that the
nodes tend to have similar results with bigger communi-
cation ranges, but in the context of the MSN environment,
the shorter communication ranges is more suitable for our
proposed idea. Moreover, we used real location-based
NCCU trace data to demonstrate the efficiency of this
method. The experiments on the real datasets showed that
the result was outstanding performance compared to other
methods in general.
Acknowledgements This research was supported by the Basic Sci-
ence Research Program through the National Research Foundation of
Korea (NRF) funded by the Ministry of Education, Science, and
Technology (2016R1A2B4010142).
References
1. Xu, Q., Su, Z., Zhang, K., Ren, P., & Shen, X. S. (2015). Epi-
demic information dissemination in mobile social networks with
opportunistic links. IEEE Transactions on Emerging Topics in
Computing, 3(3), 399–409.
2. Ma, H., Yang, H., Lyu, M. R., & King, I. (2008). Mining social
networks using heat diffusion processes for marketing candidates
selection. In Proceedings of the 17th ACM conference on infor-
mation and knowledge management (pp. 233–242).
3. Kempe, D., Kleinberg, J., & Tardos, E. (2003). Maximizing the
spread of influence through a social network. In: Proceedings of
ACM SIGKDD.
4. Lu, Z., Wen, Y., & Cao, G. (2014). Information diffusion in
mobile social networks: The speed perspective. In Proceedings of
IEEE INFOCOM (pp. 1932–1940).
5. Ning, T., Yang, Z., Wu, H. & Han, Z. (2013). Self-interest-drive
incentives for ad dissemination in autonomous mobile social
networks. In: Proceedings of IEEE INFOCOM.
6. Kempe, D., Kleinberg, J., & Tardos, E. (2003). Maximizing the
spread of influence through a social network. In Proceedings of
the 9th ACM SIGKDD international conference on knowledge
discovery and data mining (pp. 137–146).
7. Richardson, M., & Domingos, P. (2002). Mining knowledge-
sharing sites for viral marketing. In Proceedings of the eighth
ACM SIGKDD international conference on Knowledge discovery
and data mining (pp. 61–70).
Wireless Netw
123
8. Myers, S. A., Zhu, C., & Leskovec, J. (2012). Information dif-
fusion and external influence in networks. In Proceedings of the
18th ACM SIGKDD international conference on knowledge dis-
covery and data mining (pp. 33–41).
9. Kim, Y., Kim, J. K., Seok, J., & Du Kim, B. (2016). Information
propagation modeling in a drone network using disease epidemic
models. In 2016 Eighth international conference on ubiquitous
and future networks (ICUFN) (pp. 79-81).
10. Araniti, G., Orsino, A., Militano, L., Wang, L., & Iera, A. (2017).
Context-aware information diffusion for alerting messages in 5G
mobile social networks. IEEE Internet of Things Journal, 4(2),
427–436.
11. Chen, W., Wang, C., & Wang, Y. (2010). Scalable influence
maximization for prevalent viral marketing in large-scale social
networks. In: Proceedings of ACM SIGKDD.
12. Wang, Y., Cong, G., Song, G., & Xie, K. (2010). Community-
based greedy algorithm for mining top-k influential nodes in
mobile social networks. In: Proceedings of ACM SIGKDD.
13. Lu, Z., Wen, Y., Zhang, W., Zheng, Q., & Cao, G. (2016).
Towards information diffusion in mobile social networks. IEEE
Transactions on Mobile Computing, 15(5), 1292–1304.
14. Lu, Z., Sun, X., & La Porta, T. (2016). Cooperative data offload
in opportunistic networks: From mobile devices to Infrastructure.
arXiv preprint arXiv:1606.03493.
15. Markov chain. (2016). https://en.wikipedia.org/wiki/Markov_
chain.
16. Lee, J. K., & Hou, J. C. (2006). Modeling steady-state and
transient behaviors of user mobility: formulation, analysis, and
application. In Proceedings of the 7th ACM international sym-
posium on Mobile ad hoc networking and computing (pp. 85–96).
17. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-
based algorithm for discovering clusters in large spatial databases
with noise. In: Kdd (Vol. 96, No. 34, pp. 226–231).
18. Mai, S. T., Assent, I., & Storgaard, M. (2016, August). AnyDBC:
An Efficient Anytime Density-based Clustering Algorithm for
Very Large Complex Datasets. In Proceedings of the 22nd ACM
SIGKDD international conference on knowledge discovery and
data mining (pp. 1025–1034). ACM.
19. Keranen, A. (2008). Opportunistic network environment simu-
lator. Special Assignment report, Helsinki University of Tech-
nology, Department of Communications and Networking.
20. Conti, M., Giordano, S., May, M., & Passarella, A. (2010). From
opportunistic networks to opportunistic computing. IEEE Com-
munications Magazine, 48(9), 126–139.
21. Zhang, Z. (2006). Routing in intermittently connected mobile ad
hoc networks and delay tolerant networks: Overview and chal-
lenges. IEEE Communications Surveys and Tutorials, 8(1),
24–37.
22. Tsai, T. C., & Chan, H. H. (2015). NCCU trace: Social-network-
aware mobility trace. IEEE Communications Magazine, 53(10),
144–149.
23. Fratini, A., & Caleffi, M. (2014). Medical emergency alarm
dissemination in urban environments. Telematics and Informat-
ics, 31(3), 511–517.
24. Pelusi, L., Passarella, A., & Conti, M. (2006). Opportunistic
networking: data forwarding in disconnected mobile ad hoc
networks. IEEE Communications Magazine, 44(11), 15.
25. Srinivasan, V., Motani, M., Ooi, W.T. (2006). Analysis and
implications of student contact patterns derived from campus
schedules. In: Proceedings of ACM MobiCom, Los Angeles, CA
(pp. 86–97).
26. Hui, P. (2008). People are the network: experimental design and
evaluation of social-based forwarding algorithms, Ph.D.
Dissertation, UCAM-CL-TR-713. University of Cambridge,
Computer Laboratory.
27. Eagle, N., & Pentland, A. (2006). Reality mining: sensing com-
plex social systems. Personal and Ubiquitous Computing, 10(4),
255–268.
28. Socievole, A., De Rango, F., Caputo, A. (2014). Wireless con-
tacts, Facebook friendships and interests: analysis of a multi-layer
social network in an academic environment. In: 2014 IFIP
Wireless Days (WD).
29. Freeman, L. C. (1978). Centrality in social networks conceptual
clarification. Social Networks, 1(3), 215–239.
30. Alsabti, K., Ranka, S., & Singh, V. (1997). An efficient k-meansclustering algorithm.
Jiho Park is currently an Ph.D.
candidate in computer science at
Yonsei University in South
Korea. His research interests
include wireless sensor net-
works, machine learning, deep
learning and social network
analysis.
Jegwang Ryu is currently an
M.S. candidate in computer
science at Yonsei University in
South Korea. His research
interests include mobile social
networks, data offloading and
machine learning.
Sung-Bong Yang received his
M.S. and Ph.D. from the Dept.
of Computer Science at the
University of Oklahoma in 1986
and 1992, respectively. He has
been a professor at Yonsei
University since 1994. His
research interests include graph
algorithms, mobile computing,
machine learning and social
network analysis.
Wireless Netw
123