modelling and mining complex network data

19
Modelling and mining complex networks Kaushalya Madhawa Source: Facebook Engineering blog

Upload: kaushalya-madhawa

Post on 21-Apr-2017

394 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Modelling and mining complex networks

Kaushalya Madhawa

Source: Facebook Engineering blog

What is a graph?

● Graph theory started with Euler’s solution to the problem of K�onigsberg bridges in 1736

● In simple term, a graph is a set of vertices (V) connected with a set of edges (E)○ Vertices: entities○ Edges: pairwise relations among vertices○ Optionally can have direction and weight of edges

● Graphs can be used to model many real-world datasets

Analysis of graph datasets

● Graph data sets have been studied in the past○ they were small○ visual inspection could reveal lot of information

Now:

● More and more larger networks with millions or billions of nodes

○ impossible to visualize

Types of networks

● Social networks○ Phone call networks, email networks

● Knowledge and information networks○ The web, peer-to-peer networks, blog networks

● Technology networks○ Power grid, transportation networks

● Biological networks○ Protein-protein interaction networks, gene regulation networks

Network science● The field which studies complex networks● Draws theories and methods from many fields

○ Mathematics- graph theory○ Physics- statistical mechanics○ Computer science- data mining, information

visualization○ Sociology- social structure

● Understanding networks○ Understand their topology and measure their

properties○ Study the evolution and dynamics of them○ Create realistic models○ Create algorithms that make use of the network

structure

Frieze, Gionis, and Tsourakakis, Algorithmic Techniques for Modeling and Mining Large Graphs

Describing a network: network properties

● Density- ratio of the number of edges E to the number of possible edges

● Size- number of nodes● Average degree ● Average path length- average number of steps it takes to get

from one member of the network to another● Network diameter- longest of all the calculated shortest paths in

a network● Clustering coefficient- measures "all-my-friends-know-each-

other" property● Connectedness- the way in which the network is connected● Node centrality- set of measures to identify the most important

nodes

A sample dataset• Multiple mobile operators in Sri Lanka have provided four

different types of metadata– Call Detail Records (CDRs)

• Records of calls• SMS• Internet access

– Airtime recharge records

• Data sets do not include any Personally Identifiable Information– All phone numbers are pseudonymized – LIRNEasia does not maintain any mappings of identifiers to original phone

numbers

• Covers 50-60% of users

CDR: What is the underlying graph?

● Vertices: users, base stations● Edges: calls, texts● Edge weights: number of

calls between 2 vertices, number of between 2 vertices

Properties of real-world networks

● Power law degree distribution○ pk - the fraction of vertice in the network that have degree k○ The degree distribution of a network can be visualized by making

a histogram of the pk values

● Heavy tail distribution: existence of nodes that has very high degree● scale-free : average is not informative

Properties of real-world networks...

● Transitivity (Clustering)○ “The friend of your friend is likely also to be

your friend.”○ If vertex A is connected to vertex B, and

vertex B is connected to vertex C, then there is a heightened probability that vertex A is also connected to vertex C.

○ Measured by the clustering coefficient

Properties of real-world networks...

● Communities○ “A set of vertices densely connected to

each other and sparsely connected to the rest of the graph”

○ Real -world insights can be gained from community structure

■ Metabolic networks have communities based on functional groupings

■ Communities in social networks can be formed based on common location, interests, occupation etc.

Finding the community structure● There are multiple approaches to find community structure

in a network● Modularity maximization is one of the widely used

methods○ Modularity Q = (edges inside the community) -- (expected number of

edges inside the community)

○ The goal of such algorithms is to find the community structure with the highest modularity

○ Since modularity maximization is NP-complete heuristic method are used

M. E. J.-Newman, Michele-Girvan, “Finding and evaluating community structure in networks”, Physical Review E, APS, Vol. 69, No. 2, p. 1-16, 204.

Properties of real-world networks...

● Small world phenomena○ Most pairs of vertices are connected by a short path

through the network.○ S. Milgram’s famous experiment demonstrated the small-

world effect ○ Each Facebook user is connected to every other user by an

average of three and a half other people. With more interconnections degree of separation on Facebook is shrinking over time https://research.facebook.com/blog/three-and-a-half-degrees-of-separation/

Communities in a mobile network

● Louvain method [1] applied to tower-to-tower call network

● The community structure of Sri Lanka having the highest modularity consists of 11 clusters.

[1] V. Blondel and J. Guillaume, “Fast unfolding of communities in large networks,” J. Stat. …, pp. 1–12, 2008.

Limitations of modularity

● Often fails to detect communities smaller than a certain size (resolution limit)

● In real-world, nodes can belong to more than one community

Machine Learning in network analysis● Finding low dimensional feature representations of large

networks

○ DeepWalk: ■ uses deep learning based word-embedding techniques developed

for natural language modelling. ■ Set of vertices considered as the vocabulary

○ GraRep○ LINE

Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. "Deepwalk: Online learning of social representations."

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,

2014.

Machine Learning in network analysis...

● Anomaly detection○ Bayesian anomaly detection methods are used to detect

anomalies in large dynamic networks● Link prediction● Maximizing information diffusion in networks

○ Bayesian networks are used to model belief propagation in networks.

Q & A