community mining in social networks

Upload: tusharsharma

Post on 06-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Community Mining in Social Networks

    1/66

  • 8/3/2019 Community Mining in Social Networks

    2/66

    More links inside than outside

    Graphs are sparse

    Communities

  • 8/3/2019 Community Mining in Social Networks

    3/66

    Metabolic Protein-protein

    Social Economical

  • 8/3/2019 Community Mining in Social Networks

    4/66

    Outline

    Elements of community detection Graph partitioning

    Hierarchical clustering

    The Girvan-Newman algorithm New methods Testing algorithms Conclusions

  • 8/3/2019 Community Mining in Social Networks

    5/66

    Null hypothesis

    The relations between nodes can be

    inferred from the topology, i.e.

    Real

    communities= Topological

    communities

  • 8/3/2019 Community Mining in Social Networks

    6/66

    Questions

    What is a community?What is a partition?What is a good partition?

  • 8/3/2019 Community Mining in Social Networks

    7/66

    Communities: definition

    Local criteria Global criteria Vertex similarity

  • 8/3/2019 Community Mining in Social Networks

    8/66

    Local definitions:

    self-referringLook at the subgraph, forget

    the rest of the network.

    Ex. Clique!

    n-cliques, k-plexes, etc.

  • 8/3/2019 Community Mining in Social Networks

    9/66

    Local definitions:

    comparative

    Compare links inside and outside of the

    subgraph

    1) Strong definition: LS set

    2) Weak definition: internal degree

    is larger than external degree.

  • 8/3/2019 Community Mining in Social Networks

    10/66

    Global definitions

    Looking at the community with respect

    to the whole graph

    Null model: graph with no communitystructure

    Ex. Newman-Girvan null model: a random

    graph with the same degree sequence ofthe original graph

  • 8/3/2019 Community Mining in Social Networks

    11/66

    Definitions based on

    vertex similarity

    Vertices are in the same community if

    they are similar to each other

    Similarity measure can be local or global

    Ex. Distance between vertices, eigenvector

    components, structural equivalence, etc.

  • 8/3/2019 Community Mining in Social Networks

    12/66

    Warning

    Communities are usually implicitly defined

    by the specific algorithm adopted, without

    an explicit definition!

    The practical definition may depend on the

    specific system/application

  • 8/3/2019 Community Mining in Social Networks

    13/66

    What is a partition?

    A partition is a subdivision of a graph

    in groups of vertices, such that each

    vertex is assigned to one group

    Problems:

    1) Overlapping communities2) Hierarchical structure

  • 8/3/2019 Community Mining in Social Networks

    14/66

    Overlapping communities

    In real networks,

    vertices may belong

    to different modules

    G. Palla, I. Dernyi, I. Farkas, T. Vicsek,

    Nature 435, 814, 2005

  • 8/3/2019 Community Mining in Social Networks

    15/66

    Hierarchies

    Modules may embed

    smaller modules,

    yielding differentorganizational levels

    A. Clauset, C. Moore, M.E.J. Newman,

    LNCS 4503, 1, 2007

  • 8/3/2019 Community Mining in Social Networks

    16/66

    What is a good partition?

  • 8/3/2019 Community Mining in Social Networks

    17/66

    How can we compare different

    partitions?

    ?

  • 8/3/2019 Community Mining in Social Networks

    18/66

    Partition P1versus P

    2: which one is

    better?

    Quality function Q

    Is Q(P1) > Q(P

    2) or Q(P

    1) < Q(P

    2) ?

  • 8/3/2019 Community Mining in Social Networks

    19/66

    Modularity

    = # links in module i

    = expected # of links in module i

    li

  • 8/3/2019 Community Mining in Social Networks

    20/66

    probability that a stub,

    randomly selected, ends inmodule i

    =

  • 8/3/2019 Community Mining in Social Networks

    21/66

    probability that the link

    is internal to module i

    expected number of links in

    module i

  • 8/3/2019 Community Mining in Social Networks

    22/66

    History

    1970s: Graph partitioning incomputer science

    Hierarchical clustering in socialsciences

    2002: Girvan and Newman, PNAS 99,7821-7826

    2002-onward: methods of newgeneration, mostly by physicists

  • 8/3/2019 Community Mining in Social Networks

    23/66

    Graph partitioning

    Divide a graph in n parts, such that the

    number of links between them (cut size)

    is minimal

    Problems:

    1.Number of clusters must be specified2.Size of the clusters must be specified

  • 8/3/2019 Community Mining in Social Networks

    24/66

    If cluster sizes are not specified, the

    minimal cut size is zero, for a partition

    where all nodes stay in a single clusterand the other clusters are empty

    Bipartition: divide a graph in two clusters

    of equal size and minimal cut size

  • 8/3/2019 Community Mining in Social Networks

    25/66

    Spectral partitioning

    Laplacian matrix L

  • 8/3/2019 Community Mining in Social Networks

    26/66

    Spectral properties of L:

    All eigenvalues are non-negative If the graph is divided in gcomponents, there are g zero

    eigenvalues

    In this case L can be rewritten in ablock-diagonal form

  • 8/3/2019 Community Mining in Social Networks

    27/66

    If the network is connected, but there

    are two groups of nodes weakly linked

    to each other, they can be identified from

    the eigenvector of the second smallest

    eigenvalue (Fiedler vector)

    The Fiedler vector has both positive and

    negative components, their sum must be 0

    If one wants a split into n1 and n2=n-n1nodes, one takes the n1 largest (smallest)

    components of the Fiedler vector

  • 8/3/2019 Community Mining in Social Networks

    28/66

    Kernighan-Lin algorithm

    Start: split in two groups

    At each step, a pair of nodes of differentgroups are swapped so to decrease the cut

    size

    Sometimes swaps are allowed thatincrease the cut size, to avoid local

    minima

  • 8/3/2019 Community Mining in Social Networks

    29/66

    Hierarchical clustering

    Very common in social network analysis

    1.A criterion is introduced to comparenodes based on their similarity2. A similarity matrix X is constructed:

    the similarity of nodes i and j is Xij

    3. Starting from the individual nodes, largergroups are built by joining groups of nodes

    based on their similarity

  • 8/3/2019 Community Mining in Social Networks

    30/66

    Final result: a hierarchy of partitions

    (dendrogram)

  • 8/3/2019 Community Mining in Social Networks

    31/66

    Problems of traditional

    methods Graph partitioning: one needs to

    specify the number and the size of

    the clusters

    Hierarchical clustering: manypartitions recovered, which one is

    the best?

    One would like a method that can predictthe number and the size of the partition

    and indicate a subset of good

    partitions

  • 8/3/2019 Community Mining in Social Networks

    32/66

    Girvan-Newman algorithm

    M. Girvan & M.E.J Newman,

    PNAS 99, 7821-7826 (2002)

    Divisive method: one removes the linksthat connect the clusters, until the latter

    are isolated

    How to identify intercommunity links?Betweenness

  • 8/3/2019 Community Mining in Social Networks

    33/66

    Link-betweenness: number of shortest

    paths crossing a link

  • 8/3/2019 Community Mining in Social Networks

    34/66

    Steps

    1. Calculate the betweenness of alllinks

    2. Remove the one with highestbetweenness

    3. Recalculate the betweenness ofthe remaining edges

    4. Repeat from 2

  • 8/3/2019 Community Mining in Social Networks

    35/66

    The process delivers a hierarchy of

    partitions: which one is the best?

    The best partition is the one corresponding

    to the highest modularity Q

    M.E.J. Newman & M. Girvan, Phys. Rev. E

    69, 026113 (2004)

    The algorithm runs in a time O(n3) on a

    sparse graph (i.e. when m ~ n)

  • 8/3/2019 Community Mining in Social Networks

    36/66

    New methods

    Divisive algorithms Modularity optimization Spectral methods Dynamics methods Clique percolation

  • 8/3/2019 Community Mining in Social Networks

    37/66

    Divisive algorithms

    Based on link removal (like GN)

    Ex. Algorithm by Radicchi et al.(PNAS 101, 2658-2663, 2004)

    Edge clustering:

  • 8/3/2019 Community Mining in Social Networks

    38/66

    Main idea: inter-community links have

    low edge clustering coefficient

  • 8/3/2019 Community Mining in Social Networks

    39/66

    Steps

    1. Calculate the edge clustering of alllinks

    2. Remove the one with lowest edgeclustering

    3. Recalculate the measure for theremaining edges

    4. Repeat from 2Advantage over GN: fast! The CPU time

    scales as O(n2) on a sparse graph

  • 8/3/2019 Community Mining in Social Networks

    40/66

    Modularity optimization

    1) Greedy algorithms2) Simulated annealing3) Extremal optimization

    Goal: find the

    maximum of Q over

    all possible network

    partitions

    Problem: NP-complete!

  • 8/3/2019 Community Mining in Social Networks

    41/66

    Greedy algorithm

    Start: partition with one node ineach community

    Merge groups of nodes so to obtainthe highest increase of Q

    Continue until all nodes are in thesame community

    Pick the partition with largestmodularity

    M.E.J. Newman,

    Phys. Rev. E 69, 066133, 2004

    CPU time O(n2)

  • 8/3/2019 Community Mining in Social Networks

    42/66

    Resolution limit of modularity

  • 8/3/2019 Community Mining in Social Networks

    43/66

    ?

    S.F. & M. Barthlemy, PNAS 104, 36 (2007)

  • 8/3/2019 Community Mining in Social Networks

    44/66

    Spectral methods

    Finding communities from spectral

    properties of graph matrices: A, L,

    etc.

    Ex. Algorithm by Donetti & Muoz(JSTAT, P10012, 2004)

    The first few eigenvectors of the Laplacian

    are computedEigenvector components act like

    coordinates to represent nodes in space

  • 8/3/2019 Community Mining in Social Networks

    45/66

    Nodes are then grouped with

    hierarchical clustering

  • 8/3/2019 Community Mining in Social Networks

    46/66

    Dynamic algorithms

    Potts model Synchronization Random walks

  • 8/3/2019 Community Mining in Social Networks

    47/66

    Clique percolation

    G. Palla, I. Dernyi, I. Farkas, T. Vicsek,

    Nature 435, 814, 2005

  • 8/3/2019 Community Mining in Social Networks

    48/66

    Communities in

    weighted networks

    Aij = Wij

    Most methods based on topology fail

    Some methods can be easily adapted

    Sparsity may not be crucial!

  • 8/3/2019 Community Mining in Social Networks

    49/66

    Weighted modularity

    = sum of weights of links in module i

    (strength of module i)

    = expected strength of module i

  • 8/3/2019 Community Mining in Social Networks

    50/66

    Markov Cluster Algorithm

    S. Van Dongen, PhD thesis (2000)

    Basic idea: diffusion flow on a network

    Wij Sij

    Sij

    is the stochastic matrix:

    Sij=Wij/si

  • 8/3/2019 Community Mining in Social Networks

    51/66

    Steps:

    1. (Diffusion) Raise the stochasticmatrix to the power p (e.g. p=2)

    2. (Inflation) Raise each resultingmatrix element to the power

    3. Normalize the elements of theresulting matrix (by row)

    4. Keep only the k largest elementsper column

    5. Repeat from 1.

    Three parameters: p, , k

  • 8/3/2019 Community Mining in Social Networks

    52/66

    http://www.micans.org/mcl/

    After a sufficient number of iterations

    the matrix converges to a matrix with 0s

    and 1s, with disconnected components!

    Problem: the final configuration depends

    on the parameters p and (mostly!)

    Complexity: O(nk2)

  • 8/3/2019 Community Mining in Social Networks

    53/66

    Matrix ordering

    Goal: to put a matrix in block-diagonal

    form!

  • 8/3/2019 Community Mining in Social Networks

    54/66

    Cost function optimization

    Quadratic assignment problem,

    NP complete!

    Standard optimization techniques, e.g.

    simulated annealing

    M. Sales-Pardo, R. Guimer, A.A. Moreira,

    L.A.N. Amaral, PNAS 104, 15224 (2007)

  • 8/3/2019 Community Mining in Social Networks

    55/66

    K-clustering

    Set of data points, distance d(x,y) for

    each pair of points x,y

    Goal: dividing the points in k groups

    such to maximize/minimize a given

    measure

    Problem: number of clusters given

    as input!

  • 8/3/2019 Community Mining in Social Networks

    56/66

    Minimum k-clustering: minimizing the diameter

    of a cluster, i.e. the largest distance between

    points of the cluster

    k-clustering sum: minimizing the average

    intracluster distance

    k-center: minimizing the maximum distance of

    cluster points from a centroid

    k-median: minimizing the average distance of

    cluster points from a centroid

    k-means: minimizing the average squareddistance of cluster points from an arbitrary

    centroid point

  • 8/3/2019 Community Mining in Social Networks

    57/66

    Example: k-means clustering

    1. k points are randomly chosen, as far aspossible from each other (centroids)

    2. Each data point is assigned to the nearestcentroid

    3. Recalculate positions of centroids bydetermining the centers of mass of the kclusters

    4. Repeat from 2.

    Points embedded in metric space

    Problem: result sensitive to initial conditions!

  • 8/3/2019 Community Mining in Social Networks

    58/66

    Testing algorithm

    Artificial networks Real networks with known

    community structure

  • 8/3/2019 Community Mining in Social Networks

    59/66

    Benchmark of Girvan &

    Newman

  • 8/3/2019 Community Mining in Social Networks

    60/66

    Problems

    All nodes have the same degree All communities have equal size

    In real networks the distributions of degree

    and community size is highly heterogeneous!

    Ne benchmark (A Lancichinetti S

  • 8/3/2019 Community Mining in Social Networks

    61/66

    New benchmark (A. Lancichinetti, S.

    F., F. Radicchi, PRE 78, 046110, 2008)

    Power law distribution of degree Power law distribution of community size A mixing parametersets the ratio

    between the external and the total degree

    of each node

  • 8/3/2019 Community Mining in Social Networks

    62/66

  • 8/3/2019 Community Mining in Social Networks

    63/66

  • 8/3/2019 Community Mining in Social Networks

    64/66

    Real networks

  • 8/3/2019 Community Mining in Social Networks

    65/66

    Outlook

    Overlapping communities Hierarchies Testing Computational complexity Clustering in dense correlation matrices

    (i.e. neither sparse nor complete)

    A long way to go more questions thananswers from clustering

    More rigorous definition of the problem!

  • 8/3/2019 Community Mining in Social Networks

    66/66

    S F C Castellano arXiv:0712 2716