chapter 2 graphs first develop some of the basic ideas behind graph theory, then look at some...

Chapter 2 Graphs First develop some of the basic ideas behind graph theory, then

look at some fundamental applications

2.1 Basic Definitions Graphs: Nodes and Edges A graph is a way to specify relationships among a collection of items

A graph consists of a set of objects, called nodes

Certain pairs of these objects connected by links called edges

E.g., the graph in Fig. 2.1(a) consists of 4 nodes, labeled A, B, C, D

B is connected to each of the other 3 nodes by edges

C and D are connected by an edge too

Two nodes are neighbors if they’re connected by an edge

In Fig. 2.1(a), the relationship between the 2 ends of an edge as being symmetric

The edge simply connects them to each other

But often we want to express asymmetric relationships

Define a directed graph to consist of a set of nodes together with a set of directed edges, each a link from one node to another

The direction being important—see Fig. 2.1(b)

To emphasize that a graph isn’t directed, call it an undirected graph

But generally a graph is assumed undirected unless otherwise noted

Graphs as Models of Networks

Graphs serve as mathematical models of network structures

For a real example, Fig. 2.2 depicts the network structure of the Internet (“Arpanet”) in Dec. 1970—only 13 sites

Nodes represent computing hosts

There’s an edge joining 2 nodes if there’s a direct communication link between them

Ignoring the superimposed US map and the blow-up circles in MA and CA, the rest depicts this 13-node graph using the dots-and-lines style of Fig. 2.1

For showing the pattern of connections, the actual placement of the nodes is immaterial; all that matters is which nodes link to which

Fig. 2.3 is a different drawing of the same 13-node Arpanet graph

Graphs are useful whenever we want to represent how things are either physically or logically linked to one another in a network structure

The 13-node Arpanet is an example of a communication network

In Chapter 1, we saw examples from 2 other broad classes of graph structures

Social networks Nodes are people or groups of people Edges represent some kind of social interaction

Information networks Nodes are info resources (e.g., Web pages or documents) Edges represent logical connections such as hyperlinks,

citations, or cross-references

Fig. 2.4: a few further examples

The depictions of airline and subway systems in (a) and (b) are examples of transportation networks

Nodes are destinations and edges represent direct connections

The prerequisites among college courses in (c) is an example of a dependency network

Nodes are tasks and directed edges indicate that one task must be performed before another

The Tank Street Bridge from Brisbane, Australia in (d) is an example of a structural network

Joints are nodes and physical linkages are edges

2.2 Paths and ConnectivityPaths A path is a sequence of nodes with the property that

each consecutive pair in the sequence is connected by an edge

E.g., the sequence of nodes MIT, BBN, RAND, UCLA is a path in the Internet graph from Figs. 2.2 and 2.3

Another path is the sequence CASE, LINCOLN, MIT, UTAH, SRI, UCSB

A path can repeat nodes, e.g., SRI, STAN, UCLA, SRI, UTAH, MIT

But most paths we consider won’t do this

A path without repeat nodes is a simple path

Cycles A particularly important kind of non-simple path is a cycle,

a path with 3 edges, in which the 1st and last nodes are the same, but otherwise all nodes are distinct

Many cycles in Fig. 2.3, e.g.,

SRI, STAN, UCLA, SRI (as short as possible: it has 3 edges)

SRI, STAN, UCLA, RAND, BBN, MIT, UTAH, SRI

Every edge in the 1970 Arpanet belongs to a cycle

If any edge were to fails, there’s still a way to get from any node to any other node

More generally, cycles in communication and transportation networks allow for redundancy—provide for alternate routings

In the social network of friendships, cycles are common

NetworkX: Cyclescycle_basis(G) returns a list of cycles that form a basis for cycles of G

Each cycle list is a list of nodes forming a cycle; cyclic permutations are not included

>>> G = nx.barbell_graph(4,2)

>>> nx.cycle_basis(G)

[[1, 3, 0], [2, 3, 0], [8, 7, 6], [9, 7, 6], [8, 9, 6], [1, 2, 0]]

G must be a Graph—can’t be a DiGraph

A basis for cycles of a network is a minimal collection of cycles s.t. any cycle in the network can be written as a sum of cycles in the basis

Here summation of cycles is defined as “exclusive or” of the edges

simple_cycles(DG) returns the simple cycles (elementary circuits) of a directed graph

DG must be a DiGraph

A simple cycle is a closed path where no node appears twice, except the 1st and last are the same

Two elementary circuits are distinct if they aren’t cyclic permutations of each other

>>> DG1 = nx.DiGraph([(0,1), (1,3), (3,0), (3,2), (2,0)])

>>> nx.simple_cycles(DG1)

[[0, 1, 3, 0], [0, 1, 3, 2, 0]]

Connectivity A graph is connected if there’s a path between every pair of nodes

E.g., the 13-node Arpanet graph is connected

We expect most communication and transportation networks to (try to) be connected

Their goal is to move traffic from one node to another

But there’s no a priori reason to expect graphs in other settings to be connected

Figs. 2.5 and 2.6 show disconnected graphs

Fig. 2.6 is the collaboration graph of the biological research center Structural Genomics of Pathogenic Protozoa

Nodes represent researchers

There’s an edge between 2 nodes if the researchers co-authored a publication

Components

In Fig. 2.5, the graph consists of 3 “pieces”:

one consisting of nodes A and B,

one consisting of nodes C, D, and E, and

one consisting of the rest of the nodes

The network in Fig. 2.6 also consists of 3 pieces: one on 3 nodes, one on 4 nodes, and one that’s much larger

A connected component (or just component) of a graph is a subset of the nodes s.t.

(i) every node in the subset has a path to every other, and

(ii) the subset isn't part of some larger set with the property that every node can reach every other

Dividing a graph into its components is just a first, global way of describing its structure

Within a given component, there may be richer internal structure that’s important to our interpretation of the network

E.g., in the largest component in Fig. 2.6, there’s a prominent node at the center, and tightly-knit groups linked to this node but not to each other

This component would break into 3 distinct components if this node were removed

Analyzing a graph this way (its densely-connected regions and the boundaries between them) is a powerful way of thinking about network structure—cf. Chap. 3

Giant Components

Consider the social network of the entire world, with a link between 2 people if they’re friends

This global friendship network probably isn’t connected—consider, e.g., un-contacted tribes

But the component you inhabit probably contains a significant fraction of the world’s population.

This is true for a range of network datasets—large, complex networks often have a giant component

This is a deliberately informal term for a connected component containing a significant fraction of all the nodes

When a network contains a giant component, it almost always contains only one

E.g., if the global friendship network had 2 giant components, all it would take is a meting between a representative of each to combines them

This in fact happened with the discovery of America—with dramatic consequences

The notion of giant components is useful for reasoning about networks on much smaller scales as well

See the collaboration network in Fig. 2.6

Another example is Fig. 2.7: the romantic relationships in an American high school over an 18-month period

Not all edges were present at once

The fact that this graph contains such a large component is significant regarding the spread of STDs

The researchers noted that, “like social facts, [these structures] are invisible yet consequential macrostructures that arise as the product of individual agency.”

NetworkX: Subgraphs A subgraph of a graph G is a graph

whose vertex set is a subset of that of G, and

whose adjacency relation is a subset of that of G restricted to this subset

Graph.subgraph(nbunch) returns the subgraph induced on the nodes in nbunch

I.e., the nodes in nbunch and the edges between them

The graph, edge or node attributes just point to the original graph

So changes to the node or edge structure won’t be reflected in the original graph But changes to the attributes will

To create a subgraph with its own copy of the edge/node attributes use

nx.Graph(G.subgraph(nbunch))

If edge attributes are containers, get a deep copy using

G.subgraph(nbunch).copy()

For an in-place reduction of a graph to a subgraph, remove nodes

G.remove_nodes_from([n in G if n not in set(nbunch)])

The following all have the same description as Graph.subgraph()

DiGraph.subgraph(nbunch)

MultiGraph.subgraph(nbunch)

MultiDiGraph.subgraph(nbunch)

Make a barbell without the handle


>>> G1 = G.subgraph([0,1,2,3,6,7,8,9])

NetworkX: Connected Components In an undirected graph G, vertices u and v are connected if G

contains a path from u to v

A graph is connected if every pair of vertices in it is connected

A connected component is a maximal connected subgraph of G

Each vertex belongs to exactly 1 connected component, as does each edge

A directed graph is weakly connected if replacing all of its directed edges with undirected edges produces a connected (undirected) graph

It’s connected if it contains a directed path from u to v or a directed path from v to u for every pair of vertices u, v

It’s strongly connected if it contains a directed path from u to v and a directed path from v to u for every pair of vertices u, v

The weakly connected components are the maximal weakly connected subgraphs

The strongly connected components are the maximal strongly connected subgraphs

The condensation of a directed graph is the graph with each of the strongly connected components contracted into a single node

For a Graph G

is_connected(G) tests G’s connectivity

number_connected_components(G) returns the number of connected components in G

connected_components(G) returns a list of the connected components of G, each a list of nodes

connected_component_subgraphs(G) returns a list of the connected components of G as subgraphs

node_connected_component(G, n ) returns a list of the nodes in the connected components of G containing node n

>>> G1 = nx.complete_graph(3)

>>> G2 = nx.complete_graph(2)

>>> GG = nx.disjoint_union(G1, G2)

>>> GG.edges()

[(0, 1), (0, 2), (1, 2), (3, 4)]

>>> nx.connected_components(GG)

[[0, 1, 2], [3, 4]]

>>> H1, H2 = nx.connected_component_subgraphs(GG)

>>> H1.edges()

[(0, 1), (0, 2), (1, 2)]

>>> H2.edges()

[(3, 4)]

>>> nx.node_connected_component(GG, 2)

[0, 1, 2]

For a DiGraph G

is_strongly_connected(G) tests G for strong connectivity

number_strongly_connected_components(G) returns the number of strongly connected components in G

strongly_connected_components(G) returns a list of the strongly connected components of G, each a list of nodes

strongly_connected_component_subgraphs(G) returns a list of the strongly connected components of G as subgraphs

condensation(G, scc) returns the condensation of G

scc is a list of strongly connected components—cf. strongly_connected_components()

The resulting graph is a directed acyclic graph

Node labels are the indices of the components in the list of strongly connected components

is_weakly_connected(G) tests G for weak connectivity

number_weakly_connected_components(G) returns the number of weakly connected components in G

weakly_connected_components(G) returns a list of the strongly connected components of G, each a list of nodes

weakly_connected_component_subgraphs(G) returns a list of the weakly connected components of G as subgraphs

>>> DG = nx.DiGraph([(0,1),(1,2),(2,0),(3,4)])

>>> nx.weakly_connected_components(DG)

[[0, 1, 2], [3, 4]]

>>> scc = nx.strongly_connected_components(DG)

>>> scc

[[0, 1, 2], [4], [3]]

>>> DGS = nx.strongly_connected_component_subgraphs(DG)[0]

>>> DGS.edges()

[(0, 1), (1, 2), (2, 0)]

>>> DGcon = nx.condensation(DG, scc)

>>> DGcon.edges()

[(2, 1)]

>>> DGcon.nodes()

[0, 1, 2]

Cliques A clique in an undirected graph G = (V, E) is a subset of the vertex set

C ⊆ V s.t. every 2 vertices in C are connected by an edge

Equivalently, the subgraph induced by C is complete

Sometimes the term “clique” also refers to the subgraph

A maximal clique is a clique that can’t be extended by including 1 more adjacent vertex

i.e., a clique that doesn’t exist exclusively within the vertex set of a larger clique

A maximum clique is a clique of the largest possible size in G

The clique number of G is the number of nodes in a maximum clique of G

Finding a maximum clique is an NP-complete problem

In the following, G may be a Graph, DiGraph, MultiGraph, or MultiDiGraph

find_cliques(G) returns a generator of maximal cliques in G as node lists

graph_clique_number(G) returns the clique number for G

graph_number_of_cliques(G) returns the number of maximal cliques in G

>>> F = nx.barbell_graph(4,1)

>>> for cl in nx.find_cliques(F):

... print cl

...

[8, 5, 6, 7]

[3, 0, 1, 2]

[3, 4]

[5, 4]

>>> nx.graph_number_of_cliques(F)

4

>>> nx.graph_clique_number(F)

4

2.3 Distance and Breadth-First Search Beyond asking whether 2 nodes are connected by a path, ask how long

such a path is

The length of a path is the number of edges in the sequence that comprises it E.g., the path MIT, BBN, RAND, UCLA in Fig. 2.3 has length 3 The path MIT, UTAH has length 1

The distance between 2 nodes is the length of the shortest path between them

E.g., the distance between LINC and SRI is 3 Check that there’s no length-1 or length-2 path between them

Breadth-First Search First declare all of your actual

friends to be at distance 1

Then find all of their friends (not counting people already friends of yours), declare these to be at distance 2

Then find all of their friends (not counting people already found at distances 1 and 2), declare these to be at distance 3

Continuing in this way, search in successive layers, each representing the next distance out Each new layer is built from all those nodes that

have not already been discovered in earlier layers, and have an edge to some node in the previous layer

Figure 2.8

Figure 2.9. How to discover all distances from the node MIT in the 13-node Arpanet graph from Figure 2.3

NetworkX: Depth First Search Various algorithms giving the result of a depth-first search (DFS) on a graph

The source argument (where the traversal begins) is optional (defaulting to node 0 or whichever is listed first) but generally included

dfs_edges(G, source) returns a generator that produces edges in a DFS

dfs_tree(G, source) returns a directed tree (a DiGraph) of a DFS

dfs_predecessors(G, source) returns a dictionary of predecessors in a DFS

dfs_successors(G, source) returns a dictionary of successors in a DFS

dfs_preorder_nodes(G, source) returns a generator producing nodes in a DFS pre-ordering

dfs_postorder_nodes(G, source) returns a generator producing nodes in a DFS post-ordering

dfs_labeled_edges(G, source) returns a generator that produces edges in a DFS labeled by direction (‘dir’) type (‘forward’, ‘reverse’, ‘nontree’)

>>> G = nx.krackhardt_kite_graph()

>>> list(nx.dfs_edges(G,0))

[(0, 1), (1, 3), (3, 2), (2, 5), (5, 6), (6, 4), (6, 7), (7, 8), (8, 9)]

>>> list(nx.dfs_edges(G,9))

[(9, 8), (8, 7), (7, 5), (5, 0), (0, 1), (1, 3), (3, 2), (3, 4), (4, 6)]

>>> list(nx.dfs_edges(G))

[(0, 1), (1, 3), (3, 2), (2, 5), (5, 6), (6, 4), (6, 7), (7, 8), (8, 9)]

>>> tree = nx.dfs_tree(G, 9)

>>> tree

<networkx.classes.digraph.DiGraph object at 0x0217CF10>

>>> tree.succ{0: {1: {}}, 1: {3: {}}, 2: {}, 3: {2: {}, 4: {}}, 4: {6: {}}, 5: {0: {}}, 6: {}, 7: {5: {}}, 8: {7: {}}, 9: {8: {}}}

>>> nx.dfs_successors(G, 9)

{0: [1], 1: [3], 3: [2, 4], 4: [6], 5: [0], 7: [5], 8: [7], 9: [8]}

>>> nx.dfs_predecessors(G, 9)

{0: 5, 1: 0, 2: 3, 3: 1, 4: 3, 5: 7, 6: 4, 7: 8, 8: 9}

>>> list(nx.dfs_preorder_nodes(G, 9))

[9, 8, 7, 5, 0, 1, 3, 2, 4, 6]

>>> list(nx.dfs_postorder_nodes(G, 9))

[2, 6, 4, 3, 1, 0, 5, 7, 8, 9]

NetworkX: Breadth First Search Various algorithms that give the result of a breadth-first search (BFS) on

a graph

The source argument (where the traversal begins) again is optional (defaulting to the node listed first) but generally included

bfs_edges(G, source) returns a generator that produces edges in a BFS

bfs_tree(G, source) returns a directed tree of a BFS

bfs_predecessors(G, source) returns a dictionary of predecessors in a BFS

bfs_successors(G, source) returns a dictionary of successors in a BFS

>>> list(nx.bfs_edges(G, 0))

[(0, 1), (0, 2), (0, 3), (0, 5), (1, 4), (1, 6), (5, 7), (7, 8), (8, 9)]

>>> list(nx.bfs_edges(G, 9))

[(9, 8), (8, 7), (7, 5), (7, 6), (5, 0), (5, 2), (5, 3), (6, 1), (6, 4)]

>>> tree = nx.bfs_tree(G, 9)

>>> tree.succ{0: {}, 1: {}, 2: {}, 3: {}, 4: {}, 5: {0: {}, 2: {}, 3: {}},

6: {1: {}, 4: {}}, 7: {5: {}, 6: {}}, 8: {7: {}}, 9: {8: {}}}

>>> nx.bfs_predecessors(G, 9)

{0: 5, 1: 6, 2: 5, 3: 5, 4: 6, 5: 7, 6: 7, 7: 8, 8: 9}

>>> nx.bfs_successors(G, 9)

{8: [7], 9: [8], 5: [0, 2, 3], 6: [1, 4], 7: [5, 6]}

NetworkX: Shortest Path These algorithms work for undirected and directed graphs

Return an arbitrary shortest path when there is more than 1 shortest path between 2 nodes

Those that search for a path between 2 particular nodes raise a NetworkXNoPath exception if there is not such path

has_path(G, source, target) returns True if G has a path from source to target; otherwise, False is returned

shortest_path(G, source=None, target=None, weight=None) computes shortest paths in G

If the source and target are both specified, return a single list of nodes in a shortest path

If only the source is specified, return a dictionary keyed by targets with a list of nodes in a shortest path

If neither the source nor the target is specified, return path, a dictionary of dictionaries where

path[source][target] is the list of nodes in the source-to-target path

weight, if None (default), causes every edge to have weight/distance 1

If weight is a string, it’s the edge attribute to use as the edge weight Any edge attribute not present defaults to 1

shortest_path_length(G, source=None, target=None, weight=None) computes shortest path lengths in G

source, target, weight are as with shortest_path()

If the source and target are both specified, return a single number for the shortest path

If only the source is specified, return a dictionary keyed by targets with the shortest path lengths as values

If neither the source nor the target is specified, return length, a dictionary of dictionaries where length[source][target] is the length of a shortest path from source to target

average_shortest_path_length(G, weight=None) returns the average shortest path length over all pairs of distinct nodes of G

weight is as before

The average shortest path length a is

where

V is the set of nodes in G,

d (s, t ) is the shortest path from s to t, and

n is the number of nodes in G

>>> import networkx as nx

>>> DG = nx.DiGraph()

>>> DG.add_weighted_edges_from([(0,1,1.0), (1,0,1.0),

(1,3,1.0), (3,4,1.0), (4,2,1.0), (2,1,1.0),

(1,4,3.0), (4,1,3.0), (5,4,1.0)])

>>> nx.has_path(DG,5,0)

True

>>> nx.has_path(DG,0,5)

False

>>> d = nx.shortest_path(DG)

>>> d

{0: {0: [0], 1: [0, 1], 2: [0, 1, 4, 2], 3: [0, 1, 3], 4: [0, 1, 4]},

1: {0: [1, 0], 1: [1], 2: [1, 4, 2], 3: [1, 3], 4: [1, 4]},

2: {0: [2, 1, 0], 1: [2, 1], 2: [2], 3: [2, 1, 3], 4: [2, 1, 4]},

3: {0: [3, 4, 1, 0], 1: [3, 4, 1], 2: [3, 4, 2], 3: [3], 4: [3, 4]},

4: {0: [4, 1, 0], 1: [4, 1], 2: [4, 2], 3: [4, 1, 3], 4: [4]},

5: {0: [5, 4, 1, 0], 1: [5, 4, 1], 2: [5, 4, 2], 3: [5, 4, 1, 3],

4: [5, 4], 5: [5]}}

>>> d[0]

{0: [0], 1: [0, 1], 2: [0, 1, 4, 2], 3: [0, 1, 3], 4: [0, 1, 4]}

>>> dw = nx.shortest_path(DG, weight='weight')

>>> dw[0]

{0: [0], 1: [0, 1], 2: [0, 1, 3, 4, 2], 3: [0, 1, 3], 4: [0, 1, 3, 4]}

>>> nx.shortest_path(DG, source=0, weight='weight')

{0: [0], 1: [0, 1], 2: [0, 1, 3, 4, 2], 3: [0, 1, 3], 4: [0, 1, 3, 4]}

>>> nx.shortest_path(DG, source=0, target=4, weight='weight')

[0, 1, 3, 4]

>>> nx.shortest_path(DG, source=0, target=5)

Traceback (most recent call last):

…

networkx.exception.NetworkXNoPath: No path between 0 and 5.

>>> dw_len = nx.shortest_path_length(DG, weight='weight')

>>> dw_len[0]

{0: 0, 1: 1.0, 2: 4.0, 3: 2.0, 4: 3.0}

>>> dw_len[0][4]

3.0

>>> nx.shortest_path_length(DG, source=0, weight='weight')

{0: 0, 1: 1.0, 2: 4.0, 3: 2.0, 4: 3.0}

>>> nx.shortest_path_length(DG, source=0, target=4, weight='weight')

3.0

>>> nx.average_shortest_path_length(DG, weight='weight')

1.9333333333333333

>>> dw_len = nx.shortest_path_length(DG, weight='weight')

>>> count = len_sum = 0

>>> for lens in dw_len.values():

... count += len(lens)

... len_sum += sum(lens.values())

...

>>> print count, len_sum

31 58.0

>>> len_sum / count

1.8709677419354838

Advanced Shortest Path These are often more specific than the functions listed above and

often provide the implementations for those functions

Generally return results in the now familiar nested dictionary format

A function without ‘dijkstra’ in its name ignores weights and other edge data

A function with ‘dijkstra’ in its name by default considers the values of edge ‘weight’ attributes

To consider different edge data, set the ‘weight’ keyword argument to that attribute

All these functions have a keyword argument cutoff

Can be set to stop the search at the given depth

Paths of length greater than the cutoff are ignored

single_source_shortest_path(G, source, cutoff=None) computes shortest path from source to all nodes reachable from it

single_source_shortest_path_length(G, source, cutoff=None) computes shortest path lengths from source to all reachable nodes

all_pairs_shortest_path(G, cutoff=None) computes shortest paths between all node

all_pairs_shortest_path_length(G, cutoff=None) computes the shortest path lengths between all nodes

>>> nx.single_source_shortest_path(DG, 0)

{0: [0], 1: [0, 1], 2: [0, 1, 4, 2], 3: [0, 1, 3],

4: [0, 1, 4]}

>>> nx.single_source_shortest_path(DG, 0, cutoff=2)

{0: [0], 1: [0, 1], 3: [0, 1, 3], 4: [0, 1, 4]}

>>> nx.single_source_shortest_path_length(DG, 0)

{0: 0, 1: 1, 2: 3, 3: 2, 4: 2}

>>> nx.all_pairs_shortest_path_length(DG)

{0: {0: 0, 1: 1, 2: 3, 3: 2, 4: 2},

1: {0: 1, 1: 0, 2: 2, 3: 1, 4: 1},

2: {0: 2, 1: 1, 2: 0, 3: 2, 4: 2},

3: {0: 3, 1: 2, 2: 2, 3: 0, 4: 1},

4: {0: 2, 1: 1, 2: 1, 3: 2, 4: 0},

5: {0: 3, 1: 2, 2: 2, 3: 3, 4: 1, 5: 0}}

dijkstra_path(G, source, target, weight=’weight’) returns the shortest path from source to target in a weighted graph

dijkstra_path_length(G, source, target, weight=’weight’) returns the shortest path length from source to target in a weighted graph

single_source_dijkstra_path(G, source, cutoff=None, weight=’weight’) computes the shortest paths between source and all other reachable nodes for a weighted graph

single_source_dijkstra_path_length(G, source, cutoff=None, weight=’weight’) computes the lengths of the shortest paths lengths between source and all

other reachable nodes for a weighted graph

all_pairs_dijkstra_path(G, cutoff=None, weight=’weight’) compute shortest paths between all nodes in a weighted graph

all_pairs_dijkstra_path_length(G, cutoff=None, weight=’weight’) compute the lengths of the shortest paths between all nodes in a weighted graph

single_source_dijkstra(G, source, target=None, cutoff=None, weight=’weight’)

computes the shortest paths and their lengths in a weighted graph

Returns a tuple of 2 dictionaries keyed by node,

1st for distances from the source

2nd for the paths from the source to that node

>>> nx.dijkstra_path(DG,0,4)

[0, 1, 3, 4]

>>> ls, ps = nx.single_source_dijkstra(DG, 0)

>>> ls

{0: 0, 1: 1.0, 2: 4.0, 3: 2.0, 4: 3.0}

>>> ps

{0: [0], 1: [0, 1], 2: [0, 1, 3, 4, 2], 3: [0, 1, 3], 4: [0, 1, 3, 4]}

floyd_warshall(G, weight=’weight’) finds all-pairs shortest path lengths using Floyd’s algorithm

Floyd’s algorithm is appropriate for finding shortest paths in dense graphs or graphs with negative weights when Dijkstra’s algorithm fails

This algorithm can still fail if there are negative cycles

It has running time O(n 3) with running space is O(n

2)

>>> fw = nx.floyd_warshall(DG)

>>> fw[0]

defaultdict1(<function <lambda> at 0x01E0E5B0>, {0: 0, 1: 1.0, 2: 4.0, 3: 2.0, 4: 3.0, 5: inf})

>>> fw[5]

defaultdict(<function <lambda> at 0x01FC1230>, {0: 4.0, 1: 3.0, 2: 2.0, 3: 4.0, 4: 1.0, 5: 0})

>>> fw[5][0]

4.0

1. defaultdict is a dict subclass that calls a factory function to supply missing values

astar_path(G, source, target, heuristic=None, weight=’weight’) returns a list of nodes in a shortest path between source and target using the A* (“A-star”) algorithm

heuristic is a function to estimate the distance from a node to target

To guarantee a shortest path, this function should never overestimate this distance

The function takes 2 node arguments and must return a number

>>> G=nx.grid_graph(dim=[3,3])

>>> def dist(a, b):

... (x1, y1) = a

... (x2, y2) = b

... return ((x1 - x2) ** 2 + (y1 - y2) ** 2) ** 0.5

...

>>> nx.astar_path(G,(0,0),(2,2),dist)

[(0, 0), (0, 1), (1, 1), (1, 2), (2, 2)]

The Small-World Phenomenon Go back to our thought experiments on the global friendship

network

The argument explaining why you belong to a giant component asserts something stronger:

Not only do you have paths of friends connecting you to a large fraction of the world’s population,

but these paths are surprisingly short

E.g., consider a friend from another country (thence to his friends and relatives, etc.)

This is the small-world phenomenon:

the idea that the world looks “small” when you think of how short a path of friends it takes to get from you to almost anyone else

Also known as the 6 degrees of separation

Title of a play by John Guare

One line is:

“I read somewhere that everybody on this planet is separated by only six other people. Six degrees of separation between us and everyone else on this planet

The first experimental study of this notion (and the origin of “6”) was by Stanley Milgram and his colleagues in the 1960s

With a small budget, he tested the idea that people are really connected in the global friendship network by short chains of friends

Asked 296 randomly chosen “starters” to try forwarding a letter to a “target” person, a stockbroker living in a suburb of Boston

The starters were each given some personal info about the target (including address and occupation)

Asked to forward the letter to someone they knew on a first-name basis, with the same instructions Try to reach the target as quickly as possible

Formed chains of people that closed in on the stockbroker

Figure 2.10: the distribution of path lengths, among the 64 chains that reached the target

The median length was 6

It’s striking that so many letters reached their destination and by such short paths

Some caveats about the experiment

It clearly doesn't establish a statement as bold as “six degrees of separation between us and everyone else on this planet”

The paths were just to a single, fairly affluent target

Many letters never got there

Attempts to recreate the experiment have been problematic due to lack of participation

We can ask how useful these short paths really are

Milgram himself in his original paper:

If we think of each person as the center of their own social “world,” then “six short steps" becomes “six worlds apart”

Makes 6 sound like a much larger number

Still, the overall conclusion has been accepted in a broad sense:

Social networks tend to have very short paths between essentially arbitrary pairs of people

The existence of all these short paths has substantial consequences for the potential speed with which info, diseases, and other kinds of contagion can spread

Cf. also the potential access that the social network provides to opportunities and to people with very different characteristics from our own

See Chapter 20, a more detailed study of the small-world phenomenon and its consequences

Instant Messaging, Paul Erdös, and Kevin Bacon That social networks generally are “small worlds” has been

increasingly confirmed in settings where we do have full data on the network structure

Milgram resorted to an experiment where letters served as “tracers” through a global friendship network

He had no hope of fully mapping the network on his own

But where the full graph structure is known,

we load it into a computer and perform breadth-first search to determine what typical distances look like

One of the largest such computational studies was by Leskovec and Horvitz

Analyzed the 240 million active user accounts on Microsoft Instant Messenger

They were employed by Microsoft at the time, had access to a complete snapshot of the system for the month under study

Built a graph where each node corresponds to a user

There’s an edge between two users if they engaged in a two-way conversation at any point during a month-long observation period

The graph had a giant component containing almost all of the nodes

The distances within this giant component were very small

An estimated average distance of 6.6

An estimated median of 7

Figure 2.11: The distribution of distances averaged over a random sample of 1000 users:

Breadth-first search was done separately from each of these 1000 users

The results from these 1000 nodes were combined to produce the plot in

The graph was so large that doing breadth-first search from every single node would have taken an astronomical amount of time

Producing plots like this efficiently for massive graphs is an interesting research topic in itself

Figure 2.11 approximates what Milgram was after: the distribution of how far apart we all are in the full global friendship network

Reconciling the structure of such massive datasets with the underlying networks we’re trying to measure is an issue arising here and many times later

Here we’re still some distance from Milgram's goal

We only track people who are technologically-endowed enough to have access to instant messaging

Rather than basing the graph on who is truly friends with whom, we observe only who talks to whom during an observation period

Turning to a smaller scale (magnitude 105 rather than 108 people), researchers have also discovered very short paths in the collaboration networks within professional communities

E.g., in mathematics, Erdös (published c. 1500 papers) is a central figure in the collaborative structure of the field

Define a collaboration graph (as in Figure 2.6) with

nodes for mathematicians and

edges connecting pairs who have jointly authored a paper

Figure 2.12: A small hand-drawn piece of the collaboration graph, with paths leading to Paul Erdös

A mathematician's Erdös number is the distance from them to Erdös

Most mathematicians have Erdös numbers of at most 4 or 5

Extending the collaboration graph to co-authorship across all the sciences, most scientists in other fields have Erdös numbers only slightly (if at all) larger

Einstein (2), Fermi (3), Chomsky (4), Pauling (4), Crick (5), Watson (6)

Three students at Albright College in PA around 1994 adapted the idea of Erdös numbers to the collaboration graph of movie actors

Nodes are performers

An edge connects 2 performers if they've appeared together in a movie

A performer's Bacon number is their distance in this graph to Kevin Bacon

Using cast lists from the Internet Movie Database (IMDB), compute Bacon numbers for all performers via breadth-first search

The ave. Bacon number, over all performers in the IMDB, is c. 2.9

Hard to find one that's larger than 5

One movies enthusiast tried to come up with the largest Bacon number

Found an obscure 1928 Soviet pirate film, Plenniki Morya, starring P. Savin with Bacon number of 7

Supporting cast of 8 appeared nowhere else

NetworkX: Minimum Spanning Tree Given a connected, undirected graph G, a spanning tree of G is a

subgraph that

is a tree and

connects all the vertices

A single graph may have several spanning trees

The weight of a spanning tree is the sum of the weights of the edges in that spanning tree

A minimum spanning tree (MST) is a spanning tree with weight the weight of every other spanning tree

More generally, any undirected graph (not necessarily connected) graph has a minimum spanning forest—

the union of MSTs for its connected components

minimum_spanning_tree(G, weight=’weight’) returns an MST or, if the graph isn’t connected, a min. spanning forest of G

G must be a Graph (not a DiGraph, MultiGraph, …)

A Graph is returned

weight is the edge-data key to use for the weight (default = ‘weight’)

If the edges don’t have a weight attribute, a default weight of 1 is used

Uses Kruskal’s algorithm

minimum_spanning_edges(G, weight=’weight’, data=True) returns a generator that produces edges in the MST

Edges are 3-tuples (u,v,w), where w is the edge-data dictionary

If keyword argument data is False, edges are just (u,v)

>>> G = nx.watts_strogatz_graph(30, 10, 0.2)

>>> nx.draw_graphviz(G, prog='sfdp', node_color='w')

>>> plt.show()

>>> mst = nx.minimum_spanning_tree(G)

>>> nx.draw_graphviz(mst, prog='sfdp', node_color='w')

>>> plt.show()

>>> ee = nx.minimum_spanning_edges(G)

>>> for (u,v,w) in ee:

... if u == 0 or v == 0:

... print u,v,w

...

0 3 {}

0 4 {}

0 5 {}

0 7 {}

0 16 {}

0 26 {}

0 27 {}

0 28 {}

0 29 {}

NetworkX: Distance Measures First define the graphic-theoretic distance-related concepts then

give the relevant NetworkX functions

The distance between 2 vertices (nodes) in a graph is the number of edges in a shortest path connecting them

Also called the geodesic distance: it’s the length of the graph geodesic between those 2 vertices A graph geodesic is a shortest path between 2 nodes—

possibly several for a given pair of nodes

If there is no path connecting the 2 vertices, the distance is defined as infinite

The eccentricity of a vertex v is the greatest geodesic distance between v and any other vertex

How far a node is from the node most distant from it in the graph

The radius of a graph is the min. eccentricity of any vertex in the graph

The diameter of a graph is the max. eccentricity of any vertex in the graph

I.e., the greatest distance between any pair of vertices.

A central vertex in a graph of radius r is one whose eccentricity is r —i.e., a vertex that achieves the radius

A peripheral vertex in a graph of diameter d is one that is distance d from some other vertex—i.e., a vertex that achieves the diameter

Python Functions for Distance Measures

eccentricity(G) returns the eccentricities of G

radius(G) returns the radius of G

diameter(G) returns the diameter of G

center(G) returns the set of central vertices (nodes) of G

periphery(G) returns the set of peripheral nodes of G


>>> nx.eccentricity(G)

{0: 5, 1: 5, 2: 5, 3: 4, 4: 3, 5: 3, 6: 4, 7: 5, 8: 5, 9: 5}

>>> nx.diameter(G)

5

>>> nx.periphery(G)

[0, 1, 2, 7, 8, 9]

>>> nx.radius(G)

3

>>> nx.center(G)

[4, 5]

>>> DG1 = nx.DiGraph([(0,1), (1,3), (3,0), (3,2), (2,0)])

>>> nx.eccentricity(DG1)

{0: 3, 1: 2, 2: 3, 3: 2}

>>> nx.diameter(DG1)

3

>>> nx.periphery(DG1)

[0, 2]

>>> nx.radius(DG1)

2

>>> nx.center(DG1)

[1, 3]

NetworkX: Directed Acyclic Graphs The following work only for a DiGraph

A directed acyclic graph (DAG) is a directed graph with no cycles

A topological sort is a non-unique permutation of the nodes of a DAG s.t. an edge from u to v implies that u appears before v

is_directed_acyclic_graph(DG) returns True if DG is a DAG or False if not

topological_sort(DG, nbunch=None) returns a list of nodes in topological sort order

nbunch is an optional container of nodes; only those nodes are sorted

If DG isn’t a DAG, no topological sort exists, and a NetworkXUnfeasible exception is raised

>>> DG = nx.DiGraph([(0,2), (0,3), (1,2), (1,4), (2,3), (2,4)])

>>> nx.is_directed_acyclic_graph(DG)

True

>>> nx.topological_sort(DG)

[1, 0, 2, 4, 3]

>>> nx.topological_sort(DG, [4,2,3])

[2, 3, 4]

>>> DG.add_edge(3,1)

>>> nx.is_directed_acyclic_graph(DG)

False

>>> nx.topological_sort(DG)

Traceback (most recent call last):

…

networkx.exception.NetworkXUnfeasible: Graph contains a cycle.

NetworkX: Reversing a DiGraphDiGraph.reverse(copy=True) returns the reverse of the graph—

a graph with the same nodes and edges but with the edge directions reversed

copy, if True, results in a new DiGraph returned that holds the reversed edges

If copy is False, the reverse graph is created using the original graph (changing it in place)

MultiDiGraph.reverse(copy=True) has the same description

>>> DG = nx.DiGraph([(0,1), (1,2)])

>>> DG1 = DG.reverse()

2.4 Network Datasets: An Overview The increasing availability of large, detailed network datasets has led to

an explosion of research on large-scale networks in recent

Now think more systematically about where people get the data for such research

There are several reasons we might study a particular network dataset

We may care about the actual domain it comes from

So fine-grained details of the data are potentially as interesting as the broad picture

Or we’re using the dataset as a proxy for a related network that may be impossible to measure

E.g., the Microsoft IM graph from Figure 2.11 gave us info about distances in a social network of a scale and character that begins to approximate the global friendship network

Or we’re looking for network properties common across many different domains

So finding a similar effect in unrelated settings can suggest that it has a certain universal nature

All 3 motivations are often at work simultaneously, to varying degrees

E.g., consider the analysis of the Microsoft IM graph

It gave insight into the global friendship network

At a more specific level, the researchers were also interested in the dynamics of instant messaging in particular

At a more general level, the result of the IM graph analysis fit into the broader framework of small-world phenomena that spans many domains

To study a social network on 20 people, we can interview then all and ask them who their friends are

But to study the interactions among 20,000 people, we need to be more opportunistic in where we look for data

Can't just go collect everything by hand

Must think about settings where the data has in some essential way already been measured for us

Now consider some of the main sources of large-scale network data used for research

The resulting list is not exhaustive

The categories aren’t truly distinct—a single dataset can exhibit characteristics from several

Collaboration Graphs Collaboration graphs record who works with whom in a specific setting

E.g., co-authorships among scientists, co-appearances by actors

An example extensively studied by sociologists is the graph on highly-placed people in the corporate world

An edge joins 2 if they’ve served together on the board of directors of the same Fortune 500 company

The on-line world provides new instances

The Wikipedia collaboration graph (connecting 2 Wikipedia editors if they've ever edited the same article)

The World-of-Warcraft collaboration graph (connecting 2 W-o-W users if they've ever taken part together in the same raid or other activity)

Sometimes a collaboration graph is studied to learn about the specific domain it comes from

E.g., sociologists who study the business world are interested in the relationships among companies at the director level, as expressed via co-membership on boards

In contrast, e.g., people other than research scientists are interested in scientific co-authorship networks

because they form detailed, pre-digested snapshots of a rich form of social interaction that unfolds over a long period of time

With on-line bibliographic records, can often track the patterns of collaboration within a field across a century or more

Thereby extrapolate how the social structure of collaboration may work across a range of harder-to-measure settings as well

Who-Talks-to-Whom Graphs The Microsoft IM graph is a snapshot of a large community engaged

in several billion conversations during a month

Captures the “who-talks-to-whom” structure of the community

Similar datasets have been constructed

from the e-mail logs within a company or a university

from records of phone calls: study the structure of call graphs where each node is a phone number there’s an edge between 2 if they engaged in a phone call

over a given observation period

Can also use the fact that mobile phones with short-range wireless technology can detect other similar devices nearby

Equip subjects with such devices

Study the traces they record

Thereby build “face-to-face” graphs that record physical proximity A node is a person carrying one of the mobile devices There’s an edge joining 2 people if they were detected to be in

close physical proximity over the observation period

The nodes generally represent customers, employees, or students of the organization that maintains the data with strong expectations of privacy

The research is generally restricted in specific ways to protect privacy

Such privacy considerations have also become an issue where

companies try to use this type of data for marketing

governments try to use it for intelligence-gathering purposes

Economic network measurements recording the “who-transacts-with-whom” structure of a market or financial community have been used to study the ways in which

different levels of access to market participants lead to

different levels of market power and different prices for goods

This motivates more mathematical investigations of how a network structure limiting access between buyers and sellers affects outcomes (cf. Chaps. 10-12)

Information Linkage Graphs Snapshots of the Web are central examples of network datasets

Nodes are Web pages

Directed edges represent links from one page to another

Of particular interest (beyond the info in the documents) is the social and economic structures that stand behind the info

hundreds of millions of personal pages on social-networking and blogging sites

hundreds of millions more representing companies and governmental organizations engineering their external images

Because of the scale of the full Web, just manipulating the data effectively is a research challenge in itself

So a lot of network research is done on interesting, well-defined subsets of the Web, including the linkages among bloggers pages on Wikipedia pages on social-networking sites such as Facebook or MySpace discussions and product reviews on shopping sites

Since the early 20th century (well before the Web), citation analysis has studied the network structure of citations among scientific papers or patents

Lets us track the evolution of science

Citation networks remain popular in social research for the same reason that scientific co-authorship graphs are

They’re very clean datasets that span decades

Technological Networks Don’t think of the Web as primarily a technological network

It’s really a projection onto a technological backdrop of ideas, info, and social and economic structure created by humans

But there’s been a convergence of social and technological networks

A lot of interesting network data comes from the more overtly technological end of the spectrum

Nodes represent physical devices

Edges represent physical connections between them

Examples include the interconnections among computers on the Internet generating stations in a power grid

Even such physical networks are ultimately also economic networks

Represent the interactions among the competing organizations, companies, regulatory bodies, and other economic entities that shape them

On the Internet, we have a two-level view of the network

At the lowest level

Nodes are individual routers and computers

An edge means that 2 devices are physically connected

At a higher level, these nodes are grouped into little “nation-states” termed autonomous systems (ASs)

Each is controlled by a different Internet service-providers (ISP)

The who-transacts-with-whom graph on the ASs is the AS graph Represents the data transfer agreements these ISPs make

with each other

Networks in the Natural World Network research has special interest in several different types of

biological networks

Look at 3 examples at 3 different scales, from population level down to molecular level

Food webs represent the who-eats-whom relationships among species in an ecosystem

There’s a node for each species

A directed edge from node A to node B indicates that members of A consume members of B

Seeing the structure of a food web as a graph helps us reason about issues such as cascading extinctions If certain species become extinct, species relying on them for

food also risk extinction These extinctions can propagate through the food web

In the structure of neural connections within an organism's brain:

Nodes are neurons

An edge represents a connection between 2 neurons

The global brain architecture for the simple organism C. Elegans (a 1mm roundworm), with 302 nodes and c. 7000 edges, has been completely mapped

But detailed network pictures for brains of higher organisms are far beyond the state of the art

Still, significant insight has been gained by studying the structure of specific modules within a complex

brain and understanding how they interrelate

There are many ways to define the set of networks that make up a cell’s metabolism, but roughly:

Nodes are compounds that play a role in a metabolic process

Edges represent chemical interactions among them

Hope that analysis of these networks can shed light on the complex reaction pathways and regulatory

feedback loops that take place inside a cell and suggest network-centric attacks on pathogens that disrupt a

cell’s metabolism

chapter 2 graphs first develop some of the basic ideas behind graph theory, then look at some...

Documents

set of nodes

nodes link

sites nodes

edges slide

dependency network nodes

sequence of nodes mit

node graph

internet graph