networks all around us: extracting networks from your problem domain

40
METIS MEETUP Networks All Around Us: Analyzing Networks in your Problem Domain | 3/3/2016 Russell Jurney http://bit.ly/socialnetworkanalysis2

Upload: russell-jurney

Post on 15-Apr-2017

360 views

Category:

Data & Analytics


0 download

TRANSCRIPT

M E T I S M E E T U P

Networks All Around Us: Analyzing Networks in your Problem Domain | 3/3/2016

Russell Jurney

http://bit.ly/socialnetworkanalysis2

RELATO MAPS

MARKET

BACKGROUND

Serial Entrepreneur Contributed code to Apache Druid, Apache Pig, Apache DataFu, Apache Whirr, Azkaban, MongoDB

Apache Commi?er

Three-Bme O'Reilly Author Started & Shipped Product at E8 Security

Ning, LinkedIn, Hortonworks veteran

2009 2010 2011

2012 2014

EXAMPLES OF NETWORKS

FOUNDER

NETWORKS

node = company edge = employment transition as in people who… …worked at one startup, founded another

WEBSITE

BEHAVIOR

node = web page edge = user browses one page, then another

ONLINE SOCIAL

NETWORKS

node = linkedin profile, edge = linked connection

EMAIL INBOX

node = email address, edge = sent email

MARKETS

node = company, edge = partnership

MARKET REPORTS

TYPES OF NETWORKS

TINKERPOP

“Marko Rodriguez is the Doug Cutting of graph analytics.” —Mark Twain

PROPERTY

GRAPHS

A PROPERTY GRAPH IN

EVERY DATABASE

PROPERTY GRAPHS IN YOUR DOMAIN

identify entities identify relationships specify schema (or not) populate graph database learn to think in graph walks (hard) query in batch query in realtime

POPULATING A PROPERTY GRAPH

// Add nodes while((json = company_reader.readLine()) != null) { document = jsonSlurper.parseText(json) v = graph.addVertex('company') v.property("_id", document._id) v.property("domain", document.domain) v.property("name", document.name) }

POPULATING A PROPERTY GRAPH

// Get a graph traverser g = graph.traversal()

while((json = links_reader.readLine()) != null) { document = jsonSlurper.parseText(json)

// Add edges to graph v1 = g.V().has('domain', document.home_domain).next() v2 = g.V().has('domain', document.link_domain).next() v1.addEdge(document.type, v2) }

MULTI RELATIONAL TO SINGLE

RELATIONAL

g.E(‘friend’).subgraph()

final Graph g = TinkerFactory.createClassic(); try (final OutputStream os = new FileOutputStream(“jsondump/links.json")) { GraphSONWriter.build().create().writeGraph(os, g); }

EXPORT LINKS AS JSON

THEN USE SNA

LIBRARIES

# # Example - calculate friendship dispersion #

di_graph = nx.DiGraph()

all_edges = util.json_cr_file_2_array('jsondump/links.json')

for edge in all_edges: if 'type' in edge and edge['type'] == 'partnership': di_graph.add_edge(edge['domain1'], edge[‘domain2'])

dispersion = nx.dispersion(di_graph)

TOOLS OF

SNA

SNA = Social Network Analysis

centrality clustering block models cores dispersion center-pieces

CENTRALITY

Centrality is a way of measuring how central or important a particular node is in a social network.

OR

What nodes should I care about?

SINGLE-RELATIONAL CENTRALITY(S)

# all-links-the-same-type-centrality g.V().out().groupCount()

# things-humans-walk-centrality g.V().hasLabel(‘human’).out(‘walks’).groupCount()

# things-dogs-eat-centrality g.V().hasLabel(‘dog’).out(‘eats’).groupCount()

MULTI-RELATIONAL CENTRALITY(S)

# things-eaten-by-things-humans-walk-centrality g.V().hasLabel(‘human’).out(‘walks’).out(‘eats’).groupCount()

# things-hated-by-things-humans-pet-centrality g.V().hasLabel(‘human’).out(‘pets’).out(‘hates’).groupCount()

# things-that-pet-things-that-eat-mice-centrality g.V().in(‘eats’).in(‘pets’).groupCount()

CENTRALITIES

degree centrality closeness centrality

betweenness centrality eigenvector centrality

DEGREE CENTRALITY

in-degree centrality is nice… it works even if you’re missing a node’s outbound links

DEGREE CENTRALITY

# computation count connections …its that simple in-degree centrality = popularity out-degree centrality = gregariousness

# meaning risk of catching cold

DEGREE CENTRALITY IN GREMLIN

# all-links-the-same-type-centrality g.V().out().groupCount()

CLOSENESS CENTRALITY

# computation count hops of all shortest paths distance from all other nodes reciprocal of farness

# meaning communication efficiency spread of information

CLOSENESS CENTRALITY IN GREMLIN

closenessCentrality = g.V().as(“a”).repeat(both(‘relationship_type').simplePath()).emit().as("b")

.dedup().by(select(“a","b")).path() .group().by(limit(local, 1)).by(count(local)

.map {1/it.get()}.sum())

BETWEENNESS CENTRALITY

# computation count of times node appears in shortest paths… …between all pairs of nodes

# meaning control of communication between other nodes

EIGENVECTOR CENTRALITY

# computation counts connections of connected nodes more connected neighbors matter more

# meaning influence of one node on others pagerank is an eigenvector centrality

EIGENVECTOR CENTRALITY IN GREMLIN

g.V() .repeat(out(‘relationship_type’).groupCount(‘m').by('unique_key'))

.times(n).cap('m')

CLUSTERING

CLUSTERING

property based clustering: k-meansgraph based clustering: modularity property graph based clustering: CESNA

BLOCK MODELS

how much do clusters connect? are links reciprocal? circos are helpful

CORES

DISPERSION

Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook

CENTER-PIECE SUBGRAPHS

*Slide stolen from Tong, Faloutsos, Pan

Russell Jurney, CEO [email protected] twi?er.com/rjurney 404-317-3620

http://bit.ly/socialnetworkanalysis2