statistical latent variable and event models for …slovett/workshops/big-graphs...dyadic data...

Statistical Latent Variable and Event Models for Network Data

Padhraic SmythDepartment of Computer Science

University of California, Irvine

January 7th 2016Workshop on Big Graphs: Theory and Practice

UCSD

Padhraic Smyth, January 2016: 2

Acknowledgements

Students and Colleagues

Chris Dubois, Jimmy Foulds, Arthur Asuncion, Carter Butts, Zach Butler

Funding


References

Multiplicative latent factor models for description and prediction of social networks

P. D. Hoff, Computational and Mathematical Organization Theory , 15(4), 2009.

Dyadic data analysis with amen

P. D. Hoff, available online, June 2015

A relational event model for social action

C. E. Butts, Sociological Methodology, 2008

A survey of statistical network models

A. Goldenberg, A. Zheng, S. Fienberg, E. Airoldi, Foundations and Trends in Machine Learning, 2009


Email Contact Network

Data from HP Labs


Goals

• Learn a predictive distribution over future events in the network

– Incorporate node and edge attributes

• Be able to answer queries such as

– What will the network look like at time t + k?

– How likely is it that node i will communicate with node j

– How much influence does node i have on node j?

• Understand the dynamics of the network process


C. Butts, Science, 2009


Descriptive/Exploratory Analysis of Networks

Long history in social network analysis, complex systems, etc

– Degree distributions, power laws, scale-free networks

– Clustering effects

– Betweenness and centrality

Often focused on broad network properties

Very useful….but does not support inferential or predictive statements about specific nodes or edges


Statistical Network Modeling

Basic idea: hypothesize a (simple) generative model for the data given parameters….and then infer parameters given observed data

• Learning

– Systematic methods for estimating network parameters

• Prediction/Querying

– reduces to computation of conditional probabilities and expectations

• Noise/Missing Data

– Systematic way to handle real-world noise

• Covariates

– Relatively straightforward to integrate “non-network” information


Modeling Approaches

• Static model

– Aggregate event data into a single network

– e.g., static model for binary edges

• Discrete time models

– Aggregate event data into temporal windows, e.g., per week

• Continuous-time models

– Model event rates directly

– e.g., stationary Poisson (simple)

– e.g., non-stationary Poisson (more complex)

• Sequences of dependent events

– Cascade models


Static Network Models


Network Notation

N actors (node set)

• Generally assume that set of actors is known and fixed


Network Notation

N actors (node set)


Edges between actors: adjacency matrix Y

= 1 : an edge between actors i and j

: real-valued or counts: indication of strength of relationship


Network Notation

N actors (node set)


Edges between actors: adjacency matrix Y

= 1 : an edge between actors i and j

: real-valued or counts: indication of strength of relationship

Covariates/Attributes X

• e.g., for each actor, for each edge


Example of a Y matrix:Counts of 200,000 email messages between 3000 individuals over 3 months


Sidenote: Graphical Models

It is tempting to think of our N x N network as being related to a graphical model on N variables

However, in network modeling, the edges are viewed as the random variables, not the nodes

This hints at the complexity of the problem, i.e., O(N2) variables, and exponential in N possible graph realizations


Network Models via Regression


Network Models via Regression

Mean effectRow effect Column effect


Binary Undirected Edges


Likelihood

Note that edges are conditionally independent given parameters


Special Case: Erdos-Renyi Graph


Likelihood


Likelihood

We can learn the q’s using maximum likelihood or Bayesian methods, using a variety of techniques such as gradient methods, MCMC, variational approximations, etc


Adding Node and Edge Covariates

CovariatesWeights

Example:


Adding Latent (Hidden) Variables

Hypothesize that the nodes are embedded in a latent (hidden) space

The probability of a link is higher if nodes are closer in this space

Given a set of observed links can we infer a set of “good locations”?


Adding Latent (Hidden) Variables

Hypothesize that the nodes are embedded in a latent (hidden) space

The probability of a link is higher if nodes are closer in this space

Given a set of observed links can we infer a set of “good locations”?

Old idea in social science, e.g., McFarland and Brown, “Social distance as metric…”, 1973

See also more recent word embedding methods


Latent Space Model

K-dimensional real-valued latent space vector for each node

Intuition:

• Embed nodes in a K-dimensional latent space, K much smaller than N

• Probability (or log-odds) of edge(i,j) decreases as i and j become further away

Hoff, Raftery, Handcock, JASA, 2002


Figure from Hoff, Raftery, Handcock, 2002


Additive Latent Interactions

This model implies transitivity:

e.g., if (A,B) close and if (B,C) close then (A,C) close (and has high probability)

…but some relations are not transitive, e.g., “conflict”


Multiplicative Latent InteractionsHoff, 2009

K x K real-valued matrix(learned from the data)


Multiplicative Latent InteractionsHoff, 2009

K x K real-valued matrix(learned from the data)

Hoff (NIPS 2008) showed that for a diagonal W matrix (the latent eigenmodel) this model is a strict generalization of the distance model

For directed networks or rectangular matrices we can replace zj with vj , yielding links to matrix factorization


Building Blocks for Network Modeling

See also P. Hoff, Dyadic data analysis with amen, ArXiv, 2015



e.g., g = log(p/1-p) Network density

Row and column effects



Edge covariates and regression

weights




K-dimensional latent vector

per node

Similarity function on

latent vectors




Stochastic Block Model

Each node assumed to belong to 1 of K “stochastically equivalent” blocks

z vectors are K-dimensional indicators, e.g., z = [0, 0, 1, 0]

Within-block and between-block edge probabilities at block level, K x K matrix W

Nowicki and Snijders, 2002







(Figure from Goldenberg et al, 2010)






Example:

Interaction:



Binary Relational Feature Model

Each node can “turn on” any subset of K binary features (latent)

z vectors are K-dimensional binary vectors, e.g., z = [0, 0, 1, 1]

K x K weight matrix W captures feature interactions

Miller, Jordan, Griffiths, NIPS 2009



Hidden Features

Actors

Presence of edge between actor i and actor j is (e.g.)a logistic function of a weighted sum of features they have in common

Estimation: based on MCMC or variational EM




Example:

Interaction:


(Original proposed as an infinite-dimensional non-parametric model)


Predictions on NIPS Coauthorship Data

From Miller, Griffiths, Jordan, 2009


Other Models

Mixed membership stochastic blockmodel (MMSB), Airoldi et al, 2008

Each node: a probability vector zi over K possible groups

W is a matrix of Bernoulli probabilities


Other Models

Mixed membership stochastic blockmodel (MMSB), Airoldi et al, 2008

Each node: a probability vector zi over K possible groups

W is a matrix of Bernoulli probabilities

Relational topic model, Chang and Blei 2009

For modeling linked documents, e.g., via citations

Each node = document = K-dimensional topic probability vector

Various possible combination functions to reflect topic similarity


General Formulation

e.g., g = log(p/1-p) Network density

Row and column effects

Edge covariates and regression

weights

K-dimensional latent vector

per node

Similarity function on

latent vectors


Scalability

• The O(N2) term in the likelihood is problematic for scalability

• However, there is hope

– In most real-world social networks the number of edges in a network often scales as O(N) not O(N2)

…but the number of non-edges still scales as O(N2)

• This suggests factoring the likelihood into 2 pieces

– A product over edges, with O(N) terms

– A product over non-edges, with O(N2) terms that we approximate with O(N) terms

– This idea has been discovered (and rediscovered) several times


Approximating the Log-Likelihood

Can approximate this term with O(N) randomly-sampled non-edges

See Raftery et al, 2012, J. Computational and Graphical Statistics

This idea can also be combined with stochastic gradient methods


Stochastic Variational Inference: a-MMSB model

From Gopalan et al, 2012

Red: stochastic gradient with mini-batchBlue: conventional gradient batch algorithm


Variations and Extensions

• Sender and receiver effects

– Latent vectors for sender and receiver roles can be different

• Rectangular matrices, bipartite graphs

– rows and columns each get their own latent vectors

• Multi-way arrays and tensors

• Bayesian estimation

– Fully Bayesian methods: infer posterior locations in latent space

– MAP and regularized variations: enforce sparsity in solutions

• Non-linear “deep” models

– Could incorporate non-linearities in various ways


Dynamic Networks…..Adding Time


Networks over Time

• Many network problems are dynamic rather than static

– e.g., social relationships are changing over time

– instantaneous communication events (emails, phone calls)

• Edges, nodes, and covariates may all be evolving over time

– We will assume node set is fixed and edges and covariates may change

– Systematic temporal effects often important (TOD, DOW, seasonality)

• Different ways to define networks over time

– Snapshots at time t

– Aggregation over time windows

– Continuous time models


Discrete-Time Models

Yt represents the network at discrete time t

Data D = {Y1 …… Yt ………. YT }

Example

actors = students in a school

Yt = friendships between students measured in month t, t = 1, … 12

Interest is often in network dynamics and evolution

e.g., Markov models for P( Yt+1 | Y t )

(See work of Tom Snijders, Eric Xing, and others)


Figure from Carter Butts


General Formulation

In principle we can add time-dependence to any or all terms


General Formulation

In principle we can add time-dependence to any or all terms

One approach is to make the z’s time-dependent

i.e., allow latent features of each actor change over time

Example: linear Gaussian dynamics in z-space

- Sarkar and Moore (2005) for actors’ latent-space positions

- Fu, Song, and Xing (2009) for actors’ mixed membership vectors


Dynamic Relational Binary Feature Model

Recall for the static version zi = k-dimensional binary vector, e.g., (1, 0, 1, 0 , 1) f( zi , zj ) = z’i W zj , where W is a k x k matrixCommon set of k features across all actors

Foulds, Asuncion, DuBois, Butts, Smyth 2011


Dynamic Relational Binary Feature Model

Recall for the static version zi = k-dimensional binary vector, e.g., (1, 0, 1, 0 , 1) f( zi , zj ) = z’i W zj , where W is a k x k matrixCommon set of k features across all actors

Dynamic version (Dynamic Relational Features)• Assume discrete time • The kth feature for actor i, zik (t) is a binary hidden Markov process• Features can turn on, persist, or turn off at each time step• For infinite version, new features can be born over time

• Inference via MCMC – tricky, but works

Foulds, Asuncion, DuBois, Butts, Smyth 2011


Hidden Features

Actors

Time

Presence of edge i,j attime t depends on interactionof actor i’s and j’s feature vectors at that time t


Example of DRIFT Predictions on Enron


Continuous-Time Data and Models

Relational events: < i, j, t >

yt is an edge between some pair i and j at time t

Birth-death edges: each yt has start and end times

Instantaneous edges: each yt is (effectively) instantaneous

• Data D = { y1 …… yt ………. yT }

In a certain sense there is no graph!

Example

actors = students in a school

yt = text message between 2 students at time t

Interest is often in rates and patterns of communication

e.g., Poisson rates for y i,j given network history up to time t


Multinomial Models for Relational Events

• Let be the rates of Poisson processes for each pair of nodes in a network

• Assume for simplicity that these processes are conditionally independent given model parameters

• We can decompose the network process into

– A global rate l which generates events globally

– A choice process: given an event, which pair generated it, i.e.


Marginal Product Mixture ModelDuBois and Smyth, 2010

Multinomial over N2

possible edgesMixture over K unobserved groups



Multinomial over N2


Distribution over senders

for group k

Distribution over receivers

for group k

Marginal probability of

group k



Multinomial over N2


Distribution over senders

for group k

Distribution over receivers

for group k

Marginal probability of

group k

Edge events (rather than nodes) belong to latent groups (unlike MMSB)

Straightforward to learn via EM or collapsed Gibbs sampling


LikelihoodDuBois and Smyth, 2010

Product over events

Product over pairs with non-zero

counts

For large sparse networks number of non-zero pairs << N2

Similar to use of multinomial versus Bernoulli models for text


Application to Email Data:200,000 email messages among 3000 individuals(data from Eckmann, Moses, Sergi, 2004)

Most likely Edge Assignments by Group

Figures from Dubois and Smyth, 2010


International Relations Data40,000 events2700 actors171 action types

(King, 2003)


Prediction and Evaluation

• Use future data to evaluate predictive power and compare models

– e.g., predict network at time t+1 given network up to time t

• Metrics

– Log score = log probability of events that actually occurred

– Brier/MSE style scores

– Ranking/ROC scores


Simple Baseline for Comparison

• We could predict the likelihood of i and j communicating based directly on i and j’s history

– Multinomial with O(N2) entries

– Can use smoothing to combat sparsity

• Problems

– Data can be extremely sparse for large N – smoothing is non-informative, and does not “borrow strength” from the graph

• Nonetheless this is a useful baseline when evaluating predictions

– Historically, few papers evaluate models predictively

– Even fewer compare their models to simple baselines


From DuBois and Smyth, 2010


Relational Event Model

Time-varying Poisson rate for edge i,j

Baserate

Sender and receiver effects

Butts, 2009




Baserate


p-dim vector of regression parameters

p-dim vector of historical statistics

on edge i,j

Butts, 2009




Baserate


p-dim vector of regression parameters

p-dim vector of historical statistics

on edge i,j

Butts, 2009

Edge rates are time-varying functions of historical features

Results in a piecewise constant (between events) Poisson process

Features can include conversation effects, recency, persistence, etc


Parameter Estimation

• Likelihood includes terms for all events that occurred and all events that did not occur, for all inter-event times

– Computation of likelihood is O( T N2 ), T = number of events

– Some computational tricks possible to improve scalability

– See Vu et al (ICML 2011, NIPS 2011) for extensions to large social networks and citation networks

• Can use point estimates (optimization) or Bayesian inference (MCMC)


Applications?

• Modeling classroom interactions in education[DuBois, Butts, McFarland, Smyth, J Math Psych, 2013]

• Understanding and predicting citation patterns among documents[Vu et al, NIPS 2011, ICML 2011; Foulds and Smyth, EMNLP 2013]

• Modeling communication patterns among individuals[DuBois, Smyth, KDD 2010; F oulds et al, AI Stats 2011]

• Clustering individuals in email networks over time[Navaroli, DuBois, Smyth, MLJ, 2013]


Modeling Cascades

• Given a structural network with binary directed/undirected edges

AB

C

D

E

F


Modeling Cascades


AB

C

D

E

F


Modeling Cascades


• A cascade is a sequence of “node infections” (may have time-stamps)

– E.g., a post that spreads on a network such as Facebook or LinkedIn

• We observe a set of cascades, e.g.,

{A, B, E}, {B, A, D, F}, {A, B, C, E, F}, ….

• Given cascades …. make inferences about the “infection process”

AB

C

D

E

F


Prior Work

• Ideas based on epidemics in networks

– Analyze how infection spreads as a function of network structure

• e.g, work by Kempe, Kleinberg, Newman, and many others

– Typically assume a single homogenous infection rate b

– Typically does not look at learning from data

• Statistical models (more recent)

– Define a generative model (i.e., likelihood) for cascades on a network

– Example

• Assumes cascades are independent

• Assume heterogeneous infection rates for different edges

• Define a probabilistic model of infection spreads to next node

– Learn parameters, e.g., a matrix of infection rates b

(see work by Manuel Gomez-Rodriguez and colleagues)


Summary

• Static networks

– Statistical models can be built up from basic building blocks

– Latent representations (“node embeddings”) can be broadly useful

• Dynamic networks

– Modeling networks over time can be more straightforward than static case

– More natural representation of the underlying data

– Notion of prediction is clearer

– Can build these models using same building blocks as for static networks

• Scalability of the learning algorithms is a general issue….but there are promising approaches emerging

statistical latent variable and event models for …slovett/workshops/big-graphs...dyadic data...

Documents