statistical latent variable and event models for …slovett/workshops/big-graphs...dyadic data...
TRANSCRIPT
Statistical Latent Variable and Event Models for Network Data
Padhraic SmythDepartment of Computer Science
University of California, Irvine
January 7th 2016Workshop on Big Graphs: Theory and Practice
UCSD
Padhraic Smyth, January 2016: 2
Acknowledgements
Students and Colleagues
Chris Dubois, Jimmy Foulds, Arthur Asuncion, Carter Butts, Zach Butler
Funding
Padhraic Smyth, January 2016: 3
References
Multiplicative latent factor models for description and prediction of social networks
P. D. Hoff, Computational and Mathematical Organization Theory , 15(4), 2009.
Dyadic data analysis with amen
P. D. Hoff, available online, June 2015
A relational event model for social action
C. E. Butts, Sociological Methodology, 2008
A survey of statistical network models
A. Goldenberg, A. Zheng, S. Fienberg, E. Airoldi, Foundations and Trends in Machine Learning, 2009
Padhraic Smyth, January 2016: 4
Email Contact Network
Data from HP Labs
Padhraic Smyth, January 2016: 5
Goals
• Learn a predictive distribution over future events in the network
– Incorporate node and edge attributes
• Be able to answer queries such as
– What will the network look like at time t + k?
– How likely is it that node i will communicate with node j
– How much influence does node i have on node j?
• Understand the dynamics of the network process
Padhraic Smyth, January 2016: 6
C. Butts, Science, 2009
Padhraic Smyth, January 2016: 7
Descriptive/Exploratory Analysis of Networks
Long history in social network analysis, complex systems, etc
– Degree distributions, power laws, scale-free networks
– Clustering effects
– Betweenness and centrality
Often focused on broad network properties
Very useful….but does not support inferential or predictive statements about specific nodes or edges
Padhraic Smyth, January 2016: 8
Statistical Network Modeling
Basic idea: hypothesize a (simple) generative model for the data given parameters….and then infer parameters given observed data
• Learning
– Systematic methods for estimating network parameters
• Prediction/Querying
– reduces to computation of conditional probabilities and expectations
• Noise/Missing Data
– Systematic way to handle real-world noise
• Covariates
– Relatively straightforward to integrate “non-network” information
Padhraic Smyth, January 2016: 9
Modeling Approaches
• Static model
– Aggregate event data into a single network
– e.g., static model for binary edges
• Discrete time models
– Aggregate event data into temporal windows, e.g., per week
• Continuous-time models
– Model event rates directly
– e.g., stationary Poisson (simple)
– e.g., non-stationary Poisson (more complex)
• Sequences of dependent events
– Cascade models
Padhraic Smyth, January 2016: 10
Static Network Models
Padhraic Smyth, January 2016: 11
Network Notation
N actors (node set)
• Generally assume that set of actors is known and fixed
Padhraic Smyth, January 2016: 12
Network Notation
N actors (node set)
• Generally assume that set of actors is known and fixed
Edges between actors: adjacency matrix Y
= 1 : an edge between actors i and j
: real-valued or counts: indication of strength of relationship
Padhraic Smyth, January 2016: 13
Network Notation
N actors (node set)
• Generally assume that set of actors is known and fixed
Edges between actors: adjacency matrix Y
= 1 : an edge between actors i and j
: real-valued or counts: indication of strength of relationship
Covariates/Attributes X
• e.g., for each actor, for each edge
Padhraic Smyth, January 2016: 14
Example of a Y matrix:Counts of 200,000 email messages between 3000 individuals over 3 months
Padhraic Smyth, January 2016: 15
Padhraic Smyth, January 2016: 16
Sidenote: Graphical Models
It is tempting to think of our N x N network as being related to a graphical model on N variables
However, in network modeling, the edges are viewed as the random variables, not the nodes
This hints at the complexity of the problem, i.e., O(N2) variables, and exponential in N possible graph realizations
Padhraic Smyth, January 2016: 17
Network Models via Regression
Padhraic Smyth, January 2016: 18
Network Models via Regression
Mean effectRow effect Column effect
Padhraic Smyth, January 2016: 19
Binary Undirected Edges
Padhraic Smyth, January 2016: 20
Likelihood
Note that edges are conditionally independent given parameters
Padhraic Smyth, January 2016: 21
Special Case: Erdos-Renyi Graph
Padhraic Smyth, January 2016: 22
Special Case: Erdos-Renyi Graph
Padhraic Smyth, January 2016: 23
Likelihood
Padhraic Smyth, January 2016: 24
Likelihood
We can learn the q’s using maximum likelihood or Bayesian methods, using a variety of techniques such as gradient methods, MCMC, variational approximations, etc
Padhraic Smyth, January 2016: 25
Adding Node and Edge Covariates
CovariatesWeights
Example:
Padhraic Smyth, January 2016: 26
Adding Latent (Hidden) Variables
Hypothesize that the nodes are embedded in a latent (hidden) space
The probability of a link is higher if nodes are closer in this space
Given a set of observed links can we infer a set of “good locations”?
Padhraic Smyth, January 2016: 27
Adding Latent (Hidden) Variables
Hypothesize that the nodes are embedded in a latent (hidden) space
The probability of a link is higher if nodes are closer in this space
Given a set of observed links can we infer a set of “good locations”?
Old idea in social science, e.g., McFarland and Brown, “Social distance as metric…”, 1973
See also more recent word embedding methods
Padhraic Smyth, January 2016: 28
Latent Space Model
K-dimensional real-valued latent space vector for each node
Intuition:
• Embed nodes in a K-dimensional latent space, K much smaller than N
• Probability (or log-odds) of edge(i,j) decreases as i and j become further away
Hoff, Raftery, Handcock, JASA, 2002
Padhraic Smyth, January 2016: 29
Figure from Hoff, Raftery, Handcock, 2002
Padhraic Smyth, January 2016: 30
Additive Latent Interactions
This model implies transitivity:
e.g., if (A,B) close and if (B,C) close then (A,C) close (and has high probability)
…but some relations are not transitive, e.g., “conflict”
Padhraic Smyth, January 2016: 31
Multiplicative Latent InteractionsHoff, 2009
K x K real-valued matrix(learned from the data)
Padhraic Smyth, January 2016: 32
Multiplicative Latent InteractionsHoff, 2009
K x K real-valued matrix(learned from the data)
Hoff (NIPS 2008) showed that for a diagonal W matrix (the latent eigenmodel) this model is a strict generalization of the distance model
For directed networks or rectangular matrices we can replace zj with vj , yielding links to matrix factorization
Padhraic Smyth, January 2016: 33
Building Blocks for Network Modeling
See also P. Hoff, Dyadic data analysis with amen, ArXiv, 2015
Padhraic Smyth, January 2016: 34
Building Blocks for Network Modeling
e.g., g = log(p/1-p) Network density
Row and column effects
See also P. Hoff, Dyadic data analysis with amen, ArXiv, 2015
Padhraic Smyth, January 2016: 35
Edge covariates and regression
weights
Building Blocks for Network Modeling
See also P. Hoff, Dyadic data analysis with amen, ArXiv, 2015
Padhraic Smyth, January 2016: 36
K-dimensional latent vector
per node
Similarity function on
latent vectors
Building Blocks for Network Modeling
See also P. Hoff, Dyadic data analysis with amen, ArXiv, 2015
Padhraic Smyth, January 2016: 37
Stochastic Block Model
Each node assumed to belong to 1 of K “stochastically equivalent” blocks
z vectors are K-dimensional indicators, e.g., z = [0, 0, 1, 0]
Within-block and between-block edge probabilities at block level, K x K matrix W
Nowicki and Snijders, 2002
Padhraic Smyth, January 2016: 38
Stochastic Block Model
Each node assumed to belong to 1 of K “stochastically equivalent” blocks
z vectors are K-dimensional indicators, e.g., z = [0, 0, 1, 0]
Within-block and between-block edge probabilities at block level, K x K matrix W
Nowicki and Snijders, 2001
(Figure from Goldenberg et al, 2010)
Padhraic Smyth, January 2016: 39
Stochastic Block Model
Each node assumed to belong to 1 of K “stochastically equivalent” blocks
z vectors are K-dimensional indicators, e.g., z = [0, 0, 1, 0]
Within-block and between-block edge probabilities at block level, K x K matrix W
Example:
Interaction:
Nowicki and Snijders, 2001
Padhraic Smyth, January 2016: 40
Binary Relational Feature Model
Each node can “turn on” any subset of K binary features (latent)
z vectors are K-dimensional binary vectors, e.g., z = [0, 0, 1, 1]
K x K weight matrix W captures feature interactions
Miller, Jordan, Griffiths, NIPS 2009
Padhraic Smyth, January 2016: 41
Binary Relational Feature Model
Hidden Features
Actors
Presence of edge between actor i and actor j is (e.g.)a logistic function of a weighted sum of features they have in common
Estimation: based on MCMC or variational EM
Miller, Jordan, Griffiths, NIPS 2009
Padhraic Smyth, January 2016: 42
Binary Relational Feature Model
Example:
Interaction:
Miller, Jordan, Griffiths, NIPS 2009
(Original proposed as an infinite-dimensional non-parametric model)
Padhraic Smyth, January 2016: 43
Predictions on NIPS Coauthorship Data
From Miller, Griffiths, Jordan, 2009
Padhraic Smyth, January 2016: 45
Other Models
Mixed membership stochastic blockmodel (MMSB), Airoldi et al, 2008
Each node: a probability vector zi over K possible groups
W is a matrix of Bernoulli probabilities
Padhraic Smyth, January 2016: 46
Other Models
Mixed membership stochastic blockmodel (MMSB), Airoldi et al, 2008
Each node: a probability vector zi over K possible groups
W is a matrix of Bernoulli probabilities
Relational topic model, Chang and Blei 2009
For modeling linked documents, e.g., via citations
Each node = document = K-dimensional topic probability vector
Various possible combination functions to reflect topic similarity
Padhraic Smyth, January 2016: 47
General Formulation
e.g., g = log(p/1-p) Network density
Row and column effects
Edge covariates and regression
weights
K-dimensional latent vector
per node
Similarity function on
latent vectors
Padhraic Smyth, January 2016: 48
Scalability
• The O(N2) term in the likelihood is problematic for scalability
• However, there is hope
– In most real-world social networks the number of edges in a network often scales as O(N) not O(N2)
…but the number of non-edges still scales as O(N2)
• This suggests factoring the likelihood into 2 pieces
– A product over edges, with O(N) terms
– A product over non-edges, with O(N2) terms that we approximate with O(N) terms
– This idea has been discovered (and rediscovered) several times
Padhraic Smyth, January 2016: 49
Approximating the Log-Likelihood
Can approximate this term with O(N) randomly-sampled non-edges
See Raftery et al, 2012, J. Computational and Graphical Statistics
This idea can also be combined with stochastic gradient methods
Padhraic Smyth, January 2016: 50
Stochastic Variational Inference: a-MMSB model
From Gopalan et al, 2012
Red: stochastic gradient with mini-batchBlue: conventional gradient batch algorithm
Padhraic Smyth, January 2016: 51
Variations and Extensions
• Sender and receiver effects
– Latent vectors for sender and receiver roles can be different
• Rectangular matrices, bipartite graphs
– rows and columns each get their own latent vectors
• Multi-way arrays and tensors
• Bayesian estimation
– Fully Bayesian methods: infer posterior locations in latent space
– MAP and regularized variations: enforce sparsity in solutions
• Non-linear “deep” models
– Could incorporate non-linearities in various ways
Padhraic Smyth, January 2016: 52
Dynamic Networks…..Adding Time
Padhraic Smyth, January 2016: 53
Networks over Time
• Many network problems are dynamic rather than static
– e.g., social relationships are changing over time
– instantaneous communication events (emails, phone calls)
• Edges, nodes, and covariates may all be evolving over time
– We will assume node set is fixed and edges and covariates may change
– Systematic temporal effects often important (TOD, DOW, seasonality)
• Different ways to define networks over time
– Snapshots at time t
– Aggregation over time windows
– Continuous time models
Padhraic Smyth, January 2016: 54
Discrete-Time Models
Yt represents the network at discrete time t
Data D = {Y1 …… Yt ………. YT }
Example
actors = students in a school
Yt = friendships between students measured in month t, t = 1, … 12
Interest is often in network dynamics and evolution
e.g., Markov models for P( Yt+1 | Y t )
(See work of Tom Snijders, Eric Xing, and others)
Padhraic Smyth, January 2016: 55
Figure from Carter Butts
Padhraic Smyth, January 2016: 56
General Formulation
In principle we can add time-dependence to any or all terms
Padhraic Smyth, January 2016: 57
General Formulation
In principle we can add time-dependence to any or all terms
One approach is to make the z’s time-dependent
i.e., allow latent features of each actor change over time
Example: linear Gaussian dynamics in z-space
- Sarkar and Moore (2005) for actors’ latent-space positions
- Fu, Song, and Xing (2009) for actors’ mixed membership vectors
Padhraic Smyth, January 2016: 58
Dynamic Relational Binary Feature Model
Recall for the static version zi = k-dimensional binary vector, e.g., (1, 0, 1, 0 , 1) f( zi , zj ) = z’i W zj , where W is a k x k matrixCommon set of k features across all actors
Foulds, Asuncion, DuBois, Butts, Smyth 2011
Padhraic Smyth, January 2016: 59
Dynamic Relational Binary Feature Model
Recall for the static version zi = k-dimensional binary vector, e.g., (1, 0, 1, 0 , 1) f( zi , zj ) = z’i W zj , where W is a k x k matrixCommon set of k features across all actors
Dynamic version (Dynamic Relational Features)• Assume discrete time • The kth feature for actor i, zik (t) is a binary hidden Markov process• Features can turn on, persist, or turn off at each time step• For infinite version, new features can be born over time
• Inference via MCMC – tricky, but works
Foulds, Asuncion, DuBois, Butts, Smyth 2011
Padhraic Smyth, January 2016: 60
Hidden Features
Actors
Time
Presence of edge i,j attime t depends on interactionof actor i’s and j’s feature vectors at that time t
Padhraic Smyth, January 2016: 61
Example of DRIFT Predictions on Enron
Padhraic Smyth, January 2016: 62
Continuous-Time Data and Models
Relational events: < i, j, t >
yt is an edge between some pair i and j at time t
Birth-death edges: each yt has start and end times
Instantaneous edges: each yt is (effectively) instantaneous
• Data D = { y1 …… yt ………. yT }
In a certain sense there is no graph!
Example
actors = students in a school
yt = text message between 2 students at time t
Interest is often in rates and patterns of communication
e.g., Poisson rates for y i,j given network history up to time t
Padhraic Smyth, January 2016: 63
Multinomial Models for Relational Events
• Let be the rates of Poisson processes for each pair of nodes in a network
• Assume for simplicity that these processes are conditionally independent given model parameters
• We can decompose the network process into
– A global rate l which generates events globally
– A choice process: given an event, which pair generated it, i.e.
Padhraic Smyth, January 2016: 64
Marginal Product Mixture ModelDuBois and Smyth, 2010
Multinomial over N2
possible edgesMixture over K unobserved groups
Padhraic Smyth, January 2016: 65
Marginal Product Mixture ModelDuBois and Smyth, 2010
Multinomial over N2
possible edgesMixture over K unobserved groups
Distribution over senders
for group k
Distribution over receivers
for group k
Marginal probability of
group k
Padhraic Smyth, January 2016: 66
Marginal Product Mixture ModelDuBois and Smyth, 2010
Multinomial over N2
possible edgesMixture over K unobserved groups
Distribution over senders
for group k
Distribution over receivers
for group k
Marginal probability of
group k
Edge events (rather than nodes) belong to latent groups (unlike MMSB)
Straightforward to learn via EM or collapsed Gibbs sampling
Padhraic Smyth, January 2016: 67
LikelihoodDuBois and Smyth, 2010
Product over events
Product over pairs with non-zero
counts
For large sparse networks number of non-zero pairs << N2
Similar to use of multinomial versus Bernoulli models for text
Padhraic Smyth, January 2016: 68
Application to Email Data:200,000 email messages among 3000 individuals(data from Eckmann, Moses, Sergi, 2004)
Most likely Edge Assignments by Group
Figures from Dubois and Smyth, 2010
Padhraic Smyth, January 2016: 69
International Relations Data40,000 events2700 actors171 action types
(King, 2003)
Padhraic Smyth, January 2016: 70
Prediction and Evaluation
• Use future data to evaluate predictive power and compare models
– e.g., predict network at time t+1 given network up to time t
• Metrics
– Log score = log probability of events that actually occurred
– Brier/MSE style scores
– Ranking/ROC scores
Padhraic Smyth, January 2016: 71
Simple Baseline for Comparison
• We could predict the likelihood of i and j communicating based directly on i and j’s history
– Multinomial with O(N2) entries
– Can use smoothing to combat sparsity
• Problems
– Data can be extremely sparse for large N – smoothing is non-informative, and does not “borrow strength” from the graph
• Nonetheless this is a useful baseline when evaluating predictions
– Historically, few papers evaluate models predictively
– Even fewer compare their models to simple baselines
Padhraic Smyth, January 2016: 72
From DuBois and Smyth, 2010
Padhraic Smyth, January 2016: 74
Relational Event Model
Time-varying Poisson rate for edge i,j
Baserate
Sender and receiver effects
Butts, 2009
Padhraic Smyth, January 2016: 75
Relational Event Model
Time-varying Poisson rate for edge i,j
Baserate
Sender and receiver effects
p-dim vector of regression parameters
p-dim vector of historical statistics
on edge i,j
Butts, 2009
Padhraic Smyth, January 2016: 76
Relational Event Model
Time-varying Poisson rate for edge i,j
Baserate
Sender and receiver effects
p-dim vector of regression parameters
p-dim vector of historical statistics
on edge i,j
Butts, 2009
Edge rates are time-varying functions of historical features
Results in a piecewise constant (between events) Poisson process
Features can include conversation effects, recency, persistence, etc
Padhraic Smyth, January 2016: 77
Parameter Estimation
• Likelihood includes terms for all events that occurred and all events that did not occur, for all inter-event times
– Computation of likelihood is O( T N2 ), T = number of events
– Some computational tricks possible to improve scalability
– See Vu et al (ICML 2011, NIPS 2011) for extensions to large social networks and citation networks
• Can use point estimates (optimization) or Bayesian inference (MCMC)
Padhraic Smyth, January 2016: 78
Applications?
• Modeling classroom interactions in education[DuBois, Butts, McFarland, Smyth, J Math Psych, 2013]
• Understanding and predicting citation patterns among documents[Vu et al, NIPS 2011, ICML 2011; Foulds and Smyth, EMNLP 2013]
• Modeling communication patterns among individuals[DuBois, Smyth, KDD 2010; F oulds et al, AI Stats 2011]
• Clustering individuals in email networks over time[Navaroli, DuBois, Smyth, MLJ, 2013]
Padhraic Smyth, January 2016: 79
Modeling Cascades
• Given a structural network with binary directed/undirected edges
AB
C
D
E
F
Padhraic Smyth, January 2016: 80
Modeling Cascades
• Given a structural network with binary directed/undirected edges
AB
C
D
E
F
Padhraic Smyth, January 2016: 81
Modeling Cascades
• Given a structural network with binary directed/undirected edges
AB
C
D
E
F
Padhraic Smyth, January 2016: 82
Modeling Cascades
• Given a structural network with binary directed/undirected edges
AB
C
D
E
F
Padhraic Smyth, January 2016: 83
Modeling Cascades
• Given a structural network with binary directed/undirected edges
AB
C
D
E
F
Padhraic Smyth, January 2016: 84
Modeling Cascades
• Given a structural network with binary directed/undirected edges
• A cascade is a sequence of “node infections” (may have time-stamps)
– E.g., a post that spreads on a network such as Facebook or LinkedIn
• We observe a set of cascades, e.g.,
{A, B, E}, {B, A, D, F}, {A, B, C, E, F}, ….
• Given cascades …. make inferences about the “infection process”
AB
C
D
E
F
Padhraic Smyth, January 2016: 85
Prior Work
• Ideas based on epidemics in networks
– Analyze how infection spreads as a function of network structure
• e.g, work by Kempe, Kleinberg, Newman, and many others
– Typically assume a single homogenous infection rate b
– Typically does not look at learning from data
• Statistical models (more recent)
– Define a generative model (i.e., likelihood) for cascades on a network
– Example
• Assumes cascades are independent
• Assume heterogeneous infection rates for different edges
• Define a probabilistic model of infection spreads to next node
– Learn parameters, e.g., a matrix of infection rates b
(see work by Manuel Gomez-Rodriguez and colleagues)
Padhraic Smyth, January 2016: 86
Summary
• Static networks
– Statistical models can be built up from basic building blocks
– Latent representations (“node embeddings”) can be broadly useful
• Dynamic networks
– Modeling networks over time can be more straightforward than static case
– More natural representation of the underlying data
– Notion of prediction is clearer
– Can build these models using same building blocks as for static networks
• Scalability of the learning algorithms is a general issue….but there are promising approaches emerging