poisson gamma dynamical systems - columbia universityas5530/scheinzhouwallach2016_poster.pdf ·...

1
Augment and conquer Sequentially observed count vectors e.g., counting daily interactions between pairs of countries July 1, 2003 y (1) ,..., y (T ) where y (t) v number of times event type v occurred during time step t = PoissonGamma Dynamical Systems Aaron Schein 1 1 Mingyuan Zhou 2 2 3 Hanna Wallach 3 Objective: Gibbs-sampling inference Poissongamma dynamical system k Dir(1 k ,..., ⇠⌫ k ,..., K k ) (1) k Gam(0 k , 0 ) k Gam γ 0 K , β y (t) v Pois δ (t) K X k =1 φ kv (t) k ! (t) k Gam 0 K X k 2 =1 kk 2 (t-1) k 2 , 0 ! Poisson matrix factorization (natural for count matrices) Dynamical system of gammas (conjugate prior to Poisson) h y (t) i = Φ (t) δ (t) h (t) i = (t-1) Columns of the transition matrix are probability vectors A shrinkage prior on shuts off unneeded model capacity by shrinking transition probabilities and initial chain value (1) k k kk 2 Challenge: Conditional non-conjugacy in original model means conditional posteriors are not available in closed form Solution: Augment the model with auxiliary variables and transform it into a model with closed-form conditional posteriors Three rules, applied recursively, transform the original model. MCMC inference Forwards sampling (1) k Gam m (1) k + 0 k , 0 + δ (1) + (2) 0 for t =2,...,T : (t) k Gam m (t) k + 0 K X k 2 =1 kk 2 (t-1) k 2 , 0 + δ (t) + (t+1) 0 ! for t = T,..., 2: (t) := ln(1 + δ (t) -1 0 + (t+1) ) for k =1,...,K : m (t) k := y (t) k + K X k 1 =1 l (t+1) k 1 k l (t) k· CRT m (t) k , 0 K X k 2 =1 kk 2 (t-1) k 2 ( l (t) kk 2 ) K k 2 =1 Mult l (t) k· , ( kk 2 (t-1) k 2 ) K k 2 =1 Backwards filtering K K l (t) k 1 · l (t) ·k 2 l (t) k 1 k 2 l (t-1) k 1 · allocate across columns sample row sums for t-1 sum across rows Input: (T +1) (default is 0) l (T +1) k 1 k 2 Pois (T +1) 0 k 1 k 2 (T ) k 2 Setup to BFFS Conditional posteriors for all latent variables are available under one or all of the alternate models. Sampling transition matrix for k =1,...,K : k Dir ( 1 k + l (·) 1k ,..., ⇠⌫ k + l (·) kk ,..., K k + l (·) Kk ) This results in an efficient backward filteringforward sampling (BFFS) algorithm. 1988 1991 1994 1997 2000 5 10 15 20 25 Jan 2003 Mar 2003 May 2003 Aug 2003 Oct 2003 1 2 3 4 5 6 Interpretable latent structure 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 k k 1 k 2 (t) k All parameters are non-negative and interpretable. NIPS corpus data y (t) v number of times word type v was used in NIPS papers during year t = GDELT data y (t) v number of interactions between country pair v during day t = word types v with largest φ kv φ kv in component k (t) k country pairs with largest φ kv φ kv in component k Three components visualized here are those with the largest k k values. Shrinkage promotes diagonal structure in the transition matrix 1. Green (IsraelPalestine) 2. Blue (Iraq War) 3. Red (Six-party talks) Predictive performance NIPS (top) versus ICEWS (bottom). NIPS corpus is less bursty. PGDS has a better inductive bias for bursty count data Baselines models Gaussian linear dynamical system (LDS) y (t) N Φ (t) ,D (t) N (t-1) , Gamma process dynamic Poisson factor analysis (GP-DPFA) y (t) v Pois K X k=1 λ k φ kv (t) k ! (t) k Gam (t-1) k ,c (t) We compared the predictive performance on smoothing (predicting missing entries in the input matrix) and forecasting (predicting future data) to two baselines on two country event data sets (GDELT, ICEWS) and three text data sets (SOTU, DBLP, NIPS). Step 1: Augment with a Poisson Step 2: Apply Rule 1 Step 3: Apply Rule 2 Step 3: Augment with a CRT Step 3: Apply Rule 3 Alternative model recurse When a red variable has green arrows leading out, we can form its conditional posterior! Represents future information Original model: ˆ B = 1 V V X v=1 T T - 1 T -1 X t=1 |y (t+1) v - y (t) v | P T t=1 y (t) v We measure bustiness as: Rule 1: Two independent Poissons are multinomial when conditioned on their sum: (the steps below are equivalent). Rule 3: The magic bivariate count distribution. The same bivariate distribution factorizes in two ways that encode different conditional independencies. y Pois() l Pois() m Pois(2) y Bin(m, 0.5) l := m - y Gam(, β ) m Pois() m NB ( , 1 1+ β ) m NB ( , 1 1+ β ) l CRT(m, ) l Pois ( ln(1 + β -1 ) ) m SumLog ( l, 1 1+ β ) Rule 2: A Poisson with a gamma-distributed rate becomes a negative if its rate is marginalized out. (y,l ) Multi ( m, (0.5, 0.5) )

Upload: others

Post on 20-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Poisson Gamma Dynamical Systems - Columbia Universityas5530/ScheinZhouWallach2016_poster.pdf · (t1) k,c (t) ⌘ We compared the predictive performance on smoothing (predicting missing

Augment and conquer

Sequentially observed count vectors

e.g., counting daily interactions between pairs of countries

July 1, 2003y(1), . . . ,y(T )

where

y(t)v

number of times event type v occurred during time step t

=

Poisson—Gamma Dynamical SystemsAaron Schein1

1

Mingyuan Zhou2

2 3

Hanna Wallach3

Objective: Gibbs-sampling inference

Poisson—gamma dynamical system

⇡k ⇠ Dir(⌫1⌫k, . . . , ⇠⌫k, . . . , ⌫K⌫k) ✓(1)k ⇠ Gam(⌧0⌫k, ⌧0)⌫k ⇠ Gam⇣�0K

, �⌘

y(t)v ⇠ Pois

�(t)

KX

k=1

�kv✓(t)k

!✓(t)k ⇠ Gam

⌧0

KX

k2=1

⇡kk2✓(t�1)k2

, ⌧0

!

Poisson matrix factorization(natural for count matrices)

Dynamical system of gammas(conjugate prior to Poisson)

hy(t)

i= �✓(t)�(t)

h✓(t)

i= ⇧✓(t�1)

Columns of the transition matrixare probability vectors

A shrinkage prior on shuts off unneeded model capacity by shrinking transition probabilities and initial chain value ✓(1)k

⌫k⇡kk2

Challenge: Conditional non-conjugacy in original model means conditional posteriors are not available in closed form

Solution: Augment the model with auxiliary variables and transform it into a model with closed-form conditional posteriors

Three rules, applied recursively, transform the original model.

MCMC inference Forwards sampling✓(1)k ⇠ Gam

⇣m(1)

k + ⌧0⌫k, ⌧0 + �(1) + ⇣(2)⌧0⌘

for t = 2, . . . , T :

✓(t)k ⇠ Gam

m(t)

k + ⌧0

KX

k2=1

⇡kk2✓(t�1)k2

, ⌧0 + �(t) + ⇣(t+1)⌧0

!

for t = T, . . . , 2 :

⇣(t) := ln(1 + �(t)⌧�10 + ⇣(t+1)

)

for k = 1, . . . ,K :

m(t)k := y(t)k +

KX

k1=1

l(t+1)k1k

l(t)k· ⇠ CRT

⇣m(t)

k , ⌧0

KX

k2=1

⇡kk2✓(t�1)k2

�l(t)kk2

�Kk2=1

⇠ Mult

⇣l(t)k· ,

�⇡kk2✓

(t�1)k2

�Kk2=1

Backwards filtering

K

K

l(t)k1·

l(t)·k2

l(t)k1k2l(t�1)k1·

allocate across columns

sample

row

sums f

or t-1

sum across

rows

Input: ⇣(T+1)(default is 0)

l(T+1)k1k2

⇠ Pois

⇣⇣(T+1) ⌧0 ⇡k1k2 ✓

(T )k2

Setup to BFFS

Conditional posteriors for all latent variables are available under one or all of the alternate models.

Sampling transition matrixfor k = 1, . . . ,K :

⇡k ⇠ Dir

�⌫1⌫k + l(·)1k , . . . , ⇠⌫k + l(·)kk, . . . , ⌫K⌫k + l(·)Kk

This results in an efficient backward filtering—forward sampling (BFFS) algorithm.

1988 1991 1994 1997 20005

10

15

20

25

Jan 2003 Mar 2003 May 2003 Aug 2003 Oct 2003 Dec 20031

2

3

4

5

6

Interpretable latent structure

0 1 2 3 4 5 6 7 8 9

01

23

45

67

89

0123456

⌫k

⇡k1k2

✓(t)k

All parameters are non-negative and interpretable.

NIPS corpus datay(t)v

number of times word type v was used in NIPS papers during year t =

GDELT datay(t)v

number of interactions between country pair v during day t

=

word types v with largest �kv�kv

in component k✓(t)k

country pairs with largest �kv�kv

in component kThree components visualized here are those with the largest ⌫k⌫k values.

Shrinkage promotes diagonal structure in the transition matrix

1. Green (Israel—Palestine)2. Blue (Iraq War)3. Red (Six-party talks)

Predictive performance NIPS (top) versus ICEWS (bottom). NIPS corpus is less bursty.

PGDS has a better inductive bias for bursty count data

Baselines modelsGaussian linear dynamical system (LDS)

y(t) ⇠ N⇣�✓(t), D

⌘✓(t) ⇠ N

⇣⇧✓(t�1), ⌃

Gamma process dynamic Poisson factor analysis (GP-DPFA)

y(t)v ⇠ Pois

KX

k=1

�k�kv✓(t)k

!✓(t)k ⇠ Gam

⇣✓(t�1)k , c(t)

We compared the predictive performance on smoothing (predicting missing entries in the input matrix) and forecasting (predicting future data) to two baselines on two country event data sets (GDELT, ICEWS) and three text data sets (SOTU, DBLP, NIPS).

Step 1: Augment with a Poisson Step 2: Apply Rule 1

Step 3: Apply Rule 2 Step 3: Augment with a CRT

Step 3: Apply Rule 3

Alternative model

recurse

When a red variable has green arrows

leading out, we can form its conditional

posterior!

Represents future information

Original model:

B̂ =1

V

VX

v=1

T

T � 1

T�1X

t=1

|y(t+1)v � y(t)v |PT

t=1 y(t)v

We measure bustiness as:

Rule 1: Two independent Poissons are multinomial when conditioned on their sum: (the steps below are equivalent).

Rule 3: The magic bivariate count distribution. The same bivariate distribution factorizes in two ways that encode different conditional independencies.

y ⇠ Pois(✓)

l ⇠ Pois(✓) m ⇠ Pois(2✓)

y ⇠ Bin(m, 0.5)

l := m� y

✓ ⇠ Gam(↵,�)

m ⇠ Pois(✓) m ⇠ NB�↵,

1

1 + �

m ⇠ NB�↵,

1

1 + �

l ⇠ CRT(m,↵)

l ⇠ Pois

�↵ ln(1 + ��1

)

m ⇠ SumLog

�l,

1

1 + �

Rule 2: A Poisson with a gamma-distributed rate becomes a negative if its rate is marginalized out.

(y, l) ⇠ Multi�m, (0.5, 0.5)