causal inference from multivariate time series: principles ...• f∗(t)- all information in the...

Theory of Big Data 2 Conference

Big Data Institute, University College London

Causal Inference from Multivariate Time Series:Principles and Problems

Michael Eichler

Department of Quantitative Economics

Maastricht University

http://researchers-sbe.unimaas.nl/michaeleichler

6 January 2016

Outline

• Causality concepts

• Graphical representation

• Definition• Markov properties• Extension: systems with latent variables

• Causal learning

• Basic principles• Identification from empirical relationships

• Non-Markovian constraints

• Trek-separation in graphs• Tetrad representation theorem• Testing for tetrad constraints

• Open problems and conclusions

2 / 52

Concepts of causality for time series

We consider two variables X and Y measured at discrete times t ∈ Z:

X =�

Xt

�

t∈Z, Y =�

Yt

�

t∈Z.

Question: When is it justified to say that X causes Y?

Various approaches:

• Intervention causality (Pearl, 1993; Eichler & Didelez 2007, 2010)

• Structural causality (White and Lu, 2010)

• Granger causality (Granger, 1967, 1980, 1988)

• Sims causality (Sims, 1972)

3 / 52

Granger causality

Two fundamental princples:

• The cause precedes its effect in time.

• The causal series contains special information about the series being

caused that is not available otherwise.

4 / 52

Granger causality





This leads us to consider two information sets:

• F ∗(t) - all information in the universe up to time t

• F ∗−X(t) - this information except the values of X

4 / 52

Granger causality





This leads us to consider two information sets:

• F ∗(t) - all information in the universe up to time t

• F ∗−X(t) - this information except the values of X

Granger’s definition of causality (Granger 1969, 1980)

We say that X causes Y if the probability distributions of

• Yt+1 given F ∗(t) and

• Yt+1 given F ∗−X(t)

are different.

4 / 52

Granger causality

Problem: The definition cannot be used with actual data.

5 / 52

Granger causality


Suppose data consist of multivariate time series V = (X,Y,Z) and let

• {Xt} - information given by X up to time t

• similarly for Y and Z

Definition: Granger non-causality

• X is Granger-noncausal for Y with respect to V if

Yt+1⊥⊥Xt |Yt,Zt.

• Otherwise we say that X Granger-causes Y with respect to V.

5 / 52

Granger causality


Suppose data consist of multivariate time series V = (X,Y,Z) and let

• {Xt} - information given by X up to time t

• similarly for Y and Z

Definition: Granger non-causality

• X is Granger-noncausal for Y with respect to V if

Yt+1⊥⊥Xt |Yt,Zt.

• Otherwise we say that X Granger-causes Y with respect to V.

Additionally:

• X and Y are said to be contemporaneously independent w.r.t. V if

Xt+1⊥⊥Yt+1 |Vt

5 / 52

Sims causality

Definition: Sims non-causality

X does not Sims-cause Y with respect to V = (X,Y,Z) if

{Yt′ |t′ > t}⊥⊥Xt |X

t−1,Y t,Zt

Note:

• Granger causality is a concept of direct causality

• Sims causality is a concept of total causality (direct and indirect

pathways)

The following statistics are measures for Sims causality:

• impulse response function (time and frequency domain)

• direct transfer function (DTF)

6 / 52

Vector autoregressive processes

Let X be a multivariate stationary Gaussian time series with vector

autoregressive representation

Xt =∞∑

k=1

Ak Xt−k + ǫt

Granger non-causality in VAR models:

The following are equivalent:

• Xb does not Granger cause Xa with respect to X;

• Aab,k = 0 for all k ∈N.

7 / 52

Vector autoregressive processes

Let X be a multivariate stationary Gaussian time series with vector

autoregressive representation

Xt =∞∑

k=1

Ak Xt−k + ǫt =∞∑

k=0

Bk ǫt−k

Granger non-causality in VAR models:


• Xb does not Granger cause Xa with respect to X;

• Aab,k = 0 for all k ∈N.

Sims non-causality in VAR models:


• Xb does not Sims cause Xa with respect to X;

• Bab,k = 0 for all k ∈N.

7 / 52

Outline




• Causal learning





8 / 52

Graphical models for time series

Basic idea: use graphs to encode conditional independences among

variables

• nodes/vertices represent variables

• missing edge between two nodes implies conditional independence

of the two variables

Application to time series:

• treat each variable at each time separately ( time series chain

graphs)

• treat each series as one variables (only one node in the graph)

9 / 52

Graphical models for time seriesGranger causality graphs (Eichler 2007)

Idea: represent Granger-causal relations in X by mixed graph G:

• vertices v ∈ V represent the variables (time series) Xv;

10 / 52




• directed edges between the vertices indicate Granger-causal

relationships;

10 / 52




• directed edges between the vertices indicate Granger-causal

relationships;

• additionally undirected (dashed) edges indicate contemporaneous

associations.

10 / 52


Example: consider five-dimensional autoregressive process XV

Xt = f(Xt−1) + ǫt

1 3 5

2 4

11 / 52




1 3 5

2 4

with

• X1,t = f1(X3,t−1) + ǫ1,t

11 / 52




1 3 5

2 4

with

• X1,t = f1(X3,t−1) + ǫ1,t

• X2,t = f2(X4,t−1) + ǫ2,t

11 / 52




1 3 5

2 4

with

• X1,t = f1(X3,t−1) + ǫ1,t

• X2,t = f2(X4,t−1) + ǫ2,t

• X3,t = f3(X1,t−1,X2,t−1) + ǫ3,t

11 / 52




1 3 5

2 4

with

• X1,t = f1(X3,t−1) + ǫ1,t

• X2,t = f2(X4,t−1) + ǫ2,t

• X3,t = f3(X1,t−1,X2,t−1) + ǫ3,t

• X4,t = f4(X3,t−1,X5,t−1) + ǫ4,t

11 / 52




1 3 5

2 4

with

• X1,t = f1(X3,t−1) + ǫ1,t

• X2,t = f2(X4,t−1) + ǫ2,t

• X3,t = f3(X1,t−1,X2,t−1) + ǫ3,t

• X4,t = f4(X3,t−1,X5,t−1) + ǫ4,t

• X5,t = f5(X3,t−1) + ǫ5,t

11 / 52




1 3 5

2 4

with

• X1,t = f1(X3,t−1) + ǫ1,t

• X2,t = f2(X4,t−1) + ǫ2,t

• X3,t = f3(X1,t−1,X2,t−1) + ǫ3,t

• X4,t = f4(X3,t−1,X5,t−1) + ǫ4,t

• X5,t = f5(X3,t−1) + ǫ5,t

• ǫ1,t,ǫ2,t,ǫ3,t⊥⊥ǫ4,t,ǫ5,t

ǫ4,t⊥⊥ǫ5,t

11 / 52

Markov properties

Objective: derive Granger-causal relationships for XS, S ⊆ V

12 / 52

Markov properties


Idea: characterize pathways that induce associations

12 / 52

Markov properties


Idea: characterize pathways that induce associations

Tool: concepts of separation in graphs

• DAGs: d-separation (Pearl 1988)

• mixed graphs: d-separation (Spirtes et al. 1998, Koster 1999) or

m-separation (Richardson 2003)

12 / 52

Markov properties

1

2

3

p(x) = p(x3|x2)p(x2|x1)p(x1)

⇒ X3⊥⊥X1 |X2

13 / 52

Markov properties

1

2

3

p(x) = p(x3|x2)p(x2|x1)p(x1)

⇒ X3⊥⊥X1 |X2

1

2

3

p(x) = p(x1|x2)p(x3|x2)p(x2)

⇒ X3⊥⊥X1 |X2

13 / 52

Global Granger-causal Markov propertySeparation in mixed graphs

Question: What type of paths induce Granger causal relations between

variables?

Note: Granger (non)causality is not symmetric

Idea: consider only paths ending with a directed edge�

Examples: 1� 2� 3� 4 entails

• X1 does not Granger cause X4 with respect to X1,X4

• X1 does not Granger cause X4 with respect to X1,X3,X4

• X1 does not Granger cause X4 with respect to X1,X2,X3,X4

but not

• X1 does not Granger cause X4 with respect to X1,X2,X4

14 / 52

Outline




• Causal learning





15 / 52

Principles of causal inference

Objective: identify causal structure of process X

Question: What to use in practise?

• Granger causality or Sims causality

• bivariate or fully multivariate analysis

16 / 52


Objective: identify causal structure of process X

Question: What to use in practise?

• Granger causality or Sims causality

• bivariate or fully multivariate analysis

Answer:

For causal inference . . . all and more.

16 / 52

Principles of identification

An example of indirect causality:

1

2

3

implies for the bivariate submodel

1 3

17 / 52


An example of spurious causality:

1

2

3

L

implies for the trivariate and bivariate submodels

1

2

3

1 3

18 / 52


Inverse problem:

What can we say about the full system based on observed

Granger-noncausal relations for the observed (sub)process?

Suppose

• Xa→ Xc [XS] for all {a, c} ⊆ S ⊆ V

• Xc→ Xb [XS] for all {c,b} ⊆ S ⊆ V

Rules of causal inference

• Indirect causality rule: Xa truely causes Xb if

Xa9 Xb [S] for some S ⊆ V with c ∈ S

• Spurious causality rule: Xa is a spurious cause of Xb if

Xa9 Xb [S] for some S ⊆ V with c /∈ S

19 / 52


Y

Z

X

U

2 4 6 8 10

−0.2

0.0

0.2

0.4

lag h

AY

X(h

)

bivariate Granger

2 4 6 8 10

−0.2

0.0

0.2

0.4

lag h

AY

X(h

)

trivariate Granger

2 4 6 8 10 12 14

−0.2

0.0

0.2

0.4

lag h

BY

X(h

)

trivariate Sims

20 / 52


Y

Z

X

U V

2 4 6 8 10

−0.2

0.0

0.2

0.4

lag h

AY

X(h

)

bivariate Granger

2 4 6 8 10

−0.2

0.0

0.2

0.4

lag h

AY

X(h

)

trivariate Granger

2 4 6 8 10 12 14

−0.2

0.0

0.2

0.4

lag h

BY

X(h

)

trivariate Sims

21 / 52

Identification of causal structure

Algorithm: identification of adjacencies

• insert a � b whenever Xa and Xb are not contemporaneously

independent

• insert a b whenever

• Xb→ Xa [XS] for all S ⊆ V with a,b ∈ S;

• Xa(t− k) 6⊥⊥Xb(t+ 1) |FS1(t)∨FS2

(t− k)∨Fa(t− k− 1)

for all k ∈N, t ∈ Z, for all disjoint S1,S2 ⊆ V with b ∈ S1 and

a /∈ S1 ∪ S2.

22 / 52


Algorithm: identification of tails

• colliders:

a c b ∈ G and Xa9 Xb [XS] for some S such that c /∈ S

⇒ c b c� b

• non-colliders:

a c b ∈ G and Xa9 Xb [XS] for some S such that c ∈ S

⇒ c b c� b

• ancestors:

a� . . .� b in G ⇒ a b a� b

• discriminating paths: e.g. Ali et al. (2004)

23 / 52


Example: application to neural spike train data

Time [sec]

0 2 4 6 8

Neuron 10Neuron 9 Neuron 8 Neuron 7 Neuron 6 Neuron 5 Neuron 4 Neuron 3 Neuron 2 Neuron 1

−60 −40 −20 0 20 40 60−0.2

−0.1

0.0

0.1

0.2

0.3

0.4

lag

pdc(

1→

2)

−60 −40 −20 0 20 40 60−0.2

−0.1

0.0

0.1

0.2

0.3

0.4

lag

pdc(

1→

3)

−60 −40 −20 0 20 40 60−0.2

−0.1

0.0

0.1

0.2

0.3

0.4

lag

pdc(

1→

4)

−60 −40 −20 0 20 40 60−0.2

−0.1

0.0

0.1

0.2

0.3

0.4

lag

pdc(

2→

3)

−60 −40 −20 0 20 40 60−0.2

−0.1

0.0

0.1

0.2

0.3

0.4

lag

pdc(

2→

4)

−60 −40 −20 0 20 40 60−0.2

−0.1

0.0

0.1

0.2

0.3

0.4

lag

pdc(

3→

4)

24 / 52


Example:

1

2 3

4

(a) (b) (c) (d)

(e)(f) (g) (h)

(i) (j) (k)

1

2 3

4

Result:

25 / 52

Outline




• Causal learning





26 / 52

Problem

Example:

1 2 3 4

L

• X1,X2,X3,X4 are conditionally independent given L

• no conditional independences among X1, . . . ,X4.

27 / 52

Trek separation

Problem:

• conditional independences are not sufficient to describe processes

that involve latent variables

• identification of such structures relies on sparsity that is often not

given

Approach: Sullivant et al (2011) for multivariate Gaussian distributions

• new concept of separation in graphs

• encodes rank constraints on minors of covariance matrix

• generalizes other concepts of separation

• special case: conditional independences

28 / 52

Trek separation

A trek between nodes i and j is a path π= (πL,πM,πR) such that

• πL is a directed path from some node kL to i;

• πR is a directed path from some node kR to j;

• πM is an undirected edge kL � kR or a path of length zero (kL = kR).

Examples: i� kR � kL� j, i� v� k� j, i� v� j, i � j

Definition (trek separation)

(CL,CR) t-separates sets A and B if for every trek (πL,πM,πR)

• πL contains a vertex in CL or

• πR contains a vertex in CR.

29 / 52

Trek separation

Let X be a stationary Gaussian process with spectral matrix Σ(ω)

satisfying

Σ(ω) = 12π

∞∑

u=−∞

cov(Xt,Xt−u) e−i uω.

Theorem

Let X be G-Markov. Then the following are equivalent:

• rank(ΣAB(ω))≤ r for all ω ∈ [−π,π]

• A and B are t-separated by some (CL,CR) with |CL|+ |CR| ≤ r.

30 / 52

Trek separation

Corollaries:

Let X be Gaussian stationary process. Then

XA⊥⊥XB |XC ⇔ rank(ΣA∪C,B∪C) = |C|.

Furthermore the following are equivalent:

• XA⊥⊥XB |XC for all G-Markov processes X;

• (CA,CB) t-separates A∪ C and B∪ C for some partition C = CA ∪ CB.

31 / 52

Tetrad representation theorem

Consider the classM (G) of all G-Markov stationary Gaussian processes

Proposition


• The spectral matrices Σ(·) of processes inM (G) satisfy

Σik(ω)Σjl(ω)−Σil(ω)Σjk(ω) = 0;

• {i, j} and {k, l} are t-separated by (c,∅) or (∅, c) for some node c in G

32 / 52

Tetrad representation theorem

If the spectral matrix Σ(ω) satisfies the tetrad constraints

Σik(ω)Σjl(ω)−Σil(ω)Σjk(ω) = 0

Σij(ω)Σkl(ω)−Σil(ω)Σkj(ω) = 0

Σik(ω)Σlj(ω)−Σij(ω)Σlk(ω) = 0

then there exists a node P such that Xi, Xj, Xk, and Xl are mutually

conditionally independent given XP.

1 2 3 4

P

Note: If no such XP is among the observed variables, XP must be a latent

factor.

33 / 52

Testing tetrad constraints

Approach: nonparametric test (Eichler 2008)

Null hypothesis: ψ(Σ(ω))≡ 0 where ψ(Z) = zik zjl − zil zjk

Test statistic:

ST =

∫

|ψ(Σ̂(ω))|2 dω.

where Σ̂(ω) is a kernel spectral estimator with bandwidth bT

34 / 52

Testing tetrad constraints

Approach: nonparametric test (Eichler 2008)

Null hypothesis: ψ(Σ(ω))≡ 0 where ψ(Z) = zik zjl − zil zjk

Test statistic:

ST =

∫

|ψ(Σ̂(ω))|2 dω.

where Σ̂(ω) is a kernel spectral estimator with bandwidth bT

Theorem Under the null hypothesis

b1/2T T ST − b

−1/2T µ

D→N (0,σ2),

where

µ= Ch Cw,2

∫

tr�

∇ψ(Σ(ω))′Σ(ω)∇ψ(Σ(−ω))Σ(ω)�

dω

σ2 = 4πC2h

Cw,4

∫

| tr�

∇ψ(Σ(ω))′ΣAA(ω)∇ψ(Σ(−ω))ΣBB(ω)�

|2 dω,

34 / 52

Latent variable models

Common identifiability constraint for factor models:

factors are uncorrelated/independent

But: in many applications (eg in neuroscience), we think of latent

variables that are causally connected.

• EEG recordings measures neural activity in close cortical regions

• fMRI recordings measure hemodynamic responses which depend on

underlying neural activity

Objective: recover latent processes and interrelations among them

35 / 52


Suppose that Y(t) can be partioned into YI1(t), . . . ,YIr

(t) such that

YIj(t) = Λj Xj(t) + ǫIj

(t)

and X(t) is a VAR(p) process.

Then the model can be fitted by the following steps:

• identify clusters of variables depending on one latent variable

(based on tetrad rules)

• use PCA to determine latent variable processes Xj(t)

• fit VAR model to all latent variable processes jointly

36 / 52

Latent variable modelsExample

0 200 400 600 800 1000−15−10

−505

1015

time

X(1

)

0 200 400 600 800 1000−15−10

−505

1015

time

X(2

)

0 200 400 600 800 1000−30−20−10

01020

time

X(3

)

0 200 400 600 800 1000−4−2

024

time

X(4

)

0 200 400 600 800 1000−4−2

024

time

X(5

)

37 / 52


Set {1,2} with:

• {3,4}: S= −0.98

• {3,5}: S= −0.31

• {4,5}: S= −1.4 0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

38 / 52


Set {1,3} with:

• {2,4}: S= −1.37

• {2,5}: S= 0.76

• {4,5}: S= −0.44 0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

39 / 52


Set {1,4} with:

• {2,3}: S= −1.19

• {2,5}: S= 6.54

• {3,5}: S= 6.55 0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

40 / 52


Set {1,5} with:

• {2,3}: S= −1.22

• {2,4}: S= 5.43

• {3,4}: S= 5.77 0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

41 / 52


Set {2,3} with:

• {1,4}: S= −1.18

• {1,5}: S= −1.21

• {4,5}: S= −1.58 0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

42 / 52


Set {2,4} with:

• {3,4}: S= −1.36

• {3,5}: S= 5.43

• {4,5}: S= 5.66 0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

43 / 52


Set {2,5} with:

• {1,3}: S= 0.76

• {1,4}: S= 6.55

• {3,4}: S= 5.73 0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

44 / 52


Set {3,4} with:

• {1,2}: S= −0.98

• {1,5}: S= 5.77

• {2,5}: S= 5.73 0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

45 / 52


Set {3,5} with:

• {1,2}: S= −0.31

• {1,4}: S= 6.54

• {2,4}: S= 5.66 0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

46 / 52


Set {4,5} with:

• {1,2}: S= −1.41

• {1,3}: S= −0.44

• {2,3}: S= −1.58 0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Index

abs(

Res

[m, ]

)

47 / 52


Example:

1 2 3 4 5

P Q

48 / 52


Example:

1 2 3 4 5 6

L1 L2 L3

49 / 52

Conclusion

Causal Inference is a complex task

• requires modelling at all levels (bivariate to fully multivariate)

• requires Granger causality as well as other measures (e.g. Sims

causality)

• definite results may be sparse without further assumptions

• latent variables induces further (non-Markovian) constraints on the

distribution

Open Problems:

• merging of information about latent variables;

development of algortihms for latent variables

• uncertainty in identification of Granger causal relationships

• instantaneous causality

• aggregation over time (distortion of identification only possible up

to Markov equivalence

• non-stationarity and non-linearity

50 / 52

References

• E. (2007), Granger-causality and path diagrams for multivariate time series,

Journal of Econometrics 137, 334-353.

• E. (2008), Testing nonparametric and semiparametric hypotheses in vector

stationary processes. Journal of Multivariate Analysis 99, 968-1009.

• E. (2009), Causal inference from time series: what can be learned from

Granger causality? In: G. Glymour, W. Wang, D. Westerståhl (eds),

Proceedings of the 13th International Congress of Logic, Methodology and

Philosophy of Science, College Publications, London.

• E. (2010), Graphical Modelling of multivariate time series with latent

variables. Journal of Machine Learning Research W&CP 9

• E. (2012), Graphical modelling of multivariate time series. Probability

Theory and Related Fields 153, 233-268.

• E. (2012). Causal inference in time series analysis. In: C. Berzuini, A.P.

Dawid, L. Bernardinelli (eds), Causality: Statistical Perspectives and

Applications, Wiley, Chichester.

• E. (2013). Causal inference with multiple time series: principles and

problems. Philosophical Transaction of The Royal Society A 371, 20110613.

51 / 52

causal inference from multivariate time series: principles ...• f∗(t)- all information in the...

Documents