learning dags from observational data - eth zmaathuis/meetings/slides4.pdf · •learning dags from...

Learning DAGs from observational data

General overview

2

• Introduction

• DAGs and conditional independence

• DAGs and causal effects

• Learning DAGs from observational data

• IDA algorithm

• Further problems

What can we do when the DAG is unknown?

3

• Knowing the DAG is unrealistic in high-dimensional settings.So we assume that the data come from an unknown DAG.


3


• A DAG encodes conditional independence relationships. So given allconditional independence relationships in the observationaldistribution, can we learn the DAG?


3



• Almost... several DAGs can encode the same conditionalindependence relationships. They are Markov equivalent.


3




• Example:X1 ⊥⊥ X3 X1 ⊥⊥ X3|X2

X1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 true false


3






no v-structure

v-structure


3






no v-structure

v-structure

• A v-structure is a triple i→ j ← k where i and k are not adjacent

Markov equivalence class

4

• All DAGs in a Markov equivalence class have the same skeleton andthe same v-structures (Verma and Pearl, 1990)


4


• They can be uniquely represented by a CPDAG:• edge between X and Y iff X and Y are d-connected given S for

all subsets S of the remaining variables(edges are stronger than in CIGs/Gaussian Graphical Models)

• X → Y iff X → Y in all DAGs in the equivalence class(direct causal effect)

• X Y iff there is a DAG in the equivalence class with X → Yand one with X ← Y(unidentifiable orientations)


4






• Example:

CPDAG


4






• Example:

CPDAG DAG 1 DAG 2 DAG 3 DAG 4

Causal structure learning

5

• Learning (Markov equivalence classes of) DAGs is challenging.Main methods:• Score-based methods: e.g. Greedy Equivalence Search

(Chickering, 2002)

• Constraint-based methods: e.g. PC algorithm(Spirtes et al, 2000)


5


(Chickering, 2002)

• Constraint-based methods: e.g. PC algorithm(Spirtes et al, 2000)• Fast

• Consistent for high-dimensional sparse graphs(Kalisch & Buhlmann, 2007)


5


(Chickering, 2002)

• Constraint-based methods: e.g. PC algorithm(Spirtes et al, 2000)• Fast

• Consistent for high-dimensional sparse graphs(Kalisch & Buhlmann, 2007)

• Restricted structural equation models: e.g. LiNGAM(Shimizu et al, 2006; Tubingen group)• DAG is identifiable!

Faithfulness

6

• Constraint-based methods require a faithfulness assumption:the conditional independencies in the distribution exactly equal theones encoded in the DAG via d-separation

• Example of a distribution that is not faithful to its generating DAG:

X1

X1 ← ε1X2

X2 ← X1 + ε21

X3X3 ← X1 −X2 + ε3

-1

1

Faithfulness

6



X1

X1 ← ε1X2

X2 ← X1 + ε21

X3X3 ← X1 −X2 + ε3

-1

1

• X1 and X3 are not d-separated by the empty set

Faithfulness

6



X1

X1 ← ε1X2

X2 ← X1 + ε21

X3X3 ← X1 −X2 + ε3

-1

1

• X1 and X3 are not d-separated by the empty set

• But:X1 = ε1,

X2 = ε1 + ε2,

X3 = ε1 − (ε1 + ε2) + ε3 = −ε2 + ε3.

Hence, X1 and X3 are independent.

Skeleton of a DAG

7

• Under the faithfulness assumption:• There is an edge between Xi and Xj in the DAG if and only ifXi and Xj are dependent given every subset of the remainingvariables

Skeleton of a DAG

7


• This means that the skeleton of a DAG is determined uniquelyby conditional independence relationships

Skeleton of a DAG

7


• This means that the skeleton of a DAG is determined uniquelyby conditional independence relationships

• But the directions of the edges are generally not uniquelydetermined

PC algorithm

8

• Assuming faithfulness, a CPDAG can be estimated by thePC-algorithm of Peter Spirtes and Clark Glymour (2000):• Determine the skeleton

• Determine the v-structures

• Direct as many of the remaining edges as possible

PC algorithm

8

• Assuming faithfulness, a CPDAG can be estimated by thePC-algorithm of Peter Spirtes and Clark Glymour (2000):• Determine the skeleton• No edge between Xi and Xj

⇐⇒Xi ⊥⊥ Xj |S for some subset S of the remaining variables⇐⇒Xi ⊥⊥ Xj |S

′ for some subset S′ of adj(Xi) or of adj(Xj)

• Start with the complete graph

• For k = 0, 1, . . . :• Consider all pairs of adjacent vertices (Xi, Xj), and

remove edge if they are conditionally independent givensome subset of size k of adj(Xi) or of adj(Xj)



PC algorithm - oracle version

9

• Assume faithfulness and an ‘oracle’ that tells us whether or notX ⊥⊥ Y |S for any triple (X,Y,S).

• Then a CPDAG can be estimated by the PC-algorithm of PeterSpirtes and Clark Glymour (2000):• Determine the skeleton



PC algorithm - oracle version

9

• Assume faithfulness and an ‘oracle’ that tells us whether or notX ⊥⊥ Y |S for any triple (X,Y,S).

• Then a CPDAG can be estimated by the PC-algorithm of PeterSpirtes and Clark Glymour (2000):• Determine the skeleton



• Fast implementation in the R-package pcalg(Kalisch et al., 2012)

• Consistent in sparse high-dimensional settings(Kalisch and Buhlmann, 2007)

PC algorithm - sample version

10

• Instead of the oracle, we perform conditional independence tests


10


• In the multivariate Gaussian setting, this is equivalent to testing forzero partial correlation: H0 : ρij|S = 0 versus Ha : ρij|S 6= 0.


10



• Partial correlations can be computed via regression, inversion ofparts of the covariance matrix, or a recursive formula


10




• For testing, it is helpful to use Fisher’s Z-transform:

zij|S =1

2log

(1 + ρij|S

1− ρij|S

).

Under H0,√

n− |S| − 3zij|S ∼ N(0, 1).


10





zij|S =1

2log

(1 + ρij|S

1− ρij|S

).

Under H0,√

n− |S| − 3zij|S ∼ N(0, 1).

• Hence, we reject H0 versus Ha if√n− |S| − 3|zij|S| > Φ−1(1− α/2)


10





zij|S =1

2log

(1 + ρij|S

1− ρij|S

).

Under H0,√

n− |S| − 3zij|S ∼ N(0, 1).


• The significance level α serves as a tuning parameter for the PCalgorithm


10





zij|S =1

2log

(1 + ρij|S

1− ρij|S

).

Under H0,√

n− |S| − 3zij|S ∼ N(0, 1).


• The significance level α serves as a tuning parameter for the PCalgorithm

• We perform many many tests during the algorithm. Can we obtainconsistency results?

High-dimensional asymptotic framework

11

• Since typical datasets in biology contain many more variables thanobservations, we consider a framework in which the graph is allowedto grow with the sample size n:• DAG: Gn

• Number of variables: pn• Variables: Xn1, . . . , Xnpn

• Distribution: Pn

• Parial correlations: ρnij|S

Assumptions

12

• Pn is multivariate Gaussian and faithful to the true unknowncausal DAG Gn

Assumptions

12


• High-dimensionality and sparseness:• pn = O(na), for some 0 ≤ a <∞

• Maximum number of neighbors in Gn is qn = O(n1−b),for some 0 < b ≤ 1

Assumptions

12


• High-dimensionality and sparseness:• pn = O(na), for some 0 ≤ a <∞

• Maximum number of neighbors in Gn is qn = O(n1−b),for some 0 < b ≤ 1

• Regularity conditions on partial correlations:• supn,i 6=j,S |ρnij|S| ≤M for some M < 1,

where S ⊆ {Xn1, . . . , Xnpn} \ {Xni, Xnj} with |S| ≤ qn

• inf i,j,S{|ρnij|S| : ρnij|S 6= 0} ≥ cn,where S ⊆ {Xn1, . . . , Xnpn} \ {Xni, Xnj} with |S| ≤ qn andc−1n = O(nd) for some 0 < d < b/2

High dimensional consistency

13

• Denote the estimated CPDAG by Cn(αn) and the true CPDAG by C.

• Then there exists a sequence αn → 0 such that

P (Cn(αn) = C) = 1−O(exp(−Cn1−2d)),

for some C > 0 and d as in the assumptions (Kalisch & Buhlmann,2007)

Summary: learning DAGs from observational data

15

• Markov equivalence class

• Faithfulness

• PC algorithm

• Consistency in high-dimensional settings

learning dags from observational data - eth zmaathuis/meetings/slides4.pdf · •learning dags from...

Documents