learning dags from observational data - eth zmaathuis/meetings/slides4.pdf · •learning dags from...

41
Learning DAGs from observational data

Upload: others

Post on 28-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Learning DAGs from observational data

Page 2: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

General overview

2

• Introduction

• DAGs and conditional independence

• DAGs and causal effects

• Learning DAGs from observational data

• IDA algorithm

• Further problems

Page 3: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

What can we do when the DAG is unknown?

3

• Knowing the DAG is unrealistic in high-dimensional settings.So we assume that the data come from an unknown DAG.

Page 4: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

What can we do when the DAG is unknown?

3

• Knowing the DAG is unrealistic in high-dimensional settings.So we assume that the data come from an unknown DAG.

• A DAG encodes conditional independence relationships. So given allconditional independence relationships in the observationaldistribution, can we learn the DAG?

Page 5: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

What can we do when the DAG is unknown?

3

• Knowing the DAG is unrealistic in high-dimensional settings.So we assume that the data come from an unknown DAG.

• A DAG encodes conditional independence relationships. So given allconditional independence relationships in the observationaldistribution, can we learn the DAG?

• Almost... several DAGs can encode the same conditionalindependence relationships. They are Markov equivalent.

Page 6: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

What can we do when the DAG is unknown?

3

• Knowing the DAG is unrealistic in high-dimensional settings.So we assume that the data come from an unknown DAG.

• A DAG encodes conditional independence relationships. So given allconditional independence relationships in the observationaldistribution, can we learn the DAG?

• Almost... several DAGs can encode the same conditionalindependence relationships. They are Markov equivalent.

• Example:X1 ⊥⊥ X3 X1 ⊥⊥ X3|X2

X1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 true false

Page 7: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

What can we do when the DAG is unknown?

3

• Knowing the DAG is unrealistic in high-dimensional settings.So we assume that the data come from an unknown DAG.

• A DAG encodes conditional independence relationships. So given allconditional independence relationships in the observationaldistribution, can we learn the DAG?

• Almost... several DAGs can encode the same conditionalindependence relationships. They are Markov equivalent.

• Example:X1 ⊥⊥ X3 X1 ⊥⊥ X3|X2

X1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 true false

no v-structure

v-structure

Page 8: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

What can we do when the DAG is unknown?

3

• Knowing the DAG is unrealistic in high-dimensional settings.So we assume that the data come from an unknown DAG.

• A DAG encodes conditional independence relationships. So given allconditional independence relationships in the observationaldistribution, can we learn the DAG?

• Almost... several DAGs can encode the same conditionalindependence relationships. They are Markov equivalent.

• Example:X1 ⊥⊥ X3 X1 ⊥⊥ X3|X2

X1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 true false

no v-structure

v-structure

• A v-structure is a triple i→ j ← k where i and k are not adjacent

Page 9: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Markov equivalence class

4

• All DAGs in a Markov equivalence class have the same skeleton andthe same v-structures (Verma and Pearl, 1990)

Page 10: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Markov equivalence class

4

• All DAGs in a Markov equivalence class have the same skeleton andthe same v-structures (Verma and Pearl, 1990)

• They can be uniquely represented by a CPDAG:• edge between X and Y iff X and Y are d-connected given S for

all subsets S of the remaining variables(edges are stronger than in CIGs/Gaussian Graphical Models)

• X → Y iff X → Y in all DAGs in the equivalence class(direct causal effect)

• X Y iff there is a DAG in the equivalence class with X → Yand one with X ← Y(unidentifiable orientations)

Page 11: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Markov equivalence class

4

• All DAGs in a Markov equivalence class have the same skeleton andthe same v-structures (Verma and Pearl, 1990)

• They can be uniquely represented by a CPDAG:• edge between X and Y iff X and Y are d-connected given S for

all subsets S of the remaining variables(edges are stronger than in CIGs/Gaussian Graphical Models)

• X → Y iff X → Y in all DAGs in the equivalence class(direct causal effect)

• X Y iff there is a DAG in the equivalence class with X → Yand one with X ← Y(unidentifiable orientations)

• Example:

CPDAG

Page 12: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Markov equivalence class

4

• All DAGs in a Markov equivalence class have the same skeleton andthe same v-structures (Verma and Pearl, 1990)

• They can be uniquely represented by a CPDAG:• edge between X and Y iff X and Y are d-connected given S for

all subsets S of the remaining variables(edges are stronger than in CIGs/Gaussian Graphical Models)

• X → Y iff X → Y in all DAGs in the equivalence class(direct causal effect)

• X Y iff there is a DAG in the equivalence class with X → Yand one with X ← Y(unidentifiable orientations)

• Example:

CPDAG DAG 1 DAG 2 DAG 3 DAG 4

Page 13: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Markov equivalence class

4

• All DAGs in a Markov equivalence class have the same skeleton andthe same v-structures (Verma and Pearl, 1990)

• They can be uniquely represented by a CPDAG:• edge between X and Y iff X and Y are d-connected given S for

all subsets S of the remaining variables(edges are stronger than in CIGs/Gaussian Graphical Models)

• X → Y iff X → Y in all DAGs in the equivalence class(direct causal effect)

• X Y iff there is a DAG in the equivalence class with X → Yand one with X ← Y(unidentifiable orientations)

• Example:

CPDAG DAG 1 DAG 2 DAG 3 DAG 4

Page 14: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Markov equivalence class

4

• All DAGs in a Markov equivalence class have the same skeleton andthe same v-structures (Verma and Pearl, 1990)

• They can be uniquely represented by a CPDAG:• edge between X and Y iff X and Y are d-connected given S for

all subsets S of the remaining variables(edges are stronger than in CIGs/Gaussian Graphical Models)

• X → Y iff X → Y in all DAGs in the equivalence class(direct causal effect)

• X Y iff there is a DAG in the equivalence class with X → Yand one with X ← Y(unidentifiable orientations)

• Example:

CPDAG DAG 1 DAG 2 DAG 3 DAG 4

Page 15: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Causal structure learning

5

• Learning (Markov equivalence classes of) DAGs is challenging.Main methods:• Score-based methods: e.g. Greedy Equivalence Search

(Chickering, 2002)

• Constraint-based methods: e.g. PC algorithm(Spirtes et al, 2000)

Page 16: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Causal structure learning

5

• Learning (Markov equivalence classes of) DAGs is challenging.Main methods:• Score-based methods: e.g. Greedy Equivalence Search

(Chickering, 2002)

• Constraint-based methods: e.g. PC algorithm(Spirtes et al, 2000)• Fast

• Consistent for high-dimensional sparse graphs(Kalisch & Buhlmann, 2007)

Page 17: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Causal structure learning

5

• Learning (Markov equivalence classes of) DAGs is challenging.Main methods:• Score-based methods: e.g. Greedy Equivalence Search

(Chickering, 2002)

• Constraint-based methods: e.g. PC algorithm(Spirtes et al, 2000)• Fast

• Consistent for high-dimensional sparse graphs(Kalisch & Buhlmann, 2007)

• Restricted structural equation models: e.g. LiNGAM(Shimizu et al, 2006; Tubingen group)• DAG is identifiable!

Page 18: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Faithfulness

6

• Constraint-based methods require a faithfulness assumption:the conditional independencies in the distribution exactly equal theones encoded in the DAG via d-separation

• Example of a distribution that is not faithful to its generating DAG:

X1

X1 ← ε1X2

X2 ← X1 + ε21

X3X3 ← X1 −X2 + ε3

-1

1

Page 19: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Faithfulness

6

• Constraint-based methods require a faithfulness assumption:the conditional independencies in the distribution exactly equal theones encoded in the DAG via d-separation

• Example of a distribution that is not faithful to its generating DAG:

X1

X1 ← ε1X2

X2 ← X1 + ε21

X3X3 ← X1 −X2 + ε3

-1

1

• X1 and X3 are not d-separated by the empty set

Page 20: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Faithfulness

6

• Constraint-based methods require a faithfulness assumption:the conditional independencies in the distribution exactly equal theones encoded in the DAG via d-separation

• Example of a distribution that is not faithful to its generating DAG:

X1

X1 ← ε1X2

X2 ← X1 + ε21

X3X3 ← X1 −X2 + ε3

-1

1

• X1 and X3 are not d-separated by the empty set

• But:X1 = ε1,

X2 = ε1 + ε2,

X3 = ε1 − (ε1 + ε2) + ε3 = −ε2 + ε3.

Hence, X1 and X3 are independent.

Page 21: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Skeleton of a DAG

7

• Under the faithfulness assumption:• There is an edge between Xi and Xj in the DAG if and only ifXi and Xj are dependent given every subset of the remainingvariables

Page 22: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Skeleton of a DAG

7

• Under the faithfulness assumption:• There is an edge between Xi and Xj in the DAG if and only ifXi and Xj are dependent given every subset of the remainingvariables

• This means that the skeleton of a DAG is determined uniquelyby conditional independence relationships

Page 23: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Skeleton of a DAG

7

• Under the faithfulness assumption:• There is an edge between Xi and Xj in the DAG if and only ifXi and Xj are dependent given every subset of the remainingvariables

• This means that the skeleton of a DAG is determined uniquelyby conditional independence relationships

• But the directions of the edges are generally not uniquelydetermined

Page 24: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

PC algorithm

8

• Assuming faithfulness, a CPDAG can be estimated by thePC-algorithm of Peter Spirtes and Clark Glymour (2000):• Determine the skeleton

• Determine the v-structures

• Direct as many of the remaining edges as possible

Page 25: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

PC algorithm

8

• Assuming faithfulness, a CPDAG can be estimated by thePC-algorithm of Peter Spirtes and Clark Glymour (2000):• Determine the skeleton• No edge between Xi and Xj

⇐⇒Xi ⊥⊥ Xj |S for some subset S of the remaining variables⇐⇒Xi ⊥⊥ Xj |S

′ for some subset S′ of adj(Xi) or of adj(Xj)

• Start with the complete graph

• For k = 0, 1, . . . :• Consider all pairs of adjacent vertices (Xi, Xj), and

remove edge if they are conditionally independent givensome subset of size k of adj(Xi) or of adj(Xj)

• Determine the v-structures

• Direct as many of the remaining edges as possible

Page 26: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

PC algorithm - oracle version

9

• Assume faithfulness and an ‘oracle’ that tells us whether or notX ⊥⊥ Y |S for any triple (X,Y,S).

• Then a CPDAG can be estimated by the PC-algorithm of PeterSpirtes and Clark Glymour (2000):• Determine the skeleton

• Determine the v-structures

• Direct as many of the remaining edges as possible

Page 27: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

PC algorithm - oracle version

9

• Assume faithfulness and an ‘oracle’ that tells us whether or notX ⊥⊥ Y |S for any triple (X,Y,S).

• Then a CPDAG can be estimated by the PC-algorithm of PeterSpirtes and Clark Glymour (2000):• Determine the skeleton

• Determine the v-structures

• Direct as many of the remaining edges as possible

• Fast implementation in the R-package pcalg(Kalisch et al., 2012)

• Consistent in sparse high-dimensional settings(Kalisch and Buhlmann, 2007)

Page 28: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

PC algorithm - sample version

10

• Instead of the oracle, we perform conditional independence tests

Page 29: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

PC algorithm - sample version

10

• Instead of the oracle, we perform conditional independence tests

• In the multivariate Gaussian setting, this is equivalent to testing forzero partial correlation: H0 : ρij|S = 0 versus Ha : ρij|S 6= 0.

Page 30: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

PC algorithm - sample version

10

• Instead of the oracle, we perform conditional independence tests

• In the multivariate Gaussian setting, this is equivalent to testing forzero partial correlation: H0 : ρij|S = 0 versus Ha : ρij|S 6= 0.

• Partial correlations can be computed via regression, inversion ofparts of the covariance matrix, or a recursive formula

Page 31: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

PC algorithm - sample version

10

• Instead of the oracle, we perform conditional independence tests

• In the multivariate Gaussian setting, this is equivalent to testing forzero partial correlation: H0 : ρij|S = 0 versus Ha : ρij|S 6= 0.

• Partial correlations can be computed via regression, inversion ofparts of the covariance matrix, or a recursive formula

• For testing, it is helpful to use Fisher’s Z-transform:

zij|S =1

2log

(1 + ρij|S

1− ρij|S

).

Under H0,√

n− |S| − 3zij|S ∼ N(0, 1).

Page 32: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

PC algorithm - sample version

10

• Instead of the oracle, we perform conditional independence tests

• In the multivariate Gaussian setting, this is equivalent to testing forzero partial correlation: H0 : ρij|S = 0 versus Ha : ρij|S 6= 0.

• Partial correlations can be computed via regression, inversion ofparts of the covariance matrix, or a recursive formula

• For testing, it is helpful to use Fisher’s Z-transform:

zij|S =1

2log

(1 + ρij|S

1− ρij|S

).

Under H0,√

n− |S| − 3zij|S ∼ N(0, 1).

• Hence, we reject H0 versus Ha if√n− |S| − 3|zij|S| > Φ−1(1− α/2)

Page 33: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

PC algorithm - sample version

10

• Instead of the oracle, we perform conditional independence tests

• In the multivariate Gaussian setting, this is equivalent to testing forzero partial correlation: H0 : ρij|S = 0 versus Ha : ρij|S 6= 0.

• Partial correlations can be computed via regression, inversion ofparts of the covariance matrix, or a recursive formula

• For testing, it is helpful to use Fisher’s Z-transform:

zij|S =1

2log

(1 + ρij|S

1− ρij|S

).

Under H0,√

n− |S| − 3zij|S ∼ N(0, 1).

• Hence, we reject H0 versus Ha if√n− |S| − 3|zij|S| > Φ−1(1− α/2)

• The significance level α serves as a tuning parameter for the PCalgorithm

Page 34: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

PC algorithm - sample version

10

• Instead of the oracle, we perform conditional independence tests

• In the multivariate Gaussian setting, this is equivalent to testing forzero partial correlation: H0 : ρij|S = 0 versus Ha : ρij|S 6= 0.

• Partial correlations can be computed via regression, inversion ofparts of the covariance matrix, or a recursive formula

• For testing, it is helpful to use Fisher’s Z-transform:

zij|S =1

2log

(1 + ρij|S

1− ρij|S

).

Under H0,√

n− |S| − 3zij|S ∼ N(0, 1).

• Hence, we reject H0 versus Ha if√n− |S| − 3|zij|S| > Φ−1(1− α/2)

• The significance level α serves as a tuning parameter for the PCalgorithm

• We perform many many tests during the algorithm. Can we obtainconsistency results?

Page 35: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

High-dimensional asymptotic framework

11

• Since typical datasets in biology contain many more variables thanobservations, we consider a framework in which the graph is allowedto grow with the sample size n:• DAG: Gn

• Number of variables: pn• Variables: Xn1, . . . , Xnpn

• Distribution: Pn

• Parial correlations: ρnij|S

Page 36: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Assumptions

12

• Pn is multivariate Gaussian and faithful to the true unknowncausal DAG Gn

Page 37: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Assumptions

12

• Pn is multivariate Gaussian and faithful to the true unknowncausal DAG Gn

• High-dimensionality and sparseness:• pn = O(na), for some 0 ≤ a <∞

• Maximum number of neighbors in Gn is qn = O(n1−b),for some 0 < b ≤ 1

Page 38: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Assumptions

12

• Pn is multivariate Gaussian and faithful to the true unknowncausal DAG Gn

• High-dimensionality and sparseness:• pn = O(na), for some 0 ≤ a <∞

• Maximum number of neighbors in Gn is qn = O(n1−b),for some 0 < b ≤ 1

• Regularity conditions on partial correlations:• supn,i 6=j,S |ρnij|S| ≤M for some M < 1,

where S ⊆ {Xn1, . . . , Xnpn} \ {Xni, Xnj} with |S| ≤ qn

• inf i,j,S{|ρnij|S| : ρnij|S 6= 0} ≥ cn,where S ⊆ {Xn1, . . . , Xnpn} \ {Xni, Xnj} with |S| ≤ qn andc−1n = O(nd) for some 0 < d < b/2

Page 39: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

High dimensional consistency

13

• Denote the estimated CPDAG by Cn(αn) and the true CPDAG by C.

• Then there exists a sequence αn → 0 such that

P (Cn(αn) = C) = 1−O(exp(−Cn1−2d)),

for some C > 0 and d as in the assumptions (Kalisch & Buhlmann,2007)

Page 40: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Sketch of proof

14

• Sketch of the proof:• Enij|S is event of a type I/II error when testing for ρnij|S = 0

• Let PCqn denote the PC algorithm where we test conditionalindependencies up to level qn• Choose αn st P (Enij|S) = O(n exp(−C(n− qn)c

2n)) if |S| ≤ qn

• Then

P (error occurs in PCqn(αn))

≤ P (∪i,j,S:|S|≤qnEnij|S) ≤∑

i,j,S:|S|≤qn

P (Enij|S)

≤ O(pqn+2n )O(n exp(−C(n− qn)c

2n))

= O(exp(qn log(pn) + log(n)− C(n− qn)c2n))

= O(exp(n1−ba log(n) + log(n)− Cn1−2d + Cn1−2d−b)→ 0

Page 41: Learning DAGs from observational data - ETH Zmaathuis/meetings/slides4.pdf · •Learning DAGs from observational data •IDA algorithm •Further problems. What can we do when the

Summary: learning DAGs from observational data

15

• Markov equivalence class

• Faithfulness

• PC algorithm

• Consistency in high-dimensional settings