learning dags from observational data - eth zmaathuis/meetings/slides4.pdf · •learning dags from...
TRANSCRIPT
Learning DAGs from observational data
General overview
2
• Introduction
• DAGs and conditional independence
• DAGs and causal effects
• Learning DAGs from observational data
• IDA algorithm
• Further problems
What can we do when the DAG is unknown?
3
• Knowing the DAG is unrealistic in high-dimensional settings.So we assume that the data come from an unknown DAG.
What can we do when the DAG is unknown?
3
• Knowing the DAG is unrealistic in high-dimensional settings.So we assume that the data come from an unknown DAG.
• A DAG encodes conditional independence relationships. So given allconditional independence relationships in the observationaldistribution, can we learn the DAG?
What can we do when the DAG is unknown?
3
• Knowing the DAG is unrealistic in high-dimensional settings.So we assume that the data come from an unknown DAG.
• A DAG encodes conditional independence relationships. So given allconditional independence relationships in the observationaldistribution, can we learn the DAG?
• Almost... several DAGs can encode the same conditionalindependence relationships. They are Markov equivalent.
What can we do when the DAG is unknown?
3
• Knowing the DAG is unrealistic in high-dimensional settings.So we assume that the data come from an unknown DAG.
• A DAG encodes conditional independence relationships. So given allconditional independence relationships in the observationaldistribution, can we learn the DAG?
• Almost... several DAGs can encode the same conditionalindependence relationships. They are Markov equivalent.
• Example:X1 ⊥⊥ X3 X1 ⊥⊥ X3|X2
X1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 true false
What can we do when the DAG is unknown?
3
• Knowing the DAG is unrealistic in high-dimensional settings.So we assume that the data come from an unknown DAG.
• A DAG encodes conditional independence relationships. So given allconditional independence relationships in the observationaldistribution, can we learn the DAG?
• Almost... several DAGs can encode the same conditionalindependence relationships. They are Markov equivalent.
• Example:X1 ⊥⊥ X3 X1 ⊥⊥ X3|X2
X1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 true false
no v-structure
v-structure
What can we do when the DAG is unknown?
3
• Knowing the DAG is unrealistic in high-dimensional settings.So we assume that the data come from an unknown DAG.
• A DAG encodes conditional independence relationships. So given allconditional independence relationships in the observationaldistribution, can we learn the DAG?
• Almost... several DAGs can encode the same conditionalindependence relationships. They are Markov equivalent.
• Example:X1 ⊥⊥ X3 X1 ⊥⊥ X3|X2
X1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 false trueX1 X2 X3 true false
no v-structure
v-structure
• A v-structure is a triple i→ j ← k where i and k are not adjacent
Markov equivalence class
4
• All DAGs in a Markov equivalence class have the same skeleton andthe same v-structures (Verma and Pearl, 1990)
Markov equivalence class
4
• All DAGs in a Markov equivalence class have the same skeleton andthe same v-structures (Verma and Pearl, 1990)
• They can be uniquely represented by a CPDAG:• edge between X and Y iff X and Y are d-connected given S for
all subsets S of the remaining variables(edges are stronger than in CIGs/Gaussian Graphical Models)
• X → Y iff X → Y in all DAGs in the equivalence class(direct causal effect)
• X Y iff there is a DAG in the equivalence class with X → Yand one with X ← Y(unidentifiable orientations)
Markov equivalence class
4
• All DAGs in a Markov equivalence class have the same skeleton andthe same v-structures (Verma and Pearl, 1990)
• They can be uniquely represented by a CPDAG:• edge between X and Y iff X and Y are d-connected given S for
all subsets S of the remaining variables(edges are stronger than in CIGs/Gaussian Graphical Models)
• X → Y iff X → Y in all DAGs in the equivalence class(direct causal effect)
• X Y iff there is a DAG in the equivalence class with X → Yand one with X ← Y(unidentifiable orientations)
• Example:
CPDAG
Markov equivalence class
4
• All DAGs in a Markov equivalence class have the same skeleton andthe same v-structures (Verma and Pearl, 1990)
• They can be uniquely represented by a CPDAG:• edge between X and Y iff X and Y are d-connected given S for
all subsets S of the remaining variables(edges are stronger than in CIGs/Gaussian Graphical Models)
• X → Y iff X → Y in all DAGs in the equivalence class(direct causal effect)
• X Y iff there is a DAG in the equivalence class with X → Yand one with X ← Y(unidentifiable orientations)
• Example:
CPDAG DAG 1 DAG 2 DAG 3 DAG 4
Markov equivalence class
4
• All DAGs in a Markov equivalence class have the same skeleton andthe same v-structures (Verma and Pearl, 1990)
• They can be uniquely represented by a CPDAG:• edge between X and Y iff X and Y are d-connected given S for
all subsets S of the remaining variables(edges are stronger than in CIGs/Gaussian Graphical Models)
• X → Y iff X → Y in all DAGs in the equivalence class(direct causal effect)
• X Y iff there is a DAG in the equivalence class with X → Yand one with X ← Y(unidentifiable orientations)
• Example:
CPDAG DAG 1 DAG 2 DAG 3 DAG 4
Markov equivalence class
4
• All DAGs in a Markov equivalence class have the same skeleton andthe same v-structures (Verma and Pearl, 1990)
• They can be uniquely represented by a CPDAG:• edge between X and Y iff X and Y are d-connected given S for
all subsets S of the remaining variables(edges are stronger than in CIGs/Gaussian Graphical Models)
• X → Y iff X → Y in all DAGs in the equivalence class(direct causal effect)
• X Y iff there is a DAG in the equivalence class with X → Yand one with X ← Y(unidentifiable orientations)
• Example:
CPDAG DAG 1 DAG 2 DAG 3 DAG 4
Causal structure learning
5
• Learning (Markov equivalence classes of) DAGs is challenging.Main methods:• Score-based methods: e.g. Greedy Equivalence Search
(Chickering, 2002)
• Constraint-based methods: e.g. PC algorithm(Spirtes et al, 2000)
Causal structure learning
5
• Learning (Markov equivalence classes of) DAGs is challenging.Main methods:• Score-based methods: e.g. Greedy Equivalence Search
(Chickering, 2002)
• Constraint-based methods: e.g. PC algorithm(Spirtes et al, 2000)• Fast
• Consistent for high-dimensional sparse graphs(Kalisch & Buhlmann, 2007)
Causal structure learning
5
• Learning (Markov equivalence classes of) DAGs is challenging.Main methods:• Score-based methods: e.g. Greedy Equivalence Search
(Chickering, 2002)
• Constraint-based methods: e.g. PC algorithm(Spirtes et al, 2000)• Fast
• Consistent for high-dimensional sparse graphs(Kalisch & Buhlmann, 2007)
• Restricted structural equation models: e.g. LiNGAM(Shimizu et al, 2006; Tubingen group)• DAG is identifiable!
Faithfulness
6
• Constraint-based methods require a faithfulness assumption:the conditional independencies in the distribution exactly equal theones encoded in the DAG via d-separation
• Example of a distribution that is not faithful to its generating DAG:
X1
X1 ← ε1X2
X2 ← X1 + ε21
X3X3 ← X1 −X2 + ε3
-1
1
Faithfulness
6
• Constraint-based methods require a faithfulness assumption:the conditional independencies in the distribution exactly equal theones encoded in the DAG via d-separation
• Example of a distribution that is not faithful to its generating DAG:
X1
X1 ← ε1X2
X2 ← X1 + ε21
X3X3 ← X1 −X2 + ε3
-1
1
• X1 and X3 are not d-separated by the empty set
Faithfulness
6
• Constraint-based methods require a faithfulness assumption:the conditional independencies in the distribution exactly equal theones encoded in the DAG via d-separation
• Example of a distribution that is not faithful to its generating DAG:
X1
X1 ← ε1X2
X2 ← X1 + ε21
X3X3 ← X1 −X2 + ε3
-1
1
• X1 and X3 are not d-separated by the empty set
• But:X1 = ε1,
X2 = ε1 + ε2,
X3 = ε1 − (ε1 + ε2) + ε3 = −ε2 + ε3.
Hence, X1 and X3 are independent.
Skeleton of a DAG
7
• Under the faithfulness assumption:• There is an edge between Xi and Xj in the DAG if and only ifXi and Xj are dependent given every subset of the remainingvariables
Skeleton of a DAG
7
• Under the faithfulness assumption:• There is an edge between Xi and Xj in the DAG if and only ifXi and Xj are dependent given every subset of the remainingvariables
• This means that the skeleton of a DAG is determined uniquelyby conditional independence relationships
Skeleton of a DAG
7
• Under the faithfulness assumption:• There is an edge between Xi and Xj in the DAG if and only ifXi and Xj are dependent given every subset of the remainingvariables
• This means that the skeleton of a DAG is determined uniquelyby conditional independence relationships
• But the directions of the edges are generally not uniquelydetermined
PC algorithm
8
• Assuming faithfulness, a CPDAG can be estimated by thePC-algorithm of Peter Spirtes and Clark Glymour (2000):• Determine the skeleton
• Determine the v-structures
• Direct as many of the remaining edges as possible
PC algorithm
8
• Assuming faithfulness, a CPDAG can be estimated by thePC-algorithm of Peter Spirtes and Clark Glymour (2000):• Determine the skeleton• No edge between Xi and Xj
⇐⇒Xi ⊥⊥ Xj |S for some subset S of the remaining variables⇐⇒Xi ⊥⊥ Xj |S
′ for some subset S′ of adj(Xi) or of adj(Xj)
• Start with the complete graph
• For k = 0, 1, . . . :• Consider all pairs of adjacent vertices (Xi, Xj), and
remove edge if they are conditionally independent givensome subset of size k of adj(Xi) or of adj(Xj)
• Determine the v-structures
• Direct as many of the remaining edges as possible
PC algorithm - oracle version
9
• Assume faithfulness and an ‘oracle’ that tells us whether or notX ⊥⊥ Y |S for any triple (X,Y,S).
• Then a CPDAG can be estimated by the PC-algorithm of PeterSpirtes and Clark Glymour (2000):• Determine the skeleton
• Determine the v-structures
• Direct as many of the remaining edges as possible
PC algorithm - oracle version
9
• Assume faithfulness and an ‘oracle’ that tells us whether or notX ⊥⊥ Y |S for any triple (X,Y,S).
• Then a CPDAG can be estimated by the PC-algorithm of PeterSpirtes and Clark Glymour (2000):• Determine the skeleton
• Determine the v-structures
• Direct as many of the remaining edges as possible
• Fast implementation in the R-package pcalg(Kalisch et al., 2012)
• Consistent in sparse high-dimensional settings(Kalisch and Buhlmann, 2007)
PC algorithm - sample version
10
• Instead of the oracle, we perform conditional independence tests
PC algorithm - sample version
10
• Instead of the oracle, we perform conditional independence tests
• In the multivariate Gaussian setting, this is equivalent to testing forzero partial correlation: H0 : ρij|S = 0 versus Ha : ρij|S 6= 0.
PC algorithm - sample version
10
• Instead of the oracle, we perform conditional independence tests
• In the multivariate Gaussian setting, this is equivalent to testing forzero partial correlation: H0 : ρij|S = 0 versus Ha : ρij|S 6= 0.
• Partial correlations can be computed via regression, inversion ofparts of the covariance matrix, or a recursive formula
PC algorithm - sample version
10
• Instead of the oracle, we perform conditional independence tests
• In the multivariate Gaussian setting, this is equivalent to testing forzero partial correlation: H0 : ρij|S = 0 versus Ha : ρij|S 6= 0.
• Partial correlations can be computed via regression, inversion ofparts of the covariance matrix, or a recursive formula
• For testing, it is helpful to use Fisher’s Z-transform:
zij|S =1
2log
(1 + ρij|S
1− ρij|S
).
Under H0,√
n− |S| − 3zij|S ∼ N(0, 1).
PC algorithm - sample version
10
• Instead of the oracle, we perform conditional independence tests
• In the multivariate Gaussian setting, this is equivalent to testing forzero partial correlation: H0 : ρij|S = 0 versus Ha : ρij|S 6= 0.
• Partial correlations can be computed via regression, inversion ofparts of the covariance matrix, or a recursive formula
• For testing, it is helpful to use Fisher’s Z-transform:
zij|S =1
2log
(1 + ρij|S
1− ρij|S
).
Under H0,√
n− |S| − 3zij|S ∼ N(0, 1).
• Hence, we reject H0 versus Ha if√n− |S| − 3|zij|S| > Φ−1(1− α/2)
PC algorithm - sample version
10
• Instead of the oracle, we perform conditional independence tests
• In the multivariate Gaussian setting, this is equivalent to testing forzero partial correlation: H0 : ρij|S = 0 versus Ha : ρij|S 6= 0.
• Partial correlations can be computed via regression, inversion ofparts of the covariance matrix, or a recursive formula
• For testing, it is helpful to use Fisher’s Z-transform:
zij|S =1
2log
(1 + ρij|S
1− ρij|S
).
Under H0,√
n− |S| − 3zij|S ∼ N(0, 1).
• Hence, we reject H0 versus Ha if√n− |S| − 3|zij|S| > Φ−1(1− α/2)
• The significance level α serves as a tuning parameter for the PCalgorithm
PC algorithm - sample version
10
• Instead of the oracle, we perform conditional independence tests
• In the multivariate Gaussian setting, this is equivalent to testing forzero partial correlation: H0 : ρij|S = 0 versus Ha : ρij|S 6= 0.
• Partial correlations can be computed via regression, inversion ofparts of the covariance matrix, or a recursive formula
• For testing, it is helpful to use Fisher’s Z-transform:
zij|S =1
2log
(1 + ρij|S
1− ρij|S
).
Under H0,√
n− |S| − 3zij|S ∼ N(0, 1).
• Hence, we reject H0 versus Ha if√n− |S| − 3|zij|S| > Φ−1(1− α/2)
• The significance level α serves as a tuning parameter for the PCalgorithm
• We perform many many tests during the algorithm. Can we obtainconsistency results?
High-dimensional asymptotic framework
11
• Since typical datasets in biology contain many more variables thanobservations, we consider a framework in which the graph is allowedto grow with the sample size n:• DAG: Gn
• Number of variables: pn• Variables: Xn1, . . . , Xnpn
• Distribution: Pn
• Parial correlations: ρnij|S
Assumptions
12
• Pn is multivariate Gaussian and faithful to the true unknowncausal DAG Gn
Assumptions
12
• Pn is multivariate Gaussian and faithful to the true unknowncausal DAG Gn
• High-dimensionality and sparseness:• pn = O(na), for some 0 ≤ a <∞
• Maximum number of neighbors in Gn is qn = O(n1−b),for some 0 < b ≤ 1
Assumptions
12
• Pn is multivariate Gaussian and faithful to the true unknowncausal DAG Gn
• High-dimensionality and sparseness:• pn = O(na), for some 0 ≤ a <∞
• Maximum number of neighbors in Gn is qn = O(n1−b),for some 0 < b ≤ 1
• Regularity conditions on partial correlations:• supn,i 6=j,S |ρnij|S| ≤M for some M < 1,
where S ⊆ {Xn1, . . . , Xnpn} \ {Xni, Xnj} with |S| ≤ qn
• inf i,j,S{|ρnij|S| : ρnij|S 6= 0} ≥ cn,where S ⊆ {Xn1, . . . , Xnpn} \ {Xni, Xnj} with |S| ≤ qn andc−1n = O(nd) for some 0 < d < b/2
High dimensional consistency
13
• Denote the estimated CPDAG by Cn(αn) and the true CPDAG by C.
• Then there exists a sequence αn → 0 such that
P (Cn(αn) = C) = 1−O(exp(−Cn1−2d)),
for some C > 0 and d as in the assumptions (Kalisch & Buhlmann,2007)
Sketch of proof
14
• Sketch of the proof:• Enij|S is event of a type I/II error when testing for ρnij|S = 0
• Let PCqn denote the PC algorithm where we test conditionalindependencies up to level qn• Choose αn st P (Enij|S) = O(n exp(−C(n− qn)c
2n)) if |S| ≤ qn
• Then
P (error occurs in PCqn(αn))
≤ P (∪i,j,S:|S|≤qnEnij|S) ≤∑
i,j,S:|S|≤qn
P (Enij|S)
≤ O(pqn+2n )O(n exp(−C(n− qn)c
2n))
= O(exp(qn log(pn) + log(n)− C(n− qn)c2n))
= O(exp(n1−ba log(n) + log(n)− Cn1−2d + Cn1−2d−b)→ 0
Summary: learning DAGs from observational data
15
• Markov equivalence class
• Faithfulness
• PC algorithm
• Consistency in high-dimensional settings