an introduction to causal modeling and discovery using graphical models greg cooper university of...
TRANSCRIPT
An Introduction to Causal An Introduction to Causal Modeling and Discovery Using Modeling and Discovery Using
Graphical Models Graphical Models
Greg CooperGreg CooperUniversity of PittsburghUniversity of Pittsburgh
OverviewOverview
IntroductionIntroduction RepresentationRepresentation InferenceInference LearningLearning EvaluationEvaluation
What Is Causality?What Is Causality?
Much consideration in philosophyMuch consideration in philosophy I will treat it as a primitiveI will treat it as a primitive Roughly, if we manipulate something Roughly, if we manipulate something
and something else changes, then and something else changes, then the former causally influences the the former causally influences the latter.latter.
Why is Causation Why is Causation Important?Important?
Causal issues arise in most fields Causal issues arise in most fields including medicine, business, law, including medicine, business, law, economics, and the scienceseconomics, and the sciences
An intelligent agent is continually An intelligent agent is continually considering considering what to what to do do next in order to change next in order to change the world (including the agent’s own the world (including the agent’s own mind). That is a causal question.mind). That is a causal question.
Representing Causation Representing Causation Using Causal Bayesian Using Causal Bayesian
NetworksNetworks A causal Bayesian network (CBN) A causal Bayesian network (CBN)
represents some entity (e.g., a represents some entity (e.g., a patient) that we want to model patient) that we want to model causallycausally
Features of the entity are Features of the entity are represented by variables/nodes in represented by variables/nodes in the CBNthe CBN
Direct causation is represented by Direct causation is represented by arcsarcs
An Example An Example of a Causal Bayesian of a Causal Bayesian Network StructureNetwork Structure
History of Smoking (HS)
Lung Cancer (LC)Chronic Bronchitis (CB)
Fatigue (F) Weight Loss (WL)
An Example of the An Example of the Accompanying Causal Accompanying Causal
Bayesian Network Bayesian Network ParametersParameters
P(HS = no) = 0.80 P(HS = yes) = 0.20
P(CB = absent | HS = no) = 0.95 P(CB = present | HS = no) = 0.05P(CB = absent | HS = yes) = 0.75 P(CB = present | HS = yes) = 0.25
P(LC = absent | HS = no) = 0.99995 P(LC = present | HS = no) = 0.00005P(LC = absent | HS = yes) = 0.997 P(LC = present | HS = yes) = 0.003
Causal Markov ConditionCausal Markov Condition
A node is independent of its non-A node is independent of its non-effects given just its direct causes. effects given just its direct causes.
This is the key representational This is the key representational property of causal Bayesian networks.property of causal Bayesian networks.
Special case: A node is independent Special case: A node is independent of its distant causes given just its of its distant causes given just its direct causes.direct causes.
General notion: Causality is localGeneral notion: Causality is local
Causal Modeling Causal Modeling FrameworkFramework
An underlying process generates An underlying process generates entities that share the same causal entities that share the same causal network structure. The entities may network structure. The entities may have different parameters have different parameters (probabilities).(probabilities).
Each entity independently samples Each entity independently samples the joint distribution defined by its the joint distribution defined by its CBN to generate values (data) for CBN to generate values (data) for each variable in the CBN model each variable in the CBN model
Entity Generator
HS1
LC1
WL1
HS2
LC2
WL2
HS3
LC3
WL3
(no, absent, absent)
existing entities
entityfeaturevaluessamples
(yes, present, present)
(yes, absent, absent)
(no, absent, absent)
(yes, absent, absent)
Discovering the Average Discovering the Average Causal Bayesian NetworkCausal Bayesian Network
HSavg
LCavg
WLavg
Some Key Types of Causal Some Key Types of Causal RelationshipsRelationships
HS
LC
HS
WL
DirectCausation
IndirectCausation
Confounding
HS
LCCB
WLF
Sampled = true
Sampling bias
Inference Using a Single CBN Inference Using a Single CBN When Given Evidence When Given Evidence
in the Form of Observationsin the Form of Observations
History of Smoking (HS)
Lung Cancer (LC)Chronic Bronchitis (CB)
Fatigue (F) Weight Loss (WL)
P(F | CB = present, WL = present, CBN1)
InferenceInference The Markov Condition implies the The Markov Condition implies the
following equation:following equation:
The above equation specifies the full joint The above equation specifies the full joint probability distribution over the model probability distribution over the model variables.variables.
From the joint distribution we can derive From the joint distribution we can derive any conditional probability of interest.any conditional probability of interest.
n
iiin XesDirectCausXPXXXP
121 ))(|(),...,,(
Inference AlgorithmsInference Algorithms In the worst case, the brute force In the worst case, the brute force
algorithm is exponential time in the algorithm is exponential time in the number of variables in the modelnumber of variables in the model
Numerous exact inference algorithms Numerous exact inference algorithms have been developed that exploit have been developed that exploit independences among the variables in the independences among the variables in the causal Bayesian network.causal Bayesian network.
However, in the worst case, these However, in the worst case, these algorithms are exponential time.algorithms are exponential time.
Inference in causal Bayesian networks is Inference in causal Bayesian networks is NP-hard (Cooper, AIJ, 1990).NP-hard (Cooper, AIJ, 1990).
Inference Using a Single CBN Inference Using a Single CBN When Given Evidence When Given Evidence
in the Form of Manipulationsin the Form of Manipulations
P(F | MCB = present, CBN1)
• Let MCB be a new variable that can have the same values as CB (present, absent) plus the value observe.
• Add an arc from MCB to CB.
• Define the probability distribution of CB given its parents.
Inference Using a Single CBN Inference Using a Single CBN When Given Evidence When Given Evidence
in the Form of Manipulationsin the Form of Manipulations
History of Smoking (HS)
Lung Cancer (LC)Chronic Bronchitis (CB)
Fatigue (F) Weight Loss (WL)
P(F | MCB = present, CBN1)
MCB
A Deterministic ManipulationA Deterministic Manipulation
History of Smoking (HS)
Lung Cancer (LC)Chronic Bronchitis (CB)
Fatigue (F) Weight Loss (WL)
P(F | MCB = present), CBN1)
MCB
Inference Using a Single CBN Inference Using a Single CBN When Given Evidence When Given Evidence
in the Form of Observations and in the Form of Observations and ManipulationsManipulations
History of Smoking (HS)
Lung Cancer (LC)Chronic Bronchitis (CB)
Fatigue (F) Weight Loss (WL)
P(F | MCB = present, WL = present, CBN1)
MCB
Inference Using Multiple Inference Using Multiple CBNs: Model AveragingCBNs: Model Averaging
),|(),),(|(
))(),(|(
KdataCBNPCBNmanipulateXP
observemanipulateXP
ii
i
Z'Y'
Z'Y'
Some Key Reasons Some Key Reasons for Learning CBNsfor Learning CBNs
Scientific discovery among measured Scientific discovery among measured variablesvariables Example of general: What are the causal Example of general: What are the causal
relationships among relationships among HS, LC, CB, F, HS, LC, CB, F, and and WLWL??
Example of focused: What are the causes of Example of focused: What are the causes of LCLC from among from among HS, CB, F,HS, CB, F, and and WLWL??
Scientific discovery of hidden processesScientific discovery of hidden processes PredictionPrediction
Example: The effect of not smoking on Example: The effect of not smoking on contracting lung cancercontracting lung cancer
Major Methods for Major Methods for Learning CBNs from DataLearning CBNs from Data
Constraint-based methodsConstraint-based methods Uses tests of independence to find Uses tests of independence to find
patterns of relationships among variables patterns of relationships among variables that support causal relationshipsthat support causal relationships
Relatively efficient in discovery of causal Relatively efficient in discovery of causal models with hidden variablesmodels with hidden variables
See talk by Frederick Eberhardt this See talk by Frederick Eberhardt this morningmorning
Score-based methods Bayesian scoringScore-based methods Bayesian scoring Allows informative prior probabilities of causal Allows informative prior probabilities of causal
structure and parametersstructure and parameters Non-Bayesian scoringNon-Bayesian scoring
Does not allow informative prior probabilitiesDoes not allow informative prior probabilities
Learning CBNs from Observational Learning CBNs from Observational Data: Data:
A Bayesian FormulationA Bayesian Formulation
YXSii
i
KDSPKDYXPcontains :
), | (), | (
where D is observational data,Si is the structure of CBNi,
and K is background knowledge and belief.
Learning CBNs from Observational Learning CBNs from Observational Data When There Are No Hidden Data When There Are No Hidden
VariablesVariables
jjjjjjj
iiiiiii
dKSPKSDPKSP
dKSPKSDPKSPKDSP
), | (),, | ( ) | (
), | (),, | () | (), | (
where i are the parameters associated with Si and the sum is over all CBNs for which P(Sj | K) > 0.
The BD Marginal The BD Marginal LikelihoodLikelihood
The previous integral has the The previous integral has the following closed form solution, when following closed form solution, when we assume Dirichlet priors (we assume Dirichlet priors (ijkijk and and ijij), multinomial likelihoods (), multinomial likelihoods (NNijkijk and and NNij ij denote counts), parameter denote counts), parameter independence, and parameter independence, and parameter modularity:modularity:
ii r
k ijk
ijkijkn
i
q
j ijij
ij N
N 11 1 )(
)(
)(
)(
Searching for Network Searching for Network StructuresStructures
Greedy search often usedGreedy search often used Hybrid methods have been explored that Hybrid methods have been explored that
constraints and scoringconstraints and scoring Some algorithms guarantee locating the Some algorithms guarantee locating the
generating model in the large sample limit generating model in the large sample limit (assuming Markov and Faithfulness (assuming Markov and Faithfulness conditions), as for example the GES algorithm conditions), as for example the GES algorithm (Chickering, JMLR, 2002)(Chickering, JMLR, 2002)
The ability to approximate the generating The ability to approximate the generating network is often quite goodnetwork is often quite good
An excellent discussion and evaluation of An excellent discussion and evaluation of several state-of-the-art methods, including a several state-of-the-art methods, including a relatively new method (Max-Min Hill Climbing) relatively new method (Max-Min Hill Climbing) is at: is at: Tsamardinos, Brown, Aliferis, Machine Learning, 2006.
The Complexity of SearchThe Complexity of Search
Given a complete dataset and no Given a complete dataset and no hidden variables, locating the hidden variables, locating the Bayesian network structure that has Bayesian network structure that has the highest posterior probability is the highest posterior probability is NP-hard (Chickering, AIS, 1996; NP-hard (Chickering, AIS, 1996; Chickering, et al, JMLR, 2004). Chickering, et al, JMLR, 2004).
We Can Learn More from We Can Learn More from Observational and Experimental Observational and Experimental
Data Together Data Together than from Either One Alonethan from Either One Alone
EC
H
We cannot learn the above causal structure from observational or experimental data alone. We need both.
Learning CBNs from Observational Learning CBNs from Observational Data When There Data When There AreAre Hidden Hidden
VariablesVariables
j Hjjjjjj
Hiiiiii
i
j
i
dKSPKSDPKSP
dKSPKSDPKSP
KDSP ), | (),, | () | (
), | (),, | () | (
), | (
where Hi (Hj) are the hidden variables in Si (Sj) and the sum in the numerator (denominator) is taken over all values of Hi (Hj).
Learning CBNs from Observational Learning CBNs from Observational and Experimental Data: and Experimental Data: A Bayesian FormulationA Bayesian Formulation
• For each model variable Xi that is experimentally manipulated in at least one case, introduce a potential parent MXi
of Xi .
• Xi can have parents as well from among the other {X1, ..., Xi-1, Xi+1, ..., Xn} domain variables in the model.• Priors on the distribution of Xi will include conditioning on MXi
,when it is a parent of Xi, as well as conditioning on
the other parents of Xi. • Define MXi
to have the same values vi1, vi2, ... , viq as Xi,
plus a value o (for observe). o When MXi
has value vij in a given case, this represents that the
experimenter intended to manipulate Xi to have value vij in the case.
o When MXi has value observe in a given case, this
represents that no attempt was made by the experimenter to
manipulate Xi, but rather, Xi was merely observed to have the value
recorded for it.• With the above variable additions in place, use the previous Bayesian methods for causal modeling from observational data.
An Example Database An Example Database Containing Observations Containing Observations
and Manipulationsand Manipulations
HS MCB CB LC F WL
T obs T F T T
F F F T T F
F F T T F F
T obs F F T F
Faithfulness ConditionFaithfulness Condition
Faithfulness ConditionFaithfulness ConditionAny independence among variables in Any independence among variables in the data generating distribution follows the data generating distribution follows from the Markov Condition applied to from the Markov Condition applied to the data generating causal structure.the data generating causal structure.
A simple counter example: A simple counter example:
EC
H
Challenges of Bayesian Challenges of Bayesian Learning of Causal Learning of Causal
NetworksNetworks Major challengesMajor challenges
Large search spacesLarge search spaces Hidden variablesHidden variables FeedbackFeedback Assessing parameter and structure priorsAssessing parameter and structure priors Modeling complicated distributionsModeling complicated distributions
The remainder of this talk will summarize The remainder of this talk will summarize several methods for dealing with hidden several methods for dealing with hidden variables, which is arguably the biggest variables, which is arguably the biggest major challenge todaymajor challenge today These examples provide only a small sample of These examples provide only a small sample of
previous researchprevious research
Learning Belief Networks in the Learning Belief Networks in the Presence of Missing Values and Presence of Missing Values and
Hidden VariablesHidden Variables(N. Friedman, ICML, 1997)(N. Friedman, ICML, 1997)
Assumes a fixed set of measured and hidden Assumes a fixed set of measured and hidden variablesvariables
Uses Expectation Maximization (EM) to “fill in” Uses Expectation Maximization (EM) to “fill in” the values of the hidden variablethe values of the hidden variable
Uses BIC to score causal network structures Uses BIC to score causal network structures with the filled-in data. Greedily finds best with the filled-in data. Greedily finds best structure and then returns to the EM step using structure and then returns to the EM step using this new structure.this new structure.
Some subsequent workSome subsequent work Use patterns of induced relationships among the Use patterns of induced relationships among the
measured variables to suggest where to introduce measured variables to suggest where to introduce hidden variables (Elidan, et al., NIPS, 2000)hidden variables (Elidan, et al., NIPS, 2000)
Determining the cardinality of the hidden variables Determining the cardinality of the hidden variables introduced (Elidan & Friedman, UAI, 2001)introduced (Elidan & Friedman, UAI, 2001)
A Non-Parametric Bayesian A Non-Parametric Bayesian Methods for Inferring Hidden Methods for Inferring Hidden
CausesCauses(Wood, et al., UAI, 2006)(Wood, et al., UAI, 2006)
Learns hidden causes of measured variablesLearns hidden causes of measured variables
Assumes binary variables and noisy-OR Assumes binary variables and noisy-OR interactionsinteractions
Uses MCMC to sample the hidden structures Uses MCMC to sample the hidden structures Allows in principle an infinite number of Allows in principle an infinite number of
hidden variableshidden variables In practice, the number of optimal hidden In practice, the number of optimal hidden
variables is constrained by the measured datavariables is constrained by the measured data
hidden variables
measured variables
Bayesian Learning of Bayesian Learning of Measurement and Structural Measurement and Structural
ModelModel(Silva & Scheines, ICML, 2006)(Silva & Scheines, ICML, 2006)
Learns the following type of modelsLearns the following type of models
Assumes continuous variables, Assumes continuous variables, mixture of Gaussian distributions, mixture of Gaussian distributions, and linear interactionsand linear interactions
hidden variables
measured variables
Mixed Ancestral GraphsMixed Ancestral Graphs**
A A MAGMAG(G) is a graphical object that contains only (G) is a graphical object that contains only the observed variables, causal arcs, and a new the observed variables, causal arcs, and a new relationship relationship for representing hidden for representing hidden confounding.confounding.
There exist methods for scoring linear MAGS There exist methods for scoring linear MAGS (Richardson & Spirtes Ancestral Graph Markov (Richardson & Spirtes Ancestral Graph Markov Models, Models, Annals of StatisticsAnnals of Statistics, 2002), 2002) SESSES
SEX PE CPSEX PE CP
IQIQ
SESSES
SEX PE CPSEX PE CP
IQIQ
LL11 LL22
Latent Variable DAG Corresponding MAG* This slide was adapted from a slide provided by Peter Spirtes.
A Theoretical Study of Y Structures for Causal Discovery
(Mani, Spirtes, Cooper, UAI, 2006)(Mani, Spirtes, Cooper, UAI, 2006)
Learn a Bayesian network structure on the Learn a Bayesian network structure on the measured variablesmeasured variables
Identify patterns in the structure that Identify patterns in the structure that suggest causal relationshipssuggest causal relationships
The “The “YY” structure shown in green supports ” structure shown in green supports that that DD is an unconfounded cause of is an unconfounded cause of FF..
A
B C
D E
F
Causal Discovery Causal Discovery Using Subsets of VariablesUsing Subsets of Variables
Search for an estimate Search for an estimate MM of the Markov of the Markov blanket of a variable blanket of a variable X X (e.g., Aliferis, et al., (e.g., Aliferis, et al., AMIA, 2002)AMIA, 2002) XX is independent of other variables in the is independent of other variables in the
generating causal network model, conditioned generating causal network model, conditioned on the variables in on the variables in XX’s Markov blanket’s Markov blanket
Within Within MM search for patterns among the search for patterns among the variables that suggest a causal variables that suggest a causal relationship to relationship to XX (e.g., Mani, doctoral (e.g., Mani, doctoral dissertation, Un. of Pittsburgh, 2006)dissertation, Un. of Pittsburgh, 2006)
Causal IdentifiabilityCausal Identifiability
Generally depends uponGenerally depends upon Markov ConditionMarkov Condition Faithfulness ConditionFaithfulness Condition Informative structural relationships Informative structural relationships
among the measured variablesamong the measured variables Example of the “Y structure”: Example of the “Y structure”:
C
E
A B
Evaluation of Causal Evaluation of Causal DiscoveryDiscovery
In evaluating a classifier, the correct answer in In evaluating a classifier, the correct answer in any instance is just the value of some variable any instance is just the value of some variable of interest, which typically is explicitly in the of interest, which typically is explicitly in the data set. This make evaluation relatively data set. This make evaluation relatively straightforward.straightforward.
In evaluating the output of a causal discovery In evaluating the output of a causal discovery algorithm, the answer is not in the dataset. In algorithm, the answer is not in the dataset. In general we need some outside knowledge to general we need some outside knowledge to confirm that the causal output is correct. This confirm that the causal output is correct. This makes evaluation relatively difficult. Thus, makes evaluation relatively difficult. Thus, causal discovery algorithms have not been causal discovery algorithms have not been thoroughly evaluated.thoroughly evaluated.
Methods for Evaluating Methods for Evaluating Causal Discovery Causal Discovery
Algorithms Algorithms Simulated dataSimulated data Real data with expert judgments of Real data with expert judgments of
causationcausation Real data with previously validated Real data with previously validated
causal relationshipscausal relationships Real data with follow up Real data with follow up
experimentsexperiments
An Example of an An Example of an Evaluation Using Evaluation Using Simulated Simulated
DataData(Mani, poster here)(Mani, poster here)
Generated 20,000 observational data samples Generated 20,000 observational data samples from each of five CBNs that were manually from each of five CBNs that were manually constructedconstructed
Applied the BLCD algorithm, which considers Applied the BLCD algorithm, which considers many 4-variable subsets of all the variables many 4-variable subsets of all the variables and applies Bayesian scoring. It is based on and applies Bayesian scoring. It is based on the causal properties of “Y” structures.the causal properties of “Y” structures.
ResultsResults Precision: 83%Precision: 83% Recall: 27%Recall: 27%
An Example of an Evaluation UsingAn Example of an Evaluation Using Previously Validated Causal Previously Validated Causal
RelationshipsRelationships(Yoo, et al., PSB, 2002)(Yoo, et al., PSB, 2002) ILVS is a Bayesian method that considers ILVS is a Bayesian method that considers
pairwise relationships among a set of variablespairwise relationships among a set of variables It works best when given both observational and It works best when given both observational and
experimental dataexperimental data ILVS was applied to a previously collected DNA ILVS was applied to a previously collected DNA
microarray dataset on 9 genes that control microarray dataset on 9 genes that control galactose metabolism in yeast (Ideker, et al., galactose metabolism in yeast (Ideker, et al., Science, 2001) The causal relationships among Science, 2001) The causal relationships among the genes have been extensively studied and the genes have been extensively studied and reported in the literature.reported in the literature.
ILVS predicted 12 of 27 known causal ILVS predicted 12 of 27 known causal relationships among the genes (44% recall) and relationships among the genes (44% recall) and of those 12 eight were correct (67% precision)of those 12 eight were correct (67% precision)
Yoo has explored numerous extensions to ILVSYoo has explored numerous extensions to ILVS
An Example of an Evaluation An Example of an Evaluation Using Real Using Real Data withData with Follow Follow
Up ExperimentsUp Experiments(Sachs, et al., Science, 2005)(Sachs, et al., Science, 2005) Experimentally manipulated human immune Experimentally manipulated human immune system cellssystem cells
Used flow cytometry to measure the effects Used flow cytometry to measure the effects on 11 proteins and phospholipids on a large on 11 proteins and phospholipids on a large number of individual cellsnumber of individual cells
Used a Bayesian method for causally learning Used a Bayesian method for causally learning from observational and experimental datafrom observational and experimental data
Derived 17 causal relationships with high Derived 17 causal relationships with high probabilityprobability 15 highly supported by the literature (precision = 15 highly supported by the literature (precision =
15/17 = 88%)15/17 = 88%) The other two were confirmed experimentally The other two were confirmed experimentally
by the authors (precision = 17/17 = 100%)by the authors (precision = 17/17 = 100%) Three causal relationships were missed (“recall” = Three causal relationships were missed (“recall” =
17 /20 = 85%)17 /20 = 85%)
A Possible Approach to A Possible Approach to Combining Causal Discovery Combining Causal Discovery
and Feature Selectionand Feature Selection1.1. Use prior knowledge and statistical associations to Use prior knowledge and statistical associations to
develop overlapping groups of features (variables)develop overlapping groups of features (variables)2.2. Derive causal probabilistic relationships within Derive causal probabilistic relationships within
groupsgroups3.3. Have the causal groups constrain each otherHave the causal groups constrain each other4.4. Determine additional groups of features that might Determine additional groups of features that might
constrain causal relationships furtherconstrain causal relationships further5.5. Either go to step 2 or step 6Either go to step 2 or step 66.6. Model average within and across groups to derive Model average within and across groups to derive
approximate model-averaged causal relationshipsapproximate model-averaged causal relationships
David Danks 2002. Learning the Causal Structure of David Danks 2002. Learning the Causal Structure of Overlapping Variable Sets. In S. Lange, K. Satoh, & C.H. Overlapping Variable Sets. In S. Lange, K. Satoh, & C.H. Smith, eds. Smith, eds. Discovery Science: Proceedings of the 5th Discovery Science: Proceedings of the 5th International ConferenceInternational Conference. Berlin: Springer-Verlag. pp. 178-. Berlin: Springer-Verlag. pp. 178-191. 191.
Some Suggestions Some Suggestions for Further Informationfor Further Information
BooksBooks Glymour, Cooper (eds), Glymour, Cooper (eds), Computation, Computation,
Causation, and DiscoveryCausation, and Discovery (MIT Press, 1999) (MIT Press, 1999) Pearl, Pearl, Causality: Models, Reasoning, and Causality: Models, Reasoning, and
InferenceInference (Cambridge University Press, 2000) (Cambridge University Press, 2000) Spirtes, Glymour, Scheines, Spirtes, Glymour, Scheines, Causation, Causation,
Prediction, and SearchPrediction, and Search (MIT Press, 2001) (MIT Press, 2001) Neapolitan, Neapolitan, Learning Bayesian NetworksLearning Bayesian Networks
(Prentice Hall, 2003)(Prentice Hall, 2003) ConferencesConferences
UAI, ICML, NIPS, AAAI, IJCAIUAI, ICML, NIPS, AAAI, IJCAI JournalsJournals
JMLR, Machine LearningJMLR, Machine Learning
AcknowledgementAcknowledgement
Thanks to Peter Spirtes for his Thanks to Peter Spirtes for his comments on an outline of this talkcomments on an outline of this talk