an introduction to causal modeling and discovery using graphical models greg cooper university of...

An Introduction to Causal An Introduction to Causal Modeling and Discovery Using Modeling and Discovery Using

Graphical Models Graphical Models

Greg CooperGreg CooperUniversity of PittsburghUniversity of Pittsburgh

OverviewOverview

IntroductionIntroduction RepresentationRepresentation InferenceInference LearningLearning EvaluationEvaluation

What Is Causality?What Is Causality?

Much consideration in philosophyMuch consideration in philosophy I will treat it as a primitiveI will treat it as a primitive Roughly, if we manipulate something Roughly, if we manipulate something

and something else changes, then and something else changes, then the former causally influences the the former causally influences the latter.latter.

Why is Causation Why is Causation Important?Important?

Causal issues arise in most fields Causal issues arise in most fields including medicine, business, law, including medicine, business, law, economics, and the scienceseconomics, and the sciences

An intelligent agent is continually An intelligent agent is continually considering considering what to what to do do next in order to change next in order to change the world (including the agent’s own the world (including the agent’s own mind). That is a causal question.mind). That is a causal question.

Representing Causation Representing Causation Using Causal Bayesian Using Causal Bayesian

NetworksNetworks A causal Bayesian network (CBN) A causal Bayesian network (CBN)

represents some entity (e.g., a represents some entity (e.g., a patient) that we want to model patient) that we want to model causallycausally

Features of the entity are Features of the entity are represented by variables/nodes in represented by variables/nodes in the CBNthe CBN

Direct causation is represented by Direct causation is represented by arcsarcs

An Example An Example of a Causal Bayesian of a Causal Bayesian Network StructureNetwork Structure

History of Smoking (HS)

Lung Cancer (LC)Chronic Bronchitis (CB)

Fatigue (F) Weight Loss (WL)

An Example of the An Example of the Accompanying Causal Accompanying Causal

Bayesian Network Bayesian Network ParametersParameters

P(HS = no) = 0.80 P(HS = yes) = 0.20

P(CB = absent | HS = no) = 0.95 P(CB = present | HS = no) = 0.05P(CB = absent | HS = yes) = 0.75 P(CB = present | HS = yes) = 0.25

P(LC = absent | HS = no) = 0.99995 P(LC = present | HS = no) = 0.00005P(LC = absent | HS = yes) = 0.997 P(LC = present | HS = yes) = 0.003

Causal Markov ConditionCausal Markov Condition

A node is independent of its non-A node is independent of its non-effects given just its direct causes. effects given just its direct causes.

This is the key representational This is the key representational property of causal Bayesian networks.property of causal Bayesian networks.

Special case: A node is independent Special case: A node is independent of its distant causes given just its of its distant causes given just its direct causes.direct causes.

General notion: Causality is localGeneral notion: Causality is local

Causal Modeling Causal Modeling FrameworkFramework

An underlying process generates An underlying process generates entities that share the same causal entities that share the same causal network structure. The entities may network structure. The entities may have different parameters have different parameters (probabilities).(probabilities).

Each entity independently samples Each entity independently samples the joint distribution defined by its the joint distribution defined by its CBN to generate values (data) for CBN to generate values (data) for each variable in the CBN model each variable in the CBN model

Entity Generator

HS1

LC1

WL1

HS2

LC2

WL2

HS3

LC3

WL3

(no, absent, absent)

existing entities

entityfeaturevaluessamples

(yes, present, present)

(yes, absent, absent)

(no, absent, absent)

(yes, absent, absent)

Discovering the Average Discovering the Average Causal Bayesian NetworkCausal Bayesian Network

HSavg

LCavg

WLavg

Some Key Types of Causal Some Key Types of Causal RelationshipsRelationships

HS

LC

HS

WL

DirectCausation

IndirectCausation

Confounding

HS

LCCB

WLF

Sampled = true

Sampling bias

Inference Using a Single CBN Inference Using a Single CBN When Given Evidence When Given Evidence

in the Form of Observationsin the Form of Observations




P(F | CB = present, WL = present, CBN1)

InferenceInference The Markov Condition implies the The Markov Condition implies the

following equation:following equation:

The above equation specifies the full joint The above equation specifies the full joint probability distribution over the model probability distribution over the model variables.variables.

From the joint distribution we can derive From the joint distribution we can derive any conditional probability of interest.any conditional probability of interest.

n

iiin XesDirectCausXPXXXP

121 ))(|(),...,,(

Inference AlgorithmsInference Algorithms In the worst case, the brute force In the worst case, the brute force

algorithm is exponential time in the algorithm is exponential time in the number of variables in the modelnumber of variables in the model

Numerous exact inference algorithms Numerous exact inference algorithms have been developed that exploit have been developed that exploit independences among the variables in the independences among the variables in the causal Bayesian network.causal Bayesian network.

However, in the worst case, these However, in the worst case, these algorithms are exponential time.algorithms are exponential time.

Inference in causal Bayesian networks is Inference in causal Bayesian networks is NP-hard (Cooper, AIJ, 1990).NP-hard (Cooper, AIJ, 1990).


in the Form of Manipulationsin the Form of Manipulations

P(F | MCB = present, CBN1)

• Let MCB be a new variable that can have the same values as CB (present, absent) plus the value observe.

• Add an arc from MCB to CB.

• Define the probability distribution of CB given its parents.


in the Form of Manipulationsin the Form of Manipulations




P(F | MCB = present, CBN1)

MCB

A Deterministic ManipulationA Deterministic Manipulation




P(F | MCB = present), CBN1)

MCB


in the Form of Observations and in the Form of Observations and ManipulationsManipulations




P(F | MCB = present, WL = present, CBN1)

MCB

Inference Using Multiple Inference Using Multiple CBNs: Model AveragingCBNs: Model Averaging

),|(),),(|(

))(),(|(

KdataCBNPCBNmanipulateXP

observemanipulateXP

ii

i

Z'Y'

Z'Y'

Some Key Reasons Some Key Reasons for Learning CBNsfor Learning CBNs

Scientific discovery among measured Scientific discovery among measured variablesvariables Example of general: What are the causal Example of general: What are the causal

relationships among relationships among HS, LC, CB, F, HS, LC, CB, F, and and WLWL??

Example of focused: What are the causes of Example of focused: What are the causes of LCLC from among from among HS, CB, F,HS, CB, F, and and WLWL??

Scientific discovery of hidden processesScientific discovery of hidden processes PredictionPrediction

Example: The effect of not smoking on Example: The effect of not smoking on contracting lung cancercontracting lung cancer

Major Methods for Major Methods for Learning CBNs from DataLearning CBNs from Data

Constraint-based methodsConstraint-based methods Uses tests of independence to find Uses tests of independence to find

patterns of relationships among variables patterns of relationships among variables that support causal relationshipsthat support causal relationships

Relatively efficient in discovery of causal Relatively efficient in discovery of causal models with hidden variablesmodels with hidden variables

See talk by Frederick Eberhardt this See talk by Frederick Eberhardt this morningmorning

Score-based methods Bayesian scoringScore-based methods Bayesian scoring Allows informative prior probabilities of causal Allows informative prior probabilities of causal

structure and parametersstructure and parameters Non-Bayesian scoringNon-Bayesian scoring

Does not allow informative prior probabilitiesDoes not allow informative prior probabilities

Learning CBNs from Observational Learning CBNs from Observational Data: Data:

A Bayesian FormulationA Bayesian Formulation

YXSii

i

KDSPKDYXPcontains :

), | (), | (

where D is observational data,Si is the structure of CBNi,

and K is background knowledge and belief.

Learning CBNs from Observational Learning CBNs from Observational Data When There Are No Hidden Data When There Are No Hidden

VariablesVariables

jjjjjjj

iiiiiii

dKSPKSDPKSP

dKSPKSDPKSPKDSP

), | (),, | ( ) | (

), | (),, | () | (), | (

where i are the parameters associated with Si and the sum is over all CBNs for which P(Sj | K) > 0.

The BD Marginal The BD Marginal LikelihoodLikelihood

The previous integral has the The previous integral has the following closed form solution, when following closed form solution, when we assume Dirichlet priors (we assume Dirichlet priors (ijkijk and and ijij), multinomial likelihoods (), multinomial likelihoods (NNijkijk and and NNij ij denote counts), parameter denote counts), parameter independence, and parameter independence, and parameter modularity:modularity:

ii r

k ijk

ijkijkn

i

q

j ijij

ij N

N 11 1 )(

)(

)(

)(

Searching for Network Searching for Network StructuresStructures

Greedy search often usedGreedy search often used Hybrid methods have been explored that Hybrid methods have been explored that

constraints and scoringconstraints and scoring Some algorithms guarantee locating the Some algorithms guarantee locating the

generating model in the large sample limit generating model in the large sample limit (assuming Markov and Faithfulness (assuming Markov and Faithfulness conditions), as for example the GES algorithm conditions), as for example the GES algorithm (Chickering, JMLR, 2002)(Chickering, JMLR, 2002)

The ability to approximate the generating The ability to approximate the generating network is often quite goodnetwork is often quite good

An excellent discussion and evaluation of An excellent discussion and evaluation of several state-of-the-art methods, including a several state-of-the-art methods, including a relatively new method (Max-Min Hill Climbing) relatively new method (Max-Min Hill Climbing) is at: is at: Tsamardinos, Brown, Aliferis, Machine Learning, 2006.

The Complexity of SearchThe Complexity of Search

Given a complete dataset and no Given a complete dataset and no hidden variables, locating the hidden variables, locating the Bayesian network structure that has Bayesian network structure that has the highest posterior probability is the highest posterior probability is NP-hard (Chickering, AIS, 1996; NP-hard (Chickering, AIS, 1996; Chickering, et al, JMLR, 2004). Chickering, et al, JMLR, 2004).

We Can Learn More from We Can Learn More from Observational and Experimental Observational and Experimental

Data Together Data Together than from Either One Alonethan from Either One Alone

EC

H

We cannot learn the above causal structure from observational or experimental data alone. We need both.

Learning CBNs from Observational Learning CBNs from Observational Data When There Data When There AreAre Hidden Hidden

VariablesVariables

j Hjjjjjj

Hiiiiii

i

j

i

dKSPKSDPKSP

dKSPKSDPKSP

KDSP ), | (),, | () | (

), | (),, | () | (

), | (

where Hi (Hj) are the hidden variables in Si (Sj) and the sum in the numerator (denominator) is taken over all values of Hi (Hj).

Learning CBNs from Observational Learning CBNs from Observational and Experimental Data: and Experimental Data: A Bayesian FormulationA Bayesian Formulation

• For each model variable Xi that is experimentally manipulated in at least one case, introduce a potential parent MXi

of Xi .

• Xi can have parents as well from among the other {X1, ..., Xi-1, Xi+1, ..., Xn} domain variables in the model.• Priors on the distribution of Xi will include conditioning on MXi

,when it is a parent of Xi, as well as conditioning on

the other parents of Xi. • Define MXi

to have the same values vi1, vi2, ... , viq as Xi,

plus a value o (for observe). o When MXi

has value vij in a given case, this represents that the

experimenter intended to manipulate Xi to have value vij in the case.

o When MXi has value observe in a given case, this

represents that no attempt was made by the experimenter to

manipulate Xi, but rather, Xi was merely observed to have the value

recorded for it.• With the above variable additions in place, use the previous Bayesian methods for causal modeling from observational data.

An Example Database An Example Database Containing Observations Containing Observations

and Manipulationsand Manipulations

HS MCB CB LC F WL

T obs T F T T

F F F T T F

F F T T F F

T obs F F T F

Faithfulness ConditionFaithfulness Condition

Faithfulness ConditionFaithfulness ConditionAny independence among variables in Any independence among variables in the data generating distribution follows the data generating distribution follows from the Markov Condition applied to from the Markov Condition applied to the data generating causal structure.the data generating causal structure.

A simple counter example: A simple counter example:

EC

H

Challenges of Bayesian Challenges of Bayesian Learning of Causal Learning of Causal

NetworksNetworks Major challengesMajor challenges

Large search spacesLarge search spaces Hidden variablesHidden variables FeedbackFeedback Assessing parameter and structure priorsAssessing parameter and structure priors Modeling complicated distributionsModeling complicated distributions

The remainder of this talk will summarize The remainder of this talk will summarize several methods for dealing with hidden several methods for dealing with hidden variables, which is arguably the biggest variables, which is arguably the biggest major challenge todaymajor challenge today These examples provide only a small sample of These examples provide only a small sample of

previous researchprevious research

Learning Belief Networks in the Learning Belief Networks in the Presence of Missing Values and Presence of Missing Values and

Hidden VariablesHidden Variables(N. Friedman, ICML, 1997)(N. Friedman, ICML, 1997)

Assumes a fixed set of measured and hidden Assumes a fixed set of measured and hidden variablesvariables

Uses Expectation Maximization (EM) to “fill in” Uses Expectation Maximization (EM) to “fill in” the values of the hidden variablethe values of the hidden variable

Uses BIC to score causal network structures Uses BIC to score causal network structures with the filled-in data. Greedily finds best with the filled-in data. Greedily finds best structure and then returns to the EM step using structure and then returns to the EM step using this new structure.this new structure.

Some subsequent workSome subsequent work Use patterns of induced relationships among the Use patterns of induced relationships among the

measured variables to suggest where to introduce measured variables to suggest where to introduce hidden variables (Elidan, et al., NIPS, 2000)hidden variables (Elidan, et al., NIPS, 2000)

Determining the cardinality of the hidden variables Determining the cardinality of the hidden variables introduced (Elidan & Friedman, UAI, 2001)introduced (Elidan & Friedman, UAI, 2001)

A Non-Parametric Bayesian A Non-Parametric Bayesian Methods for Inferring Hidden Methods for Inferring Hidden

CausesCauses(Wood, et al., UAI, 2006)(Wood, et al., UAI, 2006)

Learns hidden causes of measured variablesLearns hidden causes of measured variables

Assumes binary variables and noisy-OR Assumes binary variables and noisy-OR interactionsinteractions

Uses MCMC to sample the hidden structures Uses MCMC to sample the hidden structures Allows in principle an infinite number of Allows in principle an infinite number of

hidden variableshidden variables In practice, the number of optimal hidden In practice, the number of optimal hidden

variables is constrained by the measured datavariables is constrained by the measured data

hidden variables

measured variables

Bayesian Learning of Bayesian Learning of Measurement and Structural Measurement and Structural

ModelModel(Silva & Scheines, ICML, 2006)(Silva & Scheines, ICML, 2006)

Learns the following type of modelsLearns the following type of models

Assumes continuous variables, Assumes continuous variables, mixture of Gaussian distributions, mixture of Gaussian distributions, and linear interactionsand linear interactions

hidden variables

measured variables

Mixed Ancestral GraphsMixed Ancestral Graphs**

A A MAGMAG(G) is a graphical object that contains only (G) is a graphical object that contains only the observed variables, causal arcs, and a new the observed variables, causal arcs, and a new relationship relationship for representing hidden for representing hidden confounding.confounding.

There exist methods for scoring linear MAGS There exist methods for scoring linear MAGS (Richardson & Spirtes Ancestral Graph Markov (Richardson & Spirtes Ancestral Graph Markov Models, Models, Annals of StatisticsAnnals of Statistics, 2002), 2002) SESSES

SEX PE CPSEX PE CP

IQIQ

SESSES

SEX PE CPSEX PE CP

IQIQ

LL11 LL22

Latent Variable DAG Corresponding MAG* This slide was adapted from a slide provided by Peter Spirtes.

A Theoretical Study of Y Structures for Causal Discovery

(Mani, Spirtes, Cooper, UAI, 2006)(Mani, Spirtes, Cooper, UAI, 2006)

Learn a Bayesian network structure on the Learn a Bayesian network structure on the measured variablesmeasured variables

Identify patterns in the structure that Identify patterns in the structure that suggest causal relationshipssuggest causal relationships

The “The “YY” structure shown in green supports ” structure shown in green supports that that DD is an unconfounded cause of is an unconfounded cause of FF..

A

B C

D E

F

Causal Discovery Causal Discovery Using Subsets of VariablesUsing Subsets of Variables

Search for an estimate Search for an estimate MM of the Markov of the Markov blanket of a variable blanket of a variable X X (e.g., Aliferis, et al., (e.g., Aliferis, et al., AMIA, 2002)AMIA, 2002) XX is independent of other variables in the is independent of other variables in the

generating causal network model, conditioned generating causal network model, conditioned on the variables in on the variables in XX’s Markov blanket’s Markov blanket

Within Within MM search for patterns among the search for patterns among the variables that suggest a causal variables that suggest a causal relationship to relationship to XX (e.g., Mani, doctoral (e.g., Mani, doctoral dissertation, Un. of Pittsburgh, 2006)dissertation, Un. of Pittsburgh, 2006)

Causal IdentifiabilityCausal Identifiability

Generally depends uponGenerally depends upon Markov ConditionMarkov Condition Faithfulness ConditionFaithfulness Condition Informative structural relationships Informative structural relationships

among the measured variablesamong the measured variables Example of the “Y structure”: Example of the “Y structure”:

C

E

A B

Evaluation of Causal Evaluation of Causal DiscoveryDiscovery

In evaluating a classifier, the correct answer in In evaluating a classifier, the correct answer in any instance is just the value of some variable any instance is just the value of some variable of interest, which typically is explicitly in the of interest, which typically is explicitly in the data set. This make evaluation relatively data set. This make evaluation relatively straightforward.straightforward.

In evaluating the output of a causal discovery In evaluating the output of a causal discovery algorithm, the answer is not in the dataset. In algorithm, the answer is not in the dataset. In general we need some outside knowledge to general we need some outside knowledge to confirm that the causal output is correct. This confirm that the causal output is correct. This makes evaluation relatively difficult. Thus, makes evaluation relatively difficult. Thus, causal discovery algorithms have not been causal discovery algorithms have not been thoroughly evaluated.thoroughly evaluated.

Methods for Evaluating Methods for Evaluating Causal Discovery Causal Discovery

Algorithms Algorithms Simulated dataSimulated data Real data with expert judgments of Real data with expert judgments of

causationcausation Real data with previously validated Real data with previously validated

causal relationshipscausal relationships Real data with follow up Real data with follow up

experimentsexperiments

An Example of an An Example of an Evaluation Using Evaluation Using Simulated Simulated

DataData(Mani, poster here)(Mani, poster here)

Generated 20,000 observational data samples Generated 20,000 observational data samples from each of five CBNs that were manually from each of five CBNs that were manually constructedconstructed

Applied the BLCD algorithm, which considers Applied the BLCD algorithm, which considers many 4-variable subsets of all the variables many 4-variable subsets of all the variables and applies Bayesian scoring. It is based on and applies Bayesian scoring. It is based on the causal properties of “Y” structures.the causal properties of “Y” structures.

ResultsResults Precision: 83%Precision: 83% Recall: 27%Recall: 27%

An Example of an Evaluation UsingAn Example of an Evaluation Using Previously Validated Causal Previously Validated Causal

RelationshipsRelationships(Yoo, et al., PSB, 2002)(Yoo, et al., PSB, 2002) ILVS is a Bayesian method that considers ILVS is a Bayesian method that considers

pairwise relationships among a set of variablespairwise relationships among a set of variables It works best when given both observational and It works best when given both observational and

experimental dataexperimental data ILVS was applied to a previously collected DNA ILVS was applied to a previously collected DNA

microarray dataset on 9 genes that control microarray dataset on 9 genes that control galactose metabolism in yeast (Ideker, et al., galactose metabolism in yeast (Ideker, et al., Science, 2001) The causal relationships among Science, 2001) The causal relationships among the genes have been extensively studied and the genes have been extensively studied and reported in the literature.reported in the literature.

ILVS predicted 12 of 27 known causal ILVS predicted 12 of 27 known causal relationships among the genes (44% recall) and relationships among the genes (44% recall) and of those 12 eight were correct (67% precision)of those 12 eight were correct (67% precision)

Yoo has explored numerous extensions to ILVSYoo has explored numerous extensions to ILVS

An Example of an Evaluation An Example of an Evaluation Using Real Using Real Data withData with Follow Follow

Up ExperimentsUp Experiments(Sachs, et al., Science, 2005)(Sachs, et al., Science, 2005) Experimentally manipulated human immune Experimentally manipulated human immune system cellssystem cells

Used flow cytometry to measure the effects Used flow cytometry to measure the effects on 11 proteins and phospholipids on a large on 11 proteins and phospholipids on a large number of individual cellsnumber of individual cells

Used a Bayesian method for causally learning Used a Bayesian method for causally learning from observational and experimental datafrom observational and experimental data

Derived 17 causal relationships with high Derived 17 causal relationships with high probabilityprobability 15 highly supported by the literature (precision = 15 highly supported by the literature (precision =

15/17 = 88%)15/17 = 88%) The other two were confirmed experimentally The other two were confirmed experimentally

by the authors (precision = 17/17 = 100%)by the authors (precision = 17/17 = 100%) Three causal relationships were missed (“recall” = Three causal relationships were missed (“recall” =

17 /20 = 85%)17 /20 = 85%)

A Possible Approach to A Possible Approach to Combining Causal Discovery Combining Causal Discovery

and Feature Selectionand Feature Selection1.1. Use prior knowledge and statistical associations to Use prior knowledge and statistical associations to

develop overlapping groups of features (variables)develop overlapping groups of features (variables)2.2. Derive causal probabilistic relationships within Derive causal probabilistic relationships within

groupsgroups3.3. Have the causal groups constrain each otherHave the causal groups constrain each other4.4. Determine additional groups of features that might Determine additional groups of features that might

constrain causal relationships furtherconstrain causal relationships further5.5. Either go to step 2 or step 6Either go to step 2 or step 66.6. Model average within and across groups to derive Model average within and across groups to derive

approximate model-averaged causal relationshipsapproximate model-averaged causal relationships

David Danks 2002. Learning the Causal Structure of David Danks 2002. Learning the Causal Structure of Overlapping Variable Sets. In S. Lange, K. Satoh, & C.H. Overlapping Variable Sets. In S. Lange, K. Satoh, & C.H. Smith, eds. Smith, eds. Discovery Science: Proceedings of the 5th Discovery Science: Proceedings of the 5th International ConferenceInternational Conference. Berlin: Springer-Verlag. pp. 178-. Berlin: Springer-Verlag. pp. 178-191. 191.

Some Suggestions Some Suggestions for Further Informationfor Further Information

BooksBooks Glymour, Cooper (eds), Glymour, Cooper (eds), Computation, Computation,

Causation, and DiscoveryCausation, and Discovery (MIT Press, 1999) (MIT Press, 1999) Pearl, Pearl, Causality: Models, Reasoning, and Causality: Models, Reasoning, and

InferenceInference (Cambridge University Press, 2000) (Cambridge University Press, 2000) Spirtes, Glymour, Scheines, Spirtes, Glymour, Scheines, Causation, Causation,

Prediction, and SearchPrediction, and Search (MIT Press, 2001) (MIT Press, 2001) Neapolitan, Neapolitan, Learning Bayesian NetworksLearning Bayesian Networks

(Prentice Hall, 2003)(Prentice Hall, 2003) ConferencesConferences

UAI, ICML, NIPS, AAAI, IJCAIUAI, ICML, NIPS, AAAI, IJCAI JournalsJournals

JMLR, Machine LearningJMLR, Machine Learning

AcknowledgementAcknowledgement

Thanks to Peter Spirtes for his Thanks to Peter Spirtes for his comments on an outline of this talkcomments on an outline of this talk

an introduction to causal modeling and discovery using graphical models greg cooper university of...

Documents

absent hs

absent slide

causal bayesian network

present hs

causal question

causal network structure

cbn model slide

causal markov condition