causal search in the real world. a menu of topics some real-world challenges: convergence &...

32
Causal Search in the Real World

Upload: rodney-bradford

Post on 24-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Causal Search in the Real World

A menu of topics

Some real-world challenges: Convergence & error bounds Sample selection bias Simpson’s paradox

Some real-world successes: Learning based on more than just

independence Learning about latents & their structure

Short-run causal search

Bayes net learning algorithms can give the wrong answer if the data fail to reflect the “true” associations and independencies Of course, this is a problem for all

inference: we might just be really unlucky Note: This is not (really) the problem of

unrepresentative samples (e.g., black swans)

Convergence in search

In search, we would like to bound our possible error as we acquire data I.e., we want search procedures that have

uniform convergence Without uniform convergence,

Cannot set confidence intervals for inference

Not every Bayesian, regardless of priors over hypotheses, agrees on probable bounds, no matter how loose

Pointwise convergence

Assume hypothesis H is true Then

For any standard of “closeness” to H, and For any standard of “successful refutation,” For every hypothesis that is not “close” to

H, there is a sample size for which that hypothesis is refuted

Uniform convergence

Assume hypothesis H is true Then

For any standard of “closeness” to H, and For any standard of “successful refutation,” There is a sample size such that for all

hypotheses H* that are not “close” to H, H* is refuted at that sample size.

Two theorems about convergence There are procedures that, for every

model, pointwise converge to the Markov equivalence class containing the true causal model. (Spirtes, Glymour, & Scheines, 1993)

There is no procedure that, for every model, uniformly converges to the Markov equivalence class containing the true causal model. (Robins, Scheines, Spirtes, & Wasserman, 1999; 2003)

Two theorems about convergence What if we didn’t care about “small” causes? ε-Faithfulness: If X & Y are d-connected given S, then ρXY.S >ε Every association predicted by d-connection is ≥ε

For anyε, standard constraint-based algorithms are uniformly convergent given ε-Faithfulness So we have error bounds, confidence intervals,

etc.

Sample selection bias

Sometimes, a variable of interest is a cause of whether people get in the sample E.g., measure various skills or knowledge in

college students Or measuring joblessness by a phone

survey during the middle of the day

Simple problem: You might get a skewed picture of the population

Sample selection bias

If two variables matter, then we have:

Sample = 1 for everyone we measure That is equivalent to conditioning on

Sample ⇒ Induces an association between A and B!

Factor A Factor B

Sample

Simpson’s Paradox

Consider the following data:Men Women

P(A | T) = 0.5 P(A | T) = 0.39

P(A | U) = 0.45… P(A | U) = 0.333

Treated Untreated

Alive 3 20

Dead 3 24

Treated Untreated

Alive 16 3

Dead 25 6

Treatment is superior in both groups!

Simpson’s Paradox

Consider the following data:Pooled

P(A | T) = 0.404P(A | U) = 0.434

Treated Untreated

Alive 19 23

Dead 28 30In the “full” population, you’re better off not being Treated!

Simpson’s Paradox

Berkeley Graduate Admissions case

More than independence

Independence & association can reveal only the Markov equivalence class But our data contain more statistical

information!

Algorithms that exploit this additional info can sometimes learn more (including unique graphs) Example: LiNGaM algorithm for non-

Gaussian data

Non-Gaussian data

Assume linearity & independent non-Gaussian noise

Linear causal DAG functions are:D = B D + ε

where B is permutable to lower triangular (because graph is acyclic)

Non-Gaussian data

Assume linearity & independent non-Gaussian noise

Linear causal DAG functions are:D = Aε

where A = (I – B)-1

Non-Gaussian data

Assume linearity & independent non-Gaussian noise

Linear causal DAG functions are:D = Aε

where A = (I – B)-1

ICA is an efficient estimator for A ⇒ Efficient causal search that reveals

direction! C ⟶ E iff non-zero entry in A

Non-Gaussian data

Why can we learn the directions in this case?

A B

A B

Gaussian noise Uniform noise

Non-Gaussian data

Case study: European electricity cost

Learning about latents

Sometimes, our real interest… is in variables that are only indirectly

observed or observed by their effects or unknown altogether but influencing

things behind the scenes

Test score

Reading level

Math skills

General IQSociability

Size of social network

Other factors

Factor analysis

Assume linear equations Given some set of (observed) features,

determine the coefficients for (a fixed number of) unobserved variables that minimize the error

Factor analysis

If we have one factor, then we find coefficients to minimize error in:

Fi = ai + biU

where U is the unobserved variable (with fixed mean and variance)

Two factors ⇒ Minimize error in:Fi = ai + bi,1U1 + bi,2U2

Factor analysis

Decision about exactly how many factors to use is typically based on some “simplicity vs. fit” tradeoff

Also, the interpretation of the unobserved factors must be provided by the scientist The data do not dictate the meaning of the

unobserved factors (though it can sometimes be “obvious”)

Factor analysis as graph search One-variable factor analysis is equivalent

to finding the ML parameter estimates for the SEM with graph:

F1 FnF2

U

Factor analysis as graph search Two-variable factor analysis is

equivalent to finding the ML parameter estimates for the SEM with graph:

F1 FnF2

U1

U2

Better methods for latents

Two different types of algorithms:1. Determine which observed variables are

caused by shared latents BPC, FOFC, FTFC, …

2. Determine the causal structure among the latents

MIMBuild

Note: need additional parametric assumptions Usually linearity, but can do it with weaker info

Discovering latents

Key idea: For many parameterizations, association between X & Y can be decomposed Linearity ⇒

⇒ can use patterns in the precise associations to discover the number of latents Using the ranks of different sub-matrices

Discovering latents

BA C D

U

Discovering latents

BA C D

U L

Discovering latents

Many instantiations of this type of search for different parametric knowledge, # of observed variables (⇒ # of discoverable latents), etc.

And once we have one of these “clean” models, can use “traditional” search algorithms (with modifications) to learn structure between the latents

Other Algorithms

CCD: Learn DCG (with non-obvious semantics) ION: Learn global features from overlapping

local sets (including between not co-measured variables)

SAT-solver: Learn causal structure (possibly cyclic, possibly with latents) from arbitrary combinations of observational & experimental constraints

LoSST: Learn causal structure while that structure potentially changes over time

And lots of other ongoing research!

Tetrad project

http://www.phil.cmu.edu/projects/tetrad/current.html