evaluation poster neurips - amanda gentzel poster neurips.pdf · type of evaluation? q4. does...

1
The Case for Evaluating Causal Models Using Interventional Measures and Empirical Data Summary Algorithms for causal discovery are unlikely to be widely accepted unless they can be shown to accurately predict the effects of intervention when applied to empirical data. Q1. Don’t we do this already? Q2. Aren’t structural measures enough? Q3. Is there any data that supports this type of evaluation? Q4. Does empirical data add any value? We surveyed 91 causality papers from the past 5 years of NeurIPS, AAAI, KDD, UAI, and ICML. Source of data Both Synthetic Empirical 0 10 20 30 40 50 60 70 Structural Distributional Visual inspection Interventional Synthetic Empirical Type of algorithm Bivariate Other Multivariate Observational Sampling Observational Data Interventional Data C T O Parameterized DAG Structure Learning & Parameterization Compute Interventional Distribution Estimate Interventional Distribution Evaluation T O C 1 5.7 L 0 3.2 L 1 4.5 H 0 4.3 H 1 6.2 H 0 1.5 H 1 5.3 L ID 1 1 2 2 3 3 4 0 4.6 L 4 T O C 0 3.2 L 1 4.5 H 1 6.2 H 1 5.3 L ID 1 2 3 4 Synthetic (PC) Synthetic (GES) Empirical GES MMHC PC Many datasets exist with known interventional effects (DREAM, ACIC 2016 challenge, flow cytometry, cause-effect pairs challenge, etc.) We can collect data from computational systems Many advantages: Empirical Easily intervenable Natural stochasticity We collected data from Postgres, the JDK, and networking infrastructure and intervened by changing system parameters Can create pseudo-observational data by biasing with an observed covariate Most structural measures penalize all errors equally Even structural measures designed to consider interventions (ex: Structural Intervention Distance) produce similar results to other structural measures Interventional measures (ex: Total variation distance) Generated random DAGs and compare different evaluation measures, for both GES and PC TVD and SHD produce significantly different results, varying by algorithm Measures of interventional effect are necessary when trying to assess how well an algorithm learns actual causal effects Effective methods exist to create empirical data for evaluating algorithms for causal discovery. GES PC Empirical data provides realistic complexity generally not present in synthetic data Data not generated by the researcher is less likely to contain unintentional biases and can be standardized across the community Provides a stronger demonstration of effectiveness Learned causal structure of computational systems using PC (left) and GES (center) Generated synthetic data based on these structures and evaluated performance of GES, MMHC, and PC Compare to performance of GES, MMHC, and PC on the original empirical data – significantly different relative order A. No, fewer than 10% of papers published in the past 5 years use a combination of empirical data and interventional measures. A. No, structural measures correspond poorly to measures of interventional effect, such as TVD. A. Yes, several data sets exist, including ones we have recently created from experimentation with computational systems. A. Yes, results on empirical data sets appear to differ substantially from results on ‘look-alike’ synthetic data sets. Amanda Gentzel, Dan Garant, and David Jensen College of Information and Computer Sciences,University of Massachusetts Amherst Percentage of papers using Evaluation measures

Upload: others

Post on 04-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluation poster NeurIPS - Amanda Gentzel poster NeurIPS.pdf · type of evaluation? Q4. Does empirical data add any value? •We surveyed 91 causality papers from the past 5 years

The Case for Evaluating Causal Models UsingInterventional Measures and Empirical Data

SummaryAlgorithms for causal discovery

are unlikely to be widely acceptedunless they can be shown to accurately predict the effects of intervention when

applied to empirical data.

Q1. Don’t we do this already? Q2. Aren’t structural measures enough?

Q3. Is there any datathat supports thistype of evaluation?

Q4. Does empirical data add any value?

• We surveyed 91 causality papers from the past 5 years of NeurIPS, AAAI, KDD, UAI, and ICML.

Source of data

BothSynthetic

Empirical

0

10

20

30

40

50

60

70

Structural Distributional Visual inspection Interventional

Synthetic Empirical

Type of algorithm

BivariateOther

Multivariate

ObservationalSampling

Observational DataInterventional Data

C

T

O

Parameterized DAG

Structure Learning &Parameterization

Compute InterventionalDistribution

Estimate InterventionalDistributionEvaluation

T O C1 5.7 L0 3.2 L1 4.5 H0 4.3 H1 6.2 H0 1.5 H1 5.3 L

ID1122334

0 4.6 L4…

T O C0 3.2 L1 4.5 H1 6.2 H1 5.3 L

ID1234

Synthetic (PC) Synthetic (GES) Empirical

GES

MMHC

PC

• Many datasets exist with knowninterventional effects (DREAM, ACIC 2016 challenge, flow cytometry,cause-effect pairs challenge, etc.)

• We can collect data fromcomputational systems

• Many advantages:• Empirical• Easily intervenable• Natural stochasticity

• We collected data from Postgres, the JDK, and networking infrastructure and intervened by changing system parameters

• Can create pseudo-observational data by biasing with an observed covariate

• Most structural measures penalize all errors equally• Even structural measures designed to consider

interventions (ex: Structural Intervention Distance) produce similar results to other structural measures

• Interventional measures (ex: Total variation distance) • Generated random DAGs and compare different evaluation

measures, for both GES and PC• TVD and SHD produce significantly different results,

varying by algorithm

Measures of interventional effect arenecessary when trying to assess how well an

algorithm learns actual causal effects

Effective methods exist to createempirical data for evaluating

algorithms for causaldiscovery.

GES

PC

• Empirical data provides realistic complexity generally not

present in synthetic data• Data not generated by the

researcher is less likely to containunintentional biases and can be

standardized across the community• Provides a stronger demonstration of

effectiveness• Learned causal structure of computational systems using

PC (left) and GES (center)• Generated synthetic data based on these structures and evaluated

performance of GES, MMHC, and PC• Compare to performance of GES, MMHC, and PC on the original

empirical data – significantly different relative order

A. No, fewer than 10% of papers published in the past 5 years use a combination of empirical data and interventional measures.

A. No, structural measures correspond poorly to measures of interventional effect, such as TVD.

A. Yes, several data sets exist, including ones we have recently created from experimentation with computational systems.

A. Yes, results on empirical data sets appear to differ substantially from results on ‘look-alike’ synthetic data sets.

Amanda Gentzel, Dan Garant, and David JensenCollege of Information and Computer Sciences,University of Massachusetts Amherst

Percentage of papers usingEvaluation measures