causal data mining: identifying causal effects at scale

1

Causal data mining: Identifying causal effects at scaleAMIT SHARMA Postdoctoral Researcher, Microsoft Research New Yorkhttp://www.amitsharma.in@amt_shrma

http://www.amitsharma.in/

https://twitter.com/@amt_shrma

https://twitter.com/@amt_shrma

2

A tale of two questions

Q1: How much activity comes from the recommendation system?

Q2: How much activity comes because of the recommendation system?

3

How much activity comes because of the recommendation system?

A causal question.

With recommender

Without recommender

Real world Counterfactual world

2. Evaluating systems

1. Modeling user behavior

Understanding causal relationships from data

Distinguishing between personal preference and homophily in online activity feeds. Sharma and

Cosley (2016).

Studying and modeling the effect of social explanations in recommender systems. Sharma

and Cosley (2013).

Amit and Dan like this.

SOME MUSICAL ARTIST

2. Evaluating and improving systems


Distinguishing between personal preference and homophily in online activity feeds. Sharma and

Cosley (2016).

Studying and modeling the effect of social explanations in recommender systems. Sharma

and Cosley (2013).

Amit and Dan like this.

SOME MUSICAL ARTIST


Averaging Gone Wrong: Using Time-Aware Analyses to Better Understand Behavior. Barbosa, Cosley,

Sharma, Cesar (2016)

Auditing search engines for differential satisfaction across demographics. Mehrotra, Anderson, Diaz, Sharma,

Wallach (2016)

7

A core problem across the sciencesJake and Duncan like this


Code profiling, static analysis [Berger et al.]

Debugging machine learning [Chakarov et

al.]

Decision-making in robotics

8

Why is it hard?

Without:

Recommender System

algorithm Any code

change Social policy,

medical treatment

Observed Data from the Real world

No data from the Counterfactual

world

Without a randomized experiment, hard to estimate.

9

Difference between prediction and causation

Cause (X=) Outcome (Y)

Unobserved Confounders

(U)𝒚= 𝑓 (𝒙 ,𝑢) 𝑔 𝑓

𝑓

Hofman, Sharma, and Watts (2017). Science, 355.6324

10

Prediction: = ( )+𝑦 𝑘 𝑥 𝜖

¿ 𝑋 ,𝑌>¿�̂� ?

Hofman, Sharma, and Watts (2017). Science, 355.6324

Causation:

¿ 𝑋 ,𝑌>¿

11

Research goal

How can we use large-scale data to infer causal estimates?

Use algorithms to find experiment-like data: Quasi (”Natural”) experiments

12 PredictionCausation

𝑦=𝛽𝑥+𝜖

¿ 𝑋 ,𝑌>¿�̂�

¿ 𝑋 ,𝑌>¿

�̂�

Natural experiment

Combine Pearl’s causal graphical model framework with natural

experiments

14

Inverting the natural experiment paradigmHypothesize about a natural variation

Argue why it resembles a randomized experiment

Observational DataDevelop tests for validity of natural

variation

Mine for data subsets with such

valid variations

¿ 𝑋 ,𝑌>¿

15

¿ 𝑋 ,𝑌>¿

Natural Experimen

t

Natural Experimen

t

Natural Experimen

t

Natural Experimen

t

Natural Experimen

t

Natural Experimen

tNatural

Experiment

Since 1850s¿ 𝑋 ,𝑌>¿

16

¿ 𝑋 ,𝑌>¿Natural

Experiment

¿ 𝑋 ,𝑌>¿

1. Split-door Criterion Causal effect of recommender systems

2. Bayesian Natural Experiment Test Validate past economics studies

Data mining for causal inference

¿ 𝑋 ,𝑌>¿

17

Part 0: Traditional causal inference using a natural experiment

18

1854: London was having a devastating cholera outbreak

19

Causal question: What is causing cholera?Air-borne: Spreads through air (“miasma”)

Water-borne: Spreads through contaminated water

Polluted Air

Cholera Diagnosis

Contaminated Water

Cholera Diagnosis

Neighborhood

21

Enter John Snow. He found higher cholera deaths near a water pump, but could be just

correlational.

22

S & V

WATER

COM

PANY

LAMBETH

WATER

COMPANY

New Idea: Two major water companies for London:

one upstream and one downstream.

23

No difference in neighborhood, still an 8-fold increase in cholera with the downstream

company.

S&V and Lambeth

24

Led to a change in belief about cholera’s cause.

25

• Choice of water company cannot cause cholera.

• Choice of water company was not related to people’s neighborhood or its air quality. • People receiving water from the two companies were interspersed

within neighborhoods.

Why was Snow’s study so convincing?

Choice of water company cannot cause cholera.Choice of water company is not related to neighborhood.

Probably the first application of cause-effect principles

26

Exclusion

As-if-random

Contaminated Water

Cholera Diagnosis

Neighborhood

Water Compan

y

27

Contaminated Water (X)

Cholera Diagnosis

(Y)

Other factors [e.g.

neighborhood] (U)

Water Compan

y(Z)

As-If-Random

Exclusion

Two assumptions central to causal inference: Exclusion and As-if-random

28(𝑍 ∐𝑌∨𝑋 ,𝑈 )

Cause (X) Outcome (Y)


(U)

New variable

(Z)

As-If-Random

Exclusion

¿

Two assumptions central to causal inference: Exclusion and As-if-random

1930s: Fisher introduces randomized experiment

Since then, these assumptions have formed the core of causal inference

29

Cause (X)

Outcome (Y)


(U)

Randomized

Assignment (Z)

Exclusion: Randomized assignment should not affect outcome.As-if-random: Randomized assignment should be independent of unobserved confounders.

Z is now a special observed variable, called an instrumental variable.

All studies using observational data also need to satisfy these two assumptions

Cause (X)

Outcome (Y)


(U)

Instrumental Variable

(Z)30

But Exclusion and As-if-random are hard to establish, because of unobserved confounders.

32

More formally…

Full dataset Subsets of the data

¿ 𝑋 ,𝑌>¿

Expt

¿ 𝑋 ,𝑌>¿

Expt

Expt

Expt

Such that:As-If-Random: Exclusion:

Hard to verify from observed data.

33

Current methods haven’t changed much from that used by John Snow in 1850s.Use rhetorical arguments to justify an instrumental variable.

1. Manually finding an instrumental variable restricts researchers to single-source events (e.g. weather or lottery)

2. Still no guarantee that either Exclusion or As-if-random is satisfied.

34

Causal data mining: Inverting the natural experiment paradigm

Hypothesize about a natural variation


Observational DataDevelop tests for validity of natural

variation

Mine for data subsets with such

valid variations

¿ 𝑋 ,𝑌>¿

35

Part I: Split-door criterion for causal identification

36

Intuition: What if we can observe an auxiliary outcome that is unaffected by causal variable?

Cause Outcome

Unobserved

Confounders

Auxiliary

Outcome

Outcome can be separated into two observable parts:

i) Primary outcome: (possibly) affected by cause

ii) Auxiliary outcome: unaffected by cause

37


CausePrimary Outcom

e

Unobserved

Confounders

Auxiliary

Outcome


Cause Outcome

Unobserved

Confounders

Auxiliary

Outcome

40

Simplest case: Outcome can be separated into two observable parts

i) Primary outcome: (possibly) affected by cause

ii) Auxiliary outcome: unaffected by cause

41

Such outcome data commonly available in digital systemsRecommender systemsAd systemsApp notificationsAny content website (such as news)

Let’s take a concrete example: recommender systems

42

Can we find such an auxiliary outcome ()?

43

Example: Estimating the causal impact of a recommender system (novel recommendations)

44

How much activity comes from the recommendation system?

30% of product page visits.

30% of groups joined.

80% of movies watched.

Sharma and Yan (2013), Sharma, Hofman and Watts (2015), Gomez and Hunt (2015)

Confounding: Observed click-throughs may be due to correlated demand

45

Demand for The Road

Visits to The Road

Rec. visits to No

Country for Old

Men

Demand for No Country for Old Men

Correlated Demand for Cormac McCarthy

46

Observed activity is almost surely an overestimate of the causal effect

Causal

Convenience

OBSERVED ACTIVITY

FROM RECOMMENDER

All page visits

?

ACTIVITY WITHOUT

RECOMMENDER

47

Counterfactual thought experiment: What would have happened without recommendations?

48

Hypothetical experiment: Randomized A/B test

But such experiments can be costly.Can we develop an offline metric?

Treatment (A) Control (B)

49

Past work: traditional instrumental variable

Instrument

Demand for Cormac

McCarthy

Visits to The Road

Rec. visits to No

Country for Old

Men

Carmi et al. (2012)

Data mining approach (Shock-IV): Finding valid shocks across product categories

50

Shock to demand of a product due to

Oprah


¿ 𝑋 ,𝑌>¿ Develop tests for validity of a shock

Mine for shocks in observational data

¿ 𝑋 ,𝑌>¿

Finding auxiliary outcome: Split outcome into recommender (primary) and direct visits (auxiliary)

51

All visits to a recommended

product

Recommender visits Direct visits

Search visits

Direct browsing

Auxiliary outcome: Proxy for unobserved demand

52

Causal graphical model for effect of a recommendation system

Demand for focal

product(UX)

Visits to focal

product (X)Rec. visits

(YR)Direct

visits (YD

Demand for rec.

product(UY)

? ?

1a. Search for any product with a shock to page visits

53

1b. Filtering out invalid natural experiments

54

55

The “split-door” criterionTest if auxiliary outcome is independent of the cause. Criterion:

Exclusion

Demand for focal

product(UX)

Visits to focal

product (X)Rec. visits

(YR)Direct

visits (YD

Demand for rec.

product(UY)

56

More formally, why does it work?

Theorem 1: Barring incidental equality of parameters, statistical independence of and guarantees unconfoundedness between and .Proof: Follows from properties of causal graphical models and Pearl’s do-calculus [Pearl 2009]

Unobserved variables

(UX)

Cause(X)

Outcome (YR)

Auxiliary Outcome

(YD

Unobserved variables

(UY)

57

Example: Assuming a linear model

Theorem 1a: Whenever , and , then the unbiased causal estimate can be estimated as:

TreatmentOutcome: Unobserved confoundersCausal effectParameters

58

Relationship to instrumental variable techniqueBoth utilize naturally occurring variation in data.

Instrumental Variable Split-door criterionAssumption: Exclusion and As-if-random

Independence test used to find natural experiments.Only Assumption: Auxiliary outcome is affected by the causes of the primary outcome.

By testing if treatment is independent of auxiliary outcome, Split-door requires a weaker dependence

assumption for validity.

59

By testing if treatment is independent of auxiliary outcome, Split-door requires a weaker dependence

assumption for validity.

Treatment

Outcome

Unobserved

Confounders

Exclusion?


Treatment

Outcome

Unobserved

Confounders

Auxiliary Outcome

Split-door criterion


Data from Amazon.com, using Bing toolbarAnonymized browsing logs (Sept 2013-May 2014)• 23 M pageviews • 2 M Bing Toolbar users• 1.3 M Amazon productsOut of which 20 K products have at least 10 visits on any one day

61

Constructed sequence of visits for each user

Search page Focal product pageRecommended product page

62

Recreating sequence of visits: Log data

Timestamp URL2014-01-20 09:04:10

http://www.amazon.com/s/ref=nb_sb_noss_1?field-keywords=Cormac%20McCarthy

2014-01-20 09:04:15

http://www.amazon.com/dp/0812984250/ref=sr_1_2

2014-01-20 09:05:01

http://www.amazon.com/dp/1573225797/ref=pd_sim_b_1

http://www.amazon.com/s/ref=nb_sb_noss_1?field-keywords=Cormac%20Mccarthy

http://www.amazon.com/s/ref=nb_sb_noss_1?field-keywords=Cormac%20Mccarthy





63

Recreating sequence of visits: Log data

Timestamp

URL

2014-01-20 09:04:10


2014-01-20 09:04:15


2014-01-20 09:05:01


User searches for Cormac McCarthy

User clicks on the second search result

User clicks on the first recommendation




http://amazon.com/dp/0812984250/ref=sr_1_1






I. Weekly and seasonal patterns in traffic, nearly tripling in holidays

65

II. 30% of pageviews come from recommendations

III. Books and eBooks are the most popular categories by far

67

Implementing the split-door criterion

¿ 𝑋 ,𝑌 𝐷>¿

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

Nat. Expt.

days

𝑥(2) , 𝑦𝐷(2)

𝑥(𝑖) , 𝑦𝐷(𝑖)

𝑥(1) , 𝑦𝐷(1 )

𝑥(𝑛−1) , 𝑦𝐷(𝑛−1 )

𝑥(𝑛) , 𝑦𝐷(𝑛)

Causal effect

68

Implementing the split-door criterion1. Divide up data into t=15 day periods.

2. For each time period: a) Using Fisher’s test, find product pairs (X and Y) such that:

Visits to focal product: Direct visits to recommended product

b) Compute

69

Using the split-door criterion, obtain 23,000 natural experiments for over

12,000 products.1) Traditional IV method using Oprah Winfrey

[Carmi et al.]: 133 natural experiments

2) Covers more than half of all products~20k

70

VALID

INVALID

71

Observational click-through rate overestimates causal effect

Over half of the recommendation click-throughs would have happened anyways.

72

Can vary the confidence in validity of obtained natural experiments

74

Similar, more precise causal estimates than simply using shocks

75

Generalization? Distribution of products with a natural experiment identical to overall distribution

Causal estimates are consistent with experimental findings (e.g., Belluf et. al. [2012], Lee

and Hosanager [2014])

• Shocks may be due to discounts or sales

Generalizable to all products on amazon.com?

76

Lower CTR may be due to the holiday

season

• Split-door products are not a representative sample of all products, nor are the users who participate in them.

• But Split-door criterion covers more than half of all products with at least 10 visits on any single day.

• Causal estimates are consistent with experimental findings (e.g., Belluf et. al. [2012], Lee and Hosanager [2014])

Generalization to all of Amazon.com?

77

78

Potential applications: Whenever an auxiliary outcome is availableDigital systemsRecommender systems, ad systems, app notificationsAny media website or app (such as newspapers)Offline contextsDiscount mailers sent by storesAny two marketing channelsIn the future…Effect of medical treatments, teaching interventions, etc.

79

Summary: Mining natural experiments at scaleUnlike traditional natural experiments, Split-door criterion relies on fine-grained data to:

Verify exclusion assumption [Robustness]Cover a broad range of data

[Generalizability]

Provides an offline metric for computing causal effects in digital systems (e.g., ad systems, media websites, app notifications).Code available for use.

Oprah [Carmi et al.] 133 shocks Restricted to books

Split-door criterion

12,000 natural experiments

Representative of overall product distribution

80

Nat. Exp.

The spectrum: split-door, regression and a natural experiment

Cutoff for Likelihood of Independence

0 .80 .95 1

Split-door

Nat. Exp.

Nat. Exp.

Regression

Amou

nt o

f dat

a

81

Part 2: A general Bayesian test for natural experiments in any dataset

Cause (X)

Outcome (Y)


(U)


(Z)

As-If-Random?

Exclusion?Given observed data, can we determine whether it was generated from, a) the above model class (Invalid-IV), or

b) a model class without red edges (Valid-IV)?

83

Observational Data

Cause (X)

Outcome (Y)

Unobserved

Confounders (U)

I.V.(Z)

(X)

(Y)

(U)

(Z) (X

)(Y)

(U)

(Z)

(X)

(Y)

(U)

(Z)

𝑦= 𝑓 (𝑥 ,𝑢)

𝑦= 𝑓 (𝑥 ,𝑧 ,𝑢)

84

Necessary test: By properties of causal graph

Test :

Pearl (1993)

Cause (X)

Outcome (Y)

Unobserved

Confounders (U)

I.V.(Z)

85

But we would like a sufficient test for instrumental variables.

86

A first try: Compare model classes by maximum likelihood

Every data distribution that can be generated by a ValidIV model can also be generated by an InvalidIV model.

𝑀𝐿 𝐼𝑛𝑣𝑎𝑙𝑖𝑑𝐼𝑉= max𝑚 ′∈𝐼𝑛𝑣𝑎𝑙𝑖𝑑𝐼𝑉

𝑃 (𝐷𝑎𝑡𝑎∨𝑚′ )

Diamond represents all observable probability distributions P(X,Y|Z).

Sufficiency is almost “impossible”

87

Passes Necessary test

Both Valid and Invalid IV models can generate this data distribution.

Can attain a weaker notion: probable sufficiency

88

𝑉𝑎𝑙𝑖𝑑𝑖𝑡𝑦𝑅𝑎𝑡𝑖𝑜=𝑃 (𝑉𝑎𝑙𝑖𝑑𝐼𝑉∨𝐷𝑎𝑡𝑎)𝑃 (𝐼𝑛𝑣𝑎𝑙𝑖𝑑𝐼𝑉∨𝐷𝑎𝑡𝑎)

A “probably sufficient” criterion

89

Intuition

(X)

(Y)

(U)

(Z)

(X)

(Y)

(U)

(Z)

Valid-IV

𝑉𝑎𝑙𝑖𝑑𝑖𝑡𝑦𝑅𝑎𝑡𝑖𝑜=𝑃 (𝑉𝑎𝑙𝑖𝑑𝐼𝑉∨𝐷𝑎𝑡𝑎)𝑃 (𝐼𝑛𝑣𝑎𝑙𝑖𝑑𝐼𝑉∨𝐷𝑎𝑡𝑎)

(X)

(Y)

(U)

(Z)

(X)

(Y)

(U)

(Z)

Invalid-IV

Observational Data

Observational Data

𝑔1 𝑔2

𝑔3

𝑓 1

𝑓 3

𝑓 2

𝑓 4𝑔4h3 h4

Develop a generative meta-model of the data.

Compare marginal likelihoods of Valid versus Invalid-IV models.

Can formalize as a Bayesian model comparison

90

Data is likely to be generated from a Valid-IV model if ValidityRatio ≫ 1

91

Computing the Validity Ratio

Two problems:Each causal model contains unobserved

variable U.Infinitely many causal models in each

sub-class.

92

I. Use a response variable frameworkAssumes discrete variables.

𝑦= 𝑓 (𝑥 ,𝑢)

93

II. Non-standard integral over infinite models

Denominator (Invalid-IV)

Derived a closed form solution.

Properties of dirichlet and hyperdirichlet distributions.

-Laplace transform

Numerator (Valid-IV)

No closed form solution exists.

Used Monte Carlo methods for approximating.

-Annealed Importance Sampling

∫❑ ∫❑

94

Use the NPS test to validate IV studies from American Economic ReviewCollected studies from American Economic Review (AER) with “instrumental variable” in title or abstract.

95

Many recent studies from American Economic Review do not pass the testCollected studies from American Economic Review (AER) with “instrumental variable” in title or abstract. Studies from American Economic Review Validity

RatioEffect of Mexican immigration on crime in United States (2015)

0.07Effect of subsidy manipulation on Medicare premiums (2015)

1.02Effect of credit supply on housing prices (2015) 0.01Effect of Chinese import competition on local labor markets (2013)

0.3Effect of rural electrification on employment in South Africa (2011)

3.6

Expt: National Job Training Partnership Act (JTPA) Study (2002)

3.4

Challenges decades-long belief that causal assumptions cannot be tested from data

Can use data mining for causal effects in large-scale data.

Two recipes:• Create new graphical structures that identify

causal effect: Split-door criterion• Use Bayesian modeling to test instrumental

variables: NPS test

Conclusion: Causal data mining enables causal inference from large-scale data

96

97

More generally, a viable methodology for causal inference in large datasets

¿ 𝑋 ,𝑌>¿ Develop tests for validity of natural

variation

Mine for such valid variations in

observational data

98

LotteryWeatherShocks

Hard-to-find variations

Discontinuities

Change in access of digital services

Change in medicines at a hospital

Change in train stops in a city

…

More generally, a viable methodology for causal inference in large datasets

99

Controlled experiments

IV Test

Future Work

Ability to experiment

Amou

nt o

f dat

a

1010

108

106

104

102

Contextual BanditsA/B

test

Split-door

Causal algorithms

Warm Start (choosing expts.)

Online+Offline

100

Future work: Causal inference and machine learning

Causal inference robust prediction

Causal inferencePredicted value under the counterfactual distribution P’(X,y).

(Supervised) MLPredicted value under the training distribution P(X,y).

101

Thank you!Amit Sharmahttp://www.amitsharma.in1. Hofman, Sharma, and Watts (2017).

Prediction and explanation in social systems. Science, 355.6324.

2. Sharma (2016). Necessary and probably sufficient test for finding instrumental variables. Working paper.

3. Sharma, Hofman, and Watts (2016). Split-door criterion for causal identification: An algorithm for finding natural experiments. Under review at Annals of Applied Statistics (AOAS).

4. Sharma, Hofman, and Watts (2015). Estimating the causal impact of recommendation systems from observational data. In Proceedings of the 16th ACM Conference on Economics and Computation.

http://www.amitsharma.in/

http://science.sciencemag.org/content/355/6324/486

http://science.sciencemag.org/content/355/6324/486

mailto:[email protected]




http://www.amitsharma.in/pubs/ec15_causal_impact_recommendations.pdf

http://www.amitsharma.in/pubs/ec15_causal_impact_recommendations.pdf

102

References1. Angrist and Pischke (2008). Mostly harmless econometrics:

An empiricist’s companion. Princeton Univ. Press.2. Belluf, Xavier and Giglio (2012). Case study on the business

value impact of personalized recommendations on a large online retailer. In Proc. ACM Conf. on Recommender Systems.

3. Carmi, Oestreicher-Singer and Sundararajan (2012). Is Oprah contagious? Identifying demand spillovers in online networks. SSRN 1694308

4. Dunning (2012). Natural experiments in the social sciences: a design-based approach. Cambridge University Press

5. Gomez-Uribe and Hunt (2015). The Netflix recommender system: Algorithms, business value and innovation. ACM Transactions on Management Information Systems.

6. Lee and Hosanager (2014). When do recommender systems work the best? The moderating effects of product attributes and consumer reviews on recommender performance. In Proc. ACM World Wide Web Conference.

103

References7. Lin, Goh and Heng (2013). The demand effects of product

recommendation networks: An empirical analysis of network diversity and stability. SSRN 2389339.

8. Linden, Smith and York (2003). Amazon. com recommendations: Item to-item collaborative filtering. IEEE Internet Computing.

9. Mulpuru (2006). What you need to know about Third-Party Recommendation Engines. Forrester Research.

10. Oestreicher-Singer and Sundararajan (2012). The Visible Hand? Demand Effects of Recommendation Networks in Electronic Markets. Management Science.

11. Pearl (2009). Causality: models, reasoning and inference. Cambridge Univ Press.

12. Sharma and Yan (2013). Pairwise learning in recommendation: Experiments with community recommendation on Linkedin. In ACM Conf. on Recommender Systems.