uncertainty in online experiments with dependent data (kdd 2013 presentation)

facebook data science

Eytan BakshyDean EcklesFacebook Data Science

ACM KDD !"#$, Chicago, ILAugust #!, !"#$

Uncertainty in Online Experiments with Dependent DataAn Evaluation of Bootstrap Methods

Outline▪ Motivation▪ A model of user-item experiments and its variance▪ The bootstrap▪ Evaluation of bootstrap methods using search, ads, and feed data

▪ Under the sharp null (no effect)▪ Under the null with effects

▪ Takeaways

http://www.tucsonsentinel.com/local/report/"%!&#!_az_students/state-natl-rankings-vary-widely-az-student-performance/

#"" items#"" users#"",""" impressionsDoes N = #"","""?

http://www.tucsonsentinel.com/local/report/062712_az_students/state-natl-rankings-vary-widely-az-student-performance/

http://www.tucsonsentinel.com/local/report/062712_az_students/state-natl-rankings-vary-widely-az-student-performance/

A ranking experiment

d = ! Item ! Item " Item # Item $ Item %

User !

User "

User #

User $

# # # " "

# # " " "

" # " # "

" " " " #

d = " Item ! Item " Item # Item $ Item %

User !

User "

User #

User $

# " # # "

# " # " "

" # " # "

# " " " "

Many ranking experiments affect the pattern of exposure...

pattern of exposure under control pattern of exposure under treatment

Z(&) Z(!)

A ranking experiment

d = ! Item ! Item " Item # Item $ Item %

User !

User "

User #

User $

Y#,# Y#,!(") Y#,$ - -

Y!,# Y!,! - - -

- Y$,! - Y$,' -

- - - - Y',(

d = " Item ! Item " Item # Item $ Item %

User !

User "

User #

User $

Y#,# - Y#,$ Y#,' -

Y!,# - Y!,$ - -

- Y$,! - Y$,' -

Y',# - - - -

Many ranking experiments affect the pattern of exposure, but not the potential outcomes

Y = Y(&) = Y(!)

potential outcomes under control potential outcomes under treatment

A user interface experiment

Z(") = Z(#)

UI Experiments may not affect the pattern of exposure, but do affect responses to items

Y(") < Y(#)

d="

d=#

pattern of exposure is the same for treatment and control

d = i Item ! Item " Item # Item $ Item %

User !

User "

User #

User $

Y#,#(i) Y#,!(i) Y#,$(i) - -

Y!,#(i) Y!,!(i) - - -

- Y$,!(i) - Y$,'(i) -

- - - - Y',((i)

response differs for the same itemdepending on treatment status

Basic crossed random effects model

This paper describes sources and consequences of depen-dence in common applications of experimentation to Inter-net services. We posit a general data generating process andillustrate how experimental assignment procedures and com-mon e↵ects of units (e.g., users and ads) a↵ect the true un-certainty about experimental comparisons. We then evalu-ate independent, one-way, and multi-way bootstrap methodsfor computing confidence intervals using null experiments(“A/A tests”) on three real datasets from Facebook: clickson advertisements, search results, and content in the NewsFeed. We additionally modify these datasets to simulatesystematic imbalance in items across conditions, as wouldresult from changes to CTR prediction or ranking models.To examine performance under additional deviations fromtreatment e↵ects, we conduct additional simulations using arealistic probit random e↵ects model.

Our primary contribution is providing guidance aboutwhen accounting for dependence among observations ismost important: while previous work has shown that ne-glecting all dependence structure results in massive overcon-fidence, less work has examined how accounting for somesources of dependence, but not others, a↵ects inference inpractice. We conclude that analysts should certainly use ainferential procedure that accounts for dependence amongobservations of the units assigned to conditions (e.g., users),but that whether not additionally accounting for secondaryunits (e.g., ads, search results, links) makes for misleadinginference is more likely to depend on the specific (usuallypartially unknown) deviations from the hypothesis that theexperiment has no e↵ects.

The literature on routine Internet-scale experimentation(e.g. [8, 14]) largely does not address such questions aboutstatistical inference. Some authors [14] suggest conductingnull experiments to evaluate one’s experimentation tools,but little is said about how these should be conducted andexactly what problems they can detect. We intend that, inaddition to our results, this paper provides a blueprint forother experimenters who wish to evaluate and choose amonginferential procedures in their own settings.

2. DEPENDENCE IN EXPERIMENTSMany experiments allow observing the same units repeat-

edly. We may observe responses from the same person manytimes and also observe responses to the same items manytimes. In this section, we examine how this a↵ects our esti-mation of contrasts between experimental conditions, suchas di↵erences in means between treatment and control, i.e.,the average treatment e↵ect (ATE).1

Recent work in applied econometrics has been concernedwith dependence due to clustering in data. It is now routinefor work in empirical economics to consider and account fordependence in observations produced by one or more typesof units. This is reflected in the fact that a recent paper byCameron et al. [6] on dealing with dependence due to ob-serving two or more types of units repeatedly has been cited

1There are likely other sources of dependence among obser-vations in online experiments, including some arising fromgeneral equilibria in advertising auctions, peer e↵ects, andother “spillovers”. In this paper, we restrict our attention todependence due to repeated observations of the same units,for which we have inferential procedures, while these othersources of dependence take us into active areas of researchbeyond the scope of this paper.

over 550 times.2 Concerns about such dependence have beenfeatured centrally in methodological work in the context of agrowing number of field experiments in economics and othersocial sciences [12]. Similarly, work on two-way and ten-sor data in the context of recommender systems and ob-servational comparisons has emphasized the importance ofaccounting for multiway dependency [18, 19]. And in psy-chometrics [5] and psycholinguistics [2], investigators haveidentified problems with ignoring either of two sources ofdependence.As practitioners conducting and analyzing massive Inter-

net experiments, the degree of attention given to this areasuggests a need to consider the consequences of dependencefor our data. We present our e↵ort to understand whether itwould be necessary, in order to have inferential procedureswith good performance, to account for multiple units caus-ing dependence in our data, or whether a single unit wouldsu�ce.

2.1 Random effects modelWe use the random e↵ects model to illustrate how depen-

dence can a↵ect uncertainty in ATEs, and motivate the useof the bootstrap. Random e↵ects models provide a natu-ral and general way to describe outcomes for data generatedby combinations of units, called random e↵ects. In the two-way crossed random e↵ects model [2, 24], each observationis generated by some function f of a linear combination ofa grand mean, µ, a random e↵ect ↵i for the first variable,which (without loss of generality) we take to be the idiosyn-cratic deviation for user i, and a second random variable�j for the idiosyncratic deviation for item j (e.g. an ad, asearch result, a URL). Finally, we have a error term "ij foreach user’s idiosyncratic response to each item.3 This finalterm could be caused by a number of factors, including howrelevant the item is to the user. Thus, we have the model

Yij = f (µ+ ↵i + �j + "ij)

↵i ⇠ H(0,�2↵i), �j ⇠ H(0,�2

�j), "ij ⇠ H(0,�2

"ij ).

Each random e↵ect is modeled as being drawn from somedistributionH with zero mean and some variance. In the ho-mogeneous random e↵ects model, this variance is the samefor each user or item (i.e., �↵i = �↵), whereas in the het-erogenous random e↵ects model, each variable or groups ofvariables as may have their own variances.

2.1.1 Comparisons of meansWe extend the basic random e↵ects above to consider mul-

tiple treatments and develop expressions for the varianceof a di↵erence in means between experimental conditions.This illustrates how repeatedly observing the same units,and which units are randomly assigned to conditions, af-fects this variance.Let, without loss of generality, users (rather than items)

be assigned to experimental conditions. That is, all obser-vations of the same user have the same experimental condi-tions, such that Dij = Dij0 for all j 6= j0. As in other workon random e↵ects models where we observe only a small

2Citation count according to Google Scholar [2013-02-22].3For expository simplicity, we consider only a single obser-vation of each user–item pair. An addition error term canbe included when there are repeated observations of pairs.

Some function (often identity, logit, or probit)

User effectItem effect

User-item interaction effect (error term)

Effects are random variables

Variances may be the same or different for each user or item

Random effects models are a general way of describing a data-generating process

A random effects model for experimentsnumber of combinations of units [18, 19], we work condi-tional on D.

For the sake of exposition, we restrict our attention to lin-ear models with normally distributed random e↵ects. Thatis, the following analysis considers cases where Y is un-bounded, f is the identity function, and random e↵ects aredrawn from a multivariate normal distribution, so that

Y (d)ij = µ(d) + ↵(d)

i + �(d)j + "(d)ij

~↵i ⇠ N (0,⌃↵), ~�j ⇠ N (0,⌃�), ~"ij ⇠ N (0,⌃"). (1)

Note that ~↵i, etc., are vectors, where each element corre-sponds to the random e↵ect of a unit under a given treat-ment.

We wish to estimate quantities comparing outcomes fordi↵erent values of Dij — most simply, the di↵erence inmeans, or average treatment e↵ect (ATE) for a binary treat-ment

� ⌘ E[Y (1)ij | Dij = 1]� E[Y (0)

ij | Dij = 0]

= µ(1) � µ(0).

While this di↵erence cannot be directly observed from thedata (since user–item pairs can only be assigned to one con-dition at a time), we can estimate � with the di↵erence insample means4 [21]. Our focus is then to consider the truevariance of this estimator of � and, later, bootstrap methodsfor estimating that variance.

The sample mean for each condition is

Y (d) = µ(d)+1

n(d)••

X

i

n(d)i• ↵(d)

i +X

j

n(d)•j �(d)

j +X

i

X

j

"(d)ij

!

where, e.g., n(d)i• is the number of observations of user i in

condition d. We then estimate the ATE with � = Y (1)�Y (0).Consider the case where the treatment and control groups

are of equal size such that N = n(1)•• = n(0)

•• . This enablessimplifying the expression for � and its variance to

� =1N

X

i

⇣�n(1)i•

�2↵(1)i � �n(0)

i•

�2↵(0)i

⌘+

X

j

⇣�n(1)

•j

�2�(1)j � �n(0)

•j

�2�(0)j

⌘+X

i

X

j

"(Dij)ij

�

and

V[�] =1N2

X

i

✓�n(1)i•

�2�2↵(1) +

�n(0)i•

�2�2↵(0)

◆

+X

j

✓�n(1)

•j

�2�2�(1) +

�n(0)

•j

�2�2�(0) � 2n(0)

•j n(1)•j �

2�(0),�(1)

◆

+X

i

X

j

✓�n(1)ij

�2"(1)ij � �n(0)

ij

�2"(0)ij

◆�.

(2)

The first term is the contribution of random e↵ects of usersto the variance, and the second is the contribution of therandom e↵ects of items. The covariance term, present foritems, is absent for users and user–item pairs since each isonly observed in either the treatment or control.

4For true experiments, D is randomly assigned, but undersome circumstances (i.e. conditional ignorability) treatmente↵ects may be estimated without randomization [7, 17, 21].

To further simplify, we can introduce coe�cients measur-ing how much units are duplicated in the data. Followingprevious work [18, 19], we define

⌫(d)A ⌘ 1

N

X

i

�n(d)i•

�2⌫(d)B ⌘ 1

N

X

j

�n(d)

•j

�2,

which are the average number of observations sharing thesame user (the ⌫As) or item (the ⌫Bs) as an observation(including itself). For the units assigned to conditions (in

this case, users), either n(0)i• or n(1)

i• is zero for each i; forthe non-assigned units (items), we need a measure of thisbetween-condition duplication

!B ⌘ 1N

X

j

n(0)•j n

(1)•j .

Under the homogeneous random e↵ects model (1), we canthen simplify (2) to

V[�] =1N

✓⌫(1)A �2

↵(1) + ⌫(0)A �2

↵(0)

◆

+

✓⌫(1)B �2

�(1) + ⌫(0)B �2

�(0) � 2!��2�(0),�(1)

◆

+ �2"(0) + �2

"(1)

�.

(3)

This expression makes clear that if the random e↵ects foritems in the treatment and control are correlated (as wewould usually expect), then an increase in the balance of howoften items appear in each condition reduces the variance ofthe estimated treatment e↵ect.

2.1.2 Sharp and non-sharp null hypothesesUnder the sharp null hypothesis, the treatment has no av-

erage or interaction e↵ects; that is, the outcome for a partic-ular user or item is the same regardless of treatment assign-ment. In the context of our model, this would mean thatthe variances of the random e↵ects are equal in both con-ditions and are perfectly correlated across conditions, suchthat, in addition to � = 0,

�2↵(1) = �2

↵(0) = �2↵(0),↵(1)

�2�(1) = �2

�(0) = �2�(0),�(1)

�2"(1) = �2

"(0) = �2"(0),�(1) .

(4)

In this case, only random e↵ects for items that are not bal-anced across conditions contribute to the variance of ourATE estimate: the contribution a single item j makes tothe variance simplifies to

�n(0)

•j � n(1)•j

�2�2� ; that is, it de-

pends only on the squared di↵erence in duplication betweentreatment and control. It is easy to show that

V[�] =1N

�⌫(1)A + ⌫(0)

A

��2↵ + B�

2� + 2�2

"

�, (5)

where B ⌘ 1N

Pj

�n(0)

•j � n(1)•j

�2measures the average

between-condition duplication of observations of items. Ifitems, like users, also only appear in either treatment orcontrol, then B = ⌫(1)

B + ⌫(0)B , highlighting the resulting

symmetry between users’ and items’ contributions to ouruncertainty.When Eq 4 does not hold, we say that there are interaction

e↵ects of the treatment and units; for example, there maybe an item–treatment interaction e↵ect.

Response of user i under treatment d

Average effect under d

number of combinations of units [18, 19], we work condi-tional on D.


Y (d)ij = µ(d) + ↵(d)

i + �(d)j + "(d)ij

~↵i ⇠ N (0,⌃↵), ~�j ⇠ N (0,⌃�), ~"ij ⇠ N (0,⌃"). (1)



� ⌘ E[Y (1)ij | Dij = 1]� E[Y (0)

ij | Dij = 0]

= µ(1) � µ(0).



Y (d) = µ(d)+1

n(d)••

X

i

n(d)i• ↵(d)

i +X

j

n(d)•j �(d)

j +X

i

X

j

"(d)ij

!





� =1N

X

i

⇣�n(1)i•

�2↵(1)i � �n(0)

i•

�2↵(0)i

⌘+

X

j

⇣�n(1)

•j

�2�(1)j � �n(0)

•j

�2�(0)j

⌘+X

i

X

j

"(Dij)ij

�

and

V[�] =1N2

X

i

✓�n(1)i•

�2�2↵(1) +

�n(0)i•

�2�2↵(0)

◆

+X

j

✓�n(1)

•j

�2�2�(1) +

�n(0)

•j

�2�2�(0) � 2n(0)

•j n(1)•j �

2�(0),�(1)

◆

+X

i

X

j

✓�n(1)ij

�2"(1)ij � �n(0)

ij

�2"(0)ij

◆�.

(2)




⌫(d)A ⌘ 1

N

X

i

�n(d)i•

�2⌫(d)B ⌘ 1

N

X

j

�n(d)

•j

�2,




!B ⌘ 1N

X

j

n(0)•j n

(1)•j .


V[�] =1N

✓⌫(1)A �2

↵(1) + ⌫(0)A �2

↵(0)

◆

+

✓⌫(1)B �2

�(1) + ⌫(0)B �2

�(0) � 2!��2�(0),�(1)

◆

+ �2"(0) + �2

"(1)

�.

(3)




�2↵(1) = �2

↵(0) = �2↵(0),↵(1)

�2�(1) = �2

�(0) = �2�(0),�(1)

�2"(1) = �2

"(0) = �2"(0),�(1) .

(4)


�n(0)

•j � n(1)•j



V[�] =1N

�⌫(1)A + ⌫(0)

A

��2↵ + B�

2� + 2�2

"

�, (5)

where B ⌘ 1N

Pj

�n(0)

•j � n(1)•j






d might affect units in different ways

True difference in means

Model of potential outcomes for thinking about ads, search, and feed experiments

2.1.2 Difference in means for user–item experiments

We wish to estimate quantities comparing outcomes ob-served under di↵erent values of Di — most simply, the dif-

ference in means for a binary treatment

� ⌘ E[Y (1)ij | Z(1)

ij = 1]� E[Y (0)ij | Z(0)

ij = 1].

Experiments can produce a non-zero � simply by chang-ing this pattern of exposure. For example, a search rankingexperiment could primarily have e↵ects by changing whichitems are displayed as results (and thus observed). At theextreme, it could be that the potential outcomes are identi-cal under treatment and control, Y (0)

ij = Y (1)ij for all i, j but

that the pattern of exposure is di↵erent (Z(0)ij 6= Z(1)

ij ), suchthat � 6= 0.

Other experiments can produce a non-zero � while leavingthe pattern of exposure identical (i.e., Z(0) = Z(1)) or oth-erwise ignorably similar. For example, an experiment mightdisplay the same item slightly di↵erently, so that Y (0)

ij 6= Y (1)ij

for some i, j. In this case, � is then an average treatmente↵ect (ATE) since it is a di↵erence in means for the sameunits [20].

If for all i, j, Z(0)ij = Z(1)

ij , the pattern of exposure is thesame and � is an ATE,

� = E[Y (1)ij � Y (0)

ij | Zij = 1].

For analytical and expository simplicity, the remainderof this section assumes that the pattern of exposure is thesame4 under the treatment and control: Zij ⌘ Z(0)

ij = Z(1)ij .

2.1.3 Estimates of average treatment effects

We extend the basic random e↵ects model above to in-clude experimental conditions that may a↵ect the deliveryof items to subjects and subjects’ responses to items. Wethen derive expressions for the variance of the di↵erence inmean outcomes between conditions to illustrate how repeat-edly observing the same units, and which units are randomlyassigned to conditions influences estimates of experimentale↵ects.

For analytical and expository simplicity, we restrict ourattention to linear models with normally distributed randome↵ects. That is, the following analysis considers cases whereY is unbounded, f is the identity function, and randome↵ects are drawn from a multivariate normal distribution,so that

Y (d)ij = µ(d) + ↵(d)

i + �(d)j + "(d)ij

~↵i ⇠ N (0,⌃↵), ~�j ⇠ N (0,⌃�), ~"ij ⇠ N (0,⌃"). (1)

Note that we define ~↵i, etc., as vectors so that their e↵ectsmay be more or less similar (correlated) across conditions.


Y (d) = µ(d)+

1

n(d)••

✓X

i

n(d)i• ↵(d)

i +X

j

n(d)•j �(d)

j +X

i

X

j

n(d)ij "(d)ij

◆

4The results for the more general case require introducing amodel for exposure such that the random e↵ects share com-mon causes with the missingness. The � terms in varianceexpressions are then replaced with variances conditional ondata being observed.

where, e.g., n(d)j• is the number of observations of item j

in condition d (i.e., n(d)j• =

Pi Zij {Di = d}). We then

estimate the di↵erence in outcomes with � = Y (1) � Y (0).Consider the case where the treatment and control groups



� = � +1N

X

i

⇣�n(1)i•

�2↵(1)i � �

n(0)i•

�2↵(0)i

⌘+

X

j

⇣�n(1)

•j

�2�(1)j � �

n(0)•j

�2�(0)j

⌘+

X

i

X

j

n(Di)ij "(Di)

ij

�

and

V[�] =1N2

X

i

✓�n(1)i•

�2�2↵(1) +

�n(0)i•

�2�2↵(0)

◆

+X

j

✓�n(1)

•j

�2�2�(1) +

�n(0)

•j

�2�2�(0) � 2n(0)

•j n(1)•j �

2�(0),�(1)

◆

+X

i

X

j

✓�n(1)ij

�2�"(1) �

�n(0)ij

�2�"(0)

◆�.

(2)

The first term is the contribution of random e↵ects of usersto the variance, and the second is the contribution of the ran-dom e↵ects of items. The covariance term �2

�(0),�(1) , presentfor items, is absent for users and user–item pairs since eachis only observed in either the treatment or control.To further simplify, we can introduce coe�cients measur-

ing how much units are duplicated in the data. Followingprevious work [17, 18], we define

⌫(d)A ⌘ 1

N

X

i

�n(d)i•

�2⌫(d)B ⌘ 1

N

X

j

�n(d)

•j

�2,




!B ⌘ 1N

X

j

n(0)•j n

(1)•j .


V[�] =1N

✓⌫(1)A �2

↵(1) + ⌫(0)A �2

↵(0)

◆

+

✓⌫(1)B �2

�(1) + ⌫(0)B �2

�(0) � 2!��2�(0),�(1)

◆

+

✓�2"(0) + �2

"(1)

◆�.

(3)


2.1.2 Difference in means for user–item experiments

We wish to estimate quantities comparing outcomes ob-served under di↵erent values of Di — most simply, the dif-

ference in means for a binary treatment

� ⌘ E[Y (1)ij | Z(1)

ij = 1]� E[Y (0)ij | Z(0)

ij = 1].

Experiments can produce a non-zero � simply by chang-ing this pattern of exposure. For example, a search rankingexperiment could primarily have e↵ects by changing whichitems are displayed as results (and thus observed). At theextreme, it could be that the potential outcomes are identi-cal under treatment and control, Y (0)

ij = Y (1)ij for all i, j but

that the pattern of exposure is di↵erent (Z(0)ij 6= Z(1)

ij ), suchthat � 6= 0.

Other experiments can produce a non-zero � while leavingthe pattern of exposure identical (i.e., Z(0) = Z(1)) or oth-erwise ignorably similar. For example, an experiment mightdisplay the same item slightly di↵erently, so that Y (0)

ij 6= Y (1)ij

for some i, j. In this case, � is then an average treatmente↵ect (ATE) since it is a di↵erence in means for the sameunits [20].

If for all i, j, Z(0)ij = Z(1)

ij , the pattern of exposure is thesame and � is an ATE,

� = E[Y (1)ij � Y (0)

ij | Zij = 1].

For analytical and expository simplicity, the remainderof this section assumes that the pattern of exposure is thesame4 under the treatment and control: Zij ⌘ Z(0)

ij = Z(1)ij .

2.1.3 Estimates of average treatment effects

We extend the basic random e↵ects model above to in-clude experimental conditions that may a↵ect the deliveryof items to subjects and subjects’ responses to items. Wethen derive expressions for the variance of the di↵erence inmean outcomes between conditions to illustrate how repeat-edly observing the same units, and which units are randomlyassigned to conditions influences estimates of experimentale↵ects.

For analytical and expository simplicity, we restrict ourattention to linear models with normally distributed randome↵ects. That is, the following analysis considers cases whereY is unbounded, f is the identity function, and randome↵ects are drawn from a multivariate normal distribution,so that

Y (d)ij = µ(d) + ↵(d)

i + �(d)j + "(d)ij

~↵i ⇠ N (0,⌃↵), ~�j ⇠ N (0,⌃�), ~"ij ⇠ N (0,⌃"). (1)

Note that we define ~↵i, etc., as vectors so that their e↵ectsmay be more or less similar (correlated) across conditions.


Y (d) = µ(d)+

1

n(d)••

✓X

i

n(d)i• ↵(d)

i +X

j

n(d)•j �(d)

j +X

i

X

j

n(d)ij "(d)ij

◆

4The results for the more general case require introducing amodel for exposure such that the random e↵ects share com-mon causes with the missingness. The � terms in varianceexpressions are then replaced with variances conditional ondata being observed.

where, e.g., n(d)j• is the number of observations of item j

in condition d (i.e., n(d)j• =

Pi Zij {Di = d}). We then

estimate the di↵erence in outcomes with � = Y (1) � Y (0).Consider the case where the treatment and control groups



� = � +1N

X

i

⇣�n(1)i•

�2↵(1)i � �

n(0)i•

�2↵(0)i

⌘+

X

j

⇣�n(1)

•j

�2�(1)j � �

n(0)•j

�2�(0)j

⌘+

X

i

X

j

n(Di)ij "(Di)

ij

�

and

V[�] =1N2

X

i

✓�n(1)i•

�2�2↵(1) +

�n(0)i•

�2�2↵(0)

◆

+X

j

✓�n(1)

•j

�2�2�(1) +

�n(0)

•j

�2�2�(0) � 2n(0)

•j n(1)•j �

2�(0),�(1)

◆

+X

i

X

j

✓�n(1)ij

�2�"(1) �

�n(0)ij

�2�"(0)

◆�.

(2)

The first term is the contribution of random e↵ects of usersto the variance, and the second is the contribution of the ran-dom e↵ects of items. The covariance term �2

�(0),�(1) , presentfor items, is absent for users and user–item pairs since eachis only observed in either the treatment or control.To further simplify, we can introduce coe�cients measur-

ing how much units are duplicated in the data. Followingprevious work [17, 18], we define

⌫(d)A ⌘ 1

N

X

i

�n(d)i•

�2⌫(d)B ⌘ 1

N

X

j

�n(d)

•j

�2,




!B ⌘ 1N

X

j

n(0)•j n

(1)•j .


V[�] =1N

✓⌫(1)A �2

↵(1) + ⌫(0)A �2

↵(0)

◆

+

✓⌫(1)B �2

�(1) + ⌫(0)B �2

�(0) � 2!��2�(0),�(1)

◆

+

✓�2"(0) + �2

"(1)

◆�.

(3)


If Z = Z (") = Z(#) , δ is an ATE

“Clicky” users(D=")

KDD attendants(D=#)

“Bad luck” in an A/A testTreatment has no true effect, but the estimated CTR (mean click rate) is higher for the control

$/( clicks for D = "#/$ clicks for D = #d = i Item ! Item " Item # Item $ Item %

User !

User "

User #

User $

Y#,#(") Y#,!(") Y#,$(") Y#,'(") -

Y!,#(") - - - -

- - Y$,$(#) Y$,'(#) -

- - - - Y',((#)

OPTIMAL DESIGN FOR TRAFFIC-BASED BENCHMARKS

EYTAN BAKSHY

Abstract. The successful development and deployment of large-scale Internet servicesdepends critically on performance. Time, bandwidth, and memory usage translate di-rectly into financial and user experience costs. Despite the widespread use of tra�c-basedbenchmarks, there is little research on how they should be run in order to obtain validand precise inferences with minimal data collection costs. Correctly A/B testing Inter-net services can be surprisingly di�cult because interdependencies between user requests(e.g., for new stories, ads, search results) and hosts can lead to failures in estimatingthe significance and magnitude of performance di↵erences. We review basic principlesof causal inference and discuss how they can be applied to the design of benchmarkingsystems. We develop hierarchical models of Internet service performance that take intoaccount dependence due to user requests and hosts, and use them to design benchmarkingroutines that maximize precision subject to time and resource constraints.

1. Introduction

¯Y (0)= 3/5 clicks

¯Y (1)= 1/3 clicks

ˆ� = 1/3� 3/5.= �0.26

We often want to make changes to software – a new user interface, a feature, a ranking

model, a retrieval system, or virtual machine – and want to know what would happen if

we made that change.

1.1. Why benchmark? -Determine the improvement to a single backend service -Detect

errors introduced into a system -May be used to test multiple services

1.2. Core issues. Sampling: how representative is your test? Speed: how quickly can you

benchmark things? Validity: how much does your experiment reflect the “true” process?

Precision: how precise is your estimate? Generalizability: how well does your experiment

generalize to what happens across the site?

2. Sampling and data collection

2.1. Clustering.

2.1.1. Power calculations for clustered data.

2.2. Sampling on a budget.

2.3. Power calculations for multiple endpoints. Pooling SEs

1

“Good” ads “Bad” ads

Variance of ATE for user-item experiments



Y (d)ij = µ(d) + ↵(d)

i + �(d)j + "(d)ij

~↵i ⇠ N (0,⌃↵), ~�j ⇠ N (0,⌃�), ~"ij ⇠ N (0,⌃"). (1)



� ⌘ E[Y (1)ij | Dij = 1]� E[Y (0)

ij | Dij = 0]

= µ(1) � µ(0).



Y (d) = µ(d)+1

n(d)••

X

i

n(d)i• ↵(d)

i +X

j

n(d)•j �(d)

j +X

i

X

j

"(d)ij

!





� =1N

X

i

⇣�n(1)i•

�2↵(1)i � �n(0)

i•

�2↵(0)i

⌘+

X

j

⇣�n(1)

•j

�2�(1)j � �n(0)

•j

�2�(0)j

⌘+X

i

X

j

"(Dij)ij

�

and

V[�] =1N2

X

i

✓�n(1)i•

�2�2↵(1) +

�n(0)i•

�2�2↵(0)

◆

+X

j

✓�n(1)

•j

�2�2�(1) +

�n(0)

•j

�2�2�(0) � 2n(0)

•j n(1)•j �

2�(0),�(1)

◆

+X

i

X

j

✓�n(1)ij

�2"(1)ij � �n(0)

ij

�2"(0)ij

◆�.

(2)




⌫(d)A ⌘ 1

N

X

i

�n(d)i•

�2⌫(d)B ⌘ 1

N

X

j

�n(d)

•j

�2,




!B ⌘ 1N

X

j

n(0)•j n

(1)•j .


V[�] =1N

✓⌫(1)A �2

↵(1) + ⌫(0)A �2

↵(0)

◆

+

✓⌫(1)B �2

�(1) + ⌫(0)B �2

�(0) � 2!��2�(0),�(1)

◆

+ �2"(0) + �2

"(1)

�.

(3)




�2↵(1) = �2

↵(0) = �2↵(0),↵(1)

�2�(1) = �2

�(0) = �2�(0),�(1)

�2"(1) = �2

"(0) = �2"(0),�(1) .

(4)


�n(0)

•j � n(1)•j



V[�] =1N

�⌫(1)A + ⌫(0)

A

��2↵ + B�

2� + 2�2

"

�, (5)

where B ⌘ 1N

Pj

�n(0)

•j � n(1)•j








Y (d)ij = µ(d) + ↵(d)

i + �(d)j + "(d)ij

~↵i ⇠ N (0,⌃↵), ~�j ⇠ N (0,⌃�), ~"ij ⇠ N (0,⌃"). (1)



� ⌘ E[Y (1)ij | Dij = 1]� E[Y (0)

ij | Dij = 0]

= µ(1) � µ(0).



Y (d) = µ(d)+1

n(d)••

X

i

n(d)i• ↵(d)

i +X

j

n(d)•j �(d)

j +X

i

X

j

"(d)ij

!





� =1N

X

i

⇣�n(1)i•

�2↵(1)i � �n(0)

i•

�2↵(0)i

⌘+

X

j

⇣�n(1)

•j

�2�(1)j � �n(0)

•j

�2�(0)j

⌘+X

i

X

j

"(Dij)ij

�

and

V[�] =1N2

X

i

✓�n(1)i•

�2�2↵(1) +

�n(0)i•

�2�2↵(0)

◆

+X

j

✓�n(1)

•j

�2�2�(1) +

�n(0)

•j

�2�2�(0) � 2n(0)

•j n(1)•j �

2�(0),�(1)

◆

+X

i

X

j

✓�n(1)ij

�2"(1)ij � �n(0)

ij

�2"(0)ij

◆�.

(2)




⌫(d)A ⌘ 1

N

X

i

�n(d)i•

�2⌫(d)B ⌘ 1

N

X

j

�n(d)

•j

�2,




!B ⌘ 1N

X

j

n(0)•j n

(1)•j .


V[�] =1N

✓⌫(1)A �2

↵(1) + ⌫(0)A �2

↵(0)

◆

+

✓⌫(1)B �2

�(1) + ⌫(0)B �2

�(0) � 2!��2�(0),�(1)

◆

+ �2"(0) + �2

"(1)

�.

(3)




�2↵(1) = �2

↵(0) = �2↵(0),↵(1)

�2�(1) = �2

�(0) = �2�(0),�(1)

�2"(1) = �2

"(0) = �2"(0),�(1) .

(4)


�n(0)

•j � n(1)•j



V[�] =1N

�⌫(1)A + ⌫(0)

A

��2↵ + B�

2� + 2�2

"

�, (5)

where B ⌘ 1N

Pj

�n(0)

•j � n(1)•j








Y (d)ij = µ(d) + ↵(d)

i + �(d)j + "(d)ij

~↵i ⇠ N (0,⌃↵), ~�j ⇠ N (0,⌃�), ~"ij ⇠ N (0,⌃"). (1)



� ⌘ E[Y (1)ij | Dij = 1]� E[Y (0)

ij | Dij = 0]

= µ(1) � µ(0).



Y (d) = µ(d)+1

n(d)••

X

i

n(d)i• ↵(d)

i +X

j

n(d)•j �(d)

j +X

i

X

j

"(d)ij

!





� =1N

X

i

⇣�n(1)i•

�2↵(1)i � �n(0)

i•

�2↵(0)i

⌘+

X

j

⇣�n(1)

•j

�2�(1)j � �n(0)

•j

�2�(0)j

⌘+X

i

X

j

"(Dij)ij

�

and

V[�] =1N2

X

i

✓�n(1)i•

�2�2↵(1) +

�n(0)i•

�2�2↵(0)

◆

+X

j

✓�n(1)

•j

�2�2�(1) +

�n(0)

•j

�2�2�(0) � 2n(0)

•j n(1)•j �

2�(0),�(1)

◆

+X

i

X

j

✓�n(1)ij

�2"(1)ij � �n(0)

ij

�2"(0)ij

◆�.

(2)




⌫(d)A ⌘ 1

N

X

i

�n(d)i•

�2⌫(d)B ⌘ 1

N

X

j

�n(d)

•j

�2,




!B ⌘ 1N

X

j

n(0)•j n

(1)•j .


V[�] =1N

✓⌫(1)A �2

↵(1) + ⌫(0)A �2

↵(0)

◆

+

✓⌫(1)B �2

�(1) + ⌫(0)B �2

�(0) � 2!��2�(0),�(1)

◆

+ �2"(0) + �2

"(1)

�.

(3)




�2↵(1) = �2

↵(0) = �2↵(0),↵(1)

�2�(1) = �2

�(0) = �2�(0),�(1)

�2"(1) = �2

"(0) = �2"(0),�(1) .

(4)


�n(0)

•j � n(1)•j



V[�] =1N

�⌫(1)A + ⌫(0)

A

��2↵ + B�

2� + 2�2

"

�, (5)

where B ⌘ 1N

Pj

�n(0)

•j � n(1)•j






Estimates of the average treatment effect (ATE) include noise that depends on users and items

Average number of observations sharing the same user

Balance of items across conditions

Duplication coefficients

Correlation between item-level effects under treatment and control

*under the the linear homogeneous random effects model with an equal number of observations in each condition, randomizing over users with Z(!) = Z (")

*

Bootstrap Illustration

IID bootstrap illustration, R = !

clicked

weight (iid)

product

# " # " #

# " " # "

# " " " "

t*r=# = (#+"+"+"+")/(#+"+"+#+") = #/! = ".(

t = (#+"+#+"+#)/( = $/( = ".%

IID bootstrap illustration, R = "

t*r=! = (#+"+#+"+")/(#+#+#+"+") = !/$ = ".%%

clicked

weight (iid)

product

# " # " #

# # # " "

# " # " "

t = (#+"+#+"+#)/( = $/' = ".%

Repeat %&& times...▪ Repeat process ("" times, and compute the )(* interval of the

estimated meansR=500 bootstrap replicates

estimated average effect

Frequency

0.70 0.75 0.80

020

4060

80120

estimated mean clicks per impression

Problem: variance from IID bootstrap does not account for dependence due to users or items.

Problem: variance from IID bootstrap does not account for dependence due to users or items.

“'%( confidence intervals” may not include the true mean '%( of the time.

User bootstrap illustration, R = !

t*r=# = ("+"+#+"+#)/("+#+#+"+#) = !/$ = ".%%

clicked

weight (user)

click*weight

# " # " #

" # # " #

" " # " #

t = ("+"+#+"+#)/( = $/( = ".%

! " " "!

Multiway bootstrap illustration, R = !

t*r=# = ("+"+#+"+")/("+#+"+"+#) = #/! = ".(

clicked

w! (user)

w" (ad)

w!*w"

clicked*w!*w"

# " # " #

" # # " #

# # # " "

" # # " "

" " # " "

t = ("+"+#+"+#)/( = $/( = ".%

! " " "!

" ! !" "

Bootstrap evaluation under the sharp null

iid item user multiway

−1e−03

−5e−04

0e+00

5e−04

1e−03

0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500rank

95%

boo

tstra

p C

I

#. Randomly assign #* segments of users to treatment (&) or control (!)!. Bootstrap difference in mean outcomes under (&) and (!)$. Repeat ("" times'. True coverage is the proportion of times the null is accepted (i.e., the “)(*” confidence interval for the difference in means crosses zero)

2.1.4 Sharp and non-sharp null hypotheses

Sharp null hypothesis. Under the sharp null hypothe-

sis for user–item experiments, the treatment has no average,interaction, or exposure e↵ects; that is, the outcome for aparticular user–item pair, and whether or not the item isdisplayed to the user, would be the same regardless of treat-ment assignment. In the context of our model, in additionto � = 0, the sharp null can be defined by:

Zij ⌘ Z(0)ij = Z(1)

ij (4)

�2↵ ⌘ �2

↵(1) = �2↵(0) = �2

↵(0),↵(1)

�2� ⌘ �2

�(1) = �2�(0) = �2

�(0),�(1)

�2" ⌘ �2

"(1) = �2"(0) = �2

"(0),"(1) .

(5)

In the case of the sharp null, only random e↵ects for itemsthat are not balanced across conditions contribute to thevariance of our di↵erence: the contribution a single item jmakes to the variance simplifies to

�n(0)

•j � n(1)•j

�2�2� ; that

is, it depends only on the squared di↵erence in duplicationbetween treatment and control. It is easy to show that

V[�] =1N

�⌫(1)A + ⌫(0)

A

��2↵ + B�

2� + 2�2

"

�, (6)

where B ⌘ 1N

Pj

�n(1)

•j � n(0)•j




symmetry between users’ and items’ contributions to ouruncertainty.

Non-sharp null hypothesis. Experiments may havezero average e↵ects (� = 0) and violate the sharp null. Forexample, when (4) does not hold, the pattern of exposuremay change such that users are exposed to di↵erent items ineach condition, but there are no experimental e↵ects (i.e. (5)remains true). This can occur when exposures are missingfrom the layout Z(d) at random. That is, the change inexposure between conditions is independent of the potentialoutcomes: Z(1)

ij � Z(0)ij ?? (Y (0)

ij , Y (1)ij ).

Another deviation from the sharp null is when (5) does nothold. In this case, we say that there are interaction e↵ectsof the treatment and units; for example, an experiment canpositively or negatively a↵ect particular items or user-itempairs, but when outcomes are averaged over all exposures,there is zero e↵ect.

Together these considerations highlight the need to evalu-ate tests and confidence intervals under both the sharp andnon-sharp null. For example, even the simplistic data gen-erating process under the sharp null from (6), � can havesubstantial variance due to duplication and imbalance ofitems across conditions. The non-sharp null is more di�-cult to express under more realistic circumstances in whichsystematic changes to the exposure layout occur and inter-action e↵ects exist. After introducing the bootstrap, we willuse simulations with real data to investigate inferential pro-cedures under the sharp null in Section 3.3, and deviationsfrom the sharp null in Section 3.4 and Section 4.

2.2 Bootstrapping dependent dataThe bootstrap [9] o↵ers a very general method for char-

acterizing the sampling distribution of a statistic (e.g., a

di↵erence in means), and can be used to produce confidenceintervals for experimental comparisons for many data gen-erating processes. The bootstrap distribution of a samplestatistic is the distribution of that statistic under resampling[9] or reweighting [21] of the data. In this section, we de-scribe how the bootstrap can be applied to dependent data.We focus on a version of the bootstrap that uses indepen-dent weights, rather than the resampling bootstrap, since itis suitable for use in online (i.e., streaming) computationalsettings [18, 19].Bootstrap methods are attractive because they involve

minimal assumptions and scale well to large datasets com-pared to other methods commonly used in practice for sta-tistical inference with dependent data. One could fit crossedrandom e↵ects model to the data [2], but such models don’tscale to large datasets and involve specifying (likely incor-rect) assumptions not needed for the bootstrap. Other al-ternatives include cluster robust Huber–White “sandwich”standard errors from econometrics [6], but such methodscannot be applied in a streaming fashion and are not avail-able for robust statistics of interest, such as trimmed or Win-sorized means.iid bootstrap. In order to get a confidence interval

for some statistic t, we produce R replicates of the thestatistic, t⇤r , computed on randomly reweighted versions ofthe sample. That is, for some replicate r 2 [1, R], eachobservation Yij is randomly reweighted with weights Wij .These reweighted samples allow us to estimate features ofthe sampling distribution of our statistic. We generally haveWij ⇠ G where G is some distribution with mean and vari-ance 1, such as Poisson(1) and Uniform{0, 2} [18, §3.3]. Notethat each observation is reweighted independently, includingother observations of the same units. Applied to two-waydata, the iid bootstrap can be expected to underestimatethe variance of statistics and thus produce confidence inter-vals with poor coverage [15].Single-way bootstrap. In the single-way bootstrap, or

“block” or “cluster” bootstrap, the analyst chooses a sin-gle relevant type of unit (e.g., users) and all observationsfrom the same unit are given the same random weight whenreweighting. In other words, taken i as indexing the cho-sen type of unit, we have Wij = Wij0 = ui and ui ⇠ G forall j, j0. When the data only has one-way dependency, thisprocedure produces a bootstrap distribution that gives con-sistent confidence intervals. When the data has additionaldependency structure, it can be anticonservative; we usereal data and simulations to examine how poorly it works inpractice.Multiway bootstrap. When there are two or more rel-

evant units, analysts can use a bootstrap that reweightsall relevant units. Under a more general model than theone presented above, the multiway bootstrap produces vari-ance estimates, and thus confidence intervals, that are accu-rate or mildly conservative [17, 18]. The two-way bootstraphas been used for analyzing large online advertising experi-ments [3].With two-way data, we have Wij = uivj , where ui ⇠ G

and vj ⇠ G. That is, the random weights for an observa-tion is the product of two independently sampled weightsassigned to unit i and unit j. For example, if in one repli-cate, user i gets weight 2 and item j gets weight 3 then allobservations of the pair (i, j) get weight 2 ⇥ 3 = 6 in thatreplicate. Note that if either unit has a weight of 0, any com-

Duplication increases over timeads search feed

0.4

0.6

0.8

1.0

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20days

true

cove

rage

for 9

5% b

oots

trap

CI

boot

iid

item

user

multiway

Figure 2: True coverage for nominal 95% confidence intervals produced by the iid, single-way, and multiway

bootstrap for A/A tests segmented by user id as a function of time. Uncertainty estimates for the iid and

item-level bootstrap become increasingly inaccurate over time, while the user-level and multiway bootstrap

have the advertised or conservative Type I error rate.

Ads Search Feedusers 4,515,816 908,339 545,218items 317,159 1,362,061 326,831user-item pairs 24,081,939 4,263,769 2,882,452⌫users 18.5 35.5 20.3⌫items 6,625.9 543.6 1,333.0

Table 1: The amount of duplication present in our

datasets for a single 1% segment of users.

corresponding 10 salts from the null experiments and ap-ply the bootstrap procedures described in Section 2.2. Todetermine whether or not a bootstrap experiment is signif-icant, we compute the mean and variance of the di↵erencein means (�k,m,r) over all R replicates. The distributions of�k,m,r, are asymptotically normal under the bootstrap, so wesimply use quantiles of the normal to compute the central100(1� ↵)% interval.

To obtain the estimated true coverage under the sharp nullhypothesis (zero mean di↵erence, equal variance), we com-pute the proportion of times the K bootstrap tests indicatea significant di↵erence in means at some level ↵. We treateach of the K comparisons as independent, and use the Wil-son score interval for binomial proportions [1] to estimatethe uncertainty around the coverage.

We may also obtain the coverage for a non-sharp null hy-pothesis by creating synthetic imbalance between the itemsin both conditions. To do this, for each pair of segments(m,m + 1), we downsample each item from either segmentm or m + 1 (chosen with equal probability); in the down-sampled segment for some item j, its user–item pairs are(independently) removed with probability p. Thus, whenp = 0, we have the sharp null hypothesis, and when p = 1,we have total imbalance (i.e., the two conditions contain dis-joint sets of items).

3.3 DuplicationA central quantity that contributes to the variance of �

is the average number of observations that share the sameuser, ⌫user, and item, ⌫item. We give basic summary statisticsabout the duplication in the data for a random 1% segment

user item

5

10

15

5 10 15 5 10 15days (t)

ν(t)

ν(0)

ads

search

feed

Figure 3: Duplication (⌫) for users and items over

time relative to the first day.

used in our evaluation for the Ads, Search, and Feed datasetsin Table 1. For the restricted categories of items we considerin each dataset, there are more users exposed to ads thanthe search results or feed stories. While per user duplicationis similar across the three datasets, the per-item duplicationfor Ads is much higher than either Search or Feed. Thispattern is congruent with the nature of the items: the num-ber of businesses that are actively advertising are far fewerthan the number of users, while search and News Feed sto-ries tend to have a much longer tail of items that result inlower duplication.Experiments often run for many days; as the number of

days increase, so does the duplication. Figure 3 shows howduplication increases over time. With the exception of Feeditems, this relationship is rather linear both for user anditems. The behavior for Feed may be explained by the waysocial media feeds work: unlike ads and search results, userssee and interact with very recent content, therefore limitingthe average number of users that may be exposed to an item.Given the two-way random e↵ects model and the increas-

ing relationship between the duplication coe�cients andtime, we expect that users and items contribute substan-tially to the variance of �. Not taking these units intoaccount when computing confidence intervals may result inpoor coverage. Figure 2 shows the true coverage of the dif-

ads search feed

0.4

0.6

0.8

1.0

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20days

true

cove

rage

for 9

5% b

oots

trap

CI

boot

iid

item

user

multiway

Figure 2: True coverage for nominal 95% confidence intervals produced by the iid, single-way, and multiway

bootstrap for A/A tests segmented by user id as a function of time. Uncertainty estimates for the iid and

item-level bootstrap become increasingly inaccurate over time, while the user-level and multiway bootstrap

have the advertised or conservative Type I error rate.

Ads Search Feedusers 4,515,816 908,339 545,218items 317,159 1,362,061 326,831user-item pairs 24,081,939 4,263,769 2,882,452⌫users 18.5 35.5 20.3⌫items 6,625.9 543.6 1,333.0

Table 1: The amount of duplication present in our

datasets for a single 1% segment of users.

corresponding 10 salts from the null experiments and ap-ply the bootstrap procedures described in Section 2.2. Todetermine whether or not a bootstrap experiment is signif-icant, we compute the mean and variance of the di↵erencein means (�k,m,r) over all R replicates. The distributions of�k,m,r, are asymptotically normal under the bootstrap, so wesimply use quantiles of the normal to compute the central100(1� ↵)% interval.

To obtain the estimated true coverage under the sharp nullhypothesis (zero mean di↵erence, equal variance), we com-pute the proportion of times the K bootstrap tests indicatea significant di↵erence in means at some level ↵. We treateach of the K comparisons as independent, and use the Wil-son score interval for binomial proportions [1] to estimatethe uncertainty around the coverage.

We may also obtain the coverage for a non-sharp null hy-pothesis by creating synthetic imbalance between the itemsin both conditions. To do this, for each pair of segments(m,m + 1), we downsample each item from either segmentm or m + 1 (chosen with equal probability); in the down-sampled segment for some item j, its user–item pairs are(independently) removed with probability p. Thus, whenp = 0, we have the sharp null hypothesis, and when p = 1,we have total imbalance (i.e., the two conditions contain dis-joint sets of items).

3.3 DuplicationA central quantity that contributes to the variance of �

is the average number of observations that share the sameuser, ⌫user, and item, ⌫item. We give basic summary statisticsabout the duplication in the data for a random 1% segment

user item

5

10

15

5 10 15 5 10 15days (t)

ν(t)

ν(0)

ads

search

feed

Figure 3: Duplication (⌫) for users and items over

time relative to the first day.

used in our evaluation for the Ads, Search, and Feed datasetsin Table 1. For the restricted categories of items we considerin each dataset, there are more users exposed to ads thanthe search results or feed stories. While per user duplicationis similar across the three datasets, the per-item duplicationfor Ads is much higher than either Search or Feed. Thispattern is congruent with the nature of the items: the num-ber of businesses that are actively advertising are far fewerthan the number of users, while search and News Feed sto-ries tend to have a much longer tail of items that result inlower duplication.Experiments often run for many days; as the number of

days increase, so does the duplication. Figure 3 shows howduplication increases over time. With the exception of Feeditems, this relationship is rather linear both for user anditems. The behavior for Feed may be explained by the waysocial media feeds work: unlike ads and search results, userssee and interact with very recent content, therefore limitingthe average number of users that may be exposed to an item.Given the two-way random e↵ects model and the increas-

ing relationship between the duplication coe�cients andtime, we expect that users and items contribute substan-tially to the variance of �. Not taking these units intoaccount when computing confidence intervals may result inpoor coverage. Figure 2 shows the true coverage of the dif-



Y (d)ij = µ(d) + ↵(d)

i + �(d)j + "(d)ij

~↵i ⇠ N (0,⌃↵), ~�j ⇠ N (0,⌃�), ~"ij ⇠ N (0,⌃"). (1)



� ⌘ E[Y (1)ij | Dij = 1]� E[Y (0)

ij | Dij = 0]

= µ(1) � µ(0).



Y (d) = µ(d)+1

n(d)••

X

i

n(d)i• ↵(d)

i +X

j

n(d)•j �(d)

j +X

i

X

j

"(d)ij

!





� =1N

X

i

⇣�n(1)i•

�2↵(1)i � �n(0)

i•

�2↵(0)i

⌘+

X

j

⇣�n(1)

•j

�2�(1)j � �n(0)

•j

�2�(0)j

⌘+X

i

X

j

"(Dij)ij

�

and

V[�] =1N2

X

i

✓�n(1)i•

�2�2↵(1) +

�n(0)i•

�2�2↵(0)

◆

+X

j

✓�n(1)

•j

�2�2�(1) +

�n(0)

•j

�2�2�(0) � 2n(0)

•j n(1)•j �

2�(0),�(1)

◆

+X

i

X

j

✓�n(1)ij

�2"(1)ij � �n(0)

ij

�2"(0)ij

◆�.

(2)




⌫(d)A ⌘ 1

N

X

i

�n(d)i•

�2⌫(d)B ⌘ 1

N

X

j

�n(d)

•j

�2,




!B ⌘ 1N

X

j

n(0)•j n

(1)•j .


V[�] =1N

✓⌫(1)A �2

↵(1) + ⌫(0)A �2

↵(0)

◆

+

✓⌫(1)B �2

�(1) + ⌫(0)B �2

�(0) � 2!��2�(0),�(1)

◆

+ �2"(0) + �2

"(1)

�.

(3)




�2↵(1) = �2

↵(0) = �2↵(0),↵(1)

�2�(1) = �2

�(0) = �2�(0),�(1)

�2"(1) = �2

"(0) = �2"(0),�(1) .

(4)


�n(0)

•j � n(1)•j



V[�] =1N

�⌫(1)A + ⌫(0)

A

��2↵ + B�

2� + 2�2

"

�, (5)

where B ⌘ 1N

Pj

�n(0)

•j � n(1)•j








Y (d)ij = µ(d) + ↵(d)

i + �(d)j + "(d)ij

~↵i ⇠ N (0,⌃↵), ~�j ⇠ N (0,⌃�), ~"ij ⇠ N (0,⌃"). (1)



� ⌘ E[Y (1)ij | Dij = 1]� E[Y (0)

ij | Dij = 0]

= µ(1) � µ(0).



Y (d) = µ(d)+1

n(d)••

X

i

n(d)i• ↵(d)

i +X

j

n(d)•j �(d)

j +X

i

X

j

"(d)ij

!





� =1N

X

i

⇣�n(1)i•

�2↵(1)i � �n(0)

i•

�2↵(0)i

⌘+

X

j

⇣�n(1)

•j

�2�(1)j � �n(0)

•j

�2�(0)j

⌘+X

i

X

j

"(Dij)ij

�

and

V[�] =1N2

X

i

✓�n(1)i•

�2�2↵(1) +

�n(0)i•

�2�2↵(0)

◆

+X

j

✓�n(1)

•j

�2�2�(1) +

�n(0)

•j

�2�2�(0) � 2n(0)

•j n(1)•j �

2�(0),�(1)

◆

+X

i

X

j

✓�n(1)ij

�2"(1)ij � �n(0)

ij

�2"(0)ij

◆�.

(2)




⌫(d)A ⌘ 1

N

X

i

�n(d)i•

�2⌫(d)B ⌘ 1

N

X

j

�n(d)

•j

�2,




!B ⌘ 1N

X

j

n(0)•j n

(1)•j .


V[�] =1N

✓⌫(1)A �2

↵(1) + ⌫(0)A �2

↵(0)

◆

+

✓⌫(1)B �2

�(1) + ⌫(0)B �2

�(0) � 2!��2�(0),�(1)

◆

+ �2"(0) + �2

"(1)

�.

(3)




�2↵(1) = �2

↵(0) = �2↵(0),↵(1)

�2�(1) = �2

�(0) = �2�(0),�(1)

�2"(1) = �2

"(0) = �2"(0),�(1) .

(4)


�n(0)

•j � n(1)•j



V[�] =1N

�⌫(1)A + ⌫(0)

A

��2↵ + B�

2� + 2�2

"

�, (5)

where B ⌘ 1N

Pj

�n(0)

•j � n(1)•j






Variance increases with duplication:

Anti-conservatism increases over time

ads search feed

0.4

0.6

0.8

1.0

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20days

true

cove

rage

for 9

5% b

oots

trap

CI

boot

iid

item

user

multiway

Experiments with effects▪ Fit probit random effects model to a small countries to estimate random

effects for realistic generative model

▪ Simulate varying treatment effect levels via synthetic model with fixed layout with duplication similar to real-world duplication

ferent bootstrap methods for consecutively larger spans oftime in each dataset. We find that the iid confidence inter-vals tend to be highly anti-conservative. For example, aftertwo weeks of data collection, a search experiment that teststhe di↵erence in click-through rates between two equivalentgroups of users could result in rejecting the null hypothe-sis nearly 50% of the time. We find that bootstrapping bythe unit not being randomized over the item) often leads toanti-conservative intervals, and that for the sharp null withlittle imbalance in items, the user-level bootstrap yieldsaccurate coverage. The multiway bootstrap on the otherhand remains conservative no matter how many days areconsidered.

3.4 Imbalance in itemsGiven how these A/A tests were constructed, there is ap-

proximate balance of items across conditions, such that theprimary contributors to the variance of � are the user andresidual error components. However, if items are system-atically imbalanced across treatments (e.g. the experimentresults in showing similarly relevant, but di↵erent ads), thenitem random e↵ects can also make a substantial contribu-tion (Eq 5). To examine how such imbalance might a↵ectthe coverage of the confidence intervals in practice whenthe treatment has no average or interaction e↵ects, we cre-ated imbalance by downsampling items from either condi-tion with probability p (see Section 3.2 for details).

ads feed search

0.4

0.6

0.8

1.0

0.3 0.6 1.0 0.3 0.6 1.0 0.3 0.6 1.0p

true

cove

rage

for 9

5% b

oots

trap

CI

iid

item

user

multiway

Figure 4: True coverage for nominal 95% confidence

intervals for each bootstrap method applied to data

with varying levels of synthetic imbalance of items

across conditions for 2 weeks of data. Imbalance

does not appear to a↵ect the accuracy of the true

coverage for the multiway and user-level bootstrap,

while the iid and item level bootstrap become more

conservative when imbalance is greatest.

Figure 4 shows the true coverage with varying censoringprobabilities, p 2 {0.3, 0.6, 1.0}. Despite the threat that theimbalance might result in a large item-level contribution tothe variance, the coverage of the user bootstrap, which ne-glects this variance, remains approximately as advertised.This result may be due to a number of factors. First, themost straightforward expressions for V[�] and the expectedvariance estimates from the bootstrap procedures involve as-suming a homogeneous random e↵ects model, when it canactually be expected that the variances of the random ef-fects for, e.g., frequently observed users are di↵erent than

those for infrequently observed users. Second, there is a rel-atively high amount of index-level duplication in the data,such that there are for many users and item a small numberof user–item pairs observed; such duplication can cause themultiway bootstrap to be very conservative [19, Theorem 7].For Feed and Search, the poor coverage of the iid and item

bootstrap confidence intervals notably increases, thoughthey continue to undercover. This is expected since, in ad-dition to creating imbalance, the downsampling procedurereduces within-condition duplication.

4. SIMULATIONSWe have seen how di↵erent bootstrap methods perform

under the sharp null hypothesis and synthetic imbalancewith three real-world domains. However, these A/A testscannot tell us about how bootstrap procedures might per-form in situations where treatments do have e↵ects. Forexample, an ads experiment that manipulates the display ofcertain advertising units may only a↵ect certains items andnot others [3]. To explore these circumstances, we conductsimulations with a probit random e↵ects model parameter-ized to mirror the kinds of outcomes described in the previ-ous section. We use this generative model to vary the pres-ence of an item–treatment interaction, a plausible source ofviolations of the sharp null hypothesis.We modify the model of (1) so that Y is binary and there

is a single intercept common to both treatment and control,reflecting the lack of an ATE:

y(d)ij = µ+ ↵(d)

i + �(d)j + "(d)ij (6)

E[Y (d)ij ] = {y(d)

ij > 0} (7)

Also reflecting the absence of an ATE, we restrict the ran-dom e↵ect variance to be the same in treatment and control.For example, the covariance matrix for the item random ef-fects is

⌃� =

"�2� ⇢��

2�

⇢��2� �2

�

#.

To make realistic choices for the variances of the randome↵ects, we fit a probit random e↵ects models to the adsdataset from a large random sample of users in each of sev-eral small countries. This produced several estimates of �↵

and �� . We report on simulation results for �↵ = 0.3,which is close to several of the estimates. Our estimatesof �� often ranged from 0.2 to 0.9, so we present results for�� 2 {0.1, 0.3, 0.5, 1.0}. We set µ so as to achieve E[Yij ]close to 0.027.We constructed the set of observed user–item pairs used

in the simulations by assigning each of 3,000 potential usersand 200 potential ads to log-normally distributed scores. Foreach of 2N observations, we selected a particular user andad with probability proportional to this score. This yieldeda “layout” with 2481 unique users, 199 unique ads, and du-plication coe�cients ⌫A

.= 30.9 and ⌫B

.= 6077.4, which is

similar to the Ads dataset.

4.1 Item–treatment interactionsEven if the treatment has no e↵ects on average, it can have

positive e↵ects for some users and items and negative e↵ects7Since there is no scale to the latent variable yij , we achievedthis by in fact choosing a fixed µ = �2 and rescaling therandom e↵ect variances to sum to 1.




ads feed search

0.4

0.6

0.8

1.0

0.3 0.6 1.0 0.3 0.6 1.0 0.3 0.6 1.0p

true

cove

rage

for 9

5% b

oots

trap

CI

iid

item

user

multiway















y(d)ij = µ+ ↵(d)

i + �(d)j + "(d)ij (6)

E[Y (d)ij ] = {y(d)

ij > 0} (7)


⌃� =

"�2� ⇢��

2�

⇢��2� �2

�

#.




.= 30.9 and ⌫B

.= 6077.4, which is




Probit model:



Y (d)ij = µ(d) + ↵(d)

i + �(d)j + "(d)ij

~↵i ⇠ N (0,⌃↵), ~�j ⇠ N (0,⌃�), ~"ij ⇠ N (0,⌃"). (1)



� ⌘ E[Y (1)ij | Dij = 1]� E[Y (0)

ij | Dij = 0]

= µ(1) � µ(0).



Y (d) = µ(d)+1

n(d)••

X

i

n(d)i• ↵(d)

i +X

j

n(d)•j �(d)

j +X

i

X

j

"(d)ij

!





� =1N

X

i

⇣�n(1)i•

�2↵(1)i � �n(0)

i•

�2↵(0)i

⌘+

X

j

⇣�n(1)

•j

�2�(1)j � �n(0)

•j

�2�(0)j

⌘+X

i

X

j

"(Dij)ij

�

and

V[�] =1N2

X

i

✓�n(1)i•

�2�2↵(1) +

�n(0)i•

�2�2↵(0)

◆

+X

j

✓�n(1)

•j

�2�2�(1) +

�n(0)

•j

�2�2�(0) � 2n(0)

•j n(1)•j �

2�(0),�(1)

◆

+X

i

X

j

✓�n(1)ij

�2"(1)ij � �n(0)

ij

�2"(0)ij

◆�.

(2)




⌫(d)A ⌘ 1

N

X

i

�n(d)i•

�2⌫(d)B ⌘ 1

N

X

j

�n(d)

•j

�2,




!B ⌘ 1N

X

j

n(0)•j n

(1)•j .


V[�] =1N

✓⌫(1)A �2

↵(1) + ⌫(0)A �2

↵(0)

◆

+

✓⌫(1)B �2

�(1) + ⌫(0)B �2

�(0) � 2!��2�(0),�(1)

◆

+ �2"(0) + �2

"(1)

�.

(3)




�2↵(1) = �2

↵(0) = �2↵(0),↵(1)

�2�(1) = �2

�(0) = �2�(0),�(1)

�2"(1) = �2

"(0) = �2"(0),�(1) .

(4)


�n(0)

•j � n(1)•j



V[�] =1N

�⌫(1)A + ⌫(0)

A

��2↵ + B�

2� + 2�2

"

�, (5)

where B ⌘ 1N

Pj

�n(0)

•j � n(1)•j








Y (d)ij = µ(d) + ↵(d)

i + �(d)j + "(d)ij

~↵i ⇠ N (0,⌃↵), ~�j ⇠ N (0,⌃�), ~"ij ⇠ N (0,⌃"). (1)



� ⌘ E[Y (1)ij | Dij = 1]� E[Y (0)

ij | Dij = 0]

= µ(1) � µ(0).



Y (d) = µ(d)+1

n(d)••

X

i

n(d)i• ↵(d)

i +X

j

n(d)•j �(d)

j +X

i

X

j

"(d)ij

!





� =1N

X

i

⇣�n(1)i•

�2↵(1)i � �n(0)

i•

�2↵(0)i

⌘+

X

j

⇣�n(1)

•j

�2�(1)j � �n(0)

•j

�2�(0)j

⌘+X

i

X

j

"(Dij)ij

�

and

V[�] =1N2

X

i

✓�n(1)i•

�2�2↵(1) +

�n(0)i•

�2�2↵(0)

◆

+X

j

✓�n(1)

•j

�2�2�(1) +

�n(0)

•j

�2�2�(0) � 2n(0)

•j n(1)•j �

2�(0),�(1)

◆

+X

i

X

j

✓�n(1)ij

�2"(1)ij � �n(0)

ij

�2"(0)ij

◆�.

(2)




⌫(d)A ⌘ 1

N

X

i

�n(d)i•

�2⌫(d)B ⌘ 1

N

X

j

�n(d)

•j

�2,




!B ⌘ 1N

X

j

n(0)•j n

(1)•j .


V[�] =1N

✓⌫(1)A �2

↵(1) + ⌫(0)A �2

↵(0)

◆

+

✓⌫(1)B �2

�(1) + ⌫(0)B �2

�(0) � 2!��2�(0),�(1)

◆

+ �2"(0) + �2

"(1)

�.

(3)




�2↵(1) = �2

↵(0) = �2↵(0),↵(1)

�2�(1) = �2

�(0) = �2�(0),�(1)

�2"(1) = �2

"(0) = �2"(0),�(1) .

(4)


�n(0)

•j � n(1)•j



V[�] =1N

�⌫(1)A + ⌫(0)

A

��2↵ + B�

2� + 2�2

"

�, (5)

where B ⌘ 1N

Pj

�n(0)

•j � n(1)•j






Variance of ATE for an experiment for (non sharp) null:

Experiments with effects▪ Fit probit random effects model to a small countries to estimate random

effects for realistic generative model

▪ Simulate varying treatment effect levels via synthetic model with fixed layout with duplication similar to real-world duplication

Strength of item−treatment interaction 1 − ρβ

True

cov

erag

e of

nom

inal

95%

CI

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.5 1.0

= σβ 0.1

=

σα

0.3

0.0 0.5 1.0

= σβ 0.3

=

σα

0.3

0.0 0.5 1.0

= σβ 0.5

=

σα

0.3

0.0 0.5 1.0

= σβ 1

=

σα

0.3

useritemmultiway

Figure 5: E↵ects of item–treatment interaction e↵ects on true coverage of 95% confidence intervals. De-

creasing ⇢�, which makes the random item e↵ects less correlated between treatment and control, reduces the

coverage of user bootstrap confidence intervals. This e↵ect is moderated by the magnitude of the item-level

random e↵ects.

for others. Given our random e↵ects model, we know thatitem–treatment interactions can increase the contribution ofduplication of items to the variance of the mean di↵erence.

We vary item–treatment interactions by setting the corre-lation coe�cient ⇢� 2 {0, 0.25, 0.5, 0.75, 1}. Perfect correla-tion ⇢� = 1 corresponds to the sharp null hypothesis, whiledecreasing ⇢� corresponding to an increasing proportion ofitem random e↵ects being not shared across conditions. Atthe extreme of ⇢� = 0, the random e↵ect of an item in thetreatment is completely independent of its random e↵ect inthe control.

4.2 ResultsFigure 5 summarizes the results of 1000 simulations

for each combination of parameter values. Without anyitem–treatment interaction, both the user and item boot-strap have approximately correct coverage since; this isattributable to the relatively low ⌫A in this simulation, andis consistent with the results from a small number of datasin our real data sets. As the item–treatment interactionincreases, the coverage of the user bootstrap confidence in-tervals drop substantially. For example, even with moderatevalues of �� = 0.5 and ⇢� = 0.75, a nominally 95% confi-dence interval has a true coverage of 87.5%. While do wenot expect to observe the extremes of all item-level vari-ance being treatment specific (i.e., ⇢� = 0), these resultsdemonstrate that deviations from the sharp null in the formof item–treatment interaction have serious consequences forthe single-way bootstrap. On the other hand, the multiwaybootstrap remains mildly conservative even with large ��

and small ⇢� .

5. DISCUSSIONDespite having a large number of individual observations,

many settings for online experiments involve substantial de-pendence and small e↵ects such that statistical inference re-mains a central concern. The preceding analysis of real andsimulated data makes clear that methods which neglect de-

pendence structure in these large experiments can result inhigh Type I error rates and confidence intervals with poorcoverage. In each of our three datasets, the iid bootstrapperformed very poorly, such that using it (or other methodsassuming iid observations) would result in reaching incor-rect conclusions about the presence, sign, and magnitude oftreatment e↵ects [11].On the other hand, neglecting dependence among obser-

vations of units not assigned to conditions (the items) gen-erally did not result in lower coverage with our data. Foreach of the data sets, this remained the case even when weproduced imbalance of items across conditions. Given therandom e↵ects model posited in Section 2.1, one might ex-pect this imbalance to make both the user and item contri-butions to the variance necessary to account for separately.Since bootstrapping multiple units and storing these repli-cates can have substantial costs in terms of computation andinfrastructure, our results suggest that experimenters shouldconsider whether a single-way bootstrap on the experimen-tal units may be practically su�cient, even in the presenceof other clearly relevant units, such as ads and URLs.Nonetheless, neglecting dependence among observation of

these non-experimental units may have substantial e↵ectson coverage when the treatment has any e↵ects. Most treat-ments are expected to have some e↵ects. Our simulationswith item–treatment interaction e↵ects demonstrate thatthe coverage of the user bootstrap can be extremely sen-sitive to the presence of these e↵ects. This highlights thatusing A/A tests only serves to validate inferential proce-dures under a narrow set of conditions (i.e., the sharp nullhypothesis), but cannot detect other (potentially severe) in-ferential problem that occur in other circumstances. Giventhat experimenters expect treatment e↵ects, and often wantto know how large the average e↵ects are, they should con-sider whether or not they wish to use a procedure that pro-vides a somewhat conservative measurement of uncertainty(i.e. the multi-way bootstrap), or the user-level bootstrap,which correctly tests the less plausible sharp null.




ads feed search

0.4

0.6

0.8

1.0

0.3 0.6 1.0 0.3 0.6 1.0 0.3 0.6 1.0p

true

cove

rage

for 9

5% b

oots

trap

CI

iid

item

user

multiway















y(d)ij = µ+ ↵(d)

i + �(d)j + "(d)ij (6)

E[Y (d)ij ] = {y(d)

ij > 0} (7)


⌃� =

"�2� ⇢��

2�

⇢��2� �2

�

#.




.= 30.9 and ⌫B

.= 6077.4, which is







ads feed search

0.4

0.6

0.8

1.0

0.3 0.6 1.0 0.3 0.6 1.0 0.3 0.6 1.0p

true

cove

rage

for 9

5% b

oots

trap

CI

iid

item

user

multiway















y(d)ij = µ+ ↵(d)

i + �(d)j + "(d)ij (6)

E[Y (d)ij ] = {y(d)

ij > 0} (7)


⌃� =

"�2� ⇢��

2�

⇢��2� �2

�

#.




.= 30.9 and ⌫B

.= 6077.4, which is




Probit model:

Takeaways▪ Not accounting for any dependence results in incorrect CIs▪ The bootstrap provides a simple way to account for

dependence▪ Bootstrapping on the units being randomized over (e.g. users)

is sufficient to test the (very narrow) sharp null▪ When experiments have effects, not using multiway CIs can

result in anti-conservative estimates

Thanks▪ Coauthor, Dean Eckles

▪ Helpful comments: Dan Merl, Yaron Greif, Alex Deng, Daniel Ting,Wojtek Galuba, Art Owen

▪ For more details, see paper @:

▪ http://arxiv.org/abs/#$"'.&'"%

http://arxiv.org/abs/1304.7406

http://arxiv.org/abs/1304.7406

uncertainty in online experiments with dependent data (kdd 2013 presentation)

Education