[trying to correct] selection biasims.nus.edu.sg/events/2017/quan/files/noah2.pdf · 2017-07-11 ·...

[Trying to Correct] Selection Bias

Noah Simon

July 11, 2017

1 / 62

An Dialogue about Precision Medicine

Let’s begin with a discussion about Precision Medicine.

2 / 62

The following ideas may have partly been borrowed from...

He’s the one in the center... giving everyone else a hard time atseminar!

3 / 62

What is Precision Medicine?

I will stick mostly to oncology...

Not because I know much there...

But I definitely know less about everything else!

4 / 62


The practice of medicine has always been about

I characterizing dysfunction

I treating based on specific characterizations

5 / 62


In the beginning this was based on simple observation alone:

you’ve been vomiting and missed your period −→ Pregnant

Now we have more sophisticated methods:

hCG in urine −→ Pregnant

In oncology, tumors are characterized using histology

6 / 62


My understanding is:

Medicine attempts to differentiate diseases...

to develop treatments that target specific disease characteristics

Precision medicine attempts to differentiate diseases...

more precisely?


That reads like a high schooler “not-plagiarizing” an essay...

7 / 62


My understanding is:

Medicine attempts to differentiate diseases...


[Biomolecular] Precision medicine attempts to differentiate diseasesusing biomolecular profiling

to develop treatments that target specific biomolecular diseasecharacteristics

8 / 62

What am I leaving out

Screening diagnostics

eg. cfDNA

Actionable prognostic biomarkers

eg. oncotypeDX

Often forgotten that the goal is to find actionable biomarkers

9 / 62

Back to “Predictive Biomarkers”

Two common scenarios:

Developing a targeted treatment + diagnostic

Developing a new diagnostic, for an existing, non-targetedtreatment

10 / 62

Targeted Treatments

30+ targeted cancer drugs1 with many different targets

The primary FDA-specified “biomolecular” indications were

I HER2/HR status

I KRAS/EGFR mutation

I BRAF mutation

Many with no “biomolecular indication”...

only approved in very specific cancer-types though!

(histology-based personalization!)

1from “Overview of FDA-approved Anti-Cancer Drugs Used for TargetedTherapy” WCRJ 2015; 2(3) e553

11 / 62

The Road to Failure in Precision Medicine

Where have I seen little success?

Characterizing the [in]effectiveness of non-targeted treatments

Why do poor treatments tend not to work?

???

Why do I tend to miss free throws?

Because I keep forgetting to wear my lucky shirt...?

Or maybe because I’m generally bad at basketball...

12 / 62

The Road to Failure in Precision Medicine

Where have I seen little success?

Characterizing the [in]effectiveness of non-targeted treatments

Why do poor treatments tend not to work?

Because they tend not to work...

Why do I tend to miss free throws?

Because I keep forgetting to wear my lucky shirt...?

Or maybe because I’m generally bad at basketball...

13 / 62

The Road to Success in Precision Medicine?

What is the best place for statisticians on that road?

Is it building fancier methods?

(in some avenues things work pretty well with simple methods)

Or domain expertise?

Or some other option?

14 / 62

Solve Easy Problems!

EE/CS does this well!

Very approximately solve useful + “easy” domain problems

Statistics seems to have more deep, but slow prodding phenotype.

sometimes the problems are messy...

15 / 62

And onto something completely different!

Let’s begin with a common story...

16 / 62

A familiar problem - testing multiple hypotheses

Prostate cancer data2

n = 102 samples:

50 healthy controls

52 prostate cancer patients

p = 6033 genes

2Singh et al., (2002)17 / 62


Interested in

δj =µ1j − µ2j

σj

Calculate a (scaled) two-sample t-statistic for each gene

zj =x(c)j − x

(d)j

sj,

sj is your favorite estimate of standard deviation

18 / 62

Scaled t-statistics

0.0

0.5

1.0

1.5

2.0

−1.0 −0.5 0.0 0.5 1.0stats

dens

ity

19 / 62


Suppose I adjust for multiplicity...

and find 10 differentially expressed genes.

However, I also want to estimate the effect-size (δj) for those 10.

Given that I already adjusted for multiplicity in testing...

can I just report unadjusted δj?

NO!

20 / 62

Estimating Effect Sizes

The test statistics are approximately

zj ∼N

(δj ,

√n−11 + n−1

2

)with

δj =µ1j − µ2j

σj

We’re pretty good at testing if δj = 0

Bonferroni — Benjamini-Hochberg (/Yekutieli)

21 / 62

Estimating Effect Sizes

The test statistics are approximately

zj ∼N

(δj ,

√n−11 + n−1

2

)with

δj =µ1j − µ2j

σj

We’re pretty bad at estimating δj

(especially for most extreme values of δj)

21 / 62

Before we move on

Two flavors of past approaches:

I Conditioning on exceeding a threshold (univariate correction)

I Empirical Bayes (multivariate correction)

This talk will be more related to Empirical Bayes

22 / 62

Illustrative Example

500 data points based on zj |δj ∼ N(δj , 1)

0.0

0.2

0.4

0.6

0.8

−2.5 0.0 2.5 5.0x

dens

ity

Means Statistics

23 / 62


0.0

0.2

0.4

0.6

0.8

−2.5 0.0 2.5 5.0x

dens

ity

Means Statistics

Standard estimate δj = zj is poor

24 / 62


0.0

0.2

0.4

0.6

0.8

−2.5 0.0 2.5 5.0x

dens

ity

Means Statistics

E ‖z‖2 = ‖δ‖2 + p

24 / 62


0.0

0.2

0.4

0.6

0.8

−2.5 0.0 2.5 5.0x

dens

ity

Means Statistics

James-stein scales toward overall mean

24 / 62


−4

−2

0

2

4

−4 −2 0 2 4z

NaiveJ−S

δ

25 / 62


0.000

0.025

0.050

0.075

0.100

0 10 20 30z

dens

ity

What should we scale towards here??

26 / 62


0.000

0.025

0.050

0.075

0.100

0 10 20 30z

dens

ity

Need local shrinkage

26 / 62


10

20

10 20z

mea

n

Naive J−S Truth

27 / 62

Winner’s Curse

Some Definitions

z(k) The k-th order statistic

j(k) The index of the k-th order statistic

Note

j(k) is the inverse of the “rank” operator

zj(k) = z(k)

28 / 62

Winner’s Curse

Some Definitions

z(k) The k-th order statistic

j(k) The index of the k-th order statistic

By Jensen’s inequality (And our pictures):

E[z(p)]≥ max (δj) ≥ E

[δj(p)

]

29 / 62

Winner’s Curse

z(p) − δj(p) = 2.52

0.0

0.2

0.4

0.6

0.8

−2.5 0.0 2.5z

dens

ity

MeansStatistics

30 / 62

Winner’s Curse

z(p) − δj(p) = 3.29

0.0

0.2

0.4

0.6

0.8

−2.5 0.0 2.5z

dens

ity

MeansStatistics

31 / 62

Winner’s Curse

z(p) − δj(p) = 2.78

0.0

0.2

0.4

0.6

0.8

−2.5 0.0 2.5z

dens

ity

MeansStatistics

32 / 62

Winner’s Curse

z(p) − δj(p) = 2.63

0.0

0.2

0.4

0.6

0.8

−2.5 0.0 2.5z

dens

ity

MeansStatistics

33 / 62

Winner’s Curse

z(p) − δj(p) = 2.43

0.0

0.2

0.4

0.6

0.8

−2.5 0.0 2.5z

dens

ity

MeansStatistics

34 / 62

Winner’s Curse

Define βk byβ(δ)k = E

[z(k) − δj(k)

]Now consider the estimate

δj(k) = z(k) − β (δ)k

or equivalentlyδj = zj − β (δ)r(j)

35 / 62

Winner’s Curse...?

A more efficient estimate

E∥∥∥δ − δ∥∥∥2

2= E ‖z− δ‖22 −

∑β (δ)2k

Removes selection bias

E[δj(p)

]= E

[δj(p)

]

36 / 62

Toy example

200 data points from δj = each of 5, 10, 15, 20, 25.

0.000

0.025

0.050

0.075

0.100

0 10 20 30z

dens

ity

10

20

10 20z

mea

n

Naive Oracle Debias Truth

37 / 62

Calculating β

Just use Monte-Carlo

38 / 62

Estimating β

True δ are unknown → Use naive estimates

39 / 62

Estimating β

Use parametric bootstrap; estimate β (δ) by

β (δ) = β(δ)

5

10

15

20

25

10 20z

mea

n

Oracle Debias Bootstrap Debias Truth 40 / 62

Another Interpretation

z(k) − E[z(k) − δj(k)

]vs

E [δ | z ]

41 / 62

Non-parametric Empirical Bayes

Assume δ ∼ g(·)

Estimate g by g

Take δ = Eg [δ | z ]

Extremely strong in simple scenarios.

Difficult/impossible to apply in general.

Similar to considering E[δj(k)

]

42 / 62

More Complex Scenarios

What about correlation among estimates?

Maybe interested in regression coefficients?

Or entries of a precision matrix?

Or a very complicated parameter based on a very complicatedprocedure?

43 / 62

More Complex Scenarios

There is a similar, simple framework to accommodate all of these!

(Details are notationally dense, but not hard)

44 / 62

A particularly intriguing scenario

Often consider several candidate biomolecular signatures...

for predicting response to a new test treatment

Want estimate of ATE in test+ for best signature

On phase II data, and select the best (Induces a bias!)

but signatures are likely very correlated (maybe less bias?)

This framework can quite easily be applied there.

45 / 62

Back to Prostate Cancer

n = 102 samples:

50 healthy controls

52 prostate cancer patients

p = 6033 genes

Interested in

δj =µ1j − µ2j

σj

46 / 62

Prostate Cancer — Scaled t-statistics

0.0

0.5

1.0

1.5

2.0

−1.0 −0.5 0.0 0.5 1.0stats

dens

ity

47 / 62

Prostate Cancer — Shrinkage

−1.0

−0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.0z

estim

ate

uncor bootstrap

para bootstrap

nonpara bootstrap

NP EB

JS

48 / 62

Prostate Cancer — Evaluation

SSE of 50− 50 split

Methods k = 50 k = 25 k = 15

ebayes 204.33 (3.11) 110.13 (2.83) 71.03 (2.40)para-uncor 190.81 (2.40) 93.56 (1.84) 54.93 (1.40)para-cor 178.65 (1.97) 87.90 (1.55) 51.07 (1.17)nonpara 191.73 (2.42) 93.65 (1.84) 54.75 (1.37)

unadjusted 726.62 (8.05) 400.35 (5.76) 258.56 (4.21)

49 / 62

Intervals for large δ

What if we want more than just point estimates??

50 / 62

Intervals

0.0

0.2

0.4

0.6

0.8

−2.5 0.0 2.5z

dens

ity

MeansStatistics

51 / 62

Rank Conditional Coverage

Consider intervals I1, . . . , Ip, ordered as Ij(1), . . . , Ij(p)

We know average coverage is OK

1

p

∑j≤p

P (δj ∈ Ij) =1

p

∑k≤p

P(δj(k) ∈ Ij(k)

)= 1− α

52 / 62


The problem is that

P(δj(k) ∈ Ij(k)

)<< 1− α for k interesting

P(δj(k) ∈ Ij(k)

)∼ 1 for k boring

The most interesting intervals are also the most under-covering

Conditioning on ”rejecting the null” doesn’t fix this!

53 / 62


For random intervals I1, . . . , Ip we call

RCCk ≡ P(δj(k) ∈ Ij(k)

)the Rank Conditional Coverage.

Generally want to control RCC uniformly at level 1− α

54 / 62


We would like to get a procedure to form I1, . . . , Ik with

P(δj(k) ∈ Ij(k)

)≥ 1− α

55 / 62


Using a similar framework + resampling, can construct intervals!

Details are notationally dense, but not hard

As before there is an approximation by resampling

56 / 62

Interval Examples (n=100, p=500)

δ ∼ N(0, 1)

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●

●

●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●

●●●●●●●●●●

●

●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

Cov

erag

e

●

●

●●●●●●●

●●

●●●●●●●●

●●●

●●●●

●

●

●●●●●

●●●●●●●●●●●●

●●●

●●●

●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●

●●●●●●●●●●●●

●●●

●

●●●

●

●●●●●●

●

●●●●●●

●

●

●●●

●●●

●

●●

●

●●●

●●●●

●●

●

●

●

●

●

NaiveBS

57 / 62

Interval Examples (n=100, p=500)

δj = cor (xj , y)

where X ∼ N (0,Σ),

y =∑20

j=1 xjβj + ε for β1, . . . , β20 ∼ N(0, 1)

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●

●

●

●

●

400 420 440 460 480 500

0.0

0.2

0.4

0.6

0.8

1.0

Order Statistic

Cov

erag

e

●●●●●●●●●●

●●●●●●●●●●●

●●●●

●●●

●●●

●●●●●●●

●●●●●

●●●●●●●

●●●

●●●●●●●●

●●●

●

●●●●

●●●●●

●●●

●●●

●●●●●

●

●

●●●●●

●●●

●●

●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●

●

NaiveBootstrapWFBWFB2

58 / 62

Takeaways!

Simple formulation for dealing with “selection-bias” in highdimensions

Revolves around the distribution of z(k) − δj(k)

You have to be careful with plug-ins.

59 / 62

More Complex Scenarios — EXTRA

Consider

distribution Fa parameter-vector Θ (F).

Θ an empirical estimate of Θ (F) based on X1, · · · ,Xn ∼ F

If we defineβ (F)k = E

[Θ(k) −Θ (F)j(k)

]and

Θj = Θj − β (F)r(j)

Then

E∥∥∥Θ−Θ (F)

∥∥∥22

= E∥∥∥Θ−Θ (F)

∥∥∥22−∑

β (F)2k

60 / 62

Intervals for large δ EXTRA

Rather than using the mean of our distribution, use quantiles!

I Let Gk(δ) denote the distribution for z(k) − δj(k)

I Define L (δ)k and U (δ)k to be the 1− α/2 and α/2 quantilesof Gk (δ) ie.

P[Uk ≤ z(k) − δj(k) ≤ Lk

]= 1− α

I Pivot! Define Ik =[z(k) − Lk , z(k) + Uk

]

61 / 62

Estimating L (δ)k and U (δ)k EXTRA

We don’t know Gk(δ)... As before use bootstrap!

62 / 62

[trying to correct] selection biasims.nus.edu.sg/events/2017/quan/files/noah2.pdf · 2017-07-11 ·...

Documents