statistics for high-dimensional data: p-values and

41
Statistics for high-dimensional data: p-values and confidence intervals Peter B¨ uhlmann Seminar f ¨ ur Statistik, ETH Z ¨ urich June 2014

Upload: others

Post on 16-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistics for high-dimensional data: p-values and

Statistics for high-dimensional data:p-values and confidence intervals

Peter Buhlmann

Seminar fur Statistik, ETH Zurich

June 2014

Page 2: Statistics for high-dimensional data: p-values and

High-dimensional dataBehavioral economics and genetics (with Ernst Fehr, U. Zurich)

I n = 1′525 personsI genetic information (SNPs): p ≈ 106

I 79 response variables, measuring “behavior”

p n

goal: find significant associationsbetween behavioral responsesand genetic markers

0 20 40 60 80

020

040

060

0

Number of significant target SNPs per phenotype

Phenotype index

Num

ber

of s

igni

fican

t tar

get S

NP

s

5 10 15 25 30 35 45 50 55 65 70 75

100

300

500

700

Page 3: Statistics for high-dimensional data: p-values and

in high-dimensional statistics:a lot of progress has been achieved over the last 8-10 years for

I point estimationI rates of convergence

but very little work on assigning measures of uncertainty,p-values, confidence intervals

Page 4: Statistics for high-dimensional data: p-values and

we need uncertainty quantification!(the core of statistics)

Page 5: Statistics for high-dimensional data: p-values and

goal (regarding the title of the talk):

p-values/confidence intervalfor a high-dimensional linear model

(and we can then generalize to other models)

Page 6: Statistics for high-dimensional data: p-values and

Motif regression and variable selection

for finding HIF1α transcription factor binding sites in DNA seq.Muller, Meier, PB & Ricci

for coarse DNA segments i = 1, ...,n :

I predictor Xi = (X (1)i , . . . ,X (p)

i ) ∈ Rp:abundance score of candidate motifs j = 1, ...,p in DNAsegment i (using sequence data and computationalbiology algorithms, e.g. MDSCAN)

I univariate response Yi ∈ R: binding intensity of HIF1α tocoarse DNA segment (from CHIP-chip experiments)

Page 7: Statistics for high-dimensional data: p-values and

question: relation between the binding intensity Y and theabundance of short candidate motifs?

; linear model is often reasonable“motif regression” (Conlon, X.S. Liu, Lieb & J.S. Liu, 2003)

Yi =

p∑j=1

β0j X (j)

i + εi

i = 1, . . . ,n = 143, p = 195

goal: variable selection and significance of variables; find the relevant motifs among the p = 195 candidates

Page 8: Statistics for high-dimensional data: p-values and

Lasso for variable selection:

S(λ) = j ; βj(λ) 6= 0estimate for S0 = j ; β0

j 6= 0

no significance testing involvedit’s convex optimization only!and it’s very popular (Meinshausen & PB, 2006;Zhao & Yu, 2006;Wainwright, 2009;...)

Page 9: Statistics for high-dimensional data: p-values and

for motif regression(finding HIF1α transcription factor binding sites)n=143, p=195

; Lasso selects 26 covariateswhen choosing λ = λCV via cross-validationand resulting R2 ≈ 50%

i.e. 26 interesting candidate motifs

how significant are the findings?

Page 10: Statistics for high-dimensional data: p-values and

for motif regression(finding HIF1α transcription factor binding sites)n=143, p=195

; Lasso selects 26 covariateswhen choosing λ = λCV via cross-validationand resulting R2 ≈ 50%

i.e. 26 interesting candidate motifs

how significant are the findings?

Page 11: Statistics for high-dimensional data: p-values and

estimated coefficients β(λCV)

0 50 100 150 200

0.00

0.05

0.10

0.15

0.20

original data

variables

coef

ficie

nts

p-values for H0,j : β0j = 0 ?

Page 12: Statistics for high-dimensional data: p-values and

P-values for high-dimensional linear models

Y = Xβ0 + ε

goal: statistical hypothesis testing

H0,j : β0j = 0 or H0,G : β0

j = 0 for all j ∈ G ⊆ 1, . . . ,p

background: if we could handle the asymptotic distribution ofthe Lasso β(λ) under the null-hypothesis; could construct p-values

this is very difficult!asymptotic distribution of β has some point mass at zero,...Knight and Fu (2000) for p <∞ and n→∞

Page 13: Statistics for high-dimensional data: p-values and

; standard bootstrapping and subsampling cannot be usedeither

but there are recent proposals when using adaptations ofstandard resampling methods(Chatterjee & Lahiri, 2013; Liu & Yu, 2013); non-uniformity/super-efficiency issues remain...

Page 14: Statistics for high-dimensional data: p-values and

Low-dimensional projections and bias correctionOr de-sparsifying the Lasso estimatorrelated work by Zhang and Zhang (2011)

motivation:

βOLS,j = projection of Y onto residuals (Xj − X−j γ(j)OLS)

projection not well defined if p > n; use “regularized” residuals from Lasso on X -variables

Zj = Xj − X−j γ(j)Lasso

γ(j) = argminγ‖Xj − X−jγ‖+ λj‖γ‖1

Page 15: Statistics for high-dimensional data: p-values and

using Y = Xβ0 + ε ;

Z Tj Y = Z T

j Xjβ0j +

∑k 6=j

Z Tj Xk + Z T

j ε

and hence

Z Tj Y

Z Tj Xj

= β0j +

∑k 6=j

Z Tj Xk

Z Tj Xj

β0k︸ ︷︷ ︸

bias

+Z T

j ε

Z Tj Xj︸ ︷︷ ︸

noise component

;

bj =Z T

j Y

Z Tj Xj−

∑k 6=j

Z Tj Xk

Z Tj Xj

βLasso;k︸ ︷︷ ︸Lasso-estim. bias corr.

Page 16: Statistics for high-dimensional data: p-values and

bj is not sparse!... and this is crucial to obtain Gaussian limitnevertheless: it is “optimal” (see later)

I target: low-dimensional component β0j

I η := β0k ; k 6= j is a high-dimensional nuisance parameter

; exactly as in semiparametric modeling!and sparsely estimated (e.g. with Lasso)

Page 17: Statistics for high-dimensional data: p-values and

bj is not sparse!... and this is crucial to obtain Gaussian limitnevertheless: it is “optimal” (see later)

I target: low-dimensional component β0j

I η := β0k ; k 6= j is a high-dimensional nuisance parameter

; exactly as in semiparametric modeling!and sparsely estimated (e.g. with Lasso)

Page 18: Statistics for high-dimensional data: p-values and

=⇒ let’s turn to the blackboard!

Page 19: Statistics for high-dimensional data: p-values and

A general principle: de-sparsifying via “inversion” of KKT

KKT conditions:sub-differential of objective function ‖Y − Xβ‖22/n + λ‖β‖1

−X T ( Y︸︷︷︸Xβ0+ε

−X β)/n + λτ = 0

‖τ‖∞ ≤ 1, τj = sign(βj) if βj 6= 0.

with Σ = X T X/n ; Σ(β − β0) + λτ = X T ε/n

“regularized inverse” of Σ, denoted by Θ (not e.g. GLasso)

β − β0 + Θλτ = ΘX T ε/n −∆,

∆ = (ΘΣ− I)(β − β0)

new estimator: b = β + Θλτ = β + ΘX T (Y − X β)/n

Page 20: Statistics for high-dimensional data: p-values and

; b is exactly the same estimator as before (based onlow-dimensional projection using residual vectors Zj )

... when taking Θ (“regularized inverse of Σ”) havingrows using the (“nodewise”) Lasso-estimated coefficients fromXj versus X−j :

γj = argminγ∈Rp−1‖Xj − X−jγ‖22/n + 2λj‖γ‖1

Denote by

C =

1 −γ1,2 · · · −γ1,p−γ2,1 1 · · · −γ2,p

......

. . ....

−γp,1 −γp,2 · · · 1

Let

T 2 = diag(τ21 , . . . , τ

2p ), τ2

j = ‖Xj − X−j γj‖22/n + λj‖γj‖1,

ΘLasso = T−2C not symmetric... !

Page 21: Statistics for high-dimensional data: p-values and

“inverting” the KKT conditions is a very general principle; the principle can be used for GLMs and many other models

Page 22: Statistics for high-dimensional data: p-values and

Asymptotic pivot and optimality

Theorem (van de Geer, PB & Ritov, 2013)

√n(bj − β0

j )⇒ N (0, σ2εΩjj) (j = 1, . . . ,p)

Ωjj explicit expression ∼ (Σ−1)jj optimal!reaching semiparametric information bound

; asympt. optimal p-values and confidence intervalsif we assume:

I sub-Gaussian design (i.i.d. rows of X sub-Gaussian) withpopulation Cov(X ) = Σ has minimal eigenvalue ≥ M > 0

I sparsity for regr. Y vs. X : s0 = o(√

n/ log(p))“quite sparse”I sparsity of design: Σ−1 sparse

i.e. sparse regressions Xj vs. X−j : sj = o(n/ log(p))“maybe restrictive”

Page 23: Statistics for high-dimensional data: p-values and

It is optimal!Cramer-Rao

Page 24: Statistics for high-dimensional data: p-values and

for design with Σ−1 non-sparse:I Ridge projection (PB, 2013): good type I error control but

not optimal in terms of powerI convex program instead of Lasso for Zj (Javanmard &

Montanari, 2013; MSc. thesis Dezeure, 2013)Javanmard & Montanari prove optimality

so far: no convincing empirical evidence that we can deal wellwith such scenarios (Σ−1 non-sparse)

Page 25: Statistics for high-dimensional data: p-values and

Uniform convergence:

√n(bj − β0

j )⇒ N (0, σ2εΩjj) (j = 1, . . . ,p)

convergence is uniform over B(s0) = β; ‖β‖0 ≤ s0

; honest tests and confidence regions!

and we can avoid post model selection inference(cf. Potscher and Leeb)

Page 26: Statistics for high-dimensional data: p-values and

Simultaneous inference over all components:

√n(b − β0) ≈ (W1, . . . ,Wp) ∼ Np(0, σ2

εΩ)

; can construct P-values for:

H0,G with any G: test-statistics maxj∈G |bj |since covariance structure Ω is known

andcan easily do efficient multiple testing adjustment sincecovariance structure Ω is known!

Page 27: Statistics for high-dimensional data: p-values and

Alternatives?I versions of bootstrapping (Chatterjee & Lahiri, 2013)

; super-efficiencyphenomenon!

i.e. non-uniform convergenceJoe Hodges

• good for estimating the zeroes (i.e., j ∈ Sc0 with β0

j = 0)• bad for estim. the non-zeroes (i.e., j ∈ S0 with β0

j 6= 0)

I multiple sample splitting (Meinshausen, Meier & PB, 2009)split the sample repeatedly in two halfs:• select variables on first half• p-values using second half, based on selected variables

; avoids (because of sample splitting) over-optmisticp-values, but potentially suffers in terms of power

I covariance test (Lockhart, Taylor, Tibshirani, Tibshirani, 2014)I no sparsity ass. on Σ−1 (Javanmard and Montanari, 2014)

Page 28: Statistics for high-dimensional data: p-values and

Some empirical results (Dezeure, PB, Meier & Meinshausen, in progress)

compare power and control of familywise error rate (FWER)always p = 500, n = 100 and s0 = 15

0.0 0.2 0.4 0.6 0.8 1.0

Covtest

JM

MS−Split

Ridge

Despars−Lasso

FWER

0.0 0.2 0.4 0.6 0.8 1.0

Power

Page 29: Statistics for high-dimensional data: p-values and

confidence intervalsI for β0

j (j ∈ S0)

I for β0j = 0 (j ∈ Sc

0) where the intervals exhibit the worstcoverage (for each method)

Jm201345

65 69 78 80 82 83 84 84 85 8594

86 86 86 86 87 87 88

liuyu91 92

4398 99 99 99 99 99 99 99

9999 99 99 99 99 99 99

Res−Boot92

80 5396 97 98 98 98 99 99 99

9999 99 99 99 99 99 99

MS−Split

69100 99 100 100 100 100

100100 100 100

94100

100100 100 100 100 100

Ridge86

97 95 98 98 98 98 98 98 98 9999

99 99 99 99 99 99 99

Lasso−Pro Z&Z82

86 91 86 87 88 88 88 88 89 8995

89 89 89 90 90 90 90

Lasso−Pro78

82 80 83 84 86 87 87 87 87 8795

88 88 88 88 89 89 89

Toeplitz s0=3 U[0,2]

Page 30: Statistics for high-dimensional data: p-values and

Motif regression example

one significant variable withboth “de-sparsified Lasso” and multi sample splitting

0 50 100 150 200

0.00

0.05

0.10

0.15

0.20

motif regression

variables

coef

ficie

nts

: variable/motif with FWER-adjusted p-value 0.006: p-value clearly larger than 0.05

(this variable corresponds to known true motif)

Page 31: Statistics for high-dimensional data: p-values and

for data-sets with p ≈ 4′000− 10′000 and n ≈ 100; often no significant variable

because it is a too extreme ratio log(p)/n

Page 32: Statistics for high-dimensional data: p-values and

Behavioral economics and genomewide associationwith Ernst Fehr, University of Zurich

I n = 1525 probands (all students!)I m = 79 response variables measuring various behavioral

characteristics (e.g. risk aversion) from well-designedexperiments

I 460 Target SNPs (as a proxy for ≈ 106 SNPs):1380 parameters per response(but only 1341 meaningful parameters)

model: multivariate linear model

Yn×m︸ ︷︷ ︸responses

= Xn×p︸ ︷︷ ︸SNP data

βp×m + εn×m︸ ︷︷ ︸error

although p < n, the design matrix X (with categorical values∈ 1,2,3) does not have full rank

Page 33: Statistics for high-dimensional data: p-values and

Yn×m = Xn×pβp×m + εn×m

interested in p-values for

H0,jk : βjk = 0 versus HA,jk : βjk 6= 0,H0,G : βjk = 0 for all j , k ∈ G versus HA,G = Hc

0,G

adjusted to control the familywise error rate (i.e. conservativecriterion)in total: we consider 110’857 hypotheses

we test for non-marginal regression coefficients; “predictive” GWAS

Page 34: Statistics for high-dimensional data: p-values and

there is structure!

I 79 response experimentsI 23 chromosomes per response experimentI 20 Target SNPs per chromosome = 460 Target SNPs

. . .

.. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

1 2

1

23 1 23

1 2 20

global

79

Page 35: Statistics for high-dimensional data: p-values and

do hierarchical FWER adjustment (Meinshausen, 2008)

. . .

.. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

1 2

1

23 1 23

1 2 20

global

79

significant not significant

1. test global hypothesis2. if significant: test all single response hypotheses3. for the significant responses: test all single chromosome hyp.4. for the significant chromosomes: test all TargetSNPs

; powerful multiple testing withdata dependent adaptation of the resolution level(our analysis with 20 TagetSNPs per chromosome is ad-hoc)

cf. general sequential testing principle (Goeman & Solari, 2010)

Page 36: Statistics for high-dimensional data: p-values and

testing a group-hypothesis:

H0,G : β0j ≡ 0 for all j ∈ G

test-statistics:

maxj∈G|bj |

since und H0,G:

√nbG = NG(0, σ2

εΩG) + ∆G,

∆G = (∆j ; j ∈ G),√

n‖∆‖∞ = oP(1)

thus:

maxj∈G

√n|bj | ⇒ σε max

j∈G|Wj |,

(W1, . . . ,Wp) ∼ Np(0,Ω)

and can easily simulate maxj∈G |Wj |

Page 37: Statistics for high-dimensional data: p-values and

number of significant SNP parameters per response

0 20 40 60 80

020

040

060

0

Number of significant target SNPs per phenotype

Phenotype index

Num

ber

of s

igni

fican

t tar

get S

NP

s

5 10 15 25 30 35 45 50 55 65 70 75

100

300

500

700

response 40 has most significant (levels of) Target SNPs

Page 38: Statistics for high-dimensional data: p-values and

Conclusionscan construct asymptotically optimal

p-values and confidence intervals for low-dimensional targets inhigh-dimensional modelsR-package hdi︸︷︷︸

high-dimensional inference

(Meier, 2013)

assuming/based on suitable conditions:

• sparsity of Y vs X : s0 = o(√

n/ log(p))• sparsity of Xj vs X−j (j = 1, . . . ,p): maxj sj ≤ o(n/ log(p))• design matrix X is not too ill-posed

(e.g. restricted eigenval. ass.; or nice population covariance)

these conditions are typically uncheckable... ;confirmatory high-dimensional inference remains challenging

Thank you!

Page 39: Statistics for high-dimensional data: p-values and

Conclusionscan construct asymptotically optimal

p-values and confidence intervals for low-dimensional targets inhigh-dimensional modelsR-package hdi︸︷︷︸

high-dimensional inference

(Meier, 2013)

assuming/based on suitable conditions:

• sparsity of Y vs X : s0 = o(√

n/ log(p))• sparsity of Xj vs X−j (j = 1, . . . ,p): maxj sj ≤ o(n/ log(p))• design matrix X is not too ill-posed

(e.g. restricted eigenval. ass.; or nice population covariance)

these conditions are typically uncheckable... ;confirmatory high-dimensional inference remains challenging

Thank you!

Page 40: Statistics for high-dimensional data: p-values and

Conclusionscan construct asymptotically optimal

p-values and confidence intervals for low-dimensional targets inhigh-dimensional modelsR-package hdi︸︷︷︸

high-dimensional inference

(Meier, 2013)

assuming/based on suitable conditions:

• sparsity of Y vs X : s0 = o(√

n/ log(p))• sparsity of Xj vs X−j (j = 1, . . . ,p): maxj sj ≤ o(n/ log(p))• design matrix X is not too ill-posed

(e.g. restricted eigenval. ass.; or nice population covariance)

these conditions are typically uncheckable... ;confirmatory high-dimensional inference remains challenging

Thank you!

Page 41: Statistics for high-dimensional data: p-values and

R-package: hdi (Meier, 2013)

References:

I Buhlmann, P. and van de Geer, S. (2011). Statistics forHigh-Dimensional Data: Methodology, Theory and Applications.Springer.

I Meinshausen, N., Meier, L. and Buhlmann, P. (2009). P-valuesfor high-dimensional regression. Journal of the AmericanStatistical Association 104, 1671-1681.

I Buhlmann, P. (2013). Statistical significance in high-dimensionallinear models. Bernoulli 19, 1212-1242.

I van de Geer, S., Buhlmann, P. and Ritov, Y. (2013). Onasymptotically optimal confidence regions and tests forhigh-dimensional models. Preprint arXiv:1303.0518v1

I Meier, L. (2013). hdi: High-dimensional inference. R-packageavailable from R-Forge.