high-dimensional data analysis, fall 2013 10rootzen/highdimensional/hdd10.pdf · 2013-11-01 ·...

Yeast understanding basic life functions 11,904 p-values Blomberg et al. 2003, 2010

Arabidopsis Thaliana association mapping 3,745 p-values Zhao et al. 2007

fMRI brain scans function of brain language network appr. 3 mill. p-values Taylor et al. 2006

High-dimensional data analysis, fall 2013

TexPoint fonts used in EMF.

Read the TexPoint manual before you delete this box.: AA

10

Slides for B&vdG 10.1 – 10.5, 10.7: Stable solutions

Exercises: 10.1

B&vdG 10.2: Subsampling, stablility and selection Sometimes the aim is prediction, sometimes variable selection (and sometimes both) – both are important, but selection is harder!

Setting: 𝒀 = 𝑿𝛽 + 𝜖 + think of Lasso (but ideas more general)

Recall from B&vdG 2 that the regularisation path is the set of 𝑝 functions of 𝜆 defined as follows

{𝛽 j 𝜆 ; 𝜆 ∈ Λ, 𝑗 = 1, ⋯ 𝑝},

where Λ typically is some interval [𝜆m𝑖𝑛, 𝜆m𝑎𝑥], and that

𝑆 𝜆 = 𝑗; 𝛽 𝑗 𝜆 ≠ 0 .

Now write 𝑆 𝜆 = 𝑆 𝜆 𝐼 to indicate dependence on the sample, which above is 𝐼 = 1, ⋯ 𝑛 .

Let 𝐼∗ be a random subset of {1, ⋯ 𝑛} of size m =𝑛

2

selcted by drawing without replacement, and for a subset 𝐾 (typically 𝐾 = {𝑗}) of 1, … , 𝑝 let the subsampling probability be

Π 𝐾 𝜆 = 𝑃∗[𝐾 ⊂ 𝑆 𝜆 𝐼∗ }

=#size 𝑚 subsets I∗ with 𝐾 ⊂ 𝑆 𝜆 𝐼∗ }

𝑛𝑚

Here Π 𝐾 𝜆 may be estimated by drawing randomly without replacement a large number 𝐵 of subsets 𝐼∗1, ⋯ 𝐼∗𝐵 subsets, and computing

Π 𝐾 𝜆 =1

𝐵 1{𝐾 ⊂

𝐵

𝑏=1𝑆 𝜆 𝐼∗𝑏 }

B&vdG argue that the stability path

{Π {𝑗} 𝜆 ; 𝜆 ∈ Λ, 𝑗 = 1, ⋯ 𝑝},

is better for variable selection than the regularization path.

Typically they use 𝜆min = 0, and take 𝜆max as the smallest value of 𝜆 for which all 𝛽𝑗 are estimated with zero (this

value can be seen to be max1≤𝑗≤𝑝

2 𝑋𝑗𝑌 /𝑛 .)

B&vdG 10.2.1.1: Vitamin B2 production using bacillus subtilus

Numerical experiment using

𝑛 = 115 values of the logarithm of vitamin B2 production

𝑝 = 4088 gene expression values

6 genes where selected at random from the 200 genes with the highest marginal empirical correlation with the log vitamin B2 production response varaible.

The other genes where subjected to a random permutation of rows, so that their possible connenctions with the response variable disappeared.

Regularization path Stability path

x-axis: 𝜆/𝜆max (in reverse ordering)

Y-axis: left: the 𝛽𝑗(𝜆), right: the Π 𝑗 (𝜆)

Red lines are the non-permuted genes

B&vdG 10.2.1.2: Motif regression

Heat shock experiment for finding transcription factor binding sites in DNA sequeces. Subset containing

𝑛 = 1200 gene expression values

𝑝 = 666 motif scores

Lasso estimates 𝛽 𝑗 = 𝛽 𝑗(𝜆 CV) with 𝜆 CV chosen by 10-fold

cross-validation, and the corresponding subsampling

probabilities Π 𝑗 = Π 𝑗 (𝜆 CV) for the 9 most promising

motifs:

Should one use ordering from |𝛽 𝑗| or from Π 𝑗?

Numerical experiment: choose 5 covariates at random, set correponding 𝛽-s to values which lead to very low signal-noise ratio (=0.1), set all other 𝛽-s to zero, simulate with i.i.d. 𝑁 0, 1 error variables 𝜖𝑡. Gives following result:

x-axis: Π 𝑗(𝜆 CV), y-axis: |𝛽 𝑗(𝜆 CV)|

Red crosses are the active genes

B&vdG 10.3: Stability selection

Traditionally: select one element, say 𝑆 (𝜆0) from the set of models

{𝑆 𝜆 ; 𝜆 ⊂ Λ}

Alternatively: select a value Πtrh and select the model

𝑆 stable = 𝑗; max𝜆∈Λ

Π 𝑗 𝜆 > Πtrh

(and then perhaps reestimate the 𝛽-s in this set with ols). Often Λ = {𝜆CV}.

Type 1 error: select a covariate which isn’t active, i.e. a 𝑗 not in 𝑆0

Type 2 error: not select a covariate which is active, i.e. a 𝑗 ∈ 𝑆0

Want to make probability of both errors ”small”.

𝑆 Λ ≔ ∪𝜆∈Λ 𝑆 𝜆 , 𝑞Λ = 𝐸|𝑆 Λ|

𝑉 = |𝑆0𝑐 ∩ 𝑆 stable| = #type 1 errors

Thm 10.1 Assume {1(𝑗 ∈ 𝑆 Λ; 𝑗 ∈ 𝑆0𝑐)} has excangable

distribution and that

𝐸(𝑆0 ∩ 𝑆 Λ)

𝐸(𝑆0𝑐 ∩ 𝑆 Λ)

≥𝑆0

𝑆0𝑐 .

Then, for Πtrh > 1/2,

𝐸 𝑉 ≤1

2Πtrh − 1

𝑞Λ2

𝑝.

𝐸(𝑉): = PFER = Per Family Error

𝐸(𝑉)/𝑝 ∶= PCER = Per Comparison error rate

Type 1 error control: for a given value 𝜈 choose Πtrh such that 𝐸 𝑉 ≤ 𝜈. If 𝜈 is choosen as some suitable small number, say 𝜈 = 𝛼 = 0.05 one then gets Type 1 (or, equivalently, PFER) error control,

𝑃 𝑉 > 0 ≤ 𝐸 𝑉 ≤ 0.05.

However, sometimes bigger 𝜈-values are also of interest, e.g. if one wants to control PCER. By Thm 10.1, 𝐸 𝑉 ≤ 𝜈 holds if the threshold is chosen as

1

2Πtrh − 1

𝑞𝜆2

𝑝= 𝜈 Πtrh = (1 +

𝑞𝜆2

𝑝𝜈)/2

(only useful if 𝑞𝜆2 < 𝑝𝜈 so that Πtrh < 1)

Homework: Problem 10.1

But here 𝑞Λ = 𝐸𝑆 Λ 𝐼 isn’t known. One way to handle this is to beforehand decide on a value 𝑞 and then use a procedure which at most selects 𝑞 covariates. Then of course 𝐸𝑆 Λ 𝐼 ≤ 𝑞. Possible ways of doing this include

• use standard Lasso but only select the 𝑞 covariates with the largest absolute values of the regression coefficients;

• select the 𝑞 variables which enter first in the regularization path.

This instead leads to the problem of selecting 𝑞. An alternative is to turn things around and decide on a value of Πtrh, say Πtrh = 0.9 and then use

𝑞 = 𝜈𝑝(2Πtrh − 1) .

B&vdG 10.4: A numerical experiment

Red triangles: stability selection, controlled to 𝐸 𝑉 ≤ 2.5 Black dots: crossvalidated Lasso Each pair from a different simulation set-up

2.5

B&vdG 10.7: Proofs

Read!

high-dimensional data analysis, fall 2013 10rootzen/highdimensional/hdd10.pdf · 2013-11-01 ·...

Documents