high-dimensional data analysis, fall 2013 10rootzen/highdimensional/hdd10.pdf · 2013-11-01 ·...
TRANSCRIPT
Yeast understanding basic life functions 11,904 p-values Blomberg et al. 2003, 2010
Arabidopsis Thaliana association mapping 3,745 p-values Zhao et al. 2007
fMRI brain scans function of brain language network appr. 3 mill. p-values Taylor et al. 2006
High-dimensional data analysis, fall 2013
TexPoint fonts used in EMF.
Read the TexPoint manual before you delete this box.: AA
10
Slides for B&vdG 10.1 – 10.5, 10.7: Stable solutions
Exercises: 10.1
B&vdG 10.2: Subsampling, stablility and selection Sometimes the aim is prediction, sometimes variable selection (and sometimes both) – both are important, but selection is harder!
Setting: 𝒀 = 𝑿𝛽 + 𝜖 + think of Lasso (but ideas more general)
Recall from B&vdG 2 that the regularisation path is the set of 𝑝 functions of 𝜆 defined as follows
{𝛽 j 𝜆 ; 𝜆 ∈ Λ, 𝑗 = 1, ⋯ 𝑝},
where Λ typically is some interval [𝜆m𝑖𝑛, 𝜆m𝑎𝑥], and that
𝑆 𝜆 = 𝑗; 𝛽 𝑗 𝜆 ≠ 0 .
Now write 𝑆 𝜆 = 𝑆 𝜆 𝐼 to indicate dependence on the sample, which above is 𝐼 = 1, ⋯ 𝑛 .
Let 𝐼∗ be a random subset of {1, ⋯ 𝑛} of size m =𝑛
2
selcted by drawing without replacement, and for a subset 𝐾 (typically 𝐾 = {𝑗}) of 1, … , 𝑝 let the subsampling probability be
Π 𝐾 𝜆 = 𝑃∗[𝐾 ⊂ 𝑆 𝜆 𝐼∗ }
=#size 𝑚 subsets I∗ with 𝐾 ⊂ 𝑆 𝜆 𝐼∗ }
𝑛𝑚
Here Π 𝐾 𝜆 may be estimated by drawing randomly without replacement a large number 𝐵 of subsets 𝐼∗1, ⋯ 𝐼∗𝐵 subsets, and computing
Π 𝐾 𝜆 =1
𝐵 1{𝐾 ⊂
𝐵
𝑏=1𝑆 𝜆 𝐼∗𝑏 }
B&vdG argue that the stability path
{Π {𝑗} 𝜆 ; 𝜆 ∈ Λ, 𝑗 = 1, ⋯ 𝑝},
is better for variable selection than the regularization path.
Typically they use 𝜆min = 0, and take 𝜆max as the smallest value of 𝜆 for which all 𝛽𝑗 are estimated with zero (this
value can be seen to be max1≤𝑗≤𝑝
2 𝑋𝑗𝑌 /𝑛 .)
B&vdG 10.2.1.1: Vitamin B2 production using bacillus subtilus
Numerical experiment using
𝑛 = 115 values of the logarithm of vitamin B2 production
𝑝 = 4088 gene expression values
6 genes where selected at random from the 200 genes with the highest marginal empirical correlation with the log vitamin B2 production response varaible.
The other genes where subjected to a random permutation of rows, so that their possible connenctions with the response variable disappeared.
Regularization path Stability path
x-axis: 𝜆/𝜆max (in reverse ordering)
Y-axis: left: the 𝛽𝑗(𝜆), right: the Π 𝑗 (𝜆)
Red lines are the non-permuted genes
B&vdG 10.2.1.2: Motif regression
Heat shock experiment for finding transcription factor binding sites in DNA sequeces. Subset containing
𝑛 = 1200 gene expression values
𝑝 = 666 motif scores
Lasso estimates 𝛽 𝑗 = 𝛽 𝑗(𝜆 CV) with 𝜆 CV chosen by 10-fold
cross-validation, and the corresponding subsampling
probabilities Π 𝑗 = Π 𝑗 (𝜆 CV) for the 9 most promising
motifs:
Should one use ordering from |𝛽 𝑗| or from Π 𝑗?
Numerical experiment: choose 5 covariates at random, set correponding 𝛽-s to values which lead to very low signal-noise ratio (=0.1), set all other 𝛽-s to zero, simulate with i.i.d. 𝑁 0, 1 error variables 𝜖𝑡. Gives following result:
x-axis: Π 𝑗(𝜆 CV), y-axis: |𝛽 𝑗(𝜆 CV)|
Red crosses are the active genes
B&vdG 10.3: Stability selection
Traditionally: select one element, say 𝑆 (𝜆0) from the set of models
{𝑆 𝜆 ; 𝜆 ⊂ Λ}
Alternatively: select a value Πtrh and select the model
𝑆 stable = 𝑗; max𝜆∈Λ
Π 𝑗 𝜆 > Πtrh
(and then perhaps reestimate the 𝛽-s in this set with ols). Often Λ = {𝜆CV}.
Type 1 error: select a covariate which isn’t active, i.e. a 𝑗 not in 𝑆0
Type 2 error: not select a covariate which is active, i.e. a 𝑗 ∈ 𝑆0
Want to make probability of both errors ”small”.
𝑆 Λ ≔ ∪𝜆∈Λ 𝑆 𝜆 , 𝑞Λ = 𝐸|𝑆 Λ|
𝑉 = |𝑆0𝑐 ∩ 𝑆 stable| = #type 1 errors
Thm 10.1 Assume {1(𝑗 ∈ 𝑆 Λ; 𝑗 ∈ 𝑆0𝑐)} has excangable
distribution and that
𝐸(𝑆0 ∩ 𝑆 Λ)
𝐸(𝑆0𝑐 ∩ 𝑆 Λ)
≥𝑆0
𝑆0𝑐 .
Then, for Πtrh > 1/2,
𝐸 𝑉 ≤1
2Πtrh − 1
𝑞Λ2
𝑝.
𝐸(𝑉): = PFER = Per Family Error
𝐸(𝑉)/𝑝 ∶= PCER = Per Comparison error rate
Type 1 error control: for a given value 𝜈 choose Πtrh such that 𝐸 𝑉 ≤ 𝜈. If 𝜈 is choosen as some suitable small number, say 𝜈 = 𝛼 = 0.05 one then gets Type 1 (or, equivalently, PFER) error control,
𝑃 𝑉 > 0 ≤ 𝐸 𝑉 ≤ 0.05.
However, sometimes bigger 𝜈-values are also of interest, e.g. if one wants to control PCER. By Thm 10.1, 𝐸 𝑉 ≤ 𝜈 holds if the threshold is chosen as
1
2Πtrh − 1
𝑞𝜆2
𝑝= 𝜈 Πtrh = (1 +
𝑞𝜆2
𝑝𝜈)/2
(only useful if 𝑞𝜆2 < 𝑝𝜈 so that Πtrh < 1)
Homework: Problem 10.1
But here 𝑞Λ = 𝐸𝑆 Λ 𝐼 isn’t known. One way to handle this is to beforehand decide on a value 𝑞 and then use a procedure which at most selects 𝑞 covariates. Then of course 𝐸𝑆 Λ 𝐼 ≤ 𝑞. Possible ways of doing this include
• use standard Lasso but only select the 𝑞 covariates with the largest absolute values of the regression coefficients;
• select the 𝑞 variables which enter first in the regularization path.
This instead leads to the problem of selecting 𝑞. An alternative is to turn things around and decide on a value of Πtrh, say Πtrh = 0.9 and then use
𝑞 = 𝜈𝑝(2Πtrh − 1) .
B&vdG 10.4: A numerical experiment
Red triangles: stability selection, controlled to 𝐸 𝑉 ≤ 2.5 Black dots: crossvalidated Lasso Each pair from a different simulation set-up
2.5
B&vdG 10.7: Proofs
Read!