1 should we use standard errors or cross-validation in component analysis techniques ? henk a.l....
Post on 22-Dec-2015
218 views
TRANSCRIPT
1
Should we use Standard Errors
or Cross-Validation
in Component Analysis Techniques ?
Henk A.L. Kiers
University of Groningen
IFCS 2002, Krakow, July 16-19
2
Compare 2 groups of people on income in $ 1000 (Y)
1 = young (e.g., 20-30 yrs) mean1 = 49.8 2 = old (e.g., 50-60 yrs) mean2 = 60.8
Preliminaries
3
View this comparison as model fitting:
• X = Age (two binary dummy variables X1 and X2)
• General linear model: Y = 1 X1 + 2 X2 + E
• Estimates: 1 49.8, se = 1.2
2 60.8, se = 1.2
• Model fit: 1 - Var(Y- ) / Var(Y) = 52.2 % Y
How well does model fit data ?
52.2 % OK
How reliable are the model estimates?
se’s are small compared to scale
95% confidence intervals roughly:
1 49.8 2.4
2 60.8 2.4
How well would model fit other data?
………. Cross-validation question !
4
We ‘happen’ to have a second data set from same population:
Question 1:
What estimates do we get from new data ?
Estimates on 2nd data set:
1 49.8 (was 49.8, se = 1.2)
2 59.2 (was 60.8; se = 1.2)
description ‘quite’ similar.
Question 2:
How well does original solution describe 2nd data set?
How well would model fit other data?
Compute = 1 X1 + 2 X2 using 1st estimates
Compute Q2 = 1 - Var(Y2 - ) / Var(Y2) = 45.7 %
reasonable guess of ‘how good’ current description is for future observations
Y
Y
5
… and if we have continuous data ...
Does linear regression model fit (well) ?
Y = 0+ 1X + E
Estimates: 0 37.1 se = 3.0
1 0.40 se = 0.08
6
Model fit: R2 = 1 - Var(Y- ) / Var(Y) = 39.8 %Y
7
Question 1:
What model estimates do we get from new data ?
se’s give an idea about this …
What if we would have a different sample ?
or look at graphs:
example of different lines for different samples (drawn through original sample):
8
Question 2:
How well does original solution describe 2nd data set X2 , Y2 from same population?
What if no second data set is available ?
resort to se’s
split data set in parts and cross-validate
Compute = 0+ 1X2 using 1st model estimates
Compute Q2 = 1 - Var(Y2- ) / Var(Y2) = 13.8 %
current model estimates are not at all good for other sample from same population
…even though 0, 1 highly significant
...!
( significance uninformative about future behavior!)
Y
Y
9
se’s:
indicate what happens to model estimates
don’t indicate what happens to model qualitywhen we get new sample from same population
alternatives: split data set in parts, create ‘2nd’ set:
• 2 parts: split-half find estimates on one half (training set)cross-validate these on other half (test set)
• k<n parts: k-fold jack-knifefull data, but leave subset out
use as training setleft out subset use as test set
Do this for all k sets
• n parts: ordinary jack-knifefull data, but leave one case out
use as training setleft out case use as test ‘set’
Do this for all n cases
10
How does jack-knife etc work ?
1. Split data Z=(X,Y) into k subsets Zk=(Xk ,Yk)
For k = 1,…,K:
2. Fit model to Z-k = (Z1,…,Zk-1 ,Zk+1 ,…,ZK)
model parameter estimates A-k
3. Validate model on Zk:
3a. Compute estimates for Yk by applying A-k to Xk
Yk - (A-k) = Rk = cross-validated residuals
3b. Compute cross-validated model fit:
Q2 = fit ( parameter estimates A-k | data=Zk )
4. Compute sum of squares of all cross-validated
residuals: PRESS measure
5. Compare estimates across k
(display or compute stdev)
visualizes influential observations gives uncertainty measure (se)
Y
11
First Overview
approach assessment of uncertainty of model estimates
assessment of model fit on other data *
signalling problematic observations
se’s excellent none no
jack-knife reasonable good (unbiased, but sample-dependent)
signals influential observations
k-fold jack-knife
poor good (more biased, less sample-dependent)
to some extent
split-half
none/poor good (most biased, least sample-dependent)
to some extent
*) see e.g. Hastie, Tibshirani & Friedman, 2001
12
Se’s, cross validation OK for “(X,Y)-models”:
X = predictor(s)
Y = criterion
A = parameters (e.g., regression weights), with unique estimates
= f(X,A) (e.g., in regression: = XA )
Y Y
But, how to handle component methods?
Component methods:
no (X,Y)-modelsno unique parameter estimates
e.g., Principal Component Analysis:
Model: Y = AB’ + E
A and B both parameter matrices; no predictors
estimates A and B just as good as AT and B(T-1)’
Cross-validation approaches usable for
all sorts of descriptive methods:
13
Why is “No X,Y relation” a problem?
For computing se’s no problem
For cross-validation validation process becomes ambiguous
Why is “No unique estimates” a problem?
For computing se’s variation in estimates will be due to nonuniqueness (not only sampling fluctuation)
For cross-validation need not be problematic, but often is ...
14
Determining se’s in PCA
1. Identify PCA model
A (component scores)
unrotated, in order of explained variance
normalized to sum of squares n
ensure that column sums are positive
B (loadings) automatically identified by above
or:
B (loadings)
rotated by varimax, ordered by explained variance
ensure that column sums are positive
A (comp. scores) automatically identified by above
15
2a. Use distributional assumptions, compute se’s
e.g., Anderson, 1963, Archer & Jennrich, 1973, Jennrich, 1973, Ogasawara, 1996, 2000
2b. Use resampling techniques , compute se’s:
bootstrap (or, if you like, jackknife)
e.g., Efron & Tibshirani, 1993
construct N bootstrap samples
apply PCA to each sample
for each parameter: compute std’s across all samples se’s
pros cons
distr.ass.
se’s
explicit expressions takes orientation, order, assumptions
(too?!) seriously
bootstrap se’s
does not require prefixed rotation
computer intensive
16
What can go wrong when you take orientation too seriously ?
Example Data: 100 x 8 Data set
PCA: 2 Components
Eigenvalues: 4.04, 3.96, 0.0002, etc. (Note: first two close to each other)
PCA (unrotated) solutions for variables (a,b,c,d,e,f,g,h) and bootstrap based 95% confidence ellipses*:
*) thanks to program by Patrick Groenen (procedure by Meulman & Heiser, 1983)
17
Data Bootstrap 1 Bootstrap 2 Bootstrap 3 a -0.6 0.8 -0.6 0.8 -1.0 -0.3 0.8 0.6 b -0.8 0.7 -0.7 0.7 -0.9 -0.4 0.7 0.7 c -0.5 0.9 -0.5 0.9 -1.0 -0.2 0.9 0.5 d -0.8 0.6 -0.8 0.6 -0.8 -0.6 0.6 0.8 e -0.8 -0.6 -0.8 -0.6 0.3 -1.0 -0.7 0.7 f -0.7 -0.7 -0.7 -0.7 0.5 -0.9 -0.8 0.6 g -0.9 -0.5 -0.9 -0.5 0.2 -1.0 -0.6 0.8 h -0.6 -0.8 -0.7 -0.8 0.6 -0.8 -0.9 0.5
Look at loadings for data and some bootstraps:
Loadings Bootstrap based standard errors
a -0.6 0.8 0.6 0.5 b -0.8 0.7 0.5 0.6 c -0.5 0.9 0.6 0.5 d -0.8 0.6 0.5 0.6 e -0.8 -0.6 0.6 0.5 f -0.7 -0.7 0.6 0.5 g -0.9 -0.5 0.5 0.5 h -0.6 -0.8 0.6 0.5
… leading to standard errors: ...
What caused these enormous ellipses?
18
Conclusion: solutions very unstable, hence: loadings seem very uncertain
Configurations of subsamples very similar
So: We should’ve considered the whole configuration !
However ….
19
How consider whole configuration ?
(e.g., Meulman & Heiser, 1983, Krzanowski, 1987, Ringrose, 1992, Markus, 1994, Milan & Whittaker, 1995)
1. Compute PCA on original data Ao, Bo
2. Create N bootstrap samples
3. Compute PCA in all samples Ab, Bb, b=1,…,N
4. Optimally rotate bootstrap solutions to original solution: e.g., minimize gb(Tb) = || BbTb - Bo ||2 , b=1,…,N
5. Compute se’s for elements of Bb,rot = BbTb
Applying above procedure to previous example:
Loadings Bootstrap based standard errors
a -0.6 0.8 0.03 0.03 b -0.8 0.7 0.03 0.04 c -0.5 0.9 0.04 0.02 d -0.8 0.6 0.03 0.04 e -0.8 -0.6 0.03 0.03 f -0.7 -0.7 0.04 0.03 g -0.9 -0.5 0.02 0.04 h -0.6 -0.8 0.04 0.03
20
95 % Confidence ellipses after matching rotation of bootstraps:
Wouldn’t varimax rotation have solved the problem?Yes: 95% confidence ellipses:
21
What else can go wrong when you take orientation too seriously ?
Data: 50 x 8 Data set
PCA: 2 Components
Eigenvalues: 3.07, 3.01, 0.56, 0.39, etc.
PCA followed by varimax
Bootstrap 95% confidence ellipses:
… Varimax doesn’t always help …!
22
Cause: varimax solution is unstable !
Look at observed loadings and three bootstrap solutions:
Solution (again):
rotation to optimal agreement with original solution
23
Confidence ellipses, based on bootstraps rotated towards varimax rotated solution:
24
Is matching rotation enough to get good se’s?
Bootstrap solutions will differ by more than rotation, even for perfect data:
If X = AB’ (with A’A = I)
Then Xb = AbB’ (with Ab’Ab I)
But PCA solution: Xb = Pb(QbDb)’
with: QbDb = BT
Pb = Ab(T’)-1
for some nonsingular transformation T
So: it’s better to transform bootstrap solutions to original solution minimize || BbT - Bo ||2 over any nonsingular T
better/smaller confidence ellipses for small n and almost perfect data
25
Do standard errors satisfy all our needs?
PCA is descriptive technique
Remember:
Se’s answer question:
“What would description look like when we would have a different sample? ”
Answer: Cross-validation !
But: How ? What are our predictors and criteria?
Maybe more interesting (in descriptive context, or always?):
“How well would our description describe the data if we would have a different sample? ”
“Did it overfit the first sample?”
26
Cross-validation in PCA
1. Eastment & Krzanowski, 1982
2. Ten Berge, 1986
3. Martens & Martens, 2001
4. what else we can think of….
Eastment & Krzanowski, 1982 approach:
1. solve PCA using SVD: X = USV’ and taking only first r components
for each datum xij:
2. leave out row i compute SVD U-iS-iV-i’leave out column j compute SVD U-jS-jV-j’retain first r components
3. compute cross-validation estimate for xij as
= [U-j] i’S-i1/2S-j
1/2[V-i] j
(sign correction, if necessary)
4. Combine results by computing
PRESS = 1/IJ ij( xij)2
ijx
ijx
27
Ten Berge, 1986 approach:
PCA cross-validation procedure meant for situation with “2nd” data set:
1. Apply PCA to data matrix X1
loadings B1
component scores A1= X1W1 ,
( W1 : component weights )
prop.explained variance:
1-SS(X1-A1B1’)/SS(X1)
2. Use same components in 2nd set, by using W1
component scores A2= X2W1
3. Compute loadings B2 for these components(by regression of X2 on A2)
4. Compute explained variance of original components in 2nd data set:
1-SS(X2-X2W1B2’)/SS(X2))
5. Compare to explained variance of X1 and to maximal explained var. of X2
28
Martens & Martens, 2001 approach:
PCA cross-validation procedure meant for jackknife or k-fold cv:
For each case (or subset)
1. leave row(s) i out X-i
2. apply PCA to X-i loadings Bi (with Bi’Bi=I)
3. apply loadings to test set (xi’) xi’BiBi’
4. compute residuals xi’ xi’BiBi’
5. compute explained variance of original components in test set:
SS(xi’) SS(xi’ xi’BiBi’) sum for all rows (test sets) PRESS
29
Eastment Krzanowski
Ten Berge Martens
used information
all (except data point)
only weights weights and loadings+
rotational indeterminacy
ignored bypassed bypassed
prediction estimates for
each original data point
only new data set
each original data point
predictions based on
other data only (almost)
other data same data
Comparison
30
Problems of Eastment & Krzanowski approach
- estimates not completely data independent (due to sign alignment)
- rotation not taken into account
E&K E&K + aligning orientation
CV via PCA with xij considered missing
r=1 -17.1% -45.4 % 13.5 %
r=2 -18.7 % -81.3 % 41.4 %
r=3 21.1 % 77.6 % 77.5 %
r=4 20.2 % 75.2 % 74.4 %
Corrected by: “orientation alignment”
and “PCA while treating left out data as missing” (cf. Wold, 1978, but componentwise approach)
more accurate cv values:
Example: 50 x 40 data set with underlying eigenvalues equal, and 3 dim. true structure (approx. 80% of data)
Gives following Q2 values:
31
… and in the presence of outliers ...
20 x 9 data set; outlier of case 13, variable 2
E&K: Q2 = 70.1 % missing values cv Q2 = 43.0 %
Look at residuals:
pca residuals
E&K cv residuals missing values cv residuals
data
32
Alternative cv possibilities:
PCA also fits covariance/correlation matrix:
X’X = AA’ + E
Then k-fold/jackknife scheme:
for i=1,…,I
1. Leave out object (or subset) i X-i
2. Compute covariance/correlation X-i’X-i
3. Fit PCA to X-i’X-i A-i (same size as original A)
end
4a. Compare all matrices A-i, after matching rotation, to A gives “se’s”
4b. Compute CV-fit to Xi’Xi : Q2 = || Xi’Xi - A-iA-i’||2
Notes:
comparable to cross-validation in LISREL instead of covariances/correlations distances other (dis)similarity matrices
33
Advantages/disadvantages se’s and cv
for PCA
Overfitting signalled?
Underfitting signalled?
Results for perfect data:
TB, M&M procedures
Not well, Q2 increases with # components
Yes
small Q2
Good, Q2=1
E&K cv Yes Yes
small Q2
Bad, Q2<1
CV via missings estimation
Yes Yes
small Q2
Good, Q2=1
Bootstrap se’s with rotation
Not at all(small even for high # comps)
Yes
big se’s
Bad, se’s>0
Bootstrap se’s with transform.
Not at all(small even for high # comps)
Yes
big se’s
Good, se’s=0
34
What about other component methods ?
General principle:
Answer one of two questions:
1. “How much will our estimates differ when we use other data from same population ?”
2. “How well will our solution describe other data from same population ?”
Question 1:
- use mathematical statistics to derive se’s
or
- use bootstrap/jackknife to approximate se’s
Essential: clearly define what is solution:
e.g.,
the precisely identified estimates
a class of solutions which have the same mutual distances (allowing for rotational freedom)
35
Question 2:
“How well will our solution describe other data from same population ?”
Answer: cross-validation
split data in parts (or use 2nd data set)
fit model to one part (X1) estimates A1
use A1 as (subset of) estimates to model X2
find estimates for possibly remaining parameters
compute cv-fit for those estimates
repeat for different choices of subsets
36
What to choose? Se’s or cv?
Or combine the two !
1. k-fold split k solutions
Fit on test set cv residuals
and
Compare solutions (after matching) se’s
(e.g., Krzanowski, 1987; new: combine with missings fitting approach)
2. bootstrap k solutions
Compare solutions se’s
and
Cross-validate solutions of original data estimate overfitting (“optimism”)
Consider missing cases in bootstraps as test set .632 bootstrap estimator gives cv result
(see Efron & Tibshirani, 1993)
37
Three-way component analysis: A complex case
or in Matrix Algebra:
X = AG(C’B’) + E (X = I x JK matrix)
i = 1......I
j=1 . . . . . . . JVARIABLES
k=1
K
SUBJECTs
OCCASIONS
ij
P
1p
Q
1q
R
1rpqrkrjqipij egcbax
Tucker 3 model:
xijk = score of subject i on variable j on time kaip = loading of ind. i on person component pbjq = loading of var. j on variable component qckr = loading of sit. k on situation component r gpqr = (latent) score of type p on factor q in sit. type r
(Core array)eijk = error
38
How to determine se’s ?
How to cross-validate ?
First consider rotational freedom:
AG(C’B’) = AS S-1G(U-1’T-1’)((CU)’(BT)’)
= A* G* (C*’B*’)
= A* G* (C*’B*’)
A, B, and C can be rotated independently
has to be taken into account while determining se’s and in cross-validation
But how ?
rotate solution to simple structure and fully identify solution
rotate solutions for subsamples to original solution
39
split half procedure- split data into two sets - assess stability for A, B, C- cross-validate core (Kiers & van Mechelen, 2001)
EM-Tucker3 cross-validation - leave elements out at random (use as test set)- fit Tucker3 on nonmissings- cross-validate model parameters on missings(Louwerse, Kiers & Smilde, 1999)
Leave-bar-out cross-validation: generalization of Eastment-Krzanowski approach (Louwerse, Kiers & Smilde, 1999)
Three procedures for cross-validation in Tucker3:
40
Split Half Procedure
0. analysis of full data: Sol = {Afull, Bfull, Cfull, Gfull}
1. split data into two sets e.g., two random subsets of subjects ( A-mode) two I/2 x J x K data sets: X1 , X2
2. Preprocess X1 , X2
(same choices as for full data)
3. Fit Tucker3 model to X1 , X2
(same P, Q, R as for full data) Sol1 = {A1, B1, C1, G1} Sol2 = {A2, B2, C2, G2}
4. Use rotational freedom: Match B1 to Bfull and match C1 to Cfull B1* = B1(B1’B1)-1 B1’Bfull C1* = C1(C1’C1)-1 C1’Cfull
5. Assess stability for B and C: (congruences between columns of B1 and B2, and between columns of C1 and C2)
6. Quasi cross-validation for core:Use Afull, Bfull, Cfull to find optimal core for X1 and X2
Compare ‘cross-validated’ cores (focus on big values; compute absolute differences)
41
Leave-bar-out Cross-validation
1a. Leave out subject set s (I-Is) x J x K data set Xs
2a. Preprocess Xs , analyze As*, Bs, Cs, Gs
1b. Leave out variables set t I x (J-Jt) x K data Xt
2b. Preprocess Xt , analyze At, Bt*, Ct, Gt
1c. Leave out occasions set u I x J x (K-Ku) data Xu
2c. Preprocess Xu , analyze Au, Bu, Cu*, Gu
(Note: use always same identification procedure solutions fully comparable )
3. Estimate Is x Jt x Ku left-out part of Xstu:
use all solutions (except *’s), e.g.,
= At Gs (CsBu)’ or = AuGsGt(CsBs)’
1st decision: use average of all 8 combinations Ap,Bq,Cr
2nd decision: take core for Ap,Bq,Cr equal to (Gp)1/3 (Gq)1/3 (Gr)1/3
(elementwise power and products) + size correction
stuX stuX
4. Repeat, and compute PRESS = stu|| Xstu ||2 stuX
42
EM-tucker3 Cross-validation
1. Leave out random set s of observations (Xs)
2. Preprocess Xs
3. Analyze Xs by “EM Tucker3” As, Bs, Cs, Gs (fits Tucker3 model only to nonmissings)
4. Compute estimates , based on As, Bs, Cs, Gs
5. Repeat, and compute PRESS = stu|| Xs ||2
sX
sX
43
pros cons
Split half simple, quick split dependent; no ‘real’ cv (solutions partly based on ‘test set’)
Leave-bar-out cv
efficient real cv
rotation dependent; many arbitrary decisions
EM cv no rotational choices needed
very time consuming; EM: ‘degenerate’ solutions
Comparison of cross-validation procedures
44
Conclusions
Insight into sampling fluctuation:
via standard errors via cross-validation
different answers to different questions
supplement each other
Component analysis: Nonunique solutions
special procedures for computation standard errors and cross-validation
… for cross-validation:
difficulty: what is criterion?
before you know, you happen to use criterion while making prediction…