scientific methods 1
DESCRIPTION
Scientific Methods 1. ‘Scientific evaluation, experimental design & statistical methods’ COMP80131 Lecture 10: Statistical Methods- Intro to PCA & Monte Carlo Methods. Barry & Goran. www.cs.man.ac.uk/~barry/mydocs/MyCOMP80131. Correlation & covariance. - PowerPoint PPT PresentationTRANSCRIPT
11 Dec 2012 COMP80131-SEEDSM12_10 1
Scientific Methods 1
Barry & Goran
‘Scientific evaluation, experimental design
& statistical methods’
COMP80131
Lecture 10: Statistical Methods-
Intro to PCA & Monte Carlo Methods
www.cs.man.ac.uk/~barry/mydocs/MyCOMP80131
11 Dec 2012 COMP80131-SEEDSM12_10 2
Correlation & covariance
M
m
M
mymxm M
meanymeanxM 11
... 1
or ))((1
1
Pearson correlation coeff for two samples of M variables
Covariance between two samples of M variables:
Lies between -1 & 1yx
M
mymxm meanymeanxM
varvar
))(())1/(1(1
1/(M-1) if meanx is sample-mean; 1/M if meanx is pop-mean
11 Dec 2012 COMP80131-SEEDSM12_10 3
In vector notation
1
2
3
-4
5
6
Both measure similarity between 2 cols of numbers (vectors).
x =
2
3
-3
5
4
1
y=
(1/(varx.vary)) xT y or (1/M) xTy
when means have been subtracted
11 Dec 2012 COMP80131-SEEDSM12_10 4
Principal Components Analysis
• PCA converts samples of M variables into samples of a smaller number of variables called principal components.
• Produces shorter columns.• Exploits interdependency or correlation among the M
variables in each col.• Evidence is the similarity between columns as seen in lots
of examples.• If there is none, PCA cannot reduce number of variables.• First princ comp has the highest variance.• It accounts for as much variability as possible.• Each succeeding princ comp has the highest variance
possible while being orthogonal to (uncorrelated with) the previous ones.
11 Dec 2012 COMP80131-SEEDSM12_10 5
P C A
• Reduces number of variables (dimensionality) - without significant loss of information
• Also named: ‘discrete Karhunen–Loève transform (KLT)’, ‘Hotelling transform’ ‘proper orthogonal decomposition (POD)’.
• Related to (but not the same as): ‘Factor analysis’
11 Dec 2012 COMP80131-SEEDSM12_10 6
ExampleAssume 5 observations of behaviour of 3 variables x1, x2, x3:x1: 1 2 3 1 4 sample-mean = 11/5 = 2.2x2: 2 1 3 -1 2 sample-mean = 1.4x3: 3 4 7 1 8 sample mean = 4.6Subtract means:x1' : -1.2 -0.2 0.8 -1.2 1.8x2' : 0.6 -0.4 1.6 -2.4 0.6 call this matrix Xx3' : -1.6 -0.6 2.4 -3.6 3.4Calculate ‘Covariance Matrix’ (C): x1' x2' x3' x1' : 1.7 1.15 3.6 x2' : 1.15 2.3 3.45
x3' : 3.6 3.45 8.3
C(1,2) = average value of x1'.x2‘ = (-1.20.6 +0.20.4 + 0.81.6 +1.22.4+1.80.6) = = 4.6/4 = 1.15
11 Dec 2012 COMP80131-SEEDSM12_10 7
Eigenvalues & eigenvectors
[U, diagV] = eig(C);
0.811 0.458 0.364 0 0 0
0.324 -0.87 0.372 0 0.964 0
-0.487 0.184 0.854 0 0 11.34
u3 u2 u1 3 0 0
0 2 0 D
U 0 0 1
11 Dec 2012 COMP80131-SEEDSM12_10 8
Transforming the measurements• For each column of matrix X, multiply by UT to transform it to a different
set of numbers.• For each column x’ transform it to UTx’• Or do it all at once by calculating Y = UT*X.• We get: 0 0 0 0 0
Y = -1.37 0.146 -0.583 0.874 0.929-1.58 -0.734 2.936 -4.404 3.781
• First column of X is now expressed as: 0 u1 -1.37 u2 – 1.58 u3 • Similarly for all the other four columns of X.• Each column is now expressed in terms of the eigenvectors of C.
11 Dec 2012 COMP80131-SEEDSM12_10 9
Reducing dimensionality
• UT C U = D therefore C = U D UT ( since UT is inverse of U)
• Now we can express:
• C = 1 (u1 u1T) + 2 (u2 u2
T) + 3 (u3 u3T)
• Now look at the eigenvalues 1, 2, 3
• Strike out zero valued one (3) with corresponding eigenvector (u3).
• Leaves u1 & u2 as princ components.
• Can represent all the data, without loss, with just these two.
• Can remove smaller eigenvalues (such as 2) with its eigenvector.
• (If they do not affect C much they should not affect X)
• Whole data can represented by just u1 without serious loss of accuracy.
11 Dec 2012 COMP80131-SEEDSM12_10 10
Reconstructing orig data from princ comps
• Because Y = UT*X, then X = U*Y.• If we don’t strike out any eigenvals & eigenvecs, this gets
us back to orig data.• If we strike out row 1 of Y and u1 (first col of U), we still get
back to orig data.• If we strike out row 2 of Y and u2, we get back something
close to orig data.,• We do not lose much info by keeping just one princ. comp• Dimensionality reduces from 3 to 2 or 1. • (Normally, eigenvals reordered in decreasing magnitude,
but I have not done that here)
11 Dec 2012 COMP80131-SEEDSM12_10 11
In MATLABclear all; origData = [1 2 3 1 4 ; 2 1 3 -1 2 ; 3 4 7 1 8][M,N] = size(origData);meanofCols = mean(origData,2); % subtract off mean for EACH dimensionzmData = origData - repmat(meanofCols,1,N)covarMat = 1 / (N-1) * (zmData * zmData')% find the eigenvectors and eigenvalues of covarMat[eigVecs, diagV] = eig(covarMat)eigVals = diag(diagV)[reigVals, Ind] = sort(eigVals,'descend'); % sort the variances in decreasing orderreigVecs = eigVecs(:,Ind); % Reorder eigenvectors accordinglyproj_zmData = reigVecs' * zmDatadisp('Approx to original data taking just a few principal components'); nPC = input('How many PCs do you need (look at eigenvals to decide):');PCproj_zmData = proj_zmData(1:nPC,:)PCVecs = reigVecs(:,1:nPC) %Only keep the first few reordered eigVecsRecApprox_zmData = PCVecs * PCproj_zmDataRecApprox_origData = RecApprox_zmData + repmat(meanofCols,1,N)
11 Dec 2012 COMP80131-SEEDSM12_10 12
Monte Carlo Methods
• Use of repeated random sampling of the behaviour of highly complex multidimensional mathematical equations describing real or simulated systems, to determine their properties.
• The repeated random sampling produces observations to which statistical inference can be applied.
11 Dec 2012 COMP80131-SEEDSM12_10 13
Pseudo-random processes
• Name Monte Carlo refers to the famous casino.• Gambling requires a random process such as the spinning of a
roulette wheel. • Monte Carlo methods use pseudo-random processes implemented in
software. • ‘Pseudo-random’ processes are not truly random.• The variables produced can be predicted by a person who knows the
algorithm being used. • However, they can be used to simulate the effects of true
randomness. • Simulations not required to be numerically identical to real processes.• Aim is to produce statistical results such as averages & distributions. • Requires a ‘sampling’ of the population of all possible modes of
behaviour of the system.
11 Dec 2012 COMP80131-SEEDSM12_10 14
Illustration
• Monte Carlo methods may be used to evaluate multi-dimensional integrals.
• Consider the problem of calculating the area of an ellipse by generating a set of N pseudo-random number pairs (xi,yi) uniformly covering the area -1<x<1, -1<y<1 as illustrated next:
11 Dec 2012 COMP80131-SEEDSM12_10 15
Area of an ellipse
x
y
-1 1
1
Area of square is 2 2 = 4
If there are N points and M of them fall inside the ellipse, area of ellipse 4 M / N
as N
(Frequentist approach)
11 Dec 2012 COMP80131-SEEDSM12_10 16
Simplicity of MC methods
• This example illustrates the simplicity of MC techniques, but not their computational advantages.
• We could have use a regularly placed grid of points rather than randomly placed points in the rectangle as on next slide
11 Dec 2012 COMP80131-SEEDSM12_10 17
Regular grid
x
y
-1 1cm
1
11 Dec 2012 COMP80131-SEEDSM12_10 18
Advantages of Monte Carlo
• In fact there are no advantages for such a 2-dimensional dimensional problem.
• Consider a multi-dimensional integration
...),...,,,(... 321321
1
1
2
2
3
3L
b
a
b
a
b
a
b
a L dxdxdxdxxxxxfIL
L
11 Dec 2012 COMP80131-SEEDSM12_10 19
Disadvantage of regular grid• f(x1, x2, …, xL) may be evaluated at regularly spaced points as a means
of evaluating the integral. • Number of regularly spaced points, N, must increase exponentially with
dimension L if error is not to increase exponentially with L. • If N = 100 when L=2, then adjacent points will be 0.2 cm apart.• If L increases to 3, we need N=103 points to maintain the same separation
between them.• When L = 4, we need N= 104 etc. – ‘Curse of dimensionality’• Look at this another way:• Assume N remains fixed with regular sampling, and L increases. • Each dimension must be sampled more & more sparsely
- and less and less efficiently.• More & more points with same value in each dimension. • Error increases in proportion to N-2/L
11 Dec 2012 COMP80131-SEEDSM12_10 20
Advantage of random sampling
• Uniformly distributed random points in L-dimensional space. • Avoids inefficiency of rectangular grids created by regular sampling by
using a purely random sample of N points uniformly distributed• For high dimensions K, error is proportional to 1/(N) • To reduce the error by a factor of 2, the sample size N must be
increased by a factor of 4 regardless of the dimensionality. • There are ways of decreasing the Monte Carlo error to make the
technique still more efficient. • One approach is to use ‘quasi-random’ or ‘low-discrepancy’ sampling. • The use of such quasi-random sampling for numerical integration is
referred to as “quasi–Monte Carlo” integration.
11 Dec 2012 COMP80131-SEEDSM12_10 21
MATLAB: Area of Semicircle
for N=1:200 M=0; for i=1:N x=2*rand -1; y=rand*1.0; I = sqrt(1-x*x); if y <= I, M=M+1; end; %if y <= I point (x,y) is below curve !!! end;Int(N)=M*2/N;end; % of N loopfigure(6); plot(Int); title('Area of semicircle');xlabel('Number of points');
11 Dec 2012 COMP80131-SEEDSM12_10 22
Convergence as N
0 20 40 60 80 100 120 140 160 180 2001.3
1.4
1.5
1.6
1.7
1.8
1.9
2Area of sdemicircle
Number of points
11 Dec 2012 COMP80131-SEEDSM12_10 23
MATLAB code for scatter plotclear; N=6000;M=0;for i=1:N x(i)=rand*2-1; y(i)=rand*1.0; I = sqrt(1-x(i)*x(i)); if y(i)<=I, M=M+1; C(i) = 2; else C(i)=1; end;end;scatter(x,y,6,C,'filled');Int = M*2/N ;title(sprintf('Scatter of MC area method: N=%d, Int=%d',N,Int));disp('Red if I<= y, blue if I>y');xlabel('x Plot red if y <= I'); ylabel('y');
11 Dec 2012 COMP80131-SEEDSM12_10 24
Integration of circle – scatter plot
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Scatter of MC area method: N=6000, Int=1.574000e+000
x Plot red if y <= I
y