scientific methods 1

11 Dec 2012 COMP80131-SEEDSM12_10 1

Scientific Methods 1

Barry & Goran

‘Scientific evaluation, experimental design

& statistical methods’

COMP80131

Lecture 10: Statistical Methods-

Intro to PCA & Monte Carlo Methods

www.cs.man.ac.uk/~barry/mydocs/MyCOMP80131

11 Dec 2012 COMP80131-SEEDSM12_10 2

Correlation & covariance

M

m

M

mymxm M

meanymeanxM 11

... 1

or ))((1

1

Pearson correlation coeff for two samples of M variables

Covariance between two samples of M variables:

Lies between -1 & 1yx

M

mymxm meanymeanxM

varvar

))(())1/(1(1

1/(M-1) if meanx is sample-mean; 1/M if meanx is pop-mean

11 Dec 2012 COMP80131-SEEDSM12_10 3

In vector notation

1

2

3

-4

5

6

Both measure similarity between 2 cols of numbers (vectors).

x =

2

3

-3

5

4

1

y=

(1/(varx.vary)) xT y or (1/M) xTy

when means have been subtracted

11 Dec 2012 COMP80131-SEEDSM12_10 4

Principal Components Analysis

• PCA converts samples of M variables into samples of a smaller number of variables called principal components.

• Produces shorter columns.• Exploits interdependency or correlation among the M

variables in each col.• Evidence is the similarity between columns as seen in lots

of examples.• If there is none, PCA cannot reduce number of variables.• First princ comp has the highest variance.• It accounts for as much variability as possible.• Each succeeding princ comp has the highest variance

possible while being orthogonal to (uncorrelated with) the previous ones.

11 Dec 2012 COMP80131-SEEDSM12_10 5

P C A

• Reduces number of variables (dimensionality) - without significant loss of information

• Also named: ‘discrete Karhunen–Loève transform (KLT)’, ‘Hotelling transform’ ‘proper orthogonal decomposition (POD)’.

• Related to (but not the same as): ‘Factor analysis’

11 Dec 2012 COMP80131-SEEDSM12_10 6

ExampleAssume 5 observations of behaviour of 3 variables x1, x2, x3:x1: 1 2 3 1 4 sample-mean = 11/5 = 2.2x2: 2 1 3 -1 2 sample-mean = 1.4x3: 3 4 7 1 8 sample mean = 4.6Subtract means:x1' : -1.2 -0.2 0.8 -1.2 1.8x2' : 0.6 -0.4 1.6 -2.4 0.6 call this matrix Xx3' : -1.6 -0.6 2.4 -3.6 3.4Calculate ‘Covariance Matrix’ (C): x1' x2' x3' x1' : 1.7 1.15 3.6 x2' : 1.15 2.3 3.45

x3' : 3.6 3.45 8.3

C(1,2) = average value of x1'.x2‘ = (-1.20.6 +0.20.4 + 0.81.6 +1.22.4+1.80.6) = = 4.6/4 = 1.15

11 Dec 2012 COMP80131-SEEDSM12_10 7

Eigenvalues & eigenvectors

[U, diagV] = eig(C);

0.811 0.458 0.364 0 0 0

0.324 -0.87 0.372 0 0.964 0

-0.487 0.184 0.854 0 0 11.34

u3 u2 u1 3 0 0

0 2 0 D

U 0 0 1

11 Dec 2012 COMP80131-SEEDSM12_10 8

Transforming the measurements• For each column of matrix X, multiply by UT to transform it to a different

set of numbers.• For each column x’ transform it to UTx’• Or do it all at once by calculating Y = UT*X.• We get: 0 0 0 0 0

Y = -1.37 0.146 -0.583 0.874 0.929-1.58 -0.734 2.936 -4.404 3.781

• First column of X is now expressed as: 0 u1 -1.37 u2 – 1.58 u3 • Similarly for all the other four columns of X.• Each column is now expressed in terms of the eigenvectors of C.

11 Dec 2012 COMP80131-SEEDSM12_10 9

Reducing dimensionality

• UT C U = D therefore C = U D UT ( since UT is inverse of U)

• Now we can express:

• C = 1 (u1 u1T) + 2 (u2 u2

T) + 3 (u3 u3T)

• Now look at the eigenvalues 1, 2, 3

• Strike out zero valued one (3) with corresponding eigenvector (u3).

• Leaves u1 & u2 as princ components.

• Can represent all the data, without loss, with just these two.

• Can remove smaller eigenvalues (such as 2) with its eigenvector.

• (If they do not affect C much they should not affect X)

• Whole data can represented by just u1 without serious loss of accuracy.

11 Dec 2012 COMP80131-SEEDSM12_10 10

Reconstructing orig data from princ comps

• Because Y = UT*X, then X = U*Y.• If we don’t strike out any eigenvals & eigenvecs, this gets

us back to orig data.• If we strike out row 1 of Y and u1 (first col of U), we still get

back to orig data.• If we strike out row 2 of Y and u2, we get back something

close to orig data.,• We do not lose much info by keeping just one princ. comp• Dimensionality reduces from 3 to 2 or 1. • (Normally, eigenvals reordered in decreasing magnitude,

but I have not done that here)

11 Dec 2012 COMP80131-SEEDSM12_10 11

In MATLABclear all; origData = [1 2 3 1 4 ; 2 1 3 -1 2 ; 3 4 7 1 8][M,N] = size(origData);meanofCols = mean(origData,2); % subtract off mean for EACH dimensionzmData = origData - repmat(meanofCols,1,N)covarMat = 1 / (N-1) * (zmData * zmData')% find the eigenvectors and eigenvalues of covarMat[eigVecs, diagV] = eig(covarMat)eigVals = diag(diagV)[reigVals, Ind] = sort(eigVals,'descend'); % sort the variances in decreasing orderreigVecs = eigVecs(:,Ind); % Reorder eigenvectors accordinglyproj_zmData = reigVecs' * zmDatadisp('Approx to original data taking just a few principal components'); nPC = input('How many PCs do you need (look at eigenvals to decide):');PCproj_zmData = proj_zmData(1:nPC,:)PCVecs = reigVecs(:,1:nPC) %Only keep the first few reordered eigVecsRecApprox_zmData = PCVecs * PCproj_zmDataRecApprox_origData = RecApprox_zmData + repmat(meanofCols,1,N)

11 Dec 2012 COMP80131-SEEDSM12_10 12

Monte Carlo Methods

• Use of repeated random sampling of the behaviour of highly complex multidimensional mathematical equations describing real or simulated systems, to determine their properties.

• The repeated random sampling produces observations to which statistical inference can be applied.

11 Dec 2012 COMP80131-SEEDSM12_10 13

Pseudo-random processes

• Name Monte Carlo refers to the famous casino.• Gambling requires a random process such as the spinning of a

roulette wheel. • Monte Carlo methods use pseudo-random processes implemented in

software. • ‘Pseudo-random’ processes are not truly random.• The variables produced can be predicted by a person who knows the

algorithm being used. • However, they can be used to simulate the effects of true

randomness. • Simulations not required to be numerically identical to real processes.• Aim is to produce statistical results such as averages & distributions. • Requires a ‘sampling’ of the population of all possible modes of

behaviour of the system.

11 Dec 2012 COMP80131-SEEDSM12_10 14

Illustration

• Monte Carlo methods may be used to evaluate multi-dimensional integrals.

• Consider the problem of calculating the area of an ellipse by generating a set of N pseudo-random number pairs (xi,yi) uniformly covering the area -1<x<1, -1<y<1 as illustrated next:

11 Dec 2012 COMP80131-SEEDSM12_10 15

Area of an ellipse

x

y

-1 1

1

Area of square is 2 2 = 4

If there are N points and M of them fall inside the ellipse, area of ellipse 4 M / N

as N

(Frequentist approach)

11 Dec 2012 COMP80131-SEEDSM12_10 16

Simplicity of MC methods

• This example illustrates the simplicity of MC techniques, but not their computational advantages.

• We could have use a regularly placed grid of points rather than randomly placed points in the rectangle as on next slide

11 Dec 2012 COMP80131-SEEDSM12_10 17

Regular grid

x

y

-1 1cm

1

11 Dec 2012 COMP80131-SEEDSM12_10 18

Advantages of Monte Carlo

• In fact there are no advantages for such a 2-dimensional dimensional problem.

• Consider a multi-dimensional integration

...),...,,,(... 321321

1

1

2

2

3

3L

b

a

b

a

b

a

b

a L dxdxdxdxxxxxfIL

L

11 Dec 2012 COMP80131-SEEDSM12_10 19

Disadvantage of regular grid• f(x1, x2, …, xL) may be evaluated at regularly spaced points as a means

of evaluating the integral. • Number of regularly spaced points, N, must increase exponentially with

dimension L if error is not to increase exponentially with L. • If N = 100 when L=2, then adjacent points will be 0.2 cm apart.• If L increases to 3, we need N=103 points to maintain the same separation

between them.• When L = 4, we need N= 104 etc. – ‘Curse of dimensionality’• Look at this another way:• Assume N remains fixed with regular sampling, and L increases. • Each dimension must be sampled more & more sparsely

- and less and less efficiently.• More & more points with same value in each dimension. • Error increases in proportion to N-2/L

11 Dec 2012 COMP80131-SEEDSM12_10 20

Advantage of random sampling

• Uniformly distributed random points in L-dimensional space. • Avoids inefficiency of rectangular grids created by regular sampling by

using a purely random sample of N points uniformly distributed• For high dimensions K, error is proportional to 1/(N) • To reduce the error by a factor of 2, the sample size N must be

increased by a factor of 4 regardless of the dimensionality. • There are ways of decreasing the Monte Carlo error to make the

technique still more efficient. • One approach is to use ‘quasi-random’ or ‘low-discrepancy’ sampling. • The use of such quasi-random sampling for numerical integration is

referred to as “quasi–Monte Carlo” integration.

11 Dec 2012 COMP80131-SEEDSM12_10 21

MATLAB: Area of Semicircle

for N=1:200 M=0; for i=1:N x=2*rand -1; y=rand*1.0; I = sqrt(1-x*x); if y <= I, M=M+1; end; %if y <= I point (x,y) is below curve !!! end;Int(N)=M*2/N;end; % of N loopfigure(6); plot(Int); title('Area of semicircle');xlabel('Number of points');

11 Dec 2012 COMP80131-SEEDSM12_10 22

Convergence as N

0 20 40 60 80 100 120 140 160 180 2001.3

1.4

1.5

1.6

1.7

1.8

1.9

2Area of sdemicircle

Number of points

11 Dec 2012 COMP80131-SEEDSM12_10 23

MATLAB code for scatter plotclear; N=6000;M=0;for i=1:N x(i)=rand*2-1; y(i)=rand*1.0; I = sqrt(1-x(i)*x(i)); if y(i)<=I, M=M+1; C(i) = 2; else C(i)=1; end;end;scatter(x,y,6,C,'filled');Int = M*2/N ;title(sprintf('Scatter of MC area method: N=%d, Int=%d',N,Int));disp('Red if I<= y, blue if I>y');xlabel('x Plot red if y <= I'); ylabel('y');

11 Dec 2012 COMP80131-SEEDSM12_10 24

Integration of circle – scatter plot

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Scatter of MC area method: N=6000, Int=1.574000e+000

x Plot red if y <= I

y

scientific methods 1

Documents