random projections - healthiness and stochastic simulation · stochastic simulation of multiple...

47
Random Projections Jo˜ ao Brazuna Defining Random Projections How Do Random Projections Work? How to Apply Random Projections? Multiple Linear Regression Model Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other Applications Random Projections Healthiness and Stochastic Simulation Jo˜ ao Brazuna Statistical Methods in Data Mining Instituto Superior T´ ecnico November 29, 2016

Upload: others

Post on 13-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Random ProjectionsHealthiness and Stochastic Simulation

Joao Brazuna

Statistical Methods in Data MiningInstituto Superior Tecnico

November 29, 2016

Page 2: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Contents

Defining Random Projections

How Do Random Projections Work?

How to Apply Random Projections?

Multiple Linear Regression Model

Stochastic Simulation of Multiple Linear Regression Models

Multiple Logistic Regression Model

Diagnosing Leukaemia

Other Applications

Page 3: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Information Era

Characterized by:

I High dimensional data;

I Difficult to process.

Random Projections’ Goal:

Efficiently reduce the data dimension.

Some Applications of Random Projections

I Classification;

I Clustering;

I Regression.

Page 4: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

“Dimensionality Curse”

It affects data analysis in two different ways:

I A lot of features and samples;

I A lot of features and few samples.

An Efficient Solution:Random Projections

Page 5: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Notation

I Sample dimension n;

I For the i-th sample, with i ∈ {1, · · · , n}, we check pfeatures;

I Vector with the p features for sample i :

x i = (xi1, · · · , xip) ∈ Rp, ∀i ∈ {1, ..., n} ;

I We join the n vectors (by rows) in a n × p dimensionalmatrix:

X =

x t1...x tn

=

x11 · · · x1p

.... . .

...xn1 · · · xnp

∈ Rn×p.

Page 6: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Random Projections’ Goal

What do we have?n vectors in Rp

What do we want?n vectors in Rk , with k < p and all the squared distancespreserved by a (1± ε) factor.

So, we keep the sample size but we significantly reduce thenumber of features.

What do we need?A function f : Rp → Rk such that, for anyu, v ∈ {x1, · · · , xn},

(1−ε)∥∥f (u)− f (v)

∥∥2 ≤‖u − v‖2 ≤ (1+ε)∥∥f (u)− f (v)

∥∥2.

Page 7: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Random Projections’ Goal

Figure: Example of our goal. Source:[1]

Page 8: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Random Projections’ Goal

Figure: Example of our goal. Source:[1]

Page 9: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Random Projections’ Goal

Figure: Example of our goal. Source:[1]

Page 10: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Random Projections’ Goal

Figure: Example of our goal. Source:[1]

Page 11: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Principal Component Analysis vs. RandomProjections

PCA’s GoalTo preserve data variability:

I 1st Principal Component ⇒ Direction of the maximumvariability;

I 2nd Principal Component ⇒ Direction of maximumvariability that is orthogonal to the 1st ;...

RP’s GoalTo preserve distances between vectors: For all rows u, v ofX , the distance between the projected values f (u) and f (v)is similar to the original squared distance between u and v :is is between (1− ε)‖u − v‖2 and (1 + ε)‖u − v‖2.

Page 12: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Some Questions on Random Projections

I How can we define that function?

I Can we use it regardless of the values of n or p?

I Is there any restriction on k? How can we determinethe smallest possible k (dimension of the space wherewe want to project the n vectors of Rp)?

Page 13: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

How Do Random Projections Work?

Lemma (Johnson-Lindenstrauss)

Let:

I X be a matrix in Rn×p, whose rows are denoted byx i ∈ Rp, ∀i ∈ {1, · · · , n}:

X =

x t1...x tn

=

x11 · · · x1p

.... . .

...xn1 · · · xnp

∈ Rn×p;

I ε ∈]0, 1[ arbitrary;

I k ∈ N such that d 243ε2−2ε3 log ne ≤ k < p.

Then, there exists f : Rp → Rk such that for anyu, v ∈ {x1, · · · , xn} we have that

(1−ε)∥∥f (u)− f (v)

∥∥2 ≤‖u − v‖2 ≤ (1+ε)∥∥f (u)− f (v)

∥∥2.

Page 14: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

The AnswersIt is not possible to randomly project an arbitrary set ofvector. We must have a value of p of at leastd 24

3ε2−2ε3 log ne+ 1, because k < p.

Figure: Smallest value of k vs. ε and n

Page 15: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

ConclusionsI Fixing ε, k slowly increases (logarithmically) as n

increases;I Fixing n, k quickly decreases as ε increases.

Figure: Smallest value of k vs. ε and n

Page 16: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Conclusions

We can produce a table considering n and k as integervalues.

ε0.0001 0.001 0.1 0.3 0.5 0.7 0.9 0.999 0.9999

n

10 1.85× 109 1.85× 107 1974 256 111 71 57 56 56102 3.69× 109 3.69× 107 3948 512 222 141 114 111 111103 5.53× 109 5.53× 107 5921 768 332 212 171 166 166106 1.11× 1010 1.11× 108 11842 1536 664 423 342 332 332109 1.66× 1010 1.66× 108 17763 2303 995 635 512 498 4981012 2.22× 1010 2.22× 108 23684 3071 1327 846 683 664 664

Table: Smallest value of k for fixed ε and n

For instance, 1 million vectors with dimension 10 million canbe projected, considering ε = 0.5, in R664.We obtain 1 million vectors with dimension 664 each, withall the squared distances preserved by a 1± ε factor.

Page 17: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

How to Apply Random Projections?

I The tool is obtained on the proof ofJohnson-Lindenstrauss’ Lemma;

I Given a data matrix X ∈ Rn×p with n samplescontaining p parameters each, our goal is to find aprojection matrix R such that E = XR is the projectionof matrix X ;

I Let f : Rp → Rk be given by f (u) = 1√kAu, where

A ∈ Rk×p is a matrix verifying Aij ∼i .i .d .

N(0, 1) for all i,j;

I The map preserves all the squared distances by a 1± εfactor when repeatedly applied O(n) times (it occursalmost surely);

I Take R such that Rt = 1√kA and the projection is given

byE = XR.

Page 18: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

How to Apply Random Projections?

1. Fix ε ∈]0, 1[;

2. Choose k ∈ N such that d 243ε2−2ε3 log ne ≤ k < p;

3. Build a matrix R = 1n√kAt , where A =

∑nl=1 Al and Al

are n real k × p dimensional matrices with standardizednormal entries, which means thatAl ij ∼

i .i .dN(0, 1), ∀i ∈ {1, ..., k} , ∀j ∈ {1, ..., p};

4. Get the projection of matrix X computing E = XR.

Page 19: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Interpreting Projected FeaturesThe data matrix X is a n dimensional sample of p features:

X =

x11 · · · x1p

.... . .

...xn1 · · · xnp

∈ Rn×p.

After the projection, the new matrix is the product of X bya matrix R ∈ Rp×k .

E = X R =

x11 · · · x1p

.... . .

...xn1 · · · xnp

r11 · · · r1k...

. . ....

rp1 · · · rpk

=

=

∑p

j=1 x1j rj1 · · ·∑p

j=1 x1j rjk...

. . ....∑p

j=1 xnj rj1 · · ·∑p

j=1 xnj rjk

∈ Rn×k

Page 20: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Interpreting Projected Features

Each new feature is a linear combination of all the originalfeatures where the coefficients are the elements of matrix R,which are

rij =1

n√k

n∑l=1

Alji

with Alji ∼ N(0, 1).

Page 21: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Multiple Linear Regression Model

Let us consider the general linear model with Gauss-Markovstructure:

y = XDβ + ε⇔⇔Yi = β0 + β1xi1 + · · ·+ βpxip + εi

I y = (Y1, ...,Yn) is the n dimensional vector containingthe values of the response variable Y ;

I XD =

1 x11 · · · x1p...

.... . .

...1 xn1 · · · xnp

is the n × (p + 1)

dimensional design matrix;

Page 22: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Multiple Linear Regression Model

y = XDβ + ε

I β = (β0, ..., βp) is the vector of the p + 1 regressionparameters;

I ε = (ε1, ..., εn) is the vector of random errors such that:I E (ε) = 0⇔ E (εi ) = 0 , ∀i ∈ {1, ..., n};I Var(ε) = σ2

I ⇔ Var(εi ) = σ2 , ∀i ∈ {1, ..., n};I Corr(εi , εj) = 0 , ∀i 6= j .

E (Y |x) = XDβ

To make inferences, we additionally suppose that

εi ∼i .i .d .

N(

0, σ2).

So the fitted values are

y = XD β ⇔ yi = β0 + β1xi1 + · · ·+ βpxip.

Page 23: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Simulating a Linear Multiple Regression Model -Ideal Case

We generated on R:

I n = 5000 samples;I p = 1000 features:

I 400 values from the distribution N(20, 25);I 500 values from the distribution Unif (5, 95);I 100 values from the distribution Bin(100, 0.5);

I X ∈ R5000×1000 is the concatenation of all 5000samples of the 1000 generated features;

I βk ∼ N(0, 100), ∀k ∈ {0, 1, · · · , p} are the trueregression parameters;

I σ = 2.055656 is the constant standard deviation of therandom errors;

I εi ∼ N(0, σ2

), ∀i ∈ {1, · · · , n} are the random errors,

with the only restriction imposed by the model!

Page 24: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Simulating a Linear Multiple Regression Model -Ideal Case

With all this data, we generate the response variable vector:

y = XDβ + ε.

Just for control, we can estimate the parametersβ0, β1, · · · , βp and verify that they are similar to the originalones.

Page 25: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Applying Random Projections to the GeneratedModel - Ideal Case

Using ε = 0.5 (factor from random projections, not the linearmodel), we reduce the number of features from p = 1000 tok = 409.Taking ε = 0.999, we can get k = 205, which is the smallestinteger k that can be used!

ε p k R2a VIF AIC PRESS

0.5 1000 409 71.23% < 5 95433 5.7× 1010

0.999 1000 205 37.58% < 5 98976 1.2× 1010

Model assumptions seem to still be verified in both cases.We obtain good results using ε = 0.5 but they are not sogood when we take ε = 0.999.

Page 26: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Simulating a Linear Multiple Regression Model -Non-Ideal Case

I n = 5000 samples as before;I p = 10000 features, 10 times more, keeping the

proportion and the distributions:I 4000 from the distribution N(20, 25);I 5000 from the distribution Unif (5, 95);I 1000 from the distribution Bin(100, 0.5);

I X ∈ R5000×1000 is the concatenation of all 5000samples of the 1000 generated features;

I βk ∼ N(0, 100), ∀k ∈ {0, 1, · · · , p} are the regressionparameters;

I σ = 2.055656 is the constant standard deviation of therandom errors;

I εi ∼ N(0, σ2

), ∀i ∈ {1, · · · , n} are the random errors,

with the only restriction imposed by the model;I y = XDβ + ε.

There are more features than samples so we cannot estimateregression parameters!

Page 27: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Applying Random Projections to the GeneratedModel - Non-Ideal Case

Using ε = 0.5 (factor from random projections, not the linearmodel), we reduce the number of features from p = 10000to k = 409 as before.Taking ε = 0.999, we can also get k = 205, which is thesmallest integer k that can be used!

ε p k R2a VIF AIC PRESS

0.5 10000 409 8% < 5 112323 1.7× 1012

0.999 10000 205 4% < 5 112379 1.7× 1012

Model assumptions seem to still be verified in both cases.We do not obtain good results using ε = 0.5 and they areeven worse when we choose ε = 0.999.

Page 28: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Conclusions

I Model assumptions are verified;

I The portion of data variability which is explained by thelinear model seems to decrease, with AIC increasingand R2

a decreasing;

I The total number of influent observations seems toincrease as we reduce the dimension k of the spacewhere we want to project our data, with PRESS gettinglarger or, at least, at the same order.

I Interpreting the regression parameters is more difficult.

Page 29: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Interpreting the Regression Parameters

In General:The coefficient βk tells us how much a fitted value increasesor decreases when the k − th explanatory variable (afterprojection) increases in one unit, keep all the other termsfixed.

Y = β0 + β1x∗1 + · · ·+ βpx

∗p

When Applying Random Projections:

The coefficient βk tells us how much a fitted value increasesor decreases when a linear combination x∗k of all the originalexplanatory variables (before projection) increase in one unit,keep all the other terms fixed. But those other terms alsodepend on the model structure considering the originalvariables...

Page 30: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Multiple Logistic Regression ModelFor the next dataset, our response variable will be binary.When the response variable is categorical, we should uselogistic regression.Using logistic functions, we can obtain the model

E(Yi |xi1, · · · , xip

)= πi =

eβ0+β1xi1+···+βpxip

1 + eβ0+β1xi1+···+βpxip

that can be linearised by the logit function

π∗i = log

(πi

1− πi

)= β0 + β1xi1 + · · ·+ βpxip.

which is continuous, linear on the parameters and takesvalues in R.

β =

β0

β1...βp

, X i =

1

Xi ,1...

Xi ,p

Page 31: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Multiple Logistic Regression Model

Figure: Example of a Logistic Function

Page 32: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Multiple Logistic Regression Model

The Model

π (x) =eβ0+β1x1+···+βpxp

1 + eβ0+β1x1+···+βpxp =ex

1 + ex tβ=[1 + e−x

tβ]−1

where the logit function is given by

π∗(x) = log

[π(x)

1− π(x)

]= β0 + β1x1 + · · ·+ βpxp

Estimating the Parameters

It is possible to estimate β using maximum likelihood, butwe get non-linear likelihood equations for any βk coefficient.We need to apply numerical methods. In R, it is used FisherScoring algorithm.

Page 33: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Another Application - Clinical Data FromLeukaemia Diagnosis

Data from clinical experiments on St. Jude Children’sResearch Hospital, Memphis, Tenessee, USA, was publishedon 2002.It is a microarray for diagnosing acute lymphoblasticleukaemia.

I n = 327 samples - 327 patients from that hospital;

I p = 12625 explanatory variables - 12625 genes;

I Y binary response variable:

Y =

{1, if the patient is ill

0, otherwise.

Page 34: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Another Application - Clinical Data FromLeukaemia Diagnosis

Figure: Construcao de um Microarray

Page 35: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Building a Regression Model

Ideal Number of Samples:

Between 5 and 10 samples per each explanatory variable.

What do we have?More explanatory variables than samples (0.026 samples pereach explanatory variable). It is not possible to estimateregression parameters!

What can we do?Applying random projections to the explanatory variables.There are n = 327 vectors in Rp, with p = 12625 that canbe projected in Rk , with k < p. By Johnson-Lindenstrauss’Lemma, the smallest k in those conditions, pickingε = 0.999 is 139.

After Projection:

327 samples and 139 explanatory variables. It is not ideal,but at least we can now build a model.

Page 36: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Building a Regression Model

The response variable is binary, so we should apply logisticregression. We will apply both methods: linear and logisticregression.

Page 37: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Applying Multiple Linear Regression

We estimate the p + 1 = 140 regression parameters on R,without considering the response variable as categorical.

Classification RuleWe want to classify the patients as being ill or not.If a fitted value yi is larger than 0.5, we classify the i − thpatient as being ill.Otherwise, we do not consider that patient as ill.

Multiple Linear Regression Model

I R2a ' 16%;

I Several variables with a very high p-value on its t-test;

I F-Test with a small p-value: 0.8%;

I VIF > 10 on 4 variables, VIF < 5 on 92 variables;

I AIC ' 20;

I PRESS ' 27.

Page 38: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Applying Multiple Linear Regression

Model Assumptions

When we try to verify model assumptions, we can easilycheck that they are not verified, so multiple linear regressionmodel does not fit the data.

PredictionThere are 327 patients, 19 of them not ill and 308 ill.Applying the previous classification rule, we get

I 312 ill patients;

I 15 healthy patients.

Also, we did not split the data on training set and test set...

Page 39: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Applying Multiple Logistic Regression

An Additional ProblemWe have a sample of 19 healthy patients and 308 ill ones in atotal os 327 observations, so the data is not balanced. Thisleads to convergence problems while estimating parameters...

On Possible SolutionAuthorizing a maximum error of 0.1 on parametersestimation and 100000 iterations.

Applying Multiple Logistic Regression

I Very high standard deviation for each coefficient(order of 103

);

I AIC ' 280;

I PRESS ' 4.62.

Page 40: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Applying Multiple Logistic Regression

PredictionIt is on prediction that we have the most importantimprovement.Applying the classification rule defined above, we predict

I 308 ill patients;

I 19 healthy patients.

which correspond to the real data.Once again, we did not split the data on training set andtest set...

Page 41: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Comparing Random Projections and PCA

Advantages of PCA

I Better adjusted coefficient of determination (concisewith PCA’s goal);

I Faster if we extract few principal components.

Advantages of RP

I Better predictions;

I Faster if we need to extract a lot of principalcomponents when using PCA.

Page 42: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Other Applications - Linear Regression withCompressive Ordinary Least Squares

Figure: Application to Detection of Musical Patterns withn = 2000, p = 106 and Very Sparse Data. Source: [5]

Page 43: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Other Applications - Noise Detection on Images

Figure: Application to Noise Detection on Images. Source: [7]

Page 44: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Defining RandomProjections

How Do RandomProjections Work?

How to ApplyRandomProjections?

Multiple LinearRegression Model

StochasticSimulation ofMultiple LinearRegression Models

Multiple LogisticRegression Model

DiagnosingLeukaemia

Other Applications

Other Applications - Noise Detection on Images

Figure: Impact over required floating point operations after noisedetection on images, using MATLAB. Source: [7]

Page 45: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Appendix

Bibliography

Bibliography I

Aditya Krishna Menon.Random Projections and Applications to DimensionalityReduction.School of Information Technologies, University ofSydney, Australia, 2007.

Lopez-Paz and Duvenaud.Random Projections.School of Engineering and Applied Sciences, Universityof Harvard, Estados Unidos da America, 2013.

Michael Mahoney.The Johnson-Lindenstrauss Lemma.School of Engineering, University of Standford, EstadosUnidos da America, 2009.

Page 46: Random Projections - Healthiness and Stochastic Simulation · Stochastic Simulation of Multiple Linear Regression Models Multiple Logistic Regression Model Diagnosing Leukaemia Other

RandomProjections

Joao Brazuna

Appendix

Bibliography

Bibliography II

Sanjoy Dasgupta and Anupam Gupta.An Elementary Proof of a Theorem of Johnson andLindenstrauss.New Jersey, Estados Unidos da America, 2001.

Robert J. Durrant and Ata Kaban.Random Projections for Machine Learning and DataMining: Theory and Applications.University of Birmingham, Reino Unido, 2012.

Conceicao Amado.Regressao Logıstica - Uma Introducao.Instituto Superior Tecnico, Universidade de Lisboa,Portugal, 2010.