converting high dimensional problems to low dimensional ones

Converting High Dimensional Problems to Low Dimensional Ones

General ParadigmReduce and Conquer

• Large Problem Small Problem

– Break array into two parts

– Consider odd and even elements

– Sample edges in a graph to obtain a smaller graph

– Represent a graph by a collection of trees

– Take number modulo small prime

– Multiply matrix by a random vector

– Project high dimensional point sets into fewer dimensions

The Problem

• Given n points in D dimensional space

• Project them in d << D dimensions – So (Euclidean) distance between every pair of points is

(almost) preserved

• How does d compare to n?

Application

• Hierarchical Clustering

• Say ten thousand samples each over a few million SNPs

• Few million Few Hundreds/Thousands? And Fast?

First Attempt

• Can we make d=n-1?

– X axis through 2 of the points

– Y axis so 3rd point is in the XY plane

– Z axis so 4th point is in the XYZ 3d space

– And so on

First Attempt

• Time taken

– Each new axis has to be made orthogonal to all previous axes

– O(n2 D)

– Too slow

Second AttemptUse Random Projections

• Take d random vectors r1..rd

• For every point p, take the d dimensional point

• [ p.r1 p.r2 .. p.rd ] * scaling-factor

• Do these d-dim points preserve inter-point distances approximately? How large should d be?

Random ProjectionsFurther Simplification

• Take any vector p in D dimensions

• Suppose we show– [ p.r1 p.r2 .. p.rd ] * scaling-factor has length ~ |p|

– Failure prob < 1/n3

• Prob that even one of the n2 difference vector lengths is not preserved with prob < n2/n3 ~ 1/n

Random ProjectionsWhat is a random vector?

• No directional bias

Normal Distributions

• Pr of being between x and x+dx

For N(0,1), ~ e-x2/2

Generating Random Vectors without Directional Bias

• Take D numbers (X1...XD), each N(0,1), independently

• Distribution of each number X– Pr of being between a..a+da ~ e-a2/2

• Pr X1 in a1..a1+da1 : X2 in a2..a2+da2 ::: XD in aD..aD+daD

– e-a12/2 e-a

22/2 … e-a

D2/2 da1da2….daD

– e-(a12+a

22+a

D2)/2 da1da2….daD

– e-l2/2 da1da2….daD

So no dependence on direction, only on length l !

The Algorithm

• Take d random vectors r1..rd

– Each ri = [Xi1 Xi2 … XiD] where the X’s are chosen from N(0,1) independently

• For every point p, take the d dimensional point

• [ p.r1 p.r2 .. p.rd ] * sqrt(1/d)

• Time: n*d*D

Simplifying Further

• Take any vector p in D dimensions

• We need to show that• [ p.r1 p.r2 .. p.rd ] * sqrt(1/d) has length ~ |

p|• Failure prob < 1/n3

• We can assume p to be 1 0 0 0 0 0 … – because random vectors have no directional bias– Then [ p.r1 p.r2 .. p.rd ] * sqrt(1/d) = [X11 X21 … Xd1] * sqrt(1/d)

Analysis

• We need to show that

• [X1 X2 … Xd] * sqrt(1/d) has length ~ 1

• Failure prob < 1/n3

• Or (X12+…+Xd

2)/d ~ 1, failure prob < 1/n3

• Or (X12+…+Xd

2) ~ d, failure prob < 1/n3

• Note Xi has mean 1 and s.d sqrt(2)

Law of Large Numbers

• Y1..Yd each with any (decent) distribution with mean 1 and s.d sqrt(2)

• Then Y1+…+Yd tends to a Normal distribution with mean d and s.d sqrt(2d) (for large d)

• Pr (Y1+…+Yd not in (1+∆)d.. (1-∆)d) <

• e-(∆d)2/2.2d = e-∆2d/4

• Choose d=12 ln n/∆2 , this is < 1/n3 as needed

Conclusion

• n numbers in D dimensions

– can be projected to 12 ln n/∆2 dimensions

– all distances stretch only by (1+/-∆)

– with prob > 1-1/n

converting high dimensional problems to low dimensional ones

Technology