numerical linear algebra in the streaming model

Numerical Linear Algebra in the Streaming Model

Ken Clarkson - IBM

David Woodruff - IBM

The Problems

Given n x d matrix A and n x d’ matrix B, we want estimators for

• The matrix product AT B

• The matrix X* minimizing ||AX-B||– A slightly generalized version of linear regression

• Given integer k, the matrix Ak of rank k minimizing ||A-Ak||

• We consider the Frobenius matrix norm: square root of sum of squares

General Properties of Our Algorithms

• 1 pass over matrix entries, given in any order (allow multiple updates)

• Maintain compressed versions or “sketches” of matrices

• Do small work per entry to maintain the sketches

• Output result using the sketches

• Randomized approximation algorithms

• Since we minimize space complexity, we restrict matrix entries to be O(log nc) bits, or O(log nd) bits

Matrix Compression Methods

• In a line of similar efforts…– Element-wise sampling [AM01], [AHK06]– Row / column sampling: pick small random

subset of the rows, columns, or both [DK01], [DKM04], [DMM08]

– Sketching / Random Projection: maintain a small number of random linear combinations of rows or columns [S06]

– Usually more than 1 pass

• Here: sketching

Outline

• Matrix Product

• Linear Regression

• Low-rank approximation

An Optimal Matrix Product Algorithm

• A and B have n rows, and a total of c columns, and we want to estimate ATB, so that ||Est-ATB|| · ε||A||¢||B||

• Let S be an n x m sign (Rademacher) matrix– Each entry is +1 or -1 with probability ½– m small, set to O(log 1/ δ) ε-2 – Entries are O(log 1/δ)-wise independent

• Observation:– E[ATSSTB/m] = ATE[SST]B/m = ATB

• Wouldn’t it be nice if all the algorithm did was maintain STA and STB, and output ATSSTB/m?

An Optimal Matrix Product Algorithm

• This does work, and we are able to improve the previous dependence on m:

• New Tail Estimate: for δ, ε > 0, there is m = O(log 1/ δ) ε-2 so that

Pr[||ATSSTB/m-ATB|| > ε ||A|| ||B||] · δ (again ||C|| = [Σi, j Ci, j

2]1/2)

• Follows from bounding O(log 1/δ)-th moment of ||ATSSTB/m-ATB||

Efficiency

• Easy to maintain sketches given updates– O(m) time/update, O(mc log(nc)) bits of space

for STA and STB– Improves Sarlos’ algorithm by a log c factor.– Sarlos’ algorithm based on JL Lemma

• JL preserves all entries of ATB up to an additive error, whereas we only preserve overall error

– Can compute [ATS][STB]/m via fast rectangular matrix multiplication

Matrix Product Lower Bound

• Our algorithm is space-optimal for constant δ– a new lower bound

• Reduction from a communication game– Augmented Indexing, players Alice and Bob– Alice has random x 2 {0,1}s

– Bob has random i 2 {1, 2, …, s}• also xi+1, …, xs

• Alice sends Bob one message• Bob should output xi with probability at least 2/3 • Theorem [MNSW]: Message must be (s) bits

on average

Lower Bound Proof

• Set s := £(cε-2log cn) • Alice makes matrix U

– Uses x1…xs

• Bob makes matrix U’ and B– Uses i and xi+1, …, xs

• Alg input will be A:=U+U’ and B– A and B are n x c/2

• Alice:– Runs streaming matrix product Alg on U– Sends Alg state to Bob– Bob continues Alg with A := U + U’ and B

• ATB determines xi with probability at least 2/3– By choice of U, U’, B– Solving Augmented Indexing

• So space of Alg must be (s) = (cε-2log cn) bits

Lower Bound Details

• U = U(1); U(2); …, U(log (cn)); 0s

– Each U(k) is an £(ε-2) x c/2 submatrix with entries in

{-10k, 10k}

– U(k)i, j = 10k if matched entry of x is 0, else U(k)

i, j = -10k

• Bob’s index i corresponds to U(k*)i*, j*

• U’ is such that A = U+U’ = U(1); U(2); …, U(k*); 0s

– U’ is determined from xi+1, …, xs

• ATB is i*-th row of U(k*)

• ||A|| ¼ ||U(k*)|| since the entries of A are geometrically increasing

• ε2||A||2¢ ||B||2, the squared error, is small, so most entries of the approximation to ATB have the correct sign

Outline

• Matrix Product

• Linear Regression


Linear Regression

• The problem: minX ||AX-B||• X* minimizing this has X* = A-B, where A- is the pseudo-

inverse of A• Every matrix A = UΣVT using singular value

decomposition (SVD)– If A is n x d of rank k, then

• U is n x k with orthonormal columns• Σ is k x k diagonal matrix, diagonal is positive• V is d x k with orthonormal columns

• A- = VΣ-1UT

• Normal Equations: ATA X = ATB for optimal X

Linear Regression

• Let S be an n x m sign matrix, m = O(dε-1log(1/δ))

• The algorithm is– Maintain STA and STB– Return X’ solving minX ||ST(AX-B)||– Space is O(d2 ε-1log(1/δ)) words– Improves Sarlos’ space by log c factor – Space is optimal via new lower bound

• Main claim: With probability at least 1- δ, ||AX’-B|| · (1+ε)||AX*-B||

– That is, relative error for X’ is small

Regression Analysis

• Why should X’ solving minX ||ST(AX-B)|| be good?• ST approximately preserves AX-B for fixed X• If this worked for all X, we’re done• ST must preserve norms even for X’, chosen using S

• First reduce to showing that ||A(X*-X’)|| is small

• Use normal equation ATAX* = ATB– Implies ||AX’-B||2 = ||AX*-B||2 + ||A(X’-X*)||2

• Bounding ||A(X’-X*)||2 equivalent to bounding ||UTA(X’-X*)||2, where A = UΣVT, from SVD, and U is an orthonormal basis of the columnspace of A

Regression Analysis Continued

• Bounding ||¯||2 := ||UTA(X’-X*)||2

– ||¯|| · ||UTSSTU¯/m|| + ||UTSSTU¯/m-¯||

• Normal equations in sketch space imply(STA)T(STA)X’ = (STA)T(STB)

• UTSSTU¯ = UTSSTA(X’-X*) = UTSSTA(X’-X*) + UTSST(B-AX’) = UTSST(B-AX*)

• || UTSSTU¯/m|| = ||UTSST(B-AX*)/m|| · (ε/k)1/2||U||¢||B-AX*|| (new tail estimate) = ε1/2 ||B-AX*||

Regression Analysis Continued

• Hence, ||¯||2 := ||UTA(X’-X*)||2

– ||¯|| · ||UTSST¯/m|| + ||UTSSTU¯/m-¯|| · ε1/2 ||B-AX*|| + ||UTSSTU¯/m-¯||

• Recall the spectral norm: ||A||2 = supx ||Ax||/||x||• Implies ||CD|| · ||C||2 ||D||

• ||UTSSTU¯/m-¯|| · ||UTSSTU/m-I ||2 ||¯||• Subspace JL: for m = (k log(1/δ)), ST approximately

preserves lengths of all vectors in a k-space• ||UTSSTU¯/m-¯|| · ||¯||/2

• ||¯|| · 2ε1/2||AX*-B||

• ||AX’-B||2 = ||AX*-B||2 + ||¯||2 = ||AX*-B||2 + 4ε||AX*-B||2

Regression Lower Bound

• Tight (d2 log (nd) ε-1) space lower bound• Again a reduction from augmented indexing• This time more complicated • Embed log (nd) ε-1 independent regression sub-

problems into hard instance– Uses deletions and geometrically growing property, as

in matrix product lower bound

• Choose the entries of A and b so that the algorithm’s output x encodes some entries of A


• Lower bound of (d2) already tricky because of bit complexity

• Natural approach:– Alice has random d x d sign matrix A-1

– b is a standard basis vector ei

– Alice computes A = (A-1)-1 and puts it into the stream. Solution x to minx ||Ax=b|| is i-th column of A-1

– Bob can isolate entries of A-1, solving indexing

• Wrong: A has entries that can be exponentially small!


• We design A and b together (Aug. Index)

1 A1, 2 A1, 3 A1, 4 A1,5

0 1 A2, 3 A2, 4 A2,5

0 0 1 A3, 4 A3,5

0 0 0 1 A4,5

0 0 0 0 1

x1

x2

x3

x4

x5

0A2,4

A3,4

10

=

x5 = 0, x4 = 1, x3 = 0, x2 = 0, x1 = -A1,4

Outline

• Matrix Product

• Regression


Best Low-Rank Approximation

• For any matrix A and integer k, there is a matrix Ak of rank k that is closest to A among all matrices of rank k.

• Since rank of Ak is k, it is the product CDT of two k-column matrices C and D

– Ak can be found from the SVD (singular value decomposition), where C and D are orthogonal matrices U and VΣk

– This is a good compression of A– LSI, PCA, recommendation systems, clustering

Best Low-Rank Approximation

• Previously, nothing was known for 1-pass low-rank approximation and relative error– Even for k = 1, best upper bound O(nd log (nd)) bits– Problem 28 of [Mut]: can one get sublinear space?

• We get 1-pass and O(kε-2(n+dε-2)log(nd)) space• Update time is O(kε-4), so total work is O(Nkε-4),

where N is the number of non-zero entries of A• New space lower bound shows optimal up to 1/ε

Best Low-Rank Approximation and STA

• The sketch STA holds information about A

• In particular, there is a rank k matrix Ak’ in the rowspace of STA nearly as close to A as the closest rank k matrix Ak

– The rowspace of STA is the set of linear combinations of its rows

• That is, ||A-Ak’|| · (1+ε)||A-Ak||

• Why is there such an Ak’?

Low-Rank Approximation via Regression

• Apply the regression results with A ! Ak, B ! A

• The X’ minimizing ||ST(AkX-A)|| has

||AkX’-A|| · (1+ ε)||Ak X*-A||

• But here X* = I, and X’ = (ST Ak)- STA

• So the matrix AkX’ = Ak(STAk)-STA:– Has rank k– In the rowspace of STA– Within 1+ε of smallest distance of any rank-k matrix

Low-Rank Approximation in 2 Passes

• Can’t use Ak(STAk)- STA without finding Ak

• Instead: maintain STA– Can show that if GT has orthonormal rows, then

the best rank-k approximation to A in the rowspace of GT is A’kGT, where A’k is the best rank-k approximation to AG

– After 1st pass, compute orthonormal basis GT for rowspace of STA

– In 2nd pass, maintain AG

– Afterwards, compute A’k and A’kGT

Low-Rank Approximation in 2 Passes

• A’k is best rank-k approximation to AG• For any rank-k matrix Z, • ||AGGT – A’kGT|| = ||AG-A’k|| · ||AG-Z|| · ||AGGT – ZGT||• For all Y: (AGGT-YGT) ¢ (A-AGGT)T = 0, so we

can apply Pythagorean Theorem twice:• ||A – A’kG|| = ||A-AGGT|| + ||AGGT – A’kGT|| · ||A-AGGT|| + ||AGGT – ZGT|| = ||A – ZGT||

1 Pass Algorithm

• With high probability,(*) ||AX’-B|| · (1+ε)||AX*-B||, where– X* minimizes ||AX-B||– X’ minimizes ||STAX-STB||

• Apply (*) with A ! AR and B ! A and X’ minimizing ||ST(ARX-A)||

• So X’= (STAR)-STA has ||ARX’-A|| · (1+ε)minX ||ARX-A|| • Columnspace of AR contains a (1+ε)-approximation to Ak

• So, ||ARX’-A|| · (1+ε)minX ||ARX-A|| · (1+ε)2 minX ||A-Ak||

• Key idea: ARX’ = AR(STAR)-STA is– easy to compute in 1-pass with small space and fast update time– behaves like A (similar to SVD)– use it instead of A in our 2-pass algorithm!

1 Pass Algorithm

• Algorithm:– Maintain AR and STA– Compute AR(STAR)-STA– Let GT be an orthonormal basis for the rowspace of

STA, as before– Output the best rank-k approximation to

AR(STAR)-STA in the rowspace of STA

• Same as 2-pass algorithm except we don’t need a second pass to project A onto the rowspace of STA

• Analysis is similar to that for regression

A Lower Bound

binary string x index i and xi+1, xi+2, …matrix A

10, -10-10, 10-10,-10……

-100, -100100, 100100, 100……

-1000, -1000-1000, 10001000, 1000……

……………

0, 00, 00, 0……

10000, -10000-10000, -10000-10000, -10000……

0, 00, 00, 0……

00………

Error now dominated by block of interest

Bob also inserts a k x k identity submatrix into block of interest

n-k rows

k ε-1 columns per block

0sk rows

Lower Bound Details

Block of interest:

n-k rows

k ε-1 columns

*

k rows 0s 0sP*Ik

Bob inserts k x k identity submatrix, scaled by large value P

Show any rank-k approximation must err on all of shaded region

So good rank-k approximation likely has correct sign on Bob’s entry

Concluding Remarks

• Space bounds are tight for product, regression– Sharpen prior upper bounds– Prove optimal lower bounds

• Space bounds off by a factor of ε-1 for low-rank approximation– First sub-linear (and near-optimal) 1-pass algorithm – We have better upper bounds for restricted cases

• Improve the dependence on ε in the update time• Lower bounds for multi-pass algorithms?

numerical linear algebra in the streaming model

Documents