direct qr factorizations for tall-and-skinny matrices in mapreduce architectures (ieee bigdata)
TRANSCRIPT
Direct QR factorizations for tall-and-skinnymatrices in MapReduce architectures
Austin BensonICME, Stanford University
David Gleich (Purdue) and Jim Demmel (UC-Berkeley)
A Q
R
m
n
m
n n
n
IEEEDATAOctober 8, 2013
Contributions 2
I Numerically stable and scalable algorithm for QR and SVD oftall-and-skinny matrices in MapReduce
I Performance and stability tradeoffs of several methods
I Performance model: prediction within a factor of two
I Code: https://github.com/arbenson/mrtsqr
MapReduce overview 3
Two functions that operate on key value pairs:
(key , value)map−−→ (key , value)
(key , 〈value1, . . . , valuen〉)reduce−−−−→ (key , value)
shuffle stage between map and reduce to sort values by key.
MapReduce overview 4
The programmer implements:
I map(key, value)
I reduce(key, 〈 value1, . . ., valuen 〉)
Handled by MapReduce framework, e.g., Hadoop:
I shuffle
I load balancing
I reading and writing data
I data serialization
I fault tolerance
I ...
MapReduce Example: ColorCount 5
(key, value) input is (image id, image)
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
shuffle
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1 Map Reduce
1
5
2
1
1
4 2
def ColorCountMap(key , val ) :for pixel in val :
yield ( pixel , 1)
def ColorCountReduce(key , vals ) :total = sum( vals )yield (key , total )
Why MapReduce? (for scientists) 6
MapReduce is restrictive! Why Bother?
I Easy
I load balancing
I structured data I/O
I fault tolerance
I cheap clusters with large data storage
Hadoop may not be the best option...
Generate lots of data on supercomputer
Post-process and analyze on MapReduce cluster
Tall-and-skinny matrices 7
What are tall-and-skinny matrices? m >> n
A m
n
Examples: rows are data samples; blocks of A are images from avideo; Krylov subspaces; unrolled tensors
Matrix representation 8
We have matrices, so what are the key-value pairs?
A =
1.0 0.02.4 3.70.8 4.29.0 9.0
→
(1, [1.0, 0.0])(2, [2.4, 3.7])(3, [0.8, 4.2])(4, [9.0, 9.0])
(key, value) → (row index, row)
Matrix representation: an example 9
Scientific example: (x, y, z) coordinates and model number:
((47570,103.429767811242,0,-16.525510963787,iDV7), [0.00019924
-4.706066e-05 2.875293979e-05 2.456653e-05 -8.436627e-06 -1.508808e-05
3.731976e-06 -1.048795e-05 5.229153e-06 6.323812e-06])
Figure: Aircraft simulation data. Aero/Astro Department, Stanford
Tall-and-skinny matrices 10
Tall-and-skinny: m� n
A m
n
Slightly more rigorous definition:It is “cheap” to pass O(n2) data to all processors.
Quick QR and SVD review 11
A Q R
VT
n
n
n
n
n
n
A U n
n
n
n
n
n
Σ
n
n
Figure: Q, U, and V are orthogonal matrices. R is upper triangular andΣ is diagonal with decreasing, nonnegative entries.
Tall-and-skinny QR 12
A Q
R
m
n
m
n n
n
Tall-and-skinny (TS): m >> n. QTQ = I .
TS-QR → TS-SVD 13
A Q
R Σ VT
Q
UR
U
R is small, so computing its SVD is cheap.
Why Tall-and-skinny QR and SVD? 14
1. Regression with many samples
2. Principle Component Analysis (PCA)
3. Model Reduction
Pressure, Dilation, Jet Engine
Figure: Dynamic mode decomposition of the screech of a jet. JoeNichols, University of Minnesota.
Cholesky QR 15
Cholesky QR
ATA = (QR)T (QR) = RTQTQR = RTR
I Computing ATA in MapReduce is easy and well-studied.
I We call this Cholesky QR.
Cholesky QR: Getting Q 16
Q = AR−1.
A
A1 R-1 map
R
Q1
A2 R-1 map
Q2
A3 R-1 map
Q3
A4 R-1 map
Q4
emit
emit
emit
emit
distribute
Local MatMul
Stability problems 17
I Can get Q = AR−1
I Problem: Columns can be far from orthogonal and ATAsquares condition number (data later)
I Idea: Use a more advanced algorithm.
Communication-avoiding TSQR 18
A =
A1
A2
A3
A4
︸ ︷︷ ︸8n×4n
=
Q1
Q2
Q3
Q4
︸ ︷︷ ︸
8n×4n
R1
R2
R3
R4
︸ ︷︷ ︸4n×n
=
=Q︷ ︸︸ ︷Q1
Q2
Q3
Q4
︸ ︷︷ ︸
8n×4n
Q̃︸︷︷︸4n×n
R︸︷︷︸n×n
Demmel et al. 2008
Communication-avoiding TSQR 19
A =
A1
A2
A3
A4
︸ ︷︷ ︸8n×4n
=
Q1
Q2
Q3
Q4
︸ ︷︷ ︸
8n×4n
R1
R2
R3
R4
︸ ︷︷ ︸4n×n
Ai = QiRi can be computed in parallel. If we only need R, thenwe can throw out the intermediate Qi factors.
MapReduce TSQR 20
S(1)
A
A1
A2
A3
A3
R1 map
A2
emit R2 map
A3
emit R3 map
A4
emit R4 map
shuffle
S1
A2
reduce
S2 R2,2
reduce
R2,1 emit
emit
emit
shuffle
A2 S3 R2,3
reduce emit
Local TSQR
identity map
A2 S(2) R reduce emit
Local TSQR Local TSQR
Figure: S (1) is the matrix consisting of the rows of all of the Ri factors.Similarly, S (2) consists of all of the rows of the R2,j factors.
MapReduce TSQR: Getting Q 21
I Again: have R, want Q
A = QR → Q = AR−1
I We call this method Indirect TSQR.
I Problem: Q can be far from orthogonal (again).
Indirect TSQR: Iterative Refinement 22
Iterative refinement: repeat TSQR for a more orthogonal Q
A
A1 R-1 map
R
Q1
A2 R-1 map
Q2
A3 R-1 map
Q3
A4 R-1 map
Q4
emit
emit
emit
emit
distribute
TSQ
R
Q
Q1 R1
-1 map
R1
Q1
Q2 R1
-1 map
Q2
Q3 R1
-1 map
Q3
Q4 R1
-1 map
Q4
emit
emit
emit
emit
distribute
Local MatMul Local MatMul
Iterative Refinement step
Indirect TSQR: a randomized approach 23
I Idea: Take a small sample of rows of A and form Rs
I Refinement step by Qs = AR−1s , Qs → R1, Q = QsR
−11
I R = R1Rs , QTQ ≈ I for ill-conditioned A
I Theory on why this works, need ≈ 100n log n rows[Mahoney 2011], [Avron, Maymounkov, and Toledo 2010],[Ipsen and Wentworth 2012]
We call this Pseudo-Iterative Refinement
Pseudo-Iterative Refinement 24
A
A1 Rs
-1 map
Rs
Q1
A2 Rs
-1 map
Q2
A3 Rs
-1 map
Q3
A4 Rs
-1 map
Q4
emit
emit
emit
emit
distribute
TSQ
R
Q
Q1 R1
-1 map
R1
Q1
Q2 R1
-1 map
Q2
Q3 R1
-1 map
Q3
Q4 R1
-1 map
Q4
emit
emit
emit
emit
distribute
Local MatMul Local MatMul
Iterative Refinement step Form Qs
A1
TSQ
R
Form Rs
(In the implementation, combine AR−1s and TSQR in one pass)
Direct TSQR 25
Why is computing truly orthogonal Q difficult in MapReduce?
I Orthogonality is a global property, but we compute locally.
I Can only label data via keys and file names.
Communication-avoiding TSQR 26
A =
A1
A2
A3
A4
︸ ︷︷ ︸8n×4n
=
Q1
Q2
Q3
Q4
︸ ︷︷ ︸
8n×4n
R1
R2
R3
R4
︸ ︷︷ ︸4n×n
=
Q1
Q2
Q3
Q4
︸ ︷︷ ︸
8n×4n
Q1,2
Q2,2
Q3,2
Q4,2
︸ ︷︷ ︸4n×n
R︸︷︷︸n×n
=
Q1Q1,2
Q2Q2,2
Q3Q1,2
Q4Q1,2
︸ ︷︷ ︸
8n×n
R︸︷︷︸n×n
= QR
Gathering Q 27
R1
R2
R3
R4
︸ ︷︷ ︸
n·#(mappers)×n
=
Q1,2
Q2,2
Q3,2
Q4,2
︸ ︷︷ ︸
n·#(mappers)×n
R︸︷︷︸n×n
I Idea: Compute QR (n ·#(mappers) rows) in serial.
I Idea: Pass Qi ,2 (n rows each) in second pass to reconstruct Q.
I We call this Direct TSQR
Direct TSQR: Steps 1 and 2 28
A
A1 R1 map
emit Q1 emit
A2 R2 map
emit Q2
emit
A3 R3 map
emit Q3
emit
A4 R4 map
emit Q4
emit
First step
R1
R2
R3
R4
Q1,2
Q2,2
Q3,2
Q4,2
R
reduce
emit
emit
emit
emit
emit
Second step
shuffle
Direct TSQR: Step 3 29
Q1 Q1,2
emit Q map
Q2 Q2,2
emit Q map
Q3 Q3,2
emit Q map
Q41 Q4,2
emit Q map
Q12 Q22
Q32 Q42 distribute
Third step
Stability 30
100
102
104
106
108
1010
1012
1014
1016
10−16
10−14
10−12
10−10
10−8
10−6
10−4
10−2
100
102
κ2(A)
||Q
TQ
− I||
2Numerical stability: 10,000x10 matrices
Indir. TSQR + PIR
Dir. TSQR
Indir. TSQR
Indir. TSQR + IR
Chol.
Chol. + IR
Performance model 31
I Only count reads and writes
I Streaming benchmark for read and write bandwidth of system
I Within a factor of two of experimental data for all algorithms
I I/O dominates runtime
I Algorithms take same time as a few passes over data
Performance 32
4B x 4(134.6 GB)
2.5B x 10(193.1 GB)
600M x 25(112.0 GB)
500M x 50(183.6 GB)
150M x 100(109 GB)
Matrix size
0
1000
2000
3000
4000
5000
6000
7000Time to solution (seconds)
Performance of QR algorithms on MapReduce
Chol
Indir TSQR
Chol + PIR
Indir TSQR + PIR
Chol + IR
Indir TSQR + IR
Direct TSQR
Direct TSQR: recursive extension 33
R1
R2
R3
R4
︸ ︷︷ ︸
n·#(mappers)×n
TSQR−−−−→
Q1,2
Q2,2
Q3,2
Q4,2
︸ ︷︷ ︸
n·#(mappers)×n
R︸︷︷︸n×n
I n ·#(mappers) rows is too large → recurse
Direct TSQR: recursive performance 34
0 50 100 150 2000
2000
4000
6000
number of columns
runnin
g tim
e (
s)
150M rows
0 50 100 150 200 2500
2000
4000
6000
8000
number of columns
runnin
g tim
e (
s)
100M rows
0 50 100 150 200 250 3000
5000
10000
15000
number of columns
runnin
g tim
e (
s)
50M rows
no recursion
recursion
no recursion
recursion
no recursion
recursion
End 35
Contributions:
I Numerically stable and scalable QR
I Performance and stability tradeoffs
I Performance model
I Code: https://github.com/arbenson/mrtsqr
Contact: