iterative refinement on fpgas - saahpc.ncsa.illinois.edu
TRANSCRIPT
Iterative Refinement on FPGAs
Tennessee Advanced Computing Laboratory
University of Tennessee
JunKyu Lee
July 19th 2011
This work was partially supported by
the National Science Foundation,
grant NSF CHE-0625598.
Floating Point Performance
Processors (CPU/GPU):
Fast, customized static ALUs
FPGAs:
-Slower clock, parallel, application specific ALUs
-Pipelining
-Precision
1) Good performance for single and double for CPU/GPU
2) Can we exploit FPGA flexibility?
- Arbitrary precision
Benefits from Lower Precision ALUs on FPGAs
ALU PrecisionHigherLower
Smaller ALUs Larger ALUs
ParallelismNumber of ALUs
in Fixed AreaShorter Wires
Shorter Pipeline
Clock Rate
SPEED UP!!
Ok, Let us
Explore Precisions on FPGAs
Which applications?
Dense Linear System Solvers :
Iterative Refinement on FPGAs
Provide high performance
according to a prescribed accuracy
Dense Linear System Solvers with Arbitrary Accuracy
- eXtended Mixed Precision Iterative Refinement (XMIR)
1. Iterative Refinement Algorithm
2. Implementation of XMIR on FPGAs
3. Performance Comparison with GPGPUs
(Xilinx XC6VSX475T vs NVIDIA GTX480)
4. Conclusions
Iterative Refinement
x(1) : a zero vector (n × 1)
A : a square matrix (n × n)
Note)
GEPP: Gaussian Elimination with Partial Pivoting
P: Permutation matrix from GEPP
PL: Lower precision
PH: Higher precision
r: Residual vector
x: Solution
b: Right side vector in system equation (Ax = b)
ε :Prescribed Accuracy
Step 1: LU (GEPP)
Partial Pivoting
i = 0
LU x(1) = b
O(n3), PL
Repeat
i = i + 1
Step 2: r(i) = b – A×x(i) O(n2), PH
Step 3: LU×z(i) =
P×r(i)
O(n2), PL
Step 4: x(i+1) = x(i) + z(i) O(n), PH
Until ||x(i+1)||2 ≤ ε
Computationally Inexpensive
A
L
U
Ly b
x
=
b=
xU
y=
LU decomposition with Partial Pivoting (LUPP)
Matrix Decomposition: 2/3 n3 Ops
n2 Ops (Forward Substitution) n2 Ops (Back Substitution)
||x – x*||/ ||x*|| = q (q<1; for IR), q = ϕ(n)κ(A)ϵ0 (LUPP)
Direct Method
Success Condition for Iterative Refinement
Single precision for LUPP
Double precision refinement
Converge ?
Double precision for LUPP
DONE
MPIR
Arbitrary precision for LUPP
Arbitrary precision refinement
Arbitrary precision for LUPP
DONE
XMIR
Converge ?
Impact of XMIR
Achievable Accuracy:
XMIR: Arbitrarily High MPIR: Double
Assume that Time Cost Model, T(є) = α mβ,
Performance Comparison,
MPIR
T = T(єD) (2/3 n3 × γβ + 2 n2m ×(1+γβ) )
Let γ = (mL/mH),
XMIR
T = T(єAH) (2/3 n3 × γβ + 2 n2m ×(1+γβ) )×If T(AL) << T(AH)
Dense Linear System Solvers with Flexible Precisions
Direct Method
Wilkinson’s Iterative Refinement
(WIR)
Mixed Precision IR (MPIR)
Arbitrary Initial Precision MPIR (AMIR)
eXtended MPIR (XMIR)
1 precision
2 precisions (Original/Higher)
Accuracy: Original Precision
2 precisions (S/D(O))
Accuracy: Double
2 precisions (A/D(O))
Accuracy: Double
2 precisions (A/A(O))
Accuracy: Arbitrary
Step 2. Residual Calculation
r = b – Ax
Multiplier Adder
0
b
sel
Register File
MSB <= not(MSB)
Execution time: T = (n/# PEs) × (n + k + l + r) / f ≈ n2/(# PEs× f )
2 Ops per PE per clock cycle, Implementation block size = 1,024
Implementation
(Xilinx ISE/VHDL)
PE
B
R
A
M
0
B
R
A
M
1
BRAM 0 Status
Microblaze
Loading
BRAM 1 Status
Matrix Side
BRAMs
B
R
A
M
0
B
R
A
M
1
Vector Side
BRAMs
Partial Sum
Register
Loading
Implementation
PrecisionPipeline
Depth
(Add/Mult)
DSP 48E
Slices
Registers
LUTs
# of BRAM
(36Kb)
# of PEs
s2
PAR CLK GFLOPsExp
Size
Mantissa
Size
8 23(S) 11/8 4/2,0161,278/595,200
1,748/297,6004/1,064 170 210 MHz 71
11 38 12/12 5/2,0162,355/595,200
2,807/297,6006/1,064 106 167 MHz 35
11 52(D) 14/15 13/2,0162,912/595,200
3,546/297,6008/1,064 83 125 MHz 21
15 63 13/22 16/2,0163,816/595,200
4,517/297,6008/1,064 65 178 MHz 23
Precision Pipeline
Depth
Add/Mult
DSP 48E
Slices
Registers
LUTs
# of
BRAM
(36Kb)
# of PEs
s2
PAR
CLKGFLOPsExp
Size
Mantissa
Size
8 23(S) 12/8 2/641496/69,120
1724/69,1204/148 32 196 MHz 13
Step 2. Residual Calculation
PAR: Xilinx XC5VLX110T
PAR: Xilinx XC6VSX475T
Implementation
Step 3. Triangular System Solver (Block Method)
L21
L31
L41
L51
L61
L71
L81
L32
L42
L52
L62
L72
L82
L43
L53
L63
L73
L83
L54
L64
L74
L84
L65
L75
L85
L76
L86 L87
y2
y3
y4
y5
y6
y7
y8
y1L11
L22
L33
L44
L55
L66
L77
L88
z2
z3
z4
z5
z6
z7
z8
z1
=
Update z vector
block size = 64y1
z2 = z2 – L21y1
y2
z3 = z3 – L31y1 – L32y2
….
z8 = z8 – L81y1 – L82y2 .. – L87y7.
Implementation
Step 3. Triangular System Solver (Small triangular matrices)
block size = 64
=
x0 = 1/l00 (b0),
x1 = 1/l11 (b1 – l10 x0),
x2 = 1/l22 (b2 – l20 x0 – l21 x1),
x3 = 1/l33 (b3 – l30 x0 – l31 x1 – l32 x2).
z1 = (b1 – l10 x0), z2 = (b2 – l20 x0), and z3 = (b3 – l30 x0).
In the next iteration,
z2 = (z2 – l21 x1), and z3 = (z3 – l31 x1), and so on.
Implementation
Step 3. Triangular System Solver (block size = 64)
zf done
zf data
addr Td
Td/zf data
Td data
act
Xdone/xdata
addrx/xdatadly
loc_enable
ext_enable
To all the
modules
PE
XADDR
Arbiter
Div_inter
Td
BRAM
T
BRAM
z
BRAM
Latency from division is hidden
2 operations per clock cycle
Triangular matrix
b vector =>
Intermediate z vector
Diagonal elements
=> Final Solution
- *
MAGMA v0.2
Exclude data transfer time from host to both accelerators (GPGPU/FPGA)
48 GFlops
71 GFlops
47 GFlops
32 GFlops
35 GFlops
23 38 52 63
Performance Comparison
NVIDIA GTX480 Xilinx XC6VSX475TGFlops
Precision (Mantissa Size)
FPGA ≈ GPU
1. XMIR can produce arbitrary accuracy in linear system solvers
2. For applications requiring very high accuracy, impact of XMIR is
maximized
3. XMIR (FPGA): Lower Precision / Beyond Double Precision
MPIR (GPU): Moderately High Precision
Conclusions
Future Work
Hybrid-Platform (FPGA + GPU)
- Power-Aware Performance
Dynamic precision?
- Update precisions during iteration
Thank You, Any Questions?