iterative refinement on fpgas - saahpc.ncsa.illinois.edu

Iterative Refinement on FPGAs

Tennessee Advanced Computing Laboratory

University of Tennessee

JunKyu Lee

July 19th 2011

This work was partially supported by

the National Science Foundation,

grant NSF CHE-0625598.

Floating Point Performance

Processors (CPU/GPU):

Fast, customized static ALUs

FPGAs:

-Slower clock, parallel, application specific ALUs

-Pipelining

-Precision

1) Good performance for single and double for CPU/GPU

2) Can we exploit FPGA flexibility?

- Arbitrary precision

Benefits from Lower Precision ALUs on FPGAs

ALU PrecisionHigherLower

Smaller ALUs Larger ALUs

ParallelismNumber of ALUs

in Fixed AreaShorter Wires

Shorter Pipeline

Clock Rate

SPEED UP!!

Ok, Let us

Explore Precisions on FPGAs

Which applications?

Dense Linear System Solvers :

Iterative Refinement on FPGAs

Provide high performance

according to a prescribed accuracy

Dense Linear System Solvers with Arbitrary Accuracy

- eXtended Mixed Precision Iterative Refinement (XMIR)

1. Iterative Refinement Algorithm

2. Implementation of XMIR on FPGAs

3. Performance Comparison with GPGPUs

(Xilinx XC6VSX475T vs NVIDIA GTX480)

4. Conclusions

Iterative Refinement

x(1) : a zero vector (n × 1)

A : a square matrix (n × n)

Note)

GEPP: Gaussian Elimination with Partial Pivoting

P: Permutation matrix from GEPP

PL: Lower precision

PH: Higher precision

r: Residual vector

x: Solution

b: Right side vector in system equation (Ax = b)

ε :Prescribed Accuracy

Step 1: LU (GEPP)

Partial Pivoting

i = 0

LU x(1) = b

O(n3), PL

Repeat

i = i + 1

Step 2: r(i) = b – A×x(i) O(n2), PH

Step 3: LU×z(i) =

P×r(i)

O(n2), PL

Step 4: x(i+1) = x(i) + z(i) O(n), PH

Until ||x(i+1)||2 ≤ ε

Computationally Inexpensive

A

L

U

Ly b

x

=

b=

xU

y=

LU decomposition with Partial Pivoting (LUPP)

Matrix Decomposition: 2/3 n3 Ops

n2 Ops (Forward Substitution) n2 Ops (Back Substitution)

||x – x*||/ ||x*|| = q (q<1; for IR), q = ϕ(n)κ(A)ϵ0 (LUPP)

Direct Method

Success Condition for Iterative Refinement

Single precision for LUPP

Double precision refinement

Converge ?

Double precision for LUPP

DONE

MPIR

Arbitrary precision for LUPP

Arbitrary precision refinement

Arbitrary precision for LUPP

DONE

XMIR

Converge ?

Impact of XMIR

Achievable Accuracy:

XMIR: Arbitrarily High MPIR: Double

Assume that Time Cost Model, T(є) = α mβ,

Performance Comparison,

MPIR

T = T(єD) (2/3 n3 × γβ + 2 n2m ×(1+γβ) )

Let γ = (mL/mH),

XMIR

T = T(єAH) (2/3 n3 × γβ + 2 n2m ×(1+γβ) )×If T(AL) << T(AH)

Dense Linear System Solvers with Flexible Precisions

Direct Method

Wilkinson’s Iterative Refinement

(WIR)

Mixed Precision IR (MPIR)

Arbitrary Initial Precision MPIR (AMIR)

eXtended MPIR (XMIR)

1 precision

2 precisions (Original/Higher)

Accuracy: Original Precision

2 precisions (S/D(O))

Accuracy: Double

2 precisions (A/D(O))

Accuracy: Double

2 precisions (A/A(O))

Accuracy: Arbitrary

Step 2. Residual Calculation

r = b – Ax

Multiplier Adder

0

b

sel

Register File

MSB <= not(MSB)

Execution time: T = (n/# PEs) × (n + k + l + r) / f ≈ n2/(# PEs× f )

2 Ops per PE per clock cycle, Implementation block size = 1,024

Implementation

(Xilinx ISE/VHDL)

PE

B

R

A

M

0

B

R

A

M

1

BRAM 0 Status

Microblaze

Loading

BRAM 1 Status

Matrix Side

BRAMs

B

R

A

M

0

B

R

A

M

1

Vector Side

BRAMs

Partial Sum

Register

Loading

Implementation

PrecisionPipeline

Depth

(Add/Mult)

DSP 48E

Slices

Registers

LUTs

# of BRAM

(36Kb)

# of PEs

s2

PAR CLK GFLOPsExp

Size

Mantissa

Size

8 23(S) 11/8 4/2,0161,278/595,200

1,748/297,6004/1,064 170 210 MHz 71

11 38 12/12 5/2,0162,355/595,200

2,807/297,6006/1,064 106 167 MHz 35

11 52(D) 14/15 13/2,0162,912/595,200

3,546/297,6008/1,064 83 125 MHz 21

15 63 13/22 16/2,0163,816/595,200

4,517/297,6008/1,064 65 178 MHz 23

Precision Pipeline

Depth

Add/Mult

DSP 48E

Slices

Registers

LUTs

# of

BRAM

(36Kb)

# of PEs

s2

PAR

CLKGFLOPsExp

Size

Mantissa

Size

8 23(S) 12/8 2/641496/69,120

1724/69,1204/148 32 196 MHz 13

Step 2. Residual Calculation

PAR: Xilinx XC5VLX110T

PAR: Xilinx XC6VSX475T

Implementation

Step 3. Triangular System Solver (Block Method)

L21

L31

L41

L51

L61

L71

L81

L32

L42

L52

L62

L72

L82

L43

L53

L63

L73

L83

L54

L64

L74

L84

L65

L75

L85

L76

L86 L87

y2

y3

y4

y5

y6

y7

y8

y1L11

L22

L33

L44

L55

L66

L77

L88

z2

z3

z4

z5

z6

z7

z8

z1

=

Update z vector

block size = 64y1

z2 = z2 – L21y1

y2

z3 = z3 – L31y1 – L32y2

….

z8 = z8 – L81y1 – L82y2 .. – L87y7.

Implementation

Step 3. Triangular System Solver (Small triangular matrices)

block size = 64

=

x0 = 1/l00 (b0),

x1 = 1/l11 (b1 – l10 x0),

x2 = 1/l22 (b2 – l20 x0 – l21 x1),

x3 = 1/l33 (b3 – l30 x0 – l31 x1 – l32 x2).

z1 = (b1 – l10 x0), z2 = (b2 – l20 x0), and z3 = (b3 – l30 x0).

In the next iteration,

z2 = (z2 – l21 x1), and z3 = (z3 – l31 x1), and so on.

Implementation

Step 3. Triangular System Solver (block size = 64)

zf done

zf data

addr Td

Td/zf data

Td data

act

Xdone/xdata

addrx/xdatadly

loc_enable

ext_enable

To all the

modules

PE

XADDR

Arbiter

Div_inter

Td

BRAM

T

BRAM

z

BRAM

Latency from division is hidden

2 operations per clock cycle

Triangular matrix

b vector =>

Intermediate z vector

Diagonal elements

=> Final Solution

- *

MAGMA v0.2

Exclude data transfer time from host to both accelerators (GPGPU/FPGA)

48 GFlops

71 GFlops

47 GFlops

32 GFlops

35 GFlops

23 38 52 63

Performance Comparison

NVIDIA GTX480 Xilinx XC6VSX475TGFlops

Precision (Mantissa Size)

FPGA ≈ GPU

1. XMIR can produce arbitrary accuracy in linear system solvers

2. For applications requiring very high accuracy, impact of XMIR is

maximized

3. XMIR (FPGA): Lower Precision / Beyond Double Precision

MPIR (GPU): Moderately High Precision

Conclusions

Future Work

Hybrid-Platform (FPGA + GPU)

- Power-Aware Performance

Dynamic precision?

- Update precisions during iteration

Thank You, Any Questions?

iterative refinement on fpgas - saahpc.ncsa.illinois.edu

Documents