o ak r idge n ational l aboratory u. s. d epartment of e nergy scientific computing beyond cpus:...

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

Scientific Computing Beyond CPUs: Scientific Computing Beyond CPUs: FPGA implementations of common FPGA implementations of common

scientific kernelsscientific kernels

Melissa C. SmithMelissa C. Smith11 Jeffrey S. VetterJeffrey S. Vetter22 Sadaf R. AlamSadaf R. Alam22

Sreesa AkellaSreesa Akella33 Luis CordovaLuis Cordova33

11Engineering Science and Technology Division, ORNLEngineering Science and Technology Division, ORNL

22Computer Science and Mathematics Division, ORNLComputer Science and Mathematics Division, ORNL

33University of South CarolinaUniversity of South Carolina

September 2005September 2005

2Smith MAPLD 2005/187


Outline

Introduction & MotivationIntroduction & MotivationCandidate Kernels/Apps & ImplementationCandidate Kernels/Apps & ImplementationResultsResultsFunction LibraryFunction LibraryLessons LearnedLessons LearnedConclusionsConclusions



Reconfigurable Computing (RC) with FPGAsReconfigurable Computing (RC) with FPGAsFaster execution & lower power consumption all with slower clock speeds Faster execution & lower power consumption all with slower clock speeds Exploit inherent parallelism in algorithms Exploit inherent parallelism in algorithms Match computation to application data flow i.e. Data Flow Graph TheoryMatch computation to application data flow i.e. Data Flow Graph TheoryHardware-like speed with software-like flexibility that can adapt to the Hardware-like speed with software-like flexibility that can adapt to the

needs of the applicationneeds of the applicationGate densities suitable for 64b floating-pointGate densities suitable for 64b floating-point

Introduction

Traditional ComputingTraditional ComputingHardware development Hardware development

struggling to keep pace with struggling to keep pace with analysis needsanalysis needs

Reaching limits on computing Reaching limits on computing speed due to I/O bandwidth and speed due to I/O bandwidth and clock wallclock wall

Managing heat dissipation Managing heat dissipation becoming increasingly difficultbecoming increasingly difficult 1x

97 98 99 00 01 02 03 04 05

100x

96Year

Per

form

ance

Imp

rove

men

t FPGA

Proc

1000x

10000x

10x

100x

1000x

10000x

10x

Image courtesy of SRC



Motivation

Many scientific applications at ORNL and elsewhere depend on Many scientific applications at ORNL and elsewhere depend on double precisiondouble precision operations operations

Kernel selection and classificationKernel selection and classification– compute intensivecompute intensive– common among many relevant applicationscommon among many relevant applications– candidate for hardware implementationcandidate for hardware implementation

Interface to legacy code (FORTRAN & C) Interface to legacy code (FORTRAN & C) extremelyextremely important importantMemory bottleneck in conventional memory hierarchies for Memory bottleneck in conventional memory hierarchies for

scientific applications throttling performancescientific applications throttling performance

With this knowledge:With this knowledge:– Can users harness reconfigurable hardware Can users harness reconfigurable hardware withoutwithout

(a) becoming (a) becoming hardware expertshardware experts and and

(b) completely (b) completely re-writingre-writing their code? their code?

– Can we develop function libraries such as BLAS, VSIPL, or others Can we develop function libraries such as BLAS, VSIPL, or others without loss of generality?without loss of generality?



Candidate Kernels & Applications

Initial studiesInitial studies– KernelsKernels

• Dense matrix operations (e.g. DGEMM)Dense matrix operations (e.g. DGEMM)• Sparse matrix operationsSparse matrix operations

– ClimateClimate• PSTSWMPSTSWM

– BioinformaticsBioinformatics• BLASTBLAST• Fragment assemblyFragment assembly

– Molecular dynamicsMolecular dynamics• AMBERAMBER• LAMMPSLAMMPS

Cannot cover all apps studies today.



DGEMM & SGEMM

BLAS routines:BLAS routines: SGEMMSGEMM & & DGEMMDGEMM perform the matrix- perform the matrix-matrix operation:matrix operation:

C = C = AB + AB + CC and and are scalars, and are scalars, and AA, , BB, and , and CC are matrices are matrices ((AA is an is an m x km x k, , BB is an is an k x nk x n, and , and CC is an is an m x nm x n matrix) matrix)

What makes them difficult and interesting:What makes them difficult and interesting:– Memory communication bottleneck (limited bandwidth)Memory communication bottleneck (limited bandwidth)– Local storage limitation (for both sequential & parallel machines) Local storage limitation (for both sequential & parallel machines)

Answer:Answer:

Exploit Data Reusability and Data Flow with FPGAsExploit Data Reusability and Data Flow with FPGAs



Fully utilize both user FPGAs (XC2V6000) of the Fully utilize both user FPGAs (XC2V6000) of the SRC MAPstationSRC MAPstation

DGEMM: DGEMM: 12 MAC12 MAC units per FPGA units per FPGA (SGEMM: (SGEMM: 25 MAC25 MAC units per FPGA) units per FPGA)

Geared to handle arbitrary size matrices up to Geared to handle arbitrary size matrices up to 1024x10241024x1024

Matrices operations occur in blocksMatrices operations occur in blocksHow to count FLOPS?How to count FLOPS?

– FPGA Algorithm performs FPGA Algorithm performs more FLOPS than efficient more FLOPS than efficient SW implementationSW implementation

– Takes advantage of the data Takes advantage of the data flow architecture flow architecture

– Later referred to as Later referred to as alternate FLOPSalternate FLOPS

A00 A01 A02 A03 A04 A05

A20 A21 A22 A23 A24 A25

A30 A31 A32 A33 A34 A35

A40 A41 A42 A43 A44 A45

A50 A51 A52 A53 A54 A55

A10 A11 A12 A13 A14 A15

Implementation

A00

A10

A01

A11



Implementation – Stage0

FPGA1

OBMBanks

A

A00

,A01

B

B00

,B10

C

C00

,C01

D

A10

,A11

E

B01

,B11

F

C10

,C11

FPGA0

Calculations are conducted in two stagesCalculations are conducted in two stagesTwo FPGAs exchange ownership of the matrix B blocksTwo FPGAs exchange ownership of the matrix B blocks

800 MB/sPer bank



Implementation – Stage1

FPGA1

OBMBanks

A

A00

,A01

B

B00

,B10

C

C00

,C01

D

A10

,A11

E

B01

,B11

F

C10

,C11

FPGA0

In stage two, the two FPGAs have exchanged ownership In stage two, the two FPGAs have exchanged ownership of the matrix B blocksof the matrix B blocks



0.00E+00

1.00E+03

2.00E+03

3.00E+03

4.00E+03

5.00E+03

6.00E+03

7.00E+03

0 100 200 300 400 500 600 700 800 900

Dimension [N]

pe

rfo

rma

nc

e [

Mfl

op

s/s

ec

]

atlas -03 flagcblas_dgemm()SRC dual Xeon

hw measured withRTDSC

hw measured withpga counters

computation onlyw/ FPGA counter

2x faster FPGA

0.00E+00

1.00E-01

2.00E-01

3.00E-01

4.00E-01

5.00E-01

6.00E-01

0 100 200 300 400 500 600 700 800 900

Dimension [N]

tim

e [

se

c]

ATLAS-03 flagcblas_dgemm()SRC dual Xeon

hw measuredwith RTDSC

hardwaremeasured withcounters in fpga

computation onlyw/ FPGA counter

2x faster FPGA

2x faster FPGA &data xfer

DGEMM AnalysisData transfer time Data transfer time in/out of hardware is in/out of hardware is significant and takes significant and takes away from “time to away from “time to solution” – Hence the solution” – Hence the interest in other interest in other memory systems such memory systems such as those used in as those used in systems by Cray and systems by Cray and SGISGI

Faster and/or denser Faster and/or denser FPGAs can significantly FPGAs can significantly improve performance and improve performance and ‘time to solution’‘time to solution’

Performance and ‘time Performance and ‘time to solution’ could to solution’ could potentially be potentially be improved with ‘DMA improved with ‘DMA streaming’ of datastreaming’ of data



1x97 98 99 00 01 02 03 04 05

100x

96Year

Per

form

ance

Imp

rove

men

t FPGA

Proc

1000x

10000x

10x

100x

1000x

10000x

10x

Image courtesy of SRC

FPGA Opportunity and Potential

Our results using SRC Our results using SRC CARTE v1.8CARTE v1.8

Dual Xilinx XC2V6000Dual Xilinx XC2V6000– 12 64-b MACs @ 100 MHz 12 64-b MACs @ 100 MHz

(or 25 32-b MACs)(or 25 32-b MACs)– 3.5 GFlops 3.5 GFlops (5.3 GFlops (5.3 GFlops alternate FLOPSalternate FLOPS))

Dou et.al. results Dou et.al. results using hardware using hardware languagelanguage

Xilinx XC2VP125-7Xilinx XC2VP125-7– 39 64-b MACs @ 200 MHz39 64-b MACs @ 200 MHz– 15.6 GFlops15.6 GFlops

Parts Available on the Cray XD1Parts Available on the Cray XD1– Xilinx XC2VP50-7 x 6 nodesXilinx XC2VP50-7 x 6 nodes– Up to 200 MHzUp to 200 MHz– Conservative estimate 18 64-b MACs ->7.2 GFlops per nodeConservative estimate 18 64-b MACs ->7.2 GFlops per node– Full utilization of all 6 nodes potentially 43.2 GFlopsFull utilization of all 6 nodes potentially 43.2 GFlops

Flops/MAC ratios: Flops/MAC ratios: MAPstation=0.44 MAPstation=0.44 Dou’s=0.4Dou’s=0.4



Building Function Libraries for RC

GoalGoal: To assemble library of user friendly, familiar, and : To assemble library of user friendly, familiar, and pertinent scientific functionspertinent scientific functions

Initial functions identified:Initial functions identified:– BLAS Level 2 and Level 3 (e.g. DGEMM/SGEMM)BLAS Level 2 and Level 3 (e.g. DGEMM/SGEMM)– Sparse Matrix Operations Sparse Matrix Operations – FFT and 3D-FFTFFT and 3D-FFT– Bioinformatics query functionsBioinformatics query functions

MDApps

FFT

BLAS Iterative SolversSpMatVec

ClimateApps Bioinformatics

Apps

Queries



Sparse Matrix Vector Operations

Used in iterative solvers for Used in iterative solvers for linear systemslinear systems

Not efficient on general Not efficient on general purpose microprocessor purpose microprocessor systemssystems– High cache miss rateHigh cache miss rate due to poor due to poor

data localitydata locality– Low utilization of floating point Low utilization of floating point

unitunit due to high ratio load/store due to high ratio load/store to floating point operationsto floating point operations

RC advantageRC advantage– Avoid cache missesAvoid cache misses with high on- with high on-

chip and off-chip memory chip and off-chip memory bandwidthbandwidth

– Local distributed memory banksLocal distributed memory banks– High density FPGAsHigh density FPGAs– High speed host to FPGA High speed host to FPGA

communicationcommunication

MAC

.

.

NZ[0]

NZ[1]

NZ[n-1]

NZ[n]

IV[CO[0]]

IV[CO[1]]

IV[CO[n-1]]

IV[CO[n]]

OV[0]

OV[1]

OV[n-1]

OV[n]

NZ – Non-zero element vector, CO – Column indices vector,IV- Input vector, OV- Output vector

MAC

MAC

MAC

Investigating multiple Investigating multiple storage formats (CSR, storage formats (CSR,

ELLPACK, and CSRPERM)ELLPACK, and CSRPERM)



Candidate Application: Amber8 Acceleration Strategy

Identified regions of Identified regions of Amber8 application using Amber8 application using detailed profiling and detailed profiling and modeling of codemodeling of code– ew_direct.few_direct.f– veclib.fveclib.f

Examining strategy for Examining strategy for mapping this routine into mapping this routine into SRC’s two FPGAsSRC’s two FPGAs

Also investigating Also investigating acceleration of FFTs using acceleration of FFTs using FPGAsFPGAs– ew_recip.few_recip.f– ew_fft.few_fft.f– pub_fft.f & passb2.fpub_fft.f & passb2.f

11.22%

3.39%

3D FFT time worsens for parallel systems due communication costs

73.14%

O(N2) smaller problems

main

sander

runmd

ewald_force

get_nb_energy

short_ene

do_pmesh_kspace

force

grad_sumrc

vdinvsqrt

fft_backrc fft_forwardrc

fft3dzxyrc

fft2drc

fft3d0rc

cfftb1 cfftf1 cffti

passb4

passb2

nb_adjust

fastwt_mp_quick3

adjust_imagcrds

shake

fft_setup

passf4

pub_fft.f

passb2.f

ew_fft.f

ew_recip.f

ew_direct.f

vec_lib.f

ew_force.f

ew_box.f

23558000

47116000

1000

1

1000

1000

1

1



3D FFTs in LAMMPS (Large Scale Atomic/Molecular Massively Parallel Simulator)

fftw (plan, total1/length1, data, 1/-1, length1, NULL, 0, 0)

Nfast (1) x Nmid (2) x Nslow (3)

remap_3d (data, copy, scratch, pre_plan)





fft_3d ( )

1

2

3

fftw

OBM

BRAM plane

BRAM plane

… 321 GCM

Single/Multi-MAP

total1/length1 = 1x3x3/3 = 3

total2/length2 = 3x1x3/3 = 3

total3/length3 = 3x3x1/3 = 31/-1 = forward / inverse

Will not necessarily fit but there is a penalty for going off-chip

Remap stages are exchanged by intelligent access and addressing

fftw

fftw

fftw

fftw

fftw

fftw

fly

MOI

fly

M

fftw_orchestrator

Depending on data size the FPGA implementation of the fftw will resemble the software counterpart with improved performance and data reuse

The fly element indicated stand for different FFT computation units with radix 2,3,4, and 5 and with certain level of parallelism



Bioinformatics – BLAST

BLAST: Basic Local BLAST: Basic Local Alignment Search ToolAlignment Search Tool

Profiling of the NCBI Profiling of the NCBI source code determine source code determine time-consuming functions time-consuming functions that could be targeted to that could be targeted to FPGA completedFPGA completed

Currently investigating Currently investigating best problem structure and best problem structure and domain for given RC domain for given RC architecture and architecture and bandwidths (analysis of bandwidths (analysis of data streams, memory data streams, memory capacity, etc.)capacity, etc.)



Lessons Learned

Effective use of HLL (such as the Carte tool used here) to design for Effective use of HLL (such as the Carte tool used here) to design for FPGAs still requires some hardware knowledgeFPGAs still requires some hardware knowledge– Memory limitationsMemory limitations– FPGA limitationsFPGA limitations– ‘‘Tricks’ to take advantage of FPGA strengthsTricks’ to take advantage of FPGA strengths– ‘‘Tricks’ to take advantage of RC architectureTricks’ to take advantage of RC architecture

Library development requires analysis to determine functions Library development requires analysis to determine functions appropriate for FPGA implementationappropriate for FPGA implementation

Breakout level of library functions may not always be appropriate Breakout level of library functions may not always be appropriate for RC implementation – still under investigationfor RC implementation – still under investigation– Combine or fuse appropriate function calls to form larger functions with Combine or fuse appropriate function calls to form larger functions with

more computational weightmore computational weight



Status Review & Future Work

Consider these caveatsConsider these caveats– FPGA growth rates exceeding general purpose microprocessorsFPGA growth rates exceeding general purpose microprocessors

• These FPGA implementations demonstrate These FPGA implementations demonstrate performance performance with additional with additional power and spacepower and space savings vs. general processor implementations savings vs. general processor implementations

– Restricted our evaluation to compiler transformed high-level languagesRestricted our evaluation to compiler transformed high-level languages• No manual VHDL codingNo manual VHDL coding• Performance comparable with VHDL techniques Performance comparable with VHDL techniques

(adjusting for FPGA size & clock frequency)(adjusting for FPGA size & clock frequency)

– New higher bandwidth RC architectures promise to dramatically reduce New higher bandwidth RC architectures promise to dramatically reduce data transfer costsdata transfer costs

– Efforts in 64b floating-point computation just beginningEfforts in 64b floating-point computation just beginning• Cores not widely availableCores not widely available

– No common tools exist that identify candidate codes or regions in the No common tools exist that identify candidate codes or regions in the application for accelerationapplication for acceleration• Must manually profile and model large, complex applicationsMust manually profile and model large, complex applications

We expect the performance advantages and applicability of these systems to only improve over the coming years.



Status Review & Future Work (cont.)

Ability to code inAbility to code in C or FORTAN C or FORTAN a significant benefit for our users a significant benefit for our usersProgress on several application areasProgress on several application areas

– Initial studies completed with competitive performanceInitial studies completed with competitive performance• Kernels (dense & sparse matrix), climateKernels (dense & sparse matrix), climate

– Actively studying other fruitful areasActively studying other fruitful areas• Molecular dynamics, Bioinformatics Molecular dynamics, Bioinformatics

Future work will focus on Future work will focus on – Maximum utilization of FPGA resourcesMaximum utilization of FPGA resources– Additional function/kernel library developmentAdditional function/kernel library development– Resource management for multi-paradigm platformsResource management for multi-paradigm platforms– Evaluations of other RC platforms (Cray XD1 and SGI)Evaluations of other RC platforms (Cray XD1 and SGI)


EndEnd

o ak r idge n ational l aboratory u. s. d epartment of e nergy scientific computing beyond cpus:...

Documents