Download - FGPU: An SIMT-Architecture for FPGAs
![Page 1: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/1.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU: An SIMT-Architecture for FPGAsMuhammed Al Kadi, Benedikt Janssen, Michael HuebnerRuhr University of Bochum
Chair for Embedded Systems for Information Technology
FPGA’16, Monterey, 23 February 2016www.esit.rub.de
![Page 2: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/2.jpg)
RUHR-UNIVERSITÄT BOCHUM
Outline
1 Motivation and Background
2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller
3 Implementation and Results
4 Future Work and Conclusion
FGPU | FPGA’16, Monterey, 23 February 2016 2
![Page 3: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/3.jpg)
RUHR-UNIVERSITÄT BOCHUM
Outline
1 Motivation and Background
2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller
3 Implementation and Results
4 Future Work and Conclusion
FGPU | FPGA’16, Monterey, 23 February 2016 2
![Page 4: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/4.jpg)
RUHR-UNIVERSITÄT BOCHUM
MotivationWhat is FGPU?
An FPGA-GPU for general purpose computing
A portable, flexible and scalable softprocessor
A multi-core GPU-like architecture
Its ISA is designed to support execution ofOpenCL kernels
Capable of interfacing many AXI4-compatibledata interfaces with an internal L1 cache
It does not replicate any other architecture
Implemented completely in VHDL-2002
Threads Scheduler
AXI Control Interface
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE7
CU Memory Controller
Cache Memory
AXI DataInterfaces
Global Memory Controller
Com
pute
Devic
es
Global Memory
FGPU | FPGA’16, Monterey, 23 February 2016 3
![Page 5: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/5.jpg)
RUHR-UNIVERSITÄT BOCHUM
MotivationWhy FGPU?
Standard programming: OpenCL-compatible, noPRAGMAs necessary, shorter development cycles
Scalable task management: new tasks occupyno extra area on the FPGA
Design space exploration: not possible withhard embedded GPUs
Application specific adaptations: to achieve thebest area/power and performance trade-off
High effeciency: compared to other softarchitectures
NoC
FGPU | FPGA’16, Monterey, 23 February 2016 4
![Page 6: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/6.jpg)
RUHR-UNIVERSITÄT BOCHUM
MotivationWhy FGPU?
Standard programming: OpenCL-compatible, noPRAGMAs necessary, shorter development cycles
Scalable task management: new tasks occupyno extra area on the FPGA
Design space exploration: not possible withhard embedded GPUs
Application specific adaptations: to achieve thebest area/power and performance trade-off
High effeciency: compared to other softarchitectures
NoC
FGPU | FPGA’16, Monterey, 23 February 2016 4
![Page 7: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/7.jpg)
RUHR-UNIVERSITÄT BOCHUM
MotivationWhy FGPU?
Standard programming: OpenCL-compatible, noPRAGMAs necessary, shorter development cycles
Scalable task management: new tasks occupyno extra area on the FPGA
Design space exploration: not possible withhard embedded GPUs
Application specific adaptations: to achieve thebest area/power and performance trade-off
High effeciency: compared to other softarchitectures
NoC
FGPU | FPGA’16, Monterey, 23 February 2016 4
![Page 8: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/8.jpg)
RUHR-UNIVERSITÄT BOCHUM
MotivationWhy FGPU?
Standard programming: OpenCL-compatible, noPRAGMAs necessary, shorter development cycles
Scalable task management: new tasks occupyno extra area on the FPGA
Design space exploration: not possible withhard embedded GPUs
Application specific adaptations: to achieve thebest area/power and performance trade-off
High effeciency: compared to other softarchitectures
NoC
FGPU | FPGA’16, Monterey, 23 February 2016 4
![Page 9: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/9.jpg)
RUHR-UNIVERSITÄT BOCHUM
MotivationWhy FGPU?
Standard programming: OpenCL-compatible, noPRAGMAs necessary, shorter development cycles
Scalable task management: new tasks occupyno extra area on the FPGA
Design space exploration: not possible withhard embedded GPUs
Application specific adaptations: to achieve thebest area/power and performance trade-off
High effeciency: compared to other softarchitectures
NoC
FGPU | FPGA’16, Monterey, 23 February 2016 4
![Page 10: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/10.jpg)
RUHR-UNIVERSITÄT BOCHUM
MotivationWhy FGPU?
Standard programming: OpenCL-compatible, noPRAGMAs necessary, shorter development cycles
Scalable task management: new tasks occupyno extra area on the FPGA
Design space exploration: not possible withhard embedded GPUs
Application specific adaptations: to achieve thebest area/power and performance trade-off
High effeciency: compared to other softarchitectures
NoC
FGPU | FPGA’16, Monterey, 23 February 2016 4
![Page 11: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/11.jpg)
RUHR-UNIVERSITÄT BOCHUM
BackgroundSIMT and GPGPU
SIMT (Single-Instruction Multiple-Treads)
is a parallel execution model to program many-core architectures.
A single instruction can be concurrently executed by multiple threads.
Thread are scheduled at runtime on the available cores.
GPGPU (General Purpose Computing on Graphical Processing Units)
An efficient solution for many application e.g. filtering, scientificsimulations, matrix operations, sorting, DSP etc.
Embedded GPUs are easier to program with OpenCL or CUDA thanFPGAs with HDLs.
FGPU | FPGA’16, Monterey, 23 February 2016 5
![Page 12: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/12.jpg)
RUHR-UNIVERSITÄT BOCHUM
BackgroundSIMT and GPGPU
SIMT (Single-Instruction Multiple-Treads)
is a parallel execution model to program many-core architectures.
A single instruction can be concurrently executed by multiple threads.
Thread are scheduled at runtime on the available cores.
GPGPU (General Purpose Computing on Graphical Processing Units)
An efficient solution for many application e.g. filtering, scientificsimulations, matrix operations, sorting, DSP etc.
Embedded GPUs are easier to program with OpenCL or CUDA thanFPGAs with HDLs.
FGPU | FPGA’16, Monterey, 23 February 2016 5
![Page 13: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/13.jpg)
RUHR-UNIVERSITÄT BOCHUM
BackgroundSIMT and GPGPU
SIMT (Single-Instruction Multiple-Treads)
is a parallel execution model to program many-core architectures.
A single instruction can be concurrently executed by multiple threads.
Thread are scheduled at runtime on the available cores.
GPGPU (General Purpose Computing on Graphical Processing Units)
An efficient solution for many application e.g. filtering, scientificsimulations, matrix operations, sorting, DSP etc.
Embedded GPUs are easier to program with OpenCL or CUDA thanFPGAs with HDLs.
FGPU | FPGA’16, Monterey, 23 February 2016 5
![Page 14: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/14.jpg)
RUHR-UNIVERSITÄT BOCHUM
Outline
1 Motivation and Background
2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller
3 Implementation and Results
4 Future Work and Conclusion
FGPU | FPGA’16, Monterey, 23 February 2016 5
![Page 15: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/15.jpg)
RUHR-UNIVERSITÄT BOCHUM
Outline
1 Motivation and Background
2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller
3 Implementation and Results
4 Future Work and Conclusion
FGPU | FPGA’16, Monterey, 23 February 2016 5
![Page 16: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/16.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureExecution Model (OpenCL-Compatible)
Threads (Work-items) get coordinated in a 1-, 2- or 3D index spaceWork-items get scheduled together inWork-groups (WGs) on idle coresWGs are splitted intoWavefronts (WFs)WF’s work-items share the same program counter
Example (Array multiplication)
0 N-1
0 1 N-1
*0 1 N-1
0 1 N-1
=
Thread (Work-item)
i
1st Array
2nd Array
WavefrontWork-group
Result Array
1D Index Space
FGPU | FPGA’16, Monterey, 23 February 2016 6
![Page 17: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/17.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureExecution Model (OpenCL-Compatible)
Threads (Work-items) get coordinated in a 1-, 2- or 3D index spaceWork-items get scheduled together inWork-groups (WGs) on idle coresWGs are splitted intoWavefronts (WFs)WF’s work-items share the same program counter
Example (Array multiplication)
0 N-1
0 1 N-1
*0 1 N-1
0 1 N-1
=
Thread (Work-item)
i
1st Array
2nd Array
WavefrontWork-group
Result Array
1D Index Space
FGPU | FPGA’16, Monterey, 23 February 2016 6
![Page 18: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/18.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureExecution Model (OpenCL-Compatible)
Threads (Work-items) get coordinated in a 1-, 2- or 3D index spaceWork-items get scheduled together inWork-groups (WGs) on idle coresWGs are splitted intoWavefronts (WFs)WF’s work-items share the same program counter
Example (Array multiplication)
0 N-1
0 1 N-1
*0 1 N-1
0 1 N-1
=
Thread (Work-item)
i
1st Array
2nd Array
WavefrontWork-group
Result Array
1D Index Space
FGPU | FPGA’16, Monterey, 23 February 2016 6
![Page 19: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/19.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureExecution Model (OpenCL-Compatible)
Threads (Work-items) get coordinated in a 1-, 2- or 3D index spaceWork-items get scheduled together inWork-groups (WGs) on idle coresWGs are splitted intoWavefronts (WFs)WF’s work-items share the same program counter
Example (Array multiplication)
0 N-1
0 1 N-1
*0 1 N-1
0 1 N-1
=
Thread (Work-item)
i
1st Array
2nd Array
WavefrontWork-group
Result Array
1D Index Space
FGPU | FPGA’16, Monterey, 23 February 2016 6
![Page 20: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/20.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureExecution Model (OpenCL-Compatible)
Threads (Work-items) get coordinated in a 1-, 2- or 3D index spaceWork-items get scheduled together inWork-groups (WGs) on idle coresWGs are splitted intoWavefronts (WFs)WF’s work-items share the same program counter
Example (Array multiplication)
0 N-1
0 1 N-1
*0 1 N-1
0 1 N-1
=
Thread (Work-item)
i
1st Array
2nd Array
WavefrontWork-group
Result Array
1D Index Space
FGPU | FPGA’16, Monterey, 23 February 2016 6
![Page 21: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/21.jpg)
RUHR-UNIVERSITÄT BOCHUM
Outline
1 Motivation and Background
2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller
3 Implementation and Results
4 Future Work and Conclusion
FGPU | FPGA’16, Monterey, 23 February 2016 6
![Page 22: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/22.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitecturePlatform Model
WG
Dis
patc
her
AX
I C
ontr
ol In
terf
ace
CRAM
LRAM
ControlRegisters
WFScheduler
CU
RTM
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE7
CU Memory Controller
Cache Memory
AXI DataInterfaces
Global MemoryController
FGPU accommodates several Compute Units (CUs), each holds a singlearray of Processing Elements (PEs)
All PEs in a CU execute the same instruction at the same time
The binary code is executed from the Code RAM (CRAM)
Information that do not belong to the binary, e.g. # of work-items tolaunch or parameter values, has to be stored in the Link RAM (LRAM)
FGPU | FPGA’16, Monterey, 23 February 2016 7
![Page 23: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/23.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitecturePlatform Model
WG
Dis
patc
her
AX
I C
ontr
ol In
terf
ace
CRAM
LRAM
ControlRegisters
WFScheduler
CU
RTM
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE7
CU Memory Controller
Cache Memory
AXI DataInterfaces
Global MemoryController
FGPU accommodates several Compute Units (CUs), each holds a singlearray of Processing Elements (PEs)
All PEs in a CU execute the same instruction at the same time
The binary code is executed from the Code RAM (CRAM)
Information that do not belong to the binary, e.g. # of work-items tolaunch or parameter values, has to be stored in the Link RAM (LRAM)
FGPU | FPGA’16, Monterey, 23 February 2016 7
![Page 24: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/24.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitecturePlatform Model
WG
Dis
patc
her
AX
I C
ontr
ol In
terf
ace
CRAM
LRAM
ControlRegisters
WFScheduler
CU
RTM
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE7
CU Memory Controller
Cache Memory
AXI DataInterfaces
Global MemoryController
FGPU accommodates several Compute Units (CUs), each holds a singlearray of Processing Elements (PEs)
All PEs in a CU execute the same instruction at the same time
The binary code is executed from the Code RAM (CRAM)
Information that do not belong to the binary, e.g. # of work-items tolaunch or parameter values, has to be stored in the Link RAM (LRAM)
FGPU | FPGA’16, Monterey, 23 February 2016 7
![Page 25: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/25.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitecturePlatform Model
WG
Dis
patc
her
AX
I C
ontr
ol In
terf
ace
CRAM
LRAM
ControlRegisters
WFScheduler
CU
RTM
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE7
CU Memory Controller
Cache Memory
AXI DataInterfaces
Global MemoryController
FGPU accommodates several Compute Units (CUs), each holds a singlearray of Processing Elements (PEs)
All PEs in a CU execute the same instruction at the same time
The binary code is executed from the Code RAM (CRAM)
Information that do not belong to the binary, e.g. # of work-items tolaunch or parameter values, has to be stored in the Link RAM (LRAM)
FGPU | FPGA’16, Monterey, 23 February 2016 7
![Page 26: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/26.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitecturePlatform Model
WG
Dis
patc
her
AX
I C
ontr
ol In
terf
ace
CRAM
LRAM
ControlRegisters
WFScheduler
CU
RTM
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE7
CU Memory Controller
Cache Memory
AXI DataInterfaces
Global MemoryController
FGPU accommodates several Compute Units (CUs), each holds a singlearray of Processing Elements (PEs)
All PEs in a CU execute the same instruction at the same time
The binary code is executed from the Code RAM (CRAM)
Information that do not belong to the binary, e.g. # of work-items tolaunch or parameter values, has to be stored in the Link RAM (LRAM)
FGPU | FPGA’16, Monterey, 23 February 2016 7
![Page 27: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/27.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitecturePlatform Model - continued
WG
Dis
patc
her
AX
I C
ontr
ol In
terf
ace
CRAM
LRAM
ControlRegisters
WFScheduler
CU
RTM
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE7
CU Memory Controller
Cache Memory
AXI DataInterfaces
Global MemoryController
TheWG Dispatcher assigns WGs on idle CUs
TheWF Scheduler admits WFs for execution, e.g. after a memory accessis performed
The Runtime Memory (RTM) implements theWork-Item Built-In Functionsdefined by OpenCL, e.g. feeding coordinates in the index space.
FGPU | FPGA’16, Monterey, 23 February 2016 8
![Page 28: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/28.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitecturePlatform Model - continued
WG
Dis
patc
her
AX
I C
ontr
ol In
terf
ace
CRAM
LRAM
ControlRegisters
WFScheduler
CU
RTM
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE7
CU Memory Controller
Cache Memory
AXI DataInterfaces
Global MemoryController
TheWG Dispatcher assigns WGs on idle CUs
TheWF Scheduler admits WFs for execution, e.g. after a memory accessis performed
The Runtime Memory (RTM) implements theWork-Item Built-In Functionsdefined by OpenCL, e.g. feeding coordinates in the index space.
FGPU | FPGA’16, Monterey, 23 February 2016 8
![Page 29: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/29.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitecturePlatform Model - continued
WG
Dis
patc
her
AX
I C
ontr
ol In
terf
ace
CRAM
LRAM
ControlRegisters
WFScheduler
CU
RTM
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE7
CU Memory Controller
Cache Memory
AXI DataInterfaces
Global MemoryController
TheWG Dispatcher assigns WGs on idle CUs
TheWF Scheduler admits WFs for execution, e.g. after a memory accessis performed
The Runtime Memory (RTM) implements theWork-Item Built-In Functionsdefined by OpenCL, e.g. feeding coordinates in the index space.
FGPU | FPGA’16, Monterey, 23 February 2016 8
![Page 30: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/30.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitecturePlatform Model - continued
WG
Dis
patc
her
AX
I C
ontr
ol In
terf
ace
CRAM
LRAM
ControlRegisters
WFScheduler
CU
RTM
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE7
CU Memory Controller
Cache Memory
AXI DataInterfaces
Global MemoryController
TheWG Dispatcher assigns WGs on idle CUs
TheWF Scheduler admits WFs for execution, e.g. after a memory accessis performed
The Runtime Memory (RTM) implements theWork-Item Built-In Functionsdefined by OpenCL, e.g. feeding coordinates in the index space.
FGPU | FPGA’16, Monterey, 23 February 2016 8
![Page 31: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/31.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitecturePlatform Model - continued
WG
Dis
patc
her
AX
I C
ontr
ol In
terf
ace
CRAM
LRAM
ControlRegisters
WFScheduler
CU
RTM
PE0
PE1
PE2
PE3
PE4
PE5
PE6
PE7
CU Memory Controller
Cache Memory
AXI DataInterfaces
Global MemoryController
TheWG Dispatcher assigns WGs on idle CUs
TheWF Scheduler admits WFs for execution, e.g. after a memory accessif performed
The Runtime Memory (RTM) implements theWork-Item Built-In Functionsdefined by OpenCL, e.g. feeding coordinates in the index space.
FGPU | FPGA’16, Monterey, 23 February 2016 8
![Page 32: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/32.jpg)
RUHR-UNIVERSITÄT BOCHUM
Outline
1 Motivation and Background
2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller
3 Implementation and Results
4 Future Work and Conclusion
FGPU | FPGA’16, Monterey, 23 February 2016 8
![Page 33: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/33.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureCompute Unit Architecture
8 WFs can be managed within a single CU
The same instruction gets executed over8 clock cycles on 8 PEs
A work-item owns 32 registers (32bit)
Register files are held in dual-port RAMsand switched with no latency
The pipeline consists of 18 stages!
Doubled clock frequency for the registerfiles and the ALUs
64 outstanding memory requests can bemanaged at the same time
Wavefront Manager RTM
ALU
Register FileA
B
Wavefront
7
Instruction
Instr. Fetch
clk_2x
CRAM
WG Scheduler
Wavefront Scheduler
0
Station
Buffer
Wri
te B
ack
PE0
PE7
R/W Addr & Data
Read Data32*N
Compute Vector
CU Memory Controller
8x9
32
32
8x32
8x32
Read Address
FIFO
FGPU | FPGA’16, Monterey, 23 February 2016 9
![Page 34: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/34.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureCompute Unit Architecture
8 WFs can be managed within a single CU
The same instruction gets executed over8 clock cycles on 8 PEs
A work-item owns 32 registers (32bit)
Register files are held in dual-port RAMsand switched with no latency
The pipeline consists of 18 stages!
Doubled clock frequency for the registerfiles and the ALUs
64 outstanding memory requests can bemanaged at the same time
Wavefront Manager RTM
ALU
Register FileA
B
Wavefront
7
Instruction
Instr. Fetch
clk_2x
CRAM
WG Scheduler
Wavefront Scheduler
0
Station
Buffer
Wri
te B
ack
PE0
PE7
R/W Addr & Data
Read Data32*N
Compute Vector
CU Memory Controller
8x9
32
32
8x32
8x32
Read Address
FIFO
FGPU | FPGA’16, Monterey, 23 February 2016 9
![Page 35: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/35.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureCompute Unit Architecture
8 WFs can be managed within a single CU
The same instruction gets executed over8 clock cycles on 8 PEs
A work-item owns 32 registers (32bit)
Register files are held in dual-port RAMsand switched with no latency
The pipeline consists of 18 stages!
Doubled clock frequency for the registerfiles and the ALUs
64 outstanding memory requests can bemanaged at the same time
Wavefront Manager RTM
ALU
Register FileA
B
Wavefront
7
Instruction
Instr. Fetch
clk_2x
CRAM
WG Scheduler
Wavefront Scheduler
0
Station
Buffer
Wri
te B
ack
PE0
PE7
R/W Addr & Data
Read Data32*N
Compute Vector
CU Memory Controller
8x9
32
32
8x32
8x32
Read Address
FIFO
FGPU | FPGA’16, Monterey, 23 February 2016 9
![Page 36: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/36.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureCompute Unit Architecture
8 WFs can be managed within a single CU
The same instruction gets executed over8 clock cycles on 8 PEs
A work-item owns 32 registers (32bit)
Register files are held in dual-port RAMsand switched with no latency
The pipeline consists of 18 stages!
Doubled clock frequency for the registerfiles and the ALUs
64 outstanding memory requests can bemanaged at the same time
Wavefront Manager RTM
ALU
Register FileA
B
Wavefront
7
Instruction
Instr. Fetch
clk_2x
CRAM
WG Scheduler
Wavefront Scheduler
0
Station
Buffer
Wri
te B
ack
PE0
PE7
R/W Addr & Data
Read Data32*N
Compute Vector
CU Memory Controller
8x9
32
32
8x32
8x32
Read Address
FIFO
FGPU | FPGA’16, Monterey, 23 February 2016 9
![Page 37: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/37.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureCompute Unit Architecture
8 WFs can be managed within a single CU
The same instruction gets executed over8 clock cycles on 8 PEs
A work-item owns 32 registers (32bit)
Register files are held in dual-port RAMsand switched with no latency
The pipeline consists of 18 stages!
Doubled clock frequency for the registerfiles and the ALUs
64 outstanding memory requests can bemanaged at the same time
Wavefront Manager RTM
ALU
Register FileA
B
Wavefront
7
Instruction
Instr. Fetch
clk_2x
CRAM
WG Scheduler
Wavefront Scheduler
0
Station
Buffer
Wri
te B
ack
PE0
PE7
R/W Addr & Data
Read Data32*N
Compute Vector
CU Memory Controller
8x9
32
32
8x32
8x32
Read Address
FIFO
FGPU | FPGA’16, Monterey, 23 February 2016 9
![Page 38: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/38.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureCompute Unit Architecture
8 WFs can be managed within a single CU
The same instruction gets executed over8 clock cycles on 8 PEs
A work-item owns 32 registers (32bit)
Register files are held in dual-port RAMsand switched with no latency
The pipeline consists of 18 stages!
Doubled clock frequency for the registerfiles and the ALUs
64 outstanding memory requests can bemanaged at the same time
Wavefront Manager RTM
ALU
Register FileA
B
Wavefront
7
Instruction
Instr. Fetch
clk_2x
CRAM
WG Scheduler
Wavefront Scheduler
0
Station
Buffer
Wri
te B
ack
PE0
PE7
R/W Addr & Data
Read Data32*N
Compute Vector
CU Memory Controller
8x9
32
32
8x32
8x32
Read Address
FIFO
FGPU | FPGA’16, Monterey, 23 February 2016 9
![Page 39: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/39.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureCompute Unit Architecture
8 WFs can be managed within a single CU
The same instruction gets executed over8 clock cycles on 8 PEs
A work-item owns 32 registers (32bit)
Register files are held in dual-port RAMsand switched with no latency
The pipeline consists of 18 stages!
Doubled clock frequency for the registerfiles and the ALUs
64 outstanding memory requests can bemanaged at the same time
Wavefront Manager RTM
ALU
Register FileA
B
Wavefront
7
Instruction
Instr. Fetch
clk_2x
CRAM
WG Scheduler
Wavefront Scheduler
0
Station
Buffer
Wri
te B
ack
PE0
PE7
R/W Addr & Data
Read Data32*N
Compute Vector
CU Memory Controller
8x9
32
32
8x32
8x32
Read Address
FIFO
FGPU | FPGA’16, Monterey, 23 February 2016 9
![Page 40: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/40.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureCompute Unit Architecture
8 WFs can be managed within a single CU
The same instruction gets executed over8 clock cycles on 8 PEs
A work-item owns 32 registers (32bit)
Register files are held in dual-port RAMsand switched with no latency
The pipeline consists of 18 stages!
Doubled clock frequency for the registerfiles and the ALUs
64 outstanding memory requests can bemanaged at the same time
Wavefront Manager RTM
ALU
Register FileA
B
Wavefront
7
Instruction
Instr. Fetch
clk_2x
CRAM
WG Scheduler
Wavefront Scheduler
0
Station
Buffer
Wri
te B
ack
PE0
PE7
R/W Addr & Data
Read Data32*N
Compute Vector
CU Memory Controller
8x9
32
32
8x32
8x32
Read Address
FIFO
FGPU | FPGA’16, Monterey, 23 February 2016 9
![Page 41: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/41.jpg)
RUHR-UNIVERSITÄT BOCHUM
Outline
1 Motivation and Background
2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller
3 Implementation and Results
4 Future Work and Conclusion
FGPU | FPGA’16, Monterey, 23 February 2016 9
![Page 42: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/42.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureGlobal Memory Controller Architecture
Byte Dirty
Tag Controller
StationCU
Station
PO
RT A
PO
RT A
PO
RT B
PO
RT B
Cache
Global Memory Controller
Rd DataRd Addr
Cache Cleaner
Tag Controller
Tag MemoryTag Manager
Write Pipeline
AXI Controller
AXI Port
ARADDR
RDATA
WDATA
WSTRB
AWADDR
It processes up to 64 outstanding requests from multiple CUs
Not served requests get increasing priorities as they are waiting
A multi-bank, directed mapped cache with write-back strategy is included
Accesses to global memory are parallelized over many AXI4-ports
Data is transfered in bursts that fill a single cache line
FGPU | FPGA’16, Monterey, 23 February 2016 10
![Page 43: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/43.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureGlobal Memory Controller Architecture
Byte Dirty
Tag Controller
StationCU
Station
PO
RT A
PO
RT A
PO
RT B
PO
RT B
Cache
Global Memory Controller
Rd DataRd Addr
Cache Cleaner
Tag Controller
Tag MemoryTag Manager
Write Pipeline
AXI Controller
AXI Port
ARADDR
RDATA
WDATA
WSTRB
AWADDR
It processes up to 64 outstanding requests from multiple CUs
Not served requests get increasing priorities as they are waiting
A multi-bank, directed mapped cache with write-back strategy is included
Accesses to global memory are parallelized over many AXI4-ports
Data is transfered in bursts that fill a single cache line
FGPU | FPGA’16, Monterey, 23 February 2016 10
![Page 44: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/44.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureGlobal Memory Controller Architecture
Byte Dirty
Tag Controller
StationCU
Station
PO
RT A
PO
RT A
PO
RT B
PO
RT B
Cache
Global Memory Controller
Rd DataRd Addr
Cache Cleaner
Tag Controller
Tag MemoryTag Manager
Write Pipeline
AXI Controller
AXI Port
ARADDR
RDATA
WDATA
WSTRB
AWADDR
It processes up to 64 outstanding requests from multiple CUs
Not served requests get increasing priorities as they are waiting
A multi-bank, directed mapped cache with write-back strategy is included
Accesses to global memory are parallelized over many AXI4-ports
Data is transfered in bursts that fill a single cache line
FGPU | FPGA’16, Monterey, 23 February 2016 10
![Page 45: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/45.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureGlobal Memory Controller Architecture
Byte Dirty
Tag Controller
StationCU
Station
PO
RT A
PO
RT A
PO
RT B
PO
RT B
Cache
Global Memory Controller
Rd DataRd Addr
Cache Cleaner
Tag Controller
Tag MemoryTag Manager
Write Pipeline
AXI Controller
AXI Port
ARADDR
RDATA
WDATA
WSTRB
AWADDR
It processes up to 64 outstanding requests from multiple CUs
Not served requests get increasing priorities as they are waiting
A multi-bank, directed mapped cache with write-back strategy is included
Accesses to global memory are parallelized over many AXI4-ports
Data is transfered in bursts that fill a single cache line
FGPU | FPGA’16, Monterey, 23 February 2016 10
![Page 46: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/46.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureGlobal Memory Controller Architecture
Byte Dirty
Tag Controller
StationCU
Station
PO
RT A
PO
RT A
PO
RT B
PO
RT B
Cache
Global Memory Controller
Rd DataRd Addr
Cache Cleaner
Tag Controller
Tag MemoryTag Manager
Write Pipeline
AXI Controller
AXI Port
ARADDR
RDATA
WDATA
WSTRB
AWADDR
It processes up to 64 outstanding requests from multiple CUs
Not served requests get increasing priorities as they are waiting
A multi-bank, directed mapped cache with write-back strategy is included
Accesses to global memory are parallelized over many AXI4-ports
Data is transfered in bursts that fill a single cache line
FGPU | FPGA’16, Monterey, 23 February 2016 10
![Page 47: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/47.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ArchitectureGlobal Memory Controller Architecture
Byte Dirty
Tag Controller
StationCU
Station
PO
RT A
PO
RT A
PO
RT B
PO
RT B
Cache
Global Memory Controller
Rd DataRd Addr
Cache Cleaner
Tag Controller
Tag MemoryTag Manager
Write Pipeline
AXI Controller
AXI Port
ARADDR
RDATA
WDATA
WSTRB
AWADDR
It processes up to 64 outstanding requests from multiple CUs
Not served requests get increasing priorities as they are waiting
A multi-bank, directed mapped cache with write-back strategy is included
Accesses to global memory are parallelized over many AXI4-ports
Data is transfered in bursts that fill a single cache line
FGPU | FPGA’16, Monterey, 23 February 2016 10
![Page 48: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/48.jpg)
RUHR-UNIVERSITÄT BOCHUM
Outline
1 Motivation and Background
2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller
3 Implementation and Results
4 Future Work and Conclusion
FGPU | FPGA’16, Monterey, 23 February 2016 10
![Page 49: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/49.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSystem Setup
Development board: ZC706 (ZynqXC7Z045) from Xilinx
Two other solutions for comparison:
The hard ARM-Cortex A9
The soft MicroBlaze
Processing data located in the DDRmodule
The benchmark was compiled with -O3 asbare-metal application
Only 32bit integer operations have beenconsidered
Cache flushing was always performed
DDR3- SDRAM
Coretex-A9CPU 0
NEON Engine
Coretex-A9CPU 1
NEON EngineMemoryInterface
HP 0 HP 1 HP 2 HP 3
AXI Slave HP Interconnect
Central Memory Interconnect
AXI Master
PS
ControlInterface
MemoryController
Compute Units
Cache
PL
Zynq XC7Z045
FGPU | FPGA’16, Monterey, 23 February 2016 11
![Page 50: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/50.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSystem Setup
Development board: ZC706 (ZynqXC7Z045) from Xilinx
Two other solutions for comparison:
The hard ARM-Cortex A9
The soft MicroBlaze
Processing data located in the DDRmodule
The benchmark was compiled with -O3 asbare-metal application
Only 32bit integer operations have beenconsidered
Cache flushing was always performed
DDR3- SDRAM
Coretex-A9CPU 0
NEON Engine
Coretex-A9CPU 1
NEON EngineMemoryInterface
HP 0 HP 1 HP 2 HP 3
AXI Slave HP Interconnect
Central Memory Interconnect
AXI Master
PS
ControlInterface
MemoryController
Compute Units
Cache
PL
Zynq XC7Z045
FGPU | FPGA’16, Monterey, 23 February 2016 11
![Page 51: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/51.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSystem Setup
Development board: ZC706 (ZynqXC7Z045) from Xilinx
Two other solutions for comparison:
The hard ARM-Cortex A9
The soft MicroBlaze
Processing data located in the DDRmodule
The benchmark was compiled with -O3 asbare-metal application
Only 32bit integer operations have beenconsidered
Cache flushing was always performed
DDR3- SDRAM
Coretex-A9CPU 0
NEON Engine
Coretex-A9CPU 1
NEON EngineMemoryInterface
HP 0 HP 1 HP 2 HP 3
AXI Slave HP Interconnect
Central Memory Interconnect
AXI Master
PS
ControlInterface
MemoryController
Compute Units
Cache
PL
Zynq XC7Z045
FGPU | FPGA’16, Monterey, 23 February 2016 11
![Page 52: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/52.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSystem Setup
Development board: ZC706 (ZynqXC7Z045) from Xilinx
Two other solutions for comparison:
The hard ARM-Cortex A9
The soft MicroBlaze
Processing data located in the DDRmodule
The benchmark was compiled with -O3 asbare-metal application
Only 32bit integer operations have beenconsidered
Cache flushing was always performed
DDR3- SDRAM
Coretex-A9CPU 0
NEON Engine
Coretex-A9CPU 1
NEON EngineMemoryInterface
HP 0 HP 1 HP 2 HP 3
AXI Slave HP Interconnect
Central Memory Interconnect
AXI Master
PS
ControlInterface
MemoryController
Compute Units
Cache
PL
Zynq XC7Z045
FGPU | FPGA’16, Monterey, 23 February 2016 11
![Page 53: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/53.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSystem Setup
Development board: ZC706 (ZynqXC7Z045) from Xilinx
Two other solutions for comparison:
The hard ARM-Cortex A9
The soft MicroBlaze
Processing data located in the DDRmodule
The benchmark was compiled with -O3 asbare-metal application
Only 32bit integer operations have beenconsidered
Cache flushing was always performed
DDR3- SDRAM
Coretex-A9CPU 0
NEON Engine
Coretex-A9CPU 1
NEON EngineMemoryInterface
HP 0 HP 1 HP 2 HP 3
AXI Slave HP Interconnect
Central Memory Interconnect
AXI Master
PS
ControlInterface
MemoryController
Compute Units
Cache
PL
Zynq XC7Z045
FGPU | FPGA’16, Monterey, 23 February 2016 11
![Page 54: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/54.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSystem Setup
Development board: ZC706 (ZynqXC7Z045) from Xilinx
Two other solutions for comparison:
The hard ARM-Cortex A9
The soft MicroBlaze
Processing data located in the DDRmodule
The benchmark was compiled with -O3 asbare-metal application
Only 32bit integer operations have beenconsidered
Cache flushing was always performed
DDR3- SDRAM
Coretex-A9CPU 0
NEON Engine
Coretex-A9CPU 1
NEON EngineMemoryInterface
HP 0 HP 1 HP 2 HP 3
AXI Slave HP Interconnect
Central Memory Interconnect
AXI Master
PS
ControlInterface
MemoryController
Compute Units
Cache
PL
Zynq XC7Z045
FGPU | FPGA’16, Monterey, 23 February 2016 11
![Page 55: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/55.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSystem Setup
Development board: ZC706 (ZynqXC7Z045) from Xilinx
Two other solutions for comparison:
The hard ARM-Cortex A9
The soft MicroBlaze
Processing data located in the DDRmodule
The benchmark was compiled with -O3 asbare-metal application
Only 32bit integer operations have beenconsidered
Cache flushing was always performed
DDR3- SDRAM
Coretex-A9CPU 0
NEON Engine
Coretex-A9CPU 1
NEON EngineMemoryInterface
HP 0 HP 1 HP 2 HP 3
AXI Slave HP Interconnect
Central Memory Interconnect
AXI Master
PS
ControlInterface
MemoryController
Compute Units
Cache
PL
Zynq XC7Z045
FGPU | FPGA’16, Monterey, 23 February 2016 11
![Page 56: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/56.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ImplementationHighlights
Scalability
Up to 4K threads run simultaneously on
64 PEs
Slight degradation for the operating
frequencies for bigger designs
Portability: no IP-cores or primitives
Even DSP slices in pipeline mode were
targeted without using any IP-cores
Flexibility
Many parameters can be configured to
meet the best performance/area
trade-offFloorplan for 8 CUs on XC7Z045
FGPU | FPGA’16, Monterey, 23 February 2016 12
![Page 57: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/57.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ImplementationHighlights
Scalability
Up to 4K threads run simultaneously on
64 PEs
Slight degradation for the operating
frequencies for bigger designs
Portability: no IP-cores or primitives
Even DSP slices in pipeline mode were
targeted without using any IP-cores
Flexibility
Many parameters can be configured to
meet the best performance/area
trade-offFloorplan for 8 CUs on XC7Z045
FGPU | FPGA’16, Monterey, 23 February 2016 12
![Page 58: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/58.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ImplementationHighlights
Scalability
Up to 4K threads run simultaneously on
64 PEs
Slight degradation for the operating
frequencies for bigger designs
Portability: no IP-cores or primitives
Even DSP slices in pipeline mode were
targeted without using any IP-cores
Flexibility
Many parameters can be configured to
meet the best performance/area
trade-offFloorplan for 8 CUs on XC7Z045
FGPU | FPGA’16, Monterey, 23 February 2016 12
![Page 59: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/59.jpg)
RUHR-UNIVERSITÄT BOCHUM
FGPU ImplementationHighlights
Scalability
Up to 4K threads run simultaneously on
64 PEs
Slight degradation for the operating
frequencies for bigger designs
Portability: no IP-cores or primitives
Even DSP slices in pipeline mode were
targeted without using any IP-cores
Flexibility
Many parameters can be configured to
meet the best performance/area
trade-offFloorplan for 8 CUs on XC7Z045
FGPU | FPGA’16, Monterey, 23 February 2016 12
![Page 60: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/60.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsArea Requirements
Many FGPUs with different settingswere tested
Clock frequencies were fixed at200MHz/400MHz
BRAMs and DSPs were targeted butwithout the manufacturer’s IP-cores
More FFs than LUTs were consumeddue to the deep pipeline
Empty pipeline stages were inserted
between the two clock domains to
improve timing
Adjustable parameters with direct influence on design scalability
Module Parameter Range
CU# CUs 2, 4, 8
# Outstanding mem. requests 16, 24, 32
Cache Size 1 to 8KBMemory # Cache Read Banks 2, 4, 8
Controller # Outstanding mem. requests 32, 64# AXI4 interfaces 1, 2, 4# Tag Managers 2, 4, 8, 16
Area requirements for different configurations
LUTs FFs BRAM DSPs
Available 219K 437K 545 900
8 CUs 57% 36% 31% 21%
4 CUs 35% 20% 18% 11%
2 CUs 26% 14% 12% 5.3%
2 CUs(min) 9.6% 8.1% 8.5% 5.3%
MicroBlaze 3.2% 1.3% 5.0% 0.7%
FGPU | FPGA’16, Monterey, 23 February 2016 13
![Page 61: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/61.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsArea Requirements
Many FGPUs with different settingswere tested
Clock frequencies were fixed at200MHz/400MHz
BRAMs and DSPs were targeted butwithout the manufacturer’s IP-cores
More FFs than LUTs were consumeddue to the deep pipeline
Empty pipeline stages were inserted
between the two clock domains to
improve timing
Adjustable parameters with direct influence on design scalability
Module Parameter Range
CU# CUs 2, 4, 8
# Outstanding mem. requests 16, 24, 32
Cache Size 1 to 8KBMemory # Cache Read Banks 2, 4, 8
Controller # Outstanding mem. requests 32, 64# AXI4 interfaces 1, 2, 4# Tag Managers 2, 4, 8, 16
Area requirements for different configurations
LUTs FFs BRAM DSPs
Available 219K 437K 545 900
8 CUs 57% 36% 31% 21%
4 CUs 35% 20% 18% 11%
2 CUs 26% 14% 12% 5.3%
2 CUs(min) 9.6% 8.1% 8.5% 5.3%
MicroBlaze 3.2% 1.3% 5.0% 0.7%
FGPU | FPGA’16, Monterey, 23 February 2016 13
![Page 62: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/62.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsArea Requirements
Many FGPUs with different settingswere tested
Clock frequencies were fixed at200MHz/400MHz
BRAMs and DSPs were targeted butwithout the manufacturer’s IP-cores
More FFs than LUTs were consumeddue to the deep pipeline
Empty pipeline stages were inserted
between the two clock domains to
improve timing
Adjustable parameters with direct influence on design scalability
Module Parameter Range
CU# CUs 2, 4, 8
# Outstanding mem. requests 16, 24, 32
Cache Size 1 to 8KBMemory # Cache Read Banks 2, 4, 8
Controller # Outstanding mem. requests 32, 64# AXI4 interfaces 1, 2, 4# Tag Managers 2, 4, 8, 16
Area requirements for different configurations
LUTs FFs BRAM DSPs
Available 219K 437K 545 900
8 CUs 57% 36% 31% 21%
4 CUs 35% 20% 18% 11%
2 CUs 26% 14% 12% 5.3%
2 CUs(min) 9.6% 8.1% 8.5% 5.3%
MicroBlaze 3.2% 1.3% 5.0% 0.7%
FGPU | FPGA’16, Monterey, 23 February 2016 13
![Page 63: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/63.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsArea Requirements
Many FGPUs with different settingswere tested
Clock frequencies were fixed at200MHz/400MHz
BRAMs and DSPs were targeted butwithout the manufacturer’s IP-cores
More FFs than LUTs were consumeddue to the deep pipeline
Empty pipeline stages were inserted
between the two clock domains to
improve timing
Adjustable parameters with direct influence on design scalability
Module Parameter Range
CU# CUs 2, 4, 8
# Outstanding mem. requests 16, 24, 32
Cache Size 1 to 8KBMemory # Cache Read Banks 2, 4, 8
Controller # Outstanding mem. requests 32, 64# AXI4 interfaces 1, 2, 4# Tag Managers 2, 4, 8, 16
Area requirements for different configurations
LUTs FFs BRAM DSPs
Available 219K 437K 545 900
8 CUs 57% 36% 31% 21%
4 CUs 35% 20% 18% 11%
2 CUs 26% 14% 12% 5.3%
2 CUs(min) 9.6% 8.1% 8.5% 5.3%
MicroBlaze 3.2% 1.3% 5.0% 0.7%
FGPU | FPGA’16, Monterey, 23 February 2016 13
![Page 64: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/64.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsArea Requirements
Many FGPUs with different settingswere tested
Clock frequencies were fixed at200MHz/400MHz
BRAMs and DSPs were targeted butwithout the manufacturer’s IP-cores
More FFs than LUTs were consumeddue to the deep pipeline
Empty pipeline stages were inserted
between the two clock domains to
improve timing
Adjustable parameters with direct influence on design scalability
Module Parameter Range
CU# CUs 2, 4, 8
# Outstanding mem. requests 16, 24, 32
Cache Size 1 to 8KBMemory # Cache Read Banks 2, 4, 8
Controller # Outstanding mem. requests 32, 64# AXI4 interfaces 1, 2, 4# Tag Managers 2, 4, 8, 16
Area requirements for different configurations
LUTs FFs BRAM DSPs
Available 219K 437K 545 900
8 CUs 57% 36% 31% 21%
4 CUs 35% 20% 18% 11%
2 CUs 26% 14% 12% 5.3%
2 CUs(min) 9.6% 8.1% 8.5% 5.3%
MicroBlaze 3.2% 1.3% 5.0% 0.7%
FGPU | FPGA’16, Monterey, 23 February 2016 13
![Page 65: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/65.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsArea Requirements
Many FGPUs with different settingswere tested
Clock frequencies were fixed at200MHz/400MHz
BRAMs and DSPs were targeted butwithout the manufacturer’s IP-cores
More FFs than LUTs were consumeddue to the deep pipeline
Empty pipeline stages were inserted
between the two clock domains to
improve timing
Adjustable parameters with direct influence on design scalability
Module Parameter Range
CU# CUs 2, 4, 8
# Outstanding mem. requests 16, 24, 32
Cache Size 1 to 8KBMemory # Cache Read Banks 2, 4, 8
Controller # Outstanding mem. requests 32, 64# AXI4 interfaces 1, 2, 4# Tag Managers 2, 4, 8, 16
Area requirements for different configurations
LUTs FFs BRAM DSPs
Available 219K 437K 545 900
8 CUs 57% 36% 31% 21%
4 CUs 35% 20% 18% 11%
2 CUs 26% 14% 12% 5.3%
2 CUs(min) 9.6% 8.1% 8.5% 5.3%
MicroBlaze 3.2% 1.3% 5.0% 0.7%
FGPU | FPGA’16, Monterey, 23 February 2016 13
![Page 66: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/66.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSpeedup over MicroBlaze
The MicroBlaze is:
configured by default settings for
best performance
clocked at 185MHz
49x average speedup was achievedoverall
166x maximum speedup for matrixmultiplication
7.7 Gbps maximum throughput whenused like a DMA (memcopy), averagedover the whole execution time
0
100
43
48
47
50
65
47
66
64
58
85
1416
17
1722
12
Wall clock time speedup for 8 CUs over MicroBlaze implementation for task size between 256 and 256K
FIR (20 taps)
matrix multiplication
cross correlation
vecadd & vecmul
memcopy
FIR (5 taps)
GeomeanMaxMin
FGPU | FPGA’16, Monterey, 23 February 2016 14
![Page 67: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/67.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSpeedup over MicroBlaze
The MicroBlaze is:
configured by default settings for
best performance
clocked at 185MHz
49x average speedup was achievedoverall
166x maximum speedup for matrixmultiplication
7.7 Gbps maximum throughput whenused like a DMA (memcopy), averagedover the whole execution time
0
100
43
48
47
50
65
47
66
64
58
85
1416
17
1722
12
Wall clock time speedup for 8 CUs over MicroBlaze implementation for task size between 256 and 256K
FIR (20 taps)
matrix multiplication
cross correlation
vecadd & vecmul
memcopy
FIR (5 taps)
GeomeanMaxMin
FGPU | FPGA’16, Monterey, 23 February 2016 14
![Page 68: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/68.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSpeedup over MicroBlaze
The MicroBlaze is:
configured by default settings for
best performance
clocked at 185MHz
49x average speedup was achievedoverall
166x maximum speedup for matrixmultiplication
7.7 Gbps maximum throughput whenused like a DMA (memcopy), averagedover the whole execution time
0
100
43
48
47
50
65
47
66
64
58
85
1416
17
1722
12
Wall clock time speedup for 8 CUs over MicroBlaze implementation for task size between 256 and 256K
FIR (20 taps)
matrix multiplication
cross correlation
vecadd & vecmul
memcopy
FIR (5 taps)
GeomeanMaxMin
FGPU | FPGA’16, Monterey, 23 February 2016 14
![Page 69: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/69.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSpeedup over MicroBlaze
The MicroBlaze is:
configured by default settings for
best performance
clocked at 185MHz
49x average speedup was achievedoverall
166x maximum speedup for matrixmultiplication
7.7 Gbps maximum throughput whenused like a DMA (memcopy), averagedover the whole execution time
0
100
43
48
47
50
65
47
66
64
58
85
1416
17
1722
12
Wall clock time speedup for 8 CUs over MicroBlaze implementation for task size between 256 and 256K
FIR (20 taps)
matrix multiplication
cross correlation
vecadd & vecmul
memcopy
FIR (5 taps)
GeomeanMaxMin
FGPU | FPGA’16, Monterey, 23 February 2016 14
![Page 70: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/70.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSpeedup over MicroBlaze
The MicroBlaze is:
configured by default settings for
best performance
clocked at 185MHz
49x average speedup was achievedoverall
166x maximum speedup for matrixmultiplication
7.7 Gbps maximum throughput whenused like a DMA (memcopy), averagedover the whole execution time
0
100
43
48
47
50
65
47
66
64
58
85
1416
17
1722
12
Wall clock time speedup for 8 CUs over MicroBlaze implementation for task size between 256 and 256K
FIR (20 taps)
matrix multiplication
cross correlation
vecadd & vecmul
memcopy
FIR (5 taps)
GeomeanMaxMin
FGPU | FPGA’16, Monterey, 23 February 2016 14
![Page 71: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/71.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSpeedup over MicroBlaze
The MicroBlaze is:
configured by default settings for
best performance
clocked at 185MHz
49x average speedup was achievedoverall
166x maximum speedup for matrixmultiplication
7.7 Gbps maximum throughput whenused like a DMA (memcopy), averagedover the whole execution time
0
100
43
48
47
50
65
47
66
64
58
85
1416
17
1722
12
Wall clock time speedup for 8 CUs over MicroBlaze implementation for task size between 256 and 256K
FIR (20 taps)
matrix multiplication
cross correlation
vecadd & vecmul
memcopy
FIR (5 taps)
GeomeanMaxMin
FGPU | FPGA’16, Monterey, 23 February 2016 14
![Page 72: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/72.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSpeedup over ARM+NEON
The ARM CPU is:
supported by the NEON vector engine
clocked at 667MHz
3.5x average speedup was achievedoverall
35x maximum speedup for matrixmultiplication
Better speedups were recorded formore computational intensive tasks
Cache flushing on ARM was done onlyfor the dirty region
0
33
3
7
5
3
2
4
5
4 4
7
11
2
2 1
1
1
Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K
FIR (20 taps)
matrix multiplication cross correlation
vecadd & vecmul
FIR (5 taps)
GeomeanMaxMin
transpose
memcopy
FGPU | FPGA’16, Monterey, 23 February 2016 15
![Page 73: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/73.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSpeedup over ARM+NEON
The ARM CPU is:
supported by the NEON vector engine
clocked at 667MHz
3.5x average speedup was achievedoverall
35x maximum speedup for matrixmultiplication
Better speedups were recorded formore computational intensive tasks
Cache flushing on ARM was done onlyfor the dirty region
0
33
3
7
5
3
2
4
5
4 4
7
11
2
2 1
1
1
Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K
FIR (20 taps)
matrix multiplication cross correlation
vecadd & vecmul
FIR (5 taps)
GeomeanMaxMin
transpose
memcopy
FGPU | FPGA’16, Monterey, 23 February 2016 15
![Page 74: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/74.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSpeedup over ARM+NEON
The ARM CPU is:
supported by the NEON vector engine
clocked at 667MHz
3.5x average speedup was achievedoverall
35x maximum speedup for matrixmultiplication
Better speedups were recorded formore computational intensive tasks
Cache flushing on ARM was done onlyfor the dirty region
0
33
3
7
5
3
2
4
5
4 4
7
11
2
2 1
1
1
Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K
FIR (20 taps)
matrix multiplication cross correlation
vecadd & vecmul
FIR (5 taps)
GeomeanMaxMin
transpose
memcopy
FGPU | FPGA’16, Monterey, 23 February 2016 15
![Page 75: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/75.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSpeedup over ARM+NEON
The ARM CPU is:
supported by the NEON vector engine
clocked at 667MHz
3.5x average speedup was achievedoverall
35x maximum speedup for matrixmultiplication
Better speedups were recorded formore computational intensive tasks
Cache flushing on ARM was done onlyfor the dirty region
0
33
3
7
5
3
2
4
5
4 4
7
11
2
2 1
1
1
Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K
FIR (20 taps)
matrix multiplication cross correlation
vecadd & vecmul
FIR (5 taps)
GeomeanMaxMin
transpose
memcopy
FGPU | FPGA’16, Monterey, 23 February 2016 15
![Page 76: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/76.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSpeedup over ARM+NEON
The ARM CPU is:
supported by the NEON vector engine
clocked at 667MHz
3.5x average speedup was achievedoverall
35x maximum speedup for matrixmultiplication
Better speedups were recorded formore computational intensive tasks
Cache flushing on ARM was done onlyfor the dirty region
0
33
3
7
5
3
2
4
5
4 4
7
11
2
2 1
1
1
Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K
FIR (20 taps)
matrix multiplication cross correlation
vecadd & vecmul
FIR (5 taps)
GeomeanMaxMin
transpose
memcopy
FGPU | FPGA’16, Monterey, 23 February 2016 15
![Page 77: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/77.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSpeedup over ARM+NEON
The ARM CPU is:
supported by the NEON vector engine
clocked at 667MHz
3.5x average speedup was achievedoverall
35x maximum speedup for matrixmultiplication
Better speedups were recorded formore computational intensive tasks
Cache flushing on ARM was done onlyfor the dirty region
0
33
3
7
5
3
2
4
5
4 4
7
11
2
2 1
1
1
Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K
FIR (20 taps)
matrix multiplication cross correlation
vecadd & vecmul
FIR (5 taps)
GeomeanMaxMin
transpose
memcopy
FGPU | FPGA’16, Monterey, 23 February 2016 15
![Page 78: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/78.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSpeedup over ARM+NEON
The ARM CPU is:
supported by the NEON vector engine
clocked at 667MHz
3.5x average speedup was achievedoverall
35x maximum speedup for matrixmultiplication
Better speedups were recorded formore computational intensive tasks
Cache flushing on ARM was done onlyfor the dirty region
0
33
3
7
5
3
2
4
5
4 4
7
11
2
2 1
1
1
Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K
FIR (20 taps)
matrix multiplication cross correlation
vecadd & vecmul
FIR (5 taps)
GeomeanMaxMin
transpose
memcopy
FGPU | FPGA’16, Monterey, 23 February 2016 15
![Page 79: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/79.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsProblem Size
Better speedups achieved whenprocessing bigger pieces of data
The speedup increases more rapidlyfor the most complex applications
A min. of 4us is needed by FGPU for:
preparing to execute a new task at
the beginning
flushing the cache content at the end
64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
memcopy
vecmul & vecadd
FIR(5 taps)
cross correlation
Wall clock time speedup for 8 CUs over ARM+NEONimplementation for variable task size
Task Size
Spee
dup
over
AR
M+N
EON
FGPU | FPGA’16, Monterey, 23 February 2016 16
![Page 80: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/80.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsProblem Size
Better speedups achieved whenprocessing bigger pieces of data
The speedup increases more rapidlyfor the most complex applications
A min. of 4us is needed by FGPU for:
preparing to execute a new task at
the beginning
flushing the cache content at the end
64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
memcopy
vecmul & vecadd
FIR(5 taps)
cross correlation
Wall clock time speedup for 8 CUs over ARM+NEONimplementation for variable task size
Task Size
Spee
dup
over
AR
M+N
EON
FGPU | FPGA’16, Monterey, 23 February 2016 16
![Page 81: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/81.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsProblem Size
Better speedups achieved whenprocessing bigger pieces of data
The speedup increases more rapidlyfor the most complex applications
A min. of 4us is needed by FGPU for:
preparing to execute a new task at
the beginning
flushing the cache content at the end
64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
memcopy
vecmul & vecadd
FIR(5 taps)
cross correlation
Wall clock time speedup for 8 CUs over ARM+NEONimplementation for variable task size
Task Size
Spee
dup
over
AR
M+N
EON
FGPU | FPGA’16, Monterey, 23 February 2016 16
![Page 82: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/82.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsProblem Size
Better speedups achieved whenprocessing bigger pieces of data
The speedup increases more rapidlyfor the most complex applications
A min. of 4us is needed by FGPU for:
preparing to execute a new task at
the beginning
flushing the cache content at the end
64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
memcopy
vecmul & vecadd
FIR(5 taps)
cross correlation
Wall clock time speedup for 8 CUs over ARM+NEONimplementation for variable task size
Task Size
Spee
dup
over
AR
M+N
EON
FGPU | FPGA’16, Monterey, 23 February 2016 16
![Page 83: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/83.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsProblem Size
Better speedups achieved whenprocessing bigger pieces of data
The speedup increases more rapidlyfor the most complex applications
A min. of 4us is needed by FGPU for:
preparing to execute a new task at
the beginning
flushing the cache content at the end
64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
memcopy
vecmul & vecadd
FIR(5 taps)
cross correlation
Wall clock time speedup for 8 CUs over ARM+NEONimplementation for variable task size
Task Size
Spee
dup
over
AR
M+N
EON
FGPU | FPGA’16, Monterey, 23 February 2016 16
![Page 84: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/84.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsProblem Size
Better speedups achieved whenprocessing bigger pieces of data
The speedup increases more rapidlyfor the most complex applications
A min. of 4us is needed by FGPU for:
preparing to execute a new task at
the beginning
flushing the cache content at the end
64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
11.00
12.00
memcopy
vecmul & vecadd
FIR(5 taps)
cross correlation
Wall clock time speedup for 8 CUs over ARM+NEONimplementation for variable task size
Task Size
Spee
dup
over
AR
M+N
EON
FGPU | FPGA’16, Monterey, 23 February 2016 16
![Page 85: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/85.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsNumber of CUs
Implementing more CUs improvedalways the speedups
linearly in best cases
even for less computationally
intensive applications
Having multiple CUs improves theefficiency of the memory controller
by pushing more memory requests in
the pipeline
8 4 20.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
Number of CUs
memcopyvecadd & vecmulFIR (5taps)cross correlationmatrix multiplication
Spee
dup
over
Mic
roB
laze
Wall clock time speedup for FGPU with different number ofCUs over MicroBlaze implementation for a task size of 16K
FGPU | FPGA’16, Monterey, 23 February 2016 17
![Page 86: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/86.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsNumber of CUs
Implementing more CUs improvedalways the speedups
linearly in best cases
even for less computationally
intensive applications
Having multiple CUs improves theefficiency of the memory controller
by pushing more memory requests in
the pipeline
8 4 20.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
Number of CUs
memcopyvecadd & vecmulFIR (5taps)cross correlationmatrix multiplication
Spee
dup
over
Mic
roB
laze
Wall clock time speedup for FGPU with different number ofCUs over MicroBlaze implementation for a task size of 16K
FGPU | FPGA’16, Monterey, 23 February 2016 17
![Page 87: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/87.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsNumber of CUs
Implementing more CUs improvedalways the speedups
linearly in best cases
even for less computationally
intensive applications
Having multiple CUs improves theefficiency of the memory controller
by pushing more memory requests in
the pipeline
8 4 20.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
Number of CUs
memcopyvecadd & vecmulFIR (5taps)cross correlationmatrix multiplication
Spee
dup
over
Mic
roB
laze
Wall clock time speedup for FGPU with different number ofCUs over MicroBlaze implementation for a task size of 16K
FGPU | FPGA’16, Monterey, 23 February 2016 17
![Page 88: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/88.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsNumber of CUs
Implementing more CUs improvedalways the speedups
linearly in best cases
even for less computationally
intensive applications
Having multiple CUs improves theefficiency of the memory controller
by pushing more memory requests in
the pipeline
8 4 20.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
Number of CUs
memcopyvecadd & vecmulFIR (5taps)cross correlationmatrix multiplication
Spee
dup
over
Mic
roB
laze
Wall clock time speedup for FGPU with different number ofCUs over MicroBlaze implementation for a task size of 16K
FGPU | FPGA’16, Monterey, 23 February 2016 17
![Page 89: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/89.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsNumber of CUs
Implementing more CUs improvedalways the speedups
linearly in best cases
even for less computationally
intensive applications
Having multiple CUs improves theefficiency of the memory controller
by pushing more memory requests in
the pipeline
8 4 20.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
Number of CUs
memcopyvecadd & vecmulFIR (5taps)cross correlationmatrix multiplication
Spee
dup
over
Mic
roB
laze
Wall clock time speedup for FGPU with different number ofCUs over MicroBlaze implementation for a task size of 16K
FGPU | FPGA’16, Monterey, 23 February 2016 17
![Page 90: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/90.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsPower Consumption
The measurements were done using theon-board power monitor
The 2nd ARM core was recording thepower consumption while the 1st wascontrolling FGPU
In average over the whole benchmark:
The biggest and smallest FGPUs
consumed 13x and 3.3x more power than
the MicroBlaze.
5.2W was recorded as maximum by the
power monitor
8 CUs 4 CUs 2 CUs 2 CUs (Min)0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
memcopy
FIR (5 taps)
FIR (20 taps)
matrix multiply
cross correlation
vecmul & vecadd
transpose
Estimated Average after P&R
Measured Average
Rat
io o
f con
sum
ed p
ower
ove
r M
icro
Bla
ze
FGPU | FPGA’16, Monterey, 23 February 2016 18
![Page 91: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/91.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsPower Consumption
The measurements were done using theon-board power monitor
The 2nd ARM core was recording thepower consumption while the 1st wascontrolling FGPU
In average over the whole benchmark:
The biggest and smallest FGPUs
consumed 13x and 3.3x more power than
the MicroBlaze.
5.2W was recorded as maximum by the
power monitor
8 CUs 4 CUs 2 CUs 2 CUs (Min)0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
memcopy
FIR (5 taps)
FIR (20 taps)
matrix multiply
cross correlation
vecmul & vecadd
transpose
Estimated Average after P&R
Measured Average
Rat
io o
f con
sum
ed p
ower
ove
r M
icro
Bla
ze
FGPU | FPGA’16, Monterey, 23 February 2016 18
![Page 92: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/92.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsPower Consumption
The measurements were done using theon-board power monitor
The 2nd ARM core was recording thepower consumption while the 1st wascontrolling FGPU
In average over the whole benchmark:
The biggest and smallest FGPUs
consumed 13x and 3.3x more power than
the MicroBlaze.
5.2W was recorded as maximum by the
power monitor
8 CUs 4 CUs 2 CUs 2 CUs (Min)0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
memcopy
FIR (5 taps)
FIR (20 taps)
matrix multiply
cross correlation
vecmul & vecadd
transpose
Estimated Average after P&R
Measured Average
Rat
io o
f con
sum
ed p
ower
ove
r M
icro
Bla
ze
FGPU | FPGA’16, Monterey, 23 February 2016 18
![Page 93: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/93.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsPower Consumption
The measurements were done using theon-board power monitor
The 2nd ARM core was recording thepower consumption while the 1st wascontrolling FGPU
In average over the whole benchmark:
The biggest and smallest FGPUs
consumed 13x and 3.3x more power than
the MicroBlaze.
5.2W was recorded as maximum by the
power monitor
8 CUs 4 CUs 2 CUs 2 CUs (Min)0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
memcopy
FIR (5 taps)
FIR (20 taps)
matrix multiply
cross correlation
vecmul & vecadd
transpose
Estimated Average after P&R
Measured Average
Rat
io o
f con
sum
ed p
ower
ove
r M
icro
Bla
ze
FGPU | FPGA’16, Monterey, 23 February 2016 18
![Page 94: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/94.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsPower Consumption
The measurements were done using theon-board power monitor
The 2nd ARM core was recording thepower consumption while the 1st wascontrolling FGPU
In average over the whole benchmark:
The biggest and smallest FGPUs
consumed 13x and 3.3x more power than
the MicroBlaze.
5.2W was recorded as maximum by the
power monitor
8 CUs 4 CUs 2 CUs 2 CUs (Min)0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
memcopy
FIR (5 taps)
FIR (20 taps)
matrix multiply
cross correlation
vecmul & vecadd
transpose
Estimated Average after P&R
Measured Average
Rat
io o
f con
sum
ed p
ower
ove
r M
icro
Bla
ze
FGPU | FPGA’16, Monterey, 23 February 2016 18
![Page 95: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/95.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSummary
In comparison to the MicroBlaze, alltested FGPUs:
consume 3.2x - 4.4x less energy
speedup execution by 11x - 49x
need 3.0x - 17.7x more area
In comparison to ARM+NEON, all testedFGPUs:
speedup execution by up to 3.5x
8 CUs 4 CUs 2 CUs 2CUs (Min)0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00 48.5
31.9
19.9
10.6
3.7 4.3 4.43.2
17.7
10.9
8.1
3.03.52.3
1.4 1.0
Speedup over MBPower Saving over MB#LUTs over MBSpeedup over ARM+NEON
Average wall clock time speedup, power saving and area overhead for different FGPUs over MicroBlaze and ARM+NEON implementations.Speedups were averaged over the whole benchmark and problem sizes from 256 to 256K
FGPU | FPGA’16, Monterey, 23 February 2016 19
![Page 96: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/96.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSummary
In comparison to the MicroBlaze, alltested FGPUs:
consume 3.2x - 4.4x less energy
speedup execution by 11x - 49x
need 3.0x - 17.7x more area
In comparison to ARM+NEON, all testedFGPUs:
speedup execution by up to 3.5x
8 CUs 4 CUs 2 CUs 2CUs (Min)0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00 48.5
31.9
19.9
10.6
3.7 4.3 4.43.2
17.7
10.9
8.1
3.03.52.3
1.4 1.0
Speedup over MBPower Saving over MB#LUTs over MBSpeedup over ARM+NEON
Average wall clock time speedup, power saving and area overhead for different FGPUs over MicroBlaze and ARM+NEON implementations.Speedups were averaged over the whole benchmark and problem sizes from 256 to 256K
FGPU | FPGA’16, Monterey, 23 February 2016 19
![Page 97: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/97.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSummary
In comparison to the MicroBlaze, alltested FGPUs:
consume 3.2x - 4.4x less energy
speedup execution by 11x - 49x
need 3.0x - 17.7x more area
In comparison to ARM+NEON, all testedFGPUs:
speedup execution by up to 3.5x
8 CUs 4 CUs 2 CUs 2CUs (Min)0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00 48.5
31.9
19.9
10.6
3.7 4.3 4.43.2
17.7
10.9
8.1
3.03.52.3
1.4 1.0
Speedup over MBPower Saving over MB#LUTs over MBSpeedup over ARM+NEON
Average wall clock time speedup, power saving and area overhead for different FGPUs over MicroBlaze and ARM+NEON implementations.Speedups were averaged over the whole benchmark and problem sizes from 256 to 256K
FGPU | FPGA’16, Monterey, 23 February 2016 19
![Page 98: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/98.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSummary
In comparison to the MicroBlaze, alltested FGPUs:
consume 3.2x - 4.4x less energy
speedup execution by 11x - 49x
need 3.0x - 17.7x more area
In comparison to ARM+NEON, all testedFGPUs:
speedup execution by up to 3.5x
8 CUs 4 CUs 2 CUs 2CUs (Min)0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00 48.5
31.9
19.9
10.6
3.7 4.3 4.43.2
17.7
10.9
8.1
3.03.52.3
1.4 1.0
Speedup over MBPower Saving over MB#LUTs over MBSpeedup over ARM+NEON
Average wall clock time speedup, power saving and area overhead for different FGPUs over MicroBlaze and ARM+NEON implementations.Speedups were averaged over the whole benchmark and problem sizes from 256 to 256K
FGPU | FPGA’16, Monterey, 23 February 2016 19
![Page 99: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/99.jpg)
RUHR-UNIVERSITÄT BOCHUM
ResultsSummary
In comparison to the MicroBlaze, alltested FGPUs:
consume 3.2x - 4.4x less energy
speedup execution by 11x - 49x
need 3.0x - 17.7x more area
In comparison to ARM+NEON, all testedFGPUs:
speedup execution by up to 3.5x
8 CUs 4 CUs 2 CUs 2CUs (Min)0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00 48.5
31.9
19.9
10.6
3.7 4.3 4.43.2
17.7
10.9
8.1
3.03.52.3
1.4 1.0
Speedup over MBPower Saving over MB#LUTs over MBSpeedup over ARM+NEON
Average wall clock time speedup, power saving and area overhead for different FGPUs over MicroBlaze and ARM+NEON implementations.Speedups were averaged over the whole benchmark and problem sizes from 256 to 256K
FGPU | FPGA’16, Monterey, 23 February 2016 19
![Page 100: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/100.jpg)
RUHR-UNIVERSITÄT BOCHUM
Outline
1 Motivation and Background
2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller
3 Implementation and Results
4 Future Work and Conclusion
FGPU | FPGA’16, Monterey, 23 February 2016 19
![Page 101: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/101.jpg)
RUHR-UNIVERSITÄT BOCHUM
Future Work
Extend the ISA to cover more benchmarks
Developing an LLVM-backend to compile OpenCL-kernels
Developing a Linux-driver with OpenCL-compatible interface
Enabling branches at the work-item level
Supporting soft/hard floating point computations
Providing local/global atomic operations
Improving the cache system by having two levels
Implementing local storage within the CUs to be compliant to theOpenCL memory model.
FGPU | FPGA’16, Monterey, 23 February 2016 20
![Page 102: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/102.jpg)
RUHR-UNIVERSITÄT BOCHUM
Future Work
Extend the ISA to cover more benchmarks
Developing an LLVM-backend to compile OpenCL-kernels
Developing a Linux-driver with OpenCL-compatible interface
Enabling branches at the work-item level
Supporting soft/hard floating point computations
Providing local/global atomic operations
Improving the cache system by having two levels
Implementing local storage within the CUs to be compliant to theOpenCL memory model.
FGPU | FPGA’16, Monterey, 23 February 2016 20
![Page 103: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/103.jpg)
RUHR-UNIVERSITÄT BOCHUM
Future Work
Extend the ISA to cover more benchmarks
Developing an LLVM-backend to compile OpenCL-kernels
Developing a Linux-driver with OpenCL-compatible interface
Enabling branches at the work-item level
Supporting soft/hard floating point computations
Providing local/global atomic operations
Improving the cache system by having two levels
Implementing local storage within the CUs to be compliant to theOpenCL memory model.
FGPU | FPGA’16, Monterey, 23 February 2016 20
![Page 104: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/104.jpg)
RUHR-UNIVERSITÄT BOCHUM
Future Work
Extend the ISA to cover more benchmarks
Developing an LLVM-backend to compile OpenCL-kernels
Developing a Linux-driver with OpenCL-compatible interface
Enabling branches at the work-item level
Supporting soft/hard floating point computations
Providing local/global atomic operations
Improving the cache system by having two levels
Implementing local storage within the CUs to be compliant to theOpenCL memory model.
FGPU | FPGA’16, Monterey, 23 February 2016 20
![Page 105: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/105.jpg)
RUHR-UNIVERSITÄT BOCHUM
Future Work
Extend the ISA to cover more benchmarks
Developing an LLVM-backend to compile OpenCL-kernels
Developing a Linux-driver with OpenCL-compatible interface
Enabling branches at the work-item level
Supporting soft/hard floating point computations
Providing local/global atomic operations
Improving the cache system by having two levels
Implementing local storage within the CUs to be compliant to theOpenCL memory model.
FGPU | FPGA’16, Monterey, 23 February 2016 20
![Page 106: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/106.jpg)
RUHR-UNIVERSITÄT BOCHUM
Future Work
Extend the ISA to cover more benchmarks
Developing an LLVM-backend to compile OpenCL-kernels
Developing a Linux-driver with OpenCL-compatible interface
Enabling branches at the work-item level
Supporting soft/hard floating point computations
Providing local/global atomic operations
Improving the cache system by having two levels
Implementing local storage within the CUs to be compliant to theOpenCL memory model.
FGPU | FPGA’16, Monterey, 23 February 2016 20
![Page 107: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/107.jpg)
RUHR-UNIVERSITÄT BOCHUM
Future Work
Extend the ISA to cover more benchmarks
Developing an LLVM-backend to compile OpenCL-kernels
Developing a Linux-driver with OpenCL-compatible interface
Enabling branches at the work-item level
Supporting soft/hard floating point computations
Providing local/global atomic operations
Improving the cache system by having two levels
Implementing local storage within the CUs to be compliant to theOpenCL memory model.
FGPU | FPGA’16, Monterey, 23 February 2016 20
![Page 108: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/108.jpg)
RUHR-UNIVERSITÄT BOCHUM
Future Work
Extend the ISA to cover more benchmarks
Developing an LLVM-backend to compile OpenCL-kernels
Developing a Linux-driver with OpenCL-compatible interface
Enabling branches at the work-item level
Supporting soft/hard floating point computations
Providing local/global atomic operations
Improving the cache system by having two levels
Implementing local storage within the CUs to be compliant to theOpenCL memory model.
FGPU | FPGA’16, Monterey, 23 February 2016 20
![Page 109: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/109.jpg)
RUHR-UNIVERSITÄT BOCHUM
Future Work
Extend the ISA to cover more benchmarks
Developing an LLVM-backend to compile OpenCL-kernels
Developing a Linux-driver with OpenCL-compatible interface
Enabling branches at the work-item level
Supporting soft/hard floating point computations
Providing local/global atomic operations
Improving the cache system by having two levels
Implementing local storage within the CUs to be compliant to theOpenCL memory model.
FGPU | FPGA’16, Monterey, 23 February 2016 20
![Page 110: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/110.jpg)
Thank you for your attention!
![Page 111: FGPU: An SIMT-Architecture for FPGAs](https://reader034.vdocuments.mx/reader034/viewer/2022051506/589844c71a28ab2a498b639d/html5/thumbnails/111.jpg)
RUHR-UNIVERSITÄT BOCHUM
Instruction Set ArchitectureExample: FIR Filter
__kernel void fir( int *input_array, int *filter, int *res_array, int filter_len){
int index = get_global_id(0);
int i = 0;
int acc = 0;
do {
acc += input_array[indx+i] * filter[i]; i++; } while(i != filter_len);
res[index] = acc;}
# FIR filter using 1D index space. It has 4 Paramaters:# 0 : address of the first element in input array# 1 : address of the first element in coefficients array# 2 : address of the first element in results array# 3 : filter length (L)LID r1, d0 # local ID: load the local work-item index in its work-group into r1WGOFF r2, d0# Work-Group Offset: load the work-group global offset into r2ADD r1, r1, r2 # ADD integers: r1 has now the global id of the work-itemLP r2, 3 # Load Parameter: r2 has filter lengthLP r3, 0 # Load Parameter: r3 is a pointer to the input arrayLP r4, 1 # Load Parameter: r4 is a pointer to the coefficients arrayADDI r5, r0, 0 # ADD Immediate: r5 will be the loop index (initialized with 0)ADDI r6, r0, 0 # ADD Immediate: r6 will contain the result (initialized with 0)
begin: LW r10, r4[r5] # Load Word: load a coefficients into r10ADD r11, r5, r1 # ADD integers: calculate the index of an element in input arrayLW r11, r3[r11] # Load Word: load the input element into r11MACC r6, r10, r11 # Multiply and ACComulate: update the resultADDI r5, r5, 1 # ADD Immediate: update loop indexBNE r5, r2, begin # Branch if Not Equal: repeat the iteration if necessary
LP r20, 2 # Load Parameter: r20 is a pointer to the result array SW r6, r20[r1] # Store Word: store the result r6 into the index r1 in result arrayRET #RETurn: end of task
(a) FIR filter as OpenCL kernel (b) Equivalent implementation in FGPU ISA
FGPU | FPGA’16, Monterey, 23 February 2016 22