an fpga-based scalable simulation accelerator for tile architectures @heart2011

An FPGA-based Scalable Simulation Accelerator for Tile Architectures

Shinya Takamaeda-Yamazaki†‡, Ryosuke Sasakawa†, Yoshito Sakaguchi†, Kenji Kise†

†Tokyo Institute of Technology, Japan ‡JSPS Research Fellow

14:30 – 15:00 June 2, 2011 HEART 2011 @Imperial College London

This presentation shows ScalableCore system n  Multi-FPGA system for Tile architecture simulations

l  Achieving SCALABLE simulation speed

Target Core

System Function

Agenda

n  Background & Motivation n  Proposal: ScalableCore

n  System Implementation l  Overall system

l  Components: ScalableCore Unit & Board

l  Logic Hierarch & Architecture

n  Evaluation l  Simulation Speed

l  Power

n  Conclusion

Background: Multicores to Many-cores

Intel Single Chip Cloud Computer 48 cores (x86)

TILERA TILE-Gx100 100 cores (MIPS)

Simulation Target Manycore: M-Core n  Tile architecture with 2D mesh network

l  A Node has: Core, Local Memory, INCC (DMA controller) and Router

l  Local Memory: Independent Address Space, Data transfer by DMAs

Local Memory

INCC Core

DRAM Controller DRAM Controller

How to evaluate the architectures? n  Customizability vs. Simulation Speed

l  We want to run a large benchmark fast

Difficulty to construct

Software Simulator

FPGA Simulator

Faster simulation and customizable

Easy construction of ideal system without

HW limitations Real but expensive

Less scalability of simulation speed on software simulators n  Decreasing speed with the increasing # target cores

l  SimMc :M-Core simulator

l  Difficult to achieve the scalable speed •  Overhead for cycle accurate simulation

16 32 48 64

# Target Cores

Speed degradation more than the increasing # cores

Simulation Speed on SimMc (M-Core simulator)

Motivation n  Achieve the SCALABLE simulation speed

l  = Keep the constant simulation speed in case of large number of cores

n  How to scale the simulation speed? l  Our target architecture: M-Core

•  Tile architecture with 2D mesh network

Partitioning of the target processor into multiple FPGAs

Many-core Processor

Partition

Proposal of ScalableCore n  Multiple FPGAs corresponding to the target processor

l  Each ScalableCore Unit has a part of the target processor and shares the simulation progress with its neighbor Units

ScalableCore Unit (FPGA Card with off-chip Memory) A part of the target processor

ScalableCore Board Connecting among the ScalableCore Units

LCD Display for simulation information

Target Core

System Function Target Processor (M-Core)

Simulation Target Manycore: M-Core n  Tile architecture with 2D mesh network

l  A Node has: Core, Local Memory, INCC (DMA controller) and Router

l  Local Memory: Independent Address Space, Data transfer by DMAs

Local Memory

INCC Core

DRAM Controller DRAM Controller

Current Target of ScalableCore system

ScalableCore system 1.1: Overview n  Simulating the M-Core with up to 64 Nodes (= FPGAs)

Local Memory

INCC Core

System Functions

Able to increase/decrease the number of Nodes

1Node : 1 ScalableCore Unit

4 Nodes (2x2) : 4 ScalableCore Units

16 Nodes (4×4) : 16 ScalableCore Units

64 Nodes (8×8) : 64 ScalableCore Units

Scalable Extension!

ScalableCore system 1.1: Components

n  ScalableCore Unit FPGA board with off-chip SRAM l  Xilinx Spartan-3E XC3S500E

l  512KBi SRAM (8bit, 1 port for read/write)

l  Configuration ROM

n  ScalableCore Board Interface board bridging Units l  Power regulator & SD card slot

ScalableCore system 1.1:Logic Hierarchy

Core INCC

Local Memory (Interface)

Router

Ser/Des Memory Multiplexer

Initializer Device Controller

Arbiter Interface Register

Target Core (a Node in M-Core)

System Functions

ScalableCore system 1.1:Logic Architecture

Memory Multiplexer

DMA Generator/Receiver

Fetch Unit

Decoder

Execution Unit

Register File

Memory Access Unit

DMA Register Memory Controller

SRAM Controller SRAM

Arbiter

Interface Register

SD Card Controller

Node Memory

Router

to/from Adjacent Units

State Machine Controller

Ser/Des

ScalableCore Unit FPGA Spartan-3E

Off-chip Devices

IR IR IR IR

Configuration ROM

XCF04S JTAG port

Two key techniques n  Local Barrier Synchronization

l  Each FPGA has one Node of M-Core (or other tile architecture)

l  To satisfy the cycle accuracy, hand shaking of simulation state is needed

•  All-to-All hand shake: Increasing overhead to the number of cores

l  Our target is a tile architecture, so …

n  Virtual Cycle l  How to emulate the complex hardware?

•  Ex.) larger number of memory ports

Hand shaking by only 4 neighbors

Use multiple FPGA cycles for 1 target cycle

Local Barrier Synchronization n  Handshakes with 4 neighbor FPGAs

l  Constant handshaking overhead, not increasing with the increasing of # target cores

l  So it achieves scalable simulation speed

Sending to Unit 0

Sending to Unit 1

Sending to Unit 2

Sending to Unit 3

Receiving from Unit 0

Sending to Unit 0

Sending to Unit 1

Sending to Unit 2

Sending to Unit 3

Cycle 1 Cycle 2

Virtual Cycle n  Multiple FPGA clock cycles for 1 target clock cycle

l  Virtually complex hardware by using simple FPGA equipment •  Example. Multiport RAM by driving 1 port RAM multiple times

INCC Core

Sending the synchronized data via Serial I/O (North)

Receiving the synchronized data via Serial I/O (North)

Sending the synchronized data via Serial I/O (East) Sending the synchronized data via Serial I/O (West) Sending the synchronized data via Serial I/O (South)

Receiving the synchronized data via Serial I/O (East) Receiving the synchronized data via Serial I/O (West)

Receiving the synchronized data via Serial I/O (South)

Start sending

Finish synchronization

Data Sender via Serial I/Os

Data Receiver via Serial I/Os

1 Virtual Cycle Time

Virtual Cycle N

Virtual Cycle N+1

Router

INCC Send Core (IF) INCC Recv Core (L/S) Interleaved Memory Access

via Memory Multiplexer

Proceeding Target Circuit State

Drive the circuit of target components

Process the memory accesses

Evaluation

n  Evaluation Points l  Simulation Speed [K cycle / sec]

l  Power [W]

n  Environment l  ScalableCore system 1.1 (FPGA-based simulator)

•  Freq.: 45MHz

l  SimMc 1.1(Software simulator of M-Core) •  Intel Core2Duo, Memory 4GB, gcc4.1.2, Debian 5

n  # Node l  16, 32, 48, 64

Evaluation: Simulation Speed [K cycle/sec] n  = Clock frequency of the target processor [KHz]

l  Software simulator: degrading speed with the increasing of # target cores

l  ScalableCore system: constant speed rate

n  Relative Speed l  Increasing # cores, Increasing the relative speed

•  In simulation of 64 Nodes, achieves 14.2x speed up

1000 1000 1000 1000

343 149 96 70

0 200 400 600 800

1000 1200

16 32 48 64

# Nodes

ScalableCore system Software Simulator

0.0 2.0 4.0 6.0 8.0

10.0 12.0 14.0 16.0

16 32 48 64

# Nodes

Evaluation: Power [W] n  = Energy consumption of the system per sec

l  Software simulator: constant consumption [W]

l  ScalableCore system: increasing the power [W]

n  Relative Efficiency (=Ratio of energy used for simulation of 1 clock cycle on the target1) l  More efficient, increasing # target cores

•  In simulation of 64 nodes, achieves

19.2 22.2 22.9 23.5

16 32 48 64

# Nodes

84 84 84 84

16 32 48 64

# Nodes

ScalableCore system Software Simulator

Conclusion n ScalableCore system 1.1

An FPGA-based scalable simulation system for tile architecture evaluations l  Multiple FPGAs l  Two key techniques

•  Virtual cycle

•  Local Barrier Synchronization

l  14.2 times faster simulation than the software simulator •  When simulating the more detailed architecture the speedup rate

becomes the very larger

n  Future Work l  Off-chip DRAM support l  Virtual combined multiple FPGAs for a large core l  Time-multiplexed driven for higher hardware utilization

an fpga-based scalable simulation accelerator for tile architectures @heart2011

Technology

optimizing fpga-based accelerator design … · references...

intel vision accelerator design with an intel arria 10 fpga...

reconfigurable architectures fpga as an accelerator

an fpga-based accelerator platform for network-on-chip...

optimizing fpga accelerator design for deep convolution...

a gpu-outperforming fpga accelerator architecture for...

an fpga-based accelerator for tate pairing on edwards...

a generic fpga accelerator for minimum storage...

recongurable fpga accelerator for databases master thesis

fpga-based accelerator for long short-term memory ... · 1...

fpga-accelerator attractor computation of scale free gene

optimizing fpga-based accelerator design for deep ......

fpga accelerator for floating-point matrix...

fpga accelerator virtualization in an openpowercloud ·...

optimizing fpga-based cnn accelerator for energy ﬃ with an

fpga accelerator virtualization in an openpower … · fpga...

bthesis - a high-speed and portable fpga accelerator

arduino-compatible fpga application accelerator and … ·...

design of fpga-based accelerator for convolutional neural

improving the performance of opencl-based fpga accelerator...