an fpga-based scalable simulation accelerator for tile architectures @heart2011
Post on 07-Dec-2014
1.157 Views
Preview:
DESCRIPTION
TRANSCRIPT
An FPGA-based Scalable Simulation Accelerator for Tile Architectures
Shinya Takamaeda-Yamazaki†‡, Ryosuke Sasakawa†, Yoshito Sakaguchi†, Kenji Kise†
†Tokyo Institute of Technology, Japan ‡JSPS Research Fellow
14:30 – 15:00 June 2, 2011 HEART 2011 @Imperial College London
This presentation shows ScalableCore system n Multi-FPGA system for Tile architecture simulations
l Achieving SCALABLE simulation speed
2
Target Core
System Function
Agenda
n Background & Motivation n Proposal: ScalableCore
n System Implementation l Overall system
l Components: ScalableCore Unit & Board
l Logic Hierarch & Architecture
n Evaluation l Simulation Speed
l Power
n Conclusion
3
Background: Multicores to Many-cores
4
Intel Single Chip Cloud Computer 48 cores (x86)
TILERA TILE-Gx100 100 cores (MIPS)
Simulation Target Manycore: M-Core n Tile architecture with 2D mesh network
l A Node has: Core, Local Memory, INCC (DMA controller) and Router
l Local Memory: Independent Address Space, Data transfer by DMAs
5
Local Memory
INCC Core
R
DRAM Controller DRAM Controller
DRAM Controller DRAM Controller
Node
How to evaluate the architectures? n Customizability vs. Simulation Speed
l We want to run a large benchmark fast
6
Difficulty to construct
Rea
lity
Software Simulator
FPGA Simulator
Chip
Faster simulation and customizable
Easy construction of ideal system without
HW limitations Real but expensive
Less scalability of simulation speed on software simulators n Decreasing speed with the increasing # target cores
l SimMc :M-Core simulator
l Difficult to achieve the scalable speed • Overhead for cycle accurate simulation
7
343
149
96 70
0
50
100
150
200
250
300
350
400
16 32 48 64
Sim
ulat
ion
Spee
d [K
cyc
le /
sec]
# Target Cores
Speed degradation more than the increasing # cores
Simulation Speed on SimMc (M-Core simulator)
Motivation n Achieve the SCALABLE simulation speed
l = Keep the constant simulation speed in case of large number of cores
n How to scale the simulation speed? l Our target architecture: M-Core
• Tile architecture with 2D mesh network
8
Partitioning of the target processor into multiple FPGAs
Many-core Processor
Partition
Proposal of ScalableCore n Multiple FPGAs corresponding to the target processor
l Each ScalableCore Unit has a part of the target processor and shares the simulation progress with its neighbor Units
9
ScalableCore Unit (FPGA Card with off-chip Memory) A part of the target processor
ScalableCore Board Connecting among the ScalableCore Units
LCD Display for simulation information
Target Core
System Function Target Processor (M-Core)
Simulation Target Manycore: M-Core n Tile architecture with 2D mesh network
l A Node has: Core, Local Memory, INCC (DMA controller) and Router
l Local Memory: Independent Address Space, Data transfer by DMAs
10
Local Memory
INCC Core
R
DRAM Controller DRAM Controller
DRAM Controller DRAM Controller
Node
Current Target of ScalableCore system
ScalableCore system 1.1: Overview n Simulating the M-Core with up to 64 Nodes (= FPGAs)
11
Local Memory
INCC Core
R
System Functions
Able to increase/decrease the number of Nodes
1Node : 1 ScalableCore Unit
12
45cm
30cm
4 Nodes (2x2) : 4 ScalableCore Units
13
45cm
30cm
16 Nodes (4×4) : 16 ScalableCore Units
14
45cm
30cm
64 Nodes (8×8) : 64 ScalableCore Units
15
Scalable Extension!
ScalableCore system 1.1: Components
n ScalableCore Unit FPGA board with off-chip SRAM l Xilinx Spartan-3E XC3S500E
l 512KBi SRAM (8bit, 1 port for read/write)
l Configuration ROM
n ScalableCore Board Interface board bridging Units l Power regulator & SD card slot
16
ScalableCore system 1.1:Logic Hierarchy
17
Core INCC
Local Memory (Interface)
Router
Ser/Des Memory Multiplexer
Initializer Device Controller
Arbiter Interface Register
Target Core (a Node in M-Core)
System Functions
ScalableCore system 1.1:Logic Architecture
18
Memory Multiplexer
DMA Generator/Receiver
Fetch Unit
Decoder
Execution Unit
Register File
Memory Access Unit
DMA Register Memory Controller
SRAM Controller SRAM
Arbiter
XBAR
Interface Register
Interface Register
SD Card Controller
Core
INCC
Node Memory
Router
to/from Adjacent Units
State Machine Controller
SD
Ser/Des
Ser/Des
Ser/Des
Ser/Des
Clock
Reset
ScalableCore Unit FPGA Spartan-3E
Off-chip Devices
IR IR
IR
IR IR IR IR
Configuration ROM
XCF04S JTAG port
Two key techniques n Local Barrier Synchronization
l Each FPGA has one Node of M-Core (or other tile architecture)
l To satisfy the cycle accuracy, hand shaking of simulation state is needed
• All-to-All hand shake: Increasing overhead to the number of cores
l Our target is a tile architecture, so …
n Virtual Cycle l How to emulate the complex hardware?
• Ex.) larger number of memory ports
19
Hand shaking by only 4 neighbors
Use multiple FPGA cycles for 1 target cycle
Local Barrier Synchronization n Handshakes with 4 neighbor FPGAs
l Constant handshaking overhead, not increasing with the increasing of # target cores
l So it achieves scalable simulation speed
Sending to Unit 0
Sending to Unit 1
Sending to Unit 2
Sending to Unit 3
Receiving from Unit 0
Receiving from Unit 1
Receiving from Unit 2
Receiving from Unit 3
Sending to Unit 0
Sending to Unit 1
Sending to Unit 2
Sending to Unit 3
Receiving from Unit 0
Receiving from Unit 1
Receiving from Unit 2
Receiving from Unit 3
Cycle 1 Cycle 2
0
3 4
2
1
20
Virtual Cycle n Multiple FPGA clock cycles for 1 target clock cycle
l Virtually complex hardware by using simple FPGA equipment • Example. Multiport RAM by driving 1 port RAM multiple times
21
INCC Core
Sending the synchronized data via Serial I/O (North)
Receiving the synchronized data via Serial I/O (North)
Sending the synchronized data via Serial I/O (East) Sending the synchronized data via Serial I/O (West) Sending the synchronized data via Serial I/O (South)
Receiving the synchronized data via Serial I/O (East) Receiving the synchronized data via Serial I/O (West)
Receiving the synchronized data via Serial I/O (South)
Start sending
Finish synchronization
Data Sender via Serial I/Os
Data Receiver via Serial I/Os
1 Virtual Cycle Time
Virtual Cycle N
Virtual Cycle N+1
…
Router
INCC Send Core (IF) INCC Recv Core (L/S) Interleaved Memory Access
via Memory Multiplexer
Proceeding Target Circuit State
Drive the circuit of target components
Process the memory accesses
Evaluation
n Evaluation Points l Simulation Speed [K cycle / sec]
l Power [W]
n Environment l ScalableCore system 1.1 (FPGA-based simulator)
• Freq.: 45MHz
l SimMc 1.1(Software simulator of M-Core) • Intel Core2Duo, Memory 4GB, gcc4.1.2, Debian 5
n # Node l 16, 32, 48, 64
22
Evaluation: Simulation Speed [K cycle/sec] n = Clock frequency of the target processor [KHz]
l Software simulator: degrading speed with the increasing of # target cores
l ScalableCore system: constant speed rate
n Relative Speed l Increasing # cores, Increasing the relative speed
• In simulation of 64 Nodes, achieves 14.2x speed up
23
1000 1000 1000 1000
343 149 96 70
0 200 400 600 800
1000 1200
16 32 48 64
Sim
ulat
ion
Spee
d [K
cyc
le /
sec]
# Nodes
ScalableCore system Software Simulator
2.9
6.7
10.4
14.2
0.0 2.0 4.0 6.0 8.0
10.0 12.0 14.0 16.0
16 32 48 64
Rel
ativ
e Sp
eed
# Nodes
Evaluation: Power [W] n = Energy consumption of the system per sec
l Software simulator: constant consumption [W]
l ScalableCore system: increasing the power [W]
n Relative Efficiency (=Ratio of energy used for simulation of 1 clock cycle on the target1) l More efficient, increasing # target cores
• In simulation of 64 nodes, achieves
24
19.2 22.2 22.9 23.5
0.0
5.0
10.0
15.0
20.0
25.0
16 32 48 64
Rel
ativ
e Ef
ficie
ncy
# Nodes
13 26
38 51
84 84 84 84
0
20
40
60
80
100
16 32 48 64
Pow
er [W
]
# Nodes
ScalableCore system Software Simulator
Conclusion n ScalableCore system 1.1
An FPGA-based scalable simulation system for tile architecture evaluations l Multiple FPGAs l Two key techniques
• Virtual cycle
• Local Barrier Synchronization
l 14.2 times faster simulation than the software simulator • When simulating the more detailed architecture the speedup rate
becomes the very larger
n Future Work l Off-chip DRAM support l Virtual combined multiple FPGAs for a large core l Time-multiplexed driven for higher hardware utilization
25
top related