active cells: a programming model for configurable...
TRANSCRIPT
CASE STUDY 4: CUSTOM DESIGNED MULTI-PROCESSOR SYSTEM
Active Cells:
A Programming Model for Configurable Multicore Systems
387
Acknowledgements
Results originate in project "Supercomputer in the Pocket" (J. Gutknecht, L. Liu, A. Morzov, P. Hunziker, A. Gokhberg, FF) in the Microsoft Innovation Cluster for Embedded Systems funded by Microsoft Research (2009-2014).
We are particularly inbebted for consulting to Chuck Thacker, Niklaus Wirth, Timothée Martiel, Paul Reed and Florian Negele.
Thanks for Support from Xilinx Academic Program.
CRBM Implementation and exercise 12 by Stephan Koster.
388
Vision
389
P P NILP NILNIL P P PP P PP P
core core
cache
bus
cache
memory
P P
core corePP
core engineP
core corePP
coreP
coreP
engine
General Purpose Shared Memory Computer Application Specific Multicore Network On Chip
Motivation: Multicore Systems Challenges
Cache Coherence
Shared Memory Communication Bottleneck
Thread Synchronization Overhead
Hard to predict performance of a program
Difficult to scale the design to massive multi-core architecture
390
Operating System Challenges
Processor Time Sharing
Interrupts
Context Switches
Thread Synchronisation
Memory Sharing
Inter-process: Paging
Intra-process, Inter-Thread: Monitors
391
Focus
Academia: Education
holistic design of computing systems
simplicity
consistency
Industry: High Performance Sensor Driven Medical IT
streaming applications: ultrasound, tomography, hemodynamics, etc.
392
Focus: Streaming Applications
393
Stream-Parallelism: Pipelining
Task Parallelism:
ParallelExecution Data Paralellism:
Vector ComputingLoop-level parallelism
TRM: Tiny Register Machine*
Extremely simple processor on FPGA with Harvard architecture.
Two-stage pipelined
Each TRM contains
Arithmetic-logic unit (ALU) and a shifter.
32-bit operands and results stored in a bank of 2*8 registers.
local data memory: d*512 words of 32 bits.
local program memory: i*1024 instructions with 18 bits.
7 general purpose registers
Register H for storing the high 32 bits of a product, and 4 conditional registers C, N, V, Z.
No caches
395* Invented and implemented by Dr. Ling Liu and Prof. Niklaus Wirth
TRM Machine Language
Machine language: binary representation of instructions
18-bit instructions
Three instruction types:
Type a: arithmetical and logical operations
Type b: load and store instructions
Type c: branch instructions (for jumping)
396from Lectures on Reconfigurable Computing, Dr. Ling Liu, ETH Zürich
Encoding Overview Register Operations
Load and Store
Conditional Branches
Special Instructions
Branch and Link
397
091011131417
0 immRdop imm is zero extended to 32 bits
091011131417
1 RsRdop 0 0 0 x x 0
(a)
(b)
091011131417
1 VRsVRdop 1 0 0 x x x(c)
091011131417
1 x x xRdop 0 0 0 0 0 1(a)
(c)
091011131417
1 RsRdop 1 0 x x x x
091011131417
1 RsRdop 0 1 x x x x(d)
091011131417
1 x x xVRdop 1 0 0 001(b)
091011131417
1 RsRdop 101(e) xxxx
6
(a)
(b)
091011131417
0 RsRdop off
091011131417
1 RsVRdop off
3
3
off is zero extended
0910131417
cond1110 off off is sign extended0131417
1111 off off is 14-bit offset
TRM architecture
398
Figure from: Niklaus Wirth, Experiments in Computer System Design, Technical Report, August 2010
http://www.inf.ethz.ch/personal/wirth/Articles/FPGA-relatedWork/ComputerSystemDesign.pdf
Variants of TRM
FTRM
includes floating point unit
VTRM (Master Thesis Dan Tecu)
includes a vector processing unit
supports 8 x 8-word registers
available with / without FP unit
TRM with software-configurable instruction width (Master Thesis Stefan Koster, 2015)
399
Initial ExperimentsTRM12 Bus TRM12 Ring
400
Column 0
Column 1
Column 2
Column 3
H0
H1
H2H3
C0
inbound arbiteroutbound arbiter
inbound arbiteroutbound arbiter
inbound arbiteroutbound arbiter
inbound arbiter
outbound arbiter
N0
C7
N1
N2
C2
N6
N7
N8
C6
C1
C8
N3
N4
N5
N9
N10
N11
C5 C11
C4
C3
C10
C9
RS232TR
CiNi
: processor core : network controller RS232TR : RS232 transmitter receiver
Timer
LCD
LEDs
DDR2
node0 node1 node2 node3 node4 node5
node11 node10 node9 node8 node7 node6
TRM0 TRM1 TRM2 TRM3 TRM4 TRM5
TRM11 TRM10 TRM9 TRM8 TRM7 TRM6T
RM
Rin
g
RS232
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
TR
MR
ing
01111110
01111110
01111110
01111110
01111110
01111110
01111110
01111110
01111110 0111
111001111110
01111110
Software / Hardware Co-designVision: Custom System on Button Push
System design as high-level program code
Electronic circuits
Computing model
Programming Language
Compiler, Synthesizer,
Hardware Library,
Simulator
Programmable Hardware
(FPGA)
402
System specification, HW/SW partitioning
Program microcontroller in
C/C++
Programsystem specific
hardware in HDL
Compilation Synthesis
Microcontroller + machine code + specific hardware (eg. DSP)
Traditional HW/SW co-design for embedded
systems
Traditional HW/SW co-design
One Program
One Toolchain
System on FPGA
Active Cells approach for embedded systems
development
Goal
403
Active Cells Computing Model
On-chip distributed system
Cell
Scope and environment for a running isolated process.
Integrated control thread(s)
Provides communication ports
Net
Network of communication cells
Cells connected via channels (FIFOs)
404
Inspired by
• Kahn Process Networks
• Dataflow Programming
• CSP (e.g. Google's Go)
• Actor Model (e.g. Erlang)
Consequences of the approach
No global memory
No processor sharing
No pecularities of specific processor
No predefined topology (NoC)
No interrupts
No operating system
406
Cell
407
type
BernoulliSampler* = cell (probIn: port in; valOut: port out);
var
r: Random.Generator;
p: real;
begin
new(r);
loop
p << probIn;
valOut << r.Bernoulli(p);
end
end BernoulliSampler;
non-typed communication ports
blocking receive
asynchronous send
BernoulliSampler
probIn
valOut
CellActivity
Properties
type
Controller = cell {Processor=TRM, FPU, DataMemory=2048, BitWidth=18} (in: port in (64); result: port out);
...
begin
(* ... controller action ... *)
end Controller;
....
Port Width
408
Properties can influence both, generation of hardware and the generation of software code.
Controller(TRM)
FPU
Configurable Processor on PL
409
Tiny Register Machine
FPU
offon
Vector Unit
offon
Instruction Width
14 16 18 20 22
Multiplier
offon
Program Memory
0k 1k 2k 3k 4k
Data Memory
0.5k 1k 1.5k
Engine
typeMerger = cell {Engine, inputs=1}
(ind: array inputs of port in; outd: port out);var data: longint;begin
loopfor i := 0 to len(in)-1 do
data << ind[i]outd << data;
endend
end Merger;
410
Engines are prebuilt components instantiated as electronic circuitson a target hardware
Merger
…
Unit of Deployment: (Terminal) CellnetLearnTest = cellnet;var
learner: CRBMNet.CRBMLearner;reader: MLUtil.imageReader;ims0,ims1: MLUtil.imshow;…
beginnew(learner)new(reader);new(ims0{name='v0debug',posx=0,posy=100});new(ims1{name='v1debug',posx=300,posy=100});…reader.imageOUT >> learner.imgIN;learner.v0DebugOUT >> ims0.imageIN;learner.v1DebugOUT >> ims1.imageIN;…
end LearnTest;
411
CRBMLearner
imShow
imageReader
imShow
connection
dynamic construction
connection
Hierarchical Composition
412
CRBMLearner
imShow
imageReader
imShow
CRBMLearner
DownNet
UpNet
Delay
Delay
Delay
Delay
debug
result
img in
… …
…Sample
UpNet
…
GetGradients
visible
P(h|v)visible
hidden
…hidden
kernels
UpNetsplit
…
UpStep
Hierarchic Composition: non-terminal CellnetUpNet = cellnet
{vr=28,vc=28,kr=5,kc=5,k=9,c=2,name='upstep'}(pvIN, kerIN, bIN : port in; phOUT: array k of port out; pvSideOUT: port out );
vari,hr,hc: longint;upstep: CRBMUpstepCell;split: MLFunctions.SplitterCell;
beginhr:=vr-kr+1; hc:=vc-kc+1;new(vSplit {dataSize=vr*vc,numOut=2});new(upstep {vr=vr,vc=vc,kr=kr,kc=kc,k=k,c=c});pvIN >> vSplit.dataIN;
vSplit.dataOUT[0] >> pvSideOUT;
vSplit.dataOUT[1] >> UpstepCell.vIN;
kerIN >> UpstepCell.kerIN;bIN >> UpstepCell.bIN;for i:=0 to k-1 do
upstep.phOUT[i] >> phOUT[i];end;
end CRBMUpNet;
413
UpNet
img in
split
…
UpStep
pvOut
phOut
kernIn
biasIn
port delegationport delegation
ports and properties
Software Hardware Map
414
ARM TRM
ARM ENGINE
ENGINE
ZYNQ PL
PS
cell
engine
port
thread
softcore
hw engine
AX
I4
FIFO
AXI4 Stream Interconnect
Hybrid Compilation
Code body Role Compilation method
Cell (Softcore) Program logic Software Compilation
Cell (Engine) Computation unit Hardware Generation
Cell Net Architecture Hardware Compilation
415
cellnet N;
type A=cell(pi: port in; po: port out);var x: integer;begin… pi ? x; … po ! x; …end A;
var a,b: A;begin… connect(a.po, b.pi)end N.
Implementation Alternatives (1)
416
compilerfrontend
Zybo Backend
Spartan3 Backend
Spartan3e Backend
…
+ simple- redundant
- not flexible- hard to extend
sourcebitstream
Implementation Alternatives (2)
417
compilerfrontend
+ extensible+ not redundant- not simple- static configuration limits flexibility
Hardware Library Component Library
cellnet interpreterbackend
source bitstream
Implementation Alternatives (3)
418
compilerfrontend
+ extensible+ not redundant+ simpler+ simulation becomes execution
Hardware LibraryExecutables
Component LibraryExecutables
hdl runtime source bitstreamexecutable
Frontend
419
Source Code(Module)
Intermediate Code
(Module)Executable
CompilerFrontend
CompilerBackend
SimulationRuntime
VisualizationRuntime
Software Libraries
Software Libraries
Software Libraries
HDL BackendRuntime
Backend
420
HDL Backend Runtime
DependencyAnalysis
HDL Code Generator
Instruction Code Generator
HW Library
Dependency Graph
HDL Code
TRM Kernel
TRM CodeExecutable
ARM Kernel
ARM CodeExecutable
TRM Kernel
TRM CodeExecutable
ARM Kernel
ARM CodeExecutable
TRM Kernel
TRM CodeExecutable
RuntimeLibraries
CompilerBackend
Executable
FPGA VendorImplementation
Toolsbitstream
Deployment
Intermediate Code
(Module)
RuntimeLibrariesRuntimeLibrariesRuntimeLibraries
Extensibility: Defining components and platforms
Software HLL Codetype Gpo = cell{Engine,DataWidth=32,InitState="0H"} (input: port in);
421
Hardware HDL Codemodule Gpo
#( parameter integer DW = 8 … )
(
input aclk, input aresetn , input [DW−1:0] …
);
Component Specification
Build CommandAcHdlBackend.Build
--target="ZyboBoard" -p="Vivado" CRBM.TestCellnet ~
Platform Specification
Hardware Platform and ToolsHardware Types
Platform Instances
Vendor specific tools
Component Specification
422
module Gpo;import Hdl := AcHdlBackend;var c: Hdl.Engine;begin
new(c,"Gpo","Gpo");c.SetDescription("General Purpose Output … ");
c.SupportedOn("∗"); (∗ portable ∗)
c.NewProperty("DataWidth","DW",Hdl.NewInteger(32),Hdl.IntegerPropertyRangeCheck(1,Hdl.MaxInteger));c.NewProperty("InitState","InitState",Hdl.NewBinaryValue("0H"),nil);
c.SetMainClockInput("aclk"); (* main component's clock *)c.SetMainResetInput("aresetn",Hdl.ActiveLow); (* active-low reset *)c.NewAxisPort("input","inp",Hdl.In,8);c.NewExternalHdlPort("gpo","gpo",Hdl.Out,8);
c.NewDependency("Gpo.v",true,false);
c.AddPostParamSetter(Hdl.SetPortWidthFromProperty("inp","DW"));c.AddPostParamSetter(Hdl.SetPortWidthFromProperty("gpo","DW"));
Hdl.hwLibrary.AddComponent(c);end Gpo.
Identification and description
Supported devices
Parameters
Ports
Dependencies
Configuration Actions
c.NewExternalHdlPort("gpo","gpo",Hdl.Out,8);
Generic Peer-to-Peer Communication Interface
Use of AXI4 Stream interconnect standard from ARM
Generic, flexible
Non-redundant
423
TVALID: Data valid
Senderport
Receiverport
TREADY:Ready to process
Mu
ltip
lexi
ng
net
wo
rkData out
Select Select
Data in
clk clk
master must assert tvalid and keep asserted
slave can wait for master's tvalid and then assert tready
Target Platform Specificationmodule Basys2Board;
import Hdl := AcHdlBackend, AcXilinx;
var t: Hdl.TargetDevice;
pldPart: AcXilinx.PldPart;
ioSetup: Hdl.IoSetup;
pin: Hdl.IoPin;
begin
new(pldPart,"XC3S100E-4CP132");
pldPart.SetJtagChainIndex(0);
new(t,"Basys2Board",pldPart);
new(pin,Hdl.In,"B8","LVCMOS33");
t.NewExternalClock(pin,50000000,50,0); (* ExternalClock0 *)
t.SetSystemClock(t.clocks.GetClockByName("ExternalClock0"),1,1);
new(pin,Hdl.In,"G12","LVCMOS33");
t.SetSystemReset(pin,true);
new(ioSetup,"Gpo_0");
ioSetup.NewIoPort("gpo",Hdl.Out,"U16,E19,U19,V19","LVCMOS33");
t.AddIoSetup(ioSetup);
Hdl.hwLibrary.AddTarget(t);
end Basys2Board.
424
System Signals
FPGA Part
Mapping of external Ports
ioSetup.NewIoPort("gpo",Hdl.Out,"U16,E19,U19,V19","LVCMOS33");
TRM1
TRM2
TRM9
TRM10 TRM11 TRM12
FIFO1
FIFO8
FIFO9
FIFO16
FIFO17 FIFO18
FIFO19
FIFO20
FIFO33
FIFO34
UART
controller CF
controller
LCD
controller
Virtex-5LX50T FPGA
Xilinx ML505 board
RS232
CF
LCD
ECG
Sensor
·
·
·
·
·
·
Case Study 1: ECGFocus: Resources and Power
Real-time
ECG MonitorSignalinput
Wave proc_1
QRSdetect
HRV analysis
Disease classifier
Wave proc_2
Wave proc_8
ECG
bitstream
out
stream
425
Resources
ECG Monitor*
Maximum number of TRMs in communication chain
426
#TRMs #LUTs #BRAMs #DSPs TRM load
12 13859(48%)
52(86%)
12(25%)
<5%@116 MHz
FPGA #TRMs #LUTs #BRAMs #DSPs
Virtex-5 30 27692 (96%) 60 (100%)
30 (62%)
Virtex 6 500
*8 physical channels @ 500 Hz sampling frequency
implemented on Virtex 5 426
Comparative Power Usage
Preconfigured FPGA (#TRMs, IM/DM, I/O, Interconnect fixed)versus fully configurable FPGA (Active Cells)
System StaticPower (W)
Dynamic Power (W)
Preconfigured("TRM12")
3.44 0.59
Dynamicallyconfigured
0.5 0.58
86% saving!
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
Core 8
Core 9
Core 10
Core 11
Signal
inpu
t
Wave
proc_
1
QRS
det
ect
HRV analysis
Disease classifier
Wave
proc_2
Wave
proc_8
427
Case Study: Non-Invasive Continuous Blood Pressure MonitorFocus: Development Cycle Time
A2 Host OS with GUI on ARMSensor control and medical algorithms on Zynq PL
Sensors and Motorson Bracelet
428
Medical Monitor Network On Chip
429429
Dominated by TRM processors. Feedback driven. Not performance critical.
Development Cycle Times
Step Medical Monitor
OCT (full)
Software Compilation 4s 2s
Hardware Implementation
20 min 20 min
Stream Patching (all) 2 min -
Stream Patching (typical) 12 s -
Deployment 11s 16s
sporadic often430
430
Case Study 3: Optical Coherence TomographyFocus: Performance
A(λi) = A(1/fi)
f(z)
z-Axis Processing
1. Non uniform sampling
A(λi) A(f i)
2. Dispersion compensation
3. (Inverse) FFT
… for many lines x in a row (2d)
… and many rows y in a column (3d)
~
zx
y 431 431
A component of OCT image processingDispersion Compensation
Dominated by Engines. Dataflow driven.
432
Case Study: ANN
433
Sigmoid
InnerProdFlt
Í Í Í
+
+
x0x0 x1x1 x2x2y0y0 y1y1 y2y2
- Í
+
Frac
LUT
yy
Fix
zz
input[0]input[0]
input[1]input[1]
outputoutput
xx
y
bx0 x1 x2w0 w1 w2
ARM SoC
x
w
P0
R0
P2
R2P1
R1
z
ANN Layer 1(MoreLayers)
b
Programmable Logic
Perceptron
Performance and Resource Usage
Medical Monitor Simple OCT OCT Perceptron
Architecture Spartan 6XC6SLX75
Zynq 7000XC7Z020
Zynq 7000XC7Z020
Zynq 7000XC7Z010
Resources 28% Slice LUTs, 4% Slice Registers80% BRAMs24% DSPs
11% Slice LUTs, 6% Slice Registers7% BRAMs15% DSPs1 ARM Cortex A9
17% Slice LUTS8% Slice Registers22% BRAMs31% DSPs1 ARM Cortex A9
83% Slice LUTS, 67% Slice Registers13.33% BRAMs, 21% DSP1 ARM Cortex A9
Clock Rate 58 MHz 118 MHz 50 MHz 147 MHz
Data Bandwidth 1.25 Mbit /s (in)23 kB/s (out)
236 MWords/s (in)118 MWords/s (out)
50 MWords/s (in)50 MWords/s (out)
9.6 GBits/s in9.6 GBits/s out
Performance -- 8.3 GFPOps*up to 32 GFPops**
4.3 GFPOps* 4.9 GFlops
Power ~2W ~5W ~5W 2.1 W
** Fixed point operations, 32bit
* when instantiated 4 times434
Conclusion
ActiveCells: Computing model and tool-chain for configurable computing
Configurable interconnect Simple Computing, Power Saving
Embedding of task engines High Performance
Hybrid compilation Quick Development
Backend execution Eased Flexibility and Extensibility
435
Mapping to Hardware – Simple ExampleIO*=CELL (input: PORT IN; output: PORT OUT;
buttons: PORT IN; leds: PORT OUT);BEGIN ...END IO;
Controller*=CELL (input: PORT IN; output: PORT OUT)BEGIN ...END Controller;
VAR controller: Controller; io: IO; gpo{DataWidth=8}: Engines.Gpo;gpi{DataWidth=11}: Engines.Gpi;
BEGINNEW(controller); NEW(io);NEW(gpi); NEW(gpo);
CONNECT (controller.output, io.input, 32);CONNECT (io.output, controller.input, 32);CONNECT (gpi.output, io.buttons);CONNECT (io.leds, gpo.input);
END Simple.
436
IOTRM
ControllerTRM
fifo fifo
gpi gpo
Blackboard / Code inspection
Resource Usage Scenarios
438
CELLNET Game;
IMPORT LED;
TYPE
IO*=CELL {DataMemorySize(512), CodeMemorySize(1024), LED} (in: PORT IN);BEGINEND IO;
VAR io: IO; BEGIN
NEW(io); END Game.
1 TRM
Resources
4-input LUTS 27%
Flip Flops 5%
BRAMs 16%
Resources
4-input LUTS 58%
Flip Flops 9%
BRAMs 33%
TRM ≈ 21-27% (LUTs), 16% (BRAMs); Fifo ≈6% (LUTs)
VAR io: IO; trm: TRM;BEGINNEW(io); NEW(trm);
CONNECT(trm.out, io.in);END Game.
TRM TRM
(cf. next page)
TRM TRM
-> TRM
Resources
4-input LUTS 86%
Flip Flops 18%
BRAMs 75%
Resource Usage (Lab)controller: Controller;ioTRM: IO;digitsDriver: DigitsDriver;digits: Engines.LEDDigits;gpo{DataWidth=8}: Engines.Gpo;gpi{DataWidth=11}: Engines.Gpi;
BEGINNEW(controller); NEW(ioTRM);NEW(digitsDriver);NEW(digits);NEW(gpi);NEW(gpo);CONNECT( controller.cmdOUT,ioTRM.cmdIN,10);ONNECT( ioTRM.cmdOUT,controller.cmdIN,10);CONNECT( gpi.output, ioTRM.ButtonsIN);CONNECT( ioTRM.LEDOUT, gpo.input);CONNECT(ioTRM.digitsOUT, digitsDriver.cmdIN,10);CONNECT(digitsDriver.ledOUT, digits.input);
3 TRMs
3 FIFOs
439
Resources
4-input LUTS 86%
Flip Flops 18%
BRAMs 75%